Structured sparse support vector machine with ordered features

Kuangnan Fang; Peng Wang; Xiaochen Zhang; Qingzhao Zhang

doi:10.1080/02664763.2020.1849053

. 2020 Nov 18;49(5):1105–1120. doi: 10.1080/02664763.2020.1849053

Structured sparse support vector machine with ordered features

Kuangnan Fang ^a,^b, Peng Wang ^a, Xiaochen Zhang ^a, Qingzhao Zhang ^a,^b,^c,^CONTACT

PMCID: PMC9041777 PMID: 35707509

ABSTRACT

In the application of high-dimensional data classification, several attempts have been made to achieve variable selection by replacing the $ℓ_{2}$ -penalty with other penalties for the support vector machine (SVM). However, these high-dimensional SVM methods usually do not take into account the special structure among covariates (features). In this article, we consider a classification problem, where the covariates are ordered in some meaningful way, and the number of covariates p can be much larger than the sample size n. We propose a structured sparse SVM to tackle this type of problems, which combines the non-convex penalty and cubic spline estimation procedure (i.e. penalizing second-order derivatives of the coefficients) to the SVM. From a theoretical point of view, the proposed method satisfies the local oracle property. Simulations show that the method works effectively both in feature selection and classification accuracy. A real application is conducted to illustrate the benefits of the method.

Keywords: Structured sparse, support vector machine, variable selection, local oracle property

1. Introduction

Classification is one of the most important research fields in statistics and machine learning, and it is also a common practical problem. The support vector machine (SVM)[19] is a powerful classification tool with high accuracy and great flexibility. In this article, we will focus on a classification with n cases having class labels ${y_{i} \in {1, - 1}; i = 1, \dots n}$ and features ${x_{i j}; i = 1, 2, \dots, n, j = 1, 2, \dots, p}$ . The SVM has an equivalent formulation as the $ℓ_{2}$ penalized hinge loss [11]:

min_{(β_{0}, β)} \frac{1}{n} \sum_{i = 1}^{n} ℓ {y_{i} (β_{0} + x_{i}^{⊤} β)} + \frac{λ}{2} ∥ β ∥^{2},

(1)

where the loss $ℓ (t) = [1 - t]_{+}$ is called hinge loss, and $∥ \cdot ∥$ is the $ℓ_{2}$ -norm. λ is the tuning parameter, which is used to control the tradeoff between loss and penalty.

In the application of high-dimensional data classification, several attempts have been made to achieve variable selection by replacing the $ℓ_{2}$ -penalty with other penalties for the SVM, such as $ℓ_{1}$ -SVM [1,28], $ℓ_{0}$ -SVM and $ℓ_{\infty}$ -SVM [10], $ℓ_{p}$ -SVM [4], SCAD-SVM [25,27], Hybrid Huberized SVM [20,21], and MCP-SVM [27]. Since the hinge loss does not have the first derivative, it causes some difficulties in the calculation. [6,15] considered square hinge loss in the SVM; [17,20–22] suggested using the Huberized hinge loss in the SVM. [17] point out that the Huberized regularized model paths are both less affected by the outlier than the non-Huberized squared loss.

This paper concerns a class of structured sparse classification problems with ordering features, i.e. $x_{j}$ that can be ordered as $x_{1}, x_{2}, \dots, x_{p}$ in some sense. A motivating example comes from protein mass spectroscopy. For each blood serum sample i, we observed the intensity $x_{i j}$ for many time-of-flight values $t_{j}$ . Time of flight is related to the mass over charge ratio m/z of the constituent proteins in the blood. Figure 1 shows an example that protein mass spectroscopy taken from [16]. We plot intensity $x_{i j}$ on the vertical Y -axis against m/z on the horizontal x-axis. The features are ordered in a meaningful way, i.e. $x_{i j}$ are ordered by m/z, which may lead to a high correlation among closely located variables. The SVM methods mentioned above for processing high-dimensional data classification, do not consider the structure in which the variables are arranged in order. Our goal is to predict the label from the ordered features, especially for $p ≫ n$ .

Figure 1. — Protein mass spectroscopy data: average profiles from control (–) and ovarian cancer patients (–).

Besides the above example, there are also many data like this, such as the gene expression data in microarray studies, single nucleotide polymorphisms (SNPs) data in genome-wide association studies (GWAS), graph and image data[9]. Those special structures among variables may lead to successive coefficients vary slowly. Fused lasso [18] encourages flatness of the coefficients by penalizing the $ℓ_{1}$ -norm of coefficients' successive differences. However, it may not perform well when the features vary smoothly, rather than being like a step function. To capture the smooth features in a group, smooth-lasso [12] replaces the $ℓ_{1}$ -penalty of the difference of the adjacent coefficients in the fused lasso by the $ℓ_{2}$ -penalty. Recently, [9] proposed the spline-lasso and spline-MCP, in which they imposed an $ℓ_{2}$ -penalty on the discrete version of the second derivatives of coefficients. However, these methods mentioned above are mainly used in the regression.

To the best of our knowledge, the present article is the first to develop theory and methodology for SVM to incorporate the ordered structure among predictors. This study may advance from the existing ones along the following aspects. First, the structured sparse SVM can achieve variable selection as well as capture the ordered structure of features. The subsequent numerical analysis proves that ignoring this data structure has an impact on the accuracy of the classification and the variable selection. Second, we can theoretically prove that our method has local oracle properties. Even when the number of covariates grows exponentially with the sample size, the local oracle property still holds for the structured sparse SVM. Finally, the algorithm and asymptotic properties are established based on the general form, and we can advocate many kinds of loss functions in the formulation of structured sparse SVM, such as Huberized hinge loss and squared hinge loss.

The rest of the article is organized as follows. In Section 2, we describe the model, an efficient algorithm, and local oracle properties for structured sparse SVM. Simulation results and an application of the proposed method to a protein mass spectroscopy dataset are presented in Sections 3 and 4. Discussions of the proposed method and results are given in Section 5. Proofs for the oracle properties of structured sparse SVM are provided in the Appendix.

2. Structured sparse support vector machine

2.1. Methodology

In this paper, we allow the number of covariates p to increase with the sample size n. It is even possible that p is much larger than n. We assume that the true parameter is sparse, and the features are ordered in some meaningful way. Thus, we need to get an estimator that enjoys the structured sparse property.

The structured sparse SVM is formulated in terms of a loss function that is regularized by penalty terms. Our proposed minimization objective function is

min_{(β_{0}, β)} \frac{1}{n} \sum_{i = 1}^{n} ℓ {y_{i} (β_{0} + x_{i}^{⊤} β)} + \sum_{j = 1}^{p} p_{λ_{1}} (| β_{j} |) + λ_{2} \sum_{j = 2}^{p - 1} (Δ_{j}^{(2)} β)^{2} .

(2)

In (2), the first part is a convex loss function. We can advocate many kinds of loss functions. Huberized hinge loss,

ℓ_{δ} (t) = \{\begin{cases} 0, & t > 1, \\ (1 - t)^{2} / (2 δ), & 1 - δ < t \leq 1, \\ 1 - t - δ / 2, & t \leq 1 - δ, \end{cases}

is adopted in this paper. We fix the pre-specified constant $δ = 2$ following by [21]. The results with other losses are provided in the supplemental materials.

The second part is used to achieve variable selection. We consider the penalized SVM with a general class of non-convex penalties, such as the smoothly clipped absolute deviation (SCAD) penalty [7] and the minimax concave penalty (MCP) [24]. The SCAD penalty is defined by $p_{λ} (x) = λ \int_{0}^{| x |} min {1, (a - t / λ)_{+} / (a - 1)} d t$ for some a>2. The MCP is defined by $λ \int_{0}^{| x |} {1 - t / (a λ)}_{+} d t$ for some a>1. The experiments with different a values are presented in supplemental materials. We find our results to be insensitive to these choices, and for brevity, we fixed a = 3.7 for SCAD penalty and a = 3 for MCP as suggested in the literature [3,7,27,29].

The third part mimics the cubic spline to encourage the smoothness of coefficients. As the coefficient of the variables might vary smoothly, we encourage the smoothness of coefficients by penalizing the $ℓ_{2}$ norm of the discrete version of the second-order derivatives of the coefficients. Denote the first- and second-order difference (or discrete versions of derivatives) of the coefficients by $Δ_{j} β =: (β_{j + 1} - β_{j})$ and $Δ_{j}^{(2)} β = Δ_{j} β - Δ_{j - 1} β = β_{j + 1} - 2 β_{j} + β_{j - 1}$ . Then the $ℓ_{2}$ norm of the discrete version of the second-order derivatives of the coefficients is $\sum_{j = 2}^{p - 1} (Δ_{j}^{(2)} β)^{2}$ . The estimator by minimizing (2) enjoys structured sparse property.

Remark 2.1

The idea of the third part in (2) is similar to spline-lasso [9], which is used in the regression. The computation of spline-lasso is converted to lasso through a certain transformation. However, the conversion is no longer applicable when we solve the SVM problem. The computation is more complicated for the SVM problem than regression. The details of the algorithm are shown in Section 2.2.

2.2. Algorithm

Then we give the algorithm to solve this problem. Without loss of generality, we assume that the input data are standardized: $\sum_{i = 1}^{n} x_{i j} / n = 0$ , $\sum_{i = 1}^{n} x_{i j}^{2} / n = 1$ , $j = 1, 2, \dots, p .$ We use the generalized coordinate descent (GCD) algorithm [22] to calculate the structured sparse SVM problem.

When using the GCD algorithm, the loss function $ℓ (\cdot)$ should satisfy the following condition

ℓ (t + a) \leq ℓ (a) + ℓ^{'} (a) t + \frac{M}{2} t^{2} \forall a, t,

(3)

where M is a constant greater than 0. The corresponding M value is $2 / δ$ for Huberized hinge loss. It can be proved that the common loss functions such as Huberized hinge loss, logistic loss, and square hinge loss satisfy the above condition. Although the hinge loss, the loss for the standard SVM, does not satisfy (3), [21] showed that Huberized hinge function with the parameter $δ = 0.01$ is nearly identical to the hinge loss.

Let D be a $(p - 2) \times p$ matrix with $D_{i i} = D_{i, i + 2} = 1$ , $D_{i, i + 1} = - 2$ , and $D_{i j} = 0$ otherwise. Given current estimate ${{\tilde{β}}_{0}, \tilde{β}}$ . Define the current margin $r_{i} = y_{i} ({\tilde{β}}_{0} + x_{i}^{⊤} \tilde{β})$ . The coordinate descent algorithm cyclically minimizes

\begin{aligned} F (β_{j} | {\tilde{β}}_{0}, \tilde{β}) & = \frac{1}{n} \sum_{i = 1}^{n} ℓ {r_{i} + y_{i} x_{i j} (β_{j} - {\tilde{β}}_{j})} + p_{λ_{1}} (| β_{j} |) \\ + λ_{2} (D^{⊤} D)_{j j} β_{j}^{2} + 2 λ_{2} \sum_{l = 1, i \neq j}^{n} {(D^{⊤} D)}_{l j} {\tilde{β}}_{l} β_{j} \end{aligned}

(4)

with respect to $β_{j}$ . According to local linear approximation (LLA) [29], we have $p_{λ_{1}} (| β_{j} |) \approx p_{λ_{1}} (| {\tilde{β}}_{j} |) + p_{λ_{1}}^{'} (| {\tilde{β}}_{j} |) (| β_{j} | - | {\tilde{β}}_{j} |)$ , for $β_{j} \approx {\tilde{β}}_{j}$ . As pointed out by a referee, CCCP (constrained concave–convex procedure) algorithm is also an efficient algorithm for solving this problem, which is worth investigating as future work. When $ℓ (\cdot)$ satisfies (3), we can get $F (β_{j} | {\tilde{β}}_{0}, \tilde{β}) \leq \hat{F} (β_{j} | {\tilde{β}}_{0}, \tilde{β})$ , where

\begin{aligned} \hat{F} (β_{j} | {\tilde{β}}_{0}, \tilde{β}) & = \frac{1}{n} \sum_{i = 1}^{n} ℓ (r_{i}) + \frac{1}{n} \sum_{i = 1}^{n} ℓ^{'} (r_{i}) y_{i} x_{i j} (β_{j} - {\tilde{β}}_{j}) + \frac{M}{2} (β_{j} - {\tilde{β}}_{j})^{2} \\ + p_{λ_{1}}^{'} (| {\tilde{β}}_{j} |) | β_{j} | + λ_{2} (D^{⊤} D)_{j j} β_{j}^{2} + 2 λ_{2} \sum_{l = 1, l \neq j}^{p} {(D^{⊤} D)}_{l j} {\tilde{β}}_{l} β_{j} . \end{aligned}

(5)

Since $\hat{F}$ is a quadratic majorization function of F, we can get the new update by minimizing $\hat{F}$ :

{\hat{β}}_{j}^{n e w} = \underset{β_{j}}{\arg min} \hat{F} (β_{j} | {\tilde{β}}_{0}, \tilde{β}) = \frac{S (z, p_{λ_{1}}^{'} (| {\tilde{β}}_{j} |))}{M + 2 λ_{2} {(D^{⊤} D)}_{j j}},

(6)

where $S (z, t) = (| z | - t)_{+} s i g n (z), z = M {\tilde{β}}_{j} - \frac{1}{n} \sum_{i = 1}^{n} ℓ^{'} (r_{i}) y_{i} x_{i j} - 2 λ_{2} \sum_{l = 1, i \neq j}^{n} {(D^{⊤} D)}_{l j} {\tilde{β}}_{l}$ .

Likewise, we can update the intercept by minimizing

\hat{F} (β_{0} | {\tilde{β}}_{0}, \tilde{β}) = \frac{1}{n} \sum_{i = 1}^{n} ℓ (r_{i}) + \frac{1}{n} \sum_{i = 1}^{n} ℓ^{'} (r_{i}) y_{i} (β_{0} - {\tilde{β}}_{0}) + \frac{M}{2} (β_{0} - {\tilde{β}}_{0})^{2} .

(7)

Then the intercept is updated by

{\hat{β}}_{0}^{n e w} = \underset{β_{0}}{\arg min} Q (β_{0} | {\tilde{β}}_{0}, \tilde{β}) = {\tilde{β}}_{0} - \frac{\sum_{i = 1}^{n} ℓ^{'} (r_{i}) y_{i}}{M n} .

(8)

Then we can iterate (4)–(8) until convergence.

Remark 2.2

In this paper, we specify the initial value by $L_{1}$ penalized SVM following by [27], and it leads to a satisfactory result.

Remark 2.3

The above algorithm satisfies the majorization–minimization (MM) principle [5,13,14], and the MM principle ensures the descent property of the GCD algorithm. The proof is similar to [21,22] and we omit here.

2.3. Asymptotic properties

In this subsection, we establish the theory of the local oracle property for the structured sparse SVM, namely the oracle estimator is one of the local minimizers of (2).

Since $β_{0}$ does not affect variable selection, we make $β_{0} = 0$ for the convenience of expression without loss of generality in this section. Let $β^{*} = (β_{1}^{*}, β_{2}^{*}, \dots, β_{p}^{*})^{⊤}$ denote the true parameter value, which is defined as the minimizer of the population loss: $β^{*} = \underset{β}{\arg min} L (β) = \underset{β}{\arg min} E {ℓ (y x^{⊤} β)} .$

We use $p_{n}$ in this section to denote the number of features. Let $A = {j : β_{j}^{*} \neq 0, 1 \leq j \leq p_{n}}$ be the index set of the non-zero coefficients, and $A^{c} = {j : β_{j}^{*} = 0, 1 \leq j \leq p_{n}}$ be the index set of the zero coefficients. $q_{n} = | A |$ is the cardinality of set $A$ . $D_{A}$ is a submatrix formed by D removing the column corresponding to the element in the $A^{c}$ . $β_{A^{c}} = (β_{j})_{j \in A^{c}}$ is a vector composed of the components of $β$ corresponding to the elements in $A^{c}$ . Then the oracle estimate $\hat{β}$ is defined as $\hat{β} = \underset{β_{A^{c}} = 0}{\arg min} {\frac{1}{n} \sum_{i = 1}^{n} ℓ (y_{i} x_{i A}^{⊤} β_{A}) + λ_{2 n} β_{A}^{⊤} D_{A}^{⊤} D_{A} β_{A}} .$

Theorem 2.1

Assume that the conditions 1–5 listed in the Appendix hold. Let $B_{n} (λ_{1 n}, λ_{2 n})$ be the set of local minimizers of the objective function

$Q_{n} (β) = L_{n} (β) + \sum_{j = 1}^{p} p_{λ_{1 n}} (| β_{j} |) + λ_{2 n} β^{⊤} D^{⊤} D β$

with regularization parameter $λ_{1 n}$ , $λ_{2 n}$ . The oracle estimator $\hat{β} = ({\hat{β}}_{A}^{⊤}, 0^{⊤})^{⊤}$ satisfies

$Pr {\hat{β} \in B_{n} (λ_{1 n}, λ_{2 n})} \to 1$

as $n \to \infty$ , if $q_{n} n^{- 1 / 2} \log p_{n} \log n = o (λ_{1 n})$ , $λ_{2 n} {q_{n}}^{1 / 2} n^{- 1 / 2} = o (λ_{1 n})$ , and $λ_{1 n} = o (n^{- (1 - c_{3}) / 2})$ .

From Theorem 2.1, we can see that if we take $λ_{1 n} = n^{- 1 / 2 + τ}$ for some $c_{1} < τ < c_{3} / 2$ , then the oracle property holds even for $p = o {\exp (n^{(τ - c_{1}) / 2})}$ . Thus, even when the number of covariates grows exponentially with the sample size, the local oracle property still holds for the structured sparse SVM.

3. Simulations

In this section, numerical experiments are conducted to study the performance of our proposed method. We use Spline-penalty-HSVM, where penalty includes SCAD and MCP, to represent our proposed method (i.e. Spline-SCAD-HSVM and Spline-MCP-HSVM). To investigate the performance, we compared performances of the proposed method with other alternatives without considering structured sparsity: SCAD-HSVM and MCP-HSVM.

Three data generation processes are considered in this paper. We set n = 100 and p = 1000. In Example 3.1, the non-zero coefficients of the variables are completely smooth by position. Partial non-zero coefficients are smooth in Example 3.2. For Example 3.3, the non-zero coefficients are not smooth. Within each example, our simulated data consist of a training set and a testing set. Models are fitted on training data only, and the testing set with sample size 500 is used to show the predictions of each method. The optimal regularization parameters $λ_{1}$ and $λ_{2}$ are selected on a 15-by-20 meshgrid through a 5-fold cross validation. Possible values of $λ_{2}$ are from $[0.1, 0.2, \dots, 1.9, 2]$ . For each fixed $λ_{2}$ , we compute the solutions for a fine grid of $λ_{1}$ s. Following [21], we start with $λ_{1 (max)}$ which is the smallest $λ_{1}$ to set all $β_{j}$ to be zero, and set $λ_{1 (min)} = 0.01 λ_{1 (max)}$ . Between $λ_{1 (min)}$ and $λ_{1 (max)}$ , 15 points are placed uniformly in the log-scale. Then we select the optimal regularization parameters that achieve the maximum of the classification accuracy rate. Here are the details of the three scenarios.

Example 3.1

Consider $x \sim N (0, Σ)$ with $Σ = ({0.5}^{| i - j |})_{p \times p}$ , $β_{j} = j / 40$ for $j = 1, 2, \dots, 20$ ; $β_{j} = 1 - j / 40$ for $j = 21, \dots, 40$ ; $β_{j} = - \sin (π j / 40)$ for $j = 81, \dots, 120$ ; $β_{j} = 0.5 (1 - \cos (π j / 20))$ , for $j = 161, \dots, 200$ ; and $β_{j} = 0$ for the otherwise. $Pr (y = - 1) = 1 / (1 + \exp (- x^{⊤} β))$ , and $Pr (y = 1) = 1 / (1 + \exp (x^{⊤} β))$ . The Bayes rule is $s g n (x^{⊤} β)$ with Bayes error $5.2 %$ .

Example 3.2

The setting of $x$ is the same as in Example 3.1. $β_{j}$ takes the same value as in Example 1 for $j = 1, 2, \dots, 120$ ; $β_{j} \sim U n i f o r m (- 0.5, 0.5)$ , for $j = 160, \dots, 200$ ; and $β_{j} = 0$ for the otherwise. $Pr (y = 1 | x) = Φ (x^{⊤} β)$ , where $Φ (\cdot)$ is the distribution function of standard normal distribution. The Bayes rule is $s g n (x^{⊤} β)$ with Bayes error $9.1 %$ .

Example 3.3

The generations of $x$ and $y$ are the same as in Example 3.1. $β_{j} \sim U n i f o r m (0, 1)$ , for $j = 1, 2, \dots, 40$ ; and $β_{j} = 0$ for the otherwise. The Bayes rule is $s g n (x^{⊤} β)$ with Bayes error $8.3 %$ .

The performance of different methods will be examined in two aspects: classification prediction and feature selection. In the evaluation of classification prediction, the classification accuracy rate (ACC), area under curve (AUC), true positive rate (TPR) and false positive rate (FPR) are adopted. As for feature selection, we compare TPR and FPR for different methods.

The results of the simulations are shown in Table 1. When the features vary smoothly, the SVM model with the spline penalty is significantly better in classification prediction and variable selection than the SVM model without the spline penalty (Example 3.1 and 3.2). The proposed method performs better, especially when non-zero coefficients of the variables are completely smooth by position in terms of classification prediction. In Figure 2, we present the estimation results for the correlated features in Example 3.1, by four different methods. From the figure, we can conclude that both the Spline-SCAD-HSVM and Spline-MCP-HSVM give a good estimate of the coefficients, while SCAD-HSVM and MCP-HSVM do not clean out the noisy signals very well. The improvement is not surprising since Spline-SCAD-HSVM and Spline-MCP-HSVM can capture the smoothing changes in coefficients. When the non-zero coefficients are not smooth, which means structural information described in this paper does not exist, the performances of both methods are similar (Example 3.3). This observation indicates that the proposed models are also applicable even when the features do not vary smoothly.

Table 1.

The simulation results obtained from 100 Monte Carlo repetitions (with standard errors in parentheses).

	Classification prediction				Variable selection
Method	ACC	AUC	TPR	FPR	TPR	FPR
Example 3.1
SCAD-HSVM	0.782 (0.021)	0.874 (0.018)	0.769 (0.046)	0.205 (0.040)	0.732 (0.038)	0.309 (0.029)
MCP-HSVM	0.775 (0.025)	0.867 ( 0.024)	0.789 (0.031)	0.237 (0.047)	0.761 (0.028)	0.304 (0.036)
Spline-SCAD-HSVM	0.922 (0.022)	0.983 (0.011)	0.911 (0.034)	0.067 (0.030)	0.874 (0.019)	0.254 (0.021)
Spline-MCP-HSVM	0.932 (0.010)	0.987 (0.010)	0.920 (0.033)	0.055 (0.028)	0.868 (0.024)	0.143 (0.019)
Example 3.2
SCAD-HSVM	0.756 (0.011)	0.845 (0.008)	0.723 (0.010)	0.258 (0.008)	0.611 (0.019)	0.253 (0.008)
MCP-HSVM	0.738 (0.015)	0.834 (0.007)	0.715 (0.009)	0.263 (0.008)	0.606 (0.020)	0.265 (0.009)
Spline-SCAD-HSVM	0.798 (0.013)	0.853 (0.011)	0.729 (0.009)	0.259 (0.008)	0.799 (0.019)	0.223 (0.010)
Spline-MCP-HSVM	0.804 (0.012)	0.854 (0.012)	0.731 (0.015)	0.257 (0.008)	0.799 (0.020)	0.225 (0.011)
Example 3.3
SCAD-HSVM	0.788 (0.009)	0.880 (0.012)	0.799 (0.010)	0.244 (0.002)	0.782 (0.018)	0.215 (0.005)
MCP-HSVM	0.788 (0.009)	0.880 (0.012)	0.799 (0.010)	0.244 (0.002)	0.781 (0.020)	0.215 (0.005)
Spline-SCAD-HSVM	0.801 (0.006)	0.878 (0.011)	0.791 (0.010)	0.290 (0.003)	0.790 (0.019)	0.229 (0.011)
Spline-MCP-HSVM	0.808 (0.006)	0.878 (0.011)	0.795 (0.010)	0.281 (0.003)	0.790 (0.015)	0.229 (0.012)

Open in a new tab

Figure 2. — The average estimation results for the correlated features in Example 1 of 100 Monte Carlo repetitions, by four different methods: SCAD-HSVM, MCP-HSVM, Spline-SCAD-HSVM and Spline-MCP-HSVM. The solid line curve is the true $β$ , and the scatter dot represents the estimation for each method.

4. Real data analysis

In this section, we apply our methods to a dataset of Ovarian Dataset 8-7-02. The dataset is provided by the US Food and Drug Administration (FDA) and the National Cancer Institute (NCI), which can be downloaded and accessed at http://home.ccr.cancer.gov/. The data were collected as serum samples from normal and cancer patients, and the mass spectrometry technique was combined with the WCX2 protein chip and SELDI-TOF. The sample set included 91 controls and 162 ovarian cancers, which were not randomized. Each mass spectrometer sample contains a 15154-dimensional mass-to-charge ratio (m/z) /intensity characteristic. As mentioned in Section 1, the features are ordered in a meaningful way. Following the original researchers, we ignored m/z-sites below 100, where chemical artifacts can occur [16].

We randomly choose 173 samples from data as the training set, and the remaining 80 samples are used as the testing set. Four methods with Huberized hinge loss, i.e. SCAD-HSVM, MCP-HSVM, Spline-SCAD-HSVM, and Spline-MCP-HSVM, are fitted using the training set. Additional results with other losses are provided in the supplemental materials. Tuning parameters are chosen by 5-fold cross validation base on the training set. We select the optimal regularization parameters that achieve the maximum of the classification accuracy rate among grid points using two-dimensional grid search. We run the sample-splitting method 100 times, and the results are summarized in Table 2. We can see that the performance of the Spline-SCAD-HSVM and Spline-MCP-HSVM is slightly better than the SCAD-HSVM and MCP-HSVM. The ACC, AUC and TPR are slightly higher when we consider the model that explicitly incorporates the special structures among the features.

Table 2.

Results of 100 random splits of the ovarian cancer dataset (with standard errors in parentheses).

Method	ACC	AUC	TPR	FPR
SCAD-HSVM	0.919 (0.025)	0.968 (0.010)	0.930 (0.021)	0.061 (0.012)
MCP-HSVM	0.921 (0.026)	0.971 (0.008)	0.932 (0.022)	0.060 (0.015)
Spline-SCAD-HSVM	0.947 (0.018)	0.989 (0.004)	0.972 (0.009)	0.043 (0.011)
Spline-MCP-HSVM	0.947 (0.019)	0.992 (0.003)	0.975 (0.009)	0.041 (0.010)

Open in a new tab

To complement the estimation and identification analysis, we also evaluate the stability of analysis by computing the observed occurrence index (OOI). For each feature identified using the training data, we compute its probability of being identified out of the 100 resamplings; this probability has been referred to as the OOI. The median OOI values of SCAD-HSVM, MCP-HSVM, Spline-SCAD-HSVM, and Spline-MCP-HSVM are 0.736, 0.739, 0.857, and 0.862, respectively. Figure 3 shows the number of selected proteins versus the selection frequency of four different methods out of the 100 random splits. We can conclude that the OOI value of the model with the spline item is significantly higher, which indicates that Spline-SCAD-HSVM and Spline-MCP-HSVM are more stable than SCAD-HSVM and MCP-HSVM.

Figure 3. — The figure shows the number of selected proteins versus the selection frequency of four different methods: (a) SCAD-HSVM, (b) MCP-HSVM, (c) Spline-SCAD-HSVM, and (d) Spline-MCP-HSVM.

5. Discussion

In this article, we consider a high-dimensional data classification problem, where the features are ordered in some meaningful way. When the coefficients are sparse and change smoothly, we propose a structured sparse SVM, which combines the non-convex penalty and cubic spline estimation procedure (i.e. penalizing second-order derivatives of the coefficients) to the SVM, and proved that it satisfies the local oracle property under some conditions. The simulation and empirical results show that the proposed method has a higher accuracy of classification and prediction compared with the existing methods.

In the future, we want to work on more complex data. For example, when the high-dimensional data variables have group structure information and the intra-group features are ordered in some meaningful way. Moreover, our approach could also be extended to the framework of semisupervised learning and multi-class classification.

Supplementary Material

Supplementary_Data

Click here for additional data file.^{(157.7KB, pdf)}

Acknowledgments

We would like to thank the editor, associate editor and two reviewers for their constructive comments that have led to a significant improvement of the manuscript.

Appendices.

Appendix 1. Regularity conditions.

To facilitate our technical proofs, we impose the following regularity conditions.

Condition A.1

The loss function $ℓ (\cdot)$ is convex and it has a first order continuous derivative. There exist constants $M_{1}$ and $M_{2}$ , such that $| ℓ^{'} (t) | \leq M_{1} (| t | + 1), | \partial (ℓ^{'} (t)) | \leq M_{2}, \forall t,$ where $\partial (\cdot)$ represents the subgradient.

Condition A.2

$q_{n} = O (n^{c_{1}}), 0 \leq c_{1} < 1 / 2$ ; $λ_{2 n} ∥ D β^{*} ∥ = O (n^{- c_{2}})$ , $(1 - c_{1}) / 2 < c_{2} \leq 1 / 2$ .

Condition A.3

The Hessian matrix $H (β_{A}) = E [\nabla^{2} ℓ (y x_{A}^{⊤} β_{A})]$ satisfies conditions

$0 < M_{3} < λ_{min} {H (β_{A}^{*})} \leq λ_{max} {H (β_{A}^{*})} < M_{4} < \infty,$

where $x_{A}$ is the matrix formed by $x$ removing the column corresponding to the element in the $A^{c}$ . where $λ_{min}$ and $λ_{max}$ denote the smallest and largest eigenvalue, respectively.

Condition A.4

There is a constant $M_{5} > 0$ such that $λ_{max} (n^{- 1} x_{A}^{⊤} x_{A}) \leq M_{5}$ . It is further assumed that $x_{i j}$ are sub-Gaussian random variables for $1 \leq i \leq n$ , $j \in A^{c}$ .

Condition A.5 Condition on the true model dimension —

There exist positive constants $c_{3}$ and $M_{6}$ such that $1 - c_{1} \leq c_{3} \leq 1$ and $n^{(1 - c_{3}) / 2} min_{j \in A} | β_{j}^{*} | \geq M_{6}$ .

Remark A.1

Condition 1 requires that the loss function be smooth and that the change is gentle, which is satisfactory for some common SVM loss functions, such as Huberized hinge loss function and square hinge loss function. Condition 2 states that the divergence rate of the number of non-zero coefficients cannot be faster than $n^{1 / 2}$ , and the coefficient of the variable is slowly changing in position, which supports our introduction of spline penalty. Under Conditions 3, the Hessian matrix of the loss function is assumed to be positive definite, and its eigenvalues are uniformly bounded. The condition on the largest eigenvalues of the design matrix, which is assumed in Condition 4, is similar to that of [23,26,27]. Condition 5 simply states that the signals cannot decay too quickly.

Appendix 2. Some lemmas.

The proof of Theorem 2.1 relies on the following lemmas.

Lemma A.1

Assume that Conditions 1–5 are satisfied. Then the oracle estimator satisfies $∥ {\hat{β}}_{A} - β_{A}^{*} ∥ = O_{p} (\sqrt{q_{n} / n})$ .

Proof.

Let $α_{n} = \sqrt{q_{n} / n}$ , and $Q_{n} (β_{A}) = \frac{1}{n} \sum_{i = 1}^{n} ℓ (y_{i} x_{i A}^{⊤} β_{A}) + λ_{2 n} β_{A}^{⊤} D_{A}^{⊤} D_{A} β_{A}$ , we want to show that for any given $ϵ > 0$ , there exists a constant C>0 such that

$Pr \{inf_{∥ u ∥ = C} Q_{n} (β_{A}^{*} + α_{n} u) > Q_{n} (β_{A}^{*})\} \geq 1 - ϵ .$ (A.1)

This implies that there exists a local minimum in the ball ${β_{A}^{*} + α_{n} u : ∥ u ∥ \leq C}$ with probability at least $1 - ϵ$ . Hence, there exists a local minimizer such that $∥ {\hat{β}}_{A} - β_{A}^{*} ∥ = O_{p} (α_{n})$ .

Let

$\begin{aligned} Λ_{n} (u) & = Q_{n} (β_{A}^{*} + α_{n} u) - Q_{n} (β_{A}^{*}) \\ = \frac{1}{n} \sum_{i = 1}^{n} [ℓ {y_{i} x_{i A}^{⊤} (β_{A}^{*} + α_{n} u)} - ℓ (y_{i} x_{i A}^{⊤} β_{A}^{*})] \\ + λ_{2 n} {{(β_{A}^{*} + α_{n} u)}^{⊤} D_{A}^{⊤} D_{A} (β_{A}^{*} + α_{n} u) - β_{A}^{⊤ *} D_{A}^{⊤} D_{A} β_{A}^{*}} . \end{aligned}$

By applying Taylor series expansion around $β^{*}$ , we have

$\begin{aligned} Λ_{n} (u) & = \frac{1}{n} \sum_{i = 1}^{n} [α_{n} \nabla^{⊤} {ℓ (y_{i} x_{i A}^{⊤} β_{A}^{*})} u + \frac{α_{n}^{2}}{2} u^{⊤} \nabla^{2} {ℓ (y_{i} x_{i A}^{⊤} {\tilde{β}}_{A})} u] \\ + 2 λ_{2 n} α_{n} β_{A}^{* ⊤} D_{A}^{⊤} D_{A} u + λ_{2 n} α_{n}^{2} u^{⊤} D_{A}^{⊤} D_{A} u \\ \geq \frac{α_{n}}{n} \sum_{i = 1}^{n} \nabla^{⊤} {ℓ (y_{i} x_{i A}^{⊤} β_{A}^{*})} u + \frac{α_{n}^{2} u^{⊤}}{2 n} \sum_{i = 1}^{n} \nabla^{2} {ℓ (y_{i} x_{i A}^{⊤} {\tilde{β}}_{A})} u \\ + 2 λ_{2 n} α_{n} β_{A}^{* ⊤} D_{A}^{⊤} D_{A} u \\ \overset{\land}{=} I_{1} + I_{2} + I_{3}, \end{aligned}$ (A.2)

where ${\tilde{β}}_{A} = β_{A}^{*} + α_{n} t u$ , 0<t<1. By Conditions 1–3, we have

$\begin{aligned} | I_{1} | & = | \frac{α_{n}}{n} \sum_{i = 1}^{n} \nabla^{⊤} {ℓ (y_{i} x_{i A}^{⊤} β_{A}^{*})} u | \leq α_{n} ∥ \frac{1}{n} \sum_{i = 1}^{n} \nabla^{⊤} {ℓ (y_{i} x_{i A}^{⊤} β_{A}^{*})} ∥ \cdot ∥ u ∥ \\ = α_{n} \cdot O_{p} (\sqrt{q_{n} / n}) ∥ u ∥ = O_{p} (α_{n}^{2}) ∥ u ∥ . \end{aligned}$

With Conditions 1 and 4, using Chebyshev inequality similarly as that in [8], we have when $q_{n} = O (n^{c_{1}}), 0 \leq c_{1} < 1 / 2$

$Pr {∥\frac{1}{n} \sum_{i = 1}^{n} \nabla^{2} {ℓ (y_{i} x_{i A}^{⊤} β_{A}^{*})} - H (β_{A}^{*})∥ \geq ϵ q_{n}^{- 1}} \leq \frac{q_{n}^{2}}{n ϵ^{2}} = o (1),$

Thus

$∥ \frac{1}{n} \sum_{i = 1}^{n} \nabla^{2} {ℓ (y_{i} x_{i A}^{⊤} β_{A}^{*})} - H (β_{A}^{*}) ∥ = o_{p} (q_{n}^{- 1}),$

Then

$I_{2} = \frac{1}{2} α_{n}^{2} u^{⊤} H (β_{A}^{*}) u {1 + o_{p} (1)} .$

By choosing a sufficiently large C, the second term $I_{2}$ dominates the first term $I_{1}$ uniformly in $∥ u ∥ = C$ . By Cauchy–Schwarz inequality and Condition 2, we have

$\begin{aligned} | I_{3} | & = | 2 λ_{2 n} α_{n} β_{A}^{* ⊤} D_{A}^{⊤} D_{A} u | \leq 2 λ_{2 n} α_{n} ∥ D_{A} β_{A}^{*} ∥ ∥ D_{A} u ∥ \\ = 2 λ_{2 n} α_{n} ∥ D β^{*} ∥ \sqrt{u^{⊤} D_{A}^{⊤} D_{A} u} \\ \leq 2 λ_{2 n} α_{n} ∥ D β^{*} ∥ ∥ u ∥ \sqrt{λ_{max} (D_{A}^{⊤} D_{A})} = o (α_{n}^{2}) ∥ u ∥ . \end{aligned}$

This is also dominated by the second term of (A2). Hence, by choosing a sufficiently large C, (A1) holds. This completes the proof of the lemma.

Lemma A.2

Assume that Conditions 1–5 hold and that $q_{n} n^{- 1 / 2} \log p_{n} \log n = o (λ_{1 n})$ . Then

$Pr \{max_{j \in A^{c}} | \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i j} ℓ^{'} (y_{i} x_{i A}^{⊤} β_{A}^{*}) | > \frac{λ_{1 n}}{2}\} \to 0,$

as $n \to \infty$ .

Proof.

Recall that $E [n^{- 1} \sum_{i = 1}^{n} y_{i} x_{i j} ℓ^{'} (y_{i} x_{i A}^{⊤} β_{A}^{*})] = 0$ , and the fact that $max_{i} | x_{i j} | = O_{p} (\sqrt{\log n})$ for sub-Gaussian random variables. For some positive constans C, we have

$| y_{i} x_{i j} ℓ^{'} (y_{i} x_{i A}^{⊤} β_{A}^{*}) | \leq M_{1} | x_{i j} | (| x_{i A}^{⊤} β_{A}^{*} | + 1) \leq M_{1} | x_{i j} | (∥ x_{i A} ∥ ∥ β_{A}^{*} ∥ + 1) \leq C q_{n} \log n .$

By Lemma 14.11 of [2], we have

$Pr \{| \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i j} ℓ^{'} (y_{i} x_{i A}^{⊤} β_{A}^{*}) | > \frac{λ_{1 n}}{2}\} \leq 2 \exp \{- \frac{n λ_{1 n}^{2}}{8 C^{2} q_{n}^{2} \log^{2} n}\} .$

Then

$\begin{aligned} Pr \{max_{j \in A^{c}} | \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i j} ℓ^{'} (y_{i} x_{i A}^{⊤} β_{A}^{*}) | > \frac{λ_{1 n}}{2}\} \\ = Pr [⋃_{j \in A^{c}} \{| \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i j} ℓ^{'} (y_{i} x_{i A}^{⊤} β_{A}^{*}) | > \frac{λ_{1 n}}{2}\}] \leq 2 p_{n} \exp \{- \frac{n λ_{1 n}^{2}}{8 C^{2} q_{n}^{2} \log^{2} n}\} \\ = 2 \exp [\log p_{n} \{1 - \frac{n λ_{1 n}^{2}}{8 C^{2} q_{n}^{2} \log p_{n} \log^{2} n}\}] \to 0 \end{aligned}$

as $n \to \infty$ by the fact that $q_{n} n^{- 1 / 2} \log p_{n} \log n = o (λ_{1 n})$ .

Lemma A.3

Suppose that Conditions 1–5 hold, $q_{n} n^{- 1 / 2} \log p_{n} \log n = o (λ_{1 n})$ , $λ_{2 n} {q_{n}}^{1 / 2} n^{- 1 / 2} = o (λ_{1 n})$ , and $λ_{1 n} = o (n^{- (1 - c_{3}) / 2})$ . For $j = 1, 2, \dots, p$ , denote

$s_{j} (\hat{β}) = \frac{\partial [L_{n} (\hat{β}) + λ_{2 n} β_{A}^{⊤} D_{A}^{⊤} D_{A} β_{A}]}{\partial β_{j}} .$

For the oracle estimator $\hat{β}$ and $s_{j} (\hat{β})$ , with probability approaching 1, we have

$\begin{aligned} s_{j} (\hat{β}) = 0, | {\hat{β}}_{j} | \geq (a + \frac{1}{2}) λ_{1 n}, j \in A, \\ | s_{j} (\hat{β}) | \leq λ_{1 n}, | {\hat{β}}_{j} | = 0, j \in A^{c} . \end{aligned}$

Proof.

The objective function $L_{n} (β) + λ_{2 n} β^{⊤} D_{A}^{⊤} D_{A} β$ is convex derivative function. By the convex optimization theorem, we have $s_{j} (\hat{β}) = 0$ , $j \in A$ .

Note that $min_{j \in A} | {\hat{β}}_{j} | \geq min_{j \in A} | β_{j}^{*} | - max_{j \in A} | {\hat{β}}_{j} - β_{j}^{*} |$ . Furthermore, we have $min_{j \in A} | β_{j}^{*} | \geq M_{6} n^{- (1 - c_{3}) / 2}$ by Condition 5, and $max_{j \in A} | {\hat{β}}_{j} - β_{j}^{*} | \leq ∥ {\hat{β}}_{A} - β_{A}^{*} ∥ = O_{p} (\sqrt{q_{n} / n}) = O_{p} (n^{- (1 - c_{1}) / 2}) = o_{p} (n^{- (1 - c_{3}) / 2}) .$ Then according to $λ_{1 n} = o (n^{- (1 - c_{3}) / 2})$ , we have

$Pr \{| {\hat{β}}_{j} | \geq (a + \frac{1}{2}) λ_{1 n}\} \to 1, f o r j \in A .$

For $j \in A^{c}$ , we have

$s_{j} (\hat{β}) = \frac{1}{n} \sum_{i = 1}^{n} y_{i} x_{i j} ℓ^{'} (y_{i} x_{i A}^{⊤} {\hat{β}}_{A}) + 2 λ_{2 n} \sum_{i = 1}^{p_{n}} {(D^{⊤} D)}_{i j} {\hat{β}}_{i}$ (A.3)

We observe that

$\begin{aligned} Pr {max_{j \in A^{c}} | n^{- 1} \sum_{i = 1}^{n} y_{i} x_{i j} ℓ^{'} (y_{i} x_{i A}^{⊤} {\hat{β}}_{A}) | > λ_{1 n}} \\ \leq Pr {max_{j \in A^{c}} | n^{- 1} \sum_{i = 1}^{n} y_{i} x_{i j} ℓ^{'} (y_{i} x_{i A}^{⊤} β_{A}^{*}) | > \frac{λ_{1 n}}{2}} \\ + Pr {max_{j \in A^{c}} | n^{- 1} \sum_{i = 1}^{n} y_{i} x_{i j} [ℓ^{'} (y_{i} x_{i A}^{⊤} {\hat{β}}_{A}) - ℓ^{'} (y_{i} x_{i A}^{⊤} β_{A}^{*})] | > \frac{λ_{1 n}}{2}} . \end{aligned}$ (A.4)

By Lemma A.2, the first term of inequality (A.4) is $o_{p} (1)$ . From Lemma A.1, the second term of inequality (A.4) is bounded by

$\begin{aligned} Pr {max_{j \in A^{c}} | n^{- 1} \sum_{i = 1}^{n} y_{i} x_{i j} [ℓ^{'} (y_{i} x_{i A}^{⊤} {\hat{β}}_{A}) - ℓ^{'} (y_{i} x_{i A}^{⊤} β_{A}^{*})] | > \frac{λ_{1 n}}{2}} \\ \leq Pr \{max_{j \in A^{c}} sup_{∥ β_{A} - β_{A}^{*} ∥ \leq C \sqrt{q_{n} / n}} | n^{- 1} \sum_{i = 1}^{n} y_{i} x_{i j} [ℓ^{'} (y_{i} x_{i A}^{⊤} {\hat{β}}_{A}) - ℓ^{'} (y_{i} x_{i A}^{⊤} β_{A}^{*})] | > \frac{λ_{1 n}}{2}\} \end{aligned}$ (A.5)

Together with Conditions 1 and 4, we have

$\begin{aligned} max_{j \in A^{c}} sup_{∥ β_{A} - β_{A}^{*} ∥ \leq C \sqrt{q_{n} / n}} | n^{- 1} \sum_{i = 1}^{n} y_{i} x_{i j} [ℓ^{'} (y_{i} x_{i A}^{⊤} β_{A}) - ℓ^{'} (y_{i} x_{i A}^{⊤} β_{A}^{*})] | \\ \leq M_{2} sup_{∥ β_{A} - β_{A}^{*} ∥ \leq C \sqrt{q_{n} / n}} max_{i, j} | x_{i j} | n^{- 1} \sum_{i = 1}^{n} \sqrt{{(β_{A} - β_{A}^{*})}^{⊤} x_{i A} x_{i A}^{⊤} (β_{A} - β_{A}^{*})} \\ \leq M_{2} sup_{∥ β_{A} - β_{A}^{*} ∥ \leq C \sqrt{q_{n} / n}} max_{i, j} | x_{i j} | \sqrt{{(β_{A} - β_{A}^{*})}^{⊤} (n^{- 1} x_{A}^{⊤} x_{A}) (β_{A} - β_{A}^{*})} \\ \leq M_{2} sup_{∥ β_{A} - β_{A}^{*} ∥ \leq C \sqrt{q_{n} / n}} max_{i, j} | x_{i j} | \cdot ∥ β_{A} - β_{A}^{*} ∥ \cdot [\sqrt{λ_{max} (n^{- 1} x_{A}^{⊤} x_{A})}] \\ = O {\sqrt{\log (p_{n} n)}} \cdot \sqrt{q_{n} / n} = o (λ_{1 n}), \end{aligned}$ (A.6)

as $n \to \infty$ by the fact that $q_{n} n^{- 1 / 2} \log p_{n} \log n = o (λ_{1 n})$ .

By (A.4)–(A.6), as $n \to \infty$ , we have

$Pr {max_{j \in A^{c}} | n^{- 1} \sum_{i = 1}^{n} y_{i} x_{i j} ℓ^{'} (y_{i} x_{i A}^{⊤} {\hat{β}}_{A}) | > λ_{1 n}} \to 0.$ (A.7)

Then according to the nature of the matrix $D^{⊤} D$ ,

$\begin{aligned} 2 λ_{2 n} | \sum_{i = 1}^{p_{n}} {(D^{⊤} D)}_{i j} {\hat{β}}_{i} | & \leq 2 λ_{2 n} | \sum_{i = 1}^{p_{n}} {(D^{⊤} D)}_{i j} ({\hat{β}}_{i} - β_{i}^{*}) | + 2 λ_{2 n} | \sum_{i = 1}^{p_{n}} {(D^{⊤} D)}_{i j} β_{i}^{*} | \\ \leq 2 \sqrt{70} λ_{2 n} ∥ {\hat{β}}_{A} - β_{A}^{*} ∥ + 2 \sqrt{6} λ_{2 n} ∥ D β^{*} ∥ = o_{p} (λ_{1 n}) . \end{aligned}$ (A.8)

By (A.3), (A.7) and (A.8), we have $| s_{j} (\hat{β}) | \leq λ_{1 n}$ for $j \in A^{c}$ .

As the oracle estimate $\hat{β}$ is defined as

$\hat{β} = \arg min_{β_{A^{c}} = 0} \{\frac{1}{n} \sum_{i = 1}^{n} ℓ (y_{i} x_{i A}^{⊤} β_{A}) + λ_{2 n} β_{A}^{⊤} D_{A}^{⊤} D_{A} β_{A}\},$

$| {\hat{β}}_{j} | = 0$ for $j \in A^{c}$ naturally.

Appendix 3. Proof of Theorem 2.1.

Proof.

Let

$Q_{n} (β) = L_{n} (β) + \sum_{j = 1}^{p} p_{λ_{1 n}} (| β_{j} |) + λ_{2 n} β^{⊤} D^{⊤} D β \overset{Δ}{=} g (β) - h (β),$

where

$g (β) = L_{n} (β) + λ_{1 n} \sum_{j = 1}^{p} | β_{j} | + λ_{2 n} β^{⊤} D^{⊤} D β, h (β) = λ_{1 n} \sum_{j = 1}^{p} | β_{j} | - \sum_{j = 1}^{p} p_{λ_{1 n}} (| β_{j} |) .$

By writing as $g (β) - h (β)$ , we need to show that $\hat{β}$ is a local minimizer of $Q_{n} (β)$ . Based on Lemma A3, the proof is similar to that of Theorem 3.2 in [27]. We omit the proof here.

Funding Statement

This study was supported by the National Natural Science Foundation of China [grant number 11971404], [grant number 71471152], Humanity and Social Science Youth Foundation of Ministry of Education of China [grant number 19YJC910010], [grant number 20YJC910004], the 111 Project (B13028) and Fundamental Research Funds for the Central Universities [grant number 20720181003].

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

1.Bradley P.S. and Mangasarian O.L., Feature selection via concave minimization and support vector machines, ICML 98 (1998), pp. 82–90. [Google Scholar]
2.Bühlmann P. and Van De. Geer S., Statistics for High-dimensional Data: Methods, Theory and Applications, Springer Science, New York, 2011. [Google Scholar]
3.Chan W.H., Mohamad M.S., Deris S., Corchado J.M., and Kasim S., An improved gSVM-SCADL2 with firefly algorithm for identification of informative genes and pathways, Int. J. Bioinf. Res. Appl. 12 (2016), pp. 72–93. [Google Scholar]
4.Chen W.J. and Tian Y.J., $ℓ_{p}$ -norm proximal support vector machine and its applications, Procedia Computer Sci. 1 (2010), pp. 2417–2423. [Google Scholar]
5.De Leeuw J. and Heiser W.J., Convergence of Correction Matrix Algorithms for Multidimensional Scaling, in Geometric Representations of Relational Data, Vol. 36. Mathesis Press, Ann Arbor, 1977, pp. 735–752.
6.Fan J. and Fan Y., High dimensional classification using features annealed independence rules, Ann. Stat. 36 (2008), pp. 2605–2637. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. [Google Scholar]
8.Fan J. and Peng H., Nonconcave penalized likelihood with a diverging number of parameters, Ann. Stat. 32 (2004), pp. 928–961. [Google Scholar]
9.Guo J., Hu J., Jing B.Y., and Zhang Z., Spline-lasso in high-dimensional linear regression, J. Am. Stat. Assoc. 111 (2016), pp. 288–297. [Google Scholar]
10.Guyon I., Weston J., Barnhill S., and Vapnik V., Gene selection for cancer classification using support vector machines, Mach. Learn. 46 (2002), pp. 389–422. [Google Scholar]
11.Hastie T., Tibshirani R., and Friedman J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York, 2001. [Google Scholar]
12.Hebiri M. and Geer S.V.D., The smooth-lasso and other $ℓ_{1} + ℓ_{2}$ -penalized methods, Electron. J. Stat. 5 (2011), pp. 1184–1226. [Google Scholar]
13.Hunter D.R. and Lange K., A tutorial on MM algorithms, Am. Stat. 58 (2004), pp. 30–37. [Google Scholar]
14.Lange K., Hunter D.R., and Yang I., Optimization transfer using surrogate objective functions, J. Comput. Graph. Stat. 9 (2000), pp. 1–20. [Google Scholar]
15.Mangasarian O.L., A finite newton method for classification, Optim. Methods. Softw. 17 (2002), pp. 913–929. [Google Scholar]
16.Petricoin III E.F., Ardekani A.M., Hitt B.A., Levine P.J., Fusaro V.A., Steinberg S.M., Mills G.B., Simone C., Fishman D.A., Kohn E.C., and Liotta L.A., Use of proteomic patterns in serum to identify ovarian cancer, The Lancet. 359 (2002), pp. 572–577. [DOI] [PubMed] [Google Scholar]
17.Rosset S. and Zhu J., Piecewise linear regularized solution paths, Ann. Stat. 35 (2007), pp. 1012–1030. [Google Scholar]
18.Tibshirani R., Saunders M., Rosset S., Zhu J., and Knight K., Sparsity and smoothness via the fused lasso, J. R. Stat. Soc. Ser. B(Methodological) 67 (2005), pp. 91–108. [Google Scholar]
19.Vapnik V., The Nature of Statistical Learning Theory, New York, Springer, 1995. [Google Scholar]
20.Wang L., Zhu J., and Zou H., Hybrid huberized support vector machines for microarray classification and gene selection, Bioinformatics. 24 (2008), pp. 412–419. [DOI] [PubMed] [Google Scholar]
21.Yang Y. and Zou H., An efficient algorithm for computing the HHSVM and its generalizations, J. Comput. Graph. Stat. 22 (2013), pp. 396–415. [Google Scholar]
22.Yang Y. and Zou H., A fast unified algorithm for solving group-lasso penalize learning problems, Stat. Comput. 25 (2015), pp. 1129–1141. [Google Scholar]
23.Yuan M., High dimensional inverse covariance matrix estimation via linear programming, J. Mach. Learn. Res. 11 (2010), pp. 2261–2286. [Google Scholar]
24.Zhang C.H., Nearly unbiased variable selection under minimax concave penalty, Ann. Stat. 38 (2010), pp. 894–942. [Google Scholar]
25.Zhang H.H., Ahn J., Lin X., and Park C., Gene selection using support vector machines with non-convex penalty, Bioinformatics. 22 (2005), pp. 88–95. [DOI] [PubMed] [Google Scholar]
26.Zhang C.H. and Huang J., The sparsity and bias of the lasso selection in high-dimensional linear regression, Ann. Stat. 36 (2008), pp. 1567–1594. [Google Scholar]
27.Zhang X., Wu Y., Wang L., and Li R., Variable selection for support vector machines in moderately high dimensions, J. R. Stat. Soc. Ser. B (Methodological) 78 (2016), pp. 53–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Zhu J., Rosset S., Hastie T., and Tibshirani R., 1-norm support vector machines, Adv. Neural. Inf. Process. Syst. 16 (2004), pp. 49–56. [Google Scholar]
29.Zou H. and Li R., One-step sparse estimates in nonconcave penalized likelihood models, Ann. Stat. 36 (2008), pp. 1509–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_Data

Click here for additional data file.^{(157.7KB, pdf)}

[CIT0001] 1.Bradley P.S. and Mangasarian O.L., Feature selection via concave minimization and support vector machines, ICML 98 (1998), pp. 82–90. [Google Scholar]

[CIT0002] 2.Bühlmann P. and Van De. Geer S., Statistics for High-dimensional Data: Methods, Theory and Applications, Springer Science, New York, 2011. [Google Scholar]

[CIT0003] 3.Chan W.H., Mohamad M.S., Deris S., Corchado J.M., and Kasim S., An improved gSVM-SCADL2 with firefly algorithm for identification of informative genes and pathways, Int. J. Bioinf. Res. Appl. 12 (2016), pp. 72–93. [Google Scholar]

[CIT0004] 4.Chen W.J. and Tian Y.J., $ℓ_{p}$ -norm proximal support vector machine and its applications, Procedia Computer Sci. 1 (2010), pp. 2417–2423. [Google Scholar]

[CIT0005] 5.De Leeuw J. and Heiser W.J., Convergence of Correction Matrix Algorithms for Multidimensional Scaling, in Geometric Representations of Relational Data, Vol. 36. Mathesis Press, Ann Arbor, 1977, pp. 735–752.

[CIT0006] 6.Fan J. and Fan Y., High dimensional classification using features annealed independence rules, Ann. Stat. 36 (2008), pp. 2605–2637. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0007] 7.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. [Google Scholar]

[CIT0008] 8.Fan J. and Peng H., Nonconcave penalized likelihood with a diverging number of parameters, Ann. Stat. 32 (2004), pp. 928–961. [Google Scholar]

[CIT0009] 9.Guo J., Hu J., Jing B.Y., and Zhang Z., Spline-lasso in high-dimensional linear regression, J. Am. Stat. Assoc. 111 (2016), pp. 288–297. [Google Scholar]

[CIT0010] 10.Guyon I., Weston J., Barnhill S., and Vapnik V., Gene selection for cancer classification using support vector machines, Mach. Learn. 46 (2002), pp. 389–422. [Google Scholar]

[CIT0011] 11.Hastie T., Tibshirani R., and Friedman J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York, 2001. [Google Scholar]

[CIT0012] 12.Hebiri M. and Geer S.V.D., The smooth-lasso and other $ℓ_{1} + ℓ_{2}$ -penalized methods, Electron. J. Stat. 5 (2011), pp. 1184–1226. [Google Scholar]

[CIT0013] 13.Hunter D.R. and Lange K., A tutorial on MM algorithms, Am. Stat. 58 (2004), pp. 30–37. [Google Scholar]

[CIT0014] 14.Lange K., Hunter D.R., and Yang I., Optimization transfer using surrogate objective functions, J. Comput. Graph. Stat. 9 (2000), pp. 1–20. [Google Scholar]

[CIT0015] 15.Mangasarian O.L., A finite newton method for classification, Optim. Methods. Softw. 17 (2002), pp. 913–929. [Google Scholar]

[CIT0016] 16.Petricoin III E.F., Ardekani A.M., Hitt B.A., Levine P.J., Fusaro V.A., Steinberg S.M., Mills G.B., Simone C., Fishman D.A., Kohn E.C., and Liotta L.A., Use of proteomic patterns in serum to identify ovarian cancer, The Lancet. 359 (2002), pp. 572–577. [DOI] [PubMed] [Google Scholar]

[CIT0017] 17.Rosset S. and Zhu J., Piecewise linear regularized solution paths, Ann. Stat. 35 (2007), pp. 1012–1030. [Google Scholar]

[CIT0018] 18.Tibshirani R., Saunders M., Rosset S., Zhu J., and Knight K., Sparsity and smoothness via the fused lasso, J. R. Stat. Soc. Ser. B(Methodological) 67 (2005), pp. 91–108. [Google Scholar]

[CIT0019] 19.Vapnik V., The Nature of Statistical Learning Theory, New York, Springer, 1995. [Google Scholar]

[CIT0020] 20.Wang L., Zhu J., and Zou H., Hybrid huberized support vector machines for microarray classification and gene selection, Bioinformatics. 24 (2008), pp. 412–419. [DOI] [PubMed] [Google Scholar]

[CIT0021] 21.Yang Y. and Zou H., An efficient algorithm for computing the HHSVM and its generalizations, J. Comput. Graph. Stat. 22 (2013), pp. 396–415. [Google Scholar]

[CIT0022] 22.Yang Y. and Zou H., A fast unified algorithm for solving group-lasso penalize learning problems, Stat. Comput. 25 (2015), pp. 1129–1141. [Google Scholar]

[CIT0023] 23.Yuan M., High dimensional inverse covariance matrix estimation via linear programming, J. Mach. Learn. Res. 11 (2010), pp. 2261–2286. [Google Scholar]

[CIT0024] 24.Zhang C.H., Nearly unbiased variable selection under minimax concave penalty, Ann. Stat. 38 (2010), pp. 894–942. [Google Scholar]

[CIT0025] 25.Zhang H.H., Ahn J., Lin X., and Park C., Gene selection using support vector machines with non-convex penalty, Bioinformatics. 22 (2005), pp. 88–95. [DOI] [PubMed] [Google Scholar]

[CIT0026] 26.Zhang C.H. and Huang J., The sparsity and bias of the lasso selection in high-dimensional linear regression, Ann. Stat. 36 (2008), pp. 1567–1594. [Google Scholar]

[CIT0027] 27.Zhang X., Wu Y., Wang L., and Li R., Variable selection for support vector machines in moderately high dimensions, J. R. Stat. Soc. Ser. B (Methodological) 78 (2016), pp. 53–76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0028] 28.Zhu J., Rosset S., Hastie T., and Tibshirani R., 1-norm support vector machines, Adv. Neural. Inf. Process. Syst. 16 (2004), pp. 49–56. [Google Scholar]

[CIT0029] 29.Zou H. and Li R., One-step sparse estimates in nonconcave penalized likelihood models, Ann. Stat. 36 (2008), pp. 1509–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Structured sparse support vector machine with ordered features

Kuangnan Fang

Peng Wang

Xiaochen Zhang

Qingzhao Zhang

ABSTRACT

1. Introduction

Figure 1.

2. Structured sparse support vector machine

2.1. Methodology

Remark 2.1

2.2. Algorithm

Remark 2.2

Remark 2.3

2.3. Asymptotic properties

Theorem 2.1

3. Simulations

Example 3.1

Example 3.2

Example 3.3

Table 1.

Figure 2.

4. Real data analysis

Table 2.

Figure 3.

5. Discussion

Supplementary Material

Acknowledgments

Appendices.

Appendix 1. Regularity conditions.

Condition A.1

Condition A.2

Condition A.3

Condition A.4

Condition A.5 Condition on the true model dimension —

Remark A.1

Appendix 2. Some lemmas.

Lemma A.1

Proof.

Lemma A.2

Proof.

Lemma A.3

Proof.

Appendix 3. Proof of Theorem 2.1.

Proof.

Funding Statement

Disclosure statement

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases