A generalized l2,p-norm regression based feature selection algorithm

X Zhi; J Liu; S Wu; C Niu

doi:10.1080/02664763.2021.1975662

. 2021 Sep 17;50(3):703–723. doi: 10.1080/02664763.2021.1975662

A generalized l_2,p-norm regression based feature selection algorithm

X Zhi ^a, J Liu ^b,^CONTACT, S Wu ^b, C Niu ^b

PMCID: PMC9930865 PMID: 36819074

Abstract

Feature selection is an important data dimension reduction method, and it has been used widely in applications involving high-dimensional data such as genetic data analysis and image processing. In order to achieve robust feature selection, the latest works apply the $l_{2, 1}$ or $l_{2, p}$ -norm of matrix to the loss function and regularization terms in regression, and have achieved encouraging results. However, these existing works rigidly set the matrix norms used in the loss function and the regularization terms to the same $l_{2, 1}$ or $l_{2, p}$ -norm, which limit their applications. In addition, the algorithms for solutions they present either have high computational complexity and are not suitable for large data sets, or cannot provide satisfying performance due to the approximate calculation. To address these problems, we present a generalized $l_{2, p}$ -norm regression based feature selection ( $l_{2, p}$ -RFS) method based on a new optimization criterion. The criterion extends the optimization criterion of ( $l_{2, p}$ -RFS) when the loss function and the regularization terms in regression use different matrix norms. We cast the new optimization criterion in a regression framework without regularization. In this framework, the new optimization criterion can be solved using an iterative re-weighted least squares (IRLS) procedure in which the least squares problem can be solved efficiently by using the least square QR decomposition (LSQR) algorithm. We have conducted extensive experiments to evaluate the proposed algorithm on various well-known data sets of both gene expression and image data sets, and compare it with other related feature selection methods.

Keywords: Feature selection; sparse regression; l_2,p-norm; iterative re-weighted least squares; least square QR decomposition

1. Introduction

In many applications such as genetic data analysis, image processing and data mining, people often encounters very high-dimensional data. Some features of the high-dimensional data are related to the target task, while many features are redundant [23]. Therefore, dimension reduction has become an important stage of data preprocessing in such applications [12,13]. Feature selection and feature extraction are two main dimension reduction methods [2,22]. Feature extraction transforms the original data into a new low-dimensional subspace. Feature selection algorithm selects low-dimensional features from the original high-dimensional data according to certain processing rules. The latter can retain the original representation of the data without changing the original features and is interpretable, while the former cannot do this [23]. Over the years, the research on feature selection has received more and more attention, and has made considerable progress.

Generally, according to how the classification algorithm is incorporated in evaluating and selecting features, feature selection methods can be organized into three categories [9,23]: wrapper method, filter method, and embedded method. Compared with filter methods, wrapper methods and embedded methods are tightly coupled with a specific classifier, thus they often have good performance but very expensive computational costs. Contrarily, filter methods such as F-statistic [4], Laplacian score (LS) [10], ReliefF (RF) [15], minimum redundancy maximum relevance (mRMR) [21] and Trace ratio (TR) [19], are often very efficient. In this paper, we focus on the filter-type methods for supervised feature selection.

In recent years, some filter-type feature selection techniques based on the combination of transformation method (such as linear discriminant analysis [8] and regression) and sparse regularization theory have become a hot spot in the study of feature selection [16,18,23,24]. The sparse regularization technique can make the transformation matrix has more zero rows, thereby enabling feature selection. For multi-class problem, these methods search a subset of features shared by all the classes, so are also known as multi-task feature learning (MTFL) [1]. As far as we know, Masaeli et al. [16] first proposed to combine LDA and transformation matrix sparse regularization, and obtained a new filter-based feature selection algorithm named linear discriminant feature selection (LDFS). By enforcing the row sparsity of the transformation matrix of LDA via $l_{\infty, 1}$ -norm regularization, LDFS uses both the discriminative information and the learning mechanism [23]. Nie et al. [18] proposed a $l_{2, 1}$ -norm based sparse regression feature selection ( $l_{2, 1}$ -RFS) algorithm by applying $l_{2, 1}$ -norm to the loss function as well as the regularization of regression, which can be solved by using the Lagrangian multiplier method. The convergence of $l_{2, 1}$ -RFS algorithm has been proved, and the experimental results on a large amount of data have verified that $l_{2, 1}$ -RFS algorithm has ability to perform robust feature selection task for data. In order to obtain a more sparse solution, Wang et al. expanded $l_{2, 1}$ -RFS in the sense of the $l_{2, p}$ -norm ( $0 < p \leq 1$ ) and proposed $l_{2, p}$ -norm based feature selection ( $l_{2, p}$ -RFS) algorithm [24]. Considering the expensive calculation of the Lagrangian multiplier based algorithm, Wang et al. further proposed one-step gradient projection based algorithm to reduce the algorithm complexity. However, the one-step gradient projection based $l_{2, p}$ -RFS algorithm also has expensive calculation costs in fact, and the approximation in the solution process has a large impact on the classification accuracy of the subsequent classification algorithm algorithm. In [11], sparse regression based feature selection method was extended to direct matrix data feature selection case. In [17], $l_{2, 1}$ -norm base sparse regression based feature selection method was modified to achieve ‘robust and flexible’ feature selection via being added an additional $l_{1, 2}$ regularization term.

It is beneficial to use $l_{2, p}$ -norm rather than traditional F-norm in the loss function part of sparse regression based feature selection method. However, both in $l_{2, 1}$ -RFS [18] and $l_{2, p}$ -RFS [24], the matrix norms of the loss function and the regular term are strictly set to be the same, which limit their application. In this paper, we extend $l_{2, p}$ -RFS when the least-squares regression loss term and the regularization term using different matrix norms. The main contributions of this paper include:

We propose a new sparse regression based feature selection method, namely $l_{2,_{β}^{α}}$ -RFS, which extends the $l_{2, p}$ -RFS method when the loss function and the regularization terms in regression use different matrix norms.
We present an effective and efficient algorithm for $l_{2,_{β}^{α}}$ -RFS by casting it in an IRLS framework, and also present a proof of the convergence of the proposed algorithm.
We have conducted extensive experiments on gene expression and image data sets to evaluate the effectiveness of $l_{2,_{β}^{α}}$ -RFS and compare it with other related feature selection methods, especially the sparse regression based feature selection methods.

The rest of this article is organized as follows: Section 2 introduces the notations and definitions used in this paper, and briefly reviews two related works. In Section 3, a new criterion for ( $l_{2, p}$ -RFS) method is proposed and an effective and efficient algorithm presented for the proposed criterion. A comprehensive study on the performance of the $l_{2,_{β}^{α}}$ -RFS algorithm is presented in Section 4. We conclude in Section 5 with a discussion of related future work.

2. Related works

In this section, we introduce the notations and the definitions of norms used in this paper and briefly review two main related works: $l_{2, 1}$ -RFS [18] and $l_{2, p}$ -RFS [24].

2.1. Notations and definitions

In this section, we introduce the notations and definitions of norms used in this paper. Matrices are written as boldface uppercase letters. Vectors are written as boldface lowercase letters. For a matrix $M$ =( $m_{i j}$ ), its $i$ th row and $j$ th column are denoted by $m^{i}$ and $m_{j}$ , respectively. $M^{T}$ denotes the transpose of matrix $M$ .

The $l_{p}$ -norm $(p > 0)$ of a vector $v \in R^{n}$ is defined as $‖ v ‖_{p} = (\sum_{i = 1}^{n} | v_{i} |^{p})^{\frac{1}{p}}$ , and the $l_{0}$ -norm is defined as $‖ v ‖_{0} = \sum_{i = 1}^{n} | v_{i} |^{0}$ . Actually neither $l_{p}$ -norm $(0 < p < 1)$ nor $l_{0}$ -norm is a valid norm, because the former does not satisfy positive scalability, and the latter does not satisfy the triangle inequality [23].

The $l_{r, p}$ -norm, $l_{2, 0}$ -norm, $l_{2, 1}$ -norm and $l_{2, p}$ -norm $(0 < p \leq 1)$ of a matrix $M$ are defined as

\begin{aligned} ‖ M ‖_{r, p} & = {(\sum_{i = 1}^{n} (\sum_{j = 1}^{m} | m_{i j} |^{r})^{\frac{p}{r}})}^{\frac{1}{p}} = {(\sum_{i = 1}^{n} ‖ m^{i} ‖_{r}^{p})}^{\frac{1}{p}}, \\ ‖ M ‖_{2, 0} & = \sum_{i = 1}^{n} {(\sum_{j = 1}^{m} | m_{i j} |^{2})}^{0} = \sum_{i = 1}^{n} ‖ m^{i} ‖_{2}^{0}, \\ ‖ M ‖_{2, 1} & = \sum_{i = 1}^{n} {(\sum_{j = 1}^{m} | m_{i j} |^{2})}^{\frac{1}{2}} = \sum_{i = 1}^{n} ‖ m^{i} ‖_{2}, \end{aligned}

and

‖ M ‖_{2, p} = {(\sum_{i = 1}^{n} {(\sum_{j = 1}^{m} | m_{i j} |^{2})}^{\frac{p}{2}})}^{\frac{1}{p}} = {(\sum_{i = 1}^{n} ‖ m^{i} ‖_{2}^{p})}^{\frac{1}{p}},

respectively.

In the field of pattern recognition and machine learning, the $l_{2, 1}$ -norm and $l_{2, p}$ -norm of matrix can not only be used to improve the robustness of the algorithm to the data with outliers [5,18,24,25], but also can be used to achieve the sparsity of the transformation matrix to achieve feature selection [1,16,18,23,24] or for both purposes [18,24].

2.2. $l_{2, 1}$ -RFS algorithm

Given the data matrix $X = [x_{1}, x_{2}, \dots, x_{n}] \in R^{d \times n}$ and corresponding class label matrix $B = [b_{1}, b_{2}, \dots, b_{c}] \in R^{n \times c}$ , the optimization criterion of the $l_{2, 1}$ -RFS algorithm [18] is defined as

min_{G} J (G) = ‖ X^{T} G - B ‖_{2, 1} + γ ‖ G ‖_{2, 1},

(1)

where $G$ is the regression matrix to be solved. γ is the regularization term coefficient, the larger it is, the stronger the sparsity of G is, and the fewer features will be selected.

In the optimization criterion (1), $l_{2, 1}$ -norm is used to not only the row sparsification of the regression transformation matrix $G$ but also the robustness of regression loss measure for the residual is not squared, and outlier have less importance than the squared residual [18].

The unconstrained optimization problem (1) can be transformed equivalently into the following constrained optimization problem:

min_{Y} ‖ Y ‖_{2, 1} s . t . M Y = B,

(2)

where $Y = [G M]^{T} \in R^{m \times c}, M = [X^{T} γ I] \in R^{n \times m}, E = \frac{1}{γ} (X^{T} G - B), m = n + d, I \in R^{n \times n}$ is an identify matrix. By using Lagrange multiplier method to solve the problem (2), the following equation can be obtained: $Y = D^{- 1} M (M D^{- 1} M^{T})^{- 1} B$ , where D is an identify matrix with the $i$ th diagonal elements as $d_{i i} = \frac{1}{2 ‖ y^{i} ‖_{2}}$ , where $y^{i}$ is ith row of matrix $Y$ . It can be seen that calculating $Y$ requires $D$ , and calculating $D$ also requires $Y$ . Therefore, an iterative algorithm is needed to obtain the final $Y$ . The detailed algorithm steps are presented in Algorithm 1.

2.2.

The time complexity of $l_{2, 1}$ -RFS algorithm (Algorithm 1) can by analyzed as follows. Line 2 takes $O (n m^{2} + n^{2} m)$ time for computing matrix $M D_{t}^{- 1} M^{T}$ , and $O (n^{3})$ time for computing its inversion, and $O (n m^{2} + n^{2} m + n m c)$ time for matrix multiplication. Line 3 takes $O (m c)$ time for computing the diagonal matrix $D_{t + 1}$ . Hence, the total time complexity of $l_{2, 1}$ -RFS $_{2}$ algorithm is

O (t (n m^{2} + n^{2} m + n^{3} + n m c)),

where t is the number of iterations. The $l_{2, 1}$ -RFS algorithm [18] adopts $l_{2, 1}$ -norm to ensure the robustness of loss function and the sparsity of regularization. However, the application of the algorithm is easily limited since the norms used in the loss function and the regularization term are rigidly fixed to the same $l_{2, 1}$ -norm. In addition, from the analysis of computational complexity above, it can be seen that this algorithm is computational expensive for relatively large data sets.

2.3. $l_{2, p}$ -norm regression based feature selection algorithm

In order to obtain more robust and sparse solution, Wang et al. extended $l_{2, 1}$ -RFS [18] in the sense $l_{2, p}$ -norm and proposed the $l_{2, p}$ -RFS algorithm [24]. In $l_{2, p}$ -RFS, $l_{2, p}$ -norm $(0 < p \leq 1)$ is adopted for both the regression loss and the regularization terms. Its optimization criterion is defined as

min_{G} J (G) = ‖ X^{T} G - B ‖_{2, p}^{p} + γ ‖ G ‖_{2, p}^{p},

(3)

where $p ϵ (0, 1]$ .

For any $p \in (0, 1]$ , the noise magnitude of distant outlier in (3) is no more than that in (1). Thus the model (3) is expected to be more robust than (1) [24].

Similar to the solution method of $l_{2, 1}$ -RFS [18], the above unconstrained optimization problem can be changed equivalently into the following constrained optimization problem:

min_{Y} ‖ Y ‖_{2, p}^{p} s . t . M Y = B,

(4)

where $Y = [G M]^{T} \in R^{m \times c}, M = [X^{T} γ I] \in R^{n \times m}, E = \frac{1}{γ} (X^{T} G - B), m = n + d$ .Using Lagrangian multiplier method, the solution to problem (4) can be obtained as $Y = D^{- 1} M (M D^{- 1} M^{T})^{- 1} B$ , where $D$ is an diagonal matrix with the $i$ th diagonal elements as $d_{i i} = \frac{1}{2 ‖ y^{i} ‖_{2}^{2 - p}}$ . However, similar to $l_{2, 1}$ -RFS [18], the solution obtained by this algorithm involves matrix inversion operation that is computationally expensive for large data sets. Note that when $D$ is fixed, the optimization problem (4) can be equivalently transformed into:

min_{Y} f_{k} (Y) = \frac{1}{2} T r (Y^{T} D_{k} Y) s . t . M Y = B .

(5)

Wang et al. proposed to apply projection gradient algorithm and its one-step approximate algorithm to solve the constrained nonlinear optimization problem (5) for efficiency [24]. The detailed steps of the one-step projection gradient based $l_{2, p}$ -RFS algorithm are described in Algorithm 2:

2.3.

The time complexity of $l_{2, p}$ -RFS algorithm (Algorithm 2) can by analyzed as follows. Line 1 takes $O (n^{2} m)$ time for QR decomposition. Line 2 takes $O (n m^{2}) + O (m n^{2}) + O (n^{3}) + O (n m c)$ time for matrix multiplication and matrix inversion operation. Line 3 takes $O (t (m^{2} c + m c^{2}))$ for iteratively solving the transformation matrix. Hence, the total time complexity of $l_{2, p}$ -RFS $_{2}$ algorithm is

O (n m^{2} + m n^{2} + n^{3} + n m c + t (m^{2} c + m c^{2})) .

$l_{2, p}$ -RFS [24] extends the $l_{2, 1}$ -RFS [18] in the sense of $l_{2, p}$ -norm $(0 < p \leq 1)$ that provides more models to choose for practical applications, thus increasing the flexibility of the original algorithm in application. However, we find one-step gradient projection method used in $l_{2, p}$ -RFS algorithm consumes the precision of the algorithm to some extent for the numerical approximation in experiments. In addition, from the analysis of computational complexity above, it can be seen that one-step gradient projection based $l_{2, p}$ -RFS algorithm is not as efficient as the author claimed, and it is computationally expensive for relatively large data sets.

3. Proposed method

The regression loss function and regularization terms in the criterion of $l_{2, 1}$ -RFS [18] and $l_{2, p}$ -RFS algorithm [24] use the same $l_{2, 1}$ -norm or $l_{2, p}$ -norm ( $0 < p \leq 1$ ) respectively. The norm in the regression loss function is used to both measure the residual of regression and tolerate the influence of outliers in data on the algorithm [18,24], and the norm in the regularization term is used to achieve the sparseness of the regression transformation matrix. Since the two norms play different roles, we think rigidly setting them to the same one may lead to the algorithms failing to obtain good feature selection performance when we use them. Therefore, in this section, we consider a more general form of $l_{2, p}$ -RFS: using different $l_{2, p}$ -norm for the regression loss function and the penalty terms. Then we can obtain a new criterion for $l_{2, p}$ -RFS, which is denoted as $l_{2,_{β}^{α}}$ -RFS in following discussion for convenience.

3.1. A new criterion for $l_{2, p}$ -RFS

$l_{2,_{β}^{α}}$ -RFS can be formulated as the following optimization problem:

min_{G} J (G) = ‖ X^{T} G - B ‖_{2, α}^{α} + γ ‖ G ‖_{2, β}^{β},

(6)

where α ( $0 < α \leq 2$ ) and β ( $0 < β < 2$ ) are the norm indexes of the regression loss function and penalty terms respectively, and they are allowed to be set to different values. From the criterion (6), we can see the proposed $l_{2,_{β}^{α}}$ -RFS criterion is a general form of $l_{2, p}$ -RFS criterion, and it will degenerate the $l_{2, p}$ -RFS criterion [24] when $α = β (0 < α, β \leq 1)$ . At the same time, we think setting two norm exponents in this way can better ensure the flexibility of the algorithm facing complex data in practical applications. For example, we can set the norm indexes of the regression loss function $α = 2$ in the proposed $l_{2,_{β}^{α}}$ -RFS criterion. Then the regression loss function becomes exactly the squared F-norm, which corresponds to the assumption of the normal distribution of the residuals, which has been widely verified to be valid. The range of parameter p in $l_{2, p}$ -RFS [24] cannot be expanded from (0,1] to (0, 2] directly since the norms of the loss metric and the regularization term in $l_{2, p}$ -RFS are fixed to the same $l_{2, p}$ -norm ( $0 < p \leq 1$ ). Because, if p is set to 2, the regularization term in $l_{2, p}$ -RFS becomes the squared F-norm of the regression transformation matrix, which is not suitable to be used to achieve the row sparsification of regression transformation matrix. Thus feature selection cannot be performed.

3.2. An effective and efficient algorithm

The algorithm for $l_{2, p}$ -RFS criterion cannot be used for the new proposed $l_{2,_{β}^{α}}$ -RFS criterion. In order to present an effective and efficient algorithm for the new criterion (6), first, we shall show the optimization problem (6) can be converted into an nonlinear regression problem without regularization term.

In fact, the criterion in problem (6) can be reformulated as

\begin{aligned} J (G) & = ‖ X^{T} G - B ‖_{2, α}^{α} + γ ‖ I_{d} G - 0 ‖_{2, β}^{β} \\ = \sum_{i = 1}^{n} ‖ x_{i}^{T} G - b^{i} ‖_{2}^{α} + \sum_{r = 1}^{d} ‖ μ e_{r}^{T} G - e^{T} 0 ‖_{2}^{β} \\ = \sum_{i = 1}^{n} {(\sum_{j = 1}^{c} (x_{i}^{T} g_{j} - b_{i j})^{2})}^{\frac{α}{2}} + \sum_{r = 1}^{d} {(\sum_{j = 1}^{c} (μ e_{r}^{T} g_{j} - 0)^{2})}^{\frac{β}{2}}, \end{aligned}

(7)

where $μ = γ^{\frac{1}{β}}$ , $b^{i}$ is the ith row of $B$ , $I_{d}$ is an identity matrix, $e_{r}$ is its rth column, $0 \in R^{d \times c}$ is a zero matrix, $e = [1, \dots, 1]^{T} \in R^{1 \times c}$ . Let $U = [X^{T}, μ I_{d}]^{T}$ , $V = [B^{T}, 0]^{T}$ , m = n + d. Then we can know the optimization problem (6) can be equivalently written as

min_{G} J (G) = \sum_{i = 1}^{m} ‖ u_{i}^{T} G - v^{i} ‖_{2}^{θ},

(8)

where if $i \leq n$ , $θ = α$ , else $θ = β$ , $u_{i}$ is the ith column of $U$ , $v^{i}$ is the ith row of $V$ .

We then show the problem (8) can be solved by using iteratively re-weighted least squares (IRLS) method [14, Section 4.5.2]. In fact, taking derivative of $J (G)$ in Equation (8) w.r.t $g_{j}$ and setting $\frac{\partial J (G)}{\partial g_{j}} = 0$ , we will have the first-order optimality condition defined in the following equation:

\sum_{i = 1}^{m} θ (\sum_{j = 1}^{c} (u_{i}^{T} g_{j} - v_{i j})^{2})^{\frac{θ - 2}{2}} \cdot (u_{i} u_{i}^{T} g_{j} - u_{i} v_{i j}) = 0,

(9)

where $v_{i j}$ is the j-th component of $v^{i}$ . Let

w_{i} = θ {((\sum_{j = 1}^{c} ((u_{i})^{T} g_{j} - v_{i j})^{2}) + ε)}^{\frac{θ - 2}{2}} ε \to 0.

(10)

where $ε$ can avoid that the residual is 0 in Equation (10). Then we have

\sum_{i = 1}^{m} w_{i} u_{i} u_{i}^{T} g_{j} = \sum_{i = 1}^{m} w_{i} u_{i} v_{i j} .

(11)

From Equation (11), it can be known that, $G$ can be solved by using an iterative algorithm: with fixed $g_{l}, (l = 1, 2, \dots, c$ and $l \neq j$ ), solve Equation (11) to update $g_{j}$ ; repeat the above process to update each column of $G$ until all columns of $G$ are no longer updated.

When solving $g_{j}$ , Equation (11) is a nonlinear equation w.r.t. $g_{j}$ :

\sum_{i = 1}^{m} w_{i} (\circ, \dots, g_{j}, \circ, \dots, \circ) u_{i} u_{i}^{T} g_{j} = \sum_{i = 1}^{m} w_{i} (\circ, \dots, g_{j}, \circ, \dots, \circ) u_{i} v_{i j} .

(12)

So Equation (12) can be solved by using an iterative algorithm: given $w_{i} (i = 1, 2, \dots, n)$ , $g_{j}$ can be computed by solving the remaining linear equation w.r.t $g_{j}$ in (12); and given $g_{j}$ , $w_{i} (i = 1, 2, \dots, n)$ can be updated by computing Equation (10). That is, we can solve Equation (12) by executing the following iterative format:

\sum_{i = 1}^{m} w_{i} (\circ, \dots, g_{j}^{(k)}, \circ, \dots, \circ) u_{i} u_{i}^{T} g_{j}^{(k + 1)} = \sum_{i = 1}^{m} w_{i} (\circ, \dots, g_{j}^{(k)}, \circ, \dots, \circ) u_{i} v_{i j} .

(13)

Note that given $w_{i} (i = 1, 2, \dots, n)$ , the solution to Equation (12) is exactly the solution to the weighted least squares problem

min_{g_{j}} J (g_{j}) = \sum_{i = 1}^{m} w_{i} (u_{i}^{T} g_{j} - v_{i j})^{2} .

(14)

So the solution procedure for $g_{j}$ discussed above is in essence an IRLS algorithm [3], which we note as IRLS- $g$ in the following discussion. Note that the weighted least squares problem (14) can be converted into a standard least squares problem which can be solved efficiently by using LSQR algorithm [20]. The following Algorithm 3 presents detailed algorithm steps for IRLS- $g$ .

3.2.

Based on the IRLS- $g$ algorithm above, we can present an iterative algorithm for the optimal $G$ :

Call IRLS- $g$ algorithm (Algorithm 3) above repeatedly to update all the columns of $G$ successively until all the columns of $G$ are updated.
Repeat this process until $G$ is no longer updated.

In this way, we can finally get the solution of problem (8), namely, the solution of the original problem (6). Summarizing the above analysis, we can present the following Algorithm 4. In addition, Algorithm 4 can be initialized by using the solution to the following optimal problem:

min_{G^{(0)}} J (G^{(0)}) = ‖ U^{T} G^{(0)} - V ‖_{F}^{2} + γ ‖ G^{(0)} ‖_{F}^{2} .

(15)

Problem (15) is a standard square F-norm based regularized regression problem which can be also solved by using LSQR algorithm efficiently [20].

3.2.

3.3. Time complexity analysis

The time complexity of the proposed IRLS- $g$ algorithm (Algorithm 3) can by analyzed as follows. Line 1 takes $O (c d m)$ for updating $w_{i} (i = 1, 2, \dots, m)$ , Line 2 takes $O (t_{1} (3 m + 5 d + 2 m d))$ for updating $g_{j}$ by using LSQR algorithm, where $t_{1}$ is the number of iteration of LSQR. So the total time complexity of the proposed IRLS- $g$ algorithm is $O (t_{2} (c d m + t_{1} (3 m + 5 d + 2 m d)))$ . Based on the analysis above, the time complexity of the proposed $l_{2,_{β}^{α}}$ -RFS algorithm (Algorithm 4) can be analyzed as follows. Line 1 takes $O (t_{0} c (3 m + 5 d + 2 m d))$ time for using LSQR algorithm [2] to compute $G_{0}$ . Line 2 takes $O (c t_{2} (c d m + t_{1} (3 m + 5 d + 2 m d)))$ time for updating $g_{j}$ $(j = 1, 2, \dots, c)$ by running Algorithm 3. Line 3 takes $O (d c)$ time for computing $b (G)$ , and $O (d s)$ time for sorting features. Hence, the total time complexity of the proposed $l_{2,_{β}^{α}}$ -RFS algorithm is

O (t_{0} c (3 m + 5 d + 2 m d) + t_{3} (c t_{2} (c d m + t_{1} (3 m + 5 d + 2 m d)))),

which can be further simplified to

O ((t_{0} + t_{3} t_{2} t_{1}) c (3 m + 5 d + 2 m d) + t_{3} c^{2} d m) .

3.4. Convergence analysis

Theorem 3.1

When $0 < θ \leq 2$ , the objective function sequence $J (G^{(k)})$ generated by Algorithm 4 is monotonically decreasing and convergent.

When $1 \leq θ \leq 2$ , the matrix sequence $G^{(k)}$ generated by Algorithm 4 or its subsequence converges to the global minimum point of problem (6); when $0 < θ < 1$ , the sequence $G^{(k)}$ generated by Algorithm 4 or its subsequence converges to the local minimum point of problem (6).

Proof.

Let $Z_{i j} = \sum_{l = 1, l \neq j}^{c} (u_{i}^{T} g_{l} - v_{i l})^{2}$ , and $r_{i, j} = u_{i}^{T} g_{j} - v_{i j}$ . With fixed $g_{l}, (l = 1, 2, \dots, c$ and $l \neq j$ ), problem (8) can be rewritten as
$min_{g_{j}} J (g_{j}) = \sum_{i = 1}^{m} (r_{i, j}^{2} + Z_{i j})^{\frac{θ}{2}} .$ (16)
Without loss of generality, problem (16) can be formulated as the following optimization problem:
$min_{g} J (g) = \sum_{i = 1}^{m} (r_{i}^{2} + Z_{i})^{\frac{θ}{2}},$ (17)
where $r_{i} = u_{i}^{T} g - v_{i j}$ . First, we shall claim that, when running Algorithm 3, $J (g)$ decreases at each iteration, that is $J (g^{(k + 1)}) \leq J (g^{(k)})$ . In fact, for r>0, let $ρ_{i} (r) = (r^{2} + Z_{i})^{\frac{θ}{2}}$ , $h_{i} (r) = (r + Z_{i})^{\frac{θ}{2}}$ . It follows that $ρ_{i} (r) = h_{i} (r^{2})$ and $w_{i} (r) = 2 h_{i}^{'} (r^{2}) .$ Since $h_{i}^{'} (r) = \frac{θ}{2} (r + Z_{i})^{\frac{θ - 2}{2}}$ is decreasing for $0 < θ \leq 2$ , $h_{i} (r)$ is concave for $0 < θ \leq 2$ . We also know problem (16) can be rewritten as
$min_{g} J (g) = \sum_{i = 1}^{n} h_{i} (r_{i}^{2}) .$ (18)
Then we have
$\begin{aligned} J (g^{(k + 1)}) - J (g^{(k)}) & \leq \sum_{i = 1}^{m} h_{i}^{'} (r_{i} (g^{(k)})^{2}) (r_{i} (g^{(k + 1)})^{2} - r_{i} (g^{(k)})^{2}) \\ = \frac{1}{2} \sum_{i = 1}^{m} w_{i} (r_{i} (g^{(k + 1)}) - r_{i} (g^{(k)})) (r_{i} (g^{(k + 1)}) + r_{i} (g^{(k)})) . \end{aligned}$ (19)
Since $r_{i} (g^{(k + 1)}) - r_{i} (g^{(k)}) = (g^{(k + 1)} - g^{(k)})^{T} u_{i}$ and $r_{i} (g^{(k + 1)}) + r_{i} (g^{(k)}) = u_{i}^{T} (g^{(k)} + g^{(k + 1)}) - 2 v_{i j}$ , we have
$\begin{aligned} J (g^{(k + 1)}) - J (g^{(k)}) & \leq \frac{1}{2} (g^{(k + 1)} - g^{(k)})^{T} \sum_{i = 1}^{m} w_{i} u_{i} u_{i}^{T} (g^{(k)} + g^{(k + 1)} - 2 g^{(k + 1)}) \\ = \frac{1}{2} (g^{(k + 1)} - g^{(k)})^{T} \sum_{i = 1}^{m} w_{i} x_{i} u_{i}^{T} (g^{(k)} - g^{(k + 1)}) . \end{aligned}$ (20)
Since $\sum_{i = 1}^{m} w_{i} u_{i} u_{i}^{T}$ is nonnegative definite, we have $J (g^{(k + 1)}) - J (g^{(k)}) \leq 0.$ That is the sequence $J (g^{(k)})$ is decreasing. At the same time, the sequence $J (g^{(k)})$ is bounded from below. So it can be known that the sequence $J (g^{(k)})$ generated by Algorithm 3 is convergent. Since Algorithm 4 is a process of repeatedly calling Algorithm 3, the sequence $J (G^{(k)})$ generated by Algorithm 4 must also be strictly monotonically decreasing. And it also has a lower bound, so the sequence $J (G^{(k)})$ generated by Algorithm 4 is also convergent. This proves (1).

When $1 \leq θ \leq 2$ , $J (G)$ is convex. So its optimal point must be a global optimal point. If the optimal point of $J (G)$ is unique, it can be shown that $J (G^{(k)})$ converges the optimal point of $J (G)$ . In fact, since $J (G^{(k)})$ converges, the matrix sequence $G^{(k)}$ is bounded. It has a subsequence which has a limit $G$ , which by continuity satisfies (11) and is hence a minimum point of (6). If the convergent subsequence of $G^{(k)}$ is unique, then $G^{(k)} \to G$ ; otherwise, there would exist a subsequence of $G^{(k)}$ bounded away from $G$ , which in turn would have a convergent subsequence, which would have a limit different from $G$ , which would also satisfy (11) and is hence an optimal point of (6). This contradicts the fact $G$ has unique optimal point. If the optimal point of $J (G)$ is not unique, according to the analysis above, we can know $J (G^{(k)})$ has a subsequence which converges to the optimal point of $J (G)$ . When $0 < θ \leq 1$ , $J (G)$ is non-convex, based on the above discussion, we can know the sequence generated by Algorithm 4 or its subsequence converges to the local minimum point of problem (6).

Remark 3.2

The proof method presented above is actually a concrete manifestation of the algorithm convergence proof method presented in [14, Section 9.1], mainly because Algorithm 3 we proposed can actually be regarded as a concrete example of Algorithm 6 presented in [14, Section 4.5.2]. In fact, we can use the theorem on convergence already given in [14, Section 4.5.2] to directly give a conclusion that is almost consistent with Theorem 1. But for clarity, we still present a detailed proof process here.

4. Experiments

We evaluate the effectiveness of $l_{2,_{β}^{α}}$ -RFS in this section. It contains four parts. The convergence of the algorithm is first studied empirically in Section 4.1. In Sections 4.2 and 4.3, we compare $l_{2,_{β}^{α}}$ -RFS with other feature selection methods, including Laplacian score (LS) [10], ReliefF (RF) [15], Minimal-redundancy-maximal-relevance (mRMR) [21], Trace ratio (TR) [19] and $l_{2, 1}$ -RFS [18] and $l_{2, p}$ -RFS [24], in terms of classification accuracy and running time respectively. Finally, we study the effects of the parameters involved in $l_{2,_{β}^{α}}$ -RFS on its performance in Section 4.4. In the experiments, the 1-Nearest-Neighbor (1NN) algorithm [7] is applied for the classification of the obtained low-dimensional data resulted from all seven feature selection algorithms, and five-fold cross validation is used for computing the classification accuracy. In the $l_{2,_{β}^{α}}$ -RFS algorithm, the numbers of iteration of LSQR algorithm [2] to compute $G_{0}$ and update $G$ are set to 20 and 5, respectively; the numbers of iteration of Algorithm 3 and Algorithm 4 are set to 1 and 20 respectively. Seven data sets are selected for the experiment. Among them, five are gene expression data sets including ALLAML, GLIOMA, LUNG, PRO-GE [18,24] and COLON [23], and two are image data sets: COIL20 and USPS [23]. All data sets were preprocessed by standardization and centralization. The relevant statistical characteristics of each data set are shown in Table 1. The experimental environment is the Intel(R) Core(TM) i7-8700 3.00 CPU@3.3GHz and 16GB RAM running memory and Windows 10 operating system and matlab2014a simulation tool.

Table 1.

Summary of the test data sets used in our experiment.

Data set	Size (n)	Dimension (m)	# of classes (c)
ALLAML	72	3571	2
GLIOMA	50	4434	4
LUNG	203	3312	5
PRO-GE	102	5966	2
COLON	62	2000	2
COIL20	1440	1024	20
USPS	9298	256	10

Open in a new tab

4.1. Convergence

In this experiment, we study the convergence of the $l_{2,_{β}^{α}}$ -RFS algorithm empirically. Figure 1 shows the objective function value of $l_{2,_{β}^{α}}$ -RFS algorithm for different numbers of iteration of $l_{2,_{β}^{α}}$ -RFS on 7 data sets, where parameters α, β and γ are fixed as 1.25, 0.25 and 10. We can observe from Figure 1, the objective function value of the $l_{2,_{β}^{α}}$ -RFS algorithm shows a non-increasing trend during the iteration and converges after the six iteration. This shows the $l_{2,_{β}^{α}}$ -RFS algorithm converges fast. We observe similar trends under other parameter values and the results are omitted.

4.2. Classification accuracy

In this experiment, the feature selection ability of the proposed $l_{2,_{β}^{α}}$ -RFS algorithm is evaluated in terms of classification accuracy of the subsequent 1NN classification algorithm. Regularization parameter γ in $l_{2, 1}$ -RFS [18], $l_{2, p}$ -RFS [24] and the proposed $l_{2,_{β}^{α}}$ -RFS algorithms is set to the set ${10^{- 6}, 10^{- 5}, 10^{- 4}, 10^{- 3}, 10^{- 2}, 10^{- 1}, 10^{0}, 10^{1}, 10^{2}, 10^{3}, 10^{4}, 10^{5}, 10^{6}}$ , the parameter p in $l_{2, p}$ -RFS algorithm [24] is set to the set ${0.25, 0.5, 0.75, 1}$ , the parameters α and β in the proposed $l_{2,_{β}^{α}}$ -RFS algorithm are set to the set ${0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2}$ and ${0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75}$ respectively. The classification accuracy and its standard deviation corresponding to the best parameter for each algorithm are reported in Tables 2–5, where Tables 2, 3, 4 and 5 show the classification accuracy and its standard deviation (SD) of each algorithm when the number of selected features is 20 $%$ , 40 $%$ , 60 $%$ and 80 $%$ , respectively. The last line in each table is the average classification accuracy of each feature selection algorithm for all data sets. In the tables, the ‘×’ means that the corresponding algorithm has been running and cannot be ended, so it failed to output the result.

Table 2.

Classification accuracy ( $%$ ) and its standard deviation of 1NN using five cross-validation for top 20( $%$ ) features.

Data set	LS	mRMR	RF	TR	$l_{2, 1}$ -RFS	$l_{2, p}$ -RFS	$l_{2,_{β}^{α}}$ -RFS
ALLAML	92.95±7.15	91.42±11.74	92.85±10.10	92.85±8.75	95.71±6.39	94.29± 5.98	98.57±3.91
GLIOMA	74.00±11.40	78.00±8.37	76.00±5.48	66.00±11.40	80.00±7.07	74.80±4.47	86.00±8.94
LUNG	95.07±3.45	95.08±2.98	94.58±2.69	96.06±3.70	97.04±2.07	97.04±2.07	98.52±1.35
PRO-GE	69.57±8.27	83.29±4.52	89.19±6.47	84.33±6.23	90.19±6.99	91.19±5.33	90.14±6.56
COLON	64.74±10.94	77.43±8.72	79.10±9.22	74.23±2.92	82.31±5.25	80.77±6.86	88.72±4.36
COIL20	91.73±1.18	99.79±0.19	99.86±0.19	98.61±1.26	100±0.00	100±0.00	99.79±0.31
USPS	83.05±1.99	93.84±0.71	92.23±0.91	89.04±0.65	96.36±0.14	×	95.17±0.32
Mean accuracy	81.59	88.41	89.12	85.88	91.95	89.68	93.85

Open in a new tab

Table 5.

Classification accuracy ( $%$ ) and its standard deviation of 1NN using five cross-validation for top 80( $%$ ) features.

Data set	LS	mRMR	RF	TR	$l_{2, 1}$ -RFS	$l_{2, p}$ -RFS	$l_{2,_{β}^{α}}$ -RFS
ALLAML	94.29±7.82	92.95±7.15	94.29±7.82	92.86±10.10	94.29±7.83	94.29±7.83	95.71±3.91
GLIOMA	76.00±8.94	76.00±8.94	80.00±7.07	78.00±8.37	78.00±8.37	78.00±8.37	80.00±7.07
LUNG	94.59±1.07	93.12±1.98	94.57±2.08	94.59±1.73	95.07±1.73	94.59±2.66	95.56±2.06
PRO-GE	83.24±7.59	82.24±9.12	87.24±7.44	83.29±8.12	85.24±8.12	85.24±7.84	87.19±5.69
COLON	74.36±7.94	77.56±5.98	74.36±5.79	74.36±5.79	77.56±5.98	75.90±5.06	80.77±3.51
COIL20	99.79±0.31	99.86±0.19	100.00±0.00	99.86±0.00	100.00±0.00	100.00±0.00	99.86±0.31
USPS	97.02±0.39	97.05±0.24	97.25±0.37	96.92±0.37	97.41±0.378	×	97.56±0.32
Mean accuracy	88.47	88.40	89.67	88.55	89.81	88.00	90.95

Open in a new tab

Table 3.

Classification accuracy ( $%$ ) and its standard deviation of 1NN using five cross-validation for top 40( $%$ ) features.

Data set	LS	mRMR	RF	TR	$l_{2, 1}$ -RFS	$l_{2, p}$ -RFS	$l_{2,_{β}^{α}}$ -RFS
ALLAML	95.71±6.39	94.29±7.82	95.71±6.39	94.29±7.82	95.71±3.91	97.14±3.91	97.14±3.91
GLIOMA	72.00±13.04	78.00±4.47	74.00±8.94	74.00±8.94	80.00±7.07	78.00±4.47	82.00±8.37
LUNG	95.06±3.03	94.60±3.16	96.06±2.19	96.05±1.38	96.54±2.22	97.04±2.07	97.54±2.47
PRO-GE	79.33±9.04	81.24±6.80	86.24±4.27	81.38±3.96	88.19±5.69	89.19±4.11	90.14±5.33
COLON	67.95±8.68	79.23±6.24	78.97±4.65	75.90±5.06	80.77±6.86	80.64±4.36	83.97±5.25
COIL20	97.29±0.37	100.00±0.00	99.93±0.15	99.86±0.19	100.00±0.00	100.00±0.00	99.86±0.31
USPS	92.07±0.85	95.97±0.45	96.49±0.41	91.96±0.53	97.20±0.35	×	97.28±0.40
Mean accuracy	85.63	89.05	89.63	87.63	91.41	90.34	92.56

Open in a new tab

Table 4.

Classification accuracy ( $%$ ) and its standard deviation of 1NN using five cross-validation for top 60( $%$ ) features.

Data set	LS	mRMR	RF	TR	$l_{2, 1}$ -RFS	$l_{2, p}$ -RFS	$l_{2,_{β}^{α}}$ -RFS
ALLAML	95.71±6.39	92.95±7.15	95.71±6.39	95.71±6.39	94.29±7.83	95.71±6.39	97.14±3.91
GLIOMA	76.00±2.66	78.00±2.14	76.00±5.48	78.00±8.37	80.00±7.07	80.00±7.07	80.00±7.07
LUNG	94.59±2.19	93.61±1.38	95.06±2.49	94.57±3.20	97.04± 2.07	96.55±2.79	97.04±2.07
PRO-GE	84.29±6.50	81.29±5.64	87.29±6.52	82.29±5.8	87.19± 7.61	86.24±6.40	88.19±6.47
COLON	72.82±8.07	77.69±9.72	78.97±4.65	77.56±5.98	77.44±3.43	77.44±3.43	83.97±5.25
COIL20	98.68±0.66	100.00±0.00	100.00±0.00	99.86±0.19	100.00±0.00	100.00±0.00	99.93 ±0.16
USPS	95.53±0.34	96.58±0.42	97.34±0.40	95.88±0.46	97.41±0.31	×	97.63±0.26
Mean accuracy	88.23	88.59	90.05	89.13	90.48	89.32	91.99

Open in a new tab

It can be seen that from the above four tables, for most data sets, and for most selected feature numbers, the $l_{2,_{β}^{α}}$ -RFS algorithm achieves the best in terms of classification accuracy among seven algorithms. For most selected feature numbers, in term of average classification accuracy, the $l_{2,_{β}^{α}}$ -RFS algorithm outperforms other six algorithms. This is more obvious when the number of selected features is 20 $%$ . Especially, on GLIOMA and COLON, the proposed $l_{2,_{β}^{α}}$ -RFS algorithm not only obtains the best but also obtains the classification accuracies of $6 %$ and $8 %$ higher than $l_{2, 1}$ -RFS [18] and $l_{2, p}$ -RFS [24] respectively. On GLIOMA, $l_{2,_{β}^{α}}$ -RFS obtains the classification accuracies of $6 %$ and $10 %$ higher than $l_{2, 1}$ -RFS and $l_{2, p}$ -RFS algorithms respectively.

4.3. Running time

In this experiment, we evaluate the efficiency of the $l_{2,_{β}^{α}}$ -RFS algorithm in terms of running time. Tables 6–9 present the running times corresponding to the optimal accuracies of all seven algorithms on seven data sets using different parameters. From Tables 6–9, we can see that, overall, LS and TR are the most efficient algorithms among the seven algorithms for efficiency. We also see that $l_{2, 1}$ -RFS and $l_{2, p}$ -RFS are sensitive to n (the number of data samples), which is obviously reflected on the USPS data set: the running times of the two algorithms are relatively long ( $l_{2, p}$ -RFS is really running too slowly, and it cannot present the running result.). This phenomenon is actually consistent with our previous analysis of the computational complexity of $l_{2, 1}$ -RFS and $l_{2, p}$ -RFS: the computational complexities of these two algorithms have a cubic relationship with n. The efficiency of the $l_{2,_{β}^{α}}$ -RFS shows robustness to the data scale. Its running time changes smoothly for different data samples and data dimensions.

Table 7.

Running time (in seconds) of classification of 1NN using five cross-validation for top 40( $%$ ) features.

Data set	LS	mRMR	RF	TR	$l_{2, 1}$ -RFS	$l_{2, p}$ -RFS	$l_{2,_{β}^{α}}$ -RFS
ALLAML	0.05	83.84	0.89	0.30	0.18	0.24	8.23
GLIOMA	0.06	143.73	0.79	0.37	0.16	0.36	39.01
LUNG	0.14	81.88	2.71	0.31	0.67	0.28	15.87
PRO-GE	0.13	315.06	2.10	0.82	0.47	0.72	90.69
COLON	0.01	36.40	0.60	0.09	0.06	0.06	5.04
COIL20	1.13	11.57	13.60	1.30	11.15	1.46	11.84
USPS	23.57	15.54	105.13	24.20	623.15	×	21.88

Open in a new tab

Table 8.

Running time (in seconds) of classification of 1NN using five cross-validation for top 60( $%$ ) features.

Data set	LS	mRMR	RF	TR	$l_{2, 1}$ -RFS	$l_{2, p}$ -RFS	$l_{2,_{β}^{α}}$ -RFS
ALLAML	0.10	127.26	0.94	0.38	0.19	0.17	9.86
GLIOMA	0.14	225.80	0.87	0.48	0.17	0.35	15.02
LUNG	0.24	121.35	2.82	0.44	0.75	0.45	6.11
PRO-GE	0.27	482.67	2.25	1.05	0.46	0.55	56.07
COLON	0.03	50.74	0.61	0.12	0.06	0.05	7.28
COIL20	1.99	15.56	14.48	2.29	10.21	2.21	8.95
USPS	38.64	18.99	120.26	38.88	614.03	×	38.80

Open in a new tab

Table 6.

Running time (in seconds) of classification of 1NN using five cross-validation for top 20( $%$ ) features.

Data set	LS	mRMR	RF	TR	$l_{2, 1}$ -RFS	$l_{2, p}$ -RFS	$l_{2,_{β}^{α}}$ -RFS
ALLAML	0.02	39.65	0.86	0.25	0.18	0.16	9.16
GLIOMA	0.02	64.64	0.75	0.31	0.16	0.36	36.78
LUNG	0.09	38.76	2.66	0.22	0.64	0.25	17.68
PRO-GE	0.06	129.88	2.02	0.68	0.45	1.01	78.22
COLON	0.01	19.38	0.59	0.08	0.06	0.06	9.22
COIL20	0.54	6.42	12.98	0.61	9.83	0.92	12.73
USPS	10.90	11.62	92.38	11.92	617.18	×	12.52

Open in a new tab

Table 9.

Running time (in seconds) of classification of 1NN using five cross-validation for top 80( $%$ ) features.

Data set	LS	mRMR	RF	TR	$l_{2, 1}$ -RFS	$l_{2, p}$ -RFS	$l_{2,_{β}^{α}}$ -RFS
ALLAML	0.22	158.64	1.05	0.54	0.20	0.27	17.21
GLIOMA	0.32	283.96	1.04	0.68	0.17	0.26	38.10
LUNG	0.41	149.06	2.99	0.65	0.78	0.42	4.36
PRO-GE	0.59	597.89	2.58	1.48	0.51	0.59	71.46
COLON	0.05	60.44	0.64	0.17	0.06	0.05	7.46
COIL20	3.31	18.12	15.82	3.64	10.55	3.45	15.44
USPS	56.08	21.85	137.11	55.89	628.44	×	51.51

Open in a new tab

4.4. Effect of parameters

There are three parameters including α, β and γ in the $l_{2,_{β}^{α}}$ -RFS algorithm. α is used to control the robustness of the loss function and β is used to ensure the sparsity of the transformation matrix. γ is the regularization parameter, which is used to balance the relationship between the least square term and the regularization term. In this experiment, we empirically study the effect of the three parameters on the performance of the $l_{2,_{β}^{α}}$ -RFS algorithm in terms of classification accuracy of the subsequent 1NN classification algorithm.

In order to evaluate the influence of parameter γ on the performance of the proposed $l_{2,_{β}^{α}}$ -RFS algorithm, in the experiment, we fix the values of parameters α and β to 1, and plot the change trend of classification accuracy of $l_{2,_{β}^{α}}$ -RFS with respect to γ as shown in Figure 2. We can observe γ also does have an impact on the performance of the $l_{2,_{β}^{α}}$ -RFS algorithm. It appears that, for most data sets, $[10^{- 4}, 10^{- 1}]$ is a relatively robust selection range for γ.

Figure 2. — Classification accuracies on seven data sets with different γ, where (a), (b), (c) and (d) represent the classification accuracy when the proportion of the selected feature is 20 $%$ , 40 $%$ , 60 $%$ and 80 $%$ respectively.

In order to evaluate the influence of parameter β on the performance of the proposed $l_{2,_{β}^{α}}$ -RFS algorithm, in the experiment, we fix the values of parameters α and γ to 1, and plot the change trend of classification accuracy of $l_{2,_{β}^{α}}$ -RFS with respect to β, as shown in Figure 3. we can observe that β does have an impact on the performance of the $l_{2,_{β}^{α}}$ -RFS algorithm. This is especially obvious on some data sets such as GLIOMA and PROSTATE. We can also observe that, on several data sets, the optimal classification accuracy of the algorithm is not obtained at $β = 1$ (corresponding to the $l_{2, 1}$ -norm). This experimental result shows that it is necessary to expand the value range of parameter β in general. It appears that, for most data sets, $[0.25, 0.75]$ is a relatively robust selection range for parameter β.

Figure 3. — Classification accuracies on seven data sets with different β, where (a), (b), (c) and (d) represent the classification accuracy when the proportion of the selected feature is 20 $%$ , 40 $%$ , 60 $%$ and 80 $%$ respectively.

In order to evaluate the influence of parameter α on the performance of the proposed $l_{2,_{β}^{α}}$ -RFS algorithm, in the experiment, we fix the values of parameter β and γ to 1 and 10 respectively, and plot the change trend of classification accuracy of $l_{2,_{β}^{α}}$ -RFS with respect to α, as shown in Figure 4. we can observe that α does have an impact on the performance of the $l_{2,_{β}^{α}}$ -RFS algorithm. This is especially obvious on some data sets such as GLIOMA and PROSTATE. We can also observe that, on several data sets, the optimal classification accuracy of the algorithm is not obtained at $α = 1$ (corresponding to the $l_{2, 1}$ -norm) or $α = 2$ (corresponding to the F-norm). This experimental result shows that it is necessary to expand the value range of parameter α in general. It appears that, for most data sets, $[1.25, 1.5]$ is a relatively robust selection range for parameter α.

Figure 4. — Classification accuracies on seven data sets with different α, where (a), (b), (c) and (d) represent the classification accuracy when the proportion of the selected feature is 20 $%$ , 40 $%$ , 60 $%$ and 80 $%$ respectively.

In order to further test the influence of parameters α and β on the performance of the proposed $l_{2,_{β}^{α}}$ -RFS algorithm, we perform $l_{2, 1}$ -RFS, $l_{2, p}$ -RFS, and the proposed $l_{2,_{β}^{α}}$ -RFS by using our IRLS algorithm (Algorithm 4). At the same time, we denote the $l_{2, p}$ -RFS algorithm as $l_{2, β}$ -RFS for easy comparison. Figure 5 depicts the evolution of the classification accuracy as a function of β on all seven data sets, where the classification accuracy of $l_{2,_{β}^{α}}$ -RFS is the optimal accuracy under different values of α. From Figure 5, we can see that, for several data sets, there are several values of β, under which the classification accuracies of $l_{2, β}$ -RFS are better than ones of $l_{2, 1}$ -RFS. This shows that it is beneficial to extend $l_{2, 1}$ -RFS to $l_{2, β}$ -RFS. We also see, for all data sets, and for all values of parameter β, the classification accuracies of $l_{2,_{β}^{α}}$ -RFS are always better than ones of $l_{2, β}$ -RFS. This shows that it is beneficial to further extend $l_{2, β}$ -RFS to $l_{2,_{β}^{α}}$ -RFS.

Figure 5. — The classification accuracy of $l_{2, 1}$ -RFS, $l_{2, p}$ -RFS and $l_{2,_{β}^{α}}$ -RFS features selection algorithms on seven data sets with different β.

5. Conclusion and future work

In this paper, a new optimization criterion for $l_{2, p}$ -norm regression based feature selection ( $l_{2, p}$ -RFS) algorithm [24], which is an extension of the $l_{2, 1}$ -norm regression based feature selection algorithm [18], is proposed. The new criterion further extends the optimization criterion of $l_{2, p}$ -RFS algorithm when the $l_{2, p}$ -norm used in the regression loss is not necessarily the same as the one used in the regularization term. This improves the flexibility of the algorithm when it is used on different data sets. Based on the iterative re-weighted least squares framework, an effective and efficient algorithm, which we note $l_{2,_{β}^{α}}$ -RFS, is proposed for the new optimization criterion. The experimental results on a variety of real-world data sets show the new algorithm is competitive with other related feature selection algorithms in terms of classification accuracy and efficiency. Parameters in the proposed $l_{2,_{β}^{α}}$ -RFS algorithm have a certain influence on the performance of the algorithm, and sometimes this influence is significant. The automatic parameter selection is still an open problem, and a promising method is nested cross-validation [6] for supervised learning problems. This constitutes our next work.

Funding Statement

This work is supported by the National Science Foundation of China (Grant nos. 62071378, 62071379, 62071380, 61601362, 61671377 and 61901365), and the National Science Basic Research Plan in Shaanxi Province of China(No. 2020JM-580 and 2021JM-461), and New Star Team of Xi'an University of Posts and Telecommunications (Grant no. xyt2016-01), and the Science Plan Foundation of the Education Bureau of Shaanxi Province of China (No. 18JK0719).

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

1.Argyriou A., Evgeniou T., and Pontil M., Multi-task feature learning, Adv. Neural. Inf. Process. Syst. 19 (2007), pp. 41–48. [Google Scholar]
2.Chen J., Ma Z., and Liu Y., Local coordinates alignment with global preservation for dimensionality reduction, IEEE Trans. Neural. Netw. Learn. Syst. 24 (2013), pp. 106–117. [DOI] [PubMed] [Google Scholar]
3.Daubechies I., Devore R. and Fornasier M., Iteratively re-weighted least squares minimization: proof of faster than linear rate for sparse recovery, 42nd Annual Conference on Information Sciences and Systems, USA, pp. 26–29, 2008.
4.Ding C. and Peng H., Minimum redundancy feature selection from microarray gene expression data. Proceedings of 2nd IEEE Computational Systems Bioinformatics Conference. pp. 523–528, August 2003. [DOI] [PubMed]
5.Ding C., Zhou D., He X., and Zha H., R1-PCA: rotational invariant l1-norm principal component analysis for robust subspace factorization, Proc. Intl. Conf. Mach. Learn. 23 (2006), pp. 281–288. [Google Scholar]
6.Dora L., Agrawal S., Panda R., and al et, Nested cross-validation based adaptive sparse representation algorithm and its application to pathological brain classification, Expert. Syst. Appl. 114 (2018), pp. 313–321. [Google Scholar]
7.Duda R.O., Hart P.E., and Stork D., Pattern Classification, New York, Wiley, 2000. [Google Scholar]
8.Fukunaga K., Statistical Pattern Recognition, Academic Press, 2nd edition, 1990. [Google Scholar]
9.Guyon I. and Elisseeff A., An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003), pp. 1157–1182. [Google Scholar]
10.He X., Cai D., and Niyogi P., Laplacian score for feature selection, Adv. Neural. Inf. Process. Syst. 18 (2005), pp. 507–514. [Google Scholar]
11.Hou C., Jiao Y., Nie F., Luo T., and Zhou Z., 2D feature selection by sparse matrix regression, IEEE Trans. Image Process. 26 (2017), pp. 4255–4268. [DOI] [PubMed] [Google Scholar]
12.Hou C., Wang J., and Wu Y., Local linear transformation embedding, Neurocomputing 72 (2009), pp. 2368–2378. [Google Scholar]
13.Hou C., Zhang C., and Wu Y., Multiple view semi-supervised dimensionality reduction, Pattern. Recognit. 43 (2010), pp. 720–730. [Google Scholar]
14.Huber P.J., Robust Statistics, New York, Wiley, 1981. [Google Scholar]
15.Kononenko I., Estimating attributes: analysis and extensions of relief, Eur. Conf. Mach. Learn. 7 (1994), pp. 171–182. [Google Scholar]
16.Masaeli M., Fung G. and Dy J.G., From transformation-based dimensionality reduction to feature selection, Proceedings of the 27th International Conference on Machine Learning, pp. 751–758, 2010.
17.Ming D. and Ding C., Robust flexible feature selection via exclusive L21 regularization. Proceedings of the 28th International Joint Conference on Artificial Intelligence. pp. 3158–3164, 2019.
18.Nie F., Huang H., and Cai X., Efficient and robust feature selection via joint $l_{2, 1}$ -norms minimization, Adv. Neural Inf. Process. Syst. 23 (2010), pp. 1–9. [Google Scholar]
19.Nie F., Xiang S. and Jia Y., Trace ratio criterion for feature selection, Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp. 671–676, 2008.
20.Paige C. and Saunders M., LSQR: an algorithm for sparse linear equations and sparse least squares, ACM Trans. Math. Softw. 8 (1982), pp. 43–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Peng H., Long F., and Ding C., Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell. 27 (2005), pp. 1226–1238. [DOI] [PubMed] [Google Scholar]
22.Shao L., Liu L., and Li X., Feature learning for image classification via multiobjective genetic programming, IEEE. Trans. Neural. Netw. Learn. Syst. 25 (2014), pp. 1359–1371. [Google Scholar]
23.Tao H., Hou C., and Nie F., Effective discriminative feature selection with nontrivial solution, IEEE. Trans. Neural. Netw. Learn. Syst. 27 (2016), pp. 796–808. [DOI] [PubMed] [Google Scholar]
24.Wang L., Chen S., and Wang Y., A unified algorithm for mixed $l_{2}, p$ -minimizations and its application in feature selection, Comput. Optim. Appl. 58 (2014), pp. 409–421. [Google Scholar]
25.Zhao H., Wang Z., and Nie F., A new formulation of linear discriminant analysis for robust dimensionality reduction, IEEE. Trans. Knowl. Data Eng. 31 (2018), pp. 629–640. [Google Scholar]

[CIT0001] 1.Argyriou A., Evgeniou T., and Pontil M., Multi-task feature learning, Adv. Neural. Inf. Process. Syst. 19 (2007), pp. 41–48. [Google Scholar]

[CIT0002] 2.Chen J., Ma Z., and Liu Y., Local coordinates alignment with global preservation for dimensionality reduction, IEEE Trans. Neural. Netw. Learn. Syst. 24 (2013), pp. 106–117. [DOI] [PubMed] [Google Scholar]

[CIT0003] 3.Daubechies I., Devore R. and Fornasier M., Iteratively re-weighted least squares minimization: proof of faster than linear rate for sparse recovery, 42nd Annual Conference on Information Sciences and Systems, USA, pp. 26–29, 2008.

[CIT0004] 4.Ding C. and Peng H., Minimum redundancy feature selection from microarray gene expression data. Proceedings of 2nd IEEE Computational Systems Bioinformatics Conference. pp. 523–528, August 2003. [DOI] [PubMed]

[CIT0005] 5.Ding C., Zhou D., He X., and Zha H., R1-PCA: rotational invariant l1-norm principal component analysis for robust subspace factorization, Proc. Intl. Conf. Mach. Learn. 23 (2006), pp. 281–288. [Google Scholar]

[CIT0006] 6.Dora L., Agrawal S., Panda R., and al et, Nested cross-validation based adaptive sparse representation algorithm and its application to pathological brain classification, Expert. Syst. Appl. 114 (2018), pp. 313–321. [Google Scholar]

[CIT0007] 7.Duda R.O., Hart P.E., and Stork D., Pattern Classification, New York, Wiley, 2000. [Google Scholar]

[CIT0008] 8.Fukunaga K., Statistical Pattern Recognition, Academic Press, 2nd edition, 1990. [Google Scholar]

[CIT0009] 9.Guyon I. and Elisseeff A., An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003), pp. 1157–1182. [Google Scholar]

[CIT0010] 10.He X., Cai D., and Niyogi P., Laplacian score for feature selection, Adv. Neural. Inf. Process. Syst. 18 (2005), pp. 507–514. [Google Scholar]

[CIT0011] 11.Hou C., Jiao Y., Nie F., Luo T., and Zhou Z., 2D feature selection by sparse matrix regression, IEEE Trans. Image Process. 26 (2017), pp. 4255–4268. [DOI] [PubMed] [Google Scholar]

[CIT0012] 12.Hou C., Wang J., and Wu Y., Local linear transformation embedding, Neurocomputing 72 (2009), pp. 2368–2378. [Google Scholar]

[CIT0013] 13.Hou C., Zhang C., and Wu Y., Multiple view semi-supervised dimensionality reduction, Pattern. Recognit. 43 (2010), pp. 720–730. [Google Scholar]

[CIT0014] 14.Huber P.J., Robust Statistics, New York, Wiley, 1981. [Google Scholar]

[CIT0015] 15.Kononenko I., Estimating attributes: analysis and extensions of relief, Eur. Conf. Mach. Learn. 7 (1994), pp. 171–182. [Google Scholar]

[CIT0016] 16.Masaeli M., Fung G. and Dy J.G., From transformation-based dimensionality reduction to feature selection, Proceedings of the 27th International Conference on Machine Learning, pp. 751–758, 2010.

[CIT0017] 17.Ming D. and Ding C., Robust flexible feature selection via exclusive L21 regularization. Proceedings of the 28th International Joint Conference on Artificial Intelligence. pp. 3158–3164, 2019.

[CIT0018] 18.Nie F., Huang H., and Cai X., Efficient and robust feature selection via joint $l_{2, 1}$ -norms minimization, Adv. Neural Inf. Process. Syst. 23 (2010), pp. 1–9. [Google Scholar]

[CIT0019] 19.Nie F., Xiang S. and Jia Y., Trace ratio criterion for feature selection, Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp. 671–676, 2008.

[CIT0020] 20.Paige C. and Saunders M., LSQR: an algorithm for sparse linear equations and sparse least squares, ACM Trans. Math. Softw. 8 (1982), pp. 43–71. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0021] 21.Peng H., Long F., and Ding C., Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell. 27 (2005), pp. 1226–1238. [DOI] [PubMed] [Google Scholar]

[CIT0022] 22.Shao L., Liu L., and Li X., Feature learning for image classification via multiobjective genetic programming, IEEE. Trans. Neural. Netw. Learn. Syst. 25 (2014), pp. 1359–1371. [Google Scholar]

[CIT0023] 23.Tao H., Hou C., and Nie F., Effective discriminative feature selection with nontrivial solution, IEEE. Trans. Neural. Netw. Learn. Syst. 27 (2016), pp. 796–808. [DOI] [PubMed] [Google Scholar]

[CIT0024] 24.Wang L., Chen S., and Wang Y., A unified algorithm for mixed $l_{2}, p$ -minimizations and its application in feature selection, Comput. Optim. Appl. 58 (2014), pp. 409–421. [Google Scholar]

[CIT0025] 25.Zhao H., Wang Z., and Nie F., A new formulation of linear discriminant analysis for robust dimensionality reduction, IEEE. Trans. Knowl. Data Eng. 31 (2018), pp. 629–640. [Google Scholar]

PERMALINK

A generalized l2,p-norm regression based feature selection algorithm

X Zhi

J Liu

S Wu

C Niu

Abstract

1. Introduction

2. Related works

2.1. Notations and definitions

2.2. l2,1-RFS algorithm

2.3. l2,p-norm regression based feature selection algorithm

3. Proposed method

3.1. A new criterion for l2,p-RFS

3.2. An effective and efficient algorithm

3.3. Time complexity analysis

3.4. Convergence analysis

Theorem 3.1

Proof.

Remark 3.2

4. Experiments

Table 1.

4.1. Convergence

Figure 1.

4.2. Classification accuracy

Table 2.

Table 5.

Table 3.

Table 4.

4.3. Running time

Table 7.

Table 8.

Table 6.

Table 9.

4.4. Effect of parameters

Figure 2.

Figure 3.

Figure 4.

Figure 5.

5. Conclusion and future work

Funding Statement

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A generalized l_2,p-norm regression based feature selection algorithm

2.2. $l_{2, 1}$ -RFS algorithm

2.3. $l_{2, p}$ -norm regression based feature selection algorithm

3.1. A new criterion for $l_{2, p}$ -RFS