Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2021 Sep 17;50(3):703–723. doi: 10.1080/02664763.2021.1975662

A generalized l2,p-norm regression based feature selection algorithm

X Zhi a, J Liu b,CONTACT, S Wu b, C Niu b
PMCID: PMC9930865  PMID: 36819074

Abstract

Feature selection is an important data dimension reduction method, and it has been used widely in applications involving high-dimensional data such as genetic data analysis and image processing. In order to achieve robust feature selection, the latest works apply the l2,1 or l2,p-norm of matrix to the loss function and regularization terms in regression, and have achieved encouraging results. However, these existing works rigidly set the matrix norms used in the loss function and the regularization terms to the same l2,1 or l2,p-norm, which limit their applications. In addition, the algorithms for solutions they present either have high computational complexity and are not suitable for large data sets, or cannot provide satisfying performance due to the approximate calculation. To address these problems, we present a generalized l2,p-norm regression based feature selection ( l2,p-RFS) method based on a new optimization criterion. The criterion extends the optimization criterion of ( l2,p-RFS) when the loss function and the regularization terms in regression use different matrix norms. We cast the new optimization criterion in a regression framework without regularization. In this framework, the new optimization criterion can be solved using an iterative re-weighted least squares (IRLS) procedure in which the least squares problem can be solved efficiently by using the least square QR decomposition (LSQR) algorithm. We have conducted extensive experiments to evaluate the proposed algorithm on various well-known data sets of both gene expression and image data sets, and compare it with other related feature selection methods.

Keywords: Feature selection; sparse regression; l2,p-norm; iterative re-weighted least squares; least square QR decomposition

1. Introduction

In many applications such as genetic data analysis, image processing and data mining, people often encounters very high-dimensional data. Some features of the high-dimensional data are related to the target task, while many features are redundant [23]. Therefore, dimension reduction has become an important stage of data preprocessing in such applications [12,13]. Feature selection and feature extraction are two main dimension reduction methods [2,22]. Feature extraction transforms the original data into a new low-dimensional subspace. Feature selection algorithm selects low-dimensional features from the original high-dimensional data according to certain processing rules. The latter can retain the original representation of the data without changing the original features and is interpretable, while the former cannot do this [23]. Over the years, the research on feature selection has received more and more attention, and has made considerable progress.

Generally, according to how the classification algorithm is incorporated in evaluating and selecting features, feature selection methods can be organized into three categories [9,23]: wrapper method, filter method, and embedded method. Compared with filter methods, wrapper methods and embedded methods are tightly coupled with a specific classifier, thus they often have good performance but very expensive computational costs. Contrarily, filter methods such as F-statistic [4], Laplacian score (LS) [10], ReliefF (RF) [15], minimum redundancy maximum relevance (mRMR) [21] and Trace ratio (TR) [19], are often very efficient. In this paper, we focus on the filter-type methods for supervised feature selection.

In recent years, some filter-type feature selection techniques based on the combination of transformation method (such as linear discriminant analysis [8] and regression) and sparse regularization theory have become a hot spot in the study of feature selection [16,18,23,24]. The sparse regularization technique can make the transformation matrix has more zero rows, thereby enabling feature selection. For multi-class problem, these methods search a subset of features shared by all the classes, so are also known as multi-task feature learning (MTFL) [1]. As far as we know, Masaeli et al. [16] first proposed to combine LDA and transformation matrix sparse regularization, and obtained a new filter-based feature selection algorithm named linear discriminant feature selection (LDFS). By enforcing the row sparsity of the transformation matrix of LDA via l,1-norm regularization, LDFS uses both the discriminative information and the learning mechanism [23]. Nie et al. [18] proposed a l2,1-norm based sparse regression feature selection ( l2,1-RFS) algorithm by applying l2,1-norm to the loss function as well as the regularization of regression, which can be solved by using the Lagrangian multiplier method. The convergence of l2,1-RFS algorithm has been proved, and the experimental results on a large amount of data have verified that l2,1-RFS algorithm has ability to perform robust feature selection task for data. In order to obtain a more sparse solution, Wang et al. expanded l2,1-RFS in the sense of the l2,p-norm ( 0<p1) and proposed l2,p-norm based feature selection ( l2,p-RFS) algorithm [24]. Considering the expensive calculation of the Lagrangian multiplier based algorithm, Wang et al. further proposed one-step gradient projection based algorithm to reduce the algorithm complexity. However, the one-step gradient projection based l2,p-RFS algorithm also has expensive calculation costs in fact, and the approximation in the solution process has a large impact on the classification accuracy of the subsequent classification algorithm algorithm. In [11], sparse regression based feature selection method was extended to direct matrix data feature selection case. In [17], l2,1-norm base sparse regression based feature selection method was modified to achieve ‘robust and flexible’ feature selection via being added an additional l1,2 regularization term.

It is beneficial to use l2,p-norm rather than traditional F-norm in the loss function part of sparse regression based feature selection method. However, both in l2,1-RFS [18] and l2,p-RFS [24], the matrix norms of the loss function and the regular term are strictly set to be the same, which limit their application. In this paper, we extend l2,p-RFS when the least-squares regression loss term and the regularization term using different matrix norms. The main contributions of this paper include:

  • We propose a new sparse regression based feature selection method, namely l2,βα-RFS, which extends the l2,p-RFS method when the loss function and the regularization terms in regression use different matrix norms.

  • We present an effective and efficient algorithm for l2,βα-RFS by casting it in an IRLS framework, and also present a proof of the convergence of the proposed algorithm.

  • We have conducted extensive experiments on gene expression and image data sets to evaluate the effectiveness of l2,βα-RFS and compare it with other related feature selection methods, especially the sparse regression based feature selection methods.

The rest of this article is organized as follows: Section 2 introduces the notations and definitions used in this paper, and briefly reviews two related works. In Section 3, a new criterion for ( l2,p-RFS) method is proposed and an effective and efficient algorithm presented for the proposed criterion. A comprehensive study on the performance of the l2,βα-RFS algorithm is presented in Section 4. We conclude in Section 5 with a discussion of related future work.

2. Related works

In this section, we introduce the notations and the definitions of norms used in this paper and briefly review two main related works: l2,1-RFS [18] and l2,p -RFS [24].

2.1. Notations and definitions

In this section, we introduce the notations and definitions of norms used in this paper. Matrices are written as boldface uppercase letters. Vectors are written as boldface lowercase letters. For a matrix M=( mij), its ith row and jth column are denoted by mi and mj, respectively. MT denotes the transpose of matrix M.

The lp-norm (p>0) of a vector vRn is defined as vp=(i=1n|vi|p)1p, and the l0-norm is defined as v0=i=1n|vi|0. Actually neither lp-norm (0<p<1) nor l0-norm is a valid norm, because the former does not satisfy positive scalability, and the latter does not satisfy the triangle inequality [23].

The lr,p-norm, l2,0-norm, l2,1-norm and l2,p-norm (0<p1) of a matrix M are defined as

Mr,p=(i=1n(j=1m|mij|r)pr)1p=(i=1nmirp)1p,M2,0=i=1n(j=1m|mij|2)0=i=1nmi20,M2,1=i=1n(j=1m|mij|2)12=i=1nmi2,

and

M2,p=(i=1n(j=1m|mij|2)p2)1p=(i=1nmi2p)1p,

respectively.

In the field of pattern recognition and machine learning, the l2,1-norm and l2,p-norm of matrix can not only be used to improve the robustness of the algorithm to the data with outliers [5,18,24,25], but also can be used to achieve the sparsity of the transformation matrix to achieve feature selection [1,16,18,23,24] or for both purposes [18,24].

2.2. l2,1-RFS algorithm

Given the data matrix X=[x1,x2,,xn]Rd×n and corresponding class label matrix B=[b1,b2,,bc]Rn×c, the optimization criterion of the l2,1-RFS algorithm [18] is defined as

minGJ(G)=XTGB2,1+γG2,1, (1)

where G is the regression matrix to be solved. γ is the regularization term coefficient, the larger it is, the stronger the sparsity of G is, and the fewer features will be selected.

In the optimization criterion (1), l2,1-norm is used to not only the row sparsification of the regression transformation matrix G but also the robustness of regression loss measure for the residual is not squared, and outlier have less importance than the squared residual [18].

The unconstrained optimization problem (1) can be transformed equivalently into the following constrained optimization problem:

minYY2,1s.t.MY=B, (2)

where Y=[GM]TRm×c,M=[XTγI]Rn×m,E=1γ(XTGB),m=n+d,IRn×n is an identify matrix. By using Lagrange multiplier method to solve the problem (2), the following equation can be obtained: Y=D1M(MD1MT)1B, where D is an identify matrix with the ith diagonal elements as dii=12yi2, where yi is ith row of matrix Y. It can be seen that calculating Y requires D, and calculating D also requires Y. Therefore, an iterative algorithm is needed to obtain the final Y. The detailed algorithm steps are presented in Algorithm 1.

2.2.

The time complexity of l2,1-RFS algorithm (Algorithm 1) can by analyzed as follows. Line 2 takes O(nm2+n2m) time for computing matrix MDt1MT, and O(n3) time for computing its inversion, and O(nm2+n2m+nmc) time for matrix multiplication. Line 3 takes O(mc) time for computing the diagonal matrix Dt+1. Hence, the total time complexity of l2,1-RFS 2 algorithm is

O(t(nm2+n2m+n3+nmc)),

where t is the number of iterations. The l2,1-RFS algorithm [18] adopts l2,1-norm to ensure the robustness of loss function and the sparsity of regularization. However, the application of the algorithm is easily limited since the norms used in the loss function and the regularization term are rigidly fixed to the same l2,1-norm. In addition, from the analysis of computational complexity above, it can be seen that this algorithm is computational expensive for relatively large data sets.

2.3. l2,p-norm regression based feature selection algorithm

In order to obtain more robust and sparse solution, Wang et al. extended l2,1-RFS [18] in the sense l2,p-norm and proposed the l2,p-RFS algorithm [24]. In l2,p-RFS, l2,p-norm (0<p1) is adopted for both the regression loss and the regularization terms. Its optimization criterion is defined as

minGJ(G)=XTGB2,pp+γG2,pp, (3)

where pϵ(0,1].

For any p(0,1], the noise magnitude of distant outlier in (3) is no more than that in (1). Thus the model (3) is expected to be more robust than (1) [24].

Similar to the solution method of l2,1-RFS [18], the above unconstrained optimization problem can be changed equivalently into the following constrained optimization problem:

minYY2,pps.t.MY=B, (4)

where Y=[GM]TRm×c,M=[XTγI]Rn×m,E=1γ(XTGB),m=n+d.Using Lagrangian multiplier method, the solution to problem (4) can be obtained as Y=D1M(MD1MT)1B, where D is an diagonal matrix with the ith diagonal elements as dii=12yi22p. However, similar to l2,1-RFS [18], the solution obtained by this algorithm involves matrix inversion operation that is computationally expensive for large data sets. Note that when D is fixed, the optimization problem (4) can be equivalently transformed into:

minYfk(Y)=12Tr(YTDkY)s.t.MY=B. (5)

Wang et al. proposed to apply projection gradient algorithm and its one-step approximate algorithm to solve the constrained nonlinear optimization problem (5) for efficiency [24]. The detailed steps of the one-step projection gradient based l2,p-RFS algorithm are described in Algorithm 2:

2.3.

The time complexity of l2,p-RFS algorithm (Algorithm 2) can by analyzed as follows. Line 1 takes O(n2m) time for QR decomposition. Line 2 takes O(nm2)+O(mn2)+O(n3)+O(nmc) time for matrix multiplication and matrix inversion operation. Line 3 takes O(t(m2c+mc2)) for iteratively solving the transformation matrix. Hence, the total time complexity of l2,p-RFS 2 algorithm is

O(nm2+mn2+n3+nmc+t(m2c+mc2)).

l2,p-RFS [24] extends the l2,1-RFS [18] in the sense of l2,p-norm (0<p1) that provides more models to choose for practical applications, thus increasing the flexibility of the original algorithm in application. However, we find one-step gradient projection method used in l2,p-RFS algorithm consumes the precision of the algorithm to some extent for the numerical approximation in experiments. In addition, from the analysis of computational complexity above, it can be seen that one-step gradient projection based l2,p-RFS algorithm is not as efficient as the author claimed, and it is computationally expensive for relatively large data sets.

3. Proposed method

The regression loss function and regularization terms in the criterion of l2,1-RFS [18] and l2,p-RFS algorithm [24] use the same l2,1-norm or l2,p-norm ( 0<p1) respectively. The norm in the regression loss function is used to both measure the residual of regression and tolerate the influence of outliers in data on the algorithm [18,24], and the norm in the regularization term is used to achieve the sparseness of the regression transformation matrix. Since the two norms play different roles, we think rigidly setting them to the same one may lead to the algorithms failing to obtain good feature selection performance when we use them. Therefore, in this section, we consider a more general form of l2,p-RFS: using different l2,p-norm for the regression loss function and the penalty terms. Then we can obtain a new criterion for l2,p-RFS, which is denoted as l2,βα-RFS in following discussion for convenience.

3.1. A new criterion for l2,p-RFS

l2,βα-RFS can be formulated as the following optimization problem:

minGJ(G)=XTGB2,αα+γG2,ββ, (6)

where α ( 0<α2) and β ( 0<β<2) are the norm indexes of the regression loss function and penalty terms respectively, and they are allowed to be set to different values. From the criterion (6), we can see the proposed l2,βα-RFS criterion is a general form of l2,p-RFS criterion, and it will degenerate the l2,p-RFS criterion [24] when α=β(0<α,β1). At the same time, we think setting two norm exponents in this way can better ensure the flexibility of the algorithm facing complex data in practical applications. For example, we can set the norm indexes of the regression loss function α=2 in the proposed l2,βα-RFS criterion. Then the regression loss function becomes exactly the squared F-norm, which corresponds to the assumption of the normal distribution of the residuals, which has been widely verified to be valid. The range of parameter p in l2,p-RFS [24] cannot be expanded from (0,1] to (0, 2] directly since the norms of the loss metric and the regularization term in l2,p-RFS are fixed to the same l2,p-norm ( 0<p1). Because, if p is set to 2, the regularization term in l2,p-RFS becomes the squared F-norm of the regression transformation matrix, which is not suitable to be used to achieve the row sparsification of regression transformation matrix. Thus feature selection cannot be performed.

3.2. An effective and efficient algorithm

The algorithm for l2,p-RFS criterion cannot be used for the new proposed l2,βα-RFS criterion. In order to present an effective and efficient algorithm for the new criterion (6), first, we shall show the optimization problem (6) can be converted into an nonlinear regression problem without regularization term.

In fact, the criterion in problem (6) can be reformulated as

J(G)=XTGB2,αα+γIdG02,ββ=i=1nxiTGbi2α+r=1dμerTGeT02β=i=1n(j=1c(xiTgjbij)2)α2+r=1d(j=1c(μerTgj0)2)β2, (7)

where μ=γ1β, bi is the ith row of B, Id is an identity matrix, er is its rth column, 0Rd×c is a zero matrix, e=[1,,1]TR1×c. Let U=[XT,μId]T, V=[BT,0]T, m = n + d. Then we can know the optimization problem (6) can be equivalently written as

minGJ(G)=i=1muiTGvi2θ, (8)

where if in, θ=α, else θ=β, ui is the ith column of U, vi is the ith row of V.

We then show the problem (8) can be solved by using iteratively re-weighted least squares (IRLS) method [14, Section 4.5.2]. In fact, taking derivative of J(G) in Equation (8) w.r.t gj and setting J(G)gj=0, we will have the first-order optimality condition defined in the following equation:

i=1mθ(j=1c(uiTgjvij)2)θ22(uiuiTgjuivij)=0, (9)

where vij is the j-th component of vi. Let

wi=θ((j=1c((ui)Tgjvij)2)+ε)θ22ε0. (10)

where ε can avoid that the residual is 0 in Equation (10). Then we have

i=1mwiuiuiTgj=i=1mwiuivij. (11)

From Equation (11), it can be known that, G can be solved by using an iterative algorithm: with fixed gl,(l=1,2,,c and lj), solve Equation (11) to update gj; repeat the above process to update each column of G until all columns of G are no longer updated.

When solving gj, Equation (11) is a nonlinear equation w.r.t. gj:

i=1mwi(,,gj,,,)uiuiTgj=i=1mwi(,,gj,,,)uivij. (12)

So Equation (12) can be solved by using an iterative algorithm: given wi(i=1,2,,n), gj can be computed by solving the remaining linear equation w.r.t gj in (12); and given gj, wi(i=1,2,,n) can be updated by computing Equation (10). That is, we can solve Equation (12) by executing the following iterative format:

i=1mwi(,,gj(k),,,)uiuiTgj(k+1)=i=1mwi(,,gj(k),,,)uivij. (13)

Note that given wi(i=1,2,,n), the solution to Equation (12) is exactly the solution to the weighted least squares problem

mingjJ(gj)=i=1mwi(uiTgjvij)2. (14)

So the solution procedure for gj discussed above is in essence an IRLS algorithm [3], which we note as IRLS- g in the following discussion. Note that the weighted least squares problem (14) can be converted into a standard least squares problem which can be solved efficiently by using LSQR algorithm [20]. The following Algorithm 3 presents detailed algorithm steps for IRLS- g.

3.2.

Based on the IRLS- g algorithm above, we can present an iterative algorithm for the optimal G:

  1. Call IRLS- g algorithm (Algorithm 3) above repeatedly to update all the columns of G successively until all the columns of G are updated.

  2. Repeat this process until G is no longer updated.

In this way, we can finally get the solution of problem (8), namely, the solution of the original problem (6). Summarizing the above analysis, we can present the following Algorithm 4. In addition, Algorithm 4 can be initialized by using the solution to the following optimal problem:

minG(0)J(G(0))=UTG(0)VF2+γG(0)F2. (15)

Problem (15) is a standard square F-norm based regularized regression problem which can be also solved by using LSQR algorithm efficiently [20].

3.2.

3.3. Time complexity analysis

The time complexity of the proposed IRLS- g algorithm (Algorithm 3) can by analyzed as follows. Line 1 takes O(cdm) for updating wi(i=1,2,,m), Line 2 takes O(t1(3m+5d+2md)) for updating gj by using LSQR algorithm, where t1 is the number of iteration of LSQR. So the total time complexity of the proposed IRLS- g algorithm is O(t2(cdm+t1(3m+5d+2md))). Based on the analysis above, the time complexity of the proposed l2,βα-RFS algorithm (Algorithm 4) can be analyzed as follows. Line 1 takes O(t0c(3m+5d+2md)) time for using LSQR algorithm [2] to compute G0. Line 2 takes O(ct2(cdm+t1(3m+5d+2md))) time for updating gj (j=1,2,,c) by running Algorithm 3. Line 3 takes O(dc) time for computing b(G), and O(ds) time for sorting features. Hence, the total time complexity of the proposed l2,βα-RFS algorithm is

O(t0c(3m+5d+2md)+t3(ct2(cdm+t1(3m+5d+2md)))),

which can be further simplified to

O((t0+t3t2t1)c(3m+5d+2md)+t3c2dm).

3.4. Convergence analysis

Theorem 3.1

  1. When 0<θ2, the objective function sequence J(G(k)) generated by Algorithm 4 is monotonically decreasing and convergent.

  2. When 1θ2, the matrix sequence G(k) generated by Algorithm 4 or its subsequence converges to the global minimum point of problem (6); when 0<θ<1, the sequence G(k) generated by Algorithm 4 or its subsequence converges to the local minimum point of problem (6).

Proof.

  1. Let Zij=l=1,ljc(uiTglvil)2, and ri,j=uiTgjvij. With fixed gl,(l=1,2,,c and lj), problem (8) can be rewritten as
    mingjJ(gj)=i=1m(ri,j2+Zij)θ2. (16)
    Without loss of generality, problem (16) can be formulated as the following optimization problem:
    mingJ(g)=i=1m(ri2+Zi)θ2, (17)
    where ri=uiTgvij. First, we shall claim that, when running Algorithm 3, J(g) decreases at each iteration, that is J(g(k+1))J(g(k)). In fact, for r>0, let ρi(r)=(r2+Zi)θ2, hi(r)=(r+Zi)θ2. It follows that ρi(r)=hi(r2) and wi(r)=2hi(r2). Since hi(r)=θ2(r+Zi)θ22 is decreasing for 0<θ2, hi(r) is concave for 0<θ2. We also know problem (16) can be rewritten as
    mingJ(g)=i=1nhi(ri2). (18)
    Then we have
    J(g(k+1))J(g(k))i=1mhi(ri(g(k))2)(ri(g(k+1))2ri(g(k))2)=12i=1mwi(ri(g(k+1))ri(g(k)))(ri(g(k+1))+ri(g(k))). (19)
    Since ri(g(k+1))ri(g(k))=(g(k+1)g(k))Tui and ri(g(k+1))+ri(g(k))=uiT(g(k)+g(k+1))2vij, we have
    J(g(k+1))J(g(k))12(g(k+1)g(k))Ti=1mwiuiuiT(g(k)+g(k+1)2g(k+1))=12(g(k+1)g(k))Ti=1mwixiuiT(g(k)g(k+1)). (20)
    Since i=1mwiuiuiT is nonnegative definite, we have J(g(k+1))J(g(k))0. That is the sequence J(g(k)) is decreasing. At the same time, the sequence J(g(k)) is bounded from below. So it can be known that the sequence J(g(k)) generated by Algorithm 3 is convergent. Since Algorithm 4 is a process of repeatedly calling Algorithm 3, the sequence J(G(k)) generated by Algorithm 4 must also be strictly monotonically decreasing. And it also has a lower bound, so the sequence J(G(k)) generated by Algorithm 4 is also convergent. This proves (1).
  2. When 1θ2, J(G) is convex. So its optimal point must be a global optimal point. If the optimal point of J(G) is unique, it can be shown that J(G(k)) converges the optimal point of J(G). In fact, since J(G(k)) converges, the matrix sequence G(k) is bounded. It has a subsequence which has a limit G, which by continuity satisfies (11) and is hence a minimum point of (6). If the convergent subsequence of G(k) is unique, then G(k)G; otherwise, there would exist a subsequence of G(k) bounded away from G, which in turn would have a convergent subsequence, which would have a limit different from G, which would also satisfy (11) and is hence an optimal point of (6). This contradicts the fact G has unique optimal point. If the optimal point of J(G) is not unique, according to the analysis above, we can know J(G(k))has a subsequence which converges to the optimal point of J(G). When 0<θ1, J(G) is non-convex, based on the above discussion, we can know the sequence generated by Algorithm 4 or its subsequence converges to the local minimum point of problem (6).

Remark 3.2

The proof method presented above is actually a concrete manifestation of the algorithm convergence proof method presented in [14, Section 9.1], mainly because Algorithm 3 we proposed can actually be regarded as a concrete example of Algorithm 6 presented in [14, Section 4.5.2]. In fact, we can use the theorem on convergence already given in [14, Section 4.5.2] to directly give a conclusion that is almost consistent with Theorem 1. But for clarity, we still present a detailed proof process here.

4. Experiments

We evaluate the effectiveness of l2,βα-RFS in this section. It contains four parts. The convergence of the algorithm is first studied empirically in Section 4.1. In Sections 4.2 and 4.3, we compare l2,βα-RFS with other feature selection methods, including Laplacian score (LS) [10], ReliefF (RF) [15], Minimal-redundancy-maximal-relevance (mRMR) [21], Trace ratio (TR) [19] and l2,1-RFS [18] and l2,p-RFS [24], in terms of classification accuracy and running time respectively. Finally, we study the effects of the parameters involved in l2,βα-RFS on its performance in Section 4.4. In the experiments, the 1-Nearest-Neighbor (1NN) algorithm [7] is applied for the classification of the obtained low-dimensional data resulted from all seven feature selection algorithms, and five-fold cross validation is used for computing the classification accuracy. In the l2,βα-RFS algorithm, the numbers of iteration of LSQR algorithm [2] to compute G0 and update G are set to 20 and 5, respectively; the numbers of iteration of Algorithm 3 and Algorithm 4 are set to 1 and 20 respectively. Seven data sets are selected for the experiment. Among them, five are gene expression data sets including ALLAML, GLIOMA, LUNG, PRO-GE [18,24] and COLON [23], and two are image data sets: COIL20 and USPS [23]. All data sets were preprocessed by standardization and centralization. The relevant statistical characteristics of each data set are shown in Table 1. The experimental environment is the Intel(R) Core(TM) i7-8700 3.00 CPU@3.3GHz and 16GB RAM running memory and Windows 10 operating system and matlab2014a simulation tool.

Table 1.

Summary of the test data sets used in our experiment.

Data set Size (n) Dimension (m) # of classes (c)
ALLAML 72 3571 2
GLIOMA 50 4434 4
LUNG 203 3312 5
PRO-GE 102 5966 2
COLON 62 2000 2
COIL20 1440 1024 20
USPS 9298 256 10

4.1. Convergence

In this experiment, we study the convergence of the l2,βα-RFS algorithm empirically. Figure 1 shows the objective function value of l2,βα-RFS algorithm for different numbers of iteration of l2,βα-RFS on 7 data sets, where parameters α, β and γ are fixed as 1.25, 0.25 and 10. We can observe from Figure 1, the objective function value of the l2,βα-RFS algorithm shows a non-increasing trend during the iteration and converges after the six iteration. This shows the l2,βα-RFS algorithm converges fast. We observe similar trends under other parameter values and the results are omitted.

Figure 1.

Figure 1.

Convergence analysis curve.

4.2. Classification accuracy

In this experiment, the feature selection ability of the proposed l2,βα-RFS algorithm is evaluated in terms of classification accuracy of the subsequent 1NN classification algorithm. Regularization parameter γ in l2,1-RFS [18], l2,p-RFS [24] and the proposed l2,βα-RFS algorithms is set to the set {106,105,104,103,102,101,100,101,102,103,104,105,106}, the parameter p in l2,p-RFS algorithm [24] is set to the set {0.25,0.5,0.75,1}, the parameters α and β in the proposed l2,βα-RFS algorithm are set to the set {0.25,0.5,0.75,1,1.25,1.5,1.75,2} and {0.25,0.5,0.75,1,1.25,1.5,1.75} respectively. The classification accuracy and its standard deviation corresponding to the best parameter for each algorithm are reported in Tables 25, where Tables 234 and 5 show the classification accuracy and its standard deviation (SD) of each algorithm when the number of selected features is 20 %, 40 %, 60 % and 80 %, respectively. The last line in each table is the average classification accuracy of each feature selection algorithm for all data sets. In the tables, the ‘×’ means that the corresponding algorithm has been running and cannot be ended, so it failed to output the result.

Table 2.

Classification accuracy ( %) and its standard deviation of 1NN using five cross-validation for top 20( %) features.

Data set LS mRMR RF TR l2,1-RFS l2,p-RFS l2,βα-RFS
ALLAML 92.95±7.15 91.42±11.74 92.85±10.10 92.85±8.75 95.71±6.39 94.29± 5.98 98.57±3.91
GLIOMA 74.00±11.40 78.00±8.37 76.00±5.48 66.00±11.40 80.00±7.07 74.80±4.47 86.00±8.94
LUNG 95.07±3.45 95.08±2.98 94.58±2.69 96.06±3.70 97.04±2.07 97.04±2.07 98.52±1.35
PRO-GE 69.57±8.27 83.29±4.52 89.19±6.47 84.33±6.23 90.19±6.99 91.19±5.33 90.14±6.56
COLON 64.74±10.94 77.43±8.72 79.10±9.22 74.23±2.92 82.31±5.25 80.77±6.86 88.72±4.36
COIL20 91.73±1.18 99.79±0.19 99.86±0.19 98.61±1.26 100±0.00 100±0.00 99.79±0.31
USPS 83.05±1.99 93.84±0.71 92.23±0.91 89.04±0.65 96.36±0.14 × 95.17±0.32
Mean accuracy 81.59 88.41 89.12 85.88 91.95 89.68 93.85

Table 5.

Classification accuracy ( %) and its standard deviation of 1NN using five cross-validation for top 80( %) features.

Data set LS mRMR RF TR l2,1-RFS l2,p-RFS l2,βα-RFS
ALLAML 94.29±7.82 92.95±7.15 94.29±7.82 92.86±10.10 94.29±7.83 94.29±7.83 95.71±3.91
GLIOMA 76.00±8.94 76.00±8.94 80.00±7.07 78.00±8.37 78.00±8.37 78.00±8.37 80.00±7.07
LUNG 94.59±1.07 93.12±1.98 94.57±2.08 94.59±1.73 95.07±1.73 94.59±2.66 95.56±2.06
PRO-GE 83.24±7.59 82.24±9.12 87.24±7.44 83.29±8.12 85.24±8.12 85.24±7.84 87.19±5.69
COLON 74.36±7.94 77.56±5.98 74.36±5.79 74.36±5.79 77.56±5.98 75.90±5.06 80.77±3.51
COIL20 99.79±0.31 99.86±0.19 100.00±0.00 99.86±0.00 100.00±0.00 100.00±0.00 99.86±0.31
USPS 97.02±0.39 97.05±0.24 97.25±0.37 96.92±0.37 97.41±0.378 × 97.56±0.32
Mean accuracy 88.47 88.40 89.67 88.55 89.81 88.00 90.95

Table 3.

Classification accuracy ( %) and its standard deviation of 1NN using five cross-validation for top 40( %) features.

Data set LS mRMR RF TR l2,1-RFS l2,p-RFS l2,βα-RFS
ALLAML 95.71±6.39 94.29±7.82 95.71±6.39 94.29±7.82 95.71±3.91 97.14±3.91 97.14±3.91
GLIOMA 72.00±13.04 78.00±4.47 74.00±8.94 74.00±8.94 80.00±7.07 78.00±4.47 82.00±8.37
LUNG 95.06±3.03 94.60±3.16 96.06±2.19 96.05±1.38 96.54±2.22 97.04±2.07 97.54±2.47
PRO-GE 79.33±9.04 81.24±6.80 86.24±4.27 81.38±3.96 88.19±5.69 89.19±4.11 90.14±5.33
COLON 67.95±8.68 79.23±6.24 78.97±4.65 75.90±5.06 80.77±6.86 80.64±4.36 83.97±5.25
COIL20 97.29±0.37 100.00±0.00 99.93±0.15 99.86±0.19 100.00±0.00 100.00±0.00 99.86±0.31
USPS 92.07±0.85 95.97±0.45 96.49±0.41 91.96±0.53 97.20±0.35 × 97.28±0.40
Mean accuracy 85.63 89.05 89.63 87.63 91.41 90.34 92.56

Table 4.

Classification accuracy ( %) and its standard deviation of 1NN using five cross-validation for top 60( %) features.

Data set LS mRMR RF TR l2,1-RFS l2,p-RFS l2,βα-RFS
ALLAML 95.71±6.39 92.95±7.15 95.71±6.39 95.71±6.39 94.29±7.83 95.71±6.39 97.14±3.91
GLIOMA 76.00±2.66 78.00±2.14 76.00±5.48 78.00±8.37 80.00±7.07 80.00±7.07 80.00±7.07
LUNG 94.59±2.19 93.61±1.38 95.06±2.49 94.57±3.20 97.04± 2.07 96.55±2.79 97.04±2.07
PRO-GE 84.29±6.50 81.29±5.64 87.29±6.52 82.29±5.8 87.19± 7.61 86.24±6.40 88.19±6.47
COLON 72.82±8.07 77.69±9.72 78.97±4.65 77.56±5.98 77.44±3.43 77.44±3.43 83.97±5.25
COIL20 98.68±0.66 100.00±0.00 100.00±0.00 99.86±0.19 100.00±0.00 100.00±0.00 99.93 ±0.16
USPS 95.53±0.34 96.58±0.42 97.34±0.40 95.88±0.46 97.41±0.31 × 97.63±0.26
Mean accuracy 88.23 88.59 90.05 89.13 90.48 89.32 91.99

It can be seen that from the above four tables, for most data sets, and for most selected feature numbers, the l2,βα-RFS algorithm achieves the best in terms of classification accuracy among seven algorithms. For most selected feature numbers, in term of average classification accuracy, the l2,βα-RFS algorithm outperforms other six algorithms. This is more obvious when the number of selected features is 20 %. Especially, on GLIOMA and COLON, the proposed l2,βα-RFS algorithm not only obtains the best but also obtains the classification accuracies of 6% and 8% higher than l2,1-RFS [18] and l2,p-RFS [24] respectively. On GLIOMA, l2,βα-RFS obtains the classification accuracies of 6% and 10% higher than l2,1-RFS and l2,p-RFS algorithms respectively.

4.3. Running time

In this experiment, we evaluate the efficiency of the l2,βα-RFS algorithm in terms of running time. Tables 69 present the running times corresponding to the optimal accuracies of all seven algorithms on seven data sets using different parameters. From Tables 69, we can see that, overall, LS and TR are the most efficient algorithms among the seven algorithms for efficiency. We also see that l2,1-RFS and l2,p-RFS are sensitive to n (the number of data samples), which is obviously reflected on the USPS data set: the running times of the two algorithms are relatively long ( l2,p-RFS is really running too slowly, and it cannot present the running result.). This phenomenon is actually consistent with our previous analysis of the computational complexity of l2,1-RFS and l2,p-RFS: the computational complexities of these two algorithms have a cubic relationship with n. The efficiency of the l2,βα-RFS shows robustness to the data scale. Its running time changes smoothly for different data samples and data dimensions.

Table 7.

Running time (in seconds) of classification of 1NN using five cross-validation for top 40( %) features.

Data set LS mRMR RF TR l2,1-RFS l2,p-RFS l2,βα-RFS
ALLAML 0.05 83.84 0.89 0.30 0.18 0.24 8.23
GLIOMA 0.06 143.73 0.79 0.37 0.16 0.36 39.01
LUNG 0.14 81.88 2.71 0.31 0.67 0.28 15.87
PRO-GE 0.13 315.06 2.10 0.82 0.47 0.72 90.69
COLON 0.01 36.40 0.60 0.09 0.06 0.06 5.04
COIL20 1.13 11.57 13.60 1.30 11.15 1.46 11.84
USPS 23.57 15.54 105.13 24.20 623.15 × 21.88

Table 8.

Running time (in seconds) of classification of 1NN using five cross-validation for top 60( %) features.

Data set LS mRMR RF TR l2,1-RFS l2,p-RFS l2,βα-RFS
ALLAML 0.10 127.26 0.94 0.38 0.19 0.17 9.86
GLIOMA 0.14 225.80 0.87 0.48 0.17 0.35 15.02
LUNG 0.24 121.35 2.82 0.44 0.75 0.45 6.11
PRO-GE 0.27 482.67 2.25 1.05 0.46 0.55 56.07
COLON 0.03 50.74 0.61 0.12 0.06 0.05 7.28
COIL20 1.99 15.56 14.48 2.29 10.21 2.21 8.95
USPS 38.64 18.99 120.26 38.88 614.03 × 38.80

Table 6.

Running time (in seconds) of classification of 1NN using five cross-validation for top 20( %) features.

Data set LS mRMR RF TR l2,1-RFS l2,p-RFS l2,βα-RFS
ALLAML 0.02 39.65 0.86 0.25 0.18 0.16 9.16
GLIOMA 0.02 64.64 0.75 0.31 0.16 0.36 36.78
LUNG 0.09 38.76 2.66 0.22 0.64 0.25 17.68
PRO-GE 0.06 129.88 2.02 0.68 0.45 1.01 78.22
COLON 0.01 19.38 0.59 0.08 0.06 0.06 9.22
COIL20 0.54 6.42 12.98 0.61 9.83 0.92 12.73
USPS 10.90 11.62 92.38 11.92 617.18 × 12.52

Table 9.

Running time (in seconds) of classification of 1NN using five cross-validation for top 80( %) features.

Data set LS mRMR RF TR l2,1-RFS l2,p-RFS l2,βα-RFS
ALLAML 0.22 158.64 1.05 0.54 0.20 0.27 17.21
GLIOMA 0.32 283.96 1.04 0.68 0.17 0.26 38.10
LUNG 0.41 149.06 2.99 0.65 0.78 0.42 4.36
PRO-GE 0.59 597.89 2.58 1.48 0.51 0.59 71.46
COLON 0.05 60.44 0.64 0.17 0.06 0.05 7.46
COIL20 3.31 18.12 15.82 3.64 10.55 3.45 15.44
USPS 56.08 21.85 137.11 55.89 628.44 × 51.51

4.4. Effect of parameters

There are three parameters including α, β and γ in the l2,βα-RFS algorithm. α is used to control the robustness of the loss function and β is used to ensure the sparsity of the transformation matrix. γ is the regularization parameter, which is used to balance the relationship between the least square term and the regularization term. In this experiment, we empirically study the effect of the three parameters on the performance of the l2,βα-RFS algorithm in terms of classification accuracy of the subsequent 1NN classification algorithm.

In order to evaluate the influence of parameter γ on the performance of the proposed l2,βα-RFS algorithm, in the experiment, we fix the values of parameters α and β to 1, and plot the change trend of classification accuracy of l2,βα-RFS with respect to γ as shown in Figure 2. We can observe γ also does have an impact on the performance of the l2,βα-RFS algorithm. It appears that, for most data sets, [104,101] is a relatively robust selection range for γ.

Figure 2.

Figure 2.

Classification accuracies on seven data sets with different γ, where (a), (b), (c) and (d) represent the classification accuracy when the proportion of the selected feature is 20 %, 40 %, 60 % and 80 % respectively.

In order to evaluate the influence of parameter β on the performance of the proposed l2,βα-RFS algorithm, in the experiment, we fix the values of parameters α and γ to 1, and plot the change trend of classification accuracy of l2,βα-RFS with respect to β, as shown in Figure 3. we can observe that β does have an impact on the performance of the l2,βα-RFS algorithm. This is especially obvious on some data sets such as GLIOMA and PROSTATE. We can also observe that, on several data sets, the optimal classification accuracy of the algorithm is not obtained at β=1 (corresponding to the l2,1-norm). This experimental result shows that it is necessary to expand the value range of parameter β in general. It appears that, for most data sets, [0.25,0.75] is a relatively robust selection range for parameter β.

Figure 3.

Figure 3.

Classification accuracies on seven data sets with different β, where (a), (b), (c) and (d) represent the classification accuracy when the proportion of the selected feature is 20 %, 40 %, 60 % and 80 % respectively.

In order to evaluate the influence of parameter α on the performance of the proposed l2,βα-RFS algorithm, in the experiment, we fix the values of parameter β and γ to 1 and 10 respectively, and plot the change trend of classification accuracy of l2,βα-RFS with respect to α, as shown in Figure 4. we can observe that α does have an impact on the performance of the l2,βα-RFS algorithm. This is especially obvious on some data sets such as GLIOMA and PROSTATE. We can also observe that, on several data sets, the optimal classification accuracy of the algorithm is not obtained at α=1 (corresponding to the l2,1-norm) or α=2 (corresponding to the F-norm). This experimental result shows that it is necessary to expand the value range of parameter α in general. It appears that, for most data sets, [1.25,1.5] is a relatively robust selection range for parameter α.

Figure 4.

Figure 4.

Classification accuracies on seven data sets with different α, where (a), (b), (c) and (d) represent the classification accuracy when the proportion of the selected feature is 20 %, 40 %, 60 % and 80 % respectively.

In order to further test the influence of parameters α and β on the performance of the proposed l2,βα-RFS algorithm, we perform l2,1-RFS, l2,p-RFS, and the proposed l2,βα-RFS by using our IRLS algorithm (Algorithm 4). At the same time, we denote the l2,p-RFS algorithm as l2,β-RFS for easy comparison. Figure 5 depicts the evolution of the classification accuracy as a function of β on all seven data sets, where the classification accuracy of l2,βα-RFS is the optimal accuracy under different values of α. From Figure 5, we can see that, for several data sets, there are several values of β, under which the classification accuracies of l2,β-RFS are better than ones of l2,1-RFS. This shows that it is beneficial to extend l2,1-RFS to l2,β-RFS. We also see, for all data sets, and for all values of parameter β, the classification accuracies of l2,βα-RFS are always better than ones of l2,β-RFS. This shows that it is beneficial to further extend l2,β-RFS to l2,βα-RFS.

Figure 5.

Figure 5.

The classification accuracy of l2,1-RFS, l2,p-RFS and l2,βα-RFS features selection algorithms on seven data sets with different β.

5. Conclusion and future work

In this paper, a new optimization criterion for l2,p-norm regression based feature selection ( l2,p-RFS) algorithm [24], which is an extension of the l2,1-norm regression based feature selection algorithm [18], is proposed. The new criterion further extends the optimization criterion of l2,p-RFS algorithm when the l2,p-norm used in the regression loss is not necessarily the same as the one used in the regularization term. This improves the flexibility of the algorithm when it is used on different data sets. Based on the iterative re-weighted least squares framework, an effective and efficient algorithm, which we note l2,βα-RFS, is proposed for the new optimization criterion. The experimental results on a variety of real-world data sets show the new algorithm is competitive with other related feature selection algorithms in terms of classification accuracy and efficiency. Parameters in the proposed l2,βα-RFS algorithm have a certain influence on the performance of the algorithm, and sometimes this influence is significant. The automatic parameter selection is still an open problem, and a promising method is nested cross-validation [6] for supervised learning problems. This constitutes our next work.

Funding Statement

This work is supported by the National Science Foundation of China (Grant nos. 62071378, 62071379, 62071380, 61601362, 61671377 and 61901365), and the National Science Basic Research Plan in Shaanxi Province of China(No. 2020JM-580 and 2021JM-461), and New Star Team of Xi'an University of Posts and Telecommunications (Grant no. xyt2016-01), and the Science Plan Foundation of the Education Bureau of Shaanxi Province of China (No. 18JK0719).

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Argyriou A., Evgeniou T., and Pontil M., Multi-task feature learning, Adv. Neural. Inf. Process. Syst. 19 (2007), pp. 41–48. [Google Scholar]
  • 2.Chen J., Ma Z., and Liu Y., Local coordinates alignment with global preservation for dimensionality reduction, IEEE Trans. Neural. Netw. Learn. Syst. 24 (2013), pp. 106–117. [DOI] [PubMed] [Google Scholar]
  • 3.Daubechies I., Devore R. and Fornasier M., Iteratively re-weighted least squares minimization: proof of faster than linear rate for sparse recovery, 42nd Annual Conference on Information Sciences and Systems, USA, pp. 26–29, 2008.
  • 4.Ding C. and Peng H., Minimum redundancy feature selection from microarray gene expression data. Proceedings of 2nd IEEE Computational Systems Bioinformatics Conference. pp. 523–528, August 2003. [DOI] [PubMed]
  • 5.Ding C., Zhou D., He X., and Zha H., R1-PCA: rotational invariant l1-norm principal component analysis for robust subspace factorization, Proc. Intl. Conf. Mach. Learn. 23 (2006), pp. 281–288. [Google Scholar]
  • 6.Dora L., Agrawal S., Panda R., and al et, Nested cross-validation based adaptive sparse representation algorithm and its application to pathological brain classification, Expert. Syst. Appl. 114 (2018), pp. 313–321. [Google Scholar]
  • 7.Duda R.O., Hart P.E., and Stork D., Pattern Classification, New York, Wiley, 2000. [Google Scholar]
  • 8.Fukunaga K., Statistical Pattern Recognition, Academic Press, 2nd edition, 1990. [Google Scholar]
  • 9.Guyon I. and Elisseeff A., An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003), pp. 1157–1182. [Google Scholar]
  • 10.He X., Cai D., and Niyogi P., Laplacian score for feature selection, Adv. Neural. Inf. Process. Syst. 18 (2005), pp. 507–514. [Google Scholar]
  • 11.Hou C., Jiao Y., Nie F., Luo T., and Zhou Z., 2D feature selection by sparse matrix regression, IEEE Trans. Image Process. 26 (2017), pp. 4255–4268. [DOI] [PubMed] [Google Scholar]
  • 12.Hou C., Wang J., and Wu Y., Local linear transformation embedding, Neurocomputing 72 (2009), pp. 2368–2378. [Google Scholar]
  • 13.Hou C., Zhang C., and Wu Y., Multiple view semi-supervised dimensionality reduction, Pattern. Recognit. 43 (2010), pp. 720–730. [Google Scholar]
  • 14.Huber P.J., Robust Statistics, New York, Wiley, 1981. [Google Scholar]
  • 15.Kononenko I., Estimating attributes: analysis and extensions of relief, Eur. Conf. Mach. Learn. 7 (1994), pp. 171–182. [Google Scholar]
  • 16.Masaeli M., Fung G. and Dy J.G., From transformation-based dimensionality reduction to feature selection, Proceedings of the 27th International Conference on Machine Learning, pp. 751–758, 2010.
  • 17.Ming D. and Ding C., Robust flexible feature selection via exclusive L21 regularization. Proceedings of the 28th International Joint Conference on Artificial Intelligence. pp. 3158–3164, 2019.
  • 18.Nie F., Huang H., and Cai X., Efficient and robust feature selection via joint l2,1-norms minimization, Adv. Neural Inf. Process. Syst. 23 (2010), pp. 1–9. [Google Scholar]
  • 19.Nie F., Xiang S. and Jia Y., Trace ratio criterion for feature selection, Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp. 671–676, 2008.
  • 20.Paige C. and Saunders M., LSQR: an algorithm for sparse linear equations and sparse least squares, ACM Trans. Math. Softw. 8 (1982), pp. 43–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Peng H., Long F., and Ding C., Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell. 27 (2005), pp. 1226–1238. [DOI] [PubMed] [Google Scholar]
  • 22.Shao L., Liu L., and Li X., Feature learning for image classification via multiobjective genetic programming, IEEE. Trans. Neural. Netw. Learn. Syst. 25 (2014), pp. 1359–1371. [Google Scholar]
  • 23.Tao H., Hou C., and Nie F., Effective discriminative feature selection with nontrivial solution, IEEE. Trans. Neural. Netw. Learn. Syst. 27 (2016), pp. 796–808. [DOI] [PubMed] [Google Scholar]
  • 24.Wang L., Chen S., and Wang Y., A unified algorithm for mixed l2,p-minimizations and its application in feature selection, Comput. Optim. Appl. 58 (2014), pp. 409–421. [Google Scholar]
  • 25.Zhao H., Wang Z., and Nie F., A new formulation of linear discriminant analysis for robust dimensionality reduction, IEEE. Trans. Knowl. Data Eng. 31 (2018), pp. 629–640. [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES