Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jun 2.
Published in final edited form as: IEEE Trans Neural Netw Learn Syst. 2021 Jun 2;32(6):2595–2609. doi: 10.1109/TNNLS.2020.3006877

Discriminative Ridge Machine: A Classifier for High-Dimensional Data or Imbalanced Data

Chong Peng 1, Qiang Cheng 2
PMCID: PMC8219475  NIHMSID: NIHMS1710909  PMID: 32692682

Abstract

In this paper, we introduce a discriminative ridge regression approach to supervised classification. It estimates a representation model while accounting for discriminativeness between classes, thereby enabling accurate derivation of categorical information. This new type of regression models extends existing models such as ridge, lasso, and group lasso through explicitly incorporating discriminative information. As a special case we focus on a quadratic model that admits a closed-form analytical solution. The corresponding classifier is called discriminative ridge machine (DRM). Three iterative algorithms are further established for the DRM to enhance the efficiency and scalability for real applications. Our approach and the algorithms are applicable to general types of data including images, high-dimensional data, and imbalanced data. We compare the DRM with current state-of-the-art classifiers. Our extensive experimental results show superior performance of the DRM and confirm the effectiveness of the proposed approach.

Index Terms—: ridge regression, discriminative, label information, high-dimensional data, imbalanced data

I. Introduction

Classification is a critically important technique for enabling automated data-driven machine intelligence and it has numerous applications in diverse fields ranging from science and technology to medicine, military, and business. For classifying large-sample data theoretical understanding and practical applications have been successfully developed; nonetheless, when data dimension or sample size becomes big, the accuracy or scalability usually becomes an issue.

Modern data are increasingly high dimensional yet the sample size may be small. Such data usually have a large number of irrelevant entries; feature selection and dimension reduction techniques are often needed before applying classical classifiers that typically require a large sample size. Although some research efforts have been devoted to developing methods capable of classifying high-dimensional data without feature selection, classifiers with more desirable accuracy and efficiency are yet to be discovered.

Diverse areas of scientific research and everyday life are currently deluged with large-scale and/or imbalanced data. Achieving high classification accuracy, scalability, and balance of precision and recall (in the case of imbalanced data) simultaneously is challenging. While several classic data analytics and classifiers have been adapted to the large-scale or imbalanced setting, their performance is still far from being desired. There is an urgent need for developing effective, scalable data mining and prediction techniques suitable for large-scale and/or imbalanced data.

K-Nearest Neighbor (KNN) has been one of the most widely used classifiers due to its simplicity and efficiency. It predicts the label of an example by seeking the K-nearest neighbors among the training data. It typically relies on hard supervision, where binary labels over the samples indicate whether they are similar to the test example or not. However, binary similarities are not sufficient enough to represent complicated relationships of the data because they do not fully exploit rich structured and continuous similarity information. Thus, there is a difficulty for the KNN to find proper neighbors for high-dimensional data. Moreover, the KNN omits important discriminative information of the data since the class information is not used in finding appropriate neighbors. To overcome the drawbacks of the KNN, in this paper, we introduce a discriminative ridge regression approach to classification. Instead of seeking binary similarity, it estimates a representation for a new example as a soft similarity vector by minimizing the fitting error in a regression framework, which allows the model to have continuous labels for the KNN supervision. Moreover, we inject the class information into the soft similarity metrices, such that the proposed model explicitly incorporates discriminative information between classes. Thus, the learned representation vector for the target example essentially represents the soft label concept coupled with the discriminative information by the proposed method. That is, we leverage the soft labels to distinguish the potentially highly complex patterns with maximum within-class similarity and between-class separability. Because our models explicitly account for discrimination, this new family of regression models are more geared toward classification. Based on the estimated rich representation, categorical information for the new example is subsequently derived. Our method can be considered as a class-aware convex generalization of KNN beyond binary supervision. Also, this new type of discriminative regression models can be regarded as an extension of existing regression models such as the ridge, lasso, and group lasso regression, and when particular parameter values are taken the new models fall back to several existing models. As a special case, we consider a quadratic model, called the discriminative ridge machine (DRM), as the particular classifier of interest in this paper. The DRM admits a closed-form analytical solution, which is suitable for small-scale or high-dimensional data. For large-scale data, three optimization algorithms of improved computational cost are established for the DRM.

The family of discriminative ridge regression-based classifiers are applicable to general types of data including imagery or other high-dimensional data as well as imbalanced data. Extensive experiments on a variety of real world data sets demonstrate that the classification accuracy of the DRM is comparable to the support vector machine (SVM), a commonly used classifier, on classic large-sample data, and superior to several existing state-of-the-art classifiers, including the SVM, on high-dimensional data and imbalanced data. The DRM with linear kernel, called linear DRM in analogy to the linear SVM, has classification accuracy superior to the linear SVM on large-scale data. The efficiency of linear DRM algorithms is provably linear in data size, on a par with that of the linear SVM. Consequently, the linear DRM is a scalable classifier.

As an outline, the main contributions of this paper can be summarized as follows: 1) A new approach to classification, and thereby a family of new regression models and corresponding classifiers are introduced, which explicitly incorporate the discriminative information for multi-class classification on general data. The formulation is strongly convex and admits a unique global optimum. 2) The DRM is constructed as a special case of the family of new classifiers. It involves only quadratic optimization, with simple formulation yet powerful performance. 3) The new method can be regarded as a class-aware convex generalization of KNN beyond binary supervision as well as discriminative extension of ridge regression and lasso. 4) The closed-form solution to the DRM optimization is obtained which is computationally suited to classification of moderate-scale or high-dimensional data. 5) For large-scale data, three iterative algorithms are proposed for solving the DRM optimization substantially more efficiently. They all have theoretically proven convergence at a rate no worse than linear - the third one is an accelerated version of the second one. 6) The DRM with a general kernel demonstrates an empirical classification accuracy that is comparable to the SVM on classic (small- to mid-scale) large-sample data, while superior to the SVM or other state-of-the-art classifiers on high-dimensional data and imbalanced data. 7) The linear DRM demonstrates an empirical classification accuracy superior to the linear SVM on large-scale data, using any of the three iterative optimization algorithms. All methods have linear efficiency and scalability in the sample size or the number of features.

The rest of this paper is organized as follows. Section II discusses related research work. Sections III and IV introduce our formulations of discriminative ridge regression and its kernel version. The discriminative ridge regression-based classification is derived in Section V. In Section VI we construct and analyze the DRM. Experimental results are presented in Section VII. Finally, Section VIII concludes this paper.

II. Related Work

There is of course a vast literature for classification and a variety of methods have been proposed. Here we briefly review some closely related methods only; more thorough accounts of various classifiers and their properties are extensively discussed in [28], [30], [39], [77]. Fisher’s linear discriminant analysis (LDA) [30] has been commonly used in a variety of applications, for example, Fisherface and its variants for face recognition [9]. It is afflicted, however, by the rank deficiency problem or covariance matrix estimation problem when the sample size is less than the number of features. For a given test example, the nearest neighbor (NN) method [28] uses only one closest training example, or more generally, KNN uses K closest training examples, to supply categorical information, while the nearest subspace method [41] adopts all training examples to do it. These methods are affected much less by the rank deficiency problem than the LDA. Due to increasing availability of high-dimensional data such as text documents, images and genomic profiles, analytics of such data have been actively explored for about a decade, including classifiers tailored to high-dimensional data. As the number of features is often tens of thousands while the sample size may be only a few tens, some traditional classifiers such as the LDA suffer from the curse of dimensionality. Several variants of the LDA have been proposed to alleviate the affliction. For example, regularized discriminant analysis (RDA) use the regularization technique to address the rank deficiency problem [37], while penalized linear discriminant analysis (PLDA) adds a penalization term to improve covariance matrix estimation [81].

Support vector machine (SVM) [25] is one of the most widely used classification methods with solid theoretical foundation and excellent practical applications. By minimizing a hinge loss with either an L-2 or L-1 penalty, it constructs a maximum-margin hyperplane between two classes in the instance space and extends to a nonlinear decision boundary in the feature space using kernels. Many variants have been proposed, including primal solver or dual solver, and extension to multiclass cases [70] [78].

The size of the training sample can be small in practice. Besides reducing the number of features with feature selection [23], [32], [53], [64], the size of training sample can be increased with relevance feedback [69] and active learning techniques, which may help specify and refine the query information from the users as well. However, due to the high cost of wet-lab experiments in biomedical research or the rare occurrence of rare events, it is often hard or even impractical to increase the sizes of minority classes in many real world applications; therefore, imbalanced data sets are often encountered in practice [16], [51], [80], [84]. It has been noted that the random forest (RF) [13] is relatively insensitive to imbalanced data. Furthermore, various methods have been particularly developed for handling imbalanced data [40], [47], which may be classified into three categories: data-level, algorithm-level, and hybrid methods.

Data-level methods change the training set to make the distributions of training examples over different classes more balanced, with under-sampling [59], [83], over-sampling [15], [20], and hybrid-sampling techniques [34]. Algorithm-level methods may compensate for the bias of some existing learning methods towards majority classes. Typical methods include cost-sensitive learning [60], [79], kernel perturbation techiques [55], and multi-objective optimization approaches [3], [10], [72]. Hybrid methods attempt to combine the advantages of data-level and algorithm-level methods [26]. In this paper we develop an algorithm-level method whose objective function is convex, leading to a high-quality, fast, scalable optimization and solution.

Large-scale data learning has been an active topic of research in recent years [31], [36], [46], [82]. For large-scale data, especially big data, linear kernel is often adopted in learning techniques as a viable way to gain scalability. For classification on large-scale data, fast solver for primal SVM with the linear kernel, called linear SVM, has been extensively studied. Early work uses finite Newton methods to train the L2 primal SVM [54] [45]. Because the hinge loss is not smooth, generalized second-order derivative and generalized Hessian matrix need to be used which is not sufficiently efficient [71]. Stochastic descent method has been proposed to solve primal linear SVM for a wide class of loss functions [85]. SVMperf [44] solves the L1 primal SVM with a cutting plane technique. Pegasos alternates between gradient descent and projection steps for training large-scale linear SVM [71]. Another method [11], similar to the Pegasos, also uses stochastic descent for solving primal linear SVM, and has more competitive performance than the SVMperf. For L2 primal linear SVM, a Trust RegiOn Newton method (TRON) has been proposed which converges at a fast rate though the arithmetic complexity at each iteration is high [52]. It can also be used to optimize logistic regression problems. More efficient algorithms than the Pegasos or TRON use coordinate descent methods for solving the L2 primal linear SVM - primal coordinate descent (PCD) works in the primal domain [19] while dual coordinate descent (DCD) in the dual domain [43]. A library of methods have been provided in the toolboxes of Liblinear and Libsvm [18], [29].

III. Problem Statement and discriminative ridge regression

Our setup is the usual multiclass setting where the training data set is denoted by {(xi,yi)}i=1n, with observation xiRp and class label yi ∈ {1, ⋯, g}. Supposing the jth class Cj has nj observations, we may also denote these training examples in Cj by xj=[x1j,,xnjj]. Obviously j=1gnj=n. Without loss of generality, we assume the examples are given in groups; that is, xjk=xm with m=i=1k1ni+j, k = 1, ⋯, g and j = 1, ⋯, nk. A data matrix is formed as A = [x1, ⋯, xn] = [x1, ⋯, xg] of size p × n. For traditional large-sample data, np, with a particular case being big data where n is big; while for high-dimensional data, np. As will be clear later on, the proposed discriminative ridge regressions have no restrictions on n or p, and thus are applicable to general data including both large-sample and high-dimensional data. Given a test example xRp, the task is to decide what class this test example belongs to.

In this paper, we first consider linear relationship of the examples in the instance space; subsequently in Section IV we will account for the nonlinearity by exploiting kernel techniques to map the data to a kernel space where linear relationship may be better observed.

If the class information is available and xCj, a linear combination of xkj, k = 1, ⋯, nj, is often used in the literature to approximate x in the instance space; see, for example, [2], [6], [9], [50] for face and object recognition. By doing so, it is equivalent to assuming xspan{x1j,,xnjj}. In the absence of the class information for x, a linear model can be obtained from all available training examples:

x=Aw+e=w1x1+w2x2++wnxn+e, (1)

where w = [w1, w2, ⋯, wn]T is a vector of combining coefficients to be estimated, and eRp is additive zero-mean noise. Note that we may re-index w = [(w1)T, ⋯, (wg)T]T with wiRni in accordance with the groups of A = [x1, ⋯, xg].

To estimate w which is regarded as a representation of x, classic ridge regression or lasso uses an optimization model,

w^=argminwRn12xAw22+λwq, (2)

with q = 2 corresponding to ridge regression or q = 1 to lasso. The first term of this model is for data fitting while the second for regularization. While this model has been widely employed, a potential limitation in terms of classification is that the discriminative information about the classes is not taken into account in the formulation.

To enable the model to be more geared toward classification, we propose to incorporate the class discriminativeness into the regression. To this end, we desire w to have the following properties [86]: 1) maximal between-class separability and 2) maximal within-class similarity.

The class-level regularization may help achieve between-class separability. Besides this, we desire maximal within-class similarity because it may help enhance accuracy and also robustness, for example, in combating class noise [86]. To explicitly incorporate maximal within-class similarity into the objective function, inspired by the LDA, we propose to minimize the within-class dis-similarity induced by the combining vector w and defined as

Sw(w):=12tr(i=1gj=1ni(wjixjiμi)(wjixjiμi)T), (3)

where μi:=1nij=1niwjixji are the weighted class mean vectors. Being a quadratic function in w, this Sw(w) helps facilitate efficient optimization as well as scalability of our approach.

It is noted that Sw(w) defined above is starkly different from the classic LDA formulation of the scattering matrix. The LDA seeks a unit vector - a direction - to linearly project the features; whereas we formulate the within-class dis-similarity using the weighted instances by any vector w. Consequently, the projection direction in the LDA is sought in the p-dimensional feature space, while the weight vector of w is defined in the n-dimensional instance space.

We may also define w-weighted between-class separability and total separability as follows,

Sb(w):=12tr(i=1gni(μiμ)(μiμ)T),St(w):=12tr(k=1n(wkxkμ)(wkxkμ)T),

where μ=1ni=1gniμi=1nk=1nwkxk. With straightforward algebra the following equality can be verified.

Proposition III.1.

St(w) = Sw(w) + Sb(w), for any wRn.

Incorporating discriminative information between classes into the regression, we formulate the following unconstrained minimization problem,

minwRn12xAw22+αSw(w)+βρ(wCq,r), (4)

where α and β are nonnegative balancing factors, ∥w|Cq,r is defined as ∥(∥w1q,∥w2q, ⋯,∥wgq)Tr, (q, r) ∈ [0, ∞] × [0, ∞], and ρ(·) is a nonnegative valued function. Often ρ(·) is chosen to be the identity function or other simple, convex functions to facilitate optimization.

Remarks.

1. When α = β = 0, model (4) is simply least squares regression. 2. When α = 0, (q, r) = (2, 2), and ρ(x) = x2, (4) reduces to standard ridge regression. 3. When α = 0, (q, r) = (1, 1), and ρ(x) = x, (4) falls back to classic lasso type of sparse representation. 4. When α ≠ 0, discriminative information is injected into the regression, hence the name of discriminative ridge regression. 5. We consider mainly β ≠ 0 for regularizing the minimization.

IV. Discriminative ridge regression in Kernel Space

When approximating x from the training observations x1, ⋯, xn, the objective function of (4) is empowered with discriminative ability explicitly. This new discriminative ridge regression model, nonetheless, takes no account of any nonlinearity in the input space. As shown in [68], the examples may reside (approximately) on a low-dimensional nonlinear manifold in an ambient space of Rp. To capture the nonlinear effect, we allow this model to account for the nonlinearity in the input space by exploiting the kernel technique.

Consider a potentially nonlinear mapping ϕ():RpRD, where DN{+} is the dimension of the image space after the mapping. To reduce computational complexity, the kernel trick [70] is applied to calculate the Euclidean inner product in the kernel space, < ϕ(x), ϕ(y) >= k(x, y), where k(,):Rp×RpR is a kernel function induced by ϕ(·). The kernel function satisfies the finitely positive semidefinite property: for any x1,,xmRp, the Gram matrix GRn×n with element Gij = k(xi, xj) is symmetric and positive semidefinite (p.s.d.). Suppose we can find a nonlinear mapping ϕ(·) such that, after mapping into the kernel space, the examples approximately satisfy the linearity assumption, then (4) will be applicable in such a kernel space.

Specifically, we consider the extended linear model (1) in the kernel space, ϕ(x) = Aϕw + e, to minimize Swϕ(w), ρ(∥w|Cq,r), and the variance of e as (4) does. Here Aϕ := [ϕ(x1), ϕ(x2), ⋯, ϕ(xn)], Swϕ(w) is obtained by replacing each xk with ϕ(xk) in (3), and eRD. The kernel matrix of training examples is denoted by

K=[k(xi,xj)]i,j=1n. (5)

Obviously the kernel matrix K is p.s.d. by the property of k(·, ·). We may derive a simple matrix form of Swϕ(w) with straightforward algebra,

Swϕ(w)=12(wTHwi=1g1ni(wi)TKiwi)=12wT(HB)w,

where

H:=diag(k(x1,x1),,k(xn,xn)),B:=diag(B1,B2,,Bg), (6)

Ki:=[k(xsi,xti)]s,t=1ni, and Bi := Ki/ni, i = 1, ⋯, g.

Proposition IV.1.

The matrix (HB) is p.s.d.

Proof:

Because HB is block diagonal, we need only to show that HiBi is p.s.d. with Hi := diag (k(x1i,x1i),,k(xnii,xnii)). For any zRni, we have

zT(HiBi)z=l=1nizl2k(xli,xli)1niu=1niv=1nizuzvk(xui,xvi)=12niu=1niv=1ni(zu2k(xui,xui)+zv2k(xvi,xvi)2zuzvk(xui,xvi))0.

The last inequality holds because the kernel is finitely p.s.d.; that is, for any zu and zv

(zu,zv)[k(xu,xu)k(xu,xv)k(xv,xu)k(xv,xv)](zuzv)0.

Similarly we obtain a simple matrix form of Sbϕ(w),

Sbϕ(w)=12wT(B1nK)w. (7)

Proposition IV.2.

The matrix (B1nK) is p.s.d.

Proof:

Similar to the proof of Proposition IV.1. ■

Having formulated (4) in the input space, we directly extend it to the discriminative ridge regression in the kernel space

minwRn12ϕ(x)Aϕw22+αSwϕ(w)+βρ(wCq,r).

Plugging in the expression of Swϕ(w) and omitting the constant term, we obtain the discriminative ridge regression model,

minwRn12wT(K+αHαB)w(Kx)Tw+βρ(wCq,r), (8)

where

(Kx)T:=[(Kx1)T,(Kx2)T,,(Kxg)T], (9)

and Kxi:=[k(x,x1i),k(x,x2i),,k(x,xnii)]T i = 1, ⋯, g. Apparently we have Kx = [k(x, x1), ⋯, k(x, xn)]T.

With special types of (q, r)-regularization such as linear, quadratic, and conic functions, the optimization in (8) is a quadratic (cone) programming problem that admits a global optimum. Regarding ρ(·), we have the freedom to choose it so as to facilitate the optimization thanks to the following property.

Proposition IV.3.

With β varying over the range of (0, ∞) while α fixed, the set of minimum points of (8) remains the same for any monotonically increasing function ρ(·).

Proof:

It can be shown by scalarization for multicriterion optimization similarly to Proposition 3.2 of [22]. ■

In the following sections, we will explain how to optimize (8) and how to perform classification with w. For clarity and smoothness of the organization of this paper, we will first discuss how to perform classification with the optimal w in Section V and then provide the detailed optimization in Section VI.

V. Discriminative ridge regression-Based Classification

Having estimated from (8) the vector of optimal combining coefficients, denoted by w, we use it to determine the category of x. In this paper, we mainly consider the case in which the g groups are exhaustive of the whole instance space; in the non-exhaustive case, the corresponding minimax decision rule [22] can be used analogously.

In the exhaustive case, the decision function is built with the projection of x onto the subspace spanned by each group. Specifically, the projection of x onto the subspace of Ci is ψi:=Aw|Ci, where w|Ci represents restricting w to Ci in that (w|Ci)j=wj1(xjCi), with 1(·) being the indicator function. Similarly, denoting k=1,_kigCk by C¯i, the projection of x onto the subspace of C¯i is ψ¯i:=Aw|C¯i. Now define

δi:=xψi22+ψ¯i22,

which measures the dis-similarity between x and the examples in class Ci. Then the decision rule chooses the class with the minimal dis-similarity. This rule has a clear geometrical interpretation and it works intuitively: When x truly belongs to Ck, ψk=i=1nkxik(w|Ck)ix, while ψ¯k0 due to the class separability properties imposed by the discriminative ridge regression, and thus, δk is approximately 0; whereas for jk, ψj=i=1njxij(w|Ck)i0, while ψ¯jx as C¯jCk, and thus δj is approximately 2x22. Hence the decision rule picks Ck for x. In the kernel space the corresponding δiϕ is derived in the minimax sense as follows [22],

δiϕ=(w|Ci)TKw|Ci+(w|C¯i)TKw|C¯i2(w|Ci)TKx. (10)

And the corresponding decision rule is

i^=argmini{1,,g}δiϕ. (11)

It should be noted that the decision rule (11) depends on the weighting coefficient vector w, which accounts for discriminative information between classes by minimizing a discriminative ridge regression model (8). Thus, with w, the dis-similarity δi or δiϕ accounts for the discriminativeness of the classes and helps to classify the target example. Now we obtain the discriminative ridge regression-based classification outlined in Algorithm 1. Related parameters such as α and β, and a proper kernel can be chosen using the standard cross validation (CV).

graphic file with name nihms-1710909-f0001.jpg

How to solve optimization problem (8) efficiently in Step 3 is critical to our algorithm. We mainly intend to use such ρ(∥w|Cq,r) that is convex in w to efficiently attain the global optimum of (8). Any standard optimization tool may be employed including, for example, CVX package [35]. In the following, we shall focus on a special case of ρ(∥w|Cq,r) for the sake of deriving efficient and scalable optimization algorithms for high-dimensional data and large-scale data; nonetheless, it is noted that the discriminative ridge regression-based classification is constructed for general (q, r) and ρ(·).

VI. Discriminative ridge machine

Discriminative ridge regression-based classification with a particular regularization of (q, r) = (2, 2) will be considered, leading to the DRM.

A. Closed-Form Solution to DRM

Let (q, r) = (2, 2) and ρ(x)=12x2, then we have the regularization term ρ(wCq,r)=12wC2,22=12w22. The discriminative ridge regression problem (8) reduces to

(DRM )minwRn12wT(K+α(HB)+βI)wwTKx, (12)

which is an unconstrained convex quadratic optimization problem leading to a closed-form solution,

w=(K+αHαB+βI)1Kx, (13)

with α ≥ 0 and β > 0. Hereafter we always require β > 0 to ensure the existence of the inverse matrix, because Q := K + αHαB is p.s.d. and Q + βI is strictly positive definite. The minimization problem (12) can be regarded as a generalization of kernel ridge regression (KRR). Indeed, when α = 0 the KRR is obtained. More interestingly, when α ≠ 0 the discriminative information is incorporated to go beyond the KRR.

For clarity, the DRM with the closed-form formula is outlined as Algorithm 2. It should be noted that only the term Kx involves the information from the testing example. Thus, for a specific problem and given training data, the computation of (K + αHαB + βI)−1 can be simply regarded as a pre-processing step. Then for any new example x, the training stage is reduced to a simple matrix-vector product.

Remark.

For large-sample data with np, the complexity of (13) is O(n3), which renders Algorithm 2 impractical with big n. To have a more efficient and scalable, though possibly inexact, solution to (12), in the sequel we will construct three iterative algorithms suitable for large-scale data.

B. DRM Algorithm Based on Gradient Descent (DRM-GD)

For large-scale data with big n and np, the closed-form solution (13) may be computationally impractical in real applications. Three new algorithms will be established to alleviate this computational bottleneck. In light of the linear SVM on large-scale data classification, e.g., [11], we will consider the use of the linear kernel in the DRM, which turns out to have linear cost. The first algorithm relies on classic gradient descent optimization method, denoted by DRM-GD; the second hinges on an idea of proximal-point approximation (PPA) similar to [87] to eliminate the matrix inversion and induce efficient and scalable computation with linear kernels; the third uses a method of accelerated proximal gradient line search for theoretically proven accelerated convergence. This section presents the algorithm of DRM-GD, and the other two algorithms will be derived in subsequent sections.

Denote by f(w) the objective function of the DRM,

f(w)=12wTQwwTKx+β2w22.

Its gradient is

wf(w)=(Q+βI)wKx. (14)

Suppose we have obtained w(t) at iteration t. At next iteration, the gradient descent direction is −∇wf(w(t)), in which the optimal step size is given by the exact line search,

d(t)=argmind0f(w(t)dwf(w(t)))=(wf(w(t)))Twf(w(t))(wf(w(t)))T(Q+βI)wf(w(t)). (15)

Thus, the updating rule using gradient descent

w(t+1)=w(t)d(t)wf(w(t)). (16)

VI.

The resulting DRM-GD is outlined in Algorithm 3. Its main computational burden is on Q which costs, in general, n2p floating-point operations (flops) [12] for a general kernel. This paper mainly considers multiplications for flops. After getting Q, ∇wf(w(t)) is obtained by matrix-vector product with n2 + n flops, and so is d(t) with n2 + 3n flops. Thus, the overall count of flops is about (p + 2)n2 with a general kernel.

With the linear kernel the computational cost can be further reduced by exploiting its particular structure of K = ATA. Given any vRn, Bivi can be computed as 1ni(xi)T(xivi) which costs 2pni flops for any 1 ≤ ig, and thus computing Bv = [(B1v1)T, ⋯, (Bgvg)T]T requires 2pn flops. Similarly, Kv = AT(Av) takes pn flops by matrix-vector product, since Av=ixivi can be readily obtained from Bv computation. As Hv=[v1(x1Tx1),,vn(xnTxn)], the total count of flops is (p + 1)n for getting Hv. Therefore, computing Qv needs (4p + 1)n flops. Each of (14) and (15) requires Qv and βv types of computation, hence each iteration of updating w(t) to w(t + 1) costs (8p + 7)n overall flops, including 3n additional ones for computing d(t) and d(t)wf(w(t)).

VI.

As a summary, the property of the DRM-GD, including the cost, is given in Theorem VI.1.

Theorem VI.1 (Property of DRM-GD).

With any w(0)Rn,{w(t)}t=0 generated by Algorithm 3 with β > 0 converges in function values to the unique minimum of f(w) with a linear rate. Each iteration costs (p + 2)n2 flops with a general kernel, in particular, (8p + 7)n flops with the linear kernel.

Proof:

For the gradient descent method, the sequence of objective function values {f(w(t))} converges to an optimal value with a worst-case rate of O(1/t) because, by Proposition IV.1, f(w) is convex with β ≥ 0; in particular, the convergence rate is linear because f(w) is strongly convex with β > 0 [49]. The convergence and its rate of f(w) are standard [12]. Also, with β > 0, f(w) is strongly convex, and thus the minimum is unique. □

C. DRM Algorithm Based on Proximal Point Approximation

PPA is a general method for finding a zero of a maximal monotone operator and solving non-convex optimization problems [67]. Many algorithms have been shown to be its special cases, including the method of multipliers, the alternating direction method of multipliers, and so on. Our DRM optimization is a convex problem; nonetheless, we employ the idea of PPA in this paper for potential benefits of efficiency and scalability. At iteration t, having obtained the minimizer w(t), we construct an augmented function around it,

F(w;w(t)):=f(w)+12(ww(t))T(cIQ)(ww(t)),

where c is a constant satisfying cσmax(Q). We minimize F(w, w(t)) to update the minimizer,

w(t+1):=argminwRnF(w;w(t)).

As F(w; w(t)) is a strongly convex, quadratic function, its minimizer is calculated directly by setting its first-order derivative equal to zero,

F(w;w(t))=QwKx+βwQ(ww(t))+c(ww(t))=0,

which results in

w(t+1)=(KxQw(t)+cw(t))/(β+c). (17)

The computational cost of each iteration is reduced compared to the closed-form solution (13), which is formally stated in the following.

Proposition VI.1.

With a general kernel, the updating rule (17) costs about (p + 2)n2 flops.

Proof:

The cost of getting K is pn2, since it has n2 elements and each costs p flops. Subsequently, α(HB) needs about n2 flops. Hence computing Q needs about (p + 1)n2 flops. Evidently Qw(t) needs n2 flops. □

As a summary, the PPA-based DRM, called DRM-PPA, is outlined in Algorithm 4. Starting from any initial point w(0)Rn, repeatedly applying (17) in Algorithm 4 generates a sequence of points {w(t)}t=0. The property of this sequence will be analyzed subsequently.

C.

Theorem VI.2 (Convergence and optimality of DRM-PPA).

With any w(0)Rn, the sequence of points {w(t)}t=0 generated by Algorithm 4 gives a monotonically non-increasing value sequence {f(w(t))}t=0 which converges to the globally minimal value of f(w). Furthermore, the sequence {w(t)}t=0 itself converges to w in (13), the unique minimizer of f(w).

Before the proof, we point out two properties of F(w; w(t)) which are immediate by the definition,

F(w;w(t))f(w),w;F(w(t);w(t))=f(w(t)). (18)

Proof of Theorem VI.2:

By re-writing f(w) as

12(Q+βI)1/2w(Q+βI)1/2Kx2212KxT(Q+βI)1Kx,

we know f(w) is lower bounded by 12KxT(Q+βI)1Kx. From the following chain of inequality it is clear that {f(w(t))}t=0 is a monotonically non-increasing sequence,

f(w(t))=F(w(t);w(t))F(w(t+1);w(t))f(w(t+1)).

The first inequality of the chain holds by (17) since w(t + 1) is the global minimizer of F(w; w(t)), and the second by (18). Hence, {f(w(t))}t=0 converges.

Next we will further prove that limt→∞ w(t) exists and it is the unique minimizer of f(w). By using the following equality

w(t+1)w=(KxQw(t)+cw(t))/(β+c)w=((Q+βI)wQw(t)+cw(t))/(β+c)w=(cIQ)(w(t)w)/(β+c),

where the first equality holds by (17) and the second by (13), we have

w(t+1)w2(cIQ)2w(t)w2/(β+c)=cσmin(Q)c+βw(t)w2w(0)w2(cσmin(Q)c+β)t+1.

Here, the first inequality holds by the definition of the spectral norm, and the subsequent equality holds because ∥cIQ2 = σmax(cIQ) = cσmin(Q). Because β > 0 and Q is p.s.d., we have σmin(Q) ≥ 0, and cσmin(Q)c+β<1. Hence, as t → ∞, ∥w(t + 1)w2 → 0 for any w(0); that is, w(t + 1)w. The uniqueness is because of the strict convexity of f(w). □

As a consequence of the above proof, the convergence rate is also readily obtained.

Theorem VI.3 (Convergence rate of DRM-PPA).

The convergence rate of {w(t)}t=0 generated by Algorithm 4 is at least linear. Given a convergence threshold ϵ > 0, the number of iterations is upper bounded by logw(0)w2ϵ/log(c+βcσmin(Q)), which is approximately cσmin(Q)σmin(Q)+βlogw(0)w2ϵ when σmin(Q)+βcσmin(Q) is small.

Proof:

Because

limtsupw(t+1)w2w(t)w2cσmin(Q)c+β<1,

the convergence rate is at least linear. For a given ϵ, the maximal number of iterations satisfies w(0)w2(cσmin(Q)c+β)tϵ by (19), hence the upper bound in the theorem. When σmin(Q)+βcσmin(Q) is small, log(c+βcσmin(Q)) is approximately σmin(Q)+βcσmin(Q) by using log(1 + x) ≈ x for small x. □

Remark.

With large n yet small p, σmax(Q) can be obtained efficiently as shown in next section. When both n and p are large, we may use a loose upper bound

σmax(Q)=Q2K2+αHB2min{Tr(K),K}+αH2=min{i=1nk(xi,xi),maxi{j=1n|k(xi,xj)|}}+αmaxi{k(xi,xi)}. (19)

Here, the first inequality is by the triangle inequality, and the second by the Gershgorin Theorem. Usually this upper bound is much larger than σmax(Q) and the convergence of the DRM-PPA would be slower. Alternatively, the implicitly restarted Lanczos method (IRLM) by Sorensen [73] can be used to find the largest eigenvalue. The IRLM relies only on matrix-vector product and can compute an approximation of σmax(Q) as c with any specified accuracy tolerance at little cost. As yet another alternative, the backtracking method may be used to iteratively find a proper c value as in Algorithm 6 given in Section VI–E.

C.

D. Scalable Linear-DRM-PPA for Large-Scale Data

To further reduce the complexity, now we consider using the linear kernel in the DRM-PPA. With a special structure of K = ATA, we exploit the matrix-vector multiplication to improve the efficiency. Given a vector vRn, since B is block diagonal and Bv = [(B1v1)T, ⋯, (Bgvg)T]T, we need only to compute each block separately to obtain Bv,

uixivi,Bivi1ni(xi)Tui,1ig. (20)

Consequently, the computation of Bv costs (2p + 1)n flops. The computation of Kv is given by

uAv=i=1gxivi,KvATu, (21)

which only needs pn flops. Finally, Hv is computed as

Hv=[(x1Tx1)v1,,(xnTxn)vn]T, (22)

which takes (p + 1)n flops. Overall, it takes (4p + 2)n flops to get Qv, and thus each iteration of (17) costs (4p + 4)n flops. For clarity, the linear-DRM-PPA is outlined in Algorithm 5.

Remark.

In Step 2 σmax(Q) needs to be estimated. With big n and small p, σmax(K) may be computed as σmax(ATA)=σmax(AAT)=σmax(i=1nxixiT), with a cost of O(p2n); while σmax(H)=maxi{xiTxi}, taking pn flops. Thus, σmax(Q) ≤ σmax(K) + ασmax(H), taking O((p2 + p)n) flops.

By using the matrix-vector product Algorithm 5 has a linear computational cost in n when p is small, which is summarized as follows.

Proposition VI.2.

The cost of Algorithm 5 is (4p + 4)mn flops, by taking (4p + 4)n flops per iteration for m iterations needed to converge whose upper bound is given by Theorem VI.3.

E. DRM Algorithm Based on Accelerated Proximal Gradient

Accelerated proximal gradient (APG) techniques, originally due to Nesterov [58] for smooth convex functions, have been developed with a convergence rate of O(t−2) in function values [7]. For its potential in efficiency for large-scale data, we exploit the APG idea here to build an algorithm for the DRM, denoted by DRM-APG. It does not compute the exact step size in the gradient descent direction, but rather uses a proximal approximation of the gradient. Thus the DRM-APG may be regarded as a combination of PPA and gradient descent methods for a potentially faster convergence.

Evidently, f(w) has Lipschitz continuous gradient because ∥∇f(w) − ∇f(v)∥2 ≤ ∥Q + βI2wv2bwv2 for any w, vRn, where b is an upper bound for ∥Q + βI2, for example, b = c + β, with c defined in Section VI–C. First, let us consider the case in which the Lipschitz constant b is known. Around a given v a PPA to f(w) is built,

G(wv,b)=f(v)+(wv)Tf(v)+b2wv22. (23)

Note that G(w|v, b) is an upper bound of f(w), and both functions have identical values and first-order derivatives at w = v. We minimize G(w|v, b) to update w based on v,

w^b(v)=:argminwG(wv,b)=v1bf(v).

Next, we will consider the case in which b is unknown. In practice, especially for large-scale data, the Lipschitz constant might not be readily available. In this case, a backtracking strategy is often used. Based on its current value, b is iteratively increased until G(w|v, b) becomes an upper bound of f(w) at the point w^b(v); that is, f(w^b(v))G(w^b(v)v,b). By straightforward algebra this is equivalent to

f(v)TQf(v)(bβ)f(v)22. (24)

Thus, more specifically, the strategy is that a constant factor η > 1 will be repeatedly multiplied to b until (24) is met if the current b value fails to satisfy (24).

In iteration t + 1, after obtaining a proper b(t + 1) with the backtracking, the update rule of w(t + 1) is

w(t+1)=w^b(t+1)(v(t))=v(t)1b(t+1)f(v(t))=v(t)1b(t+1)((Q+βI)v(t)Kx). (25)

The stepsize d(t + 1) and the auxiliary v(t + 1) are updated by,

d(t+1)=(1+1+4(d(t))2)/2,v(t+1)=w(t+1)+d(t)1d(t+1)(w(t+1)w(t)). (26)

In summary, the procedure of the DRM-APG is outlined in Algorithm 6. Matrix-vector multiplication may be employed in Step 4 which costs, assuming ∇f(wk) has been obtained, it(n + 1)2 flops with a general kernel, and it(2(p + 1)n + p(g + 1)) flops with the linear kernel. Here, the integer it is always finite because of the Lipschitz condition on the gradient of f(w). Steps 5 and 6 cost n2 flops with a general kernel, while in the particular case of the linear kernel, (4p + 5)n flops. The convergence property and cost are stated in Theorem VI.4.

Theorem VI.4 (Property of DRM-APG).

With any w(0)Rn,{w(t)}t=0 generated by Algorithm 6 converges in function value to the global minimum of f(w) with a worst-case linear rate. Each iteration costs O(pn2) flops with a general kernel, while O(pn) flops with the linear kernel.

Proof:

With β ≥ 0, f(w) is convex and the convergence to an optimal value has a worst-case rate of O(1t2). In particular, with β > 0, f(w) is strongly convex, and thus the global minimizer is unique and the worst-case rate is linear. The convergence and its rate are standard; see, for example, [58]. The cost is analyzed similarly to Proposition VI.2 or Theorem VI.1. □

The desirable properties of linear efficiency and scalability in Algorithms 3, 5 and 6 can be generalized to other kernels, provided some conditions such as low-rank approximation are met. One way for generalization is stated in Proposition VI.3.

Proposition VI.3 (Generalization of Linear DRM Algorithms to Any Kernel).

Algorithms 3, 5, and 6 are applicable to any kernel function k(·, ·), provided its corresponding kernel matrix K can be approximately factorized, KGTG, with GRr×n and rn. The computational cost of the generalized algorithm is O(rmn), with m iterations.

Proof:

Replacing A by G and, correspondingly, p by r in (20) to (22), the conclusion follows for Algorithm 5; and so do Algorithms 3 or 6 similarly. □

E.

VII. Experiments

In this section, we present how our method performs on standard test data sets in real world applications. First, we show our results on high- and low-dimensional data and compare them with several state-of-the-art methods. Then, we compare the linear DRM and the linear SVM on large-scale data and evaluate our three algorithms DRM-PPA, DRM-GD, and DRM-APG in terms of accuracy rate, time cost for training, and convergence rate. Last, we benchmark the DRM in imbalanced data classification.

Unless otherwise specified, all experiments are implemented in Matlab on a 4-core Intel Core i7-4510U 2.00GHz laptop with 16G memory. We terminate the algorithm when ∥w(t + 1)w(t)2 ≤ 10−5 for the DRM algorithms, or when they reach the maximal number of iterations of 150. As experimentally shown, all three algorithms usually converge within a few tens of iterations. Code of our method can be found at https://www.researchgate.net/profile/Chong_Peng10.

A. Application to Real World Data

We conduct experiments on 19 data sets in three categories: high-dimensional data for gene expression classification, image data for face recognition, and low-dimensional data including hand-written digits and others. Among them, the first two are high-diensional while the last is low-dimensional. Their characteristics and the partition of training and testing sets are listed in Table I. For each data set, we perform experiments on 5 random trials and report their average accuracy rates and standard deviations for different algorithms.

TABLE I:

Summary of Small to Medium Sized Data Sets Used in Experiments

Type Data Set Dim. Size Class Training Testing Notes
High Dim. 1. Nakayama 22,283 112 10 84 28 multi-class, small size
2. Leukemia 7,129 38 2 30 8 binary-class, small size
3. Lung cancer 12,533 181 2 137 44 binary-class, small size
4. Prostate 12,600 136 2 103 33 binary-class, small size
5. GCM 16,036 189 14 150 39 multi-class, small size
6. Sun data 54,613 180 4 137 43 multi-class, small size
7. Ramaswamy 16,063 198 14 154 44 multi-class, small size
8. NCI 9,712 60 9 47 13 multi-class, small size
Image Data 9. AR 1,024 1,300 50 1,050 250 multi-class, moderate size
10. EYaleB 1,024 2,414 38 1,814 600 multi-class, moderate size
11. Jaffe 676 213 10 172 41 multi-class, small size
12. Yale 1,024 165 15 135 30 multi-class, small size
13. PIX 10,000 100 10 80 20 multi-class, small size
Low Dim. 14. Semeion 256 1593 10 1,279 314 multi-class, moderate size
15. Pen digits 16 10,992 10 1,575 9,417 multi-class, moderate size
16. Optical pen 64 1,797 10 1,352 445 multi-class, moderate size
17. Iris 4 150 3 114 36 multi-class, small size
18. Wine 13 178 3 135 43 multi-class, small size
19.Tic-tac-toe 9 958 2 719 239 binary-class, moderate size

1). Gene Expression:

For high-dimensional data, we have eight gene expression data sets: Global Cancer Map (GCM), Lung caner, Sun data [75], Prostate, Ramaswamy, Leukemia, NCI, and Nakayama [57].

To illustrate the effectiveness of the proposed method, we compare it with currently state-of-the-art classifiers on such data, including the SVM, KNN, RF, C4.5, and Naive Bayes (NB). For each of these classifiers, leave-one-out cross validation (LOOCV) is conducted for parameter selection. For the SVM, we use linear, polynomial, and radial basis function (rbf) kernel. The order of polynomial kernel is selected from the set of {2, 3, 4, 5, 8, 10}, and the variance parameter of the rbf kernel is selected from {0.001, 0.01, 0.1, 1, 10, 100, 1000}. The balancing parameter of SVM is selected from the set of {0.001, 0.01, 0.1, 1, 10, 100, 1000}. For the KNN, we adopt the Euclidean and cosine based distances and the number of neighbors are selected within {2, 3, 4, 5, 6, 7, 8}. For RF, the number of trees is selected from the set {10, 20, 30, 40}. For the C4.5, we choose the percentage of incorrectly assigned examples at a node to be {5%, 6%, 7%, 8%, 9%, 10%}. For the DRM, the parameter options for kernels remain the same as the SVM and the value of α and β range in {0.001, 0.01, 0.1, 1, 10, 100, 1000}. All these parameters are determined by the LOOCV. We conduct experiments on both scaled and unscaled data for all classifiers and report the better performance. Here, for each data set, the scaled data set is obtained by dividing each feature value with the infinite norm of the corresponding feature vector. We report the average performance on the 5 splits in Table II. Here, for clearer illustration of how these methods perform, we seperately report the best performances of rbf and polynomial kernels for the SVM and DRM, as well as those of cosine and Euclidean distances for KNN. It should be noted that in practice the kernel and distance types can be determined by cross-validation.

TABLE II:

Classification Accuracy of SVM, KNN, RF, C4.5, NB, and DRM on High- and Low-Dimensional Data

Type Data SVM(R) SVM(P) KNN(E) KNN(C) RF C4.5 NB DRM(R) DRM(P)
High Dim. 1 .6286±.0621 .7619±.1010 .7429±.0722 .7524±.0782 .6762±.0916 .3619±.1043 —————– .7619±.0583 .7905±.0543
2 .7500±.0000 1.000±.0000 .9500±.1118 .9750±.0559 .9250±.1118 .8500±.2054 1.000±.0000 1.000±.0000 1.000±.0000
3 .9864±.0203 .9909±.0124 .9955±.0102 .9955±.0102 .9955±.0102 .9273±.0466 .9864±.0124 1.000±.0000 1.000±.0000
4 .6788±.0507 .9455±.0498 .8061±.0819 .8121±.0498 .8727±.0657 .6545±.0628 .6061±.0851 .8848±.0449 .9333±.0542
5 .4974±.0344 .6462±.0380 .6154±.0480 .6462±.0380 .6051±.0466 .2410±.0562 .7077±.0229 .8308±.0389 .8410±.0493
6 .5535±.0477 .6884±.0746 .7302±.0535 .7070±.0628 .7163±.0104 .4884±.0465 .6977±.0658 .7442±.0637 .7488±.0666
7 .6182±.0743 .6906±.0380 .7000±.0296 .7000±.0296 .6909±.0903 .5273±.0943 .6227±.0674 .8273±.0729 .8000±.0588
8 .0769±.0000 .5385±.0769 .4308±.1397 .4154±.1287 .4462±.1668 —————– —————– .6308±.1141 .5231±.1264
Image Data 9 .6320±.0210 .9464±.0112 .6552±.0411 .6816±.0346 .7944±.0384 .1040±.0216 —————– .9856±.0112 .9776±.0112
10 .8163±.0146 .9233±.0210 .7490±.0103 .8507±.0126 .9700±.0047 .1370±.0117 —————– .9830±.0040 .9920±.0059
11 1.000±.0000 .4146±.0000 .9951±.0109 .9951±.0109 1.000±.0000 .8098±.0966 —————– .9951±.0109 1.000±.0000
12 .7133±.0960 .7067±.1188 .6467±.0691 .6533±.0989 .7733±.0435 .4067±.0925 .6267±.0596 .8067±.0596 .8133±.0558
13 .9400±.0584 .9300±.0477 .9600±.0652 .9600±.0652 .9800±.0274 .8537±.0827 .8900±.0962 .9600±.0652 .9600±.0652
Low Dim. 14 .9433±.0042 .9204±.0050 .9255±.0138 .9185±.0036 .9236±.0172 .6287±.0486 —————– .9624±.0116 .9611±.0079
15 .9867±.0020 .9865±.0017 .9835±.0019 .9839±.0019 .9686±.0031 .6963±.0278 —————– .9911±.0017 .9910±.0015
16 .9924±.0034 .9856±.0020 .9888±.0036 .9879±.0059 .9730±.0086 .1011±.0000 —————– .9915±.0040 .9924±.0030
17 .9778±.0232 .9778±.0232 .9667±.0304 .9833±.0152 .9556±.0317 .9611±.0317 .9596±.0421 .9667±.0562 .9833±.0152
18 .9674±.0353 .9814±.0195 .8791±.0602 .9349±.0382 .9860±.0208 .9395±.0353 .9767±.0233 .9116±.0504 .9581±.0382
19 .8494±.0118 .9724±.0128 .8418±.0174 .8427±.0158 .9264±.0134 .7665±.0321 .6996±.0130 .9950±.0035 .9941±.0092

For each data set, the best accuracy rate is highlighted in red. The performance is represented as average accuracy ± standard deviation. The data sets are represented as their indices, which follow Table I.

From Table II, it is seen that the proposed method has the highest accuracy on seven out of eight high-dimensional data sets, except for Prostate, with significant improvement in classification accuracy. For example, the DRM has at least 20%, 13% and 10% improvements on GCM, Ramaswamy, and NCI data sets, respectively. Besides, the DRM achieves the second best performance on Prostate, which is comparable to the best. These observations have confirmed the effectiveness of the DRM on high-dimensional data and application of gene expression classification.

2). Face or Facial Expression Recognition:

The recognition task is important in both supervised and unsupervised learning problems [63], [65], [66]. As an important classification problem, we examine the performance of the DRM on high-dimensional data in this application. In this test, five commonly used face data sets are used, including Extended Yale B (EYaleB) [33], AR [56], Jaffe, Yale [8], and PIX [42]. The EYaleB data set has 2,414 images for 38 individuals under different viewing conditions. All original images are of size 192 × 168 whose cropped and resized version of 32 × 32 pixels [33] are used in the experiments. For AR data set, 25 women and 25 men are uniformly selected, with each image of 32 × 32 pixels. Jaffe data set contains face images of 10 Japanese females with 7 different facial expressions, with each image of size 26 × 26. PIX data set collects 100 gray scale images of size 100 × 100 from 10 objects. Yale data set contains 165 gray scale images of 15 persons with 11 images of size 32 × 32 per person. In the experiments, all images are vectorized as columns in the corresponding data matrix. In the literature, robust PCA has been used as a pre-processing step to remove noise and shadow from face images [61], [62]. To fully test the performance of the DRM, we use the original face image data sets.

It should be pointed out that deep convolutional neural network (CNN) [48] has achieved promising performance for many large-scale image classification as well as other supervised learning tasks [21]. However, in this paper, we do not include CNN as a baseline method due to the following reasons: 1) The key reason is the lack of comprehensive theoretical understanding of learning with deep neural networks such as the absence of theoretical guarantees on the convergence in optimization and the generalization ability, and they are often used as a “black box” [1]; 2) CNN is treated as a specialized classifier for image data, whereas the DRM is a general classifier that can be used in various scenarios; it would be unfair to compare the DRM with such methods; 3) It appears reasonable to claim at least comparable performance of the DRM to CNN on these image data sets1.

In this test, all experimental settings remain the same as in Section VII-A1 and we report the average performance on 5 splits in Table II. It is seen that the proposed method achieves the best performance on 4 out of 5 data sets with significant improvements. On PIX data, the DRM is not the best, but is still comparable to the best and obtains the second best performance.

In summary, the DRM achieves the highest accuracy rates on eleven out of thirteen high-dimensional data sets, including gene expression and face image data sets, and comparable performance on the rest two. Even though the other methods may achieve the best performance on a few data sets, in many cases they are not as competetive. Meanwhile, the DRM consistently show competitive performance, which implies the effectiveness of the DRM.

3). Low-Dimensional Data:

Though the DRM is manily intended on classifying high-dimensional data, we also test its classification capability on classic low-dioensional data. For this test, we use six data sets, inluding three hand-written digits and three others. Hand-written digits are image data sets of much fewer pixels than face images and thus treated as low-dimensional data sets. Three widely used hand-written data sets [4] are tested: hand-written pen digits, optical pen, and Semeion Handwritten Digit (SHD) [76]. They all have 10 classes representing the digits 0–9. The sizes of images in Optical pen, digits, and SHD are 8 × 8, 4 × 4, and 16 × 16, respectively. All images are reshaped into vectors. Besides, three data sets, iris, wine, and tic-tac-toe, from the UCI Machine Learning Repository, are included here to show the applicability of the DRM to classic, low-dimensional data. All experimental settings remain the same as previous tests and we report the average performance on five splits in Table II.

Table II indicates that the DRM performs the best on five data sets, including all the hand-written digits data sets. Moreover, the DRM has comparable performance using both rbf and polynomial kernels. These observations verify that the DRM is also effective on low-dimensional data classification.

To verify the efficiency of the DRM in these applications, we report the average training time of the DRM on the data sets in Table II. Experiments in this part, including the results reported in Table II and Fig. 1, are all conducted in Matlab on a 4-core Intel Core i7-8550U 1.80GHz laptop with 16G memory. As pointed out in Section VI-A, the construction of the kernel matrix K, as well as the matrices H and B, does not rely on the testing example x. Also, the matrix inversion operation in (13) can be used for any testing example. Thus, it makes sense if we consider the matrix-vector multiplication of (K + αHαB + βI)−1 and Kx in (13) as training while the others as pre-processing operations. To better understand the efficiency of the DRM, we report the time cost for training as well as pre-processing steps. Specifically, we report the average time cost of the calculation of matrices K, B, and H, and the calculation of vector Kx in Figs. 1(a) and 1(b), respectively. With given K, B, H, and Kx, we also report the time cost of (13) in Fig. 1(c). Without loss of generality, we use the linear kernel and set α = 1, β = 1 in the experiments here. It is seen that all the steps can be done with a few seconds or less. While the pre-processing step takes relatively longer time, it is a one-time computation and can be used for any testing example. To be more efficient, these matrices can be computed in a parallel approach. The third step takes less than one second. It should be noted that the reported time cost includes the matrix inversion and training. Thus the training is even faster. These observations empirically corroborate the efficiency of the DRM.

Fig. 1:

Fig. 1:

Average time cost of the DRM on training data sets. (a) and (b) are pre-processing steps. (c) involves matrix inversion and training with pre-calculated B, H, K, and Kx.

4). Large-Scale Data:

We evaluate the performance of the linear-DRM-PPA, linear-DRM-GD, and linear-DRM-APG on 5 standard large-scale data sets: UCI Poker [4], ijcnn1 [17], SensIT Vehicle (Acoustic) [27], SensIT Vehicle (Seismic) [27], and Shuttle.

All data sets have already been split into training and testing sets. It is noted that such large sample size would effectively avoid the effects of occasionality of samples and it is unnecessary to re-split the data. Thus we follow the original splitting. We conduct random split CV on the training sets to select the parameters α and β with a grid search from 10−6 to 10−1 and from 10−6 to 108, respectively. To reduce computational complexity, we randomly select a subset from the training set to do the CV. The size of the subset is usually set to 3000. All data sets are scaled to [−1, 1] by dividing each feature with the infinite norm of the corresponding feature vector. We compare the prediction accuracy with the L1-SVM and L2-SVM for which TRON [52] is employed as the solver. Table IV gives the accuracy rates, from which two observations are made: (1) The accuracy of the SVM is (almost) exactly the same as that reported in [19]; (2) The proposed DRM method outperforms the SVM on all the data sets significantly; especially, on Shuttle the DRM exceeds the SVM by 53.2%.

TABLE IV:

Accuracy by Using linear SVM and linear DRM

Dataset L1-SVM L2-SVM DRM-GD DRM-PPA DRM-APG
UCI Poker 31.5% 33.8% 50.1% 50.1% 50.1%
IJCNN1 67.3% 74.2% 90.1% 90.5% 90.5%
SensIT Vehicle (A) 43.5% 45.9% 65.1% 64.6% 65.5%
SensIT Vehicle (S) 41.6% 42.2% 62.8% 62.8% 62.8%
Shuttle 35.9% 29.7% 89.6% 89.2% 90.0%

As analyzed in Section VI, all the three algorithms take O(mpn) flops. When p is small, all of them have a cost of O(mn) flops; nonetheless, they have different constant factors. To empirically demonstrate the computational cost, Fig. 2 plots the time cost versus training sample size n for all three algorithms, where we run experiments for 100 examples and record the average time as well as standard deviation. Here, we fix the iteration number to be 50, which guarantees the convergence of the algorithms as will be seen in later section. We observe that, with the same convergence tolerance, the DRM-APG has the largest time cost while the DRM-PPA the smallest on the same data set. Fig. 2 verifies that the average training time of the three algorithms is linear in n. It is noted that the proposed algorithms are naturally suitable for parallel computing because they treat one test example at a time.

Fig. 2:

Fig. 2:

Training time of an example versus training sample size n, for different data sets. All algorithms run 50 iterations with α = 10−3 and β = 104.

To numerically verify the theoretical convergence properties of our algorithms, we plot the value of objective function versus iteration in Fig. 3. We use all training examples and the same testing example for the three algorithms, with α and β fixed to be 10−3 and 104, respectively. Fig. 3 indicates that the DRM-GD and DRM-PPA converge with the least and largest numbers of iterations, respectively, on most data sets; however, each iteration of the DRM-GD is more time consuming than that of the DRM-PPA.

Fig. 3:

Fig. 3:

Value of objective function versus iteration, for the DRM on different data sets.

We have theoretically analyzed that larger β ensures faster convergence for the DRM-PPA. As shown in Fig. 4, similar conclusions can also be drawn for the DRM-GD and DRM-APG. Especially, these three algorithms converge within a few iterations on all data sets with a large β. We also report the time for convergence in Fig. 4. It is seen that the proposed algorithms converge fast and need only a few seconds, which implies potentials in real world applications.

Fig. 4:

Fig. 4:

Iterations and time needed by three DRM algorithms on different data sets, with ϵ = 1e − 5, α = 1e − 3. SensV(A) and SensV(S) stand for SensITVehicle (Acoustic) and (Seismic), respectively.

B. Application to Imbalanced Data

Imbalanced data are often observed in real world applications. Typically, in imbalanced data the classes are not equally distributed. In this experiment, we evaluate the proposed method in classification imbalanced data. For the test, we have used 15 data sets from KEEL database, where the key statistics of them are provided in Table V. Here, imbalance ratio (IR) measures the number of examples from the largest class over smallest class, where larger value indicates more imbalance. We adopt G-mean for evaluation, which is often used in the literature [5], [74]. For the detailed definition of G-mean, please refer to [40].

TABLE V:

Summary of Imbalanced Data Sets

Data Set Dim Size IR Data Set Dim Size IR Data Set Dim Size IR
1. haberman 3 306 2.78 6. glass6 9 214 6.38 11. glass2 9 214 11.59
2. ecoli1 7 336 3.36 7. yeast3 8 1484 8.10 12. cleveland-0vs4 13 173 12.31
3. glass4 9 214 3.98 8. ecoli3 7 336 8.60 13. yeast-1vs7 7 459 14.30
4. new thyroid 2 5 215 5.14 9. yeast-2vs4 8 514 9.08 14. ecoli4 7 336 15.80
5. ecoli2 7 336 5.46 10. yeast-05679vs4 8 528 9.35 15. page-blocks-13vs45 10 472 15.86

These data sets are commonly used for evaluation of classifiers on imbalanced data and originally split into five folds. In our experiments, we follow the original splits to conduct 5 trials and report the average performance. We include one more classifier in this test, i.e., sparse linear discriminant analysis (SLDA) [24]. We do not include this method in previous experiments because it is over lengthy to train on high-dimensional data. For its weight on the elastic-net regression, we choose the values from {0.001, 0.01, 0.1, 1, 10, 100, 1000} by the LOOCV. The reduced dimension is determined by the LOOCV from the set {1, 3, 5, 7, 9}. Except for the change of evaluation metric, all the other settings remain the same as in Section VII–A1. We report the average performance on these five splits in Table VI. It is seen that the DRM obtains the best performance on 12 out of 15 data sets. The SVM, KNN, and RF obtain one best performance, respectively. It is noted that in the cases where the DRM is not the best, its performance is still quite competive. For example, on page-blocks-1-3vs4 data set, the DRM has a G-mean value of 0.9932, which is slightly inferior to the RF by about 0.005. In contrast, when the SVM, KNN, and RF are not the best, in many cases, their performance is far from being comparable to the DRM. These observations suggest the effectiveness of the DRM in imbalanced data classification.

TABLE VI:

Classification Performance (G-mean) of SVM, KNN, RF, C4.5, NB, SLDA, and DRM on Imbalanced Data

Data SVM(R) SVM(P) KNN(E) KNN(C) RF C4.5 NB SLDA DRM(R) DRM(P)
1 .4212±.1451 .5578±.1249 .5315±.0455 .5188±.1318 .4532±.0761 —————- .5000±.1562 .4359±.1048 .6541±.0364 .6129±.0618
2 .8830±.0383 .4082±.2618 .8651±.0426 .8435±.0830 .8517±.0777 .4093±.0312 —————- .7792±.0413 .8714±.0366 .8728±.0466
3 .6849±.1855 .8858±.1311 .8479±.1973 .8479±.1973 .6358±.3858 —————- .7210±.1623 .3528±.3364 .8926±.1290 .9219±.1257
4 .8936±.0442 .9824±.0322 .9852±.0332 .9796±.0308 .9543±.0511 .3596±.0192 .9859±.0142 .5227±.0880 .9703±.0406 .9852±.0332
5 .8143±.0438 .0000±.0000 .9380±.0494 .9494±.0398 .8592±.0715 .3468±.0495 —————- .5880±.1993 .9415±.0510 .9057±.0573
6 .8176±.0836 .8628±.0797 .8625±.0766 .8843±.0747 .8673±.0759 —————- .8837±.0919 .8503±.0984 .8995±.0940 .9323±.0496
7 .7253±.0508 —————- .8141±.0224 .7752±.0398 .8222±.0423 .2129±.1447 —————- .8510±.0271 .8509±.0539 .8860±.0279
8 .3610±.2159 —————- .7797±.1493 .7657±.1438 .6923±.1161 .2597±.0361 —————- .6632±.1561 .8862±.0983 .8668±.0822
9 .6542±.1782 —————- .8322±.0779 .7710±.0886 .8222±.1121 .0685±.1553 —————- .5558±.0665 .8822±.0459 .8765±.0582
10 .1258±.1723 —————- .5686±.1330 .6375±.1204 .5840±.1850 .0557±.0142 —————- .4578±.0768 .7911±.0945 .7915±.1058
11 —————- .3517±.3378 .5663±.0736 .3996±.3816 .1140±.2550 —————- .4677±.2643 .3901±.2497 .6652±.1441 .7370±.0869
12 .5093±.2909 .5011±.4858 .6332±.3805 .5294±.3125 .6103±.3742 .2880±.0720 .6962±.4245 .5596±.0461 .8814±.1783 .7503±.1635
13 —————- —————- .5482±.1487 .4397±.2643 .4612±.2578 —————- —————- .4759±.2709 .6633±.1161 .7174±.1029
14 .8025±.0870 —————- .9464±.0734 .8610±.1038 .8435±.2033 .2196±.0494 —————- .8955±.0694 .9545±.0573 .9343±.0723
15 .6642±.1955 .9782±.0395 .9755±.0456 .9352±.0592 .9989±.0025 .2539±.0225 .7324±.1346 .6589±.0492 .9433±.0530 .9932±.0062

For each data set, the best G-mean value is highlighted in red. The performance is represented as average G-mean ± standard deviation. The data sets are represented as their indices, which follow Table V.

Moreover, to further illustrate the effectiveness of the DRM, we report the results of baseline classifiers with up-sampling techniques including the synthetic minority over-sampling technique (SMOTE) [20] and its variants such as the borderline SMOTE (BorSMOTE) [38] and Safe Level SMOTE (SL-SMOTE) [14]. Due to space limit, we report the results of the SVM, KNN, and RF since they have relatively better performance. All settings remain the same except that the SVM, KNN and RF are performed on data sets after up-sampling is pre-processed. We use a number of 5 neighbors for all the up-sampling methods to generate synthetic examples. For the SVM and KNN, we merge the results of different kernels or distances and report whichever is higher. For the DRM, we report the higher results as in Table VI. We report the classification results in Table VII. It is seen that with up-sampling, the performance of baseline methods are significantly improved whereas the DRM still shows comparable performance. It should be noted that the performance of the DRM can be further improved if sampling techniques such as the SMOTE are performed. However, it is not within the scope of this paper to fully discuss sampling techniques and thus we do not fully expand it here. Also, it is worth mentioning that the DRM itself has competitive performance compared with sampling technique + baseline classifiers.

TABLE VII:

Classification Performance (G-mean) of DRM and Sampling + Benchmarking Classifiers

Data S+SVM BorS+SVM SL-S+SVM S+KNN BorS+KNN SL-S+KNN S+RF BorS+RF SL-S+RF DRM
1 .6435±.0727 —————– —————– .5598±.0447 .5856±.0568 .5871±.0230 .5870±.0770 .5697±.0519 .5550±.0376 .6541±.0364
2 .8735±.0399 .8813±.0101 .8757±.0367 .8220±.0905 .8233±.0667 .8648±.0878 .8803±.0443 .8778±.0544 .8904±.0708 .8728±.0466
3 .9003±.1236 .9003±.1236 .8883±.1334 .9313±.1257 .9313±.1257 .9313±.1257 .7924±.4431 .8956±.1353 .8492±.1951 .9219±.1257
4 .9824±.0322 .9824±.0322 .9944±.0077 1.0000±.0000 1.0000±.0000 .9944±.0077 .9569±.0487 .9329±.0379 .9661±.0534 .9852±.0332
5 .8973±.0544 .8464±.0744 .9035±.0630 .9056±.0681 .8957±.0724 .9327±.0462 .8924±.0660 .8770±.0408 .9107±.0408 .9415±.0510
6 .9269±.0511 .8965±.0656 .8958±.0601 .9129±.0739 .9271±.0738 .9207±.0752 .9226±.0362 .9350±.0543 .8893±.0767 .9329±.0379
7 .9003±.0362 .8714±.0225 .8965±.0244 .8401±.0167 .8228±.0140 .8532±.0333 .8824±.0311 .8876±.0178 .8758±.0352 .8860±.0279
8 .8866±.0205 .8701±.0271 .8829±.0220 .8081±.0881 —————– .8134±.0690 .7970±.1716 .7641±.1122 .8483±.0645 .8862±.0983
9 .8905±.0362 .8405±.0787 .8877±.0416 —————– .7573±.1194 .8641±.0680 .8544±.0649 .8781±.0414 .8575±.0987 .8822±.0459
10 .7936±.0283 .8057±.0495 .7928±.0825 .7280±.0421 .7016±.0774 .7210±.1273 .7685±.0703 .7106±.1253 .7201±.0964 .7915±.1058
11 .7311±.1618 .7036±.1243 .5421±.0275 .6183±.1520 .6178±.1506 .5996±.1267 .4924±.2857 .4580±.2698 .2460±.3431 .7370±.0869
12 .9190±.0649 .9451±.0255 .9158±.0641 .8040±.1791 .7585±.2039 .8063±.1770 .7048±.1737 .7660±.2063 .6278±.4077 .8814±.1783
13 .7081±.0785 .7452±.0708 .6865±.0910 .5666±.0600 .6002±.0706 .5321±.0708 .5249±.1285 .5556±.0924 .4127±.2599 .7174±.1029
14 .9544±.0233 .9361±.0233 .9709±.0222 .9375±.0750 .9375±.0750 .9391±.0765 .9137±.0719 .9085±.1266 .9167±.0725 .9545±.0573
15 .9782±.0395 .9782±.0395 .9782±.0395 .9955±.0074 .9744±.0452 .9955±.0074 1.0000±.0000 1.0000±.0000 .9977±.0051 .9932±.0062

For each data set, the best G-mean value is highlighted in red. The performance is represented as average G-mean ± standard deviation. The data sets are represented as their indices, which follow Table V. S, BorS, and SL represent SMOTE, BorSMOTE, and SL-SMOTE, respectively

A few comments about the effectiveness of the DRM on imbalanced data are now in order. It is seen that the model of (8) considers different weights on different groups with matrix B. The weight is determined by the group size as defined in (6). Thus, the weight can effectively alleviate the adverse effect of class imbalance in the data. In fact, the spirit of this approach is consistent with that of algorithm-level methods for imbalanced data classification. Moreover, the decision rule in (11) has clear geometrical interpretation guiding how the class which the testing example truly belongs to is determined. Thus, from the model itself, it is convincing to expect promising performance of the DRM.

C. Ablation Study

In this subsesction, we will conduct extensive experiments to illustrate the significance of the component terms in our model. Without loss of generality, we conduct ablation study on imbalanced data sets for illustration. Since the first term of (4) is the loss while the third term is known to avoid overfitting issue and strengthen robustness in ridge regression, we focus on the second term, i.e., Sw(w). In particular, we will compare the DRM with the classic kernel ridge regression (KRR) for illustration. First, to show the significance of Sw(w), we compare the best performance of the KRR and DRM with all parameters selected by CV where the setting follows Section VII-A. In particular, we show the results of using RBF and polynomial kernels separately in Fig. 5. It is observed that the DRM has better performance than the KRR, which reveals that the improvements benefit from the discriminative information introduced by Sw(w). To further investigate how Sw(w) performs, we compare the KRR and DRM in the following way. We take the value of β within {10−3, 10−2, 10−1, 100, 101, 102, 103}. For each β value, we report the best performances of the KRR and DRM, where the other parameters are optimally selected among all possible combinations. We separately show the results of RBF and polynomial kernels in Fig. 6. It is observed that the performance curve of the DRM is almost always above that of the KRR, implying that with the term of Sw(w) the DRM outperforms the KRR with any β value. Thus, it is convincing to claim that the improvement always comes from the new term and thus demonstrates its significance in our model.

Fig. 5:

Fig. 5:

Comparison between the DRM and KRR, where all parameters are selected by CV.

Fig. 6:

Fig. 6:

Comparison between the DRM and KRR by fixing β value and varing the others. Best viewed zoomed-in and in color.

VIII. Conclusion

In this paper, we introduce a type of discriminative ridge regressions for supervised classification. This class of discriminative ridge regression models can be regarded as an extension to several existing regression models in the literature such as ridge, lasso, and grouped lasso. This class of new models explicitly incorporates discriminative information between different classes and hence is more suitable for classification than those existing regression models. As a special case of the new models, we focus on a quadratic model of the DRM as our classifier. The DRM gives rise to a closed-form solution, which is computationally suitable for high-dimensional data with a small sample size. For large-scale data, we establish three iterative algorithms that have reduced computational cost. In particular, these three algorithms with the linear kernel all have linear cost in sample size and theoretically proven guarantee to globally converge. Extensive experiments on standard, real-world data sets demonstrate that the DRM outperforms several state-of-the-art classification methods, especially on high-dimensional data or imbalanced data; furthermore, the three iterative algorithms with the linear kernel all have desirable linear cost, global convergence, and scalability while being significantly more accurate than the linear SVM. Ablation study confirms the significance of the key component of the proposed method.

TABLE III:

Large-Scale Data Used in Experiments

Dataset Size Class Training Testing
UCI Poker 1,025,010 × 10 10 25,010 1,000,000
ijcnn1 141,691 × 22 2 49,990 91,701
SensIT Vehicle (A) 98,528 × 50 3 78,823 19,705
SensIT Vehicle (S) 98,528 × 50 3 78,823 19,705
Shuttle 58,000 × 9 7 43,500 14,500

For the five data sets, we follow the original splitting.

Acknowledgment

C.P. is partially supported by National Natural Foundation of China (NSFC) under Grants 61806106, and Shandong Provincial Natural Science Foundation, China under Grants ZR2019QF009; Q.C. is partially supported by NIH UH3 NS100606-03, R01HD101508-01, and a grant from the University of Kentucky. Q.C. is the corresponding author.

Biographies

graphic file with name nihms-1710909-b0007.gif

Chong Peng received his PhD degree in Computer Science from Southern Illinois University, Carbondale, IL, USA in 2017. Currently, he is an assistant professor at College of Computer Science and Techonlogy, Qingdao University, China. He has published about 40 research papers in top-tier conferences and journals. His research interests include data mining, computer vision, machine learning, and pattern recognition.

graphic file with name nihms-1710909-b0008.gif

Qiang Cheng received the PhD degree from the Department of Electrical and Computer Engineering at the University of Illinois, Urbana-Champaign. Currently, he is an associate professor at the Institute for Biomedical Informatics and the Department of Computer Science at the University of Kentucky. He has published about 100 peer-reviewed papers in various premium conferences and journals. His research interests include data science, machine learning, pattern recognition, artificial intelligence, and biomedical informatics.

Footnotes

1

As reported in later parts of this section, we test the DRM on 8 image data sets, including 5 face image data sets and 3 digit image data sets. The DRM achieves higher than 96% accuracy on 7 data sets, out of which 5 are above 98% and 4 are above 99%. Thus, it appears sensible to state that the DRM is competitive even if CNN has better performance given the little room to improve the performance.

Contributor Information

Chong Peng, College of Computer Science and Technology, Qingdao University..

Qiang Cheng, Institute of Biomedical Informatics & Department of Computer Science, University of Kentucky..

References

  • [1].Alain G and Bengio Y, “Understanding intermediate layers using linear classifier probes,” arXiv preprint arXiv:1610.01644, 2016. [Google Scholar]
  • [2].Ao L, Deyun C, Zhiqiang W, Guanglu S, Kezheng L, and Quanquan G, “Self-supervised sparse coding scheme for image classification based on low rank representation,” Plos One, vol. 13, no. 6, pp. e0 199 141–. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Aşkan A and Sayın S, “Svm classification for imbalanced data sets using a multiobjective optimization framework,” Annals of Operations Research, vol. 216, no. 1, pp. 191–203, 2014. [Google Scholar]
  • [4].Bache K and Lichman M, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
  • [5].Barua S, Islam MM, Yao X, and Murase K, “Mwmote–majority weighted minority oversampling technique for imbalanced data set learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 2, pp. 405–425, 2014. [Google Scholar]
  • [6].Basri R and Jacobs D, “Lambertian reflectance and linear subspaces,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 2, pp. 218–233, 2003. [Google Scholar]
  • [7].Beck A and Teboulle M, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal of Imaging Science, vol. 2, no. 1, pp. 183–202, 2009. [Google Scholar]
  • [8].Belhumeur PN, Hespanha JP, and Kriegman DJ, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE Transactions on pattern analysis and machine intelligence, vol. 19, no. 7, pp. 711–720, 1997. [Google Scholar]
  • [9].Bellhumer PN, Hespanha J, and Kriegman D, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 7, pp. 711–720, 1997. [Google Scholar]
  • [10].Bhowan U, Johnston M, Zhang M, and Yao X, “Evolving diverse ensembles using genetic programming for classification with unbalanced data,” IEEE Transactions on Evolutionary Computation, vol. 17, no. 3, pp. 368–386, 2013. [Google Scholar]
  • [11].Bottou L, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186. [Google Scholar]
  • [12].Boyd S and Vandenberghe L, Convex Optimization. Cambridge University Press, 2004. [Google Scholar]
  • [13].Breiman L, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001. [Google Scholar]
  • [14].Bunkhumpornpat C, Sinapiromsaran K, and Lursinsap C, “Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem,” in Pacific-Asia conference on knowledge discovery and data mining. Springer, 2009, pp. 475–482. [Google Scholar]
  • [15].——, “Dbsmote: density-based synthetic minority over-sampling technique,” Applied Intelligence, vol. 36, no. 3, pp. 664–684, 2012. [Google Scholar]
  • [16].Castellanos FJ, Valero-Mas JJ, Calvo-Zaragoza J, and Rico-Juan JR, “Oversampling imbalanced data in the string space,” Pattern Recognition Letters, vol. 103, pp. 32–38, 2018. [Google Scholar]
  • [17].Chang C-C and Lin C-J, “Ijcnn 2001 challenge: Generalization ability and text decoding,” in Proceedings of IJCNN. IEEE, 2001, pp. 1031–1036. [Google Scholar]
  • [18].——, “Libsvm: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 1–27, 2011. [Google Scholar]
  • [19].Chang K-W, Hsieh C-J, and Lin C-J, “Coordinate descent method for large-scale l2-loss linear support vector machines,” The Journal of Machine Learning Research, vol. 9, pp. 1369–1398, 2008. [Google Scholar]
  • [20].Chawla NV, Bowyer KW, Hall LO, and Kegelmeyer WP, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002. [Google Scholar]
  • [21].Chen C, Wang G, Peng C, Zhang X, and Qin H, “Improved robust video saliency detection based on long-term spatial-temporal information,” IEEE Transactions on Image Processing, vol. 29, pp. 1090–1100, 2020. [DOI] [PubMed] [Google Scholar]
  • [22].Cheng Q, Zhou H, Cheng J, and Li H, “A minimax framework for classification with applications to images and high dimensional data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 11, pp. 2117–2130, 2014. [DOI] [PubMed] [Google Scholar]
  • [23].Cheng Q, Zhou H, and Cheng J, “The fisher-markov selector: Fast selecting maximally separable feature subset for multiclass classification with applications to high-dimensional data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 6, pp. 1217–1233, 2011. [DOI] [PubMed] [Google Scholar]
  • [24].Clemmensen L, Hastie T, Witten D, and Ersbøll B, “Sparse discriminant analysis,” Technometrics, vol. 53, no. 4, pp. 406–413, 2011. [Google Scholar]
  • [25].Cortes C and Vapnik V, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995. [Google Scholar]
  • [26].Del Río S, López V, Benítez JM, and Herrera F, “On the use of mapreduce for imbalanced big data using random forest,” Information Sciences, vol. 285, pp. 112–137, 2014. [Google Scholar]
  • [27].Duarte MF and Hu YH, “Vehicle classification in distributed sensor networks,” Journal of Parallel and Distributed Computing, vol. 64, no. 7, pp. 826–838, 2004. [Google Scholar]
  • [28].Duda R, Hart P, and Stork D, Pattern Classification, 2nd ed. John Wiley & Sons, 2001. [Google Scholar]
  • [29].Fan R-E, Chang K-W, Hsieh C-J, Wang X-R, and Lin C-J, “Liblinear: A library for large linear classification,” The Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008. [Google Scholar]
  • [30].Fukunaga K, Introduction to Statistical Pattern Recognition. Academic Press, 1990. [Google Scholar]
  • [31].Gallego A-J, Calvo-Zaragoza J, Valero-Mas JJ, and Rico-Juan JR, “Clustering-based k-nearest neighbor classification for large-scale data with neural codes representation,” Pattern Recognition, vol. 74, pp. 531–543, 2018. [Google Scholar]
  • [32].Gan J, Wen G, Yu H, Zheng W, and Lei C, “Supervised feature selection by self-paced learning regression,” Pattern Recognition Letters, 2018. [Google Scholar]
  • [33].Georghiades AS, Belhumeur PN, and Kriegman D, “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 643–660, 2001. [Google Scholar]
  • [34].Gónzalez S, García S, Lázaro M, Figueiras-Vidal AR, and Herrera F, “Class switching according to nearest enemy distance for learning from highly imbalanced data-sets,” Pattern Recognition, vol. 70, pp. 12–24, 2017. [Google Scholar]
  • [35].Grant M, Boyd S, and Ye Y, “CVX: Matlab software for disciplined convex programming,” Avialable at http://www.stanford.edu/boyd/cvx.
  • [36].Gu X, Chung F.-l., and Wang S, “Fast convex-hull vector machine for training on large-scale ncrna data classification tasks,” Knowledge-Based Systems, vol. 151, pp. 149–164, 2018. [Google Scholar]
  • [37].Guo Y, Hastie T, and Tibshirani R, “Regularized linear discriminant analysis and its application in microarrays,” Biostatistics, vol. 8, no. 1, pp. 86–100, 2007. [DOI] [PubMed] [Google Scholar]
  • [38].Han H, Wang W-Y, and Mao B-H, “Borderline-smote: a new oversampling method in imbalanced data sets learning,” in International conference on intelligent computing. Springer, 2005, pp. 878–887. [Google Scholar]
  • [39].Hastie T, Tibshirani R, and Friedman J, The Elements of Statistical Learning. Springer, 2009. [Google Scholar]
  • [40].He H and Garcia EA, “Learning from imbalanced data,” IEEE Transactions on knowledge and data engineering, vol. 21, no. 9, pp. 1263–1284, 2009. [Google Scholar]
  • [41].Ho J, Yang M, Lim J, Lee K, and Kriegman D, “Clustering appearancs of objects under varying illumination conditions,” in Proceedings of Computer Vision and Pattern Recognition, 2003. [Google Scholar]
  • [42].Hond D and Spacek L, “Distinctive descriptions for face processing.” in BMVC, no. 0.2, 1997, pp. 0–4. [Google Scholar]
  • [43].Hsieh C-J, Chang K-W, Lin C-J, Keerthi SS, and Sundararajan S, “A dual coordinate descent method for large-scale linear svm,” in Proceedings of the 25th International Conference on Machine learning. ACM, 2008, pp. 408–415. [Google Scholar]
  • [44].Joachims T, “Training linear svms in linear time,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2006, pp. 217–226. [Google Scholar]
  • [45].Keerthi SS and DeCoste D, “A modified finite newton method for fast solution of large scale linear svms,” in Journal of Machine Learning Research, 2005, pp. 341–361. [Google Scholar]
  • [46].Kiela D, Grave E, Joulin A, and Mikolov T, “Efficient large-scale multi-modal classification,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [Google Scholar]
  • [47].Krawczyk B, “Learning from imbalanced data: open challenges and future directions,” Progress in Artificial Intelligence, vol. 5, no. 4, pp. 221–232, 2016. [Google Scholar]
  • [48].Krizhevsky A, Sutskever I, and Hinton GE, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. [Google Scholar]
  • [49].Levitin ES and Polyak BT, “Constrained minimization methods,” Compu. Math. Math. Phys, vol. 6, pp. 787–823, 1966. [Google Scholar]
  • [50].Li A, Wu Z, Lu H, Chen D, and Sun G, “Collaborative self-regression method with nonlinear feature based on multi-task learning for image classification,” IEEE Access, vol. 6, pp. 43 513–43 525, 2018. [Google Scholar]
  • [51].Li J, Du Q, Li Y, and Li W, “Hyperspectral image classification with imbalanced data based on orthogonal complement subspace projection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 7, pp. 3838–3851, 2018. [Google Scholar]
  • [52].Lin C-J, Weng RC, and Keerthi SS, “Trust region newton method for logistic regression,” The Journal of Machine Learning Research, vol. 9, pp. 627–650, 2008. [Google Scholar]
  • [53].Liu K, Yang X, Yu H, Mi J, Wang P, and Chen X, “Rough set based semi-supervised feature selection via ensemble selector,” Knowledge-Based Systems, vol. 165, pp. 282–296, 2019. [Google Scholar]
  • [54].Mangasarian OL, “A finite newton method for classification,” Optimization Methods and Software, vol. 17, no. 5, pp. 913–929, 2002. [Google Scholar]
  • [55].Maratea A, Petrosino A, and Manzo M, “Adjusted f-measure and kernel scaling for imbalanced data learning,” Information Sciences, vol. 257, pp. 331–341, 2014. [Google Scholar]
  • [56].Martinez AM, “The ar face database,” CVC Technical Report, vol. 24, 1998. [Google Scholar]
  • [57].Nakayama R, Nemoto T, Takahashi H, Ohta T, Kawai A, Seki K, Yoshida T, Toyama Y, Ichikawa H, and Hasegawa T, “Gene expression analysis of soft tissue sarcomas: characterization and reclassification of malignant fibrous histiocytoma,” Modern Pathology, vol. 20, no. 7, pp. 749–759, 2007. [DOI] [PubMed] [Google Scholar]
  • [58].Nesterov Y, “A method of solving a convex programming problem with convergence rate of o(1/k2),” Soviet Math. Doklady, vol. 27, pp. 372–376, 1983. [Google Scholar]
  • [59].Ng WW, Hu J, Yeung DS, Yin S, and Roli F, “Diversified sensitivity-based undersampling for imbalance classification problems,” IEEE transactions on cybernetics, vol. 45, no. 11, pp. 2402–2412, 2015. [DOI] [PubMed] [Google Scholar]
  • [60].Nikolaou N, Edakunni N, Kull M, Flach P, and Brown G, “Cost-sensitive boosting algorithms: Do we really need them?” Machine Learning, vol. 104, no. 2–3, pp. 359–384, 2016. [Google Scholar]
  • [61].Peng C, Chen C, Kang Z, Li J, and Cheng Q, “Res-pca: A scalable approach to recovering low-rank matrices,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7317–7325. [Google Scholar]
  • [62].Peng C, Chen Y, Kang Z, Chen C, and Cheng Q, “Robust principal component analysis: A factorization-based approach with linear complexity,” Information Sciences, vol. 513, pp. 581–599, 2020. [Google Scholar]
  • [63].Peng C, Kang Z, Cai S, and Cheng Q, “Integrate and conquer: Double-sided two-dimensional k-means via integrating of projection and manifold construction,” ACM Trans. Intell. Syst. Technol, vol. 9, no. 5, June. 2018. [Online]. Available: 10.1145/3200488 [DOI] [Google Scholar]
  • [64].Peng C, Kang Z, and Cheng Q, “Feature selection embedded subspace clustering,” IEEE Signal Processing Letters, vol. 23, no. 7, pp. 1018–1022, 2016. [Google Scholar]
  • [65].——, “Subspace clustering via variance regularized ridge regression,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [Google Scholar]
  • [66].Peng C, Kang Z, Li H, and Cheng Q, “Subspace clustering using log-determinant rank approximation,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD 15. New York, NY, USA: Association for Computing Machinery, 2015, p. 925934. [Online]. Available: 10.1145/2783258.2783303 [DOI] [Google Scholar]
  • [67].Rockafellar RT, “Monotone operators and the proximal point algorithm,” SIAM journal on control and optimization, vol. 14, no. 5, pp. 877–898, 1976. [Google Scholar]
  • [68].Roweis S and Saul L, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, p. 2323, 2000. [DOI] [PubMed] [Google Scholar]
  • [69].Rui Y, Huang T, Ortega M, and Mehrotra S, “Relevance feedback: A power tool for interactive content-based image retrieval,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, no. 5, pp. 644–655, 2002. [Google Scholar]
  • [70].Scholkopf B and Smola A, Learning with Kernels. MIT Press, 2002. [Google Scholar]
  • [71].Shalev-Shwartz S, Singer Y, Srebro N, and Cotter A, “Pegasos: Primal estimated sub-gradient solver for svm,” Mathematical Programming, vol. 127, no. 1, pp. 3–30, 2011. [Google Scholar]
  • [72].Soda P, “A multi-objective optimisation approach for class imbalance learning,” Pattern Recognition, vol. 44, no. 8, pp. 1801–1810, 2011. [Google Scholar]
  • [73].Sorensen DC, “Implicit application of polynomial filters in a k-step arnoldi method,” SIAM journal matrix ana. appl, vol. 13, pp. 357–385, 1992. [Google Scholar]
  • [74].Su C-T and Hsiao Y-H, “An evaluation of the robustness of mts for imbalanced data,” IEEE Transactions on knowledge and data engineering, vol. 19, no. 10, 2007. [Google Scholar]
  • [75].Sun L, Hui A-M, Su Q, Vortmeyer A, Kotliarov Y, Pastorino S, Passaniti A, Menon J, Walling J, Bailey R et al. , “Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain,” Cancer cell, vol. 9, no. 4, pp. 287–300, 2006. [DOI] [PubMed] [Google Scholar]
  • [76].Tactile, “Semeion data set,” Semeion Research Center of Sciences of Communication, via Sersale 117, 00128 Rome, Italy. Tattile Via Gaetano Donizetti, 1-3-5,25030 Mairano; (Brescia), Italy. [Google Scholar]
  • [77].Vapnik V, The nature of statistical learning theory. springer, 2000. [Google Scholar]
  • [78].Vapnik VN, Statistical Learning Theory. Wiley; New York, 1998, vol. 2. [Google Scholar]
  • [79].Veropoulos K, Campbell C, Cristianini N et al. , “Controlling the sensitivity of support vector machines,” in Proceedings of the international joint conference on AI, vol. 55, 1999, p. 60. [Google Scholar]
  • [80].Wang Y, Gan W, Yang J, Wu W, and Yan J, “Dynamic curriculum learning for imbalanced data classification,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 5017–5026. [Google Scholar]
  • [81].Witten DM and Tibshirani R, “Penalized classification using fisher’s linear discriminant,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 73, no. 5, pp. 753–772, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [82].Xu X, Liang T, Zhu J, Zheng D, and Sun T, “Review of classical dimensionality reduction and sample selection methods for large-scale data processing,” Neurocomputing, vol. 328, pp. 5–15, 2019. [Google Scholar]
  • [83].Yen S-J and Lee Y-S, “Cluster-based under-sampling approaches for imbalanced data distributions,” Expert Systems with Applications, vol. 36, no. 3, pp. 5718–5727, 2009. [Google Scholar]
  • [84].You C, Li C, Robinson DP, and Vidal R, “Scalable exemplar-based subspace clustering on class-imbalanced data,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 67–83. [Google Scholar]
  • [85].Zhang T, “Solving large scale linear prediction problems using stochastic gradient descent algorithms,” in Proceedings of the 21st International Conference on Machine Learning. ACM, 2004, p. 116. [Google Scholar]
  • [86].Zhou H and Cheng Q, “Sufficient conditions for generating group level sparsity in a robust minimax framework,” in Advances in Neural Information Processing Systems, 2010. [Google Scholar]
  • [87].——, “A scalable projective scaling algorithm for lp loss with convex penalizations,” IEEE Transactions on Neural Networks and Learning Systems, vol. 26, no. 2, pp. 265–272, 2015. [DOI] [PubMed] [Google Scholar]

RESOURCES