Abstract
We consider the classification problem when the input features are represented as matrices rather than vectors. To preserve the intrinsic structures for classification, a successful method is the Support Matrix Machine (SMM) in [19], which optimizes an objective function with a hinge loss plus a so-called spectral elastic net penalty. However, the issues of extending SMM to multicategory classification still remain. Moreover, in practice, it is common to see the training data contaminated by outlying observations, which can affect the robustness of existing matrix classification methods. In this paper, we address these issues by introducing a robust angle-based classifier, which boils down binary and multicategory problems to a unified framework. Benefitting from the use of truncated hinge loss functions, the proposed classifier achieves certain robustness to outliers. The underlying optimization model becomes nonconvex, but admits a natural DC (difference of two convex functions) representation. We develop a new and efficient algorithm by incorporating the DC algorithm and primal-dual first-order methods together. The proposed DC algorithm adaptively chooses the accuracy of the subproblem at each iteration while guaranteeing the overall convergence of the algorithm. The use of primal-dual methods removes a natural complexity of the linear operator in the subproblems and enables us to use the proximal operator of the objective functions, and matrix-vector operations. This advantage allows us to solve large-scale problems efficiently. Theoretical and numerical results indicate that for problems with potential outliers, our method can be highly competitive among existing methods.
Keywords: Angle-based classifiers, DCA (difference of convex function) algorithm, Fisher consistency, Nonconvex optimization, Robustness, Spectral elastic net
1. Introduction
Many popular classification methods are originally developed for data with a vector of covariates, such as linear discriminant analysis, logistic regression, support vector machine (SVM), and Adaboost [12]. Recent advances in technology enable the generation of a wealth of data with complex structures, where the input features are represented by multi-linear geometric objects such as matrices or tensors, rather than by the form of vectors or scalars. The matrix-type datasets are often encountered in a wide range of real applications, e.g., the face recognition [31] and the analysis of medical images, such as the electroencephalogram data [36].
One common strategy to handle the matrix data classification is to stack a matrix into a long vector, and then employ some existing vector-based methods. This approach has several drawbacks. First, after vectorization, the dimensionality of the resulting vector typically becomes exceedingly high, which in turn leads to the curse of dimensionality, i.e. the large p and small n phenomenon. Second, vectorization of matrix-type data can destroy informative structure and correlation of data matrix, such as the neighbor information and the adjacent relation. Third, under the statistical learning framework, the regularization of vector and matrix data should be different due to their intrinsic structures. To exploit the correlation among the columns or rows of the data matrix, several methods were developed, for example, [6], [27], [24], [14]. These methods are essentially built on the low-rank assumption. Another major direction is to extend regularization techniques commonly used in vector-based classification methods to the present matrix-type data, under certain sparsity assumptions. The regularization with the nuclear norm of matrix of parameters is popular in a variety of settings; see [7] for matrix completion with a low rank constraint, and [36] for matrix regression problems based on generalized linear models. Specifically, [19] proposed the Support Matrix Machine (SMM) which employs a so-called spectral elastic net penalty for binary classification problems. The spectral elastic net penalty is the combination of the squared Frobenius matrix norm and the nuclear norm, in parallel to the elastic net [37]. They showed that the SMM classifier enjoys the property of grouping effect, while keeping a low-rank representation.
Our approach and contribution:
Though the SMM model is simple yet effective, two major issues still remain. The first one is how to extend it to address the problem of multicategory classification. One may reduce the multicategory problem via a sequence of binary problems, for example, using one-versus-rest or one-versus-one techniques. However, the one-versus-rest method can be inconsistent when there is no dominating class, and one-versus-one method may suffer a tie-in-vote problem [17, 18]. Another issue is that existing classifiers may not be robust against outliers, and thus they may have unstable performance in practice [30]. To address these two issues, we propose a new multicategory angle-based SMM using truncated hinge loss functions, which not only provides a natural generalization of binary SMM methods, but also achieves certain robustness to outliers. Our proposed classifier can be viewed as a robust matrix counterpart of the robust vector-based classifier in [32]. We show that the proposed classifier enjoys Fisher consistency and other attractive theoretical properties.
Because the truncated hinge loss is nonconvex and the spectral elastic net regularization is not smooth, the optimization problem involved in our classifier is highly non-trivial. We first show that this problem admits a global optimal solution by exploiting special structures of the model. Next, we show that the optimization problem has a natural DC (difference of two convex functions) decomposition. Hence, one can apply a DC algorithm (DCA) [2] to solve this problem. However, the convex subproblem is rather complicated with nonsmooth objective functions and linear operators, and cannot be solved exactly. This prevents us from solely applying DCA to solve our nonconvex problem. We instead develop a new variant, namely the inexact proximal DCA, to solve this problem. By using the proximal term, we obtain a strongly convex subproblem. Then, to approximately solve this subproblem, we propose to use primal-dual first-order methods proposed in [8, 28]. These methods allow us to exploit the special structures of the problem by utilizing the proximal operator of the objective terms, and matrix-vector multiplications. One drawback of this approach is to match the number of inner iterations in the primal-dual scheme and the inexactness of the proximal DCA scheme. By exploiting the problem structure, we show how to estimate this number of inner iterations at each step of the DCA scheme to obtain a unified DCA algorithm for solving the nonconvex optimization problem. We prove that by adaptively controlling the number of iterations in the primal-dual routine, we can still achieve a global convergence of our DCA variant, which converges to a stationary point. Our method can be implemented efficiently and does not require to estimate any parameter with expensive computational cost. To our limited knowledge, we are not aware of any efficient method to solve SMM-type problems in the literature except the alternating direction method of multipliers (ADMM)-based scheme [5]. In order to examine the efficiency of our method, we compare it with an ADMM-based scheme [5]. As shown in Section 5, our method outperforms ADMM in terms of computational time, and our new model has highly competitive performance among existing methods in different aspects.
Paper organization:
The rest of the article is organized as follows. In Section 2, we briefly review some related works, and then introduce our proposed model and methodology. In Section 3, we describe a new inexact proximal DCA algorithm and investigate its convergence. Some statistical learning results, including Fisher consistency, risk and robustness analysis, are presented in Section 4. Numerical studies are given in Section 5 on both synthetic and real data. Section 6 concludes our work with some remarks, and theoretical proofs are delineated in the appendix.
Notation:
For a matrix of rank r (r ≤ min(p, q)), represents the condensed singular value decomposition (SVD) of A, where and satisfy and , and ΣA = diag·{σ1(A), ⋯, σr(A)} with σ1 (A) ≥ ⋯ ≥ σr (A) > 0. For each τ > 0, the singular value thresholding operator is defined as follows:
where with . For , denotes the Frobenius norm of denotes the nuclear norm of A, and ∥A∥2 = σ1(A) stands for the spectral norm of A. The inner product between two matrices is defined as . It is well-known that the nuclear norm ∥A∥*, as a mapping from to , is not differentiable, but convex. Alternatively, one considers the subdifferential of ∥A∥*, which is the set of subgradients and denoted by ∂∥A∥*. For a matrix A, vec(A) denotes its vectorization. We use 〈·, ·〉 to denote the inner product.
For a proper, closed and convex function , dom(φ) denotes the domain of φ, denotes its proximal operator, and denotes its Fenchel conjugate. We say that φ has a “friendly” proximal operator if its proximal operator proxφ can be computed efficiently by, e.g., closed-form or polynomial time algorithms. We say that φ is μφ-strongly convex if is convex, where μφ ≥ 0. Given a nonnegative real number x, we denote ⌊x⌋ the largest integer that is less than or equal to x.
2. Methodology
Assume that the underlying joint distribution of is , where is the matrix of predictors and is the label. We are given a set of training samples of matrix-type data collected independently and identically distributed (i.i.d.) from Pr, where is the ith input sample and yi is its corresponding class label. Here, we assume that Xi’s are zero-centered; otherwise we can make transformation by , where . We take the structure into consideration and handle all Xi’s in the matrix form. Based on the given training set, the target of a classification problem is to estimate a classifier , by minimizing the empirical prediction error
where is the indicator function. Because is discontinuous, in practice, we use some surrogate loss function to approximate it. As an example, in the case of the SVM, the hinge loss is adopted.
2.1. Review of the Support Matrix Machine
We take the binary problem as a special example with the encoded class labels set {+1, −1}. The optimization problem of [19]’s SMM can be expressed as
| (1) |
where , and is the hinge loss, τ ≥ 0 controls the balance between the Frobenius norm and nuclear norm, and λ > 0 is a tuning parameter that balances the loss and regularization terms. The SMM (1) is a soft margin classifier, and it has a close connection to the ordinary SVM [4, 10]. With τ = 0, by vectorization of the coefficient matrix M1, SMM reduces to the standard form of the SVM.
The penalty term, , can be re-expressed as
Clearly, this term is essentially of the form of the elastic net penalty for all singular values of the regression matrix M1, and thus is referred to as the spectral elastic net penalty. Such regularization encourages a low-rank constraint of the coefficient matrix. This can be better understood by the dual problem of (1), which is presented as follows:
| (2) |
where C = (Nλ)−1, and the optimum satisfies . The derivation of (2) is given in the appendix. Under the low-rank assumption, small singular values of are more likely to be noisy, and hence SMM could be more efficient than the SVM by thresholding with an appropriate choice of τ. Moreover, due to the use of the trace norm, [19] also showed that there is a stronger grouping effect in the estimation of M1 than the ordinary SVM.
2.2. Robust Multicategory SMM
For extensions of the binary classification method to the multicategory case, a common approach is to use K classification functions to stand for the K categories, and the prediction rule is based on which function has the largest value. Recently, [32] showed that this approach can be inefficient and suboptimal, and proposed an angle-based classification framework that needs to train K 1 classification functions f = (f1, ⋯ , fK−1)⊤. The angle-based classifiers can enjoy better prediction performance and faster computation [33, 34, 26]. Hence, we adopt this strategy here. For simplicity, we focus on linear learning.
To be more specific, consider a centered simplex with K vertices W = (w1, ⋯, wK) in , where these vertices are given by
Here, ek is the unit vector of length K − 1 with the kth entry 1 and 0 otherwise, and 1 is the vector of all ones. One can verify that each vector wk has Euclidean norm 1, and the matrix W introduces a symmetric simplex in . Each wk represents the kth class label. Let M be the linear transformation matrix which maps an input X into a (K − 1)-variate vector f(X) = M · vec(X), where and for any j ∈ {1, ⋯ , K − 1}. The angle ∠(f(X), wk) shows the confidence of the sample X belonging to class k. Thus the prediction rule is based on which angle is the smallest, i.e.,
It can also be verified that the least-angle prediction rule is equivalent to the largest inner product, i.e.,
Here, we define and . Based on the structure of matrix-type data, our proposed Robust Multicategory Support·Matrix ·Machine (RMSMM) solves
| (3) |
Where
with fj(X) = 〈Mj,X〉 for j = 1, ⋯,K − 1;
, where τ ≥ 0 is a balancing parameter;
and . The notation s ≤ 0 a parameter that controls the location of truncation, and γ ∈ [0, 1] is a convex combination parameter.
In (3), the loss term can be written as , where
The first term of the above representation is a generalization of the reinforced multicategory loss function in the angle-based framework proposed by [33]. Note that explicitly encourages 〈f(X), wy〉 to be small for k ≠ yi In parallel to [33], we will show later that this convex combination of hinge loss functions enjoys Fisher consistency with and s ≤ 0.
The use of the second term is motivated by [30] to alleviate the effect of potential outliers, resulting in a truncated hinge loss. It can be seen that for any potential outlier (X, y) with a sizable 〈f(X), wy〉, its loss is upper bounded by a constant for any f. Thus, the impact of outliers can be alleviated by using . Note that when s > 0, Ts(u) and Rs(u) are constants within [−s, s]. In this case, the loss for some correctly classified observations is the same as that of those misclassified ones. Hence, it is more desirable to set s ≤ 0. As recommended by [32], the choice of s = −(K − 1)−1 works well and will be used in our simulation study.
The truncated hinge loss is nonconvex, which makes the optimization problem (3) more involved than that of SMM. We next present an efficient algorithm to implement our RMSMM.
3. Optimization Algorithm
Since the optimization problem (3) admits a DC decomposition, we propose to apply DCA [2] to solve this problem. At each iteration of DCA, it requires to solve a convex subproblem, which does not have a closed form. We instead solve this convex subproblem up to a given accuracy and design an inexact variant of DCA so that it automatically adapts the accuracy of the subproblem to guarantee the overall convergence of the full algorithm.
3.1. A DC Representation of (3)
Problem (3) is nonconvex, but fortunately, it possesses a natural DC representation. Indeed, due to the relation , we can write
where with ⊗ denoting the Kronecker product. Let us define
| (4) |
Then, we can rewrite problem (3) as
| (5) |
Problem (5) has a DC representation as follows:
| (6) |
where
| (7) |
Here, both function Φ and Ψ are convex, but nonsmooth. In addition, Ψ is polyhedral. Note that we can always add any strongly convex function S to Φ and Ψ to write F = Φ − Ψ as
| (8) |
to obtain a new DC representation. The latter representation shows that both convex functions Φ+S and Ψ+S are strongly convex. This representation also leads to a strongly convex subproblem at each iteration of DCA as we will see in the sequel. However, the choice of S is crucial, and also affects the performance of the algorithm. In our implementation, we simply add a convex quadratic function which leads to a proximal DCA.
Note that dom(Φ) ∩ dom(Ψ) ≠ ∅. Since problem (6) is nonconvex, any point satisfies
| (9) |
is called a stationary point of (6). If M* satisfies ∂Φ (M*) ∩ ∂Φ (M*) ≠ ∅, then we say that M* is a critical point (6). We show in the following theorem that (6) has a global optimal solution.
Theorem 1
If λ > 0, then problem (6) has at least one global optimal solution M*.
Proof We first write the objective function F of (5) into the sum , where is a function combining the sum of Ts(K−1), Rs, and the nuclear norm in J.
Next, we show that is Lipschitz continuous. Indeed, using the fact that we can show that and are both Lipschitz continuous. In addition, we have for j = 1, ⋯ K – 1. Hence, is also Lipschitz continuous. As a consequence, defined above is Lipschitz continuous. That is, there exists such that for all .
Using a fixed point , we can bound F as
Hence, F is coercive, i.e., F(M) → +∞ as ∥M∥F → +∞. Consequently, its sublevel set is closed and bounded for any . By the well-known Weierstrass theorem, (6) has at least one global optimal solution M*. ◻
3.2. Inexact Proximal DCA Scheme
Let us start with the standard DCA scheme [2] and propose an inexact proximal DCA scheme to solve (6). The proximal DCA is equivalent to DCA applying to the DC decomposition (8) mentioned above, but often uses an adaptive strongly convex term S.
3.2.1. The Standard DCA Scheme and Its Proximal Variant
The DCA method for solving (6) is very simple. At each iteration t ≥ 0, given Mt, we compute a subgradient r (Mt) ∈ ∂Ψ(Mt) and form the subproblem:
| (10) |
to compute the next iteration Mt+1 as an exact solution of (10). The subproblem (10) is convex. However, it is fully nonsmooth and does not have a closed form solution.
In the proximal DC variant, we instead apply DCA to the DC decomposition (8) with , which leads to the following scheme:
| (11) |
where ρt > 0 is a given proximal parameter. Clearly, Mt+1 is well-defined and unique.
3.2.2. Inexact Proximal DCA Scheme
Clearly the subproblem (11) in the proximal DCA scheme (11) does not have a closed form solution. We can only obtain an approximate solution of this problem. This certainly affects the convergence of (11). We instead propose an inexact variant of (11) by approximately solving
| (12) |
where :≈ stands for the approximation between the approximate solution Mt+1 and the true solution of the subproblem (12), and is characterized via the objective residual as
| (13) |
We note that this condition is implementable if we apply first-order methods in convex optimization to approximately solving (12).
Clearly, by strong convexity, we have
This leads to , which shows the difference between the approximate solution Mt+1 and the true one .
Under the inexact criterion (13), we can still prove the following descent property of the inexact proximal DCA scheme (12).
Lemma 1
Let Ψ be μΨ-strongly convex with μΨ ≥ 0. Let {Mt} be the sequence generated by the inexact DCA scheme (12) under the inexact criterion (13). Then
| (14) |
Proof Using the optimality condition of (12), we have
From the μΦ- and μΨ-strong convexity of Φ and Ψ, respectively, we have
Summing up the last two inequalities and using the above optimality condition, we obtain
Here, F (M) = Φ(M) − Ψ(M). Next, using (13), we have
Summing up the last two inequalities and using F = Φ − Ψ again, we obtain
This implies (14) by neglecting the term
3.3. Solution of The Convex Subproblem
By rescaling the objective function by a factor of , we can rewrite the strongly convex subproblem (12) at the iteration t of the inexact proximal DCA scheme as follows:
| (15) |
where
and
Here, is a linear operator concatenating all vectors ai and bik, and the subgradient ∇Ψ(Mt) in Pt, and Pt is a nonsmooth convex function, but has a “friendly” proximal operator that can be computed in linear time (see Subsection 3.5 for more details). Due to the strong convexity of J, (15) is strongly convex even for ρt = 0. However, one can adaptively choose ρt ≥ 0 such that we have a “good” strong convexity parameter. If we do not add a regularization term , then (15) is strongly convex if ρt > 0. Since μΨ = 0 in (6), to strictly get a descent property in Lemma 1, we require ρt > 0. The following lemma will be used in the sequel, whose proof is given in the appendix.
Lemma 2
The objective function Pt(·) of (15) is Lipschitz continuous, i.e., there exists L0 ∈ (0, + ∞) such that for all u, , where L0 is independent of t. Consequently, the domain of the conjugate is bounded uniformly in t, i.e., its diameter is · finite and independent of t.
Denote by
| (16) |
the sublevel set of (5). As we proved in Theorem 1, the sublevel set is closed and bounded for any . We define
| (17) |
the diameter of this sublevel set, which is finite, i.e., .
3.3.1. Primal-dual Schemes for Solving (15)
Problem (15) can be written into a minimax saddle-point problem using the Fenchel conjugate of Pt. It is natural to apply primal-dual first-order methods to solve this problem. We propose in this subsection two different primal-dual schemes to solve (15).
Our first algorithm is the common Chambolle-Pock primal-dual method proposed in [8]. This method is described as follows. Starting from , and as an initial dual variable with Y0 = 0, set , and at each inner iteration l ≥ 0, we perform
| (18) |
Here, we use the index t for the DCA scheme as the outer iteration counter, and the index l for the inner iteration counter. The initial stepsizes are set to be , where is the operator norm of , and c = 0.999; is the adjoint operator of (i.e., when is a matrix, is the transpose of ), is the proximal operator of the Fenchel conjugate of Pt, and is the proximal operator of ω · Qt. Alternatively, we can also apply [28, Algorithm 2] to solve (15). Originally, [28, Algorithm 2] works directly on the primal space, and has a convergence guarantee on the primal sequence that is independent of the dual variable as we can see in Lemma 3 below. Let us describe this scheme here to solve (15). Starting from , and , at each inner iteration l ≥ 0, we update
| (19) |
Here, the initial values and are given.
Note that both schemes (18) and (19) look quite similar at first glance, but they are fundamentally different. First, the dual step in (19) fixes for all iterations l, while it is recursive with in (18). Second, (18) has an extra averaging step at the last line, while (19) has a linear coupling step at the last line, where it works similarly as the accelerated gradient method of Nesterov [23]. Finally, the way of updating parameters in both schemes are really different.
In terms of complexity, (18) and (19) essentially have the same per-iteration complexity with one proximal operator , one proximal operator , one matrix-vector multiplication , and one adjoint operation .
The following lemma provides us conditions to design a stopping criterion for the inner loop (i.e., the l-iterative loop), whose proof is given in the appendix.
Lemma 3
Let be the unique solution of (15) at the outer iteration t. Then, the sequence generated by (18) satisfies
| (20) |
where is the corresponding exact dual solution of (15). Alternatively, the sequence generated by (19) satisfies
| (21) |
where L0 is given in Lemma 2
One advantage of (19) over (18) is that the right-hand side bound (21) does not depend on the dual variables and as in (20).
3.3.2. The Upper Bound of the Inner Iterations
Our next step is to specify the maximum number of inner iterations lmax(t) to guarantee the condition (13) at each outer iteration t.
First, from both schemes (18) and (19), one can see that . Hence, by Lemma 2, we can bound . On the other hand, by Theorem 1, the sublevel set defined by (16) is bounded. We can also bound , where and is given by (17). Using these upper bounds (20), we can show that
Let be a constant. In order to guarantee (13), we require to choose the number of iterations l at most
| (22) |
Here, α > 1 is a given constant specified by the user. With such a choice of δt, we have , which is independent of and .
If we apply (19) to solve (15), then we have the bound (21). Let . Since , in order to achieve , we require which implies Hence, we can choose
| (23) |
to terminate the primal-dual scheme (19). With such a choice of δt, we can exactly evaluate , which is also independent of .
Remark 1
By the choice of δt as in (22) or (23), the maximum number of inner iterations lmax(t) is independent of the two constants and . These constants only show up when we prove the convergence of Algorithm 1 in Theorem 2, but they do not need to be evaluated in Algorithm 1 below. Hence, in the implementation of Algorithm 1, we simply use for (18), or for (19) to specify the maximum number of inner iterations, where α > 1 is a given number, e.g., α = 1.1.
Algorithm 1.
(Inexact Proximal DC Algorithm with primal-dual iterations)
| 1: Initialization: |
| 2: Input an accuracy ε > 0. Choose an initial point , and choose . |
| 3: Choose two parameters , and σ0=. |
| 4: For t = 0 to T, perform |
| 5: Evaluate a subgradient and choose . |
| 6: Initialization of inner loop: Initialize and Compute lmax(t). |
| 7: Inner loop: For l = 0, 1,···, lmax(t), perform either (18) or (19). |
| 8: Terminate the inner loop: If l ≥ lmax(t), then set and . |
| 9: Stopping criterion: If , then terminate and return Mt+1. |
| 10: End for |
3.4. The Overall Algorithm and Its Convergence Guarantee
We now combine the inexact proximal DCA scheme (12), and the primal-dual scheme (18) (or (19)) to complete the full algorithm for solving (5) as in Algorithm 1.
In the sequel, we will explicitly specify the evaluation of a subgradient ∇Ψ(Mt) of Ψ, the choice of ρt, and the evaluation of and . The number of maximum iterations T of the outer loop is not necessary to specify. However, we use T as a safeguard value to prevent the algorithm from an infinite loop. Practically, we can set T to be a relatively large value, e.g., T = 103. Nevertheless, the stopping criterion at Step 9 will terminate Algorithm 1 earlier. For large-scale problems, we can evaluate the operation norm > of by a power method.
We state the overall convergence of Algorithm 1 in the following theorem.
Theorem 2 (Overall convergence)
Let {Mt}· be the sequence generated by Algorithm 1 using (18) (respectively, (19)) for approximately solving (12) up to lmax(t) inner iterations as in (22) (respectively, (23)). Then, we have
Moreover, the sequence ·{Mt} is bounded. Any cluster point M* of {Mt} is a stationary point of (5). Consequently, the whole sequence{Mt} converges to a stationary point of (5).
Proof Since we apply (19) to solve the subproblem (12), with the choice of t as in (23), we can derive from Lemma 1 that
By Theorem 1, we have F (MT+1) ≥ F (M*) > −∞, the global optimal value of (5). Hence, using the fact that , we obtain
Here, due to the choice of δt. This is exactly the first estimate in Theorem 2. The second limit in Theorem 2 is a direct consequence of the first one.
By Theorem 1 again, the sublevel set defined by (16) is bounded, and F (Mt+1) ≤ F (Mt) by Lemma 1, we have , which is bounded. For any cluster point M* of {Mt}, there exists a subsequence that converges to M*. Now, we prove that M* is a stationary point of (5). Using the optimality condition of (12), we have
| (24) |
Note that lim due to the choice of δt. Here, we can pass this limit to a subsequence if necessary. Using this limit and the fact that limt→∞∥Mt+1−Mt∥F = 0, we can show . In summary, we have . Using the definition of Φ and Ψ, we can see that the subgradient ∇Ψ(Mt) of Ψ is uniformly bounded and independent of t. The subgradient can be represented as ., where is uniformly bounded and independent of t. By taking subsequence if necessary, both and ∇Ψ(Mt) converge to ∇Ψ(M*) and ∇Ψ(M*), respectively. By [25, Theorem 24.4], we have ∇Φ(M*) ∈ ∂Φ(M*) and ∇Ψ(M*) ∈ ∂Ψ(M*). Using this fact, , and the boundedness of ρt, we can show that 0 ∈ ∂Φ(M*) −∂Ψ(M*) Hence, M* a stationary point of (5). By the boundedness of {M1} and limt→∞∥Mt+1−Mt∥F = 0, one can then use routine techniques to show that the whole sequence {Mt} converges to M*. ◻
While the convergence result given in Theorem 2 is rather standard and similar to those in [2], its analysis for the inexact proximal DCA seems to be new to the best of our knowledge. Note that the convex subproblem in DCA-type methods is often general and may not have closed-form solutions. It is natural to incorporate inexactness in an adaptive manner to guarantee the convergence of the overall algorithm.
3.5. Implementation Details and Comparison with ADMM
In Algorithm 1, we need to compute the proximal operator of the Fenchel conjugate of Pt and of Qt. In addition, in order to compare our method with other optimization methods, we specify the well-known ADMM to solve (12) as our comparison candidate.
3.5.1. Evaluation of Subgradient ∇Ψ(Mt) and The Choice of ρt
Using the definition of Ψ from (7), we have
where and . Here, sign(·) is the common sign function.
To choose ρt, we first choose a range in (0, +∞). For instance, we can choose and , and {ρt} is any sequence in . We can also fix ρt for all t as , e.g., ρt = 10−3. From our experience, we observe that if ρt is small, the strong convexity of (15) is 1+ρt, which is also small. Hence, the number of inner iterations lmax(t) is large. However, the number of outer iterations t may be small. In the opposite case, if ρt is large, then we need a small number lmax(t). Nevertheless, due to a short step Mt+1 −Mt, the number of outer iterations may increase. Therefore, trading-off the value of ρt is crucial and affects the performance of Algorithm 1.
3.5.2. Evaluation of Proximal Operators
To compute the proximal operator of in (18), we can use Moreau’s identity [3]:
where is the well-known soft-thresholding operator.
To compute the proximal operator of Qt, we note that (here, τj = τ)
Hence, we have
Where , and
This operator can be computed in a closed form using SVD of as , where is the soft-thresholding operator defined above with r = ωτj/[1 + ω (1 + ρt)].
3.5.3. ADMM Method for Solving (15)
In Algorithm 1, we can apply ADMM to solve the subproblem (15) instead of primal-dual methods. We split the nuclear norm in Qt of (15) by introducing an auxiliary variable S and rewrite (15) as
| (25) |
We define the corresponding augmented Lagrangian function of (25) as
where β > 0 is a penalty parameter. Starting from an initial point , , our ADMM scheme for solving (25) updates at the inner iteration l according to the following steps:
| (26) |
In this scheme, the auxiliary sequence can be computed into a closed form using SVD as we have done in Subsection 3.5.2. The sequence requires to solve a general convex problem. However, this problem has a special structure so that its dual formulation becomes a boxed constrained convex quadratic program, which is very similar to (2). Hence, we solve this problem by coordinate descent methods, see, e.g., [29]. In summary, if we apply ADMM to solve (15), then our inexact proximal DCA has three loops: DCA outer iterations, ADMM inner iterations, and coordinate descent iterations for computing .
Remark 2 (Convergence of the ADMM scheme (26))
Note that (15) is strongly convex, and both subproblems in and of (26) are strongly convex, and therefore, uniquely solvable. Consequently, this scheme converges theoretically as proved e.g., in [5, Appendix A]. Together with asymptotic convergence guarantees, the convergence rates of ADMM, where (26) is a special case, have been studied in e.g., [11, 13, 21]. We omit the details here.
4. Statistical Properties
In this section, we explore some statistical properties of our proposed classifier RMSMM (3). In the first part, we establish the Fisher consistency result for the RMSMM, and study the finite sample bound on the misclassification rate. In the second part, we analyze the robustness property of RMSMM via the breakdown point theory.
4.1. Classification Consistency
Fisher’s consistency is a fundamental property of classification methods. For an observed matrix-type data with fixed X, and denote by the class conditional probability of class k ∈ {1, 2, ⋯, K}. One can verify that the best prediction rule, namely, the Bayes rule, which minimizes the misclassification error rate, is ŷBayes(X) = arg maxk Pk(X).
For a classifier, denote by ϕ (f(X), y) its surrogate loss function for classification using f as the classification function, and ŷf the corresponding prediction rule. Assume the conditional loss is L(X) = E[ϕ(f(X), y) | X], where the expectation is taken with respect to the marginal distribution of . We denote the theoretical minimizer of the conditional loss as f*(X) = arg minf L(X). When ŷf* (X) = ŷBayes(X), we say the classifier is Fisher consistent. Let us denote by the loss function in (3). Then, we have the following result.
Theorem 3
The classifier with the loss is Fisher consistent when and s ≤ 0.
This result can be viewed as a generalization of Theorem 1 in [34] which is devised for vector-type observations. By this theorem, we know that our classifier RMSMM can achieve the best classification accuracy, given a sufficiently large matrix-type training dataset and a rich family . The following theorem provides an upper bound of the prediction error using the training dataset.
The proof of both Theorems 3 and 4 can be found in the appendix.
Theorem 4
Suppose that the conditional distribution of X given is the same as the distribution of Ck + E, where is a constant matrix and the entries of E are i.i.d. random variables with mean zero and finite fourth moment. Let denote the solution of (5). Then, with probability at least 1 − δ, the misclassification rate of the classifier ŷ corresponding to can be bounded as
| (27) |
Where , and c is a constant specified in the proof.
Theorem 4 measures the gap between the expectation error and the empirical error, which allows us to get a better understanding of the utility of the nuclear norm. For each category, the decision matrix contains p × q parameters, and therefore, if we only impose the Frobenius constraints [34] we would expect at best to obtain rates of the order . By taking the low rank structure of the decision matrices into account, we use the nuclear norm penalty to control the singular values of the decision matrices. For the i-th singular vectors of the k-th decision matrix, there are p + q + 1 free parameters in total [22], one for the singular value σki and the others for the orthogonal vectors with dimensions p and q. Its contribution to the gap will be . Hence, with the low-rank structure of the decision matrices, the nuclear-norm-penalized estimator achieves a substantially faster rate.
The rate in Theorem 4 can be further improved if we additionally impose some low-rank constraint on the noise term of Xi. For example, consider E = UΛV⊤, where is low-rank noise with all entries i.i.d. with mean zero and the finite fourth moment, U and V are orthogonal projection matrices independent of Λ. It can be verified that the term in the rate above can be replaced by . Finally, as a side remark, consider a special case with q = 1, i.e., the features are vectors rather than matrices. In such a situation, the nuclear norm reduces to the quadratic norm, and the last term of the upper bound in (27) will become , which is equivalent to existing results, for example, see [34].
4.2. Breakdown Point Analysis
Robustness theory has been developed to evaluate instability of statistical procedures since the 1960s [15]. The breakdown point theory focuses on the smallest fraction of contaminated data that can cause an estimator totally diverging from the original model. Here we consider the breakdown point analysis for multicategory classification models.
Let be the original n observations, and be the contaminated sample with m observations of contaminated, and be the parameters estimated from the contaminated sample. We extend the sample angular breakdown point in [35] to the multicategory classification problem as
where is the estimated decision matrix from the original sample. Since the angle-based classifiers make the decision by comparing the angles between the (K − 1)-dimensional classification function f and the K vertices of the simplex , it is reasonable to quantify the divergence between classifiers via the angles between the decision vectors and the original counterpart, . When there exists one category k so that the angle between the two decision vectors is larger than π/2, the two classifiers would behave totally different at this category. Consequently, the classifier with contaminated samples would “break down”.
The following theorem compares the sample breakdown points of the proposed RMSMM and the multicategory SMM (MSMM) which generalizes [19]’s SMM using angle-based methods, say γ = 1/2 and s = −∞ in Eq. (3).
Theorem 5
Assume that . Then the breakdown point of MSMM is 1/n, while the breakdown point of RMSMM is not smaller than , where
By this theorem, only one contaminated observation will make the MSMM classifier break down. In other words, this estimator may not work well in the presence of few outliers. In contrast, the breakdown point of our proposed RMSMM, benefitting from the use of truncated hinge loss functions, has a fixed lower bound. Thus, the RMSMM has high outlier-resistance compared to its counterpart without truncation. The robustness property will be carefully examined via numerical comparisons in the next section.
5. Numerical Experiments
In this section, we investigate the performance of our proposed robust angle-based SMM using simulated and real datasets. Our configuration of the algorithm is as follows. For the primal-dual method described in Algorithm 1, we use M0 = 0 and ρt = 0.01 for every t. We set the stop criterion as ∥Mt+1 − Mt∥F ≤ 10−4 max {1, ∥Mt∥F}. All the simulation results are obtained based on 100 replications.
5.1. Simulation Results
We generate simulated datasets by the following two scenarios. In the first scenario, the dimensions of input matrices are 50 × 50. For the kth category, to make the matrices low-rank, we randomly generate two 50 × 5 matrices, Uk and Vk, which are standard orthonormal. More precisely, we first generate two 50 × 5 matrices with all the entries i.i.d. from the standard normal distribution and obtain Uk and Vk by the Gram-Schmidt process. The center of each class is then specified by . The observations in each class are generated by Ck + E; k = 1,⋯, K, where E is a 50 × 50 normal random matrix with all entries i.i.d from . For the contaminated observations, we generate them by 3C1 + E for .
In the second scenario, the dimensions of input matrices are fixed as 80 × 100. We follow the settings in [36] to generate the true array signals by , where each entry of Ck is 0 or 1 and , p1 = 80 and p2 = 100. To control the rank and the percentage of nonzero entries, we set r = 10 and generate Ck,i by setting each row to contain only one entry one and others zero, and the probabilities of entries being one are equal. All the entries of the noise matrix E are i.i.d. from σ · t(3), where t(3) denotes the Student’s t-distribution with three degrees of freedom. The outliers are generated by the same method as in the first scenario.
We use 103 observations for training, 104 observations for tuning and 104 observations for testing. The contamination ratio in the training sample ρ, is chosen as 0%, 10%, and 20%. For training the truncated model, we use the solutions of the ordinary SMM as an initial point. Following the suggestion by [33], we choose γ = 1/2 as it can provide stable classification performance. The truncation parameter, s, is fixed at −1/(K − 1). The other hyper-parameters, C and τ, are selected via a grid search on the tuning set.
We first consider the binary classification problem, say K = 2. We compare our RMSMM with the SMM in [19]. We also include a naive benchmark, the standard SVM method which is applied to the stacked-up vectors. Fig. 1 presents the classification error rates of RMSMM, SMM, and SVM on the simulated data with Scenario (I) and K = 2. Three noise magnitudes are considered: σ = 0.5, 0.7 and 0.9. Both two “support-matrix-based” methods, RMSMM and SMM, perform much better than the SVM. It has been observed that RMSMM generally outperforms SMM when there exits outliers, and its advantage becomes more pronounced for larger ρ. All methods are affected by different values of σ, but the comparison conclusion still holds for various σ.
Fig. 1.
Classification error rates for RMSMM, SMM, and SVM on the simulated data with Scenario (I) and K = 2. Here, ρ stands for the percentage of data that are contaminated. SMM: [19]’s support matrix machine; SVM: the standard SVM applied to the stacked-up vectors.
Next we consider the multicategory case. Fig. 2 depicts the boxplots of the classification error rates for RMSMM and other competitors under Scenario (I) with K = 3 and 5. Three benchmarks are considered: the multicategory SMM using angle-based methods, MSMM; the angle-based multicategory SVM classifier [32] and its robust version RMSVM classifier [34]. In the case of ρ = 0, the RMSMM and its non-robust counterpart MSMM perform almost identically, which demonstrates that the truncation parameter, s, can adapt to the data structure and make the efficiency loss of RMSMM relative to MSMM minimal when there is no outlier. When ρ = 0.1 or ρ = 0.2, the advantage of RMSMM is clear: the means and standard variations of its classification error rates are generally smaller. From this figure, we can also observe that the use of the nuclear norm is prominent: the two SMM-based classifiers perform much better than the two SVM-based ones. Similar comparison conclusions can be drawn from Fig. 3, which reports the classification error rates of RMSMM and the other three methods under Scenario (II) with σ = 3, 4, and 5.
Fig. 2.
Classification error rates for RMSMM, SMM, and SVM on the simulated data with Scenario (I). The top three panels: the case with K = 3; the bottom three panels: the case with K = 5. MSMM: multicategory generalization of SMM using angle-based methods; MSVM: the angle-based multicategory SVM [32]; RMSVM: the robust angle-based multicategory SVM [34].
Fig. 3.
Classification error rates for RMSMM, SMM, and SVM on the simulated data with Scenario (II). The top three panels: the case of K = 3; the bottom three panels: the case of K = 5.
Finally, we present some comparison results of the ADMM and primal-dual algorithms for solving the RMSMM optimization problem (5). Fig. 4 reports the classification error rates and the corresponding computational time (in seconds) of the RMSMM using the two different primal-dual algorithms: (18) and (19) under Scenario (I) with σ = 0.7 and Scenario (II) with σ = 4 when K = 3. The bottom two panels record the total run time including the selection of tuning parameters. The tuning parameters λ and τ in the RMSMM are selected via a grid search. To be more specific, λ ∈ [0.1, 104] and for each choice of λ, τ is tuned to make the decision matrix change from full-rank to rank one. One can see that the two algorithms perform very similarly in terms of classification rates, but the proposed primal-dual algorithm is significantly faster and the advantage is more remarkable as ρ increases. This is further confirmed by Fig. 5 which depicts the decay curves of the RMSMM objective function values versus the computational time until the two algorithms reach the desired accuracy. We consider the case under Scenario (II) with K = 3 and σ = 4 for a given combination of tuning parameters. In particular, we fix a combination of (λ, τ) and record the objective function values for each iteration. Clearly, the primal-dual algorithm is generally more stable and converges much faster than ADMM.
Fig. 4.
Comparison between the ADMM and primal-dual algorithms: Primal-Dual stands for (18), and Proximal-Alter stands for (19) for solving the RMSMM optimization problem (5). The top two panels: classification error rates under Scenario (I) with σ = 0.7 and Scenario (II) with σ = 4 when K = 3; The bottom two panels: the corresponding computational time (in seconds).
Fig. 5.
The decrease of the RMSMM objective values with respect to the computational time under Scenario (II) with K = 3 and σ = 4.
5.2. A Real-data Example
We apply the RMSMM model (5) to the Daily and Sports Activities Dataset [1] which can be found on the UCI Machine Learning Repository. The dataset comprises motion sensor data of 19 daily sport activities, each performed by 8 subjects (4 females, 4 males, between the ages of 20 and 30) in their own style for 5 minutes. The dataset was collected by several sensors. The input matrices are of dimension 125 × 45, where each column contains 125 samples of data acquired by a sensor over a period of 5 seconds at 25 Hz sampling frequency, and each row contains the data acquired from all of 45 sensor axes at a particular sampling instant.
To show the efficient performance of the proposed RMSMM model, we only select the first 10 categories of the dataset for simplicity. Thus the total number of instances is N = 10 × 8 × 60 = 4, 800. It is a 10-category and balanced classification problem with 480 instances in each category. We equally and randomly divide the data into three parts for training, tuning, and testing, and the sample size of each part is 1, 600.
We choose s = −K + 1, and select the other parameters by a grid search. We report the classification accuracy of RMSMM, MSMM, RMSVM, and MSVM in Fig. 6-(left). The two matrix-based methods achieve lower classification rates than the other two vector-based classifiers, due to the benefit of the nuclear norm. This improvement can be more clear in Fig. 7, which presents the heatmap of the decision matrices of RMSMM and RMSVM; the former has a more sparse structure than the latter.
Fig. 6.
Classification error rates for RMSMM, MSMM, RMSVM, and MSVM on the Daily and Sports Activities Dataset. The left and right panels present the results when the data are clean or contaminated, respectively.
Fig. 7.
Heatmaps of the first decision matrices of RMSMM (left panel) and RMSVM (right panel)
To demonstrate the effect of potential outliers on classification accuracy, we artificially contaminate the dataset with outliers by randomly relabeling 10% of the training set into another class. From Fig. 6-(right), we observe that the performances of all the methods are deteriorated by this manipulation, while the RMSMM performs the best. Both two robust classifiers, RMSMM and RMSVM, are less affected by the outliers, than the other two non-robust methods. All these numerical examples shown above suggest that the RMSMM is a practical and robust classier for a multicategory classification problem when the input features are represented as matrices.
6. Concluding Remarks
In this paper, we consider how to devise a robust multicategory classifier when the input features are represented as matrices. Our method is constructed in the angle-based classification framework, embedding a truncated hinge loss function into the support matrix machine. Although the corresponding optimization problem is nonconvex, it admits a natural DC (difference of two convex functions) representation. Hence, it is natural to apply DCA algorithms to solve this problem. Unfortunately, the convex subproblem in DCA is rather complex and does not have a closed form solution. Therefore, we develop an inexact proximal DCA variant to solve the underlying optimization problem. To approximately solve the convex subproblem, we propose to use primal-dual first-order methods. We combine both inexact proximal DCA and primal-dual methods to obtain a new proximal DCA scheme. We prove that our optimization model admits a global optimal solution, and the sequence generated by our DCA variant globally converges to a stationary point.
In terms of statistical learning perspective, we prove Fisher’s consistency and prediction error bounds. Numerical results demonstrate that our new classifiers are quite efficient and much more robust than existing methods in the presence of outlying observations. We conclude the article with two remarks. First, our unified framework is demonstrated using the linear classifier. Though it is well recognized that linear learning is an effective solution in many real applications, it may be sub-efficient especially for problems with complex feature structures. Thus it is of interest to thoroughly study nonlinear learning under the proposed framework. Second, our numerical results show that the proposed procedure works well under large-dimensional scenarios. Theoretical investigation to the necessary condition on which the statistical theoretical guarantee of RMSMM holds is another interesting topic for future study.
Acknowledgments
The authors are grateful to the editor and the reviewers for their insightful comments that have significantly improved the article. Qian and Zou were supported in part by NNSF of China Grants 11690015 11622104 and 11431006, and NSF of Tianjin 18JCJQJC46000. Tran-Dinh was supported in part by US NSF-grant, DMS-1619884. Liu was supported in part by US NSF grants IIS1632951 and DMS-1821231, and NIH grant R01GM126550.
A Appendix: Proofs of Technical Results
In this appendix, we provide all the remaining proofs of the results presented in the main text.
A.1. Proof of Lemma 2: Lipschitz continuity and boundedness
Since [a]+ = max {0, a} = (a + |a|)/2, the function Pt defined in (15) can be rewritten as for some matrix and vectors μ and dt. Here, . However, ψ is also Lipschitz continuous due to its definition. This implies that ∇ψ(Mt) is uniformly bounded, i.e., there exists a constant C0 ∈ (0, + ∞) such that ||∇ψ (Mt)||F ≤ C0 for all . As a consequence, Pt is Lipschitz continuous with the uniform constant L0 that is independent of t, i.e., for all . The boundedness of of the conjugate follows from [3, Corollary 17.19].
A.2. The proof of Lemma 3: The convergence of the primal-dual methods
Let , where is the Fenchel conjugate of Pt. Applying [9, Theorem 4] with f = 0, for any M and Y, we have
| 28) |
where , and .
By the update rule in (18), we have . Hence, by induction, we have . On the other hand, by [8, Lemma 2], with the choice of , we have
Using this estimate and , we have
where . Hence, we can estimate Tl as . Using this estimate of Tl, , and , we obtain from (28) that
This is exactly (20).
Next, we prove (21). By introducing , we can reformulate the strongly convex subproblem (15) into the following constrained convex problem:
| (29) |
Note that Qt is strongly convex with the strong convexity parameter 1 + ρt. We can apply [28, Algorithm 2] to solve (29). If we define
then, from the proof of [28, Theorem 2], we can show that
| (30) |
By Lemma 2, Pt is Lipschitz continuous with the Lipschitz constant L0. Then we have
Combining (30) and this estimate, we obtain
Similar to the proof of [28, Corollary 1], by using , the last inequality leads to
Combining the two last estimates, we obtain
which is exactly (21). ◻
A.3. Proof of statistical properties
We provide the proof of Theorems 3 and 4 in this section.
A.3.1. Proof of Theorem 3: Fisher’s consistency
In our RMSMM (3), one can abstract the truncated hinge loss function as
Then, the conditional loss can be rewritten as
[34, Theorem 1] showed that for a vector data x, the robust classifier based on the loss function ϕ(f(x), y) is Fisher consistent with and s ≤ 0. By vectorizing the matrix data X to a new vector x = vec(X), then all settings here are the same as those of Theorem 1 in [34]. In this case, Fisher consistency results can naturally be transferred to matrix-type data. ◻
A.3.2. Proof of Theorem 4: Misclassification rates
First, we introduce the Rademacher complexity. Let be a class of loss functions. Given the sample , we define the empirical Rademacher complexity of as
where are i.i.d. random variables with Pr(σ1 = 1) = Pr(σ1 = −1) = 1/2. The Rademacher complexity of is defined as
For our model, let
and
To prove Theorem 4, we first recall the following lemma which provides a bound on by the empirical error and the Rademacher complexity.
Lemma 4
For any h ∈ H, with probability at least 1 – δ, we have
The proof of Lemma 4 can be found in [34].
Now, we need to derive the upper bound of the Rademacher complexity used in Lemma 4. Since is -Lipschitz, we have
where denotes and . The first inequality is due to Lemma 4.2 in [20], and the absolute values of the entries in wy − wk are all bounded by 1.
Firstly, by the assumption, we can write X = E(X) + E, where and the variance and the fourth moment of the entries are σ2 and . Accordingly, , where . Since are the i.i.d. copies of , we have
Because and , by Theorem 2 in [16] we have
where c is a constant which does not depend on . By similar arguments, it is easy to see that
Accordingly, we obtain the upper bound of the Rademacher complexity as
The proof is completed by using Lemma 4 with this bound and the fact that the continuous indicator function is an upper bound of the indicator function for any κ. ◻
A.3.3. Proof of Theorem 5: Breakdown Point Analysis
Let F(M, ) denote the loss function (3) with the sample , and
For the MSMM classifier, we can choose the contaminated observation as (Xo, k) with . For any M ∈ Δ+, , then as c → ∞. In this situation, the loss term corresponding to this contaminated observation will tend to infinity. Hence, we have and the classifier breaks down.
For the RMSMM, since is an interior point of Δ+, the claim
is true. Note that the loss function
is bounded by (K − 1)(1 − s). For any m ≤ nϵ1/[2(1 + δ)(K − 1)(1 − s)] with δ > 0 being any positive constant, any corresponding n −m clean subset , and any , we have
Therefore,
and
The last inequality reveals that and thus the classifier would not break down when m ≤ nϵ1/[2(1 + δ)(K − 1)(1 − s)] observations are contaminated. Finally, the proof is complete by setting δ → 0. ◻
A.4. Derivation of Eq. (2): The dual problem
Lemma 5
For a p × q real matrix A, the subdifferential of the nuclear norm ∥·∥* is given as
where UAΣAVA⊤ is the SVD of A, and ∂ stands for the operator of subgradients.
Lemma 6
Suppose that , ∂G(X) = ρX − P + τ∂∥X∥*, where is a constant matrix w.r.t. X. Let the SVD of P be
where Σ0 contains the singular values of P which are greater than τ, and Σ1 contains the rest. Then, we have 0 ∈ ∂G(X*), where .
Lemma 6 can be verified by using Lemma 5 with .
Now we derive the dual problem (2) of (1). As in the classical SVM, by setting C = (Nλ)−1, we can rewrite (1) into the following form:
The corresponding Lagrange function of this problem can be written as
| (31) |
where αi ≥ 0 and μi ≥ 0 are corresponding Lagrange multipliers. By setting the derivatives w.r.t. b and ξi of this Lagrange function to zero, we get
Based on Lemma 6 and setting the derivative w.r.t. M to zero, we have . Substituting these conditions into (31), we obtain
Footnotes
Publisher's Disclaimer: This Author Accepted Manuscript is a PDF file of an unedited peer-reviewed manuscript that has been accepted for publication but has not been copyedited or corrected. The official version of record that is published in the journal is kept up to date and so may therefore differ from this version.
Contributor Information
Chengde Qian, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, P. R. China..
Quoc Tran-Dinh, Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill. quoctd@email.unc.edu.
Sheng Fu, Department of Industrial and Systems Engineering, National University of Singapore.
Changliang Zou, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, P. R. China. nk.chlzou@gmail.com.
Yufeng Liu, Department of Statistics and Operations Research, Department of Genetics, Department of Biostatistics, Carolina Center for Genome Sciences, Lineberger Comprehensive Cancer Center, The University of North Carolina at Chapel Hill. yfliu@email.unc.edu.
References
- 1.Altun K and Barshan B. Human activity recognition using inertial/magnetic sensor units In International Workshop on Human Behavior Understanding, pages 38–51. Springer, 2010. [Google Scholar]
- 2.Le Thi HA and Pham Dinh T. Solving a class of linearly constrained indefinite quadratic problems by dc algorithms. Journal of Global Optimization, 11(3):253–285, 1997. [Google Scholar]
- 3.Bauschke H and Combettes PL. Convex Analysis and Monotone Operators Theory in Hilbert spaces. Springer International Publishing, 2nd edition, 2017. [Google Scholar]
- 4.Boser BE, Guyon IM, and Vapnik VN. A training algorithm for optimal margin classifiers In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM, 1992. [Google Scholar]
- 5.Boyd S. Alternating direction method of multipliers In Talk at NIPS workshop on optimization and machine learning, 2011. [Google Scholar]
- 6.Cai D, He X, Wen J-R, Han J, and Ma W-Y. Support tensor machines for text categorization. Technical report, 2006. [Google Scholar]
- 7.Cai J-F, Candès EJ, and Shen Z. A singular value thresholding algorithm for matrix completion. SIAM J. Optim, 20(4):1956–1982, 2010. [Google Scholar]
- 8.Chambolle A and Pock T. A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis, 40(1):120–145, 2011. [Google Scholar]
- 9.Chambolle A and Pock T. On the ergodic convergence rates of a first-order primal–dual algorithm. Math. Program, 159(1–2):253–287, 2016. [Google Scholar]
- 10.Cortes C and Vapnik V. Support-vector networks. Machine Learning, 20(3):273–297, 1995. [Google Scholar]
- 11.Davis D and Yin W. Faster convergence rates of relaxed Peaceman-Rachford and ADMM under regularity assumptions. Math. Oper. Res, 2014. [Google Scholar]
- 12.Friedman J, Hastie T, and Tibshirani R. The Elements of Statistical Learning Springer Series in Statistics. Springer-Verlag New York, 2nd edition, 2001. [Google Scholar]
- 13.He BS and Yuan XM. On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numer. Anal, 50:700–709, 2012. [Google Scholar]
- 14.Hou C, Nie F, Zhang C, Yi D, and Wu Y. Multiple rank multi-linear SVM for matrix data classification. Pattern Recognition, 47(1):454–469, 2014. [Google Scholar]
- 15.Huber PJ and Ronchetti E. Robust Statistics. Wiley series in probability and statistics John Wiley & Sons, Inc., Hoboken, New Jersey, 2nd edition, 2009. [Google Scholar]
- 16.Latala R. Some estimates of norms of random matrices. Proceedings of the American Mathematical Society, 133(5):1273–1282, 2005. [Google Scholar]
- 17.Lee Y, Lin Y, and Wahba G. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67–81, 2004. [Google Scholar]
- 18.Liu Y. Fisher consistency of multicategory support vector machines In Artificial Intelligence and Statistics, pages 291–298, 2007. [Google Scholar]
- 19.Luo L, Xie Y, Zhang Z, and Li W-J. Support matrix machines. In Proceedings of the 32nd International Conference on Machine Learning, number 1, pages 938–947, Lille, France, 2015. [Google Scholar]
- 20.Mohri M, Rostamizadeh A, and Talwalkar A. Foundations of Machine Learning Adaptive computation and machine learning series. MIT Press, Cambridge, MA, 2012. [Google Scholar]
- 21.Monteiro RDC and Svaiter BF. Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM J. Optim, 23(1):475–507, 2013. [Google Scholar]
- 22.Negahban S and Wainwright MJ. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, 39(2):1069–1097, 2011. [Google Scholar]
- 23.Nesterov Y. Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Kluwer Academic Publishers, 2004. [Google Scholar]
- 24.Pirsiavash H, Ramanan D, and Fowlkes C. Bilinear classifiers for visual recognition In Advances in Neural Information Processing Systems, pages 1482–1490, 2009. [Google Scholar]
- 25.Rockafellar RT. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, New Jersey, 1970. [Google Scholar]
- 26.Sun H, Craig B, and Zhang L. Angle-based multicategory distance-weighted svm. Journal of Machine Learning Research, 18(85):1–21, 2017. [Google Scholar]
- 27.Tao D, Li X, Wu X, Hu W, and Maybank SJ. Supervised tensor learning. Knowledge and Information Systems, 13(1):1–42, 2007. [Google Scholar]
- 28.Tran-Dinh Q. Proximal alternating penalty algorithms for constrained convex optimization. Comput. Optim. Appl, 72(1):1–43, 2019. [Google Scholar]
- 29.Wright SJ. Coordinate descent algorithms. Math. Program, 151(1):3–34, 2015. [Google Scholar]
- 30.Wu Y and Liu Y. Robust truncated hinge loss support vector machines. Journal of the American Statistical Association, 102(479):974–983, 2007. [Google Scholar]
- 31.Yang J, Zhang D, Frangi AF, and Yang J-Y. Two-dimensional pca: a new approach to appearance-based face representation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(1):131–137, 2004. [DOI] [PubMed] [Google Scholar]
- 32.Zhang C and Liu Y. Multicategory angle-based large-margin classification. Biometrika, 101(3):625–640, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Zhang C, Liu Y, Wang J, and Zhu H. Reinforced angle-based multicategory support vector machines. Journal of Computational and Graphical Statistics, 25(3):806–825, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Zhang C, Pham M, Fu S, and Liu Y. Robust multicategory support vector machines using difference convex algorithm. Math. Program, 169(1):277–305, 2018. [PMC free article] [PubMed] [Google Scholar]
- 35.Zhao J, Yu G, Liu Y, et al. Assessing robustness of classification using an angular breakdown point. The Annals of Statistics, 46(6B):3362–3389, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Zhou H and Li L. Regularized matrix regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2):463–483, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Zou H and Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005. [Google Scholar]







