Robust Multicategory Support Matrix Machines

Chengde Qian; Quoc Tran-Dinh; Sheng Fu; Changliang Zou; Yufeng Liu

doi:10.1007/s10107-019-01386-z

. Author manuscript; available in PMC: 2020 Jul 1.

Published in final edited form as: Math Program. 2019 Mar 28;176(1-2):429–463. doi: 10.1007/s10107-019-01386-z

Robust Multicategory Support Matrix Machines

Chengde Qian ¹, Quoc Tran-Dinh ², Sheng Fu ³, Changliang Zou ⁴, Yufeng Liu ⁵

PMCID: PMC6980461 NIHMSID: NIHMS1525629 PMID: 31983775

Abstract

We consider the classification problem when the input features are represented as matrices rather than vectors. To preserve the intrinsic structures for classification, a successful method is the Support Matrix Machine (SMM) in [19], which optimizes an objective function with a hinge loss plus a so-called spectral elastic net penalty. However, the issues of extending SMM to multicategory classification still remain. Moreover, in practice, it is common to see the training data contaminated by outlying observations, which can affect the robustness of existing matrix classification methods. In this paper, we address these issues by introducing a robust angle-based classifier, which boils down binary and multicategory problems to a unified framework. Benefitting from the use of truncated hinge loss functions, the proposed classifier achieves certain robustness to outliers. The underlying optimization model becomes nonconvex, but admits a natural DC (difference of two convex functions) representation. We develop a new and efficient algorithm by incorporating the DC algorithm and primal-dual first-order methods together. The proposed DC algorithm adaptively chooses the accuracy of the subproblem at each iteration while guaranteeing the overall convergence of the algorithm. The use of primal-dual methods removes a natural complexity of the linear operator in the subproblems and enables us to use the proximal operator of the objective functions, and matrix-vector operations. This advantage allows us to solve large-scale problems efficiently. Theoretical and numerical results indicate that for problems with potential outliers, our method can be highly competitive among existing methods.

Keywords: Angle-based classifiers, DCA (difference of convex function) algorithm, Fisher consistency, Nonconvex optimization, Robustness, Spectral elastic net

1. Introduction

Many popular classification methods are originally developed for data with a vector of covariates, such as linear discriminant analysis, logistic regression, support vector machine (SVM), and Adaboost [12]. Recent advances in technology enable the generation of a wealth of data with complex structures, where the input features are represented by multi-linear geometric objects such as matrices or tensors, rather than by the form of vectors or scalars. The matrix-type datasets are often encountered in a wide range of real applications, e.g., the face recognition [31] and the analysis of medical images, such as the electroencephalogram data [36].

One common strategy to handle the matrix data classification is to stack a matrix into a long vector, and then employ some existing vector-based methods. This approach has several drawbacks. First, after vectorization, the dimensionality of the resulting vector typically becomes exceedingly high, which in turn leads to the curse of dimensionality, i.e. the large p and small n phenomenon. Second, vectorization of matrix-type data can destroy informative structure and correlation of data matrix, such as the neighbor information and the adjacent relation. Third, under the statistical learning framework, the regularization of vector and matrix data should be different due to their intrinsic structures. To exploit the correlation among the columns or rows of the data matrix, several methods were developed, for example, [6], [27], [24], [14]. These methods are essentially built on the low-rank assumption. Another major direction is to extend regularization techniques commonly used in vector-based classification methods to the present matrix-type data, under certain sparsity assumptions. The regularization with the nuclear norm of matrix of parameters is popular in a variety of settings; see [7] for matrix completion with a low rank constraint, and [36] for matrix regression problems based on generalized linear models. Specifically, [19] proposed the Support Matrix Machine (SMM) which employs a so-called spectral elastic net penalty for binary classification problems. The spectral elastic net penalty is the combination of the squared Frobenius matrix norm and the nuclear norm, in parallel to the elastic net [37]. They showed that the SMM classifier enjoys the property of grouping effect, while keeping a low-rank representation.

Our approach and contribution:

Though the SMM model is simple yet effective, two major issues still remain. The first one is how to extend it to address the problem of multicategory classification. One may reduce the multicategory problem via a sequence of binary problems, for example, using one-versus-rest or one-versus-one techniques. However, the one-versus-rest method can be inconsistent when there is no dominating class, and one-versus-one method may suffer a tie-in-vote problem [17, 18]. Another issue is that existing classifiers may not be robust against outliers, and thus they may have unstable performance in practice [30]. To address these two issues, we propose a new multicategory angle-based SMM using truncated hinge loss functions, which not only provides a natural generalization of binary SMM methods, but also achieves certain robustness to outliers. Our proposed classifier can be viewed as a robust matrix counterpart of the robust vector-based classifier in [32]. We show that the proposed classifier enjoys Fisher consistency and other attractive theoretical properties.

Because the truncated hinge loss is nonconvex and the spectral elastic net regularization is not smooth, the optimization problem involved in our classifier is highly non-trivial. We first show that this problem admits a global optimal solution by exploiting special structures of the model. Next, we show that the optimization problem has a natural DC (difference of two convex functions) decomposition. Hence, one can apply a DC algorithm (DCA) [2] to solve this problem. However, the convex subproblem is rather complicated with nonsmooth objective functions and linear operators, and cannot be solved exactly. This prevents us from solely applying DCA to solve our nonconvex problem. We instead develop a new variant, namely the inexact proximal DCA, to solve this problem. By using the proximal term, we obtain a strongly convex subproblem. Then, to approximately solve this subproblem, we propose to use primal-dual first-order methods proposed in [8, 28]. These methods allow us to exploit the special structures of the problem by utilizing the proximal operator of the objective terms, and matrix-vector multiplications. One drawback of this approach is to match the number of inner iterations in the primal-dual scheme and the inexactness of the proximal DCA scheme. By exploiting the problem structure, we show how to estimate this number of inner iterations at each step of the DCA scheme to obtain a unified DCA algorithm for solving the nonconvex optimization problem. We prove that by adaptively controlling the number of iterations in the primal-dual routine, we can still achieve a global convergence of our DCA variant, which converges to a stationary point. Our method can be implemented efficiently and does not require to estimate any parameter with expensive computational cost. To our limited knowledge, we are not aware of any efficient method to solve SMM-type problems in the literature except the alternating direction method of multipliers (ADMM)-based scheme [5]. In order to examine the efficiency of our method, we compare it with an ADMM-based scheme [5]. As shown in Section 5, our method outperforms ADMM in terms of computational time, and our new model has highly competitive performance among existing methods in different aspects.

Paper organization:

The rest of the article is organized as follows. In Section 2, we briefly review some related works, and then introduce our proposed model and methodology. In Section 3, we describe a new inexact proximal DCA algorithm and investigate its convergence. Some statistical learning results, including Fisher consistency, risk and robustness analysis, are presented in Section 4. Numerical studies are given in Section 5 on both synthetic and real data. Section 6 concludes our work with some remarks, and theoretical proofs are delineated in the appendix.

Notation:

For a matrix $A \in ℝ^{p \times q}$ of rank r (r ≤ min(p, q)), $r (r \leq min (p, q)), A = U_{A} Σ_{A} V_{A}^{⊤}$ represents the condensed singular value decomposition (SVD) of A, where $U_{A} \in ℝ^{p \times r}$ and $V_{A} \in ℝ^{q \times r}$ satisfy $U_{A}^{⊤} U_{A} = I_{r}$ and $V_{A}^{⊤} V_{A} = I_{r}$ , and Σ_A = diag·{σ₁(A), ⋯, σ_r(A)} with σ₁ (A) ≥ ⋯ ≥ σ_r (A) > 0. For each τ > 0, the singular value thresholding operator $D_{τ} (\cdot)$ is defined as follows:

D_{τ} (A) = U_{A} D_{τ} (Σ_{A}) V_{A}^{⊤},

where $D_{τ} (Σ_{A}) = d i a g {{[σ_{1} (A) - τ]}_{+}, \dots, {[σ_{r} (A) - τ]}_{+}}$ with ${[a]}_{+} = max {0, a}$ . For $A \in ℝ^{p \times q}$ , $‖ A ‖_{F} = \sqrt{\sum_{i, j} a_{i j}^{2}}$ denotes the Frobenius norm of $A, ‖ A ‖_{*} = \sum_{i = 1}^{r} σ_{i} (A)$ denotes the nuclear norm of A, and ∥A∥₂ = σ₁(A) stands for the spectral norm of A. The inner product between two matrices is defined as $〈 A, B 〉 = \sqrt{t r (A^{⊤} B)} = \sqrt{\sum_{i, j} a_{i, j} b_{i, j}}$ . It is well-known that the nuclear norm ∥A∥_*, as a mapping from $ℝ^{p \times q}$ to $ℝ$ , is not differentiable, but convex. Alternatively, one considers the subdifferential of ∥A∥_*, which is the set of subgradients and denoted by ∂∥A∥_*. For a matrix A, vec(A) denotes its vectorization. We use 〈·, ·〉 to denote the inner product.

For a proper, closed and convex function $φ : ℝ^{n} \to ℝ \cup {+ \infty}$ , dom(φ) denotes the domain of φ, $p r o x_{φ} (x) ≜ arg {min}_{y} {φ (y) + \frac{1}{2}} ‖ y - x ‖^{2}$ denotes its proximal operator, and $φ^{*} (y) ≜ sup {x^{⊤} y - φ (x)}$ denotes its Fenchel conjugate. We say that φ has a “friendly” proximal operator if its proximal operator prox_φ can be computed efficiently by, e.g., closed-form or polynomial time algorithms. We say that φ is μ_φ-strongly convex if $φ (\cdot) - \frac{1}{2} μ_{φ} ‖ \cdot ‖_{F}^{2}$ is convex, where μ_φ ≥ 0. Given a nonnegative real number x, we denote ⌊x⌋ the largest integer that is less than or equal to x.

2. Methodology

Assume that the underlying joint distribution of $(X, Y)$ is $P r (X, Y)$ , where $X \in ℝ^{p \times q}$ is the matrix of predictors and $Y$ is the label. We are given a set of training samples of matrix-type data $T_{N} = {X_{i}, y_{i}}_{i = 1}^{N}$ collected independently and identically distributed (i.i.d.) from Pr, where $X_{i} \in ℝ^{p \times q}$ is the ith input sample and y_i is its corresponding class label. Here, we assume that X_i’s are zero-centered; otherwise we can make transformation by $X_{i} - \bar{X}$ , where $\bar{X} = N^{- 1} \sum_{i = 1}^{r} X_{i}$ . We take the structure into consideration and handle all X_i’s in the matrix form. Based on the given training set, the target of a classification problem is to estimate a classifier $T_{N}$ , by minimizing the empirical prediction error

\frac{1}{N} \sum_{i = 1}^{N} I (\hat{y} (X_{i}) \neq y_{i}),

where $I (\cdot)$ is the indicator function. Because $I (\cdot)$ is discontinuous, in practice, we use some surrogate loss function to approximate it. As an example, in the case of the SVM, the hinge loss is adopted.

2.1. Review of the Support Matrix Machine

We take the binary problem as a special example with the encoded class labels set {+1, −1}. The optimization problem of [19]’s SMM can be expressed as

min_{M_{1}, b} {\frac{1}{N} \sum_{i = 1}^{N} l (y_{i} (〈 M_{1}, X_{i} 〉 + b)) + λ (\frac{1}{2} ‖ M_{1} ‖_{F}^{2} + τ ‖ M_{1} ‖_{*})},

(1)

where $M_{1} \in R^{p \times q}$ , and $l (u) ≜ [{[1 - u]}_{+} = max {1 - u, 0}$ is the hinge loss, τ ≥ 0 controls the balance between the Frobenius norm and nuclear norm, and λ > 0 is a tuning parameter that balances the loss and regularization terms. The SMM (1) is a soft margin classifier, and it has a close connection to the ordinary SVM [4, 10]. With τ = 0, by vectorization of the coefficient matrix M₁, SMM reduces to the standard form of the SVM.

The penalty term, $J (M_{1}) ≜ \frac{1}{2} ‖ M_{1} ‖_{F}^{2} + τ ‖ M_{1} ‖_{*}$ , can be re-expressed as

J (M_{1}) = \sum_{i = 1}^{min {p, q}} \frac{σ_{i}^{2} (M_{1})}{2} + τ \sum_{i = 1}^{min {p, q}} σ_{i} (M_{1}) .

Clearly, this term is essentially of the form of the elastic net penalty for all singular values of the regression matrix M₁, and thus is referred to as the spectral elastic net penalty. Such regularization encourages a low-rank constraint of the coefficient matrix. This can be better understood by the dual problem of (1), which is presented as follows:

{\begin{cases} min_{α} {\frac{1}{2} {‖ D_{τ} (\sum_{i = 1}^{N} α_{i} y_{i} X_{i}) ‖}_{F}^{2} - \sum_{i = 1}^{N} α_{i}} \\ s.t. 0 \leq α_{i} \leq C, i = 1, \dots, N; \sum_{i = 1}^{N} α_{i} y_{i} = 0, \end{cases}

(2)

where C = (Nλ)⁻¹, and the optimum satisfies $M_{1} = D_{τ} (\sum_{i = 1}^{N} a_{i} y_{i} X_{i})$ . The derivation of (2) is given in the appendix. Under the low-rank assumption, small singular values of $\sum_{i = 1}^{N} a_{i} y_{i} X_{i}$ are more likely to be noisy, and hence SMM could be more efficient than the SVM by thresholding with an appropriate choice of τ. Moreover, due to the use of the trace norm, [19] also showed that there is a stronger grouping effect in the estimation of M₁ than the ordinary SVM.

2.2. Robust Multicategory SMM

For extensions of the binary classification method to the multicategory case, a common approach is to use K classification functions to stand for the K categories, and the prediction rule is based on which function has the largest value. Recently, [32] showed that this approach can be inefficient and suboptimal, and proposed an angle-based classification framework that needs to train K 1 classification functions f = (f₁, ⋯ , f_K−1)^⊤. The angle-based classifiers can enjoy better prediction performance and faster computation [33, 34, 26]. Hence, we adopt this strategy here. For simplicity, we focus on linear learning.

To be more specific, consider a centered simplex with K vertices W = (w₁, ⋯, w_K) in $ℝ^{K - 1}$ , where these vertices are given by

w_{k} = {\begin{array}{l} {(K - 1)}^{- \frac{1}{2}} 1 & if k = 1, \\ - \frac{1 + \sqrt{K}}{{(K - 1)}^{\frac{3}{2}}} 1 + \sqrt{\frac{K}{K - 1} e_{k - 1}} & if k \geq 2. \end{array}

Here, e_k is the unit vector of length K − 1 with the kth entry 1 and 0 otherwise, and 1 is the vector of all ones. One can verify that each vector w_k has Euclidean norm 1, and the matrix W introduces a symmetric simplex in $ℝ^{K - 1}$ . Each w_k represents the kth class label. Let M be the linear transformation matrix which maps an input X into a (K − 1)-variate vector f(X) = M · vec(X), where $M = {[vec (M_{1}), \dots, vec (M_{K - 1})]}^{⊤} \in ℝ^{(K - 1) \times p q}$ and $M_{j} \in ℝ^{p \times q}$ for any j ∈ {1, ⋯ , K − 1}. The angle ∠(f(X), w_k) shows the confidence of the sample X belonging to class k. Thus the prediction rule is based on which angle is the smallest, i.e.,

\hat{y} (X) = \underset{k \in {1, \dots, K}}{arg min} ∠ (f (X), w_{k}) .

It can also be verified that the least-angle prediction rule is equivalent to the largest inner product, i.e.,

\hat{y} (X) = \underset{k \in {1, \dots, K}}{arg max} 〈 f (X), w_{k} 〉 .

Here, we define $H_{a} (u) ≜ {[a - u]}_{+} = max {0, a - u}$ and $G_{a} (u) ≜ {[a - u]}_{+} = max {0, a + u}$ . Based on the structure of matrix-type data, our proposed Robust Multicategory Support·Matrix ·Machine (RMSMM) solves

min_{f \in F} [\frac{1}{N} \sum_{i = 1}^{N} {γ T_{(K - 1) s} (〈 f (X_{i}), w_{y_{i}} 〉) + (1 - γ) \sum_{k \neq y_{i}} R_{s} (〈 f (X_{i}), w_{k} 〉)} + λ J (M)],

(3)

Where

$F ≜ {f | f (X) = M vec (X), M \in ℝ^{(K - 1) \times p q}};$
$f (X) ≜ (f_{1} (X), \dots, f_{K - 1} (X))$ with f_j(X) = 〈M_j,X〉 for j = 1, ⋯,K − 1;
$J (M) ≜ \sum_{j = 1}^{k - 1} (\frac{1}{2} {‖ M_{j} ‖}_{F}^{2} + τ {‖ M_{j} ‖}_{*})$ , where τ ≥ 0 is a balancing parameter;
$T_{s} (u) ≜ H_{K - 1} (u) - H_{s} (u)$ and $R_{s} (u) ≜ G_{1} (u) - G_{s} (u)$ . The notation s ≤ 0 a parameter that controls the location of truncation, and γ ∈ [0, 1] is a convex combination parameter.

In (3), the loss term $L (X, y, M) = {γ T_{(K - 1) s} (〈 f (X), w_{y} 〉) + (1 - γ) \sum_{k \neq y} R_{s} (〈 f (X), w_{k} 〉)}$ can be written as $L_{1} (X, y, M) - L_{2} (X, y, M)$ , where

L_{1} (X, y, M) = γ H_{(K - 1)} (〈 f (X), w_{y} 〉) + (1 - γ) \sum_{k \neq y} G_{1} (〈 f (X), w_{k} 〉), and

L_{2} (X, y, M) = γ H_{(K - 1) s} (〈 f (X), w_{y} 〉) + (1 - γ) \sum_{k \neq y} G_{s} (〈 f (X), w_{k} 〉) .

The first term $L_{1}$ of the above representation is a generalization of the reinforced multicategory loss function in the angle-based framework proposed by [33]. Note that $L_{1}$ explicitly encourages 〈f(X), w_y〉 to be small for k ≠ y_i In parallel to [33], we will show later that this convex combination of hinge loss functions enjoys Fisher consistency with $γ \in [0, \frac{1}{2}]$ and s ≤ 0.

The use of the second term $L_{2}$ is motivated by [30] to alleviate the effect of potential outliers, resulting in a truncated hinge loss. It can be seen that for any potential outlier (X, y) with a sizable 〈f(X), w_y〉, its loss $L$ is upper bounded by a constant for any f. Thus, the impact of outliers can be alleviated by using $L$ . Note that when s > 0, T_s(u) and R_s(u) are constants within [−s, s]. In this case, the loss for some correctly classified observations is the same as that of those misclassified ones. Hence, it is more desirable to set s ≤ 0. As recommended by [32], the choice of s = −(K − 1)⁻¹ works well and will be used in our simulation study.

The truncated hinge loss is nonconvex, which makes the optimization problem (3) more involved than that of SMM. We next present an efficient algorithm to implement our RMSMM.

3. Optimization Algorithm

Since the optimization problem (3) admits a DC decomposition, we propose to apply DCA [2] to solve this problem. At each iteration of DCA, it requires to solve a convex subproblem, which does not have a closed form. We instead solve this convex subproblem up to a given accuracy and design an inexact variant of DCA so that it automatically adapts the accuracy of the subproblem to guarantee the overall convergence of the full algorithm.

3.1. A DC Representation of (3)

Problem (3) is nonconvex, but fortunately, it possesses a natural DC representation. Indeed, due to the relation $f (X) ≜ M \cdot vec (X)$ , we can write

〈 f (X), w 〉 = w^{⊤} M \cdot vec (X) = a^{⊤} vec (M),

where $a ≜ vec (X) \otimes w$ with ⊗ denoting the Kronecker product. Let us define

a_{i} ≜ vec (X_{i}) \otimes w_{y_{i}}, and b_{i k} ≜ vec (X_{i}) \otimes w_{k}, i = 1, \dots, N, k = 1, \dots, K - 1.

(4)

Then, we can rewrite problem (3) as

min_{_{M \in ℝ^{(K - 1) \times p q}}} {F (M) ≜ \frac{1}{N} \sum_{i = 1}^{N} [γ T_{s (K - 1)} (a_{i}^{⊤} vec (M)) + (1 - γ) \sum_{k \neq y_{i}} R_{s} (b_{i k}^{⊤} vec (M))] + λ J (M)} .

(5)

Problem (5) has a DC representation as follows:

min_{_{M}} {F (M) ≜ Φ (M) - Ψ (M)},

(6)

where

{\begin{array}{l} Φ (M) ≜ \frac{1}{N} \sum_{i = 1}^{N} [γ H_{K - 1} (a_{i}^{⊤} vec (M)) + (1 - γ) \sum_{k \neq y_{i}} G_{1} (b_{i k}^{⊤} vec (M))] + λ J (M) \\ Ψ (M) ≜ \frac{1}{N} \sum_{i = 1}^{N} [γ H_{s (K - 1)} (a_{i}^{⊤} vec (M)) + (1 - γ) \sum_{k \neq y_{i}} G_{s} (b_{i k}^{⊤} vec (M))] . \end{array}

(7)

Here, both function Φ and Ψ are convex, but nonsmooth. In addition, Ψ is polyhedral. Note that we can always add any strongly convex function S to Φ and Ψ to write F = Φ − Ψ as

F (M) = Φ (M) - Ψ (M) = [Φ (M) + S (M)] - [Ψ (M) + S (M)],

(8)

to obtain a new DC representation. The latter representation shows that both convex functions Φ+S and Ψ+S are strongly convex. This representation also leads to a strongly convex subproblem at each iteration of DCA as we will see in the sequel. However, the choice of S is crucial, and also affects the performance of the algorithm. In our implementation, we simply add a convex quadratic function which leads to a proximal DCA.

Note that dom(Φ) ∩ dom(Ψ) ≠ ∅. Since problem (6) is nonconvex, any point $M^{*} \in ℝ^{(K - 1) \times p q}$ satisfies

0 \in \partial F (M^{*}) \equiv \partial Φ (M^{*}) - \partial Ψ (M^{*})

(9)

is called a stationary point of (6). If M^* satisfies ∂Φ (M^*) ∩ ∂Φ (M^*) ≠ ∅, then we say that M^* is a critical point (6). We show in the following theorem that (6) has a global optimal solution.

Theorem 1

If λ > 0, then problem (6) has at least one global optimal solution M^*.

Proof We first write the objective function F of (5) into the sum $F (M) = \bar{F} (M) + \frac{λ}{2} ‖ M ‖_{F}^{2}$ , where $\bar{F}$ is a function combining the sum of Ts(K−1), R_s, and the nuclear norm $\sum_{i = 1}^{K - 1} τ ‖ M_{j} ‖_{*}_{i}$ in J.

Next, we show that $\bar{F}$ is Lipschitz continuous. Indeed, using the fact that $[a] + = max {0, a} = \frac{1}{2} (a + | a |)$ we can show that $T_{s} (u) = H_{K - 1} (u) - H_{s} (u) = {[K - 1 - u]}_{+} - {[s - u]}_{+}$ and $R_{s} (u) = G_{1} (u) - G_{s} (u) = {[1 + u]}_{+} - {[s + u]}_{+}$ are both Lipschitz continuous. In addition, we have $‖ M_{j} ‖_{F} \leq ‖ M_{j} ‖_{*} \leq {[min {p, q}]}^{1 / 2} ‖ M_{j} ‖_{F}$ for j = 1, ⋯ K – 1. Hence, $\sum_{i = 1}^{K - 1} τ ‖ M_{j} ‖_{*}$ is also Lipschitz continuous. As a consequence, $\bar{F}$ defined above is Lipschitz continuous. That is, there exists $\bar{L} \in [0, + \infty)$ such that $| \bar{F} (M) - \bar{F} (\hat{M}) | \leq \bar{L} ‖ M - \hat{M} ‖_{F}$ for all $M, \hat{M} \in ℝ^{(K - 1) \times p q}$ .

Using a fixed point $M^{0} \in ℝ^{(K - 1) \times p q}$ , we can bound F as

F (M) \geq F (M^{0}) - L_{\bar{F}} ‖ M - M^{0} ‖_{F} + \frac{λ}{2} ‖ M ‖_{F}^{2} \to + \infty, as ‖ M ‖_{E} \to + \infty .

Hence, F is coercive, i.e., F(M) → +∞ as ∥M∥_F → +∞. Consequently, its sublevel set $L (β) = {M | F (M) \leq β}$ is closed and bounded for any $β \in ℝ$ . By the well-known Weierstrass theorem, (6) has at least one global optimal solution M^*. ◻

3.2. Inexact Proximal DCA Scheme

Let us start with the standard DCA scheme [2] and propose an inexact proximal DCA scheme to solve (6). The proximal DCA is equivalent to DCA applying to the DC decomposition (8) mentioned above, but often uses an adaptive strongly convex term S.

3.2.1. The Standard DCA Scheme and Its Proximal Variant

The DCA method for solving (6) is very simple. At each iteration t ≥ 0, given M^t, we compute a subgradient r (M^t) ∈ ∂Ψ(M^t) and form the subproblem:

min_{M} {{\tilde{F}}_{t} (M) ≜ Φ (M) - 〈 \nabla Ψ (M^{t}), M 〉},

(10)

to compute the next iteration M^t+1 as an exact solution of (10). The subproblem (10) is convex. However, it is fully nonsmooth and does not have a closed form solution.

In the proximal DC variant, we instead apply DCA to the DC decomposition (8) with $S (M) ≜ \frac{ρ}{2} ‖ M ‖_{F}^{2}$ , which leads to the following scheme:

M^{t + 1} ≜ \underset{M}{arg min} {{\tilde{F}}_{t} (M) ≜ Φ (M) - 〈 \nabla Ψ (M^{t}, M 〉 + | \frac{ρ_{t}}{2} ‖ M - M^{t} ‖_{F}^{2}},

(11)

where ρ_t > 0 is a given proximal parameter. Clearly, M^t+1 is well-defined and unique.

3.2.2. Inexact Proximal DCA Scheme

Clearly the subproblem (11) in the proximal DCA scheme (11) does not have a closed form solution. We can only obtain an approximate solution of this problem. This certainly affects the convergence of (11). We instead propose an inexact variant of (11) by approximately solving

M^{t + 1} : \approx \underset{M}{arg min} {{\tilde{F}}_{t} (M) ≜ Φ (M) - 〈 \nabla Ψ (M^{t}), M 〉 + \frac{ρ_{t}}{2} ‖ M - M^{t} ‖_{F}^{2}},

(12)

where :≈ stands for the approximation between the approximate solution M^t+1 and the true solution ${\bar{M}}^{t + 1}$ of the subproblem (12), and is characterized via the objective residual as

{\tilde{F}}_{t} (M^{t + 1}) - {\tilde{F}}_{t} ({\bar{M}}^{t + 1}) \leq \frac{δ_{t}^{2}}{2} .

(13)

We note that this condition is implementable if we apply first-order methods in convex optimization to approximately solving (12).

Clearly, by strong convexity, we have

\frac{ρ_{t}}{2} ‖ M^{t + 1} - {\bar{M}}^{t + 1} ‖_{F}^{2} \leq {\tilde{F}}_{t} (M^{t + 1}) - {\tilde{F}}_{t} ({\bar{M}}^{t + 1}) \leq \frac{δ_{t}^{2}}{2} .

This leads to $‖ M^{t + 1} - {\bar{M}}^{t + 1} ‖_{F} \leq δ_{t} / \sqrt{ρ_{t}}$ , which shows the difference between the approximate solution M^t+1 and the true one ${\bar{M}}^{t + 1}$ .

Under the inexact criterion (13), we can still prove the following descent property of the inexact proximal DCA scheme (12).

Lemma 1

Let Ψ be μ_Ψ-strongly convex with μ_Ψ ≥ 0. Let {M^t} be the sequence generated by the inexact DCA scheme (12) under the inexact criterion (13). Then

F (M^{t + 1}) \leq F (M^{t}) - \frac{(ρ_{t} + μ_{Ψ})}{2} ‖ M^{t + 1} - M^{t} ‖_{F}^{2} + \frac{δ_{t}^{2}}{2} .

(14)

Proof Using the optimality condition of (12), we have

\nabla Φ ({\bar{M}}^{t + 1}) - \nabla Ψ (M^{t}) + ρ_{t} ({\bar{M}}^{t + 1} - M^{t}) = 0, where \nabla Φ ({\bar{M}}^{t + 1}) \in \partial Φ ({\bar{M}}^{t + 1}) .

From the μ_Φ- and μ_Ψ-strong convexity of Φ and Ψ, respectively, we have

Φ ({\bar{M}}^{t + 1}) \leq Φ (M^{t}) + 〈 \nabla Φ ({\bar{M}}^{t + 1}), {\bar{M}}^{t + 1} - M^{t} 〉 - \frac{μ_{\infty}}{2} ‖ {\bar{M}}^{t + 1} - M^{t} ‖_{F}^{2}, - Ψ (M^{t + 1}) \leq - Ψ (M^{t}) - 〈 \nabla Ψ (M^{t}), M^{t + 1} - M^{t} 〉 - \frac{μ_{w}}{2} ‖ M^{t + 1} - M^{t} ‖_{F}^{2} = - Ψ (M^{t}) - 〈 \nabla Ψ (M^{t}), {\bar{M}}^{t + 1} - M^{t} 〉 + 〈 \nabla Ψ (M^{t}), {\bar{M}}^{t + 1} - M^{t + 1} 〉 - \frac{μ_{Ψ}}{2} ‖ M^{t + 1} - M^{t} ‖_{F}^{2} .

Summing up the last two inequalities and using the above optimality condition, we obtain

Φ ({\bar{M}}^{t + 1}) - Ψ (M^{t + 1}) \leq F (M^{t}) - ρ_{t} {‖ {\bar{M}}^{t + 1} - M^{t} ‖}_{F}^{2} + 〈 \nabla Ψ (M^{t}), {\bar{M}}^{t + 1} - M^{t + 1} 〉 - \frac{μ_{\infty}}{2} {‖ {\bar{M}}^{t + 1} - M^{t} ‖}_{F}^{2} - \frac{μ_{w}}{2} {‖ M^{t + 1} - M^{t} ‖}_{F}^{2} .

Here, F (M) = Φ(M) − Ψ(M). Next, using (13), we have

Φ (M^{t + 1}) \leq Φ ({\bar{M}}^{t + 1}) - 〈 \nabla Ψ (M^{t}), {\bar{M}}^{t + 1} - M^{t + 1} 〉 + \frac{δ_{t}^{2}}{2} + \frac{ρ_{t}}{2} {‖ {\bar{M}}^{t + 1} - M^{t} ‖}_{F}^{2} - \frac{ρ_{t}}{2} {‖ M^{t + 1} - M^{t} ‖}_{F}^{2} .

Summing up the last two inequalities and using F = Φ − Ψ again, we obtain

F (M^{t + 1}) \leq F (M^{t}) - \frac{1}{2} [(ρ_{t} + μ_{Φ}) ‖ {\bar{M}}^{t + 1} - M^{t} ‖_{F}^{2} + (ρ_{t} + μ_{Ψ}) ‖ M^{t + 1} - M^{t} ‖_{F}^{2}] + \frac{δ_{t}^{2}}{2} .

This implies (14) by neglecting the term $- \frac{1}{2} (ρ_{t} + μ_{Φ}) ‖ {\bar{M}}^{t + 1} - M^{t} ‖_{F}^{2}$

3.3. Solution of The Convex Subproblem

By rescaling the objective function by a factor of $\frac{1}{λ}$ , we can rewrite the strongly convex subproblem (12) at the iteration t of the inexact proximal DCA scheme as follows:

min_{M} {{\tilde{F}}_{t} (M) ≜ P_{t} (A (M)) + Q_{t} (M)},

(15)

where

P_{t} (A (M)) ≜ \frac{1}{λ N} \sum_{i = 1}^{N} [γ H_{K - 1} (a_{i}^{⊤} vec (M)) + (1 - γ) \sum_{k \neq y_{i}} G_{1} (b_{i k}^{⊤} vec (M))] - \frac{1}{λ} 〈 \nabla Ψ (M^{t}), M 〉,

and

Q_{t} (M) ≜ J (M) + \frac{ρ_{t}}{2} ‖ M - M^{t} ‖_{F}^{2} = \sum_{j = 1}^{K - 1} [\frac{1}{2} ‖ M_{j} ‖_{F}^{2} + τ ‖ M_{j} ‖_{*} + \frac{ρ_{t}}{2} ‖ M_{j} - M_{j}^{t} ‖_{F}^{2}] .

Here, $A$ is a linear operator concatenating all vectors a_i and b_ik, and the subgradient ∇Ψ(M^t) in P_t, and P_t is a nonsmooth convex function, but has a “friendly” proximal operator that can be computed in linear time (see Subsection 3.5 for more details). Due to the strong convexity of J, (15) is strongly convex even for ρ_t = 0. However, one can adaptively choose ρ_t ≥ 0 such that we have a “good” strong convexity parameter. If we do not add a regularization term $\frac{1}{2} ‖ M I_{j} ‖_{F}^{2}$ , then (15) is strongly convex if ρ_t > 0. Since μ_Ψ = 0 in (6), to strictly get a descent property in Lemma 1, we require ρ_t > 0. The following lemma will be used in the sequel, whose proof is given in the appendix.

Lemma 2

The objective function P_t(·) of (15) is Lipschitz continuous, i.e., there exists L₀ ∈ (0, + ∞) such that $| P_{t} (u) - P_{t} (\hat{u}) |) \leq L_{0} ‖ u - \hat{u} ‖_{F}$ for all u, $\hat{u}$ , where L₀ is independent of t. Consequently, the domain $dom (P_{t}^{*})$ of the conjugate $P_{t}^{*}$ is bounded uniformly in t, i.e., its diameter $D_{P^{*}} ≜ 2 sup {‖ v ‖ | v \in dom (P_{t}^{*})}$ is · finite and independent of t.

Denote by

L (β) ≜ {M \in ℝ^{(K - 1) \times p q} | F (M) \leq β},

(16)

the sublevel set of (5). As we proved in Theorem 1, the sublevel set $L (β)$ is closed and bounded for any $β \in ℝ$ . We define

D_{L} ≜ 2 sup {‖ M ‖_{F} | F (M) \leq F (M^{0})}

(17)

the diameter of this sublevel set, which is finite, i.e., $D_{L} \in (0, + \infty)$ .

3.3.1. Primal-dual Schemes for Solving (15)

Problem (15) can be written into a minimax saddle-point problem using the Fenchel conjugate of P_t. It is natural to apply primal-dual first-order methods to solve this problem. We propose in this subsection two different primal-dual schemes to solve (15).

Our first algorithm is the common Chambolle-Pock primal-dual method proposed in [8]. This method is described as follows. Starting from ${\hat{M}}_{0}^{t} = {\tilde{M}}_{0}^{t} = M^{t}$ , and $Y_{0}^{t} = Y^{t}$ as an initial dual variable with Y⁰ = 0, set $M_{0}^{t} = 0$ , and at each inner iteration l ≥ 0, we perform

{\begin{cases} Y_{l + 1}^{t} = p r o x_{σ_{l}^{t} P_{t}^{*}} (Y_{l}^{t} + σ_{l}^{t} A ({\hat{M}}_{l}^{t})), \\ {\tilde{M}}_{l + 1}^{t} = p r o x_{ω_{l}^{t} Q_{t}} ({\tilde{M}}_{l}^{t} - ω_{l}^{t} A^{*} (Y_{l + 1}^{t})), \\ θ_{l}^{t} = \frac{1}{\sqrt{1 + 2 (1 + ρ_{t}) ω_{l}^{t}}}, ω_{l + 1}^{t} = θ_{l}^{t} ω_{l}^{t}, σ_{l + 1}^{t} = \frac{σ_{l}^{t}}{θ_{l}^{t}}, \\ {\hat{M}}_{l + 1}^{t} = {\tilde{M}}_{t}^{l + 1} + θ_{l}^{t} ({\tilde{M}}_{l + 1}^{t} - {\tilde{M}}_{l}^{t}), \\ M_{l + 1}^{t} = (1 - s_{l}^{t}) M_{l}^{t} + s_{l}^{t} {\tilde{M}}_{l + 1}^{t}, with s_{l}^{t} = \frac{σ_{l}^{t}}{\sum_{j = 1}^{l} σ_{j}^{t}} . \end{cases}

(18)

Here, we use the index t for the DCA scheme as the outer iteration counter, and the index l for the inner iteration counter. The initial stepsizes are set to be $σ_{0}^{t} = ω_{0}^{t} = c ‖ A ‖^{- 1}$ , where $‖ A ‖$ is the operator norm of $A$ , and c = 0.999; $A *$ is the adjoint operator of $A$ (i.e., when $A$ is a matrix, $A *$ is the transpose of $A$ ), $p r o x_{σ} P_{t}^{*}$ is the proximal operator of the Fenchel conjugate $P_{t}^{*}$ of P_t, and ${prox}_{ω Q_{t}}$ is the proximal operator of ω · Q_t. Alternatively, we can also apply [28, Algorithm 2] to solve (15). Originally, [28, Algorithm 2] works directly on the primal space, and has a convergence guarantee on the primal sequence ${M_{l}^{t}}$ that is independent of the dual variable ${M_{l}^{t}}$ as we can see in Lemma 3 below. Let us describe this scheme here to solve (15). Starting from $M_{0}^{t} = M^{t}$ , ${\tilde{M}}_{0}^{t} = M^{t}$ and $Y_{0}^{t} = Y^{t}$ , at each inner iteration l ≥ 0, we update

{\begin{cases} Y_{l + 1}^{t} = p r o x_{σ_{l}^{t} P_{t}^{*}} (Y_{0}^{t} + σ_{l}^{t} A ({\hat{M}}_{l}^{t})) \\ {\tilde{M}}_{l + 1}^{t} = p r o x_{Q_{t} / ω_{l}^{t} β_{l}^{t}} ({\tilde{M}}_{l}^{t} - \frac{1}{ω_{l}^{t} β_{l}^{t}} A^{*} (Y_{l + 1}^{t})) \\ M_{l + 1}^{t} = (1 - ω_{l}^{t}) M_{l}^{t} + ω_{l}^{t} {\tilde{M}}_{l + 1}^{t} \\ ω_{l + 1}^{t} = \frac{ω_{i}^{t}}{2} (\sqrt{{(ω_{l}^{t})}^{2} + 4} - ω_{l}^{t}), σ_{l + 1}^{t} = \frac{σ_{i}^{t}}{1 - ω_{l + 1}^{t}}, β_{l + 1}^{t} = ‖ A ‖^{2} σ_{l + 1}^{t}, \\ {\hat{M}}_{l + 1}^{t} = M_{l + 1}^{t} + \frac{ω_{l + 1}^{t} (1 - ω_{l}^{t})}{ω_{l}^{t}} (M_{l + 1}^{t} - M_{l}^{t}) . \end{cases}

(19)

Here, the initial values $ω_{0}^{t} = 1$ and $σ_{0}^{t} = \frac{1}{2} ‖ A ‖^{- 2} (1 + ρ_{t})$ are given.

Note that both schemes (18) and (19) look quite similar at first glance, but they are fundamentally different. First, the dual step $Y_{l}^{t}$ in (19) fixes $Y_{0}^{t}$ for all iterations l, while it is recursive with $Y_{l}^{t}$ in (18). Second, (18) has an extra averaging step at the last line, while (19) has a linear coupling step at the last line, where it works similarly as the accelerated gradient method of Nesterov [23]. Finally, the way of updating parameters in both schemes are really different.

In terms of complexity, (18) and (19) essentially have the same per-iteration complexity with one proximal operator ${prox}_{s}_{P_{t}^{*}}$ , one proximal operator ${prox}_{r Q_{t}}$ , one matrix-vector multiplication $A (M)$ , and one adjoint operation $A^{*} (Y)$ .

The following lemma provides us conditions to design a stopping criterion for the inner loop (i.e., the l-iterative loop), whose proof is given in the appendix.

Lemma 3

Let ${\bar{M}}^{t + 1}$ be the unique solution of (15) at the outer iteration t. Then, the sequence ${M_{l}^{t}}_{l \geq 0}$ generated by (18) satisfies

{\tilde{F}}_{t} (M_{l}^{t}) - {\tilde{F}}_{t} ({\bar{M}}^{t + 1}) \leq \frac{(1 + ρ_{t} + ‖ A ‖) ‖ A ‖}{(1 + ρ_{t}) l^{2}} (‖ M_{0}^{t} - {\bar{M}}^{t + 1} ‖_{F}^{2} + ‖ Y_{0}^{t} - Y^{t + 1} ‖_{F}^{2}),

(20)

where ${\bar{Y}}^{t + 1}$ is the corresponding exact dual solution of (15). Alternatively, the sequence ${M_{l}^{t}}_{l \geq 0}$ generated by (19) satisfies

{\tilde{F}}_{t} (M_{l}^{t}) - {\tilde{F}}_{t} ({\bar{M}}^{t + 1}) \leq \frac{4 L_{0} ‖ A ‖}{{(l + 1)}^{2}} [\frac{2 L_{0} ‖ A ‖}{1 + ρ_{t}} + \sqrt{3} ‖ M_{0}^{t} - {\bar{M}}^{t + 1} ‖_{F}] + \frac{3 (ρ_{t} + 1) ‖ M_{0}^{t} - {\bar{M}}^{t + 1} ‖_{F}^{2}}{{(l + 1)}^{2}},

(21)

where L₀ is given in Lemma 2

One advantage of (19) over (18) is that the right-hand side bound (21) does not depend on the dual variables $Y_{0}^{t}$ and ${\bar{Y}}^{t + 1}$ as in (20).

3.3.2. The Upper Bound of the Inner Iterations

Our next step is to specify the maximum number of inner iterations l_max(t) to guarantee the condition (13) at each outer iteration t.

First, from both schemes (18) and (19), one can see that ${Y_{l}^{t}} \subset dom (P_{t}^{*})$ . Hence, by Lemma 2, we can bound $‖ Y_{0}^{t} - {\bar{Y}}^{t + 1} ‖_{F} \leq D_{P^{*}}$ . On the other hand, by Theorem 1, the sublevel set $L (F (M^{0}))$ defined by (16) is bounded. We can also bound $‖ M_{0}^{t} - {\bar{M}}^{t + 1} ‖_{F} \leq D_{L}$ , where and $D_{L}$ is given by (17). Using these upper bounds (20), we can show that

{\tilde{F}}_{t} (M_{l}^{t}) - {\tilde{F}}_{t} ({\bar{M}}^{t + 1}) \leq \frac{(1 + ρ_{t} + ‖ A ‖) ‖ A ‖}{(1 + ρ_{t}) l^{2}} (D_{L}^{2} + D_{P^{*}}^{2}) .

Let ${\bar{K}}_{t} ≜ {(1 + ρ_{t})}^{- 1} (1 + ρ_{t} + ‖ A ‖) ‖ A ‖$ be a constant. In order to guarantee (13), we require to choose the number of iterations l at most

l_{max} (t) ≜ [\frac{1}{δ_{t}} \sqrt{{\bar{K}}_{t} (D_{L}^{2} + D_{P^{*}}^{2})}] + 1 with δ_{t} = \frac{1}{{(t + 1)}^{α}} \sqrt{D_{L}^{2} + D_{P *}^{2}} .

(22)

Here, α > 1 is a given constant specified by the user. With such a choice of δ_t, we have $l_{max} (t) = [\sqrt{{\bar{K}}_{t}} {(t + 1)}^{α}] + 1$ , which is independent of $D_{L}$ and $D_{P^{*}}$ .

If we apply (19) to solve (15), then we have the bound (21). Let ${\hat{K}}_{t} ≜ \frac{8 L_{0}^{2} ‖ A ‖^{2}}{1 + ρ_{t}} + 4 \sqrt{3} L_{0} ‖ A ‖ D_{L} + 3 (ρ_{t} + 1) D_{L}^{2}$ . Since $‖ M_{0}^{t} - {\bar{M}}^{t + 1} ‖_{F} \leq D_{L}$ , in order to achieve ${\tilde{F}}_{t} (M_{l}^{t}) - {\tilde{F}}_{t} ({\bar{M}}^{t + 1}) \leq δ_{t}^{2} / 2$ , we require ${(l + 1)}^{- 2} {\hat{K}}_{t} \leq δ_{t}^{2} / 2$ which implies $l + 1 \geq \sqrt{2 {\hat{K}}_{t} / δ_{t}}$ Hence, we can choose

l_{max} (t) ≜ [\frac{\sqrt{2 {\hat{K}}_{t}}}{δ_{t}}] + 1, with δ_{t} = \frac{C_{0} \sqrt{2 {\hat{K}}_{t}}}{{(t + 1)}^{α}} and C_{0} \in (0, 1),

(23)

to terminate the primal-dual scheme (19). With such a choice of δ_t, we can exactly evaluate $l_{max} (t) = [C_{0}^{- 1} {(t + 1)}^{α} + 1]$ , which is also independent of $D_{L}$ .

Remark 1

By the choice of δ_t as in (22) or (23), the maximum number of inner iterations l_max(t) is independent of the two constants $D_{L}$ and $D_{P^{*}}$ . These constants only show up when we prove the convergence of Algorithm 1 in Theorem 2, but they do not need to be evaluated in Algorithm 1 below. Hence, in the implementation of Algorithm 1, we simply use $l_{max} (t) = [\sqrt{{\bar{K}}_{t}} {(t + 1)}^{α}] + 1$ for (18), or $l_{max} (t) = [C_{0}^{- 1} {(t + 1)}^{α} + 1]$ for (19) to specify the maximum number of inner iterations, where α > 1 is a given number, e.g., α = 1.1.

Algorithm 1.

(Inexact Proximal DC Algorithm with primal-dual iterations)

1: Initialization:

2: Input an accuracy ε > 0. Choose an initial point

M^{0} \in ℝ^{(K - 1) \times p q}

, and choose

Y^{0} ≜ 0

3: Choose two parameters

0 < \underline{ρ} < \bar{ρ} < + \infty

, and σ₀=

σ_{0} = ω_{0} = 0.999 ‖ A ‖

4: For t = 0 to T, perform

5: Evaluate a subgradient

\nabla Ψ (M^{t}) \in \partial Ψ (M^{t})

and choose

ρ_{t} \in [ρ, \bar{ρ}]

6: Initialization of inner loop: Initialize

M_{0}^{t}, {\hat{M}}_{0}^{t}, {\tilde{M}}_{0}^{t}, Y_{0}^{t}, σ_{0}^{t}

and

ω_{0}^{t}

Compute l_max(t).

7: Inner loop: For l = 0, 1,···, l_max(t), perform either (18) or (19).

8: Terminate the inner loop: If l ≥ l_max(t), then set

M^{t + 1} = M_{l_{max} (t)}^{t}

and

Y^{t + 1} = Y_{l_{max} (t)}^{t}

9: Stopping criterion: If

‖ M^{t + 1} - M^{t} ‖_{F} \leq ε max {1, ‖ M^{t} ‖_{F}}

, then terminate and return M^t+1.

10: End for

Open in a new tab

3.4. The Overall Algorithm and Its Convergence Guarantee

We now combine the inexact proximal DCA scheme (12), and the primal-dual scheme (18) (or (19)) to complete the full algorithm for solving (5) as in Algorithm 1.

In the sequel, we will explicitly specify the evaluation of a subgradient ∇Ψ(M^t) of Ψ, the choice of ρ_t, and the evaluation of ${prox}_{s}_{P_{t}^{*}}$ and ${prox}_{r Q_{t}}$ . The number of maximum iterations T of the outer loop is not necessary to specify. However, we use T as a safeguard value to prevent the algorithm from an infinite loop. Practically, we can set T to be a relatively large value, e.g., T = 10³. Nevertheless, the stopping criterion at Step 9 will terminate Algorithm 1 earlier. For large-scale problems, we can evaluate the operation norm $‖ A ‖$ > of $A$ by a power method.

We state the overall convergence of Algorithm 1 in the following theorem.

Theorem 2 (Overall convergence)

Let {M^t}· be the sequence generated by Algorithm 1 using (18) (respectively, (19)) for approximately solving (12) up to l_max(t) inner iterations as in (22) (respectively, (23)). Then, we have

\sum_{t = 0}^{\infty} ‖ M^{t + 1} - M^{t} ‖_{F}^{2} < + \infty a n d i t i m p l i e s lim_{t \to \infty} ‖ M^{t + 1} - M^{t} ‖_{F} = 0.

Moreover, the sequence ·{M^t} is bounded. Any cluster point M^* of {M^t} is a stationary point of (5). Consequently, the whole sequence{M^t} converges to a stationary point of (5).

Proof Since we apply (19) to solve the subproblem (12), with the choice of _t as in (23), we can derive from Lemma 1 that

\sum_{t = 0}^{T} ρ_{t} ‖ M^{t + 1} - M^{t} ‖_{F}^{2} \leq 2 (F (M^{0}) - F (M^{T + 1})) + \sum_{t = 0}^{T} δ_{t} .

By Theorem 1, we have F (M^T+1) ≥ F (M^*) > −∞, the global optimal value of (5). Hence, using the fact that $ρ_{t} \geq \underline{ρ} > 0$ , we obtain

\underline{ρ} \sum_{t = 0}^{\infty} ‖ M^{t + 1} - M^{t} ‖_{F}^{2} \leq 2 (F (M^{0}) - F (M^{*})) + \sum_{t = 0}^{\infty} δ_{t} < + \infty .

Here, $\sum_{t = 0}^{\infty} δ_{t} < + \infty$ due to the choice of δ_t. This is exactly the first estimate in Theorem 2. The second limit in Theorem 2 is a direct consequence of the first one.

By Theorem 1 again, the sublevel set $\sum_{t = 0}^{\infty} δ_{t} < + \infty$ defined by (16) is bounded, and F (M^t+1) ≤ F (M^t) by Lemma 1, we have ${M^{t}} \subset L (F (M^{0}))$ , which is bounded. For any cluster point M^* of {M^t}, there exists a subsequence ${M^{t_{s}}}$ that converges to M^*. Now, we prove that M^* is a stationary point of (5). Using the optimality condition of (12), we have

0 \in \partial Φ ({\bar{M}}^{t + 1}) - \nabla Ψ (M^{t}) + ρ_{t} ({\bar{M}}^{t + 1} - M^{t}) .

(24)

Note that lim ${lim}_{t \to \infty} ‖ {\bar{M}}^{t + 1} - M^{t + 1} ‖_{F} = 0$ due to the choice of δ_t. Here, we can pass this limit to a subsequence if necessary. Using this limit and the fact that limt→∞∥M^t+1−M^t∥_F = 0, we can show ${lim}_{t \to \infty} ‖ {\bar{M}}^{t + 1} - M^{t} ‖_{F} = 0$ . In summary, we have ${lim}_{t \to \infty} {\bar{M}}^{t + 1} = {lim}_{t \to \infty} M^{t} = M^{*}$ . Using the definition of Φ and Ψ, we can see that the subgradient ∇Ψ(M^t) of Ψ is uniformly bounded and independent of t. The subgradient $\nabla Φ ({\bar{M}}^{t + 1})$ can be represented as $\nabla Φ ({\bar{M}}^{t + 1}) = {\bar{S}}^{t + 1} + λ {\bar{M}}^{t + 1}$ ., where ${\bar{S}}^{t + 1}$ is uniformly bounded and independent of t. By taking subsequence if necessary, both $\nabla Φ ({\bar{M}}^{t + 1})$ and ∇Ψ(M^t) converge to ∇Ψ(M^*) and ∇Ψ(M^*), respectively. By [25, Theorem 24.4], we have ∇Φ(M^*) ∈ ∂Φ(M^*) and ∇Ψ(M^*) ∈ ∂Ψ(M^*). Using this fact, ${lim}_{t \to \infty} {\bar{M}}^{t + 1} = {lim}_{t \to \infty} M^{t} = M^{*}$ , and the boundedness of ρ_t, we can show that 0 ∈ ∂Φ(M^*) −∂Ψ(M^*) Hence, M^* a stationary point of (5). By the boundedness of {M¹} and limt→∞∥M^t+1−M^t∥_F = 0, one can then use routine techniques to show that the whole sequence {M^t} converges to M^*. ◻

While the convergence result given in Theorem 2 is rather standard and similar to those in [2], its analysis for the inexact proximal DCA seems to be new to the best of our knowledge. Note that the convex subproblem in DCA-type methods is often general and may not have closed-form solutions. It is natural to incorporate inexactness in an adaptive manner to guarantee the convergence of the overall algorithm.

3.5. Implementation Details and Comparison with ADMM

In Algorithm 1, we need to compute the proximal operator ${prox}_{σ_{l}^{t} P_{t}^{*}}$ of the Fenchel conjugate $P_{t}^{*}$ of P_t and ${prox}_{ω_{l}^{t}}_{Q_{t}}$ of Q_t. In addition, in order to compare our method with other optimization methods, we specify the well-known ADMM to solve (12) as our comparison candidate.

3.5.1. Evaluation of Subgradient ∇Ψ(M^t) and The Choice of ρ_t

Using the definition of Ψ from (7), we have

\nabla Ψ (M^{t}) = \frac{1}{N} \sum_{i = 1}^{N} [γ \nabla H_{s (K - 1)} (a_{i}^{⊤} vec M^{t}) a_{i} + (1 - γ) \sum_{k \neq y_{i}} \nabla G_{s} (b_{i k}^{⊤} vec M^{t}) b_{i k}],

where $\nabla H_{s (K - 1)} (u) = \frac{1}{2} \cdot sign (s (K - 1) - u) - \frac{1}{2}$ and $\nabla G_{s} (v) = \frac{1}{2} \cdot sign (s + v) + \frac{1}{2}$ . Here, sign(·) is the common sign function.

To choose ρ_t, we first choose a range $[\underline{ρ}, \bar{ρ}]$ in (0, +∞). For instance, we can choose $\underline{ρ} = 10^{- 5}$ and $\bar{ρ} = 10^{5}$ , and {ρ_t} is any sequence in $[\underline{ρ}, \bar{ρ}]$ . We can also fix ρ_t for all t as $ρ_{t} = \bar{ρ} > 0$ , e.g., ρ_t = 10⁻³. From our experience, we observe that if ρ_t is small, the strong convexity of (15) is 1+ρ_t, which is also small. Hence, the number of inner iterations l_max(t) is large. However, the number of outer iterations t may be small. In the opposite case, if ρ_t is large, then we need a small number l_max(t). Nevertheless, due to a short step M^t+1 −M^t, the number of outer iterations may increase. Therefore, trading-off the value of ρ_t is crucial and affects the performance of Algorithm 1.

3.5.2. Evaluation of Proximal Operators

To compute the proximal operator of $P_{t}^{*}$ in (18), we can use Moreau’s identity [3]:

p r o x_{σ P_{t}^{*}} (z) = {\begin{array}{l} z_{j} - σ p r o x_{1 / σ P_{t}} (z_{j} / σ) = z_{j} - σ [S_{1 / σ} (z_{j} + μ_{j}) - μ_{j}], j = 1, \dots, 2 N, \\ z_{j} - σ p r o x_{1 / σ P_{t}} (z_{j} / σ) = (1 - σ) z_{2 N + 1} + 1, \end{array}

where $S_{r} (v) = s i g n (u) ⊙ max {| v | - r, 0}$ is the well-known soft-thresholding operator.

To compute the proximal operator of Q_t, we note that (here, τ_j = τ)

Q_{t} (M) ≜ \sum_{j = 1}^{K - 1} [\frac{1}{2} ‖ M_{j} ‖_{F}^{2} + τ_{j} ‖ M_{j} ‖_{*} + \frac{ρ_{t}}{2} ‖ M_{j} - M_{j}^{t} ‖_{F}^{2}] .

Hence, we have

p r o x_{ω Q_{t}} (M) = {(p r o x_{ω Q_{t_{j}}} (M_{j}))}_{j = 1}^{K - 1},

Where $Q_{t_{j}} (M_{j}) ≜ \frac{1}{2} ‖ M_{j} ‖_{F}^{2} + τ_{j} ‖ M_{j} ‖_{*} + \frac{ρ_{t}}{2} ‖ M_{j} - M_{j}^{t} ‖_{F}^{2}$ , and

p r o x_{ω Q_{t_{j}}} (M_{j}) ≜ \underset{{\hat{M}}_{j}}{arg min} {ω τ_{j} {‖ {\hat{M}}_{j} ‖}_{*} + \frac{1 + ω (1 + ρ_{t})}{2} {‖ {\hat{M}}_{j} - \frac{ω ρ_{t} M_{j}^{t} + M_{j}}{1 + ω (ρ_{t} + 1)} ‖}_{F}^{2}} .

This operator can be computed in a closed form using SVD of $(ω ρ_{t} M_{j}^{t} + M_{j}) / [1 + ω (ρ_{t} + 1)] = U_{j} Σ_{j} V_{j}^{⊤}$ as $p r o x_{ω Q_{t_{j}}} (M_{j}) = U_{j} S_{r} (Σ_{j}) V_{j}^{⊤}$ , where $S_{r}$ is the soft-thresholding operator defined above with r = ωτ_j/[1 + ω (1 + ρ_t)].

3.5.3. ADMM Method for Solving (15)

In Algorithm 1, we can apply ADMM to solve the subproblem (15) instead of primal-dual methods. We split the nuclear norm in Q_t of (15) by introducing an auxiliary variable S and rewrite (15) as

{\begin{cases} min_{M, S} {{[P_{t} (A (M)) + \sum_{j = 1}^{K - 1} [\frac{1}{2} {‖ M_{j} ‖}_{F}^{2} + \frac{ρ_{t}}{2} {‖ M_{j} - M_{j}^{t} ‖}_{F}^{t}]]}_{B_{t} (M)} + \sum_{j = 1}^{K - 1} τ_{j} {‖ S_{j} ‖}_{*}} \\ s.t S - M = 0. \end{cases}

(25)

We define the corresponding augmented Lagrangian function of (25) as

L_{β} (M, S, Λ) ≜ P_{t} (A (M)) + \sum_{j = 1}^{K - 1} [\frac{1}{2} ‖ M_{j} ‖_{F}^{2} + \frac{ρ_{t}}{2} ‖ M_{j} - M_{j}^{t} ‖_{F}^{2}] + \sum_{j = 1}^{K - 1} τ_{j} ‖ S_{j} ‖_{*} + t r a c e (Λ^{⊤} (S - M)) + \frac{β}{2} ‖ S - M ‖_{F}^{2},

where β > 0 is a penalty parameter. Starting from an initial point $M_{0}^{t} = M^{t}$ , $S_{0}^{t} = M^{t}$ , our ADMM scheme for solving (25) updates at the inner iteration l according to the following steps:

{\begin{array}{l} M_{l + 1}^{t} ≜ \underset{M}{arg min} {B_{t} (M) + t r a c e ({(Λ^{t})}_{l}^{⊤} (S_{l}^{t} - M)) + \frac{β}{2} ‖ S_{l}^{t} - M ‖_{F}^{2}} \\ S_{l + 1}^{t} ≜ \underset{S}{arg min} {\sum_{j = 1}^{K - 1} τ_{j} ‖ S_{j} ‖_{*} + t r a c e ({(Λ_{l}^{t})}^{⊤} (S - M_{l + 1}^{t})) + \frac{β}{2} ‖ S - M_{l + 1}^{t} ‖_{F}^{2}} \\ Λ_{l + 1}^{t} ≜ Λ_{l}^{t} + β (S_{l + 1}^{t} - M_{l + 1}^{t}) . \end{array}

(26)

In this scheme, the auxiliary sequence ${S_{l}^{t}}$ can be computed into a closed form using SVD as we have done in Subsection 3.5.2. The sequence ${M_{l}^{t}}$ requires to solve a general convex problem. However, this problem has a special structure so that its dual formulation becomes a boxed constrained convex quadratic program, which is very similar to (2). Hence, we solve this problem by coordinate descent methods, see, e.g., [29]. In summary, if we apply ADMM to solve (15), then our inexact proximal DCA has three loops: DCA outer iterations, ADMM inner iterations, and coordinate descent iterations for computing ${M_{l}^{t}}$ .

Remark 2 (Convergence of the ADMM scheme (26))

Note that (15) is strongly convex, and both subproblems in $M_{l + 1}^{t}$ and $S_{l + 1}^{t}$ of (26) are strongly convex, and therefore, uniquely solvable. Consequently, this scheme converges theoretically as proved e.g., in [5, Appendix A]. Together with asymptotic convergence guarantees, the convergence rates of ADMM, where (26) is a special case, have been studied in e.g., [11, 13, 21]. We omit the details here.

4. Statistical Properties

In this section, we explore some statistical properties of our proposed classifier RMSMM (3). In the first part, we establish the Fisher consistency result for the RMSMM, and study the finite sample bound on the misclassification rate. In the second part, we analyze the robustness property of RMSMM via the breakdown point theory.

4.1. Classification Consistency

Fisher’s consistency is a fundamental property of classification methods. For an observed matrix-type data with fixed X, and denote by $P_{k} (X) = P r (Y = k | X)$ the class conditional probability of class k ∈ {1, 2, ⋯, K}. One can verify that the best prediction rule, namely, the Bayes rule, which minimizes the misclassification error rate, is ŷ_Bayes(X) = arg max_k P_k(X).

For a classifier, denote by ϕ (f(X), y) its surrogate loss function for classification using f as the classification function, and ŷ_f the corresponding prediction rule. Assume the conditional loss is L(X) = E[ϕ(f(X), y) | X], where the expectation is taken with respect to the marginal distribution of $(Y | X)$ . We denote the theoretical minimizer of the conditional loss as f*(X) = arg min_f L(X). When ŷ_f* (X) = ŷ_Bayes(X), we say the classifier is Fisher consistent. Let us denote by $L (X, y, M)$ the loss function in (3). Then, we have the following result.

Theorem 3

The classifier with the loss $L (X, y, M)$ is Fisher consistent when $γ \in [0, \frac{1}{2}]$ and s ≤ 0.

This result can be viewed as a generalization of Theorem 1 in [34] which is devised for vector-type observations. By this theorem, we know that our classifier RMSMM can achieve the best classification accuracy, given a sufficiently large matrix-type training dataset and a rich family $F$ . The following theorem provides an upper bound of the prediction error using the training dataset.

The proof of both Theorems 3 and 4 can be found in the appendix.

Theorem 4

Suppose that the conditional distribution of X given $Y = k$ is the same as the distribution of C_k + E, where $C_{k} \in ℝ^{p \times q}$ is a constant matrix and the entries of E are i.i.d. random variables with mean zero and finite fourth moment. Let $\hat{M} = {[vec ({\hat{M}}_{1}), \dots, vec ({\hat{M}}_{K - 1})]}^{⊤} \in ℝ (K - 1) \times p q$ denote the solution of (5). Then, with probability at least 1 − δ, the misclassification rate of the classifier ŷ corresponding to $\hat{M}$ can be bounded as

E [I {Y \neq \hat{y} (X)}] \leq \frac{1}{N} \sum_{i = 1}^{N} I {y_{i} \neq \hat{y} (X_{i})} + \sqrt{\frac{log (δ^{- 1})}{N}} + \frac{c r (\sqrt{p} + \sqrt{q})}{\sqrt{N}},

(27)

Where $r = \sum_{j = 1}^{K - 1} ‖ {\hat{M}}_{j} ‖_{*}$ , and c is a constant specified in the proof.

Theorem 4 measures the gap between the expectation error and the empirical error, which allows us to get a better understanding of the utility of the nuclear norm. For each category, the decision matrix contains p × q parameters, and therefore, if we only impose the Frobenius constraints [34] we would expect at best to obtain rates of the order $\sqrt{p q}$ . By taking the low rank structure of the decision matrices into account, we use the nuclear norm penalty to control the singular values of the decision matrices. For the i-th singular vectors of the k-th decision matrix, there are p + q + 1 free parameters in total [22], one for the singular value σ_ki and the others for the orthogonal vectors with dimensions p and q. Its contribution to the gap will be $c σ_{k i} (\sqrt{p} + \sqrt{q})$ . Hence, with the low-rank structure of the decision matrices, the nuclear-norm-penalized estimator achieves a substantially faster rate.

The rate in Theorem 4 can be further improved if we additionally impose some low-rank constraint on the noise term of X_i. For example, consider E = UΛV^⊤, where $Λ \in ℝ^{r_{x} \times r_{x}}$ is low-rank noise with all entries i.i.d. with mean zero and the finite fourth moment, U and V are orthogonal projection matrices independent of Λ. It can be verified that the term $\sqrt{p} + \sqrt{q}$ in the rate above can be replaced by $2 \sqrt{r_{x}}$ . Finally, as a side remark, consider a special case with q = 1, i.e., the features are vectors rather than matrices. In such a situation, the nuclear norm reduces to the quadratic norm, and the last term of the upper bound in (27) will become $c r (\sqrt{p} + 1) / \sqrt{N}$ , which is equivalent to existing results, for example, see [34].

4.2. Breakdown Point Analysis

Robustness theory has been developed to evaluate instability of statistical procedures since the 1960s [15]. The breakdown point theory focuses on the smallest fraction of contaminated data that can cause an estimator totally diverging from the original model. Here we consider the breakdown point analysis for multicategory classification models.

Let $T_{n}$ be the original n observations, and ${\tilde{T}}_{n, m} = T_{n - m} \cup V_{m}$ be the contaminated sample with m observations of $T_{n}$ contaminated, and $\tilde{M} = \hat{M} ({\tilde{T}}_{n, m})$ be the parameters estimated from the contaminated sample. We extend the sample angular breakdown point in [35] to the multicategory classification problem as

ϵ^{⋆} (\hat{M}, T_{n}) = min {\frac{m}{n} | \exists k, s . t . w_{k}^{⊤} \hat{M} {\tilde{M}}^{⊤} w_{k} \leq 0},

where $\hat{M} = \hat{M} (T_{n})$ is the estimated decision matrix from the original sample. Since the angle-based classifiers make the decision by comparing the angles between the (K − 1)-dimensional classification function f and the K vertices of the simplex ${w_{k}}_{k = 1}^{K}$ , it is reasonable to quantify the divergence between classifiers via the angles between the decision vectors $w_{k}^{⊤} \tilde{M}$ and the original counterpart, $w_{k}^{⊤} \hat{M}$ . When there exists one category k so that the angle between the two decision vectors is larger than π/2, the two classifiers would behave totally different at this category. Consequently, the classifier with contaminated samples would “break down”.

The following theorem compares the sample breakdown points of the proposed RMSMM and the multicategory SMM (MSMM) which generalizes [19]’s SMM using angle-based methods, say γ = 1/2 and s = −∞ in Eq. (3).

Theorem 5

Assume that $\hat{M} \neq 0$ . Then the breakdown point of MSMM is 1/n, while the breakdown point of RMSMM is not smaller than $\frac{ϵ_{1}}{2 (K - 1) (1 - s)}$ , where

ϵ_{1} = min_{M \in Δ^{-}} F (M) - min_{M \in Δ^{+}} F (M) > 0.

By this theorem, only one contaminated observation will make the MSMM classifier break down. In other words, this estimator may not work well in the presence of few outliers. In contrast, the breakdown point of our proposed RMSMM, benefitting from the use of truncated hinge loss functions, has a fixed lower bound. Thus, the RMSMM has high outlier-resistance compared to its counterpart without truncation. The robustness property will be carefully examined via numerical comparisons in the next section.

5. Numerical Experiments

In this section, we investigate the performance of our proposed robust angle-based SMM using simulated and real datasets. Our configuration of the algorithm is as follows. For the primal-dual method described in Algorithm 1, we use M⁰ = 0 and ρ_t = 0.01 for every t. We set the stop criterion as ∥M^t+1 − M^t∥_F ≤ 10⁻⁴ max {1, ∥M^t∥_F}. All the simulation results are obtained based on 100 replications.

5.1. Simulation Results

We generate simulated datasets by the following two scenarios. In the first scenario, the dimensions of input matrices are 50 × 50. For the kth category, to make the matrices low-rank, we randomly generate two 50 × 5 matrices, U_k and V_k, which are standard orthonormal. More precisely, we first generate two 50 × 5 matrices with all the entries i.i.d. from the standard normal distribution and obtain U_k and V_k by the Gram-Schmidt process. The center of each class is then specified by $C_{k} = U_{k} V_{k}^{⊤}; k = 1, \dots, K$ . The observations in each class are generated by C_k + E; k = 1,⋯, K, where E is a 50 × 50 normal random matrix with all entries i.i.d from $N (0, σ^{2})$ . For the contaminated observations, we generate them by 3C₁ + E for $Y \in {1, \dots, K}$ .

In the second scenario, the dimensions of input matrices are fixed as 80 × 100. We follow the settings in [36] to generate the true array signals by $C_{k} = C_{k, 1} C_{k, 2}^{⊤}; k = 1, \dots, K$ , where each entry of C_k is 0 or 1 and $C_{k, i} \in ℝ^{p_{i} \times r}$ , p₁ = 80 and p₂ = 100. To control the rank and the percentage of nonzero entries, we set r = 10 and generate C_k,i by setting each row to contain only one entry one and others zero, and the probabilities of entries being one are equal. All the entries of the noise matrix E are i.i.d. from σ · t(3), where t(3) denotes the Student’s t-distribution with three degrees of freedom. The outliers are generated by the same method as in the first scenario.

We use 10³ observations for training, 10⁴ observations for tuning and 10⁴ observations for testing. The contamination ratio in the training sample ρ, is chosen as 0%, 10%, and 20%. For training the truncated model, we use the solutions of the ordinary SMM as an initial point. Following the suggestion by [33], we choose γ = 1/2 as it can provide stable classification performance. The truncation parameter, s, is fixed at −1/(K − 1). The other hyper-parameters, C and τ, are selected via a grid search on the tuning set.

We first consider the binary classification problem, say K = 2. We compare our RMSMM with the SMM in [19]. We also include a naive benchmark, the standard SVM method which is applied to the stacked-up vectors. Fig. 1 presents the classification error rates of RMSMM, SMM, and SVM on the simulated data with Scenario (I) and K = 2. Three noise magnitudes are considered: σ = 0.5, 0.7 and 0.9. Both two “support-matrix-based” methods, RMSMM and SMM, perform much better than the SVM. It has been observed that RMSMM generally outperforms SMM when there exits outliers, and its advantage becomes more pronounced for larger ρ. All methods are affected by different values of σ, but the comparison conclusion still holds for various σ.

Fig. 1 — Classification error rates for RMSMM, SMM, and SVM on the simulated data with Scenario (I) and K = 2. Here, ρ stands for the percentage of data that are contaminated. SMM: [19]’s support matrix machine; SVM: the standard SVM applied to the stacked-up vectors.

Next we consider the multicategory case. Fig. 2 depicts the boxplots of the classification error rates for RMSMM and other competitors under Scenario (I) with K = 3 and 5. Three benchmarks are considered: the multicategory SMM using angle-based methods, MSMM; the angle-based multicategory SVM classifier [32] and its robust version RMSVM classifier [34]. In the case of ρ = 0, the RMSMM and its non-robust counterpart MSMM perform almost identically, which demonstrates that the truncation parameter, s, can adapt to the data structure and make the efficiency loss of RMSMM relative to MSMM minimal when there is no outlier. When ρ = 0.1 or ρ = 0.2, the advantage of RMSMM is clear: the means and standard variations of its classification error rates are generally smaller. From this figure, we can also observe that the use of the nuclear norm is prominent: the two SMM-based classifiers perform much better than the two SVM-based ones. Similar comparison conclusions can be drawn from Fig. 3, which reports the classification error rates of RMSMM and the other three methods under Scenario (II) with σ = 3, 4, and 5.

Fig. 3 — Classification error rates for RMSMM, SMM, and SVM on the simulated data with Scenario (II). The top three panels: the case of K = 3; the bottom three panels: the case of K = 5.

Finally, we present some comparison results of the ADMM and primal-dual algorithms for solving the RMSMM optimization problem (5). Fig. 4 reports the classification error rates and the corresponding computational time (in seconds) of the RMSMM using the two different primal-dual algorithms: (18) and (19) under Scenario (I) with σ = 0.7 and Scenario (II) with σ = 4 when K = 3. The bottom two panels record the total run time including the selection of tuning parameters. The tuning parameters λ and τ in the RMSMM are selected via a grid search. To be more specific, λ ∈ [0.1, 10⁴] and for each choice of λ, τ is tuned to make the decision matrix change from full-rank to rank one. One can see that the two algorithms perform very similarly in terms of classification rates, but the proposed primal-dual algorithm is significantly faster and the advantage is more remarkable as ρ increases. This is further confirmed by Fig. 5 which depicts the decay curves of the RMSMM objective function values versus the computational time until the two algorithms reach the desired accuracy. We consider the case under Scenario (II) with K = 3 and σ = 4 for a given combination of tuning parameters. In particular, we fix a combination of (λ, τ) and record the objective function values for each iteration. Clearly, the primal-dual algorithm is generally more stable and converges much faster than ADMM.

Fig. 5 — The decrease of the RMSMM objective values with respect to the computational time under Scenario (II) with K = 3 and σ = 4.

5.2. A Real-data Example

We apply the RMSMM model (5) to the Daily and Sports Activities Dataset [1] which can be found on the UCI Machine Learning Repository. The dataset comprises motion sensor data of 19 daily sport activities, each performed by 8 subjects (4 females, 4 males, between the ages of 20 and 30) in their own style for 5 minutes. The dataset was collected by several sensors. The input matrices are of dimension 125 × 45, where each column contains 125 samples of data acquired by a sensor over a period of 5 seconds at 25 Hz sampling frequency, and each row contains the data acquired from all of 45 sensor axes at a particular sampling instant.

To show the efficient performance of the proposed RMSMM model, we only select the first 10 categories of the dataset for simplicity. Thus the total number of instances is N = 10 × 8 × 60 = 4, 800. It is a 10-category and balanced classification problem with 480 instances in each category. We equally and randomly divide the data into three parts for training, tuning, and testing, and the sample size of each part is 1, 600.

We choose s = −K + 1, and select the other parameters by a grid search. We report the classification accuracy of RMSMM, MSMM, RMSVM, and MSVM in Fig. 6-(left). The two matrix-based methods achieve lower classification rates than the other two vector-based classifiers, due to the benefit of the nuclear norm. This improvement can be more clear in Fig. 7, which presents the heatmap of the decision matrices of RMSMM and RMSVM; the former has a more sparse structure than the latter.

Fig. 7 — Heatmaps of the first decision matrices of RMSMM (left panel) and RMSVM (right panel)

To demonstrate the effect of potential outliers on classification accuracy, we artificially contaminate the dataset with outliers by randomly relabeling 10% of the training set into another class. From Fig. 6-(right), we observe that the performances of all the methods are deteriorated by this manipulation, while the RMSMM performs the best. Both two robust classifiers, RMSMM and RMSVM, are less affected by the outliers, than the other two non-robust methods. All these numerical examples shown above suggest that the RMSMM is a practical and robust classier for a multicategory classification problem when the input features are represented as matrices.

6. Concluding Remarks

In this paper, we consider how to devise a robust multicategory classifier when the input features are represented as matrices. Our method is constructed in the angle-based classification framework, embedding a truncated hinge loss function into the support matrix machine. Although the corresponding optimization problem is nonconvex, it admits a natural DC (difference of two convex functions) representation. Hence, it is natural to apply DCA algorithms to solve this problem. Unfortunately, the convex subproblem in DCA is rather complex and does not have a closed form solution. Therefore, we develop an inexact proximal DCA variant to solve the underlying optimization problem. To approximately solve the convex subproblem, we propose to use primal-dual first-order methods. We combine both inexact proximal DCA and primal-dual methods to obtain a new proximal DCA scheme. We prove that our optimization model admits a global optimal solution, and the sequence generated by our DCA variant globally converges to a stationary point.

In terms of statistical learning perspective, we prove Fisher’s consistency and prediction error bounds. Numerical results demonstrate that our new classifiers are quite efficient and much more robust than existing methods in the presence of outlying observations. We conclude the article with two remarks. First, our unified framework is demonstrated using the linear classifier. Though it is well recognized that linear learning is an effective solution in many real applications, it may be sub-efficient especially for problems with complex feature structures. Thus it is of interest to thoroughly study nonlinear learning under the proposed framework. Second, our numerical results show that the proposed procedure works well under large-dimensional scenarios. Theoretical investigation to the necessary condition on which the statistical theoretical guarantee of RMSMM holds is another interesting topic for future study.

Acknowledgments

The authors are grateful to the editor and the reviewers for their insightful comments that have significantly improved the article. Qian and Zou were supported in part by NNSF of China Grants 11690015 11622104 and 11431006, and NSF of Tianjin 18JCJQJC46000. Tran-Dinh was supported in part by US NSF-grant, DMS-1619884. Liu was supported in part by US NSF grants IIS1632951 and DMS-1821231, and NIH grant R01GM126550.

A Appendix: Proofs of Technical Results

In this appendix, we provide all the remaining proofs of the results presented in the main text.

A.1. Proof of Lemma 2: Lipschitz continuity and boundedness

Since [a]₊ = max {0, a} = (a + |a|)/2, the function P_t defined in (15) can be rewritten as $P_{t} (A (z)) = ‖ \hat{A} z + μ ‖_{1} + d_{t}^{⊤} Z$ for some matrix $\hat{A}$ and vectors μ and d_t. Here, $d_{t} ≜ d - λ^{- 1} vec (\nabla Ψ (M^{t}))$ . However, ψ is also Lipschitz continuous due to its definition. This implies that ∇ψ(M^t) is uniformly bounded, i.e., there exists a constant C₀ ∈ (0, + ∞) such that ||∇ψ (M^t)||_F ≤ C₀ for all $M^{t} \in ℝ^{(K - 1) \times p q}$ . As a consequence, P_t is Lipschitz continuous with the uniform constant L₀ that is independent of t, i.e., $| P_{t} (u) - P_{t} (\hat{u}) | \leq L_{0} ‖ u - \hat{u} ‖_{F}$ for all $u, \hat{u}$ . The boundedness of $d o m (P_{t}^{*})$ of the conjugate $P_{t}^{*}$ follows from [3, Corollary 17.19].

A.2. The proof of Lemma 3: The convergence of the primal-dual methods

Let $G (M, Y) = Q_{t} (M) + 〈 A (M), Y 〉 - P_{t}^{*} (Y)$ , where $P_{t}^{*}$ is the Fenchel conjugate of P_t. Applying [9, Theorem 4] with f = 0, for any M and Y, we have

G (M_{l}^{t}, Y) - G (M, {\bar{Y}}_{l}^{t}) \leq \frac{1}{T_{l}} (\frac{‖ M_{0}^{t} - M ‖_{F}^{2}}{2 ω_{0}^{t}} + \frac{‖ Y_{0}^{t} - Y ‖_{F}^{2}}{2 σ_{0}^{t}}),

28)

where $T_{l} = \sum_{i = 1}^{l} \frac{σ_{i - 1}^{t}}{σ_{0}^{t}}$ , and ${\bar{Y}}_{l}^{t} = \frac{1}{T_{l}} \sum_{j = 1}^{l} \frac{σ_{j - 1}^{t}}{σ_{0}^{t}} Y_{j}^{t}$ .

By the update rule in (18), we have $ω_{l + 1}^{t} σ_{l + 1}^{t} = ω_{l}^{t} σ_{l}^{t}$ . Hence, by induction, we have $ω_{l}^{t} σ_{l}^{t} = ω_{0}^{t} σ_{0}^{t} = ‖ A ‖^{- 2}$ . On the other hand, by [8, Lemma 2], with the choice of $λ = ‖ A ‖^{- 1} (1 + ρ_{t})$ , we have

\frac{‖ A ‖}{1 + ρ_{t}} + \frac{‖ A ‖ l}{‖ A ‖ + (1 + ρ_{t})} \leq \frac{1}{(1 + ρ_{t}) ω_{l}^{t}} \leq \frac{‖ A ‖}{1 + ρ_{t}} + l .

Using this estimate and $σ_{l}^{t} = ‖ A ‖^{- 2} ω_{l}^{- t}$ , we have

T_{l} = \sum_{i = 1}^{l} \frac{σ_{i - 1}^{t}}{σ_{0}^{t}} = \frac{1}{‖ A ‖} \sum_{i = 1}^{l} \frac{1}{ω_{i - 1}^{t}} \geq \sum_{i = 1}^{l} (\frac{i - 1}{1 + c} + 1) = \frac{l (l - 1)}{2 (1 + c)} + l \geq \frac{l^{2}}{2 (1 + c)},

where $c = ‖ A ‖ {(1 + ρ_{t})}^{- 1}$ . Hence, we can estimate T_l as $T_{l} \geq \frac{1}{2} {(1 + ρ_{t} + ‖ A ‖)}^{- 1} (1 + ρ_{t}) l^{2}$ . Using this estimate of T_l, $T_{l}, σ_{0}^{t} = ω_{0}^{t} = ‖ A ‖$ , and ${\tilde{F}}_{t} (M_{l}^{t}) - {\tilde{F}}_{t} ({\bar{M}}^{t + 1}) \leq G (M_{l}^{t}, {\bar{Y}}^{t + 1}) - G ({\bar{M}}^{t + 1}, {\bar{Y}}_{l}^{t})$ , we obtain from (28) that

{\tilde{F}}_{t} (M_{l}^{t}) - {\tilde{F}}_{t} ({\bar{M}}^{t + 1}) \leq \frac{(1 + ρ_{t} + ‖ A ‖) ‖ A ‖}{(1 + ρ_{t}) l^{2}} (‖ M_{0}^{t} - {\bar{M}}^{t + 1} ‖_{F}^{2} + ‖ Y_{0}^{t} - {\bar{Y}}^{t + 1} ‖_{F}^{2}) .

This is exactly (20).

Next, we prove (21). By introducing $Y = A (M)$ , we can reformulate the strongly convex subproblem (15) into the following constrained convex problem:

{\tilde{F}}_{t} ({\bar{M}}^{t + 1}) = min_{_{M, Y}} {{\tilde{F}}_{t} (M, Y) = P_{t} (Y) + Q_{t} (M) | A (M) - Y = 0} .

(29)

Note that Q_t is strongly convex with the strong convexity parameter 1 + ρ_t. We can apply [28, Algorithm 2] to solve (29). If we define

Δ_{σ_{i}^{t}} (M_{l + 1}^{t}) = P_{t} (Y_{l + 1}^{t}) + Q_{t} (M_{l + 1}^{t}) + \frac{σ_{l}^{t}}{2} ‖ A (M_{l + 1}^{t}) - Y_{l + 1}^{t} ‖_{F}^{2} - {\tilde{F}}_{t} ({\bar{M}}^{t + 1}) .

then, from the proof of [28, Theorem 2], we can show that

Δ_{σ_{l}^{t}} (M_{l + 1}^{t}) \leq \frac{2 [σ_{0}^{t} ‖ A ‖^{2} + 1 + ρ_{t}] ‖ M_{0}^{t} - {\bar{M}}^{t + 1} ‖_{F}^{2}}{{(l + 2)}^{2}} .

(30)

By Lemma 2, P_t is Lipschitz continuous with the Lipschitz constant L₀. Then we have

{\tilde{F}}_{t} (M_{l + 1}^{t}) - {\tilde{F}}_{t} ({\bar{M}}^{t + 1}) = P_{t} (A (M_{l + 1}^{t})) + Q_{t} (M_{l + 1}^{t}) - {\tilde{F}}_{t} ({\bar{M}}^{t + 1}) \leq P_{t} (Y_{l + 1}^{t}) + Q_{t} (M_{l + 1}^{t}) - {\tilde{F}}_{t} ({\bar{M}}^{t + 1}) + L_{0} ‖ A (M_{l + 1}^{t}) - Y_{l + 1}^{t} ‖_{F} .

Combining (30) and this estimate, we obtain

0 \leq {\tilde{F}}_{t} (M_{l + 1}^{t}) - {\tilde{F}}_{t} ({\bar{M}}^{t + 1}) \leq \frac{2 [σ_{0}^{t} ‖ A ‖^{2} + 1 + ρ_{t}] ‖ M_{0}^{t} - {\bar{M}}^{t + 1} ‖_{F}^{2}}{{(l + 2)}^{2}} + L_{0} ‖ A (M_{l + 1}^{t}) - Y_{l + 1}^{t} ‖_{F} - \frac{σ_{l}^{t} |}{2} ‖ A (M_{l + 1}^{t}) - Y_{l + 1}^{t} ‖_{F}^{2} .

Similar to the proof of [28, Corollary 1], by using $σ_{0}^{t} = \frac{1 + ρ_{t}}{2 ‖ A ‖^{2}}$ , the last inequality leads to

‖ A (M_{l + 1}^{t}) - Y_{l + 1}^{t} ‖_{F} \leq \frac{4 ‖ A ‖}{{(l + 1)}^{2}} [\frac{2 L_{0} ‖ A ‖}{1 + ρ_{t}} + \sqrt{3} ‖ M_{0}^{t} - {\bar{M}}^{t + 1} ‖_{F}] .

Combining the two last estimates, we obtain

\tilde{F} (M_{l}^{t}) - \tilde{F} ({\bar{M}}^{t + 1}) \leq \frac{4 ‖ A ‖ L_{0}}{{(l + 1)}^{2}} [\frac{2 L_{0} ‖ A ‖}{1 + ρ_{t}} + \sqrt{3} ‖ M_{0}^{t} - {\bar{M}}^{t + 1} ‖_{F}] + \frac{3 (ρ_{t} + 1) ‖ M_{0}^{t} - {\bar{M}}^{t + 1} ‖_{F}^{2}}{{(l + 1)}^{2}},

which is exactly (21). ◻

A.3. Proof of statistical properties

We provide the proof of Theorems 3 and 4 in this section.

A.3.1. Proof of Theorem 3: Fisher’s consistency

In our RMSMM (3), one can abstract the truncated hinge loss function as

ϕ (f (X), y) = γ T_{(K - 1) s} (〈 f (X), w_{y} 〉) + (1 - γ) \sum_{k \neq y} R_{s} (〈 f (X), w_{k} 〉) .

Then, the conditional loss can be rewritten as

L (X) ≜ \sum_{k = 1}^{K} [γ P_{k} T_{(K - 1) s} (〈 f (X), w_{k} 〉) + (1 - P_{k}) R_{s} (〈 f (X), w_{k} 〉)] .

[34, Theorem 1] showed that for a vector data x, the robust classifier based on the loss function ϕ(f(x), y) is Fisher consistent with $γ \in [0, \frac{1}{2}]$ and s ≤ 0. By vectorizing the matrix data X to a new vector x = vec(X), then all settings here are the same as those of Theorem 1 in [34]. In this case, Fisher consistency results can naturally be transferred to matrix-type data. ◻

A.3.2. Proof of Theorem 4: Misclassification rates

First, we introduce the Rademacher complexity. Let $G = {g : X \times Y \to ℝ}$ be a class of loss functions. Given the sample $T = {(X_{i}, y_{i})}_{i = 1}^{N}$ , we define the empirical Rademacher complexity of $G$ as

{\hat{R}}_{N} (G) = E_{σ} {sup_{g \in G} \frac{1}{N} \sum_{i = 1}^{N} σ_{i} g (X_{i}, y_{i})},

where $σ = {σ_{i}}_{i = 1}^{N}$ are i.i.d. random variables with Pr(σ₁ = 1) = Pr(σ₁ = −1) = 1/2. The Rademacher complexity of $G$ is defined as

R_{N} (G) = E_{σ, T} {sup_{g \in G} \frac{1}{N} \sum_{i = 1}^{N} σ_{i} g (X_{i}, y_{i})} .

For our model, let

H = {h (X, y) = {min}_{k \neq y} ((f (X), w_{y} - w_{k})) | f \in F, \sum_{j} ‖ M_{j} ‖_{*} \leq r},

and

I_{κ} (x) = {\begin{cases} 1 x < 0, \\ 1 - \frac{1}{κ} x 0 \leq x \leq κ, \\ 0 o t h e r w i s e . \end{cases}

To prove Theorem 4, we first recall the following lemma which provides a bound on $E [I_{κ} {h (X, y)}]$ by the empirical error and the Rademacher complexity.

Lemma 4

For any h ∈ H, with probability at least 1 – δ, we have

E [I_{κ} {h (X, y)}] \leq \frac{1}{N} \sum_{i = 1}^{N} I_{κ} {h (X_{i}, y_{i})} + 2 R_{N} (I_{κ} \circ H) + {\frac{log (δ^{- 1})}{N}}^{1 / 2} .

The proof of Lemma 4 can be found in [34].

Now, we need to derive the upper bound of the Rademacher complexity used in Lemma 4. Since $I_{κ}$ is $\frac{1}{κ}$ -Lipschitz, we have

R_{N} (I_{κ} \circ H) \leq \frac{1}{κ} E_{σ, T} {sup_{Σ ‖ M_{j} ‖ + \leq r} \frac{1}{N} \sum_{i = 1}^{N} σ_{i} \sum_{j = 1}^{K - 1} t r (M_{j}^{⊤} {\tilde{X}}_{i})} = \frac{r}{κ N} E_{σ, T} {‖ \sum_{i = 1}^{N} σ_{i} {\tilde{X}}_{i} ‖_{2}},

where ${\tilde{X}}_{i}$ denotes $X_{i} - \bar{X}$ and $\bar{X} = N^{- 1} \sum_{i = 1}^{N} X_{i}$ . The first inequality is due to Lemma 4.2 in [20], and the absolute values of the entries in w_y − w_k are all bounded by 1.

Firstly, by the assumption, we can write X = E(X) + E, where $E (X) = \sum_{k = 1}^{K} P r (Y = k) C_{k}$ and the variance and the fourth moment of the entries are σ² and $μ_{4}^{4}$ . Accordingly, ${\tilde{X}}_{i} = E_{i} - \bar{E}$ , where $\bar{E} = N^{- 1} \sum_{i = 1}^{N} E_{i}$ . Since ${(X_{i}, y_{i})}_{i = 1}^{N}$ are the i.i.d. copies of $(X, Y)$ , we have

‖ \sum_{i = 1}^{N} σ_{i} {\tilde{X}}_{i} ‖_{2} \leq | \sum_{i = 1}^{N} σ_{i} | ‖ \bar{E} ‖_{2} + ‖ \sum_{i = 1}^{N} σ_{i} E_{i} ‖_{2} .

Because $E [{(\sum_{i = 1}^{N} σ_{i} E_{i})}^{2}] = N σ^{2}$ and $E [{(\sum_{i = 1}^{N} σ_{i} E_{i})}^{4}] = N μ_{4}^{4} + 3 N (N - 1) σ^{4}$ , by Theorem 2 in [16] we have

E_{σ, T} (‖ \sum_{i = 1}^{N} σ_{i} E_{i} ‖_{2}) \leq c σ N^{1 / 2} {p^{1 / 2} + q^{1 / 2} + {(p q)}^{1 / 4} {[N μ_{4}^{4} + 3 N (N - 1) σ^{4}]}^{1 / 4} / (σ N^{1 / 2})} \leq c σ (1 + \frac{3^{1 / 4}}{2}) N^{1 / 2} {p^{1 / 2} + q^{1 / 2}} + O (N^{1 / 4} (p^{1 / 2} + q^{1 / 2})),

where c is a constant which does not depend on $T$ . By similar arguments, it is easy to see that

E_{σ, T} (| \sum_{i = 1}^{N} σ_{i} | ‖ \bar{E} ‖_{2}) \leq \sqrt{E_{σ} {{(\sum_{i = 1}^{N} σ_{i})}^{2}} E_{T} (‖ \bar{E} ‖_{2})} = N^{1 / 2} E_{T} (‖ \bar{E} ‖_{2}) = O (p^{1 / 2} + q^{1 / 2}) .

Accordingly, we obtain the upper bound of the Rademacher complexity as

R_{N} (I_{κ} \circ H) \leq \frac{r}{κ \sqrt{N}} {c σ (1 + \frac{3^{1 / 4}}{2}) (p^{1 / 2} + q^{1 / 2})} .

The proof is completed by using Lemma 4 with this bound and the fact that the continuous indicator function $I_{κ}$ is an upper bound of the indicator function for any κ. ◻

A.3.3. Proof of Theorem 5: Breakdown Point Analysis

Let F(M, $T$ ) denote the loss function (3) with the sample $T$ , and

Δ^{+} ≜ {M | \forall k, s . t . w_{k}^{⊤} \hat{M} M^{⊤} w_{k} 〉 0} and Δ^{-} ≜ {M | \exists k, s . t . w_{k}^{⊤} \hat{M} M^{⊤} w_{k} \leq 0} .

For the MSMM classifier, we can choose the contaminated observation as (X^o, k) with $vec {(X^{o})}^{⊤} = - c w_{k}^{⊤} \hat{M}$ . For any M ∈ Δ⁺, $w_{k}^{⊤} \hat{M} M^{⊤} w_{k} > 0$ , then $w_{k}^{⊤} M vec (X^{o}) = - c w_{k}^{⊤} \hat{M} M^{⊤} w_{k} \to - \infty$ as c → ∞. In this situation, the loss term corresponding to this contaminated observation will tend to infinity. Hence, we have $\tilde{M} \in Δ^{-}$ and the classifier breaks down.

For the RMSMM, since $\hat{M} \neq 0, \hat{M}$ is an interior point of Δ⁺, the claim

ϵ_{1} = min_{M \in Δ^{-}} F (M, T_{n}) - min_{M \in Δ^{+}} F (M, T_{n}) > 0

is true. Note that the loss function

l (X, Y, M) = γ T_{s (K - 1)} (w_{y}^{⊤} M vec (X)) + (1 - γ) \sum_{k \neq Y} R_{s} (w_{k}^{⊤} M vec (X))

is bounded by (K − 1)(1 − s). For any m ≤ nϵ₁/[2(1 + δ)(K − 1)(1 − s)] with δ > 0 being any positive constant, any corresponding n −m clean subset $T_{n - m} \subset T_{n}$ , and any $M \in ℝ^{p \times q}$ , we have

0 \leq F (M, T_{n}) - \frac{n - m}{n} F (M, T_{n - m}) = \frac{1}{n} \sum_{i \in T_{n} \ I_{n - m}} l (X_{i}, y_{i}, M) \leq \frac{m (K - 1) (1 - s)}{n} < \frac{ϵ_{1}}{2 + 2 δ} .

Therefore,

| min_{M \in Δ^{-}} F (M, T_{n}) - min_{M \in Δ^{+}} F (M, T_{n}) - min_{M \in Δ^{-}} F (M, {\tilde{T}}_{n, m}) + min_{M \in Δ^{+}} F (M, {\tilde{T}}_{n, m}) | \leq \frac{ϵ_{1}}{1 + δ},

and

min_{M \in Δ^{-}} F (M, {\tilde{T}}_{n, m}) - min_{M \in Δ^{+}} F (M, {\tilde{T}}_{n, m}) > \frac{ϵ_{1} δ}{1 + δ} > 0.

The last inequality reveals that $\tilde{M} \in Δ^{+}$ and thus the classifier would not break down when m ≤ nϵ₁/[2(1 + δ)(K − 1)(1 − s)] observations are contaminated. Finally, the proof is complete by setting δ → 0. ◻

A.4. Derivation of Eq. (2): The dual problem

Lemma 5

For a p × q real matrix A, the subdifferential of the nuclear norm ∥·∥_* is given as

\partial ‖ A ‖_{*} = {U_{A} V_{A}^{⊤} + Z | Z \in ℝ^{p \times q}, U_{A}^{⊤} Z = 0, Z V_{A} = 0, ‖ Z ‖_{2} \leq 1},

where U_AΣ_AV_A^⊤ is the SVD of A, and ∂ stands for the operator of subgradients.

Lemma 6

Suppose that $X \in ℝ^{p \times q}$ , ∂G(X) = ρX − P + τ∂∥X∥_*, where $P \in ℝ^{p \times q}$ is a constant matrix w.r.t. X. Let the SVD of P be

P = U_{0} Σ_{0} V_{0}^{⊤} + U_{1} Σ_{1} V_{1}^{⊤},

where Σ₀ contains the singular values of P which are greater than τ, and Σ₁ contains the rest. Then, we have 0 ∈ ∂G(X*), where $X^{*} = ρ^{- 1} D_{τ} (P) = ρ^{- 1} U_{0} (Σ_{0} - τ I) V_{0}^{⊤}$ .

Lemma 6 can be verified by using Lemma 5 with $Z = τ^{- 1} U_{1} Σ_{1} V_{0}^{⊤}$ .

Now we derive the dual problem (2) of (1). As in the classical SVM, by setting C = (Nλ)⁻¹, we can rewrite (1) into the following form:

{\begin{cases} min_{_{M, b, ξ}} {\frac{1}{2} t r (M^{⊤} M) + τ ‖ M ‖_{*} + C \sum_{i = 1}^{N} ξ_{i}} \\ s.t ξ_{i} \geq 0, y_{i} [t r (M^{⊤} X_{i}) + b] \geq 1 - ξ_{i}, i = 1, \dots, N . \end{cases}

The corresponding Lagrange function of this problem can be written as

L_{P} (M, b, ξ, α, μ) = \frac{1}{2} t r (M^{⊤} M) + τ ‖ M ‖_{*} + C \sum_{i = 1}^{N} ξ_{i} - \sum_{i = 1}^{N} α_{i} [y_{i} {t r (M^{⊤} X_{i}) + b} - 1 + ξ_{i}] - \sum_{i = 1}^{N} μ_{i} ξ_{i},

(31)

where α_i ≥ 0 and μ_i ≥ 0 are corresponding Lagrange multipliers. By setting the derivatives w.r.t. b and ξ_i of this Lagrange function to zero, we get

{\begin{cases} \sum_{i = 1}^{N} α_{i} y_{i} = 0, \\ C - α_{i} - μ_{i} = 0, i = 1, \dots, N . \end{cases}

Based on Lemma 6 and setting the derivative w.r.t. M to zero, we have $M = D_{τ} (\sum_{i = 1}^{N} α_{i} y_{i} X_{i})$ . Substituting these conditions into (31), we obtain

{\begin{cases} min_{α} {\frac{1}{2} ‖ D_{τ} (\sum_{i = 1}^{N} α_{i} y_{i} X_{i}) ‖_{F}^{2} - \sum_{i = 1}^{N} α_{i}} \\ s.t 0 \leq α_{i} \leq C; i = 1, \dots, N, \sum_{i = 1}^{N} α_{i} y_{i} = 0. \end{cases}

This gives us the dual problem (2) of (1).

Footnotes

Publisher's Disclaimer: This Author Accepted Manuscript is a PDF file of an unedited peer-reviewed manuscript that has been accepted for publication but has not been copyedited or corrected. The official version of record that is published in the journal is kept up to date and so may therefore differ from this version.

Contributor Information

Chengde Qian, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, P. R. China..

Quoc Tran-Dinh, Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill. quoctd@email.unc.edu.

Sheng Fu, Department of Industrial and Systems Engineering, National University of Singapore.

Changliang Zou, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, P. R. China. nk.chlzou@gmail.com.

Yufeng Liu, Department of Statistics and Operations Research, Department of Genetics, Department of Biostatistics, Carolina Center for Genome Sciences, Lineberger Comprehensive Cancer Center, The University of North Carolina at Chapel Hill. yfliu@email.unc.edu.

References

1.Altun K and Barshan B. Human activity recognition using inertial/magnetic sensor units In International Workshop on Human Behavior Understanding, pages 38–51. Springer, 2010. [Google Scholar]
2.Le Thi HA and Pham Dinh T. Solving a class of linearly constrained indefinite quadratic problems by dc algorithms. Journal of Global Optimization, 11(3):253–285, 1997. [Google Scholar]
3.Bauschke H and Combettes PL. Convex Analysis and Monotone Operators Theory in Hilbert spaces. Springer International Publishing, 2nd edition, 2017. [Google Scholar]
4.Boser BE, Guyon IM, and Vapnik VN. A training algorithm for optimal margin classifiers In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM, 1992. [Google Scholar]
5.Boyd S. Alternating direction method of multipliers In Talk at NIPS workshop on optimization and machine learning, 2011. [Google Scholar]
6.Cai D, He X, Wen J-R, Han J, and Ma W-Y. Support tensor machines for text categorization. Technical report, 2006. [Google Scholar]
7.Cai J-F, Candès EJ, and Shen Z. A singular value thresholding algorithm for matrix completion. SIAM J. Optim, 20(4):1956–1982, 2010. [Google Scholar]
8.Chambolle A and Pock T. A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis, 40(1):120–145, 2011. [Google Scholar]
9.Chambolle A and Pock T. On the ergodic convergence rates of a first-order primal–dual algorithm. Math. Program, 159(1–2):253–287, 2016. [Google Scholar]
10.Cortes C and Vapnik V. Support-vector networks. Machine Learning, 20(3):273–297, 1995. [Google Scholar]
11.Davis D and Yin W. Faster convergence rates of relaxed Peaceman-Rachford and ADMM under regularity assumptions. Math. Oper. Res, 2014. [Google Scholar]
12.Friedman J, Hastie T, and Tibshirani R. The Elements of Statistical Learning Springer Series in Statistics. Springer-Verlag New York, 2nd edition, 2001. [Google Scholar]
13.He BS and Yuan XM. On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numer. Anal, 50:700–709, 2012. [Google Scholar]
14.Hou C, Nie F, Zhang C, Yi D, and Wu Y. Multiple rank multi-linear SVM for matrix data classification. Pattern Recognition, 47(1):454–469, 2014. [Google Scholar]
15.Huber PJ and Ronchetti E. Robust Statistics. Wiley series in probability and statistics John Wiley & Sons, Inc., Hoboken, New Jersey, 2nd edition, 2009. [Google Scholar]
16.Latala R. Some estimates of norms of random matrices. Proceedings of the American Mathematical Society, 133(5):1273–1282, 2005. [Google Scholar]
17.Lee Y, Lin Y, and Wahba G. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67–81, 2004. [Google Scholar]
18.Liu Y. Fisher consistency of multicategory support vector machines In Artificial Intelligence and Statistics, pages 291–298, 2007. [Google Scholar]
19.Luo L, Xie Y, Zhang Z, and Li W-J. Support matrix machines. In Proceedings of the 32nd International Conference on Machine Learning, number 1, pages 938–947, Lille, France, 2015. [Google Scholar]
20.Mohri M, Rostamizadeh A, and Talwalkar A. Foundations of Machine Learning Adaptive computation and machine learning series. MIT Press, Cambridge, MA, 2012. [Google Scholar]
21.Monteiro RDC and Svaiter BF. Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM J. Optim, 23(1):475–507, 2013. [Google Scholar]
22.Negahban S and Wainwright MJ. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, 39(2):1069–1097, 2011. [Google Scholar]
23.Nesterov Y. Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Kluwer Academic Publishers, 2004. [Google Scholar]
24.Pirsiavash H, Ramanan D, and Fowlkes C. Bilinear classifiers for visual recognition In Advances in Neural Information Processing Systems, pages 1482–1490, 2009. [Google Scholar]
25.Rockafellar RT. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, New Jersey, 1970. [Google Scholar]
26.Sun H, Craig B, and Zhang L. Angle-based multicategory distance-weighted svm. Journal of Machine Learning Research, 18(85):1–21, 2017. [Google Scholar]
27.Tao D, Li X, Wu X, Hu W, and Maybank SJ. Supervised tensor learning. Knowledge and Information Systems, 13(1):1–42, 2007. [Google Scholar]
28.Tran-Dinh Q. Proximal alternating penalty algorithms for constrained convex optimization. Comput. Optim. Appl, 72(1):1–43, 2019. [Google Scholar]
29.Wright SJ. Coordinate descent algorithms. Math. Program, 151(1):3–34, 2015. [Google Scholar]
30.Wu Y and Liu Y. Robust truncated hinge loss support vector machines. Journal of the American Statistical Association, 102(479):974–983, 2007. [Google Scholar]
31.Yang J, Zhang D, Frangi AF, and Yang J-Y. Two-dimensional pca: a new approach to appearance-based face representation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(1):131–137, 2004. [DOI] [PubMed] [Google Scholar]
32.Zhang C and Liu Y. Multicategory angle-based large-margin classification. Biometrika, 101(3):625–640, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Zhang C, Liu Y, Wang J, and Zhu H. Reinforced angle-based multicategory support vector machines. Journal of Computational and Graphical Statistics, 25(3):806–825, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Zhang C, Pham M, Fu S, and Liu Y. Robust multicategory support vector machines using difference convex algorithm. Math. Program, 169(1):277–305, 2018. [PMC free article] [PubMed] [Google Scholar]
35.Zhao J, Yu G, Liu Y, et al. Assessing robustness of classification using an angular breakdown point. The Annals of Statistics, 46(6B):3362–3389, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Zhou H and Li L. Regularized matrix regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2):463–483, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Zou H and Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005. [Google Scholar]

[R1] 1.Altun K and Barshan B. Human activity recognition using inertial/magnetic sensor units In International Workshop on Human Behavior Understanding, pages 38–51. Springer, 2010. [Google Scholar]

[R2] 2.Le Thi HA and Pham Dinh T. Solving a class of linearly constrained indefinite quadratic problems by dc algorithms. Journal of Global Optimization, 11(3):253–285, 1997. [Google Scholar]

[R3] 3.Bauschke H and Combettes PL. Convex Analysis and Monotone Operators Theory in Hilbert spaces. Springer International Publishing, 2nd edition, 2017. [Google Scholar]

[R4] 4.Boser BE, Guyon IM, and Vapnik VN. A training algorithm for optimal margin classifiers In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152. ACM, 1992. [Google Scholar]

[R5] 5.Boyd S. Alternating direction method of multipliers In Talk at NIPS workshop on optimization and machine learning, 2011. [Google Scholar]

[R6] 6.Cai D, He X, Wen J-R, Han J, and Ma W-Y. Support tensor machines for text categorization. Technical report, 2006. [Google Scholar]

[R7] 7.Cai J-F, Candès EJ, and Shen Z. A singular value thresholding algorithm for matrix completion. SIAM J. Optim, 20(4):1956–1982, 2010. [Google Scholar]

[R8] 8.Chambolle A and Pock T. A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis, 40(1):120–145, 2011. [Google Scholar]

[R9] 9.Chambolle A and Pock T. On the ergodic convergence rates of a first-order primal–dual algorithm. Math. Program, 159(1–2):253–287, 2016. [Google Scholar]

[R10] 10.Cortes C and Vapnik V. Support-vector networks. Machine Learning, 20(3):273–297, 1995. [Google Scholar]

[R11] 11.Davis D and Yin W. Faster convergence rates of relaxed Peaceman-Rachford and ADMM under regularity assumptions. Math. Oper. Res, 2014. [Google Scholar]

[R12] 12.Friedman J, Hastie T, and Tibshirani R. The Elements of Statistical Learning Springer Series in Statistics. Springer-Verlag New York, 2nd edition, 2001. [Google Scholar]

[R13] 13.He BS and Yuan XM. On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numer. Anal, 50:700–709, 2012. [Google Scholar]

[R14] 14.Hou C, Nie F, Zhang C, Yi D, and Wu Y. Multiple rank multi-linear SVM for matrix data classification. Pattern Recognition, 47(1):454–469, 2014. [Google Scholar]

[R15] 15.Huber PJ and Ronchetti E. Robust Statistics. Wiley series in probability and statistics John Wiley & Sons, Inc., Hoboken, New Jersey, 2nd edition, 2009. [Google Scholar]

[R16] 16.Latala R. Some estimates of norms of random matrices. Proceedings of the American Mathematical Society, 133(5):1273–1282, 2005. [Google Scholar]

[R17] 17.Lee Y, Lin Y, and Wahba G. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67–81, 2004. [Google Scholar]

[R18] 18.Liu Y. Fisher consistency of multicategory support vector machines In Artificial Intelligence and Statistics, pages 291–298, 2007. [Google Scholar]

[R19] 19.Luo L, Xie Y, Zhang Z, and Li W-J. Support matrix machines. In Proceedings of the 32nd International Conference on Machine Learning, number 1, pages 938–947, Lille, France, 2015. [Google Scholar]

[R20] 20.Mohri M, Rostamizadeh A, and Talwalkar A. Foundations of Machine Learning Adaptive computation and machine learning series. MIT Press, Cambridge, MA, 2012. [Google Scholar]

[R21] 21.Monteiro RDC and Svaiter BF. Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM J. Optim, 23(1):475–507, 2013. [Google Scholar]

[R22] 22.Negahban S and Wainwright MJ. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics, 39(2):1069–1097, 2011. [Google Scholar]

[R23] 23.Nesterov Y. Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Kluwer Academic Publishers, 2004. [Google Scholar]

[R24] 24.Pirsiavash H, Ramanan D, and Fowlkes C. Bilinear classifiers for visual recognition In Advances in Neural Information Processing Systems, pages 1482–1490, 2009. [Google Scholar]

[R25] 25.Rockafellar RT. Convex Analysis, volume 28 of Princeton Mathematics Series. Princeton University Press, New Jersey, 1970. [Google Scholar]

[R26] 26.Sun H, Craig B, and Zhang L. Angle-based multicategory distance-weighted svm. Journal of Machine Learning Research, 18(85):1–21, 2017. [Google Scholar]

[R27] 27.Tao D, Li X, Wu X, Hu W, and Maybank SJ. Supervised tensor learning. Knowledge and Information Systems, 13(1):1–42, 2007. [Google Scholar]

[R28] 28.Tran-Dinh Q. Proximal alternating penalty algorithms for constrained convex optimization. Comput. Optim. Appl, 72(1):1–43, 2019. [Google Scholar]

[R29] 29.Wright SJ. Coordinate descent algorithms. Math. Program, 151(1):3–34, 2015. [Google Scholar]

[R30] 30.Wu Y and Liu Y. Robust truncated hinge loss support vector machines. Journal of the American Statistical Association, 102(479):974–983, 2007. [Google Scholar]

[R31] 31.Yang J, Zhang D, Frangi AF, and Yang J-Y. Two-dimensional pca: a new approach to appearance-based face representation and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(1):131–137, 2004. [DOI] [PubMed] [Google Scholar]

[R32] 32.Zhang C and Liu Y. Multicategory angle-based large-margin classification. Biometrika, 101(3):625–640, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Zhang C, Liu Y, Wang J, and Zhu H. Reinforced angle-based multicategory support vector machines. Journal of Computational and Graphical Statistics, 25(3):806–825, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Zhang C, Pham M, Fu S, and Liu Y. Robust multicategory support vector machines using difference convex algorithm. Math. Program, 169(1):277–305, 2018. [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Zhao J, Yu G, Liu Y, et al. Assessing robustness of classification using an angular breakdown point. The Annals of Statistics, 46(6B):3362–3389, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Zhou H and Li L. Regularized matrix regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(2):463–483, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Zou H and Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005. [Google Scholar]

PERMALINK

Robust Multicategory Support Matrix Machines

Chengde Qian

Quoc Tran-Dinh

Sheng Fu

Changliang Zou

Yufeng Liu

Abstract

1. Introduction

Our approach and contribution:

Paper organization:

Notation:

2. Methodology

2.1. Review of the Support Matrix Machine

2.2. Robust Multicategory SMM

3. Optimization Algorithm

3.1. A DC Representation of (3)

Theorem 1

3.2. Inexact Proximal DCA Scheme

3.2.1. The Standard DCA Scheme and Its Proximal Variant

3.2.2. Inexact Proximal DCA Scheme

Lemma 1

3.3. Solution of The Convex Subproblem

Lemma 2

3.3.1. Primal-dual Schemes for Solving (15)

Lemma 3

3.3.2. The Upper Bound of the Inner Iterations

Remark 1

Algorithm 1.

3.4. The Overall Algorithm and Its Convergence Guarantee

Theorem 2 (Overall convergence)

3.5. Implementation Details and Comparison with ADMM

3.5.1. Evaluation of Subgradient ∇Ψ(Mt) and The Choice of ρt

3.5.2. Evaluation of Proximal Operators

3.5.3. ADMM Method for Solving (15)

4. Statistical Properties

4.1. Classification Consistency

Theorem 3

Theorem 4

4.2. Breakdown Point Analysis

Theorem 5

5. Numerical Experiments

5.1. Simulation Results

Fig. 1.

Fig. 2.

Fig. 3.

Fig. 4.

Fig. 5.

5.2. A Real-data Example

Fig. 6.

Fig. 7.

6. Concluding Remarks

Acknowledgments

A Appendix: Proofs of Technical Results

A.1. Proof of Lemma 2: Lipschitz continuity and boundedness

A.2. The proof of Lemma 3: The convergence of the primal-dual methods

A.3. Proof of statistical properties

A.3.1. Proof of Theorem 3: Fisher’s consistency

A.3.2. Proof of Theorem 4: Misclassification rates

Lemma 4

A.3.3. Proof of Theorem 5: Breakdown Point Analysis

A.4. Derivation of Eq. (2): The dual problem

Lemma 5

Lemma 6

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.5.1. Evaluation of Subgradient ∇Ψ(M^t) and The Choice of ρ_t