The linearized alternating direction method of multipliers for low-rank and fused LASSO matrix regression model

M Li; Q Guo; W J Zhai; B Z Chen

doi:10.1080/02664763.2020.1742296

. 2020 Mar 18;47(13-15):2623–2640. doi: 10.1080/02664763.2020.1742296

The linearized alternating direction method of multipliers for low-rank and fused LASSO matrix regression model

M Li ^a, Q Guo ^a, W J Zhai ^b, B Z Chen ^a,^CONTACT

PMCID: PMC9041721 PMID: 35707412

Abstract

Datasets with matrix and vector form are increasingly popular in modern scientific fields. Based on structures of datasets, matrix and vector coefficients need to be estimated. At present, the matrix regression models were proposed, and they mainly focused on the matrix without vector variables. In order to fully explore complex structures of datasets, we propose a novel matrix regression model which combines fused LASSO and nuclear norm penalty, which can deal with the data containing matrix and vector variables meanwhile. Our main work is to design an efficient algorithm to solve the proposed low-rank and fused LASSO matrix regression model. Following the existing idea, we design the linearized alternating direction method of multipliers and establish its global convergence. Finally, we carry out numerical experiments to demonstrate the efficiency of our method. Especially, we apply our model to two real datasets, i.e. the signal shapes and the trip time prediction from partial trajectories.

Keywords: Matrix regression, fused LASSO, low rank, linearized alternating direction method of multipliers, global convergence

1. Introduction

In the era of big data, modern scientific applications are more complex, and sampling units combine matrices with vectors instead of containing one form. A well-known example is the study of electroencephalography (EEG) data set of alcoholism, as in [23]. The study consists of 122 subjects with two groups, an alcoholic group and a normal control group, with each subject being exposed to a stimulus. Voltage values are measured from 64 channels of electrodes placed on the subject's scalp for 256-time points, so each sampling unit is a $256 \times 64$ matrix. We would face intricate challenges if turning this matrix to vector. On the one hand, the dimension is $p = 256 \times 64 = 16384$ , but the sample size is $n = 122 ≪ 16384$ . On the other hand, vectorization destroys the structural information in matrix data. It is crucial to propose a novel matrix regression model to deal with this sampling data. Zhou and Li [23] proposed the matrix regression

y_{i} = trace (X_{i}^{T} B) + Z_{i}^{T} γ + ε_{i},

(1)

where $y_{i} \in R$ is the response, $B \in R^{m \times q}, γ \in R^{p}$ are the objective regression coefficients, $X_{i} \in R^{m \times q}$ is the matrix variate and $Z_{i} \in R^{p}$ is the vector variate, $ε_{i} \in R$ is the noise which follows a normal distribution with mean 0 and standard deviation σ. However, Zhou and Li [23] only analyse the properties and algorithm when γ is 0. At present, few researchers considered matrix regression model (1), let alone this model with penalized regularization. It is necessary to study this model because of the data's form containing matrix and vector at the same time.

In this paper, we focus on the low-rank and fused LASSO (LRFL) matrix regression

min_{\binom{B \in R^{m \times q}}{γ \in R^{p}}} \frac{1}{2} \sum_{i = 1}^{N} {(y_{i} - trace (X_{i}^{T} B) - Z_{i}^{T} γ)}^{2} + λ_{1} ‖ B ‖_{*} + λ_{2} ‖ γ ‖_{1} + λ_{3} \sum_{j = 2}^{p} | γ_{j} - γ_{j - 1} |,

(2)

where $0 < λ_{i} (i = 1, \dots, 3)$ are tuning parameters. Unlike some typical models only containing vectors or matrices, the first and second $l_{1}$ -norm in this model induce the sparsity of both the coefficients γ and their successive differences, and the nuclear norm $‖ B ‖_{*}$ induces the low rank of unknown regression matrix B. The sparsity of γ can help us to choose the most important variables. And the low rank of B is used to pick up the pivotal information in matrix variable. In model (2), if we get rid of B, the model degenerates into the fused LASSO (FLASSO) introduced by Tibshirani et al. [18]. If we take away γ from model (2), it comes into the nuclear norm regularized matrix regression studied in Zhou and Li [23]. In this paper, we will mainly focus on designing an efficient method for solving LRFL matrix regression model (2).

In recent years, researchers proposed many regularized regression models with different penalties, such as Power family [7], elastic net [24], log-penalty [1,3], SCAD [6], and MC+ [22]. Meanwhile, Zhou and Li [23] proposed matrix regression model and considered the low rank of B based on spectral regularization. There is also some work related to matrix data, such as [4,13,14,17,20,21]. However, none of them considered matrix and vector variables together. Thus, the model combining matrix and vector variables is essential to be studied. The basic work is studying the statistical property and designing the algorithm for computing solution of LRFL matrix regression model (2). Zhou and Li [23] used the Nesterov method to solve spectral regularized matrix regression. Although this method is analytically simple, it is not suitable for model (2) because of two variables. What's more, Li et al. proposed the linearized alternating direction method of multipliers (LADMM) algorithm in [12] for solving FLASSO. Owing to objective function in our model (2) is convex with respect to B and γ, we can consider to have a natural extension to our model. Fortunately, following the procedure of Li et al. [12], we develop a LADMM algorithm for solving model (2).

The rest of the paper is organized as follows. In Section 2, some preliminaries which are useful for further discussion will be introduced. Especially, we present two important optimization problems appeared in our algorithm. In Section 3, we give the LADMM algorithm for the LRFL matrix regression model (2). In Section 4, we demonstrate the convergence of the obtained algorithm. We conduct extensive numerical experiments to evaluate the performance of our algorithm for LRFL matrix regression model (2) in Section 5. We conclude this paper in the final section.

2. Preliminary

In this section, we introduce some preliminaries which are useful for further discussion. Firstly, we give some symbols about derivative. Then, we briefly introduce the solution of two important optimization problems.

2.1. Matrix calculation

If $f (B)$ is a real-valued function of $B \in R^{m \times q}$ , the derivative of y with respect to B is defined that
$\nabla_{B} f = (\begin{matrix} \frac{\partial f}{\partial B_{11}} & \frac{\partial f}{\partial B_{12}} & \dots & \frac{\partial f}{\partial B_{1 q}} \\ \frac{\partial f}{\partial B_{21}} & \frac{\partial f}{\partial B_{22}} & \dots & \frac{\partial f}{\partial B_{2 q}} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ \frac{\partial f}{\partial B_{m 1}} & \frac{\partial f}{\partial B_{m 2}} & \dots & \frac{\partial f}{\partial B_{m q}} \end{matrix}) .$
If $B \in R^{m \times q}, Y \in R^{n \times p}$ and $Y_{i j}$ is real-valued function of B, the derivative of Y with respect to B is defined that
$\nabla_{B} Y = (\begin{matrix} \frac{\partial Y_{11}}{\partial B_{11}} & \frac{\partial Y_{11}}{\partial B_{12}} & \dots & \frac{\partial Y_{11}}{\partial B_{m q}} \\ \frac{\partial Y_{12}}{\partial B_{11}} & \frac{\partial Y_{12}}{\partial B_{12}} & \dots & \frac{\partial Y_{12}}{\partial B_{m q}} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ \frac{\partial Y_{n p}}{\partial B_{11}} & \frac{\partial Y_{n p}}{\partial B_{12}} & \dots & \frac{\partial Y_{n p}}{\partial B_{m q}} \end{matrix}) .$
For any matrix $A \in R^{n \times m}, B \in R^{m \times q}, C \in R^{q \times n},$ the derivative of function $tr (\cdot)$ is defined as
$\frac{\partial tr (A B C)}{\partial B} = A^{T} C^{T} .$

2.2. Two important optimization problems

Here, we present two results which play an important role in the algorithm.

2.2.1. Soft-thresholding

The minimization problem

min_{x \in R^{p}} {‖ x ‖_{1} + \frac{β}{2} ‖ x - r ‖^{2}},

with $β > 0$ , and $r \in R^{p}$ has a closed-form solution, which is given by the soft-thresholding operator

x^{*} = shrink (r, 1 / β) := sign (r) * max {0, | r | - 1 / β},

(3)

where $sign (\cdot)$ is the sign function.

2.2.2. Matrix soft-thresholding

The minimization problem

min_{B \in R^{m \times q}} {‖ B ‖_{*} + \frac{β}{2} ‖ vec (B) - r ‖^{2}},

with $β > 0$ , and $r \in R^{m q}$ has a closed-form solution, which is defined as

B = U \cdot shrink (λ, 1 / β) \cdot V^{T},

(4)

where $U, V, λ$ satisfy the singular value decomposition (SVD) of R, i.e. $R = U Λ V^{T}$ and λ is the diagonal of Λ. We arrange r to an $m \times q$ column-wise matrix, and denote it by R. The proof can be found in [2] (Theorem 2.1) or [19] (Theorem 3).

3. Linearized alternating direction method of multipliers

In this section, we will propose the LADMM algorithm to solve LRFL matrix regression model (2). For completeness, we firstly explain the ADMM algorithm in the following. Considering a convex minimization problem

\begin{array}{ll} min & F (β) + G (γ) \\ s . t . & X β + Y γ = b, \\ β \in Π, γ \in Γ . \end{array}

(5)

The objective function (5) is separable and the constraints are linear. The $Π \subseteq R^{n}$ and $Γ \subseteq R^{m}$ are given non-empty, closed and convex sets, and F, G are closed convex functions. The $X \in R^{l \times n}$ and $Y \in R^{l \times m}$ are given matrices, and $b \in R$ is a given vector. Throughout, we assume that the solution set of (5) denoted by $Ω^{*}$ is non-empty. We get the augmented Lagrangian function of (5),

L_{μ} (β, γ, α) := F (β) + G (γ) - α^{T} (X β + Y γ - b) + \frac{μ}{2} ‖ X β + Y γ - b ‖_{2}^{2},

where $μ > 0$ is a penalty parameter, $α \in R$ is Lagrange multiplier. The augmented Lagrangian method (ALM) in Hestenes [10] and Powell [16] can be applied to solve (5). With given initial point $(β^{0}, γ^{0}, α^{0})$ , the iterative scheme of ALM for (5) is

\begin{aligned} (β^{k + 1}, γ^{k + 1}) & = argmin {L_{μ} (β, γ, α^{k}) | β \in Π, γ \in Γ}, \\ α^{k + 1} & = α^{k} - μ (X β^{k + 1} + Y γ^{k + 1} - b) . \end{aligned}

(6)

The direct application of the ALM to (5) is the scheme (6). At each iteration, it requires to minimize β and γ simultaneously. The ADMM algorithm in Gabay and Mercier [8] and Glowinski and Marrocco [9] decomposes the minimization problem (6) into two separable problems and minimizes them iteratively

\begin{aligned} β^{k + 1} = argmin {L_{μ} (β, γ^{k}, α^{k}) | β \in Π}, \\ γ^{k + 1} = argmin {L_{μ} (β^{k + 1}, γ, α^{k}) | γ \in Γ}, \\ α^{k + 1} = α^{k} - μ (X β^{k + 1} + Y γ^{k + 1} - b) . \end{aligned}

(7)

Because the constraint optimization problem (5) is transformed into an unconstrained optimization problem (7), the subproblems in (7) are easier than the original problem (5). Moreover, the resulting subproblems in (7) have closed-formed solutions for many applications including the LASSO and GLASSO. This fact particularly makes the application of ADMM efficient for a wide class of problems. So we consider applying the ADMM algorithm to the LRFL matrix regression model (2).

Now we analyse how to solve the LRFL matrix regression model (2) by applying the ADMM. In order to reformulate the model (2), we define a matrix $A \in R^{(p - 1) \times p}$ as

A = (\begin{matrix} - 1 & 1 & 0 & \dots & 0 \\ 0 & - 1 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋱ & ⋮ \\ 0 & 0 & \dots & - 1 & 1 \end{matrix}) .

In fact, we denote $y = (y_{1}, y_{2}, \dots, y_{n})^{T}, X = (vec (X_{1}), vec (X_{2}), \dots, vec (X_{n}))^{T}, Z = (Z_{1}, Z_{2}, \dots, Z_{n})^{T}$ , and the model (2) can be written as

min_{\binom{B \in R^{m \times q}}{γ \in R^{p}}} {\frac{1}{2} ‖ y - Z γ - X vec (B) ‖_{2}^{2} + λ_{1} ‖ B ‖_{*} + λ_{2} ‖ γ ‖_{1} + λ_{3} ‖ A γ ‖_{1}} .

(8)

Letting $ξ = A γ \in R^{p - 1}$ , (8) can be rewritten as

min_{\binom{B \in R^{m \times q}}{\binom{γ \in R^{p}}{ξ \in R^{p - 1}}}} {\frac{1}{2} {‖ y - Z γ - X vec (B) ‖}_{2}^{2} + λ_{1} ‖ B ‖_{*} + λ_{2} ‖ γ ‖_{1} + λ_{3} ‖ ξ ‖_{1}} . s . t . A γ = ξ .

(9)

The augmented Lagrangian function of (9) is

\begin{aligned} L_{μ} (B, γ, ξ, α) & := \frac{1}{2} {‖ y - Z γ - X vec (B) ‖}_{2}^{2} + λ_{1} ‖ B ‖_{*} + λ_{2} ‖ γ ‖_{1} + λ_{3} ‖ ξ ‖_{1} - α^{T} (A γ - ξ) \\ + \frac{μ}{2} ‖ A γ - ξ ‖_{2}^{2}, \end{aligned}

(10)

where $α \in R^{p - 1}$ is the Lagrange multiplier, and μ is a given penalty parameter. The iterative scheme of ADMM for (10) is

\begin{aligned} B^{k + 1} = \underset{B \in R^{m \times q}}{argmin} L_{μ} (B, γ^{k}, ξ^{k}, α^{k}), \\ γ^{k + 1} = \underset{γ \in R^{p}}{argmin} L_{μ} (B^{k + 1}, γ, ξ^{k}, α^{k}), \\ ξ^{k + 1} = \underset{ξ \in R^{p - 1}}{argmin} L_{μ} (B^{k + 1}, γ^{k + 1}, ξ, α^{k}), \\ α^{k + 1} = α^{k} - μ (A γ^{k + 1} - ξ^{k + 1}) . \end{aligned}

(11)

Now let us look at the resulting subproblems in (11). First, for the B-subproblem in (11), after trivial manipulation, it can be written as

\begin{aligned} B^{k + 1} & = \underset{B \in R^{m \times q}}{argmin} {\frac{1}{2} ‖ y - Z γ^{k} - X vec (B) ‖_{2}^{2} + λ_{1} ‖ B ‖_{*}} \\ = \underset{B \in R^{m \times q}}{argmin} {λ_{1} ‖ B ‖_{*} + \frac{1}{2} ‖ X vec (B) - {\hat{y}}_{k} ‖^{2}}, \end{aligned}

(12)

where ${\hat{y}}_{k} = y - Z γ^{k} \in R^{n}$ . The subproblem does not have a closed-form solution because of the non-identity matrix X. As Wang Yuan [19], we can linearize the quadratic term $\frac{1}{2} ‖ X vec (B) - {\hat{y}}_{k} ‖^{2}$ in (12), which is replaced by

{(X^{T} (X vec (B^{k}) - {\hat{y}}_{k}))}^{T} (vec (B) - vec (B^{k})) + \frac{v}{2} ‖ vec (B) - vec (B^{k}) ‖_{2}^{2},

where the parameter v>0 controls the proximity to $B^{k}$ . Overall, we solve the following subproblem:

\begin{aligned} B^{k + 1} & = \underset{B \in R^{m \times q}}{argmin} {λ_{1} ‖ B ‖_{*} + {(X^{T} (X vec (B^{k}) - {\hat{y}}_{k}))}^{T} (vec (B) - vec (B^{k})) \\ + \frac{v}{2} ‖ vec (B) - vec (B^{k}) ‖_{2}^{2}} \\ = \underset{B \in R^{m \times q}}{argmin} {λ_{1} ‖ B ‖_{*} + \frac{v}{2} ‖ vec (B) - vec (B^{k}) + X^{T} (X vec (B^{k}) - {\hat{y}}_{k}) / v ‖_{2}^{2}} \\ = \underset{B \in R^{m \times q}}{argmin} {λ_{1} ‖ B ‖_{*} + \frac{v}{2} ⟨ B - C_{1}, B - C_{1} ⟩} \\ = \underset{B \in R^{m \times q}}{argmin} {λ_{1} ‖ B ‖_{*} + \frac{v}{2} ‖ B - C_{1} ‖_{F}^{2}}, \end{aligned}

(13)

where $C_{1}$ is a matrix with $vec (C_{1}) = C$ , and $C = vec (B^{k}) - X^{T} (X vec (B^{k}) - {\hat{y}}_{k}) / v$ . Then, following (4), the closed-form solution of (13) is

B^{k + 1} = U \cdot shrink (λ, λ_{1} / v) \cdot V^{T},

(14)

where $U, V, λ$ satisfy the singular value decomposition (SVD) of $C_{1}$ , i.e. $C_{1} = U Λ V^{T}$ and λ is the diagonal of Λ.

Second, for the γ-subproblem in (11), we will briefly display the solution of subproblem γ as follows:

\begin{aligned} γ^{k + 1} & = \underset{γ \in R^{p}}{argmin} {\frac{1}{2} ‖ y - Z γ - X vec (B^{k + 1}) ‖_{2}^{2} + λ_{2} ‖ γ ‖_{1} + \frac{μ}{2} ‖ A γ - ξ^{k} - α^{k} / 2 ‖^{2}} \\ = \underset{γ \in R^{p}}{argmin} {λ_{2} ‖ γ ‖_{1} + \frac{1}{2} ‖ \tilde{X} γ - {\tilde{y}}_{k} ‖^{2}}, \end{aligned}

where $\tilde{X} = (Z^{T}, \sqrt{μ} A^{T})^{T}, {\tilde{y}}_{k} = ((y - X vec (B^{k + 1}))^{T}, \sqrt{μ} (I_{p - 1} ξ^{k} + α^{k} / μ)^{T})^{T}$ . The subproblem does not have a closed-form solution because of the non-identity matrix $\tilde{X}$ . Similarly, we can linearize $\frac{1}{2} ‖ \tilde{X} γ - {\tilde{y}}_{k} ‖^{2}$ the quadratic term, which is replaced by

{\tilde{X}}^{T} (\tilde{X} γ^{k} - {\tilde{y}}_{k})^{T} (γ - γ^{k}) + \frac{v}{2} ‖ γ - γ^{k} ‖_{2}^{2},

where the parameter v>0 controls the proximity to $γ^{k}$ . Overall, we solve the following subproblem:

\begin{aligned} γ^{k + 1} & = \underset{γ \in R^{p}}{argmin} {λ_{2} ‖ γ ‖_{1} + ({\tilde{X}}^{T} (\tilde{X} γ^{k} - {\tilde{y}}_{k})^{T} (γ - γ^{k}) + \frac{v}{2} ‖ γ - γ^{k} ‖_{2}^{2}} \\ = \underset{γ \in R^{p}}{argmin} {λ_{2} ‖ γ ‖_{1} + \frac{v}{2} ‖ γ - γ^{k} + {\tilde{X}}^{T} (\tilde{X} γ^{k} - {\tilde{y}}_{k}) / v ‖_{2}^{2}} . \end{aligned}

Then, according to (3), the closed-form solution of (11) is

γ^{k + 1} = shrink (γ^{k} - {\tilde{X}}^{T} (\tilde{X} γ^{k} - {\tilde{y}}_{k}) / v, λ_{2} / v) .

(15)

Third, for the ξ-subproblem in (11), its closed-form solution can be obtained directly by determining the derivative of the augmented Lagrangian function, which is

\begin{aligned} ξ^{k + 1} & = \underset{ξ \in R^{p - 1}}{argmin} {λ_{3} ‖ ξ ‖_{1} - {α^{k}}^{T} (A γ^{k + 1} - ξ) + \frac{μ}{2} ‖ A γ^{k + 1} - ξ ‖_{2}^{2}} \\ = \underset{ξ \in R^{p - 1}}{argmin} {\frac{μ}{2} ‖ ξ - A γ^{k + 1} + α^{k} / μ ‖_{2}^{2} + λ_{3} ‖ ξ ‖_{1}} . \end{aligned}

We obtain the solution by (3),

ξ^{k + 1} = shrink (A γ^{k + 1} - α^{k} / μ, λ_{3} / μ) .

(16)

In summary, the iterative scheme of LADMM algorithm for LRFL matrix regression model can be described as follows.

Remark 3.1

When solving B-subproblem, the rank of $B^{k + 1}$ is determined by $λ_{1} / v$ . If $λ_{1} / v$ increases, the rank of $B^{k + 1}$ will decrease; if $λ_{1} / v$ gets smaller, the rank of $B^{k + 1}$ will get larger. Thus, if we want to obtain a low-rank estimator, we only need to choose a large $λ_{1} / v$ . On the other hand, for a given $λ_{1} / v$ , we do not need to compute all the singular values of $C_{1}$ . From the solution in (14), the singular values which are greater than $λ_{1} / v$ make sense. Thus in the implement of Algorithm 1, we use the truncation technique which can be found in [15].

4. Convergence analysis

In this section, we will focus on the convergence analysis of Algorithm 1. In fact, this procedure is similar to some existing work, such as [12]. In order to better understand, we give a succinct proof here.

4.1. Convergence of Algorithm 1

Note that, the Lagrange function of (9) is

\frac{1}{2} ‖ y - Z γ - X vec (B) ‖_{2}^{2} + λ_{1} ‖ B ‖_{*} + λ_{2} ‖ γ ‖_{1} + λ_{3} ‖ ξ ‖_{1} - α^{T} (A γ - ξ),

(17)

where $α \in R^{p - 1}$ is the Lagrange multiplier. By the first-order optimality condition of (17), it is easy to see that solving (9) is equivalent to find $(B^{*}, γ^{*}, ξ^{*}, α^{*}) \in Ω := R^{m \times q} \times R^{p} \times R^{p - 1} \times R^{p - 1}, f (B^{*}) \in \frac{\partial ‖ B^{*} ‖_{*}}{\partial vec (B^{*})}, g (γ^{*}) \in \partial (‖ γ^{*} ‖_{1})$ , and $h (ξ^{*}) \in \partial (‖ ξ^{*} ‖_{1})$ such that

\begin{aligned} 0 & = λ_{1} f (B^{*}) - X^{T} (y - Z γ^{*} - X vec (B^{*})), \\ 0 & = λ_{2} g (γ^{*}) - Z^{T} (y - Z γ^{*} - X vec (B^{*})) - A^{T} α^{*}, \\ 0 & = λ_{3} h (ξ^{*}) + α^{*}, \\ 0 & = A γ^{*} - ξ^{*} . \end{aligned}

(18)

Note that $\partial (\cdot)$ denotes the subdifferential operator of a non-smooth convex function. We denote all the elements in Ω that satisfy (18) by $Ω^{*}$ . Then, using the notation $ω^{*} = ((vec (B^{*}))^{T}, {γ^{*}}^{T}, {ξ^{*}}^{T}, {α^{*}}^{T})^{T}, f (B) \in \frac{\partial ‖ B ‖_{*}}{\partial vec (B)}, g (γ) \in \partial (‖ γ ‖_{1})$ , and $h (ξ) \in \partial (‖ ξ ‖_{1})$ , (18) can be written as a variational inequality (VI) problem: finding a $ω^{*} \in Ω^{*}, f (B^{*}) \in \frac{\partial ‖ B^{*} ‖_{*}}{\partial vec (B^{*})}, g (γ^{*}) \in \partial (‖ γ^{*} ‖_{1})$ , and $h (ξ^{*}) \in \partial (‖ ξ^{*} ‖_{1})$ such that

{(ω - ω^{*})}^{T} F (ω^{*}) \geq 0 \forall ω \in Ω,

(19)

where $ω = (vec (B)^{T}, γ^{T}, ξ^{T}, α^{T})^{T}$ and

F (ω) = (\begin{matrix} λ_{1} f (B) - X^{T} (y - Z γ - X vec (B)) \\ λ_{2} g (γ) - Z^{T} (y - Z γ - X vec (B)) - A^{T} α \\ λ_{3} h (ξ) + α \\ A γ - ξ \end{matrix}) .

For purpose of expressing concisely, we need to use the following matrix:

G = (\begin{matrix} v I_{m q} - X^{T} X & 0 & 0 & 0 \\ 0 & v I_{p} - {\tilde{X}}^{T} \tilde{X} & 0 & 0 \\ 0 & 0 & μ I_{p - 1} & 0 \\ 0 & 0 & 0 & \frac{1}{μ} I_{p - 1} \end{matrix}) .

From the procedure of proof, the matrix G must be positive definite. Considering $\tilde{X} = (Z^{T}, \sqrt{μ} A^{T})^{T}$ , the positive definiteness of G can be achieved by the condition $v > ρ (X^{T} X)$ and $v > ρ (Z^{T} Z + μ A^{T} A)$ , where $ρ (\cdot)$ denotes the spectral radius of matrix. In order to establish the convergence of LADMM algorithm, in the following lemma we characterize the $(k + 1)$ th iteration of Algorithm 1 as a VI problem.

Lemma 4.1

Let ${ω^{k}}$ be a sequence generated by Algorithm 1. Then we have

${(ω^{'} - ω^{k + 1})}^{T} (F (ω^{k + 1}) + M (ξ^{k} - ξ^{k + 1}) - G (ω^{k} - ω^{k + 1})) \geq 0 \forall ω^{'} \in Ω,$

where

$M = (\begin{matrix} O_{m q \times (p - 1)} \\ - μ A^{T} \\ μ I_{p - 1} \\ O_{p - 1} \end{matrix}) .$

Proof.

First, we have

$\begin{aligned} X^{T} (X vec (B^{k + 1}) - {\hat{y}}_{k}) & = X^{T} (X vec (B^{k + 1}) - (y - Z γ^{k})) \\ = X^{T} (X vec (B^{k + 1}) - y) + X^{T} Z γ^{k} \end{aligned}$

and

$\begin{aligned} {\tilde{X}}^{T} (\tilde{X} γ^{k + 1} - {\tilde{y}}_{k}) & = Z^{T} Z γ^{k + 1} - Z^{T} y + Z^{T} X vec (B^{k + 1}) \\ + μ A^{T} A γ^{k + 1} - μ A^{T} ξ^{k} - A^{T} α^{k} \\ = (Z^{T} (Z γ^{k + 1} - (y - X vec (B^{k + 1})))) \\ + μ A^{T} A γ^{k + 1} - μ A^{T} ξ^{k} - A^{T} α^{k} \\ = (Z^{T} (Z γ^{k + 1} - (y - X vec (B^{k + 1})))) \\ - A^{T} α^{k + 1} - μ A^{T} (ξ^{k} - ξ^{k + 1}) . \end{aligned}$

It follows from (11) that

$α^{k} = α^{k + 1} + μ (A γ^{k + 1} - ξ^{k + 1}) .$ (20)

Deriving the first-order optimality condition of the minimization problems (14), (15) and (16), we see that the iterative scheme (11) is equivalent to find $ω^{k + 1} = ({vec (B^{k + 1})}^{T}, {γ^{k + 1}}^{T}, {ξ^{k + 1}}^{T}, {α^{k + 1}}^{T})^{T} \in Ω, f (B) \in \frac{\partial ‖ B ‖_{*}}{\partial vec (B)}, g (γ^{k + 1}) \in \partial (‖ γ ‖_{1})$ , and $h (ξ^{k + 1}) \in \partial (‖ ξ ‖_{1})$ such that

$\begin{aligned} 0 & = λ_{1} f (B^{k + 1}) + X^{T} (X vec (B^{k}) - {\hat{y}}_{k}) + v (vec (B^{k + 1}) - vec (B^{k})), \\ 0 & = λ_{2} g (γ^{k + 1}) + {\tilde{X}}^{T} (\tilde{X} γ^{k} - {\tilde{y}}_{k}) + v (γ^{k + 1} - γ^{k}), \\ 0 & = λ_{3} h (ξ^{k + 1}) + α^{k} - μ (A γ^{k + 1} - ξ^{k + 1}), \\ 0 & = A γ^{k + 1} - ξ^{k + 1} - (α^{k} - α^{k + 1}) / μ . \end{aligned}$ (21)

According to the definition of G, inserting (20) into (21), the (21) can be written as

$\begin{aligned} 0 & = (λ_{1} f (B^{k + 1}) - X^{T} (y - Z γ^{k} - X vec (B^{k + 1}))) \\ + (v I_{m q} - X^{T} X) (vec (B^{k + 1}) - vec (B^{k})), \\ 0 & = (λ_{2} g (γ^{k + 1}) - Z^{T} (y - Z γ^{k + 1} - X vec (B^{k + 1})) - A^{T} α^{k + 1}) \\ + (v I_{p} - {\tilde{X}}^{T} \tilde{X}) (γ^{k + 1} - γ^{k}) - μ A^{T} (ξ^{k} - ξ^{k + 1}), \\ 0 & = λ_{3} h (ξ^{k + 1}) + α^{k} - μ (A γ^{k + 1} - ξ^{k + 1}), \\ 0 & = (A γ^{k + 1} - ξ^{k + 1}) - (α^{k} - α^{k + 1}) / μ . \end{aligned}$

Then we obtain the conclusion.

The following lemma can be easily derived by Lemma 4.1. For completeness, we show the proof in detail.

Lemma 4.2

Let ${ω^{k}}$ be a sequence generated by Algorithm 1. Then $\forall ω^{*} \in Ω^{*},$ we have

$\begin{aligned} {(ω^{k} - ω^{*})}^{T} G (ω^{k} - ω^{k + 1}) & \geq {(ω^{k} - ω^{k + 1})}^{T} G (ω^{k} - ω^{k + 1}) \\ - {(α^{k} - α^{k + 1})}^{T} (ξ^{k} - ξ^{k + 1}) . \end{aligned}$

Proof.

From Lemma 4.1, for any $ω^{'} \in Ω$ , by setting $ω^{'} = ω^{*}$ , we obtain

${(ω^{*} - ω^{k + 1})}^{T} (F (ω^{k + 1}) + M (ξ^{k} - ξ^{k + 1}) - G (ω^{k} - ω^{k + 1})) \geq 0,$ (22)

where $ω^{*}$ is an arbitrary solution point in $Ω^{*}$ . Note that we have $A γ^{*} = ξ^{*}$ for $ω^{*} \in Ω^{*}$ , thus (22) leads to

$\begin{aligned} {(ω^{k + 1} - ω^{*})}^{T} G (ω^{k} - ω^{k + 1}) & \geq {(ω^{k + 1} - ω^{*})}^{T} F (ω^{k + 1}) \\ - μ {(A γ^{k + 1} - ξ^{k + 1})}^{T} (ξ^{k} - ξ^{k + 1}) . \end{aligned}$

Since $(α^{k} - α^{k + 1}) = μ (A γ^{k + 1} - ξ^{k + 1})$ , the above inequality becomes

$\begin{aligned} {(ω^{k + 1} - ω^{*})}^{T} G (ω^{k} - ω^{k + 1}) & \geq {(ω^{k + 1} - ω^{*})}^{T} F (ω^{k + 1}) \\ - {(α^{k} - α^{k + 1})}^{T} (ξ^{k} - ξ^{k + 1}) . \end{aligned}$

On the other hand, since $‖ \cdot ‖_{1}$ and $‖ \cdot ‖_{2}$ are both convex, it is obvious that the mapping $F (ω)$ is monotone. We thus have

${(ω^{k + 1} - ω^{*})}^{T} (F (ω^{k + 1}) - F (ω^{*})) \geq 0$ (23)

and

${(ω^{k + 1} - ω^{*})}^{T} F (ω^{k + 1}) \geq {(ω^{k + 1} - ω^{*})}^{T} F (ω^{*}) \geq 0.$ (24)

Then, replacing $ω^{k + 1} - ω^{*}$ by $(ω^{k + 1} - ω^{k}) + (ω^{k} - ω^{*})$ in (23) and using (24), we get the desired conclusion.

Using Lemma 4.2, we can obtain an upper bound of the difference between the sequence ${ω^{k}}$ generated by Algorithm 1 and true solution.

Lemma 4.3

Let ${ω^{k}}$ be the sequence generated by Algorithm 1. Then we have

$‖ ω^{k + 1} - ω^{*} ‖_{G}^{2} \leq ‖ ω^{k} - ω^{*} ‖_{G}^{2} - ‖ ω^{k} - ω^{k + 1} ‖_{G}^{2} \forall ω^{*} \in Ω^{*} .$ (25)

Proof.

For $\forall ω^{*} \in Ω^{*}$ , it follows from Lemma 4.2 that

$\begin{aligned} ‖ ω^{k + 1} - ω^{*} ‖_{G}^{2} & = ‖ ω^{k} - ω^{*} ‖_{G}^{2} + ‖ ω^{k} - ω^{k + 1} ‖_{G}^{2} - 2 {(ω^{k} - ω^{*})}^{T} G (ω^{k} - ω^{k + 1}) \\ \leq ‖ ω^{k} - ω^{*} ‖_{G}^{2} - ‖ ω^{k} - ω^{k + 1} ‖_{G}^{2} + 2 {(α^{k} - α^{k + 1})}^{T} (ξ^{k} - ξ^{k + 1}) . \end{aligned}$ (26)

As shown before, we have $λ_{3} h (ξ^{k}) + α^{k} = 0$ for any k. Thus, we have

$\begin{aligned} {(ξ^{k + 1} - ξ^{k})}^{T} (λ_{3} h (ξ^{k}) + α^{k}) \geq 0, \\ {(ξ^{k} - ξ^{k + 1})}^{T} (λ_{3} h (ξ^{k + 1}) + α^{k + 1}) \geq 0. \end{aligned}$

We obtain that

${(α^{k} - α^{k + 1})}^{T} (ξ^{k} - ξ^{k + 1}) \leq λ_{3} {(ξ^{k} - ξ^{k + 1})}^{T} (h (ξ^{k + 1}) - h (ξ^{k})) \leq 0.$ (27)

Inserting (27) into (26), we prove the assertion (25).

Lemma 4.3 implies that the sequence generated by Algorithm 1 is contractive with respect to the solution set $Ω^{*}$ . The following corollary is trivial based on the inequality (25). Thus, we omit the proof.

Corollary 4.4

Let ${ω^{k}}$ be the sequence generated by Algorithm 1. Then we have

$lim_{k \to \infty} ‖ ω^{k} - ω^{k + 1} ‖_{G} = 0$ .

The sequence ${ω^{k}}$ is bounded.

For any $ω^{*} \in Ω^{*},$ the sequence $‖ ω^{k} - ω^{*} ‖_{G}$ is monotonically non-increasing.

Now we can obtain the convergence of the linearized alternating direction method of multipliers for the LRFL matrix regression model.

Theorem 4.5

For any $μ > 0, v > ρ (X^{T} X)$ and $v > ρ (Z^{T} Z + μ A^{T} A),$ the sequence ${ω^{k} = (vec (B^{k}), γ^{k}, ξ^{k}, α^{k})}$ is generated by Algorithm 1. Then ${vec (B^{k}), γ^{k}, ξ^{k}, α^{k}}$ converges to a point $ω^{\infty} = (vec (B^{\infty}), γ^{\infty}, ξ^{\infty}, α^{\infty}),$ where $(vec (B^{\infty}), γ^{\infty}, ξ^{\infty})$ is an optimal solution of the LRFL matrix regression model (9).

Proof.

The property (a) in Corollary 4.4 means that

$\begin{aligned} lim_{k \to \infty} ‖ vec (B^{k}) - vec (B^{k + 1}) ‖ = 0, lim_{k \to \infty} ‖ γ^{k} - γ^{k + 1} ‖ = 0, \\ lim_{k \to \infty} ‖ ξ^{k} - ξ^{k + 1} ‖ = 0, lim_{k \to \infty} ‖ α^{k} - α^{k + 1} ‖ = 0. \end{aligned}$

In addition, the property (b) in Corollary 4.4 implies that the sequence ${ω^{k}}$ has at least one cluster point. We denote it by $ω^{\infty} = (vec (B^{\infty}), γ^{\infty}, ξ^{\infty}, α^{\infty})$ and let ${ω^{k_{j}}}$ be a subsequence converging to $ω^{\infty}$ . Thus, we have

$vec (B^{k_{j}}) \to vec (B^{\infty}), γ^{k_{j}} \to γ^{\infty}, ξ^{k_{j}} \to ξ^{\infty}, α^{k_{j}} \to α^{\infty}$ (28)

and

$\begin{aligned} lim_{k \to \infty} ‖ vec (B^{k_{j}}) - vec (B^{k_{j} + 1}) ‖ = 0, lim_{k \to \infty} ‖ γ^{k_{j}} - γ^{k_{j} + 1} ‖ = 0, \\ lim_{k \to \infty} ‖ ξ^{k_{j}} - ξ^{k_{j} + 1} ‖ = 0, lim_{k \to \infty} ‖ α^{k_{j}} - α^{k_{j} + 1} ‖ = 0. \end{aligned}$ (29)

Next we show that the cluster point $ω^{\infty}$ satisfies the optimality condition (18). Due to the variational inequality (19) and (29), we have

$lim_{j \to \infty} (ω^{'} - ω^{k_{j}}) F (ω^{k_{j}}) \geq 0 \forall ω^{'} \in Ω .$

Then, according to (28), we obtain that

$(ω^{'} - ω^{\infty}) F (ω^{\infty}) \geq 0 \forall ω^{'} \in Ω .$

Thus, the limiting point $ω^{\infty}$ satisfies (19), i.e. $ω^{\infty} \in Ω^{*}$ . Considering the property (c) in Corollary 4.4, we have

${‖ ω^{k + 1} - ω^{\infty} ‖}_{G} \leq {‖ ω^{k} - ω^{\infty} ‖}_{G} \forall k \geq 0.$

Therefore, the sequence ${ω^{k}}$ has the unique cluster point $ω^{\infty}$ . Thus, $(vec (B^{\infty}), γ^{\infty}, ξ^{\infty})$ is an optimal solution of the LRFL matrix regression model (9).

4.2. Convergence rate

Following the work in [24], we can easily establish a worst-case $O (1 / k)$ convergence rate measured by the iteration complexity in the ergodic sense for the proposed Algorithm 1. That is, after k iterations, the average of all these k iterates generated by Algorithm 1 is an approximate solution of LRFL matrix regression model with an accuracy of $O (1 / k)$ . For succinctness, we omit the proof.

5. Numerical experiments

5.1. Simulation

We consider a class of matrix models with different ranks and sparsity levels used in [23]. Specifically, we generate the matrix covariates X of size $64 \times 64$ and the vector covariates Z in $R^{500}$ , both of which consist of independent standard normal entries. We set the sample size at n = 500, where the number of parameters is $64 \times 64 + 500 = 4596$ . We set $γ = (1, \dots, 1)^{T} \in R^{500}$ and generate the true array signal as $B = B_{1} B_{2}^{T}$ , where $B_{1} \in R^{m \times R}, B_{2} \in R^{q \times R}$ , and R controls the rank of the signal. Moreover, each entry of B is 0 or 1, and the percentage of non-zero entries is controlled by a sparsity level constant s, i.e. each entry of $B_{1}, B_{2}$ is a Bernoulli distribution with probability of 1 equal to $\sqrt{(1 - (1 - s)^{1 / R})}$ . We vary the rank R = 1, 5 and the level of (non-)sparsity s = 0.01, 0.05, 0.1 ( s = 0.05 means that about $5 %$ of entries of B are 1s and the rest are 0s). We generate response y with the systematic part as $y = ⟨ B^{*}, X ⟩ + ⟨ γ^{*}, Z ⟩ + ε$ with ϵ satisfying a standard normal distribution. In addition, all computations are performed on an Intel Core(TM)i7-2640M CPU (2.80 GHz) and 8 GB RAM. The code for Algorithm 1 is written in MATLAB, and the initial point for this method is set to be $B^{0} = O_{m \times q}, γ^{0} = O_{p}, ξ^{0} = O_{p - 1}, α^{0} = 1_{p - 1}$ . The maximum iteration number is set as 500. For the tuning parameters $λ_{1}, λ_{2}, λ_{3}$ , we take a large grid of values. For each model, we choose the parameters which give the best performance of test root-mean-square error (RMSE), where the test set is generated as the above besides n = 2500. The other parameters are chosen as $μ = (1 + \sqrt{5}) / 2$ and $v = 10 max {ρ (‖ X^{T} X ‖_{2}), ρ (‖ Z^{T} Z + μ A^{T} A ‖_{2})} .$

In the experiment, we simulate the model 100 times. Then we evaluate the performance of our method from two aspects: parameter estimation and prediction. For the former, we employ

max {\frac{‖ B^{k} - B^{k - 1} ‖_{F}}{max {‖ B^{k} ‖_{F}, 1}}, \frac{‖ γ^{k} - γ^{k - 1} ‖_{2}}{max {‖ γ^{k} ‖_{2}, 1}}} < 10^{- 4}

as the evaluation criterion. For the latter, we use independent validation data to evaluate the prediction error measured by the RMSE of the response. For simplicity, we use LRFL to stand for LRFL matrix regression model. We report the performance in Tables 1–5. Specifically, RMSE-B is the root-mean-squared error of B for training data, RMSE-γ is the root-mean-squared error of γ for training data, RMSE-PRE is the root-mean-squared error of prediction for test data, and the numerical values in parentheses are the standard deviations of the corresponding terms. The average CPU time (in seconds) is also included.

Table 1. Performance of Algorithm 1 when the true rank of coefficient matrix R = 1.

Sparsity $s %$	RMSE-B	RMSE-γ	RMSE-PRE	CPU
0.01	0.23(0.007)	0.05(0.025)	0.23(0.006)	1.48
0.05	0.30(0.006)	0.08(0.019)	0.31(0.016)	1.67
0.1	0.40(0.004)	0.07(0.030)	0.40(0.008)	1.58

Open in a new tab

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

Table 2. Performance of Algorithm 1 when the true rank of coefficient matrix R = 5.

Sparsity $s %$	RMSE-B	RMSE-γ	RMSE-PRE	CPU
0.01	0.20(0.008)	0.05(0.023)	0.20(0.014)	1.86
0.05	0.27(0.008)	0.09(0.018)	0.29(0.014)	1.54
0.1	0.40(0.005)	0.09(0.028)	0.40(0.013)	1.56

Open in a new tab

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

Table 3. Results of comparing LRFL with matrix LASSO when the true rank of coefficient matrix R = 2.

	R = 2
Sparsity $s %$	Method	RMSE-B	RMSE-γ	RMSE-PRE	Rank
0.01	LRFL	0.38(0.055)	0.54(0.023)	0.66(0.017)	1
	matrix LASSO	0.33(0.042)	0.71(0.018)	0.78(0.037)	24.5
0.05	LRFL	0.46(0.027)	0.54(0.015)	0.70(0.018)	1
	matrix LASSO	0.38(0.019)	0.71(0.011)	0.81(0.018)	24
0.1	LRFL	0.45(0.042)	0.55(0.024)	0.72(0.031)	1
	matrix LASSO	0.38(0.029)	0.71(0.018)	0.81(0.030)	22
0.2	LRFL	0.59(0.036)	0.55(0.017)	0.80(0.037)	1
	matrix LASSO	0.54(0.017)	0.72(0.086)	0.90(0.014)	25
0.5	LRFL	0.62(0.056)	0.57(0.021)	0.84(0.037)	1
	matrix LASSO	0.95(0.020)	0.74(0.019)	1.22(0.039)	27

Open in a new tab

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

Table 4. Results of comparing LRFL with matrix LASSO when the true rank of coefficient matrix R = 5.

	R = 5
Sparsity $s %$	Method	RMSE-B	RMSE-γ	RMSE-PRE	Rank
0.01	LRFL	0.29(0.025)	0.67(0.012)	0.75(0.032)	6.5
	matrix LASSO	0.31(0.069)	0.79(0.023)	0.85(0.028)	10.5
0.05	LRFL	0.48(0.046)	0.55(0.022)	0.72(0.028)	6
	matrix LASSO	0.43(0.028)	0.72(0.010)	0.84(0.027)	25
0.1	LRFL	0.55(0.025)	0.53(0.013)	0.74(0.023)	6
	matrix LASSO	0.44(0.031)	0.71(0.012)	0.84(0.023)	23.5
0.2	LRFL	0.66(0.018)	0.55(0.023)	0.84(0.039)	7
	matrix LASSO	0.61(0.024)	0.72(0.017)	0.94(0.023)	26
0.5	LRFL	0.80(0.060)	0.56(0.023)	0.98(0.053)	6
	matrix LASSO	0.90(0.028)	0.73(0.015)	1.15(0.046)	27.5

Open in a new tab

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

Table 5. Results of comparing LRFL with matrix LASSO when the true rank of coefficient matrix R = 10.

	R = 10
Sparsity $s %$	Method	RMSE-B	RMSE-γ	RMSE-PRE	Rank
0.01	LRFL	0.43(0.043)	0.55(0.043)	0.70(0.039)	11
	matrix LASSO	0.31(0.057)	0.71(0.011)	0.78(0.038)	24
0.05	LRFL	0.48(0.026)	0.53(0.019)	0.72(0.023)	15
	matrix LASSO	0.34(0.032)	0.71(0.012)	0.78(0.017)	23
0.1	LRFL	0.54(0.015)	0.53(0.012)	0.76(0.027)	15.5
	matrix LASSO	0.46(0.028)	0.72(0.012)	0.86(0.035)	24
0.2	LRFL	0.72(0.014)	0.55(0.018)	0.91(0.020)	16.5
	matrix LASSO	0.68(0.015)	0.73(0.016)	1.01(0.026)	26.5
0.5	LRFL	0.88(0.072)	0.59(0.042)	1.07(0.068)	15
	matrix LASSO	1.05(0.024)	0.74(0.015)	1.29(0.033)	28

Open in a new tab

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

From Tables 1 and 2 and Figure 1, we can find that the accuracy of $B, γ$ and test prediction are becoming worse with the increase of the sparsity. The standard deviation, however, is almost not changed. This result suggests that our method is convergent. Compared the accuracy of B with $γ, γ$ is better than B.

In Tables 3–5, we compared LRFL with matrix LASSO in the accuracy and estimation rank of B. The ranks listed in Tables 3–5 are the median of the 100 simulation results. For the matrix LASSO method, we stack up the elements of the predictor X by column. Then, the matrix model is reformed to vector model. Thus, we can use matrix LASSO to solve it. From Tables 3–5, for the accuracy of B, LASSO gives a better performance. However, LRFL shows a better rank estimation. There is no doubt that vectorization destroys the structure of X. This result in the coefficient matrix B is not low-rank.

5.2. Real data

In this subsection, we will use our algorithm to deal with two real datasets.

5.2.1. Signal shapes

Recently, signal shapes have attracted wide attention of researchers [11,23]. In LRFL matrix regression model (2), X is a $64 \times 64$ matrix with entries generated as independent standard normal distribution, and Z is a five-dimensional vector. We set that B is binary with the true signal shapes, and γ satisfies a standard normal distribution. We easily know that the response is $y = ⟨ X, B ⟩ + ⟨ Z, γ ⟩ + ε,$ where ϵ satisfies a normal distribution. In our numerical experiments, we change sample sizes at n = 500, 750 and 1000, and results are displayed in Tables 6–8. Tables 6–8 show the root-mean-squared errors (RMSEs) for vector coefficient γ, matrix coefficient B, response y and their standard deviations. Note that RMSEs of $B, γ$ and y have significantly reduced with increasing of dimension of sample.

Table 6. Performance for signal shapes of Algorithm 1 when n = 500.

Shapes	RMSE-B	RMSE-γ	RMSE-y
Cross	0.29(0.004)	1.64(0.007)	0.20(0.002)
Square	0.32(0.004)	1.89(0.005)	0.25(0.007)
Tshape	0.32(0.005)	1.12(0.005)	0.27(0.008)
Triangle	0.31(0.001)	1.99(0.002)	0.24(0.001)

Open in a new tab

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

Table 7. Performance for signal shapes of Algorithm 1 when n = 750.

Shapes	RMSE-B	RMSE-γ	RMSE-y
Cross	0.21(0.001)	1.44(0.008)	0.18(0.007)
Square	0.28(0.002)	0.95(0.005)	0.23(0.004)
Tshape	0.26(0.002)	1.01(0.003)	0.22(0.005)
Triangle	0.22(0.009)	1.64(0.007)	0.19(0.004)

Open in a new tab

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

Table 8. Performance for signal shapes of Algorithm 1 when n = 1000.

Shapes	RMSE-B	RMSE-γ	RMSE-y
Cross	0.20(0.008)	1.29(0.001)	0.17(0.004)
Square	0.26(0.001)	0.88(0.004)	0.22(0.007)
Tshape	0.24(0.002)	0.98(0.001)	0.22(0.002)
Triangle	0.20(0.003)	0.73(0.012)	0.18(0.003)

Open in a new tab

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

5.2.2. Trip time prediction from partial trajectories

A classical example is the ECML Discovery challenge 2015 competition to estimate of taxi's travel time for complete trips [5,11]. It contains the 7733 trajectories of taxis in Porto for a period of 1 year, which every trajectory includes multiple features. We record the latitude and longitude coordinates every 15 s when running. It also contains seven regular variates, such as trip id, call type, origin stand, day type and so on. The trip id contains a unique identifier for every trip. Call type identifies some way to demand this service, such as, this trip was dispatched from the central, demanded directly to a taxi driver at a specific stand and otherwise. Origin stand contains a unique identifier for the taxi stand. Day type contains holiday or any other special day, a day before holiday and otherwise. In LRFL matrix regression model (2), X is a matrix in $R^{922 \times 2}$ and Z is a vector in $R^{7}$ . We remove 32 trajectories due to the missing observation of coordinates. We use our matrix regression model to predict taxis travel time for complete journeys.

Table 9 shows that the root-mean-squared error of prediction of test data under 5-fold or 10-fold Cross-Validation. Furthermore, we choose 1000 trips as a test dataset, and change the dimension of the training dataset. Results are displayed in Table 10.

Table 9. Results of trip time prediction under 5-fold or 10-fold cross-validation.

5-fold rate			10-fold rate
Training dataset	Test dataset	RMSE-PRE	Training dataset	Test dataset	RMSE-PRE
6161	1540	0.86(0.010)	6931	770	0.82(0.013)

Open in a new tab

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

Table 10. All the measurements are the mean of the results after repeated 100 times.

Training dataset	Test dataset	RMSE-PRE
200	1000	0.86(0.034)
500	1000	0.86(0.041)
1000	1000	0.81(0.089)
1500	1000	0.82(0.068)
2000	1000	0.77(0.046)

Open in a new tab

Note: The numbers in parentheses are the corresponding standard errors.

6. Summary

In this paper, we propose the LRFL matrix regression model which combines nuclear norm and fused LASSO penalty. The inspiration for this model comes from the fact which sampling units containing matrix and vector at the same time in kinds of fields. In order to solve the LRFL matrix regression model, we explore a linearized ADMM algorithm and establish the global convergence. Finally, we demonstrate the efficiency of our method through numerical experiments on simulation and real datasets. We compare LRFL matrix regression model with matrix LASSO. As we can see, our model can give more accurate and lower rank estimation. We mainly focus on the algorithm for solving the LRFL matrix regression model. Therefore, the statistical properties are also worthy of study.

Acknowledgments

The authors are very grateful to two anonymous reviewers and associate editor for their insightful remarks and comments, which considerably improved the presentation of our paper.

Funding Statement

The work was supported in part by the National Natural Science Foundation of China (11671029), the Fundamental Research Funds for the Central Universities (2019YJS200) and Colleges and Universities in Hebei Province Science and Technology Research Project (Z2019032).

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

1.Armagan A., Dunson D. and Lee J., Generalized double Pareto shrinkage, Stat. Sin. 23 (2013), pp. 119–143. [PMC free article] [PubMed] [Google Scholar]
2.Cai J., Candés E. and Shen Z., A singular value thresholding algorithm for matrix completion, SIAM J. Optim. 20 (2010), pp. 1956–1982. doi: 10.1137/080738970 [DOI] [Google Scholar]
3.Candés E., Wakin M. and Boyd S., Enhancing sparsity by reweighted $l_{1}$ minimization, J. Four. Anal. Appl. 14 (2008), pp. 877–905. doi: 10.1007/s00041-008-9045-x [DOI] [Google Scholar]
4.Chen B. and Kong L., High-dimensional least square matrix regression via elastic net penalty, Pac. J. Optim. 13 (2017), pp. 185–196. [Google Scholar]
5.Discovery challenge: On learning from taxi GPS traces: ECML-PKDD, 2015. Available at http://www.geolink.pt/ecmlpkdd2015-challenge/.
6.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. doi: 10.1198/016214501753382273 [DOI] [Google Scholar]
7.Frank I. and Friedman J., A statistical view of some chemometrics regression tools, Technometrics 35 (1993), pp. 109–135. doi: 10.1080/00401706.1993.10485033 [DOI] [Google Scholar]
8.Gabay D. and Mercier B., A dual algorithm for the solution of nonlinear variational problems via finite element approximation, Comp. Math. Appl. 2 (1976), pp. 17–40. doi: 10.1016/0898-1221(76)90003-1 [DOI] [Google Scholar]
9.Glowinski R. and Marrocco A., Sur l'approximation, par éléments finis et la résolution, par pénalisation-dualité d'une classe de problémes de Dirichlet non linéaires, ESAIM: Math. Model. Numer. Anal. 9 (1975), pp. 41–76. [Google Scholar]
10.Hestenes M., Multiplier and gradient methods, J. Optim. Theory Appl. 4 (1969), pp. 303–320. doi: 10.1007/BF00927673 [DOI] [Google Scholar]
11.Li M. and Kong L., Double fused Lasso penalized LAD for matrix regression, Appl. Math. Comput. 357 (2019), pp. 119–138. doi: 10.1016/j.cam.2019.02.009 [DOI] [Google Scholar]
12.Li X., Mo L., Yuan X. and Zhang J., Linearized alternating direction method of multipliers for sparse group and fused Lasso models, Comput. Stat. Data Anal. 79 (2014), pp. 203–221. doi: 10.1016/j.csda.2014.05.017 [DOI] [Google Scholar]
13.Lin Z., Liu R. and Su Z., Linearized alternating direction method with adaptive penalty for low-rank representation, Adv. Neural Inf. Process. Syst. 24 (2011), pp. 612–620. [Google Scholar]
14.Luo L., Yang J., Qian J., Tai Y. and Lu G., Robust image regression based on the extended matrix variate power exponential distribution of dependent noise, IEEE Trans. Neur. Net. Lear. 28 (2017), pp. 2168–2182. doi: 10.1109/TNNLS.2016.2573644 [DOI] [PubMed] [Google Scholar]
15.Ma S., Goldfarb D. and Chen L., Fixed point and Bregman iterative methods for matrix rank minimization, Math. Program. 128 (2011), pp. 321–353. doi: 10.1007/s10107-009-0306-5 [DOI] [Google Scholar]
16.Powell M., A Method for Nonlinear Constraints in Minimization Problems, Optimization. Academic Press, New York, 1969. [Google Scholar]
17.Qian J., Yang J., Zhang F. and Lin Z., Robust low-rank regularized regression for face recognition with occlusion, Biometrics Workshop in Conjunction with IEEE Conference on Computer Vision and Pattern Recognition (CVPRW), Columbus, Ohio, June 23–28, 2014.
18.Tibshirani R., Saunders M., Rosset S., Zhu J. and Knight K., Sparsity and smoothness via the fused Lasso, J. R. Stat. Soc. 67 (2005), pp. 91–108. doi: 10.1111/j.1467-9868.2005.00490.x [DOI] [Google Scholar]
19.Wang X. and Yuan X., The linearized alternating direction method for Dantzig selector, SIAM J. Sci. Comput. 34 (2012), pp. A2792–A2811. doi: 10.1137/110833543 [DOI] [Google Scholar]
20.Xie J., Yang J., Qian J., Tai Y. and Zhang H., Robust nuclear norm-based matrix regression with applications to robust face recognition, IEEE Trans. Image Process. 26 (2017), pp. 2286–2295. doi: 10.1109/TIP.2017.2662213 [DOI] [PubMed] [Google Scholar]
21.Yang J., Luo L., Qian J., Tai Y., Zhang F. and Xu Y., Nuclear norm based matrix regression with applications to face recognition with occlusion and illumination changes, IEEE Trans. Pattern Anal. 39 (2017), pp. 156–171. doi: 10.1109/TPAMI.2016.2535218 [DOI] [PubMed] [Google Scholar]
22.Zhang C., Nearly unbiased variable selection under minimax concave penalty, Ann. Stat. 38 (2010), pp. 894–942. doi: 10.1214/09-AOS729 [DOI] [Google Scholar]
23.Zhou H. and Li L., Regularized matrix regression, J. R. Stat. Soc. Ser. B Stat. Methodol. 76 (2014), pp. 463–483. doi: 10.1111/rssb.12031 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zou H. and Hastie T., Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005), pp. 301–320. doi: 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]

[CIT0001] 1.Armagan A., Dunson D. and Lee J., Generalized double Pareto shrinkage, Stat. Sin. 23 (2013), pp. 119–143. [PMC free article] [PubMed] [Google Scholar]

[CIT0002] 2.Cai J., Candés E. and Shen Z., A singular value thresholding algorithm for matrix completion, SIAM J. Optim. 20 (2010), pp. 1956–1982. doi: 10.1137/080738970 [DOI] [Google Scholar]

[CIT0003] 3.Candés E., Wakin M. and Boyd S., Enhancing sparsity by reweighted $l_{1}$ minimization, J. Four. Anal. Appl. 14 (2008), pp. 877–905. doi: 10.1007/s00041-008-9045-x [DOI] [Google Scholar]

[CIT0004] 4.Chen B. and Kong L., High-dimensional least square matrix regression via elastic net penalty, Pac. J. Optim. 13 (2017), pp. 185–196. [Google Scholar]

[CIT0005] 5.Discovery challenge: On learning from taxi GPS traces: ECML-PKDD, 2015. Available at http://www.geolink.pt/ecmlpkdd2015-challenge/.

[CIT0006] 6.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. doi: 10.1198/016214501753382273 [DOI] [Google Scholar]

[CIT0007] 7.Frank I. and Friedman J., A statistical view of some chemometrics regression tools, Technometrics 35 (1993), pp. 109–135. doi: 10.1080/00401706.1993.10485033 [DOI] [Google Scholar]

[CIT0008] 8.Gabay D. and Mercier B., A dual algorithm for the solution of nonlinear variational problems via finite element approximation, Comp. Math. Appl. 2 (1976), pp. 17–40. doi: 10.1016/0898-1221(76)90003-1 [DOI] [Google Scholar]

[CIT0009] 9.Glowinski R. and Marrocco A., Sur l'approximation, par éléments finis et la résolution, par pénalisation-dualité d'une classe de problémes de Dirichlet non linéaires, ESAIM: Math. Model. Numer. Anal. 9 (1975), pp. 41–76. [Google Scholar]

[CIT0010] 10.Hestenes M., Multiplier and gradient methods, J. Optim. Theory Appl. 4 (1969), pp. 303–320. doi: 10.1007/BF00927673 [DOI] [Google Scholar]

[CIT0011] 11.Li M. and Kong L., Double fused Lasso penalized LAD for matrix regression, Appl. Math. Comput. 357 (2019), pp. 119–138. doi: 10.1016/j.cam.2019.02.009 [DOI] [Google Scholar]

[CIT0012] 12.Li X., Mo L., Yuan X. and Zhang J., Linearized alternating direction method of multipliers for sparse group and fused Lasso models, Comput. Stat. Data Anal. 79 (2014), pp. 203–221. doi: 10.1016/j.csda.2014.05.017 [DOI] [Google Scholar]

[CIT0013] 13.Lin Z., Liu R. and Su Z., Linearized alternating direction method with adaptive penalty for low-rank representation, Adv. Neural Inf. Process. Syst. 24 (2011), pp. 612–620. [Google Scholar]

[CIT0014] 14.Luo L., Yang J., Qian J., Tai Y. and Lu G., Robust image regression based on the extended matrix variate power exponential distribution of dependent noise, IEEE Trans. Neur. Net. Lear. 28 (2017), pp. 2168–2182. doi: 10.1109/TNNLS.2016.2573644 [DOI] [PubMed] [Google Scholar]

[CIT0015] 15.Ma S., Goldfarb D. and Chen L., Fixed point and Bregman iterative methods for matrix rank minimization, Math. Program. 128 (2011), pp. 321–353. doi: 10.1007/s10107-009-0306-5 [DOI] [Google Scholar]

[CIT0016] 16.Powell M., A Method for Nonlinear Constraints in Minimization Problems, Optimization. Academic Press, New York, 1969. [Google Scholar]

[CIT0017] 17.Qian J., Yang J., Zhang F. and Lin Z., Robust low-rank regularized regression for face recognition with occlusion, Biometrics Workshop in Conjunction with IEEE Conference on Computer Vision and Pattern Recognition (CVPRW), Columbus, Ohio, June 23–28, 2014.

[CIT0018] 18.Tibshirani R., Saunders M., Rosset S., Zhu J. and Knight K., Sparsity and smoothness via the fused Lasso, J. R. Stat. Soc. 67 (2005), pp. 91–108. doi: 10.1111/j.1467-9868.2005.00490.x [DOI] [Google Scholar]

[CIT0019] 19.Wang X. and Yuan X., The linearized alternating direction method for Dantzig selector, SIAM J. Sci. Comput. 34 (2012), pp. A2792–A2811. doi: 10.1137/110833543 [DOI] [Google Scholar]

[CIT0020] 20.Xie J., Yang J., Qian J., Tai Y. and Zhang H., Robust nuclear norm-based matrix regression with applications to robust face recognition, IEEE Trans. Image Process. 26 (2017), pp. 2286–2295. doi: 10.1109/TIP.2017.2662213 [DOI] [PubMed] [Google Scholar]

[CIT0021] 21.Yang J., Luo L., Qian J., Tai Y., Zhang F. and Xu Y., Nuclear norm based matrix regression with applications to face recognition with occlusion and illumination changes, IEEE Trans. Pattern Anal. 39 (2017), pp. 156–171. doi: 10.1109/TPAMI.2016.2535218 [DOI] [PubMed] [Google Scholar]

[CIT0022] 22.Zhang C., Nearly unbiased variable selection under minimax concave penalty, Ann. Stat. 38 (2010), pp. 894–942. doi: 10.1214/09-AOS729 [DOI] [Google Scholar]

[CIT0023] 23.Zhou H. and Li L., Regularized matrix regression, J. R. Stat. Soc. Ser. B Stat. Methodol. 76 (2014), pp. 463–483. doi: 10.1111/rssb.12031 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0024] 24.Zou H. and Hastie T., Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005), pp. 301–320. doi: 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]

PERMALINK

The linearized alternating direction method of multipliers for low-rank and fused LASSO matrix regression model

M Li

Q Guo

W J Zhai

B Z Chen

Abstract

1. Introduction

2. Preliminary

2.1. Matrix calculation

2.2. Two important optimization problems

2.2.1. Soft-thresholding

2.2.2. Matrix soft-thresholding

3. Linearized alternating direction method of multipliers

Remark 3.1

4. Convergence analysis

4.1. Convergence of Algorithm 1

Lemma 4.1

Proof.

Lemma 4.2

Proof.

Lemma 4.3

Proof.

Corollary 4.4

Theorem 4.5

Proof.

4.2. Convergence rate

5. Numerical experiments

5.1. Simulation

Table 1. Performance of Algorithm 1 when the true rank of coefficient matrix R = 1.

Table 2. Performance of Algorithm 1 when the true rank of coefficient matrix R = 5.

Table 3. Results of comparing LRFL with matrix LASSO when the true rank of coefficient matrix R = 2.

Table 4. Results of comparing LRFL with matrix LASSO when the true rank of coefficient matrix R = 5.

Table 5. Results of comparing LRFL with matrix LASSO when the true rank of coefficient matrix R = 10.

Figure 1.

5.2. Real data

5.2.1. Signal shapes

Table 6. Performance for signal shapes of Algorithm 1 when n = 500.

Table 7. Performance for signal shapes of Algorithm 1 when n = 750.

Table 8. Performance for signal shapes of Algorithm 1 when n = 1000.

5.2.2. Trip time prediction from partial trajectories

Table 9. Results of trip time prediction under 5-fold or 10-fold cross-validation.

Table 10. All the measurements are the mean of the results after repeated 100 times.

6. Summary

Acknowledgments

Funding Statement

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases