Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2020 Mar 18;47(13-15):2623–2640. doi: 10.1080/02664763.2020.1742296

The linearized alternating direction method of multipliers for low-rank and fused LASSO matrix regression model

M Li a, Q Guo a, W J Zhai b, B Z Chen a,CONTACT
PMCID: PMC9041721  PMID: 35707412

Abstract

Datasets with matrix and vector form are increasingly popular in modern scientific fields. Based on structures of datasets, matrix and vector coefficients need to be estimated. At present, the matrix regression models were proposed, and they mainly focused on the matrix without vector variables. In order to fully explore complex structures of datasets, we propose a novel matrix regression model which combines fused LASSO and nuclear norm penalty, which can deal with the data containing matrix and vector variables meanwhile. Our main work is to design an efficient algorithm to solve the proposed low-rank and fused LASSO matrix regression model. Following the existing idea, we design the linearized alternating direction method of multipliers and establish its global convergence. Finally, we carry out numerical experiments to demonstrate the efficiency of our method. Especially, we apply our model to two real datasets, i.e. the signal shapes and the trip time prediction from partial trajectories.

Keywords: Matrix regression, fused LASSO, low rank, linearized alternating direction method of multipliers, global convergence

1. Introduction

In the era of big data, modern scientific applications are more complex, and sampling units combine matrices with vectors instead of containing one form. A well-known example is the study of electroencephalography (EEG) data set of alcoholism, as in [23]. The study consists of 122 subjects with two groups, an alcoholic group and a normal control group, with each subject being exposed to a stimulus. Voltage values are measured from 64 channels of electrodes placed on the subject's scalp for 256-time points, so each sampling unit is a 256×64 matrix. We would face intricate challenges if turning this matrix to vector. On the one hand, the dimension is p=256×64=16384, but the sample size is n=12216384. On the other hand, vectorization destroys the structural information in matrix data. It is crucial to propose a novel matrix regression model to deal with this sampling data. Zhou and Li [23] proposed the matrix regression

yi=trace(XiTB)+ZiTγ+εi, (1)

where yiR is the response, BRm×q,γRp are the objective regression coefficients, XiRm×q is the matrix variate and ZiRp is the vector variate, εiR is the noise which follows a normal distribution with mean 0 and standard deviation σ. However, Zhou and Li [23] only analyse the properties and algorithm when γ is 0. At present, few researchers considered matrix regression model (1), let alone this model with penalized regularization. It is necessary to study this model because of the data's form containing matrix and vector at the same time.

In this paper, we focus on the low-rank and fused LASSO (LRFL) matrix regression

minBRm×qγRp12i=1N(yitrace(XiTB)ZiTγ)2+λ1B+λ2γ1+λ3j=2p|γjγj1|, (2)

where 0<λi(i=1,,3) are tuning parameters. Unlike some typical models only containing vectors or matrices, the first and second l1-norm in this model induce the sparsity of both the coefficients γ and their successive differences, and the nuclear norm B induces the low rank of unknown regression matrix B. The sparsity of γ can help us to choose the most important variables. And the low rank of B is used to pick up the pivotal information in matrix variable. In model (2), if we get rid of B, the model degenerates into the fused LASSO (FLASSO) introduced by Tibshirani et al. [18]. If we take away γ from model (2), it comes into the nuclear norm regularized matrix regression studied in Zhou and Li [23]. In this paper, we will mainly focus on designing an efficient method for solving LRFL matrix regression model (2).

In recent years, researchers proposed many regularized regression models with different penalties, such as Power family [7], elastic net [24], log-penalty [1,3], SCAD [6], and MC+ [22]. Meanwhile, Zhou and Li [23] proposed matrix regression model and considered the low rank of B based on spectral regularization. There is also some work related to matrix data, such as [4,13,14,17,20,21]. However, none of them considered matrix and vector variables together. Thus, the model combining matrix and vector variables is essential to be studied. The basic work is studying the statistical property and designing the algorithm for computing solution of LRFL matrix regression model (2). Zhou and Li [23] used the Nesterov method to solve spectral regularized matrix regression. Although this method is analytically simple, it is not suitable for model (2) because of two variables. What's more, Li et al. proposed the linearized alternating direction method of multipliers (LADMM) algorithm in [12] for solving FLASSO. Owing to objective function in our model (2) is convex with respect to B and γ, we can consider to have a natural extension to our model. Fortunately, following the procedure of Li et al. [12], we develop a LADMM algorithm for solving model (2).

The rest of the paper is organized as follows. In Section 2, some preliminaries which are useful for further discussion will be introduced. Especially, we present two important optimization problems appeared in our algorithm. In Section 3, we give the LADMM algorithm for the LRFL matrix regression model (2). In Section 4, we demonstrate the convergence of the obtained algorithm. We conduct extensive numerical experiments to evaluate the performance of our algorithm for LRFL matrix regression model (2) in Section 5. We conclude this paper in the final section.

2. Preliminary

In this section, we introduce some preliminaries which are useful for further discussion. Firstly, we give some symbols about derivative. Then, we briefly introduce the solution of two important optimization problems.

2.1. Matrix calculation

  • Iff(B) is a real-valued function of BRm×q, the derivative of y with respect to B is defined that
    Bf=(fB11fB12fB1qfB21fB22fB2qfBm1fBm2fBmq).
  • If BRm×q,YRn×p and Yij is real-valued function of B, the derivative of Y with respect to B is defined that
    BY=(Y11B11Y11B12Y11BmqY12B11Y12B12Y12BmqYnpB11YnpB12YnpBmq).
  • For any matrix ARn×m,BRm×q,CRq×n, the derivative of function tr() is defined as
    tr(ABC)B=ATCT.

2.2. Two important optimization problems

Here, we present two results which play an important role in the algorithm.

2.2.1. Soft-thresholding

The minimization problem

minxRp{x1+β2xr2},

with β>0, and rRp has a closed-form solution, which is given by the soft-thresholding operator

x=shrink(r,1/β):=sign(r)max{0,|r|1/β}, (3)

where sign() is the sign function.

2.2.2. Matrix soft-thresholding

The minimization problem

minBRm×q{B+β2vec(B)r2},

with β>0, and rRmq has a closed-form solution, which is defined as

B=Ushrink(λ,1/β)VT, (4)

where U,V,λ satisfy the singular value decomposition (SVD) of R, i.e. R=UΛVT and λ is the diagonal of Λ. We arrange r to an m×q column-wise matrix, and denote it by R. The proof can be found in [2] (Theorem 2.1) or [19] (Theorem 3).

3. Linearized alternating direction method of multipliers

In this section, we will propose the LADMM algorithm to solve LRFL matrix regression model (2). For completeness, we firstly explain the ADMM algorithm in the following. Considering a convex minimization problem

minF(β)+G(γ)s.t.Xβ+Yγ=b,βΠ,γΓ. (5)

The objective function (5) is separable and the constraints are linear. The ΠRn and ΓRm are given non-empty, closed and convex sets, and F, G are closed convex functions. The XRl×n and YRl×m are given matrices, and bR is a given vector. Throughout, we assume that the solution set of (5) denoted by Ω is non-empty. We get the augmented Lagrangian function of (5),

Lμ(β,γ,α):=F(β)+G(γ)αT(Xβ+Yγb)+μ2Xβ+Yγb22,

where μ>0 is a penalty parameter, αR is Lagrange multiplier. The augmented Lagrangian method (ALM) in Hestenes [10] and Powell [16] can be applied to solve (5). With given initial point (β0,γ0,α0), the iterative scheme of ALM for (5) is

(βk+1,γk+1)=argmin{Lμ(β,γ,αk)|βΠ,γΓ},αk+1=αkμ(Xβk+1+Yγk+1b). (6)

The direct application of the ALM to (5) is the scheme (6). At each iteration, it requires to minimize β and γ simultaneously. The ADMM algorithm in Gabay and Mercier [8] and Glowinski and Marrocco [9] decomposes the minimization problem (6) into two separable problems and minimizes them iteratively

βk+1=argmin{Lμ(β,γk,αk)|βΠ},γk+1=argmin{Lμ(βk+1,γ,αk)|γΓ},αk+1=αkμ(Xβk+1+Yγk+1b). (7)

Because the constraint optimization problem (5) is transformed into an unconstrained optimization problem (7), the subproblems in (7) are easier than the original problem (5). Moreover, the resulting subproblems in (7) have closed-formed solutions for many applications including the LASSO and GLASSO. This fact particularly makes the application of ADMM efficient for a wide class of problems. So we consider applying the ADMM algorithm to the LRFL matrix regression model (2).

Now we analyse how to solve the LRFL matrix regression model (2) by applying the ADMM. In order to reformulate the model (2), we define a matrix AR(p1)×p as

A=(110001100011).

In fact, we denote y=(y1,y2,,yn)T,X=(vec(X1),vec(X2),,vec(Xn))T,Z=(Z1,Z2,,Zn)T, and the model (2) can be written as

minBRm×qγRp{12yZγXvec(B)22+λ1B+λ2γ1+λ3Aγ1}. (8)

Letting ξ=AγRp1, (8) can be rewritten as

minBRm×qγRpξRp1{12yZγXvec(B)22+λ1B+λ2γ1+λ3ξ1}.s.t.Aγ=ξ. (9)

The augmented Lagrangian function of (9) is

Lμ(B,γ,ξ,α):=12yZγXvec(B)22+λ1B+λ2γ1+λ3ξ1αT(Aγξ)+μ2Aγξ22, (10)

where αRp1 is the Lagrange multiplier, and μ is a given penalty parameter. The iterative scheme of ADMM for (10) is

Bk+1=argminBRm×qLμ(B,γk,ξk,αk),γk+1=argminγRpLμ(Bk+1,γ,ξk,αk),ξk+1=argminξRp1Lμ(Bk+1,γk+1,ξ,αk),αk+1=αkμ(Aγk+1ξk+1). (11)

Now let us look at the resulting subproblems in (11). First, for the B-subproblem in (11), after trivial manipulation, it can be written as

Bk+1=argminBRm×q{12yZγkXvec(B)22+λ1B}=argminBRm×q{λ1B+12Xvec(B)y^k2}, (12)

where y^k=yZγkRn. The subproblem does not have a closed-form solution because of the non-identity matrix X. As Wang Yuan [19], we can linearize the quadratic term 12Xvec(B)y^k2 in (12), which is replaced by

(XT(Xvec(Bk)y^k))T(vec(B)vec(Bk))+v2vec(B)vec(Bk)22,

where the parameter v>0 controls the proximity to Bk. Overall, we solve the following subproblem:

Bk+1=argminBRm×q{λ1B+(XT(Xvec(Bk)y^k))T(vec(B)vec(Bk))(XT(Xvec(Bk)y^k))T+v2vec(B)vec(Bk)22}=argminBRm×q{λ1B+v2vec(B)vec(Bk)+XT(Xvec(Bk)y^k)/v22}=argminBRm×q{λ1B+v2BC1,BC1}=argminBRm×q{λ1B+v2BC1F2}, (13)

where C1 is a matrix with vec(C1)=C, and C=vec(Bk)XT(Xvec(Bk)y^k)/v. Then, following (4), the closed-form solution of (13) is

Bk+1=Ushrink(λ,λ1/v)VT, (14)

where U,V,λ satisfy the singular value decomposition (SVD) of C1, i.e. C1=UΛVT and λ is the diagonal of Λ.

Second, for the γ-subproblem in (11), we will briefly display the solution of subproblem γ as follows:

γk+1=argminγRp{12yZγXvec(Bk+1)22+λ2γ1+μ2Aγξkαk/22}=argminγRp{λ2γ1+12X~γy~k2},

where X~=(ZT,μAT)T,y~k=((yXvec(Bk+1))T,μ(Ip1ξk+αk/μ)T)T. The subproblem does not have a closed-form solution because of the non-identity matrix X~. Similarly, we can linearize 12X~γy~k2 the quadratic term, which is replaced by

X~T(X~γky~k)T(γγk)+v2γγk22,

where the parameter v>0 controls the proximity to γk. Overall, we solve the following subproblem:

γk+1=argminγRp{λ2γ1+(X~T(X~γky~k)T(γγk)+v2γγk22}=argminγRp{λ2γ1+v2γγk+X~T(X~γky~k)/v22}.

Then, according to (3), the closed-form solution of (11) is

γk+1=shrink(γkX~T(X~γky~k)/v,λ2/v). (15)

Third, for the ξ-subproblem in (11), its closed-form solution can be obtained directly by determining the derivative of the augmented Lagrangian function, which is

ξk+1=argminξRp1{λ3ξ1αkT(Aγk+1ξ)+μ2Aγk+1ξ22}=argminξRp1{μ2ξAγk+1+αk/μ22+λ3ξ1}.

We obtain the solution by (3),

ξk+1=shrink(Aγk+1αk/μ,λ3/μ). (16)

In summary, the iterative scheme of LADMM algorithm for LRFL matrix regression model can be described as follows.

3.

Remark 3.1

When solving B-subproblem, the rank of Bk+1 is determined by λ1/v. If λ1/v increases, the rank of Bk+1 will decrease; if λ1/v gets smaller, the rank of Bk+1 will get larger. Thus, if we want to obtain a low-rank estimator, we only need to choose a large λ1/v. On the other hand, for a given λ1/v, we do not need to compute all the singular values of C1. From the solution in (14), the singular values which are greater than λ1/v make sense. Thus in the implement of Algorithm 1, we use the truncation technique which can be found in [15].

4. Convergence analysis

In this section, we will focus on the convergence analysis of Algorithm 1. In fact, this procedure is similar to some existing work, such as [12]. In order to better understand, we give a succinct proof here.

4.1. Convergence of Algorithm 1

Note that, the Lagrange function of (9) is

12yZγXvec(B)22+λ1B+λ2γ1+λ3ξ1αT(Aγξ), (17)

where αRp1 is the Lagrange multiplier. By the first-order optimality condition of (17), it is easy to see that solving (9) is equivalent to find (B,γ,ξ,α)Ω:=Rm×q×Rp×Rp1×Rp1,f(B)Bvec(B),g(γ)(γ1), and h(ξ)(ξ1) such that

0=λ1f(B)XT(yZγXvec(B)),0=λ2g(γ)ZT(yZγXvec(B))ATα,0=λ3h(ξ)+α,0=Aγξ. (18)

Note that () denotes the subdifferential operator of a non-smooth convex function. We denote all the elements in Ω that satisfy (18) by Ω. Then, using the notation ω=((vec(B))T,γT,ξT,αT)T,f(B)Bvec(B),g(γ)(γ1), and h(ξ)(ξ1), (18) can be written as a variational inequality (VI) problem: finding a ωΩ,f(B)Bvec(B),g(γ)(γ1), and h(ξ)(ξ1) such that

(ωω)TF(ω)0ωΩ, (19)

where ω=(vec(B)T,γT,ξT,αT)T and

F(ω)=(λ1f(B)XT(yZγXvec(B))λ2g(γ)ZT(yZγXvec(B))ATαλ3h(ξ)+αAγξ).

For purpose of expressing concisely, we need to use the following matrix:

G=(vImqXTX0000vIpX~TX~0000μIp100001μIp1).

From the procedure of proof, the matrix G must be positive definite. Considering X~=(ZT,μAT)T, the positive definiteness of G can be achieved by the condition v>ρ(XTX) and v>ρ(ZTZ+μATA), where ρ() denotes the spectral radius of matrix. In order to establish the convergence of LADMM algorithm, in the following lemma we characterize the (k+1)th iteration of Algorithm 1 as a VI problem.

Lemma 4.1

Let {ωk} be a sequence generated by Algorithm 1. Then we have

(ωωk+1)T(F(ωk+1)+M(ξkξk+1)G(ωkωk+1))0ωΩ,

where

M=(Omq×(p1)μATμIp1Op1).

Proof.

First, we have

XT(Xvec(Bk+1)y^k)=XT(Xvec(Bk+1)(yZγk))=XT(Xvec(Bk+1)y)+XTZγk

and

X~T(X~γk+1y~k)=ZTZγk+1ZTy+ZTXvec(Bk+1)+μATAγk+1μATξkATαk=(ZT(Zγk+1(yXvec(Bk+1))))+μATAγk+1μATξkATαk=(ZT(Zγk+1(yXvec(Bk+1))))ATαk+1μAT(ξkξk+1).

It follows from (11) that

αk=αk+1+μ(Aγk+1ξk+1). (20)

Deriving the first-order optimality condition of the minimization problems (14), (15) and (16), we see that the iterative scheme (11) is equivalent to find ωk+1=(vec(Bk+1)T,γk+1T,ξk+1T,αk+1T)TΩ,f(B)Bvec(B),g(γk+1)(γ1), and h(ξk+1)(ξ1) such that

0=λ1f(Bk+1)+XT(Xvec(Bk)y^k)+v(vec(Bk+1)vec(Bk)),0=λ2g(γk+1)+X~T(X~γky~k)+v(γk+1γk),0=λ3h(ξk+1)+αkμ(Aγk+1ξk+1),0=Aγk+1ξk+1(αkαk+1)/μ. (21)

According to the definition of G, inserting (20) into (21), the (21) can be written as

0=(λ1f(Bk+1)XT(yZγkXvec(Bk+1)))+(vImqXTX)(vec(Bk+1)vec(Bk)),0=(λ2g(γk+1)ZT(yZγk+1Xvec(Bk+1))ATαk+1)+(vIpX~TX~)(γk+1γk)μAT(ξkξk+1),0=λ3h(ξk+1)+αkμ(Aγk+1ξk+1),0=(Aγk+1ξk+1)(αkαk+1)/μ.

Then we obtain the conclusion.

The following lemma can be easily derived by Lemma 4.1. For completeness, we show the proof in detail.

Lemma 4.2

Let {ωk} be a sequence generated by Algorithm 1. Then ωΩ, we have

(ωkω)TG(ωkωk+1)(ωkωk+1)TG(ωkωk+1)(αkαk+1)T(ξkξk+1).

Proof.

From Lemma 4.1, for any ωΩ, by setting ω=ω, we obtain

(ωωk+1)T(F(ωk+1)+M(ξkξk+1)G(ωkωk+1))0, (22)

where ω is an arbitrary solution point in Ω. Note that we have Aγ=ξ for ωΩ, thus (22) leads to

(ωk+1ω)TG(ωkωk+1)(ωk+1ω)TF(ωk+1)μ(Aγk+1ξk+1)T(ξkξk+1).

Since (αkαk+1)=μ(Aγk+1ξk+1), the above inequality becomes

(ωk+1ω)TG(ωkωk+1)(ωk+1ω)TF(ωk+1)(αkαk+1)T(ξkξk+1).

On the other hand, since 1 and 2 are both convex, it is obvious that the mapping F(ω) is monotone. We thus have

(ωk+1ω)T(F(ωk+1)F(ω))0 (23)

and

(ωk+1ω)TF(ωk+1)(ωk+1ω)TF(ω)0. (24)

Then, replacing ωk+1ω by (ωk+1ωk)+(ωkω) in (23) and using (24), we get the desired conclusion.

Using Lemma 4.2, we can obtain an upper bound of the difference between the sequence {ωk} generated by Algorithm 1 and true solution.

Lemma 4.3

Let {ωk} be the sequence generated by Algorithm 1. Then we have

ωk+1ωG2ωkωG2ωkωk+1G2ωΩ. (25)

Proof.

For ωΩ, it follows from Lemma 4.2 that

ωk+1ωG2=ωkωG2+ωkωk+1G22(ωkω)TG(ωkωk+1)ωkωG2ωkωk+1G2+2(αkαk+1)T(ξkξk+1). (26)

As shown before, we have λ3h(ξk)+αk=0 for any k. Thus, we have

(ξk+1ξk)T(λ3h(ξk)+αk)0,(ξkξk+1)T(λ3h(ξk+1)+αk+1)0.

We obtain that

(αkαk+1)T(ξkξk+1)λ3(ξkξk+1)T(h(ξk+1)h(ξk))0. (27)

Inserting (27) into (26), we prove the assertion (25).

Lemma 4.3 implies that the sequence generated by Algorithm 1 is contractive with respect to the solution set Ω. The following corollary is trivial based on the inequality (25). Thus, we omit the proof.

Corollary 4.4

Let {ωk} be the sequence generated by Algorithm 1. Then we have

  1. limkωkωk+1G=0.

  2. The sequence {ωk} is bounded.

  3. For any ωΩ, the sequence ωkωG is monotonically non-increasing.

Now we can obtain the convergence of the linearized alternating direction method of multipliers for the LRFL matrix regression model.

Theorem 4.5

For any μ>0,v>ρ(XTX) and v>ρ(ZTZ+μATA), the sequence {ωk=(vec(Bk),γk,ξk,αk)} is generated by Algorithm 1. Then {vec(Bk),γk,ξk,αk} converges to a point ω=(vec(B),γ,ξ,α), where (vec(B),γ,ξ) is an optimal solution of the LRFL matrix regression model (9).

Proof.

The property (a) in Corollary 4.4 means that

limkvec(Bk)vec(Bk+1)=0,limkγkγk+1=0,limkξkξk+1=0,limkαkαk+1=0.

In addition, the property (b) in Corollary 4.4 implies that the sequence {ωk} has at least one cluster point. We denote it by ω=(vec(B),γ,ξ,α) and let {ωkj} be a subsequence converging to ω. Thus, we have

vec(Bkj)vec(B),γkjγ,ξkjξ,αkjα (28)

and

limkvec(Bkj)vec(Bkj+1)=0,limkγkjγkj+1=0,limkξkjξkj+1=0,limkαkjαkj+1=0. (29)

Next we show that the cluster point ω satisfies the optimality condition (18). Due to the variational inequality (19) and (29), we have

limj(ωωkj)F(ωkj)0ωΩ.

Then, according to (28), we obtain that

(ωω)F(ω)0ωΩ.

Thus, the limiting point ω satisfies (19), i.e. ωΩ. Considering the property (c) in Corollary 4.4, we have

ωk+1ωGωkωGk0.

Therefore, the sequence {ωk} has the unique cluster point ω. Thus, (vec(B),γ,ξ) is an optimal solution of the LRFL matrix regression model (9).

4.2. Convergence rate

Following the work in [24], we can easily establish a worst-case O(1/k) convergence rate measured by the iteration complexity in the ergodic sense for the proposed Algorithm 1. That is, after k iterations, the average of all these k iterates generated by Algorithm 1 is an approximate solution of LRFL matrix regression model with an accuracy of O(1/k). For succinctness, we omit the proof.

5. Numerical experiments

5.1. Simulation

We consider a class of matrix models with different ranks and sparsity levels used in [23]. Specifically, we generate the matrix covariates X of size 64×64 and the vector covariates Z in R500, both of which consist of independent standard normal entries. We set the sample size at n = 500, where the number of parameters is 64×64+500=4596. We set γ=(1,,1)TR500 and generate the true array signal as B=B1B2T, where B1Rm×R,B2Rq×R, and R controls the rank of the signal. Moreover, each entry of B is 0 or 1, and the percentage of non-zero entries is controlled by a sparsity level constant s, i.e. each entry of B1,B2 is a Bernoulli distribution with probability of 1 equal to (1(1s)1/R). We vary the rank R = 1, 5 and the level of (non-)sparsity s = 0.01, 0.05, 0.1 ( s = 0.05 means that about 5% of entries of B are 1s and the rest are 0s). We generate response y with the systematic part as y=B,X+γ,Z+ε with ϵ satisfying a standard normal distribution. In addition, all computations are performed on an Intel Core(TM)i7-2640M CPU (2.80 GHz) and 8 GB RAM. The code for Algorithm 1 is written in MATLAB, and the initial point for this method is set to be B0=Om×q,γ0=Op,ξ0=Op1,α0=1p1. The maximum iteration number is set as 500. For the tuning parameters λ1,λ2,λ3, we take a large grid of values. For each model, we choose the parameters which give the best performance of test root-mean-square error (RMSE), where the test set is generated as the above besides n = 2500. The other parameters are chosen as μ=(1+5)/2 and v=10max{ρ(XTX2),ρ(ZTZ+μATA2)}.

In the experiment, we simulate the model 100 times. Then we evaluate the performance of our method from two aspects: parameter estimation and prediction. For the former, we employ

max{BkBk1Fmax{BkF,1},γkγk12max{γk2,1}}<104

as the evaluation criterion. For the latter, we use independent validation data to evaluate the prediction error measured by the RMSE of the response. For simplicity, we use LRFL to stand for LRFL matrix regression model. We report the performance in Tables 1–5. Specifically, RMSE-B is the root-mean-squared error of B for training data, RMSE-γ is the root-mean-squared error of γ for training data, RMSE-PRE is the root-mean-squared error of prediction for test data, and the numerical values in parentheses are the standard deviations of the corresponding terms. The average CPU time (in seconds) is also included.

Table 1. Performance of Algorithm 1 when the true rank of coefficient matrix R = 1.

Sparsity s% RMSE-B RMSE-γ RMSE-PRE CPU
0.01 0.23(0.007) 0.05(0.025) 0.23(0.006) 1.48
0.05 0.30(0.006) 0.08(0.019) 0.31(0.016) 1.67
0.1 0.40(0.004) 0.07(0.030) 0.40(0.008) 1.58

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

Table 2. Performance of Algorithm 1 when the true rank of coefficient matrix R = 5.

Sparsity s% RMSE-B RMSE-γ RMSE-PRE CPU
0.01 0.20(0.008) 0.05(0.023) 0.20(0.014) 1.86
0.05 0.27(0.008) 0.09(0.018) 0.29(0.014) 1.54
0.1 0.40(0.005) 0.09(0.028) 0.40(0.013) 1.56

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

Table 3. Results of comparing LRFL with matrix LASSO when the true rank of coefficient matrix R = 2.

  R = 2
Sparsity s% Method RMSE-B RMSE-γ RMSE-PRE Rank
0.01 LRFL 0.38(0.055) 0.54(0.023) 0.66(0.017) 1
  matrix LASSO 0.33(0.042) 0.71(0.018) 0.78(0.037) 24.5
0.05 LRFL 0.46(0.027) 0.54(0.015) 0.70(0.018) 1
  matrix LASSO 0.38(0.019) 0.71(0.011) 0.81(0.018) 24
0.1 LRFL 0.45(0.042) 0.55(0.024) 0.72(0.031) 1
  matrix LASSO 0.38(0.029) 0.71(0.018) 0.81(0.030) 22
0.2 LRFL 0.59(0.036) 0.55(0.017) 0.80(0.037) 1
  matrix LASSO 0.54(0.017) 0.72(0.086) 0.90(0.014) 25
0.5 LRFL 0.62(0.056) 0.57(0.021) 0.84(0.037) 1
  matrix LASSO 0.95(0.020) 0.74(0.019) 1.22(0.039) 27

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

Table 4. Results of comparing LRFL with matrix LASSO when the true rank of coefficient matrix R = 5.

  R = 5
Sparsity s% Method RMSE-B RMSE-γ RMSE-PRE Rank
0.01 LRFL 0.29(0.025) 0.67(0.012) 0.75(0.032) 6.5
  matrix LASSO 0.31(0.069) 0.79(0.023) 0.85(0.028) 10.5
0.05 LRFL 0.48(0.046) 0.55(0.022) 0.72(0.028) 6
  matrix LASSO 0.43(0.028) 0.72(0.010) 0.84(0.027) 25
0.1 LRFL 0.55(0.025) 0.53(0.013) 0.74(0.023) 6
  matrix LASSO 0.44(0.031) 0.71(0.012) 0.84(0.023) 23.5
0.2 LRFL 0.66(0.018) 0.55(0.023) 0.84(0.039) 7
  matrix LASSO 0.61(0.024) 0.72(0.017) 0.94(0.023) 26
0.5 LRFL 0.80(0.060) 0.56(0.023) 0.98(0.053) 6
  matrix LASSO 0.90(0.028) 0.73(0.015) 1.15(0.046) 27.5

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

Table 5. Results of comparing LRFL with matrix LASSO when the true rank of coefficient matrix R = 10.

  R = 10
Sparsity s% Method RMSE-B RMSE-γ RMSE-PRE Rank
0.01 LRFL 0.43(0.043) 0.55(0.043) 0.70(0.039) 11
  matrix LASSO 0.31(0.057) 0.71(0.011) 0.78(0.038) 24
0.05 LRFL 0.48(0.026) 0.53(0.019) 0.72(0.023) 15
  matrix LASSO 0.34(0.032) 0.71(0.012) 0.78(0.017) 23
0.1 LRFL 0.54(0.015) 0.53(0.012) 0.76(0.027) 15.5
  matrix LASSO 0.46(0.028) 0.72(0.012) 0.86(0.035) 24
0.2 LRFL 0.72(0.014) 0.55(0.018) 0.91(0.020) 16.5
  matrix LASSO 0.68(0.015) 0.73(0.016) 1.01(0.026) 26.5
0.5 LRFL 0.88(0.072) 0.59(0.042) 1.07(0.068) 15
  matrix LASSO 1.05(0.024) 0.74(0.015) 1.29(0.033) 28

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

From Tables 1 and 2 and Figure 1, we can find that the accuracy of B,γ and test prediction are becoming worse with the increase of the sparsity. The standard deviation, however, is almost not changed. This result suggests that our method is convergent. Compared the accuracy of B with γ,γ is better than B.

Figure 1.

Figure 1.

The boxplot of RMSE-B, RMSE-γ, RMSE-PRE, where (R,s),R{1,5} is the rank of B,s{0.01,0.05,0.1} is the sparsity level of B.

In Tables 35, we compared LRFL with matrix LASSO in the accuracy and estimation rank of B. The ranks listed in Tables 35 are the median of the 100 simulation results. For the matrix LASSO method, we stack up the elements of the predictor X by column. Then, the matrix model is reformed to vector model. Thus, we can use matrix LASSO to solve it. From Tables 35, for the accuracy of B, LASSO gives a better performance. However, LRFL shows a better rank estimation. There is no doubt that vectorization destroys the structure of X. This result in the coefficient matrix B is not low-rank.

5.2. Real data

In this subsection, we will use our algorithm to deal with two real datasets.

5.2.1. Signal shapes

Recently, signal shapes have attracted wide attention of researchers [11,23]. In LRFL matrix regression model (2), X is a 64×64 matrix with entries generated as independent standard normal distribution, and Z is a five-dimensional vector. We set that B is binary with the true signal shapes, and γ satisfies a standard normal distribution. We easily know that the response is y=X,B+Z,γ+ε, where ϵ satisfies a normal distribution. In our numerical experiments, we change sample sizes at n = 500, 750 and 1000, and results are displayed in Tables 6–8. Tables 6–8 show the root-mean-squared errors (RMSEs) for vector coefficient γ, matrix coefficient B, response y and their standard deviations. Note that RMSEs of B,γ and y have significantly reduced with increasing of dimension of sample.

Table 6. Performance for signal shapes of Algorithm 1 when n = 500.
Shapes RMSE-B RMSE-γ RMSE-y
Cross 0.29(0.004) 1.64(0.007) 0.20(0.002)
Square 0.32(0.004) 1.89(0.005) 0.25(0.007)
Tshape 0.32(0.005) 1.12(0.005) 0.27(0.008)
Triangle 0.31(0.001) 1.99(0.002) 0.24(0.001)

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

Table 7. Performance for signal shapes of Algorithm 1 when n = 750.
Shapes RMSE-B RMSE-γ RMSE-y
Cross 0.21(0.001) 1.44(0.008) 0.18(0.007)
Square 0.28(0.002) 0.95(0.005) 0.23(0.004)
Tshape 0.26(0.002) 1.01(0.003) 0.22(0.005)
Triangle 0.22(0.009) 1.64(0.007) 0.19(0.004)

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

Table 8. Performance for signal shapes of Algorithm 1 when n = 1000.
Shapes RMSE-B RMSE-γ RMSE-y
Cross 0.20(0.008) 1.29(0.001) 0.17(0.004)
Square 0.26(0.001) 0.88(0.004) 0.22(0.007)
Tshape 0.24(0.002) 0.98(0.001) 0.22(0.002)
Triangle 0.20(0.003) 0.73(0.012) 0.18(0.003)

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

5.2.2. Trip time prediction from partial trajectories

A classical example is the ECML Discovery challenge 2015 competition to estimate of taxi's travel time for complete trips [5,11]. It contains the 7733 trajectories of taxis in Porto for a period of 1 year, which every trajectory includes multiple features. We record the latitude and longitude coordinates every 15 s when running. It also contains seven regular variates, such as trip id, call type, origin stand, day type and so on. The trip id contains a unique identifier for every trip. Call type identifies some way to demand this service, such as, this trip was dispatched from the central, demanded directly to a taxi driver at a specific stand and otherwise. Origin stand contains a unique identifier for the taxi stand. Day type contains holiday or any other special day, a day before holiday and otherwise. In LRFL matrix regression model (2), X is a matrix in R922×2 and Z is a vector in R7. We remove 32 trajectories due to the missing observation of coordinates. We use our matrix regression model to predict taxis travel time for complete journeys.

Table 9 shows that the root-mean-squared error of prediction of test data under 5-fold or 10-fold Cross-Validation. Furthermore, we choose 1000 trips as a test dataset, and change the dimension of the training dataset. Results are displayed in Table 10.

Table 9. Results of trip time prediction under 5-fold or 10-fold cross-validation.
5-fold rate 10-fold rate
Training dataset Test dataset RMSE-PRE Training dataset Test dataset RMSE-PRE
6161 1540 0.86(0.010) 6931 770 0.82(0.013)

Notes: All the measurements are the mean of the results after repeated 100 times. The numbers in parentheses are the corresponding standard errors.

Table 10. All the measurements are the mean of the results after repeated 100 times.
Training dataset Test dataset RMSE-PRE
200 1000 0.86(0.034)
500 1000 0.86(0.041)
1000 1000 0.81(0.089)
1500 1000 0.82(0.068)
2000 1000 0.77(0.046)

Note: The numbers in parentheses are the corresponding standard errors.

6. Summary

In this paper, we propose the LRFL matrix regression model which combines nuclear norm and fused LASSO penalty. The inspiration for this model comes from the fact which sampling units containing matrix and vector at the same time in kinds of fields. In order to solve the LRFL matrix regression model, we explore a linearized ADMM algorithm and establish the global convergence. Finally, we demonstrate the efficiency of our method through numerical experiments on simulation and real datasets. We compare LRFL matrix regression model with matrix LASSO. As we can see, our model can give more accurate and lower rank estimation. We mainly focus on the algorithm for solving the LRFL matrix regression model. Therefore, the statistical properties are also worthy of study.

Acknowledgments

The authors are very grateful to two anonymous reviewers and associate editor for their insightful remarks and comments, which considerably improved the presentation of our paper.

Funding Statement

The work was supported in part by the National Natural Science Foundation of China (11671029), the Fundamental Research Funds for the Central Universities (2019YJS200) and Colleges and Universities in Hebei Province Science and Technology Research Project (Z2019032).

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Armagan A., Dunson D. and Lee J., Generalized double Pareto shrinkage, Stat. Sin. 23 (2013), pp. 119–143. [PMC free article] [PubMed] [Google Scholar]
  • 2.Cai J., Candés E. and Shen Z., A singular value thresholding algorithm for matrix completion, SIAM J. Optim. 20 (2010), pp. 1956–1982. doi: 10.1137/080738970 [DOI] [Google Scholar]
  • 3.Candés E., Wakin M. and Boyd S., Enhancing sparsity by reweighted l1 minimization, J. Four. Anal. Appl. 14 (2008), pp. 877–905. doi: 10.1007/s00041-008-9045-x [DOI] [Google Scholar]
  • 4.Chen B. and Kong L., High-dimensional least square matrix regression via elastic net penalty, Pac. J. Optim. 13 (2017), pp. 185–196. [Google Scholar]
  • 5.Discovery challenge: On learning from taxi GPS traces: ECML-PKDD, 2015. Available at http://www.geolink.pt/ecmlpkdd2015-challenge/.
  • 6.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. doi: 10.1198/016214501753382273 [DOI] [Google Scholar]
  • 7.Frank I. and Friedman J., A statistical view of some chemometrics regression tools, Technometrics 35 (1993), pp. 109–135. doi: 10.1080/00401706.1993.10485033 [DOI] [Google Scholar]
  • 8.Gabay D. and Mercier B., A dual algorithm for the solution of nonlinear variational problems via finite element approximation, Comp. Math. Appl. 2 (1976), pp. 17–40. doi: 10.1016/0898-1221(76)90003-1 [DOI] [Google Scholar]
  • 9.Glowinski R. and Marrocco A., Sur l'approximation, par éléments finis et la résolution, par pénalisation-dualité d'une classe de problémes de Dirichlet non linéaires, ESAIM: Math. Model. Numer. Anal. 9 (1975), pp. 41–76. [Google Scholar]
  • 10.Hestenes M., Multiplier and gradient methods, J. Optim. Theory Appl. 4 (1969), pp. 303–320. doi: 10.1007/BF00927673 [DOI] [Google Scholar]
  • 11.Li M. and Kong L., Double fused Lasso penalized LAD for matrix regression, Appl. Math. Comput. 357 (2019), pp. 119–138. doi: 10.1016/j.cam.2019.02.009 [DOI] [Google Scholar]
  • 12.Li X., Mo L., Yuan X. and Zhang J., Linearized alternating direction method of multipliers for sparse group and fused Lasso models, Comput. Stat. Data Anal. 79 (2014), pp. 203–221. doi: 10.1016/j.csda.2014.05.017 [DOI] [Google Scholar]
  • 13.Lin Z., Liu R. and Su Z., Linearized alternating direction method with adaptive penalty for low-rank representation, Adv. Neural Inf. Process. Syst. 24 (2011), pp. 612–620. [Google Scholar]
  • 14.Luo L., Yang J., Qian J., Tai Y. and Lu G., Robust image regression based on the extended matrix variate power exponential distribution of dependent noise, IEEE Trans. Neur. Net. Lear. 28 (2017), pp. 2168–2182. doi: 10.1109/TNNLS.2016.2573644 [DOI] [PubMed] [Google Scholar]
  • 15.Ma S., Goldfarb D. and Chen L., Fixed point and Bregman iterative methods for matrix rank minimization, Math. Program. 128 (2011), pp. 321–353. doi: 10.1007/s10107-009-0306-5 [DOI] [Google Scholar]
  • 16.Powell M., A Method for Nonlinear Constraints in Minimization Problems, Optimization. Academic Press, New York, 1969. [Google Scholar]
  • 17.Qian J., Yang J., Zhang F. and Lin Z., Robust low-rank regularized regression for face recognition with occlusion, Biometrics Workshop in Conjunction with IEEE Conference on Computer Vision and Pattern Recognition (CVPRW), Columbus, Ohio, June 23–28, 2014.
  • 18.Tibshirani R., Saunders M., Rosset S., Zhu J. and Knight K., Sparsity and smoothness via the fused Lasso, J. R. Stat. Soc. 67 (2005), pp. 91–108. doi: 10.1111/j.1467-9868.2005.00490.x [DOI] [Google Scholar]
  • 19.Wang X. and Yuan X., The linearized alternating direction method for Dantzig selector, SIAM J. Sci. Comput. 34 (2012), pp. A2792–A2811. doi: 10.1137/110833543 [DOI] [Google Scholar]
  • 20.Xie J., Yang J., Qian J., Tai Y. and Zhang H., Robust nuclear norm-based matrix regression with applications to robust face recognition, IEEE Trans. Image Process. 26 (2017), pp. 2286–2295. doi: 10.1109/TIP.2017.2662213 [DOI] [PubMed] [Google Scholar]
  • 21.Yang J., Luo L., Qian J., Tai Y., Zhang F. and Xu Y., Nuclear norm based matrix regression with applications to face recognition with occlusion and illumination changes, IEEE Trans. Pattern Anal. 39 (2017), pp. 156–171. doi: 10.1109/TPAMI.2016.2535218 [DOI] [PubMed] [Google Scholar]
  • 22.Zhang C., Nearly unbiased variable selection under minimax concave penalty, Ann. Stat. 38 (2010), pp. 894–942. doi: 10.1214/09-AOS729 [DOI] [Google Scholar]
  • 23.Zhou H. and Li L., Regularized matrix regression, J. R. Stat. Soc. Ser. B Stat. Methodol. 76 (2014), pp. 463–483. doi: 10.1111/rssb.12031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zou H. and Hastie T., Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005), pp. 301–320. doi: 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES