Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Oct 9.
Published in final edited form as: KDD. 2012;2012:480–488. doi: 10.1145/2339530.2339609

Optimal Exact Least Squares Rank Minimization

Shuo Xiang 1,2, Yunzhang Zhu 3, Xiaotong Shen 3, Jieping Ye 1,2
PMCID: PMC4191838  NIHMSID: NIHMS497477  PMID: 25309806

Abstract

In multivariate analysis, rank minimization emerges when a low-rank structure of matrices is desired as well as a small estimation error. Rank minimization is nonconvex and generally NP-hard, imposing one major challenge. In this paper, we consider a nonconvex least squares formulation, which seeks to minimize the least squares loss function with the rank constraint. Computationally, we develop efficient algorithms to compute a global solution as well as an entire regularization solution path. Theoretically, we show that our method reconstructs the oracle estimator exactly from noisy data. As a result, it recovers the true rank optimally against any method and leads to sharper parameter estimation over its counterpart. Finally, the utility of the proposed method is demonstrated by simulations and image reconstruction from noisy background.

General Terms: Algorithms

Keywords: Nonconvex, global optimality, rank minimization

1. INTRODUCTION

In multivariate analysis, estimation of lower-dimensional structures has received attention in statistics, signal processing and machine learning. One type of such structures is the low-rank of matrices [5, 22], where the rank measures the dimension of a multivariate response. Rank minimization approximates multivariate data with the smallest possible rank of matrices. It has many applications in, for instance, multi-task learning [6, 11], multi-class classification [2], matrix completion [8, 17], collaborative filtering [33, 1], clustering [20, 29], computer vision [35, 21, 18], among others. The central topic this article addresses is least squares rank minimization.

Consider multi-response linear regression in which a k-dimensional response vector zi follows

zi=ΘTai+εi;Eεi=0,Cov(εi)=σ2Ik×k;i=1,,n, (1)

where ai is a p-dimensional design vector, Θ is a p × k regression parameter matrix, and components of εi are independent. Model (1) reduces to the widely-used linear model in compressed sensing when k = 1 and becomes a multivariate autoregressive model with ai = zi−1. Denote the rank of Θ as r(Θ) and rewrite (1) in the matrix form as follows:

Z=AΘ+e, (2)

where Z = (z1, ⋯, zn)T ∈ ℝn×k, A = (a1, ⋯, an)T ∈ ℝn × p and e = (ε1, ⋯, εn)T ∈ ℝn×k are the data, design and error matrices. In (1), we estimate Θ based on n pairs of observation vectors (ai,zi)i=1n, with a priori knowledge that r(Θ) is relatively small in comparison to min(n, k, p), where the number of unknown parameters k, p can greatly exceed the sample size n.

Least squares rank minimization, as described, solves

minΘAΘZF2s.t.r(Θ)s, (3)

where ‖·‖F is the Frobenius-norm and s is an integer-valued tuning parameter taking values from [1, min(n, k, p)]. The general rank minimization is nonconvex and NP-hard [23], which is like the L0-minimization in univariate analysis. Therefore an exact global solution to (3) is not known as well as its statistical properties, due primarily to discreteness and non-convexity of the rank function.

Estimation under the restriction that r(Θ) = r has been studied when n → ∞ with k and p held fixed, see [3, 4, 15, 28, 26]. Two major computational approaches have been proposed for approximating the optimal solution of (3). The first one involves regularization using surrogate function, such as the nuclear-norm, which is a convex envelope of the rank function [13] and can be solved by efficient algorithms [8, 19, 34, 25, 16]. In some cases, the solution of this convex problem coincides with a global minimizer of (3) under certain isometry assumptions [27]. However, these assumptions can be strong and difficult to check. Recently, [7] obtained a global minimizer of a regularized version of (3).

The second approach attacks (3) by approximating the rank function iteratively through calculating the largest singular vector using greedy search [30], or by singular value projection (SVP) through a local gradient method [17]. Under weaker isometry assumptions [27, 10, 9] than that of the nuclear-norm approach, these methods guarantee an exact solution of (3) but suffer from the same difficulties as the regularization method [30], although they have achieved promising results on both simulated and real-world data.

Theoretically, some loss error bounds of the first regularization approach are obtained in [24] under Frobenius-norm, and rank selection consistency is established in [7]. Unfortunately, to our best knowledge, whether similar conclusions hold for our formulation (3) remains largely unknown.

In this paper, we have advanced on two fronts. Computationally, we derive a general closed-form for a global minimizer of (3) in Theorem 1, and give a condition under which (3) and its nonconvex regularized counterpart are equivalent with regard to global minimizers, although these two methods are not generally equivalent. Moreover, we develop an efficient algorithm for computing the entire regularization solution path at the cost of computing only one solution for a single regularization parameter. Theoretically, we establish optimality for a global minimizer of (3). More specifically, the proposed method is optimal against any other ones in that it reconstructs the oracle estimator exactly, thus the true rank, under (1). It is important to note that this exact recovery result is a much stronger property than consistency, which is attributed to the discrete nature of the rank function as well as tuning parameter s. Such a result may not be shared by its regularized counterpart with a continuum tuning parameter. In addition, the method enjoys a higher degree of accuracy for parameter estimation than nuclear-norm rank estimation.

After the first draft of this paper was completed, we were aware that [14] and [32] gave an expression of Theorem 1. However, neither paper considered computational and statistical aspects of the solution. Inevitably, some partial overlaps exist between our Theorem 1 and theirs.

The rest of the paper is organized as follows. Section 2 presents a closed-form solution to (3). Section 3 gives an efficient path algorithm for a regularized version of (3). Section 4 is devoted to theoretical investigation, followed by Section 5 discussing methods for tuning. Section 6 presents proofs for all the theorems we develop and Section 7 presents results of empirical evaluations, where several rank minimization methods are compared. Section 8 concludes the paper.

Notation: Table 1 summarizes the notations used in the rest of the paper.

Table 1.

Notation used throughout the paper

Notation Description
A The design matrix
e Error matrix
Θ0 The ground truth model
Θ̂s Estimator obtained by optimizing (3)
Θ̂λ*
Estimator obtained by optimizing (7)
r(A) The rank of matrix A
s(Z) The best s-rank approximation of matrix Z in terms of the Frobenius-norm

2. PROPOSED METHOD: CLOSED-FORM SOLUTION

This section derives a closed-form solution to (3). The strategy is to simplify (3) through the singular value decomposition (SVD) of matrix A and utilizing the properties of the rank function. Before proceeding, we present a motivating lemma, known as the Eckart-Young theorem [12].

Lemma 1. The best s-rank approximation, in terms of the Frobenius-norm, for a t-rank matrix Z with ts, i.e., a global minimizer Θ* of

minΘΘZF2s.t.r(Θ)s (4)

is given by Θ*=s(Z)=UzDsVzT, where Ds consists of the largest s singular values of Z given the SVD of Z=UzDVzT.

Intuitively ℘s(Z) may be viewed as a projection of Z onto a set of matrices whose ranks are no more than s. Note that (4) is a special case of (3) with matrix A being the identity matrix. This motivates us to solve (3) through the simpler problem (4).

When A is nonsingular, clearly (3) has a global minimizer A−1s(Z) by rank preserveness of any nonsingular matrix in matrix multiplication. When A is singular, we first assume that r(A) ≥ s. However, this assumption is by no means mandatory, which will be discussed later. Given the SVD of A = UDVT, with orthogonal matrices U ∈ ℝn×n and V ∈ ℝp×p and diagonal matrix D ∈ ℝn×p, we have

AΘZF=UT(AΘZ)F=DVTΘUTZF.

This follows from the fact that the Frobenius-norm is invariant under any orthogonal transformation. Let Y = VT Θ and W = UTZ, clearly we have r(Y) = r(Θ). Now solving (3) amounts to solving an equivalent form:

minYDYWF2s.t.r(Y)s. (5)

Consequently, a global minimizer of (3) becomes VY*, where Y* is a global minimizer of (5) and is given by the following theorem.

Theorem 1. Let D, Y, Z and s be as defined above. If sr(A), then a global minimizer of (5) is given by

Y*=[Dr(A)1s(Wr(A))a], (6)

where Dr(A) is a diagonal matrix consisting of all the nonzero singular value of A, a can be set as the zero matrix, and Wr(A) consists of the first r(A) rows of W.

Here are some remarks regarding the above theorem:

Remark 1. The solution to problem (5) is generally not unique. Specifically, the matrix a in (6) need not be fixed at zero, as long as it does not change the rank of Y*. However, if A is of full column rank, i.e., when r(A) = p, then a vanishes and Y* can be uniquely determined. In this case, the optimal solution of (3) is also unique.

Remark 2. The optimal Y* can also be computed for the general matrix A with an arbitrary rank, i.e., when r(A) < s. See the proof of Theorem 1 in the Section 6.

It is important to note that the value of a is irrelevant for prediction, but matters for parameter estimation. In other words, when r(A) < p, a global minimizer is not unique, hence that parameter estimation is not identifiable: see Section 4 for a discussion. For simplicity, we set a = 0 for Y* subsequently.

In what follows, our estimator is defined as Θ̂s, as well as an estimated rank . Algorithm 1 below summarizes the main steps for computing Θ̂s with regard to s ≤ min(n, k, p), where the LSRM stands for Least Squares Rank Minimization.

Algorithm 1.

Exact solution of (3)

Input: A, Z, sr(A)
Output: A global minimizer Θ of (3)
Function: LSRM(A, Z, s)
  1: if A is nonsingular then
  2:   Θ = A−1s(Z)
  3: else
  4:   Perform SVD on A: A = UDVT
  5:   Extract the first r rows of UTZ and denote it as Wr(A)
  6   Θ=V[Dr(A)1s(Wr(A))0]
  7: end if
  8: return Θ

The complexity of Algorithm 1 is determined mainly by the most expensive operations–matrix inversion and SVD, specifically, at most one matrix inversion and two SVDs. Such operations roughly require a cubic time complexity1.

3. REGULARIZATION AND SOLUTION PATH

This section studies a regularized counterpart of (3):

minΘAΘZF2+λr(Θ), (7)

where λ > 0 is a continuous regularization parameter corresponding to s in (3), and Θλ* is a global minimizer of (7). The next theorem establishes an equivalence between (7) and (3) when Θλ* is unique, occurring when r(A) = p. Such a result is not generally anticipated for a nonconvex problem.

Theorem 2 (Equivalence). When p = r(A), (7) has a unique global minimizer. Moreover, (7) and (3) are equivalent with respect to their solutions. For any Θλ* with λ ≥ 0, there exists 1s*=r(Θλ*) such that Θλ*=Θ̂s, and vice verse.

Next we develop an algorithm for computing an entire solution path for all values of λ with complexity comparable to that of solving (7) at a single λ-value. For motivation, first consider a special case of the identity A in (7):

g(λ)=minΘΘZF2+λr(Θ). (8)

3.1 Monotone property

In (8), r(Θ) decreases as λ increases from 0, where r(Θ) goes through all integer values from r(Z) down to 0 when λ becomes sufficiently large. In addition, the value of g(λ) is nondecreasing as λ increases. The next theorem summarizes these results.

Theorem 3 (Monotone property). Let r(Z) be r. Then the following properties hold:

  1. There exists a solution path vector 𝒮 of length r + 2 satisfying the following:
    𝒮0=0,𝒮r+1=+,𝒮k+1>𝒮k,k=0,1,,rΘλ*=rk(Z),if𝒮kλ<𝒮k+1,
  2. Function g(λ) is nondecreasing and piecewise linear.

The monotone property leads to an efficient algorithm for calculating the pathwise solution of (8). Figure 1 displays the solution path by illustrating the function value g(λ) and r(Θλ*) as a function of λ.

Figure 1.

Figure 1

Piecewise linearity of g(·) and the rank of the optimal solution with respect to λ.

3.2 Pathwise algorithm

Through the monotone property, we compute the optimal solution of (8) at a particular λ by locating λ in the correct interval in the solution path vector 𝒮, which can be achieved efficiently via a simple binary search. Algorithm 2 describes the main steps.

Algorithm 2.

Pathwise solution of (8)

Input: Θ, Z
Output: Solution path vector 𝒮, pathwise solution Θ
Function: pathwise(Θ, Z)
  1: Initialize: 𝒮0 = 0, Θ0 = Z, r = r(Z)
  2: Perform SVD on Z: Z = UDVT
  3 for i = r down to 1 do
  4:   𝒮ri+1σi2
  5:   Θri+1=ΘriσiuiυiT
  6: end for
  7: return 𝒮, Θ

Algorithm 2 requires only one SVD operation, therefore its complexity is of the same order as that of Algorithm 1 at a single s-value. When Z is a low-rank matrix, existing software for SVD computation such as PROPACK is applicable to further improve computational efficiency.

3.3 Extension to general A

For a general design matrix A, note that

AΘZF2=DYWF2=WF2+DrYrWrF2,

where W=[WrW]. After ignoring the constant term WF2, we solve

minYrDrYrWrF2+λr(Yr).

Note that Dr is nonsingular, then the problem reduces to the simple case

minŶŶWrF2+λr(Ŷ),

where Ŷ = DrYr. The solution path can be obtained directly from Algorithm 2.

4. STATISTICAL PROPERTIES

This section is devoted to the theoretical investigation of least squares rank minimization, which remains largely unexplored, although nuclear-norm regularization has been studied. In particular, we will reveal what is the best performance for prediction as well as the optimal risk for parameter estimation. Moreover, we will establish the optimality of the proposed method. In fact, the proposed method reconstructs the oracle estimator, the optimal one as if the true rank were known in advance. Here the oracle estimator Θ̂0 is defined as a global minimizer of AΘZF2 subject to r(Θ) = r0, where Θ0 and r0 = r(Θ0) ≥ 1 denote the true parameter matrix and the true rank respectively. This leads to the exact rank recovery, in addition to the reconstruction of the optimal performance of the oracle estimator. In other words, the proposed method is optimal against any other methods such as the nuclear-norm rank regularization.

Given the design matrix A, we study the accuracy of rank recovery as well as prediction and parameter estimation. Let ℙ and 𝔼 be the true probability and expectation under Θ0 given A. For rank recovery, we use the metric ℙ(r0). For prediction and parameter estimation, we employ the risk 𝔼K(Θ̂s, Θ0) and 𝔼Θ̂sΘ0F2 respectively, where

K(Θ̂s,Θ0)=(2σ2n)1i=1naiT(Θ̂sΘ0)22=(2σ2n)1A(Θ̂sΘ0)F2

is the Kullback-Leibler loss and ‖·‖2 is the L2-norm for a vector. Note that the predictive risk equals to 2σ2𝔼K(Θ̂s, Θ0) and parameter estimation is considered when it is identifiable, i.e., r(A) = p.

Now we present the risk bounds under (1) without a Gaussian error assumption.

Theorem 4. Under (1), the oracle estimator is exactly reconstructed by our method in that Θ̂r0 = Θ0 under ℙ, when r0min(r(A), p). As a result, exact reconstruction of the optimal performance is achieved by our estimator Θ̂r0. In particular,

𝔼K(Θ̂r0,Θ0){=r0k2nifr0=r(A)2𝔼(j=1r0σj2)nifr0<r(A)

and

𝔼Θ̂r0Θ0F2{=r0kσmin2(A)nifr0=r(A)=p𝔼(j=1r0σj2)σmin2nifr0<r(A)=p,

where σj and σmin > 0 are the jth largest and the smallest nonzero singular value of e=Ur(A)Te and n−1/2 A, respectively, and Ur(A) denotes the first r(A) rows of U.

Remark: In general, 𝔼j=1r0σj2r0𝔼σ12.

Theorem 4 says that the optimal oracle estimator is exactly reconstructed by our method. Interestingly, the true rank is exactly recovered from noisy data, which is attributed to discreteness of the rank and is analogous to the maximum likelihood estimation over a discrete parameter space. Concerning prediction and parameter estimation, the optimal Kullback-Leibler risk is r0kn but the risk under the Frobenius-norm is r0kσmin2n. For prediction, only the effective degrees of freedom j=1r0σj2 matters as opposed to p, which is in contrast to a rate r0kpn without a rank restriction. This permits p to be much larger than n, or k, pn. For estimation, however, p enters the risk through σmin2, where p can not be larger than n, or max(k, p) ≤ n.

5. TUNING

As shown in Section 4, theoretically, exact rank reconstruction can be accomplished through tuning. Practically, we employ a predictive measure for rank selection.

The predictive performance of Θ̂s is measured by

MSE(Θ̂s)=1n𝔼AΘ̂sZF2,

which is proportional to the risk R(Θ̂s), where the expectation is taken with respect to (Z, A).

To estimate s over integer values in [0, ⋯, min(n, p, k)], one may cross-validate through a tuning data set, which estimates the MSE. Alternatively, one may use the generalized degrees of freedom [31] through data perturbation without a tuning set.

GDF^(Θ̂s)=n1ZAΘ̂s2+2n1i=1nl=1kCov^(zil,(aiTΘ̂s)l), (9)

where Cov^(zil,(aiTΘ̂s)l) is the estimated covariance between the lth component of zi and the lth component of aiTΘ̂s. In the case that ei in (1) follows N(0, σ2 Ik×k), the method of data perturbation of [31] is applicable. Specifically, sample ei* from N(0, σ2Ik×k) and let

Z*=(1τ)Z+τCov^(zil,(aiTΘ̂s)l)=τ1Cov*(zil*,(aiTΘ̂*s)l)

where Cov*(zil*,(aiΘ̂*s)l) is the sample covariance with the Monte Carlo size T. For the types of problems we consider, we fixed T to be n.

6. PROOF OF THEOREMS

In this section we present detailed proofs to the theorems developed in the previous sections. We first present a technical lemma to be used in the proof of Theorem 4.

Lemma 2. Suppose A and B are two n1 × n2 matrices. Then,

A,BAFr(A)(B)F, (10)

whereA,B〉 = Tr(ATB) = Tr(BTA), Tr denotes the trace, and r(A) is the rank of A.

Proof. Let the singular value decomposition of A and B be A=U1Σ1V1T and B=U2Σ2V2T where Ui and Vi, i = 1, 2, are orthogonal matrices, and Σ1 and Σ2 are diagonal matrices with their diagonal elements being the singular values of A and B, respectively. Then

A,B=Tr(V1Σ1TU1TU2Σ2V2T)=Tr(Σ1TU1TU2Σ2V2TV1)Tr(Σ1TUΣ2VT),

where U=U1TU2 and V=V1TV2 continue to be orthogonal. Let the ordered singular values of A be σ1 ≥ ⋯ ≥ σr(A) and = (ij) = UΣ2VT. By Cauchy-Schwarz’s inequality,

Tr(Σ1TUΣ2VT)=Tr(Σ1T)=i=1r(A)σiiii=1r(A)σi2i=1r(A)ii2=AFi=1r(A)ii2. (11)

Similarly, let the ordered singular values of B be η1 ≥ ⋯ ≥ ηr(B). Then it suffices to show that i=1r(A)ii2i=1r(A)ηi2. Assume, without loss of generality, that ηi = 0 if i > r(B). Let n = min(n1, n2). By Cauchy-Schwarz’s inequality,

i=1r(A)ii2=i=1r(A)(k=1nuikηkυik)2i=1r(A)(k=1nuik2ηk2)(k=1nυik2)i=1r(A)(k=1nuik2ηk2)=k=1nηk2(i=1r(A)uik2)k=1r(A)ηk2,

where the last step uses the fact that i=1r(A)uik21 and k=1ni=1r(A)uik2=i=1r(A)k=1nuik2r(A). A combination of the above bounds lead to the desired results. This completes the proof.

6.1 Proof of Theorem 1

First partition D and W as follows:

D=[Dr(A)000],W=[Wr(A)W],

then

DYW=[Dr(A)Yr(A)0][Wr(A)W]=[Dr(A)Yr(A)Wr(A)W].

Evidently, only the first r(A) rows of Y are involved in minimizing DYW22, which amounts to computing the global minimizer Yr(A)* of

argminYr(A)Dr(A)Yr(A)Wr(A)F2.

Then Yr(A)*=Dr(A)1s(Wr(A)) by non-singularity of Dr(A) and Lemma 1 with sr(A). For s > r(A), recall that, only the upper part of Y* is relevant in minimizing (5). The result then follows. This completes the proof.

6.2 Proof of Theorem 2

For any Θλ* with λ > 0, let s*=r(Θλ*). Next we prove by contradiction that Θλ*=Θ̂s*. Suppose Θλ*Θ̂s*. By uniqueness of Θ̂s* given in Theorem 1 and the definition of minimization, AΘ̂s*ZF2<AΘλ*ZF2. This, together with r(Θ̂s*)=r(Θλ*), implies that

AΘ̂s*ZF2+λr(Θ̂s*)<AΘλ*ZF2+λr(Θλ*).

This contradicts to the fact that Θλ* is a minimizer of (7). The converse is revealed in the proof of Theorem 3.

6.3 Proof of Theorem 3

We prove the first conclusion by constructing such a solution path vector 𝒮. Let 𝒮0 = 0, 𝒮r+1 = + ∞. Define 𝒮k for 1 ≤ kr as the solution of equation:

rk+1(Z)ZF2+𝒮k(rk+1)=rk(Z)ZF2+𝒮k(rk).

It follows that

𝒮k=rk(Z)ZF2rk+1(Z)ZF2=j=rk+1rσj2j=rk+2rσj2=σrk+12, (12)

where σj is the jth largest non-zero singular value of Z. By (12), 𝒮k is increasing. In addition, by definition of Sk and Sk+1, whenever λ falls into the interval [𝒮k, 𝒮k+1), the rank of a global minimizer Θ* of (8) would be no more than rk and larger than rk − 1. In other words, Θλ* is always of rank rk and is given by ℘rk(Z). Therefore, the constructed solution path vector 𝒮 satisfies all the requirements in the theorem.

Moreover, when 𝒮k ≤ λ < 𝒮k+1,

g(λ)=rk(Z)ZF2+λr(rk(Z))=rk(Z)ZF2+(rk)λ. (13)

Since ℘rk(Z) is independent of λ, g(λ) is a nondecreasing linear function of λ in each interval [𝒮k, 𝒮k+1). Combined with the definition of the solution path vector 𝒮, we conclude that g(λ) is nondecreasing and piecewise linear with each element of 𝒮 as a kink point, as shown in Figure 1. The proof is completed.

6.4 Proof of Theorem 4

The proof uses direct calculations. First we bound the Kullback-Leibler loss. By Theorem 1,

AΘ̂r0=Ur(A)r0(Wr(A)),

where

Wr(A)=Dr(A)(VTΘ0)r(A)+(UTe)r(A).

Denote B = Dr(A) (VT Θ0)r(A) with rank r(B) ≤ r0. Since the Frobenius-norm is invariant under orthogonal transformation, it follows that

AΘ̂r0AΘ0F2=Ur(A)r0(Wr(A))UDVTΘ0F2=Ur(A)r0(Wr(A))Ur(A)Dr(A)(VTΘ0)r(A)F2=r0(Wr(A))Dr(A)(VTΘ0)r(A)F2=r0(B+(UTe)r(A))BF2,=r0(B+eB)F2,

where e = (UTe)r(A). From the definition of ℘r0 (B + e) we can conclude that:

r0(B+e)BeF2BBeF2=eF2,

which implies that,

r0(B+e)BF22r0(B+e)B,e2r0(B+e)BFr0(e)F,

where the last inequality follows from Lemma 2. Thus,

r0(B+e)BF24r0(e)F2=4j=1r0σj2. (14)

The risk bounds then follow.

Second we bound Θ̂r0Θ0F2, which is equal to

V[Dr(A)1r0(Wr(A))0]Θ0F2=[Dr(A)1r0(Wr(A))0]VTΘ0F2=Dr(A)1r0(Wr(A))(VTΘ0)r(A)F2+(VTΘ0)r(A)cF21σmin2nr0(Wr(A))Dr(A)(VTΘ0)r(A)F2+(VTΘ0)r(A)cF2,

where σr(A)(n−1/2 A) = n1/2 σmin. If p = r(A), then the last term vanishes. Thus

Θ̂r0Θ0F21σr(A)n2r0(Wr(A))Dr(A)(VTΘ0)r(A)F24j=1r0σj2σmin2n.

Finally, if r(A) ≥ r0, 𝔼Θ̂r0Θ0F24σr(A)2𝔼(j=1r0σj2) and 𝔼AΘ̂r0AΘ0F24𝔼(j=1r0σj2). In particular, if r(A) = r0, then

𝔼Θ̂r0Θ0F2=1σmin2r0kn𝔼AΘ̂r0AΘ0F2=r0kn.

7. EMPIRICAL EVALUATIONS

This section examines effectiveness of the proposed method and compares it with the nuclear-norm regularization as well as the SVP method [17]. One benefit of our method is that it can evaluate the approximation quality of the inexact ones. For the SVP, we choose a default initial value 0 for this local method since no other choices are guaranteed to deliver better performance. Our numerical experiment, which is not reported in here, suggests that the SVP method is indeed sensitive to the choice of the initial value. For nuclear-norm regularization, we select a suitable regularization parameter value giving the solution satisfying the rank constraint in (3).

7.1 Synthetic Data

Simulations are performed under (2). First, the n × p design matrix A is sampled, with each entry being iid N(10, 1). Second, the p×k truth Θ0 is generated by multiplying a p×r matrix and an r × k matrix, each entry of which has a normal distribution over N(10, 1). The data matrix Z is then sampled according to (2) with e following iid N(0, σ2) with σ = 0.5.

For predictive performance and rank recovery, we compute the MSE A(Θ0Θ̂)F2, the absolute difference |r0| and record the training time for each method, averaged over 100 simulation replications on a test set of size 10n, where is tuned over integers in [0, min(n, p)] by an independent tuning set of size 2n. We consider three possible situations, i.e., when k = r0 < p, k > p > r0 and p > k > r0. To illustrate our theoretical conclusion in Theorem 4 that the prediction error bound does not depend on the value of p, we add one more case of p > k > r0 with a larger value of p. The simulation results are summarized in Tables 24.

Table 2.

Prediction results for the synthetic data: averaged MSEs as well as their standard deviations, for three competing methods based on the selected tuning parameters over 100 simulation replications. Our, SVP, and Nuclear-Norm refer to our method, the SVP method and nuclear-norm regularization method.

Set-up Our SVP Nuclear-Norm

k = r0 < p, 5 = 5 < 99 0.0110 (0.0065) 0.4591 (0.0386) 0.4130 (0.0360)
k > p > r0, 50 > 40 > 5 12.0010 (0.6125) 10166.5052 (1236.3237) 7709.0779 (960.6982)
p > k > r0, 50 > 40 > 5 13.1945 (0.5682) 10185.7699 (1132.3515) 8116.2653 (943.6058)
p > k > r0, 99 > 40 > 5 121.4019 (70.3054) 14597.0728 (1185.7200) 13153.3217 (1056.5155)

Table 4.

Run time for the synthetic data: average training time in seconds for three competing methods based on the selected tuning parameters over 100 simulation replications. Our, SVP, and Nuclear-Norm refer to our method, the SVP method and nuclear-norm regularization.

Set-up Our SVP Nuclear-Norm

k = r0, < p, 5 = 5 < 99 0.0224 (0.0045) 14.1523 (0.3933) 0.9766 (0.0074)
k > p > r0, 50 > 40 > 5 0.0349 (0.0031) 41.2122 (1.8511) 5.8937 (0.2945)
p > k > r0, 50 > 40 > 5 0.0368 (0.0025) 29.5856 (1.1786) 5.7447 (0.1173)
p > k > r0, 99 > 40 > 5 0.0542 (0.0039) 79.9850 (3.6724) 7.6046 (0.1011)

As suggested in Tables 2 and 3, our exact method is much more precise in prediction in all cases and rank recovery except one case of k = r0 = 5 < p = 99, than the other two methods. This is in agreement with the theoretical results in Theorem 4 that exact reconstruction of the oracle estimator is achieved through tuning. Note that the best MSE value does not necessarily yield best rank recovery, as in the case of k = r0 = 5 < p = 99, which is due to the bias/variance trade-off phenomenon. As indicated in Table 4, our method is, on average, 10–20 times faster than the other two.

Table 3.

Rank recovery for the synthetic data: averaged values of |r0| as well as their standard deviations, for three competing methods based on the selected tuning parameters over 100 simulation replications. Our, SVP, and Nuclear-Norm refer to our method, the SVP method and nuclear-norm regularization method.

Set-up Our SVP Nuclear-Norm

k = r0 < p, 5 = 5 < 99 0.8400 (0.8005) 0.0000 (0.0000) 0.0000 (0.0000)
k > p > r0, 50 > 40 > 5 0.0000 (0.0000) 3.5300 (1.6358) 1.6400 (0.6439)
p > k > r0, 50 >40 > 5 0.0000 (0.0000) 4.0800 (1.3830) 1.6500 (0.5925)
p > k > r0, 99 > 40 > 5 0.0000 (0.0000) 2.9800 (1.9121) 0.2200 (0.4399)

7.2 MIT logo Recovery

Next we examine the three competing methods for reconstructing the MIT logo image, which was studied in [27, 17]. The original logo is displayed in Figure 2, where we use the gray image of size 44 × 85 and the image matrix has rank 7.

Figure 2.

Figure 2

Original MIT logo image.

Our objective is to reconstruct this image from its noisy version and examine the quality of reconstruction. Towards this end, we sample the design matrix A with each entry being iid N(0, 1), where the sample size n ranges from 20 to 80. To generate a noisy version, we add random error sampled from N(0, 0.52) to each element of the sample data. The reconstruction results are displayed in Figure 3 for the three methods with the default initial value 0 for the SVP.

Figure 3.

Figure 3

Reconstruction of a noisy version of the MIT logo with varying sampling size n. From left to right (for each case of n): SVP; nuclear-norm regularization; our method.

Visually, our method delivers highly competitive performance as compared to the other two methods, as displayed in Figure 3, and yields nearly perfect reconstruction when the size of the design matrix A becomes larger, say n = 60 and n = 80. For all the methods, better reconstruction can be reached as n increases, and comparable results are achieved when n is small, say n = 20 and n = 40. This conclusion is consistent with our theoretical analysis.

In practice, the exact rank of the matrix to be estimated is unknown but may reasonably be assumed to be small. In this sense, s needs to be tuned or estimated. Next we investigate the effect of estimation of s with regard to the reconstruction quality, measured by the MSE and the recovery rate. The latter is defined as the ratio ‖Θ̂sΘ0F/‖Θ0F, commonly used to measure the quality of parameter estimation. In particular, we display both of them as a function of s in Figures 4 and 5, where s is defined in (3).

Figure 4.

Figure 4

Relative recovery error as a function of the sampling size for the MIT logo image under different rank constraints.

Figure 5.

Figure 5

MSE as a function of the sampling size for the MIT logo image under different rank constraints. Note that the MSE of our method in (b) is not zero but indistinguishable from zero, which is unlike (c) and (d) in which the MSEs are identical to zero.

For the relative recovery error, a clear transition of our solution occurs around s = 44, after which a perfect recovery is achieved, whereas no improvement occurs for the other two methods when s increases. In the case of an underdetermined A, i.e., A is not of full column rank, all three methods produce similar recovery results. For the MSEs, our method yields the perfect result zero when s ≥ 7 and a reasonably small value when s = 3, 5, whereas the other two methods lead to elevated MSEs as s increases. This is in accordance with our theory which suggests that our method constructs the oracle estimator giving perfect image reconstruction when s ≤ 7.

8. CONCLUSION

This paper considers a nonconvex least squares formulation based on the rank constraint/regularization. We establish the optimality of the global solution, against any other ones. Experimental results on synthetic and real data demonstrate the efficiency and effectiveness of the proposed algorithm. In future work, we plan to expand the present work to a general loss function.

ACKNOWLEDGEMENT

This work was supported by NSF grants IIS-0953662 and DMS-0906616, NIH grants R01LM010730, 2R01GM081535-01, and R01HL105397.

Footnotes

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

1

More specifically, for a matrix of dimension n × p, the SVD has a complexity of O(min{n2p, p2n}), whereas the matrix inversion has a complexity of O(r(A)3), which can be improved to O(r(A)2.807) when the Strassen Algorithm is utilized.

REFERENCES

  • 1.Abernethy J, Bach F, Evgeniou T, Vert J. Low-rank matrix factorization with attributes. Arxiv preprint cs/0611124. 2006 [Google Scholar]
  • 2.Amit Y, Fink M, Srebro N, Ullman S. Proceedings of the 24th Annual International Conference on Machine Learning. ACM; 2007. Uncovering shared structures in multiclass classification; pp. 17–24. [Google Scholar]
  • 3.Anderson T. Estimating linear restrictions on regression coefficients for multivariate normal distributions. The Annals of Mathematical Statistics. 1951;22(3):327–351. [Google Scholar]
  • 4.Anderson T. Asymptotic distribution of the reduced rank regression estimator under general conditions. The Annals of Statistics. 1999;27(4):1141–1154. [Google Scholar]
  • 5.André T, Nowak R, Van Veen B. Low-rank estimation of higher order statistics. Signal Processing, IEEE Transactions on. 1997;45(3):673–685. [Google Scholar]
  • 6.Argyriou A, Evgeniou T, Pontil M. Multi-task feature learning. Advances in Neural Information Processing Systems. 2007;19:41. [Google Scholar]
  • 7.Bunea F, She Y, Wegkamp M. Optimal selection of reduced rank estimators of high-dimensional matrices. The Annals of Statistics. 2011;39(2):1282–1309. [Google Scholar]
  • 8.Cai J, Candés E, Shen Z. A singular value thresholding algorithm for matrix completion. SIAM Journal of Optimization. 2010;20(4):1956–1982. [Google Scholar]
  • 9.Candes E, Plan Y. Matrix completion with noise. Arxiv preprint arXiv:0903.3131. 2009 [Google Scholar]
  • 10.Candèes E, Recht B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics. 2009;9(6):717–772. [Google Scholar]
  • 11.Chen J, Liu J, Ye J. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2010. Learning incoherent sparse and low-rank patterns from multiple tasks; pp. 1179–1188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Eckart C, Young G. The approximation of one matrix by another of lower rank. Psychometrika. 1936;1(3):211–218. [Google Scholar]
  • 13.Fazel M, Hindi H, Boyd S. American Control Conference, 2001. Proceedings of the 2001. ume 6. IEEE; 2001. A rank minimization heuristic with application to minimum order system approximation; pp. 4734–4739. [Google Scholar]
  • 14.Friedland S, Torokhti A. Generalized rank-constrained matrix approximations. SIAM Journal on Matrix Analysis and Applications. 2007;29(2):656–659. [Google Scholar]
  • 15.Izenman A. Reduced-rank regression for the multivariate linear model. Journal of multivariate analysis. 1975;5(2):248–264. [Google Scholar]
  • 16.Jaggi M, Sulovskỳ M. A Simple Algorithm for Nuclear Norm Regularized Problems; Proceedings of the 27th Annual International Conference on Machine Learning; 2010. [Google Scholar]
  • 17.Jain P, Meka R, Dhillon I. Guaranteed rank minimization via singular value projection. Advances in Neural Information Processing Systems. 2010;23:937–945. [Google Scholar]
  • 18.Ji H, Liu C, Shen Z, Xu Y. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE; 2010. Robust video denoising using low rank matrix completion; pp. 1791–1798. [Google Scholar]
  • 19.Ji S, Ye J. An accelerated gradient method for trace norm minimization; Proceedings of the 26th Annual International Conference on Machine Learning; 2009. pp. 457–464. [Google Scholar]
  • 20.Kulis B, Surendran A, Platt J. Fast low-rank semidefinite programming for embedding and clustering; Eleventh International Conference on Artifical Intelligence and Statistics, AISTATS 2007; 2007. [Google Scholar]
  • 21.Liu G, Lin Z, Yan S, Sun J, Yu Y, Ma Y. Robust recovery of subspace structures by low-rank representation. Arxiv preprint arXiv:1010.2955. 2010 doi: 10.1109/TPAMI.2012.88. [DOI] [PubMed] [Google Scholar]
  • 22.Luo X. High dimensional low rank and sparse covariance matrix estimation via convex minimization. Arxiv preprint arXiv:1111.1133. 2011 [Google Scholar]
  • 23.Natarajan BK. Sparse approximation solutions to linear systems. SIAM J. Comput. 1995;24(2):227–234. [Google Scholar]
  • 24.Negahban S, Wainwright MJ. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Annals of Statistics. 2011;39(2):1069–1097. [Google Scholar]
  • 25.Pong T, Tseng P, Ji S, Ye J. Trace norm regularization: Reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization. 2010;20:3465–3489. [Google Scholar]
  • 26.Rao C. Matrix approximations and reduction of dimensionality in multivariate statistical analysis. Multivariate analysis. 1980;5:3–22. [Google Scholar]
  • 27.Recht B, Fazel M, Parrilo P. Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization. SIAM Review. 2010;52(471) [Google Scholar]
  • 28.Reinsel G, Velu R. Multivariate reduced-rank regression: theory and applications. New York: Springer; 1998. [Google Scholar]
  • 29.Savas B, Dhillon I. Clustered low rank approximation of graphs in information science applications; Proceedings of the SIAM Data Mining Conference; 2011. [Google Scholar]
  • 30.Shalev-Shwartz S, Gonen A, Shamir O. Large-Scale Convex Minimization with a Low-Rank Constraint; Proceedings of the 28th Annual International Conference on Machine Learning; 2011. [Google Scholar]
  • 31.Shen X, Huang H. Optimal model assessment, selection, and combination. Journal of the American Statistical Association. 2006;101(474):554–568. [Google Scholar]
  • 32.Sondermann D. Best approximate solutions to matrix equations under rank restrictions. Statistical Papers. 1986;27(1):57–66. [Google Scholar]
  • 33.Srebro N, Rennie J, Jaakkola T. Maximum-margin matrix factorization. Advances in Neural Information Processing Systems. 2005;17:1329–1336. [Google Scholar]
  • 34.Toh K, Yun S. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific J. Optim. 2010;6:615–640. [Google Scholar]
  • 35.Wright J, Ganesh A, Rao S, Ma Y. Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. Advances in Neural Information Processing Systems. 2009 [Google Scholar]

RESOURCES