Optimal Exact Least Squares Rank Minimization

Shuo Xiang; Yunzhang Zhu; Xiaotong Shen; Jieping Ye

doi:10.1145/2339530.2339609

. Author manuscript; available in PMC: 2014 Oct 9.

Published in final edited form as: KDD. 2012;2012:480–488. doi: 10.1145/2339530.2339609

Optimal Exact Least Squares Rank Minimization

Shuo Xiang ^1,², Yunzhang Zhu ³, Xiaotong Shen ³, Jieping Ye ^1,²

PMCID: PMC4191838 NIHMSID: NIHMS497477 PMID: 25309806

Abstract

In multivariate analysis, rank minimization emerges when a low-rank structure of matrices is desired as well as a small estimation error. Rank minimization is nonconvex and generally NP-hard, imposing one major challenge. In this paper, we consider a nonconvex least squares formulation, which seeks to minimize the least squares loss function with the rank constraint. Computationally, we develop efficient algorithms to compute a global solution as well as an entire regularization solution path. Theoretically, we show that our method reconstructs the oracle estimator exactly from noisy data. As a result, it recovers the true rank optimally against any method and leads to sharper parameter estimation over its counterpart. Finally, the utility of the proposed method is demonstrated by simulations and image reconstruction from noisy background.

General Terms: Algorithms

Keywords: Nonconvex, global optimality, rank minimization

1. INTRODUCTION

In multivariate analysis, estimation of lower-dimensional structures has received attention in statistics, signal processing and machine learning. One type of such structures is the low-rank of matrices [5, 22], where the rank measures the dimension of a multivariate response. Rank minimization approximates multivariate data with the smallest possible rank of matrices. It has many applications in, for instance, multi-task learning [6, 11], multi-class classification [2], matrix completion [8, 17], collaborative filtering [33, 1], clustering [20, 29], computer vision [35, 21, 18], among others. The central topic this article addresses is least squares rank minimization.

Consider multi-response linear regression in which a k-dimensional response vector z_i follows

z_{i} = Θ^{T} a_{i} + ε_{i}; E ε_{i} = 0, Cov (ε_{i}) = σ^{2} I_{k \times k}; i = 1, \dots, n,

(1)

where a_i is a p-dimensional design vector, Θ is a p × k regression parameter matrix, and components of ε_i are independent. Model (1) reduces to the widely-used linear model in compressed sensing when k = 1 and becomes a multivariate autoregressive model with a_i = z_i−1. Denote the rank of Θ as r(Θ) and rewrite (1) in the matrix form as follows:

Z = A Θ + e,

(2)

where Z = (z₁, ⋯, z_n)^T ∈ ℝ^n×k, A = (a₁, ⋯, a_n)^T ∈ ℝ^{n × p} and e = (ε₁, ⋯, ε_n)^T ∈ ℝ^n×k are the data, design and error matrices. In (1), we estimate Θ based on n pairs of observation vectors ${(a_{i}, z_{i})}_{i = 1}^{n}$ , with a priori knowledge that r(Θ) is relatively small in comparison to min(n, k, p), where the number of unknown parameters k, p can greatly exceed the sample size n.

Least squares rank minimization, as described, solves

min_{Θ} {‖ A Θ - Z ‖}_{F}^{2} s . t . r (Θ) \leq s,

(3)

where ‖·‖_F is the Frobenius-norm and s is an integer-valued tuning parameter taking values from [1, min(n, k, p)]. The general rank minimization is nonconvex and NP-hard [23], which is like the L₀-minimization in univariate analysis. Therefore an exact global solution to (3) is not known as well as its statistical properties, due primarily to discreteness and non-convexity of the rank function.

Estimation under the restriction that r(Θ) = r has been studied when n → ∞ with k and p held fixed, see [3, 4, 15, 28, 26]. Two major computational approaches have been proposed for approximating the optimal solution of (3). The first one involves regularization using surrogate function, such as the nuclear-norm, which is a convex envelope of the rank function [13] and can be solved by efficient algorithms [8, 19, 34, 25, 16]. In some cases, the solution of this convex problem coincides with a global minimizer of (3) under certain isometry assumptions [27]. However, these assumptions can be strong and difficult to check. Recently, [7] obtained a global minimizer of a regularized version of (3).

The second approach attacks (3) by approximating the rank function iteratively through calculating the largest singular vector using greedy search [30], or by singular value projection (SVP) through a local gradient method [17]. Under weaker isometry assumptions [27, 10, 9] than that of the nuclear-norm approach, these methods guarantee an exact solution of (3) but suffer from the same difficulties as the regularization method [30], although they have achieved promising results on both simulated and real-world data.

Theoretically, some loss error bounds of the first regularization approach are obtained in [24] under Frobenius-norm, and rank selection consistency is established in [7]. Unfortunately, to our best knowledge, whether similar conclusions hold for our formulation (3) remains largely unknown.

In this paper, we have advanced on two fronts. Computationally, we derive a general closed-form for a global minimizer of (3) in Theorem 1, and give a condition under which (3) and its nonconvex regularized counterpart are equivalent with regard to global minimizers, although these two methods are not generally equivalent. Moreover, we develop an efficient algorithm for computing the entire regularization solution path at the cost of computing only one solution for a single regularization parameter. Theoretically, we establish optimality for a global minimizer of (3). More specifically, the proposed method is optimal against any other ones in that it reconstructs the oracle estimator exactly, thus the true rank, under (1). It is important to note that this exact recovery result is a much stronger property than consistency, which is attributed to the discrete nature of the rank function as well as tuning parameter s. Such a result may not be shared by its regularized counterpart with a continuum tuning parameter. In addition, the method enjoys a higher degree of accuracy for parameter estimation than nuclear-norm rank estimation.

After the first draft of this paper was completed, we were aware that [14] and [32] gave an expression of Theorem 1. However, neither paper considered computational and statistical aspects of the solution. Inevitably, some partial overlaps exist between our Theorem 1 and theirs.

The rest of the paper is organized as follows. Section 2 presents a closed-form solution to (3). Section 3 gives an efficient path algorithm for a regularized version of (3). Section 4 is devoted to theoretical investigation, followed by Section 5 discussing methods for tuning. Section 6 presents proofs for all the theorems we develop and Section 7 presents results of empirical evaluations, where several rank minimization methods are compared. Section 8 concludes the paper.

Notation: Table 1 summarizes the notations used in the rest of the paper.

Table 1.

Notation used throughout the paper

Notation

Description

The design matrix

Error matrix

Θ⁰

The ground truth model

Θ̂^s

Estimator obtained by optimizing (3)

{Θ̂}_{λ}^{*}

Estimator obtained by optimizing (7)

r(A)

The rank of matrix A

℘_s(Z)

The best s-rank approximation of matrix Z in terms of the Frobenius-norm

Open in a new tab

2. PROPOSED METHOD: CLOSED-FORM SOLUTION

This section derives a closed-form solution to (3). The strategy is to simplify (3) through the singular value decomposition (SVD) of matrix A and utilizing the properties of the rank function. Before proceeding, we present a motivating lemma, known as the Eckart-Young theorem [12].

Lemma 1. The best s-rank approximation, in terms of the Frobenius-norm, for a t-rank matrix Z with t ≥ s, i.e., a global minimizer Θ* of

min_{Θ} {‖ Θ - Z ‖}_{F}^{2} s . t . r (Θ) \leq s

(4)

is given by $Θ^{*} = ℘_{s} (Z) = U_{z} D_{s} V_{z}^{T}$ , where D_s consists of the largest s singular values of Z given the SVD of $Z = U_{z} D V_{z}^{T}$ .

Intuitively ℘_s(Z) may be viewed as a projection of Z onto a set of matrices whose ranks are no more than s. Note that (4) is a special case of (3) with matrix A being the identity matrix. This motivates us to solve (3) through the simpler problem (4).

When A is nonsingular, clearly (3) has a global minimizer A⁻¹ ℘_s(Z) by rank preserveness of any nonsingular matrix in matrix multiplication. When A is singular, we first assume that r(A) ≥ s. However, this assumption is by no means mandatory, which will be discussed later. Given the SVD of A = UDV^T, with orthogonal matrices U ∈ ℝ^n×n and V ∈ ℝ^p×p and diagonal matrix D ∈ ℝ^n×p, we have

{‖ A Θ - Z ‖}_{F} = {‖ U^{T} (A Θ - Z) ‖}_{F} = {‖ D V^{T} Θ - U^{T} Z ‖}_{F} .

This follows from the fact that the Frobenius-norm is invariant under any orthogonal transformation. Let Y = V^T Θ and W = U^TZ, clearly we have r(Y) = r(Θ). Now solving (3) amounts to solving an equivalent form:

min_{Y} {‖ D Y - W ‖}_{F}^{2} s . t . r (Y) \leq s .

(5)

Consequently, a global minimizer of (3) becomes VY*, where Y* is a global minimizer of (5) and is given by the following theorem.

Theorem 1. Let D, Y, Z and s be as defined above. If s ≤ r(A), then a global minimizer of (5) is given by

Y^{*} = [\begin{matrix} D_{r (A)}^{- 1} ℘_{s} (W_{r (A)}) \\ a \end{matrix}],

(6)

where D_r(A) is a diagonal matrix consisting of all the nonzero singular value of A, a can be set as the zero matrix, and W_r(A) consists of the first r(A) rows of W.

Here are some remarks regarding the above theorem:

Remark 1. The solution to problem (5) is generally not unique. Specifically, the matrix a in (6) need not be fixed at zero, as long as it does not change the rank of Y*. However, if A is of full column rank, i.e., when r(A) = p, then a vanishes and Y* can be uniquely determined. In this case, the optimal solution of (3) is also unique.

Remark 2. The optimal Y* can also be computed for the general matrix A with an arbitrary rank, i.e., when r(A) < s. See the proof of Theorem 1 in the Section 6.

It is important to note that the value of a is irrelevant for prediction, but matters for parameter estimation. In other words, when r(A) < p, a global minimizer is not unique, hence that parameter estimation is not identifiable: see Section 4 for a discussion. For simplicity, we set a = 0 for Y* subsequently.

In what follows, our estimator is defined as Θ̂^s, as well as an estimated rank r̂. Algorithm 1 below summarizes the main steps for computing Θ̂^s with regard to s ≤ min(n, k, p), where the LSRM stands for Least Squares Rank Minimization.

Algorithm 1.

Exact solution of (3)

Input: A, Z, s ≤ r(A)
Output: `A global minimizer` Θ of (3)
Function: `LSRM`(A, Z, s)
1:	if A `is nonsingular` then
2:	Θ = A⁻¹℘_s(Z)
3:	else
4:	`Perform SVD on` A: A = UDV^T
5:	`Extract the first` r `rows of` U^TZ `and denote it as` W_r(A)
6	$Θ = V [\begin{matrix} D_{r (A)}^{- 1} ℘_{s} (W_{r (A)}) \\ 0 \end{matrix}]$
7:	end if
8:	return Θ

Open in a new tab

The complexity of Algorithm 1 is determined mainly by the most expensive operations–matrix inversion and SVD, specifically, at most one matrix inversion and two SVDs. Such operations roughly require a cubic time complexity¹.

3. REGULARIZATION AND SOLUTION PATH

This section studies a regularized counterpart of (3):

min_{Θ} {‖ A Θ - Z ‖}_{F}^{2} + λ r (Θ),

(7)

where λ > 0 is a continuous regularization parameter corresponding to s in (3), and $Θ_{λ}^{*}$ is a global minimizer of (7). The next theorem establishes an equivalence between (7) and (3) when $Θ_{λ}^{*}$ is unique, occurring when r(A) = p. Such a result is not generally anticipated for a nonconvex problem.

Theorem 2 (Equivalence). When p = r(A), (7) has a unique global minimizer. Moreover, (7) and (3) are equivalent with respect to their solutions. For any $Θ_{λ}^{*}$ with λ ≥ 0, there exists $1 \leq s^{*} = r (Θ_{λ}^{*})$ such that $Θ_{λ}^{*} = {Θ̂}^{s}$ , and vice verse.

Next we develop an algorithm for computing an entire solution path for all values of λ with complexity comparable to that of solving (7) at a single λ-value. For motivation, first consider a special case of the identity A in (7):

g (λ) = min_{Θ} {‖ Θ - Z ‖}_{F}^{2} + λ r (Θ) .

(8)

3.1 Monotone property

In (8), r(Θ) decreases as λ increases from 0, where r(Θ) goes through all integer values from r(Z) down to 0 when λ becomes sufficiently large. In addition, the value of g(λ) is nondecreasing as λ increases. The next theorem summarizes these results.

Theorem 3 (Monotone property). Let r(Z) be r. Then the following properties hold:

There exists a solution path vector 𝒮 of length r + 2 satisfying the following:
$𝒮_{0} = 0, 𝒮_{r + 1} = + \infty, 𝒮_{k + 1} > 𝒮_{k}, k = 0, 1, \dots, r Θ_{λ}^{*} = ℘_{r - k} (Z), if 𝒮_{k} \leq λ < 𝒮_{k + 1},$
Function g(λ) is nondecreasing and piecewise linear.

The monotone property leads to an efficient algorithm for calculating the pathwise solution of (8). Figure 1 displays the solution path by illustrating the function value g(λ) and $r (Θ_{λ}^{*})$ as a function of λ.

Piecewise linearity of g(·) and the rank of the optimal solution with respect to λ.

3.2 Pathwise algorithm

Through the monotone property, we compute the optimal solution of (8) at a particular λ by locating λ in the correct interval in the solution path vector 𝒮, which can be achieved efficiently via a simple binary search. Algorithm 2 describes the main steps.

Algorithm 2.

Pathwise solution of (8)

Input: Θ, Z
Output: `Solution path vector` 𝒮, `pathwise solution` Θ
Function: `pathwise`(Θ, Z)
1:	`Initialize`: 𝒮₀ = 0, Θ₀ = Z, r = r(Z)
2:	`Perform SVD on` Z: Z = UDV^T
3	for i = r down to 1 do
4:	$𝒮_{r - i + 1} σ_{i}^{2}$
5:	$Θ_{r - i + 1} = Θ_{r - i} - σ_{i} u_{i} υ_{i}^{T}$
6:	end for
7:	return 𝒮, Θ

Open in a new tab

Algorithm 2 requires only one SVD operation, therefore its complexity is of the same order as that of Algorithm 1 at a single s-value. When Z is a low-rank matrix, existing software for SVD computation such as PROPACK is applicable to further improve computational efficiency.

3.3 Extension to general A

For a general design matrix A, note that

{‖ A Θ - Z ‖}_{F}^{2} = {‖ D Y - W ‖}_{F}^{2} = {‖ W' ‖}_{F}^{2} + {‖ D_{r} Y_{r} - W_{r} ‖}_{F}^{2},

where $W = [\begin{matrix} W_{r} \\ W' \end{matrix}]$ . After ignoring the constant term ${‖ W' ‖}_{F}^{2}$ , we solve

min_{Y_{r}} {‖ D_{r} Y_{r} - W_{r} ‖}_{F}^{2} + λ r (Y_{r}) .

Note that D_r is nonsingular, then the problem reduces to the simple case

min_{Ŷ} {‖ Ŷ - W_{r} ‖}_{F}^{2} + λ r (Ŷ),

where Ŷ = D_rY_r. The solution path can be obtained directly from Algorithm 2.

4. STATISTICAL PROPERTIES

This section is devoted to the theoretical investigation of least squares rank minimization, which remains largely unexplored, although nuclear-norm regularization has been studied. In particular, we will reveal what is the best performance for prediction as well as the optimal risk for parameter estimation. Moreover, we will establish the optimality of the proposed method. In fact, the proposed method reconstructs the oracle estimator, the optimal one as if the true rank were known in advance. Here the oracle estimator Θ̂⁰ is defined as a global minimizer of ${‖ A Θ - Z ‖}_{F}^{2}$ subject to r(Θ) = r₀, where Θ⁰ and r₀ = r(Θ⁰) ≥ 1 denote the true parameter matrix and the true rank respectively. This leads to the exact rank recovery, in addition to the reconstruction of the optimal performance of the oracle estimator. In other words, the proposed method is optimal against any other methods such as the nuclear-norm rank regularization.

Given the design matrix A, we study the accuracy of rank recovery as well as prediction and parameter estimation. Let ℙ and 𝔼 be the true probability and expectation under Θ⁰ given A. For rank recovery, we use the metric ℙ(r̂ ≠ r₀). For prediction and parameter estimation, we employ the risk 𝔼K(Θ̂^s, Θ⁰) and $𝔼 {‖ {Θ̂}^{s} - Θ^{0} ‖}_{F}^{2}$ respectively, where

K ({Θ̂}^{s}, Θ^{0}) = {(2 σ^{2} n)}^{- 1} \sum_{i = 1}^{n} {‖ a_{i}^{T} ({Θ̂}^{s} - Θ^{0}) ‖}_{2}^{2} = {(2 σ^{2} n)}^{- 1} {‖ A ({Θ̂}^{s} - Θ^{0}) ‖}_{F}^{2}

is the Kullback-Leibler loss and ‖·‖₂ is the L₂-norm for a vector. Note that the predictive risk equals to 2σ²𝔼K(Θ̂^s, Θ⁰) and parameter estimation is considered when it is identifiable, i.e., r(A) = p.

Now we present the risk bounds under (1) without a Gaussian error assumption.

Theorem 4. Under (1), the oracle estimator is exactly reconstructed by our method in that Θ̂^r₀ = Θ⁰ under ℙ, when r₀ ≤ min(r(A), p). As a result, exact reconstruction of the optimal performance is achieved by our estimator Θ̂^r₀. In particular,

𝔼 K ({Θ̂}^{r_{0}}, Θ^{0}) {\begin{matrix} = & \frac{r_{0} k}{2 n} & if r_{0} = r (A) \\ \leq & \frac{2 𝔼 (\sum_{j = 1}^{r_{0}} σ_{j}^{2})}{n} & if r_{0} < r (A) \end{matrix}

and

𝔼 {‖ {Θ̂}^{r_{0}} - Θ^{0} ‖}_{F}^{2} {\begin{matrix} = & \frac{r_{0} k}{σ_{min}^{2} (A) n} & if r_{0} = r (A) = p \\ \leq & \frac{𝔼 (\sum_{j = 1}^{r_{0}} σ_{j}^{2})}{σ_{min}^{2} n} & if r_{0} < r (A) = p \end{matrix},

where σ_j and σ_min > 0 are the jth largest and the smallest nonzero singular value of $e^{⋆} = U_{r (A)}^{T} e$ and n^−1/2 A, respectively, and U_r(A) denotes the first r(A) rows of U.

Remark: In general, $𝔼 \sum_{j = 1}^{r_{0}} σ_{j}^{2} \leq r_{0} 𝔼 σ_{1}^{2}$ .

Theorem 4 says that the optimal oracle estimator is exactly reconstructed by our method. Interestingly, the true rank is exactly recovered from noisy data, which is attributed to discreteness of the rank and is analogous to the maximum likelihood estimation over a discrete parameter space. Concerning prediction and parameter estimation, the optimal Kullback-Leibler risk is $\frac{r_{0} k}{n}$ but the risk under the Frobenius-norm is $\frac{r_{0} k}{σ_{min}^{2} n}$ . For prediction, only the effective degrees of freedom $\sum_{j = 1}^{r_{0}} σ_{j}^{2}$ matters as opposed to p, which is in contrast to a rate $\frac{r_{0} k p}{n}$ without a rank restriction. This permits p to be much larger than n, or k, p ≫ n. For estimation, however, p enters the risk through $σ_{min}^{2}$ , where p can not be larger than n, or max(k, p) ≤ n.

5. TUNING

As shown in Section 4, theoretically, exact rank reconstruction can be accomplished through tuning. Practically, we employ a predictive measure for rank selection.

The predictive performance of Θ̂^s is measured by

M S E ({Θ̂}^{s}) = \frac{1}{n} 𝔼 {‖ A {Θ̂}^{s} - Z ‖}_{F}^{2},

which is proportional to the risk R(Θ̂^s), where the expectation is taken with respect to (Z, A).

To estimate s over integer values in [0, ⋯, min(n, p, k)], one may cross-validate through a tuning data set, which estimates the MSE. Alternatively, one may use the generalized degrees of freedom [31] through data perturbation without a tuning set.

\hat{GDF} ({Θ̂}^{s}) = n^{- 1} {‖ Z - A {Θ̂}^{s} ‖}^{2} + 2 n^{- 1} \sum_{i = 1}^{n} \sum_{l = 1}^{k} \hat{Cov} (z_{i l}, {(a_{i}^{T} {Θ̂}^{s})}_{l}),

(9)

where $\hat{Cov} (z_{i l}, {(a_{i}^{T} {Θ̂}^{s})}_{l})$ is the estimated covariance between the lth component of z_i and the lth component of $a_{i}^{T} {Θ̂}^{s}$ . In the case that e_i in (1) follows N(0, σ² I_k×k), the method of data perturbation of [31] is applicable. Specifically, sample $e_{i}^{*}$ from N(0, σ²I_k×k) and let

Z^{*} = (1 - τ) Z + τ Z̃ \hat{Cov} (z_{i l}, {(a_{i}^{T} {Θ̂}^{s})}_{l}) = τ^{- 1} {Cov}^{*} (z_{i l}^{*}, {(a_{i}^{T} {Θ̂}^{* s})}_{l})

where ${Cov}^{*} (z_{i l}^{*}, {(a_{i} {Θ̂}^{* s})}_{l})$ is the sample covariance with the Monte Carlo size T. For the types of problems we consider, we fixed T to be n.

6. PROOF OF THEOREMS

In this section we present detailed proofs to the theorems developed in the previous sections. We first present a technical lemma to be used in the proof of Theorem 4.

Lemma 2. Suppose A and B are two n₁ × n₂ matrices. Then,

〈 A, B 〉 \leq {‖ A ‖}_{F} {‖ ℘_{r (A)} (B) ‖}_{F},

(10)

where 〈A,B〉 = Tr(A^TB) = Tr(B^TA), Tr denotes the trace, and r(A) is the rank of A.

Proof. Let the singular value decomposition of A and B be $A = U_{1} Σ_{1} V_{1}^{T}$ and $B = U_{2} Σ_{2} V_{2}^{T}$ where U_i and V_i, i = 1, 2, are orthogonal matrices, and Σ₁ and Σ₂ are diagonal matrices with their diagonal elements being the singular values of A and B, respectively. Then

〈 A, B 〉 = T r (V_{1} Σ_{1}^{T} U_{1}^{T} U_{2} Σ_{2} V_{2}^{T}) = T r (Σ_{1}^{T} U_{1}^{T} U_{2} Σ_{2} V_{2}^{T} V_{1}) \equiv T r (Σ_{1}^{T} U Σ_{2} V^{T}),

where $U = U_{1}^{T} U_{2}$ and $V = V_{1}^{T} V_{2}$ continue to be orthogonal. Let the ordered singular values of A be σ₁ ≥ ⋯ ≥ σ_r(A) and B̃ = (b̃_ij) = UΣ₂V^T. By Cauchy-Schwarz’s inequality,

T r (Σ_{1}^{T} U Σ_{2} V^{T}) = T r (Σ_{1}^{T} B̃) = \sum_{i = 1}^{r (A)} σ_{i} {b̃}_{i i} \leq \sqrt{\sum_{i = 1}^{r (A)} σ_{i}^{2}} \sqrt{\sum_{i = 1}^{r (A)} {b̃}_{i i}^{2}} = {‖ A ‖}_{F} \sqrt{\sum_{i = 1}^{r (A)} {b̃}_{i i}^{2}} .

(11)

Similarly, let the ordered singular values of B be η₁ ≥ ⋯ ≥ η_r(B). Then it suffices to show that $\sum_{i = 1}^{r (A)} {b̃}_{i i}^{2} \leq \sum_{i = 1}^{r (A)} η_{i}^{2}$ . Assume, without loss of generality, that η_i = 0 if i > r(B). Let n = min(n₁, n₂). By Cauchy-Schwarz’s inequality,

\sum_{i = 1}^{r (A)} {b̃}_{i i}^{2} = \sum_{i = 1}^{r (A)} {(\sum_{k = 1}^{n} u_{i k} η_{k} υ_{i k})}^{2} \leq \sum_{i = 1}^{r (A)} (\sum_{k = 1}^{n} u_{i k}^{2} η_{k}^{2}) (\sum_{k = 1}^{n} υ_{i k}^{2}) \leq \sum_{i = 1}^{r (A)} (\sum_{k = 1}^{n} u_{i k}^{2} η_{k}^{2}) = \sum_{k = 1}^{n} η_{k}^{2} (\sum_{i = 1}^{r (A)} u_{i k}^{2}) \leq \sum_{k = 1}^{r (A)} η_{k}^{2},

where the last step uses the fact that $\sum_{i = 1}^{r (A)} u_{i k}^{2} \leq 1$ and $\sum_{k = 1}^{n} \sum_{i = 1}^{r (A)} u_{i k}^{2} = \sum_{i = 1}^{r (A)} \sum_{k = 1}^{n} u_{i k}^{2} \leq r (A)$ . A combination of the above bounds lead to the desired results. This completes the proof.

6.1 Proof of Theorem 1

First partition D and W as follows:

D = [\begin{matrix} D_{r (A)} & 0 \\ 0 & 0 \end{matrix}], W = [\begin{matrix} W_{r (A)} \\ W' \end{matrix}],

then

D Y - W = [\begin{matrix} D_{r (A)} Y_{r (A)} \\ 0 \end{matrix}] - [\begin{matrix} W_{r (A)} \\ W' \end{matrix}] = [\begin{matrix} D_{r (A)} Y_{r (A)} - W_{r (A)} \\ - W' \end{matrix}] .

Evidently, only the first r(A) rows of Y are involved in minimizing ${‖ D Y - W ‖}_{2}^{2}$ , which amounts to computing the global minimizer $Y_{r (A)}^{*}$ of

arg min_{Y_{r} (A)} {‖ D_{r (A)} Y_{r (A)} - W_{r (A)} ‖}_{F}^{2} .

Then $Y_{r (A)}^{*} = D_{r (A)}^{- 1} ℘_{s} (W_{r (A)})$ by non-singularity of D_r(A) and Lemma 1 with s ≤ r(A). For s > r(A), recall that, only the upper part of Y* is relevant in minimizing (5). The result then follows. This completes the proof.

6.2 Proof of Theorem 2

For any $Θ_{λ}^{*}$ with λ > 0, let $s^{*} = r (Θ_{λ}^{*})$ . Next we prove by contradiction that $Θ_{λ}^{*} = {Θ̂}^{s^{*}}$ . Suppose $Θ_{λ}^{*} \neq {Θ̂}^{s^{*}}$ . By uniqueness of Θ̂^s* given in Theorem 1 and the definition of minimization, ${‖ A {Θ̂}^{s^{*}} - Z ‖}_{F}^{2} < {‖ A Θ_{λ}^{*} - Z ‖}_{F}^{2}$ . This, together with $r ({Θ̂}^{s^{*}}) = r (Θ_{λ}^{*})$ , implies that

{‖ A {Θ̂}^{s^{*}} - Z ‖}_{F}^{2} + λ r ({Θ̂}^{s^{*}}) < {‖ A Θ_{λ}^{*} - Z ‖}_{F}^{2} + λ r (Θ_{λ}^{*}) .

This contradicts to the fact that $Θ_{λ}^{*}$ is a minimizer of (7). The converse is revealed in the proof of Theorem 3.

6.3 Proof of Theorem 3

We prove the first conclusion by constructing such a solution path vector 𝒮. Let 𝒮₀ = 0, 𝒮_r+1 = + ∞. Define 𝒮_k for 1 ≤ k ≤ r as the solution of equation:

{‖ ℘_{r - k + 1} (Z) - Z ‖}_{F}^{2} + 𝒮_{k} (r - k + 1) = {‖ ℘_{r - k} (Z) - Z ‖}_{F}^{2} + 𝒮_{k} (r - k) .

It follows that

𝒮_{k} = {‖ ℘_{r - k} (Z) - Z ‖}_{F}^{2} - {‖ ℘_{r - k + 1} (Z) - Z ‖}_{F}^{2} = \sum_{j = r - k + 1}^{r} σ_{j}^{2} - \sum_{j = r - k + 2}^{r} σ_{j}^{2} = σ_{r - k + 1}^{2},

(12)

where σ_j is the jth largest non-zero singular value of Z. By (12), 𝒮_k is increasing. In addition, by definition of S_k and S_k+1, whenever λ falls into the interval [𝒮_k, 𝒮_k+1), the rank of a global minimizer Θ* of (8) would be no more than r − k and larger than r − k − 1. In other words, $Θ_{λ}^{*}$ is always of rank r − k and is given by ℘_r−k(Z). Therefore, the constructed solution path vector 𝒮 satisfies all the requirements in the theorem.

Moreover, when 𝒮_k ≤ λ < 𝒮_k+1,

g (λ) = {‖ ℘_{r - k} (Z) - Z ‖}_{F}^{2} + λ r (℘_{r - k} (Z)) = {‖ ℘_{r - k} (Z) - Z ‖}_{F}^{2} + (r - k) λ .

(13)

Since ℘_r−k(Z) is independent of λ, g(λ) is a nondecreasing linear function of λ in each interval [𝒮_k, 𝒮_k+1). Combined with the definition of the solution path vector 𝒮, we conclude that g(λ) is nondecreasing and piecewise linear with each element of 𝒮 as a kink point, as shown in Figure 1. The proof is completed.

6.4 Proof of Theorem 4

The proof uses direct calculations. First we bound the Kullback-Leibler loss. By Theorem 1,

A {Θ̂}^{r_{0}} = U_{r (A)} ℘_{r_{0}} (W_{r (A)}),

where

W_{r (A)} = D_{r (A)} {(V^{T} Θ^{0})}_{r (A)} + {(U^{T} e)}_{r (A)} .

Denote B = D_r(A) (V^T Θ⁰)_r(A) with rank r(B) ≤ r₀. Since the Frobenius-norm is invariant under orthogonal transformation, it follows that

{‖ A {Θ̂}^{r_{0}} - A Θ^{0} ‖}_{F}^{2} = {‖ U_{r (A)} ℘_{r_{0}} (W_{r (A)}) - U D V^{T} Θ^{0} ‖}_{F}^{2} = {‖ U_{r (A)} ℘_{r_{0}} (W_{r (A)}) - U_{r (A)} D_{r (A)} {(V^{T} Θ^{0})}_{r (A)} ‖}_{F}^{2} = {‖ ℘_{r_{0}} (W_{r (A)}) - D_{r (A)} {(V^{T} Θ^{0})}_{r (A)} ‖}_{F}^{2} = {‖ ℘_{r_{0}} (B + {(U^{T} e)}_{r (A)}) - B ‖}_{F}^{2}, = {‖ ℘_{r_{0}} (B + e^{⋆} - B) ‖}_{F}^{2},

where e^⋆ = (U^Te)_r(A). From the definition of ℘_r₀ (B + e^⋆) we can conclude that:

{‖ ℘_{r_{0}} (B + e^{⋆}) - B - e^{⋆} ‖}_{F}^{2} \leq {‖ B - B - e^{⋆} ‖}_{F}^{2} = {‖ e^{⋆} ‖}_{F}^{2},

which implies that,

{‖ ℘_{r_{0}} (B + e^{⋆}) - B ‖}_{F}^{2} \leq 2 〈 ℘_{r_{0}} (B + e^{⋆}) - B, e^{⋆} 〉 \leq 2 {‖ ℘_{r_{0}} (B + e^{⋆}) - B ‖}_{F} {‖ ℘_{r_{0}} (e^{⋆}) ‖}_{F},

where the last inequality follows from Lemma 2. Thus,

{‖ ℘_{r_{0}} (B + e^{⋆}) - B ‖}_{F}^{2} \leq 4 {‖ ℘_{r_{0}} (e^{⋆}) ‖}_{F}^{2} = 4 \sum_{j = 1}^{r_{0}} σ_{j}^{2} .

(14)

The risk bounds then follow.

Second we bound ${‖ {Θ̂}^{r_{0}} - Θ^{0} ‖}_{F}^{2}$ , which is equal to

{‖ V [\begin{matrix} D_{r (A)}^{- 1} ℘_{r_{0}} (W_{r (A)}) \\ 0 \end{matrix}] - Θ^{0} ‖}_{F}^{2} = {‖ [\begin{matrix} D_{r (A)}^{- 1} ℘_{r_{0}} (W_{r (A)}) \\ 0 \end{matrix}] - V^{T} Θ^{0} ‖}_{F}^{2} = {‖ D_{r (A)}^{- 1} ℘_{r_{0}} (W_{r (A)}) - {(V^{T} Θ^{0})}_{r (A)} ‖}_{F}^{2} + {‖ {(V^{T} Θ^{0})}_{r {(A)}^{c}} ‖}_{F}^{2} \leq \frac{1}{σ_{min}^{2} n} {‖ ℘_{r_{0}} (W_{r (A)}) - D_{r (A)} {(V^{T} Θ^{0})}_{r (A)} ‖}_{F}^{2} + {‖ {(V^{T} Θ^{0})}_{r {(A)}^{c}} ‖}_{F}^{2},

where σ_r(A)(n^−1/2 A) = n^1/2 σ_min. If p = r(A), then the last term vanishes. Thus

{‖ {Θ̂}^{r_{0}} - Θ^{0} ‖}_{F}^{2} \leq \frac{1}{σ_{r {(A)}^{n}}^{2}} {‖ ℘_{r_{0}} (W_{r (A)}) - D_{r (A)} {(V^{T} Θ^{0})}_{r (A)} ‖}_{F}^{2} \leq \frac{4 \sum_{j = 1}^{r_{0}} σ_{j}^{2}}{σ_{min}^{2} n} .

Finally, if r(A) ≥ r₀, $𝔼 {‖ {Θ̂}^{r_{0}} - Θ^{0} ‖}_{F}^{2} \leq \frac{4}{σ_{r (A)}^{2}} 𝔼 (\sum_{j = 1}^{r_{0}} σ_{j}^{2})$ and $𝔼 {‖ A {Θ̂}^{r_{0}} - A Θ^{0} ‖}_{F}^{2} \leq 4 𝔼 (\sum_{j = 1}^{r_{0}} σ_{j}^{2})$ . In particular, if r(A) = r₀, then

𝔼 {‖ {Θ̂}^{r_{0}} - Θ^{0} ‖}_{F}^{2} = \frac{1}{σ_{min}^{2}} \frac{r_{0} k}{n} 𝔼 {‖ A {Θ̂}^{r_{0}} - A Θ^{0} ‖}_{F}^{2} = \frac{r_{0} k}{n} .

7. EMPIRICAL EVALUATIONS

This section examines effectiveness of the proposed method and compares it with the nuclear-norm regularization as well as the SVP method [17]. One benefit of our method is that it can evaluate the approximation quality of the inexact ones. For the SVP, we choose a default initial value 0 for this local method since no other choices are guaranteed to deliver better performance. Our numerical experiment, which is not reported in here, suggests that the SVP method is indeed sensitive to the choice of the initial value. For nuclear-norm regularization, we select a suitable regularization parameter value giving the solution satisfying the rank constraint in (3).

7.1 Synthetic Data

Simulations are performed under (2). First, the n × p design matrix A is sampled, with each entry being iid N(10, 1). Second, the p×k truth Θ⁰ is generated by multiplying a p×r matrix and an r × k matrix, each entry of which has a normal distribution over N(10, 1). The data matrix Z is then sampled according to (2) with e following iid N(0, σ²) with σ = 0.5.

For predictive performance and rank recovery, we compute the MSE ${‖ A (Θ^{0} - {Θ̂}^{r̂}) ‖}_{F}^{2}$ , the absolute difference |r̂ − r₀| and record the training time for each method, averaged over 100 simulation replications on a test set of size 10n, where r̂ is tuned over integers in [0, min(n, p)] by an independent tuning set of size 2n. We consider three possible situations, i.e., when k = r₀ < p, k > p > r₀ and p > k > r₀. To illustrate our theoretical conclusion in Theorem 4 that the prediction error bound does not depend on the value of p, we add one more case of p > k > r₀ with a larger value of p. The simulation results are summarized in Tables 2–4.

Table 2.

Prediction results for the synthetic data: averaged MSEs as well as their standard deviations, for three competing methods based on the selected tuning parameters over 100 simulation replications. Our, SVP, and Nuclear-Norm refer to our method, the SVP method and nuclear-norm regularization method.

Set-up	Our	SVP	Nuclear-Norm

k = r₀ < p, 5 = 5 < 99	0.0110 (0.0065)	0.4591 (0.0386)	0.4130 (0.0360)
k > p > r₀, 50 > 40 > 5	12.0010 (0.6125)	10166.5052 (1236.3237)	7709.0779 (960.6982)
p > k > r₀, 50 > 40 > 5	13.1945 (0.5682)	10185.7699 (1132.3515)	8116.2653 (943.6058)
p > k > r₀, 99 > 40 > 5	121.4019 (70.3054)	14597.0728 (1185.7200)	13153.3217 (1056.5155)

Open in a new tab

Table 4.

Run time for the synthetic data: average training time in seconds for three competing methods based on the selected tuning parameters over 100 simulation replications. Our, SVP, and Nuclear-Norm refer to our method, the SVP method and nuclear-norm regularization.

Set-up	Our	SVP	Nuclear-Norm

k = r₀, < p, 5 = 5 < 99	0.0224 (0.0045)	14.1523 (0.3933)	0.9766 (0.0074)
k > p > r₀, 50 > 40 > 5	0.0349 (0.0031)	41.2122 (1.8511)	5.8937 (0.2945)
p > k > r₀, 50 > 40 > 5	0.0368 (0.0025)	29.5856 (1.1786)	5.7447 (0.1173)
p > k > r₀, 99 > 40 > 5	0.0542 (0.0039)	79.9850 (3.6724)	7.6046 (0.1011)

Open in a new tab

As suggested in Tables 2 and 3, our exact method is much more precise in prediction in all cases and rank recovery except one case of k = r₀ = 5 < p = 99, than the other two methods. This is in agreement with the theoretical results in Theorem 4 that exact reconstruction of the oracle estimator is achieved through tuning. Note that the best MSE value does not necessarily yield best rank recovery, as in the case of k = r₀ = 5 < p = 99, which is due to the bias/variance trade-off phenomenon. As indicated in Table 4, our method is, on average, 10–20 times faster than the other two.

Table 3.

Rank recovery for the synthetic data: averaged values of |r̂ − r₀| as well as their standard deviations, for three competing methods based on the selected tuning parameters over 100 simulation replications. Our, SVP, and Nuclear-Norm refer to our method, the SVP method and nuclear-norm regularization method.

Set-up	Our	SVP	Nuclear-Norm

k = r₀ < p, 5 = 5 < 99	0.8400 (0.8005)	0.0000 (0.0000)	0.0000 (0.0000)
k > p > r₀, 50 > 40 > 5	0.0000 (0.0000)	3.5300 (1.6358)	1.6400 (0.6439)
p > k > r₀, 50 >40 > 5	0.0000 (0.0000)	4.0800 (1.3830)	1.6500 (0.5925)
p > k > r₀, 99 > 40 > 5	0.0000 (0.0000)	2.9800 (1.9121)	0.2200 (0.4399)

Open in a new tab

7.2 MIT logo Recovery

Next we examine the three competing methods for reconstructing the MIT logo image, which was studied in [27, 17]. The original logo is displayed in Figure 2, where we use the gray image of size 44 × 85 and the image matrix has rank 7.

Our objective is to reconstruct this image from its noisy version and examine the quality of reconstruction. Towards this end, we sample the design matrix A with each entry being iid N(0, 1), where the sample size n ranges from 20 to 80. To generate a noisy version, we add random error sampled from N(0, 0.5²) to each element of the sample data. The reconstruction results are displayed in Figure 3 for the three methods with the default initial value 0 for the SVP.

Reconstruction of a noisy version of the MIT logo with varying sampling size n. From left to right (for each case of n): SVP; nuclear-norm regularization; our method.

Visually, our method delivers highly competitive performance as compared to the other two methods, as displayed in Figure 3, and yields nearly perfect reconstruction when the size of the design matrix A becomes larger, say n = 60 and n = 80. For all the methods, better reconstruction can be reached as n increases, and comparable results are achieved when n is small, say n = 20 and n = 40. This conclusion is consistent with our theoretical analysis.

In practice, the exact rank of the matrix to be estimated is unknown but may reasonably be assumed to be small. In this sense, s needs to be tuned or estimated. Next we investigate the effect of estimation of s with regard to the reconstruction quality, measured by the MSE and the recovery rate. The latter is defined as the ratio ‖Θ̂^s − Θ⁰‖_F/‖Θ⁰‖_F, commonly used to measure the quality of parameter estimation. In particular, we display both of them as a function of s in Figures 4 and 5, where s is defined in (3).

Relative recovery error as a function of the sampling size for the MIT logo image under different rank constraints.

MSE as a function of the sampling size for the MIT logo image under different rank constraints. Note that the MSE of our method in (b) is not zero but indistinguishable from zero, which is unlike (c) and (d) in which the MSEs are identical to zero.

For the relative recovery error, a clear transition of our solution occurs around s = 44, after which a perfect recovery is achieved, whereas no improvement occurs for the other two methods when s increases. In the case of an underdetermined A, i.e., A is not of full column rank, all three methods produce similar recovery results. For the MSEs, our method yields the perfect result zero when s ≥ 7 and a reasonably small value when s = 3, 5, whereas the other two methods lead to elevated MSEs as s increases. This is in accordance with our theory which suggests that our method constructs the oracle estimator giving perfect image reconstruction when s ≤ 7.

8. CONCLUSION

This paper considers a nonconvex least squares formulation based on the rank constraint/regularization. We establish the optimality of the global solution, against any other ones. Experimental results on synthetic and real data demonstrate the efficiency and effectiveness of the proposed algorithm. In future work, we plan to expand the present work to a general loss function.

ACKNOWLEDGEMENT

This work was supported by NSF grants IIS-0953662 and DMS-0906616, NIH grants R01LM010730, 2R01GM081535-01, and R01HL105397.

Footnotes

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

More specifically, for a matrix of dimension n × p, the SVD has a complexity of O(min{n²p, p²n}), whereas the matrix inversion has a complexity of O(r(A)³), which can be improved to O(r(A)^2.807) when the Strassen Algorithm is utilized.

REFERENCES

1.Abernethy J, Bach F, Evgeniou T, Vert J. Low-rank matrix factorization with attributes. Arxiv preprint cs/0611124. 2006 [Google Scholar]
2.Amit Y, Fink M, Srebro N, Ullman S. Proceedings of the 24th Annual International Conference on Machine Learning. ACM; 2007. Uncovering shared structures in multiclass classification; pp. 17–24. [Google Scholar]
3.Anderson T. Estimating linear restrictions on regression coefficients for multivariate normal distributions. The Annals of Mathematical Statistics. 1951;22(3):327–351. [Google Scholar]
4.Anderson T. Asymptotic distribution of the reduced rank regression estimator under general conditions. The Annals of Statistics. 1999;27(4):1141–1154. [Google Scholar]
5.André T, Nowak R, Van Veen B. Low-rank estimation of higher order statistics. Signal Processing, IEEE Transactions on. 1997;45(3):673–685. [Google Scholar]
6.Argyriou A, Evgeniou T, Pontil M. Multi-task feature learning. Advances in Neural Information Processing Systems. 2007;19:41. [Google Scholar]
7.Bunea F, She Y, Wegkamp M. Optimal selection of reduced rank estimators of high-dimensional matrices. The Annals of Statistics. 2011;39(2):1282–1309. [Google Scholar]
8.Cai J, Candés E, Shen Z. A singular value thresholding algorithm for matrix completion. SIAM Journal of Optimization. 2010;20(4):1956–1982. [Google Scholar]
9.Candes E, Plan Y. Matrix completion with noise. Arxiv preprint arXiv:0903.3131. 2009 [Google Scholar]
10.Candèes E, Recht B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics. 2009;9(6):717–772. [Google Scholar]
11.Chen J, Liu J, Ye J. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2010. Learning incoherent sparse and low-rank patterns from multiple tasks; pp. 1179–1188. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Eckart C, Young G. The approximation of one matrix by another of lower rank. Psychometrika. 1936;1(3):211–218. [Google Scholar]
13.Fazel M, Hindi H, Boyd S. American Control Conference, 2001. Proceedings of the 2001. ume 6. IEEE; 2001. A rank minimization heuristic with application to minimum order system approximation; pp. 4734–4739. [Google Scholar]
14.Friedland S, Torokhti A. Generalized rank-constrained matrix approximations. SIAM Journal on Matrix Analysis and Applications. 2007;29(2):656–659. [Google Scholar]
15.Izenman A. Reduced-rank regression for the multivariate linear model. Journal of multivariate analysis. 1975;5(2):248–264. [Google Scholar]
16.Jaggi M, Sulovskỳ M. A Simple Algorithm for Nuclear Norm Regularized Problems; Proceedings of the 27th Annual International Conference on Machine Learning; 2010. [Google Scholar]
17.Jain P, Meka R, Dhillon I. Guaranteed rank minimization via singular value projection. Advances in Neural Information Processing Systems. 2010;23:937–945. [Google Scholar]
18.Ji H, Liu C, Shen Z, Xu Y. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE; 2010. Robust video denoising using low rank matrix completion; pp. 1791–1798. [Google Scholar]
19.Ji S, Ye J. An accelerated gradient method for trace norm minimization; Proceedings of the 26th Annual International Conference on Machine Learning; 2009. pp. 457–464. [Google Scholar]
20.Kulis B, Surendran A, Platt J. Fast low-rank semidefinite programming for embedding and clustering; Eleventh International Conference on Artifical Intelligence and Statistics, AISTATS 2007; 2007. [Google Scholar]
21.Liu G, Lin Z, Yan S, Sun J, Yu Y, Ma Y. Robust recovery of subspace structures by low-rank representation. Arxiv preprint arXiv:1010.2955. 2010 doi: 10.1109/TPAMI.2012.88. [DOI] [PubMed] [Google Scholar]
22.Luo X. High dimensional low rank and sparse covariance matrix estimation via convex minimization. Arxiv preprint arXiv:1111.1133. 2011 [Google Scholar]
23.Natarajan BK. Sparse approximation solutions to linear systems. SIAM J. Comput. 1995;24(2):227–234. [Google Scholar]
24.Negahban S, Wainwright MJ. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Annals of Statistics. 2011;39(2):1069–1097. [Google Scholar]
25.Pong T, Tseng P, Ji S, Ye J. Trace norm regularization: Reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization. 2010;20:3465–3489. [Google Scholar]
26.Rao C. Matrix approximations and reduction of dimensionality in multivariate statistical analysis. Multivariate analysis. 1980;5:3–22. [Google Scholar]
27.Recht B, Fazel M, Parrilo P. Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization. SIAM Review. 2010;52(471) [Google Scholar]
28.Reinsel G, Velu R. Multivariate reduced-rank regression: theory and applications. New York: Springer; 1998. [Google Scholar]
29.Savas B, Dhillon I. Clustered low rank approximation of graphs in information science applications; Proceedings of the SIAM Data Mining Conference; 2011. [Google Scholar]
30.Shalev-Shwartz S, Gonen A, Shamir O. Large-Scale Convex Minimization with a Low-Rank Constraint; Proceedings of the 28th Annual International Conference on Machine Learning; 2011. [Google Scholar]
31.Shen X, Huang H. Optimal model assessment, selection, and combination. Journal of the American Statistical Association. 2006;101(474):554–568. [Google Scholar]
32.Sondermann D. Best approximate solutions to matrix equations under rank restrictions. Statistical Papers. 1986;27(1):57–66. [Google Scholar]
33.Srebro N, Rennie J, Jaakkola T. Maximum-margin matrix factorization. Advances in Neural Information Processing Systems. 2005;17:1329–1336. [Google Scholar]
34.Toh K, Yun S. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific J. Optim. 2010;6:615–640. [Google Scholar]
35.Wright J, Ganesh A, Rao S, Ma Y. Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. Advances in Neural Information Processing Systems. 2009 [Google Scholar]

[R1] 1.Abernethy J, Bach F, Evgeniou T, Vert J. Low-rank matrix factorization with attributes. Arxiv preprint cs/0611124. 2006 [Google Scholar]

[R2] 2.Amit Y, Fink M, Srebro N, Ullman S. Proceedings of the 24th Annual International Conference on Machine Learning. ACM; 2007. Uncovering shared structures in multiclass classification; pp. 17–24. [Google Scholar]

[R3] 3.Anderson T. Estimating linear restrictions on regression coefficients for multivariate normal distributions. The Annals of Mathematical Statistics. 1951;22(3):327–351. [Google Scholar]

[R4] 4.Anderson T. Asymptotic distribution of the reduced rank regression estimator under general conditions. The Annals of Statistics. 1999;27(4):1141–1154. [Google Scholar]

[R5] 5.André T, Nowak R, Van Veen B. Low-rank estimation of higher order statistics. Signal Processing, IEEE Transactions on. 1997;45(3):673–685. [Google Scholar]

[R6] 6.Argyriou A, Evgeniou T, Pontil M. Multi-task feature learning. Advances in Neural Information Processing Systems. 2007;19:41. [Google Scholar]

[R7] 7.Bunea F, She Y, Wegkamp M. Optimal selection of reduced rank estimators of high-dimensional matrices. The Annals of Statistics. 2011;39(2):1282–1309. [Google Scholar]

[R8] 8.Cai J, Candés E, Shen Z. A singular value thresholding algorithm for matrix completion. SIAM Journal of Optimization. 2010;20(4):1956–1982. [Google Scholar]

[R9] 9.Candes E, Plan Y. Matrix completion with noise. Arxiv preprint arXiv:0903.3131. 2009 [Google Scholar]

[R10] 10.Candèes E, Recht B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics. 2009;9(6):717–772. [Google Scholar]

[R11] 11.Chen J, Liu J, Ye J. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2010. Learning incoherent sparse and low-rank patterns from multiple tasks; pp. 1179–1188. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Eckart C, Young G. The approximation of one matrix by another of lower rank. Psychometrika. 1936;1(3):211–218. [Google Scholar]

[R13] 13.Fazel M, Hindi H, Boyd S. American Control Conference, 2001. Proceedings of the 2001. ume 6. IEEE; 2001. A rank minimization heuristic with application to minimum order system approximation; pp. 4734–4739. [Google Scholar]

[R14] 14.Friedland S, Torokhti A. Generalized rank-constrained matrix approximations. SIAM Journal on Matrix Analysis and Applications. 2007;29(2):656–659. [Google Scholar]

[R15] 15.Izenman A. Reduced-rank regression for the multivariate linear model. Journal of multivariate analysis. 1975;5(2):248–264. [Google Scholar]

[R16] 16.Jaggi M, Sulovskỳ M. A Simple Algorithm for Nuclear Norm Regularized Problems; Proceedings of the 27th Annual International Conference on Machine Learning; 2010. [Google Scholar]

[R17] 17.Jain P, Meka R, Dhillon I. Guaranteed rank minimization via singular value projection. Advances in Neural Information Processing Systems. 2010;23:937–945. [Google Scholar]

[R18] 18.Ji H, Liu C, Shen Z, Xu Y. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE; 2010. Robust video denoising using low rank matrix completion; pp. 1791–1798. [Google Scholar]

[R19] 19.Ji S, Ye J. An accelerated gradient method for trace norm minimization; Proceedings of the 26th Annual International Conference on Machine Learning; 2009. pp. 457–464. [Google Scholar]

[R20] 20.Kulis B, Surendran A, Platt J. Fast low-rank semidefinite programming for embedding and clustering; Eleventh International Conference on Artifical Intelligence and Statistics, AISTATS 2007; 2007. [Google Scholar]

[R21] 21.Liu G, Lin Z, Yan S, Sun J, Yu Y, Ma Y. Robust recovery of subspace structures by low-rank representation. Arxiv preprint arXiv:1010.2955. 2010 doi: 10.1109/TPAMI.2012.88. [DOI] [PubMed] [Google Scholar]

[R22] 22.Luo X. High dimensional low rank and sparse covariance matrix estimation via convex minimization. Arxiv preprint arXiv:1111.1133. 2011 [Google Scholar]

[R23] 23.Natarajan BK. Sparse approximation solutions to linear systems. SIAM J. Comput. 1995;24(2):227–234. [Google Scholar]

[R24] 24.Negahban S, Wainwright MJ. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Annals of Statistics. 2011;39(2):1069–1097. [Google Scholar]

[R25] 25.Pong T, Tseng P, Ji S, Ye J. Trace norm regularization: Reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization. 2010;20:3465–3489. [Google Scholar]

[R26] 26.Rao C. Matrix approximations and reduction of dimensionality in multivariate statistical analysis. Multivariate analysis. 1980;5:3–22. [Google Scholar]

[R27] 27.Recht B, Fazel M, Parrilo P. Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization. SIAM Review. 2010;52(471) [Google Scholar]

[R28] 28.Reinsel G, Velu R. Multivariate reduced-rank regression: theory and applications. New York: Springer; 1998. [Google Scholar]

[R29] 29.Savas B, Dhillon I. Clustered low rank approximation of graphs in information science applications; Proceedings of the SIAM Data Mining Conference; 2011. [Google Scholar]

[R30] 30.Shalev-Shwartz S, Gonen A, Shamir O. Large-Scale Convex Minimization with a Low-Rank Constraint; Proceedings of the 28th Annual International Conference on Machine Learning; 2011. [Google Scholar]

[R31] 31.Shen X, Huang H. Optimal model assessment, selection, and combination. Journal of the American Statistical Association. 2006;101(474):554–568. [Google Scholar]

[R32] 32.Sondermann D. Best approximate solutions to matrix equations under rank restrictions. Statistical Papers. 1986;27(1):57–66. [Google Scholar]

[R33] 33.Srebro N, Rennie J, Jaakkola T. Maximum-margin matrix factorization. Advances in Neural Information Processing Systems. 2005;17:1329–1336. [Google Scholar]

[R34] 34.Toh K, Yun S. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific J. Optim. 2010;6:615–640. [Google Scholar]

[R35] 35.Wright J, Ganesh A, Rao S, Ma Y. Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. Advances in Neural Information Processing Systems. 2009 [Google Scholar]

PERMALINK

Optimal Exact Least Squares Rank Minimization

Shuo Xiang

Yunzhang Zhu

Xiaotong Shen

Jieping Ye

Abstract

1. INTRODUCTION

Table 1.

2. PROPOSED METHOD: CLOSED-FORM SOLUTION

Algorithm 1.

3. REGULARIZATION AND SOLUTION PATH

3.1 Monotone property

Figure 1.

3.2 Pathwise algorithm

Algorithm 2.

3.3 Extension to general A

4. STATISTICAL PROPERTIES

5. TUNING

6. PROOF OF THEOREMS

6.1 Proof of Theorem 1

6.2 Proof of Theorem 2

6.3 Proof of Theorem 3

6.4 Proof of Theorem 4

7. EMPIRICAL EVALUATIONS

7.1 Synthetic Data

Table 2.

Table 4.

Table 3.

7.2 MIT logo Recovery

Figure 2.

Figure 3.

Figure 4.

Figure 5.

8. CONCLUSION

ACKNOWLEDGEMENT

Footnotes

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases