Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jul 23.
Published in final edited form as: SIAM J Optim. 2020 Oct 28;30(4):3098–3121. doi: 10.1137/19m1290000

NOISY MATRIX COMPLETION: UNDERSTANDING STATISTICAL GUARANTEES FOR CONVEX RELAXATION VIA NONCONVEX OPTIMIZATION

YUXIN CHEN , YUEJIE CHI , JIANQING FAN §, CONG MA , YULING YAN
PMCID: PMC8300474  NIHMSID: NIHMS1639584  PMID: 34305368

Abstract

This paper studies noisy low-rank matrix completion: given partial and noisy entries of a large low-rank matrix, the goal is to estimate the underlying matrix faithfully and efficiently. Arguably one of the most popular paradigms to tackle this problem is convex relaxation, which achieves remarkable efficacy in practice. However, the theoretical support of this approach is still far from optimal in the noisy setting, falling short of explaining its empirical success.

We make progress towards demystifying the practical efficacy of convex relaxation vis-à-vis random noise. When the rank and the condition number of the unknown matrix are bounded by a constant, we demonstrate that the convex programming approach achieves near-optimal estimation errors — in terms of the Euclidean loss, the entrywise loss, and the spectral norm loss — for a wide range of noise levels. All of this is enabled by bridging convex relaxation with the nonconvex Burer–Monteiro approach, a seemingly distinct algorithmic paradigm that is provably robust against noise. More specifically, we show that an approximate critical point of the nonconvex formulation serves as an extremely tight approximation of the convex solution, thus allowing us to transfer the desired statistical guarantees of the nonconvex approach to its convex counterpart.

Keywords: matrix completion, minimaxity, stability, convex relaxation, nonconvex optimization, Burer–Monteiro approach

AMS subject classifications: 90C25, 90C26

1. Introduction.

Suppose we are interested in a large low-rank data matrix, but only get to observe a highly incomplete subset of its entries. Can we hope to estimate the underlying data matrix in a reliable manner? This problem, often dubbed as low-rank matrix completion, spans a diverse array of science and engineering applications (e.g. collaborative filtering [82], localization [86], system identification [70], magnetic resonance parameter mapping [99], joint alignment [21]), and has inspired a flurry of research activities in the past decade. In the statistics literature, matrix completion also falls under the category of factor models with a large amount of missing data, which finds numerous statistical applications such as controlling false discovery rates for dependence data [36,37,39,40], factor-adjusted variable selection [41,63], principal component regression [3, 44, 58, 79], and large covariance matrix estimation [42, 43]. Recent years have witnessed the development of many tractable algorithms that come with statistical guarantees, with convex relaxation being one of the most popular paradigms [14, 15, 46]. See [25, 34] for an overview of this topic.

This paper focuses on noisy low-rank matrix completion, assuming that the revealed entries are corrupted by a certain amount of noise. Setting the stage, consider the task of estimating a rank-r data matrix M=[Mij]1i,jnn×n,1 and suppose that this needs to be performed on the basis of a subset of noisy entries

Mij=Mij+Eij,(i,j)Ω, (1.1)

where Ω ⊆ {1, ⋯ , n} × {1, ⋯ , n} denotes a set of indices, and Eij stands for the additive noise at the location (i, j). As we shall elaborate shortly, solving noisy matrix completion via convex relaxation, while practically exhibiting excellent stability (in terms of the estimation errors against noise), is far less understood theoretically compared to the noiseless setting.

1.1. Convex relaxation: limitations of prior results.

Naturally, one would search for a low-rank solution that best fits the observed entries. One choice is the regularized least-squares formulation given by

minimizeZn×n  12(i,j)Ω(ZijMij)2+λrank(Z), (1.2)

where λ > 0 is some regularization parameter. In words, this approach optimizes certain trade-off between the goodness of fit (through the squared loss expressed in the first term of (1.2)) and the low-rank structure (through the rank function in the second term of (1.2)). Due to computational intractability of rank minimization, we often resort to convex relaxation in order to obtain computationally feasible solutions. One notable example is the following convex program:

minimizeZn×n  g(Z)12(i,j)Ω(ZijMij)2+λZ*, (1.3)

where ‖Z* denotes the nuclear norm (i.e. the sum of singular values) of Z — a convex surrogate for the rank function. A significant portion of existing theory supports the use of this paradigm in the noiseless setting: when Eij vanishes for all (i, j) ∈ Ω, the solution to (1.3) is known to be faithful (i.e. the estimation error becomes zero) even under near-minimal sample complexity [1315,20,51,80].

By contrast, the performance of convex relaxation remains largely unclear when it comes to noisy settings (which are often more practically relevant). Candès and Plan [13] first studied the stability of an equivalent variant2 of (1.3) against noise. The estimation error ‖ZcvxMF derived therein, of the solution Zcvx to (1.3), is significantly larger than the oracle lower bound. This does not explain well the effectiveness of (1.3) in practice. In fact, the numerical experiments reported in [13] already indicated that the performance of convex relaxation is far better than their theoretical bounds. This discrepancy between numerical performance and existing theoretical bounds gives rise to the following natural yet challenging questions: Where does the convex program (1.3) stand in terms of its stability vis-à-vis additive noise? Can we establish statistical performance guarantees that match its practical effectiveness?

We note in passing that several other convex relaxation formulations have been thoroughly analyzed for noisy matrix completion, most notably by Negahban and Wainwright [74] and by Koltchinskii et al. [64]. These works have significantly advanced our understanding of the power of convex relaxation. However, the estimators studied therein, particularly the one in [64], are quite different from the one (1.3) considered here; as a consequence, the analysis therein does not lead to improved statistical guarantees of (1.3). Moreover, the performance guarantees provided for these variants are also suboptimal when restricted to the class of “incoherent” or “de-localized” matrices, unless the magnitudes of the noise are fairly large. See Section 1.4 for more detailed discussions as well as numerical comparisons of these algorithms.

1.2. A detour: nonconvex optimization.

While the focus of the current paper is convex relaxation, we take a moment to discuss a seemingly distinct algorithmic paradigm: nonconvex optimization, which turns out to be remarkably helpful in understanding convex relaxation. Inspired by the Burer–Monteiro approach [7], the nonconvex scheme starts by representing the rank-r decision matrix (or parameters) Z as Z = XY via low-rank factors X,Yn×r, and proceeds by solving the following nonconvex (regularized) least-squares problem [60]

minimizeX,Yn×r  12(i,j)Ω[(XY)ijMij]2+reg(X,Y). (1.4)

Here, reg(·, ·) denotes a certain regularization term that promotes additional structural properties.

To see its intimate connection with the convex program (1.3), we make the following observation: if the solution to (1.3) has rank r, then it must coincide with the solution to

minimizeX,Yn×r  12(i,j)Ω[(XY)ijMij]2+λ2XF2+λ2YF2reg(X,Y). (1.5)

This can be easily verified by recognizing the elementary fact that

Z*=infX,Yn×r:XY=Z{12XF2+12YF2} (1.6)

for any rank-r matrix Z [73, 87]. Note, however, that it is very challenging to predict when the key assumption in establishing this connection — namely, the rank-r assumption of the solution to the convex program (1.3) — can possibly hold (and in particular, whether it can hold under minimal sample complexity requirement).

Despite the nonconvexity of (1.4), simple first-order optimization methods, in conjunction with proper initialization, are often effective in solving (1.4). Partial examples include gradient descent on manifold [60, 61, 96], gradient descent [71, 88], and projected gradient descent [31, 101]. Apart from their practical efficiency, the nonconvex optimization approach is also appealing in theory. To begin with, algorithms tailored to (1.4) often enable exact recovery in the noiseless setting. Perhaps more importantly, for a wide range of noise settings, the nonconvex approach achieves appealing estimation accuracy [31, 71], which could be significantly better than those bounds derived for convex relaxation discussed earlier. See [25, 33] for a summary of recent results. The appealing estimation accuracy together with the lower computational cost makes nonconvex approaches suitable for large-scale problems. Having said that, the convex approach is also widely used in practice. The convex programming is often observed to enjoy better stability against model mismatch (e.g. non-uniform sampling of the matrix entries [32]), which makes it appealing for moderate-size problems. In contrast, the theoretical understanding of the effectiveness of the convex method does not explain well its empirical performance; see Section 1.1. This motivates us to take a closer inspection of the underlying connection between the two contrasting algorithmic frameworks in the hope that one can utilize the existing theory for the nonconvex approach to improve the stability analysis of the convex relaxation approach.

1.3. Empirical evidence: convex and nonconvex solutions are often close.

In order to obtain a better sense of the relationships between convex and nonconvex approaches, we begin by comparing the estimates returned by the two approaches via numerical experiments. Fix n = 1000 and r = 5. We generate M = XY⋆⊤, where X,Yn×r are random orthonormal matrices. Each entry Mij of M is observed with probability p = 0.2 independently, and then corrupted by an independent Gaussian noise Eij~N(0,σ2). Throughout the experiments, we set λ=5σnp. The convex program (1.3) is solved by the proximal gradient method [77], whereas we attempt solving the nonconvex formulation (1.5) by gradient descent with spectral initialization (see [33] for details). Let Zcvx (resp. Zncvx=XncvxYncvx) be the solution returned by the convex program (1.3) (resp. the nonconvex program (1.5)). Figure 1 displays the relative estimation errors of both methods (‖ZcvxMF/‖MF and (‖ZncvxMF/‖MF as well as the relative distance ‖ZcvxZncvxF/‖MF between the two estimates. The results are averaged over 20 independent trials.

Fig. 1.

Fig. 1.

The relative estimation errors of both Zcvx (the estimate of the convex program (1.3)) and Zncvx (the estimate returned by the nonconvex approach tailored to (1.5)) and the relative distance between them vs. the standard deviation σ of the noise. The results are reported for n = 1000, r = 5, p = 0.2, λ=5σnp and are averaged over 20 independent trials.

Interestingly, the distance between the convex and the nonconvex solutions seems extremely small (e.g. ‖Zcvx −ZncvxF/‖MF is typically below 10−7); in comparison, the relative estimation errors of both Zcvx and Zncvx are substantially larger. In other words, the estimate returned by the nonconvex approach serves as a remarkably accurate approximation of the convex solution. Given that the nonconvex approach is often guaranteed to achieve intriguing statistical guarantees vis-à-vis random noise [71], this suggests that the convex program is equally stable — a phenomenon that was not captured by prior theory [13]. Can we leverage existing theory for the nonconvex scheme to improve the statistical analysis of the convex relaxation approach?

Before continuing, we remark that the above numerical connection between convex relaxation (1.3) and nonconvex optimization (1.5) has already been observed multiple times in prior literature [45, 61, 73, 81, 87]. Nevertheless, all prior observations on this connection were either completely empirical, or provided in a way that does not lead to improved statistical error bounds of the convex paradigm (1.3). In fact, the difficulty in rigorously justifying the above numerical observations has been noted in the literature; see e.g. [61].3

1.4. Models and main results.

The numerical experiments reported in Section 1.3 suggest an alternative route for analyzing convex relaxation for noisy matrix completion. If one can formally justify the proximity between the convex and the nonconvex solutions, then it is possible to propagate the appealing stability guarantees from the nonconvex scheme to the convex approach. As it turns out, this simple idea leads to significantly enhanced statistical guarantees for the convex program (1.3), which we formally present in this subsection.

1.4.1. Models and assumptions.

Before proceeding, we introduce a few model assumptions that play a crucial role in our theory.

Assumption 1.
  1. (Random sampling) Each index (i, j) belongs to the index set Ω independently with probability p.

  2. (Random noise) The noise matrix E = [Eij]1≤i,jn is composed of i.i.d. zero-mean sub-Gaussian random variables with sub-Gaussian norm at most σ > 0, i.e. Eijψ2σ (see [94, Definition 5.7]).

In addition, let M = UΣV⋆⊤ be the singular value decomposition (SVD) of M, where U, Vn×r consist of orthonormal columns and Σ=diag(σ1,σ2,,σr)r×r is a diagonal matrix obeying σmaxσ1σ2σrσmin. Denote by κσmax/σmin the condition number of M. We impose the following incoherence condition on M, which is known to be crucial for reliable recovery of M [14, 20].

Definition 1.1.

A rank-r matrix Mn×n with SVD M = UΣV⋆⊤ is said to be μ-incoherent if

U2,μnUF=μrnandV2,μnVF=μrn.

Here, ‖U2,∞ denotes the largest l2 norm of all rows of a matrix U.

Remark 1.2.

It is worth noting that several other conditions on the low-rank matrix have been proposed in the noisy setting. Examples include the spikiness condition [74] and the bounded l norm condition [64]. However, these conditions alone are often unable to ensure identifiability of the true matrix even in the absence of noise.

1.4.2. Theoretical guarantees: when both the rank and the condition number are constants.

With these in place, we are positioned to present our improved statistical guarantees for convex relaxation. For convenience of presentation, we shall begin with a simple yet fundamentally important class of settings when the rank r and the condition number κ are both fixed constants. As it turns out, this class of problems arises in a variety of engineering applications. For example, in a fundamental problem in cryo-EM called angular synchronization [85], one needs to deal with rank-2 or rank-3 matrices with κ = 1; in a joint shape mapping problem that arises in computer graphics [29, 54], the matrix under consideration has low rank and a condition number equal to 1; and in structure from motion in computer vision [91], one often seeks to estimate a matrix with r ≤ 3 and a small condition number. Encouragingly, our theory delivers near-optimal statistical guarantees for such practically important scenarios.

Theorem 1.3.

Let M be rank-r and μ-incoherent with a condition number κ, where the rank and the condition number satisfy r, κ = O(1).4 Suppose that Assumption 1 holds and take λ=Cλσnp in (1.3) for some large enough constant Cλ > 0. Assume the sample size obeys n2p2n log3 n for some sufficiently large constant C > 0, and the noise satisfies σnpμ3lognM for some sufficiently small constant c > 0. Then with probability exceeding 1 − O(n−3):

  1. Any minimizer Zcvx of (1.3) obeys
    ZcvxMFσσminnpMF;ZcvxMσσminnpM; (1.7a)
    ZcvxMσσminμnlognpM. (1.7b)
  2. Letting Zcvx,rargminZ:rank(Z)rZZcvxF be the best rank-r approximation of Zcvx, we have5
    Zcvx,rZcvxF1n3σσminnpM, (1.8)
    and the error bounds in (1.7) continue to hold if Zcvx is replaced by Zcvx,r.

To explain the applicability of the above theorem, we first remark on the conditions required for this theorem to hold; for simplicity, we assume that μ = O(1).

  • Sample complexity. To begin with, the sample size needs to exceed the order of npoly log n, which is information-theoretically optimal up to some logarithmic term [15].

  • Noise size. We then turn attention to the noise requirement, i.e. σnp/(logn)M. Note that under the sample size condition n2pCnlog3n, the size of the noise in each entry is allowed to be substantially larger than the maximum entry in the matrix. In other words, the signal-to-noise ratio w.r.t. each observed entry could be very small. According to prior literature (e.g. [61, Theorem 1.1] and [71, Theorem 2]), such noise conditions are typically required for spectral methods to perform noticeably better than random guessing.

  • Regularization parameter. In the end, we remark on the choice of the regularization parameter λ. As in most regularized estimators, we need to pick the regularization parameter λ large enough so as to suppress the noise E (which controls the variance), and small enough so as not to shrink the signal M too much (which controls the estimation bias). It turns out that setting λσnp achieves the desired bias-variance trade-off; see Lemma 3.

    Further, Theorem 1.3 has several important implications about the power of convex relaxation. The discussions below again concentrate on the case where μ = O(1).

  • Near-optimal stability guarantees. Our results reveal that the Euclidean error of any convex optimizer Zcvx of (1.3) obeys
    ZcvxMFσn/p, (1.9)
    implying that the performance of convex relaxation degrades gracefully as the signal-to-noise ratio decreases. This result matches the oracle lower bound derived in [13, Eq. (III.13)], which also improves upon their statistical guarantee. Specifically, Candès and Plan [13] provided a stability guarantee in the presence of arbitrary bounded noise. When applied to the random noise model assumed here, their results yield ‖ZcvxMFσn3/2, which could be O(n2p) times more conservative than our bound (1.9).
  • Nearly low-rank structure of the convex solution. In light of (1.8), the optimizer of the convex program (1.3) is almost, if not exactly, rank-r. When the true rank r is known a priori, it is not uncommon for practitioners to return the rank-r approximation of Zcvx. Our theorem formally justifies that there is no loss of statistical accuracy — measured in terms of either ‖·‖F or ‖·‖ — when performing the rank-r projection operation.

  • Entrywise and spectral norm error control. Moving beyond the Euclidean loss, our theory uncovers that the estimation errors of the convex optimizer are fairly spread out across all entries, thus implying near-optimal entrywise error control. This is a stronger form of error bounds, as an optimal Euclidean estimation accuracy alone does not preclude the possibility of the estimation errors being spiky and localized. Furthermore, the spectral norm error of the convex optimizer is also well-controlled. See Figure 2 for the numerical support.

  • Implicit regularization. As a byproduct of the entrywise error control, this result indicates that the additional constraint ‖Zα suggested by [74] is automatically satisfied and is hence unnecessary. In other words, the convex approach implicitly controls the spikiness of its entries, without resorting to explicit regularization. This is also confirmed by the numerical experiments reported in Figure 3, where we see that the estimation error of (1.3) and that of the constrained version considered in [74] are nearly identical.

  • Statistical guarantees for fast iterative optimization methods. Various iterative algorithms have been developed to solve the nuclear norm regularized least-squares problem (1.3) up to an arbitrarily prescribed accuracy, examples including SVT (or proximal gradient methods) [8], FPC [72], SOFT–IMPUTE [73], FISTA [5, 90], to name just a few. Our theory immediately provides statistical guarantees for these algorithms. As we shall make precise in Section 2, any point Z with g(Z) ≤ g(Zcvx) + ε (where g(·) is defined in (1.3)) enjoys the same error bounds as in (1.7) (with Zcvx replaced by Z in (1.7)), provided that ε > 0 is sufficiently small. In other words, when these convex optimization algorithms converge w.r.t. the objective value, they are guaranteed to return a statistically reliable estimate.

Fig. 2.

Fig. 2.

The relative estimation error of Zcvx measured by both ‖ · ‖ (i.e. ‖ZcvxM/‖M and ‖ · ‖ (i.e. ‖ZcvxM‖/‖M‖) vs. the standard deviation σ of the noise. The results are reported for n = 1000, r = 5, p = 0.2, λ=5σnp and are averaged over 20 independent trials.

Fig. 3.

Fig. 3.

The relative estimation errors of Z^, measured in terms of lF and l, vs. the standard deviation σ of the noise. Here Z^ can be either the modified convex estimator in [64], the constrained convex estimator in [74] or the vanilla convex estimator (1.3). The results are reported for n = 1000, r = 5, p = 0.2, and are averaged over 20 Monte-Carlo trials. For the modified convex estimator in [64], we choose the regularization parameter λ therein to be 1.5 max{σ,M}1/(n3p), as suggested by their theory. For the constrained one in [74], the regularization parameter λ is set to be 5σnp and the constraint α is set to be ‖M. Both choices are recommended by [74]. As for (1.3), we set λ=5σnp.

To better understand our contributions, we take a moment to discuss two important but different convex programs studied in [74] and [64]. To begin with, under a spikiness assumption on the low-rank matrix, Negahban and Wainwright [74] proposed to enforce an extra entrywise constraint ‖Zα when solving (1.3), in order to explicitly control the spikiness of the estimate. When applied to our model with r, κ, μ ≍ 1, their results read (up to some logarithmic factor)

Z^MFmax{σ,M}n/p, (1.10)

where Z^ is the estimate returned by their modified convex algorithm. While this matches the optimal bound when σ ≳ ‖M, it becomes suboptimal when σ ≪ ‖M (under our models). Moreover, as we have already discussed, the extra spikiness constraint becomes unnecessary in the regime considered herein. This also means that our result complements existing theory about the convex program in [74] by demonstrating its minimaxity for an additional range of noise. Another work by Koltchinskii et al. [64] investigated a completely different convex algorithm, which is effiectively a spectral method (namely, one round of soft singular value thresholding on a rescaled zero-padded data matrix). The algorithm is shown to be minimax optimal over the class of low-rank matrices with bounded l norm (note that this is very different from the set of incoherent matrices studied here). When specialized to our model, their error bound is the same as (1.10) (modulo some log factor), which also becomes suboptimal as ˙ decreases. The advantage of the convex program (1.3) is shown in Figure 3.

1.4.3. Theoretical guarantees: extensions to more general settings.

So far we have presented results when the true matrix has bounded rank and condition number, i.e. r, κ = O(1). Our theory actually accommodates a significantly broader range of scenarios, where the rank and the condition number are both allowed to grow with the dimension n.

Theorem 1.4.

Let M be rank-r and μ-incoherent with a condition number κ. Suppose Assumption 1 holds and take λ=Cλσnp in (1.3) for some large enough constant Cλ > 0. Assume the sample size obeys n2pC κ4μ2r2n log3 n for some sufficiently large constant C > 0, and the noise satisfies σnpcσminκ4μrlogn for some sufficiently small constant c > 0. Then with probability exceeding 1 − O(n−3),

  1. Any minimizer Zcvx of (1.3) obeys
    ZcvxMFκσσminnpMF, (1.11a)
    ZcvxMκ3μrσσminnlognpM, (1.11b)
    ZcvxMσσminnpM; (1.11c)
  2. Letting Zcvx,rargminZ:rank(Z)rZZcvxF be the best rank-r approximation of Zcvx, we have
    Zcvx,rZcvxF1n3σσminnpM, (1.12)
    and the error bounds in (1.11) continue to hold if Zcvx is replaced by Zcvx,r.
Remark 1.5 (The noise condition).

The incoherence condition (cf. Definition 1.1) guarantees that the largest entry ‖M of the matrix M is no larger than κμrσmin/n. As a result, the noise condition stated in Theorem 1.4 covers all scenarios obeying

σnpκ6μ3r3lognM.

Therefore, the typical size of the noise is allowed to be much larger than the size of the largest entry of M, provided that pκ6μ3r3lognn. In particular, when r, κ = O(1), this recovers the noise condition in Theorem 1.3.

Notably, the sample size condition for noisy matrix completion (i.e. n2p4μ2r2n log3 n) is more stringent than that in the noiseless setting (i.e. n2pnr log2 n), and our statistical guarantees are likely suboptimal with respect to the dependency on r and κ. It turns out that both convex and nonconvex methods work well numerically even when the number of samples n2p is on the order of nr log n, which is much smaller than the required sample complexity κ4r2n log3 n in Theorem 1.4; see Figure 4 for an illustration. From a technical point of view, this sub-optimality is mainly due to the analysis of nonconvex optimization, a key ingredient of our analysis of convex relaxation. In fact, the state-of-the-art nonconvex analysis [31, 61, 71] requires the sample size to be much larger than the optimal one (e.g. n2pnpoly(r)poly(κ)) even in the noiseless setting. It would certainly be interesting, and in fact important, to see whether it is possible to develop a theory with optimal dependency on r and κ. We leave this for future investigation.

Fig. 4.

Fig. 4.

The estimation errors of Zcvx and Zncvx, measured in terms of lF, vs. the rescaled sample complexity n2p/(2nr log n). The results are reported for n = 1000, p = 0.2, σ = 10−4, and are averaged over 20 Monte-Carlo trials. The rank r is varied from 1 to 25 and the condition number κ is chosen from {1, 10, 100}.

It is also instrumental to compare our sample complexity and error bounds with those in the prior literature [10, 62, 64, 74]. See Table 1 for a comparison when the incoherence parameter μ is O(1). Here

αnMMF,  and  MmaxminM=UVU2,V2,.

Indeed, all the papers [10, 62, 64, 74] studied convex relaxation approaches for noisy matrix completion. However, there are two main differences from our analysis here. The first is regarding the convex method itself. All four papers [10, 62, 64,74] require extra knowledge about the underlying low-rank matrix for the convex approach, e.g. the l norm of the low-rank matrix in [10, 62, 74] and the sharpness constant in [74]. This results in different convex formulations from what we analyze here. The second major difference lies in the performance guarantees. In both the well-conditioned and the highly ill-conditioned regimes, the estimation error provided in [10, 62, 64, 74] does not vanish as the size of the noise decreases to zero. This is in stark contrast to our theory that guarantees the stability of convex approach for noisy matrix completion.

Table 1.

Comparisons of sample complexities and Euclidean estimation errors when μ = O(1).

Sample complexity Euclidean estimation error
[64] n log2 n max(σ,M)(nrlogn)/p
[75] n log n max(σ,1/n)α(nrlogn)/p
[62] n log3 n max(σ,M)(nrlogn)/p
[10] n max(σ,M)Mmaxn3/4/p1/4
Ours κ4r2n log3 n κ2σnr/p

Despite the above sub-optimality issue, implications similar to those of Theorem 1.3 hold for this general setting. To begin with, the nearly low-rank structure of the convex solution is preserved (cf. (1.12)). In addition, the estimation error of the convex estimate is spread out across entries (cf. (1.11b)), thus uncovering an implicit regularization phenomenon underlying convex relaxation (which implicitly regularizes the spikiness constraint on the solution). Last but not least, the upper bounds (1.11) and (1.12) continue to hold for approximate minimizers of the convex program (1.3), thus yielding statistical guarantees for numerous iterative algorithms aimed at minimizing (1.3).

1.5. Numerical experiments.

This subsection collects numerical supports for our theoretical findings in Section 1.4.

Entrywise and spectral norm error of the convex approach..

The experimental setting is similar to that in producing Figure 1. For completeness, we repeat here. Fix n = 1000 and r = 5. We generate M = XY⋆⊤, where X,Yn×r are random orthonormal matrices. Each entry Mij of M is observed with probability p = 0.2 independently, and then corrupted by an independent Gaussian noise Eij~N(0,σ2). Throughout the experiments, we set λ=5σnp. The convex program (1.3) is solved by the proximal gradient method [77]. Figure 2 displays the relative estimation errors of the convex approach (1.3) in both the l norm and the spectral norm. As can be seen, both forms of estimation errors scale linearly with the noise level, corroborating our theory.

Comparisons with other convex approaches..

Utilizing the same experimental setting as before, we compare the numerical performance of (1.3) with two important but different convex programs studied in [74] and [64]. For the modified convex estimator in [64], we choose the regularization parameter λ therein to be 1.5 max{σ,M}1/(n3p), as suggested by their theory. For the constrained one in [74], the regularization parameter λ is set to be 5σnp and the constraint α is set to be ‖M. Both choices are recommended by [74]. Figure 3 displays the relative Euclidean and entrywise estimation error of the three convex programs. As can be seen, the estimation error of this thresholding-based spectral algorithm [64] does not decrease as the noise shrinks, and its performance seems uniformly outperformed by that of convex relaxation (1.3) and the constrained estimator in [74]. In fact, this is part of our motivation to pursue an improved theoretical understanding of the formulation (1.3).

Numerical sample complexity of both convex and nonconvex methods..

Theorem 1.4 requires the sample complexity n2p to exceed κ4r2n log3 n, which we believe could be improved. To justify this, under a similar setting to that of Figure 1, we plot the estimation errors of Zcvx and Zncvx vs. the rescaled sample complexity n2p/(2nr log n) in Figure 4. It can be seen that the estimation errors degrade gracefully as the rescaled sample size gets smaller and the performance does not change w.r.t. the condition number κ.

2. Strategy and novelty.

In this section, we introduce the strategy for proving our main theorem, i.e. Theorem 1.4. Theorem 1.3 follows immediately. Informally, the main technical difficulty stems from the lack of closed-form expressions for the primal solution to (1.3), which in turn makes it difficult to construct a dual certificate. This is in stark contrast to the noiseless setting, where one clearly anticipates the ground truth M to be the primal solution; in fact, this is precisely why the analysis for the noisy case is significantly more challenging. Our strategy, as we shall detail below, mainly entails invoking an iterative nonconvex algorithm to “approximate” such a primal solution.

Before continuing, we introduce a few more notations. Let PΩ():n×nn×n represent the projection onto the subspace of matrices supported on Ω, namely,

[PΩ(Z)]ij={Zij, for (i,j)Ω0, otherwise  (2.1)

for any matrix Zn×n. For a rank-r matrix M with singular value decomposition UΣV, denote by T the tangent space of the rank-r manifold at M, i.e.

T={UA+BVA,Bn×r}. (2.2)

Correspondingly, let PT() be the orthogonal projection onto the subspace T, that is,

PT(Z)=UUZ+ZVVUUZVV (2.3)

for any matrix Zn×n. In addition, let T and PT() denote the orthogonal complement of T and the projection onto T, respectively. Regarding the ground truth, we denote

X=U(Σ)1/2  and  Y=V(Σ)1/2. (2.4)

The nonconvex problem (1.5) is equivalent to

minimizeX,Yn×rf(X,Y)12pPΩ(XYM)F2+λ2pXF2+λ2pYF2, (2.5)

where we have inserted an extra factor 1/p (compared to (1.5)) to simplify the presentation of the analysis later on.

2.1. Exact duality.

In order to analyze the convex program (1.3), it is natural to start with the first-order optimality condition. Specifically, suppose that Zn×n is a (primal) solution to (1.3) with SVD Z = UΣV.6 As before, let T be the tangent space of Z, and let T be the orthogonal complement of T. Then the first-order optimality condition for (1.3) reads: there exists a matrix WT (called a dual certificate) such that

1λPΩ(MZ)=UV+W; (2.6a)
W1. (2.6b)

This condition is not only necessary to certify the optimality of Z, but also “almost sufficient” in guaranteeing the uniqueness of the solution Z; see Appendix SM2 for in-depth discussions.

The challenge then boils down to identifying such a primal-dual pair (Z, W) satisfying the optimality condition (2.6). For the noise-free case, the primal solution is clearly Z = M if exact recovery is to be expected; the dual certificate can then be either constructed exactly by the least-squares solution to a certain underdetermined linear system [14, 15], or produced approximately via a clever golfing scheme pioneered by Gross [51]. For the noisy case, however, it is often difficult to hypothesize on the primal solution Z, as it depends on the random noise in a complicated way. In fact, the lack of a suitable guess of Z (and hence W) was the major hurdle that prior works faced when carrying out the duality analysis.

2.2. A candidate primal solution via nonconvex optimization.

Motivated by the numerical experiment in Section 1.3, we propose to examine whether the optimizer of the nonconvex problem (1.5) stays close to the solution to the convex program (1.3). Towards this, suppose that X,Yn×r form a critical point of (1.5) with rank(X) = rank(Y) = r.7 Then the first-order condition reads

1λPΩ(MXY)Y=X; (2.7a)
1λ[PΩ(MXY)]X=Y. (2.7b)

To develop some intuition about the connection between (2.6) and (2.7), let us take a look at the case with r = 1. Denote X = x and Y = y and assume that the two rank-1 factors are “balanced”, namely, ‖x2 = ‖y2 ≠ 0. It then follows from 2.7) that λ1PΩ(Mxy) has a singular value 1, whose corresponding left and right singular vectors are x/‖x2 and y/‖y2, respectively. In other words, one can express

1λPΩ(Mxy)=1x2y2xy+W, (2.8)

where W is orthogonal to the tangent space of xy; this is precisely the condition (2.6a). It remains to argue that (2.6b) is valid as well. Towards this end, the first-order condition (2.7) alone is insufficient, as there might be non-global critical points (e.g. saddle points) that are unable to approximate the convex solution well. Fortunately, as long as the candidate xy is not far away from the ground truth M, one can guarantee ‖W‖ < 1 as required in (2.6b).

The above informal argument about the link between the convex and the nonconvex problems can be formalized. To begin with, we introduce the following conditions on the regularization parameter λ.

Condition 1 (Regularization parameter).

The regularization parameter λ satisfies

  1. (Relative to noise) PΩ(E)<λ/8.

  2. (Relative to nonconvex solution) PΩ(XYM)p(XYM)<λ/8.

Remark 2.1.

Condition 1 requires that the regularization parameter λ should dominate a certain norm of the noise, as well as of the deviation of XYM from its mean p(XY M); as will be seen shortly, the latter condition can be met when (X, Y) is sufficiently close to (X, Y).

With the above condition in place, the following result demonstrates that a critical point (X, Y) of the nonconvex problem (1.5) readily translates to the unique minimizer of the convex program (1.3). This lemma is established in Appendix SM3.1.

Lemma 2.2 (Exact nonconvex vs. convex optimizers).

Suppose that (X, Y) is a critical point of (1.5) satisfying rank(X) = rank(Y) = r, and the sampling operator PΩ is injective when restricted to the elements of the tangent space T of XY, namely,

PΩ(H)=0H=0,forallHT. (2.9)

Under Condition 1, the point ZXY is the unique minimizer of (1.3).

In order to apply Lemma 2.2, one needs to locate a critical point of (1.5) that is sufficiently close to the truth, for which one natural candidate is the global optimizer of (1.5). The caveat, however, is the lack of theory characterizing directly the properties of the optimizer of (1.5). Instead, what is available in prior theory is the characterization of some iterative sequence (e.g. gradient descent iterates) aimed at solving (1.5). It is unclear from prior theory whether the iterative algorithm under study (e.g. gradient descent) converges to the global optimizer in the presence of noise. This leads to technical difficulty in justifying the proximity between the nonconvex optimizer and the convex solution via Lemma 2.2.

2.3. Approximate nonconvex optimizers.

Fortunately, perfect knowledge of the nonconvex optimizer is not pivotal. Instead, an approximate solution to the nonconvex problem (1.5) (or equivalently (2.5)) suffices to serve as a reasonably tight approximation of the convex solution. More precisely, we desire two factors (X, Y) that result in nearly zero (rather than exactly zero) gradients:

Xf(X,Y)0  and  Yf(X,Y)0,

where f(·, ·) is the nonconvex objective function as defined in (2.5). This relaxes the condition discussed in Lemma 2.2 (which only applies to critical points of (1.5) as opposed to approximate critical points). As it turns out, such points can be found via gradient descent tailored to (1.5). The sufficiency of the near-zero gradient condition is made possible by slightly strengthening the injectivity assumption (2.9), which is stated below.

Condition 2 (Injectivity).

Let T be the tangent space of XY. There is a quantity cinj > 0 such that

p1PΩ(H)F2cinjHF2,forallHT. (2.10)

The following lemma states quantitatively how an approximate nonconvex optimizer serves as an excellent proxy of the convex solution, which we establish in Appendix SM3.2.

Lemma 2.3 (Approximate nonconvex vs. convex optimizers).

Suppose that (X, Y) obeys

f(X,Y)Fccinjpκλpσmin (2.11)

for some sufficiently small constant c > 0. Further assume that any singular value of X and Y lies in [σmin/2,2σmax]. Then under Conditions 1 and 2, any minimizer Zcvx of (1.3) satisfies

XYZcvxFκcinj1σminf(X,Y)F. (2.12)
Remark 2.4.

In fact, this lemma continues to hold if Zcvx is replaced by any Z obeying g(Z) ≤ g(XY), where g(·) is the objective function defined in (1.3) and X and Y are low-rank factors obeying conditions of Lemma 2.3. This is important in providing statistical guarantees for iterative methods like SVT [8], FPC [72], SOFT– IMPUTE [73], FISTA [5], etc. To be more specific, suppose that (X, Y) results in an approximate optimizer of (1.3), namely, g(XY) = g(Zcvx) + ε for some sufficiently small ε > 0. Then for any Z obeying g(Z) ≤ g(XY) = g(Zcvx) + ε, one has

XYZFκcinj1σminf(X,Y)F. (2.13)

As a result, as long as the above-mentioned algorithms converge in terms of the objective value, they must return a solution obeying (2.13), which is exceedingly close to XY if ‖∇f(X, Y)‖F is small.

It is clear from Lemma 2.3 that, as the size of the gradient ∇f(X, Y) gets smaller, the nonconvex estimate XY becomes an increasingly tighter approximation of any convex optimizer of (1.3), which is consistent with Lemma 2.2. In contrast to Lemma 2.2, due to the lack of strong convexity, a nonconvex estimate with a near-zero gradient does not imply the uniqueness of the optimizer of the convex program (1.3); rather, it indicates that any minimizer of (1.3) lies within a sufficiently small neighborhood surrounding XY (cf. (2.12)).

2.4. Construction of an approximate nonconvex optimizer.

So far, Lemmas 2.22.3 are both deterministic results based on Condition 1. As we will soon see, under Assumption 1, we can derive simpler conditions that — with high probability — guarantee Condition 1. We start with Condition 1(a).

Lemma 2.5.

Suppose n2pCn log2 n for some sufficiently large constant C > 0. Then with probability at least 1 − O(n−10), one has PΩ(E)σnp. As a result, Condition 1 holds (i.e. PΩ(E)<λ/8) as long as λ=Cλσnp for some sufficiently large constant Cλ > 0.

Proof.

This follows from [31, Lemma 11] with a slight and straightforward modification to accommodate the asymmetric noise here. For brevity, we omit the proof. □

Turning attention to Condition 1(b) and Condition 2, we have the following lemma, the proof of which is deferred to Appendix SM3.3.

Lemma 2.6.

Under the assumptions of Theorem 1.4, with probability exceeding 1 – O(n−10) we have PΩ(XYM)p(XYM)<λ/8 (Condition 1 (b)) 1pPΩ(H)F2132κHF2, for all HT (Condition 2 with cinj = (32κ)−1) hold simultaneously for all (X, Y) obeying

max{XX2,,YY2,}Cκ(σσminnlognp+λpσmin)max{X2,,Y2,}. (2.14)

Here, T denotes the tangent space of XY, and C > 0 is some absolute constant.

This lemma is a uniform result, namely, the bounds hold irrespective of the statistical dependency between (X, Y) and Ω. As a consequence, to demonstrate the proximity between the convex and the nonconvex solutions (cf. (2.12)), it remains to identify a point (X, Y) with vanishingly small gradient (cf. (2.11)) that is sufficiently close to the truth (cf. (2.14)).

As we already alluded to previously, a simple gradient descent algorithm aimed at solving the nonconvex problem (1.5) might help us produce an approximate nonconvex optimizer. This procedure is summarized in Algorithm 2.1. Our hope is this: when initialized at the ground truth and run for sufficiently many iterations, the GD trajectory produced by Algorithm 2.1 will contain at least one approximate stationary point of (1.5) with the desired properties (2.11) and (2.14). We shall note that Algorithm 2.1 is not practical since it starts from the ground truth (X, Y); this is an auxiliary step mainly to simplify the theoretical analysis. While we can certainly make it practical by adopting spectral initialization as in [19, 71], it requires more lengthy proofs without further improving our statistical guarantees.

2.4.

2.5. Properties of the nonconvex iterates.

In this subsection, we will build upon the literature on nonconvex low-rank matrix completion to justify that the estimates returned by Algorithm 2.1 satisfy the requirement stated in (2.14). Our theory will be largely established upon the leave-one-out strategy introduced by Ma et al. [71], which is an effective analysis technique to control the l2, error of the estimates. This strategy has recently been extended by Chen et al. [19] to the more general rectangular case with an improved sample complexity bound.

Before continuing, we introduce several useful notations. Notice that the matrix product of X and Y⋆⊤ is invariant under global orthonormal transformation, namely, for any orthonormal matrix Rr×r one has XR(YR) = XY⋆⊤. Viewed in this light, we shall consider distance metrics modulo global rotation. In particular, the theory relies heavily on a specific global rotation matrix defined as follows

HtargminROr×r(XtRXF2+YtRYF2)1/2, (2.16)

where Or×r is the set of r × r orthonormal matrices.

We are now ready to present the performance guarantees for Algorithm 2.1.

Lemma 2.7 (Quality of the nonconvex estimates).

Instate the notation and hypotheses of Theorem 1.4. With probability at least 1 – O (n−3), the iterates {(Xt,Yt)}0tt0 of Algorithm 2.1 satisfy

max{XtHtXF,YtHtYF}CF(σσminnp+λpσmin)XF, (2.17a)
max{XtHtX,YtHtY}Cop(σσminnp+λpσmin)X, (2.17b)
max{XtHtX2,,YtHtY2,}Cκ(σσminnlognp+λpσmin)max{X2,,Y2,}, (2.17c)
min0t<t0f(Xt,Yt)F1n5λpσmin, (2.18)

where CF, Cop, C > 0 are some absolute constants, provided that η1/(nκ3σmax) and that t0 = n18.

This lemma, which we establish in Appendix SM4, reveals that for a polynomially large number of iterations, all iterates of the gradient descent sequence — when initialized at the ground truth — remain fairly close to the true low-rank factors. This holds in terms of the estimation errors measured by the Frobenius norm, the spectral norm, and the l2, norm. In particular, the proximity in terms of the l2, norm error plays a pivotal role in implementing our analysis strategy (particularly Lemmas 2.32.6) described previously. In addition, this lemma (cf. (2.18)) guarantees the existence of a small-gradient point within this sequence {(Xt,Yt)}0tt0, a somewhat straightforward property of GD tailored to smooth problems [76]. This in turn enables us to invoke Lemma 2.3.

As immediate consequences of Lemma 2.7, with high probability we have

XtYtMF3κCF(σσminnp+λpσmin)MF (2.19a)
XtYtM3Cκ3μr(σσminnlognp+λpσmin)M (2.19b)
XtYtM3Cop(σσminnp+λpσmin)M (2.19c)

for all 0 ≤ tt0. The proof is deferred to Appendix SM4.12.

2.6. Proof of Theorem 1.4.

Let t*argmin0t<t0f(Xt,Yt)F, and take (Xncvx, Yncvx) = (Xt* Ht*, Yt* Ht* ) (cf. (2.16)). It is straightforward to verify that (Xncvx, Yncvx) obeys (i) the small-gradient condition (2.11), and (ii) the proximity condition (2.14). We are now positioned to invoke Lemma 2.3: for any optimizer Zcvx of (1.3), one has

ZcvxXncvxYncvxFκcinj1σminf(Xncvx,Yncvx)Fκ2n5λp=κn5λpσmin(κσmin)=κn5λpσminM1n4λpσminM. (2.20)

The last line arises since nκ — a consequence of the sample complexity condition npκ4μ2r2 log3 n (and hence nnpκ4μ2r2 log3 nκ4). This taken collectively with the property (2.19) implies that

ZcvxMFZcvxXncvxYncvxF+XncvxYncvxMF1n4λpσminM+κ(σσminnp+λpσmin)MFκ(σσminnp+λpσmin)MF.

In other words, since XncvxYncvx and Zncvx are exceedingly close, the error ZcvxM is mainly accredited to XncvxYncvxM. Similar arguments lead to

ZcvxM(σσminnp+λpσmin)M,
ZcvxMκ3μr(σσminnlognp+λpσmin)M.

We are left with proving the properties of Zcvx,r. Since Zcvx,r is defined to be the best rank-r approximation of Zcvx, one can invoke (2.20) to derive

ZcvxZcvx,rFZcvxXncvxYncvxF1n4λpσminM,

from which (1.12) follows. Repeating the above calculations implies that (1.11) holds if Zcvx is replaced by Zcvx,r, thus concluding the proof.

3. Prior art.

Nuclear norm minimization, pioneered by the seminal works [14, 15, 45, 81], has been a popular and principled approach to low-rank matrix recovery. In the noiseless setting, i.e. E = 0, it amounts to solving the following constrained convex program

minimizeZn×nZ*  subject to  PΩ(Z)=PΩ(M), (3.1)

which enjoys great theoretical success. Informally, this approach enables exact recovery of a rank-r matrix Mn×n as soon as the sample size is about the order of nr — the intrinsic degrees of freedom of a rank-r matrix [20,51,80]. In particular, Gross [51] blazed a trail by developing an ingenious golfing scheme for dual construction — an analysis technique that has found applications far beyond matrix completion. When it comes to the noisy case, CandÚs and Plan [13] first studied the stability of convex programming when the noise is bounded and possibly adversarial, followed by [74] and [64] using two modified convex programs. As we have already discussed, none of these papers provide optimal statistical guarantees under our model when r = O(1). Other related papers such as [10,62] include similar estimation error bounds and suffer from similar sub-optimality issues.

Turning to nonconvex optimization, we note that this approach has recently received much attention for various low-rank matrix factorization problems, owing to its superior computational advantage compared to convex programming (e.g. [12, 22, 56, 60, 92, 98]). The convergence guarantees for matrix completion have been established for various algorithms such as gradient descent on manifold [60, 61], alternating minimization [53, 56], gradient descent [19, 71, 88, 95], and projected gradient descent [31], provided that a suitable initialization (like spectral initialization) is available [23, 56, 60, 71, 88]. Our work is mostly related to [19, 71], which studied (vanilla) gradient descent for nonconvex matrix completion. This algorithm was first analyzed by [71] via a leave-one-out argument — a technique that proves useful in analyzing various statistical algorithms [1, 26, 27, 35, 38, 67, 89, 102]. In the absence of noise and omitting logarithmic factors, [71] showed that O(nr3) samples are sufficient for vanilla GD to yield ε accuracy in O(log1ε) iterations (without the need of extra regularization procedures); the sample complexity was further improved to O(nr2) by [19]. Apart from gradient descent, other nonconvex methods (e.g. [16, 35, 48, 52, 53, 5557, 65, 82, 83, 93, 96, 97, 100]) and landscape / geometry properties have been investigated [18, 49, 50, 78, 84]; these are, however, beyond the scope of the current paper.

Another line of works asserted that a large family of SDPs admits low-rank solutions [4], which in turn motivates the Burer-Monteiro approach [6, 7]. When applied to matrix completion, however, the generic theoretical guarantees therein lead to conservative results. Take the noiseless case (3.1) for instance: these results revealed the existence of a solution of rank at most O(n2p), which however is often much larger the true rank (e.g. when r ≍ 1 and p ≍ poly log(n)/n, one has n2pnr). Moreover, this line of works does not imply that all solutions to the SDP of interest are (approximately) low-rank.

Finally, the connection between convex and nonconvex optimization has also been explored in line spectral estimation [66], although the context therein is drastically different from ours.

4. Discussion.

This paper provides an improved statistical analysis for the natural convex program (1.3), without the need of enforcing additional spikiness constraint. Our theoretical analysis uncovers an intriguing connection between convex relaxation and nonconvex optimization, which we believe is applicable to many other problems beyond matrix completion. Having said that, our current theory leaves open a variety of important directions for future exploration. Here we sample a few interesting ones.

  • Improving dependency on r and κ. While our theory is optimal when r and κ are both constants, it becomes increasingly looser as either r or κ grows. For instance, in the noiseless setting, it has been shown that the sample complexity for convex relaxation scales as O(nr) — linear in r and independent of κ — which is better than the current results. It is worth noting that existing theory for nonconvex matrix factorization typically falls short of providing optimal scaling in r and κ [19, 31, 60, 71, 88]. Thus, tightening the dependency of sample complexity on r and κ might call for new analysis tools.

  • Approximate low-rank structure. So far our theory is built upon the assumption that the ground-truth matrix M is exactly low-rank, which falls short of accommodating the more realistic scenario where M is only approximately low-rank. For the approximate low-rank case, it is not yet clear whether the nonconvex factorization approach can still serve as a tight proxy. In addition, the landscape of nonconvex optimization for the approximately low-rank case [18] might shed light on how to handle this case.

  • Extension to deterministic noise. Our current theory — in particular, the leave-one-out analysis for the nonconvex approach — relies heavily on the randomness assumption (i.e. i.i.d. sub-Gaussian) of the noise. In order to justify the broad applicability of convex relaxation, it would be interesting to see whether one can generalize the theory to cover deterministic noise with bounded magnitudes.

  • Extension to structured matrix completion. Many applications involve low-rank matrices that exhibit additional structures, enabling a further reduction of the sample complexity [9, 24, 47]. For instance, if a matrix is Hankel and low-rank, then the sample complexity can be O(n) times smaller than the generic low-rank case. The existing stability guarantee of Hankel matrix completion, however, is overly pessimistic compared to practical performance [24]. The analysis framework herein might be amenable to the study of Hankel matrix completion and help close the theory-practice gap.

  • Extension to robust PCA and blind deconvolution. Moving beyond matrix completion, there are other problems that are concerned with recovering low-rank matrices. Notable examples include robust principal component analysis [11, 17, 30], blind deconvolution [2, 68] and blind demixing [59, 69]. The stability analyses of the convex relaxation approaches for these problems [2, 69, 103] often adopt a similar approach as [13], and consequently are sub-optimal. The insights from the present paper might promise tighter statistical guarantees for such problems.

Finally, we remark that the intimate link between convex and nonconvex optimization enables statistically optimal inference and uncertainty quantification for noisy matrix completion (e.g. construction of optimal confidence intervals for each missing entry). The interested readers are referred to our companion paper [28] for in-depth discussions.

Supplementary Material

supplement

Acknowledgements.

This work was done in part while Y. Chen was visiting the Kavli Institute for Theoretical Physics (supported in part by NSF grant PHY-1748958). Y. Chen thanks Emmanuel Candès for motivating discussions about noisy matrix completion.

Funding: Y. Chen is supported in part by the AFOSR YIP award FA9550-19-1-0030, by the ONR grant N00014-19-1-2120, by the ARO YIP award W911NF-20-1-0097, by the ARO grant W911NF-18-1-0303, by the NSF grants CCF-1907661, IIS-1900140 and DMS-2014279. Y. Chi is supported in part by ONR under the grants N00014-18-1-2142 and N00014-19-1-2404, by ARO under the grant W911NF-18-1-0303, and by NSF under the grants CAREER ECCS-1818571 and CCF-1806154. J. Fan is supported in part by NSF Grants DMS-1662139 and DMS-1712591, ONR grant N00014-19-1-2120, and NIH Grant R01-GM072611-12.

Footnotes

1

It is straightforward to rephrase our discussions to a general rectangular matrix of size n1×n2. The current paper sets n = n1 = n2 throughout for simplicity of presentation.

2

Technically, [13] deals with the constrained version of (1.3), which is equivalent to the Lagrangian form as in (1.3) with a proper choice of the regularization parameter.

3

The seminal work [61] by Keshavan, Montanari and Oh stated that “In view of the identity (1.6) it might be possible to use the results in this paper to prove stronger guarantees on the nuclear norm minimization approach. Unfortunately this implication is not immediate … Trying to establish such an implication, and clarifying the relation between the two approaches is nevertheless a promising research direction.

4

Here and throughout, f(n) ≲ g(n) or f(n) = O(g(n)) means |f(n)|/|g(n)| ≤ C for some constant C > 0 when n is sufficiently large; f(n) ≳ g(n) means |f(n)|/|g(n)| ≥ C for some constant C > 0 when n is sůciently large; and f(n)g(n) if and only if f(n) ≲ g(n) and f(n) ≳ g(n). In addition, ‖ · ‖ denotes the entrywise l norm, whereas ‖ · ‖ is the spectral norm.

5

The factor 1/n3 in (1.8) can be replaced by 1/nc for an arbitrarily large fixed constant c > 0 (e.g. c = 100).

6

Here and below, we use Z (rather than Zcvx) for notational simplicity, whenever it is clear from the context.

7

Once again, we abuse the notation (X, Y) (instead of using (Xncvx, Yncvx)) for notational simplicity, whenever it is clear from the context.

REFERENCES

  • [1].Abbe E, Fan J, Wang K, and Zhong Y, Entrywise eigenvector analysis of random matrices with low expected rank, arXiv preprint arXiv:1709.09565, (2017). [DOI] [PMC free article] [PubMed]
  • [2].Ahmed A, Recht B, and Romberg J, Blind deconvolution using convex programming, IEEE Transactions on Information Theory, 60 (2014), pp. 1711–1732. [Google Scholar]
  • [3].Bai J and Ng S, Confidence intervals for di˙usion index forecasts and inference for factor-augmented regressions, Econometrica, 74 (2006), pp. 1133–1150. [Google Scholar]
  • [4].Barvinok AI, Problems of distance geometry and convex properties of quadratic maps, Discrete & Computational Geometry, 13 (1995), pp. 189–202. [Google Scholar]
  • [5].Beck A and Teboulle M, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM journal on imaging sciences, 2 (2009), pp. 183–202. [Google Scholar]
  • [6].Boumal N, Voroninski V, and Bandeira A, The non-convex burer-monteiro approach works on smooth semidefinite programs, in NIPS, 2016, pp. 2757–2765. [Google Scholar]
  • [7].Burer S and Monteiro RD, A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization, Mathematical Programming, 95 (2003), pp. 329–357. [Google Scholar]
  • [8].Cai JF, Candès EJ, and Shen Z, A singular value thresholding algorithm for matrix completion, SIAM Journal on Optimization, 20 (2010), pp. 1956–1982. [Google Scholar]
  • [9].Cai J-F, Wang T, and Wei K, Fast and provable algorithms for spectrally sparse signal reconstruction via low-rank hankel matrix completion, Applied and Computational Harmonic Analysis, 46 (2019), pp. 94–121. [Google Scholar]
  • [10].Cai TT and Zhou W-X, Matrix completion via max-norm constrained optimization, Electronic Journal of Statistics, 10 (2016), pp. 1493–1525. [Google Scholar]
  • [11].Candès E, Li X, Ma Y, and Wright J, Robust principal component analysis?, Journal of ACM, 58 (2011), pp. 11:1–11:37. [Google Scholar]
  • [12].Candès E, Li X, and Soltanolkotabi M, Phase retrieval via Wirtinger flow: Theory and algorithms, IEEE Transactions on Information Theory, 61 (2015), pp. 1985–2007. [Google Scholar]
  • [13].Candès E and Plan Y, Matrix completion with noise, Proceedings of the IEEE, 98 (2010), pp. 925–936. [Google Scholar]
  • [14].Candès E and Recht B, Exact matrix completion via convex optimization, Foundations of Computational Mathematics, 9 (2009), pp. 717–772. [Google Scholar]
  • [15].Candès E and Tao T, The power of convex relaxation: Near-optimal matrix completion, IEEE Transactions on Information Theory, 56 (2010), pp. 2053–2080. [Google Scholar]
  • [16].Cao Y and Xie Y, Poisson matrix recovery and completion, IEEE Transactions on Signal Processing, 64 (2016), pp. 1609–1620. [Google Scholar]
  • [17].Chandrasekaran V, Sanghavi S, Parrilo PA, and Willsky AS, Rank-sparsity incoherence for matrix decomposition, SIAM Journal on Optimization, 21 (2011), pp. 572–596. [Google Scholar]
  • [18].Chen J and Li X, Memory-efficient kernel PCA via partial matrix sampling and nonconvex optimization: a model-free analysis of local minima, arXiv:1711.01742, (2017).
  • [19].Chen J, Liu D, and Li X, Nonconvex rectangular matrix completion via gradient descent without 2,∞ regularization, arXiv:1901.06116v1, (2019).
  • [20].Chen Y, Incoherence-optimal matrix completion, IEEE Transactions on Information Theory, 61 (2015), pp. 2909–2923. [Google Scholar]
  • [21].Chen Y and Candès E, The projected power method: An efficient algorithm for joint alignment from pairwise differences, Comm. Pure and Appl. Math, 71 (2018), pp. 1648–1714. [Google Scholar]
  • [22].Chen Y and Candès EJ, Solving random quadratic systems of equations is nearly as easy as solving linear systems, Comm. Pure Appl. Math, 70 (2017), pp. 822–883, 10.1002/cpa.21638, 10.1002/cpa.21638. [DOI] [Google Scholar]
  • [23].Chen Y, Cheng C, and Fan J, Asymmetry helps: Eigenvalue and eigenvector analyses of asymmetrically perturbed low-rank matrices, arXiv preprint arXiv:1811.12804, (2018). [DOI] [PMC free article] [PubMed]
  • [24].Chen Y and Chi Y, Robust spectral compressed sensing via structured matrix completion, IEEE Transactions on Information Theory, 60 (2014), pp. 6576 – 6601. [Google Scholar]
  • [25].Chen Y and Chi Y, Harnessing structures in big data via guaranteed low-rank matrix estimation: Recent theory and fast algorithms via convex and nonconvex optimization, IEEE Signal Processing Magazine, 35 (2018), pp. 14–31, 10.1109/MSP.2018.2821706. [DOI] [Google Scholar]
  • [26].Chen Y, Chi Y, Fan J, and Ma C, Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval, Mathematical Programming, 176 (2019), pp. 5–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Chen Y, Fan J, Ma C, and Wang K, Spectral method and regularized MLE are both optimal for top-K ranking, Annals of Statistics, 47 (2019), pp. 2204–2235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Chen Y, Fan J, Ma C, and Yan Y, Inference and uncertainty quantification for noisy matrix completion, arXiv preprint arXiv:1906.04159, (2019). [DOI] [PMC free article] [PubMed]
  • [29].Chen Y, Guibas LJ, and Huang Q, Near-optimal joint optimal matching via convex relaxation, International Conference on Machine Learning (ICML), (2014), pp. 100 – 108. [Google Scholar]
  • [30].Chen Y, Jalali A, Sanghavi S, and Caramanis C, Low-rank matrix recovery from errors and erasures, IEEE Transactions on Information Theory, 59 (2013), pp. 4324–4337. [Google Scholar]
  • [31].Chen Y and Wainwright MJ, Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees, arXiv preprint arXiv:1509.03025, (2015).
  • [32].Cheng Y and Ge R, Non-convex matrix completion against a semi-random adversary, arXiv preprint arXiv:1803.10846, (2018).
  • [33].Chi Y, Lu YM, and Chen Y, Nonconvex optimization meets low-rank matrix factorization: An overview, IEEE Transactions on Signal Processing, 67 (2019), pp. 5239 – 5269. [Google Scholar]
  • [34].Davenport MA and Romberg J, An overview of low-rank matrix recovery from incomplete observations, IEEE Journal of Selected Topics in Signal Processing, 10 (2016), pp. 608–622. [Google Scholar]
  • [35].Ding L and Chen Y, The leave-one-out approach for matrix completion: Primal and dual analysis, arXiv preprint arXiv:1803.07554, (2018).
  • [36].Efron B, Correlation and large-scale simultaneous significance testing, Journal of the American Statistical Association, 102 (2007), pp. 93–103. [Google Scholar]
  • [37].Efron B, Correlated z-values and the accuracy of large-scale statistical estimates, Journal of the American Statistical Association, 105 (2010), pp. 1042–1055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].El Karoui N, On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators, Probability Theory and Related Fields, (2015), pp. 1–81. [Google Scholar]
  • [39].Fan J, Han X, and Gu W, Estimating false discovery proportion under arbitrary covariance dependence, Journal of the American Statistical Association, 107 (2012), pp. 1019–1035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Fan J, Ke Y, Sun Q, and Zhou W-X, Farmtest: Factor-adjusted robust multiple testing with approximate false discovery control, Journal of American Statistical Association, (2019+). [DOI] [PMC free article] [PubMed]
  • [41].Fan J, Ke Y, and Wang K, Factor-adjusted regularized model selection, arXiv preprint arXiv:1612.08490, (2018). [DOI] [PMC free article] [PubMed]
  • [42].Fan J, Liao Y, and Mincheva M, Large covariance estimation by thresholding principal orthogonal complements, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75 (2013), pp. 603–680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [43].Fan J, Wang W, and Zhong Y, Robust covariance estimation for approximate factor models, Journal of econometrics, 208 (2019), pp. 5–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].Fan J, Xue L, and Yao J, Sufficient forecasting using factor models, Journal of econometrics, 201 (2017), pp. 292–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [45].Fazel M, Matrix rank minimization with applications, PhD thesis, 2002.
  • [46].Fazel M, Hindi H, and Boyd S, Rank minimization and applications in system theory, in American Control Conference, vol. 4, 2004, pp. 3273–3278. [Google Scholar]
  • [47].Fazel M, Hindi H, and Boyd SP, Log-det heuristic for matrix rank minimization with applications to Hankel and Euclidean distance matrices, American Control Conference, (2003). [Google Scholar]
  • [48].Fornasier M, Rauhut H, and Ward R, Low-rank matrix recovery via iteratively reweighted least squares minimization, SIAM Journal on Optimization, 21 (2011), pp. 1614–1640. [Google Scholar]
  • [49].Ge R, Jin C, and Zheng Y, No spurious local minima in nonconvex low rank problems: A unified geometric analysis, arXiv preprint arXiv:1704.00708, (2017).
  • [50].Ge R, Lee JD, and Ma T, Matrix completion has no spurious local minimum, in Advances in Neural Information Processing Systems, 2016, pp. 2973–2981. [Google Scholar]
  • [51].Gross D, Recovering low-rank matrices from few coefficients in any basis, IEEE Transactions on Information Theory, 57 (2011), pp. 1548–1566. [Google Scholar]
  • [52].Gunasekar S, Acharya A, Gaur N, and Ghosh J, Noisy matrix completion using alternating minimization, in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2013, pp. 194–209. [Google Scholar]
  • [53].Hardt M, Understanding alternating minimization for matrix completion, in Foundations of Computer Science (FOCS), 2014, pp. 651–660. [Google Scholar]
  • [54].Huang Q-X and Guibas L, Consistent shape maps via semidefinite programming, in Computer Graphics Forum, vol. 32, Wiley Online Library, 2013, pp. 177–186. [Google Scholar]
  • [55].Jain P, Meka R, and Dhillon IS, Guaranteed rank minimization via singular value projection, in Advances in Neural Information Processing Systems, 2010, pp. 937–945. [Google Scholar]
  • [56].Jain P, Netrapalli P, and Sanghavi S, Low-rank matrix completion using alternating minimization, in ACM symposium on Theory of computing, 2013, pp. 665–674. [Google Scholar]
  • [57].Jin C, Kakade SM, and Netrapalli P, Provable efficient online matrix completion via non-convex stochastic gradient descent, in NIPS, 2016, pp. 4520–4528. [Google Scholar]
  • [58].Jolliffe IT, A note on the use of principal components in regression, Journal of the Royal Statistical Society: Series C (Applied Statistics), 31 (1982), pp. 300–303. [Google Scholar]
  • [59].Jung P, Krahmer F, and Stöger D, Blind demixing and deconvolution at near-optimal rate, IEEE Transactions on Information Theory, 64 (2017), pp. 704–727. [Google Scholar]
  • [60].Keshavan RH, Montanari A, and Oh S, Matrix completion from a few entries, IEEE Transactions on Information Theory, 56 (2010), pp. 2980–2998. [Google Scholar]
  • [61].Keshavan RH, Montanari A, and Oh S, Matrix completion from noisy entries, J. Mach. Learn. Res, 11 (2010), pp. 2057–2078. [Google Scholar]
  • [62].Klopp O, Noisy low-rank matrix completion with general sampling distribution, Bernoulli, 20 (2014), pp. 282–303. [Google Scholar]
  • [63].Kneip A and Sarda P, Factor models and variable selection in high-dimensional regression analysis, The Annals of Statistics, 39 (2011), pp. 2410–2447. [Google Scholar]
  • [64].Koltchinskii V, Lounici K, and Tsybakov AB, Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion, Ann. Statist, 39 (2011), pp. 2302–2329, 10.1214/11-AOS894, 10.1214/11-AOS894. [DOI] [Google Scholar]
  • [65].Lai M-J, Xu Y, and Yin W, Improved iteratively reweighted least squares for unconstrained smoothed ℓq minimization, SIAM Journal on Numerical Analysis, 51 (2013), pp. 927–957. [Google Scholar]
  • [66].Li Q and Tang G, Approximate support recovery of atomic line spectral estimation: A tale of resolution and precision, Applied and Computational Harmonic Analysis, (2018).
  • [67].Li Y, Ma C, Chen Y, and Chi Y, Nonconvex matrix factorization from rank-one measurements, arXiv:1802.06286, accepted to AISTATS, (2018).
  • [68].Ling S and Strohmer T, Self-calibration and biconvex compressive sensing, Inverse Problems, 31 (2015), p. 115002. [Google Scholar]
  • [69].Ling S and Strohmer T, Blind deconvolution meets blind demixing: Algorithms and performance bounds, IEEE Transactions on Information Theory, 63 (2017), pp. 4497–4520. [Google Scholar]
  • [70].Liu Z and Vandenberghe L, Interior-point method for nuclear norm approximation with application to system identification, SIAM Journal on Matrix Analysis and Applications, 31 (2009), pp. 1235–1256. [Google Scholar]
  • [71].Ma C, Wang K, Chi Y, and Chen Y, Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion and blind deconvolution, arXiv preprint arXiv:1711.10467, accepted to Foundations of Computational Mathematics, (2017).
  • [72].Ma S, Goldfarb D, and Chen L, Fixed point and bregman iterative methods for matrix rank minimization, Mathematical Programming, 128 (2011), pp. 321–353. [Google Scholar]
  • [73].Mazumder R, Hastie T, and Tibshirani R, Spectral regularization algorithms for learning large incomplete matrices, Journal of machine learning research, 11 (2010), pp. 2287–2322. [PMC free article] [PubMed] [Google Scholar]
  • [74].Negahban S and Wainwright M, Restricted strong convexity and weighted matrix completion: Optimal bounds with noise, Journal of Machine Learning Research, (2012), pp. 1665–1697. [Google Scholar]
  • [75].Negahban S and Wainwright MJ, Restricted strong convexity and weighted matrix completion: optimal bounds with noise, J. Mach. Learn. Res, 13 (2012), pp. 1665–1697. [Google Scholar]
  • [76].Nesterov Y, How to make the gradients small, Optima, 88 (2012), pp. 10–11. [Google Scholar]
  • [77].Parikh N and Boyd S, Proximal algorithms, Foundations and Trends® in Optimization, 1 (2014), pp. 127–239. [Google Scholar]
  • [78].Park D, Kyrillidis A, Carmanis C, and Sanghavi S, Non-square matrix sensing without spurious local minima via the burer-monteiro approach, in Artificial Intelligence and Statistics, 2017, pp. 65–74. [Google Scholar]
  • [79].Paul D, Bair E, Hastie T, and Tibshirani R, “Preconditioning” for feature selection and regression in high-dimensional problems, The Annals of Statistics, 36 (2008), pp. 1595–1618. [Google Scholar]
  • [80].Recht B, A simpler approach to matrix completion, Journal of Machine Learning Research, 12 (2011), pp. 3413–3430. [Google Scholar]
  • [81].Recht B, Fazel M, and Parrilo PA, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM Review, 52 (2010), pp. 471–501. [Google Scholar]
  • [82].Rennie JD and Srebro N, Fast maximum margin matrix factorization for collaborative prediction, in International conference on Machine learning, ACM, 2005, pp. 713–719. [Google Scholar]
  • [83].Rohde A, Tsybakov AB, et al. , Estimation of high-dimensional low-rank matrices, The Annals of Statistics, 39 (2011), pp. 887–930. [Google Scholar]
  • [84].Shapiro A, Xie Y, and Zhang R, Matrix completion with deterministic pattern: A geometric perspective, IEEE Transactions on Signal Processing, 67 (2019), pp. 1088–1103. [Google Scholar]
  • [85].Singer A, Angular synchronization by eigenvectors and semidefinite programming, Applied and computational harmonic analysis, 30 (2011), pp. 20–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [86].So AM-C and Ye Y, Theory of semidefinite programming for sensor network localization, Mathematical Programming, 109 (2007), pp. 367–384. [Google Scholar]
  • [87].Srebro N and Shraibman A, Rank, trace-norm and max-norm, in International Conference on Computational Learning Theory, Springer, 2005, pp. 545–560. [Google Scholar]
  • [88].Sun R and Luo Z-Q, Guaranteed matrix completion via non-convex factorization, IEEE Transactions on Information Theory, 62 (2016), pp. 6535–6579. [Google Scholar]
  • [89].Sur P, Chen Y, and Candès EJ, The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square, arXiv:1706.01191, accepted to Probability Theory and Related Fields, (2017).
  • [90].Toh K-C and Yun S, An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems, Pacific Journal of optimization, 6 (2010), p. 15. [Google Scholar]
  • [91].Tomasi C and Kanade T, Shape and motion from image streams under orthography: a factorization method, International journal of computer vision, 9 (1992), pp. 137–154. [Google Scholar]
  • [92].Tu S, Boczar R, Simchowitz M, Soltanolkotabi M, and Recht B, Low-rank solutions of linear matrix equations via procrustes flow, in International Conference on Machine Learning, 2016, pp. 964–973. [Google Scholar]
  • [93].Vandereycken B, Low-rank matrix completion by riemannian optimization, SIAM Journal on Optimization, 23 (2013), pp. 1214–1236. [Google Scholar]
  • [94].Vershynin R, Introduction to the non-asymptotic analysis of random matrices, Compressed Sensing, Theory and Applications, (2012), pp. 210 – 268. [Google Scholar]
  • [95].Wang L, Zhang X, and Gu Q, A unified computational and statistical framework for nonconvex low-rank matrix estimation, arXiv preprint arXiv:1610.05275, (2016).
  • [96].Wei K, Cai J-F, Chan T, and Leung S, Guarantees of riemannian optimization for low rank matrix recovery, SIAM Journal on Matrix Analysis and Applications, 37 (2016), pp. 1198–1222. [Google Scholar]
  • [97].Wen Z, Yin W, and Zhang Y, Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm, Mathematical Programming Computation, 4 (2012), pp. 333–361. [Google Scholar]
  • [98].Zhang H, Zhou Y, Liang Y, and Chi Y, A nonconvex approach for phase retrieval: Reshaped wirtinger flow and incremental algorithms, The Journal of Machine Learning Research, 18 (2017), pp. 5164–5198. [Google Scholar]
  • [99].Zhang T, Pauly JM, and Levesque IR, Accelerating parameter mapping with a locally low rank constraint, Magnetic resonance in medicine, 73 (2015), pp. 655–661. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [100].Zhao T, Wang Z, and Liu H, A nonconvex optimization framework for low rank matrix estimation, in NIPS, 2015, pp. 559–567. [PMC free article] [PubMed] [Google Scholar]
  • [101].Zheng Q and Lafferty J, Convergence analysis for rectangular matrix completion using Burer-Monteiro factorization and gradient descent, arXiv:1605.07051, (2016).
  • [102].Zhong Y and Boumal N, Near-optimal bound for phase synchronization, SIAM Journal on Optimization, (2018).
  • [103].Zhou Z, Li X, Wright J, Candès E, and Ma Y, Stable principal component pursuit, in International Symposium on Information Theory, 2010, pp. 1518–1522. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

RESOURCES