A Sparse Interactive Model for Matrix Completion with Side Information

Jin Lu; Guannan Liang; Jiangwen Sun; Jinbo Bi

. Author manuscript; available in PMC: 2017 Oct 25.

Published in final edited form as: Adv Neural Inf Process Syst. 2016;29:4071–4079.

A Sparse Interactive Model for Matrix Completion with Side Information

Jin Lu ¹, Guannan Liang ¹, Jiangwen Sun ¹, Jinbo Bi ¹

PMCID: PMC5656014 NIHMSID: NIHMS909907 PMID: 29081639

Abstract

Matrix completion methods can benefit from side information besides the partially observed matrix. The use of side features that describe the row and column entities of a matrix has been shown to reduce the sample complexity for completing the matrix. We propose a novel sparse formulation that explicitly models the interaction between the row and column side features to approximate the matrix entries. Unlike early methods, this model does not require the low rank condition on the model parameter matrix. We prove that when the side features span the latent feature space of the matrix to be recovered, the number of observed entries needed for an exact recovery is O(log N) where N is the size of the matrix. If the side features are corrupted latent features of the matrix with a small perturbation, our method can achieve an ε-recovery with O(log N) sample complexity. If side information is useless, our method maintains a O(N^3/2) sampling rate similar to classic methods. An efficient linearized Lagrangian algorithm is developed with a convergence guarantee. Empirical results show that our approach outperforms three state-of-the-art methods both in simulations and on real world datasets.

1 Introduction

Matrix completion has been a basis of many machine learning approaches for computer vision [6], recommender systems [21, 24], signal processing [19, 27], and among many others. Classically, low-rank matrix completion methods are based on matrix decomposition techniques which require only the partially observed data in the matrix [15, 3, 14] by solving the following problem

\min_{E} {‖ E ‖}_{*}, subject to R_{Ω} (E) = R_{Ω} (F),

(1)

where F ∈ ℝ^m×n is the partially observed low-rank matrix (with a rank of r) that needs to be recovered, Ω ⊆ {1, ⋯, m} × {1, ⋯, n} be the set of indexes where the corresponding components in F are observed, the mapping R_Ω(M): ℝ^m×ⁿ → ℝ^m×n (i, j)-th entry is M_i,j if (i, j) ∈ Ω (or 0 otherwise), and ||E||_* computes the nuclear norm of E. Early theoretical analysis [4, 5, 20] proves that O(Nr log² N) entries are sufficient for an exact recovery if the observed entries are uniformly sampled at random where N = max{n, m}.

Recent studies start to explore side information for matrix completion and factorization [1, 18, 7, 17, 8]. For example, to infer the missing ratings in a user-movie rating matrix, descriptors of the users and movies are often known and may help to build a content-based recommender system. For instance, kids tend to like cartoons, so the age of a user likely interacts with the cartoon feature of a movie. When few ratings are known, this side information could be the main source for completing the matrix. Although based on empirical studies, several works found that side features are helpful [17, 18], those methods are based on non-convex matrix factorization formulations without any theoretical guarantees. Three recent methods have focussed on convex nuclear-norm regularized objectives, which leads to theoretical guarantees on matrix recovery [13, 28, 9, 16]. These methods all construct an inductive model X^T GY so that R_Ω(X^T GY) = R_Ω(F) where the side matrices X and Y consist of side features, respectively, for the row entities (e.g., users) and column entities (e.g., movies) of a (rating) matrix. This inductive model has a parameter matrix G which is either required to be low rank [13] or to have a minimal nuclear norm ||G||_* [28]. Recovering G of a (usually) smaller size is argued to be easier than directly recovering the matrix F. With a very strong assumption on ‘perfect’ side information, i.e., both X and Y are orthonormal matrices and respectively in the latent column and row space of the matrix F, the method in [28] is proved to require much reduced sample complexity O(log N) for an exact recovery of F. Because most side features X and Y are not perfect in practice, a very recent work [9] proposes to use a residual matrix N to handle the noisy side features. This method constructs an inductive model X^T GY + N to approximate F and requires both G and N to be low rank, or have a low nuclear norm. It uses the nuclear norm of the residual to quantify the usefulness of side information, and proves O(log N) sampling rate for an ε-recovery when X and Y span the full latent feature space of F, and o(N) sample complexity when X and Y contain corrupted latent features of F. An ε-recovery is defined as that the expected discrepancy between the predicted matrix and the true matrix is less than an arbitrarily small ε > 0 under a certain probability.

In this paper, we propose a new method for matrix recovery by constructing a sparse interactive model X^T GY to approximate F where G can be sparse but does not need to be low rank. The (i, j)-th element of G determines the role of the interaction between the i-th feature of users and the j-th feature of products. The low-rank property of F is commonly assumed to characterize the observation that similar users tend to rate similar products similarly [4]. When using an inductive approximation F = X^T GY, rank(F) ≤ rank(G), so a low-rank requirement on G can be a sufficient condition on the low-rank condition of F. Previous relevant methods [13, 28, 9] all impose the low-rank condition on G, which is however not a necessary condition for F to be low rank (only becomes a necessary condition when X and Y are full rank). Given general side matrices $X \in ℝ^{d_{1} \times m}$ and $Y \in ℝ^{d_{2} \times n}$ where the numbers of features d₁, d₂ ≪ N, limiting the interactive model of $G \in ℝ^{d_{1} \times d_{2}}$ to be low rank can be an over-restrictive constraint. In our model, we use a low-rank matrix E to directly approximate F and then estimate E from the interactive model of X and Y with a sparse regularizer on G. We show empirically that a low-rank F can be recovered from a corresponding full (or high) rank G. Our contributions are summarized as follows:

We propose a new formulation that estimates both E and G by imposing a nuclear-norm constraint on E but a general regularizer on G, e.g., the sparse regularizer ||G||₁. The proposed model has recovery guarantees depending on the quality of the side features: (1) when X and Y are full row rank and span the entire latent feature space of F (but are not required to satisfy the much stronger condition of being orthonormal as in [28]), O(log N) observations are still sufficient for our method to achieve an exact recovery of F. (2) When the side matrices are not full rank and corrupted from the original latent features of F, i.e., X and Y do not contain enough basis to exactly recover F, O(log N) observed entries can be sufficient for an ε-recovery.
A new linearized alternating direction method of multipliers (LADMM) is developed to efficiently solve the proposed formulation. Existing methods that use side information are solved by standard block-wise coordinate descent algorithms which have convergence guarantee to a global solution only when each block-wise subproblem has a unique solution [26]. Our LADMM has stronger convergence property [29] and benefits from the linear convergence rate of ADMM [11,23].
Prior methods focus on the recovery of F, and little light has been shed to understand whether the interactive model of G can be retrieved. Because of the explicit use of E and G, our method aims to directly recover both. The unique G in the case of exact recovery of F can be attained by our algorithm. When G is not unique in the ε-recovery case, our algorithm converges to a point in the optimal solution set.

2 The Proposed Interactive Model

To utilize the side information in X and Y to complete F, we consider to build a predictive model from the observed components that predicts the missing ones. One can simply build a linear model: f = x^Tu + y^Tv + g, where x and y are the feature vectors respectively for a user and a product, and u, v and g are model parameters. In real life applications, interactive terms between the features in X and Y can be very important. For example, male users tend to rate science fiction and action movies higher than female, which can be informative when predicting their ratings. Therefore, a linear model considering no interactive terms can be oversimple and have low predictive power for missing entries. We hence add interactive terms by introducing an interaction matrix $H^{d_{1} \times d_{2}}$ into the predictive model, which can be written as: f = x^THy + x^Tu + y^Tv + g. By defining $\bar{x} = {[x^{T} 1]}^{T}, \bar{y} = {[y^{T} 1]}^{T}$ and $G^{(a = d_{1} + 1) \times (b = d_{2} + 1)} = (\begin{matrix} H & u \\ v^{T} & g \end{matrix})$ the above model can be simplified to: $f = {\bar{x}}^{T} G \bar{y}$ . The following optimization problem can be solved to obtain the model parameter G.

\min_{G, E} g (G) + λ_{E} {‖ E ‖}_{*}, subject to {\bar{X}}^{T} G \bar{Y} = E, R_{Ω} (E) = R_{Ω} (F),

where E is a completed version of F, ${\bar{X}}^{a \times m}$ and ${\bar{Y}}^{b \times n}$ are two matrices that are created by augmenting one row of all ones to X and to Y, respectively, and g(G) and ||E||_* are used to incorporate the (sparsity) prior of G and low rank prior of E. Because the side information data can be noisy and not all the features and their interactions are helpful to the prediction of F, a sparse G is often expected. Our implementation has used g(G) = ||G||₁. It is natural to impose low rank requirement on E because it is a completed version of a low rank matrix F. The tuning parameter λ_E is used to balance the two priors in the objective.

Without loss of generality and for convenience of notation, we simply use X and Y to denote the augmented matrices. Denote the Frobenius norm of a matrix by ||·||_F. To account for Gaussian noise, we relax the equality constraint X^T GY = E and replace it by minimizing their squared residual: ${‖ X^{T} GY - E ‖}_{F}^{2}$ and solve the following convex optimization problem to obtain G and E:

\min_{G, E} \frac{1}{2} {‖ X^{T} GY - E ‖}_{F}^{2} + λ_{G} g (G) + λ_{E} {‖ E ‖}_{*}, subject to R_{Ω} (E) = R_{Ω} (F) .

(2)

where λ_G is another tuning parameter that together with λ_E balances the three terms in the objective. Especially, the regularizer g(·) in our theoretical analysis can take any general matrix norm that satisfies ||M||_* ≤ Cg(M)), ∀M, for a constant C, so for instance g(·) can be ||G||₁, or ||G||_F, or ||G||₂. Throughout this paper, the matrices X (and Y) refer to, i.e., either the original $X^{d_{1} \times m}$ (and $Y^{d_{2} \times n}$ ) or the augmented ${\bar{X}}^{a \times m}$ (and ${\bar{Y}}^{b \times n}$ ) depending on the user-specified model.

Our formulation (2) differs from existing methods that make use of side information for matrix completion in several ways. Existing methods [28, 13, 9] solve the problem by finding $\hat{H}$ that minimizes ||H||_* subject to R_Ω(X^T HY) = R_Ω(F), but we expand it to include the linear term within the interactive model. The proposed model adds the flexibility to consider both linear and quadratically interactive terms, and allows the algorithm to determine the terms that should be used in the model by enforcing the sparsity in H (or G). Because E = X^T GY, the rank of G bounds that of E from above. The existing methods all control the rank of G (e.g. by minimizing ||G||_*) to incorporate the prior of low rank E (and thus low rank F) in their formulations. However, when the rank of G is not properly chosen during the tuning of hyperparameters, it may not even be a sufficient condition to ensure low rank E (if rank(E) ≪ the pre-specified rank(G)). It is easy to see that besides G a low-rank X or Y can lead to a low-rank E as well. Enforcing a low-rank condition on H or G may limit the search space of the interactive model and thus impair the prediction performance on missing matrix entries, which are demonstrated in our empirical results. Moreover, one can observe that when λ_G is sufficiently large, Eq.(2) is reduced to the standard matrix completion problem (1) without side information because G may be degenerated into a zero matrix, so our formulation is applicable when no access to useful side information.

3 Recovery Analysis

Let E₀ and G₀ be the two matrices such that R_Ω(F) = R_Ω(E₀) and E₀ = X^T G₀Y. In this section, we give our theoretical results on the sample complexity for achieving an exact recovery of E₀ and G₀ when X and Y are both full row rank (i.e., rank(X) = a and rank(Y) = b), and an ε-recovery of E₀ when the two side matrices are corrupted and less informative. The proofs of all theorems are given in supplementary materials.

3.1 Sample Complexity for Exact Recovery

Before presenting our results, we give a few definitions. Let $F = U \sum V^{T}, X^{T} = U_{X} \sum_{X} V_{X}^{T}$ and $Y^{T} = U_{Y} \sum_{Y} V_{Y}^{T}$ be the singular value decomposition of F, X^T and Y^T, respectively, where all Σ matrices are full rank, meaning that singular vectors corresponding to the singular value 0 are not included in the respective U and V matrices. Let

P_{U} = {UU}^{T} \in ℝ^{m \times m}, P_{V} = {VV}^{T} \in ℝ^{n \times n}, P_{X} = U_{X} U_{X}^{T} = X^{T} V_{X} \sum_{X}^{- 2} V_{X}^{T} X \in ℝ^{m \times m}, P_{Y} = U_{Y} U_{Y}^{T} = Y^{T} V_{Y} \sum_{Y}^{- 2} V_{Y}^{T} Y \in ℝ^{n \times n},

where P_U, P_V, P_X and P_Y project a vector onto the subspaces spanned, respectively, by the columns in U, V and rows in X, and Y. For any matrix M^m×n that satisfies M = P_XMP_Y, we define two linear operators: P_T : ℝ^m×n → ℝ^m×n and P_T^⊥ : ℝ^m×n → ℝ^m×n as follows:

P_{T} (M) = P_{U} M P_{Y} + P_{X} M P_{V} - P_{U} M P_{V}

P_{T^{⊥}} (M) = (P_{X} - P_{U}) M (P_{Y} - P_{V}) = P_{X^{⊥}} M P_{Y^{⊥}} .

Let μo and μ₁ be the two coherence measures of F and be defined as follows as discussed in [4, 16]:

μ_{0} = \max (\frac{m}{r} \max_{1 \leq i \leq m} {‖ P_{U} e_{i} ‖}_{2}, \frac{n}{r} \max_{1 \leq j \leq n} {‖ P_{V} e_{j} ‖}_{2}), μ_{1} = \max_{i, j} \frac{mn}{r} {({[U V^{T}]}_{i, j})}^{2},

where e_i is the unit vector with the ith entry equal to 1. Let μ_XY be the coherence measurement between X and Y and be defined as:

μ_{XY} = \max (\max_{1 \leq i \leq m} \frac{m {‖ x_{i} ‖}_{2}^{2}}{a}, \max_{1 \leq j \leq n} \frac{n {‖ y_{i} ‖}_{2}^{2}}{b}) .

With the above definitions, we show in the following theorem that when X and Y are both full row rank, (G₀, E₀) is the unique solution to Eq.(2) with high probability as long as there are O(r log N) observed components in F. In other words, with a sampling rate of O(r log N), our method can fully recover both E₀ and G₀ with a high probability when X and Y are full row rank.

Theorem 1

Let μ = max(μ₀, μ_XY), $σ = \max ({‖ \sum_{X}^{- 1} ‖}_{*}, {‖ \sum_{Y}^{- 1} ‖}_{*})$ , N = max(m, n), $q_{0} = \frac{1}{2} (1 + log a - log r), T_{0} = \frac{128 p}{3} σ μ \max (μ_{1}, μ) r (a + b) log N$ and $T_{1} = \frac{8 p}{3} σ^{2} μ^{2} (ab + r^{2}) log N$ , where p is a constant. Assume T₁ ≥ q₀T₀, X and Y are both full row rank. For any p > 1, with a probability at least 1 − 4(q₀ + 1)N⁻^p⁺¹ − 2q₀N^−p+2, (G₀, E₀) is the unique optimizer to Problem (2) with necessary sampling rate as few as O(r log N). More precisely, the sampling size |Ω| should satisfy that $| Ω | \geq \frac{64 p}{3} σ μ \max (μ_{1}, μ) (1 + log a - log r) r (a + b) log N$ .

When r ≪ N and r = O(1), the sampling rate for the exact recovery of both E₀ and G₀ reduces to O(log N). A similar sampling rate for a full recovery of E₀ has been developed in [28] where both X and Y, however, need to be orthonormal matrices in their derivation. In Theorem 1, because σ is mainly determined by the smallest singular values of the side information matrices, and sampling rate increases when σ increases, it suggests that side information matrices of lower rank would require more observed F entries for a full recovery of F. An advanced model without the orthonormal assumption has been given in [9], but exact recovery is not discussed. In our case, the two matrices are only required to be full row rank. Moreover, the theoretical or empirical results in our work give the first careful investigation on the recovery of both G₀ and E₀.

3.2 Sample Complexity for ε-Recovery

The condition for full-rank side information matrices may not be satisfied in some cases to fully recover E₀ (or F). We analyze the error bound of our model and prove a reduced sample complexity in comparison with standard matrix completion methods for an ε-recovery when the side information matrices are not full row rank or their rank is difficult to attain.

Theorem 2

Denote ||E||_* ≤ α, ||G||₁ ≤ γ, ||X^T GY − E||_F ≤ ϕ and the perfect side feature matrices (containing latent features of F) are corrupted with ∆X and ∆Y where ||∆X||_F ≤ s₁, ||∆Y||_F ≤ s₂ and S = max(s₁,s₂). To ε-recover F that the expected loss $E [l (f, F)] < ε$ for a given arbitrarily small $ε > 0, O (\min ((γ^{2} + ϕ^{2}) log N, S^{2} α \sqrt{N}) / ε^{2})$ observations are sufficient for our model when corrupted factors of side information are bounded.

Theorem 2 can be inferred from the fact that the trace norm of E and the ℓ₁-norm of G affect sample complexity of our model. It meets the intuition that higher rank matrix ought to require more observations to recover. Besides, for the discovery of G, a sparse interactive matrix can lead to the decrease of the sample complexity, which implies that the side information, even though when it is not perfect, could be informative enough such that the original matrix can be compressed by sparse coding via the estimated interaction between the features of row and column entities of the matrix. Our empirical evaluations have confirmed the utility of even imperfect side features. When the rank of the original data matrix r = O(1) (r ≪ N), and correspondingly α = O(1), Theorem 2 points out that only O(log N) sampling rate is required for an ε-recovery. The classic matrix completion analysis without side information shows that under certain conditions, one can achieve O(N poly log N) sample complexity for both perfect recovery [4] and ε-recovery [25], which is higher than our complexity. However, the condition for these existing bounds is that the observed entries follow a certain distribution. Recent studies [22] found that if no specific distribution is pre-assumed for observed entries, O(N^3/2) sampling rate is sufficient for an ε-recovery. Compared to those results, our analysis does not require any assumption on the distribution of observed entries. When X and Y contain insufficient interaction information about F and ||E||_* = O(N), the sample complexity of our method increases to O(N³/²) in the worst case, which means that our model maintains the same complexity as the classic methods.

4 Adaptive LADMM Algorithm

In this section, we develop an adaptive LADMM algorithm [29] to solve problem (2). First, we show that the ADMM is applicable in our problem and we then derive LADMM steps. A convergence proof is established to guarantee the performance of our algorithm.

Because it requires separable blocks of variables in order to use ADMM, we first define C = E − X^T GY and use it in Eq.(2). Then the augmented Lagrangian function of (2) is given by

ℒ (E, G, C, M_{1}, M_{2}, β) = \frac{1}{2} {‖ C ‖}_{F}^{2} + λ_{E} {‖ E ‖}_{*} + λ_{G} {‖ G ‖}_{1} + 〈 M_{1}, R_{Ω} (E - F) 〉 + + 〈 M_{2}, E - X^{T} GY - C 〉 + \frac{β}{2} {‖ R_{Ω} (E - F) ‖}_{F}^{2} + \frac{β}{2} {‖ E - X^{T} GY - C ‖}_{F}^{2}

(3)

where M₁, M₂ ∈ ℝ^m^×ⁿ are Lagrange multipliers and β > 0 is the penalty parameter. Given C^k, G^k, E^k, $M_{1}^{k}$ and $M_{2}^{K}$ at iteration k, each group of the variables yields their respective subproblems:

C^{k + 1} = arg \min_{C} ℒ (E^{k}, G^{k}, M_{2}^{k}, C, β_{k}), G^{k + 1} = arg \min_{G} ℒ (E^{k}, G, M_{2}^{k}, C^{k + 1}, β_{k}), E^{k + 1} = arg \min_{E} ℒ (E, G^{k + 1}, M_{1}^{k}, M_{2}^{k}, C^{k + 1}, β_{k}),

(4)

After solving these subproblems, we update the multipliers M₁ and M₂ as follows;

M_{1}^{k + 1} = M_{1}^{k} + β_{k} (R_{Ω} (E^{k + 1} - F)), M_{2}^{K + 1} = M_{2}^{K} + β_{k} (E^{k + 1} - X^{T} G^{k + 1} Y - C^{k + 1}) .

(5)

We focus on demonstrating the iterative steps of the adaptive LADMM. Given C^k, G^k, E^k, $M_{1}^{k}$ and $M_{2}^{K}$ , Algorithm 1 describes how to obtain the next iterate (C, E, G, M₁, M₂). A closed-form solution has been derived for each subproblem in the supplementary material.

Algorithm 1.

The adaptive LADMM algorithm to solve C^k, G^k, E^k, k = 1, …, K

Input: X, Y and R_Ω(F) with parameters λ_G, λ_E, τ_A, τ_B, ρ and β_max.

Output: C, G, E;

Initialize E⁰, G⁰, $M_{1}^{0}$ , $M_{2}^{0}$ . Compute A = Y^T ⊗ X^T. k = 0, repeat;
$C^{k + 1} = \frac{β_{k}}{β_{k} + 1} (E^{k} - X^{T} G^{k} Y + M_{2}^{k} / β_{k})$ ;
$G^{k + 1} = reshape (max (| g^{k} - f_{1}^{k} / τ_{A} | - \frac{λ_{G}}{τ_{A} β_{k}}, 0) ⊙ sgn (g^{k} - f_{1}^{k} / τ_{A}))$ where $f_{1}^{k} = A^{T} (A g^{k} + c^{k} - b_{1}^{k}) = A^{T} (A g^{k} + c^{k} - e^{k} - m_{2}^{k} / β_{k})$ and e = vec(E), g = vec(G), m = vec(M), c = vec(C).
$E^{k + 1} = SVT (E^{k} - (f_{2}^{k} + f_{3}^{k}) / (2 τ_{B}), λ_{E} / 2 (β_{k} τ_{B}))$ where $f_{2}^{k} = R_{Ω} (E^{k} - F + M_{1}^{k} / β_{k}); f_{3}^{k} = E^{k} - X^{T} G^{k + 1} Y - C^{k} + M_{2}^{k} / β_{k} .$
$M_{1}^{k + 1} = M_{1}^{k} + β_{k} (R_{Ω} (E^{k + 1} - F)) .$
$M_{2}^{K + 1} = M_{2}^{k} + β_{k} (E^{k + 1} - X^{T} G^{k + 1} Y - C^{k + 1}) .$
β_k+1 = min (β_max, ρβ_k).
k = k + 1 until convergence;

Return C, G, E;

Open in a new tab

The adaptive parameter in Algorithm 1 is ρ > 1, and β_max controls the upper bound of {β_k}. The operator reshape(g) converts a vector g ∈ ℝ^ab into a matrix G ∈ ℝ^a×b, which is the inverse operator of vec(G). The operator SVT(E, t) is the singular value thresholding process defined in [3] for soft-thresholding the singular values of an arbitrary matrix E by a threshold t. The matrix A = Y^T ⊗ X^T where ⊗ indicates the Kronecker product. In the initialization step, $M_{1}^{0}$ , $M_{2}^{0}$ are randomly drawn from the standard Gaussian distribution; we initialize E₀ and G₀ by the iterative soft-thresholding algorithm [2] and SVT operator respectively.

The adaptive LADMM can effectively solve the proposed optimization problem in several aspects. First, the convergence of the commonly-used block-wise coordinate descent (BCD) method, sometimes referred to as alternating minimization methods, requires typically that the optimization problem be strictly convex (or quasiconvex but hemivariate). The strongest result for BCD so far is established in [26] which requires the alternating subproblems to be optimized in each iteration to its unique optimal solution. This requirement is often restrictive in practice. Our convex (but not strictly convex) problem can be solved by the adaptive LADMM with the global convergence guarantee which is characterized in Theorem 3. Second, two of the subproblems are non-smooth due to the ℓ₁-norm or the nuclear norm, so it can be difficult to obtain a closed-form formula to efficiently compute a solution by standard optimization tools; however, adaptive LADMM utilizes the linearization technique which leads to a closed-form solution for each linearized subproblem, and significantly enhances the efficiency of the iterative process. Third, adaptive LADMM can be practically parallelizable by a similar scheme to that of ADMM. It is also noted that the convergence rate of LADMM [11] and parallel LADMM is O(1/k) [23] whereas the BCD method still lacks of clear theoretical results of its convergence rate.

Theorem3

Define the operators $A$ and $ℬ$ as $A (G) = (\binom{0}{- X^{T} GY}), ℬ (E) = (\binom{R_{Ω} (E)}{E})$ , and let $M = (\binom{M_{1}}{M_{2}})$ . If β_k is non-decreasing and upper-bounded, $τ_{A} > {‖ A ‖}^{2}$ , and $τ_{B} > {‖ ℬ ‖}^{2}$ , then the sequence {(C^k, G^k, E^k, M^k)} generated by the adaptive LADMM Algorithm 1 converges to a global minimizer of Eq. (2).

5 Experimental Results

We validated our method in both simulations and the analysis of two real world datasets: MovieLens (movie rating) and NCI-DREAM (drug discovery) datasets. Three most recent matrix completion methods that also utilized side information, MAXIDE[28], IMC[13] and DirtyIMC[9], were compared against our method. The design of our experiments focused on demonstrating the effectiveness of our method in practice. The performance of all methods was measured by the relative mean squared error (RMSE) calculated on missing entries: ${‖ R_{Ω} (X^{T} GY - F) ‖}_{2}^{2} / {‖ R_{Ω} (F) ‖}_{2}^{2}$ . For both synthetic and real-world datasets, we randomly set q percent of the components in each observed matrix F to be missing. The hyperparameters λ’s and the rank of G (required by IMC and Dirty IMC) were tuned via the same cross validation process: we randomly picked 10% of the given entries to form a validation set. Then models were obtained by applying each method to the remaining entries with a specific choice of λ from 10⁻³,10⁻²,…, 10⁴. The average validation RMSE was examined by repeating the above procedure six times. The hyperparameter values that gave the best average validation RMSE were chosen for each method. For IMC and DirtyIMC, the best rank of G was chosen from = 1 to 15 within each data split. For each choice of q, we repeated the above entire procedure six times and reported the average RMSE on the missing entries.

5.1 Synthetic Datasets

We created two different simulation tests with and without full row rank X and Y. For all the synthetic datasets, we first randomly created X and Y. In order to make our simulations reminiscent real situations where distributions of side features can be heterogeneous, data for each feature in both X and Y were generated according to a distribution that was randomly selected from Gaussian, Poisson and Gamma distributions. We created the sparse G matrices as follows. The location of the non-zero entries of G were randomly picked but their values were generated by multiplying a value drawn from $N (0, 100)$ , which we repeated several times to chose the matrices that showed full or high rank. We then generated F with F = X^T GY + N where N represents noise and each component N_i,j was drawn from $N (0, 1)$ . For each simulated F, we ran all methods with q ∈ [10% – 80%] with an increase step of 10%.

We compared the different methods in three settings, which were labeled as synthetic experiment I, II and III in our results. In the first setting, the dimension of X and Y was set to 15 × 50 and 20 × 140 and all features in these two matrices were randomly generated to make them full row rank. Both the last two settings corresponded to the second test where X and Y were not full row rank. The dimension of X and Y was set to 16 × 50, 21 × 140 and 20 × 50, 25 × 140, respectively, for these two settings where the first 15 features in X and 20 features in Y were randomly created, but the remaining features were generated by arbitrarily linear combinations of the randomly created features. For all three settings, we used 10 synthetic datasets and reported mean and standard deviation of RMSE on missing values as shown in Figure 1.

The Comparison of RMSE for Experiments I, II, and III.

Our approach outperformed all other compared methods significantly in almost all these settings. When the missing rate q increased, the RMSE of our method grew much slower than other methods. We studied the rank of the recovered G and E in the first setting. For all methods, the corresponding G and E that gave the best performance were examined. The ranks of G and E from our method, MAXIDE, IMC, DirtyIMC were 15, 8, 1, 1 and 15, 15, 1, 2, respectively. These results suggested that incorporating the strong prior of low rank G might hurt the recovery performance. The retrieved model matrices G of all compared methods (when using q =10% of missing entries in one of the 10 synthetic datasets) together with the true G are plotted in Figure 2. Only our method was able to recover the true G and all the other methods merely found approximations.

The heatmap of the true G and recovered G matrices in Synthetic Experiment I.

5.2 Real-world Datasets

We used two relatively large datasets that we could find as suitable for our empirical evaluation. Note that early methods employing side information were often tested on datasets with either X or Y but not both although some of them might be larger than the two datasets we used.

5.2.1. MovieLens

This dataset was downloaded from [12] and contained 100,000 user ratings (integers from 1 to 5) from 943 users on 1682 movies. There were 20 movie features such as genre and release date, as well as 24 user features describing users’ demographic information such as age and gender. We compared all methods with four different q values: 20–50%. The RMSE values of each method are shown in Table 1, which shows that our approach significantly outperformed other methods, especially when q was large. Figure 3 shows the constructed G matrix that shows some interesting observations. For instance, male users tend to rate action, science fiction, thriller and war movies high but low for children’ movies, exhibiting some common intuitions.

Table 1.

The Comparison of RMSE values of different methods on real-world datasets.

	MovieLens Data				NCI-Dream Challenge

Methods	20%	30%	40%	50%	20%	30%	40%	50%
Our approach	0.276 (± 0.001)	0.279 (± 0.002)	0.284 (± 0.001)	0.292 (± 0.001)	0.181 (± 0.069)	0.139 (± 0.010)	0.145 (± 0.018)	0.190 (± 0.031)
MAXIDE	0.424 (±0.016)	0.425 (±0.013)	0.419 (±0.008)	0.421 (±0.013)	0.268 (±0.036)	0.240 (±0.007)	0.255 (±0.016)	0.288 (±0.022)
IMC	0.935 (±0.001)	0.943 (±0.001)	0.945 (±0.001)	0.959 (±0.001)	0.437 (±0.031)	0.489 (±0.003)	0.557 (±0.013)	0.637 (±0.011)
DirtyIMC	0.705 (±0.001)	0.738 (±0.001)	0.775 (±0.001)	0.814 (±0.001)	0.432 (±0.033)	0.475 (±0.008)	0.551 (±0.018)	0.632 (±0.011)

Open in a new tab

5.2.2 NCI-DREAM Challenge

The data on the reactions of 46 breast cancer cell lines to 26 drugs and the expression data of 18633 genes for all the cell lines were provided by NCI-DREAM Challenge [10]. For each drug, we had 14 features that describes their chemical and physical properties such as molecular weight, XLogP3 and hydrogen bond donor count, and were downloaded from National Center for Biotechnology Information (http://pubchem.ncbi.nlm.nih.gov/). For the cell line features, we ran principle component analysis (PCA) and used the top 45 principal components that accounted for more than 99.99% of the total data variance. We compared the four different methods with four different q values: 20–50%. The RMSE values of all methods are provided in Table 1 where our method again shows the best performance. We examined the ranks of both G and E obtained by all the methods. They were 15, 15, 1, 1 for G and 2, 15, 1, 2 for E, respectively, for our approach, MAXIDE, IMC and DirtyIMC in sequence. This demonstrates that a low rank E but a high rank G give the best performance on this dataset. In other words, requiring a low rank G may hurt the performance of recovering a low rank E.

The constructed G by our method is plotted in Figure 4, where columns represent cell line features (i.e., principle components) and rows represent drug features. Please refer to the supplementary material for the names of these features. According to this figure, drug features: XlogP (F2), hydrogen bond donor (HBD) (F3), Hydrogen bond acceptor (HBA) (F4) and Rotatable Bond number (F5) all played important roles in drug sensitivity. This result aligns well with biological knowledge, as all these four features are very important descriptors for cellular entry and retention.

HeatMap of *sign*(G) log(|G|) for NCI-DREAM for a better illustration

6 Conclusion

In this paper, we have proposed a novel sparse inductive model that utilizes side features describing the row and column entities of a partially observed matrix to predict its missing entries. This method models the linear predictive power of side features as well as interaction between the features of row and column entities. Theoretical analysis shows that this model has advantages of reduced sample complexity over classical matrix completion methods, requiring only O(log N) observed entries to achieve a perfect recovery of the original matrix when the side features reflect the true latent feature space of the matrix. When the side features are less informative, our model requires O(log N) observations for an ε-recovery of the matrix. Unlike early methods that use a BCD algorithm, we have developed a LADMM algorithm to optimize the proposed formulation. Given the optimization problem is convex, this algorithm can converge to a global solution. Computational results demonstrate the superior performance of this method over three recent methods. Future work includes the examination of other types and quality of side information and the understanding of whether our method will benefit a variety of relevant problems, such as multi-label learning, and semi-supervised clustering etc.

Supplementary Material

suppl

NIHMS909907-supplement-suppl.pdf^{(332.9KB, pdf)}

Acknowledgments

Jinbo Bi and her students Jin Lu, Guannan Liang and Jiangwen Sun were supported by NSF grants IIS-1320586, DBI-1356655, and CCF-1514357 and NIH R01DA037349.

Footnotes

29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

References

1.Abernethy J, Bach F, Evgeniou T, Vert JP. A new approach to collaborative filtering: Operator estimation with spectral regularization. The Journal of Machine Learning Research. 2009;10:803–826. [Google Scholar]
2.Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences. 2009;2(1):183–202. [Google Scholar]
3.Cai JF, Cande’s EJ, Shen Z. A singular value thresholding algorithm for matrix completion. SIAM J on Optimization. 2010 Mar;20(4):1956–1982. [Google Scholar]
4.Candès EJ, Recht B. Exact matrix completion via convex optimization. Foundations of Computational mathematics. 2009;9(6):717–772. [Google Scholar]
5.Candès EJ, Tao T. The power of convex relaxation: Near-optimal matrix completion. Information Theory, IEEE Transactions on. 2010;56(5):2053–2080. [Google Scholar]
6.Chen P, Suter D. Recovering the missing components in a large noisy low-rank matrix: Application to sfm. IEEE Trans Pattern Anal Mach Intell. 2004 Aug;26(8):1051–1063. doi: 10.1109/TPAMI.2004.52. [DOI] [PubMed] [Google Scholar]
7.Chen T, Zhang W, Lu Q, Chen K, Zheng Z, Yu Y. Svdfeature: a toolkit for feature-based collaborative filtering. The Journal of Machine Learning Research. 2012;13(1):3619–3622. [Google Scholar]
8.Chiang KY, Hsieh CJ, Dhillon EIS. Robust principal component analysis with side information. Proceedings of The 33rd International Conference on Machine Learning. 2016:2291–2299. [Google Scholar]
9.Chiang KY, Hsieh CJ, Dhillon IS. Matrix completion with noisy side information. Advances in Neural Information Processing Systems. 2015;28:3429–3437. [Google Scholar]
10.Daemen A, Griffith OL, Heiser LM, Wang NJ, Enache OM, Sanborn Z, Pepin F, Durinck S, Korkola JE, Griffith M, et al. Modeling precision treatment of breast cancer. Genome Biol. 2013;14(10):R110. doi: 10.1186/gb-2013-14-10-r110. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Fang EX, He B, Liu H, Yuan X. Generalized alternating direction method of multipliers: new theoretical insights and applications. Mathematical Programming Computation. 2015;7(2):149–187. doi: 10.1007/s12532-015-0078-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Harper FM, Konstan JA. The movielens datasets: History and context. ACM Trans Interact Intell Syst. 2015 Dec;5(4):19:1–19:19. [Google Scholar]
13.Jain P, Dhillon IS. Provable inductive matrix completion. 2013 arXiv preprint arXiv:1306.0626. [Google Scholar]
14.Keshavan R, Montanari A, Oh S. Matrix completion from a few entries. Information Theory, IEEE Transactions on. 2010 Jun;56(6):2980–2998. [Google Scholar]
15.Lin Z, Chen M, Ma Y. The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices. Mathematical Programming. 2010 [Google Scholar]
16.Liu G, Li P. Low-rank matrix completion in the presence of high coherence. IEEE Transactions on Signal Processing. 2016 Nov;64(21):5623–5633. [Google Scholar]
17.Menon AK, Chitrapura KP, Garg S, Agarwal D, Kota N. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2011. Response prediction using collaborative filtering with hierarchies and side-information; pp. 141–149. [Google Scholar]
18.Natarajan N, Dhillon IS. Inductive matrix completion for predicting gene–disease associations. Bioinformatics. 2014;30(12):i60–i68. doi: 10.1093/bioinformatics/btu269. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Ning X, Karypis G. Proceedings of the Sixth ACM Conference on Recommender Systems, RecSys ’12. New York, NY, USA: ACM; 2012. Sparse linear methods with side information for top-n recommendations; pp. 155–162. [Google Scholar]
20.Recht B. A simpler approach to matrix completion. The Journal of Machine Learning Research. 2011;12:3413–3430. [Google Scholar]
21.Rennie JDM, Srebro N. Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05. New York, NY, USA: ACM; 2005. Fast maximum margin matrix factorization for collaborative prediction; pp. 713–719. [Google Scholar]
22.Shamir O, Shalev-Shwartz S. Matrix completion with the trace norm: learning, bounding, and transducing. The Journal of Machine Learning Research. 2014;15(1):3401–3423. [Google Scholar]
23.Shi W, Ling Q, Wu G, Yin W. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing. 2015 Nov;63(22):6013–6023. [Google Scholar]
24.Sindhwani V, Bucak S, Hu J, Mojsilovic A. One-class matrix completion with low-density factorizations. Data Mining (ICDM), 2010 IEEE 10th International Conference on. 2010 Dec;:1055–1060. [Google Scholar]
25.Srebro N, Shraibman A. Rank, trace-norm and max-norm. 2005:545–560. [Google Scholar]
26.Tseng P. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications. 2001;109(3):475–494. [Google Scholar]
27.Weng Z, Wang X. Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE; 2012. Low-rank matrix completion for array signal processing; pp. 2697–2700. [Google Scholar]
28.Xu M, Jin R, hua Zhou Z. Speedup matrix completion with side information: Application to multi-label learning. Advances in Neural Information Processing Systems. 2013;26:2301–2309. [Google Scholar]
29.Yang J, Yuan XM. Linearized augmented lagrangian and alternating direction methods for nuclear norm minimization. Math Comput. 2013;82 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

suppl

NIHMS909907-supplement-suppl.pdf^{(332.9KB, pdf)}

[R1] 1.Abernethy J, Bach F, Evgeniou T, Vert JP. A new approach to collaborative filtering: Operator estimation with spectral regularization. The Journal of Machine Learning Research. 2009;10:803–826. [Google Scholar]

[R2] 2.Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences. 2009;2(1):183–202. [Google Scholar]

[R3] 3.Cai JF, Cande’s EJ, Shen Z. A singular value thresholding algorithm for matrix completion. SIAM J on Optimization. 2010 Mar;20(4):1956–1982. [Google Scholar]

[R4] 4.Candès EJ, Recht B. Exact matrix completion via convex optimization. Foundations of Computational mathematics. 2009;9(6):717–772. [Google Scholar]

[R5] 5.Candès EJ, Tao T. The power of convex relaxation: Near-optimal matrix completion. Information Theory, IEEE Transactions on. 2010;56(5):2053–2080. [Google Scholar]

[R6] 6.Chen P, Suter D. Recovering the missing components in a large noisy low-rank matrix: Application to sfm. IEEE Trans Pattern Anal Mach Intell. 2004 Aug;26(8):1051–1063. doi: 10.1109/TPAMI.2004.52. [DOI] [PubMed] [Google Scholar]

[R7] 7.Chen T, Zhang W, Lu Q, Chen K, Zheng Z, Yu Y. Svdfeature: a toolkit for feature-based collaborative filtering. The Journal of Machine Learning Research. 2012;13(1):3619–3622. [Google Scholar]

[R8] 8.Chiang KY, Hsieh CJ, Dhillon EIS. Robust principal component analysis with side information. Proceedings of The 33rd International Conference on Machine Learning. 2016:2291–2299. [Google Scholar]

[R9] 9.Chiang KY, Hsieh CJ, Dhillon IS. Matrix completion with noisy side information. Advances in Neural Information Processing Systems. 2015;28:3429–3437. [Google Scholar]

[R10] 10.Daemen A, Griffith OL, Heiser LM, Wang NJ, Enache OM, Sanborn Z, Pepin F, Durinck S, Korkola JE, Griffith M, et al. Modeling precision treatment of breast cancer. Genome Biol. 2013;14(10):R110. doi: 10.1186/gb-2013-14-10-r110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Fang EX, He B, Liu H, Yuan X. Generalized alternating direction method of multipliers: new theoretical insights and applications. Mathematical Programming Computation. 2015;7(2):149–187. doi: 10.1007/s12532-015-0078-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Harper FM, Konstan JA. The movielens datasets: History and context. ACM Trans Interact Intell Syst. 2015 Dec;5(4):19:1–19:19. [Google Scholar]

[R13] 13.Jain P, Dhillon IS. Provable inductive matrix completion. 2013 arXiv preprint arXiv:1306.0626. [Google Scholar]

[R14] 14.Keshavan R, Montanari A, Oh S. Matrix completion from a few entries. Information Theory, IEEE Transactions on. 2010 Jun;56(6):2980–2998. [Google Scholar]

[R15] 15.Lin Z, Chen M, Ma Y. The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices. Mathematical Programming. 2010 [Google Scholar]

[R16] 16.Liu G, Li P. Low-rank matrix completion in the presence of high coherence. IEEE Transactions on Signal Processing. 2016 Nov;64(21):5623–5633. [Google Scholar]

[R17] 17.Menon AK, Chitrapura KP, Garg S, Agarwal D, Kota N. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2011. Response prediction using collaborative filtering with hierarchies and side-information; pp. 141–149. [Google Scholar]

[R18] 18.Natarajan N, Dhillon IS. Inductive matrix completion for predicting gene–disease associations. Bioinformatics. 2014;30(12):i60–i68. doi: 10.1093/bioinformatics/btu269. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Ning X, Karypis G. Proceedings of the Sixth ACM Conference on Recommender Systems, RecSys ’12. New York, NY, USA: ACM; 2012. Sparse linear methods with side information for top-n recommendations; pp. 155–162. [Google Scholar]

[R20] 20.Recht B. A simpler approach to matrix completion. The Journal of Machine Learning Research. 2011;12:3413–3430. [Google Scholar]

[R21] 21.Rennie JDM, Srebro N. Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05. New York, NY, USA: ACM; 2005. Fast maximum margin matrix factorization for collaborative prediction; pp. 713–719. [Google Scholar]

[R22] 22.Shamir O, Shalev-Shwartz S. Matrix completion with the trace norm: learning, bounding, and transducing. The Journal of Machine Learning Research. 2014;15(1):3401–3423. [Google Scholar]

[R23] 23.Shi W, Ling Q, Wu G, Yin W. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing. 2015 Nov;63(22):6013–6023. [Google Scholar]

[R24] 24.Sindhwani V, Bucak S, Hu J, Mojsilovic A. One-class matrix completion with low-density factorizations. Data Mining (ICDM), 2010 IEEE 10th International Conference on. 2010 Dec;:1055–1060. [Google Scholar]

[R25] 25.Srebro N, Shraibman A. Rank, trace-norm and max-norm. 2005:545–560. [Google Scholar]

[R26] 26.Tseng P. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications. 2001;109(3):475–494. [Google Scholar]

[R27] 27.Weng Z, Wang X. Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE; 2012. Low-rank matrix completion for array signal processing; pp. 2697–2700. [Google Scholar]

[R28] 28.Xu M, Jin R, hua Zhou Z. Speedup matrix completion with side information: Application to multi-label learning. Advances in Neural Information Processing Systems. 2013;26:2301–2309. [Google Scholar]

[R29] 29.Yang J, Yuan XM. Linearized augmented lagrangian and alternating direction methods for nuclear norm minimization. Math Comput. 2013;82 [Google Scholar]

PERMALINK

A Sparse Interactive Model for Matrix Completion with Side Information

Jin Lu

Guannan Liang

Jiangwen Sun

Jinbo Bi

Abstract

1 Introduction

2 The Proposed Interactive Model