From Bilinear Regression to Inductive Matrix Completion: A Quasi-Bayesian Analysis

The Tien Mai

doi:10.3390/e25020333

. 2023 Feb 11;25(2):333. doi: 10.3390/e25020333

From Bilinear Regression to Inductive Matrix Completion: A Quasi-Bayesian Analysis

The Tien Mai ¹

Editors: Augustine Wong¹, Xiaoping Shi¹

PMCID: PMC9955477 PMID: 36832699

Abstract

In this paper, we study the problem of bilinear regression, a type of statistical modeling that deals with multiple variables and multiple responses. One of the main difficulties that arise in this problem is the presence of missing data in the response matrix, a problem known as inductive matrix completion. To address these issues, we propose a novel approach that combines elements of Bayesian statistics with a quasi-likelihood method. Our proposed method starts by addressing the problem of bilinear regression using a quasi-Bayesian approach. The quasi-likelihood method that we employ in this step allows us to handle the complex relationships between the variables in a more robust way. Next, we adapt our approach to the context of inductive matrix completion. We make use of a low-rankness assumption and leverage the powerful PAC-Bayes bound technique to provide statistical properties for our proposed estimators and for the quasi-posteriors. To compute the estimators, we propose a Langevin Monte Carlo method to obtain approximate solutions to the problem of inductive matrix completion in a computationally efficient manner. To demonstrate the effectiveness of our proposed methods, we conduct a series of numerical studies. These studies allow us to evaluate the performance of our estimators under different conditions and provide a clear illustration of the strengths and limitations of our approach.

Keywords: bilinear regression, matrix completion, low-rank model, PAC-Bayesian bound, Langevin Monte Carlo

1. Introduction

In this paper, we investigate the bilinear regression model, a statistical method that assumes a linear relationship between a set of multiple response variables and two sets of covariates. This model, also known as the growth curve model or generalized multivariate analysis model, is commonly used for analyzing longitudinal data, as shown in previous studies such as [1,2,3,4,5,6]. However, these studies only cover the scenario in which the response matrix is fully observed.

Recently, the bilinear regression model with incomplete response has been introduced and studied as the so-called inductive matrix completion, which is a generalization of the matrix completion problem [7,8]. This problem has attracted significant attention in various fields, such as drug repositioning [9], collaborative filtering [10], and genomics [7]. Inductive matrix completion is a challenging problem that arises when some of the entries in the response matrix are missing, which makes it difficult to infer the underlying relationship between the variables.

In this work, we explore the problem of bilinear regression and inductive matrix completion under a low-rank constraint on the coefficient matrix. Most existing approaches for these problems are frequentist methods, such as maximum likelihood estimation [2] or penalized optimization [7]. These methods are effective in providing point estimates for the parameters of the model but lack the ability to provide a full probabilistic characterization of the uncertainty. Recently, Bayesian approaches have been considered for these problems. For example, the paper [11] proposed a Bayesian approach for bilinear regression, and a Bayesian method was proposed for inductive matrix completion in the work [9]. However, unlike frequentist approaches, the statistical properties of the Bayesian approach for these models have not been fully explored yet.

The aim of this paper is to address an existing gap in the understanding of the bilinear regression and inductive matrix completion problems. To achieve this goal, we propose a novel approach that combines elements of Bayesian statistics with a quasi-likelihood method. Specifically, we start by addressing the problem of bilinear regression using a quasi-Bayesian approach, where a quasi-likelihood is employed. We then generalize this approach to the problem of inductive matrix completion. To ensure that our method is adaptive to the rank of the coefficient matrix, we use a spectral scaled Student prior distribution, which allows us to prove that the posterior mean satisfies a tight oracle inequality. This result demonstrates that our method is able to accurately estimate the parameters of the model, even when the rank of the coefficient matrix is unknown. Additionally, we also prove the contraction properties of the posteriors, which further enhances the performance of our method.

The proposed method in this paper, the quasi-Bayesian approach, is an extension of the traditional Bayesian approach and is becoming increasingly popular in statistics and machine learning as a technique for generalized Bayesian inference, as noted in studies such as [12,13,14]. This approach allows for more flexibility in the modeling assumptions by replacing the likelihood function with a more general notion of risk or quasi-likelihood.

To provide theoretical guarantees for our proposed quasi-posteriors, we make use of the PAC-Bayesian technique [15,16,17]. This technique provides bounds on the generalization error of a learned estimator, and has been widely used in the literature as described in recent reviews and introductions, such as [18,19]. The PAC-Bayes bounds have been successfully applied in the context of matrix estimation problems as shown in studies such as [20,21,22,23]. Interestingly, by using the PAC-Bayesian technique for inductive matrix completion, we do not need to make any assumptions about the distribution of the missing entries in the response matrix. This is in contrast with previous works on matrix completion, such as [24,25,26,27,28,29], which typically require assumptions about the missing data. This makes our method more versatile and applicable to a wider range of problems.

The proposed method in this paper makes use of a spectral scaled Student prior, which is a specific choice of prior distribution. This choice is motivated by recent works in which it has been shown to lead to optimal rates in a variety of problems, including high-dimensional regression [30], image denoising [31] and reduced rank regression [23]. Although this prior is not conjugate to our problems, it allows for the convenient implementation of gradient-based sampling methods, which makes it computationally efficient.

To compute the proposed estimators and sample from the quasi-posterior, we employ a Langevin Monte Carlo (LMC) method. This method is a widely used algorithm for approximating complex distributions and allows for efficient computation of the proposed estimators. The LMC method allows us to obtain approximate solutions to the problem of bilinear regression and inductive matrix completion in a computationally efficient manner.

Furthermore, we use numerical studies to demonstrate the effectiveness of our proposed methods. These studies enable us to evaluate the performance of our estimators in various scenarios and provide insight into the capabilities and limitations of our approach. By conducting a thorough evaluation, we can gain a better understanding of how our method performs in different settings and make any necessary adjustments to improve its performance. We also compare our method with the ordinary least squared method. This comparison allows us to demonstrate the superiority of our method over the traditional approach in certain scenarios, such as when the response matrix contains missing data. Overall, the numerical studies serve as a valuable tool for assessing the effectiveness of our proposed method and provide a clear illustration of its strengths and weaknesses, as well as its improvement over the traditional method.

The remainder of this paper is organized as follows: In Section 2, we present the problem of bilinear regression, and introduce the low-rank promoting prior distribution that we use to address this problem. In Section 3, we extend our approach to the problem of inductive matrix completion. In Section 4, we discuss the Langevin Monte Carlo method used for the computation of the estimators, and present numerical studies to demonstrate the effectiveness of our proposed methods. Finally, we conclude our work and provide a summary of our findings in Section 5. The technical proofs are provided in Appendix A for the interested readers.

Notation 1.

Let $R^{n_{1} \times n_{2}}$ denote the set of $n_{1} \times n_{2}$ matrices with real elements. Let $A^{⊺} \in R^{n_{2} \times n_{1}}$ denote the transpose of A. For any $A \in R^{n_{1} \times n_{2}}$ and $I = (i, j) \in {1, \dots, n_{1}} \times {1, \dots, n_{2}}$ , we denote by $A_{I} = A_{(i, j)} = A_{i, j}$ the i-th row and j-th column elements of A. The matrix in $R^{n_{1} \times n_{2}}$ with all entries equal to 0 is denoted by $0_{n_{1} \times n_{2}}$ . For a matrix $B \in R^{n_{1} \times n_{1}}$ , we let $Tr (B)$ denote its trace. The identity matrix in $R^{n_{1} \times n_{1}}$ is denoted by $I_{n_{1}}$ . For $A \in R^{n_{1} \times n_{2}}$ , we define its sup-norm ${∥ A ∥}_{\infty} = {max}_{i, j} | A_{i, j} |$ ; its Frobenius norm ${∥ A ∥}_{F}$ is defined by ${∥ A ∥}_{F}^{2} = Tr (A^{⊺} A) = \sum_{i, j} A_{i, j}^{2}$ and $rank (A)$ its rank.

2. Bilinear Linear Regression

2.1. Model

Let $Y \in R^{n \times q}$ consist of n independent response vectors, $X \in R^{n \times p}$ be a given between-individuals design matrix and $Z \in R^{k \times q}$ be a known within-individuals design matrix. Consider the bilinear regression model as follows:

Y = X M^{*} Z + E,

(1)

where $M^{*} \in R^{p \times k}$ is the unknown parameter matrix. The random noise matrix E is assumed to have zero mean, $E (E) = 0 .$ The main assumption here is the low-rank restriction on the model parameter that $rank (M^{*}) < min (p, k)$ .

The model presented in Equation (1) is a bilinear regression model, which can be seen as a generalization of the reduced rank regression problem. In the case where $k = q$ and $Z = I_{q}$ , the model simplifies to the traditional reduced rank regression problem, which has been well-studied in the literature, such as [32,33]. However, in this paper, we consider the more general case where the matrix Z contains additional explanatory variables.

The low-rank assumption in this model can be interpreted as indicating the presence of a latent process that affects the response data, not only through the “between-individuals” structure of the model but also through the “within-individuals” structure. This model is often referred to as the growth curve model or the generalized multivariate analysis model (GMANOVA) and has been studied in depth in the literature, such as [2].

Assumption 1.

There is a known constant $C < + \infty$ such that $∥ X M^{*} {Z ∥}_{\infty} \leq C .$

From Assumption 1, it is not reliable to return predictions $X M Z$ with entries that are outside of interval $[- C, C]$ . However, for computational reasons, it is extremely convenient to employ an unbounded prior for M. Therefore, we propose to use unbounded distributions for M but to use, as a predictor, a truncated version of $X M Z$ rather than M itself. For a matrix A, let

Π_{C} (A) = arg min_{{∥ B ∥}_{\infty} \leq C} {∥ A - B ∥}_{F}

be the orthogonal projection of A on matrices with entries bounded by C. Note that B is simply obtained by replacing entries of A larger than C by C, and entries smaller than $- C$ by $- C$ .

For a matrix $M \in R^{p \times k}$ , we denote by $r (M)$ the empirical risk of M as

r (M) = \frac{1}{n q} {∥ Y - Π_{C} (X M Z) ∥}_{F}^{2}

and its expectation is denoted by

R (M) = E [r (M)] = E [{(Y_{11} - {(Π_{C} (X M Z))}_{_{11}})}^{2}] .

The focus of our work in this paper is on the predictive aspects of the model, that is, a matrix M predicts almost as well as $M^{*}$ if $R (M) - R (M^{*})$ is small. Under the assumption that $E_{i j}$ has a finite variance, using the Pythagorean theorem, we have

R (M) - R (M^{*}) = \frac{1}{n q} {∥ Π_{C} (X M Z) - X M^{*} Z ∥}_{F}^{2}

(2)

for any M, which means that our results can also be interpreted in terms of the Frobenius norm.

Let $π$ be a prior distribution on $R^{p \times k}$ (see Section 2.2). For any $λ > 0$ , we define the quasi-posterior

{\hat{ρ}}_{λ} (d M) \propto exp (- λ r (M)) π (d M) .

It is worth noting that for a specific choice of $λ = n q / (2 σ^{2})$ , the posterior distribution obtained corresponds to the case where the noise term $E_{i j}$ is assumed to be Gaussian distributed with a mean of 0 and a variance of $σ^{2}$ . However, our theoretical results hold under a more general class of noise distributions. It is known that a small enough $λ$ is sufficient when the model is misspecified [14]. Additionally, even in the case of Gaussian noise, in high-dimensional settings, a smaller value of $λ$ than $n / (2 σ^{2})$ leads to better adaptation properties [30,34]. The precise choice of $λ$ in our method will be further discussed below.

We consider the following posterior mean of $X M Z$ , given by

{\hat{X M Z}}_{λ} = \int Π_{C} (X M Z) {\hat{ρ}}_{λ} (d M) .

(3)

It is worth noting that from the simulation experiments, it is observed that using reasonable values for C, the Monte Carlo algorithm never samples matrices M such that $Π_{C} (X M Z) \neq X M Z$ . In other words, the boundedness constraint has very little impact on practice, and it is mainly necessary for technical proofs. If one is interested in obtaining an estimator of $M^{*}$ instead of an estimator of $X M^{*} Z$ , when $X^{⊺} X$ and $Z Z^{⊺}$ are invertible, one can consider the estimator ${\hat{M}}_{λ} = {(X^{⊺} X)}^{- 1} X^{⊺} {\hat{X M Z}}_{λ} Z^{⊺} {(Z Z^{⊺})}^{- 1}$ and note that $X {\hat{M}}_{λ} Z = {\hat{X M Z}}_{λ}$ . This estimator can be used to obtain the estimator of $M^{*}$ in case the inverses of $X^{⊺} X$ and $Z Z^{⊺}$ are computationally feasible.

The quasi-posterior distribution investigated in this paper is often referred to as the “Gibbs posterior” in the PAC-Bayes approach. This terminology is used in the literature such as [17,18,19,35,36]. Additionally, the estimator ${\hat{M}}_{λ}$ is sometimes referred to as the Gibbs estimator or the exponentially weighted aggregate (EWA) in the literature, such as [34,37,38].

2.2. Prior Specification

We consider, in this paper, the following spectral scaled Student prior distribution, with parameter $τ > 0$ :

\begin{matrix} π (M) \propto det {(τ^{2} I_{m} + M M^{⊺})}^{- (p + m + 2) / 2} . \end{matrix}

(4)

This prior distribution is designed to promote low-rankness by placing more probability mass on matrices with smaller singular values, which leads to sparse solutions. This prior has a similar form to the one used in related works such as [39,40], and it is known to lead to good performance in different problems, such as high-dimensional regression and image denoising. It can be verified that

π (M) \propto \prod_{j = 1}^{m} {(τ^{2} + s_{j} {(M)}^{2})}^{- (p + m + 2) / 2},

where $s_{j} (M)$ denotes the jth largest singular value of M. The above expression is a scaled Student distribution evaluated at $s_{j} (M)$ , which can be seen as a way to approximate sparsity on $s_{j} (M)$ [30]. The log-sum function $\sum_{j = 1}^{m} log (τ^{2} + s_{j} {(M)}^{2})$ used by [39,40] is also known to enforce approximate sparsity on the singular values of $s_{j} (M)$ . This means that under this prior, most of the $s_{j} (M)$ are close to 0, which implies that M is well approximated by a low-rank matrix. Therefore, it has the ability to promote the low-rankness of M.

As previously stated, this prior is not conjugate to the problem at hand. However, it is particularly convenient to implement gradient-based sampling algorithms, such as the Langevin Monte Carlo method, which will be discussed in more detail in Section 4. This is because the gradient of the log-posterior can be computed efficiently, and it allows for an efficient implementation of the LMC algorithm.

2.3. Theoretical Results

We assume the sub-exponential distribution assumption on the noise.

Assumption 2.

The entries $E_{i, j}$ of E are independent. There exist two known constants $σ > 0$ and $ξ > 0$ such that

$\forall k \geq 2, E (| E_{i, j} |^{k}) \leq σ^{2} k! ξ^{k - 2} / 2 .$

Let us put

C_{1} = 8 (σ^{2} + C^{2}); C_{2} = 64 C max (ξ, C); τ^{*} = \sqrt{C_{1} (k + p) / {(n k q ∥ X ∥}_{F}^{2} {∥ Z ∥}_{F}^{2})} .

The statistical properties of mean estimator are given in the following theorem, where we propose a non-asymptotic analysis for our mean estimator.

Theorem 1.

Let Assumptions 1 and 2 be satisfied. Fix the parameter $τ = τ^{*}$ in the prior. Fix $δ > 0$ and define $λ^{*} : = n q min (1 / (2 C_{2}), δ / [C_{1} (1 + δ)])$ . Then, for any $ε \in (0, 1)$ , we have, with probability at least $1 - ε$ on the sample,

$\begin{matrix} {∥{\hat{X M Z}}_{λ^{*}} - X M^{*} Z∥}_{F}^{2} \leq inf_{0 \leq r \leq p k} inf_{\begin{matrix} \begin{matrix} \bar{M} \in R^{p \times k} \\ rank (\bar{M}) \leq r \end{matrix} \end{matrix}} {(1 + δ) {∥X \bar{M} Z - X M^{*} Z∥}_{F}^{2} + \\ \frac{C_{1} {(1 + δ)}^{2}}{δ} [4 r (k + p + 2) log (1 + \frac{{∥ X ∥}_{F} {∥ Z ∥}_{F} {∥ \bar{M} ∥}_{F}}{\sqrt{C_{1}}} \sqrt{\frac{n k q}{r (k + p)}}) + k + p + 2 log \frac{2}{ε}]} . \end{matrix}$

The choice of $λ = λ^{*}$ is determined by optimizing an upper bound on the risk R (as shown in the proof of this theorem). However, it is important to note that this choice may not necessarily be the best choice in practice, even though it gives a good estimate of the order of magnitude for $λ$ . To ensure optimal performance, the user can use cross-validation to properly adjust the temperature parameter. Additionally, it is worth noting that $rank (\bar{M}) \neq 0$ is not a requirement in the above formula. If $rank (\bar{M}) = 0$ , then $\bar{M} = 0$ and we interpret $0 log (1 + 0 / 0)$ as 0.

The proof of this theorem is based on the PAC-Bayes theory, and it is provided in the Appendix A. By taking $\bar{M} = M^{*}$ , we can obtain an upper bound on the infimum, leading to the following result.

Corollary 1.

Under the same assumptions and the same $τ, λ^{*}$ as in Theorem 1, let $r^{*} = rank (M^{*})$ . Then, for any $ε \in (0, 1)$ , we have, with probability at least $1 - ε$ on the sample,

$\begin{matrix} {∥{\hat{X M Z}}_{λ^{*}} - X M^{*} Z∥}_{F}^{2} \leq \\ \frac{C_{1} {(1 + δ)}^{2}}{δ} [4 r^{*} (k + p + 2) log (1 + \frac{{∥ X ∥}_{F} {∥ Z ∥}_{F} {∥ M^{*} ∥}_{F}}{\sqrt{C_{1}}} \sqrt{\frac{n k q}{r (k + p)}}) + k + p + 2 log \frac{2}{ε}] . \end{matrix}$

Theorem 1 provides an understanding of the statistical properties of the posterior mean. However, it is also important to understand the contraction properties of the quasi-posterior distribution. In the following theorem, we aim to provide a result that demonstrates this aspect of the proposed method.

Theorem 2.

Under the assumptions for Theorem 1, let $ε_{n}$ be any sequence in $(0, 1)$ such that $ε_{n} \to 0$ when $n \to \infty$ . Define

$\begin{matrix} M_{n} = {M \in R^{p \times k} : {∥{\hat{X M Z}}_{λ^{*}} - X M^{*} Z∥}_{F}^{2} \leq inf_{0 \leq r \leq p k} inf_{\begin{matrix} \begin{matrix} \bar{M} \in R^{p \times k} \\ rank (\bar{M}) \leq r \end{matrix} \end{matrix}} {(1 + δ) {∥X \bar{M} Z - X M^{*} Z∥}_{F}^{2} + \\ \frac{C_{1} {(1 + δ)}^{2}}{δ} [4 r (k + p + 2) log (1 + \frac{{∥ X ∥}_{F} {∥ Z ∥}_{F} {∥ \bar{M} ∥}_{F}}{\sqrt{C_{1}}} \sqrt{\frac{n k q}{r (k + p)}}) + k + p + 2 log \frac{2}{ε_{n}}]} . \end{matrix}$

Then,

$E [P_{M \sim {\hat{ρ}}_{λ}} (M \in M_{n})] \geq 1 - ε_{n} \underset{n \to \infty}{\to} 1 .$

The proof of this theorem is provided in Appendix A.

3. Inductive Matrix Completion

3.1. Model and Method

In the context of inductive matrix completion, given two side information matrices X and Z, we assume that only a random subset $Ω$ of the entries of Y in model (1) is observed. More precisely, we assume that we observe m independent and identically random pairs $(I_{1}, Y_{1}), \dots, (I_{m}, Y_{m})$ given by

Y_{i} = {(X M^{*} Z)}_{I_{i}} + E_{i}, i = 1, \dots, m

(5)

where $M^{*} \in R^{p \times k}$ is the unknown parameter matrix expected to be low-rank and observation sample size is assumed that $m < n q$ . The noise variables $E_{i}$ are assumed to be independent with $E (E_{i}) = 0 .$ The variables $I_{i}$ are independent and identical copies of a random variable $I$ having distribution $Π$ on the set ${1, \dots, n} \times {1, \dots, q}$ , we denote $Π_{x, y} : = Π (I = (x, y))$ .

Our goal in this paper is to investigate the problem of bilinear regression and also address the case where the response matrix contains missing data, a problem known as inductive matrix completion. In particular, when $p = n, k = q$ and $X = I_{n}, Z = I_{q}$ are the identity matrices, the problem reduces to the traditional matrix completion problem, which has been well studied in the literature [26]. Similarly, when $k = q$ and $Z = I_{q}$ is the identity matrix, the problem becomes the reduced rank regression problem with incomplete response, which has also been studied in recent works, such as [23,41]. However, in the context of inductive matrix completion, we focus on the more general case where X and Z contain additional explanatory variables; in other words, we consider the side information from the n users and the q items in our model [8].

It has been acknowledged that there are two different ways to model the observed values of Y, either by including or excluding the possibility of observing the same entry multiple times. Previous studies have examined both of these methods, such as the examination of matrix completion without replacement in [25] and with replacement in [26]. Both methods have practical uses and use similar techniques for estimation. This particular study focuses on the scenario where the variables $I_{i}$ are independently and identically distributed, meaning that it is possible to observe the same entry multiple times. Additionally, it is important to note that, according to the findings presented in Section 6 of [42], the results of this study can also be applied to the scenario of sampling without replacement, as long as the sampling is performed uniformly and there is no observation noise.

We are now adapting the quasi-Bayesian approach for bilinear regression in Section 2 to the context of inductive matrix completion. For a probability distribution P on ${1, \dots, n_{1}} \times {1, \dots, n_{2}}$ , we generalize the Frobenius norm by ${∥ A ∥}_{F, P}^{2} = \sum_{i, j} P [(i, j)] A_{i, j}^{2}$ ; note that when P is the uniform distribution, then ${∥ A ∥}_{F, P}^{2} = {∥ A ∥}_{F}^{2} / (n_{1} n_{2})$ .

For a matrix $M \in R^{p \times k}$ , we denote the empirical risk of M, $r^{'} (M)$ , and its expected risk $R^{'} (M)$ respectively as

r^{'} (M) = \frac{1}{m} \sum_{i = 1}^{m} {(Y_{i} - {(Π_{C} (X M Z))}_{I_{i}})}^{2},

R^{'} (M) = E [r^{'} (M)] = E [{(Y_{1} - {(Π_{C} (X M Z))}_{I_{1}})}^{2}] .

As in Section 2, we will focus on the predictive aspects of the model, that is, a matrix M predicts almost as well as $M^{*}$ if $R^{'} (M) - R^{'} (M^{*})$ is small. Under the assumption that $E_{i}$ has a finite variance, based on the Pythagorean theorem, we have

R^{'} (M) - R^{'} (M^{*}) = {∥ Π_{C} (X M Z) - X M^{*} Z ∥}_{F, Π}^{2}

(6)

for any M, which means that our results can also be interpreted in terms of an estimation of $M^{*}$ with respect to a generalized Frobenius norm.

Here, the prior $π$ is the low-rank inducing prior specified in the Section 2.2 above. For any $λ > 0$ , we define the quasi-posterior

{\hat{ρ}}_{λ}^{'} (d M) \propto exp (- λ r^{'} (M)) π (d M) .

We will actually specify our choice of $λ$ below.

The truncated posterior mean of $X M Z$ is given by

{\hat{X M Z}}_{λ} = \int Π_{C} (X M) {\hat{ρ}}_{λ}^{'} (d M) .

(7)

Here, for the same technical reasons as in the context of bilinear regression, this truncation has a very little impact in practice for reasonable values of C.

3.2. Theoretical Results

In this section, we derive the statistical properties of the posterior ${\hat{ρ}}_{λ}^{'}$ and the mean estimator ${\hat{X M Z}}_{λ}$ for the context of inductive matrix completion. Let us first state our assumptions on this model.

Assumption 3.

The noise variables $E_{1}, \dots, E_{m}$ are independent of $I_{1}, \dots, I_{m}$ . There exist two known constants $σ^{'} > 0$ and $ξ^{'} > 0$ such that

$\forall k \geq 2, E (| E_{i} |^{k}) \leq σ^{' 2} k! ξ^{' k - 2} / 2 .$

Assumptions 1 and 3 are both standard; they have been used in [41] for theoretical analysis of reduced rank regression and in [26] for trace regression and matrix completion.

Let us put

C_{1}^{'} = 8 (σ^{' 2} + C^{2}); C_{2}^{'} = 64 C max (ξ^{'}, C); τ^{*} = \sqrt{C_{1}^{'} (k + p) / {(m k p ∥ X ∥}_{F}^{2} {∥ Z ∥}_{F}^{2})} .

Theorem 3.

Let Assumptions 1 and 3 be satisfied. Fix the parameter $τ = τ^{*}$ in the prior. Fix $δ > 0$ and define $λ^{' *} : = m min (1 / (2 C_{2}^{'}), δ / [C_{1}^{'} (1 + δ)])$ . Then, for any $ε \in (0, 1)$ , we have, with probability at least $1 - ε$ on the sample,

$\begin{matrix} {∥{\hat{X M Z}}_{λ^{' *}} - X M^{*} Z∥}_{F, Π}^{2} \leq inf_{0 \leq r \leq p k} inf_{\begin{matrix} \begin{matrix} \bar{M} \in R^{p \times k} \\ rank (\bar{M}) \leq r \end{matrix} \end{matrix}} {(1 + δ) {∥ X \bar{M} Z - X M^{*} Z ∥}_{F, Π}^{2} + \\ \frac{C_{1}^{'} {(1 + δ)}^{2}}{δ} \frac{(4 r (k + p + 2) log (1 + \frac{{∥ X ∥}_{F} {∥ Z ∥}_{F} {∥ \bar{M} ∥}_{F}}{\sqrt{C_{1}}} \sqrt{\frac{m k p}{r (k + p)}}) + k + p + 2 log \frac{2}{ε})}{m}} . \end{matrix}$

Similar to the context of bilinear regression, the choices of $λ = λ^{*}, τ = τ^{*}$ come from the optimization of an upper bound on the risk R (in the proof of this theorem). Therefore, these choices may not be necessarily the best choice in practice, even though it gives a good order of magnitude for tuning these parameters. The user could use cross-validation to properly tune them in practice. Note again that $rank (\bar{M}) \neq 0$ is not required in the above formula, if $rank (\bar{M}) = 0$ then $\bar{M} = 0$ and we interpret $0 log (1 + 0 / 0)$ as 0. The proof of this theorem is provided in the Appendix A. In particular, we can upper bound the infimum on $\bar{M}$ by taking $\bar{M} = M^{*}$ , which leads to the following result.

Corollary 2.

Under the assumptions that Theorem 3 holds, let $r^{*} = rank (M^{*})$ . Put

$R_{δ, m, p, k, r^{*}, ε} : = \frac{C_{1}^{'} {(1 + δ)}^{2}}{δ} \frac{(4 r (k + p + 2) log (1 + \frac{{∥ X ∥}_{F} {∥ Z ∥}_{F} {∥ \bar{M} ∥}_{F}}{\sqrt{C_{1}}} \sqrt{\frac{m k p}{r (k + p)}}) + k + p + 2 log \frac{2}{ε})}{m},$

then

$\begin{matrix} {∥{\hat{X M Z}}_{λ^{' *}} - X M^{*} Z∥}_{F, Π}^{2} \leq R_{δ, m, p, k, r^{*}, ε} \end{matrix}$

and in particular, if the sampling distribution Π is uniform,

$\begin{matrix} \frac{∥ {\hat{X M Z}}_{λ^{' *}} - X M^{*} {Z ∥}_{F}^{2}}{n q} \leq R_{δ, m, p, k, r^{*}, ε} . \end{matrix}$

Remark 1.

Up to a log-term, our error rate $r (k + p) / m$ is similar to the best known up-to-date rate derived in [8].

While Theorem 3 is about the finite sample convergence rate of the posterior mean, it is actually possible to prove that the quasi-posterior ${\hat{ρ}}_{λ}^{'}$ contracts around $M^{*}$ at the same rate.

Theorem 4.

Under the same assumptions for Theorem 3, and the same definition for τ and $λ^{*}$ , let $ε_{m}$ be any sequence in $(0, 1)$ such that $ε_{m} \to 0$ when $m \to \infty$ . Define

$\begin{matrix} Ω_{m} = {M \in R^{p \times k} : {∥ Π_{C} (X M Z) - X M^{*} Z ∥}_{F, Π}^{2} \leq \\ inf_{1 \leq r \leq p k} inf_{\begin{matrix} \begin{matrix} \bar{M} \in R^{p \times k} \\ rank (\bar{M}) \leq r \end{matrix} \end{matrix}} [(1 + δ) {∥ X \bar{M} Z - X M^{*} Z ∥}_{F, Π}^{2} + \\ \frac{C_{1}^{'} {(1 + δ)}^{2}}{δ} \frac{(4 r (k + p + 2) log (1 + \frac{{∥ X ∥}_{F} {∥ Z ∥}_{F} {∥ \bar{M} ∥}_{F}}{\sqrt{C_{1}}} \sqrt{\frac{m k p}{r (k + p)}}) + k + p + 2 log \frac{2}{ε_{m}})}{m}]} . \end{matrix}$

Then

$E [P_{M \sim {\hat{ρ}}_{λ}^{'}} (M \in Ω_{m})] \geq 1 - ε_{m} \underset{m \to \infty}{\to} 1 .$

The proof of this theorem is provided in Appendix A.

4. Numerical Studies

4.1. Langevin Monte Carlo Implementation

In this section, we propose to sample from the (quasi) posterior, in Section 2 and Section 3, by a suitable version of the Langevin Monte Carlo (LMC) algorithm, a gradient-based sampling method. We propose to use a constant step-size unadjusted LMC algorithm; see [43] for more details. The algorithm is given by an initial matrix $M_{0}$ and the recursion

M_{k + 1} = M_{k} - h \nabla log {\hat{ρ}}_{λ} (M_{k}) + \sqrt{2 h} N_{k} k = 0, 1, \dots

(8)

where $h > 0$ is the step-size, ${\hat{ρ}}_{λ}$ is the (quasi) posterior and $N_{0}, N_{1}, \dots$ are independent random matrices with independent and identical standard Gaussian entries. We provide a pseudo-code for LMC in Algorithm A1. For small values of the step-size h, the output of Algorithm A1, $\hat{M}$ , is very close to the integral (3) of interest. However, for some h that may not be small enough, the sum can explode [44]. In such cases, we consider to include a Metropolis–Hastings correction in the algorithm. Another possible choice is to take a smaller h and restart the algorithm; although it slows down the algorithm, we keep some control over its time of execution. On the other hand, the Metropolis–Hastings approach ensures the convergence to the desired distribution; however, the algorithm is greatly slowed down because of an additional acceptance/rejection step at each iteration.

Next, we propose a Metropolis–Hasting correction to the LMC algorithm. It guarantees the convergence to the (quasi) posterior, and it also provides a useful way for choosing h. More precisely, we consider the update rule in (8) as a proposal for a new candidate:

\begin{matrix} {\tilde{M}}_{k + 1} = M_{k} - h \nabla log {\hat{ρ}}_{λ} (M_{k}) + \sqrt{2 h} N_{k}, k = 0, 1, \dots, \end{matrix}

(9)

Note that the matrix ${\tilde{M}}_{k + 1}$ is normally distributed with mean $M_{k} - h \nabla log {\hat{ρ}}_{λ} (M_{k})$ and the covariance matrices equal to $2 h$ times the identity matrices. This proposal is then accepted or rejected according to the Metropolis–Hastings algorithm, where the proposal is accepted with probability:

A_{M A L A} : = min \{1, \frac{{\hat{ρ}}_{λ} ({\tilde{M}}_{k + 1}) q (M_{k} | {\tilde{M}}_{k + 1})}{{\hat{ρ}}_{λ} (M_{k}) q ({\tilde{M}}_{k + 1} | M_{k})}\},

(10)

where

q (x^{'} | x) \propto exp (- \frac{1}{4 h} {∥ x^{'} - x + h \nabla log {\hat{ρ}}_{λ} (x) ∥}_{F}^{2})

is the transition probability density from x to $x^{'}$ . The details of the Metropolis-adjusted Langevin algorithm (denoted by MALA) are presented in Algorithm A2. Compared to the random-walk Metropolis–Hastings, MALA usually proposes moves into regions of higher probability, which are then more likely to be accepted.

We note that the step-size h for MALA is chosen such that the acceptance rate is approximately $0.5$ following [45], while the step-size for LMC in the same setting should be smaller than the one for MALA [46].

4.2. Simulation Studies for Biliear Regression

We perform some numerical studies on simulated data to assess the performance of our proposed algorithms. All simulations were conducted using the R statistical software [47].

For fixed dimensions $q = 10, k = 20$ of the data, we vary $n = 100$ and $n = 1000$ to check the effect of the samples, whereas the dimensions of the coefficient matrix are varied by $p = 10$ and $p = 100$ . The entries of the design matrices X and Z are independently simulated from the standard Gaussian $N (0, 1)$ . Then, given a matrix $M^{*}$ , we simulate the response matrix Y from model (1) whose entries of the noise matrix E are independent and identically sampled from $N (0, 1)$ . We consider the following setups for the true coefficient matrix:

Model I: The true coefficient matrix $M^{*}$ is a rank-2 matrix that is generated as $M^{*} = B_{1} B_{2}^{⊤}$ where $B_{1} \in R^{p \times 2}, B_{2} \in R^{k \times 2}$ and all entries in $B_{1}$ and $B_{2}$ are independent and identically sampled from $N (0, 1)$ .
Model II: An approximate low-rank set up is studied. This series of simulations is similar to the Model I, except that the true coefficient is no longer rank 2, but it can be well approximated by a rank 2 matrix:
$M^{*} = 2 \cdot B_{1} B_{2}^{⊤} + U,$
where U is a matrix whose entries are independent and identically sampled from $N (0, 0.1)$ .

We compare our approaches denoted by LMC and MALA against the (generalized) ordinary least square [2], denoted by OLS. The OLS is defined as follows:

{\hat{M}}_{OLS} = {(X^{⊤} X)}^{†} X^{⊤} Y Z^{⊤} {(Z Z^{⊤})}^{†}

where $A^{†}$ denotes the Moore–Penrose inverse of matrix A. We fixed $λ = n q, τ = 1$ , and the LMC and MALA methods are initiated at the OLS estimator and are run with 10,000 iterations, where the first 1000 steps are removed as burn-in periods.

The evaluations are performed by using the mean squared estimation error (Est) and the normalized (relative) mean square error (Nmse):

Est : = ∥ \hat{M} - M^{*} ∥_{F}^{2} / (p k), Nmse : = ∥ \hat{M} - M^{*} ∥_{F}^{2} / {∥ M^{*} ∥}_{F}^{2},

and the prediction error (Pred) as

Pred : = ∥ X (\hat{M} - M^{*}) {Z ∥}_{F}^{2} / (n q),

where $\hat{M}$ here is one of the estimators for LMC, MALA or OLS. We report the averages and the standard deviation of these errors over 100 data replications.

The results of our study are presented in Table 1 and Table 2. From the tables, it can be observed that our proposed methods perform similarly to the OLS method. However, the estimation method obtained from the MALA algorithm often results in smaller prediction errors, particularly in high-dimensional settings. This advantage is even more pronounced when the method is applied in the context of inductive matrix completion, as discussed in the next subsection.

Table 1.

Simulation results on simulated data in Model I in bilinear regression, with fixed q = 10, k = 20, for different methods, with their standard error in parentheses over 100 replications. (Est: average of estimation errors; Pred: average of prediction errors; Nmse: average of normalized estimation errors).

		Errors	LMC	MALA	OLS
n = 100		Est	1.0053 (0.5480)	1.0342 (0.5559)	1.0052 (0.5478)
	$p = 10$	Pred	0.1138 (0.0171)	0.0985 (0.0151)	0.1014 (0.0154)
		Nmse	0.4931 (0.1178)	0.5100 (0.1207)	0.4930 (0.1178)
		Est	1.3544 (0.5867)	1.3384 (0.5836)	1.3544 (0.5867)
	$p = 100$	Pred	1.0066 (0.0430)	0.8761 (0.0756)	1.0030 (0.0424)
		Nmse	0.7049 (0.2944)	0.6963 (0.2927)	0.7049 (0.2944)
n = 1000		Est	1.0776 (0.5671)	1.0900 (0.5670)	1.0776 (0.5671)
	$p = 10$	Pred	0.0099 (0.0013)	0.0099 (0.0013)	0.0099 (0.0013)
		Nmse	0.5185 (0.1198)	0.5264 (0.1219)	0.5185 (0.1198)
		Est	0.9662 (0.3240)	0.9688 (0.3244)	0.9662 (0.3240)
	$p = 100$	Pred	0.0999 (0.0051)	0.0989 (0.0049)	0.0998 (0.0051)
		Nmse	0.4961 (0.1183)	0.4976 (0.1191)	0.4961 (0.1183)

Open in a new tab

Table 2.

Simulation results on simulated data in Model II (approximate low-rank) in bilinear regression, with fixed $q = 10, k = 20$ , for different methods, with their standard error in parentheses over 100 replications. (Est: average of estimation errors; Pred: average of prediction errors; Nmse: average of normalized estimation errors).

		Errors	LMC	MALA	OLS
n = 100		Est	4.0731 (1.828)	4.0989 (1.821)	4.0731 (1.828)
	$p = 10$	Pred	0.1090 (0.0160)	0.0969 (0.0140)	0.0987 (0.0145)
		Nmse	0.5119 (0.1226)	0.5162 (0.1241)	0.5118 (0.1226)
		Est	4.6047 (1.812)	4.6038 (1.813)	4.6047 (1.812)
	$p = 100$	Pred	1.0062 (0.0462)	1.0597 (0.0495)	1.0006 (0.0469)
		Nmse	0.5801 (0.1942)	0.5800 (0.1941)	0.5801 (0.1942)
n = 1000		Est	3.6733 (1.606)	3.6884 (1.606)	3.6733 (1.606)
	$p = 10$	Pred	0.0098 (0.0015)	0.0098 (0.0015)	0.0098 (0.0015)
		Nmse	0.4812 (0.1271)	0.4835 (0.1260)	0.4813 (0.1271)
		Est	3.9972 (1.375)	3.9986 (1.376)	3.9972 (1.375)
	$p = 100$	Pred	0.1000 (0.0043)	0.1032 (0.0057)	0.0999 (0.0043)
		Nmse	0.5013 (0.1061)	0.5014 (0.1063)	0.5013 (0.1062)

Open in a new tab

4.3. Simulation Studies for Inductive Matrix Completion

The simulation settings for inductive matrix completion are similar to the settings for bilinear regression, Section 4.2. However, after obtaining the response matrix Y, we remove uniformly at random $κ = 10 %$ and $κ = 30 %$ of the entries of Y. Here, $κ$ denotes the missing rate. We denote the response matrix with missing entries by $Y_{miss}$ .

As in the context of inductive matrix completion, we only observe the response matrix with missing entries, $Y_{miss}$ , and thus we cannot construct the OLS estimator as in the case of bilinear regression. For this purpose, we first impute the missing entries in $Y_{miss}$ by using the R package softImpute [48], where the rank of $M^{*}$ is specified as the true rank for matrix $Y_{miss}$ . We denote the resulting imputed matrix by $Y_{imp}$ .

We compare our approaches denoted by LMC and MALA against the (imputed and generalized) ordinary least square, denoted by OLS_imp. The OLS_imp is defined as follows:

{\hat{M}}_{OLS_imp} = {(X^{⊤} X)}^{†} X^{⊤} Y_{imp} Z^{⊤} {(Z Z^{⊤})}^{†}

where $A^{†}$ denotes the Moore–Penrose inverse of matrix A. The LMC and MALA methods are initiated at the OLS_imp estimator and are run with 10,000 iterations, where the first 1000 steps are removed as burn-in periods.

As previously discussed in Section 4.2, we present the averages and the standard deviation of the mean squared estimation error (Est), the normalized (relative) mean square error (Nmse), and the prediction error (Pred) over 100 data replications in our results.

The results are detailed in Table 3 and Table 4. It is evident from these tables that the results obtained from our MALA method surpass those of the other methods in terms of prediction error in most of the settings considered. This advantage becomes more pronounced as the missing rate in the response matrix increases. Additionally, it is worth noting that our MALA method is robust and performs well in the approximate low-rank setting (model II), while the OLS and LMC methods do not.

Table 3.

Simulation results on simulated data in Model I in inductive matrix completion, with fixed $q = 10, k = 20$ , for different methods, with their standard error in parentheses over 100 replications. ( $κ$ is the missing rate; Est: average of estimation errors; Pred: average of prediction errors; Nmse: average of normalized estimation errors).

		Errors	LMC	MALA	OLS_imp
n = 100 $κ$ = 10%		Est	1.0559 (0.5060)	1.0803 (0.5122)	1.0559 (0.5060)
	$p = 10$	Pred	0.1028 (0.0193)	0.1082 (0.0143)	0.1020 (0.0197)
		Nmse	0.4986 (0.1116)	0.5139 (0.1197)	0.4986 (0.1116)
		Est	1.4008 (0.8555)	1.3987 (0.8542)	1.4009 (0.8555)
	$p = 100$	Pred	1.2250 (0.4568)	1.4468 (0.4137)	1.2252 (0.4570)
		Nmse	0.7148 (0.3591)	0.7136 (0.3581)	0.7148 (0.3591)
n = 100 $κ$ = 30%		Est	1.0432 (0.4963)	1.0917 (0.5085)	1.0432 (0.4963)
	$p = 10$	Pred	0.2402 (0.2705)	0.1447 (0.0204)	0.2446 (0.2780)
		Nmse	0.5242 (0.1257)	0.5538 (0.1335)	0.5242 (0.1257)
		Est	1.6242 (0.8179)	1.6224 (0.8169)	1.6242 (0.8179)
	$p = 100$	Pred	9.8879 (14.11)	10.807 (13.84)	9.8901 (14.11)
		Nmse	0.7993 (0.3340)	0.7985 (0.3334)	0.7993 (0.3340)
n = 1000 $κ$ = 10%		Est	0.9810 (0.4532)	0.9882 (0.4478)	0.9810 (0.4532)
	$p = 10$	Pred	0.0114 (0.0033)	0.0112 (0.0015)	0.0114 (0.0033)
		Nmse	0.4933 (0.1076)	0.4984 (0.1075)	0.4933 (0.1076)
		Est	1.0063 (0.3465)	1.0088 (0.3471)	1.0063 (0.3465)
	$p = 100$	Pred	0.1902 (0.1758)	0.1116 (0.0049)	0.1902 (0.1759)
		Nmse	0.5069 (0.1049)	0.5082 (0.1050)	0.5069 (0.1049)
n = 1000 $κ$ = 30%		Est	1.0110 (0.4886)	1.0223 (0.4872)	1.0110 (0.4886)
	$p = 10$	Pred	0.0539 (0.0599)	0.0141 (0.0019)	0.0540 (0.0599)
		Nmse	0.5129 (0.1030)	0.5206 (0.1043)	0.5129 (0.1030)
		Est	1.0291 (0.3567)	1.0312 (0.3555)	1.0291 (0.3567)
	$p = 100$	Pred	1.7529 (1.914)	0.1475 (0.0078)	1.7530 (1.913)
		Nmse	0.5054 (0.1055)	0.5067 (0.1053)	0.5054 (0.1055)

Open in a new tab

Table 4.

Simulation results on simulated data in Model II (approximate low-rank) in inductive matrix completion, with fixed $q = 10, k = 20$ , for different methods, with their standard error in parentheses over 100 replications. ( $κ$ is the missing rate; Est: average of estimation errors; Pred: average of prediction errors; Nmse: average of normalized estimation errors).

		Errors	LMC	MALA	OLS_imp
n = 100 imis 10%		Est	3.8319 (1.691)	3.8749 (1.719)	3.8319 (1.690)
	$p = 10$	Pred	0.1604 (0.1271)	0.1092 (0.0153)	0.1598 (0.1322)
		Nmse	0.5116 (0.1154)	0.5169 (0.1147)	0.5116 (0.1155)
		Est	5.9500 (2.834)	5.9452 (2.835)	5.9500 (2.834)
	$p = 100$	Pred	4.7640 (5.272)	4.6964 (5.515)	4.7658 (5.275)
		Nmse	0.7313 (0.3454)	0.7307 (0.3455)	0.7313 (0.3454)
n = 100 imis 30%		Est	4.1838 (1.850)	4.2535 (1.859)	4.1839 (1.850)
	$p = 10$	Pred	0.7221 (0.7562)	0.1498 (0.0183)	0.7371 (0.7741)
		Nmse	0.5182 (0.1128)	0.5283 (0.1147)	0.5182 (0.1128)
		Est	7.1589 (4.084)	7.1558 (4.083)	7.1589 (4.084)
	$p = 100$	Pred	39.899 (52.40)	40.233 (51.76)	39.908 (52.41)
		Nmse	0.8998 (0.3821)	0.8994 (0.3820)	0.8998 (0.3821)
n = 1000 imis 10%		Est	3.9618 (1.678)	3.9788 (1.677)	3.9618 (1.678)
	$p = 10$	Pred	0.0409 (0.0269)	0.0110 (0.0015)	0.0409 (0.0269)
		Nmse	0.4968 (0.1196)	0.4989 (0.1195)	0.4968 (0.1196)
		Est	4.1153 (1.295)	4.1163 (1.294)	4.1153 (1.295)
	$p = 100$	Pred	1.0250 (0.9988)	0.1135 (0.0051)	1.0250 (0.9988)
		Nmse	0.5060 (0.1096)	0.5062 (0.1096)	0.5060 (0.1096)
n = 1000 imis 30%		Est	4.1647 (1.990)	4.1836 (1.995)	4.1647 (1.990)
	$p = 10$	Pred	0.4615 (0.3497)	0.0141 (0.0017)	0.4616 (0.3498)
		Nmse	0.4905 (0.1157)	0.4933 (0.1171)	0.4905 (0.1157)
		Est	4.0578 (1.400)	4.0565 (1.397)	4.0578 (1.400)
	$p = 100$	Pred	8.5608 (6.419)	0.1538 (0.0069)	8.5609 (6.419)
		Nmse	0.4944 (0.1184)	0.4943 (0.1180)	0.4944 (0.1184)

Open in a new tab

5. Discussion and Conclusions

In this paper, we focus on the problem of bilinear regression and its extension, the problem of inductive matrix completion, where the response matrix contains missing data. We propose a novel approach that combines elements of Bayesian statistics with a quasi-likelihood method. Our proposed method first addresses the problem of bilinear regression using a quasi-Bayesian approach and then adapts this approach to the problem of inductive matrix completion. By making use of a low-rankness assumption and leveraging the powerful PAC-Bayes bound technique, we provide statistical properties for our proposed estimators and for the quasi-posteriors.

Our proposed method includes an efficient gradient-based sampling algorithm that is designed to sample from the (quasi) posterior distribution. This algorithm allows for the approximate computation of mean estimators. These methods, referred to as LMC and MALA, were tested in various simulation studies and were found to perform well when compared to the ordinary least squared method. The ability to accurately sample from the (quasi) posterior distribution and compute mean estimators makes these methods a valuable tool for data analysis and modeling.

There are still some unresolved issues that require further investigation. One of these is the presence of missing data in the covariate matrices X and Z. This can have a significant impact on the analysis and may lead to biased results. Another area that needs further exploration is the assumption of independence and identically distributed data. In some cases, this assumption may not hold, and alternative models that allow for a dispersion matrix may be needed. These are potential topics for future research to address and further our understanding of these issues.

Acknowledgments

The author would like to thank the anonymous referees for their useful comments.

Appendix A. Appendix: Proofs

The main technique for our proofs is the oracle-type PAC-Bayes bounds, in the spirit of [35]. We start with a few preliminary lemmas.

Appendix A.1. Preliminary Lemmas

First, we state a version of Bernstein’s inequality from Proposition 2.9 page 24 in [49].

Lemma A1

(Bernstein’s inequality). Let $U_{1}$ , …, $U_{n}$ be independent real valued random variables. Let us assume that there are two constants v and w such that $\sum_{i = 1}^{n} E [U_{i}^{2}] \leq v$ and for all integers $k \geq 3$ , $\sum_{i = 1}^{n} E [{(U_{i})}^{k}] \leq v \frac{k! w^{k - 2}}{2} .$ Then, for any $ζ \in (0, 1 / w)$ ,

$E exp [ζ \sum_{i = 1}^{n} [U_{i} - E (U_{i})]] \leq exp (\frac{v ζ^{2}}{2 (1 - w ζ)}) .$

Another basic tool to derive the PAC-Bayes bounds is Donsker and Varadhan’s variational inequality, see Lemma 1.1.3 in Catoni [17] for a proof (among others). From now, for any $Θ \subset R^{n_{1} \times n_{2}}$ , we let $P (Θ)$ denote the set of all probability distributions on $Θ$ equipped with the Borel $σ$ -algebra. For $(μ, ν) \in P {(Θ)}^{2}$ , the Kullback–Leibler divergence is defined by $K (ν, μ) = \int log (\frac{d ν}{d μ} (θ)) ν (d θ)$ if $ν$ admits a density $\frac{d ν}{d μ}$ with respect to $μ$ , and $K (ν, μ) = + \infty$ otherwise.

Lemma A2

(Donsker and Varadhan’s variational formula). Let $μ \in P (Θ)$ . For any measurable, bounded function $h : Θ \to R$ , we have

$log \int e^{h (θ)} μ (d θ) = sup_{ρ \in P (Θ)} [\int h (θ) ρ (d θ) - K (ρ, μ)] .$

Moreover, the supremum with respect to ρ in the right-hand side is reached for the Gibbs measure $μ_{h}$ defined by its density with respect to μ

$d μ_{h} (θ) = \frac{e^{h (θ)} d μ}{\int e^{h (ϑ)} μ (d ϑ)} .$ (A1)

These two lemmas are the only tools we need to prove Theorems 1 and 2. Their proofs are quite similar, with a few differences. For the sake of simplicity, we will state the common parts of the proofs as a separate result in Lemma A3. Note that the proof of this lemma will use Lemmas A1 and A2.

Lemma A3.

Under Assumptions 1 and 2, put

$α = (λ - \frac{λ^{2} C_{1}}{2 n q (1 - \frac{C_{2} λ}{n q})}) and β = (λ + \frac{λ^{2} C_{1}}{2 n q (1 - \frac{C_{2} λ}{n q})}) .$ (A2)

Then, for any $ε \in (0, 1)$ , and $λ \in (0, n q / C_{2})$ ,

$\begin{matrix} E [\int exp {α (R (M) - R (M^{*})) + λ (- r (M) + r (M^{*})) - log [\frac{d {\hat{ρ}}_{λ}}{d π} (M)] - \\ log \frac{2}{ε}} {\hat{ρ}}_{λ} (d M)] \leq \frac{ε}{2} \end{matrix}$ (A3)

and

$\begin{matrix} E sup_{ρ \in P (R^{p \times k})} exp [β (- \int R d ρ + R (M^{*})) + λ (\int r d ρ - r (M^{*})) - K (ρ, π) - log \frac{2}{ε}] \leq \frac{ε}{2} . \end{matrix}$ (A4)

Proof of Lemma A3.

We prove the first inequality (A10) as follows. Fix any M with ${∥ X M Z ∥}_{\infty} \leq C$ and put

$T_{i j} = {(Y_{i j} - {(X M^{*} Z)}_{i j})}^{2} - {(Y_{i j} - {(Π_{C} (X M Z))}_{i j})}^{2} .$

Note that the random variables $T_{i j}$ with $i = 1, \dots, n; j = 1, \dots, q$ are independent by construction. We have

$\begin{matrix} \sum_{i = 1}^{n} \sum_{j = 1}^{q} E [T_{i j}^{2}] & = \sum_{i = 1}^{n} \sum_{j = 1}^{q} E [{(2 Y_{i j} - {(X M^{*} Z)}_{i j} - Π_{C} {(X M Z)}_{i j})}^{2} {({(X M^{*} Z)}_{i j} - Π_{C} {(X M Z)}_{i j})}^{2}] \\ = \sum_{i = 1}^{n} \sum_{j = 1}^{q} E [{(2 E_{i j} + {(X M^{*} Z)}_{i j} - Π_{C} {(X M Z)}_{i j})}^{2} {({(X M^{*} Z)}_{i j} - Π_{C} {(X M Z)}_{i j})}^{2}] \\ \leq \sum_{i = 1}^{n} \sum_{j = 1}^{q} E [8 [E_{i j}^{2} + C^{2}] {[{(X M^{*} Z)}_{i j} - Π_{C} {(X M Z)}_{i j}]}^{2}] \\ \leq \sum_{i = 1}^{n} \sum_{j = 1}^{q} 8 [σ^{2} + C^{2}] E {[{(X M^{*} Z)}_{i j} - {(X M Z)}_{i j}]}^{2} \\ \leq 8 n q (σ^{2} + C^{2}) [R (M) - R (M^{*})] \\ = n q C_{1} [R (M) - R (M^{*})] = : v (M, M^{*}) . \end{matrix}$

Next we have, for any integer $k \geq 3$ , that

$\begin{matrix} \sum_{i = 1}^{n} \sum_{j = 1}^{q} E [{(T_{i j})}^{k}] \\ \leq & \sum_{i = 1}^{n} \sum_{j = 1}^{q} E [{|2 Y_{i j} - {(X M^{*} Z)}_{i j} - Π_{C} {(X M Z)}_{i j}|}^{k} {|{(X M^{*} Z)}_{i j} - Π_{C} {(X M Z)}_{i j}|}^{k}] \\ \leq & \sum_{i = 1}^{n} \sum_{j = 1}^{q} E [2^{k - 1} [| 2 E_{i j} |^{k} + {(2 C)}^{k}] {|{(X M^{*} Z)}_{i j} - Π_{C} {(X M Z)}_{i j}|}^{k}] \\ \leq & \sum_{i = 1}^{n} \sum_{j = 1}^{q} E [2^{2 k - 1} (| E_{i j} |^{k} + C^{k}) {(2 C)}^{k - 2} {|{(X M^{*} Z)}_{i j} - Π_{C} {(X M Z)}_{i j}|}^{2}] \\ \leq & 2^{2 k - 1} [σ^{2} k! ξ^{k - 2} + C^{k}] {(2 C)}^{k - 2} \sum_{i = 1}^{n} \sum_{j = 1}^{q} E {|{(X M^{*} Z)}_{i j} - {(X M Z)}_{i j}|}^{2} \\ \leq & \frac{2^{3 k - 3} [σ^{2} k! ξ^{k - 2} + C^{k}] C^{k - 2}}{8 (σ^{2} + C^{2})} v (M, M^{*}) \\ \leq & \frac{2^{3 k - 6} [σ^{2} ξ^{k - 2} + C^{k}] C^{k - 2}}{(σ^{2} + C^{2})} k! v (M, M^{*}) \\ \leq & 2^{3 k - 5} [ξ^{k - 2} + C^{k - 2}] C^{k - 2} k! v (M, M^{*}) \\ \leq & 2^{3 k - 4} max {(ξ, C)}^{k - 2} C^{k - 2} k! v (M, M^{*}) \\ = & {[2^{3} max (ξ, C) C]}^{k - 2} 2^{2} k! v (M, M^{*}) \end{matrix}$

and use the fact that, for any $k \geq 3$ , $2^{2} \leq 2^{3 (k - 2)} / 2$ to obtain

$\begin{matrix} \sum_{i = 1}^{n} \sum_{j = 1}^{q} E [{(T_{i j})}^{k}] \leq \frac{{[2^{6} max (ξ, C) C]}^{k - 2} k! v (M, M^{*})}{2} = v (M, M^{*}) \frac{k! C_{2}^{k - 2}}{2} . \end{matrix}$

Thus, we can apply Lemma A1 with $U_{i} : = T_{i}$ , $v : = v (M, M^{*})$ , $w : = C_{2}$ and $ζ : = λ / n q$ . We obtain, for any $λ \in (0, n q / w) = (0, n q / C_{2})$ ,

$\begin{matrix} E exp [λ (R (M) - R (M^{*}) - r (M) + r (M^{*}))] & \leq exp [\frac{v λ^{2}}{2 {(n q)}^{2} (1 - \frac{w λ}{n q})}] \\ = exp [\frac{C_{1} [R (M) - R (M^{*})] λ^{2}}{2 n q (1 - \frac{C_{2} λ}{n q})}] . \end{matrix}$

Rearranging terms, and using the definition of $α$ in (A2),

$E exp [α (R (M) - R (M^{*})) + λ (- r (M) + r (M^{*}))] \leq 1 .$

Multiplying both sides by $ε / 2$ and then integrating with respect to the probability distribution $π (.)$ , we obtain

$\int E [exp \{α (R (M) - R (M^{*})) + λ (- r (M) + r (M^{*})) - log \frac{2}{ε}\}] π (d M) \leq \frac{ε}{2} .$

Next, Fubini’s theorem gives

$E [\int exp [α (R (M) - R (M^{*})) + λ (- r (M) + r (M^{*})) - log \frac{2}{ε}] π (d M)] \leq \frac{ε}{2} .$

and note that for any measurable function h,

$\begin{matrix} \int exp [h (M)] π (d M) = \int exp [h (M) - log \frac{d {\hat{ρ}}_{λ}}{d π} (M)] {\hat{ρ}}_{λ} (d M) \end{matrix}$

to obtain (A3).

Let us now prove (A4). Here again, we start with an application of Lemma A1, but this time with $U_{i} : = - T_{i}$ (we keep $v : = v (M, M^{*})$ , $w : = C_{2}$ and $ζ : = λ / n q$ ). We obtain, for any $λ \in (0, n q / C_{2})$ ,

$E exp [λ (r (M) + r (M^{*}) - R (M) + R (M^{*}))] \leq exp [\frac{C_{1} [R (M) - R (M^{*})] λ^{2}}{2 n q (1 - \frac{C_{2} λ}{n q})}] .$

Rearranging terms, using the definition of $β$ in (A2) and multiplying both sides by $ε / 2$ , we obtain

$E exp [β (- R (M) + R (M^{*})) + λ (r (M) - r (M^{*})) - log \frac{2}{ε}] \leq \frac{ε}{2} .$

We integrate with respect to $π$ and use Fubini to obtain

$E [\int exp [β (- R (M) + R (M^{*})) + λ (r (M) - r (M^{*})) - log \frac{2}{ε}] π (d M)] \leq \frac{ε}{2} .$

Here, we use a different argument from the proof of the first inequality: we use Lemma A2 on the integral, and this gives directly (A4). □

Finally, in both proofs, we will use quite often distributions $ρ \in P (R^{p \times k})$ that will be defined as translations of the prior $π$ . We introduce the following notation.

Definition A1.

For any matrix $\bar{M} \in R^{p \times k}$ , we define $ρ_{\bar{M}} \in P (R^{p \times k})$ by

$ρ_{\bar{M}} (M) = π (\bar{M} - M) .$

The following technical lemmas from [31] will be useful in the proofs.

Lemma A4

(Lemma 1 in [31]). We have ${\int ∥ M ∥}_{F}^{2} π (d M) \leq p k τ^{2} .$

Lemma A5

(Lemma 2 in [31]). For any $\bar{M} \in R^{p \times k}$ , we have

$K (ρ_{\bar{M}}, π) \leq 2 rank (\bar{M}) (k + p + 2) log (1 + \frac{∥ \bar{M} ∥_{F}}{τ \sqrt{2 rank (\bar{M})}})$

with the convention $0 log (1 + 0 / 0) = 0$ .

Appendix A.2. Proof of Theorem 1

Proof of Theorem 1.

An application of Jensen’s inequality on inequality (A3) yields

$E exp [α (\int R d {\hat{ρ}}_{λ} - R (M^{*})) + λ (- \int r d {\hat{ρ}}_{λ} + r (M^{*})) - K ({\hat{ρ}}_{λ}, π) - log \frac{2}{ε}] \leq \frac{ε}{2} .$

Using the standard Chernoff’s trick to transform an exponential moment inequality into a deviation inequality, that is: $exp (x) \geq 1_{R_{+}} (x)$ , we obtain

$\begin{matrix} P α (\int R d {\hat{ρ}}_{λ} - R (M^{*})) + λ (- \int r d {\hat{ρ}}_{λ} + r (M^{*})) - K ({\hat{ρ}}_{λ}, π) - log \frac{2}{ε}] \geq 0} \leq \frac{ε}{2} \end{matrix}$ (A5)

Using (2) we have

$\begin{matrix} \int R d {\hat{ρ}}_{λ} - R (M^{*}) & = \frac{1}{n q} \int {∥Π_{C} (X M Z) - X M^{*} Z∥}_{F}^{2} {\hat{ρ}}_{λ} (d M) \\ \geq \frac{1}{n q} {∥\int Π_{C} (X M Z) {\hat{ρ}}_{λ} (d M) - X M^{*} Z∥}_{F}^{2} \\ \geq \frac{1}{n q} {∥{\hat{X M Z}}_{λ} - X M^{*} Z∥}_{F}^{2} \end{matrix}$

where we used Jensen’s inequality in the second line, and the definition of ${\hat{X M Z}}_{λ}$ from the second to the third line. Plugging this into our probability bound (A5), and dividing both sides by $α$ , we obtain

$P \{\frac{1}{n q} {∥{\hat{X M Z}}_{λ} - X M^{*} Z∥}_{F}^{2} \leq \frac{\int r d {\hat{ρ}}_{λ} - r (M^{*}) + \frac{1}{λ} [K ({\hat{ρ}}_{λ}, π) + log \frac{2}{ε}]}{\frac{α}{λ}}\} \geq 1 - \frac{ε}{2}$

under the additional condition that $λ$ is such that $α > 0$ , which we will assume from now (note that this is satisfied by $λ^{*}$ ). Using Lemma A2, we can rewrite this as

$P \{\frac{1}{n q} {∥{\hat{X M Z}}_{λ} - X M^{*} Z∥}_{F}^{2} \leq inf_{ρ \in P (R^{p \times k})} \frac{\int r d ρ - r (M^{*}) + \frac{1}{λ} [K (ρ, π) + log \frac{2}{ε}]}{\frac{α}{λ}}\} \geq 1 - \frac{ε}{2} .$ (A6)

We consider now the consequences of the second inequality in Lemma A3, that is (A4). With Chernoff’s trick and rearranging terms a little, we obtain

$\begin{matrix} P \{\forall ρ \in P (R^{p \times k}), \int r d ρ - r (M^{*}) \leq \frac{β}{λ} [\int R d ρ - R (M^{*})] + \frac{1}{λ} [K (ρ, π) + log \frac{2}{ε}]\} \geq 1 - \frac{ε}{2} . \end{matrix}$

which we can rewrite as, $\forall ρ \in P (R^{p \times k}),$ with probability at least $1 - \frac{ε}{2}$ ,

$\begin{matrix} \int r d ρ - r (M^{*}) \leq \frac{β}{λ} \int \frac{1}{n q} {∥ Π_{C} (X M Z) - X M^{*} Z ∥}_{F}^{2} ρ (d M) + \frac{1}{λ} [K (ρ, π) + log \frac{2}{ε}] . \end{matrix}$ (A7)

Combining (A7) and (A6) with a union bound argument gives the following bound, with probability of at least $1 - ε$ ,

$\begin{matrix} \frac{1}{n q} {∥{\hat{X M Z}}_{λ} - X M^{*} Z∥}_{F}^{2} \\ \leq inf_{ρ \in P (R^{p \times k})} \frac{β \int \frac{1}{n q} {∥ Π_{C} (X M Z) - X M^{*} Z ∥}_{F}^{2} ρ (d M) + 2 [K (ρ, π) + log \frac{2}{ε}]}{α} . \end{matrix}$

Noting that, for any $(i, j)$ , ${(X M^{*} Z)}_{i, j} \in [- C, C]$ implies that

$| {(Π_{C} (X M Z))}_{i, j} - {(X M^{*} Z)}_{i, j} | \leq | {(X M Z)}_{i, j} - {(X M^{*} Z)}_{i, j} |$

and thus

$\begin{matrix} \frac{1}{n q} {∥{\hat{X M Z}}_{λ} - X M^{*} Z∥}_{F}^{2} \leq inf_{ρ \in P (R^{p \times k})} \frac{β \int \frac{1}{n q} {∥ X M Z - X M^{*} Z ∥}_{F}^{2} ρ (d M) + 2 [K (ρ, π) + log \frac{2}{ε}]}{α} . \end{matrix}$

The end of the proof consists in making the right-hand side in the inequality more explicit. In order to do so, we restrict the infimum bound above to the distributions given by Definition A1:

$\begin{matrix} P {\frac{1}{n q} {∥{\hat{X M Z}}_{λ} - X M^{*} Z∥}_{F}^{2} \\ \leq inf_{\bar{M} \in R^{p \times k}} \frac{β \int \frac{1}{n q} {∥ X M Z - X M^{*} Z ∥}_{F}^{2} ρ_{\bar{M}} (d M) + 2 [K (ρ_{\bar{M}}, π) + log \frac{2}{ε}]}{α}} \geq 1 - ε . \end{matrix}$ (A8)

We see immediately that Dalalyan’s lemma will be extremely useful for that. First, Lemma A5 provides an upper bound on $K (ρ_{\bar{M}}, π)$ . Moreover,

$\begin{matrix} \int ∥ X M Z - X M^{*} {Z ∥}_{F}^{2} ρ_{\bar{M}} (d M) \\ \leq \int ∥ X \bar{M} Z - X M^{*} {Z - X M Z ∥}_{F}^{2} π (d M) \\ = ∥ X \bar{M} Z - X M^{*} {Z ∥}_{F}^{2} - 2 \int \sum_{i, j} {(X \bar{M} Z - X M^{*} Z)}_{j, i} {(X M Z)}_{i, j} π (d M) + \int {∥ X M Z ∥}_{F}^{2} π (d M) . \end{matrix}$

The second term in the right-hand side is null because $π$ is centered, and thus

$\begin{matrix} \int ∥ X M Z - X M^{*} {Z ∥}_{F}^{2} ρ_{\bar{M}} (d M) & \leq ∥ X \bar{M} Z - X M^{*} {Z ∥}_{F}^{2} + \int {∥ X M Z ∥}_{F}^{2} π (d M) \\ \leq ∥ X \bar{M} Z - X M^{*} {Z ∥}_{F}^{2} + {∥ X ∥}_{F}^{2} {∥ Z ∥}_{F}^{2} \int {∥ M ∥}_{F}^{2} π (d M) \\ \leq ∥ X \bar{M} Z - X M^{*} {Z ∥}_{F}^{2} + {∥ X ∥}_{F}^{2} {∥ Z ∥}_{F}^{2} p k τ^{2} \end{matrix}$

where we used elementary properties of the Frobenius norm, and Lemma A4 in the last line. We can now plug this (and Lemma A5) back into (A8) to obtain

$\begin{matrix} P {\frac{1}{n q} {∥{\hat{X M Z}}_{λ} - X M^{*} Z∥}_{F}^{2} \leq inf_{\bar{M} \in R^{p \times k}} [\frac{β}{α} \frac{1}{n q} ∥ X \bar{M} Z - X M^{*} {Z ∥}_{F}^{2} + \frac{β}{α} \frac{1}{n q} {∥ X ∥}_{F}^{2} {∥ Z ∥}_{F}^{2} p k τ^{2} \\ + \frac{1}{α} (4 rank (\bar{M}) (k + p + 2) log (1 + \frac{∥ \bar{M} ∥_{F}}{τ \sqrt{2 rank (\bar{M})}}) + 2 log \frac{2}{ε})]} \geq 1 - ε . \end{matrix}$

We are now making the constants explicit. First, if $λ \leq n q / (2 C_{2})$ , then $2 n q (1 - C_{2} λ / n q) \geq n p$ and thus

$\frac{β}{α} = \frac{1 + \frac{λ C_{1}}{2 n q (1 - \frac{C_{2} λ}{n q})}}{1 - \frac{λ C_{1}}{2 n q (1 - \frac{C_{2} λ}{n q})}} \leq \frac{1 + \frac{λ C_{1}}{n q}}{1 - \frac{λ C_{1}}{n q}} .$

Then, $λ \leq \frac{n q δ}{C_{1} (1 + δ)}$ leads to $\frac{β}{α} \leq (1 + δ) .$

Note that $λ^{*} = n q min (1 / (2 C_{2}), δ / [C_{1} (1 + δ)])$ satisfies these two conditions, so from now, $λ = λ^{*}$ . We also use the following:

$\begin{matrix} \frac{1}{α} = \frac{1}{λ^{*} (1 - \frac{λ^{*} C_{1}}{2 n q (1 - C_{2} λ^{*} / n q)})} \leq \frac{β}{λ^{*} α} \leq \frac{(1 + δ)}{n q min (1 / (2 C_{2}), δ / [C_{1} (1 + δ)])} \leq \frac{C_{1} {(1 + δ)}^{2}}{n q δ} . \end{matrix}$

So far the bound is:

$\begin{matrix} P {\frac{1}{n q} {∥{\hat{X M Z}}_{λ^{*}} - X M^{*} Z∥}_{F}^{2} \leq inf_{\bar{M} \in R^{p \times k}} [\frac{(1 + δ)}{n q} {∥ X \bar{M} Z - X M^{*} Z ∥}_{F}^{2} + \\ \frac{(1 + δ)}{n q} {∥ X ∥}_{F}^{2} {∥ Z ∥}_{F}^{2} p k τ^{2} + \\ \frac{C_{1} {(1 + δ)}^{2} (4 rank (\bar{M}) (k + p + 2) log (1 + \frac{∥ \bar{M} ∥_{F}}{τ \sqrt{2 rank (\bar{M})}}) + 2 log \frac{2}{ε})}{n q δ}]} \geq 1 - ε . \end{matrix}$

In particular, with probability at least $1 - ε$ , the choice $τ^{2} = C_{1} (k + p) / {(n k q ∥ X ∥}_{F}^{2} {∥ Z ∥}_{F}^{2})$ gives

$\begin{matrix} \frac{1}{n q} {∥{\hat{X M Z}}_{λ^{*}} - X M^{*} Z∥}_{F}^{2} \leq inf_{\bar{M} \in R^{p \times k}} [\frac{(1 + δ)}{n q} {∥ X \bar{M} Z - X M^{*} Z ∥}_{F}^{2} + \frac{C_{1} (1 + δ) (k + p)}{n q} + \\ \frac{C_{1} {(1 + δ)}^{2} (4 rank (\bar{M}) (k + p + 2) log (1 + \frac{{∥ X ∥}_{F} {∥ Z ∥}_{F} {∥ \bar{M} ∥}_{F}}{\sqrt{C_{1}}} \sqrt{\frac{n k q}{(k + p) rank (\bar{M})}}) + 2 log \frac{2}{ε})}{n q δ}] . \end{matrix}$

□

Appendix A.3. Proof of Theorem 2

Proof of Theorem 2.

We also start with an application of Lemma A3, and focus on (A3), applied to $ε : = ε_{n}$ , that is:

$\begin{matrix} E [\int exp {α (R (M) - R (M^{*})) + λ (- r (M) + r (M^{*})) - log [\frac{d {\hat{ρ}}_{λ}}{d π} (M)] - \\ log \frac{2}{ε_{n}}} {\hat{ρ}}_{λ} (d M)] \leq \frac{ε_{n}}{2} . \end{matrix}$

Using Chernoff’s trick, this gives

$E [P_{M \sim {\hat{ρ}}_{λ}} (M \in A_{n})] \geq 1 - \frac{ε_{n}}{2}$

where

$A_{n} = \{M : α (R (M) - R (M^{*})) + λ (- r (M) + r (M^{*})) \leq log [\frac{d {\hat{ρ}}_{λ}}{d π} (M)] + log \frac{2}{ε_{n}}\} .$

Using the definition of ${\hat{ρ}}_{λ}$ , for $M \in A_{n}$ we have

$\begin{matrix} α (R (M) - R (M^{*})) & \leq λ (r (M) - r (M^{*})) + log [\frac{d {\hat{ρ}}_{λ}}{d π} (M)] + log \frac{2}{ε_{n}} \\ \leq - log \int exp [- λ r (M)] π (d M) - λ r (M^{*}) + log \frac{2}{ε_{n}} \\ = λ (\int r (M) {\hat{ρ}}_{λ} (d M) - r (M^{*})) + K ({\hat{ρ}}_{λ}, π) + log \frac{2}{ε_{n}} \\ = inf_{ρ} \{λ (\int r (M) ρ (d M) - r (M^{*})) + K (ρ, π) + log \frac{2}{ε_{n}}\} . \end{matrix}$

Now, let us define

$B_{n} = \{\forall ρ : β (- \int R d ρ + R (M^{*})) + λ (\int r d ρ - r (M^{*})) \leq K (ρ, π) + log \frac{2}{ε_{n}}\} .$

Using (A4), we have that

$E [1_{B_{n}}] \geq 1 - \frac{ε_{n}}{2} .$

We will now prove that, if $λ$ is such that $α > 0$ ,

$E [P_{M \sim {\hat{ρ}}_{λ}} (M \in M_{n})] \geq E [P_{M \sim {\hat{ρ}}_{λ}} (M \in A_{n}) 1_{B_{n}}]$

which, together with

$\begin{matrix} E [P_{M \sim {\hat{ρ}}_{λ}} (M \in A_{n}) 1_{B_{n}}] & = E [(1 - P_{M \sim {\hat{ρ}}_{λ}} (M \notin A_{n})) (1 - 1_{B_{n}^{c}})] \\ \geq E [1 - P_{M \sim {\hat{ρ}}_{λ}} (M \notin A_{n}) - 1_{B_{n}^{c}}] \\ \geq 1 - ε_{n} \end{matrix}$

will bring

$E [P_{M \sim {\hat{ρ}}_{λ}} (M \in M_{n})] \geq 1 - ε_{n} .$

In order to do so, assume that we are on the set $B_{n}$ , and let $M \in A_{n}$ . Then,

$\begin{matrix} α (R (M) - R (M^{*})) & \leq inf_{ρ} \{λ (\int r (M) ρ (d M) - r (M^{*})) + K (ρ, π) + log \frac{2}{ε_{n}}\} \\ \leq inf_{ρ} \{β (\int R (M) ρ (d M) - R (M^{*})) + 2 K (ρ, π) + 2 log \frac{2}{ε_{n}}\} \end{matrix}$

that is,

$R (M) - R (M^{*}) \leq inf_{ρ \in P (R^{p \times k})} \frac{β [\int R d ρ - R (M^{*})] + 2 [K (ρ, π) + log \frac{2}{ε}]}{α}$

or, rewriting it in terms of norms,

${∥Π_{C} (X M Z) - X M^{*} Z∥}_{F}^{2} \leq inf_{\bar{M} \in R^{p \times k}} \frac{β \int ∥ X M Z - X M^{*} {Z ∥}_{F}^{2} ρ_{\bar{M}} (d M) + 2 [K (ρ_{\bar{M}}, π) + log \frac{2}{ε}]}{α} .$

We upper-bound the right-hand side exactly as in the proof of Theorem 1, which gives $M \in M_{n}$ . □

Appendix A.4. Proof of Theorem 3

Lemma A6.

Under Assumptions 1 and 3, put

$α^{'} = (λ - \frac{λ^{2} C_{1}^{'}}{2 m (1 - \frac{C_{2}^{'} λ}{m})}) and β^{'} = (λ + \frac{λ^{2} C_{1}^{'}}{2 m (1 - \frac{C_{2}^{'} λ}{m})}) .$ (A9)

Then for any $ε \in (0, 1)$ , and $λ \in (0, m / C_{2}^{'})$ ,

$\begin{matrix} E [\int exp {α^{'} (R^{'} (M) - R^{'} (M^{*})) + λ (- r^{'} (M) + r^{'} (M^{*})) - \\ log [\frac{d {\hat{ρ}}_{λ}^{'}}{d π} (M)] - log \frac{2}{ε}} {\hat{ρ}}_{λ}^{'} (d M)] \leq \frac{ε}{2} \end{matrix}$ (A10)

and

$\begin{matrix} E sup_{ρ \in P (R^{p \times k})} exp [β^{'} (- \int R^{'} d ρ + R^{'} (M^{*})) + λ (\int r^{'} d ρ - r^{'} (M^{*})) - K (ρ, π) - log \frac{2}{ε}] \leq \frac{ε}{2} . \end{matrix}$ (A11)

Proof of Lemma A6.

The inequality (A10) is proved in a similar way to the proof of Lemma A3. That is, we apply Lemma A1 to the following independent random variables

$V_{i} = {(Y_{i} - {(X M^{*} Z)}_{i})}^{2} - {(Y_{i} - {(Π_{C} (X M Z))}_{i})}^{2}, i = 1, \dots, m .$

The proof of the inequality (A11) is processed similar in the proof of Lemma A3 in which we apply Lemma A1 to the independent random variables $- V_{i}, i = 1, \dots, m .$ □

Proof of Theorem 3.

Similar to the proof of Theorem 1, until the (A6), and noting that using (6), we have

$\begin{matrix} \int R^{'} d {\hat{ρ}}_{λ} - R^{'} (M^{*}) & = \int {∥Π_{C} (X M Z) - X M^{*} Z∥}_{F, Π}^{2} {\hat{ρ}}_{λ}^{'} (d M) \\ \geq {∥\int Π_{C} (X M Z) {\hat{ρ}}_{λ}^{'} (d M) - X M^{*} Z∥}_{F, Π}^{2} \\ \geq {∥{\hat{X M Z}}_{λ} - X M^{*} Z∥}_{F, Π}^{2}, \end{matrix}$

and thus, we obtain

$P \{{∥{\hat{X M Z}}_{λ} - X M^{*} Z∥}_{F, Π}^{2} \leq inf_{ρ \in P (R^{p \times k})} \frac{\int r^{'} d ρ - r^{'} (M^{*}) + \frac{1}{λ} [K (ρ, π) + log \frac{2}{ε}]}{\frac{α^{'}}{λ}}\} \geq 1 - \frac{ε}{2} .$ (A12)

We consider now the consequences of inequality (A11) in Lemma A6. With Chernoff’s trick and rearranging terms a little, we obtain $\forall ρ \in P (R^{p \times k})$ , with probability at least $1 - \frac{ε}{2}$ ,

$\begin{matrix} \int r^{'} d ρ - r^{'} (M^{*}) \leq \frac{β^{'}}{λ} \int {∥ Π_{C} (X M Z) - X M^{*} Z ∥}_{F, Π}^{2} ρ (d M) + \frac{1}{λ} [K (ρ, π) + log \frac{2}{ε}] . \end{matrix}$ (A13)

Combining (A13) and (A12) with a union bound argument gives the bound and noting that for any $(i, j)$ , ${(X M^{*} Z)}_{i, j} \in [- C, C]$ implies that $| {(Π_{C} (X M Z))}_{i, j} - {(X M^{*} Z)}_{i, j} | \leq | {(X M Z)}_{i, j} - {(X M^{*} Z)}_{i, j} |$ and thus

$\begin{matrix} P { & {∥{\hat{X M Z}}_{λ} - X M^{*} Z∥}_{F, Π}^{2} \\ \leq inf_{ρ \in P (R^{p \times k})} \frac{β^{'} \int {∥ X M Z - X M^{*} Z ∥}_{F, Π}^{2} ρ (d M) + 2 [K (ρ, π) + log \frac{2}{ε}]}{α^{'}}} \geq 1 - ε . \end{matrix}$

We are now making the right-hand side in the inequality more explicit. In order to do so, we restrict the infimum bound above to the distributions given by Definition A1:

$\begin{matrix} P {{∥{\hat{X M Z}}_{λ} - X M^{*} Z∥}_{F, Π}^{2} \leq \\ inf_{\bar{M} \in R^{p \times k}} \frac{β^{'} \int {∥ X M Z - X M^{*} Z ∥}_{F, Π}^{2} ρ_{\bar{M}} (d M) + 2 [K (ρ_{\bar{M}}, π) + log \frac{2}{ε}]}{α^{'}}} \geq 1 - ε . \end{matrix}$ (A14)

We see immediately that Dalalyan’s lemma will be extremely useful for that. First, Lemma A5 provides an upper bound on $K (ρ_{\bar{M}}, π)$ .

Moreover,

$\begin{matrix} \int ∥ X M Z - X M^{*} {Z ∥}_{F, Π}^{2} ρ_{\bar{M}} (d M) \\ \leq \int ∥ X \bar{M} Z - X M^{*} {Z - X M Z ∥}_{F, Π}^{2} π (d M) \\ = ∥ X \bar{M} Z - X M^{*} {Z ∥}_{F, Π}^{2} - 2 \int \sum_{i, j} Π_{i, j} {(X \bar{M} Z - X M^{*} Z)}_{j, i} {(X M Z)}_{i, j} π (d M) + \\ {\int ∥ X M Z ∥}_{F, Π}^{2} π (d M) . \end{matrix}$

The second term in the above right-hand side is null because $π$ is centered, and thus

$\begin{matrix} \int ∥ X M Z - X M^{*} {Z ∥}_{F, Π}^{2} ρ_{\bar{M}} (d M) \\ \leq ∥ X \bar{M} Z - X M^{*} {Z ∥}_{F, Π}^{2} + \int {∥ X M Z ∥}_{F, Π}^{2} π (d M) \\ \leq ∥ X \bar{M} Z - X M^{*} {Z ∥}_{F, Π}^{2} + \int {∥ X M Z ∥}_{F}^{2} π (d M) \\ \leq ∥ X \bar{M} Z - X M^{*} {Z ∥}_{F, Π}^{2} + {∥ X ∥}_{F}^{2} {∥ Z ∥}_{F}^{2} \int {∥ M ∥}_{F}^{2} π (d M) \\ \leq ∥ X \bar{M} Z - X M^{*} {Z ∥}_{F, Π}^{2} + {∥ X ∥}_{F}^{2} {∥ Z ∥}_{F}^{2} p k τ^{2} \end{matrix}$

where we used the elementary properties of the Frobenius norm, and Lemma A4 in the last line. We can now plug this (and Lemma A5) back into (A14) to obtain:

$\begin{matrix} P {{∥{\hat{X M Z}}_{λ} - X M^{*} Z∥}_{F, Π}^{2} \leq inf_{\bar{M} \in R^{p \times k}} [\frac{β^{'}}{α^{'}} ∥ X \bar{M} Z - X M^{*} {Z ∥}_{F, Π}^{2} + \frac{β^{'}}{α^{'}} {∥ X ∥}_{F}^{2} {∥ Z ∥}_{F}^{2} p k τ^{2} \\ + \frac{1}{α^{'}} (4 rank (\bar{M}) (k + p + 2) log (1 + \frac{∥ \bar{M} ∥_{F}}{τ \sqrt{2 rank (\bar{M})}}) + 2 log \frac{2}{ε})]} \geq 1 - ε . \end{matrix}$

We are now making the constants explicit. First, if $λ \leq m / (2 C_{2}^{'})$ , then $2 m (1 - C_{2}^{'} λ / m) \geq m$ and thus

$\frac{β^{'}}{α^{'}} = \frac{1 + \frac{λ C_{1}^{'}}{2 m (1 - \frac{C_{2}^{'} λ}{m})}}{1 - \frac{λ C_{1}^{'}}{2 m (1 - \frac{C_{2}^{'} λ}{m})}} \leq \frac{1 + \frac{λ C_{1}^{'}}{m}}{1 - \frac{λ C_{1}^{'}}{m}} .$

Then, $λ \leq \frac{m δ}{C_{1}^{'} (1 + δ)}$ leads to $\frac{β^{'}}{α^{'}} \leq (1 + δ) .$

Note that $λ^{' *} = m min (1 / (2 C_{2}^{'}), δ / [C_{1}^{'} (1 + δ)])$ satisfies these two conditions, so from now $λ = λ^{' *}$ . We also use the following:

$\begin{matrix} \frac{1}{α^{'}} = \frac{1}{λ^{' *} (1 - \frac{λ^{' *} C_{1}^{'}}{2 m (1 - C_{2} λ^{' *} / m)})} \leq \frac{β^{'}}{λ^{' *} α^{'}} \leq \frac{(1 + δ)}{m min (1 / (2 C_{2}^{'}), δ / [C_{1}^{'} (1 + δ)])} \leq \frac{C_{1}^{'} {(1 + δ)}^{2}}{m δ} . \end{matrix}$

So far the bound is:

$\begin{matrix} P {{∥{\hat{X M Z}}_{λ^{' *}} - X M^{*} Z∥}_{F, Π}^{2} \leq inf_{\bar{M} \in R^{p \times k}} [(1 + δ) {∥ X \bar{M} Z - X M^{*} Z ∥}_{F, Π}^{2} + \\ {(1 + δ) ∥ X ∥}_{F}^{2} {∥ Z ∥}_{F}^{2} p k τ^{2} + \\ \frac{C_{1}^{'} {(1 + δ)}^{2} (4 rank (\bar{M}) (k + p + 2) log (1 + \frac{∥ \bar{M} ∥_{F}}{τ \sqrt{2 rank (\bar{M})}}) + 2 log \frac{2}{ε})}{m δ}]} \geq 1 - ε . \end{matrix}$

In particular, with probability at least $1 - ε$ , the choice $τ^{2} = C_{1}^{'} (k + p) / {(m k p ∥ X ∥}_{F}^{2} {∥ Z ∥}_{F}^{2})$ gives

$\begin{matrix} {∥{\hat{X M Z}}_{λ^{' *}} - X M^{*} Z∥}_{F, Π}^{2} \leq inf_{\bar{M} \in R^{p \times k}} [(1 + δ) {∥ X \bar{M} Z - X M^{*} Z ∥}_{F, Π}^{2} + \frac{C_{1}^{'} (1 + δ) (k + p)}{m} + \\ \frac{C_{1}^{'} {(1 + δ)}^{2} (4 rank (\bar{M}) (k + p + 2) log (1 + \frac{{∥ X ∥}_{F} {∥ Z ∥}_{F} {∥ \bar{M} ∥}_{F}}{\sqrt{C_{1}^{'}}} \sqrt{\frac{m k p}{(k + p) rank (\bar{M})}}) + 2 log \frac{2}{ε})}{m δ}] . \end{matrix}$

□

Appendix A.5. Proof of Theorem 4

Proof.

The proof is proceeded completely similar to the proof of Theorem 2, in Appendix A.3.

□

Appendix B. Comments on Algorithm Implementation

For the case of inductive matrix completion, we write the logarithm of the density of the posterior

log {\hat{ρ}}_{λ} (M) = - \frac{λ}{n} \sum_{i = 1}^{n} {(Y_{i} - {(Π_{C} (X M Z))}_{i})}^{2} - \frac{p + m + 2}{2} log det (τ^{2} I_{m} + M M^{⊺}) .

Let us now differentiate this expression in M. Note that the term ${(Y_{i} - {(Π_{C} (X M Z))}_{i})}^{2}$ does actually not depend on M locally if ${(X M Z)}_{i} \notin [- C, C]$ , in this case its differential with respect to M is $0_{p \times m}$ . Otherwise, ${(Y_{i} - {(Π_{C} (X M Z))}_{i})}^{2} = {(Y_{i} - {(X M Z)}_{i})}^{2}$ . In order to be able to differentiate the term ${(X M Z)}_{i}$ , let us introduce a notation for the entries of $I_{i}$ : $I_{i} = (a_{i}, b_{i})$ . Then $\nabla {(X M Z)}_{i} = D$ where the matrix $D \in R^{p \times m}$ satisfies $D_{x, y} = 1_{{x = b_{j}}} X_{a_{j}, y}$ . Then

\nabla log {\hat{ρ}}_{λ} (M) = \frac{2 λ}{n} \sum_{i = 1}^{n} (\nabla {(X M Z)}_{i}) (Y_{i} - {(X M Z)}_{i}) 1_{{| {(X M Z)}_{i} | < C}} - (p + m + 2) {(τ^{2} I_{m} + M M^{⊺})}^{- 1} M .

The above calculation still requires to calculate a $p \times p$ matrix inversion at each iteration; for very large p, this might be expensive and can slow down the algorithm. Therefore, we could replace this matrix inversion by its accurately approximation through a convex optimization. It is noted that the matrix $B : = {(τ^{2} I_{m} + M M^{⊺})}^{- 1} M$ is the solution to the following convex optimization problem: ${min}_{B} \{∥ I_{p} - M^{⊤} {B ∥}_{F}^{2} + τ^{2} {∥ B ∥}_{F}^{2}\} .$ The solution of this optimization problem can be conveniently obtained by using the package ‘glmnet’ [50] (with the family option ‘mgaussian’). This avoids performing matrix inversion or other costly calculation. However, we note here that the LMC algorithm is being used with approximate gradient evaluation; theoretical assessment of this approach can be found in [51].

Algorithm A1 LMC

Input: The data.
Parameters: Positive real numbers $λ, τ, h, T$ .
Output: The matrix $\hat{M}$
Initialize: $M_{0}, \hat{M} = 0_{m \times p}$
for $k \leftarrow 1$ to T do
Sample $M_{k}$ from (8);
$\hat{M} \leftarrow \hat{M} + M_{k} / T$
end for

Open in a new tab

Algorithm A2 MALA

Input: The data.
Parameters: Positive real numbers $λ, τ, h, T$
Output: The matrix $\hat{M}$
Initialize: $M_{0}; \hat{M} = 0_{m \times p}$
for $k = 1$ to T do
Sample ${\tilde{M}}_{k}$ from (9).
Set $M_{k} = {\tilde{M}}_{k}$ with probability $A_{M A L A}$ , from (10), otherwise $M_{k} = M_{k - 1}$ .
$\hat{M} \leftarrow \hat{M} + M_{k} / T$ .
end for

Open in a new tab

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The R codes used in the numerical experiments are available at: https://github.com/tienmt/blr_imc (accessed on 10 February 2023).

Conflicts of Interest

The author declare no conflict of interest.

Funding Statement

TTM is supported by the Norwegian Research Council grant number 309960 through the Centre for Geophysical Forecasting at NTNU.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

1.Rosen D.V. Methodology and Applications of Statistics. Springer; Berlin/Heidelberg, Germany: 2021. Bilinear Regression with Rank Restrictions on the Mean and Dispersion Matrix; pp. 193–211. [Google Scholar]
2.Von Rosen D. Bilinear regression analysis: An Introduction. Volume 220 Springer; Berlin/Heidelberg, Germany: 2018. Lecture Notes in Statistics. [Google Scholar]
3.Potthoff R.F., Roy S. A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika. 1964;51:313–326. doi: 10.1093/biomet/51.3-4.313. [DOI] [Google Scholar]
4.Woolson R.F., Leeper J.D. Growth curve analysis of complete and incomplete longitudinal data. Commun. Stat.-Theory Methods. 1980;9:1491–1513. doi: 10.1080/03610928008827977. [DOI] [Google Scholar]
5.Kshirsagar A., Smith W. Growth Curves. Volume 145 CRC Press; Boca Raton, FL, USA: 1995. [Google Scholar]
6.Jana S. Ph.D. Thesis. Mcmaster University; Hamilton, ON, Canada: 2017. [(accessed on 26 January 2023)]. Inference for Generalized Multivariate Analysis of Variance (GMANOVA) Models and High-Dimensional Extensions. Available online: http://hdl.handle.net/11375/22043. [Google Scholar]
7.Natarajan N., Dhillon I.S. Inductive matrix completion for predicting gene–disease associations. Bioinformatics. 2014;30:i60–i68. doi: 10.1093/bioinformatics/btu269. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Zilber P., Nadler B. Inductive Matrix Completion: No Bad Local Minima and a Fast Algorithm; Proceedings of the 39th ICML; Baltimore, MA, USA. 17–23 July 2022; pp. 27671–27692. PMLR. [Google Scholar]
9.Zhang W., Xu H., Li X., Gao Q., Wang L. DRIMC: An improved drug repositioning approach using Bayesian inductive matrix completion. Bioinformatics. 2020;36:2839–2847. doi: 10.1093/bioinformatics/btaa062. [DOI] [PubMed] [Google Scholar]
10.Hsieh C.J., Natarajan N., Dhillon I. PU learning for matrix completion; Proceedings of the International Conference on Machine Learning, PMLR; Lille, France. 7–9 July 2015; pp. 2445–2453. [Google Scholar]
11.Jana S., Balakrishnan N., Hamid J.S. Bayesian growth curve model useful for high-dimensional longitudinal data. J. Appl. Stat. 2019;46:814–834. doi: 10.1080/02664763.2018.1517145. [DOI] [Google Scholar]
12.Knoblauch J., Jewson J., Damoulas T. An Optimization-centric View on Bayes’ Rule: Reviewing and Generalizing Variational Inference. J. Mach. Learn. Res. 2022;23:1–109. [Google Scholar]
13.Bissiri P.G., Holmes C.C., Walker S.G. A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2016;78:1103–1130. doi: 10.1111/rssb.12158. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Grünwald P., Van Ommen T. Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Anal. 2017;12:1069–1103. doi: 10.1214/17-BA1085. [DOI] [Google Scholar]
15.McAllester D. Some PAC-Bayesian theorems; Proceedings of the Eleventh Annual Conference on Computational Learning Theory; Madison, WI, USA. 24–26 July 1998; New York, NY, USA: ACM; 1998. pp. 230–234. [Google Scholar]
16.Shawe-Taylor J., Williamson R. A PAC analysis of a Bayes estimator; Proceedings of the Tenth Annual Conference on Computational Learning Theory; Nashville, TN, USA. 6–9 July 1997; New York, NY, USA: ACM; 1997. pp. 2–9. [Google Scholar]
17.Catoni O. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics; Beachwood, OH, USA: 2007. p. xii+163. (IMS Lecture Notes—Monograph Series, 56). [Google Scholar]
18.Guedj B. A primer on PAC-Bayesian learning. arXiv. 20191901.05353 [Google Scholar]
19.Alquier P. User-friendly introduction to PAC-Bayes bounds. arXiv. 20212110.11216 [Google Scholar]
20.Mai T.T., Alquier P. A Bayesian approach for noisy matrix completion: Optimal rate under general sampling distribution. Electron. J. Statist. 2015;9:823–841. doi: 10.1214/15-EJS1020. [DOI] [Google Scholar]
21.Cottet V., Alquier P. 1-Bit matrix completion: PAC-Bayesian analysis of a variational approximation. Mach. Learn. 2018;107:579–603. doi: 10.1007/s10994-017-5667-z. [DOI] [Google Scholar]
22.Mai T.T., Alquier P. Pseudo-Bayesian quantum tomography with rank-adaptation. J. Stat. Plan. Inference. 2017;184:62–76. doi: 10.1016/j.jspi.2016.11.003. [DOI] [Google Scholar]
23.Mai T.T., Alquier P. Optimal quasi-Bayesian reduced rank regression with incomplete response. arXiv. 20222206.08619 [Google Scholar]
24.Jain P., Dhillon I.S. Provable inductive matrix completion. arXiv. 20131306.0626 [Google Scholar]
25.Candès E.J., Plan Y. Matrix completion with noise. Proc. IEEE. 2010;98:925–936. doi: 10.1109/JPROC.2009.2035722. [DOI] [Google Scholar]
26.Koltchinskii V., Lounici K., Tsybakov A.B. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist. 2011;39:2302–2329. doi: 10.1214/11-AOS894. [DOI] [Google Scholar]
27.Foygel R., Shamir O., Srebro N., Salakhutdinov R. Learning with the weighted trace-norm under arbitrary sampling distributions; Proceedings of the Advances in Neural Information Processing Systems; Granada, Spain. 12–15 December 2011; pp. 2133–2141. [Google Scholar]
28.Klopp O. Noisy low-rank matrix completion with general sampling distribution. Bernoulli. 2014;20:282–303. doi: 10.3150/12-BEJ486. [DOI] [Google Scholar]
29.Negahban S., Wainwright M.J. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. J. Mach. Learn. Res. 2012;13:1665–1697. [Google Scholar]
30.Dalalyan A.S., Tsybakov A.B. Sparse regression learning by aggregation and Langevin Monte-Carlo. J. Comput. Syst. Sci. 2012;78:1423–1443. doi: 10.1016/j.jcss.2011.12.023. [DOI] [Google Scholar]
31.Dalalyan A.S. Exponential weights in multivariate regression and a low-rankness favoring prior. Annales de l’Institut Henri Poincaré Probabilités et Statistiques. 2020;56:1465–1483. doi: 10.1214/19-AIHP1010. [DOI] [Google Scholar]
32.Anderson T.W. Estimating linear restrictions on regression coefficients for multivariate normal distributions. Ann. Math. Stat. 1951;22:327–351. doi: 10.1214/aoms/1177729580. [DOI] [Google Scholar]
33.Izenman A.J. Modern multivariate statistical techniques. Regres. Classif. Manifold Learn. 2008;10:978. [Google Scholar]
34.Dalalyan A., Tsybakov A.B. Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Mach. Learn. 2008;72:39–61. doi: 10.1007/s10994-008-5051-0. [DOI] [Google Scholar]
35.Catoni O. Statistical learning theory and stochastic optimization. In: Picard J., editor. Saint-Flour Summer School on Probability Theory 2001. Volume 1851. Springer; Berlin, Germany: 2004. p. viii+272. Lecture Notes in Mathematics. [DOI] [Google Scholar]
36.Alquier P., Ridgway J., Chopin N. On the properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res. 2016;17:8374–8414. [Google Scholar]
37.Rigollet P., Tsybakov A.B. Sparse estimation by exponential weighting. Stat. Sci. 2012;27:558–575. doi: 10.1214/12-STS393. [DOI] [Google Scholar]
38.Dalalyan A.S., Grappin E., Paris Q. On the exponentially weighted aggregate with the Laplace prior. Ann. Stat. 2018;46:2452–2478. doi: 10.1214/17-AOS1626. [DOI] [Google Scholar]
39.Candes E.J., Wakin M.B., Boyd S.P. Enhancing sparsity by reweighted ℓ1 minimization. J. Fourier Anal. Appl. 2008;14:877–905. doi: 10.1007/s00041-008-9045-x. [DOI] [Google Scholar]
40.Yang L., Fang J., Duan H., Li H., Zeng B. Fast low-rank Bayesian matrix completion with hierarchical gaussian prior models. IEEE Trans. Signal Process. 2018;66:2804–2817. doi: 10.1109/TSP.2018.2816575. [DOI] [Google Scholar]
41.Luo C., Liang J., Li G., Wang F., Zhang C., Dey D.K., Chen K. Leveraging mixed and incomplete outcomes via reduced-rank modeling. J. Multivar. Anal. 2018;167:378–394. doi: 10.1016/j.jmva.2018.04.011. [DOI] [Google Scholar]
42.Hoeffding W. Probability Inequalities for Sums of Bounded Random Variables. J. Am. Stat. Assoc. 1963;58:13–30. doi: 10.1080/01621459.1963.10500830. [DOI] [Google Scholar]
43.Durmus A., Moulines E. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli. 2019;25:2854–2882. doi: 10.3150/18-BEJ1073. [DOI] [Google Scholar]
44.Roberts G.O., Stramer O. Langevin diffusions and Metropolis-Hastings algorithms. Methodol. Comput. Appl. Probab. 2002;4:337–357. doi: 10.1023/A:1023562417138. [DOI] [Google Scholar]
45.Roberts G.O., Rosenthal J.S. Optimal scaling of discrete approximations to Langevin diffusions. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1998;60:255–268. doi: 10.1111/1467-9868.00123. [DOI] [Google Scholar]
46.Dalalyan A.S. Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2017;3:651–676. doi: 10.1111/rssb.12183. [DOI] [Google Scholar]
47.R Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2022. [Google Scholar]
48.Hastie T., Mazumder R. softImpute: Matrix Completion via Iterative Soft-Thresholded SVD, 2021. R Package Version 1.4-1. [(accessed on 26 January 2023)]. Available online: https://cran.r-project.org/package=softImpute.
49.Massart P. Concentration Inequalities and Model Selection. Volume 1896. Springer; Berlin, Germany: 2007. p. xiv+337. Lecture Notes in Mathematics. [Google Scholar]
50.Friedman J., Hastie T., Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 2010;33:1–22. doi: 10.18637/jss.v033.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Dalalyan A.S., Karagulyan A. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stoch. Process. Their Appl. 2019;129:5278–5311. doi: 10.1016/j.spa.2019.02.016. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The R codes used in the numerical experiments are available at: https://github.com/tienmt/blr_imc (accessed on 10 February 2023).

[B1-entropy-25-00333] 1.Rosen D.V. Methodology and Applications of Statistics. Springer; Berlin/Heidelberg, Germany: 2021. Bilinear Regression with Rank Restrictions on the Mean and Dispersion Matrix; pp. 193–211. [Google Scholar]

[B2-entropy-25-00333] 2.Von Rosen D. Bilinear regression analysis: An Introduction. Volume 220 Springer; Berlin/Heidelberg, Germany: 2018. Lecture Notes in Statistics. [Google Scholar]

[B3-entropy-25-00333] 3.Potthoff R.F., Roy S. A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika. 1964;51:313–326. doi: 10.1093/biomet/51.3-4.313. [DOI] [Google Scholar]

[B4-entropy-25-00333] 4.Woolson R.F., Leeper J.D. Growth curve analysis of complete and incomplete longitudinal data. Commun. Stat.-Theory Methods. 1980;9:1491–1513. doi: 10.1080/03610928008827977. [DOI] [Google Scholar]

[B5-entropy-25-00333] 5.Kshirsagar A., Smith W. Growth Curves. Volume 145 CRC Press; Boca Raton, FL, USA: 1995. [Google Scholar]

[B6-entropy-25-00333] 6.Jana S. Ph.D. Thesis. Mcmaster University; Hamilton, ON, Canada: 2017. [(accessed on 26 January 2023)]. Inference for Generalized Multivariate Analysis of Variance (GMANOVA) Models and High-Dimensional Extensions. Available online: http://hdl.handle.net/11375/22043. [Google Scholar]

[B7-entropy-25-00333] 7.Natarajan N., Dhillon I.S. Inductive matrix completion for predicting gene–disease associations. Bioinformatics. 2014;30:i60–i68. doi: 10.1093/bioinformatics/btu269. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8-entropy-25-00333] 8.Zilber P., Nadler B. Inductive Matrix Completion: No Bad Local Minima and a Fast Algorithm; Proceedings of the 39th ICML; Baltimore, MA, USA. 17–23 July 2022; pp. 27671–27692. PMLR. [Google Scholar]

[B9-entropy-25-00333] 9.Zhang W., Xu H., Li X., Gao Q., Wang L. DRIMC: An improved drug repositioning approach using Bayesian inductive matrix completion. Bioinformatics. 2020;36:2839–2847. doi: 10.1093/bioinformatics/btaa062. [DOI] [PubMed] [Google Scholar]

[B10-entropy-25-00333] 10.Hsieh C.J., Natarajan N., Dhillon I. PU learning for matrix completion; Proceedings of the International Conference on Machine Learning, PMLR; Lille, France. 7–9 July 2015; pp. 2445–2453. [Google Scholar]

[B11-entropy-25-00333] 11.Jana S., Balakrishnan N., Hamid J.S. Bayesian growth curve model useful for high-dimensional longitudinal data. J. Appl. Stat. 2019;46:814–834. doi: 10.1080/02664763.2018.1517145. [DOI] [Google Scholar]

[B12-entropy-25-00333] 12.Knoblauch J., Jewson J., Damoulas T. An Optimization-centric View on Bayes’ Rule: Reviewing and Generalizing Variational Inference. J. Mach. Learn. Res. 2022;23:1–109. [Google Scholar]

[B13-entropy-25-00333] 13.Bissiri P.G., Holmes C.C., Walker S.G. A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2016;78:1103–1130. doi: 10.1111/rssb.12158. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14-entropy-25-00333] 14.Grünwald P., Van Ommen T. Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Anal. 2017;12:1069–1103. doi: 10.1214/17-BA1085. [DOI] [Google Scholar]

[B15-entropy-25-00333] 15.McAllester D. Some PAC-Bayesian theorems; Proceedings of the Eleventh Annual Conference on Computational Learning Theory; Madison, WI, USA. 24–26 July 1998; New York, NY, USA: ACM; 1998. pp. 230–234. [Google Scholar]

[B16-entropy-25-00333] 16.Shawe-Taylor J., Williamson R. A PAC analysis of a Bayes estimator; Proceedings of the Tenth Annual Conference on Computational Learning Theory; Nashville, TN, USA. 6–9 July 1997; New York, NY, USA: ACM; 1997. pp. 2–9. [Google Scholar]

[B17-entropy-25-00333] 17.Catoni O. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics; Beachwood, OH, USA: 2007. p. xii+163. (IMS Lecture Notes—Monograph Series, 56). [Google Scholar]

[B18-entropy-25-00333] 18.Guedj B. A primer on PAC-Bayesian learning. arXiv. 20191901.05353 [Google Scholar]

[B19-entropy-25-00333] 19.Alquier P. User-friendly introduction to PAC-Bayes bounds. arXiv. 20212110.11216 [Google Scholar]

[B20-entropy-25-00333] 20.Mai T.T., Alquier P. A Bayesian approach for noisy matrix completion: Optimal rate under general sampling distribution. Electron. J. Statist. 2015;9:823–841. doi: 10.1214/15-EJS1020. [DOI] [Google Scholar]

[B21-entropy-25-00333] 21.Cottet V., Alquier P. 1-Bit matrix completion: PAC-Bayesian analysis of a variational approximation. Mach. Learn. 2018;107:579–603. doi: 10.1007/s10994-017-5667-z. [DOI] [Google Scholar]

[B22-entropy-25-00333] 22.Mai T.T., Alquier P. Pseudo-Bayesian quantum tomography with rank-adaptation. J. Stat. Plan. Inference. 2017;184:62–76. doi: 10.1016/j.jspi.2016.11.003. [DOI] [Google Scholar]

[B23-entropy-25-00333] 23.Mai T.T., Alquier P. Optimal quasi-Bayesian reduced rank regression with incomplete response. arXiv. 20222206.08619 [Google Scholar]

[B24-entropy-25-00333] 24.Jain P., Dhillon I.S. Provable inductive matrix completion. arXiv. 20131306.0626 [Google Scholar]

[B25-entropy-25-00333] 25.Candès E.J., Plan Y. Matrix completion with noise. Proc. IEEE. 2010;98:925–936. doi: 10.1109/JPROC.2009.2035722. [DOI] [Google Scholar]

[B26-entropy-25-00333] 26.Koltchinskii V., Lounici K., Tsybakov A.B. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist. 2011;39:2302–2329. doi: 10.1214/11-AOS894. [DOI] [Google Scholar]

[B27-entropy-25-00333] 27.Foygel R., Shamir O., Srebro N., Salakhutdinov R. Learning with the weighted trace-norm under arbitrary sampling distributions; Proceedings of the Advances in Neural Information Processing Systems; Granada, Spain. 12–15 December 2011; pp. 2133–2141. [Google Scholar]

[B28-entropy-25-00333] 28.Klopp O. Noisy low-rank matrix completion with general sampling distribution. Bernoulli. 2014;20:282–303. doi: 10.3150/12-BEJ486. [DOI] [Google Scholar]

[B29-entropy-25-00333] 29.Negahban S., Wainwright M.J. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. J. Mach. Learn. Res. 2012;13:1665–1697. [Google Scholar]

[B30-entropy-25-00333] 30.Dalalyan A.S., Tsybakov A.B. Sparse regression learning by aggregation and Langevin Monte-Carlo. J. Comput. Syst. Sci. 2012;78:1423–1443. doi: 10.1016/j.jcss.2011.12.023. [DOI] [Google Scholar]

[B31-entropy-25-00333] 31.Dalalyan A.S. Exponential weights in multivariate regression and a low-rankness favoring prior. Annales de l’Institut Henri Poincaré Probabilités et Statistiques. 2020;56:1465–1483. doi: 10.1214/19-AIHP1010. [DOI] [Google Scholar]

[B32-entropy-25-00333] 32.Anderson T.W. Estimating linear restrictions on regression coefficients for multivariate normal distributions. Ann. Math. Stat. 1951;22:327–351. doi: 10.1214/aoms/1177729580. [DOI] [Google Scholar]

[B33-entropy-25-00333] 33.Izenman A.J. Modern multivariate statistical techniques. Regres. Classif. Manifold Learn. 2008;10:978. [Google Scholar]

[B34-entropy-25-00333] 34.Dalalyan A., Tsybakov A.B. Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Mach. Learn. 2008;72:39–61. doi: 10.1007/s10994-008-5051-0. [DOI] [Google Scholar]

[B35-entropy-25-00333] 35.Catoni O. Statistical learning theory and stochastic optimization. In: Picard J., editor. Saint-Flour Summer School on Probability Theory 2001. Volume 1851. Springer; Berlin, Germany: 2004. p. viii+272. Lecture Notes in Mathematics. [DOI] [Google Scholar]

[B36-entropy-25-00333] 36.Alquier P., Ridgway J., Chopin N. On the properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res. 2016;17:8374–8414. [Google Scholar]

[B37-entropy-25-00333] 37.Rigollet P., Tsybakov A.B. Sparse estimation by exponential weighting. Stat. Sci. 2012;27:558–575. doi: 10.1214/12-STS393. [DOI] [Google Scholar]

[B38-entropy-25-00333] 38.Dalalyan A.S., Grappin E., Paris Q. On the exponentially weighted aggregate with the Laplace prior. Ann. Stat. 2018;46:2452–2478. doi: 10.1214/17-AOS1626. [DOI] [Google Scholar]

[B39-entropy-25-00333] 39.Candes E.J., Wakin M.B., Boyd S.P. Enhancing sparsity by reweighted ℓ1 minimization. J. Fourier Anal. Appl. 2008;14:877–905. doi: 10.1007/s00041-008-9045-x. [DOI] [Google Scholar]

[B40-entropy-25-00333] 40.Yang L., Fang J., Duan H., Li H., Zeng B. Fast low-rank Bayesian matrix completion with hierarchical gaussian prior models. IEEE Trans. Signal Process. 2018;66:2804–2817. doi: 10.1109/TSP.2018.2816575. [DOI] [Google Scholar]

[B41-entropy-25-00333] 41.Luo C., Liang J., Li G., Wang F., Zhang C., Dey D.K., Chen K. Leveraging mixed and incomplete outcomes via reduced-rank modeling. J. Multivar. Anal. 2018;167:378–394. doi: 10.1016/j.jmva.2018.04.011. [DOI] [Google Scholar]

[B42-entropy-25-00333] 42.Hoeffding W. Probability Inequalities for Sums of Bounded Random Variables. J. Am. Stat. Assoc. 1963;58:13–30. doi: 10.1080/01621459.1963.10500830. [DOI] [Google Scholar]

[B43-entropy-25-00333] 43.Durmus A., Moulines E. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli. 2019;25:2854–2882. doi: 10.3150/18-BEJ1073. [DOI] [Google Scholar]

[B44-entropy-25-00333] 44.Roberts G.O., Stramer O. Langevin diffusions and Metropolis-Hastings algorithms. Methodol. Comput. Appl. Probab. 2002;4:337–357. doi: 10.1023/A:1023562417138. [DOI] [Google Scholar]

[B45-entropy-25-00333] 45.Roberts G.O., Rosenthal J.S. Optimal scaling of discrete approximations to Langevin diffusions. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1998;60:255–268. doi: 10.1111/1467-9868.00123. [DOI] [Google Scholar]

[B46-entropy-25-00333] 46.Dalalyan A.S. Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2017;3:651–676. doi: 10.1111/rssb.12183. [DOI] [Google Scholar]

[B47-entropy-25-00333] 47.R Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2022. [Google Scholar]

[B48-entropy-25-00333] 48.Hastie T., Mazumder R. softImpute: Matrix Completion via Iterative Soft-Thresholded SVD, 2021. R Package Version 1.4-1. [(accessed on 26 January 2023)]. Available online: https://cran.r-project.org/package=softImpute.

[B49-entropy-25-00333] 49.Massart P. Concentration Inequalities and Model Selection. Volume 1896. Springer; Berlin, Germany: 2007. p. xiv+337. Lecture Notes in Mathematics. [Google Scholar]

[B50-entropy-25-00333] 50.Friedman J., Hastie T., Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 2010;33:1–22. doi: 10.18637/jss.v033.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B51-entropy-25-00333] 51.Dalalyan A.S., Karagulyan A. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stoch. Process. Their Appl. 2019;129:5278–5311. doi: 10.1016/j.spa.2019.02.016. [DOI] [Google Scholar]

PERMALINK

From Bilinear Regression to Inductive Matrix Completion: A Quasi-Bayesian Analysis

The Tien Mai

Roles

Abstract

1. Introduction

Notation 1.

2. Bilinear Linear Regression

2.1. Model

Assumption 1.

2.2. Prior Specification

2.3. Theoretical Results

Assumption 2.

Theorem 1.

Corollary 1.

Theorem 2.

3. Inductive Matrix Completion

3.1. Model and Method

3.2. Theoretical Results

Assumption 3.

Theorem 3.

Corollary 2.

Remark 1.

Theorem 4.

4. Numerical Studies

4.1. Langevin Monte Carlo Implementation

4.2. Simulation Studies for Biliear Regression

Table 1.

Table 2.

4.3. Simulation Studies for Inductive Matrix Completion

Table 3.

Table 4.

5. Discussion and Conclusions

Acknowledgments

Appendix A. Appendix: Proofs

Appendix A.1. Preliminary Lemmas

Lemma A1

Lemma A2

Lemma A3.

Proof of Lemma A3.

Definition A1.

Lemma A4

Lemma A5

Appendix A.2. Proof of Theorem 1

Proof of Theorem 1.

Appendix A.3. Proof of Theorem 2

Proof of Theorem 2.

Appendix A.4. Proof of Theorem 3

Lemma A6.

Proof of Lemma A6.

Proof of Theorem 3.

Appendix A.5. Proof of Theorem 4

Proof.

Appendix B. Comments on Algorithm Implementation

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Funding Statement

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases