Skip to main content
Entropy logoLink to Entropy
. 2023 Feb 11;25(2):333. doi: 10.3390/e25020333

From Bilinear Regression to Inductive Matrix Completion: A Quasi-Bayesian Analysis

The Tien Mai 1
Editors: Augustine Wong1, Xiaoping Shi1
PMCID: PMC9955477  PMID: 36832699

Abstract

In this paper, we study the problem of bilinear regression, a type of statistical modeling that deals with multiple variables and multiple responses. One of the main difficulties that arise in this problem is the presence of missing data in the response matrix, a problem known as inductive matrix completion. To address these issues, we propose a novel approach that combines elements of Bayesian statistics with a quasi-likelihood method. Our proposed method starts by addressing the problem of bilinear regression using a quasi-Bayesian approach. The quasi-likelihood method that we employ in this step allows us to handle the complex relationships between the variables in a more robust way. Next, we adapt our approach to the context of inductive matrix completion. We make use of a low-rankness assumption and leverage the powerful PAC-Bayes bound technique to provide statistical properties for our proposed estimators and for the quasi-posteriors. To compute the estimators, we propose a Langevin Monte Carlo method to obtain approximate solutions to the problem of inductive matrix completion in a computationally efficient manner. To demonstrate the effectiveness of our proposed methods, we conduct a series of numerical studies. These studies allow us to evaluate the performance of our estimators under different conditions and provide a clear illustration of the strengths and limitations of our approach.

Keywords: bilinear regression, matrix completion, low-rank model, PAC-Bayesian bound, Langevin Monte Carlo

1. Introduction

In this paper, we investigate the bilinear regression model, a statistical method that assumes a linear relationship between a set of multiple response variables and two sets of covariates. This model, also known as the growth curve model or generalized multivariate analysis model, is commonly used for analyzing longitudinal data, as shown in previous studies such as [1,2,3,4,5,6]. However, these studies only cover the scenario in which the response matrix is fully observed.

Recently, the bilinear regression model with incomplete response has been introduced and studied as the so-called inductive matrix completion, which is a generalization of the matrix completion problem [7,8]. This problem has attracted significant attention in various fields, such as drug repositioning [9], collaborative filtering [10], and genomics [7]. Inductive matrix completion is a challenging problem that arises when some of the entries in the response matrix are missing, which makes it difficult to infer the underlying relationship between the variables.

In this work, we explore the problem of bilinear regression and inductive matrix completion under a low-rank constraint on the coefficient matrix. Most existing approaches for these problems are frequentist methods, such as maximum likelihood estimation [2] or penalized optimization [7]. These methods are effective in providing point estimates for the parameters of the model but lack the ability to provide a full probabilistic characterization of the uncertainty. Recently, Bayesian approaches have been considered for these problems. For example, the paper [11] proposed a Bayesian approach for bilinear regression, and a Bayesian method was proposed for inductive matrix completion in the work [9]. However, unlike frequentist approaches, the statistical properties of the Bayesian approach for these models have not been fully explored yet.

The aim of this paper is to address an existing gap in the understanding of the bilinear regression and inductive matrix completion problems. To achieve this goal, we propose a novel approach that combines elements of Bayesian statistics with a quasi-likelihood method. Specifically, we start by addressing the problem of bilinear regression using a quasi-Bayesian approach, where a quasi-likelihood is employed. We then generalize this approach to the problem of inductive matrix completion. To ensure that our method is adaptive to the rank of the coefficient matrix, we use a spectral scaled Student prior distribution, which allows us to prove that the posterior mean satisfies a tight oracle inequality. This result demonstrates that our method is able to accurately estimate the parameters of the model, even when the rank of the coefficient matrix is unknown. Additionally, we also prove the contraction properties of the posteriors, which further enhances the performance of our method.

The proposed method in this paper, the quasi-Bayesian approach, is an extension of the traditional Bayesian approach and is becoming increasingly popular in statistics and machine learning as a technique for generalized Bayesian inference, as noted in studies such as [12,13,14]. This approach allows for more flexibility in the modeling assumptions by replacing the likelihood function with a more general notion of risk or quasi-likelihood.

To provide theoretical guarantees for our proposed quasi-posteriors, we make use of the PAC-Bayesian technique [15,16,17]. This technique provides bounds on the generalization error of a learned estimator, and has been widely used in the literature as described in recent reviews and introductions, such as [18,19]. The PAC-Bayes bounds have been successfully applied in the context of matrix estimation problems as shown in studies such as [20,21,22,23]. Interestingly, by using the PAC-Bayesian technique for inductive matrix completion, we do not need to make any assumptions about the distribution of the missing entries in the response matrix. This is in contrast with previous works on matrix completion, such as [24,25,26,27,28,29], which typically require assumptions about the missing data. This makes our method more versatile and applicable to a wider range of problems.

The proposed method in this paper makes use of a spectral scaled Student prior, which is a specific choice of prior distribution. This choice is motivated by recent works in which it has been shown to lead to optimal rates in a variety of problems, including high-dimensional regression [30], image denoising [31] and reduced rank regression [23]. Although this prior is not conjugate to our problems, it allows for the convenient implementation of gradient-based sampling methods, which makes it computationally efficient.

To compute the proposed estimators and sample from the quasi-posterior, we employ a Langevin Monte Carlo (LMC) method. This method is a widely used algorithm for approximating complex distributions and allows for efficient computation of the proposed estimators. The LMC method allows us to obtain approximate solutions to the problem of bilinear regression and inductive matrix completion in a computationally efficient manner.

Furthermore, we use numerical studies to demonstrate the effectiveness of our proposed methods. These studies enable us to evaluate the performance of our estimators in various scenarios and provide insight into the capabilities and limitations of our approach. By conducting a thorough evaluation, we can gain a better understanding of how our method performs in different settings and make any necessary adjustments to improve its performance. We also compare our method with the ordinary least squared method. This comparison allows us to demonstrate the superiority of our method over the traditional approach in certain scenarios, such as when the response matrix contains missing data. Overall, the numerical studies serve as a valuable tool for assessing the effectiveness of our proposed method and provide a clear illustration of its strengths and weaknesses, as well as its improvement over the traditional method.

The remainder of this paper is organized as follows: In Section 2, we present the problem of bilinear regression, and introduce the low-rank promoting prior distribution that we use to address this problem. In Section 3, we extend our approach to the problem of inductive matrix completion. In Section 4, we discuss the Langevin Monte Carlo method used for the computation of the estimators, and present numerical studies to demonstrate the effectiveness of our proposed methods. Finally, we conclude our work and provide a summary of our findings in Section 5. The technical proofs are provided in Appendix A for the interested readers.

Notation 1.

Let Rn1×n2 denote the set of n1×n2 matrices with real elements. Let ARn2×n1 denote the transpose of A. For any ARn1×n2 and I=(i,j){1,,n1}×{1,,n2}, we denote by AI=A(i,j)=Ai,j the i-th row and j-th column elements of A. The matrix in Rn1×n2 with all entries equal to 0 is denoted by 0n1×n2. For a matrix BRn1×n1, we let Tr(B) denote its trace. The identity matrix in Rn1×n1 is denoted by In1. For ARn1×n2, we define its sup-norm A=maxi,j|Ai,j|; its Frobenius norm AF is defined by AF2=Tr(AA)=i,jAi,j2 and rank(A) its rank.

2. Bilinear Linear Regression

2.1. Model

Let YRn×q consist of n independent response vectors, XRn×p be a given between-individuals design matrix and ZRk×q be a known within-individuals design matrix. Consider the bilinear regression model as follows:

Y=XM*Z+E, (1)

where M*Rp×k is the unknown parameter matrix. The random noise matrix E is assumed to have zero mean, E(E)=0. The main assumption here is the low-rank restriction on the model parameter that rank(M*)<min(p,k).

The model presented in Equation (1) is a bilinear regression model, which can be seen as a generalization of the reduced rank regression problem. In the case where k=q and Z=Iq, the model simplifies to the traditional reduced rank regression problem, which has been well-studied in the literature, such as [32,33]. However, in this paper, we consider the more general case where the matrix Z contains additional explanatory variables.

The low-rank assumption in this model can be interpreted as indicating the presence of a latent process that affects the response data, not only through the “between-individuals” structure of the model but also through the “within-individuals” structure. This model is often referred to as the growth curve model or the generalized multivariate analysis model (GMANOVA) and has been studied in depth in the literature, such as [2].

Assumption 1.

There is a known constant C<+ such that XM*ZC.

From Assumption 1, it is not reliable to return predictions XMZ with entries that are outside of interval [C,C]. However, for computational reasons, it is extremely convenient to employ an unbounded prior for M. Therefore, we propose to use unbounded distributions for M but to use, as a predictor, a truncated version of XMZ rather than M itself. For a matrix A, let

ΠC(A)=argminBCABF

be the orthogonal projection of A on matrices with entries bounded by C. Note that B is simply obtained by replacing entries of A larger than C by C, and entries smaller than C by C.

For a matrix MRp×k, we denote by r(M) the empirical risk of M as

r(M)=1nqYΠC(XMZ)F2

and its expectation is denoted by

R(M)=Er(M)=EY11(ΠC(XMZ))112.

The focus of our work in this paper is on the predictive aspects of the model, that is, a matrix M predicts almost as well as M* if R(M)R(M*) is small. Under the assumption that Eij has a finite variance, using the Pythagorean theorem, we have

R(M)R(M*)=1nqΠC(XMZ)XM*ZF2 (2)

for any M, which means that our results can also be interpreted in terms of the Frobenius norm.

Let π be a prior distribution on Rp×k (see Section 2.2). For any λ>0, we define the quasi-posterior

ρ^λ(dM)exp(λr(M))π(dM).

It is worth noting that for a specific choice of λ=nq/(2σ2), the posterior distribution obtained corresponds to the case where the noise term Eij is assumed to be Gaussian distributed with a mean of 0 and a variance of σ2. However, our theoretical results hold under a more general class of noise distributions. It is known that a small enough λ is sufficient when the model is misspecified [14]. Additionally, even in the case of Gaussian noise, in high-dimensional settings, a smaller value of λ than n/(2σ2) leads to better adaptation properties [30,34]. The precise choice of λ in our method will be further discussed below.

We consider the following posterior mean of XMZ, given by

XMZ^λ=ΠC(XMZ)ρ^λ(dM). (3)

It is worth noting that from the simulation experiments, it is observed that using reasonable values for C, the Monte Carlo algorithm never samples matrices M such that ΠC(XMZ)XMZ. In other words, the boundedness constraint has very little impact on practice, and it is mainly necessary for technical proofs. If one is interested in obtaining an estimator of M* instead of an estimator of XM*Z, when XX and ZZ are invertible, one can consider the estimator M^λ=(XX)1XXMZ^λZ(ZZ)1 and note that XM^λZ=XMZ^λ. This estimator can be used to obtain the estimator of M* in case the inverses of XX and ZZ are computationally feasible.

The quasi-posterior distribution investigated in this paper is often referred to as the “Gibbs posterior” in the PAC-Bayes approach. This terminology is used in the literature such as [17,18,19,35,36]. Additionally, the estimator M^λ is sometimes referred to as the Gibbs estimator or the exponentially weighted aggregate (EWA) in the literature, such as [34,37,38].

2.2. Prior Specification

We consider, in this paper, the following spectral scaled Student prior distribution, with parameter τ>0:

π(M)det(τ2Im+MM)(p+m+2)/2. (4)

This prior distribution is designed to promote low-rankness by placing more probability mass on matrices with smaller singular values, which leads to sparse solutions. This prior has a similar form to the one used in related works such as [39,40], and it is known to lead to good performance in different problems, such as high-dimensional regression and image denoising. It can be verified that

π(M)j=1m(τ2+sj(M)2)(p+m+2)/2,

where sj(M) denotes the jth largest singular value of M. The above expression is a scaled Student distribution evaluated at sj(M), which can be seen as a way to approximate sparsity on sj(M) [30]. The log-sum function j=1mlog(τ2+sj(M)2) used by [39,40] is also known to enforce approximate sparsity on the singular values of sj(M). This means that under this prior, most of the sj(M) are close to 0, which implies that M is well approximated by a low-rank matrix. Therefore, it has the ability to promote the low-rankness of M.

As previously stated, this prior is not conjugate to the problem at hand. However, it is particularly convenient to implement gradient-based sampling algorithms, such as the Langevin Monte Carlo method, which will be discussed in more detail in Section 4. This is because the gradient of the log-posterior can be computed efficiently, and it allows for an efficient implementation of the LMC algorithm.

2.3. Theoretical Results

We assume the sub-exponential distribution assumption on the noise.

Assumption 2.

The entries Ei,j of E are independent. There exist two known constants σ>0 and ξ>0 such that

k2,E(|Ei,j|k)σ2k!ξk2/2.

Let us put

C1=8(σ2+C2);C2=64Cmax(ξ,C);τ*=C1(k+p)/(nkqXF2ZF2).

The statistical properties of mean estimator are given in the following theorem, where we propose a non-asymptotic analysis for our mean estimator.

Theorem 1.

Let Assumptions 1 and 2 be satisfied. Fix the parameter τ=τ* in the prior. Fix δ>0 and define λ*:=nqmin(1/(2C2),δ/[C1(1+δ)]). Then, for any ε(0,1), we have, with probability at least 1ε on the sample,

XMZ^λ*XM*ZF2inf0rpkinfM¯Rp×krank(M¯)r{(1+δ)XM¯ZXM*ZF2+C1(1+δ)2δ4r(k+p+2)log1+XFZFM¯FC1nkqr(k+p)+k+p+2log2ε}.

The choice of λ=λ* is determined by optimizing an upper bound on the risk R (as shown in the proof of this theorem). However, it is important to note that this choice may not necessarily be the best choice in practice, even though it gives a good estimate of the order of magnitude for λ. To ensure optimal performance, the user can use cross-validation to properly adjust the temperature parameter. Additionally, it is worth noting that rank(M¯)0 is not a requirement in the above formula. If rank(M¯)=0, then M¯=0 and we interpret 0log(1+0/0) as 0.

The proof of this theorem is based on the PAC-Bayes theory, and it is provided in the Appendix A. By taking M¯=M*, we can obtain an upper bound on the infimum, leading to the following result.

Corollary 1.

Under the same assumptions and the same τ,λ* as in Theorem 1, let r*=rank(M*). Then, for any ε(0,1), we have, with probability at least 1ε on the sample,

XMZ^λ*XM*ZF2C1(1+δ)2δ4r*(k+p+2)log1+XFZFM*FC1nkqr(k+p)+k+p+2log2ε.

Theorem 1 provides an understanding of the statistical properties of the posterior mean. However, it is also important to understand the contraction properties of the quasi-posterior distribution. In the following theorem, we aim to provide a result that demonstrates this aspect of the proposed method.

Theorem 2.

Under the assumptions for Theorem 1, let εn be any sequence in (0,1) such that εn0 when n. Define

Mn={MRp×k:XMZ^λ*XM*ZF2inf0rpkinfM¯Rp×krank(M¯)r{(1+δ)XM¯ZXM*ZF2+C1(1+δ)2δ4r(k+p+2)log1+XFZFM¯FC1nkqr(k+p)+k+p+2log2εn}.

Then,

EPMρ^λ(MMn)1εnn1.

The proof of this theorem is provided in Appendix A.

3. Inductive Matrix Completion

3.1. Model and Method

In the context of inductive matrix completion, given two side information matrices X and Z, we assume that only a random subset Ω of the entries of Y in model (1) is observed. More precisely, we assume that we observe m independent and identically random pairs (I1,Y1),,(Im,Ym) given by

Yi=(XM*Z)Ii+Ei,i=1,,m (5)

where M*Rp×k is the unknown parameter matrix expected to be low-rank and observation sample size is assumed that m<nq. The noise variables Ei are assumed to be independent with E(Ei)=0. The variables Ii are independent and identical copies of a random variable I having distribution Π on the set {1,,n}×{1,,q}, we denote Πx,y:=Π(I=(x,y)).

Our goal in this paper is to investigate the problem of bilinear regression and also address the case where the response matrix contains missing data, a problem known as inductive matrix completion. In particular, when p=n,k=q and X=In,Z=Iq are the identity matrices, the problem reduces to the traditional matrix completion problem, which has been well studied in the literature [26]. Similarly, when k=q and Z=Iq is the identity matrix, the problem becomes the reduced rank regression problem with incomplete response, which has also been studied in recent works, such as [23,41]. However, in the context of inductive matrix completion, we focus on the more general case where X and Z contain additional explanatory variables; in other words, we consider the side information from the n users and the q items in our model [8].

It has been acknowledged that there are two different ways to model the observed values of Y, either by including or excluding the possibility of observing the same entry multiple times. Previous studies have examined both of these methods, such as the examination of matrix completion without replacement in [25] and with replacement in [26]. Both methods have practical uses and use similar techniques for estimation. This particular study focuses on the scenario where the variables Ii are independently and identically distributed, meaning that it is possible to observe the same entry multiple times. Additionally, it is important to note that, according to the findings presented in Section 6 of [42], the results of this study can also be applied to the scenario of sampling without replacement, as long as the sampling is performed uniformly and there is no observation noise.

We are now adapting the quasi-Bayesian approach for bilinear regression in Section 2 to the context of inductive matrix completion. For a probability distribution P on {1,,n1}×{1,,n2}, we generalize the Frobenius norm by AF,P2=i,jP[(i,j)]Ai,j2; note that when P is the uniform distribution, then AF,P2=AF2/(n1n2).

For a matrix MRp×k, we denote the empirical risk of M, r(M), and its expected risk R(M) respectively as

r(M)=1mi=1mYi(ΠC(XMZ))Ii2,
R(M)=E[r(M)]=E[Y1(ΠC(XMZ))I12].

As in Section 2, we will focus on the predictive aspects of the model, that is, a matrix M predicts almost as well as M* if R(M)R(M*) is small. Under the assumption that Ei has a finite variance, based on the Pythagorean theorem, we have

R(M)R(M*)=ΠC(XMZ)XM*ZF,Π2 (6)

for any M, which means that our results can also be interpreted in terms of an estimation of M* with respect to a generalized Frobenius norm.

Here, the prior π is the low-rank inducing prior specified in the Section 2.2 above. For any λ>0, we define the quasi-posterior

ρ^λ(dM)exp(λr(M))π(dM).

We will actually specify our choice of λ below.

The truncated posterior mean of XMZ is given by

XMZ^λ=ΠC(XM)ρ^λ(dM). (7)

Here, for the same technical reasons as in the context of bilinear regression, this truncation has a very little impact in practice for reasonable values of C.

3.2. Theoretical Results

In this section, we derive the statistical properties of the posterior ρ^λ and the mean estimator XMZ^λ for the context of inductive matrix completion. Let us first state our assumptions on this model.

Assumption 3.

The noise variables E1,,Em are independent of I1,,Im. There exist two known constants σ>0 and ξ>0 such that

k2,E(|Ei|k)σ2k!ξk2/2.

Assumptions 1 and 3 are both standard; they have been used in [41] for theoretical analysis of reduced rank regression and in [26] for trace regression and matrix completion.

Let us put

C1=8(σ2+C2);C2=64Cmax(ξ,C);τ*=C1(k+p)/(mkpXF2ZF2).

Theorem 3.

Let Assumptions 1 and 3 be satisfied. Fix the parameter τ=τ* in the prior. Fix δ>0 and define λ*:=mmin(1/(2C2),δ/[C1(1+δ)]). Then, for any ε(0,1), we have, with probability at least 1ε on the sample,

XMZ^λ*XM*ZF,Π2inf0rpkinfM¯Rp×krank(M¯)r{(1+δ)XM¯ZXM*ZF,Π2+C1(1+δ)2δ4r(k+p+2)log1+XFZFM¯FC1mkpr(k+p)+k+p+2log2εm}.

Similar to the context of bilinear regression, the choices of λ=λ*,τ=τ* come from the optimization of an upper bound on the risk R (in the proof of this theorem). Therefore, these choices may not be necessarily the best choice in practice, even though it gives a good order of magnitude for tuning these parameters. The user could use cross-validation to properly tune them in practice. Note again that rank(M¯)0 is not required in the above formula, if rank(M¯)=0 then M¯=0 and we interpret 0log(1+0/0) as 0. The proof of this theorem is provided in the Appendix A. In particular, we can upper bound the infimum on M¯ by taking M¯=M*, which leads to the following result.

Corollary 2.

Under the assumptions that Theorem 3 holds, let r*=rank(M*). Put

Rδ,m,p,k,r*,ε:=C1(1+δ)2δ4r(k+p+2)log1+XFZFM¯FC1mkpr(k+p)+k+p+2log2εm,

then

XMZ^λ*XM*ZF,Π2Rδ,m,p,k,r*,ε

and in particular, if the sampling distribution Π is uniform,

XMZ^λ*XM*ZF2nqRδ,m,p,k,r*,ε.

Remark 1.

Up to a log-term, our error rate r(k+p)/m is similar to the best known up-to-date rate derived in [8].

While Theorem 3 is about the finite sample convergence rate of the posterior mean, it is actually possible to prove that the quasi-posterior ρ^λ contracts around M* at the same rate.

Theorem 4.

Under the same assumptions for Theorem 3, and the same definition for τ and λ*, let εm be any sequence in (0,1) such that εm0 when m. Define

Ωm={MRp×k:ΠC(XMZ)XM*ZF,Π2inf1rpkinfM¯Rp×krank(M¯)r[(1+δ)XM¯ZXM*ZF,Π2+C1(1+δ)2δ4r(k+p+2)log1+XFZFM¯FC1mkpr(k+p)+k+p+2log2εmm]}.

Then

EPMρ^λ(MΩm)1εmm1.

The proof of this theorem is provided in Appendix A.

4. Numerical Studies

4.1. Langevin Monte Carlo Implementation

In this section, we propose to sample from the (quasi) posterior, in Section 2 and Section 3, by a suitable version of the Langevin Monte Carlo (LMC) algorithm, a gradient-based sampling method. We propose to use a constant step-size unadjusted LMC algorithm; see [43] for more details. The algorithm is given by an initial matrix M0 and the recursion

Mk+1=Mkhlogρ^λ(Mk)+2hNkk=0,1, (8)

where h>0 is the step-size, ρ^λ is the (quasi) posterior and N0,N1, are independent random matrices with independent and identical standard Gaussian entries. We provide a pseudo-code for LMC in Algorithm A1. For small values of the step-size h, the output of Algorithm A1, M^, is very close to the integral (3) of interest. However, for some h that may not be small enough, the sum can explode [44]. In such cases, we consider to include a Metropolis–Hastings correction in the algorithm. Another possible choice is to take a smaller h and restart the algorithm; although it slows down the algorithm, we keep some control over its time of execution. On the other hand, the Metropolis–Hastings approach ensures the convergence to the desired distribution; however, the algorithm is greatly slowed down because of an additional acceptance/rejection step at each iteration.

Next, we propose a Metropolis–Hasting correction to the LMC algorithm. It guarantees the convergence to the (quasi) posterior, and it also provides a useful way for choosing h. More precisely, we consider the update rule in (8) as a proposal for a new candidate:

M˜k+1=Mkhlogρ^λ(Mk)+2hNk,k=0,1,, (9)

Note that the matrix M˜k+1 is normally distributed with mean Mkhlogρ^λ(Mk) and the covariance matrices equal to 2h times the identity matrices. This proposal is then accepted or rejected according to the Metropolis–Hastings algorithm, where the proposal is accepted with probability:

AMALA:=min1,ρ^λ(M˜k+1)q(Mk|M˜k+1)ρ^λ(Mk)q(M˜k+1|Mk), (10)

where

q(x|x)exp14hxx+hlogρ^λ(x)F2

is the transition probability density from x to x. The details of the Metropolis-adjusted Langevin algorithm (denoted by MALA) are presented in Algorithm A2. Compared to the random-walk Metropolis–Hastings, MALA usually proposes moves into regions of higher probability, which are then more likely to be accepted.

We note that the step-size h for MALA is chosen such that the acceptance rate is approximately 0.5 following [45], while the step-size for LMC in the same setting should be smaller than the one for MALA [46].

4.2. Simulation Studies for Biliear Regression

We perform some numerical studies on simulated data to assess the performance of our proposed algorithms. All simulations were conducted using the R statistical software [47].

For fixed dimensions q=10,k=20 of the data, we vary n=100 and n=1000 to check the effect of the samples, whereas the dimensions of the coefficient matrix are varied by p=10 and p=100. The entries of the design matrices X and Z are independently simulated from the standard Gaussian N(0,1). Then, given a matrix M*, we simulate the response matrix Y from model (1) whose entries of the noise matrix E are independent and identically sampled from N(0,1). We consider the following setups for the true coefficient matrix:

  • Model I: The true coefficient matrix M* is a rank-2 matrix that is generated as M*=B1B2 where B1Rp×2,B2Rk×2 and all entries in B1 and B2 are independent and identically sampled from N(0,1).

  • Model II: An approximate low-rank set up is studied. This series of simulations is similar to the Model I, except that the true coefficient is no longer rank 2, but it can be well approximated by a rank 2 matrix:
    M*=2·B1B2+U,
    where U is a matrix whose entries are independent and identically sampled from N(0,0.1).

We compare our approaches denoted by LMC and MALA against the (generalized) ordinary least square [2], denoted by OLS. The OLS is defined as follows:

M^OLS=(XX)XYZ(ZZ)

where A denotes the Moore–Penrose inverse of matrix A. We fixed λ=nq,τ=1, and the LMC and MALA methods are initiated at the OLS estimator and are run with 10,000 iterations, where the first 1000 steps are removed as burn-in periods.

The evaluations are performed by using the mean squared estimation error (Est) and the normalized (relative) mean square error (Nmse):

Est:=M^M*F2/(pk),Nmse:=M^M*F2/M*F2,

and the prediction error (Pred) as

Pred:=X(M^M*)ZF2/(nq),

where M^ here is one of the estimators for LMC, MALA or OLS. We report the averages and the standard deviation of these errors over 100 data replications.

The results of our study are presented in Table 1 and Table 2. From the tables, it can be observed that our proposed methods perform similarly to the OLS method. However, the estimation method obtained from the MALA algorithm often results in smaller prediction errors, particularly in high-dimensional settings. This advantage is even more pronounced when the method is applied in the context of inductive matrix completion, as discussed in the next subsection.

Table 1.

Simulation results on simulated data in Model I in bilinear regression, with fixed q = 10, k = 20, for different methods, with their standard error in parentheses over 100 replications. (Est: average of estimation errors; Pred: average of prediction errors; Nmse: average of normalized estimation errors).

Errors LMC MALA OLS
n = 100 Est 1.0053 (0.5480) 1.0342 (0.5559) 1.0052 (0.5478)
p=10 Pred 0.1138 (0.0171) 0.0985 (0.0151) 0.1014 (0.0154)
Nmse 0.4931 (0.1178) 0.5100 (0.1207) 0.4930 (0.1178)
Est 1.3544 (0.5867) 1.3384 (0.5836) 1.3544 (0.5867)
p=100 Pred 1.0066 (0.0430) 0.8761 (0.0756) 1.0030 (0.0424)
Nmse 0.7049 (0.2944) 0.6963 (0.2927) 0.7049 (0.2944)
n = 1000 Est 1.0776 (0.5671) 1.0900 (0.5670) 1.0776 (0.5671)
p=10 Pred 0.0099 (0.0013) 0.0099 (0.0013) 0.0099 (0.0013)
Nmse 0.5185 (0.1198) 0.5264 (0.1219) 0.5185 (0.1198)
Est 0.9662 (0.3240) 0.9688 (0.3244) 0.9662 (0.3240)
p=100 Pred 0.0999 (0.0051) 0.0989 (0.0049) 0.0998 (0.0051)
Nmse 0.4961 (0.1183) 0.4976 (0.1191) 0.4961 (0.1183)

Table 2.

Simulation results on simulated data in Model II (approximate low-rank) in bilinear regression, with fixed q=10,k=20, for different methods, with their standard error in parentheses over 100 replications. (Est: average of estimation errors; Pred: average of prediction errors; Nmse: average of normalized estimation errors).

Errors LMC MALA OLS
n = 100 Est 4.0731 (1.828) 4.0989 (1.821) 4.0731 (1.828)
p=10 Pred 0.1090 (0.0160) 0.0969 (0.0140) 0.0987 (0.0145)
Nmse 0.5119 (0.1226) 0.5162 (0.1241) 0.5118 (0.1226)
Est 4.6047 (1.812) 4.6038 (1.813) 4.6047 (1.812)
p=100 Pred 1.0062 (0.0462) 1.0597 (0.0495) 1.0006 (0.0469)
Nmse 0.5801 (0.1942) 0.5800 (0.1941) 0.5801 (0.1942)
n = 1000 Est 3.6733 (1.606) 3.6884 (1.606) 3.6733 (1.606)
p=10 Pred 0.0098 (0.0015) 0.0098 (0.0015) 0.0098 (0.0015)
Nmse 0.4812 (0.1271) 0.4835 (0.1260) 0.4813 (0.1271)
Est 3.9972 (1.375) 3.9986 (1.376) 3.9972 (1.375)
p=100 Pred 0.1000 (0.0043) 0.1032 (0.0057) 0.0999 (0.0043)
Nmse 0.5013 (0.1061) 0.5014 (0.1063) 0.5013 (0.1062)

4.3. Simulation Studies for Inductive Matrix Completion

The simulation settings for inductive matrix completion are similar to the settings for bilinear regression, Section 4.2. However, after obtaining the response matrix Y, we remove uniformly at random κ=10% and κ=30% of the entries of Y. Here, κ denotes the missing rate. We denote the response matrix with missing entries by Ymiss.

As in the context of inductive matrix completion, we only observe the response matrix with missing entries, Ymiss, and thus we cannot construct the OLS estimator as in the case of bilinear regression. For this purpose, we first impute the missing entries in Ymiss by using the R package softImpute [48], where the rank of M* is specified as the true rank for matrix Ymiss. We denote the resulting imputed matrix by Yimp.

We compare our approaches denoted by LMC and MALA against the (imputed and generalized) ordinary least square, denoted by OLS_imp. The OLS_imp is defined as follows:

M^OLS_imp=(XX)XYimpZ(ZZ)

where A denotes the Moore–Penrose inverse of matrix A. The LMC and MALA methods are initiated at the OLS_imp estimator and are run with 10,000 iterations, where the first 1000 steps are removed as burn-in periods.

As previously discussed in Section 4.2, we present the averages and the standard deviation of the mean squared estimation error (Est), the normalized (relative) mean square error (Nmse), and the prediction error (Pred) over 100 data replications in our results.

The results are detailed in Table 3 and Table 4. It is evident from these tables that the results obtained from our MALA method surpass those of the other methods in terms of prediction error in most of the settings considered. This advantage becomes more pronounced as the missing rate in the response matrix increases. Additionally, it is worth noting that our MALA method is robust and performs well in the approximate low-rank setting (model II), while the OLS and LMC methods do not.

Table 3.

Simulation results on simulated data in Model I in inductive matrix completion, with fixed q=10,k=20, for different methods, with their standard error in parentheses over 100 replications. (κ is the missing rate; Est: average of estimation errors; Pred: average of prediction errors; Nmse: average of normalized estimation errors).

Errors LMC MALA OLS_imp
n = 100
κ = 10%
Est 1.0559 (0.5060) 1.0803 (0.5122) 1.0559 (0.5060)
p=10 Pred 0.1028 (0.0193) 0.1082 (0.0143) 0.1020 (0.0197)
Nmse 0.4986 (0.1116) 0.5139 (0.1197) 0.4986 (0.1116)
Est 1.4008 (0.8555) 1.3987 (0.8542) 1.4009 (0.8555)
p=100 Pred 1.2250 (0.4568) 1.4468 (0.4137) 1.2252 (0.4570)
Nmse 0.7148 (0.3591) 0.7136 (0.3581) 0.7148 (0.3591)
n = 100
κ = 30%
Est 1.0432 (0.4963) 1.0917 (0.5085) 1.0432 (0.4963)
p=10 Pred 0.2402 (0.2705) 0.1447 (0.0204) 0.2446 (0.2780)
Nmse 0.5242 (0.1257) 0.5538 (0.1335) 0.5242 (0.1257)
Est 1.6242 (0.8179) 1.6224 (0.8169) 1.6242 (0.8179)
p=100 Pred 9.8879 (14.11) 10.807 (13.84) 9.8901 (14.11)
Nmse 0.7993 (0.3340) 0.7985 (0.3334) 0.7993 (0.3340)
n = 1000
κ = 10%
Est 0.9810 (0.4532) 0.9882 (0.4478) 0.9810 (0.4532)
p=10 Pred 0.0114 (0.0033) 0.0112 (0.0015) 0.0114 (0.0033)
Nmse 0.4933 (0.1076) 0.4984 (0.1075) 0.4933 (0.1076)
Est 1.0063 (0.3465) 1.0088 (0.3471) 1.0063 (0.3465)
p=100 Pred 0.1902 (0.1758) 0.1116 (0.0049) 0.1902 (0.1759)
Nmse 0.5069 (0.1049) 0.5082 (0.1050) 0.5069 (0.1049)
n = 1000
κ = 30%
Est 1.0110 (0.4886) 1.0223 (0.4872) 1.0110 (0.4886)
p=10 Pred 0.0539 (0.0599) 0.0141 (0.0019) 0.0540 (0.0599)
Nmse 0.5129 (0.1030) 0.5206 (0.1043) 0.5129 (0.1030)
Est 1.0291 (0.3567) 1.0312 (0.3555) 1.0291 (0.3567)
p=100 Pred 1.7529 (1.914) 0.1475 (0.0078) 1.7530 (1.913)
Nmse 0.5054 (0.1055) 0.5067 (0.1053) 0.5054 (0.1055)

Table 4.

Simulation results on simulated data in Model II (approximate low-rank) in inductive matrix completion, with fixed q=10,k=20, for different methods, with their standard error in parentheses over 100 replications. (κ is the missing rate; Est: average of estimation errors; Pred: average of prediction errors; Nmse: average of normalized estimation errors).

Errors LMC MALA OLS_imp
n = 100
imis 10%
Est 3.8319 (1.691) 3.8749 (1.719) 3.8319 (1.690)
p=10 Pred 0.1604 (0.1271) 0.1092 (0.0153) 0.1598 (0.1322)
Nmse 0.5116 (0.1154) 0.5169 (0.1147) 0.5116 (0.1155)
Est 5.9500 (2.834) 5.9452 (2.835) 5.9500 (2.834)
p=100 Pred 4.7640 (5.272) 4.6964 (5.515) 4.7658 (5.275)
Nmse 0.7313 (0.3454) 0.7307 (0.3455) 0.7313 (0.3454)
n = 100
imis 30%
Est 4.1838 (1.850) 4.2535 (1.859) 4.1839 (1.850)
p=10 Pred 0.7221 (0.7562) 0.1498 (0.0183) 0.7371 (0.7741)
Nmse 0.5182 (0.1128) 0.5283 (0.1147) 0.5182 (0.1128)
Est 7.1589 (4.084) 7.1558 (4.083) 7.1589 (4.084)
p=100 Pred 39.899 (52.40) 40.233 (51.76) 39.908 (52.41)
Nmse 0.8998 (0.3821) 0.8994 (0.3820) 0.8998 (0.3821)
n = 1000
imis 10%
Est 3.9618 (1.678) 3.9788 (1.677) 3.9618 (1.678)
p=10 Pred 0.0409 (0.0269) 0.0110 (0.0015) 0.0409 (0.0269)
Nmse 0.4968 (0.1196) 0.4989 (0.1195) 0.4968 (0.1196)
Est 4.1153 (1.295) 4.1163 (1.294) 4.1153 (1.295)
p=100 Pred 1.0250 (0.9988) 0.1135 (0.0051) 1.0250 (0.9988)
Nmse 0.5060 (0.1096) 0.5062 (0.1096) 0.5060 (0.1096)
n = 1000
imis 30%
Est 4.1647 (1.990) 4.1836 (1.995) 4.1647 (1.990)
p=10 Pred 0.4615 (0.3497) 0.0141 (0.0017) 0.4616 (0.3498)
Nmse 0.4905 (0.1157) 0.4933 (0.1171) 0.4905 (0.1157)
Est 4.0578 (1.400) 4.0565 (1.397) 4.0578 (1.400)
p=100 Pred 8.5608 (6.419) 0.1538 (0.0069) 8.5609 (6.419)
Nmse 0.4944 (0.1184) 0.4943 (0.1180) 0.4944 (0.1184)

5. Discussion and Conclusions

In this paper, we focus on the problem of bilinear regression and its extension, the problem of inductive matrix completion, where the response matrix contains missing data. We propose a novel approach that combines elements of Bayesian statistics with a quasi-likelihood method. Our proposed method first addresses the problem of bilinear regression using a quasi-Bayesian approach and then adapts this approach to the problem of inductive matrix completion. By making use of a low-rankness assumption and leveraging the powerful PAC-Bayes bound technique, we provide statistical properties for our proposed estimators and for the quasi-posteriors.

Our proposed method includes an efficient gradient-based sampling algorithm that is designed to sample from the (quasi) posterior distribution. This algorithm allows for the approximate computation of mean estimators. These methods, referred to as LMC and MALA, were tested in various simulation studies and were found to perform well when compared to the ordinary least squared method. The ability to accurately sample from the (quasi) posterior distribution and compute mean estimators makes these methods a valuable tool for data analysis and modeling.

There are still some unresolved issues that require further investigation. One of these is the presence of missing data in the covariate matrices X and Z. This can have a significant impact on the analysis and may lead to biased results. Another area that needs further exploration is the assumption of independence and identically distributed data. In some cases, this assumption may not hold, and alternative models that allow for a dispersion matrix may be needed. These are potential topics for future research to address and further our understanding of these issues.

Acknowledgments

The author would like to thank the anonymous referees for their useful comments.

Appendix A. Appendix: Proofs

The main technique for our proofs is the oracle-type PAC-Bayes bounds, in the spirit of [35]. We start with a few preliminary lemmas.

Appendix A.1. Preliminary Lemmas

First, we state a version of Bernstein’s inequality from Proposition 2.9 page 24 in [49].

Lemma A1

(Bernstein’s inequality). Let U1, …, Un be independent real valued random variables. Let us assume that there are two constants v and w such that i=1nE[Ui2]v and for all integers k3, i=1nE(Ui)kvk!wk22. Then, for any ζ(0,1/w),

Eexpζi=1nUiE(Ui)expvζ22(1wζ).

Another basic tool to derive the PAC-Bayes bounds is Donsker and Varadhan’s variational inequality, see Lemma 1.1.3 in Catoni [17] for a proof (among others). From now, for any ΘRn1×n2, we let P(Θ) denote the set of all probability distributions on Θ equipped with the Borel σ-algebra. For (μ,ν)P(Θ)2, the Kullback–Leibler divergence is defined by K(ν,μ)=logdνdμ(θ)ν(dθ) if ν admits a density dνdμ with respect to μ, and K(ν,μ)=+ otherwise.

Lemma A2

(Donsker and Varadhan’s variational formula). Let μP(Θ). For any measurable, bounded function h:ΘR, we have

logeh(θ)μ(dθ)=supρP(Θ)h(θ)ρ(dθ)K(ρ,μ).

Moreover, the supremum with respect to ρ in the right-hand side is reached for the Gibbs measure μh defined by its density with respect to μ

dμh(θ)=eh(θ)dμeh(ϑ)μ(dϑ). (A1)

These two lemmas are the only tools we need to prove Theorems 1 and 2. Their proofs are quite similar, with a few differences. For the sake of simplicity, we will state the common parts of the proofs as a separate result in Lemma A3. Note that the proof of this lemma will use Lemmas A1 and A2.

Lemma A3.

Under Assumptions 1 and 2, put

α=λλ2C12nq(1C2λnq)andβ=λ+λ2C12nq(1C2λnq). (A2)

Then, for any ε(0,1), and λ(0,nq/C2),

E[exp{αR(M)R(M*)+λr(M)+r(M*)logdρ^λdπ(M)log2ε}ρ^λ(dM)]ε2 (A3)

and

EsupρP(Rp×k)expβRdρ+R(M*)+λrdρr(M*)K(ρ,π)log2εε2. (A4)

Proof of Lemma A3.

We prove the first inequality (A10) as follows. Fix any M with XMZC and put

Tij=Yij(XM*Z)ij2Yij(ΠC(XMZ))ij2.

Note that the random variables Tij with i=1,,n;j=1,,q are independent by construction. We have

i=1nj=1qE[Tij2]=i=1nj=1qE2Yij(XM*Z)ijΠC(XMZ)ij2(XM*Z)ijΠC(XMZ)ij2=i=1nj=1qE2Eij+(XM*Z)ijΠC(XMZ)ij2(XM*Z)ijΠC(XMZ)ij2i=1nj=1qE8Eij2+C2(XM*Z)ijΠC(XMZ)ij2i=1nj=1q8σ2+C2E(XM*Z)ij(XMZ)ij28nq(σ2+C2)R(M)R(M*)=nqC1R(M)R(M*)=:v(M,M*).

Next we have, for any integer k3, that

i=1nj=1qE(Tij)ki=1nj=1qE2Yij(XM*Z)ijΠC(XMZ)ijk(XM*Z)ijΠC(XMZ)ijki=1nj=1qE2k1|2Eij|k+(2C)k(XM*Z)ijΠC(XMZ)ijki=1nj=1qE22k1|Eij|k+Ck(2C)k2(XM*Z)ijΠC(XMZ)ij222k1σ2k!ξk2+Ck(2C)k2i=1nj=1qE(XM*Z)ij(XMZ)ij223k3σ2k!ξk2+CkCk28(σ2+C2)v(M,M*)23k6σ2ξk2+CkCk2(σ2+C2)k!v(M,M*)23k5ξk2+Ck2Ck2k!v(M,M*)23k4max(ξ,C)k2Ck2k!v(M,M*)=[23max(ξ,C)C]k222k!v(M,M*)

and use the fact that, for any k3, 2223(k2)/2 to obtain

i=1nj=1qE(Tij)k[26max(ξ,C)C]k2k!v(M,M*)2=v(M,M*)k!C2k22.

Thus, we can apply Lemma A1 with Ui:=Ti, v:=v(M,M*), w:=C2 and ζ:=λ/nq. We obtain, for any λ(0,nq/w)=(0,nq/C2),

EexpλR(M)R(M*)r(M)+r(M*)expvλ22(nq)2(1wλnq)=expC1R(M)R(M*)λ22nq(1C2λnq).

Rearranging terms, and using the definition of α in (A2),

EexpαR(M)R(M*)+λr(M)+r(M*)1.

Multiplying both sides by ε/2 and then integrating with respect to the probability distribution π(.), we obtain

EexpαR(M)R(M*)+λr(M)+r(M*)log2επ(dM)ε2.

Next, Fubini’s theorem gives

EexpαR(M)R(M*)+λr(M)+r(M*)log2επ(dM)ε2.

and note that for any measurable function h,

exp[h(M)]π(dM)=exph(M)logdρ^λdπ(M)ρ^λ(dM)

to obtain (A3).

Let us now prove (A4). Here again, we start with an application of Lemma A1, but this time with Ui:=Ti (we keep v:=v(M,M*), w:=C2 and ζ:=λ/nq). We obtain, for any λ(0,nq/C2),

Eexpλr(M)+r(M*)R(M)+R(M*)expC1R(M)R(M*)λ22nq(1C2λnq).

Rearranging terms, using the definition of β in (A2) and multiplying both sides by ε/2, we obtain

EexpβR(M)+R(M*)+λr(M)r(M*)log2εε2.

We integrate with respect to π and use Fubini to obtain

EexpβR(M)+R(M*)+λr(M)r(M*)log2επ(dM)ε2.

Here, we use a different argument from the proof of the first inequality: we use Lemma A2 on the integral, and this gives directly (A4).    □

Finally, in both proofs, we will use quite often distributions ρP(Rp×k) that will be defined as translations of the prior π. We introduce the following notation.

Definition A1.

For any matrix M¯Rp×k, we define ρM¯P(Rp×k) by

ρM¯(M)=π(M¯M).

The following technical lemmas from [31] will be useful in the proofs.

Lemma A4

(Lemma 1 in [31]). We have MF2π(dM)pkτ2.

Lemma A5

(Lemma 2 in [31]). For any M¯Rp×k, we have

K(ρM¯,π)2rank(M¯)(k+p+2)log1+M¯Fτ2rank(M¯)

with the convention 0log(1+0/0)=0.

Appendix A.2. Proof of Theorem 1

Proof of Theorem 1.

An application of Jensen’s inequality on inequality (A3) yields

EexpαRdρ^λR(M*)+λrdρ^λ+r(M*)K(ρ^λ,π)log2εε2.

Using the standard Chernoff’s trick to transform an exponential moment inequality into a deviation inequality, that is: exp(x)1R+(x), we obtain

PαRdρ^λR(M*)+λrdρ^λ+r(M*)K(ρ^λ,π)log2ε]0}ε2 (A5)

Using (2) we have

Rdρ^λR(M*)=1nqΠC(XMZ)XM*ZF2ρ^λ(dM)1nqΠC(XMZ)ρ^λ(dM)XM*ZF21nqXMZ^λXM*ZF2

where we used Jensen’s inequality in the second line, and the definition of XMZ^λ from the second to the third line. Plugging this into our probability bound (A5), and dividing both sides by α, we obtain

P1nqXMZ^λXM*ZF2rdρ^λr(M*)+1λK(ρ^λ,π)+log2εαλ1ε2

under the additional condition that λ is such that α>0, which we will assume from now (note that this is satisfied by λ*). Using Lemma A2, we can rewrite this as

P1nqXMZ^λXM*ZF2infρP(Rp×k)rdρr(M*)+1λK(ρ,π)+log2εαλ1ε2. (A6)

We consider now the consequences of the second inequality in Lemma A3, that is (A4). With Chernoff’s trick and rearranging terms a little, we obtain

PρP(Rp×k),rdρr(M*)βλRdρR(M*)+1λK(ρ,π)+log2ε1ε2.

which we can rewrite as, ρP(Rp×k), with probability at least 1ε2,

rdρr(M*)βλ1nqΠC(XMZ)XM*ZF2ρ(dM)+1λK(ρ,π)+log2ε. (A7)

Combining (A7) and (A6) with a union bound argument gives the following bound, with probability of at least 1ε,

1nqXMZ^λXM*ZF2infρP(Rp×k)β1nqΠC(XMZ)XM*ZF2ρ(dM)+2K(ρ,π)+log2εα.

Noting that, for any (i,j), (XM*Z)i,j[C,C] implies that

|(ΠC(XMZ))i,j(XM*Z)i,j||(XMZ)i,j(XM*Z)i,j|

and thus

1nqXMZ^λXM*ZF2infρP(Rp×k)β1nqXMZXM*ZF2ρ(dM)+2K(ρ,π)+log2εα.

The end of the proof consists in making the right-hand side in the inequality more explicit. In order to do so, we restrict the infimum bound above to the distributions given by Definition A1:

P{1nqXMZ^λXM*ZF2infM¯Rp×kβ1nqXMZXM*ZF2ρM¯(dM)+2K(ρM¯,π)+log2εα}1ε. (A8)

We see immediately that Dalalyan’s lemma will be extremely useful for that. First, Lemma A5 provides an upper bound on K(ρM¯,π). Moreover,

XMZXM*ZF2ρM¯(dM)XM¯ZXM*ZXMZF2π(dM)=XM¯ZXM*ZF22i,j(XM¯ZXM*Z)j,i(XMZ)i,jπ(dM)+XMZF2π(dM).

The second term in the right-hand side is null because π is centered, and thus

XMZXM*ZF2ρM¯(dM)XM¯ZXM*ZF2+XMZF2π(dM)XM¯ZXM*ZF2+XF2ZF2MF2π(dM)XM¯ZXM*ZF2+XF2ZF2pkτ2

where we used elementary properties of the Frobenius norm, and Lemma A4 in the last line. We can now plug this (and Lemma A5) back into (A8) to obtain

P{1nqXMZ^λXM*ZF2infM¯Rp×k[βα1nqXM¯ZXM*ZF2+βα1nqXF2ZF2pkτ2+1α4rank(M¯)(k+p+2)log1+M¯Fτ2rank(M¯)+2log2ε]}1ε.

We are now making the constants explicit. First, if λnq/(2C2), then 2nq(1C2λ/nq)np and thus

βα=1+λC12nq(1C2λnq)1λC12nq(1C2λnq)1+λC1nq1λC1nq.

Then, λnqδC1(1+δ) leads to βα(1+δ).

Note that λ*=nqmin(1/(2C2),δ/[C1(1+δ)]) satisfies these two conditions, so from now, λ=λ*. We also use the following:

1α=1λ*1λ*C12nq(1C2λ*/nq)βλ*α(1+δ)nqmin(1/(2C2),δ/[C1(1+δ)])C1(1+δ)2nqδ.

So far the bound is:

P{1nqXMZ^λ*XM*ZF2infM¯Rp×k[(1+δ)nqXM¯ZXM*ZF2+(1+δ)nqXF2ZF2pkτ2+C1(1+δ)24rank(M¯)(k+p+2)log1+M¯Fτ2rank(M¯)+2log2εnqδ]}1ε.

In particular, with probability at least 1ε, the choice τ2=C1(k+p)/(nkqXF2ZF2) gives

1nqXMZ^λ*XM*ZF2infM¯Rp×k[(1+δ)nqXM¯ZXM*ZF2+C1(1+δ)(k+p)nq+C1(1+δ)24rank(M¯)(k+p+2)log1+XFZFM¯FC1nkq(k+p)rank(M¯)+2log2εnqδ].

   □

Appendix A.3. Proof of Theorem 2

Proof of Theorem 2.

We also start with an application of Lemma A3, and focus on (A3), applied to ε:=εn, that is:

E[exp{αR(M)R(M*)+λr(M)+r(M*)logdρ^λdπ(M)log2εn}ρ^λ(dM)]εn2.

Using Chernoff’s trick, this gives

EPMρ^λ(MAn)1εn2

where

An=M:αR(M)R(M*)+λr(M)+r(M*)logdρ^λdπ(M)+log2εn.

Using the definition of ρ^λ, for MAn we have

αR(M)R(M*)λr(M)r(M*)+logdρ^λdπ(M)+log2εnlogexpλr(M)π(dM)λr(M*)+log2εn=λr(M)ρ^λ(dM)r(M*)+K(ρ^λ,π)+log2εn=infρλr(M)ρ(dM)r(M*)+K(ρ,π)+log2εn.

Now, let us define

Bn=ρ:βRdρ+R(M*)+λrdρr(M*)K(ρ,π)+log2εn.

Using (A4), we have that

E1Bn1εn2.

We will now prove that, if λ is such that α>0,

EPMρ^λ(MMn)EPMρ^λ(MAn)1Bn

which, together with

EPMρ^λ(MAn)1Bn=E(1PMρ^λ(MAn))(11Bnc)E1PMρ^λ(MAn)1Bnc1εn

will bring

EPMρ^λ(MMn)1εn.

In order to do so, assume that we are on the set Bn, and let MAn. Then,

αR(M)R(M*)infρλr(M)ρ(dM)r(M*)+K(ρ,π)+log2εninfρβR(M)ρ(dM)R(M*)+2K(ρ,π)+2log2εn

that is,

R(M)R(M*)infρP(Rp×k)βRdρR(M*)+2K(ρ,π)+log2εα

or, rewriting it in terms of norms,

ΠC(XMZ)XM*ZF2infM¯Rp×kβXMZXM*ZF2ρM¯(dM)+2K(ρM¯,π)+log2εα.

We upper-bound the right-hand side exactly as in the proof of Theorem 1, which gives MMn.   □

Appendix A.4. Proof of Theorem 3

Lemma A6.

Under Assumptions 1 and 3, put

α=λλ2C12m(1C2λm)andβ=λ+λ2C12m(1C2λm). (A9)

Then for any ε(0,1), and λ(0,m/C2),

E[exp{αR(M)R(M*)+λr(M)+r(M*)logdρ^λdπ(M)log2ε}ρ^λ(dM)]ε2 (A10)

and

EsupρP(Rp×k)expβRdρ+R(M*)+λrdρr(M*)K(ρ,π)log2εε2. (A11)

Proof of Lemma A6.

The inequality (A10) is proved in a similar way to the proof of Lemma A3. That is, we apply Lemma A1 to the following independent random variables

Vi=Yi(XM*Z)i2Yi(ΠC(XMZ))i2,i=1,,m.

The proof of the inequality (A11) is processed similar in the proof of Lemma A3 in which we apply Lemma A1 to the independent random variables Vi,i=1,,m.    □

Proof of Theorem 3.

Similar to the proof of Theorem 1, until the (A6), and noting that using (6), we have

Rdρ^λR(M*)=ΠC(XMZ)XM*ZF,Π2ρ^λ(dM)ΠC(XMZ)ρ^λ(dM)XM*ZF,Π2XMZ^λXM*ZF,Π2,

and thus, we obtain

PXMZ^λXM*ZF,Π2infρP(Rp×k)rdρr(M*)+1λK(ρ,π)+log2εαλ1ε2. (A12)

We consider now the consequences of inequality (A11) in Lemma A6. With Chernoff’s trick and rearranging terms a little, we obtain ρP(Rp×k), with probability at least 1ε2,

rdρr(M*)βλΠC(XMZ)XM*ZF,Π2ρ(dM)+1λK(ρ,π)+log2ε. (A13)

Combining (A13) and (A12) with a union bound argument gives the bound and noting that for any (i,j), (XM*Z)i,j[C,C] implies that |(ΠC(XMZ))i,j(XM*Z)i,j||(XMZ)i,j(XM*Z)i,j| and thus

P{XMZ^λXM*ZF,Π2infρP(Rp×k)βXMZXM*ZF,Π2ρ(dM)+2K(ρ,π)+log2εα}1ε.

We are now making the right-hand side in the inequality more explicit. In order to do so, we restrict the infimum bound above to the distributions given by Definition A1:

P{XMZ^λXM*ZF,Π2infM¯Rp×kβXMZXM*ZF,Π2ρM¯(dM)+2K(ρM¯,π)+log2εα}1ε. (A14)

We see immediately that Dalalyan’s lemma will be extremely useful for that. First, Lemma A5 provides an upper bound on K(ρM¯,π).

Moreover,

XMZXM*ZF,Π2ρM¯(dM)XM¯ZXM*ZXMZF,Π2π(dM)=XM¯ZXM*ZF,Π22i,jΠi,j(XM¯ZXM*Z)j,i(XMZ)i,jπ(dM)+XMZF,Π2π(dM).

The second term in the above right-hand side is null because π is centered, and thus

XMZXM*ZF,Π2ρM¯(dM)XM¯ZXM*ZF,Π2+XMZF,Π2π(dM)XM¯ZXM*ZF,Π2+XMZF2π(dM)XM¯ZXM*ZF,Π2+XF2ZF2MF2π(dM)XM¯ZXM*ZF,Π2+XF2ZF2pkτ2

where we used the elementary properties of the Frobenius norm, and Lemma A4 in the last line. We can now plug this (and Lemma A5) back into (A14) to obtain:

P{XMZ^λXM*ZF,Π2infM¯Rp×k[βαXM¯ZXM*ZF,Π2+βαXF2ZF2pkτ2+1α4rank(M¯)(k+p+2)log1+M¯Fτ2rank(M¯)+2log2ε]}1ε.

We are now making the constants explicit. First, if λm/(2C2), then 2m(1C2λ/m)m and thus

βα=1+λC12m(1C2λm)1λC12m(1C2λm)1+λC1m1λC1m.

Then, λmδC1(1+δ) leads to βα(1+δ).

Note that λ*=mmin(1/(2C2),δ/[C1(1+δ)]) satisfies these two conditions, so from now λ=λ*. We also use the following:

1α=1λ*1λ*C12m(1C2λ*/m)βλ*α(1+δ)mmin(1/(2C2),δ/[C1(1+δ)])C1(1+δ)2mδ.

So far the bound is:

P{XMZ^λ*XM*ZF,Π2infM¯Rp×k[(1+δ)XM¯ZXM*ZF,Π2+(1+δ)XF2ZF2pkτ2+C1(1+δ)24rank(M¯)(k+p+2)log1+M¯Fτ2rank(M¯)+2log2εmδ]}1ε.

In particular, with probability at least 1ε, the choice τ2=C1(k+p)/(mkpXF2ZF2) gives

XMZ^λ*XM*ZF,Π2infM¯Rp×k[(1+δ)XM¯ZXM*ZF,Π2+C1(1+δ)(k+p)m+C1(1+δ)24rank(M¯)(k+p+2)log1+XFZFM¯FC1mkp(k+p)rank(M¯)+2log2εmδ].

   □

Appendix A.5. Proof of Theorem 4

Proof. 

The proof is proceeded completely similar to the proof of Theorem 2, in Appendix A.3.

   □

Appendix B. Comments on Algorithm Implementation

For the case of inductive matrix completion, we write the logarithm of the density of the posterior

logρ^λ(M)=λni=1n(Yi(ΠC(XMZ))i)2p+m+22logdet(τ2Im+MM).

Let us now differentiate this expression in M. Note that the term (Yi(ΠC(XMZ))i)2 does actually not depend on M locally if (XMZ)i[C,C], in this case its differential with respect to M is 0p×m. Otherwise, (Yi(ΠC(XMZ))i)2=(Yi(XMZ)i)2. In order to be able to differentiate the term (XMZ)i, let us introduce a notation for the entries of Ii: Ii=(ai,bi). Then (XMZ)i=D where the matrix DRp×m satisfies Dx,y=1{x=bj}Xaj,y. Then

logρ^λ(M)=2λni=1n((XMZ)i)(Yi(XMZ)i)1{|(XMZ)i|<C}(p+m+2)(τ2Im+MM)1M.

The above calculation still requires to calculate a p×p matrix inversion at each iteration; for very large p, this might be expensive and can slow down the algorithm. Therefore, we could replace this matrix inversion by its accurately approximation through a convex optimization. It is noted that the matrix B:=(τ2Im+MM)1M is the solution to the following convex optimization problem: minBIpMBF2+τ2BF2. The solution of this optimization problem can be conveniently obtained by using the package ‘glmnet’ [50] (with the family option ‘mgaussian’). This avoids performing matrix inversion or other costly calculation. However, we note here that the LMC algorithm is being used with approximate gradient evaluation; theoretical assessment of this approach can be found in [51].

Algorithm A1 LMC
  • Input: The data.

  • Parameters: Positive real numbers λ,τ,h,T.

  • Output: The matrix M^

  • Initialize: M0,M^=0m×p

  • fork1 to T do

  •     Sample Mk from (8);

  •     M^M^+Mk/T

  • end for

Algorithm A2 MALA
  • Input: The data.

  • Parameters: Positive real numbers λ,τ,h,T

  • Output: The matrix M^

  • Initialize: M0;M^=0m×p

  • fork=1 to T do

  •     Sample M˜k from (9).

  •     Set Mk=M˜k with probability AMALA, from (10), otherwise Mk=Mk1.

  •     M^M^+Mk/T.

  • end for

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The R codes used in the numerical experiments are available at: https://github.com/tienmt/blr_imc (accessed on 10 February 2023).

Conflicts of Interest

The author declare no conflict of interest.

Funding Statement

TTM is supported by the Norwegian Research Council grant number 309960 through the Centre for Geophysical Forecasting at NTNU.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

  • 1.Rosen D.V. Methodology and Applications of Statistics. Springer; Berlin/Heidelberg, Germany: 2021. Bilinear Regression with Rank Restrictions on the Mean and Dispersion Matrix; pp. 193–211. [Google Scholar]
  • 2.Von Rosen D. Bilinear regression analysis: An Introduction. Volume 220 Springer; Berlin/Heidelberg, Germany: 2018. Lecture Notes in Statistics. [Google Scholar]
  • 3.Potthoff R.F., Roy S. A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika. 1964;51:313–326. doi: 10.1093/biomet/51.3-4.313. [DOI] [Google Scholar]
  • 4.Woolson R.F., Leeper J.D. Growth curve analysis of complete and incomplete longitudinal data. Commun. Stat.-Theory Methods. 1980;9:1491–1513. doi: 10.1080/03610928008827977. [DOI] [Google Scholar]
  • 5.Kshirsagar A., Smith W. Growth Curves. Volume 145 CRC Press; Boca Raton, FL, USA: 1995. [Google Scholar]
  • 6.Jana S. Ph.D. Thesis. Mcmaster University; Hamilton, ON, Canada: 2017. [(accessed on 26 January 2023)]. Inference for Generalized Multivariate Analysis of Variance (GMANOVA) Models and High-Dimensional Extensions. Available online: http://hdl.handle.net/11375/22043. [Google Scholar]
  • 7.Natarajan N., Dhillon I.S. Inductive matrix completion for predicting gene–disease associations. Bioinformatics. 2014;30:i60–i68. doi: 10.1093/bioinformatics/btu269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Zilber P., Nadler B. Inductive Matrix Completion: No Bad Local Minima and a Fast Algorithm; Proceedings of the 39th ICML; Baltimore, MA, USA. 17–23 July 2022; pp. 27671–27692. PMLR. [Google Scholar]
  • 9.Zhang W., Xu H., Li X., Gao Q., Wang L. DRIMC: An improved drug repositioning approach using Bayesian inductive matrix completion. Bioinformatics. 2020;36:2839–2847. doi: 10.1093/bioinformatics/btaa062. [DOI] [PubMed] [Google Scholar]
  • 10.Hsieh C.J., Natarajan N., Dhillon I. PU learning for matrix completion; Proceedings of the International Conference on Machine Learning, PMLR; Lille, France. 7–9 July 2015; pp. 2445–2453. [Google Scholar]
  • 11.Jana S., Balakrishnan N., Hamid J.S. Bayesian growth curve model useful for high-dimensional longitudinal data. J. Appl. Stat. 2019;46:814–834. doi: 10.1080/02664763.2018.1517145. [DOI] [Google Scholar]
  • 12.Knoblauch J., Jewson J., Damoulas T. An Optimization-centric View on Bayes’ Rule: Reviewing and Generalizing Variational Inference. J. Mach. Learn. Res. 2022;23:1–109. [Google Scholar]
  • 13.Bissiri P.G., Holmes C.C., Walker S.G. A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2016;78:1103–1130. doi: 10.1111/rssb.12158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Grünwald P., Van Ommen T. Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Anal. 2017;12:1069–1103. doi: 10.1214/17-BA1085. [DOI] [Google Scholar]
  • 15.McAllester D. Some PAC-Bayesian theorems; Proceedings of the Eleventh Annual Conference on Computational Learning Theory; Madison, WI, USA. 24–26 July 1998; New York, NY, USA: ACM; 1998. pp. 230–234. [Google Scholar]
  • 16.Shawe-Taylor J., Williamson R. A PAC analysis of a Bayes estimator; Proceedings of the Tenth Annual Conference on Computational Learning Theory; Nashville, TN, USA. 6–9 July 1997; New York, NY, USA: ACM; 1997. pp. 2–9. [Google Scholar]
  • 17.Catoni O. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics; Beachwood, OH, USA: 2007. p. xii+163. (IMS Lecture Notes—Monograph Series, 56). [Google Scholar]
  • 18.Guedj B. A primer on PAC-Bayesian learning. arXiv. 20191901.05353 [Google Scholar]
  • 19.Alquier P. User-friendly introduction to PAC-Bayes bounds. arXiv. 20212110.11216 [Google Scholar]
  • 20.Mai T.T., Alquier P. A Bayesian approach for noisy matrix completion: Optimal rate under general sampling distribution. Electron. J. Statist. 2015;9:823–841. doi: 10.1214/15-EJS1020. [DOI] [Google Scholar]
  • 21.Cottet V., Alquier P. 1-Bit matrix completion: PAC-Bayesian analysis of a variational approximation. Mach. Learn. 2018;107:579–603. doi: 10.1007/s10994-017-5667-z. [DOI] [Google Scholar]
  • 22.Mai T.T., Alquier P. Pseudo-Bayesian quantum tomography with rank-adaptation. J. Stat. Plan. Inference. 2017;184:62–76. doi: 10.1016/j.jspi.2016.11.003. [DOI] [Google Scholar]
  • 23.Mai T.T., Alquier P. Optimal quasi-Bayesian reduced rank regression with incomplete response. arXiv. 20222206.08619 [Google Scholar]
  • 24.Jain P., Dhillon I.S. Provable inductive matrix completion. arXiv. 20131306.0626 [Google Scholar]
  • 25.Candès E.J., Plan Y. Matrix completion with noise. Proc. IEEE. 2010;98:925–936. doi: 10.1109/JPROC.2009.2035722. [DOI] [Google Scholar]
  • 26.Koltchinskii V., Lounici K., Tsybakov A.B. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist. 2011;39:2302–2329. doi: 10.1214/11-AOS894. [DOI] [Google Scholar]
  • 27.Foygel R., Shamir O., Srebro N., Salakhutdinov R. Learning with the weighted trace-norm under arbitrary sampling distributions; Proceedings of the Advances in Neural Information Processing Systems; Granada, Spain. 12–15 December 2011; pp. 2133–2141. [Google Scholar]
  • 28.Klopp O. Noisy low-rank matrix completion with general sampling distribution. Bernoulli. 2014;20:282–303. doi: 10.3150/12-BEJ486. [DOI] [Google Scholar]
  • 29.Negahban S., Wainwright M.J. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. J. Mach. Learn. Res. 2012;13:1665–1697. [Google Scholar]
  • 30.Dalalyan A.S., Tsybakov A.B. Sparse regression learning by aggregation and Langevin Monte-Carlo. J. Comput. Syst. Sci. 2012;78:1423–1443. doi: 10.1016/j.jcss.2011.12.023. [DOI] [Google Scholar]
  • 31.Dalalyan A.S. Exponential weights in multivariate regression and a low-rankness favoring prior. Annales de l’Institut Henri Poincaré Probabilités et Statistiques. 2020;56:1465–1483. doi: 10.1214/19-AIHP1010. [DOI] [Google Scholar]
  • 32.Anderson T.W. Estimating linear restrictions on regression coefficients for multivariate normal distributions. Ann. Math. Stat. 1951;22:327–351. doi: 10.1214/aoms/1177729580. [DOI] [Google Scholar]
  • 33.Izenman A.J. Modern multivariate statistical techniques. Regres. Classif. Manifold Learn. 2008;10:978. [Google Scholar]
  • 34.Dalalyan A., Tsybakov A.B. Aggregation by exponential weighting, sharp PAC-Bayesian bounds and sparsity. Mach. Learn. 2008;72:39–61. doi: 10.1007/s10994-008-5051-0. [DOI] [Google Scholar]
  • 35.Catoni O. Statistical learning theory and stochastic optimization. In: Picard J., editor. Saint-Flour Summer School on Probability Theory 2001. Volume 1851. Springer; Berlin, Germany: 2004. p. viii+272. Lecture Notes in Mathematics. [DOI] [Google Scholar]
  • 36.Alquier P., Ridgway J., Chopin N. On the properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res. 2016;17:8374–8414. [Google Scholar]
  • 37.Rigollet P., Tsybakov A.B. Sparse estimation by exponential weighting. Stat. Sci. 2012;27:558–575. doi: 10.1214/12-STS393. [DOI] [Google Scholar]
  • 38.Dalalyan A.S., Grappin E., Paris Q. On the exponentially weighted aggregate with the Laplace prior. Ann. Stat. 2018;46:2452–2478. doi: 10.1214/17-AOS1626. [DOI] [Google Scholar]
  • 39.Candes E.J., Wakin M.B., Boyd S.P. Enhancing sparsity by reweighted ℓ1 minimization. J. Fourier Anal. Appl. 2008;14:877–905. doi: 10.1007/s00041-008-9045-x. [DOI] [Google Scholar]
  • 40.Yang L., Fang J., Duan H., Li H., Zeng B. Fast low-rank Bayesian matrix completion with hierarchical gaussian prior models. IEEE Trans. Signal Process. 2018;66:2804–2817. doi: 10.1109/TSP.2018.2816575. [DOI] [Google Scholar]
  • 41.Luo C., Liang J., Li G., Wang F., Zhang C., Dey D.K., Chen K. Leveraging mixed and incomplete outcomes via reduced-rank modeling. J. Multivar. Anal. 2018;167:378–394. doi: 10.1016/j.jmva.2018.04.011. [DOI] [Google Scholar]
  • 42.Hoeffding W. Probability Inequalities for Sums of Bounded Random Variables. J. Am. Stat. Assoc. 1963;58:13–30. doi: 10.1080/01621459.1963.10500830. [DOI] [Google Scholar]
  • 43.Durmus A., Moulines E. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli. 2019;25:2854–2882. doi: 10.3150/18-BEJ1073. [DOI] [Google Scholar]
  • 44.Roberts G.O., Stramer O. Langevin diffusions and Metropolis-Hastings algorithms. Methodol. Comput. Appl. Probab. 2002;4:337–357. doi: 10.1023/A:1023562417138. [DOI] [Google Scholar]
  • 45.Roberts G.O., Rosenthal J.S. Optimal scaling of discrete approximations to Langevin diffusions. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1998;60:255–268. doi: 10.1111/1467-9868.00123. [DOI] [Google Scholar]
  • 46.Dalalyan A.S. Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2017;3:651–676. doi: 10.1111/rssb.12183. [DOI] [Google Scholar]
  • 47.R Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2022. [Google Scholar]
  • 48.Hastie T., Mazumder R. softImpute: Matrix Completion via Iterative Soft-Thresholded SVD, 2021. R Package Version 1.4-1. [(accessed on 26 January 2023)]. Available online: https://cran.r-project.org/package=softImpute.
  • 49.Massart P. Concentration Inequalities and Model Selection. Volume 1896. Springer; Berlin, Germany: 2007. p. xiv+337. Lecture Notes in Mathematics. [Google Scholar]
  • 50.Friedman J., Hastie T., Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 2010;33:1–22. doi: 10.18637/jss.v033.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Dalalyan A.S., Karagulyan A. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stoch. Process. Their Appl. 2019;129:5278–5311. doi: 10.1016/j.spa.2019.02.016. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The R codes used in the numerical experiments are available at: https://github.com/tienmt/blr_imc (accessed on 10 February 2023).


Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES