Empirical Bayes Matrix Factorization

Wei Wang; Matthew Stephens

. Author manuscript; available in PMC: 2023 Nov 2.

Published in final edited form as: J Mach Learn Res. 2021;22:120.

Empirical Bayes Matrix Factorization

Wei Wang ¹, Matthew Stephens ²

PMCID: PMC10621241 NIHMSID: NIHMS1908884 PMID: 37920532

Abstract

Matrix factorization methods, which include Factor analysis (FA) and Principal Components Analysis (PCA), are widely used for inferring and summarizing structure in multivariate data. Many such methods use a penalty or prior distribution to achieve sparse representations (“Sparse FA/PCA”), and a key question is how much sparsity to induce. Here we introduce a general Empirical Bayes approach to matrix factorization (EBMF), whose key feature is that it estimates the appropriate amount of sparsity by estimating prior distributions from the observed data. The approach is very flexible: it allows for a wide range of different prior families and allows that each component of the matrix factorization may exhibit a different amount of sparsity. The key to this flexibility is the use of a variational approximation, which we show effectively reduces fitting the EBMF model to solving a simpler problem, the so-called “normal means” problem. We demonstrate the benefits of EBMF with sparse priors through both numerical comparisons with competing methods and through analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons EBMF often provides more accurate inferences than other methods. In the GTEx data, EBMF identifies interpretable structure that agrees with known relationships among human tissues. Software implementing our approach is available at https://github.com/stephenslab/flashr.

Keywords: empirical Bayes, matrix factorization, normal means, sparse prior, unimodal prior, variational approximation

1. Introduction

Matrix factorization methods are widely used for inferring and summarizing structure in multivariate data. In brief, these methods represent an observed $n \times p$ data matrix $Y$ as:

Y = L^{T} F + E

(1.1)

where $L$ is an $K \times n$ matrix, $F$ is a $K \times p$ matrix, and $E$ is an $n \times p$ matrix of residuals (whose entries we assume to be normally distributed, although the methods we develop can be generalized to other settings; see Section 6.4). Here we adopt the notation and terminology of factor analysis, and refer to $L$ as the “loadings” and $F$ as the “factors”.

The model (1.1) has many potential applications. One range of applications arise from “matrix completion” problems (e.g. Fithian et al., 2018): methods that estimate $L$ and $F$ in (1.1) from partially observed $Y$ provide a natural and effective way to fill in the missing entries. Another wide range of applications come from the desire to summarize and understand the structure in a matrix Y : in (1.1) each row of $Y$ is approximated by a linear combination of underlying “factors” (rows of), which—ideally—have some intuitive or scientific interpretation. For example, suppose $Y_{i j}$ represents the rating of a user $i$ for a movie $j$ . Each factor might represent a genre of movie (“comedy”, “drama”, “romance”, “horror” etc), and the ratings for a user $i$ could be written as a linear combination of these factors, with the weights (loadings) representing how much individual $i$ likes that genre. Or, suppose $Y_{i j}$ represents the expression of gene $j$ in sample $i$ . Each factor might represent a module of co-regulated genes, and the data for sample $i$ could be written as a linear combination of these factors, with the loadings representing how active each module is in each sample. Many other examples could be given across many fields, including psychology (Ford et al., 1986), econometrics (Bai and Ng, 2008), natural language processing (Bouchard et al., 2015), population genetics (Engelhardt and Stephens, 2010), and functional genomics (Stein-O’Brien et al., 2018).

The simplest approaches to estimating $L$ and/or $F$ in (1.1) are based on maximum likelihood or least squares. For example, Principal Components Analysis (PCA)—or, more precisely, truncated Singular Value Decomposition (SVD)—can be interpreted as fitting (1.1) by least squares, assuming that columns of $L$ are orthogonal and columns of $F$ are orthonormal (Eckart and Young, 1936). And classical factor analysis (FA) corresponds to maximum likelihood estimation of $L$ , assuming that the elements of $F$ are independent standard normal and allowing different residual variances for each column of $Y$ (Rubin and Thayer, 1982). While these simple methods remain widely used, in the last two decades researchers have focused considerable attention on obtaining more accurate and/or more interpretable estimates, either by imposing additional constraints (e.g. non-negativity; Lee and Seung, 1999) or by regularization using a penalty term (e.g. Jolliffe et al., 2003; Witten et al., 2009; Mazumder et al., 2010; Hastie et al., 2015; Fithian et al., 2018), or a prior distribution (e.g. Bishop, 1999; Attias, 1999; Ghahramani and Beal, 2000; West, 2003). In particular, many authors have noted the benefits of sparsity assumptions on $L$ and/or $F$ — particularly in applications where interpretability of the estimates is desired—and there now exists a wide range of methods that attempt to induce sparsity in these models (e.g. Sabatti and James, 2005; Zou et al., 2006; Pournara and Wernisch, 2007; Carvalho et al., 2008; Witten et al., 2009; Engelhardt and Stephens, 2010; Knowles and Ghahramani, 2011; Bhattacharya and Dunson, 2011; Mayrink et al., 2013; Yang et al., 2014; Gao et al., 2016; Hore et al., 2016; Ročková and George, 2016; Srivastava et al., 2017; Kaufmann and Schumacher, 2017; Frühwirth-Schnatter and Lopes, 2018; Zhao et al., 2018). Many of these methods induce sparsity in the loadings only, although some induce sparsity in both loadings and factors.

In any statistical problem involving sparsity, a key question is how strong the sparsity should be. In penalty-based methods this is controlled by the strength and form of the penalty, whereas in Bayesian methods it is controlled by the prior distributions. In this paper we take an Empirical Bayes approach to this problem, exploiting variational approximation methods (Blei et al., 2017) to obtain simple algorithms that jointly estimate the prior distributions for both loadings and factors, as well as the loadings and factors themselves.

Both EB and variational methods have been previously used for this problem (Bishop, 1999; Lim and Teh, 2007; Raiko et al., 2007; Stegle et al., 2010). However, most of this previous work has used simple normal prior distributions that do not induce sparsity. Variational methods that use sparsity-inducing priors include Girolami (2001), which uses a Laplace prior on the factors (no prior on the loadings which are treated as free parameters), Hochreiter et al. (2010) which extends this to Laplace priors on both factors and loadings, with fixed values of the Laplace prior parameters; Titsias and Lázaro-Gredilla (2011), which uses a sparse “spike-and-slab” (point normal) prior on the loadings (with the same prior on all $K$ loadings) and a normal prior on the factors; and Hore et al. (2016) which uses a spike-and-slab prior on one mode in a tensor decomposition. (While this work was in review further examples appeared, including Argelaguet et al. 2018, which uses normal priors on the loadings and point-normal on the factors.)

Our primary contribution here is to develop and implement a more general EB approach to matrix factorization (EBMF). This general approach allows for a wide range of potential sparsity-inducing prior distributions on both the loadings and the factors within a single algorithmic framework. We accomplish this by showing that, when using variational methods, fitting EBMF with any prior family can be reduced to repeatedly solving a much simpler problem—the “empirical Bayes normal means” (EBNM) problem—with the same prior family. This feature makes it easy to implement methods for any desired prior family—one simply has to implement a method to solve the corresponding normal means problem, and then plug this into our algorithm. This approach can work for both parametric families (e.g. normal, point-normal, laplace, point-laplace) and non-parametric families, including the “adaptive shrinkage” priors (unimodal and scale mixtures of normals) from Stephens (2017). It is also possible to accommodate non-negative constraints on either $L$ and/or $F$ by using non-negative prior families. Even simple versions of our approach—e.g. using point-normal priors on both factors and loadings— provide more generality than most existing EBMF approaches and software.

A second contribution of our work is to highlight similarities and differences between EBMF and penalty-based methods for regularizing $L$ and/or $F$ . Indeed, our algorithm for fitting EBMF has the same structure as commonly-used algorithms for penalty-based methods, with the prior distribution playing a role analogous to the penalty (see Remark 3 later). While the general correspondence between estimates from penalized methods and Bayesian posterior modes (MAP estimates) is well known, the connection here is different, because the EBMF approach is estimating a posterior mean, not a mode (indeed, with sparse priors the MAP estimates of $L$ and $F$ are not useful because they are trivially 0). A key difference between the EBMF approach and penalty-based methods is that the EBMF prior is estimated by solving an optimization problem, whereas in penalty-based methods the strength of the penalty is usually chosen by cross-validation. This difference makes it much easier for EBMF to allow for different levels of sparsity in every factor and every loading: in EBMF one simply uses a different prior for every factor and loading, whereas tuning a separate parameter for every factor and loading by CV becomes very cumbersome.

The final contribution is that we provide an $R$ software package, flash (Factors and Loadings by Adaptive SHrinkage), implementing our flexible EBMF framework. We demonstrate the utility of these methods through both numerical comparisons with competing methods and through a scientific application: analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons flash often provides more accurate inferences than other methods, while remaining computationally tractable for moderate-sized matrices (millions of entries). In the GTEx data, flash highlight both effects that are shared across many tissues (“dense” factors) and effects that are specific to a small number of tissues (“sparse” factors). These sparse factors often highlight similarities between tissues that are known to be biologically related, providing external support for the reliability of the results.

2. A General Empirical Bayes Matrix Factorization Model

We define the $K$ -factor Empirical Bayes Matrix Factorization (EBMF) model as follows:

Y = \sum_{k = 1}^{K} l_{k} f_{k}^{T} + E

(2.1)

l_{k 1}, \dots, l_{k n} ~^{i i d} g_{l_{k}}, g_{l_{k}} \in 𝒢_{l}

(2.2)

{f_{k 1}, \dots, f_{k p} ~}^{i i d} g_{f_{k}}, g_{f_{k}} \in 𝒢_{f}

(2.3)

E_{i j} ~ N (0,1 / τ_{i j}) with τ ≔ (τ_{i j}) \in 𝒯 .

(2.4)

Here $Y$ is the $n \times p$ observed data matrix, $l_{k}$ is an $n$ -vector (the $k$ th set of “loadings”), $f_{k}$ is a $p$ -vector (the $k$ th “factor”), $𝒢_{l}$ and $𝒢_{f}$ are pre-specified (possibly non-parametric) families of distributions, $g_{l_{k}}$ and $g_{f_{k}}$ are unknown “prior” distributions that are to be estimated, $E$ is an $n \times p$ matrix of independent error terms, and $τ$ is an unknown $n \times p$ matrix of precisions $(τ_{i j})$ which is assumed to lie in some space $𝒯$ . (This allows structure to be imposed on $τ$ , such as constant precision, $τ_{i j} = τ$ , or column-specific precisions, $τ_{i j} = τ_{j}$ , for example.) Our methods allow that some elements of $Y$ may be “missing”, and can estimate the missing values (Section 4.1).

The term “Empirical Bayes” in EBMF means we fit (2.1)-(2.4) by obtaining point estimates for the priors $g_{l_{k}}, g_{f_{k}}, (k = 1, \dots, K)$ and approximate the posterior distributions for the parameters $l_{k}, f_{k}$ given those point estimates. This contrasts with a “fully Bayes” approach that, instead of obtaining point estimates for $g_{l_{k}}, g_{f_{k}}$ , would integrate over uncertainty in the estimates. This would involve specifying prior distributions for $g_{l_{k}}, g_{f_{k}}$ as well as (perhaps substantial) additional computation. The EB approach has the advantage of simplicity - both conceptually and computationally—while enjoying many of the benefits of a fully Bayes approach. In particular it allows for sharing of information across elements of each loading/factor. For example, if the data suggest that a particular factor, $f_{k}$ , is sparse, then this will be reflected in a sparse estimate of $g_{f_{k}}$ , and subsequently strong shrinkage of the smaller elements of $f_{k 1}, \dots, f_{k p}$ towards 0. Conversely, when the data suggest a non-sparse factor then the prior will be dense and the shrinkage less strong. By allowing different prior distributions for each factor and each loading, the model has the flexibility to adapt to any combination of sparse and dense loadings and factors. However, to fully capitalize on this flexibility one needs suitably flexible prior families $𝒢_{l}$ and $𝒢_{f}$ capable of capturing both sparse and dense factors. A key feature of our work is it allows for very flexible prior families, including non-parametric families.

Some specific choices of the distributional families $𝒢_{l}$ and $𝒢_{f}$ correspond to models used in previous work. In particular, many previous papers have studied the case with normal priors, where $𝒢_{l}$ and $𝒢_{f}$ are both the family of zero-mean normal distributions (e.g. Bishop, 1999; Lim and Teh, 2007; Raiko et al., 2007; Nakajima and Sugiyama, 2011). This family is particularly simple, having a single hyper-parameter, the prior variance, to estimate for each factor. However, it does not induce sparsity on either $L$ or $F$ ; indeed, when the matrix $Y$ is fully observed, the estimates of $L$ and $F$ under a normal prior (when using a fully factored variational approximation) are simply scalings of the singular vectors from an SVD of $Y$ (Nakajima and Sugiyama, 2011; Nakajima et al., 2013). Our work here extends these previous approaches to a much wider range of prior families that do induce sparsity on $L$ and/or $F$ .

We note that the EBMF model (2.1)-(2.4) differs in an important way from the sparse factor analysis (SFA) methods in Engelhardt and Stephens (2010), which use a type of Automatic Relevance Determination prior (e.g. Tipping, 2001; Wipf and Nagarajan, 2008) to induce sparsity on the loadings matrix. In particular, SFA estimates a separate hyperparameter for every element of the loadings matrix, with no sharing of information across elements of the same loading. In contrast, EBMF estimates a single shared prior distribution for elements of each loading, which, as noted above, allows for sharing of information across elements of each loading/factor.

3. Fitting the EBMF Model

To simplify exposition we begin with the case $K = 1$ (“rank 1”); see Section 3.6 for the extension to general $K$ . To simplify notation we assume the families $𝒢_{l}, 𝒢_{f}$ are the same, so we can write $𝒢_{l} = 𝒢_{f} = 𝒢$ . To further lighten notation in the case $K = 1$ we use $g_{l}, g_{f}, l, f$ instead of $g_{l_{1}}, g_{f_{1}}, l_{1}, f_{1}$ .

Fitting the EBMF model involves estimating all of $g_{l}, g_{f}, l, f, τ$ . A standard EB approach would be to do this in two steps:

Estimate $g_{l}, g_{f}$ and $τ$ , by maximizing the likelihood:
$L (g_{l}, g_{f}, τ) ≔ \iint p (Y ∣ l, f, τ) g_{l} (d l_{1}) \dots g_{l} (d l_{n}) g_{f} (d f_{1}) \dots g_{f} (d f_{p})$ (3.1)
over $g_{l}, g_{f} \in 𝒢, τ \in 𝒯$ . (This optimum will typically not be unique because of identifiability issues; see Section 3.8.)
Estimate $l$ and $f$ using their posterior distribution: $p (l, f ∣ Y, {\hat{g}}_{l}, {\hat{g}}_{f}, \hat{τ})$ .

However, both these two steps are difficult, even for very simple choices of $𝒢$ . Instead, following previous work (see Introduction for citations) we use variational approximations to approximate this approach. Although variational approximations are known to typically under-estimate uncertainty in posterior distributions, our focus here is on obtaining useful point estimates for $l, f$ ; results shown later demonstrate that the variational approximation can perform well in this task.

3.1. The Variational Approximation

The variational approach—see Blei et al. (2017) for review—begins by writing the log of the likelihood (3.1) as:

l (g_{l}, g_{f}, τ) ≔ log L (g_{l}, g_{f}, τ)

(3.2)

\begin{array}{r} = F (q, g_{l}, g_{f}, τ) + D_{K L} (q ∥ p) \end{array}

(3.3)

where

F (q, g_{l}, g_{f}, τ) = \int q (l, f) log \frac{p (Y, l, f ∣ g_{l}, g_{f}, τ)}{q (l, f)} d l d f,

(3.4)

and

D_{K L} (q | | p) = - \int q (l, f) log \frac{p (l, f ∣ Y, g_{l}, g_{f}, τ)}{q (l, f)} d l d f

(3.5)

is the Kullback–Leibler divergence from $q$ to $p$ . This identity holds for any distribution $q (l, f)$ . Because $D_{K L}$ is non-negative, it follows that $F (q, g_{l}, g_{f}, τ)$ is a lower bound for the $log$ likelihood:

l (g_{l}, g_{f}, τ) \geq F (q, g_{l}, g_{f}, τ)

(3.6)

with equality when $q (l, f) = p (l, f ∣ Y, g_{l}, g_{f}, τ)$ .

In other words,

l (g_{l}, g_{f}, τ) = max_{q} F (q, g_{l}, g_{f}, τ),

(3.7)

where the maximization is over all possible distributions $q (l, f)$ . Maximizing $l (g_{l}, g_{f}, τ)$ can thus be viewed as maximizing $F$ over $q, g_{l}, g_{f}, τ$ . However, as noted above, this maximization is difficult. The variational approach simplifies the problem by maximizing $F$ but restricting the family of distributions for $q$ . Specifically, the most common variational approach — and the one we consider here—restricts $q$ to the family $𝒬$ of distributions that “fully-factorize”:

𝒬 = \{q : q (l, f) = \prod_{i = 1}^{n} q_{l, i} (l_{i}) \prod_{j = 1}^{p} q_{f, j} (f_{j})\} .

(3.8)

The variational approach seeks to optimize $F$ over $q, g_{l}, g_{f}, τ$ with the constraint $q \in Q$ . For $q \in Q$ we can write $q (l, f) = q_{l} (l) q_{f} (f)$ where $q_{l} (l) = \prod_{i = 1}^{n} q_{l, i} (l_{i})$ and $q_{f} (f) =$ $\prod_{j = 1}^{p} q_{f, j} (f_{j})$ , and we can consider the problem as maximizing $F (q_{l}, q_{f}, g_{l}, g_{f}, τ)$ .

3.2. Alternating Optimization

We optimize $F (q_{l}, q_{f}, g_{l}, g_{f}, τ)$ by alternating between optimizing over variables related to l $[(q_{l}, g_{l})]$ , over variables related to $f [(q_{f}, g_{f})]$ , and over $τ$ . Each of these steps is guaranteed to increase (or, more precisely, not decrease) $F$ , and convergence can be assessed by (for example) stopping when these optimization steps yield a very small increase in $F$ . Note that $F$ may be multi-modal, and there is no guarantee that the algorithm will converge to a global optimum. The approach is summarized in Algorithm 1

The key steps in Algorithm 1 are the maximizations in Steps 4–6.

Step 4, the update of $τ$ , involves computing the expected squared residuals:

{\bar{R^{2}}}_{i j} ≔ E_{q_{l}, q_{f}} [{(Y_{i j} - l_{i} f_{j})}^{2}]

(3.9)

\begin{array}{r} = {[Y_{i j} - E_{q_{l}} (l_{i}) E_{q_{f}} (f_{j})]}^{2} - E_{q_{l}} {(l_{i})}^{2} E_{q_{f}} {(f_{j})}^{2} + E_{q_{l}} (l_{i}^{2}) E_{q_{f}} (f_{j}^{2}) . \end{array}

(3.10)

This is straightforward provided the first and second moments of $q_{l}$ and $q_{f}$ are available (see Appendix A.1 for details).

Steps 5 and 6 are essentially identical except for switching the role of $l$ and $f$ . One of our key results is that each of these steps can be achieved by solving a simpler problem—the Empirical Bayes normal means (EBNM) problem. The next subsection (3.3) describes the EBNM problem, and the following subsection (3.4) details how this can be used to solve Steps 5 and 6.

3.3. The EBNM Problem

Suppose we have observations $x = (x_{1}, \dots, x_{n})$ of underlying quantities $θ = (θ_{1}, \dots, θ_{n})$ , with independent Gaussian errors with known standard deviations $s = (s_{1}, \dots, s_{n})$ . Suppose further that the elements of $θ$ are assumed i.i.d. from some distribution, $g \in 𝒢$ . That is,

x ∣ θ ~ N_{n} (θ, diag (s_{1}^{2}, \dots, s_{n}^{2}))

(3.11)

\begin{array}{r} θ_{1}, \dots, θ_{n} & ~^{iid} g, g \in 𝒢 \end{array}

(3.12)

where $N_{n} (μ, Σ)$ denotes the $n$ -dimensional normal distribution with mean $μ$ and covariance matrix $Σ$ .

By solving the EBNM problem we mean fitting the model (3.11)-(3.12) by the following two-step procedure:

Estimate $g$ by maximum (marginal) likelihood:
$\hat{g} = arg max_{g \in 𝒢} \prod_{j} \int p (x_{j} ∣ θ_{j}, s_{j}) g (d θ_{j}) .$ (3.13)
Compute the posterior distribution for $θ$ given $\hat{g}$ ,
$p (θ ∣ x, s, \hat{g}) \propto \prod_{j} \hat{g} (θ_{j}) p (x_{j} ∣ θ_{j}, s_{j}) .$ (3.14)

Later in this paper we will have need for the posterior first and second moments, so we define them here for convenience:

{\overline{θ}}_{j} ≔ E (θ_{j} ∣ x, s, \hat{g})

(3.15)

{\bar{θ^{2}}}_{j} ≔ E (θ_{j}^{2} ∣ x, s, \hat{g}) .

(3.16)

Formally, this procedure defines a mapping (which depends on the family 𝒢) from the known quantities $(x, s)$ , to $(\hat{g}, p)$ , where $\hat{g}, p$ are given in (3.13) and (3.14). We use $EBNM$ to denote this mapping:

EBNM (x, s) = (\hat{g}, p)

(3.17)

Remark 1 Solving the EBNM problem is central to all our algorithms, so it is worth some study. A key point is that the EBNM problem provides an attractive and flexible way to induce shrinkage and/or sparsity in estimates of $θ$ . For example, if $θ$ is truly sparse, with many elements at or near 0, then the estimate $\hat{g}$ will typically have considerable mass near 0, and the posterior means (3.15) will be “shrunk” strongly toward 0 compared with the original observations. In this sense solving the EBNM problem can be thought of as a model-based analogue of thresholding-based methods, with the advantage that by estimating $g$ from the data the EBNM approach automatically adapts to provide an appropriate level of shrinkage. These ideas have been used in wavelet denoising (Clyde and George, 2000; Johnstone et al., 2004; Johnstone and Silverman, 2005a; Xing et al., 2016), and false discovery rate estimation (Thomas et al., 1985; Stephens, 2017) for example. Here we apply them to matrix factorization problems.

3.4. Connecting the EBMF and EBNM Problems

The EBNM problem is well studied, and can be solved reasonably easily for many choices of $𝒢$ (e.g. Johnstone and Silverman, 2005b; Koenker and Mizera, 2014a; Stephens, 2017). In Section 4 we give specific examples; for now our main point is that if one can solve the EBNM problem for a particular choice of $𝒢$ then it can be used to implement Steps 5 and 6 in Algorithm 1 for the corresponding EBMF problem. The following Proposition formalizes this for Step 5 of Algorithm 1; a similar proposition holds for Step 6 (see also Appendix A).

Proposition 2 Step 5 in Algorithm 1 is solved by solving an EBNM problem. Specifically

arg max_{q_{l}, g_{l}} F (q_{l}, q_{f}, g_{l}, g_{f}, τ) = EBNM (\hat{l} (Y, \bar{f}, \bar{f^{2}}, τ), s_{l} (\bar{f^{2}}, τ))

(3.18)

where the functions $\hat{l} : R^{n \times p} \times R^{p} \times R^{p} \times R^{n \times p} \to R^{n}$ and $s_{l} : R^{p} \times R^{n \times p} \to R^{n}$ are given by

\hat{l} (Y, v, w, τ)_{i} ≔ \frac{\sum_{j} τ_{i j} Y_{i j} v_{j}}{\sum_{j} τ_{i j} w_{j}},

(3.19)

s_{l} (w, τ)_{i} ≔ {(\sum_{j} τ_{i j} w_{j})}^{- 0.5},

(3.20)

and $\bar{f}, \bar{f^{2}} \in R^{p}$ denote the vectors whose elements are the first and second moments of $f$ under $q_{f}$ :

\bar{f} ≔ (E_{q_{f}} (f_{j}))

(3.21)

\bar{f^{2}} ≔ (E_{q_{f}} (f_{j}^{2})) .

(3.22)

Proof See Appendix A. ■

For intuition into where the EBNM in Proposition 2 comes from, consider estimating $l, g_{l}$ in (2.1) with $f$ and $τ$ known. The model then becomes $n$ independent regressions of the rows of $Y$ on $f$ , and the maximum likelihood estimate for $l$ has elements:

{\hat{l}}_{i} = \frac{\sum_{j} τ_{i j} Y_{i j} f_{j}}{\sum_{j} τ_{i j} f_{j}^{2}},

(3.23)

with standard errors

s_{i} = {(\sum_{j} τ_{i j} f_{j}^{2})}^{- 0.5} .

(3.24)

Further, it is easy to show that

{\hat{l}}_{i} ~ N (l_{i}, s_{i}^{2}) .

(3.25)

Combining (3.25) with the prior

l_{1}, \dots, l_{n} ~^{i i d} g_{l}, g_{l} \in 𝒢

(3.26)

yields an EBNM problem.

The EBNM in Proposition 2 is the same as the EBNM (3.25)-(3.26), but with the terms $f_{j}$ and $f_{j}^{2}$ replaced with their expectations under $q_{f}$ . Thus, the update for $(q_{l}, g_{l})$ in Algorithm 1, with $(q_{f}, g_{f}, τ)$ fixed, is closely connected to solving the EBMF problem for “known $f, τ$ “.

3.5. Streamlined Implementation Using First and Second Moments

Although Algorithm 1, as written, optimizes over $(q_{l}, q_{f}, g_{l}, g_{f})$ , in practice each step requires only the first and second moments of the distributions $q_{l}$ and $q_{f}$ . For example, the EBNM problem in Proposition 1 involves $\bar{f}$ and $\bar{f^{2}}$ and not $g_{f}$ . Consequently, we can simplify implementation by keeping track of only those moments. In particular, when solving the normal means problem, $EBNM (x, s)$ in (3.17), we need only return the posterior first and second moments (3.15) and (3.16). This results in a streamlined and intuitive implementation, summarized in Algorithm 2.

Remark 3 Algorithm 2 has a very intuitive form: it has the flavor of an alternating least squares algorithm, which alternates between estimating $l$ given $f$ (Step 6) and $f$ given $l$ (Step 8), but with the addition of the ebnm step (Steps 7 and 9), which can be thought of as regularizing or shrinking the estimates: see Remark 1. This viewpoint highlights connections with related algorithms. For example, the (rank 1 version of the) SSVD algorithm from Yang et al. (2014) has a similar form, but uses a thresholding function in place of the ebnm function to induce shrinkage and/or sparsity.

3.6. The $K$ -factor EBMF Model

It is straightforward to extend the variational approach to fit the general $K$ factor model (2.1)-(2.4). In brief, we introduce variational distributions $(q_{l_{k}}, q_{f_{k}})$ for $k = 1, \dots, K$ , and then optimize the objective function $F (q_{l_{1}}, g_{l_{1}}, q_{f_{1}}, g_{f_{1}}; \dots; q_{l_{K}}, g_{l_{K}}, q_{f_{K}}, g_{f_{K}}; τ)$ . Similar to the rank-1 model, this optimization can be done by iteratively updating parameters relating to a single loading or factor, keeping other parameters fixed. And again we simplify implementation by keeping track of only the first and second moments of the distributions $q_{l_{k}}$ and $q_{f_{k}}$ , which we denote ${\bar{l}}_{k}, {\bar{l}}_{k}^{2}, {\bar{f}}_{k}, {\bar{f}}_{k}^{2}$ . The updates to ${\bar{l}}_{k}, {\bar{l}}_{k}^{(and} {\bar{f}}_{k}, {\bar{f}}_{k}^{2}$ ) are essentially identical to those for fitting the rank 1 model above, but with $Y_{i j}$ replaced with the residuals obtained by removing the estimated effects of the other $k - 1$ factors:

R_{i j}^{k} ≔ Y_{i j} - \sum_{k^{'} \neq k} {\bar{l}}_{k^{'} i} {\bar{f}}_{k^{'} j} .

(3.27)

Based on this approach we have implemented two algorithms for fitting the $K$ -factor model. First, a simple “greedy” algorithm, which starts by fitting the rank 1 model, and then adds factors $k = 2, \dots, K$ , one at a time, optimizing over the new factor parameters before moving on to the next factor. Second, a “backfitting” algorithm (Breiman and Friedman, 1985), which iteratively refines the estimates for each factor given the estimates for the other factors. Both algorithms are detailed in Appendix A.

3.7. Selecting $K$

An interesting feature of EB approaches to matrix factorization, noted by Bishop (1999), is that they automatically select the number of factors $K$ . This is because the maximum likelihood solution to $g_{l_{k}}, g_{f_{k}}$ is sometimes a point mass on 0 (provided $𝒢$ includes this distribution). Furthermore, the same is true of the solution to the variational approximation (see also Bishop, 1999; Stegle et al., 2012). This means that if $K$ is set sufficiently large then some loading/factor combinations will be optimized to be exactly 0. (Or, in the greedy approach, which adds one factor at a time, the algorithm will eventually add a factor that is exactly 0, at which point it terminates.)

Here we note that the variational approximation may be expected to result in conservative estimation (i.e. underestimation) of $K$ compared with the (intractable) use of maximum likelihood to estimate $g_{l}, g_{f}$ . We base our argument on the simplest case: comparing $K = 1$ vs $K = 0$ . Let $δ_{0}$ denote the degenerate distribution with all its mass at 0. Note that the rank-1 factor model (2.1), with $g_{l} = δ_{0}$ (or $g_{f} = δ_{0}$ ) is essentially a “rank-0” model. Now note that the variational lower bound, $F$ , is exactly equal to the log-likelihood when $g_{l} = δ_{0}$ (or $g_{f} = δ_{0}$ ). This is because if the prior is a point mass at 0 then the posterior is also a point mass, which trivially factorizes as a product of point masses, and so the variational family $𝒬$ includes the true posterior in this case. Since $F$ is a lower bound to the log-likelihood we have the following simple lemma:

Lemma 4 If $F (\hat{q}, {\hat{g}}_{l}, {\hat{g}}_{f}, \hat{τ}) > F (δ_{0}, δ_{0}, δ_{0}, {\hat{τ}}_{0})$ then $l ({\hat{g}}_{l}, {\hat{g}}_{f}, \hat{τ}) > l (δ_{0}, δ_{0}, {\hat{τ}}_{0})$ .

Proof

l ({\hat{g}}_{l}, {\hat{g}}_{f}, \hat{τ}) \geq F (\hat{q}, {\hat{g}}_{l}, {\hat{g}}_{f}, \hat{τ}) > F (δ_{0}, δ_{0}, δ_{0}, {\hat{τ}}_{0}) = l (δ_{0}, δ_{0}, {\hat{τ}}_{0})

(3.28)

■

Thus, if the variational approximation $F$ favors ${\hat{g}}_{l}, {\hat{g}}_{f}, \hat{τ}$ over the rank 0 model, then it is guaranteed that the likelihood would also favor ${\hat{g}}_{l}, {\hat{g}}_{f}, \hat{τ}$ over the rank 0 model. In other words, compared with the likelihood, the variational approximation is conservative in terms of preferring the rank 1 model to the rank 0 model. This conservatism is a double-edged sword. On the one hand it means that if the variational approximation finds structure it should be taken seriously. On the other hand it means that the variational approximation could miss subtle structure.

In practice Algorithm 2 can converge to a local optimum of $F$ that is not as high as the trivial (rank 0) solution, $F (δ_{0}, δ_{0}, δ_{0}, {\hat{τ}}_{0})$ . We can add a check for this at the end of Algorithm 2, and set ${\hat{g}}_{l} = {\hat{g}}_{f} = δ_{0}$ and $\hat{τ} = {\hat{τ}}_{0}$ when this occurs.

3.8. Identifiability

In EBMF each loading and factor is identifiable, at best, only up to a multiplicative constant (provided $𝒢$ is a scale family). Specifically, scaling the prior distributions $g_{f_{k}}$ and $g_{l_{k}}$ by $c_{k}$ and $1 / c_{k}$ respectively results in the same marginal likelihood, and also results in a corresponding scaling of the posterior distribution on the factors $f_{k}$ and loadings $l_{k}$ (e.g. it scales the posterior first moments by $c_{k}, 1 / c_{k}$ and the second moments by $c_{k}^{2}, 1 / c_{k}^{2})$ . However, this non-identifiability is not generally a problem, and if necessary it could be dealt with by re-scaling factor estimates to have norm 1.

4. Software Implementation: flash

We have implemented Algorithms 2, 4 and 5 in an R package, flash (“factors and loadings via adaptive shrinkage”). These algorithms can fit the EBMF model for any choice of distributional family $𝒢_{l}, 𝒢_{f}$ : the user must simply provide a function to solve the EBNM problem for these prior families.

One source of functions for solving the EBNM problem is the “adaptive shrinkage” (ashr) package, which implements methods from Stephens (2017). These methods solve the EBNM problem for several flexible choices of $𝒢$ , including:

$𝒢 = 𝒮 N$ , the set of all scale mixtures of zero-centered normals;
$𝒢 = 𝒮 𝒰$ , the set of all symmetric unimodal distributions, with mode at 0;
$𝒢 = 𝒰$ , the set of all unimodal distributions, with mode at 0;
$𝒢 = 𝒰_{+}$ , the set of all non-negative unimodal distributions, with mode at 0.

These methods are computationally stable and efficient, being based on convex optimization methods (Koenker and Mizera, 2014b) and analytic Bayesian posterior computations.

We have also implemented functions to solve the EBNM problem for additional choices of $𝒢$ in the package ebnm (https://github.com/stephenslab/ebnm). These include $𝒢$ being the “point-normal” family:

$𝒢 = 𝒫 N$ , the set of all distributions that are a mixture of a point mass at zero and a normal with mean 0.

This choice is less flexible than those in ashr, and involves non-convex optimizations, but can be faster.

Although in this paper we focus our examples on sparsity-inducing priors with $𝒢_{l} = 𝒢_{f} = 𝒢$ we note that our software makes it easy to experiment with different choices, some of which represent novel methodologies. For example, setting $𝒢_{l} = 𝒰_{+}$ and $𝒢_{f} = 𝒮 N$ yields an EB version of semi-non-negative matrix factorization (Ding et al., 2008), and we are aware of no existing EB implementations for this problem. Exploring the relative merits of these many possible options in different types of application will be an interesting direction for future work.

4.1. Missing Data

If some elements of $Y$ are missing, then this is easily dealt with. For example, the sums over $j$ in (3.19) and (3.20) are simply computed using only the $j$ for which $Y_{i j}$ is not missing. This corresponds to an assumption that the missing elements of $Y$ are “missing at random” (Rubin, 1976). In practice we implement this by setting $τ_{i j} = 0$ whenever $Y_{i j}$ is missing (and filling in the missing entries of $Y$ to an arbitrary number). This allows the implementation to exploit standard fast matrix multiplication routines, which cannot handle missing data. If many data points are missing then it may be helpful to exploit sparse matrix routines.

4.2. Initialization

Both Algorithms 2 and 4 require a rank 1 initialization procedure, init. Here, we use the softImpute function from the package softImpute (Mazumder et al., 2010), with penalty parameter $λ = 0$ , which essentially performs SVD when $Y$ is completely observed, but can also deal with missing values in $Y$ .

The backfitting algorithm (Algorithm 5) also requires initialization. One option is to use the greedy algorithm to initialize, which we call “greedy+backfitting”.

5. Numerical Comparisons

We now compare our methods with several competing approaches. To keep these comparisons manageable in scope we focus attention on methods that aim to capture possible sparsity in $L$ and/or $F$ . For EBMF we present results for two different shrinkage-oriented prior families, $𝒢$ : the scale mixture of normals $(𝒢 = 𝒮 N)$ , and the point-normal family $(𝒢 = 𝒫 N)$ . We denote these flash and flash_pn respectively when we need to distinguish. In addition we consider Sparse Factor Analysis (SFA) (Engelhardt and Stephens, 2010), SFAmix (Gao et al., 2013), Nonparametric Bayesian Sparse Factor Analysis (NBSFA) (Knowles and Ghahramani, 2011), Penalized Matrix Decomposition (Witten et al., 2009) (PMD, implemented in the R package PMA), and Sparse SVD (Yang et al., 2014) (SSVD, implemented in $R$ package ssvd). Although the methods we compare against involve only a small fraction of the very large number of methods for this problem, the methods were chosen to represent a wide range of different approaches to inducing sparsity: SFA, SFAmix and NBSFA are three Bayesian approaches with quite different approaches to prior specification; PMD is based on a penalized likelihood with $L_{1}$ penalty on factors and/or loadings; and SSVD is based on iterative thresholding of singular vectors. We also compare with softImpute (Mazumder et al., 2010), which does not explicitly model sparsity in $L$ and $F$ , but fits a regularized low-rank matrix using a nuclear-norm penalty. Finally, for reference we also use standard (truncated) SVD.

All of the Bayesian methods (flash, SFA, SFAmix and NBSFA) are “self-tuning”, at least to some extent, and we applied them here with default values. According to Yang et al. (2014) SSVD is robust to choice of tuning parameters, so we also ran SSVD with its default values, using the robust option (method=“method”). The softImpute method has a single tuning parameter ( $λ$ , which controls the nuclear norm penalty), and we chose this penalty by orthogonal cross-validation (OCV; Appendix B). The PMD method can use two tuning parameters (one for $l$ and one for f) to allow different sparsity levels in $l$ vs $f$ . However, since tuning two parameters can be inconvenient it also has the option to use a single parameter for both $l$ and $f$ . We used OCV to tune parameters in both cases, referring to the methods as PMD.cv2 (2 tuning parameters) and PMD.cv1 (1 tuning parameter).

5.1. Simple Simulations

5.1.1. A Single Factor Example

We simulated data with $n = 200, p = 300$ under the single-factor model (2.1) with sparse loadings, and a non-sparse factor:

l_{i} ~ π_{0} δ_{0} + (1 - π_{0}) \sum_{m = 1}^{5} \frac{1}{5} N (0, σ_{m}^{2})

(5.1)

f_{j} ~ N (0,1)

(5.2)

where $δ_{0}$ denotes a point mass on 0, and $(σ_{1}^{2}, \dots, σ_{5}^{2}) ≔ (0.25,0.5,1, 2,4)$ . We simulated using three different levels of sparsity on the loadings, using $π_{0} = 0.9,0.3,0$ . (We set the noise precision $τ = 1,1 / 16,1 / 25$ in these three cases to make each problem not too easy and not too hard.)

We applied all methods to this rank-1 problem, specifying the true value $K = 1$ . (The NBSFA software does not provide the option to fix $K$ , so is omitted here.) We compare methods in their accuracy in estimating the true low-rank structure $(B ≔ l f^{T})$ using relative root mean squared error:

RRMSE (\hat{B}, B) ≔ \sqrt{\frac{\sum_{i, j} {({\hat{B}}_{i j} - B_{i j})}^{2}}{\sum_{i, j} B_{i j}^{2}}} .

(5.3)

Despite the simplicity of this simulation, the methods vary greatly in performance (Figure 1). Both versions of flash consistently outperform all the other methods across all scenarios (although softImpute performs similarly in the non-sparse case). The next best performances come from softImpute (SI.cv), PMD.cv2 and SFA, whose relative performances depend on the scenario. All three consistently improve on, or do no worse than, SVD. PMD.cv1 performs similarly to SVD. The SFAmix method performs very variably, sometimes providing very poor estimates, possibly due to poor convergence of the MCMC algorithm (it is the only method here that uses MCMC). The SSVD method consistently performs worse than simple SVD, possibly because it is more adapted to both factors and loadings being sparse (and possibly because, following Yang et al. 2014, we did not use CV to tune its parameters). Inspection of individual results suggests that the poor performance of both SFAmix and SSVD is often due to over-shrinking of non-zero loadings to zero.

5.1.2. A Sparse Bi-cluster Example (Rank 3)

An important feature of our EBMF methods is that they estimate separate distributions $g_{l}, g_{f}$ for each factor and each loading, allowing them to adapt to any combination of sparsity in the factors and loadings. This flexibility is not easy to achieve in other ways. For example, methods that use $CV$ are generally limited to one or two tuning parameters because of the computational difficulties of searching over a larger space.

To illustrate this flexibility we simulated data under the factor model (2.1) with $=$ $150, p = 240, K = 3, τ = 1 / 4$ , and:

l_{1, i} ~ N (0, 2^{2}) i = 1, \dots, 10

(5.4)

l_{2, i} ~ N (0,1) i = 11, \dots, 60

(5.5)

l_{3, i} ~ N (0,1 / 2^{2}) i = 61, \dots, 150

(5.6)

f_{1, j} ~ N (0,1 / 2^{2}) j = 1, \dots, 80

(5.7)

f_{2, j} ~ N (0,1) j = 81, \dots, 160

(5.8)

f_{3, j} ~ N (0, 2^{2}) j = 161, \dots, 240,

(5.9)

with all other elements of $l_{k}$ and $f_{k}$ set to zero for $k = 1,2, 3$ . This example has a sparse bi-cluster structure where distinct groups of samples are each loaded on only one factor (Figure 2a), and both the size of the groups and number of variables in each factor vary.

Figure 2: — Results from simulations with sparse bi-cluster structure $(K = 3)$ .

We applied flash, softImpute, SSVD and PMD to this example. (We excluded SFA and SFAmix since these methods do not model sparsity in both factors and loadings.) The results (Figure 2) show that again flash consistently outperforms the other methods, and again the next best is softImpute. On this example both SSVD and PMD outperform SVD. Although SSVD and PMD perform similarly on average, their qualitative behavior is different: PMD insufficiently shrink the 0 values, whereas SSVD shrinks the 0 values well but overshrinks some of the signal, essentially removing the smallest of the three loading/factor combinations (Figure 2 b).

5.2. Missing Data Imputation for Real Data Sets

Here we compare methods in their ability to impute missing data using five real data sets. In each case we “hold out” (mask) some of the data points, and then apply the methods to obtain estimates of the missing values. The data sets are as follows:

MovieLens $100 K$ data, an (incomplete) $943 \times 1682$ matrix of user-movie ratings (integers from 1 to 5) (Harper and Konstan, 2016). Most users do not rate most movies, so the matrix is sparsely observed (94% missing), and contains about 100K observed ratings. We hold out a fraction of the observed entries and assess accuracy of methods in estimating these. We centered and scaled the ratings for each user before analysis.

GTEx eQTL summary data, a $16 069 \times 44$ matrix of $Z$ scores computed testing association of genetic variants (rows) with gene expression in different human tissues (columns). These data come from the Genotype Tissue Expression (GTEx) project (Consortium et al., 2015), which assessed the effects of thousands of “eQTLs” across 44 human tissues. (An eQTL is a genetic variant that is associated with expression of a gene.) To identify eQTLs, the project tested for association between expression and every near-by genetic variant, each test yielding a $Z$ score. The data used here are the $Z$ scores for the most significant genetic variant for each gene (the “top” eQTL). See Section 5.3 for more detailed analyses of these data.

Brain Tumor data, a $43 \times 356$ matrix of gene expression measurements on 4 different types of brain tumor (included in the denoiseR package, Josse et al., 2018). We centered each column before analysis.

Presidential address data, a $13 \times 836$ matrix of word counts from the inaugural addresses of 13 US presidents (1940–2009) (also included in the denoiseR package, Josse et al., 2018). Since both row and column means vary greatly we centered and scaled both rows and columns before analysis, using the biScale function from softImpute.

Breast cancer data, a $251 \times 226$ matrix of gene expression measurements from Carvalho et al. (2008), which were used as an example in the paper introducing NBSFA (Knowles and Ghahramani, 2011). Following Knowles and Ghahramani (2011) we centered each column (gene) before analysis.

Among the methods considered above, only flash, PMD and softImpute can handle missing data. We add NBSFA (Knowles and Ghahramani, 2011) to these comparisons. To emphasize the importance of parameter tuning we include results for PMD and softImpute with default settings (denoted PMD, SI) as well as using cross-validation (PMD.cv1, SI.cv).

For these real data the appropriate value of $K$ is, of course, unknown. Both flash and NBSFA automatically estimate $K$ . For PMD and softImpute we specified $K$ based on the values inferred by flash and NBSFA. (Specifically, we used $K = 10,30,20,10,40$ respectively for the five data sets.)

We applied each method to all 5 data sets, using 10-fold OCV (Appendix B) to mask data points for imputation, repeated 20 times (with different random number seeds) for each data set. We measure imputation accuracy using root mean squared error (RMSE):

RMSE (\hat{Y}, Y; Ω) = \sqrt{\frac{1}{| Ω |} \sum_{i j \in Ω} {(Y_{i j} - {\hat{Y}}_{i j})}^{2}} .

(5.10)

where $Ω$ is the set of indices of the held-out data points.

The results are shown in Figure 3. Although the ranking of methods varies among data sets, flash, PMD.cv1 and SI.cv perform similarly on average, and consistently outperform NBSFA, which in turn typically outperforms (untuned) PMD and unpenalized softImpute. These results highlight the importance of appropriate tuning for the penalized methods, and also the effectiveness of the EB method in flash to provide automatic tuning.

In these comparisons, as in the simulations, the two flash methods typically performed similarly. The exception is the GTEx data, where the scale mixture of normals $(𝒢 = 𝒮 N)$ performed worse. Detailed investigation revealed this to be due to a very small number of very large “outlier” imputed values, well outside the range of the observed data, which grossly inflated RMSE. These outliers were so extreme that it should be possible to implement a filter to avoid them. However, we did not do this here as it seems useful to highlight this unexpected behavior. (Note that this occurs only when data are missing, and even then only in one of the five data sets considered here.)

5.3. Sharing of Genetic Effects on Gene Expression Among Tissues

To illustrate flash in a scientific application, we applied it to the GTEx data described above, a $16,069 \times 44$ matrix of $Z$ scores, with $Z_{i j}$ reflecting the strength (and direction) of effect of eQTL $i$ in tissue $j$ . We applied flash with $𝒢 = 𝒮 N$ using the greedy+backfitting algorithm (i.e. the backfitting algorithm, initialized using the greedy algorithm).

The flash results yielded 26 factors (Figure 4–5) which summarize the main patterns of eQTL sharing among tissues (and, conversely, the main patterns of tissue-specificity). For example, the first factor has approximately equal weight for every tissue, and reflects the fact that many eQTLs show similar effects across all 44 tissues. The second factor has strong effects only in the 10 brain tissues, from which we infer that some eQTLs show much stronger effects in brain tissues than other tissues.

Figure 5: — Results from running *flash* on GTEx data (factors 15 – 26)

Subsequent factors tend to be sparser, and many have a strong effect in only one tissue, capturing “tissue-specific” effects. For example, the 3rd factor shows a strong effect only in whole blood, and captures eQTLs that have much stronger effects in whole blood than other tissues. (Two tissues, “Lung” and “Spleen”, show very small effects in this factor but with the same sign as blood. This is intriguing since the lung has recently been found to make blood cells—see Lefrançais et al. 2017—and a key role of the spleen is storing of blood cells.) Similarly Factors 7, 11 and 14 capture effects specific to “Testis”, “Thyroid” and “Esophagus Mucosa” respectively.

A few other factors show strong effects in a small number of tissues that are known to be biologically related, providing support that the factors identified are scientifically meaningful. For example, factor 10 captures the two tissues related to the cerebellum, “Brain Cerebellar Hemisphere” and “Brain Cerebellum”. Factor 19 captures tissues related to female reproduction, “Ovary”, “Uterus” and “Vagina”. Factor 5 captures “Muscle Skeletal”, with small but concordant effects in the heart tissues (“Heart Atrial Appendage” and “Heart Left Ventricle”). Factor 4, captures the two skin tissues (“Skin Not Sun Exposed Suprapubic”, “Skin Sun Exposed Lower leg”) and also “Esophagus Mucosa”, possibly reflecting the sharing of squamous cells that are found in both the surface of the skin, and the lining of the digestive tract. In factor 24, “Colon Transverse” and “Small Intestine Terminal Ileum” show the strongest effects (and with same sign), reflecting some sharing of effects in these intestinal tissues. Among the 26 factors, only a few are difficult to interpret biologically (e.g. factor 8).

To highlight the benefits of sparsity, we contrast the flash results with those for softImpute, which was the best-performing method in the missing data assessments on these data, but which uses a nuclear norm penalty that does not explicitly reward sparse factors or loadings. The first eight softImpute factors are shown in Figure 6. The softImpute results—except for the first two factors—show little resemblance to the flash results, and in our view are harder to interpret.

Figure 6: — Results from running softImpute on GTEx data (factors 1–8). The factors are both less sparse and less interpretable than the *flash* results.

5.4. Computational Demands

It is difficult to make general statements about computational demands of our methods, because both the number of factors and number of iterations per factor can vary considerably depending on the data. However, to give a specific example, running our current implementation of the greedy algorithm on the GTEx data (a 16,000 by 44 matrix) takes about 140s (wall time) for $𝒢 = 𝒫 N$ and 650s for $𝒢 = 𝒮 N$ (on a 2015 MacBook Air with a 2.2 GHz Intel Core i7 processor and 8Gb RAM). By comparison, a single run of softImpute without CV takes 2–3s, so a naive implementation of 5-fold CV with 10 different tuning parameters and 10 different values of $K$ would take over 1000s (although one could improve on this by use of warm starts for example).

6. Discussion

Here we discuss some potential extensions or modifications of our work.

6.1. Orthogonality Constraint

Our formulation here does not require the factors or loadings to be orthogonal. In scientific applications we do not see any particular reason to expect underlying factors to be orthogonal. However, imposing such a constraint could have computational or mathematical advantages. Formally adding such a constraint to our objective function seems tricky, but it would be straightforward to modify our algorithms to include an orthogonalization step each update. This would effectively result in an EB version of the SSVD algorithms in Yang et al. (2014), and it seems likely to be computationally faster than our current approach. One disadvantage of this approach is that it is unclear what optimization problem such an algorithm would solve (but the same is true of SSVD, and our algorithms have the advantage that they deal with missing data.)

6.2. Non-negative Matrix Factorization

We focused here on the potential for EBMF to induce sparsity on loadings and factors. However, EBMF can also encode other assumptions. For example, to assume the loadings and factors are non-negative, simply restrict $𝒢$ to be a family of non-negative-valued distributions, yielding “Empirical Bayes non-negative Matrix Factorization” (EBNMF). Indeed, the ashr software can already solve the EBNM problem for some such families $𝒢$ , and so flash already implements EBNMF. In preliminary assessments we found that the greedy approach is problematic here: the non-negative constraint makes it harder for later factors to compensate for errors in earlier factors. However, it is straightforward to apply the backfitting algorithm to fit EBNMF, with initialization by any existing NMF method. The performance of this approach is an area for future investigation.

6.3. Tensor Factorization

It is also straightforward to extend EBMF to tensor factorization, specifically a CANDE-COMP/PARAFAC decomposition (Kolda and Bader, 2009):

Y_{i j m} = \sum_{k = 1}^{K} l_{k i} f_{k j} h_{k m} + E_{i j m}

(6.1)

l_{k 1}, \dots, l_{k n} ~^{i i d} g_{l_{k}}, g_{l_{k}} \in 𝒢

(6.2)

f_{k 1}, \dots, f_{k p} ~^{i i d} g_{f_{k}}, g_{f_{k}} \in 𝒢

(6.3)

h_{k 1}, \dots, h_{k r} ~^{i i d} g_{h_{k}}, g_{h_{k}} \in 𝒢

(6.4)

E_{i j m} ~^{i i d} N (0,1 / τ_{i j m}) .

(6.5)

The variational approach is easily extended to this case (a generalization of methods in Hore et al., 2016), and updates that increase the objective function can be constructed by solving an EBNM problem, similar to EBMF. It seems likely that issues of convergence to local optima, and the need for good initializations, will need some attention to obtain good practical performance. However, results in Hore et al. (2016) are promising, and the automatic-tuning feature of EB methods seems particularly attractive here. For example, extending PMD to this case—allowing for different sparsity levels in $l, f$ and $h —$ would require 3 penalty parameters even in the rank 1 case, making it difficult to tune by CV.

6.4. Non-Gaussian Errors

It is also possible to extend the variational approximations used here to fit non-Gaussian models, such as binomial data; see for example Jaakkola and Jordan (2000); Seeger and Bouchard (2012); Klami (2015). The extension of our EB methods using these ideas is detailed in Wang (2017).

Acknowledgments

We thank P. Carbonetto for computational assistance, and P. Carbonetto, D. Gerard, and A. Sarkar for helpful conversations and comments on a draft manuscript. Computing resources were provided by the University of Chicago Research Computing Center. This work was supported by NIH grant HG002585 and by a grant from the Gordon and Betty Moore Foundation (Grant GBMF #4559).

Appendix A. Variational EBMF with $K$ Factors

Here we describe in detail the variational approach to the $K$ factor model, including deriving updates that we use to optimize the variational objective. (These derivations naturally include the $K = 1$ model as a special case, and our proof of Proposition 5 below includes Proposition 2 as a special case.)

Let $q_{l}, q_{f}$ denote the variational distributions on the $K$ loadings/factors:

q_{l} (l_{1}, \dots, l_{K}) = \prod_{k} q_{l_{k}} (l_{k})

(A.1)

\begin{array}{r} q_{f} (f_{1}, \dots, f_{K}) & = \prod_{k} q_{f_{k}} (f_{k}) \end{array} .

(A.2)

The objective function $F (3.4)$ is thus a function of $q_{l} = (q_{l_{1}}, \dots, q_{l_{K}}), q_{f} = (q_{f_{1}}, \dots, q_{f_{K}})$ , $g_{l} = (g_{l_{1}}, \dots, g_{l_{K}})$ and $g_{f} = (g_{f_{1}}, \dots, g_{f_{K}})$ , as well as the precision $τ$ :

F (q_{l}, q_{f}, g_{l}, g_{f}, τ) = \int \prod_{k} q_{l_{k}} (l_{k}) q_{f_{k}} (f_{k}) log \frac{p (Y, l, f; g_{l_{1}}, g_{f_{1}}, \dots, g_{l_{K}}, g_{f_{K}}, τ)}{\prod_{k} q_{l_{k}} (l_{k}) q_{f_{k}} (f_{k})} d l_{k} d f_{k},

(A.3)

= E_{q_{l}, q_{f}} log p (Y ∣ l, f; τ) + \sum_{k} E_{q_{l_{k}}} log \frac{g_{l_{k}} (l_{k})}{q_{l_{k}} (l_{k})} + \sum_{k} E_{q_{f_{k}}} log \frac{g_{f_{k}} (f_{k})}{q_{f_{k}} (f_{k})} .

(A.4)

We optimize $F$ by iteratively updating parameters relating to $τ$ , a single loading $k$ $(q_{l_{k}}, g_{l_{k}})$ or factor $k (q_{f_{k}}, g_{f_{k}})$ , keeping other parameters fixed. We simplify implementation by keeping track of only the first and second moments of the distributions $q_{l_{k}}$ and $q_{f_{k}}$ , which we denote ${\bar{l}}_{k}, {\bar{l^{2}}}_{k}, {\bar{f}}_{k}, {\bar{f^{2}}}_{k}$ . We now describe each kind of update in turn.

A.1. Updates for Precision Parameters

Here we derive updates to optimize $F$ over the precision parameters $τ$ . Focusing on the parts of $F$ that depend on $τ$ gives:

F (τ) = E_{q_{l}} E_{q_{f}} \sum_{i j} 0.5 log (τ_{i j}) - 0.5 τ_{i j} {(Y_{i j} - \sum_{k} l_{k i} f_{k j})}^{2} + const

(A.5)

= 0.5 \sum_{i j} [log (τ_{i j}) + τ_{i j} {\bar{R^{2}}}_{i j}] + const

(A.6)

where $\bar{R^{2}}$ is defined by:

{\bar{R^{2}}}_{i j} ≔ E_{q_{l}, q_{f}} [{(Y_{i j} - \sum_{k = 1}^{K} l_{k i} f_{k j})}^{2}]

(A.7)

= {(Y_{i j} - \sum_{k} {\bar{l}}_{k i} {\bar{f}}_{k j})}^{2} - \sum_{k} {({\bar{l}}_{k i})}^{2} {({\bar{f}}_{k j})}^{2} + \sum_{k} {\bar{l^{2}}}_{k i} {\bar{f^{2}}}_{k j} .

(A.8)

If we constrain $τ \in 𝒯$ then we have

\hat{τ} = arg max_{τ \in 𝒯} \sum_{i j} [log (τ_{i j}) - τ_{i j} {\bar{R^{2}}}_{i j}] .

(A.9)

For example, assuming constant precision $τ_{i j} = τ$ yields:

\hat{τ} = \frac{N P}{\sum_{i j} {\bar{R^{2}}}_{i j}} .

(A.10)

Assuming column-specific precisions $(τ_{i j} = τ_{j})$ , which is the default in our software, yields:

{\hat{τ}}_{j} = \frac{N}{\sum_{i} {\bar{R^{2}}}_{i j}} .

(A.11)

Other variance structures are considered in Appendix A.5 of Wang (2017).

A.2. Updating Loadings and Factors

The following Proposition, which generalizes Proposition 2 in the main text, shows how updates for loadings (and factors) for the $K$ -factor EBMF model can be achieve by solving an EBNM problem.

Proposition 5 For the $K$ -factor model, arg ${max}_{q_{l_{k}}, g_{l_{k}}} F (q_{l}, q_{f}, g_{l}, g_{f}, τ)$ is solved by solving an EBNM problem. Specifically

arg max_{q_{l_{k}}, g_{l_{k}}} F (q_{l}, q_{f}, g_{l}, g_{f}, τ) = EBNM (\hat{l} (R^{k}, {\bar{f}}_{k}, {\bar{f^{2}}}_{k}, τ), s_{l} ({\bar{f^{2}}}_{k}, τ))

(A.12)

where the functions $\hat{l}$ and $s_{l}$ are given by (3.19) and (3.20), ${\bar{f}}_{k}, {\bar{f^{2}}}_{k} \in R^{p}$ denote the vectors whose elements are the first and second moments of $f_{k}$ under $q_{f_{k}}$ , and $R^{k}$ denotes the residual matrix (3.27).

Similarly, $arg {max}_{q_{f_{k}}, g_{f_{k}}} F (q_{l}, q_{f}, g_{l}, g_{f}, τ)$ is solved by solving an EBNM problem. Specifically,

arg max_{q_{f_{k}}, g_{f_{k}}} F (q_{l}, q_{f}, g_{l}, g_{f}, τ) = EBNM (\hat{f} (R^{k}, {\bar{l}}_{k}, {\bar{l^{2}}}_{k}, τ), s_{f} ({\bar{l^{2}}}_{k}, τ))

(A.13)

where the functions $\hat{f} : R^{n \times p} \times R^{n} \times R^{n} \times R^{n \times p} \to R^{p}$ and $s_{f} : R^{n} \times R^{n \times p} \to R^{p}$ are given by

\begin{array}{r} \hat{f} (Y, v, w, τ)_{j} & ≔ \frac{\sum_{i} τ_{i j} Y_{i j} v_{i}}{\sum_{i} τ_{i j} w_{i}}, \end{array}

s_{f} (w, τ)_{j} ≔ {(\sum_{i} τ_{i j} w_{i})}^{- 0.5} .

(A.15)

A.2.1. A Lemma on the Normal Means Problem

To prove Proposition 5 we introduce a lemma that characterizes the solution of the normal means problem in terms of an objective that is closely related to the variational objective.

Recall that the EBNM model is:

x = θ + e

(A.16)

\begin{matrix} θ_{1}, \dots, θ_{n} ~^{i i d} g, g \in 𝒢 \end{matrix} .

(A.17)

where $e_{i} ~ N (0, s_{i}^{2})$ .

Solving the EBNM problem involves estimating $g$ by maximum likelihood:

\hat{g} = arg max_{g \in 𝒢} l (g),

(A.18)

where

l (g) = log p (x ∣ g) .

(A.19)

It also involves finding the posterior distributions:

p (θ ∣ x, \hat{g}) = \prod_{j} p (θ_{j} ∣ x, \hat{g}) \propto \prod_{j} \hat{g} (θ_{j}) p (x_{j} ∣ θ_{j}, s_{j}) .

(A.20)

Lemma 6 Solving the EBNM problem also solves:

max_{q_{θ}, g \in 𝒢} F^{N M} (q_{θ}, g)

(A.21)

where

F^{N M} (q_{θ}, g) = E_{q_{θ}} [- \frac{1}{2} \sum_{j} (A_{j} θ_{j}^{2} - 2 B_{j} θ_{j})] + E_{q_{θ}} log \frac{g (θ)}{q_{θ} (θ)} + const

(A.22)

with $A_{j} = 1 / s_{j}^{2}$ and $B_{j} = x_{j} / s_{j}^{2}$ , and $g (θ) ≔ \prod_{j} g (θ_{j})$ .

Equivalently, (A.21)-(A.22) is solved by $g = \hat{g}$ in (A.18) and $q_{θ} = p (θ ∣ x, \hat{g})$ in (A.20), with $x_{j} = B_{j} / A_{j}$ and $s_{j}^{2} = 1 / A_{j}$ .

Proof The log likelihood can be written as

l (g) ≔ log [p (x ∣ g)]

(A.23)

= log [p (x, θ ∣ g) / p (θ ∣ x, g)]

(A.24)

= \int q_{θ} (θ) log \frac{p (x, θ ∣ g)}{p (θ ∣ x, g)} d θ

(A.25)

= \int q_{θ} (θ) log \frac{p (x, θ ∣ g)}{q_{θ} (θ)} d θ + \int q_{θ} (θ) log \frac{q_{θ} (θ)}{p (θ ∣ x, g)} d θ

(A.26)

= F^{NM} (q_{θ}, g) + D_{K L} (q_{θ} ∥ p_{θ ∣ x, g})

(A.27)

where

F^{NM} (q_{θ}, g) = \int q_{θ} (θ) log \frac{p (x, θ ∣ g)}{q_{θ} (θ)} d θ

(A.28)

and

D_{K L} (q_{θ} | | p_{θ ∣ x, g}) = - \int q_{θ} (θ) log \frac{p (θ ∣ x, g)}{q_{θ} (θ)} d θ

(A.29)

Here $p_{θ ∣ x, g}$ denotes the posterior distribution $p (θ ∣ x, g)$ . This identity holds for any distribution $q_{θ} (θ)$ .

Rearranging (A.27) gives:

F^{NM} (q_{θ}, g) = l (g) - D_{K L} (q_{θ} ∥ p_{θ ∣ x, g}) .

(A.30)

Since $D_{K L} (q_{θ} | | p_{θ ∣ x, g}) \geq 0$ , with equality when $q_{θ} = p_{θ ∣ x, g}, F^{NM} (q_{θ}, g)$ is maximized over $q_{θ}$ by setting $q_{θ} = p_{θ ∣ x, g}$ . Further

max_{q_{θ}} F^{NM} (q_{θ}, g) = l (g),

(A.31)

arg max_{g \in 𝒢} max_{q_{θ}} F^{NM} (q_{θ}, g) = arg max_{g \in 𝒢} l (g) = \hat{g} .

(A.32)

It remains only to show that $F^{NM}$ has the form (A.22).

By (A.16) and (A.17), we have

log p (x, θ ∣ g) = - \frac{1}{2} \sum_{j} s_{j}^{- 2} {(x_{j} - θ_{j})}^{2} + log g (θ) + const .

(A.33)

Thus

F^{NM} (q_{θ}, g) = E_{q_{θ}} [- \frac{1}{2} \sum_{j} (A_{j} θ_{j}^{2} - 2 B_{j} θ_{j})] + E_{q_{θ}} log \frac{g (θ)}{q_{θ} (θ)} + const .

(A.34)

■

A.2.2. Proof of Proposition 5

We are now ready to prove Proposition 5.

Proof We prove the first part of the proposition since the proof for the second part is essentially the same.

The objective function (A.3) is:

F (q_{l}, q_{f}, g_{l}, g_{f}, τ) = E_{q_{l}, q_{f}} log p (Y ∣ l, f; τ) + \sum_{k} E_{q_{l_{k}}} log \frac{g_{l_{k}} (l_{k})}{q_{l_{k}} (l_{k})} + \sum_{k} E_{q_{f_{k}}} log \frac{g_{f_{k}} (f_{k})}{q_{f_{k}} (f_{k})}

(A.35)

= E_{q_{l_{k}}} [- \frac{1}{2} \sum_{i} (A_{i k} l_{k i}^{2} - 2 B_{i k} l_{k i})] + E_{q_{l_{k}}} log \frac{g_{l_{k}} (l_{k})}{q_{l_{k}} (l_{k})} + C_{1}

(A.36)

where $C_{1}$ is a constant with respect to $q_{l_{k}}, g_{l_{k}}$ and

A_{i k} = \sum_{j} τ_{i j} E_{q_{f}} (f_{k j}^{2})

(A.37)

B_{i k} = \sum_{j} τ_{i j} (R_{i j}^{k} E_{q_{f}} f_{k j}) .

(A.38)

Based on Lemma 6, we can solve this optimization problem (A.36) by solving the EBNM problem with:

x_{i} = \frac{\sum_{j} τ_{i j} (R_{i j}^{k} E_{q_{f}} f_{k j})}{\sum_{j} τ_{i j} E_{q_{f}} (f_{k j}^{2})}

(A.39)

s_{i}^{2} = \frac{1}{\sum_{j} τ_{i j} E_{q_{f}} (f_{k j}^{2})} .

(A.40)

■

A.3. Algorithms

Just as with the rank 1 EBMF model, the updates for the rank $K$ model require only the first and second moments of the variational distributions $q$ . Thus we implement the updates in algorithms that keep track of the first moments $(\bar{l} ≔ ({\bar{l}}_{1}, \dots, {\bar{l}}_{K})$ and $\bar{f} ≔ ({\bar{f}}_{1}, \dots, {\bar{f}}_{K}))$ and second moments ( $\bar{l^{2}} ≔ ({\bar{l^{2}}}_{1}, \dots, {\bar{l^{2}}}_{K})$ and $\bar{f^{2}} ≔ ({\bar{f^{2}}}_{1}, \dots, {\bar{f^{2}}}_{K})$ ), and the precision $τ$ .

Algorithm 3 implements a basic update for $τ$ , and for the parameters relating to a single factor $k ({\bar{l}}_{k}, {\bar{l^{2}}}_{K}, {\bar{f}}_{k}, {\bar{f^{2}}}_{K})$ . Note that the latter updates are identical to the updates for fitting the single factor EBMF model, but with $Y_{i j}$ replaced with the residuals obtained by removing the estimated effects of the other $k - 1$ factors.

Appendix A.

Based on these basic updates we implemented two algorithms for fitting the $K$ -factor EBMF model: the greedy algorithm, and the backfitting algorithm, as follows.

A.3.1. Greedy Algorithm

The greedy algorithm is a forward procedure that, at the $k$ th step, adds new factors and loadings $l_{k}, f_{k}$ by optimizing over $q_{l_{k}}, q_{f_{k}}, g_{l_{k}}, g_{f_{k}}$ while keeping the distributions related to previous factors fixed. Essentially this involves fitting the single-factor model to the residuals obtained by removing previous factors. The procedure stops adding factors when the estimated new factors (or loadings) are identically zero. The algorithm as follows:

A.3.

A.3.2. Backfitting Algorithm

The backfitting algorithm iteratively refines a fit of $K$ factors and loadings, by updating them one at a time, at each update keeping the other loadings and factors fixed. The name comes from its connection with the backfitting algorithm in Breiman and Friedman (1985), specifically the fact that it involves iteratively re-fitting to residuals.

A.3.

A.4. Objective Function Computation

The algorithms above all involve updates that will increase (or, at least, not decrease) the objective function $F (q_{l}, q_{f}, g_{l}, g_{f}, τ)$ . However, these updates do not require computing the objective function itself. In iterative algorithms it can be helpful to compute the objective function to monitor convergence (and as a check on implementation). In this subsection we describe how this can be done. In essence, this involves extending the solver of the EBNM problem to also return the value of the log-likelihood achieved in that problem (which is usually not difficult).

The objective function of the EBMF model is:

F (q_{l}, q_{f}, g_{l}, g_{f}, τ) = E_{q_{l}, q_{f}} log p (Y ∣ l, f; τ) + E_{q_{l}} log \frac{g_{l} (l)}{q_{l} (l)} + E_{q_{f}} log \frac{g_{f} (f)}{q_{f} (f)}

(A.41)

The calculation of $E_{q_{l}, q_{f}} log p (Y ∣ l, f; τ)$ is straightforward and $E_{q_{l}} log \frac{g_{l} (l)}{q_{l} (l)}$ and $E_{q_{l}} log \frac{g_{l} (l)}{q_{l} (l)}$ can be calculated using the log-likelihood of the EBNM model using the following Lemma 7.

Lemma 7 Suppose $\hat{g}, q$ solves the EBNM problem with data $(x, s)$ :

(\hat{g}, q) = EBNM (x, s),

(A.42)

where $q ≔ (q_{1}, \dots, q_{n})$ are the estimated posterior distributions of the normal means parameters $θ_{1}, \dots, θ_{n}$ . Then

E_{q} (log (\prod_{j} \hat{g} (θ_{j}) / \prod_{j} q_{j} (θ_{j}))) = l (\hat{g}; x, s) + \frac{1}{2} \sum_{j} log (2 π s_{j}^{2}) + (1 / s_{j}^{2}) (x_{j}^{2} + E_{q} (θ_{j}^{2}) - 2 x_{j} E_{q} (θ_{j})

(A.43)

where $l (\hat{g}; x, s)$ is the log of the likelihood for the normal means problem (A.27).

Proof We have from (A.28)

F^{NM} (q_{θ}, \hat{g}) = \int q_{θ} (θ) log \frac{p (x, θ ∣ \hat{g})}{q_{θ} (θ)} d θ

(A.44)

= \int q_{θ} (θ) log \frac{p (x ∣ θ) \hat{g} (θ)}{q_{θ} (θ)} d θ

(A.45)

= E_{q} (log (\prod_{j} \hat{g} (θ_{j}) / \prod_{j} q_{j} (θ_{j}))) - \frac{1}{2} E_{q} [\sum_{j} log (2 π s_{j}^{2}) + (1 / s_{j}^{2}) {(x_{j} - θ_{j})}^{2}]

(A.46)

And the result follows from noting that $F^{NM} (\hat{q}, \hat{g}) = l (\hat{g})$ . ■

A.5. Inference with Penalty Term

Conceivably, in some settings one might like to encourage solutions to the EBMF problem be sparser than the maximum-likelihood estimates for $g_{l}, g_{f}$ would produce. This could be done by extending the EBMF model to introduce a penalty term on the distributions $g_{l}, g_{f}$ so that the maximum likelihood estimates are replaced by maximizing a penalized likelihood. We are not advocating for this approach, but it is straightforward given existing machinery, and so we document it here for completeness.

Let $h_{l} (g_{l})$ and $h_{f} (g_{f})$ denote penalty terms on $g_{l}$ and $g_{f}$ , so the penalized log-likelihood would be:

l (g_{l}, g_{f}, τ) ≔ log [p (Y ∣ g_{l}, g_{f}, τ^{2})] + h_{l} (g_{l}) + h_{f} (g_{f}) = F (q, g_{l}, g_{f}, τ^{2}) + h_{l} (g_{l}) + h_{f} (g_{f}) + D_{K L} (q ∥ p)

(A.47)

where $F (q, g_{l}, g_{f}, τ^{2})$ and $D_{K L} (q ∥ p)$ are defined in (3.4) and (3.5). And the corresponding penalized variational objective is:

max F (q, g_{l}, g_{f}, τ^{2}) + h_{l} (g_{l}) + h_{f} (g_{f}) .

(A.48)

It is straightforward to modify the algorithms above to maximize this penalized objective: simply modify the EBNM solvers to solve a corresponding penalized normal means problem. That is, instead of estimating the prior $g$ by maximum likelihood, the EBNM solver must now maximize the penalized log-likelihood:

\hat{g} = arg max_{g \in 𝒢} l_{EBNM} (g) + h (g),

(A.49)

where $l_{EBNM}$ denote the log-likelihood for the EBNM problem. (The computation of the posterior distributions given $\hat{g}$ is unchanged).

For example, the ashr software (Stephens, 2017) provides the option to include a penalty on $g$ to encourage overestimation of the size of the point mass on zero. This penalty was introduced to ensure conservative behavior in False Discovery Rate applications of the normal means problem. It is unclear that such a penalty is desirable in the matrix factorization application. However, the above discussion shows that using this penalty (e.g. within the ebnm function used by the greedy or backfitting algorithms) can be thought of as solving a penalized version of the EBMF problem.

Appendix B. Orthogonal Cross Validation

Cross-validation assessments involving “holding out” (hiding) data from methods. Here we introduce a novel approach to selecting the data to be held out, which we call Orthogonal Cross Validation (OCV). Although not the main focus of our paper, we believe that OCV is a novel and appealing approach to selecting hold-out data for factor models, e.g. when using CV to select an appropriate dimension $K$ for dimension reduction methods, as in Owen and Wang (2016).

Generic $k$ -fold CV involves randomly dividing the data matrix into $k$ parts and then, for each part, training methods on the other $k - 1$ parts before assessing error on that part, as in Algorithm 6.

graphic file with name nihms-1908884-f0005.jpg

The novel part of OCV is in how to choose the “hold-out” pattern. We randomly divide the columns and rows into $k$ sets. and put these sets into $k$ orthogonal parts, and then take all $Y_{i j}$ with the chosen column and row indices as “hold-out” $Y_{(i)}$ .

To illustrate this scheme, we take 3-fold CV as an example. We randomly divide the columns into 3 sets and the rows into 3 sets as well. The data matrix $Y$ is divided into 9 partition (by row and column permutation):

Y = (\begin{array}{l} Y_{11} & Y_{12} & Y_{13} \\ Y_{21} & Y_{22} & Y_{23} \\ Y_{31} & Y_{32} & Y_{33} \end{array})

Then $Y_{(1)} = \{Y_{11}, Y_{22}, Y_{33}\}, Y_{(2)} = \{Y_{12}, Y_{23}, Y_{31}\}$ and $Y_{(3)} = \{Y_{13}, Y_{21}, Y_{32}\}$ are orthogonal to each other. Then the data matrix $Y$ is marked as:

Y = (\begin{array}{l} Y_{(1)} & Y_{(2)} & Y_{(3)} \\ Y_{(3)} & Y_{(1)} & Y_{(2)} \\ Y_{(2)} & Y_{(3)} & Y_{(1)} \end{array})

In OCV, each fold $k, Y_{(k)}$ contains equally balanced part of data matrix and includes all the row and column indices. This ensures that all i’s and $j$ ’s are included into each $Y_{- (k)}$ . In 3-fold OCV, we have:

Y = (\begin{array}{l} Y_{11} & Y_{12} & Y_{13} \\ Y_{21} & Y_{22} & Y_{23} \\ Y_{31} & Y_{32} & Y_{33} \end{array}) = [\begin{matrix} L^{(1)} \\ L^{(2)} \\ L^{(3)} \end{matrix}] \times [\begin{array}{l} F^{(1)} & F^{(2)} & F^{(3)} \end{array}] + E

(B.1)

= (\begin{array}{l} Y_{(1)} & Y_{(2)} & Y_{(3)} \\ Y_{(3)} & Y_{(1)} & Y_{(2)} \\ Y_{(2)} & Y_{(3)} & Y_{(1)} \end{array}) = [\begin{array}{l} L^{(1)} F^{(1)} & L^{(1)} F^{(2)} & L^{(1)} F^{(3)} \\ L^{(2)} F^{(1)} & L^{(2)} F^{(2)} & L^{(2)} F^{(3)} \\ L^{(3)} F^{(1)} & L^{(3)} F^{(2)} & L^{(3)} F^{(3)} \end{array}] + E

(B.2)

where $Y_{(1)} = \{Y_{11}, Y_{22}, Y_{33}\}, Y_{(2)} = \{Y_{12}, Y_{23}, Y_{31}\}$ and $Y_{(3)} = \{Y_{13}, Y_{21}, Y_{32}\}$ . We can see for each “hold-out” part, $Y_{(k)}, L^{(1)}, L^{(2)}, L^{(3)}$ and $F^{(1)}, F^{(2)}, F^{(3)}$ show up once and only once. In this sense the hold-out pattern is “balanced”.

Contributor Information

Wei Wang, Department of Statistics, University of Chicago, Chicago, IL, USA.

Matthew Stephens, Department of Statistics and Department of Human Genetics, University of Chicago, Chicago, IL, USA.

References

Argelaguet Ricard, Velten Britta, Arnol Damien, Dietrich Sascha, Zenz Thorsten, John C Marioni Florian Buettner, Huber Wolfgang, and Stegle Oliver. Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Molecular systems biology, 14(6):e8124, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Attias Hagai. Independent factor analysis. Neural Computation, 11(4):803–851, 1999. [DOI] [PubMed] [Google Scholar]
Bai Jushan and Ng Serena. Large Dimensional Factor Analysis. Now Publishers Inc, 2008. [Google Scholar]
Bhattacharya Anirban and Dunson David B. Sparse Bayesian infinite factor models. Biometrika, pages 291–306, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bishop Christopher M.. Variational principal components. In Ninth International Conference on Artificial Neural Networks (Conf. Publ. No. 470), volume 1, pages 509–514. IET, 1999. [Google Scholar]
Blei David M, Kucukelbir Alp, and McAuliffe Jon D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017. [Google Scholar]
Bouchard Guillaume, Naradowsky Jason, Riedel Sebastian, Rocktäschel Tim, and Vlachos Andreas. Matrix and tensor factorization methods for natural language processing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Tutorial Abstracts, pages 16–18, Beijing, China, July 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-5005. URL https://www.aclweb.org/anthology/P15-5005. [DOI] [Google Scholar]
Breiman Leo and Friedman Jerome H.. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80(391):580–598, 1985. doi: 10.1080/01621459.1985.10478157. URL http://www.tandfonline.com/doi/abs/10.1080/01621459.1985.10478157. [DOI] [Google Scholar]
Carvalho Carlos M., Chang Jeffrey, Lucas Joseph E., Nevins Joseph R., Wang Quanli, and West Mike. High-dimensional sparse factor modeling: Applications in gene expression genomics. Journal of the American Statistical Association, 103(484):1438–1456, 2008. ISSN; 0162–1459. doi: 10.1198/016214508000000869. [DOI] [PMC free article] [PubMed] [Google Scholar]
Clyde Merlise and George Edward I. Flexible empirical Bayes estimation for wavelets. Journal of the Royal Statistical Society Series B, 62(4):681–698, 2000. doi: 10.1111/1467-9868.00257. [DOI] [Google Scholar]
Consortium GTEx et al. The genotype-tissue expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348(6235):648–660, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ding Chris HQ, Li Tao, and Jordan Michael I. Convex and semi-nonnegative matrix factorizations. IEEE transactions on pattern analysis and machine intelligence, 32(1):45–55, 2008. [DOI] [PubMed] [Google Scholar]
Eckart C and Young G. The approximation of one matrix by another of lower rank. Psychometrika, 1:211–218, 1936. [Google Scholar]
Engelhardt Barbara E and Stephens Matthew. Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS Genetics, 6(9): e1001117, sep 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fithian William, Mazumder Rahul, et al. Flexible low-rank statistical modeling with missing data and side information. Statistical Science, 33(2):238–260, 2018. [Google Scholar]
Ford J Kevin, MacCallum Robert C, and Tait Marianne. The application of exploratory factor analysis in applied psychology: A critical review and analysis. Personnel Psychology, 39(2):291–314, 1986. [Google Scholar]
Frühwirth-Schnatter Sylvia and Lopes Hedibert Freitas. Sparse Bayesian factor analysis when the number of factors is unknown. arXiv preprint arXiv:1804.04231, 2018. [Google Scholar]
Gao Chuan, Brown Christopher D, and Engelhardt Barbara E. A latent factor model with a mixture of sparse and dense factors to model gene gene expression data with confounding effects. arXiv:1310.4792v1, 2013. [Google Scholar]
Gao Chuan, McDowell Ian C., Zhao Shiwen, Brown Christopher D., and Engelhardt Barbara E.. Context specific and differential gene co-expression networks via Bayesian biclustering. PLoS Computational Biology, 12(7):1–39, 07 2016. doi: 10.1371/journal.pcbi.1004791. URL 10.1371/journal.pcbi.1004791. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ghahramani Zoubin and Beal Matthew J. Variational inference for Bayesian mixtures of factor analysers. In Advances in neural information processing systems, pages 449–455, 2000. [Google Scholar]
Girolami Mark. A variational method for learning sparse and overcomplete representations. Neural Computation, 13(11):2517–2532, 2001. [DOI] [PubMed] [Google Scholar]
Harper F Maxwell and Konstan Joseph A. The Movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems, 5(4):19, 2016. [Google Scholar]
Hastie Trevor, Mazumder Rahul, Lee Jason D, and Zadeh Reza. Matrix completion and low-rank SVD via fast alternating least squares. Journal of Machine Learning Research, 16(1):3367–3402, 2015. [PMC free article] [PubMed] [Google Scholar]
Hochreiter Sepp, Bodenhofer Ulrich, Heusel Martin, Mayr Andreas, Mitterecker Andreas, Kasim Adetayo, Khamiakova Tatsiana, Suzy Van Sanden Dan Lin, Talloen Willem, Bijnens Luc, Göhlmann Hinrich W. H., Shkedy Ziv, and Clevert Djork-Arné. FABIA: factor analysis for bicluster acquisition. Bioinformatics, 26(12):1520–1527, 04 2010. ISSN; 1367–4803. doi: 10.1093/bioinformatics/btq227. URL 10.1093/bioinformatics/btq227. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hore Victoria, Ana Viñuela Alfonso Buil, Knight Julian, Mark I McCarthy Kerrin Small, and Marchini Jonathan. Tensor decomposition for multiple-tissue gene expression experiments. Nature Genetics, 48(9):1094–1100, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jaakkola Tommi S. and Jordan Michael I.. Bayesian parameter estimation via variational methods. Statistics and Computing, 10:25–37, 2000. [Google Scholar]
Johnstone Iain M. and Silverman Bernard W.. Empirical Bayes selection of wavelet thresholds. The Annals of Statistics, 33(4):1700–1752, 2005a. [Google Scholar]
Johnstone Iain M and Silverman Bernard W. EBayesthresh: R and s-plus programs for empirical Bayes thresholding. J. Statist. Soft, 12:1–38, 2005b. [Google Scholar]
Johnstone Iain M, Silverman Bernard W, et al. Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences. The Annals of Statistics, 32(4):1594–1649, 2004. [Google Scholar]
Jolliffe Ian T, Trendafilov Nickolay T, and Uddin Mudassir. A modified principal component technique based on the lasso. Journal of Computational and Graphical Statistics, 12(3): 531–547, 2003. [Google Scholar]
Josse Julie, Sardy Sylvain, and Wager Stefan. denoiseR: A package for low rank matrix estimation, 2018.
Kaufmann Sylvia and Schumacher Christian. Identifying relevant and irrelevant variables in sparse factor models. Journal of Applied Econometrics, 32(6):1123–1144, 2017. [Google Scholar]
Klami Arto. Polya-gamma augmentations for factor models. In Asian Conference on Machine Learning, pages 112–128, 2015. [Google Scholar]
Knowles David and Ghahramani Zoubin. Nonparametric Bayesian sparse factor. Annals of Applied Statistics, 5(2B):1534–1552, 2011. doi: 10.1214/10-AOAS435. [DOI] [Google Scholar]
Koenker Roger and Mizera Ivan. Convex optimization, shape constraints, compound decisions, and empirical Bayes rules. Journal of the American Statistical Association, 109 (506):674–685, 2014a. [Google Scholar]
Koenker Roger and Mizera Ivan. Convex optimization in R. Journal of Statistical Software, 60(5):1–23, 2014b. doi: 10.18637/jss.v060.i05. [DOI] [Google Scholar]
Kolda Tamara G and Bader Brett W. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009. [Google Scholar]
Lee DD and Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999. doi: 10.1038/44565. [DOI] [PubMed] [Google Scholar]
Lefrançais Emma, Ortiz-muñoz Guadalupe, Caudrillier Axelle, Mallavia Beñat, Liu Fengchun, Sayah David M, Thornton Emily E, Headley Mark B, David Tovo, Coughlin Shaun R, Krummel Matthew F, Leavitt Andrew D, Passegué Emmanuelle, and Looney Mark R. The lung is a site of platelet biogenesis and a reservoir for haematopoietic progenitors. Nature, 544(7648):105–109, 2017. doi: 10.1038/nature21706. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lim Yew Jin and Teh Yee Whye. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD cup and workshop, volume 7, pages 15–21. Citeseer, 2007. [Google Scholar]
Mayrink Vinicius Diniz, Lucas Joseph Edward, et al. Sparse latent factor models with interactions: Analysis of gene expression data. The Annals of Applied Statistics, 7(2): 799–822, 2013. [Google Scholar]
Mazumder Rahul, Hastie Trevor, and Tibshirani Robert. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11(Aug): 2287–2322, 2010. [PMC free article] [PubMed] [Google Scholar]
Nakajima Shinichi and Sugiyama Masashi. Theoretical analysis of Bayesian matrix factorization. Journal of Machine Learning Research, 12:2583–2648, 2011. [Google Scholar]
Nakajima Shinichi, Sugiyama Masashi, Babacan S Derin, and Tomioka Ryota. Global analytic solution of fully-observed variational bayesian matrix factorization. Journal of Machine Learning Research, 14(Jan):1–37, 2013. [Google Scholar]
Owen Art B and Wang Jingshu. Bi-cross-validation for factor analysis. Statistical Science, 31(1):119–139, 2016. [Google Scholar]
Pournara Iosifina and Wernisch Lorenz. Factor analysis for gene regulatory networks and transcription factor activity profiles. BMC bioinformatics, 8:61, 2007. ISSN; 1471–2105. doi: 10.1186/1471-2105-8-61. [DOI] [PMC free article] [PubMed] [Google Scholar]
Raiko Tapani, Ilin Alexander, and Karhunen Juha. Principal component analysis for large scale problems with lots of missing values. In European Conference on Machine Learning, pages 691–698. Springer, 2007. [Google Scholar]
Ročková Veronika and George Edward I. Fast Bayesian factor analysis via automatic rotations to sparsity. Journal of the American Statistical Association, 111(516):1608–1622, 2016. [Google Scholar]
Rubin Donald B. Inference and missing data. Biometrika, 63(3):581–592, 1976. [Google Scholar]
Rubin Donald B and Thayer Dorothy T. EM algorithms for ML factor analysis. Psychometrika, 47(1):69–76, 1982. [Google Scholar]
Sabatti Chiara and James Gareth M. Bayesian sparse hidden components analysis for transcription regulation networks. Bioinformatics, 22(6):739–746, 2005. [DOI] [PubMed] [Google Scholar]
Seeger Matthias and Bouchard Guillaume. Fast variational Bayesian inference for non-conjugate matrix factorization models. In Artificial Intelligence and Statistics, pages 1012–1018, 2012. [Google Scholar]
Srivastava Sanvesh, Engelhardt Barbara E, and Dunson David B. Expandable factor analysis. Biometrika, 104(3):649–663, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stegle Oliver, Parts Leopold, Durbin Richard, and Winn John. A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eqtl studies. PLoS Computational Biology, 6(5):e1000770, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stegle Oliver, Parts Leopold, Piipari Matias, Winn John, and Durbin Richard. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nature Protocols, 7(3):500–507, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stein-O’Brien Genevieve L, Arora Raman, Culhane Aedin C, Favorov Alexander V, Garmire Lana X, Greene Casey S, Goff Loyal A, Li Yifeng, Ngom Aloune, Ochs Michael F, et al. Enter the matrix: factorization uncovers knowledge from omics. Trends in Genetics, 34 (10):790–805, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stephens Matthew. False discovery rates: a new deal. Biostatistics, 18(2):275–294, Apr 2017. doi: 10.1093/biostatistics/kxw041. [DOI] [PMC free article] [PubMed] [Google Scholar]
Thomas DC, Siemiatycki J, Dewar R, Robins J, Goldberg M, and Armstrong BG. The problem of multiple inference in studies designed to generate hypotheses. American Journal of Epidemiology, 122(6):1080–95, Dec 1985. [DOI] [PubMed] [Google Scholar]
Tipping Michael E. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1(Jun):211–244, 2001. [Google Scholar]
Titsias Michalis K. and Lázaro-Gredilla Miguel. Spike and slab variational inference for multi-task and multiple kernel learning. In Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, and Weinberger KQ, editors, Advances in Neural Information Processing Systems 24, pages 2339–2347. Curran Associates, Inc., 2011. URL http://papers.nips.cc/paper/4305-spike-and-slab-variational-inference-for-multi-task-and-multiple-kernel-learning.pdf. [Google Scholar]
Wang Wei. Applications of Adaptive Shrinkage in Multiple Statistical Problems. PhD thesis, The University of Chicago, 2017. [Google Scholar]
West Mike. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statistics 7 - Proceedings of the Seventh Valencia International Meeting, pages 723–732, 2003. ISSN; 08966273. doi: 10.1.1.18.3036. [Google Scholar]
Wipf David P. and Nagarajan Srikantan S.. A new view of automatic relevance determination. In Platt JC, Koller D, Singer Y, and Roweis ST, editors, Advances in Neural Information Processing Systems 20, pages 1625–1632. Curran Associates, Inc., 2008. URL http://papers.nips.cc/paper/3372-a-new-view-of-automatic-relevance-determination.pdf. [Google Scholar]
Witten Daniela M., Tibshirani Robert, and Hastie Trevor. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, 2009. doi: 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xing Zhengrong, Carbonetto Peter, and Stephens Matthew. Smoothing via adaptive shrinkage (smash): denoising Poisson and heteroskedastic Gaussian signals. arXiv preprint arXiv:1605.07787, 2016. [Google Scholar]
Yang Dan, Ma Zongming, and Buja Andreas. A sparse singular value decomposition method for high-dimensional data. Journal of Computational and Graphical Statistics, 23(4):923–942, 2014. [Google Scholar]
Zhao Shiwen, Barbara E Engelhardt Sayan Mukherjee, and Dunson David B. Fast moment estimation for generalized latent Dirichlet models. Journal of the American Statistical Association, pages 1–13, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou Hui, Hastie Trevor, and Tibshirani Robert. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):265–286, 2006. ISSN; 1061–8600. doi: 10.1198/106186006X113430. [DOI] [Google Scholar]

[R1] Argelaguet Ricard, Velten Britta, Arnol Damien, Dietrich Sascha, Zenz Thorsten, John C Marioni Florian Buettner, Huber Wolfgang, and Stegle Oliver. Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Molecular systems biology, 14(6):e8124, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Attias Hagai. Independent factor analysis. Neural Computation, 11(4):803–851, 1999. [DOI] [PubMed] [Google Scholar]

[R3] Bai Jushan and Ng Serena. Large Dimensional Factor Analysis. Now Publishers Inc, 2008. [Google Scholar]

[R4] Bhattacharya Anirban and Dunson David B. Sparse Bayesian infinite factor models. Biometrika, pages 291–306, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Bishop Christopher M.. Variational principal components. In Ninth International Conference on Artificial Neural Networks (Conf. Publ. No. 470), volume 1, pages 509–514. IET, 1999. [Google Scholar]

[R6] Blei David M, Kucukelbir Alp, and McAuliffe Jon D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017. [Google Scholar]

[R7] Bouchard Guillaume, Naradowsky Jason, Riedel Sebastian, Rocktäschel Tim, and Vlachos Andreas. Matrix and tensor factorization methods for natural language processing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Tutorial Abstracts, pages 16–18, Beijing, China, July 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-5005. URL https://www.aclweb.org/anthology/P15-5005. [DOI] [Google Scholar]

[R8] Breiman Leo and Friedman Jerome H.. Estimating optimal transformations for multiple regression and correlation. Journal of the American Statistical Association, 80(391):580–598, 1985. doi: 10.1080/01621459.1985.10478157. URL http://www.tandfonline.com/doi/abs/10.1080/01621459.1985.10478157. [DOI] [Google Scholar]

[R9] Carvalho Carlos M., Chang Jeffrey, Lucas Joseph E., Nevins Joseph R., Wang Quanli, and West Mike. High-dimensional sparse factor modeling: Applications in gene expression genomics. Journal of the American Statistical Association, 103(484):1438–1456, 2008. ISSN; 0162–1459. doi: 10.1198/016214508000000869. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Clyde Merlise and George Edward I. Flexible empirical Bayes estimation for wavelets. Journal of the Royal Statistical Society Series B, 62(4):681–698, 2000. doi: 10.1111/1467-9868.00257. [DOI] [Google Scholar]

[R11] Consortium GTEx et al. The genotype-tissue expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348(6235):648–660, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Ding Chris HQ, Li Tao, and Jordan Michael I. Convex and semi-nonnegative matrix factorizations. IEEE transactions on pattern analysis and machine intelligence, 32(1):45–55, 2008. [DOI] [PubMed] [Google Scholar]

[R13] Eckart C and Young G. The approximation of one matrix by another of lower rank. Psychometrika, 1:211–218, 1936. [Google Scholar]

[R14] Engelhardt Barbara E and Stephens Matthew. Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS Genetics, 6(9): e1001117, sep 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Fithian William, Mazumder Rahul, et al. Flexible low-rank statistical modeling with missing data and side information. Statistical Science, 33(2):238–260, 2018. [Google Scholar]

[R16] Ford J Kevin, MacCallum Robert C, and Tait Marianne. The application of exploratory factor analysis in applied psychology: A critical review and analysis. Personnel Psychology, 39(2):291–314, 1986. [Google Scholar]

[R17] Frühwirth-Schnatter Sylvia and Lopes Hedibert Freitas. Sparse Bayesian factor analysis when the number of factors is unknown. arXiv preprint arXiv:1804.04231, 2018. [Google Scholar]

[R18] Gao Chuan, Brown Christopher D, and Engelhardt Barbara E. A latent factor model with a mixture of sparse and dense factors to model gene gene expression data with confounding effects. arXiv:1310.4792v1, 2013. [Google Scholar]

[R19] Gao Chuan, McDowell Ian C., Zhao Shiwen, Brown Christopher D., and Engelhardt Barbara E.. Context specific and differential gene co-expression networks via Bayesian biclustering. PLoS Computational Biology, 12(7):1–39, 07 2016. doi: 10.1371/journal.pcbi.1004791. URL 10.1371/journal.pcbi.1004791. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Ghahramani Zoubin and Beal Matthew J. Variational inference for Bayesian mixtures of factor analysers. In Advances in neural information processing systems, pages 449–455, 2000. [Google Scholar]

[R21] Girolami Mark. A variational method for learning sparse and overcomplete representations. Neural Computation, 13(11):2517–2532, 2001. [DOI] [PubMed] [Google Scholar]

[R22] Harper F Maxwell and Konstan Joseph A. The Movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems, 5(4):19, 2016. [Google Scholar]

[R23] Hastie Trevor, Mazumder Rahul, Lee Jason D, and Zadeh Reza. Matrix completion and low-rank SVD via fast alternating least squares. Journal of Machine Learning Research, 16(1):3367–3402, 2015. [PMC free article] [PubMed] [Google Scholar]

[R24] Hochreiter Sepp, Bodenhofer Ulrich, Heusel Martin, Mayr Andreas, Mitterecker Andreas, Kasim Adetayo, Khamiakova Tatsiana, Suzy Van Sanden Dan Lin, Talloen Willem, Bijnens Luc, Göhlmann Hinrich W. H., Shkedy Ziv, and Clevert Djork-Arné. FABIA: factor analysis for bicluster acquisition. Bioinformatics, 26(12):1520–1527, 04 2010. ISSN; 1367–4803. doi: 10.1093/bioinformatics/btq227. URL 10.1093/bioinformatics/btq227. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Hore Victoria, Ana Viñuela Alfonso Buil, Knight Julian, Mark I McCarthy Kerrin Small, and Marchini Jonathan. Tensor decomposition for multiple-tissue gene expression experiments. Nature Genetics, 48(9):1094–1100, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Jaakkola Tommi S. and Jordan Michael I.. Bayesian parameter estimation via variational methods. Statistics and Computing, 10:25–37, 2000. [Google Scholar]

[R27] Johnstone Iain M. and Silverman Bernard W.. Empirical Bayes selection of wavelet thresholds. The Annals of Statistics, 33(4):1700–1752, 2005a. [Google Scholar]

[R28] Johnstone Iain M and Silverman Bernard W. EBayesthresh: R and s-plus programs for empirical Bayes thresholding. J. Statist. Soft, 12:1–38, 2005b. [Google Scholar]

[R29] Johnstone Iain M, Silverman Bernard W, et al. Needles and straw in haystacks: Empirical Bayes estimates of possibly sparse sequences. The Annals of Statistics, 32(4):1594–1649, 2004. [Google Scholar]

[R30] Jolliffe Ian T, Trendafilov Nickolay T, and Uddin Mudassir. A modified principal component technique based on the lasso. Journal of Computational and Graphical Statistics, 12(3): 531–547, 2003. [Google Scholar]

[R31] Josse Julie, Sardy Sylvain, and Wager Stefan. denoiseR: A package for low rank matrix estimation, 2018.

[R32] Kaufmann Sylvia and Schumacher Christian. Identifying relevant and irrelevant variables in sparse factor models. Journal of Applied Econometrics, 32(6):1123–1144, 2017. [Google Scholar]

[R33] Klami Arto. Polya-gamma augmentations for factor models. In Asian Conference on Machine Learning, pages 112–128, 2015. [Google Scholar]

[R34] Knowles David and Ghahramani Zoubin. Nonparametric Bayesian sparse factor. Annals of Applied Statistics, 5(2B):1534–1552, 2011. doi: 10.1214/10-AOAS435. [DOI] [Google Scholar]

[R35] Koenker Roger and Mizera Ivan. Convex optimization, shape constraints, compound decisions, and empirical Bayes rules. Journal of the American Statistical Association, 109 (506):674–685, 2014a. [Google Scholar]

[R36] Koenker Roger and Mizera Ivan. Convex optimization in R. Journal of Statistical Software, 60(5):1–23, 2014b. doi: 10.18637/jss.v060.i05. [DOI] [Google Scholar]

[R37] Kolda Tamara G and Bader Brett W. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009. [Google Scholar]

[R38] Lee DD and Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999. doi: 10.1038/44565. [DOI] [PubMed] [Google Scholar]

[R39] Lefrançais Emma, Ortiz-muñoz Guadalupe, Caudrillier Axelle, Mallavia Beñat, Liu Fengchun, Sayah David M, Thornton Emily E, Headley Mark B, David Tovo, Coughlin Shaun R, Krummel Matthew F, Leavitt Andrew D, Passegué Emmanuelle, and Looney Mark R. The lung is a site of platelet biogenesis and a reservoir for haematopoietic progenitors. Nature, 544(7648):105–109, 2017. doi: 10.1038/nature21706. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Lim Yew Jin and Teh Yee Whye. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD cup and workshop, volume 7, pages 15–21. Citeseer, 2007. [Google Scholar]

[R41] Mayrink Vinicius Diniz, Lucas Joseph Edward, et al. Sparse latent factor models with interactions: Analysis of gene expression data. The Annals of Applied Statistics, 7(2): 799–822, 2013. [Google Scholar]

[R42] Mazumder Rahul, Hastie Trevor, and Tibshirani Robert. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11(Aug): 2287–2322, 2010. [PMC free article] [PubMed] [Google Scholar]

[R43] Nakajima Shinichi and Sugiyama Masashi. Theoretical analysis of Bayesian matrix factorization. Journal of Machine Learning Research, 12:2583–2648, 2011. [Google Scholar]

[R44] Nakajima Shinichi, Sugiyama Masashi, Babacan S Derin, and Tomioka Ryota. Global analytic solution of fully-observed variational bayesian matrix factorization. Journal of Machine Learning Research, 14(Jan):1–37, 2013. [Google Scholar]

[R45] Owen Art B and Wang Jingshu. Bi-cross-validation for factor analysis. Statistical Science, 31(1):119–139, 2016. [Google Scholar]

[R46] Pournara Iosifina and Wernisch Lorenz. Factor analysis for gene regulatory networks and transcription factor activity profiles. BMC bioinformatics, 8:61, 2007. ISSN; 1471–2105. doi: 10.1186/1471-2105-8-61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] Raiko Tapani, Ilin Alexander, and Karhunen Juha. Principal component analysis for large scale problems with lots of missing values. In European Conference on Machine Learning, pages 691–698. Springer, 2007. [Google Scholar]

[R48] Ročková Veronika and George Edward I. Fast Bayesian factor analysis via automatic rotations to sparsity. Journal of the American Statistical Association, 111(516):1608–1622, 2016. [Google Scholar]

[R49] Rubin Donald B. Inference and missing data. Biometrika, 63(3):581–592, 1976. [Google Scholar]

[R50] Rubin Donald B and Thayer Dorothy T. EM algorithms for ML factor analysis. Psychometrika, 47(1):69–76, 1982. [Google Scholar]

[R51] Sabatti Chiara and James Gareth M. Bayesian sparse hidden components analysis for transcription regulation networks. Bioinformatics, 22(6):739–746, 2005. [DOI] [PubMed] [Google Scholar]

[R52] Seeger Matthias and Bouchard Guillaume. Fast variational Bayesian inference for non-conjugate matrix factorization models. In Artificial Intelligence and Statistics, pages 1012–1018, 2012. [Google Scholar]

[R53] Srivastava Sanvesh, Engelhardt Barbara E, and Dunson David B. Expandable factor analysis. Biometrika, 104(3):649–663, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] Stegle Oliver, Parts Leopold, Durbin Richard, and Winn John. A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eqtl studies. PLoS Computational Biology, 6(5):e1000770, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] Stegle Oliver, Parts Leopold, Piipari Matias, Winn John, and Durbin Richard. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nature Protocols, 7(3):500–507, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] Stein-O’Brien Genevieve L, Arora Raman, Culhane Aedin C, Favorov Alexander V, Garmire Lana X, Greene Casey S, Goff Loyal A, Li Yifeng, Ngom Aloune, Ochs Michael F, et al. Enter the matrix: factorization uncovers knowledge from omics. Trends in Genetics, 34 (10):790–805, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] Stephens Matthew. False discovery rates: a new deal. Biostatistics, 18(2):275–294, Apr 2017. doi: 10.1093/biostatistics/kxw041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] Thomas DC, Siemiatycki J, Dewar R, Robins J, Goldberg M, and Armstrong BG. The problem of multiple inference in studies designed to generate hypotheses. American Journal of Epidemiology, 122(6):1080–95, Dec 1985. [DOI] [PubMed] [Google Scholar]

[R59] Tipping Michael E. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1(Jun):211–244, 2001. [Google Scholar]

[R60] Titsias Michalis K. and Lázaro-Gredilla Miguel. Spike and slab variational inference for multi-task and multiple kernel learning. In Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, and Weinberger KQ, editors, Advances in Neural Information Processing Systems 24, pages 2339–2347. Curran Associates, Inc., 2011. URL http://papers.nips.cc/paper/4305-spike-and-slab-variational-inference-for-multi-task-and-multiple-kernel-learning.pdf. [Google Scholar]

[R61] Wang Wei. Applications of Adaptive Shrinkage in Multiple Statistical Problems. PhD thesis, The University of Chicago, 2017. [Google Scholar]

[R62] West Mike. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statistics 7 - Proceedings of the Seventh Valencia International Meeting, pages 723–732, 2003. ISSN; 08966273. doi: 10.1.1.18.3036. [Google Scholar]

[R63] Wipf David P. and Nagarajan Srikantan S.. A new view of automatic relevance determination. In Platt JC, Koller D, Singer Y, and Roweis ST, editors, Advances in Neural Information Processing Systems 20, pages 1625–1632. Curran Associates, Inc., 2008. URL http://papers.nips.cc/paper/3372-a-new-view-of-automatic-relevance-determination.pdf. [Google Scholar]

[R64] Witten Daniela M., Tibshirani Robert, and Hastie Trevor. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515–534, 2009. doi: 10.1093/biostatistics/kxp008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] Xing Zhengrong, Carbonetto Peter, and Stephens Matthew. Smoothing via adaptive shrinkage (smash): denoising Poisson and heteroskedastic Gaussian signals. arXiv preprint arXiv:1605.07787, 2016. [Google Scholar]

[R66] Yang Dan, Ma Zongming, and Buja Andreas. A sparse singular value decomposition method for high-dimensional data. Journal of Computational and Graphical Statistics, 23(4):923–942, 2014. [Google Scholar]

[R67] Zhao Shiwen, Barbara E Engelhardt Sayan Mukherjee, and Dunson David B. Fast moment estimation for generalized latent Dirichlet models. Journal of the American Statistical Association, pages 1–13, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R68] Zou Hui, Hastie Trevor, and Tibshirani Robert. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):265–286, 2006. ISSN; 1061–8600. doi: 10.1198/106186006X113430. [DOI] [Google Scholar]

PERMALINK

Empirical Bayes Matrix Factorization

Wei Wang

Matthew Stephens

Abstract

1. Introduction

2. A General Empirical Bayes Matrix Factorization Model

3. Fitting the EBMF Model

3.1. The Variational Approximation

3.2. Alternating Optimization

3.3. The EBNM Problem

3.4. Connecting the EBMF and EBNM Problems

3.5. Streamlined Implementation Using First and Second Moments

3.6. The K-factor EBMF Model

3.7. Selecting K

3.8. Identifiability

4. Software Implementation: flash

4.1. Missing Data

4.2. Initialization

5. Numerical Comparisons

5.1. Simple Simulations

5.1.1. A Single Factor Example

Figure 1:

5.1.2. A Sparse Bi-cluster Example (Rank 3)

Figure 2:

5.2. Missing Data Imputation for Real Data Sets

Figure 3:

5.3. Sharing of Genetic Effects on Gene Expression Among Tissues

Figure 4:

Figure 5:

Figure 6:

5.4. Computational Demands

6. Discussion

6.1. Orthogonality Constraint

6.2. Non-negative Matrix Factorization

6.3. Tensor Factorization

6.4. Non-Gaussian Errors

Acknowledgments

Appendix A. Variational EBMF with K Factors

A.1. Updates for Precision Parameters

A.2. Updating Loadings and Factors

A.2.1. A Lemma on the Normal Means Problem

A.2.2. Proof of Proposition 5

A.3. Algorithms

A.3.1. Greedy Algorithm

A.3.2. Backfitting Algorithm

A.4. Objective Function Computation

A.5. Inference with Penalty Term

Appendix B. Orthogonal Cross Validation

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.6. The $K$ -factor EBMF Model

3.7. Selecting $K$

Appendix A. Variational EBMF with $K$ Factors