Generalized infinite factorization models

L Schiavon; A Canale; D B Dunson

doi:10.1093/biomet/asab056

. Author manuscript; available in PMC: 2022 Sep 13.

Published in final edited form as: Biometrika. 2022 Jan 19;109(3):817–835. doi: 10.1093/biomet/asab056

Generalized infinite factorization models

L Schiavon ¹, A Canale ², D B Dunson ³

PMCID: PMC9469809 NIHMSID: NIHMS1815813 PMID: 36105175

Summary

Factorization models express a statistical object of interest in terms of a collection of simpler objects. For example, a matrix or tensor can be expressed as a sum of rank-one components. However, in practice, it can be challenging to infer the relative impact of the different components as well as the number of components. A popular idea is to include infinitely many components having impact decreasing with the component index. This article is motivated by two limitations of existing methods: (1) lack of careful consideration of the within component sparsity structure; and (2) no accommodation for grouped variables and other non-exchangeable structures. We propose a general class of infinite factorization models that address these limitations. Theoretical support is provided, practical gains are shown in simulation studies, and an ecology application focusing on modelling bird species occurrence is discussed.

Keywords: Adaptive Gibbs sampling, Bird species, Ecology, Factor analysis, High-dimensional data, Increasing shrinkage, Structured shrinkage

1. Introduction

Factorization models are used routinely to express matrices, tensors or other statistical objects based on simple components. The likelihood for data y under a general class of factorization models can be expressed as L(y; Λ, Ψ, Σ), with Λ = (Λ_h, h = 1, …, k} a p × k matrix, Λ_h = (λ_1h, …, λ_ph)^T the hth column vector of Λ, Ψ and Σ additional parameters, and k a positive integer. This class includes Gaussian linear factor models (Roweis & Ghahramani, 1999), exponential family factor models (Jun & Tao, 2013), Gaussian copula factor models (Murray et al., 2013), latent factor linear mixed models (An et al., 2013), probabilistic matrix factorization (Mnih & Salakhutdinov, 2008), underlying Gaussian factor models for mixed scale data (Reich & Bandyopadhyay, 2010), and functional data factor models (Montagna et al., 2012). A fundamental problem is how to choose weights for the components and the number of components k. This article proposes a general class of Bayesian methods to address this problem.

Although there is a rich literature, selection of k is far from a solved problem. In unsupervised settings, it is common to fit the model for different choices of k and then choose the value with the best goodness-of-fit criteria. For likelihood models, the Bayesian information criteria is particularly popular. It is also common to use an informal elbow rule, selecting the smallest k such that the criteria improves only a small amount for k + 1. In specific contexts, formal model selection methods have been developed. For example, taking a Bayesian approach, one can choose a prior for k and attempt to approximate the posterior distribution of k using Markov chain Monte Carlo; see Lopes & West (2004) for linear factor models, Miller & Harrison (2018) for mixture models and Yang et al. (2018) for matrix factorization. Although such methods are conceptually appealing, computation can be prohibitive outside of specialized settings.

Due to these challenges it has become popular to rely on over-fitted factorization models, which include more than enough components, but with shrinkage priors adaptively removing unnecessary ones by shrinking their coefficients close to zero. Such approaches were proposed by Rousseau & Mengersen (2011) for mixture models and Bhattacharya & Dunson (2011) for Gaussian linear factor models. The latter approach specifically assumes an increasing shrinkage prior on the columns of the factor loadings matrix Λ. Legramanti et al. (2020) recently modified this approach using a spike and slab structure (Mitchell & Beauchamp, 1988) that increases the mass on the spike for later columns.

Although over-fitted factorizations are widely used, there are two key gaps in the literature. The first is a careful development of the shrinkage properties of increasing shrinkage priors (Durante, 2017). Outside of the factorization context and mostly motivated by high-dimensional regression, there is a rich literature recommending specific desirable properties for shrinkage priors. These include high concentration at zero to favor shrinkage of small coefficients and heavy tails to avoid over shrinking large coefficients. Motivated by this thinking, popular shrinkage priors have been developed including the Dirichlet-Laplace (Bhattacharya et al., 2015) and horseshoe (Carvalho et al., 2010). Current increasing shrinkage priors, such as those of Bhattacharya & Dunson (2011), were not designed to have the desirable shrinkage properties of these priors. For this reason, ad hoc truncation and use of the horseshoe/Dirichlet-Laplace can outperform increasing shrinkage priors in some contexts; for example, this was the case in Ferrari & Dunson (2020).

A second gap in the literature on over-fitted factorization priors is the lack of structured shrinkage. The focus has been on priors for Λ that are exchangeable within columns, with the level of shrinkage increasing with the column index. However, it is common in practice to have meta covariates encoding features of the rows of Λ. For example, the rows may correspond to different genes in genomic applications or species in ecology. There is a rich literature on incorporating gene ontology in statistical analyses of genomic data; refer, for example to Thomas et al. (2009). In ecology it is common to include species traits in species distribution models (Ovaskainen & Abrego, 2020). Beyond the Bayesian literature, it is common to include structured penalties, with the grouped Lasso (Yuan & Lin, 2006) a notable example.

Motivated by these deficiencies of current factorizations priors, this article proposes a broad class of generalized infinite factorization priors, along with corresponding theory and algorithms for routine Bayesian implementation.

2. Generalized infinite factor models

2.1. Model specification

Suppose that an n × p data matrix y is available. In our motivating application, y_ij is a binary indicator of occurrence of bird species j (j = 1, …, p) in sample i (i = 1, …, n). Consider the following general class of models,

y_{i j} = t_{j} (z_{i j}), z_{i} = Λ η_{i} + ϵ_{i}, ϵ_{i} \sim f_{ϵ},

(1)

with Λ a p × k loadings matrix, η_i a k dimensional factor with diagonal covariance matrix Ψ = diag(ψ₁₁, …, ψ_kk), ϵ_i a p-dimensional error term independent of η_i, and the function $t_{j} : R \to R$ , for j = 1 …, p. We refer to this class as generalized factorization models. Class (1) includes most of the cases mentioned in Section 1. When ϵ_i and η_i are Gaussian random vectors and t_j is the identity function, model (1) is a Gaussian linear factor model. With similar assumptions for ϵ_i and η_i, and assuming $t_{j} = F_{j}^{- 1} {Φ (z_{i j})}$ , with Φ(z_ij) the Gaussian cumulative distribution function, model (1) is a Gaussian copula factor model (Murray et al., 2013). Exponential family factor models (Jun & Tao, 2013), probabilistic matrix factorization (Mnih & Salakhutdinov, 2008) and underlying Gaussian models for mixed scale data (Reich & Bandyopadhyay, 2010) can be obtained by appropriately defining the elements in (1), whereas multivariate response regression models belong to this framework when η_i is known.

The matrix Ω = var(z_i) can be expressed as Ω = ΛΨΛ^T + Σ, where Σ = var(ϵ_i). Following common practice in Bayesian factor analysis (Bhattacharya & Dunson, 2011), we avoid imposing identifiability constraints on Λ and assume Ψ is pre-specified. Our focus is on a new class of generalized infinite factor models induced through a novel class of priors for Λ that allows infinitely many factors, k = ∞. In particular, we let

λ_{j h} | θ_{j h} \sim N (0, θ_{j h}), θ_{j h} = τ_{0} γ_{h} ϕ_{j h}, τ_{0} \sim f_{τ_{0}}, γ_{h} \sim f_{γ_{h}}, ϕ_{j h} \sim f_{ϕ_{j}},

(2)

where f_τ₀, f_{γ_h}, and f_{ϕ_j} are supported on [0, ∞) with positive probability mass on (0, ∞). The local ϕ_jh, column-specific γ_h, and global τ₀ scales are all independent a priori. We let N(0, 0) denote a degenerate distribution with all its mass at zero. Expression (2) induces a class of scale-mixture of Gaussian shrinkage priors (Polson & Scott, 2010) for the loadings. Although we allow infinitely many columns in Λ, (2) induces a prior for Ω supported on the set of p × p positive semi-definite matrices under mild conditions reported in Proposition S1 in the Supplementary Material.

Differently from most of the existing literature on shrinkage priors, we want to define a non-exchangeable structure that includes meta covariates x informing the sparsity structure of Λ. In our context, meta covariates provide information to distinguish the p different variables as opposed to traditional covariates that serve to distinguish the n subjects. Letting x denote a p × q matrix of such meta covariates, we choose f_{ϕ_j} not depending on the index h and such that

E (ϕ_{j h} | β_{h}) = g (x_{j}^{T} β_{h}), β_{h} = {(β_{1 h}, \dots, β_{q h})}^{T}, β_{m h} \sim f_{β} (m = 1, \dots, q)

(3)

where $g : R \to A \subset R_{+}$ is a known smooth one-to-one differentiable link function, x_j = (x_j1, …, x_jq)^T denotes the jth row vector of x, and β_h are coefficients controlling the impact of the meta covariates on shrinkage of the factor loadings in the hth column of Λ.

To illustrate the usefulness of (3), consider the previously introduced ecological study and suppose $x_{j} = {1, 1 (κ_{j} = 2), \dots, 1 (κ_{j} = q)}^{T}$ , where κ_j ∈ {1, …, q} denotes the phylogenetic order of species j. Species of the same order may tend to have similarities that can be expressed in terms of a shared pattern of high or low loadings on the same latent factors. To illustrate this situation, we simulate a loadings matrix, displayed in Fig. 1, sampling from the prior introduced in Section 3 where pr(λ_jh = 0) > pr(ϕ_jh = 0) > 0. The loadings within each column are penalized basing on the group structure identified by the q = 3 phylogenetic orders (Passeriformes, Charadriiformes, and Piciformes) of the p = 10 birds species considered. Our proposed prior allows for the possibility of such structure while not imposing it. In the bird ecology application, x can be defined to include not just phylogenetic placement of each bird species but also species traits, such as size or diet (Tikhonov et al., 2020). Related meta covariates are widely available, both in other ecology applications (Miller et al., 2019) and in other fields such as genomics (Thomas et al., 2009).

Fig. 1: — Illustrative loadings matrix of an ecology application, where the rows refer to ten bird species belonging to three phylogenetic orders. White cells represent the elements of Λ equal to zero, while blue and red cells represent negative and positive values, respectively.

2.2. Properties

In this section we present some properties motivating the shrinkage process in (2) and provide insight into prior elicitation. It is important to relate the choice of hyperparameters to the signal-to-noise ratio, expressed as the proportion of variance explained by the factors. Section S2.4 of the Supplementary Material provides a study of the posterior distribution of the proportion of variance explained; the posterior tends to be robust to hyperparameter choice. Below we study key properties of our prior, including an increasing shrinkage property, the ability of the induced marginal prior to accommodate both sparse and large signals, and control of the multiplicity problem in sparse settings. Proofs are included in the Appendix and in Section S1 of Supplementary Material. This theory illuminates the role of hyperparameters; specific recommendations of hyperparameter choice in practice are illustrated under the model settings of Section 3.1.

To formalize the increasing shrinkage property, we introduce the following definition.

Definition 1. Letting Π_Λ denote a shrinkage prior on Λ, Π_Λ is a weakly increasing shrinkage prior if var(λ_{j(_h–1)}) > var(λ_jh) for j in 1, …, p and h = 2, …, ∞. Π_Λ is a strongly increasing shrinkage prior if var(λ_s(h–1)) > var(λ_jh), for j, s in {1, …, p} and h = 2, …, ∞.

Weakly increasing shrinkage corresponds to the prior variance increasing across columns within each row of Λ, while strongly increasing shrinkage implies that the prior variance of any loading element is larger than all elements with a higher column index. In the following Theorem, we show that the process in (2) induces weakly increasing shrinkage under a simple sufficient condition.

Theorem 1. Expression (2) is a weakly increasing shrinkage prior under Definition 1 if E(γ_h) > E(γ_h+1) for any h.

Increasing shrinkage priors favor a decreasing contribution of higher indexed columns of Λ to the covariance Ω. In addition to inducing a flexible shrinkage structure that allows different factors to have a different sparsity structure in their loadings, this allows one to accurately approximate the likelihood L(y; Λ, Ψ, Σ) by L(y; Λ_H, Ψ_H, Σ), with Λ_H containing the first H columns of the infinite matrix Λ and Ψ_H the first H rows and columns of Ψ. To measure the induced truncation error of $Ω_{H} = Λ_{H} Ψ_{H} Λ_{H}^{T} + Σ$ , we use the trace of Ω. The trace is justified by the fact that the maximum error occurring in an element of Ω due to truncation always lies along the diagonal and by the relation between difference of traces and the nuclear norm, routinely used to approximate low rank minimization problems (Liu & Vandenberghe, 2010). The following Proposition provides conditions on prior (2) so that the under-estimation of Ω that occurs by truncating decreases exponentially fast as H increases.

Proposition 1. Let E(τ₀) and E(ϕ_jh) be finite for j = 1, …, p and h = 1, …, ∞ and E(γ_h) = ab^h−1 with a > 0 and b ∈ (0, 1) for all h = 1, …, ∞. Let c > 0 be a sufficiently large number such that $c \geq \max_{h = 1, \dots, \infty} ψ_{h h}$ . If

m_{Ω} = \min_{j = 1, \dots, p} [E (σ_{j}^{- 2}), E {{(\sum_{h = 1}^{\infty} ψ_{h h} λ_{j h}^{2})}^{- 1}}] < \infty,

then for any T ∈ (0, 1),

pr {\frac{tr (Ω_{H})}{tr (Ω)} \leq T} \leq (\frac{1}{1 - T}) a c \frac{b^{H}}{1 - b} m_{Ω} E (τ_{0}) \sum_{j = 1}^{p} E (ϕ_{j 1}) .

The above increasing shrinkage properties can be satisfied by naive priors that over-shrink the elements of Λ. It is important to avoid such over-shrinkage and allow not only many elements that are ≈ 0 but also a small proportion of large coefficients. A similar motivation applies in the literature on shrinkage priors in regression (Carvalho et al., 2010). Borrowing from that literature, the marginal prior for λ_jh should be concentrated at zero to reduce mean square error by shrinking small coefficients to zero but with heavy tails to avoid over-shrinking the signal.

To quantify the prior concentration of (2) in an ϵ neighbourhood of zero, we can obtain

pr (| λ_{j h} | > ϵ) \leq \frac{E (τ_{0}) E (γ_{h}) E (ϕ_{j h})}{ϵ^{2}}

(4)

as a consequence of Markov’s inequality. Common practice in local-global shrinkage priors chooses E(τ₀) small while assigning a heavy-tailed density to the local or column scales. In our case, (3) allows the bound in (4) to be regulated by meta covariates x, while, under the condition in Theorem 1, decreasing E(γ_h) with column index causes an increasing concentration near zero, since E(ϕ_jh) = E(ϕ_jl) for every h, l ∈ {1, …, ∞}. The means of the column and the local scales control prior concentration near zero, while over-shrinkage can be ameliorated by choosing f_{ϕ_j} or f_{γ_h} (h = 1, …, ∞) heavy tailed. The following Proposition provides a condition on the prior to guarantee a heavy tailed marginal distribution for λ_jh. A random variable has power law tails if its cumulative distribution function F has 1 − F(t) ≥ ct^−α for constants c > 0, α > 0, and for any t > L for L sufficiently large.

Proposition 2. If at least one scale parameter among τ₀, γ_h or ϕ_jh is characterized by a power law tail prior distribution, then the prior marginal distribution of λ_jh has power law tails.

An important consequence of the heavy tailed property is avoidance of over-shrinkage of large signals. This is often formalized via a tail robustness property (Carvalho et al., 2010). As an initial result, key to showing sufficient conditions for a type of local tail robustness, we provide the following Lemma on the derivative of the log prior in the limit as the value of λ_jh → ∞.

Lemma 1. If at least one scale parameter among τ₀, γ_h or ϕ_jh has a prior with power law tails for any possible prior distribution of β_h, then for any finite truncation level H,

\lim_{λ \to \infty} \frac{\partial \log {f_{λ_{j h} | Λ_{- j h}} (λ)}}{\partial λ} = 0

where $f_{λ_{j h} | Λ_{- j h}} (λ)$ is the conditional distribution of λ_jh given the other elements of Λ_H.

The following definition introduces a type of local tail robustness property.

Definition 2. Consider model (1) with factors η known. Let $f_{λ_{j h} | y, η, Λ_{- j h} (λ)}$ denote the posterior density of λ_jh, given the data, conditional on any possible value of the other elements of Λ_H for any finite H, and let ${\hat{λ}}_{j h}$ denote the conditional maximum likelihood estimate of λ_jh for any possible value of the other elements of Λ_H. We say that the prior on λ_jh is tail robust if

\lim_{{\hat{λ}}_{j h} \to \infty} | {\hat{λ}}_{j h} - \arg \max_{λ} f_{λ_{j h} | y, η, Λ_{- j h}} (λ) | = 0.

For a given sample, ${\hat{λ}}_{j h}$ is a fixed quantity; the above limit should be interpreted as what happens as the data support a larger and larger maximum likelihood estimate. In order for tail robustness to hold, we need the data to be sufficiently informative about the parameter λ_jh and the likelihood to be sufficiently regular; this is formalized as follows.

Assumption 1. Let L(y; Λ, η, Σ) denote the likelihood for data y conditionally on latent variables η, let l_s(λ) denote the derivative function of the log-likelihood with respect to λ_jh, and let $J ({\hat{λ}}_{j h})$ denote the negative of the second derivative of the log-likelihood with respect to λ_jh, evaluated at the conditional maximum likelihood estimate ${\hat{λ}}_{j h}$ . Then l_s(λ) is a continuous function for every $λ \in R$ and $J ({\hat{λ}}_{j h}) \geq ν ({\hat{λ}}_{j h})$ , where $υ ({\hat{λ}}_{j h})$ is of order O(1) as ${\hat{λ}}_{j h} \to \infty$ .

This assumption can be verified for most of the cases mentioned in Section 1; for example, for Gaussian linear factor models $J ({\hat{λ}}_{j h})$ is of order O(1) with respect to ${\hat{λ}}_{j h}$ .

Theorem 2. Under Assumption 1, if at least one scale parameter among τ₀, γ_h or ϕ_jh is power law tail distributed for any possible prior distribution of β_h, then the prior on λ_jh is tail robust under Definition 2.

As an additional desirable property, we would like to control for the multiplicity problem within each column λ_h of the loadings matrix, corresponding to increasing numbers of false signals as dimension p increases. This can be accomplished by imposing an asymptotically increasingly sparse property on the prior, which is defined as follows.

Definition 3. Let $| {supp}_{ϵ} (λ_{h}) |$ denote the cardinality of ${supp}_{ϵ} (λ_{h}) = (j : | λ_{j h} | > ϵ)$ . Let s_p = o(p) such that $s_{p} \geq c_{s} \log (p) / p$ for some constant c_s > 0. We say that the prior on Λ defined in (2) is an asymptotically increasingly sparse prior if

\lim_{p \to \infty} pr {| {supp}_{ϵ} (λ_{h}) | > a s_{p} | γ_{h}, τ_{0}} = 0, f o r s o m e c o n s t a n t a > 0 .

The quantity |supp_ϵ(λ_h)| represents an approximate measure of model size for continuous shrinkage priors and, conditionally on β_h, γ_h, and τ₀, it is a priori distributed as a sum of independent Bernoulli random variables Ber(ζ_ϵjh), where

ζ_{ϵ j h} = pr (| λ_{j h} | > ϵ | β_{h}, γ_{h}, τ_{0}) \leq \frac{τ_{0} γ_{h} g (x_{j}^{T} β_{h})}{ϵ^{2}} .

We now provide sufficient conditions for an asymptotically increasingly sparse prior, allowing regulation of the sparsity behaviour of the prior of the columns of Λ for increasing dimension p.

Theorem 3. Consider prior (2) with ϕ_jh (j = 1, …, p) a priori independent given β_h. If $pr {g (x_{j}^{T} β_{h}) \leq ν_{j} (p)} = 1$ , with ν_j(p) = O{log(p)/p}, (j = 1, …, p), then the prior on Λ is asymptotically increasingly sparse under Definition 3.

The condition of the theorem is easily satisfied, for example, if g is the multiplication of a bounded function and a suitable offset depending on p as assumed in Section 3.1. The multiplicative gamma process (Bhattacharya & Dunson, 2011) and cumulative shrinkage process (Legramanti et al., 2020) do not satisfy the sufficient conditions of Theorem 3, and, furthermore, the following lemma holds.

Lemma 2. The multiplicative gamma process prior (Bhattacharya & Dunson, 2011) and the cumulative shrinkage process prior (Legramanti et al., 2020) are not asymptotically increasing sparse under Definition 3.

Although this Section has focused on properties of the prior, we find empirically that these properties tend to carry over to the posterior, as will be illustrated in the subsequent sections. For example, the posterior exhibits asymptotic increasing sparsity; refer to Table 2 of Section 4, which shows results for a novel process in our proposed class that is much more effective than current approaches in identifying the true sparsity structure, particularly when p is large.

Table 2:

Median and interquartile range of the mean classification error computed in 25 replications assuming Scenario b and several combinations of (p, k, s)

(p, k, s)	MGP		CUSP		SIS
	Q_0.5	IQR	Q_0.5	IQR	Q_0.5	IQR
(16,4,0.6)	1.06	0.16	0.38	0.01	0.24	0.09
(32,8,0.4)	0.70	0.07	0.48	0.08	0.16	0.09
(64,12,0.3)	0.61	0.07	0.58	0.01	0.09	0.06
(128,16,0.2)	0.49	0.03	0.52	0.08	0.04	0.01

Open in a new tab

MCE, mean classification error; MGP, multiplicative gamma process; CUSP, cumulative shrinkage process; SIS, structured increasing shrinkage process; Q_0.5, median; IQR, interquartile range.

3. Structured increasing shrinkage process

3.1. Model specification

In this section we propose a structured increasing shrinkage process prior for generalized infinite factor models satisfying all the sufficient conditions in Propositions 1–2 and Theorems 2–3. Let Ga(a, b) denote the gamma distribution with mean a/b and variance a/b². Following the notation of Section 2.1, we specify

τ_{0} = 1, γ_{h} = ϑ_{h} ρ_{h}, ϕ_{j h} | β_{h} \sim Ber {{logit}^{- 1} (x_{j}^{T} β_{h}) c_{p}},

(5)

ϑ_{h}^{- 1} \sim Ga (a_{θ}, b_{θ}), a_{θ} > 1, ρ_{h} = Ber (1 - π_{h}), β_{h} \sim N_{q} (0, σ_{β}^{2} I_{q}),

where we assume the link g(x) = logit⁻¹(x)c_p, with logit⁻¹(x) = e^x/(1 + e^x) and c_p ∈ (0, 1) a possible offset. The parameter π_h = pr(γ_h = 0) follows a stick-breaking construction,

π_{h} = \sum_{l = 1}^{h} w_{l}, w_{l} = v_{l} \prod_{m = 1}^{l - 1} (1 - υ_{m}), υ_{m} \sim Be (1, α),

with Be(a, b) the beta distribution with mean a/(a + b), such that $π_{h + 1} > π_{h}$ is guaranteed for any h = 1, …, ∞ and $\lim_{h \to \infty} π_{h} = 1$ almost surely. The prior expected number of non degenerate Λ columns is $E (\sum_{h = 1}^{\infty} ρ_{h}) = α$ (Legramanti et al., 2020), suggesting setting α equal to the expected number of active factors. The prior specification is completed assuming $Σ = diag (σ_{1}^{2}, \dots, σ_{p}^{2})$ with $σ_{j}^{- 2} \sim Ga (a_{σ}, b_{σ})$ for j = 1, …, p, consistently with the literature. The hyperparameters can be chosen based on one’s prior expectation of the signal-to-noise ratio, as $σ_{j}^{2}$ is the contribution of the noise component to the total variance of the jth variable. A sensitivity study in Section S2.4 of the Supplementary Material, however, shows that posterior distributions tend to be robust to the specification of a_σ, b_σ. Regarding prior elicitation, we recommend setting b_θ ≥ a_θ to induce a high enough proportion of variance explained by the factor model. In Section S2.4 in the Supplementary Materials we report empirical evidence of the effect of different prior parameters on this quantity.

The above specification respects (2) and, consequently, the following corollary holds.

Corollary 1. The structured increasing shrinkage process defined in (5)

is a strongly increasing shrinkage prior under Definition 1;
for any T ∈ (0, 1),
$pr {\frac{tr (Ω_{H})}{tr (Ω)} \leq T} \leq (\frac{1}{1 - T}) \frac{b^{H}}{1 - b} θ_{0} \frac{a_{σ}}{b_{σ}} \sum_{j = 1}^{p} E (ϕ_{j 1}),$
with b = {α(1 + α)}⁻¹ and $θ_{0} = E (ϑ_{h})$ .

We conducted a simulation study on the posterior distribution of {tr(Ω_H)/tr(Ω) ≤ T} for varying hyperparameters, and found that the results, reported in Section S2.4 of the Supplementary Material, were quite consistent with our prior truncation error bounds.

The prior concentration of the structured increasing shrinkage process in (5) follows from (4):

pr (| λ_{j h} | > ϵ) \leq \frac{E (ϑ_{h}) {1 - E (π_{h}) E (ϕ_{j h})}}{ϵ^{2}} = \frac{θ_{0} {α / (1 + α)}^{h}}{ϵ^{2}} \frac{c_{p}}{2} .

In addition, the inverse gamma prior on $ϑ_{h}$ implies a power law tail distribution on γ_h inducing robustness properties on λ_jh as formalized by the next corollary of Proposition 2 and Theorem 2.

Corollary 2. Under the structured increasing shrinkage process defined in (5)

the marginal prior distribution on λ_jh (j = 1, …, p; h = 1, 2, …) has power law tails;
under Assumption 1, the prior on λ_jh (j = 1, …, p; h = 1, 2, …) is tail robust under Definition 2.

Finally, it is important to assess the joint sparsity properties of the prior on each column of Λ. This is formalized in the following corollary of Theorem 3.

Corollary 3. If c_p = O{log(p)/p} the structured increasing shrinkage process defined in (5) is asymptotically increasingly sparse under Definition 3.

3.2. Posterior computations

Posterior inference is conducted via Markov chain Monte Carlo sampling. Following common practice in infinite factor models (Bhattacharya & Dunson, 2011; Legramanti et al., 2020; Schiavon & Canale, 2020) we use an adaptive Gibbs algorithm, which attempts to infer the best truncation level H while drawing from the posterior distribution of the parameters. The value of H is adapted only at some Gibbs iterations by discarding redundant factors and, if no redundant factors are identified, by adding a new factor by sampling its parameters from the prior distribution. Convergence of the Markov chain is guaranteed by satisfying the diminishing adaptation condition in Theorem 5 of Roberts & Rosenthal (2007), by specifying the probability of occurrence of an adaptive iteration t as equal to p(t) = exp(α₀ + α₁t), where α₀ and α₁ are negative constants, such that frequency of adaptation decreases.

The decomposition of γ_h into two parameters ρ_h and $ϑ_{h}$ allows one to identify the inactive columns of Λ, corresponding to the redundant factors, as those with ρ_h = 0, while H_a indicates the number of active columns of Λ. Consequently, at the adaptive iteration t + 1, the truncation level H is set to $H^{(t + 1)} = H_{a}^{(t)} + 1$ if $H_{a}^{(t)} < H^{(t)} - 1$ , and H^(t+1) = H^(t) + 1 otherwise. Given H^(t+1), the number of factors of the truncated model at iteration t + 1, the sampler draws the model parameters from the corresponding posterior full conditional distributions. The detailed steps of the adaptive Gibbs sampler for the structured increasing shrinkage prior in case of Gaussian or binary data are reported in the Supplementary Material, as well as trace plots of the posterior samples for some parameters of the model in Section 5 (see Section S3.2), showing good mixing.

3.3. Identifiability and posterior summaries

Non-identifiability of the latent structure creates problems in interpretation of the results from Markov chain Monte Carlo samples. Indeed, both Λ and η are only identifiable up to an arbitrary rotation P with PP^T = I_k. This is a well known problem in Bayesian factor models and there is a rich literature proposing post-processing algorithms that align posterior samples Λ^(t), so that one can then obtain interpretable posterior summaries. Refer to McParland et al. (2014), Aßmann et al. (2016), and Roy et al. (2019) for alternative post-processing algorithms in related contexts.

Unfortunately, such post-hoc alignment algorithms destroy the structure we have carefully imposed on the loadings in terms of sparsity and dependence on meta covariates. Therefore, we propose a different solution to obtain a point estimate of Λ based on finding a representative Monte Carlo draw Λ^(t) consistently with the proposals of Dahl (2006) and Wade et al. (2018) in the context of Bayesian model-based clustering. Specifically, we summarize Λ and β = (β₁, β₂, …) through Λ^(t*) and β^(t*) sampled at iteration t*, characterized by the highest marginal posterior density function f (Λ, β, Σ | y) obtained by integrating out the scale parameters τ₀, γ_h, ϕ_jh (j = 1, …, p, h = 1, …) and the latent factors η_i (i = 1, …, n) from the posterior density function. Formally, we select the iteration t* ∈ {1, …, T} such that

f (Λ^{(t *)}, β^{(t *)}, Σ^{(t *)} | y) > f (Λ^{(t)}, β^{(t)}, Σ^{(t)} | y) (t = 1, \dots, T),

where t = 1, …, T indexes the posterior samples. Under the structured increasing shrinkage prior described in Section 3.1, these computations are straightforward. The matrices Λ^(t*), β^(t*), Σ^(t*) are Monte Carlo approximations of the maximum a posteriori estimator, which corresponds to the Bayes estimator under L_∞ loss. Although one can argue that L_∞ is not an ideal choice of loss philosophically in continuous parameter problems, it nonetheless is an appealing pragmatic choice in our context and is broadly used in other sparse estimation contexts, as in the algorithm proposed by Ročková & George (2016) that similarly aims to recover a strongly sparse posterior mode of an over-parameterized factor model.

4. Simulation experiments

We assess the performance of our structured increasing shrinkage prior compared with current approaches (Bhattacharya & Dunson, 2011; Ročková & George, 2016; Legramanti et al., 2020) through a simulation study. We have a particular interest in inferring sparse and interpretable loadings matrices Λ, but also assess performance in estimating the induced covariance matrix Ω and number of factors. We generate synthetic data from four scenarios based on different loadings structures. For each scenario we simulate R = 25 data sets with n = 250 observations from $y_{i} \sim N_{p} (0, Λ_{0} Λ_{0}^{T} + I_{p})$ (i = 1, …, n). In Scenario a, we assume non sparse Λ₀, sampling the loadings λ_jh from a Gaussian distribution with mean zero, variance equal to $σ_{λ}^{2} = 1$ and ordering them to obtain decreasing variance over the columns. To ensure that each element λ_jh represents a signal, we shifted them away from zero by $σ_{λ}^{2} / 3$ . In Scenario b we remove the decreasing behaviour and introduce a random sparsity pattern characterized by an increasing number of zero entries over the column index. The loadings matrix for Scenario c is characterized by both the decreasing behaviour over the columns of Scenario a and the random sparsity structure of Scenario b. Finally, in Scenario d, while the decreasing behaviour is kept, we induce a sparsity pattern dependent on a categorical and two continuous meta covariates x₀. Details are reported in Section S2.2 of the Supplementary Material.

For each scenario we consider four combinations of dimension and sparsity level of Λ₀. We let (p, k, s) ∈ {(16, 4, 0.6), (32, 8, 0.4), (64, 12, 0.3), (128, 16, 0.2)}, where s is the proportion of non-zero entries of Λ, with the exception of Scenario a where s = 1. In these settings the algorithm takes from 0.07 to 0.73 seconds of computational time per iteration depending on the dimension p and considering an R implementation on an Intel Core i5-6200U CPU laptop with 15.8 GB of RAM. To estimate the structured increasing shrinkage model, we set x equal to the p-variate column vector of 1s, σ_β = 1 and, consistently with Corollary 3, c_p = 2e log(p)/p. In Scenario d we also estimate and compare a correctly specified structured increasing model with x = x₀. For the method proposed by Ročková & George (2016), we set the hyperparameters as suggested by the authors, while for the remaining approaches, we follow the hyperparameter specification and factor selection guidelines in Section 4 of Schiavon & Canale (2020).

Scenario a is a worst case for the proposed method since there is no sparsity, no structure, and the elements of the loadings matrix are similar in magnitude. However, even in this case, structured increasing shrinkage performs essentially identically to the best competitor, as illustrated by the results in Table 1. The results of Ročková & George (2016) are not reported as they are not competitive, as can be seen in table S2 in the Supplementary Material. We report the median and interquartile range over the R replicates of the logarithm of the pseudo-marginal likelihood (Gelfand & Dey, 1994) and of the estimated posterior mean of the number of factors E(H_a | y).

Table 1:

Median and interquartile range of LPML and E(H_a | y) in 25 replications of Scenario a for different combinations of (p, k); Scenario a is a worst case for the proposed SIS method.

	(p, k)	MGP		CUSP		SIS
		Q_0.5	IQR	Q_0.5	IQR	Q_0.5	IQR
LPML	(16,4)	−28.68	0.42	−28.68	0.43	−28.65	0.41
	(32,8)	−60.08	0.45	−60.09	0.45	−60.07	0.49
	(64,12)	−117.68	0.56	−117.75	0.53	−117.88	0.56
	(128,16)	−225.04	1.04	−225.13	1.04	−228.76	1.47
E(H_a \| y)	(16,4)	8.17	1.44	4.00	0.00	4.00	0.00
	(32,8)	10.68	0.33	8.00	0.00	8.00	0.00
	(64,12)	14.16	1.09	12.00	0.00	12.00	0.00
	(128,16)	17.03	0.47	16.00	0.00	18.00	0.02

Open in a new tab

LPML, logarithm of the pseudo-marginal likelihood; CUSP, cumulative shrinkage process; MGP, multiplicative gamma process; SIS, structured increasing shrinkage process; Q_0.5, median; IQR, interquartile range.

Scenario b judges performance in detecting sparsity. The proposed approach shows better performance in the logarithm of the pseudo-marginal likelihood and mean squared error of the covariance matrix, particularly as sparsity increases, as displayed in Fig. 2. Consistently with (Legramanti et al., 2020), the covariance mean squared error is estimated in each simulation by $\sum_{j, l}^{p} \sum_{t = 1}^{S} {(ω_{j l}^{(t)} - ω_{j l 0})}^{2} / {p (p + 1) / 2}$ , where ω_jl0 and $ω_{j l}^{(t)}$ are the elements jl of $Ω_{0} = Λ_{0} Λ_{0}^{T} + I_{p}$ and $Ω^{(t)} = Λ^{(t)} Λ^{(t) T} + I_{p}$ , respectively. The proposed approach allows exact zeros in the loadings, while the competitors require thresholding to infer sparsity. Following the thresholding approach described in Section S2.2 of the Supplementary Material, we evaluate performance in inferring the sparsity pattern via the mean classification error:

M C E = \frac{1}{S} \sum_{t = 1}^{S} \frac{\sum_{j = 1}^{p} \sum_{h = 1}^{k^{* (t)}} | 1 (λ_{j h 0} = 0) - 1 (λ_{j h}^{(t)} = 0) |}{p k},

where k^*(t) is the maximum between the true number of factors k and $H_{a}^{(t)}$ , and λ_jh0 and $λ_{j h}^{(t)}$ are the elements jh of Λ₀ and Λ^(t), respectively. If $H_{a}^{(t)}$ or k are smaller than k*, we fix the higher indexed columns at zero, possibly leading to a mean classification error bigger than one. The results reported in Table 2 show that the proposed structured increasing shrinkage prior is much more effective in identifying sparsity in Λ, maintaining good performance even with large p and in strongly sparse contexts. Also, more accurate estimation of the number of factors is obtained, as reported in Table S1 in the Supplementary Material.

Fig. 2: — Boxplots of mean squared error of the covariance matrix of each model for different combinations of (p, k, s) in Scenario b. Cov. MSE, covariance mean squared error; CUSP, cumulative shrinkage process; MGP, multiplicative gamma process; SIS, structured increasing shrinkage process.

Similar comments apply in Scenarios c and d reported in Fig. S2 in the Supplementary Material. The superior performance of the structured increasing shrinkage model is only partially mitigated in Scenario c for large p for the logarithm of the pseudo-marginal likelihood. In Scenario d, the use of meta covariates has a mild benefit in identifying the sparsity pattern. In lower signal-to-noise settings, meta covariates have a bigger impact, and they also aid interpretation, as illustrated in the next section. Additional details, tables, and plots for all scenarios are reported in Section S2.3 of the Supplementary Material.

5. Finnish bird co-occurrence application

We illustrate our approach by modelling co-occurrence of the fifty most common bird species in Finland (Lindström et al., 2015), focusing on data in 2014. Response y is an n × p binary matrix denoting occurrence of p = 50 species in n = 137 sampling areas. An n × c environmental covariate matrix w is available, including a 5-level habitat type, ‘spring temperature’ (mean temperature in April and May), and (spring temperature)², leading to c = 7. We consider a meta covariate p × q matrix x of species traits: logarithm of typical body mass, migratory strategy (short-distance migrant, resident species, long-distance migrant), and a 7-level superfamily index. We model species presence or absence via a multivariate probit regression model:

y_{i j} = 1 (z_{i j} > 0), z_{i j} = w_{i}^{T} μ_{j} + ϵ_{i j}, ϵ_{i} = {(ϵ_{i 1}, \dots, ϵ_{i p})}^{T} \sim N_{p} (0, Λ Λ^{T} + I_{p}),

(6)

where μ_j characterizes impact of environmental covariates on species occurrence probabilities, and covariance in the latent z_i vector is characterized through a factor model. To borrow information across species while incorporating species traits, we let

μ_{j} \sim N_{c} (b x_{j}, σ_{μ}^{2} I_{c}), b = (b_{1}, \dots, b_{q}), b_{m} \sim N_{c} (0, σ_{b}^{2} I_{c}),

(7)

where b is a c × q coefficient matrix with column vectors b_m given Gaussian priors.

Model (6)–(7) is consistent with popular joint species distribution models (Ovaskainen et al., 2016; Tikhonov et al., 2017; Ovaskainen & Abrego, 2020), with current standard practice using a multiplicative gamma process for Λ. We compare this approach to an analysis that instead uses our proposed structured increasing shrinkage prior to allow the species traits x to impact Λ and hence the covariance structure across species. After standardizing w and x, we set α = 4, a_θ = b_θ = 2 and σ_μ = σ_b = 1. Posterior sampling is straightforward via a Gibbs sampler reported in Section S3.1 of the Supplementary Material.

Figure S8 in the Supplementary Material displays the posterior means of μ and b. A first investigation shows large heterogeneity of the habitat type effects across different species. Matrix b shows that covariate effects tend to not depend on migratory strategy or body mass, with the exception of urban habitats tending to have more migratory birds.

The estimated Λ and meta covariate coefficients β, following the guidelines of Section 3.3, are displayed in Fig. 3. The loadings matrix is quite sparse, indicating that each latent factor impacts a small group of species. Positive sign of the loadings means that high levels of the corresponding factors increase the probability of observing birds from those species. Lower elements of β^(t*), represented with light cells on the right panel, induce higher shrinkage on the corresponding group of birds. To facilitate interpretation, we re-arrange the rows of Λ^(t*) according to the most relevant species traits in terms of shrinkage, which are migration strategy and body mass. The species influenced by the first factor are fairly homogeneous, characterized by short distance or resident migratory strategies and a larger body mass. The strongly negative value of β_(t*)42 suggests heavier species of birds tend to have loadings close to zero for the second factor. This is also true for the third factor, which also does not impact short-distance migrants.

Figure S9 in the Supplementary Material shows a spatial map of the sampling units coloured accordingly to the values of the first and the third latent factors. We can interpret these latent factors as unobserved environmental covariates. We find that the species traits included in our analysis only partially explain the loadings structure; this is as expected and provides motivation for the proposed approach. Sparsity in the loadings matrix helps in interpretation. Species may load on the same factor not just because they have similar traits but also because they tend to favor similar habitats for reasons not captured by the measured traits.

The induced covariance matrix Ω = ΛΛ^T + I_p across species is of particular interest. We compare estimates of Ω under the multiplicative gamma process, estimated using the R package hmsc (Tikhonov et al., 2020), and our proposed structured increasing shrinkage model. Figure 4 reports the posterior mean of the correlation matrices under the two competing models. The network graph based on the posterior mean of the partial correlation matrices, reported in Fig. 5, reveals several communities of species under the proposed structured increasing shrinkage prior that are not evident under the multiplicative gamma.

Fig. 4: — Posterior mean of the correlation matrices estimated by the structured increasing shrinkage model (on the left) and the multiplicative gamma process model (on the right).

Fig. 5: — Graphical representation based on the inverse of the posterior mean of the correlation matrices estimated by the structured increasing shrinkage model (on the left) and the multiplicative gamma process model (on the right). Edge thicknesses are proportional to the latent partial correlations between species. Values below 0.025 are not reported. Nodes are positioned using a FruchtermanReingold force-direct algorithm.

We also find that the multiplicative gamma process provides a slightly worse fit to the data. The logarithm of the pseudo marginal likelihood computed on the posterior samples of the structured increasing shrinkage model is equal to −21.06, higher than that achieved by the competing model, which is −21.36. Using 4-fold cross-validation, we compared the log-likelihood evaluated in the held-out data, with μ and Ω estimated by the posterior mean in the training set. The mean of the log-likelihood was −22.62 under the structured increasing shrinkage and −23.22 under the multiplicative gamma process prior.

Supplementary Material

Supplementary material

NIHMS1815813-supplement-Supplementary_material.pdf^{(1.3MB, pdf)}

Acknowledgement

The authors thank Daniele Durante, Sirio Legramanti, Otso Ovaskainen, and Gleb Tikhonov for useful comments on an early version of this manuscript. This project has received funding from the University of Padova under the STARS Grants programme, the United States National Institutes of Health under grant R01ES027498 and the European Research Council under the European Unions Horizon 2020 research and innovation programme (grant agreement No 856506).

Appendix. Lemmas and proofs

Proof of Theorem 1. The variance of λ_jh is

var (λ_{j h}) = E {E (λ_{j h}^{2} | ϕ_{j h}, γ_{h}, τ_{0})} = E {E (θ_{j h} | ϕ_{j h}, γ_{h}, τ_{0})} .

By construction, $E (θ_{j h} | ϕ_{j h}, γ_{h}, τ_{0}) = ϕ_{j h} γ_{h} τ_{0}$ . Then,

var (λ_{j h}) = E (ϕ_{j h} γ_{h} τ_{0}) = E (ϕ_{j 1}) E (γ_{h}) E (τ_{0}) > E (ϕ_{j 1}) E (γ_{h + 1}) E (τ_{0}) = var (λ_{j h + 1}),

since the scale parameters are independent and the local scale ϕ_jh is equally distributed over the column index h. □

To prove Proposition 2 we need to introduce the following Lemma.

Lemma 3. Let u, v denote two real positive random variables. If at least one among (u | v) and (v | u) is power law tail distributed, then the product uv is power law tail distributed.

Proof. For a positive value w, we can write

pr (u v > w) = \int_{0}^{\infty} pr (u > w / v | v) f (v) d v = E {F_{u | v}^{C} (w / v)},

where $F_{u | v}^{C} (w) = pr (u > w | v)$ and f(v) is the probability density function of v. If $F_{u | v}^{C} (w) \geq c w^{- α}$ with c, α positive constants and w greater than a sufficiently large number L, then

pr (u v > w) \geq E {c {(w / v)}^{- α}} = c w^{- α} E (v^{α}) w > L ≫ 0 .

If E(v^α) = ∞, then pr(uv > w) > cw^−α = O(w^−α), otherwise pr(uv > w) ≥ v(w) for w > L, with v(w) a function of order O(w^−α) as w goes to infinity. This shows that the right tail of the distribution of the random variable uv follows a power law behaviour. □

Proof of Proposition 2. Consider the strictly positive random variables $θ_{j h}^{*} = (θ_{j h} | θ_{j h} > 0)$ , $τ_{0}^{*} = (τ_{0} | τ_{0} > 0)$ , $γ_{h}^{*} = (γ_{h} | γ_{h} > 0)$ , and $ϕ_{j h}^{*} = (ϕ_{j h} | ϕ_{j h} > 0)$ . Since $θ_{j h}^{*}$ is equal to the product $τ^{*} γ_{h}^{*} ϕ_{j h}^{*}$ of independent positive random variables, Lemma 3 ensures that if at least one of those scale parameters follows a power law tail distribution, then $θ_{j h}^{*}$ is power law tail distributed, so that $pr (θ_{j h}^{*} > θ) \geq c θ^{- α}$ for c, α positive constants and θ > L. Without loss of generality, we focus on the right tail of λ_jh. Let

pr (λ_{j h} > λ) = pr (λ_{j h} > λ | θ_{j h} > 0) pr (θ_{j h} > 0) + pr (λ_{j h} > λ | θ_{j h} = 0) pr (θ_{j h} = 0) .

(1)

It is straightforward to observe that λ_jh marginally has a power law tail if and only if (λ_jh | θ_jh > 0) is power law tail distributed and pr(θ_jh > 0) is strictly positive. Since pr(τ₀ > 0) > 0, pr(γ_h > 0) > 0, and pr(ϕ_jh > 0) > 0, then pr(θ_jh > 0) > 0, given independence between the scale parameters. Focusing on θ_jh > 0 in the first term of the right hand side of (1), we have

pr (λ_{j h} > λ | θ_{j h}^{*}) = 1 - Φ (λ θ_{j h}^{* - 0.5}),

and we want to prove that the marginal $F_{λ_{j h}}^{c} (λ) = pr (λ_{j h} > λ)$ is sub-exponential as λ → ∞. Using the lower bound for the right tail of the standard Gaussian of Abramowitz & Stegun (1948),

1 - Φ (λ θ_{j h}^{* - 0.5}) \geq {(\frac{2}{π})}^{0.5} \frac{θ_{j h}^{* 0.5}}{λ + {(λ^{2} + 4 θ_{j h}^{*})}^{0.5}} e^{- λ^{2} / (2 θ_{j h}^{*})} .

Marginalizing over $θ_{j h}^{*}$ , we obtain

pr (λ_{j h} > λ | θ_{j h}^{*}) \geq E {{(\frac{2}{π})}^{0.5} \frac{θ_{j h}^{* 0.5}}{λ + {(λ^{2} + 4 θ_{j h}^{*})}^{0.5}} e^{- λ^{2} / (2 θ_{j h}^{*})}} = E {t_{λ} (θ_{j h}^{*})},

where $t_{λ} (θ_{j h}^{*})$ is a monotonically increasing nonnegative function defined on the positive real line. Applying Markov’s inequality, we have $E {t_{λ} (θ_{j h}^{*})} > pr (θ_{j h}^{*} > ϵ) t_{λ} (ϵ)$ , and letting ϵ = λ²

E {t_{λ} (θ_{j h}^{*})} > pr (θ_{j h}^{*} > λ^{2}) \frac{e^{- 0.5}}{1 + 5^{0.5}} {(\frac{2}{π})}^{0.5} .

If $pr (θ_{j h}^{*} > λ) \geq c λ^{- α}$ for certain α, c positive constants and λ sufficiently large, then

pr (λ_{j h} > λ | θ_{j h}^{*}) \geq \frac{e^{- 0.5}}{1 + 5^{0.5}} {(\frac{2}{π})}^{0.5} c λ^{- 2 α} = \tilde{c} λ^{- \tilde{α}},

where $\tilde{c} = e^{- 0.5} {(1 + 5^{0.5})}^{- 1} {(2 / π)}^{0.5} c > 0$ and $\tilde{α} = α / 2 > 0$ . By symmetry, $pr (λ_{j h} < - λ | θ_{j h} > 0) \geq \tilde{c} λ^{- \tilde{α}}$ for λ > L sufficiently large. It is sufficient that the marginal distribution of $θ_{j h}^{*}$ has power law right tail to guarantee that $(λ_{j h} | θ_{j h} > 0)$ has power law tail and then that marginally λ_jh has power law tail. □

Proof of Theorem 2. The mode of the conditional posterior density of λ_jh is ${\tilde{λ}}_{j h}$ such that

l_{s} ({\tilde{λ}}_{j h}; y, η) + \frac{\partial}{\partial λ} {\log {f_{λ_{j h} | Λ_{- j h}} (λ)} |}_{λ = {\tilde{λ}}_{j h}} = 0,

(2)

where $l_{s} ({\tilde{λ}}_{j h}; y, η)$ is the jhth element of the score function of the likelihood for the data y conditionally on the latent variables η, and $f_{λ_{j h} | Λ_{- j h}}$ is the conditional prior density function of $(λ_{j h} | Λ_{- j h})$ . Given prior symmetry, without loss of generality, we focus on ${\hat{λ}}_{j h} > 0$ . In a neighbourhood $({\hat{λ}}_{j h} - ε, {\hat{λ}}_{j h} + ε)$ of the conditional maximum likelihood estimate ${\hat{λ}}_{j h}$ of λ_jh, we can approximate the score function using a Taylor expansion: $l_{s} (λ; y) = - J ({\hat{λ}}_{j h}) (λ - {\hat{λ}}_{j h}) + ϵ_{ε}$ , where $J ({\hat{λ}}_{j h}) > 0$ is the negative of the derivative of l_s (λ; y) evaluated at $λ = {\hat{λ}}_{j h}$ , and ϵ_ε is an approximation error term such that $\lim_{ε \to 0} ϵ_{ε} / ε = 0$ . For ${\hat{λ}}_{j h}$ large enough, such that ${\hat{λ}}_{j h} - ε > L$ with $L ≫ 0$ , Lemma 1 holds for every λ in $({\hat{λ}}_{j h} - ε, {\hat{λ}}_{j h} + ε)$ , leading to the lower bound

- J ({\hat{λ}}_{j h}) (λ - {\hat{λ}}_{j h}) + {f^{'}}_{l b} (λ) + ϵ_{ε} \leq l_{s} (λ; y) + \frac{\partial}{\partial λ} \log {f_{λ_{j h} | Λ_{- j h}} (λ)},

where ${f^{'}}_{l b} (λ)$ is a non positive continuous function for every λ > 0, $\lim_{λ \to + \infty} {f^{'}}_{l b} (λ) = 0$ . Let ε be a function of ${\hat{λ}}_{j h}$ such that $\lim_{{\hat{λ}}_{j h} \to \infty} ε = 0$ and $\lim_{{\hat{λ}}_{j h} \to \infty} {f^{'}}_{l b} ({\hat{λ}}_{j h}) / ε = 0$ . The limit for ${\hat{λ}}_{j h} \to \infty$ of the lower bound evaluated in ${\hat{λ}}_{j h} - ε$ is

\lim_{{\hat{λ}}_{j h} \to \infty} J ({\hat{λ}}_{j h}) ε + {f^{'}}_{l b} ({\hat{λ}}_{j h} - ε) + ϵ_{ε} = \lim_{{\hat{λ}}_{j h} \to \infty} | ε | {J ({\hat{λ}}_{j h}) + {f^{'}}_{l b} ({\hat{λ}}_{j h} - ε) / | ε | + ϵ_{ε} / | ε |} .

Under Assumption 1, $\lim_{{\hat{λ}}_{j h} \to \infty} J ({\hat{λ}}_{j h}) + {f^{'}}_{l b} ({\hat{λ}}_{j h} - ε) / | ε | + ϵ_{ε} / | ε | \geq 0$ , which guarantees ${\hat{λ}}_{j h} - ε \leq {\tilde{λ}}_{j h} \leq {\hat{λ}}_{j h}$ , and, hence $\lim_{{\hat{λ}}_{j h} \to \infty} | {\tilde{λ}}_{j h} - {\hat{λ}}_{j h} | = 0$ , which proves the theorem. □

Proof of Theorem 3. Since the local scales are independent, conditionally on β, we can apply the Chernoff’s method and obtain the following upper bound

pr {| {supp}_{ϵ} (λ_{h}) | > a s_{p} | β_{h}, γ_{h}, τ_{0}} \leq \exp (- s_{p} a t) \exp {(e^{t} - 1) \sum_{j = 1}^{p} ζ_{ϵ j h}},

for every t > 0 and $ζ_{ϵ j h} = {τ_{0} γ_{h} g (x_{j}^{T} β_{h})} / ϵ^{2}$ a function of β_h. Since $g (x_{j}^{T} β_{h})$ is of order $\leq O (\log (p) / p)$ by assumption and is limited above with respect to β_h, we can deduce $g (x_{j}^{T} β_{h}) \leq c_{j} \log (p) / p$ for p sufficiently large and for some constant c_j > 0 that does not depend on β_h and is asymptotically of order O(1) with respect to p. Then, for $p ≫ 0$ ,

\sum_{j = 1}^{p} g (x_{j}^{T} β_{h}) \leq \sum_{j = 1}^{p} c_{j} \log (p) / p \leq p \log (p) / p \max_{1 \leq j \leq p} c_{j} = m \log (p),

where $m = \max_{1 \leq j \leq p} c_{j}$ does not depend on β_h. Then, the upper bound is

pr {| {supp}_{ϵ} (λ_{h}) | > a s_{p} | β_{h}, γ_{h}, τ_{0}} \leq \exp {- s_{p} a t + (e^{t} - 1) \frac{τ_{0} γ_{h}}{ϵ^{2}} m \log (p)} .

Let us choose $t = \log {ϵ^{2} / (τ_{0} γ_{h} m) + 1}$ . Since $s_{p} \geq \log (p) c_{s}$ for a certain c_s > 0, then, for any $a > {(c_{s} t)}^{- 1}$ , we can write

pr {| {supp}_{ϵ} (λ_{h}) | > a s_{p} | β_{h}, γ_{h}, τ_{0}} \leq \exp {- \log (p) \tilde{a}},

where $\tilde{a}$ is a positive constant such that $a = (1 + \tilde{a}) {(c_{s} t)}^{- 1}$ . The upper bound does not depend on β_h, so

pr {| {supp}_{ϵ} (λ_{h}) | > a s_{p} | γ_{h}, τ_{0}} \leq v (p)

with ν(p) of order O(p⁻¹) that goes to zero. □

Footnotes

Supplementary material

Supplementary material available at Biometrika online includes the statement and proof of Proposition S1 and the proofs of Proposition 1, Lemmas 1–2, and Corollaries 1–3. The Gibbs sampling algorithm, settings, and additional results of the simulations and ecology data analysis are reported, including trace plots and a sensitivity analysis to varying hyperparameters.

Contributor Information

L. Schiavon, Department of Statistical Sciences, University of Padova, Via Cesare Battisti 241, 35121 Padova, Italy

A. Canale, Department of Statistical Sciences, University of Padova, Via Cesare Battisti 241, 35121 Padova, Italy

D. B. Dunson, Department of Statistical Science, Duke University, Durham, North Carolina 27708, U.S.A.

References

Abramowitz M & Stegun IA (1948). Handbook of mathematical functions with formulas, graphs, and mathematical tables, vol. 55. US Government printing office. [Google Scholar]
An X, Yang Q & Bentler PM (2013). A latent factor linear mixed model for high-dimensional longitudinal data analysis. Statistics in medicine 32, 4229–4239. [DOI] [PMC free article] [PubMed] [Google Scholar]
Assmann C, Boysen-Hogrefe J & Pape M (2016). Bayesian analysis of static and dynamic factor models: An ex-post approach towards the rotation problem. Journal of Econometrics 192, 190–206. [Google Scholar]
Bhattacharya A & Dunson DB (2011). Sparse Bayesian infinite factor models. Biometrika 98, 291–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bhattacharya A, Pati D, Pillai NS & Dunson DB (2015). Dirichlet-Laplace priors for optimal shrinkage. Journal of the American Statistical Association 110, 1479–1490. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carvalho CM, Polson NG & Scott JG (2010). The horseshoe estimator for sparse signals. Biometrika 97, 465–480. [Google Scholar]
Dahl DB (2006). Model-based clustering for expression data via a Dirichlet process mixture model. Bayesian inference for gene expression and proteomics 4, 201–218. [Google Scholar]
Durante D (2017). A note on the multiplicative gamma process. Statistics and Probability Letters 122, 198–204. [Google Scholar]
Ferrari F & Dunson DB (2020). Bayesian factor analysis for inference on interactions. Journal of the American Statistical Association, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gelfand AE & Dey DK (1994). Bayesian model choice: asymptotics and exact calculations. Journal of the Royal Statistical Society: Series B (Methodological) 56, 501–514. [Google Scholar]
Jun L & Tao D (2013). Exponential Family Factors for Bayesian Factor Analysis. IEEE Transactions on neural networks and learning systems 24, 964–976. [DOI] [PubMed] [Google Scholar]
Legramanti S, Durante D & Dunson DB (2020). Bayesian cumulative shrinkage for infinite factorizations. Biometrika 107, 745–752. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lindström Å, Green M, Husby M, KÅlÅs JA & Lehikoinen A (2015). Large-scale monitoring of waders on their boreal and arctic breeding grounds in northern Europe. Ardea 103, 3–15. [Google Scholar]
Liu Z & Vandenberghe L (2010). Interior-point method for nuclear norm approximation with application to system identification. SIAM Journal on Matrix Analysis and Applications 31, 1235–1256. [Google Scholar]
Lopes HF & West M (2004). Bayesian model assessment in factor analysis. Statistica Sinica, 41–67. [Google Scholar]
McParland D, Gormley IC, McCormick TH, Clark SJ, Kabudula CW & Collinson MA (2014). Clustering south African households based on their asset status using latent variable models. The annals of applied statistics 8, 747. [DOI] [PMC free article] [PubMed] [Google Scholar]
Miller JE, Li D, LaForgia M & Harrison S (2019). Functional diversity is a passenger but not driver of drought-related plant diversity losses in annual grasslands. Journal of Ecology 107, 2033–2039. [Google Scholar]
Miller JW & Harrison MT (2018). Mixture models with a prior on the number of components. Journal of the American Statistical Association 113, 340–356. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mitchell TJ & Beauchamp JJ (1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association 83, 1023–1036. [Google Scholar]
Mnih A & Salakhutdinov RR (2008). Probabilistic matrix factorization. In Advances in neural information processing systems. [Google Scholar]
Montagna S, Tokdar ST, Neelon B & Dunson DB (2012). Bayesian latent factor regression for functional and longitudinal data. Biometrics 68, 1064–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murray JS, Dunson DB, Carin L & Lucas JE (2013). Bayesian Gaussian copula factor models for mixed data. Journal of the American Statistical Association 108, 656–665. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ovaskainen O & Abrego N (2020). Joint Species Distribution Modelling: With Applications in R. Cambridge University Press. [Google Scholar]
Ovaskainen O, Abrego N, Halme P & Dunson D (2016). Using latent variable models to identify large networks of species-to-species associations at different spatial scales. Methods in Ecology and Evolution 7, 549–555. [Google Scholar]
Polson NG & Scott JG (2010). Shrink globally, act locally: Bayesian sparsity and regularization. Bayesian Statistics 9, 1–16. [Google Scholar]
Reich BJ & Bandyopadhyay D (2010). A latent factor model for spatial data with informative missingness. The annals of applied statistics 4, 439. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roberts GO & Rosenthal JS (2007). Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. Journal of applied probability 44, 458–475. [Google Scholar]
Ročková V & George EI (2016). Fast bayesian factor analysis via automatic rotations to sparsity. Journal of the American Statistical Association 111, 1608–1622. [Google Scholar]
Rousseau J & Mengersen K (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73, 689–710. [Google Scholar]
Roweis S & Ghahramani Z (1999). A unifying review of linear Gaussian models. Neural computation 11, 305–345. [DOI] [PubMed] [Google Scholar]
Roy A, Schaich-Borg J & Dunson DB (2019). Bayesian time-aligned factor analysis of paired multivariate time series. arXiv preprint arXiv:1904.12103. [PMC free article] [PubMed] [Google Scholar]
Schiavon L & Canale A (2020). On the truncation criteria in infinite factor models. Stat 9, e298. [Google Scholar]
Thomas DC, Conti DV, Baurley J, Nijhout F, Reed M & Ulrich CM (2009). Use of pathway information in molecular epidemiology. Human genomics 4, 21. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tikhonov G, Abrego N, Dunson D & Ovaskainen O (2017). Using joint species distribution models for evaluating how species-to-species associations depend on the environmental context. Methods in Ecology and Evolution 8, 443–452. [Google Scholar]
Tikhonov G, Opedal ÁH, Abrego N, Lehikoinen A, de Jonge MM, Oksanen J & Ovaskainen O (2020). Joint species distribution modelling with the R-package hmsc. Methods in ecology and evolution 11, 442–447. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wade S, Ghahramani Z et al. (2018). Bayesian cluster analysis: Point estimation and credible balls (with discussion). Bayesian Analysis 13, 559–626. [Google Scholar]
Yang L, Fang J, Duan H, Li H & Zeng B (2018). Fast low-rank Bayesian matrix completion with hierarchical Gaussian prior models. IEEE Transactions on Signal Processing 66, 2804–2817. [Google Scholar]
Yuan M & Lin Y (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 49–67. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

NIHMS1815813-supplement-Supplementary_material.pdf^{(1.3MB, pdf)}

[R1] Abramowitz M & Stegun IA (1948). Handbook of mathematical functions with formulas, graphs, and mathematical tables, vol. 55. US Government printing office. [Google Scholar]

[R2] An X, Yang Q & Bentler PM (2013). A latent factor linear mixed model for high-dimensional longitudinal data analysis. Statistics in medicine 32, 4229–4239. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Assmann C, Boysen-Hogrefe J & Pape M (2016). Bayesian analysis of static and dynamic factor models: An ex-post approach towards the rotation problem. Journal of Econometrics 192, 190–206. [Google Scholar]

[R4] Bhattacharya A & Dunson DB (2011). Sparse Bayesian infinite factor models. Biometrika 98, 291–306. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Bhattacharya A, Pati D, Pillai NS & Dunson DB (2015). Dirichlet-Laplace priors for optimal shrinkage. Journal of the American Statistical Association 110, 1479–1490. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Carvalho CM, Polson NG & Scott JG (2010). The horseshoe estimator for sparse signals. Biometrika 97, 465–480. [Google Scholar]

[R7] Dahl DB (2006). Model-based clustering for expression data via a Dirichlet process mixture model. Bayesian inference for gene expression and proteomics 4, 201–218. [Google Scholar]

[R8] Durante D (2017). A note on the multiplicative gamma process. Statistics and Probability Letters 122, 198–204. [Google Scholar]

[R9] Ferrari F & Dunson DB (2020). Bayesian factor analysis for inference on interactions. Journal of the American Statistical Association, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Gelfand AE & Dey DK (1994). Bayesian model choice: asymptotics and exact calculations. Journal of the Royal Statistical Society: Series B (Methodological) 56, 501–514. [Google Scholar]

[R11] Jun L & Tao D (2013). Exponential Family Factors for Bayesian Factor Analysis. IEEE Transactions on neural networks and learning systems 24, 964–976. [DOI] [PubMed] [Google Scholar]

[R12] Legramanti S, Durante D & Dunson DB (2020). Bayesian cumulative shrinkage for infinite factorizations. Biometrika 107, 745–752. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Lindström Å, Green M, Husby M, KÅlÅs JA & Lehikoinen A (2015). Large-scale monitoring of waders on their boreal and arctic breeding grounds in northern Europe. Ardea 103, 3–15. [Google Scholar]

[R14] Liu Z & Vandenberghe L (2010). Interior-point method for nuclear norm approximation with application to system identification. SIAM Journal on Matrix Analysis and Applications 31, 1235–1256. [Google Scholar]

[R15] Lopes HF & West M (2004). Bayesian model assessment in factor analysis. Statistica Sinica, 41–67. [Google Scholar]

[R16] McParland D, Gormley IC, McCormick TH, Clark SJ, Kabudula CW & Collinson MA (2014). Clustering south African households based on their asset status using latent variable models. The annals of applied statistics 8, 747. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Miller JE, Li D, LaForgia M & Harrison S (2019). Functional diversity is a passenger but not driver of drought-related plant diversity losses in annual grasslands. Journal of Ecology 107, 2033–2039. [Google Scholar]

[R18] Miller JW & Harrison MT (2018). Mixture models with a prior on the number of components. Journal of the American Statistical Association 113, 340–356. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Mitchell TJ & Beauchamp JJ (1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association 83, 1023–1036. [Google Scholar]

[R20] Mnih A & Salakhutdinov RR (2008). Probabilistic matrix factorization. In Advances in neural information processing systems. [Google Scholar]

[R21] Montagna S, Tokdar ST, Neelon B & Dunson DB (2012). Bayesian latent factor regression for functional and longitudinal data. Biometrics 68, 1064–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Murray JS, Dunson DB, Carin L & Lucas JE (2013). Bayesian Gaussian copula factor models for mixed data. Journal of the American Statistical Association 108, 656–665. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Ovaskainen O & Abrego N (2020). Joint Species Distribution Modelling: With Applications in R. Cambridge University Press. [Google Scholar]

[R24] Ovaskainen O, Abrego N, Halme P & Dunson D (2016). Using latent variable models to identify large networks of species-to-species associations at different spatial scales. Methods in Ecology and Evolution 7, 549–555. [Google Scholar]

[R25] Polson NG & Scott JG (2010). Shrink globally, act locally: Bayesian sparsity and regularization. Bayesian Statistics 9, 1–16. [Google Scholar]

[R26] Reich BJ & Bandyopadhyay D (2010). A latent factor model for spatial data with informative missingness. The annals of applied statistics 4, 439. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Roberts GO & Rosenthal JS (2007). Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. Journal of applied probability 44, 458–475. [Google Scholar]

[R28] Ročková V & George EI (2016). Fast bayesian factor analysis via automatic rotations to sparsity. Journal of the American Statistical Association 111, 1608–1622. [Google Scholar]

[R29] Rousseau J & Mengersen K (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73, 689–710. [Google Scholar]

[R30] Roweis S & Ghahramani Z (1999). A unifying review of linear Gaussian models. Neural computation 11, 305–345. [DOI] [PubMed] [Google Scholar]

[R31] Roy A, Schaich-Borg J & Dunson DB (2019). Bayesian time-aligned factor analysis of paired multivariate time series. arXiv preprint arXiv:1904.12103. [PMC free article] [PubMed] [Google Scholar]

[R32] Schiavon L & Canale A (2020). On the truncation criteria in infinite factor models. Stat 9, e298. [Google Scholar]

[R33] Thomas DC, Conti DV, Baurley J, Nijhout F, Reed M & Ulrich CM (2009). Use of pathway information in molecular epidemiology. Human genomics 4, 21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Tikhonov G, Abrego N, Dunson D & Ovaskainen O (2017). Using joint species distribution models for evaluating how species-to-species associations depend on the environmental context. Methods in Ecology and Evolution 8, 443–452. [Google Scholar]

[R35] Tikhonov G, Opedal ÁH, Abrego N, Lehikoinen A, de Jonge MM, Oksanen J & Ovaskainen O (2020). Joint species distribution modelling with the R-package hmsc. Methods in ecology and evolution 11, 442–447. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Wade S, Ghahramani Z et al. (2018). Bayesian cluster analysis: Point estimation and credible balls (with discussion). Bayesian Analysis 13, 559–626. [Google Scholar]

[R37] Yang L, Fang J, Duan H, Li H & Zeng B (2018). Fast low-rank Bayesian matrix completion with hierarchical Gaussian prior models. IEEE Transactions on Signal Processing 66, 2804–2817. [Google Scholar]

[R38] Yuan M & Lin Y (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68, 49–67. [Google Scholar]

PERMALINK

Generalized infinite factorization models

L Schiavon

A Canale

D B Dunson

Summary

1. Introduction

2. Generalized infinite factor models

2.1. Model specification

Fig. 1:

2.2. Properties

Table 2:

3. Structured increasing shrinkage process

3.1. Model specification

3.2. Posterior computations

3.3. Identifiability and posterior summaries

4. Simulation experiments

Table 1:

Fig. 2:

5. Finnish bird co-occurrence application

Fig. 3:

Fig. 4:

Fig. 5:

Supplementary Material

Acknowledgement

Appendix. Lemmas and proofs

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Generalized infinite factorization models

L Schiavon

A Canale

D B Dunson

Summary

1. Introduction

2. Generalized infinite factor models

2.1. Model specification

Fig. 1:

2.2. Properties

Table 2:

3. Structured increasing shrinkage process

3.1. Model specification

3.2. Posterior computations

3.3. Identifiability and posterior summaries

4. Simulation experiments

Table 1:

Fig. 2:

5. Finnish bird co-occurrence application

Fig. 3:

Fig. 4:

Fig. 5:

Supplementary Material

Acknowledgement

Appendix. Lemmas and proofs

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases