Sparse Bayesian infinite factor models

A Bhattacharya; D B Dunson

doi:10.1093/biomet/asr013

. 2011 Jun;98(2):291–306. doi: 10.1093/biomet/asr013

Sparse Bayesian infinite factor models

A Bhattacharya ¹, D B Dunson ¹

PMCID: PMC3419391 PMID: 23049129

Abstract

We focus on sparse modelling of high-dimensional covariance matrices using Bayesian latent factor models. We propose a multiplicative gamma process shrinkage prior on the factor loadings which allows introduction of infinitely many factors, with the loadings increasingly shrunk towards zero as the column index increases. We use our prior on a parameter-expanded loading matrix to avoid the order dependence typical in factor analysis models and develop an efficient Gibbs sampler that scales well as data dimensionality increases. The gain in efficiency is achieved by the joint conjugacy property of the proposed prior, which allows block updating of the loadings matrix. We propose an adaptive Gibbs sampler for automatically truncating the infinite loading matrix through selection of the number of important factors. Theoretical results are provided on the support of the prior and truncation approximation bounds. A fast algorithm is proposed to produce approximate Bayes estimates. Latent factor regression methods are developed for prediction and variable selection in applications with high-dimensional correlated predictors. Operating characteristics are assessed through simulation studies, and the approach is applied to predict survival times from gene expression data.

Keywords: Adaptive Gibbs sampling, Factor analysis, High-dimensional data, Multiplicative gamma process, Parameter expansion, Regularization, Shrinkage

1. Introduction

Factor models aim to explain the dependence structure among high-dimensional observations through a sparse decomposition of a p × p covariance matrix Ω as ΛΛ^T + Σ, where Λ is a p × k factor loadings matrix with k ≪ p and Σ is a p × p diagonal matrix with nonnegative diagonal entries. A popular approach to ensure identifiability of the loading elements is to constrain the loading matrix to be lower triangular with positive diagonal entries (Geweke & Zhou, 1996). Factor models have been traditionally applied in behavioural and social sciences, where the latent factors have a natural interpretation as certain unobserved psychological traits. A more recent approach (West, 2003; Carvalho et al., 2008) uses the above sparse characterization as a dimensionality reduction tool in large p and small n applications such as gene expression studies.

A Bayesian specification of the factor model (Arminger & Muthén, 1998; Song & Lee, 2001) commonly uses inverse gamma priors on the residual variances and normal and truncated normal priors on the off-diagonal and diagonal elements of the loadings matrix, respectively. Such choices lead to conditionally conjugate forms of the posterior distribution and enable posterior computation by a straightforward Gibbs sampler. However, it has been observed that these choices lead to poorly behaved Gibbs samplers with slow mixing when some of the outcomes are highly correlated. Posterior inference also tends to be sensitive to certain hyperparameters. To address these issues, Ghosh & Dunson (2009) use parameter expansion (Liu & Wu, 1999; Gelman, 2006) to induce a heavy-tailed default prior distribution on the loading elements and propose an efficient Gibbs sampler.

Inference on the number of factors in factor analysis models is both conceptually and computationally challenging. Some of the early papers in this direction (discussion paper by Polasek, 1997, University of Basil) involve computation of the marginal likelihoods under models with different numbers of factors. Lopes & West (2004) proposed a reversible jump Markov chain Monte Carlo algorithm to allow for uncertainty in the number of factors. Lee & Song (2002) developed a path sampling approach instead. A more recent method infers the number of factors by zeroing a subset of the loading elements using Bayesian variable selection priors (Lucas et al., 2006; Carvalho et al., 2008); see also the 2009 discussion paper from the University of Chicago Booth School of Business by Schnatter and Lopes. Ando (2009) proposed an approach for calculating the exact marginal likelihood in Bayesian factor analysis with heavy-tailed priors. This method can be used for rapid estimation of the number of factors, but may be sensitive to subjectively chosen priors.

In this article we introduce a multiplicative gamma process shrinkage prior that allows introduction of infinitely many factors, with the loadings increasingly shrunk towards zero as the column index increases. The key to our approach lies in the fact that for purpose of prediction or inference on the covariance matrix, identifiability of the loadings is not necessary. In standard factor models, the identifiability constraints induce undesirable properties, such as a priori order dependence in the off-diagonal entries of the covariance matrix. Our proposed prior is placed on a parameter expanded factor loadings matrix, making the induced prior on the covariance matrix invariant to ordering of the data. The shrinkage prior allows us to adaptively select a truncation of the infinite loadings to one having finite columns, which facilitates the posterior computation while providing an accurate approximation to the infinite factor model.

2. Bayesian factor models

2.1. Model and prior specification

The generic form of a latent factor model is

y_{i} = Λ η_{i} + ∊_{i}, ∊_{i} \sim N_{p} (0, Σ) (i = 1, \dots, n),

(1)

where y_i is a p-dimensional continuous response, Λ is a p × k factor loadings matrix, η_i ∼ N_k(0, I_k) are latent factors and ∊_i is an idiosyncratic error with covariance $Σ = diag (σ_{1}^{2}, \dots, σ_{p}^{2})$ . We follow standard practice in normalizing the data prior to analysis and hence do not include an intercept term in (1). Each observation y_i is assumed to have independent components given the factors and dependence among the components is induced by marginalizing over the distribution of the factors, so marginally y_i ∼ N_p(0, Ω) with Ω = ΛΛ^T + Σ. In practical applications involving moderate to large p, the number of factors is typically much smaller than p, inducing a sparse characterization of the unknown covariance matrix Ω.

The above decomposition of Ω is not unique and there are actually infinitely many possibilities, since Λ₁ = ΛP also satisfies the above condition for any semi-orthogonal matrix P (P P^T = I). The usual lower triangular constraint for identifiability (Geweke & Zhou, 1996) induces order dependence among the responses, with the choice of the first k response variables being an important modelling decision (Carvalho et al., 2008). From a Bayesian perspective, one does not require identifiability of the loading elements for a wide class of applications including covariance matrix estimation, variable selection and prediction. The above fact has been exploited in our approach to define the prior on a parameter-expanded loadings matrix with redundant parameters, resulting in better computational properties while simplifying the theory.

Letting Θ_Λ denote the collection of all matrices Λ with p rows and infinitely many columns such that ΛΛ^T is a p × p matrix with all entries finite, we have

Θ_{Λ} = {Λ = (λ_{j h}), j = 1, \dots, p, h = 1 \dots, \infty, max_{1 ⩽ j ⩽ p} \sum_{h = 1}^{\infty} λ_{j h}^{2} < \infty} .

(2)

Using the Cauchy–Schwartz inequality, it is straightforward to show that all the entries of ΛΛ^T are finite if and only if the condition in (2) is satisfied. Let Θ_Σ denote the set of p × p diagonal matrices with nonnegative entries and let Θ denote all p × p positive semi-definite matrices. Consider the function g : Θ_Λ × Θ_Σ → Θ corresponding to g(Λ, Σ) = ΛΛ^T + Σ.

Lemma 1. For any (Λ, Σ) ∈ Θ_Λ × Θ_Σ, we have g(Λ, Σ) ∈ Θ.

All proofs can be found in the Appendix. The image of Θ_Λ × Θ_Σ under g is the set Ω : Ω =g(Λ, Σ), (Λ, Σ) ∈ Θ_Λ × Θ_Σ}. Letting g⁻¹ (Ω) ⊂ Θ_Λ × Θ_Σ denote the pre-image of Ω ∈ Θ, it is straightforward to show that the set g⁻¹(Ω) contains at least one element for any Ω ∈ Θ, so that the image of Θ_Λ × Θ_Σ under g is the set Θ. For example, one element corresponds to (Λ, 0_p), with Λ = (Ω^1/2: 0_p×∞), Ω^1/2 a Cholesky decomposition of Ω and 0_p denoting a p × p matrix of zeros. Thus g is a continuous surjective function. However, g is not bijective, and in general the cardinality of g⁻¹(Ω) is ∞. Lemma 2 states a regularity property of g, which is later used to prove sup-norm support of the proposed prior.

Lemma 2. Let (Λ₀, Σ₀) be an arbitrary element of Θ_Λ × Θ_Σ. For ∊ > 0, define the following ∊-ball around (Λ₀, Σ₀), B_∊ (Λ₀, Σ₀) = {(Λ, Σ) ∈ Θ_Λ × Θ_Σ : d₂(Λ, Λ₀) < ∊, d_∞(Σ, Σ₀) < ∊}, where d₂(·, ·) denotes the L₂ distance metric on Θ_Λ,

d_{2} (Λ, Λ_{0}) = {\sum_{j = 1}^{p} \sum_{h = 1}^{\infty} {(λ_{j h} - λ_{j h}^{0})}^{2}}^{1 / 2},

for p × ∞ matrices Λ = (λ_jh), $Λ_{0} = (λ_{j h}^{0})$ , and d_∞(A, B) = max_1⩽r,s⩽p |a_rs − b_rs| is the sup-norm metric for p × p matrices A = (a_rs), B = (b_rs). Then, the image g{B_∊ (Λ₀, Σ₀)} contains values Ω ∈ Θ in an ∊^* sized ball in sup norm around Ω₀ = g(Λ₀, Σ₀), with ∊^* decreasing towards zero monotonically as ∊ decreases to zero.

Observe that d₂ is well defined and finite on Θ_Λ by (2).

We adopt a Bayesian approach and choose independent priors supported on Θ_Λ × Θ_Σ, which in turn induces a prior on Ω ∈ Θ through the operator g. We place the usual inverse gamma priors on the diagonal elements of Σ. To define a prior supported on Θ_Λ, we allow the entries of Λ to decrease in magnitude flexibly as the column index increases. The prior is defined on a parameter-expanded loading matrix without imposing any restriction on the loading elements. The introduction of the redundant parameters simplifies the theory and the induced prior has attractive properties including large support and order-independence. We use a shrinkage-type prior with the degree of shrinkage increasing across the column index as follows,

\begin{array}{r} λ_{j h} | ϕ_{j h}, τ_{h} \sim N (0, ϕ_{j h}^{- 1} τ_{h}^{- 1}), ϕ_{j h} \sim Ga (ν / 2, ν / 2), τ_{h} = \prod_{l = 1}^{h} δ_{l}, \\ δ_{1} \sim Ga (a_{1}, 1), δ_{l} \sim Ga (a_{2}, 1), l ⩾ 2, σ_{j}^{- 2} \sim Ga (a_{σ}, b_{σ}) (j = 1, \dots, p), \end{array}

(3)

where δ_l (l = 1, …, ∞), are independent, τ_h is a global shrinkage parameter for the hth column and the ϕ_jhs are local shrinkage parameters for the elements in the hth column. The τ_hs are stochastically increasing under the restriction a₂ > 1, which favours more shrinkage as the column index increases. If we only use the global shrinkage parameter, the prior has a tendency to over-shrink the nonzero loadings. In gene expression examples involving large p, it is often the case that a relatively small proportion of genes are within each pathway. In such applications, we would like to shrink a subset of the elements strongly towards zero while retaining the sparse signals. We refer to the induced prior on the space of covariance matrices as a multiplicative gamma process shrinkage prior.

2.2. Properties of the shrinkage prior

Let Π_Λ ⊗ Π_Σ denote the prior on (Λ, Σ) defined in (3). We first need to make sure that our prior is well defined so that draws from the above prior are elements of Θ_Λ × Θ_Σ almost surely.

Proposition 1. If (Λ, Σ) ∼ Π_Λ ⊗ Π_Σ, then Π_Λ ⊗ Π_Σ (Θ_Λ × Θ_Σ) = 1.

For computational purposes, we would like to approximate the infinite loadings matrix with a finite matrix having few columns relative to the number of outcomes p. As justification, we obtain theoretical bounds on the truncation approximation error. Let (Λ, Σ) ∼ Π_Λ ⊗ Π_Σ and Ω = ΛΛ^T + Σ be the induced covariance matrix. We can approximate Ω by $Ω_{H} = Λ_{H} Λ_{H}^{T} + Σ$ where Λ_H denotes the matrix obtained by setting the columns of Λ from H + 1 onwards to zero or equivalently discarding those higher indexed columns. The following theorem states that the prior probability of Ω_H being arbitrarily close to Ω in an appropriate sense converges exponentially fast to 1 as H tends to ∞.

Theorem 1. If a₂ > 2, then for any ∊ > 0,

pr {d_{\infty} (Ω, Ω_{H}) > ∊} < \frac{6 p b}{∊ (1 - a)} a^{H},

for H > log{6pb/∊(1 − a)}/ log(1/a), where $b = E (δ_{1}^{- 1})$ and $a = E (δ_{2}^{- 1})$ , with δ₁ and δ₂ as in (3).

The proof of the theorem assumes ν = 3 which has been used as a default choice throughout, but the same argument holds for any ν > 2. Although the condition a₂ > 2 is sufficient to ensure that a < 1, for any Ga(a₂, b₂) prior on δ₂, the theorem remains valid as long as $E (δ_{2}^{- 1}) = b_{2} / (a_{2} - 1) < 1$ or a₂ > 1 + b₂.

Letting Π denote the induced prior on Θ, Π = (Π_Λ ⊗ Π_Σ) ○ g⁻¹ so that for any Borel subset A of Θ, Π(A) = (Π_Λ ⊗ Π_Σ){g⁻¹(A)}. Since g is a continuous and hence measurable map, Π is a well-defined probability measure on (Θ, 𝒜), with 𝒜 the Borel σ-algebra of subsets of Θ.

Proposition 2. If Ω₀ is any p × p covariance matrix and $B_{∊}^{\infty} (Ω_{0})$ is an ∊-neighbourhood of Ω₀ under the sup-norm, then $Π {B_{∊}^{\infty} (Ω_{0})} > 0$ for any ∊ > 0.

Proposition 2 shows that our proposed prior has large support, so places positive probability in arbitrarily small neighbourhoods around any covariance matrix. We use Proposition 2 to show weak consistency of the posterior distribution of Ω in Theorem 2. Denote K (Ω₀, Ω) to be the Kullback–Leibler divergence between N_p(0, Ω₀) and N_p(0, Ω),

K (Ω_{0}, Ω) = \int log \frac{N (y; 0, Ω_{0})}{N (y; 0, Ω)} N (y; 0, Ω_{0}) d y .

Theorem 2. Fix Ω₀ ∈ Θ. For any ∊ > 0, there exists ∊^* > 0, such that

{Ω : d_{\infty} (Ω_{0}, Ω) < ∊^{*}} \subset {Ω : K (Ω_{0}, Ω) < ∊},

which implies that the posterior distribution of Ω is weakly consistent.

The weak consistency of the posterior follows from the Schwartz (1965) theorem, since any Kullback–Leibler neighbourhood of the true density has positive probability using Proposition 2.

Another attractive property of our prior is that it is free of order dependence, so that the induced prior on Ω is invariant to permutations with Ω having the same distribution as Ω_π, where Ω_π = (w_{π_rπ_s}) with π any permutation of {1, …, p} and Ω = (w_rs). We have $w_{r s} = \sum_{h = 1}^{\infty} λ_{r h} λ_{s h} = λ_{r}^{T} λ_{s}$ , where λ_j = (λ_j1, λ_j2, …)^T. Conditionally on τ = (τ₁, τ₂, …)^T, the λ_rh’s are independent with λ_rh | τ_h ∼ t₃(0, $τ_{h}^{- 1}$ ). Since the marginal prior on λ_r is the same for every r, w_rs has the same distribution as w_r′s′ for any (r, s) ≠ (r′, s′) such that r ≠ s, r′ ≠ s′. The permutation invariance follows from the fact that w_rr and w_r′r′ have the same distribution for any 1 ⩽ r, r′ ⩽ p.

Although the distribution of w_rs does not have a simple form, the first two moments of w_rs can be obtained as

\begin{array}{l} E (w_{r s}) & = \sum_{h = 1}^{\infty} E {E (λ_{r h} λ_{s h} | τ_{h})} = 0, \\ E (w_{r s}^{2}) & = E {tr (λ_{r}^{T} λ_{s} λ_{s}^{T} λ_{r})} = tr {E (λ_{r} λ_{r}^{T} λ_{s} λ_{s}^{T})} \\ = tr [E {E (λ_{r} λ_{r}^{T} | τ) E (λ_{s} λ_{s}^{T} | τ)}] = 9 \sum_{h = 1}^{\infty} E (τ_{h}^{- 2}) . \end{array}

Thus $E (w_{r s}^{2})$ is finite if $d = E (δ_{1}^{- 2})$ is finite and $c = E (δ_{2}^{- 2}) < 1$ and in that case $E (w_{r s}^{2}) = 9 d / (1 - c)$ . One way to ensure the above conditions is to let a₁ > 2 and a₂ > 3. Hence the induced prior on any of the off-diagonal entries of Ω has mean zero and the parameters a₁, a₂ dictate the existence of higher order moments. We place gamma priors on a₁ and a₂ to learn these key hyperparameters from the data.

3. Posterior computation

3.1. Gibbs sampler with a fixed truncation level

We propose a straightforward Gibbs sampler for posterior computation after truncating the loadings matrix to have k^* ≪ p columns. An adaptive strategy for inference on the truncation level k^* is described in § 3.2. The Gibbs sampler is computationally efficient and mixes rapidly as the shrinkage prior allows block updating of the loadings. The sampler cycles through the following steps.

Step 1. If we denote the jth row of Λ_k* by $λ_{j}^{T}$ , then the λ_j s have independent conditionally conjugate posteriors,
$π (λ_{j} | -) \sim N_{k *} {{(D_{j}^{- 1} + σ_{j}^{- 2} η^{T} η)}^{- 1} η^{T} σ_{j}^{- 2} y^{(j)}, {(D_{j}^{- 1} + σ_{j}^{- 2} η^{T} η)}^{- 1}},$
where η (η₁, …,η_n)^T, $D_{j}^{- 1} = diag (ϕ_{j 1} τ_{1}, \dots, ϕ_{j k *} τ_{k *})$ and y^(j) = (y_1j, …, y_nj)^T for j = 1, …, p. Given the other parameters, π (λ_j | −) denotes the conditional posterior of λ_j.
Step 2. Sample $σ_{j}^{- 2}$ , j = 1 …, p, from conditionally independent posteriors
$π (σ_{j}^{- 2} | -) \sim Ga {a_{σ} + \frac{n}{2}, b_{σ} + \frac{1}{2} \sum_{i = 1}^{n} {(y_{i j} - λ_{j}^{T} η_{i})}^{2}} .$
Step 3. Sample η_i, i = 1 …, n, from conditionally independent posteriors
$π (η_{i} | -) \sim N_{k *} {{(I_{k *} + Λ_{k *}^{T} Σ^{- 1} Λ_{k *})}^{- 1} Λ_{k *}^{T} Σ^{- 1} y_{i}, {(I_{k *} + Λ_{k *}^{T} Σ^{- 1} Λ_{k *})}^{- 1}} .$
Step 4. Sample ϕ_jh from
$π (ϕ_{j h} | -) \sim Ga (\frac{ν + 1}{2}, \frac{ν + τ_{h} λ_{j h}^{2}}{2}) .$
Step 5. Sample δ₁ from
$π (δ_{1} | -) \sim Ga {a_{1} + \frac{p k^{*}}{2}, 1 + \frac{1}{2} \sum_{l = 1}^{k^{*}} τ_{l}^{(1)} \sum_{j = 1}^{p} ϕ_{j l} λ_{j l}^{2}},$
and for h ⩾ 2, sample δ_h from
$π (δ_{h} | -) \sim Ga {a_{2} + \frac{p}{2} (k^{*} - h + 1), 1 + \frac{1}{2} \sum_{l = 1}^{k^{*}} τ_{l}^{(h)} \sum_{j = 1}^{p} ϕ_{j l} λ_{j l}^{2}},$
where $τ_{l}^{(h)} = Π_{t = 1, t \neq h}^{l} δ_{t}$ for h = 1, …, k^*.
Step 6. Update a₁ and a₂ using a Metropolis–Hastings step within the Gibbs sampler.

3.2. Choosing the number of factors adaptively

In practical situations, we expect to have relatively few important factors compared with the dimension p of the outcomes. Our proposed model with infinite number of factors obviates the need for pre-specifying the number of factors, while the sparsity favouring prior on the loadings ensures that the effective number of factors would be small when the truth is sparse. However, we need a computational strategy for choosing an appropriate level of truncation k^*. We would like to strike a balance between missing important factors by choosing k^* too small and wasting computation on an overly conservative truncation level. One can think of k^* as the effective number of factors, so that the contribution from adding additional factors is negligible. Starting with a conservative guess k̃ of k^*, the posterior samples of Λ_k̃ from the Gibbs sampler mentioned in § 3.1 contain information about the effective number of factors. At the tth iteration of the Gibbs sampler, let m^(t) denote the number of columns in Λ_k̃ having all elements in a pre-specified small neighbourhood of zero. Intuitively, m^(t) of the factors have a negligible contribution at the tth iteration. Usual shrinkage priors on the loadings exhibit the phenomenon of factor splitting, in which none of the columns have all loadings close to zero even when k̃ is chosen to be greater than the true number of factors. By shrinking increasingly in later columns, we avoid this problem. We define k^*(t) = k̃ − m^(t) to be the effective number of factors at iteration t.

The above approach has been shown to produce accurate estimates of the true effective number of factors k^* in a number of simulation examples as long as k̃ ⩾ k^*. However, in order to be assured that k̃ ⩾ k^*, it is typically necessary to choose a very conservative bound in large p applications, which leads to wasted computational effort. Ideally, we would like to discard the redundant factors and continue the sampler with a reduced number of columns in the loadings. With this aim, we modify our sampler described above to an adaptive Gibbs sampler, which tunes the number of factors as the sampler progresses. The adaptations are designed to satisfy the diminishing adaptation condition in Theorem 5 of Roberts & Rosenthal (2007). To be specific, we adapt with probability p(t) = exp(α₀ + α₁t) at the tth iteration, with α₀, α₁ chosen so that adaptation occurs around every 10 iterations at the beginning of the chain but decreases in frequency exponentially fast. We generate a sequence u_t of uniform random numbers between 0 and 1. At the tth iteration, if u_t ⩽ p(t), we monitor the columns in the loadings having all elements within some pre-specified small neighbourhood of zero. If the number of such columns drops to zero, we add a column to the loadings and otherwise discard the redundant columns. The other parameters are also modified accordingly. When we add a factor, we sample parameters from the prior distribution to fill in additional columns, and otherwise retain parameters corresponding to the nonredundant columns.

The most common approach for selecting the number of factors relies on fitting the factor model for different choices of k^*, and then using the bic or another criteria for selection. This approach can be difficult to implement for large p, small n problems in which maximum likelihood estimates often do not exist, and the bic is not well justified for factor models even for small to moderate p. Lopes & West (2004) compared a number of alternatives, recommending a reversible jump Markov chain Monte Carlo approach that requires a preliminary run for each choice of the number of factors, so it is very computationally intensive. Path sampling faces similar computational hurdles in scaling up to large p. Stochastic search variable selection algorithms have been applied in large p settings, but performance is questionable given the need to update elements of the loadings matrix one at a time, leading to very slow mixing and convergence rates. In the econometrics literature on approximate factor models, there has been recent work (Bai & Ng, 2002; Amengual & Watson, 2007) on consistent estimation of the number of static and dynamic factors as the number of time series and observation times both increase to ∞ at a comparable rate; see also the discussion paper by Onatski (2005), University of Columbia.

A significant advantage of our adaptive method is that a single run provides posterior samples of the parameters as well as information about the number of factors, with convergence of the chain guaranteed by the theory in Roberts & Rosenthal (2007). In addition, we save computation by discarding the unimportant factors. Letting k̃^(t) denote the truncation level at iteration t and k^*(t) = k̃^(t) − m^(t) denote the effective number of factors, we use the median or mode of {k^*(t)} after burn-in as an estimate of k^* with credible intervals quantifying uncertainty.

After a reasonable burn-in, $Ω^{(t)} = Λ_{{\tilde{k}}^{(t)}}^{(t)} Λ_{{\tilde{k}}^{(t)}}^{(t) T} + Σ^{(t)}$ represent draws from the approximated marginal posterior distribution of Ω given y₁, …, y_n, where ${Λ_{{\tilde{k}}^{(t)}}^{(t)}, Σ^{(t)}}$ denote posterior samples at the tth iteration. The posterior samples Ω^(t) can be used for inference on Ω. We also propose a fast algorithm for calculating an approximate maximum a posteriori estimate of the covariance matrix. The proposed approach is useful to arrive at a quick working estimate of the covariance matrix. Our proposed stochastic em approach, (Celeux et al., 1996) approach replaces draws from the conditional posterior distributions of Λ_k̃^(t), Σ and ϕ in Steps 1, 2 and 4 above by the respective conditional posterior modes.

4. Simulation example

4.1. Factor selection and covariance matrix estimation

We consider a number of simulation examples to illustrate our approach and compare with competing methods. We simulated y_i, i = 1, …, 200, from a p-dimensional normal distribution with zero-mean and covariance matrix Ω = ΛΛ^T + Σ, where Λ is a p × k matrix and Σ is a p × p diagonal matrix. The diagonal elements of Σ⁻¹ are drawn independently from a Ga(1,0.25) distribution with mean 4. The number of non-zero elements in each column of Λ are chosen linearly between 2k and k + 1 in a decreasing fashion. We randomly allocate the location of the zeros in each column and simulate the nonzero elements independently from a normal distribution with mean 0 and variance 9.

We choose three (p, k) combinations with moderate to large p, namely (100, 5), (500, 10) and (1000, 15). For each pair we consider 50 simulation replicates. We run the adaptive Gibbs sampler for 25 000 iterations with a burn-in of 5000, and collect every 5th sample to thin the chain. We use a default choice of 5 log(p) as the starting number of factors. The hyperparameters a_σ and b_σ for $σ_{j}^{- 2}$ in (3) are 1 and 0.3 respectively, while ν is 3. We place Ga(2, 1) priors on a₁ and a₂. We choose α₀ and α₁ in the adaptation probability p(t) as −1 and −5 × 10⁻⁴ respectively. We monitor the columns in the loadings having all elements less than 10⁻⁴ in magnitude and proceed by adapting the number of factors as in § 3.2. For the stochastic em algorithm, we choose a burn-in of 100 and monitored the estimated covariance matrix every 10 iterations. We stop the chain when the sup-norm distance between the estimated covariance matrix at the current iteration was within a small tolerance level compared with the estimate 10 iterations before.

The average of the estimated number of factors across the replicates is 6.82, 10.00 and 14.40 corresponding to k = 5, 10 and 15 with empirical 95% intervals for the number of factors (5, 8), (9, 11) and (13, 16), respectively. The estimated covariance matrix in each case is close to the true value, with small mean square error, average and maximum absolute bias. We compare the estimation of the covariance matrix to a recent method by Bickel & Levina (2008) which bands the sample covariance matrix and proposes a resampling scheme for choosing the optimal banding parameter. The stochastic em algorithm was also used to arrive at an approximate maximum a posteriori estimate of the covariance matrix. We provide the summaries of the mean square error, average absolute bias and maximum absolute bias for the three methods across the replicates in Table 1. Based on Table 1, the proposed shrinkage approach does significantly better than the Bickel & Levina (2008) method. The stochastic em algorithm also performs well, especially for smaller values of p.

Table 1.

Comparative performance in covariance matrix estimation in the simulation study. The average, best and worst case performance across 50 simulation replicates in terms of mean square error (×10²), average absolute bias (×10²) and maximum absolute bias (×10²) are tabulated for the different methods

true (p, k)	(100, 5)			(500, 10)			(1000, 15)
method	mgps	Banding	map	mgps	Banding	map	mgps	Banding	map
mse
mean	0.2	1.3	0.2	0.10	0.4	0.10	0.10	0.3	0.10
min	0.1	0.9	0.1	0.02	0.4	0.05	0.02	0.2	0.05
max	0.3	1.6	0.3	0.20	0.5	0.30	0.4	0.5	0.30
average absolute bias
mean	1.9	3.1	1.0	0.6	0.6	0.3	0.4	0.5	0.3
min	1.3	2.5	0.6	0.4	0.6	0.2	0.2	0.4	0.2
max	2.5	4.9	1.5	0.9	0.9	0.5	0.6	0.5	0.5
maximum absolute bias
mean	50.9	111.0	44.8	95.4	117.8	97.7	115.0	115	108.0
min	38.8	99.8	24.7	50.2	105.0	64.4	52.6	111	74.7
max	74.1	131.0	105.0	152.0	131.0	162.0	242.0	240	221.0

Open in a new tab

mgps, posterior mean using our proposed multiplicative shrinkage prior; Banding, Banding sample covariance matrix; map, approximate maximum a posteriori estimate under our proposed prior; mse, mean square error.

4.2. Latent factor regression

It is common in many application areas to have a massive-dimensional vector of candidate predictors, with many of the predictors being moderately to highly correlated. Modifications using penalized least squares methods have been studied extensively. The lasso (Tibshirani, 1996) and the elastic net (Zou & Hastie, 2005) are two of the most popular such methods. In order to select correlated batches of predictors simultaneously, one can potentially use Bayesian latent factor regression (Lucas et al., 2006; Carvalho et al., 2008).

Let $y_{i} = {(z_{i}, x_{i}^{T})}^{T}$ , i = 1, …, n, where the x_i s are (p − 1)-dimensional predictors and z_i s are the response. For ease of illustration, we assume the z_i s to be univariate, though extensions to multivariate cases are straightforward. Also assume that the predictors and response are all continuous. We can use standard data augmentation procedures otherwise. We jointly model the y_i s as in (1). Our objective is to predict the response z_n₊₁ for a future subject based on the predictors x_n+1 for that subject and y₁, …, y_n. The posterior predictive distribution of z_n+1 | x_n+1, y₁, …, y_n is

f (z_{n + 1} | x_{n + 1}, y_{1}, \dots, y_{n}) = \int f (z_{n + 1} | x_{n + 1}, Ω) π (Ω | y_{1}, \dots, y_{n}) d Ω .

For the simulation examples described in § 4.1, let z_i = y_i1 and x_i = (y_i2, …, y_ip)^T. We randomly selected two locations in the first row of Λ and assigned values 1 and −1 to those locations, with the other elements in the first row set to zero. The remaining rows of the loadings were simulated as mentioned before. We used a randomly chosen training set of size 100 and held out the z_i s for the remaining 100 samples. The coverage of 95% predictive intervals averaged across the replicates were 0.95, 0.94 and 0.95, respectively. Table 2 compares the predictive performance with lasso and elastic net. The proposed approach does similar to lasso and elastic net, but has the advantage of quantifying predictive uncertainty.

Table 2.

Predictive performance in the simulation study. Average, best and worst case performance across 50 simulation replicates are reported for the different methods

true (p, k)	(100, 5)			(500, 10)			(1000, 15)
method	mgps	Lasso	Elastic net	mgps	Lasso	Elastic net	mgps	Lasso	Elastic net
mspe
mean	0.63	0.55	0.55	0.41	0.38	0.38	0.95	0.87	0.88
min	0.32	0.33	0.33	0.18	0.22	0.22	0.57	0.55	0.56
max	0.89	0.79	0.78	0.86	0.57	0.56	1.48	1.44	1.44
aape
mean	0.62	0.59	0.59	0.51	0.49	0.49	0.80	0.77	0.75
min	0.47	0.47	0.47	0.33	0.38	0.37	0.60	0.59	0.59
max	0.85	0.73	0.72	0.80	0.58	0.59	0.99	0.98	0.99
mape
mean	2.19	2.07	2.07	1.71	1.66	1.68	2.54	2.48	2.48
min	1.36	1.43	1.40	1.21	1.17	1.18	1.83	1.83	1.80
max	3.15	2.91	2.89	2.95	2.70	2.63	3.27	3.07	3.07

Open in a new tab

mgps, our proposed multiplicative shrinkage prior; mspe, mean squared prediction error; aape, average absolute prediction error; mape, maximum absolute prediction error.

The joint Gaussian model implies that $E (z_{i} | x_{i}) = x_{i}^{T} β$ , with $β = Ω_{x x}^{- 1} Ω_{z x}$ , with the Ω matrix suitably partitioned. The elements of the (p − 1)-dimensional vector β can be considered as the true regression coefficients of z on x. Letting Ω^(t) denote the posterior samples of Ω, $β^{(t)} = {Ω_{x x}^{(t)}}^{- 1} Ω_{z x}^{(t)}$ give samples from the posterior distribution of β. Since $Ω_{x x}^{(t)} = Λ_{x}^{(t)} Λ_{x}^{(t) T} + Σ_{x x}^{(t)}$ , where $Λ_{x}^{(t)}$ and $Σ_{x x}^{(t)}$ are appropriate sub-blocks of Λ^(t) and Σ^(t), one can use the Sherman–Morrison–Woodbury formula (Hager, 1989) to invert $Ω_{x x}^{(t)}$ at each iteration of the Gibbs sampler, which only requires the inverse of a k^*(t) × k^*(t) matrix, leading to many-fold speed up in large p settings.

As shown in Table 3, the estimate of β based on our method was close to the truth in each case, with small mean square error, average and maximum absolute bias. The coverage of 95% credible intervals for the elements of β were 0.96, 0.91 and 0.90 for the three cases, respectively.

Table 3.

Performance in estimating regression coefficients in the simulation study. We report the mean square error (×10³), average absolute bias (×10³) and maximum absolute bias (×10³) averaged across 50 simulation replicates for the different methods

true (p, k)	(100, 5)			(500, 10)			(1000, 15)
method	mgps	Lasso	Elastic net	mgps	Lasso	Elastic net	mgps	Lasso	Elastic net
mse	1.1	1.2	1.3	0.1	0.3	0.4	0.0	0.1	0.1
aab	10.1	12.4	13.0	1.7	3.9	4.1	0.9	1.8	1.9
mab	176.1	207.3	211.3	172.5	253.3	244.5	102.6	109.0	122.6

Open in a new tab

mgps, our proposed multiplicative shrinkage prior; mse, mean squared error; aab, average absolute bias; mab, maximum absolute bias.

The simulation examples were designed to induce correlation in groups of predictors, so that batches of predictors are included in the response model. The sparsity in the loadings ensures that many of the true regression coefficients are exactly equal to zero, with only a few important predictors. We propose a simple algorithm for variable selection in our framework based on thresholding the posterior mean of β. Let β̂₍₁₎ < ⋯ < β̂_(p−1) denote the ordered values of the posterior means for the p − 1 predictors, and let π_j = h denote that the jth predictor is the hth smallest in magnitude. Then, our thresholding approach sets β_j = 0 for all j with π_j ⩽ h̃, with h̃ chosen to minimize the mean squared prediction error. Table 4 shows the percentage of false positives and power compared with lasso and elastic net.

Table 4.

Variable selection performance in the simulation study. Percentage of false positives and power in detecting the true signal reported across 50 simulation replicates (average, best and worst case) for the different methods

true (p, k)	(100, 5)			(500, 10)			(1000, 15)
method	mgps	Lasso	Elastic net	mgps	Lasso	Elastic net	mgps	Lasso	Elastic net
false positives (%)
mean	0	9	7	0	4·0	3	0	3·0	2·0
min	0	0	0	0	0·2	0	0	0·7	0·7
max	0	26	25	0	14·0	14	0	8·0	10·0
power (%)
mean	72	76	77	75	76	77	71	72	72
min	68	72	74	73	75	76	70	71	71
max	81	80	83	80	79	79	73	73	72

Open in a new tab

mgps, our proposed multiplicative shrinkage prior.

The three simulation examples took 2, 14 and 33 seconds per hundred iterations, respectively, in Matlab on a Intel(R) Core(TM) 2 Duo machine. The analyses were repeated with different choices of hyperparameter values. We used ν = 3.5, 4, 5 and varied b_σ between 0.1 and 0.5. We also used different multiples of log(p) between 3 and 10 for the initial number of factors. The results were robust, with the conclusions unchanged. We observed good mixing for the Gibbs sampler using both exploratory and diagnostic tests. The effective sample size averaged across the elements of β were 55, 53 and 48% for the three cases, respectively, suggesting an excellent computational efficiency.

The true loadings were not simulated from our proposed prior in any of the simulation examples. Although our prior on the loadings can concentrate in arbitrarily small neighbourhoods around zero, it does not allow any of the loading elements to be exactly zero. In the simulation study, many of the true loading elements were set equal to zero, and instead of shrinking the nonzero loadings with the column index, they were all drawn from the same N(0, 9) distribution. To assess robustness when the model is not applicable, we ran simulations with correlated factors and/or correlated idiosyncratic error, with the errors drawn from an AR(1) process. The results were robust even in these cases, in particular, we always had similar predictive performance as the elastic net. The adaptive method for factor selection proved to be extremely robust with respect to the choice of the threshold. Although we used 10⁻⁴ as a default threshold, the conclusions were mostly unchanged even with a threshold as small as 10⁻⁹. Also, one can use either of the median or mode of the samples k^*(t) as an estimate of the number of factors as they gave the same answer on all occasions. The simulation study clearly highlights the merit of our method in a variety of applications, with much improved performance over competitors in terms of covariance matrix estimation, regression coefficient estimation and variable selection.

5. Diffuse large-B-cell lymphoma application

5.1. Background

Lymphoma is a cancer of the white blood cell which occurs when lymphocytes, a type of white blood cell, have abnormal growth. Diffuse large B-cell lymphoma is the most common lymphoma among adults and has a high mortality rate. Rosenwald et al. (2002) analysed biopsy samples from 240 patients with untreated diffuse large B-cell lymphoma and identified 17 genes predictive of survival after chemotherapy. Segal (2006) reanalysed the data using penalized methods. The patients in the study were followed up after collection of biopsy specimens with a median follow-up of 2.8 years. For each patient, a potentially right-censored survival time is available along with 7399 features representing 4128 genes from the Lymphochip cDNA microarray. Rosenwald et al. (2002) divided the patients into a training set of 160 patients and a validation set of 80 patients to gauge predictive performance.

Rosenwald et al. (2002) used hierarchical clustering to identify four signature groups whose expressions were correlated with the survival times. They also identified a subset of 17 genes predictive of overall survival after chemotherapy. Gui & Li (2005), Segal (2006) and Ma & Huang (2007) analysed this data using penalized methods. In each case, the selected features mostly belonged to one of the four signature groups in Rosenwald et al. (2002), though the individual selected features varied across the methods.

5.2. Model and results

Our interest lies in simultaneously identifying an important subset of the features and obtaining a predictive model for the exact survival times. Let T_i denote the survival time for the i th patient and let x_i denote the corresponding 7399 dimensional feature vector. There were 72 patients in the training set whose survival times were right-censored. Possibly due to rounding, there were some survival times equal to zero, so we added one unit to the survival times of all the patients. We took the logarithm of the shifted survival times and appended them to the x_is to create a p-dimensional vector $y_{i} = {(z_{i}, x_{i}^{T})}^{T}$ , where p = 7400 and z_i = log(1 + T_i). We model the y_i s jointly as in § 4.2 after normalizing them. The joint Gaussian model implies an accelerated failure time model for the survival times, since the conditional mean of the log-shifted survival time z_i given the predictors x_i is linear in x_i. Since the exact survival times are known for the uncensored subjects, the response was normalized with the mean and standard deviations of those subjects only and an intercept for the response was added to the model. A normal prior with zero mean and variance one was placed on the intercept. The posterior computation proceeds exactly as in § 3, but an additional step is needed to impute the shifted log survival times for the censored subjects from a truncated normal distribution, truncated below by the transformed censoring time. We ran the adaptive Gibbs sampler for 25 000 iterations with 5000 burn-in and collected every fifth sample after burn-in to thin the chain. The estimated number of factors was 20, with a 95% credible interval of (19,21).

We thresholded the posterior mean of the regression coefficients as described in § 4.2 to perform a variable selection. The thresholding approach selected 17 features, with all of the features belonging to three of the four signature groups mentioned in Rosenwald et al. (2002). The three signature groups were germinal-centre B-cell signature, major histocompatibility complex class II signature and lymph-node signature, while no genes in the proliferation signature group were selected. The top features mentioned in Gui & Li (2005) and Segal (2006) also come from the same three signature groups. In Table 5, we provide a brief description of the top five genes selected using our approach.

Table 5.

Feature selection in the diffuse large-B-cell lymphoma data

Unique ID	GenBank ID	Signature	Description
24094	AI476194	lymph	CD63 antigen (melanoma 1 antigen)
17048	AA085368	lymph	CD63 antigen (melanoma 1 antigen)
29636	NM005194	lymph	enhancer binding protein (C/EBP), β
34818	U83461	lymph	solute carrier family 31 (copper transporters), member 2
24394	AA729055	MHC	major histocompatibility complex, class II, DR α

Open in a new tab

Lymph, lymph-node signature; MHC, major histocompatibility complex; GenBank, National Institute of Health genetic sequence database.

Among the features selected by our approach, the ones with GenBank ID AA729055, AA805575 and X59812 also appear in Gui & Li (2005) and Segal (2006). Although standard penalization methods tend to select one of a correlated group of important predictors, our approach is designed to allow selection of highly correlated predictors into the same model. This is illustrated in Table 5, as the first two predictors have a correlation coefficient of 0.96. There are several groups of highly correlated predictors in the selected set of 17.

Segal (2006) obtained the modest predictive accuracy using a variety of methods, so advocated exercising care before making prognosis based only on the gene expressions. Our analysis also suggested that the gene expression data explain only a small proportion of the variability in the survival times. The 95% predictive intervals for the survival times in the test sample were wide and contained the true survival times for the uncensored observations in all the cases. The mean square prediction error and mean absolute prediction error for the uncensored observations were 1.31 and 0·89 while the same for lasso trained with the uncensored observations in the training sample were 1.28 and 0.90. The proportion of times the predicted survival times for the censored observations exceeded the censoring time was 0.54. We also performed sensitivity analysis by varying ν, initial values of a₁, a₂ and the prior variance of the intercept. The conclusions were unchanged, with the same set of top 10 genes selected on all occasions.

Acknowledgments

This research was partially supported by a grant from the National Institute of Environmental Health Sciences of the National Institutes of Health, U.S.A. The authors would like to thank Mark Segal for sharing the diffuse large B-cell lymphoma data.

Appendix

Proofs

Proof of Lemma 1. It is enough to show that, for any Λ ∈ Θ_Λ, ΛΛ^T is positive semi-definite. For any vector υ ∈ ℝ^p, υ^TΛΛ^Tυ is finite since all elements of ΛΛ^T are finite. The proof is completed by observing that υ^TΛΛ^Tυ = ‖Λ^Tυ‖² ⩾ 0 where ‖ · ‖ denotes the Euclidian norm.

Proof of Lemma 2. Let Ω = (w_rs), $Ω_{0} = w_{r s}^{0}$ , $λ_{j h} = λ_{j h}^{0} + ψ_{j h}$ , clearly $d_{2} (Λ, Λ_{0}) = {(\sum_{j = 1}^{p} \sum_{h = 1}^{\infty} ψ_{j h}^{2})}^{1 / 2}$ . For any 1 ⩽ r, s ⩽ p,

| w_{r s} - w_{r s}^{0} | ⩽ \sum_{h = 1}^{\infty} | λ_{r h}^{0} ψ_{s h} | + \sum_{h = 1}^{\infty} | λ_{s h}^{0} ψ_{r h} | + \sum_{h = 1}^{\infty} | ψ_{r h} ψ_{s h} | + ∊ ⩽ (2 M_{0} + 1) ∊ + ∊^{2},

by the Cauchy–Schwartz inequality, where $M_{0} = {{max}_{1 ⩽ j ⩽ p} \sum_{h = 1}^{\infty} {(λ_{j h}^{0})}^{2}}^{1 / 2} < \infty$ . Thus d_∞(Ω, Ω₀) ⩾ ∊^*, with ∊^* = (2M₀ + 1)∊ + ∊².

Proof of Proposition 1. Clearly Π_Σ(Θ_Σ) = 1, so it is enough to show Π_Λ (Θ_Λ) = 1. The ϕ_jhs are independent of the δ_hs. Hence marginalizing over the ϕ_jhs yields λ_jh | τ_h ∼ t₃(0, $τ_{h}^{- 1}$ ) where t_ν(μ, σ²) denotes the t distribution with ν degrees of freedom having location μ and scale σ². By the Cauchy–Schwartz inequality,

{(\sum_{h = 1}^{\infty} λ_{r h} λ_{s h})}^{2} ⩽ (\sum_{h = 1}^{\infty} λ_{r h}^{2}) (\sum_{h = 1}^{\infty} λ_{s h}^{2}) ⩽ max_{1 ⩽ j ⩽ p} {(\sum_{h = 1}^{\infty} λ_{j h}^{2})}^{2},

and thus

| \sum_{h = 1}^{\infty} λ_{r h} λ_{s h} | ⩽ max_{1 ⩽ j ⩽ p} (\sum_{h = 1}^{\infty} λ_{j h}^{2}) .

Hence all the elements of ΛΛ^T are bounded in absolute value by M, where M = max_1⩽j⩽p M_j with $M_{j} = \sum_{h = 1}^{\infty} λ_{j h}^{2}$ . Now,

E (M_{j}) = \sum_{h = 1}^{\infty} E {(λ_{j h}^{2} | τ_{h})} = \sum_{h = 1}^{\infty} E (\frac{3}{τ_{h}}) = \sum_{h = 1}^{\infty} 3 b a^{h - 1} = \frac{3 b}{(1 - a)} < \infty,

where $b = E (δ_{1}^{- 1})$ and $a = E (δ_{2}^{- 1}) < 1$ if a₂ > 2. Hence $E (M) ⩽ \sum_{j = 1}^{p} E (M_{j}) < \infty$ and thus M is finite almost surely. It follows that Π_Λ ⊗ Π_Σ (Θ_Λ × Θ_Σ) = 1.

Proof of Theorem 1. Write $Λ Λ^{T} = Λ_{H} Λ_{H}^{T} + Δ_{H}$ . Clearly $d_{\infty} (Ω, Ω_{H}) = {max}_{1 ⩽ r, s ⩽ p} | a_{r s}^{H} |$ , where $a_{r s}^{H}$ is the rsth entry of Δ_H so that $a_{r s}^{H} = \sum_{h = H + 1}^{\infty} λ_{r h} λ_{s h}$ . An application of the Cauchy–Schwartz inequality as in the previous proof gives

| \sum_{h = H + 1}^{\infty} λ_{r h} λ_{s h} | ⩽ max_{1 ⩽ j ⩽ p} (\sum_{h = H + 1}^{\infty} λ_{j h}^{2}),

which implies $d_{\infty} (Ω, Ω_{H}) = {max}_{1 ⩽ j ⩽ p} a_{j j}^{H}$ . Now, for a fixed ∊ > 0,

\begin{array}{l} pr {d_{\infty} (Ω, Ω_{H} ⩽ \in} & = E {pr (a_{11}^{H} ⩽ ∊, \dots, a_{p p}^{H} ⩽ ∊ | δ)} \\ = E {pr {(a_{11}^{H} ⩽ ∊ | δ)}^{p}} > {[E {pr (a_{11}^{H} ⩽ ∊ | δ)}]}^{p} \\ = {[1 - E {pr (a_{11}^{H} > ∊ | δ)}]}^{p} ⩾ {[1 - E {\frac{E (a_{11}^{H} | δ}{∊}}]}^{p} \\ = {1 - \frac{E (a_{11}^{H})}{∊}}^{p}, \end{array}

where the equality in the second line follows from the fact that $a_{i i}^{H}$ are conditionally independent and identically distributed given δ and the subsequent two inequalities use Jensen’s and Chebyshev’s inequalities respectively. Now,

E (a_{11}^{H}) = E {E (a_{11}^{H} | δ)} = E (\sum_{h = H + 1}^{\infty} \frac{3}{τ_{h}}) = \sum_{h = H + 1}^{\infty} E (\frac{3}{τ_{h}}) = \sum_{h = H + 1}^{\infty} 3 b a^{h - 1} = \frac{3 b}{(1 - a)} a^{H},

where $b = E (δ_{1}^{- 1})$ , $a = E (δ_{2}^{- 1}) < 1$ if a₂ > 2 and the third equality is a direct consequence of Fubini’s theorem. Now use the inequality (1 − x/2) > exp(−x) if 0 < x ⩽ 1.5 to get

pr {d_{\infty} (Ω, Ω_{H}) ⩽ ∊} ⩾ exp {\frac{- 6 p b}{∊ (1 - a)} a^{H}}

if H > log{2b/∊(1 − a)}/ log(1/a). Hence

pr {d_{\infty} (Ω, Ω_{H}) > ∊} ⩽ 1 - exp {\frac{- 6 p b}{∊ (1 - a)} a^{H}} ⩽ \frac{6 p b}{∊ (1 - a)} a^{H}

for 6a^H pb/{(1 − a)∊} < 1 or H > log{6pb/∊(1 − a)}/ log(1/a).

Proof of Proposition 2. Let Λ_* be a p × k matrix (k ⩽ p) and Σ₀ ∈ Θ_Σ such that $Ω_{0} = Λ_{*} Λ_{*}^{T} + Σ_{0}$ . Set Λ₀ = (Λ_* : 0_p×∞), then (Λ₀, Σ₀) ∈ Θ_Λ × Θ_Σ, with g(Λ₀, Σ₀) = Ω₀. Fix ∊ > 0, choose ∊₁ > 0 such that $(2 M_{0} + 1) ∊_{1} + ∊_{1}^{2} > ∊$ , with M₀ as in the proof of Lemma 2. By Lemma 2, $g {B_{∊_{1}} (Λ_{0}, Σ_{0})} \subset B_{∊}^{\infty} (Ω_{0})$ , and thus $B_{∊_{1}} (Λ_{0}, Σ_{0}) \subset g^{- 1} {B_{∊}^{\infty} (Ω_{0})}$ . Now $Π {B_{∊}^{\infty} (Ω_{0})} = (Π_{Λ} \otimes Π_{Σ}) \circ g^{- 1} {B_{∊}^{\infty} (Ω_{0})} ⩾ Π_{Λ} \otimes Π_{Σ} {B_{∊_{1}} (Λ_{0}, Σ_{0})}$ . Clearly, Π_Σ {Σ : d_∞(Σ, Σ₀) < ∊₁} > 0, so it is enough to show Π_Λ {Λ : d₂(Λ, Λ₀) < ∊₁} > 0. We have,

\begin{array}{l} pr {d_{2} (Λ, Λ_{0}) < ∊_{1}} & = pr {\sum_{j = 1}^{\infty} \sum_{h = 1}^{\infty} {(λ_{j h} - λ_{j h}^{0})}^{2} < ∊_{1}^{2}} \\ ⩾ pr {\sum_{h = 1}^{\infty} {(λ_{j h} - λ_{j h}^{0})}^{2} < ∊_{1}^{2} / p, j = 1, \dots, p} \\ = E_{δ} [\prod_{j = 1}^{p} pr {\sum_{h = 1}^{\infty} {(λ_{j h} - λ_{j h}^{0})}^{2} < ∊_{1}^{2} / p | δ}] > 0 \end{array}

by the following Lemma.

Lemma 3. Fix 1 ⩽ j ⩽ p. For any ∊ > 0, $pr {\sum_{h = 1}^{\infty} {(λ_{j h} - λ_{j h}^{0})}^{2} < ∊_{1}^{2} / p | δ} > 0$ almost surely.

Proof of Lemma 3. We have $λ_{j h}^{0} = 0$ for h > k. Thus for any H ⩾ k,

\begin{array}{l} pr {\sum_{h = 1}^{\infty} {(λ_{j h} - λ_{j h}^{0})}^{2} < ∊ | δ} & ⩾ pr {\sum_{h = 1}^{H} {(λ_{j h} - λ_{j h}^{0})}^{2} < ∊ / 2, \sum_{h = H + 1}^{\infty} λ_{j h}^{2} < ∊ / 2 | δ} \\ = pr {\sum_{h = 1}^{H} {(λ_{j h} - λ_{j h}^{0})}^{2} < ∊ / 2 | δ} pr {\sum_{h = H + 1}^{\infty} λ_{j h}^{2} < ∊ / 2 | δ} . \end{array}

By Theorem 1, $pr (\sum_{h = H + 1}^{\infty} λ_{j h}^{2} < ∊ / 2) \to 1$ as H → ∞, hence we can find H₀ > k such that $pr (\sum_{h = H_{0} + 1}^{\infty} λ_{j h}^{2} < ∊ / 2) > 0$ and thus $pr (\sum_{h = H_{0} + 1}^{\infty} λ_{j h}^{2} | δ) > 0$ almost surely. The proof is completed by observing that $pr {\sum_{h = 1}^{H} {(λ_{j h} - λ_{j h}^{0})}^{2} < ∊ / 2 | δ} > 0$ almost surely for any H < ∞.

Proof of Theorem 2. Fix ∊ > 0, Ω₀ ∈ Θ. We have,

K (Ω_{0}, Ω) = \frac{1}{2} log \frac{det Ω_{0}}{det Ω} - \frac{1}{2} tr (I_{p} - Ω^{- 1} Ω_{0}) .

Let u₀ = det Ω₀, find ∊₁ > 0 such that |u − u₀| < ∊₁ implies | log u − log u₀| < ∊. Since det(·) is a continuous function from Θ to ℝ, we can find ∊₂ such that d_∞(Ω₀, Ω) < ∊₂ implies | det(Ω₀) − det(Ω)| < ∊₁. Now $tr (I_{p} - Ω^{- 1} Ω_{0}) = \sum_{i = 1}^{p} (1 - λ_{i})$ , where λ₁ ⩽ … ⩽ λ_p are the eigenvalues of Ω⁻¹Ω₀. Since Ω and Ω₀ are both positive definite,

0 ⩽ λ_{1} ⩽ \frac{x^{T} Ω_{0} x}{x^{T} Ω x} ⩽ λ_{p},

where x is any p-dimensional vector with x^Tx = 1. For any x ∈ ℝ^p with x^Tx = 1,

| \frac{x^{T} Ω_{0} x}{x^{T} Ω x} - 1 | = \frac{| x^{T} Ω_{0} x - x^{T} Ω x |}{x^{T} Ω x} .

Now

| x^{T} Ω_{0} x - x^{T} Ω x | ⩽ \sum_{i = 1}^{p} \sum_{j = 1}^{p} | w_{i j} - w_{i j}^{0} | | x_{i} x_{j} | ⩽ d_{\infty} (Ω_{0}, Ω) {(\sum_{i = 1}^{p} | x_{i} |)}^{2} ⩽ p d_{\infty} (Ω_{0}, Ω),

and

x^{T} Ω x = x^{T} Ω_{0} x + (x^{T} Ω_{0} x - x^{T} Ω x) ⩾ λ_{min} (Ω_{0}) + (x^{T} Ω_{0} x - x^{T} Ω x),

where λ_min(Ω₀) > 0 denotes the smallest eigenvalue of Ω₀. Choose 0 < ∊₃ < λ_min(Ω₀)/2p such that 2p²∊₃/λ_min(Ω₀) < ∊. We have

| \frac{x^{T} Ω_{0} x}{x^{T} Ω x} - 1 | = \frac{| x^{T} Ω_{0} x - x^{T} Ω x |}{x^{T} Ω x} ⩽ \frac{p ∊_{3}}{λ_{min} (Ω_{0}) / 2} < ∊ / p,

for all Ω₀ such that d_∞(Ω₀, Ω) < ∊₃, since |x^TΩ₀x − x^TΩx| < λ_min(Ω₀)/2 and thus x^TΩx > λ_min(Ω₀)/2. Choose ∊^* = min{∊₂, ∊₃}, then for d_∞(Ω₀, Ω) < ∊^*, we have,

\begin{array}{l} K (Ω_{0}, Ω) & ⩽ \frac{1}{2} | log \frac{det Ω_{0}}{det Ω} | + \frac{1}{2} | tr (I_{p} - Ω^{- 1} Ω_{0}) | \\ ⩽ \frac{1}{2} | log (det Ω_{0}) - log (det Ω) | + \frac{1}{2} \sum_{i = 1}^{p} | 1 - λ_{i} | \\ ⩽ \frac{∊}{2} + p max {| 1 - λ_{i} |, | 1 - λ_{p} |} < ∊, \end{array}

which proves Theorem 2.

References

Amengual D, Watson M. Consistent estimation of the number of dynamic factors in a large N and T panel. J Bus Econ Statist. 2007;25:91–6. [Google Scholar]
Ando T. Bayesian factor analysis with fat-tailed factors and its exact marginal likelihood. J Mult Anal. 2009;100:1717–26. [Google Scholar]
Arminger G, Muthén B. A Bayesian approach to nonlinear latent variable models using the Gibbs sampler and the Metropolis–Hastings algorithm. Psychometrika. 1998;63:271–300. [Google Scholar]
Bai J, Ng S. Determining the number of factors in approximate factor models. Econometrica. 2002;70:191–221. [Google Scholar]
Bickel P, Levina E. Regularized estimation of large covariance matrices. Ann Statist. 2008;36:199–227. [Google Scholar]
Carvalho C, Chang J, Lucas J, Nevins J, Wang Q, West M. High-dimensional sparse factor modeling: applications in gene expression genomics. J Am Statist Assoc. 2008;103:1438–56. doi: 10.1198/016214508000000869. [DOI] [PMC free article] [PubMed] [Google Scholar]
Celeux G, Chauveau D, Diebolt J. Stochastic versions of the EM algorithm: an experimental study in the mixture case. J Statist Comp and Simul. 1996;55:287–314. [Google Scholar]
Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 2006;1:515–34. [Google Scholar]
Geweke J, Zhou G. Measuring the price of the Arbitrage Pricing Theory. Rev. Finan. Studies. 1996;9:557–87. [Google Scholar]
Ghosh J, Dunson D. Default prior distributions and efficient posterior computation in Bayesian factor analysis. J Comp Graph Statist. 2009;18:306–20. doi: 10.1198/jcgs.2009.07145. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21:3001–8. doi: 10.1093/bioinformatics/bti422. [DOI] [PubMed] [Google Scholar]
Hager W. Updating the inverse of a matrix. SIAM Rev. 1989;31:221–39. [Google Scholar]
Lee S, Song X. Bayesian selection on the number of factors in a factor analysis model. Behaviormetrika. 2002;29:23–39. [Google Scholar]
Liu J, Wu Y. Parameter expansion for data augmentation. J Am Statist Assoc. 1999;94:1264–74. [Google Scholar]
Lopes H, West M. Bayesian model assessment in factor analysis. Statist. Sinica. 2004;14:41–68. [Google Scholar]
Lucas J, Carvalho C, Wang Q, Bild A, Nevins J, West M. Sparse statistical modelling in gene expression genomics. In: Müller P, Do K, Vannucci M, editors. Bayesian Inference for Gene Expression and Proteomics. Cambridge: Cambridge University Press; 2006. pp. 155–76. [Google Scholar]
Ma S, Huang J. Additive risk survival model with microarray data. BMC Bioinformatics. 2007;8:192. doi: 10.1186/1471-2105-8-192. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roberts G, Rosenthal J. Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. J Appl Prob. 2007;44:458–475. [Google Scholar]
Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Mueller-Hermelink HK, Smeland EB, et al. The use of molecular profiling to predict survival after chemotheropy for diffuse large-B-cell lymphoma. New Engl J Med. 2002;346:1937–47. doi: 10.1056/NEJMoa012914. [DOI] [PubMed] [Google Scholar]
Schwartz L. On Bayes procedures. Prob. Theory Rel. Fields. 1965;4:10–26. [Google Scholar]
Segal M. Microarray gene expression data with linked survival phenotypes: diffuse large-B-cell lymphoma revisited. Biostatistics. 2006;7:268–85. doi: 10.1093/biostatistics/kxj006. [DOI] [PubMed] [Google Scholar]
Song X, Lee S. Bayesian estimation and test for factor analysis model with continuous and polytomous data in several populations. Br J Math Statist Psychol. 2001;54:237–63. doi: 10.1348/000711001159546. [DOI] [PubMed] [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996;58:267–88. [Google Scholar]
West M. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statist. 2003;7:723–32. [Google Scholar]
Zou H, Hastie T. Regularization and variable selection via the elastic net. J. R. Statist. Soc. B. 2005;67:301–20. [Google Scholar]

[b1-asr013] Amengual D, Watson M. Consistent estimation of the number of dynamic factors in a large N and T panel. J Bus Econ Statist. 2007;25:91–6. [Google Scholar]

[b2-asr013] Ando T. Bayesian factor analysis with fat-tailed factors and its exact marginal likelihood. J Mult Anal. 2009;100:1717–26. [Google Scholar]

[b3-asr013] Arminger G, Muthén B. A Bayesian approach to nonlinear latent variable models using the Gibbs sampler and the Metropolis–Hastings algorithm. Psychometrika. 1998;63:271–300. [Google Scholar]

[b4-asr013] Bai J, Ng S. Determining the number of factors in approximate factor models. Econometrica. 2002;70:191–221. [Google Scholar]

[b5-asr013] Bickel P, Levina E. Regularized estimation of large covariance matrices. Ann Statist. 2008;36:199–227. [Google Scholar]

[b6-asr013] Carvalho C, Chang J, Lucas J, Nevins J, Wang Q, West M. High-dimensional sparse factor modeling: applications in gene expression genomics. J Am Statist Assoc. 2008;103:1438–56. doi: 10.1198/016214508000000869. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b7-asr013] Celeux G, Chauveau D, Diebolt J. Stochastic versions of the EM algorithm: an experimental study in the mixture case. J Statist Comp and Simul. 1996;55:287–314. [Google Scholar]

[b8-asr013] Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 2006;1:515–34. [Google Scholar]

[b9-asr013] Geweke J, Zhou G. Measuring the price of the Arbitrage Pricing Theory. Rev. Finan. Studies. 1996;9:557–87. [Google Scholar]

[b10-asr013] Ghosh J, Dunson D. Default prior distributions and efficient posterior computation in Bayesian factor analysis. J Comp Graph Statist. 2009;18:306–20. doi: 10.1198/jcgs.2009.07145. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b11-asr013] Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21:3001–8. doi: 10.1093/bioinformatics/bti422. [DOI] [PubMed] [Google Scholar]

[b12-asr013] Hager W. Updating the inverse of a matrix. SIAM Rev. 1989;31:221–39. [Google Scholar]

[b13-asr013] Lee S, Song X. Bayesian selection on the number of factors in a factor analysis model. Behaviormetrika. 2002;29:23–39. [Google Scholar]

[b14-asr013] Liu J, Wu Y. Parameter expansion for data augmentation. J Am Statist Assoc. 1999;94:1264–74. [Google Scholar]

[b15-asr013] Lopes H, West M. Bayesian model assessment in factor analysis. Statist. Sinica. 2004;14:41–68. [Google Scholar]

[b16-asr013] Lucas J, Carvalho C, Wang Q, Bild A, Nevins J, West M. Sparse statistical modelling in gene expression genomics. In: Müller P, Do K, Vannucci M, editors. Bayesian Inference for Gene Expression and Proteomics. Cambridge: Cambridge University Press; 2006. pp. 155–76. [Google Scholar]

[b17-asr013] Ma S, Huang J. Additive risk survival model with microarray data. BMC Bioinformatics. 2007;8:192. doi: 10.1186/1471-2105-8-192. [DOI] [PMC free article] [PubMed] [Google Scholar]

[b18-asr013] Roberts G, Rosenthal J. Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. J Appl Prob. 2007;44:458–475. [Google Scholar]

[b19-asr013] Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Mueller-Hermelink HK, Smeland EB, et al. The use of molecular profiling to predict survival after chemotheropy for diffuse large-B-cell lymphoma. New Engl J Med. 2002;346:1937–47. doi: 10.1056/NEJMoa012914. [DOI] [PubMed] [Google Scholar]

[b20-asr013] Schwartz L. On Bayes procedures. Prob. Theory Rel. Fields. 1965;4:10–26. [Google Scholar]

[b21-asr013] Segal M. Microarray gene expression data with linked survival phenotypes: diffuse large-B-cell lymphoma revisited. Biostatistics. 2006;7:268–85. doi: 10.1093/biostatistics/kxj006. [DOI] [PubMed] [Google Scholar]

[b22-asr013] Song X, Lee S. Bayesian estimation and test for factor analysis model with continuous and polytomous data in several populations. Br J Math Statist Psychol. 2001;54:237–63. doi: 10.1348/000711001159546. [DOI] [PubMed] [Google Scholar]

[b23-asr013] Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996;58:267–88. [Google Scholar]

[b24-asr013] West M. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statist. 2003;7:723–32. [Google Scholar]

[b25-asr013] Zou H, Hastie T. Regularization and variable selection via the elastic net. J. R. Statist. Soc. B. 2005;67:301–20. [Google Scholar]

PERMALINK

Sparse Bayesian infinite factor models

A Bhattacharya

D B Dunson

Abstract

1. Introduction

2. Bayesian factor models

2.1. Model and prior specification

2.2. Properties of the shrinkage prior

3. Posterior computation

3.1. Gibbs sampler with a fixed truncation level

3.2. Choosing the number of factors adaptively

4. Simulation example

4.1. Factor selection and covariance matrix estimation

Table 1.

4.2. Latent factor regression

Table 2.

Table 3.

Table 4.

5. Diffuse large-B-cell lymphoma application

5.1. Background

5.2. Model and results

Table 5.

Acknowledgments

Appendix

Proofs

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Sparse Bayesian infinite factor models

A Bhattacharya

D B Dunson

Abstract

1. Introduction

2. Bayesian factor models

2.1. Model and prior specification

2.2. Properties of the shrinkage prior

3. Posterior computation

3.1. Gibbs sampler with a fixed truncation level

3.2. Choosing the number of factors adaptively

4. Simulation example

4.1. Factor selection and covariance matrix estimation

Table 1.

4.2. Latent factor regression

Table 2.

Table 3.

Table 4.

5. Diffuse large-B-cell lymphoma application

5.1. Background

5.2. Model and results

Table 5.

Acknowledgments

Appendix

Proofs

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases