Default Prior Distributions and Efficient Posterior Computation in Bayesian Factor Analysis

Joyee Ghosh; David B Dunson

doi:10.1198/jcgs.2009.07145

. Author manuscript; available in PMC: 2013 Aug 28.

Published in final edited form as: J Comput Graph Stat. 2012 Jan 1;18(2):306–320. doi: 10.1198/jcgs.2009.07145

Default Prior Distributions and Efficient Posterior Computation in Bayesian Factor Analysis

Joyee Ghosh ¹, David B Dunson ²

PMCID: PMC3755784 NIHMSID: NIHMS451612 PMID: 23997568

Abstract

Factor analytic models are widely used in social sciences. These models have also proven useful for sparse modeling of the covariance structure in multidimensional data. Normal prior distributions for factor loadings and inverse gamma prior distributions for residual variances are a popular choice because of their conditionally conjugate form. However, such prior distributions require elicitation of many hyperparameters and tend to result in poorly behaved Gibbs samplers. In addition, one must choose an informative specification, as high variance prior distributions face problems due to impropriety of the posterior distribution. This article proposes a default, heavy-tailed prior distribution specification, which is induced through parameter expansion while facilitating efficient posterior computation. We also develop an approach to allow uncertainty in the number of factors. The methods are illustrated through simulated examples and epidemiology and toxicology applications. Data sets and computer code used in this article are available online.

Keywords: Bayes factor, Covariance structure, Latent variables, Parameter expansion, Selection of factors, Slow mixing

1. INTRODUCTION

Factor models have been traditionally used in behavioral sciences, where the study of latent factors such as anxiety and aggression arises naturally. These models also provide a flexible framework for modeling multivariate data by a few unobserved latent factors. In recent years factor models have found their way into many application areas beyond social sciences. For example, latent factor regression models have been used as a dimensionality reduction tool for modeling of sparse covariance structures in genomic applications (West 2003; Carvalho et al. 2008). In addition, structural equation models and other generalizations of factor analysis are increasingly used in epidemiological studies involving complex health outcomes and exposures (Sanchez et al. 2005). There has been a recent interesting application of factor analysis for reconstruction of gene regulatory networks and unobserved activity profiles of transcription factors (Pournara and Wernisch 2007).

Improvements in Bayesian computation permit the routine implementation of latent factor models via Markov chain Monte Carlo (MCMC) algorithms. One typical choice of prior distribution for factor models is to use normal and inverse gamma prior distributions for factor loadings and residual variances, respectively. These choices are convenient, because they represent conditionally conjugate forms that lead to straightforward posterior computation by a Gibbs sampler (Arminger 1998; Rowe 1998; Song and Lee 2001).

Although conceptually straightforward, routine implementation of Bayesian factor analysis faces a number of major hurdles. First, it is difficult to elicit the hyperparameters needed in specifying the prior distribution. For example, these hyperparameters control the prior mean, variance, and covariance in the factor loadings. In many cases, prior to examination of the data, one may have limited knowledge of plausible values for these parameters, and it can be difficult to convert subject matter expertise into reasonable guesses for the factor loadings and residual variances. Prior elicitation is particularly important in factor analysis, because the posterior distribution is improper in the limiting case as the prior variance for the normal and inverse gamma components increases. In addition, as for other hierarchical models, use of a diffuse, but proper prior distribution does not solve this problem (Natarajan and McCulloch 1998). Even for informative prior distributions, Gibbs samplers are commonly very poorly behaved due to high posterior dependence in the parameters leading to extreme slow-mixing.

To simultaneously address the need for default prior distributions and dramatically more efficient and reliable algorithms for posterior computation, this article proposes a parameter expansion approach (Liu and Wu 1999). As noted by Gelman (2004), parameter expansion provides a useful approach for inducing new families of prior distributions. Gelman (2006) used this idea to propose a class of prior distributions for variance parameters in hierarchical models. Kinney and Dunson (2007) later expanded this class to allow dependent random effects in the context of developing Bayesian methods for random effects selection. Liu, Rubin, and Wu (1998) used parameter expansion to accelerate convergence of the EM algorithm, and applied this approach to a factor model. However, to our knowledge, parameter expansion has not yet been used to induce prior distributions and improve computational efficiency in Bayesian factor analysis.

As a robust prior distribution for the factor loadings, we use parameter expansion to induce t or folded-t prior distributions, depending on sign constraints. The Cauchy or half-Cauchy case can be used as a default in cases in which subject matter knowledge is limited. We propose an efficient parameter-expanded Gibbs sampler involving generating draws from standard conditionally conjugate distributions, followed by a postprocessing step to transform back to the inferential parameterization. This algorithm is shown to be dramatically more efficient than standard Gibbs samplers in several examples. In addition, we develop an approach to allow uncertainty in the number of factors, providing an alternative to methods proposed by Lopes and West (2004) and others.

Section 2 defines the model and parameter expansion approach when the number of factors is known. Section 3 presents a comparison of the traditional and the parameter-expanded Gibbs sampler, based on the results of a simulation study, when the number of factors is known. Section 4 extends this approach to allow unknown number of factors. Section 5 contains two applications, one to data from a reproductive epidemiology study and the other to data from a toxicology study. Section 6 discusses the results.

2. BAYESIAN FACTOR MODELS

2.1 Model Specification and Standard Prior Distributions

The factor model is defined as follows:

y_{i} = Λ η_{i} + ε_{i}, ε_{i} ~ N_{p} (0, Σ),

(2.1)

where Λ is a p × k matrix of factor loadings, η_i = = (η_i1, . . . , η_ik)′ ~ N_k(0, I_k) is a vector of standard normal latent factors, and ε_i is a residual with diagonal covariance matrix $Σ = diag (σ_{1}^{2}, \dots, σ_{p}^{2})$ . The introduction of the latent factors, η_i, induces dependence, as the marginal distribution of y_i is N_p(0,Ω), with Ω = ΛΛ′ + Σ. In practice, the number of factors is small relative to the number of outcomes (k ≪ p). Small values of k relative to p lead to sparse models for Ω containing many fewer than p(p + 1)/2 parameters. For this reason, factor models provide a convenient and flexible framework for modeling of a covariance matrix, particularly in applications with moderate to large p.

For simplicity in exposition, we leave the intercept out of expression (2.1). The factor model (2.1) without further constraints is not identifiable. One can obtain an identical Ω by multiplying Λ by an orthonormal matrix P defined so that PP′ = I_k. Following a common convention to ensure identifiability (Geweke and Zhou 1996), we assume that Λ has a full-rank lower triangular structure. The number of free parameters in Λ, Σ is then q = p(k + 1) − k(k − 1)/2, and k must be chosen so that q ≤ p(p + 1)/2. Although we focus on the case in which the loadings matrix is lower triangular, our methods can be trivially adapted to cases in which structural zeros can be chosen in the loadings matrix based on prior knowledge. For example, the first p₁ measurements may be known to measure the first latent trait but not to measure the other latent traits. This would imply the constraint that λ_jl = 0, for j = 1, . . . , p₁ and l = 2, . . . , k.

To complete a Bayesian specification of model (2.1), the typical choice specifies truncated normal prior distributions for the diagonal elements of Λ, normal prior distributions for the lower triangular elements, and inverse gamma prior distributions for $σ_{1}^{2}, \dots, σ_{p}^{2}$ . These choices are convenient, because they represent conditionally conjugate forms that lead to straightforward posterior computation by a Gibbs sampler (Arminger 1998; Rowe 1998; Song and Lee 2001). Unfortunately, this choice is subject to the problems mentioned in Section 1.

2.2 Inducing Prior Distributions Through Parameter Expansion

To induce a heavier tailed prior distribution on the factor loadings to allow specification of a default, proper prior distribution, we propose a parameter expansion (PX) approach. The basic PX idea involves introduction of a working model that is overparameterized. This working model is then related to the inferential model through a transformation. Generalizing the approach proposed by Gelman (2006) for prior distribution specification in simple ANOVA models, we define the following PX-factor model:

y_{i} = Λ^{*} η_{i}^{*} + ε_{i}, η_{i}^{*} ~ N_{k} (0, Ψ), ε_{i} ~ N_{p} (0, Σ),

(2.2)

where Λ^* is p × k working factor loadings matrix having a lower triangular structure without constraints on the elements, $η_{i}^{*} = (η_{i 1}^{*}, \dots, η_{i k}^{*})'$ is a vector of working latent variables, Ψ = diag(ψ₁, . . . , ψ_k), and Σ is a diagonal covariance matrix defined as in (2.1). Note that model (2.2) is clearly overparameterized having redundant parameters in the covariance structure. In particular, marginalizing out the latent variables, $η_{i}^{*}$ , we obtain y_i ~ N_p (0, Λ*ΨΛ*′+Σ). Clearly, the diagonal elements of Λ* and Ψ are redundant.

To relate the working model parameters in (2.2) to the inferential model parameters in (2.1), we use the following transformation:

λ_{j l} = 𝓈 (λ_{l l}^{*}) λ_{j l}^{*} ψ_{l}^{1 / 2}, η_{i l} = 𝓈 (λ_{l l}^{*}) ψ_{l}^{1 / 2} η_{i l}^{*} for j = 1, \dots, p, l = 1, \dots, k,

(2.3)

where 𝓈(x) = −1 for x < 0 and 𝓈(x) = 1 for x ≥ 0. Then, instead of specifying a prior distribution for Λ directly, we induce a prior distribution on Λ through a prior distribution for Λ*, Ψ. In particular, we let

λ_{j l}^{*} \overset{iid}{~} N (0, 1), j = 1, \dots, p, l = 1, \dots, min (j, k), λ_{j l}^{*} ~ δ_{0}, j = 1, \dots, (k - 1), l = j + 1, \dots, k, ψ_{l}^{- 1} \overset{iid}{~} ℊ (a_{l}, b_{l}), l = 1, \dots, k,

(2.4)

where δ₀ is a measure concentrated at 0, and ℊ(a, b) denotes the gamma distribution with mean a/b and variance a/b².

The prior distribution is conditionally conjugate, leading to straightforward Gibbs sampling, as described in Section 2.3. In the special case in which k = 1 and λ_jl = λ, $λ_{j l}^{*} = λ^{*}$ , the induced prior distribution on Λ reduces to the Gelman (2006) half-t prior distribution. In a general case, we obtain a t prior distribution for the off-diagonal elements of Λ and half-t prior distributions for the diagonal elements of Λ upon marginalizing out Λ* and Ψ. Note that we have induced prior dependence across the elements within a column of Λ. In particular, columns having higher ψ_l values will tend to have higher factor loadings, whereas columns with low ψ_l values tend to have low factor loadings. Such a dependence structure is quite reasonable because factors which tend to have higher precision will have their corresponding factor loadings inflated. On the other hand, loadings for factors with low precision will be automatically smaller. We have considered proper prior distributions for the working parameters in this article. Alternatively, improper prior distributions can be used, but then one would need to perform technical calculations to ensure that the chain corresponding to the inferential parameters has the desired posterior distribution as its stationary distribution (Meng and van Dyk 1999).

2.3 Parameter-Expanded Gibbs Sampler

After specifying the prior distribution one can then run an efficient, blocked Gibbs sampler for posterior computation in the PX-factor model. Note that the chains under the overparameterized model actually exhibit very poor mixing, reflecting the lack of identifiability. It is only after transforming back to the inferential model that we obtain improved mixing. We have found this algorithm to be highly efficient, in terms of rates of convergence and mixing, in a wide variety of simulated and real data examples. The conditional distributions are described below.

Model (2.2) can be written as $y_{i j} = z_{i j}^{'} λ_{j}^{*} + ε_{i j}, ε_{i j} ~ N (0, σ_{j}^{2})$ , where $z_{i j} = (η_{i 1}^{*}, \dots, η_{i k_{j}}^{*})'$ , $λ_{j}^{*} = (λ_{j 1}^{*}, \dots, λ_{j k_{j}}^{*})'$ denotes the free elements of row j of Λ*, and k_j = min(j, k) is the number of free elements. Let $π (λ_{j}^{*}) = N_{k_{j}} (λ_{0 j}^{*}, Σ_{0 λ_{j}^{*}})$ denote the prior distribution for $λ_{j}^{*}$ ; the full conditional posterior distributions are as follows:

π (λ_{j}^{*} | η^{*}, Ψ, Σ, y) = N_{k_{j}} ({(Σ_{0 λ_{j}^{*}}^{- 1} + σ_{j}^{- 2} Z_{j}^{'} Z_{j})}^{- 1} (Σ_{0 λ_{j}^{*}}^{- 1} λ_{0 j}^{*} + σ_{j}^{- 2} Z_{j}^{'} Y_{j}), {(Σ_{0 λ_{j}^{*}}^{- 1} + σ_{j}^{- 2} Z_{j}^{'} Z_{j})}^{- 1}),

where Z_j = (z_1j, . . . , z_nj)′ and Y_j = (y_1j, . . . , y_nj)′. In addition, we have

π (η_{i}^{*} | Λ^{*}, Σ, Ψ, y) = N_{k} ({(Ψ^{- 1} + Λ^{*}' Σ^{- 1} Λ^{*})}^{- 1} Λ^{*}' Σ^{- 1} y_{i}, {(Ψ^{- 1} + Λ^{*}' Σ^{- 1} Λ^{*})}^{- 1}), π (ψ_{l}^{- 1} | η^{*}, Λ^{*}, Σ, y) = ℊ (a_{l} + \frac{n}{2}, b_{l} + \frac{1}{2} \sum_{i = 1}^{n} η_{i l}^{* 2}), π (σ_{j}^{- 2} | η^{*}, Λ^{*}, Ψ, y) = ℊ (c_{j} + \frac{n}{2}, d_{j} + \frac{1}{2} \sum_{i = 1}^{n} {(y_{i j} - z_{i j}^{'} λ_{j}^{*})}^{2}),

where ℊ(a_l, b_l) is the prior distribution for $ψ_{l}^{- 1}$ , for l = 1, . . . , k, and ℊ(c_j, d_j) is the prior distribution for $σ_{j}^{- 2}$ , for j = 1, . . . , p.

Hence, the proposed PX Gibbs sampler cycles through simple steps for sampling from normal and gamma full conditional posterior distributions under the working model. After discarding a burn-in and collecting a large number of samples, we then simply apply the transformation in (2.3) to each of the samples as a postprocessing step that produces samples from the posterior distribution under the inferential parameterization. Convergence diagnostics and inferences then rely entirely on the samples after postprocessing, with the working model samples discarded.

3. SIMULATION STUDY WHEN THE NUMBER OF FACTORS IS KNOWN

We look at two simulation examples to compare the performance of the traditional and our PX Gibbs sampler. We routinely normalize the data prior to analysis.

3.1 One Factor Model

In our first simulation, we consider one of the examples in Lopes and West (2004). Here p = 7, n = 100, the number of factors, k = 1, Λ = (0.995, 0.975, 0.949, 0.922, 0.894, 0.866, 0.837)′, and diag(Σ) = (0.01, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30). We repeat this simulation for 100 simulated datasets. To specify the prior distributions, for the traditional Gibbs sampler we choose N₊(0, 1) (truncated to be positive) prior distributions for the diagonals and N(0, 1) prior distributions for the lower triangular elements of Λ, respectively. For the PX Gibbs sampler we induce half-Cauchy and Cauchy prior distributions, both with scale parameter 1, for the diagonals and lower triangular elements of Λ, respectively. To induce this prior distribution we take N(0, 1) prior distributions for the free elements of Λ* and ℊ(1/2, 1/2) prior distributions for the diagonal elements of Ψ. For the residual precisions $σ_{j}^{- 2}$ we take ℊ(1, 0.2) prior distributions for both samplers. This choice of hyperparameter values provides a modest degree of shrinkage toward a plausible range of values for the residual precision. For each simulated dataset, we run the Gibbs sampler for 25,000 iterations, discarding the first 5,000 iterations as burn-in.

The effective sample size (ESS) gives the size of an independent sample with the same variance as the MCMC sample under consideration (Robert and Casella 2004), and is hence a good measure of mixing of the chain. We compare the ESS of the variance covariance matrix Ω across the 100 simulations. We find that the PX Gibbs sampler leads to a tremendous gain in ESS for all elements in Ω. This is evident from Figure 1. It is important to note that the target distributions of the MCMC algorithms for the typical prior distribution and the PX-induced prior distribution are not equivalent, so that in some sense the ESS values are not directly comparable. That said, our focus is on choosing a default prior distribution that has good properties in terms of posterior computation, and it is reasonable from this perspective to compare the ESS for the two different approaches. If substantive belief is available allowing one to follow a subjective Bayes approach and choose informative prior distributions, our recommended PX approach can still be used, though one should be careful to choose the hyperparameters based on the prior distributions induced after transforming back to the inferential model (Imai and van Dyk 2005). We have attempted to choose prior distributions under the two approaches that are roughly comparable in terms of scale, avoiding use of diffuse, but proper prior distributions. Such prior distributions are expected to lead to very poor computational performance and unreliable estimates and inferences, because the limiting case corresponds to an improper posterior distribution. The use of induced heavy-tailed prior distributions after data standardization seems to provide a reasonable objective Bayes solution to the problem, which results in substantially improved computational performance.

Comparison of the parameter expanded and traditional Gibbs sampler based on the ratio of effective sample size for upper triangular elements of Ω, plotted rowwise from left to right, over 100 simulated datasets for Simulation 3.1.

3.2 Three Factor Model

For our second simulation, we have p = 10, n = 100, and the number of factors, k = 3. To make the problem more difficult, we set some of the loadings as negative and introduce more noise in the data:

Λ' = (\begin{matrix} 0.89 & 0.00 & 0.25 & 0.00 & 0.80 & 0.00 & 0.50 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.90 & 0.25 & 0.40 & 0.00 & 0.50 & 0.00 & 0.00 & - 0.30 & - 0.30 \\ 0.00 & 0.00 & 0.85 & 0.80 & 0.00 & 0.75 & 0.75 & 0.00 & 0.80 & 0.80 \end{matrix}), diag (Σ) = (0.2079, 0.1900, 0.1525, 0.2000, 0.3600, 0.1875, 0.1875, 1.0000, 0.2700, 0.2700) .

We carried out the simulations exactly as in the previous simulation study. From Figure 2 we find that the gain is not as dramatic as in the previous example. This is not surprising for the following reason.

The idea behind parameter expansion is to introduce auxiliary variables in the original model to form a parameter-expanded working model. Thus the working model has an extra set of parameters, and parameters under the original model are obtained by a predefined set of reduction functions operating on the extra parameters. The improvement in mixing occurs because of the extra randomness introduced via the auxiliary variables. Gelman (2006) also mentioned that parameter expansion diminishes dependence among parameters and hence leads to better mixing.

The off-diagonal elements in the covariance matrix Ω represent the correlations between the outcomes as the data are standardized. When some of the outcomes are highly correlated, the corresponding elements in Ω are close to one. In such a scenario the likelihood is close to degenerate and the posterior samples under the traditional Gibbs sampler are poorly behaved, exhibiting extreme high autocorrelation for those parameters. Thus in those cases there is a tremendous gain in using the PX approach. On the other hand, when the posterior correlation under the traditional Gibbs sampler is low to begin with, there is not much room for improvement using the PX approach, and hence we find a more modest gain.

It should be clarified that the modest improvement in the second example is not due to a fall-off in the PX Gibbs performance as dimension increases. Instead, the gain for the first example was attributable to small values for some of the residual variances, leading to very high posterior dependence in the factor loadings. The second example had more moderate residual variances. To verify that the method is scalable to higher dimensions, we repeated the simulation in a case taken from Lopes and West (2004) with p = 9, k = 3, and with some of the outcomes highly correlated, implying low residual variance. In this case, we again observed a tremendous gain in mixing. We note that the additional computation time needed for the PX approach over the traditional approach is negligible, so that it is reasonable to focus on computational efficiency for the same number of MCMC iterates.

4. BAYESIAN MODEL SELECTION FOR UNKNOWN NUMBER OF FACTORS

4.1 Path Sampling With Parameter Expansion

To allow an unknown number of factors k, we choose a multinomial prior distribution, with Pr(k = h) = κ_h, with κ_h = 1/m, for h = 1, . . . , m. We then complete a Bayesian specification through prior distributions on the coefficients within each of the models in the list k ∈ {1, . . . , m}. This is accomplished by choosing a prior distribution for the coefficients in the m factor model having the form described in Section 2, with the prior distribution for Λ^(h) for any smaller model k = h obtained by marginalizing out the columns from (h+1) to m. In this manner, we place a prior distribution on the coefficients in the largest model, while inducing prior distributions on the coefficients in each of the smaller models.

Bayesian selection of the number of factors relies on posterior model probabilities, $Pr (k = h | y) = {κ_{h} π (y | k = h)} / {\sum_{l = 1}^{m} κ_{l} π (y | k = l)}$ where the marginal likelihood under model k, π(y|k = h), is obtained by integrating the likelihood Π_i Np (y_i; 0, Λ^(k)Λ^(k)′ + Σ) across the prior distribution for the factor loadings Λ^(k) and residual variances Σ. We still need to consider the problem of estimating Pr(k = h|y) as the marginal likelihood is not available in closed form. Note that any posterior model probability can be expressed entirely in terms of the prior odds O[h : j] = {κ_h/κ_j } and Bayes factors BF[h: j] = {π(y|k = h)/π(y|k = j)} as follows:

Pr (k = h | y) = \frac{O [h : j] * BF [h : j]}{\sum_{l = 1}^{m} O [l : j] * BF [l : j]} .

(4.1)

Lee and Song (2002) used the path sampling approach of Gelman and Meng (1998) for estimating log Bayes factors. They constructed a path using a scalar t ∈ [0, 1] to link two models M₀ and M₁. They used the same idea as outlined in an example in Gelman and Meng (1998) to construct their path. To compute the required integral they took a fixed set of grid points for t, t ∈ [0, 1] and then used numerical integration to approximate the integration over t.

Let M₀ and M₁ correspond to models with (h − 1) and h factors, respectively. The two models are linked by the path: M_t : y_i = Λ_tη_i + ε_i, Λ_t = (λ₁, λ₂, . . . , λ_(h−1), tλ_h), where λ_i is the ith column of the loadings matrix. Thus t = 0 and t = 1 correspond to M₀ and M₁. Then for a set of fixed and ordered grid points, t₍₀₎ = 0 < t₍₁₎ < · · · < t_(S) < t_(S+1) = 1, we have

\hat{log} (BF [h : (h - 1)]) = \frac{1}{2} \sum_{s = 0}^{S} (t_{(s + 1)} - t_{(s)}) ({\bar{U}}_{(s + 1)} + {\bar{U}}_{(s)}),

(4.2)

where $U (Λ, Σ, η, y, t) = \sum_{i = 1}^{n} (y_{i} - Λ_{t} η_{i})' Σ^{- 1} (0^{p \times (h - 1)}, λ_{h}) η_{i}$ and Ū_(s) is the average of {U(Λ^(j), Σ^(j), η^(j), y, t_(s)), j = 1, 2, . . . , J} over the J MCMC samples from p(Λ, Σ, η|y, t_(s)).

An important challenge in Bayes model comparisons is sensitivity to the prior distribution. It is well known that Bayes factors tend to be sensitive to the prior distribution, motivating a rich literature on objective Bayes methods (Berger and Pericchi 1996). Lee and Song (2002) relied on highly informative prior distributions in implementing Bayesian model selection for factor analysis, an approach which is only reliable when substantial prior knowledge is available allowing one to concisely guess a narrow range of plausible values for all of the parameters in the model. Our expectation is that such knowledge is often lacking, motivating our use of default, heavy-tailed prior distributions, a strategy motivated by a desire for Bayesian robustness.

We modify their path sampling approach to allow use of our default PX-induced prior distributions. To do this we run our PX Gibbs sampler for each of the grid points and calculate the parameters under the inferential model simply using (2.3) and use those to estimate the log Bayes factors as given in (4.2).We will refer to our approach as path sampling with parameter expansion (PS-PX). First PS-PX eliminates the need to use strongly informative prior distributions. Second, model selection based on path sampling is computationally quite intensive. Because PS-PX uses prior distributions with good mixing properties one needs to run the chains for considerably fewer iterations, making the procedure much more efficient.

4.2 Simulation Study

Here we consider the same two sets of simulations as in Section 3, but now allowing the number of factors to be unknown. Let m denote the maximum number of factors in our list. For the simulation example considered in Section 3.1, we take m to be 3, which is also the maximum number of factors resulting in an identifiable model. We repeat the simulation for 100 simulated datasets and analyze them using PS-PX, taking 10 equispaced grid points in [0, 1] for t. We use the same prior distributions for the parameters within each model as in Section 3.1. The correct model is chosen 100/100 times. We also calculate the BIC for all the models in our list for each dataset based on the maximum likelihood estimates of Λ and Σ. The BIC also chooses the correct model 100/100 times.

For the simulation example in Section 3.2 the true model has three factors and the maximum number of factors resulting in an identifiable model is 6. But here we take m = 4. We recommend focusing on lists that do not have a large number of factors as sparseness is one of the main goals of factor models. Thus fitting models with as many factors as permitted given identifiability constraints goes against this motivation. We carry out the simulations exactly as in the previous simulation study. Here both PS-PX and the BIC choose the correct model 100/100 times again.

5. APPLICATIONS

5.1 Male Fertility Study

We first illustrate the effect of our parameter-expanded Gibbs sampler on mixing when the number of factors is fixed. We have data from a reproductive epidemiology study. Here the underlying latent factor of interest is the latent sperm concentration of subjects. Concentration is defined as sperm count/semen volume. There are three outcome variables: concentration based on (i) an automated sperm count system, (ii) manual counting (technique 1), and (iii) manual counting (technique 2), respectively. We consider the log-transformed concentration as the assumption of normality is more appropriate on the log scale. Here the maximum number of latent factors that results in an identifiable model is 1. The model that we consider generalizes the factor analytic model to include covariates at the latent variable level, given as follows:

y_{i j} = α_{j} + λ_{j} η_{i} + ε_{i j}, η_{i} = β' x_{i} + δ_{i}, where ε_{i j} ~ N (0, τ_{j}^{- 1}), δ_{i} ~ N (0, 1) .

(5.1)

Following the usual convention, we restrict the λ_j’s to be positive for sign identifiability. We denote the covariates by x_i = (x_i1, x_i2, x_i3)′. There are altogether three study sites:

\begin{array}{l} x_{i j} = {\begin{cases} 1, if the sample is from site (j + 1) \\ 0, otherwise, \end{cases} & j = 1, 2, \\ x_{i 3} = {\begin{cases} 1, if the time since last ejaculation \leq 2 \\ 0, otherwise . \end{cases} \end{array}

The PX model is as follows:

y_{i j} = {α_{j}}^{*} + {λ_{j}}^{*} {η_{i}}^{*} + ε_{i j}, {η_{i}}^{*} = μ^{*} + β^{*}' x_{i} + {δ_{i}}^{*}, where ε_{i j} ~ N (0, τ_{j}^{- 1}), δ_{i}^{*} ~ N (0, ψ^{- 1}) .

(5.2)

We have introduced a redundant intercept μ* in the second level of the model which improves the mixing tremendously. As noted by Gelfand et al. (1995), centering often leads to much better mixing. We relate the PX model parameters in (5.2) to those of the inferential model in (5.1) using the transformations

α_{j} = α_{j}^{*} + λ_{j}^{*} μ^{*}, λ_{j} = S (λ_{j}^{*}) λ_{j}^{*} ψ^{- 1 / 2}, β = β * ψ^{1 / 2}, η_{i} = S (λ_{j}^{*}) ψ^{1 / 2} (η_{i}^{*} - μ^{*}), δ_{i} = ψ^{1 / 2} δ_{i}^{*} .

To specify the prior distribution for the traditional Gibbs sampler we choose α_j ~ N(0, 1), λ_j ~ N₊(0, 1), τ_j ~ ℊ(1, 0.2) for j = 1, 2, 3, β ~ N(0, 10 * I₃). For the PX Gibbs sampler we have $α_{j}^{*} ~ N (0, 1)$ , $λ_{j}^{*} ~ N (0, 1)$ , τ_j ~ ℊ(1, 0.2) for j = 1, 2, 3, μ^* ~ N(0, 1), β^* ~ N(0, 10 * I₃), ψ ~ ℊ(1/2, 1/2). We run both the samplers for 25,000 iterations excluding the first 5,000 as burn-in, and then compare the performance based on convergence diagnostics such as trace plots and effective sample size (ESS).

It is evident from Figures 3 and 4 that the PX Gibbs sampler dramatically improves mixing. ESS.PX/ESS. Traditional for the upper triangular elements of Ω = ΛΛ′ + Σ, starting from the first row and proceeding from left to right, are 162.44, 184.22, 148.47, 143.83, 144.00, 69.98.We use the Raftery and Lewis Diagnostic to estimate the number of MCMC samples needed for a small Monte Carlo error in estimating 95% credible intervals for the elements of Ω. The required sample size is different for different elements and also varies depending on whether we are trying to estimate the 0.025 or 0.975 quantiles for a particular element of Ω. If we take the maximum sample size over the two quantiles and all the elements, then it turns out that the traditional Gibbs sampler needs to be run for almost an hour whereas the PX Gibbs sampler will take only a little more than a minute to achieve the same accuracy. We also look at tools for inference like posterior means and 95% credible intervals for the different parameters in the model. On the basis of the 95% credible intervals there does not seem to be any significant effect of the covariates like study center and abstinence time on sperm concentration.

Trace plots of factor loadings exhibiting poor mixing using the traditional Gibbs sampler in Application 5.1.

Trace plots of factor loadings exhibiting vastly improved mixing using the parameter expanded Gibbs sampler in Application 5.1.

5.2 Rodent Organ Weight Study

We next illustrate the approach to model selection using parameter expansion through application to organ weight data from a U.S. National Toxicology Program (NTP) 13-week study of Anthraquinone in female Fischer rats. Studies are routinely conducted with 60 animals randomized to approximately six dose groups. At the end of the study, animals are sacrificed and a necropsy is conducted, with overall body weight obtained along with weights for the heart, liver, lungs, kidneys (combined), and thymus. Although body and organ weights are clearly correlated, a challenge in the analysis of these data is the dimensionality of the covariance matrix. In particular, even assuming a constant covariance across dose groups, it is still necessary to estimate p(p+1)/2 = 21 covariance parameters using data from only n = 60 animals. Hence, routine analyses rely on univariate approaches applied separately to body weight and the different organ weights.

Here alternatively we can use a factor model to reduce dimensionality. To determine the appropriate number of factors, we implemented the model selection approach described in Section 4 in the same manner as in the simulations. Body weights were normalized within each dose group prior to analysis for purposes of studying the correlation structure. Here the maximum possible number of factors was m = 3.

The estimated probabilities for the one, two, and three factor models using PS-PX are 0.9209, 0.0714, and 0.0077, respectively. Here the BIC also chooses the one factor model. We also performed some sensitivity analysis by considering (i) t prior distributions with 4 df instead of the default Cauchy prior distributions for the factor loadings, (ii) ℊ(1, 0.4) prior distributions instead of ℊ(1, 0.2) for the precision parameters, and (iii) both (i) and (ii) together. The estimated probabilities under (i), (ii), and (iii) were {0.9412, 0.0544, 0.0044}, {0.9373, 0.0578, 0.0048}, and {0.9612, 0.0369, 0.0019}, respectively. This suggests that our approach is not very sensitive to the prior distribution. The posterior means of factor loadings under the one factor model corresponding to body, heart, liver, lungs, kidneys, and thymus are 0.88, 0.33, 0.52, 0.33, 0.70, and 0.42, respectively. Body weight and kidney weight are the two outcomes having the highest correlation with the latent factor in the one factor analysis. These results suggest that one can bypass the need to rely on univariate analyses for these data, by using a sparse one factor model instead.

6. DISCUSSION

In analyzing high-dimensional, or even moderate-dimensional, multivariate data, one is faced with the problem of estimating a large number of covariance parameters. The factor model provides a convenient dimensionality-reduction technique. The routine use of normal and gamma prior distributions has proven to be a bottleneck for posterior computation in these models for the very slow convergence exhibited in the Markov chain.

In this article we have proposed a default heavy-tailed prior distribution for factor analytic models, using a parameter expansion approach. The posterior computation can be implemented easily using a Gibbs sampler. This prior distribution leads to a considerable improvement in mixing in the Gibbs chain. Extension of this prior distribution to more general settings with mixed outcomes (Song and Lee 2007) or to factor regression models and structural equation models is straightforward. The computational gain in using the PX Gibbs sampler compared to the traditional one can be tremendous, especially if the outcomes are highly correlated. As evident from the Male Fertility Study data, one may need to run the traditional Gibbs sampler for almost an hour and the PX Gibbs sampler for less than two minutes to achieve the same level of accuracy in estimating 95% credible intervals. Hobert and Marchev (2008) showed theoretical support for PX Gibbs samplers. Note that their work is philosophically different than our method, because their goal is to use PX simply to accelerate the MCMC mixing without modifying the prior distribution. We have also outlined a method based on path sampling using our default prior distribution, for computing posterior probabilities of models having different number of factors. Good performance of the method was demonstrated by using simulated data.

Matlab 7.5.0 was used for coding the PX Gibbs sampler and the PS-PX procedure. The package coda in R 2.6.1 was used to compute the convergence diagnostics. The data used in the analyses and the code can be downloaded from the JCGS website.

Supplementary Material

supplement

NIHMS451612-supplement-supplement.pdf^{(42.1KB, pdf)}

ACKNOWLEDGMENTS

This research was supported by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences. The authors thank the editor Dr. David van Dyk, the associate editor, and the referees for helpful comments.

Footnotes

Publisher's Disclaimer: Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions

This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

Supplemental materials for this article are available through the JCGS web page at http://www.amstat.org/publications.

SUPPLEMENTAL MATERIALS

Fertility Data: The dataset contains sperm concentration using three different techniques and some covariates. (fertilitydata.txt, text file)

Weight Data: The dataset contains the rodent organ weights. (weight.txt, text file)

Dose Data: The dataset contains the dose levels for the organ weight dataset. (dose.txt, text file)

Computer Code: This file contains the R and Matlab code to implement the PX-Gibbs sampler described in the paper. (px-code.tar, tar archive)

Code Documents: This file gives a full documentation of the code in px-code.tar and provides variable keys for the datasets. (px-code.pdf, pdf file)

Contributor Information

Joyee Ghosh, Email: jghosh@bios.unc.edu, Department of Biostatistics, The University of North Carolina, Chapel Hill, NC 27599.

David B. Dunson, Email: dunson@stat.duke.edu, Department of Statistical Science, Duke University, Durham, NC 27708-0251.

REFERENCES

Arminger G. A Bayesian Approach to Nonlinear Latent Variable Models Using the Gibbs Sampler and the Metropolis–Hastings Algorithm. Psychometrika. 1998;63:271–300. [Google Scholar]
Berger J, Pericchi L. The Intrinsic Bayes Factor for Model Selection and Prediction. Journal of the American Statistical Association. 1996;91:109–122. [Google Scholar]
Carvalho C, Chang J, Lucas J, Nevins J, Wang Q, West M. High-Dimensional Sparse Factor Modelling: Applications in Gene Expression Genomics. Journal of the American Statistical Association. 2008;103(484):1438–1456. doi: 10.1198/016214508000000869. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gelfand AE, Sahu SK, Carlin BP. Efficient Parameterisations for Normal Linear Mixed Models. Biometrika. 1995;82:479–488. [Google Scholar]
Gelman A. Parameterization and Bayesian Modeling. Journal of the American Statistical Association. 2004;99:537–545. [Google Scholar]
Gelman A. Prior Distributions for Variance Parameters in Hierarchical Models. Bayesian Analysis. 2006;3:515–534. [Google Scholar]
Gelman A, Meng XL. Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling. Statistical Science. 1998;13:163–185. [Google Scholar]
Geweke J, Zhou G. Measuring the Pricing Error of the Arbitrage Pricing Theory. Review of Financial Studies. 1996;9:557–587. [Google Scholar]
Hobert JP, Marchev D. A Theoretical Comparison of the Data Augmentation, Marginal Augmentation and PX-DA Algorithms. Annals of Statistics. 2008;36:532–554. [Google Scholar]
Imai K, van Dyk DA. A Bayesian Analysis of the Multinomial Probit Model Using Marginal Data Augmentation. Journal of Econometrics. 2005;124:311–334. [Google Scholar]
Kinney S, Dunson DB. Fixed and Random Effects Selection in Linear and Logistic Models. Biometrics. 2007;63:690–698. doi: 10.1111/j.1541-0420.2007.00771.x. [DOI] [PubMed] [Google Scholar]
Lee SY, Song XY. Bayesian Selection on the Number of Factors in a Factor Analysis Model. Behaviormetrika. 2002;29:23–40. [Google Scholar]
Liu C, Rubin DB, Wu YN. Parameter Expansion to Accelerate EM: The PX-EM Algorithm. Biometrika. 1998;85:755–770. [Google Scholar]
Liu JS, Wu YN. Parameter Expansion for Data Augmentation. Journal of the American Statistical Association. 1999;94:1264–1274. [Google Scholar]
Lopes HF, West M. Bayesian Model Assessment in Factor Analysis. Statistica Sinica. 2004;14:41–67. [Google Scholar]
Meng XL, van Dyk DA. Seeking Efficient Data Augmentation Schemes via Conditional and Marginal Augmentation. Biometrika. 1999;86:301–320. [Google Scholar]
Natarajan R, McCulloch CE. Gibbs Sampling With Diffuse Proper Priors: A Valid Approach to Data-Driven Inference? Journal of Computational and Graphical Statistics. 1998;7:267–277. [Google Scholar]
Pournara I, Wernisch L. Factor Analysis for Gene Regulatory Networks and Transcription Factor Activity Profiles. BMC Bioinformatics. 2007;8:61. doi: 10.1186/1471-2105-8-61. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robert CP, Casella G. Monte Carlo Statistical Methods. New York: Springer-Verlag; 2004. [Google Scholar]
Rowe DB. Ph.D. thesis. Riverside, CA: Dept. of Statistics, University of California; 1998. Correlated Bayesian Factor Analysis. [Google Scholar]
Sanchez BN, Budtz-Jorgensen E, Ryan LM, Hu H. Structural Equation Models: A Review With Applications to Environmental Epidemiology. Journal of the American Statistical Association. 2005;100:1442–1455. [Google Scholar]
Song XY, Lee SY. Bayesian Estimation and Test for Factor Analysis Model With Continuous and Polytomous Data in Several Populations. British Journal of Mathematical & Statistical Psychology. 2001;54:237–263. doi: 10.1348/000711001159546. [DOI] [PubMed] [Google Scholar]
Song XY, Lee SY. Bayesian Analysis of Latent Variable Models With Non-Ignorable Missing Outcomes From Exponential Family. Statistics in Medicine. 2007;26:681–693. doi: 10.1002/sim.2530. [DOI] [PubMed] [Google Scholar]
West M. Bayesian Factor Regression Models in the Large p, Small n Paradigm. In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M, editors. Bayesian Statistics. Vol. 7. Oxford: Oxford University Press; 2003. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement

NIHMS451612-supplement-supplement.pdf^{(42.1KB, pdf)}

[R1] Arminger G. A Bayesian Approach to Nonlinear Latent Variable Models Using the Gibbs Sampler and the Metropolis–Hastings Algorithm. Psychometrika. 1998;63:271–300. [Google Scholar]

[R2] Berger J, Pericchi L. The Intrinsic Bayes Factor for Model Selection and Prediction. Journal of the American Statistical Association. 1996;91:109–122. [Google Scholar]

[R3] Carvalho C, Chang J, Lucas J, Nevins J, Wang Q, West M. High-Dimensional Sparse Factor Modelling: Applications in Gene Expression Genomics. Journal of the American Statistical Association. 2008;103(484):1438–1456. doi: 10.1198/016214508000000869. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Gelfand AE, Sahu SK, Carlin BP. Efficient Parameterisations for Normal Linear Mixed Models. Biometrika. 1995;82:479–488. [Google Scholar]

[R5] Gelman A. Parameterization and Bayesian Modeling. Journal of the American Statistical Association. 2004;99:537–545. [Google Scholar]

[R6] Gelman A. Prior Distributions for Variance Parameters in Hierarchical Models. Bayesian Analysis. 2006;3:515–534. [Google Scholar]

[R7] Gelman A, Meng XL. Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling. Statistical Science. 1998;13:163–185. [Google Scholar]

[R8] Geweke J, Zhou G. Measuring the Pricing Error of the Arbitrage Pricing Theory. Review of Financial Studies. 1996;9:557–587. [Google Scholar]

[R9] Hobert JP, Marchev D. A Theoretical Comparison of the Data Augmentation, Marginal Augmentation and PX-DA Algorithms. Annals of Statistics. 2008;36:532–554. [Google Scholar]

[R10] Imai K, van Dyk DA. A Bayesian Analysis of the Multinomial Probit Model Using Marginal Data Augmentation. Journal of Econometrics. 2005;124:311–334. [Google Scholar]

[R11] Kinney S, Dunson DB. Fixed and Random Effects Selection in Linear and Logistic Models. Biometrics. 2007;63:690–698. doi: 10.1111/j.1541-0420.2007.00771.x. [DOI] [PubMed] [Google Scholar]

[R12] Lee SY, Song XY. Bayesian Selection on the Number of Factors in a Factor Analysis Model. Behaviormetrika. 2002;29:23–40. [Google Scholar]

[R13] Liu C, Rubin DB, Wu YN. Parameter Expansion to Accelerate EM: The PX-EM Algorithm. Biometrika. 1998;85:755–770. [Google Scholar]

[R14] Liu JS, Wu YN. Parameter Expansion for Data Augmentation. Journal of the American Statistical Association. 1999;94:1264–1274. [Google Scholar]

[R15] Lopes HF, West M. Bayesian Model Assessment in Factor Analysis. Statistica Sinica. 2004;14:41–67. [Google Scholar]

[R16] Meng XL, van Dyk DA. Seeking Efficient Data Augmentation Schemes via Conditional and Marginal Augmentation. Biometrika. 1999;86:301–320. [Google Scholar]

[R17] Natarajan R, McCulloch CE. Gibbs Sampling With Diffuse Proper Priors: A Valid Approach to Data-Driven Inference? Journal of Computational and Graphical Statistics. 1998;7:267–277. [Google Scholar]

[R18] Pournara I, Wernisch L. Factor Analysis for Gene Regulatory Networks and Transcription Factor Activity Profiles. BMC Bioinformatics. 2007;8:61. doi: 10.1186/1471-2105-8-61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Robert CP, Casella G. Monte Carlo Statistical Methods. New York: Springer-Verlag; 2004. [Google Scholar]

[R20] Rowe DB. Ph.D. thesis. Riverside, CA: Dept. of Statistics, University of California; 1998. Correlated Bayesian Factor Analysis. [Google Scholar]

[R21] Sanchez BN, Budtz-Jorgensen E, Ryan LM, Hu H. Structural Equation Models: A Review With Applications to Environmental Epidemiology. Journal of the American Statistical Association. 2005;100:1442–1455. [Google Scholar]

[R22] Song XY, Lee SY. Bayesian Estimation and Test for Factor Analysis Model With Continuous and Polytomous Data in Several Populations. British Journal of Mathematical & Statistical Psychology. 2001;54:237–263. doi: 10.1348/000711001159546. [DOI] [PubMed] [Google Scholar]

[R23] Song XY, Lee SY. Bayesian Analysis of Latent Variable Models With Non-Ignorable Missing Outcomes From Exponential Family. Statistics in Medicine. 2007;26:681–693. doi: 10.1002/sim.2530. [DOI] [PubMed] [Google Scholar]

[R24] West M. Bayesian Factor Regression Models in the Large p, Small n Paradigm. In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M, editors. Bayesian Statistics. Vol. 7. Oxford: Oxford University Press; 2003. [Google Scholar]

PERMALINK

Default Prior Distributions and Efficient Posterior Computation in Bayesian Factor Analysis

Joyee Ghosh

David B Dunson

Roles

Abstract

1. INTRODUCTION

2. BAYESIAN FACTOR MODELS

2.1 Model Specification and Standard Prior Distributions