Abstract
We focus on sparse modelling of high-dimensional covariance matrices using Bayesian latent factor models. We propose a multiplicative gamma process shrinkage prior on the factor loadings which allows introduction of infinitely many factors, with the loadings increasingly shrunk towards zero as the column index increases. We use our prior on a parameter-expanded loading matrix to avoid the order dependence typical in factor analysis models and develop an efficient Gibbs sampler that scales well as data dimensionality increases. The gain in efficiency is achieved by the joint conjugacy property of the proposed prior, which allows block updating of the loadings matrix. We propose an adaptive Gibbs sampler for automatically truncating the infinite loading matrix through selection of the number of important factors. Theoretical results are provided on the support of the prior and truncation approximation bounds. A fast algorithm is proposed to produce approximate Bayes estimates. Latent factor regression methods are developed for prediction and variable selection in applications with high-dimensional correlated predictors. Operating characteristics are assessed through simulation studies, and the approach is applied to predict survival times from gene expression data.
Keywords: Adaptive Gibbs sampling, Factor analysis, High-dimensional data, Multiplicative gamma process, Parameter expansion, Regularization, Shrinkage
1. Introduction
Factor models aim to explain the dependence structure among high-dimensional observations through a sparse decomposition of a p × p covariance matrix Ω as ΛΛT + Σ, where Λ is a p × k factor loadings matrix with k ≪ p and Σ is a p × p diagonal matrix with nonnegative diagonal entries. A popular approach to ensure identifiability of the loading elements is to constrain the loading matrix to be lower triangular with positive diagonal entries (Geweke & Zhou, 1996). Factor models have been traditionally applied in behavioural and social sciences, where the latent factors have a natural interpretation as certain unobserved psychological traits. A more recent approach (West, 2003; Carvalho et al., 2008) uses the above sparse characterization as a dimensionality reduction tool in large p and small n applications such as gene expression studies.
A Bayesian specification of the factor model (Arminger & Muthén, 1998; Song & Lee, 2001) commonly uses inverse gamma priors on the residual variances and normal and truncated normal priors on the off-diagonal and diagonal elements of the loadings matrix, respectively. Such choices lead to conditionally conjugate forms of the posterior distribution and enable posterior computation by a straightforward Gibbs sampler. However, it has been observed that these choices lead to poorly behaved Gibbs samplers with slow mixing when some of the outcomes are highly correlated. Posterior inference also tends to be sensitive to certain hyperparameters. To address these issues, Ghosh & Dunson (2009) use parameter expansion (Liu & Wu, 1999; Gelman, 2006) to induce a heavy-tailed default prior distribution on the loading elements and propose an efficient Gibbs sampler.
Inference on the number of factors in factor analysis models is both conceptually and computationally challenging. Some of the early papers in this direction (discussion paper by Polasek, 1997, University of Basil) involve computation of the marginal likelihoods under models with different numbers of factors. Lopes & West (2004) proposed a reversible jump Markov chain Monte Carlo algorithm to allow for uncertainty in the number of factors. Lee & Song (2002) developed a path sampling approach instead. A more recent method infers the number of factors by zeroing a subset of the loading elements using Bayesian variable selection priors (Lucas et al., 2006; Carvalho et al., 2008); see also the 2009 discussion paper from the University of Chicago Booth School of Business by Schnatter and Lopes. Ando (2009) proposed an approach for calculating the exact marginal likelihood in Bayesian factor analysis with heavy-tailed priors. This method can be used for rapid estimation of the number of factors, but may be sensitive to subjectively chosen priors.
In this article we introduce a multiplicative gamma process shrinkage prior that allows introduction of infinitely many factors, with the loadings increasingly shrunk towards zero as the column index increases. The key to our approach lies in the fact that for purpose of prediction or inference on the covariance matrix, identifiability of the loadings is not necessary. In standard factor models, the identifiability constraints induce undesirable properties, such as a priori order dependence in the off-diagonal entries of the covariance matrix. Our proposed prior is placed on a parameter expanded factor loadings matrix, making the induced prior on the covariance matrix invariant to ordering of the data. The shrinkage prior allows us to adaptively select a truncation of the infinite loadings to one having finite columns, which facilitates the posterior computation while providing an accurate approximation to the infinite factor model.
2. Bayesian factor models
2.1. Model and prior specification
The generic form of a latent factor model is
(1) |
where yi is a p-dimensional continuous response, Λ is a p × k factor loadings matrix, ηi ∼ Nk(0, Ik) are latent factors and ∊i is an idiosyncratic error with covariance . We follow standard practice in normalizing the data prior to analysis and hence do not include an intercept term in (1). Each observation yi is assumed to have independent components given the factors and dependence among the components is induced by marginalizing over the distribution of the factors, so marginally yi ∼ Np(0, Ω) with Ω = ΛΛT + Σ. In practical applications involving moderate to large p, the number of factors is typically much smaller than p, inducing a sparse characterization of the unknown covariance matrix Ω.
The above decomposition of Ω is not unique and there are actually infinitely many possibilities, since Λ1 = ΛP also satisfies the above condition for any semi-orthogonal matrix P (P PT = I). The usual lower triangular constraint for identifiability (Geweke & Zhou, 1996) induces order dependence among the responses, with the choice of the first k response variables being an important modelling decision (Carvalho et al., 2008). From a Bayesian perspective, one does not require identifiability of the loading elements for a wide class of applications including covariance matrix estimation, variable selection and prediction. The above fact has been exploited in our approach to define the prior on a parameter-expanded loadings matrix with redundant parameters, resulting in better computational properties while simplifying the theory.
Letting ΘΛ denote the collection of all matrices Λ with p rows and infinitely many columns such that ΛΛT is a p × p matrix with all entries finite, we have
(2) |
Using the Cauchy–Schwartz inequality, it is straightforward to show that all the entries of ΛΛT are finite if and only if the condition in (2) is satisfied. Let ΘΣ denote the set of p × p diagonal matrices with nonnegative entries and let Θ denote all p × p positive semi-definite matrices. Consider the function g : ΘΛ × ΘΣ → Θ corresponding to g(Λ, Σ) = ΛΛT + Σ.
Lemma 1. For any (Λ, Σ) ∈ ΘΛ × ΘΣ, we have g(Λ, Σ) ∈ Θ.
All proofs can be found in the Appendix. The image of ΘΛ × ΘΣ under g is the set Ω : Ω =g(Λ, Σ), (Λ, Σ) ∈ ΘΛ × ΘΣ}. Letting g−1 (Ω) ⊂ ΘΛ × ΘΣ denote the pre-image of Ω ∈ Θ, it is straightforward to show that the set g−1(Ω) contains at least one element for any Ω ∈ Θ, so that the image of ΘΛ × ΘΣ under g is the set Θ. For example, one element corresponds to (Λ, 0p), with Λ = (Ω1/2: 0p×∞), Ω1/2 a Cholesky decomposition of Ω and 0p denoting a p × p matrix of zeros. Thus g is a continuous surjective function. However, g is not bijective, and in general the cardinality of g−1(Ω) is ∞. Lemma 2 states a regularity property of g, which is later used to prove sup-norm support of the proposed prior.
Lemma 2. Let (Λ0, Σ0) be an arbitrary element of ΘΛ × ΘΣ. For ∊ > 0, define the following ∊-ball around (Λ0, Σ0), B∊ (Λ0, Σ0) = {(Λ, Σ) ∈ ΘΛ × ΘΣ : d2(Λ, Λ0) < ∊, d∞(Σ, Σ0) < ∊}, where d2(·, ·) denotes the L2 distance metric on ΘΛ,
for p × ∞ matrices Λ = (λjh), , and d∞(A, B) = max1⩽r,s⩽p |ars − brs| is the sup-norm metric for p × p matrices A = (ars), B = (brs). Then, the image g{B∊ (Λ0, Σ0)} contains values Ω ∈ Θ in an ∊* sized ball in sup norm around Ω0 = g(Λ0, Σ0), with ∊* decreasing towards zero monotonically as ∊ decreases to zero.
Observe that d2 is well defined and finite on ΘΛ by (2).
We adopt a Bayesian approach and choose independent priors supported on ΘΛ × ΘΣ, which in turn induces a prior on Ω ∈ Θ through the operator g. We place the usual inverse gamma priors on the diagonal elements of Σ. To define a prior supported on ΘΛ, we allow the entries of Λ to decrease in magnitude flexibly as the column index increases. The prior is defined on a parameter-expanded loading matrix without imposing any restriction on the loading elements. The introduction of the redundant parameters simplifies the theory and the induced prior has attractive properties including large support and order-independence. We use a shrinkage-type prior with the degree of shrinkage increasing across the column index as follows,
(3) |
where δl (l = 1, …, ∞), are independent, τh is a global shrinkage parameter for the hth column and the ϕjhs are local shrinkage parameters for the elements in the hth column. The τhs are stochastically increasing under the restriction a2 > 1, which favours more shrinkage as the column index increases. If we only use the global shrinkage parameter, the prior has a tendency to over-shrink the nonzero loadings. In gene expression examples involving large p, it is often the case that a relatively small proportion of genes are within each pathway. In such applications, we would like to shrink a subset of the elements strongly towards zero while retaining the sparse signals. We refer to the induced prior on the space of covariance matrices as a multiplicative gamma process shrinkage prior.
2.2. Properties of the shrinkage prior
Let ΠΛ ⊗ ΠΣ denote the prior on (Λ, Σ) defined in (3). We first need to make sure that our prior is well defined so that draws from the above prior are elements of ΘΛ × ΘΣ almost surely.
Proposition 1. If (Λ, Σ) ∼ ΠΛ ⊗ ΠΣ, then ΠΛ ⊗ ΠΣ (ΘΛ × ΘΣ) = 1.
For computational purposes, we would like to approximate the infinite loadings matrix with a finite matrix having few columns relative to the number of outcomes p. As justification, we obtain theoretical bounds on the truncation approximation error. Let (Λ, Σ) ∼ ΠΛ ⊗ ΠΣ and Ω = ΛΛT + Σ be the induced covariance matrix. We can approximate Ω by where ΛH denotes the matrix obtained by setting the columns of Λ from H + 1 onwards to zero or equivalently discarding those higher indexed columns. The following theorem states that the prior probability of ΩH being arbitrarily close to Ω in an appropriate sense converges exponentially fast to 1 as H tends to ∞.
Theorem 1. If a2 > 2, then for any ∊ > 0,
for H > log{6pb/∊(1 − a)}/ log(1/a), where and , with δ1 and δ2 as in (3).
The proof of the theorem assumes ν = 3 which has been used as a default choice throughout, but the same argument holds for any ν > 2. Although the condition a2 > 2 is sufficient to ensure that a < 1, for any Ga(a2, b2) prior on δ2, the theorem remains valid as long as or a2 > 1 + b2.
Letting Π denote the induced prior on Θ, Π = (ΠΛ ⊗ ΠΣ) ○ g−1 so that for any Borel subset A of Θ, Π(A) = (ΠΛ ⊗ ΠΣ){g−1(A)}. Since g is a continuous and hence measurable map, Π is a well-defined probability measure on (Θ, 𝒜), with 𝒜 the Borel σ-algebra of subsets of Θ.
Proposition 2. If Ω0 is any p × p covariance matrix and is an ∊-neighbourhood of Ω0 under the sup-norm, then for any ∊ > 0.
Proposition 2 shows that our proposed prior has large support, so places positive probability in arbitrarily small neighbourhoods around any covariance matrix. We use Proposition 2 to show weak consistency of the posterior distribution of Ω in Theorem 2. Denote K (Ω0, Ω) to be the Kullback–Leibler divergence between Np(0, Ω0) and Np(0, Ω),
Theorem 2. Fix Ω0 ∈ Θ. For any ∊ > 0, there exists ∊* > 0, such that
which implies that the posterior distribution of Ω is weakly consistent.
The weak consistency of the posterior follows from the Schwartz (1965) theorem, since any Kullback–Leibler neighbourhood of the true density has positive probability using Proposition 2.
Another attractive property of our prior is that it is free of order dependence, so that the induced prior on Ω is invariant to permutations with Ω having the same distribution as Ωπ, where Ωπ = (wπrπs) with π any permutation of {1, …, p} and Ω = (wrs). We have , where λj = (λj1, λj2, …)T. Conditionally on τ = (τ1, τ2, …)T, the λrh’s are independent with λrh | τh ∼ t3(0, ). Since the marginal prior on λr is the same for every r, wrs has the same distribution as wr′s′ for any (r, s) ≠ (r′, s′) such that r ≠ s, r′ ≠ s′. The permutation invariance follows from the fact that wrr and wr′r′ have the same distribution for any 1 ⩽ r, r′ ⩽ p.
Although the distribution of wrs does not have a simple form, the first two moments of wrs can be obtained as
Thus is finite if is finite and and in that case . One way to ensure the above conditions is to let a1 > 2 and a2 > 3. Hence the induced prior on any of the off-diagonal entries of Ω has mean zero and the parameters a1, a2 dictate the existence of higher order moments. We place gamma priors on a1 and a2 to learn these key hyperparameters from the data.
3. Posterior computation
3.1. Gibbs sampler with a fixed truncation level
We propose a straightforward Gibbs sampler for posterior computation after truncating the loadings matrix to have k* ≪ p columns. An adaptive strategy for inference on the truncation level k* is described in § 3.2. The Gibbs sampler is computationally efficient and mixes rapidly as the shrinkage prior allows block updating of the loadings. The sampler cycles through the following steps.
- Step 1. If we denote the jth row of Λk* by , then the λj s have independent conditionally conjugate posteriors,
where η (η1, …,ηn)T, and y(j) = (y1j, …, ynj)T for j = 1, …, p. Given the other parameters, π (λj | −) denotes the conditional posterior of λj. - Step 2. Sample , j = 1 …, p, from conditionally independent posteriors
- Step 3. Sample ηi, i = 1 …, n, from conditionally independent posteriors
- Step 4. Sample ϕjh from
- Step 5. Sample δ1 from
and for h ⩾ 2, sample δh from
where for h = 1, …, k*. Step 6. Update a1 and a2 using a Metropolis–Hastings step within the Gibbs sampler.
3.2. Choosing the number of factors adaptively
In practical situations, we expect to have relatively few important factors compared with the dimension p of the outcomes. Our proposed model with infinite number of factors obviates the need for pre-specifying the number of factors, while the sparsity favouring prior on the loadings ensures that the effective number of factors would be small when the truth is sparse. However, we need a computational strategy for choosing an appropriate level of truncation k*. We would like to strike a balance between missing important factors by choosing k* too small and wasting computation on an overly conservative truncation level. One can think of k* as the effective number of factors, so that the contribution from adding additional factors is negligible. Starting with a conservative guess k̃ of k*, the posterior samples of Λk̃ from the Gibbs sampler mentioned in § 3.1 contain information about the effective number of factors. At the tth iteration of the Gibbs sampler, let m(t) denote the number of columns in Λk̃ having all elements in a pre-specified small neighbourhood of zero. Intuitively, m(t) of the factors have a negligible contribution at the tth iteration. Usual shrinkage priors on the loadings exhibit the phenomenon of factor splitting, in which none of the columns have all loadings close to zero even when k̃ is chosen to be greater than the true number of factors. By shrinking increasingly in later columns, we avoid this problem. We define k*(t) = k̃ − m(t) to be the effective number of factors at iteration t.
The above approach has been shown to produce accurate estimates of the true effective number of factors k* in a number of simulation examples as long as k̃ ⩾ k*. However, in order to be assured that k̃ ⩾ k*, it is typically necessary to choose a very conservative bound in large p applications, which leads to wasted computational effort. Ideally, we would like to discard the redundant factors and continue the sampler with a reduced number of columns in the loadings. With this aim, we modify our sampler described above to an adaptive Gibbs sampler, which tunes the number of factors as the sampler progresses. The adaptations are designed to satisfy the diminishing adaptation condition in Theorem 5 of Roberts & Rosenthal (2007). To be specific, we adapt with probability p(t) = exp(α0 + α1t) at the tth iteration, with α0, α1 chosen so that adaptation occurs around every 10 iterations at the beginning of the chain but decreases in frequency exponentially fast. We generate a sequence ut of uniform random numbers between 0 and 1. At the tth iteration, if ut ⩽ p(t), we monitor the columns in the loadings having all elements within some pre-specified small neighbourhood of zero. If the number of such columns drops to zero, we add a column to the loadings and otherwise discard the redundant columns. The other parameters are also modified accordingly. When we add a factor, we sample parameters from the prior distribution to fill in additional columns, and otherwise retain parameters corresponding to the nonredundant columns.
The most common approach for selecting the number of factors relies on fitting the factor model for different choices of k*, and then using the bic or another criteria for selection. This approach can be difficult to implement for large p, small n problems in which maximum likelihood estimates often do not exist, and the bic is not well justified for factor models even for small to moderate p. Lopes & West (2004) compared a number of alternatives, recommending a reversible jump Markov chain Monte Carlo approach that requires a preliminary run for each choice of the number of factors, so it is very computationally intensive. Path sampling faces similar computational hurdles in scaling up to large p. Stochastic search variable selection algorithms have been applied in large p settings, but performance is questionable given the need to update elements of the loadings matrix one at a time, leading to very slow mixing and convergence rates. In the econometrics literature on approximate factor models, there has been recent work (Bai & Ng, 2002; Amengual & Watson, 2007) on consistent estimation of the number of static and dynamic factors as the number of time series and observation times both increase to ∞ at a comparable rate; see also the discussion paper by Onatski (2005), University of Columbia.
A significant advantage of our adaptive method is that a single run provides posterior samples of the parameters as well as information about the number of factors, with convergence of the chain guaranteed by the theory in Roberts & Rosenthal (2007). In addition, we save computation by discarding the unimportant factors. Letting k̃(t) denote the truncation level at iteration t and k*(t) = k̃(t) − m(t) denote the effective number of factors, we use the median or mode of {k*(t)} after burn-in as an estimate of k* with credible intervals quantifying uncertainty.
After a reasonable burn-in, represent draws from the approximated marginal posterior distribution of Ω given y1, …, yn, where denote posterior samples at the tth iteration. The posterior samples Ω(t) can be used for inference on Ω. We also propose a fast algorithm for calculating an approximate maximum a posteriori estimate of the covariance matrix. The proposed approach is useful to arrive at a quick working estimate of the covariance matrix. Our proposed stochastic em approach, (Celeux et al., 1996) approach replaces draws from the conditional posterior distributions of Λk̃(t), Σ and ϕ in Steps 1, 2 and 4 above by the respective conditional posterior modes.
4. Simulation example
4.1. Factor selection and covariance matrix estimation
We consider a number of simulation examples to illustrate our approach and compare with competing methods. We simulated yi, i = 1, …, 200, from a p-dimensional normal distribution with zero-mean and covariance matrix Ω = ΛΛT + Σ, where Λ is a p × k matrix and Σ is a p × p diagonal matrix. The diagonal elements of Σ−1 are drawn independently from a Ga(1,0.25) distribution with mean 4. The number of non-zero elements in each column of Λ are chosen linearly between 2k and k + 1 in a decreasing fashion. We randomly allocate the location of the zeros in each column and simulate the nonzero elements independently from a normal distribution with mean 0 and variance 9.
We choose three (p, k) combinations with moderate to large p, namely (100, 5), (500, 10) and (1000, 15). For each pair we consider 50 simulation replicates. We run the adaptive Gibbs sampler for 25 000 iterations with a burn-in of 5000, and collect every 5th sample to thin the chain. We use a default choice of 5 log(p) as the starting number of factors. The hyperparameters aσ and bσ for in (3) are 1 and 0.3 respectively, while ν is 3. We place Ga(2, 1) priors on a1 and a2. We choose α0 and α1 in the adaptation probability p(t) as −1 and −5 × 10−4 respectively. We monitor the columns in the loadings having all elements less than 10−4 in magnitude and proceed by adapting the number of factors as in § 3.2. For the stochastic em algorithm, we choose a burn-in of 100 and monitored the estimated covariance matrix every 10 iterations. We stop the chain when the sup-norm distance between the estimated covariance matrix at the current iteration was within a small tolerance level compared with the estimate 10 iterations before.
The average of the estimated number of factors across the replicates is 6.82, 10.00 and 14.40 corresponding to k = 5, 10 and 15 with empirical 95% intervals for the number of factors (5, 8), (9, 11) and (13, 16), respectively. The estimated covariance matrix in each case is close to the true value, with small mean square error, average and maximum absolute bias. We compare the estimation of the covariance matrix to a recent method by Bickel & Levina (2008) which bands the sample covariance matrix and proposes a resampling scheme for choosing the optimal banding parameter. The stochastic em algorithm was also used to arrive at an approximate maximum a posteriori estimate of the covariance matrix. We provide the summaries of the mean square error, average absolute bias and maximum absolute bias for the three methods across the replicates in Table 1. Based on Table 1, the proposed shrinkage approach does significantly better than the Bickel & Levina (2008) method. The stochastic em algorithm also performs well, especially for smaller values of p.
Table 1.
Comparative performance in covariance matrix estimation in the simulation study. The average, best and worst case performance across 50 simulation replicates in terms of mean square error (×102), average absolute bias (×102) and maximum absolute bias (×102) are tabulated for the different methods
true (p, k) | (100, 5) | (500, 10) | (1000, 15) | ||||||
---|---|---|---|---|---|---|---|---|---|
method | mgps | Banding | map | mgps | Banding | map | mgps | Banding | map |
mse | |||||||||
mean | 0.2 | 1.3 | 0.2 | 0.10 | 0.4 | 0.10 | 0.10 | 0.3 | 0.10 |
min | 0.1 | 0.9 | 0.1 | 0.02 | 0.4 | 0.05 | 0.02 | 0.2 | 0.05 |
max | 0.3 | 1.6 | 0.3 | 0.20 | 0.5 | 0.30 | 0.4 | 0.5 | 0.30 |
average absolute bias | |||||||||
mean | 1.9 | 3.1 | 1.0 | 0.6 | 0.6 | 0.3 | 0.4 | 0.5 | 0.3 |
min | 1.3 | 2.5 | 0.6 | 0.4 | 0.6 | 0.2 | 0.2 | 0.4 | 0.2 |
max | 2.5 | 4.9 | 1.5 | 0.9 | 0.9 | 0.5 | 0.6 | 0.5 | 0.5 |
maximum absolute bias | |||||||||
mean | 50.9 | 111.0 | 44.8 | 95.4 | 117.8 | 97.7 | 115.0 | 115 | 108.0 |
min | 38.8 | 99.8 | 24.7 | 50.2 | 105.0 | 64.4 | 52.6 | 111 | 74.7 |
max | 74.1 | 131.0 | 105.0 | 152.0 | 131.0 | 162.0 | 242.0 | 240 | 221.0 |
mgps, posterior mean using our proposed multiplicative shrinkage prior; Banding, Banding sample covariance matrix; map, approximate maximum a posteriori estimate under our proposed prior; mse, mean square error.
4.2. Latent factor regression
It is common in many application areas to have a massive-dimensional vector of candidate predictors, with many of the predictors being moderately to highly correlated. Modifications using penalized least squares methods have been studied extensively. The lasso (Tibshirani, 1996) and the elastic net (Zou & Hastie, 2005) are two of the most popular such methods. In order to select correlated batches of predictors simultaneously, one can potentially use Bayesian latent factor regression (Lucas et al., 2006; Carvalho et al., 2008).
Let , i = 1, …, n, where the xi s are (p − 1)-dimensional predictors and zi s are the response. For ease of illustration, we assume the zi s to be univariate, though extensions to multivariate cases are straightforward. Also assume that the predictors and response are all continuous. We can use standard data augmentation procedures otherwise. We jointly model the yi s as in (1). Our objective is to predict the response zn+1 for a future subject based on the predictors xn+1 for that subject and y1, …, yn. The posterior predictive distribution of zn+1 | xn+1, y1, …, yn is
For the simulation examples described in § 4.1, let zi = yi1 and xi = (yi2, …, yip)T. We randomly selected two locations in the first row of Λ and assigned values 1 and −1 to those locations, with the other elements in the first row set to zero. The remaining rows of the loadings were simulated as mentioned before. We used a randomly chosen training set of size 100 and held out the zi s for the remaining 100 samples. The coverage of 95% predictive intervals averaged across the replicates were 0.95, 0.94 and 0.95, respectively. Table 2 compares the predictive performance with lasso and elastic net. The proposed approach does similar to lasso and elastic net, but has the advantage of quantifying predictive uncertainty.
Table 2.
Predictive performance in the simulation study. Average, best and worst case performance across 50 simulation replicates are reported for the different methods
true (p, k) | (100, 5) | (500, 10) | (1000, 15) | ||||||
---|---|---|---|---|---|---|---|---|---|
method | mgps | Lasso | Elastic net | mgps | Lasso | Elastic net | mgps | Lasso | Elastic net |
mspe | |||||||||
mean | 0.63 | 0.55 | 0.55 | 0.41 | 0.38 | 0.38 | 0.95 | 0.87 | 0.88 |
min | 0.32 | 0.33 | 0.33 | 0.18 | 0.22 | 0.22 | 0.57 | 0.55 | 0.56 |
max | 0.89 | 0.79 | 0.78 | 0.86 | 0.57 | 0.56 | 1.48 | 1.44 | 1.44 |
aape | |||||||||
mean | 0.62 | 0.59 | 0.59 | 0.51 | 0.49 | 0.49 | 0.80 | 0.77 | 0.75 |
min | 0.47 | 0.47 | 0.47 | 0.33 | 0.38 | 0.37 | 0.60 | 0.59 | 0.59 |
max | 0.85 | 0.73 | 0.72 | 0.80 | 0.58 | 0.59 | 0.99 | 0.98 | 0.99 |
mape | |||||||||
mean | 2.19 | 2.07 | 2.07 | 1.71 | 1.66 | 1.68 | 2.54 | 2.48 | 2.48 |
min | 1.36 | 1.43 | 1.40 | 1.21 | 1.17 | 1.18 | 1.83 | 1.83 | 1.80 |
max | 3.15 | 2.91 | 2.89 | 2.95 | 2.70 | 2.63 | 3.27 | 3.07 | 3.07 |
mgps, our proposed multiplicative shrinkage prior; mspe, mean squared prediction error; aape, average absolute prediction error; mape, maximum absolute prediction error.
The joint Gaussian model implies that , with , with the Ω matrix suitably partitioned. The elements of the (p − 1)-dimensional vector β can be considered as the true regression coefficients of z on x. Letting Ω(t) denote the posterior samples of Ω, give samples from the posterior distribution of β. Since , where and are appropriate sub-blocks of Λ(t) and Σ(t), one can use the Sherman–Morrison–Woodbury formula (Hager, 1989) to invert at each iteration of the Gibbs sampler, which only requires the inverse of a k*(t) × k*(t) matrix, leading to many-fold speed up in large p settings.
As shown in Table 3, the estimate of β based on our method was close to the truth in each case, with small mean square error, average and maximum absolute bias. The coverage of 95% credible intervals for the elements of β were 0.96, 0.91 and 0.90 for the three cases, respectively.
Table 3.
Performance in estimating regression coefficients in the simulation study. We report the mean square error (×103), average absolute bias (×103) and maximum absolute bias (×103) averaged across 50 simulation replicates for the different methods
true (p, k) | (100, 5) | (500, 10) | (1000, 15) | ||||||
---|---|---|---|---|---|---|---|---|---|
method | mgps | Lasso | Elastic net | mgps | Lasso | Elastic net | mgps | Lasso | Elastic net |
mse | 1.1 | 1.2 | 1.3 | 0.1 | 0.3 | 0.4 | 0.0 | 0.1 | 0.1 |
aab | 10.1 | 12.4 | 13.0 | 1.7 | 3.9 | 4.1 | 0.9 | 1.8 | 1.9 |
mab | 176.1 | 207.3 | 211.3 | 172.5 | 253.3 | 244.5 | 102.6 | 109.0 | 122.6 |
mgps, our proposed multiplicative shrinkage prior; mse, mean squared error; aab, average absolute bias; mab, maximum absolute bias.
The simulation examples were designed to induce correlation in groups of predictors, so that batches of predictors are included in the response model. The sparsity in the loadings ensures that many of the true regression coefficients are exactly equal to zero, with only a few important predictors. We propose a simple algorithm for variable selection in our framework based on thresholding the posterior mean of β. Let β̂(1) < ⋯ < β̂(p−1) denote the ordered values of the posterior means for the p − 1 predictors, and let πj = h denote that the jth predictor is the hth smallest in magnitude. Then, our thresholding approach sets βj = 0 for all j with πj ⩽ h̃, with h̃ chosen to minimize the mean squared prediction error. Table 4 shows the percentage of false positives and power compared with lasso and elastic net.
Table 4.
Variable selection performance in the simulation study. Percentage of false positives and power in detecting the true signal reported across 50 simulation replicates (average, best and worst case) for the different methods
true (p, k) | (100, 5) | (500, 10) | (1000, 15) | ||||||
---|---|---|---|---|---|---|---|---|---|
method | mgps | Lasso | Elastic net | mgps | Lasso | Elastic net | mgps | Lasso | Elastic net |
false positives (%) | |||||||||
mean | 0 | 9 | 7 | 0 | 4·0 | 3 | 0 | 3·0 | 2·0 |
min | 0 | 0 | 0 | 0 | 0·2 | 0 | 0 | 0·7 | 0·7 |
max | 0 | 26 | 25 | 0 | 14·0 | 14 | 0 | 8·0 | 10·0 |
power (%) | |||||||||
mean | 72 | 76 | 77 | 75 | 76 | 77 | 71 | 72 | 72 |
min | 68 | 72 | 74 | 73 | 75 | 76 | 70 | 71 | 71 |
max | 81 | 80 | 83 | 80 | 79 | 79 | 73 | 73 | 72 |
mgps, our proposed multiplicative shrinkage prior.
The three simulation examples took 2, 14 and 33 seconds per hundred iterations, respectively, in Matlab on a Intel(R) Core(TM) 2 Duo machine. The analyses were repeated with different choices of hyperparameter values. We used ν = 3.5, 4, 5 and varied bσ between 0.1 and 0.5. We also used different multiples of log(p) between 3 and 10 for the initial number of factors. The results were robust, with the conclusions unchanged. We observed good mixing for the Gibbs sampler using both exploratory and diagnostic tests. The effective sample size averaged across the elements of β were 55, 53 and 48% for the three cases, respectively, suggesting an excellent computational efficiency.
The true loadings were not simulated from our proposed prior in any of the simulation examples. Although our prior on the loadings can concentrate in arbitrarily small neighbourhoods around zero, it does not allow any of the loading elements to be exactly zero. In the simulation study, many of the true loading elements were set equal to zero, and instead of shrinking the nonzero loadings with the column index, they were all drawn from the same N(0, 9) distribution. To assess robustness when the model is not applicable, we ran simulations with correlated factors and/or correlated idiosyncratic error, with the errors drawn from an AR(1) process. The results were robust even in these cases, in particular, we always had similar predictive performance as the elastic net. The adaptive method for factor selection proved to be extremely robust with respect to the choice of the threshold. Although we used 10−4 as a default threshold, the conclusions were mostly unchanged even with a threshold as small as 10−9. Also, one can use either of the median or mode of the samples k*(t) as an estimate of the number of factors as they gave the same answer on all occasions. The simulation study clearly highlights the merit of our method in a variety of applications, with much improved performance over competitors in terms of covariance matrix estimation, regression coefficient estimation and variable selection.
5. Diffuse large-B-cell lymphoma application
5.1. Background
Lymphoma is a cancer of the white blood cell which occurs when lymphocytes, a type of white blood cell, have abnormal growth. Diffuse large B-cell lymphoma is the most common lymphoma among adults and has a high mortality rate. Rosenwald et al. (2002) analysed biopsy samples from 240 patients with untreated diffuse large B-cell lymphoma and identified 17 genes predictive of survival after chemotherapy. Segal (2006) reanalysed the data using penalized methods. The patients in the study were followed up after collection of biopsy specimens with a median follow-up of 2.8 years. For each patient, a potentially right-censored survival time is available along with 7399 features representing 4128 genes from the Lymphochip cDNA microarray. Rosenwald et al. (2002) divided the patients into a training set of 160 patients and a validation set of 80 patients to gauge predictive performance.
Rosenwald et al. (2002) used hierarchical clustering to identify four signature groups whose expressions were correlated with the survival times. They also identified a subset of 17 genes predictive of overall survival after chemotherapy. Gui & Li (2005), Segal (2006) and Ma & Huang (2007) analysed this data using penalized methods. In each case, the selected features mostly belonged to one of the four signature groups in Rosenwald et al. (2002), though the individual selected features varied across the methods.
5.2. Model and results
Our interest lies in simultaneously identifying an important subset of the features and obtaining a predictive model for the exact survival times. Let Ti denote the survival time for the i th patient and let xi denote the corresponding 7399 dimensional feature vector. There were 72 patients in the training set whose survival times were right-censored. Possibly due to rounding, there were some survival times equal to zero, so we added one unit to the survival times of all the patients. We took the logarithm of the shifted survival times and appended them to the xis to create a p-dimensional vector , where p = 7400 and zi = log(1 + Ti). We model the yi s jointly as in § 4.2 after normalizing them. The joint Gaussian model implies an accelerated failure time model for the survival times, since the conditional mean of the log-shifted survival time zi given the predictors xi is linear in xi. Since the exact survival times are known for the uncensored subjects, the response was normalized with the mean and standard deviations of those subjects only and an intercept for the response was added to the model. A normal prior with zero mean and variance one was placed on the intercept. The posterior computation proceeds exactly as in § 3, but an additional step is needed to impute the shifted log survival times for the censored subjects from a truncated normal distribution, truncated below by the transformed censoring time. We ran the adaptive Gibbs sampler for 25 000 iterations with 5000 burn-in and collected every fifth sample after burn-in to thin the chain. The estimated number of factors was 20, with a 95% credible interval of (19,21).
We thresholded the posterior mean of the regression coefficients as described in § 4.2 to perform a variable selection. The thresholding approach selected 17 features, with all of the features belonging to three of the four signature groups mentioned in Rosenwald et al. (2002). The three signature groups were germinal-centre B-cell signature, major histocompatibility complex class II signature and lymph-node signature, while no genes in the proliferation signature group were selected. The top features mentioned in Gui & Li (2005) and Segal (2006) also come from the same three signature groups. In Table 5, we provide a brief description of the top five genes selected using our approach.
Table 5.
Feature selection in the diffuse large-B-cell lymphoma data
Unique ID | GenBank ID | Signature | Description |
---|---|---|---|
24094 | AI476194 | lymph | CD63 antigen (melanoma 1 antigen) |
17048 | AA085368 | lymph | CD63 antigen (melanoma 1 antigen) |
29636 | NM005194 | lymph | enhancer binding protein (C/EBP), β |
34818 | U83461 | lymph | solute carrier family 31 (copper transporters), member 2 |
24394 | AA729055 | MHC | major histocompatibility complex, class II, DR α |
Lymph, lymph-node signature; MHC, major histocompatibility complex; GenBank, National Institute of Health genetic sequence database.
Among the features selected by our approach, the ones with GenBank ID AA729055, AA805575 and X59812 also appear in Gui & Li (2005) and Segal (2006). Although standard penalization methods tend to select one of a correlated group of important predictors, our approach is designed to allow selection of highly correlated predictors into the same model. This is illustrated in Table 5, as the first two predictors have a correlation coefficient of 0.96. There are several groups of highly correlated predictors in the selected set of 17.
Segal (2006) obtained the modest predictive accuracy using a variety of methods, so advocated exercising care before making prognosis based only on the gene expressions. Our analysis also suggested that the gene expression data explain only a small proportion of the variability in the survival times. The 95% predictive intervals for the survival times in the test sample were wide and contained the true survival times for the uncensored observations in all the cases. The mean square prediction error and mean absolute prediction error for the uncensored observations were 1.31 and 0·89 while the same for lasso trained with the uncensored observations in the training sample were 1.28 and 0.90. The proportion of times the predicted survival times for the censored observations exceeded the censoring time was 0.54. We also performed sensitivity analysis by varying ν, initial values of a1, a2 and the prior variance of the intercept. The conclusions were unchanged, with the same set of top 10 genes selected on all occasions.
Acknowledgments
This research was partially supported by a grant from the National Institute of Environmental Health Sciences of the National Institutes of Health, U.S.A. The authors would like to thank Mark Segal for sharing the diffuse large B-cell lymphoma data.
Appendix
Proofs
Proof of Lemma 1. It is enough to show that, for any Λ ∈ ΘΛ, ΛΛT is positive semi-definite. For any vector υ ∈ ℝp, υTΛΛTυ is finite since all elements of ΛΛT are finite. The proof is completed by observing that υTΛΛTυ = ‖ΛTυ‖2 ⩾ 0 where ‖ · ‖ denotes the Euclidian norm.
Proof of Lemma 2. Let Ω = (wrs), , , clearly . For any 1 ⩽ r, s ⩽ p,
by the Cauchy–Schwartz inequality, where . Thus d∞(Ω, Ω0) ⩾ ∊*, with ∊* = (2M0 + 1)∊ + ∊2.
Proof of Proposition 1. Clearly ΠΣ(ΘΣ) = 1, so it is enough to show ΠΛ (ΘΛ) = 1. The ϕjhs are independent of the δhs. Hence marginalizing over the ϕjhs yields λjh | τh ∼ t3(0, ) where tν(μ, σ2) denotes the t distribution with ν degrees of freedom having location μ and scale σ2. By the Cauchy–Schwartz inequality,
and thus
Hence all the elements of ΛΛT are bounded in absolute value by M, where M = max1⩽j⩽p Mj with . Now,
where and if a2 > 2. Hence and thus M is finite almost surely. It follows that ΠΛ ⊗ ΠΣ (ΘΛ × ΘΣ) = 1.
Proof of Theorem 1. Write . Clearly , where is the rsth entry of ΔH so that . An application of the Cauchy–Schwartz inequality as in the previous proof gives
which implies . Now, for a fixed ∊ > 0,
where the equality in the second line follows from the fact that are conditionally independent and identically distributed given δ and the subsequent two inequalities use Jensen’s and Chebyshev’s inequalities respectively. Now,
where , if a2 > 2 and the third equality is a direct consequence of Fubini’s theorem. Now use the inequality (1 − x/2) > exp(−x) if 0 < x ⩽ 1.5 to get
if H > log{2b/∊(1 − a)}/ log(1/a). Hence
for 6aH pb/{(1 − a)∊} < 1 or H > log{6pb/∊(1 − a)}/ log(1/a).
Proof of Proposition 2. Let Λ* be a p × k matrix (k ⩽ p) and Σ0 ∈ ΘΣ such that . Set Λ0 = (Λ* : 0p×∞), then (Λ0, Σ0) ∈ ΘΛ × ΘΣ, with g(Λ0, Σ0) = Ω0. Fix ∊ > 0, choose ∊1 > 0 such that , with M0 as in the proof of Lemma 2. By Lemma 2, , and thus . Now . Clearly, ΠΣ {Σ : d∞(Σ, Σ0) < ∊1} > 0, so it is enough to show ΠΛ {Λ : d2(Λ, Λ0) < ∊1} > 0. We have,
by the following Lemma.
Lemma 3. Fix 1 ⩽ j ⩽ p. For any ∊ > 0, almost surely.
Proof of Lemma 3. We have for h > k. Thus for any H ⩾ k,
By Theorem 1, as H → ∞, hence we can find H0 > k such that and thus almost surely. The proof is completed by observing that almost surely for any H < ∞.
Proof of Theorem 2. Fix ∊ > 0, Ω0 ∈ Θ. We have,
Let u0 = det Ω0, find ∊1 > 0 such that |u − u0| < ∊1 implies | log u − log u0| < ∊. Since det(·) is a continuous function from Θ to ℝ, we can find ∊2 such that d∞(Ω0, Ω) < ∊2 implies | det(Ω0) − det(Ω)| < ∊1. Now , where λ1 ⩽ … ⩽ λp are the eigenvalues of Ω−1Ω0. Since Ω and Ω0 are both positive definite,
where x is any p-dimensional vector with xTx = 1. For any x ∈ ℝp with xTx = 1,
Now
and
where λmin(Ω0) > 0 denotes the smallest eigenvalue of Ω0. Choose 0 < ∊3 < λmin(Ω0)/2p such that 2p2∊3/λmin(Ω0) < ∊. We have
for all Ω0 such that d∞(Ω0, Ω) < ∊3, since |xTΩ0x − xTΩx| < λmin(Ω0)/2 and thus xTΩx > λmin(Ω0)/2. Choose ∊* = min{∊2, ∊3}, then for d∞(Ω0, Ω) < ∊*, we have,
which proves Theorem 2.
References
- Amengual D, Watson M. Consistent estimation of the number of dynamic factors in a large N and T panel. J Bus Econ Statist. 2007;25:91–6. [Google Scholar]
- Ando T. Bayesian factor analysis with fat-tailed factors and its exact marginal likelihood. J Mult Anal. 2009;100:1717–26. [Google Scholar]
- Arminger G, Muthén B. A Bayesian approach to nonlinear latent variable models using the Gibbs sampler and the Metropolis–Hastings algorithm. Psychometrika. 1998;63:271–300. [Google Scholar]
- Bai J, Ng S. Determining the number of factors in approximate factor models. Econometrica. 2002;70:191–221. [Google Scholar]
- Bickel P, Levina E. Regularized estimation of large covariance matrices. Ann Statist. 2008;36:199–227. [Google Scholar]
- Carvalho C, Chang J, Lucas J, Nevins J, Wang Q, West M. High-dimensional sparse factor modeling: applications in gene expression genomics. J Am Statist Assoc. 2008;103:1438–56. doi: 10.1198/016214508000000869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Celeux G, Chauveau D, Diebolt J. Stochastic versions of the EM algorithm: an experimental study in the mixture case. J Statist Comp and Simul. 1996;55:287–314. [Google Scholar]
- Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 2006;1:515–34. [Google Scholar]
- Geweke J, Zhou G. Measuring the price of the Arbitrage Pricing Theory. Rev. Finan. Studies. 1996;9:557–87. [Google Scholar]
- Ghosh J, Dunson D. Default prior distributions and efficient posterior computation in Bayesian factor analysis. J Comp Graph Statist. 2009;18:306–20. doi: 10.1198/jcgs.2009.07145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21:3001–8. doi: 10.1093/bioinformatics/bti422. [DOI] [PubMed] [Google Scholar]
- Hager W. Updating the inverse of a matrix. SIAM Rev. 1989;31:221–39. [Google Scholar]
- Lee S, Song X. Bayesian selection on the number of factors in a factor analysis model. Behaviormetrika. 2002;29:23–39. [Google Scholar]
- Liu J, Wu Y. Parameter expansion for data augmentation. J Am Statist Assoc. 1999;94:1264–74. [Google Scholar]
- Lopes H, West M. Bayesian model assessment in factor analysis. Statist. Sinica. 2004;14:41–68. [Google Scholar]
- Lucas J, Carvalho C, Wang Q, Bild A, Nevins J, West M. Sparse statistical modelling in gene expression genomics. In: Müller P, Do K, Vannucci M, editors. Bayesian Inference for Gene Expression and Proteomics. Cambridge: Cambridge University Press; 2006. pp. 155–76. [Google Scholar]
- Ma S, Huang J. Additive risk survival model with microarray data. BMC Bioinformatics. 2007;8:192. doi: 10.1186/1471-2105-8-192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roberts G, Rosenthal J. Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. J Appl Prob. 2007;44:458–475. [Google Scholar]
- Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Mueller-Hermelink HK, Smeland EB, et al. The use of molecular profiling to predict survival after chemotheropy for diffuse large-B-cell lymphoma. New Engl J Med. 2002;346:1937–47. doi: 10.1056/NEJMoa012914. [DOI] [PubMed] [Google Scholar]
- Schwartz L. On Bayes procedures. Prob. Theory Rel. Fields. 1965;4:10–26. [Google Scholar]
- Segal M. Microarray gene expression data with linked survival phenotypes: diffuse large-B-cell lymphoma revisited. Biostatistics. 2006;7:268–85. doi: 10.1093/biostatistics/kxj006. [DOI] [PubMed] [Google Scholar]
- Song X, Lee S. Bayesian estimation and test for factor analysis model with continuous and polytomous data in several populations. Br J Math Statist Psychol. 2001;54:237–63. doi: 10.1348/000711001159546. [DOI] [PubMed] [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996;58:267–88. [Google Scholar]
- West M. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Statist. 2003;7:723–32. [Google Scholar]
- Zou H, Hastie T. Regularization and variable selection via the elastic net. J. R. Statist. Soc. B. 2005;67:301–20. [Google Scholar]