Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Apr 23.
Published in final edited form as: Ann Appl Stat. 2012 Dec;7(4):2431–2457. doi: 10.1214/13-AOAS643

MULTI-WAY BLOCKMODELS FOR ANALYZING COORDINATED HIGH-DIMENSIONAL RESPONSES

Edoardo M Airoldi *, Xiaopei Wang , Xiaodong Lin
PMCID: PMC3935422  NIHMSID: NIHMS494268  PMID: 24587846

Abstract

We consider the problem of quantifying temporal coordination between multiple high-dimensional responses. We introduce a family of multi-way stochastic blockmodels suited for this problem, which avoids pre-processing steps such as binning and thresholding commonly adopted for this type of problems, in biology. We develop two inference procedures based on collapsed Gibbs sampling and variational methods. We provide a thorough evaluation of the proposed methods on simulated data, in terms of membership and blockmodel estimation, predictions out-of-sample, and run-time. We also quantify the effects of censoring procedures such as binning and thresholding on the estimation tasks. We use these models to carry out an empirical analysis of the functional mechanisms driving the coordination between gene expression and metabolite concentrations during carbon and nitrogen starvation, in S. cerevisiae.

1. Introduction

In recent years, the biology community at large has engaged in an effort to characterize coordinated mechanisms of cellular regulation, to enable a systems-level understanding of cellular functions. Reference databases, such as the yeast genome database (SGD), catalog the many regulatory roles of genes and proteins with links to the originating literature (Cherry et al., 1997; Kanehisa and Goto, 2000). Recent work spans approaches that leverage these databases to integrate genomic information across multiple studies and technologies about the same regulatory mechanism, e.g. transcription (Cope et al., 2004; Franks et al., 2012), as well as approaches to integrate genomic information across levels of regulation, e.g., epigenetic markers, chromatin modifications, transcription, and translation (Troyanskaya et al., 2003; Lu et al., 2009; Markowetz et al., 2009).

We consider the problem of quantifying temporal coordination between gene expression and metabolite concentrations in yeast (Brauer et al., 2006, 2008). More generally, we are interested in statistical methods to analyze multiple coordinated high-dimensional measurements about a system organism, where correlation among pairs of measurements is believed to indicate coordinated functional and regulatory roles. We develop methods for analyzing experiments on regulation dynamics that involve: (1) data collections about multiple stages of regulations (transcriptional and metabolic) that offer complementary views of the cellular response (to Nitrogen and Carbon starvation), quantified in terms of high-dimensional measurements; and (2) data collected according to a specific coordinated temporal design, whereby the experiments at different stages of regulation are conducted on cell-cultures with matching conditions (nutrient limitations, environmental stress and chemical compounds present) over time. Coordinated time-courses about complementary stages of regulation arguably provide the best opportunity to characterize coordinated regulation dynamics, quantitatively.

A popular approach to study coordinated cellular responses in biology involves Bayesian networks (Bradley et al., 2009; Troyanskaya et al., 2003). This approach requires binning real-valued measurements into discrete categories. A deterministic alternative to explore coordination is the cross-associations algorithm (Chakrabarti et al., 2004), which instead requires thresholding the matrix of correlations between pairs of genes and metabolites into binary on-off relations. While binning and thresholding are accepted data pre-processing steps in the computational biology literature, they raise serious statistical issues (Blocker and Meng, 2103). On the one hand, the lack of appropriate and principled alternatives, together with the sizable amount of data typical in a coordinated study of cellular responses, e.g., genome-wide expression and hundreds of metabolites, make pre-processing necessary. These pre-processing steps reduce the computational burden of the analysis with Bayesian networks and cross-associations. On the other hand, however, these pre-processing steps are essentially censoring mechanisms that may compromise the patterns of variation and co-variation in the original data, when the discovery in such patterns, local and global, is the primary goal of the analysis (Turnbull, 1976; Vardi, 1985).

In this paper, we develop a family of blockmodels to analyze a correlation matrix among sets of temporally paired measurements on two distinct populations of objects. Our work extends a recent block modeling approach that leverage the notion of structural equivalence (Snijders and Nowicki, 1997; Nowicki and Snijders, 2001) to the analysis of coordinated measurements on two populations. For more details on blockmodels see (Goldenberg et al., 2009). Section 2 introduces two-way (and multi-way) stochastic blockmodels for a function of the high-dimensional responses, such as their correlation. These simple models explicitly allow different objects in the two (or more) populations to be associated with multiple blocks, say, of correlation, to different degrees, and does not require binning nor thresholding. Estimation and inference using variational methods is outlined in Section 2.4. Details of variational inference are provided in Appendix A, while Appendix B provides details of an alternative MCMC inference. Section 3 develops a thorough evaluation of the proposed methods on simulated data, including a comparative evaluation of the MCMC and variational inference procedures in terms of: (1) membership and blockmodel matrix estimation, (2) predictions out-of-sample, and (3) run-time. We assess the effects of thresholding on inference in Section 3.7. In Section 4, we analyze two recently published collections of time course data to explore the functional mechanisms underlying the coordination of transcription and metabolism during carbon and nitrogen starvation, in S. cerevisiae. We compare the results with published results on the same data using binning and Bayesian networks, and to new results we obtain using thresholding and cross-associations.

2. Multi-way stochastic blockmodels

In this section we introduce multi-way stochastic blockmodels and the associated inference procedures. This family of models generalize mixed-membership stochastic blockmodels for analyzing interactions within a single population (Airoldi et al., 2008) to interactions between two or more populations. Multi-way stochastic blockmodels models enable the discovery of interactions between latent groups across different populations, and provide estimates of the group memberships for each subject. We develop two inference strategies: one based on collapsed Gibbs sampling (Liu, 1994), the other based on variational Expectation-Maximization (vEM) (Jordan et al., 1999; Airoldi, 2007).

2.1. Two-way blockmodels

Consider a two-way interaction table between two sets of nodes 𝒩1 and 𝒩2 of size N1 and N2, respectively. These two sets of nodes represent elements of two distinct populations. An observation Y (j, k), j = 1, …, N1, k = 1, …, N2, denotes the strength of the interaction between the jth element of 𝒩1 and the kth element of 𝒩2.

As a running example, we consider the coordinated time course data we analyze in Section 4. The data consists of N1 time series of gene expression levels and of N2 time series of metabolite concentrations, before and after Nitrogen and Carbon starvation for a total of seven time points, in yeast (Brauer et al., 2006; Bradley et al., 2009). We posit a model for the N1 × N2 matrix of Fisher-transformed correlations of time courses for each gene-metabolite pair, or for any of its sub-matrices obtained by selecting subsets of genes and metabolites of special interest to biologists. The goal of the analysis is to revel interactions between gene functions and metabolic pathways, operationally defined as sets of genes, and sets of metabolites, respectively, with similar correlation patters.

In the context of this application, we posit that each gene can participate in up to K1 functions, that is, latent row groups, and that each metabolite can participate in up to K2 metabolic pathways, that is, latent column groups.1 Latent Dirichlet vectors π⃗j and p⃗k capture the relative fractions of time gene j and metabolite k participate in the different cellular functions and pathways, or latent groups. The distribution of the correlation, or more generally interaction, Y (j, k), is then a function of the interactions among the latent groups, fully specified by a K1 × K2 matrix B, together with the latent memberships of the gene and metabolite involved. The data generating process, given α, β, B and σ, is as follows,

πj~Dirichlet(α) (2.1)
pk~Dirichlet(β) (2.2)
Y(j,k)~Normal(πjBpk,σ2) (2.3)

where indices j = 1, …, N1 and k = 1, …, N2 run over genes and metabolites, respectively, vectors π⃗j and p⃗k are K1- and K2-dimensional, respectively, and elements of the blockmodel mean matrix Bgh ∈ ℝ.

While the observations Y (j, k) in the motivation application are Fisher-transformed correlations, real-valued with real-valued mean matrix B, the proposed models are more flexible. For instance, we develop a two-way block model for binary observations in Section 2.2, that is used in Section 3.7 for quantifying the effects of censoring the data matrix Y.

For inference purposes, we consider an augmented data generating process, in which we introduce latent indicator vectors D⃗jk and E⃗jk that denote the single memberships of gene j and metabolite k for the correlation Y (j, k). The latent indicators {D, E} do not have a clear biological interpretation, but serve to improve computational tractability of the inference; they lead to optimization problems that have analytical solutions. The trade-offs of such a strategy have been explored elsewhere (e.g., see Airoldi et al., 2008). From a statistical perspective, introducing {D, E} amounts to a specific representation of the interactions in terms of random effects.

2.2. Extension to non-Gaussian responses

In the data generating process above, Y is generated from a Normal distribution and the blockmodel’s elements take real values. Extending the proposed model to other distributions to account for data Y that live in a different space is straightforward. And because of the hierarchical structure of the model, only a minor portion of the inference and estimation strategies detailed in Section 2.4 will need to be modified appropriately, as a consequence.

We will consider one such extension to binary observations Y (j, k)—namely, correlations after thresholding—in Section 3.7 to assess the effects of pre-processing on the accuracy in estimating the blockmodel. The data generating process in Section 2.1 is modified as follows. The blockmodel’s elements now take values in the unit interval, since they capture the probability that there is a correlation above threshold between members of any pair of blocks, Bgh ∈ [0, 1]. For each pair (j, k), j = 1, …, N1, k = 1, …, N2, we sample the pairwise binary observation Y(j,k)~Bernoulli(DjkBEjk). Variational Bayes and MCMC inference also remains mostly unchanged. New updating equations for the elements of B will be needed; see Eq. 2.11 and B.4.

2.3. Extension to multi-way blockmodels

The two-way blockmodel introduced above can also be extended for analyzing multi-way interactions between three or more populations.

Consider a three-way interaction table Y (i1, i2, i3) observed on three populations 𝒩1, 𝒩2, 𝒩3, where i1 ∈ 𝒩1, i2 ∈ 𝒩2 and i3 ∈ 𝒩3. Assume that there are K1, K2 and K3 latent groups existing in 𝒩1, 𝒩2 and 𝒩3 respectively. we can treat the three way interaction observed in Y as a result of three way group interactions. Namely, Y (i1, i2, i3) can be fully characterized by B(g1, g2, g3), with items {i1, i2, i3} belonging to group {g1, g2, g3} respectively. Therefore inferences procedures for this three-way blockmodel can be developed in a similar fashion as those for the two-way blockmodel. Note that although the ideas for generalizations to higher order tables remain the same, keeping track of indices during inference becomes tedious.

2.4. Parameter estimation and posterior inference

The main inference task is to estimate the matrix B and the mixed membership vectors π⃗ and p⃗. Given the observed data Y = Y (j, k), latent variable X={πj,pk,Djk,Ejk}, and the parameters Θ = {α, β, σ2, B}, the complete data likelihood p(Y, X|Θ) can be written as

p(Y,X|α,β,B,σ2)=jp1(πj|α)kp1(pk|β)j,kp0(Y(j,k)|Djk,Ejk,B,σ2)p2(Djk|πj)p2(Ejk|pk), (2.4)

where p0 is a Normal distribution with mean μ=DjkBEjk and variance σ2, p1 is a Dirichlet distribution, and p2 is a Multinomial distribution with n = 1. The posterior distribution of the latent variable X is

p(X|Y,Θ)=p(Y,X|Θ)p(Y|Θ), (2.5)

where the marginal distribution p(Y |Θ) has the following form:

p(Y|Θ)=Xp(Y,X|Θ)dX=DE{jp1(πj|α)kp1(pk|β)j,kp2(Djk|πj)p2(Ejk|pk)dπdpj,kp0(Y(j,k)|Djk,Ejk,B,σ2)}.

There does not exist an explicit solution to the maximization of p(Y |Θ). Therefore we propose an iterative procedure based on variational Bayes for parameter estimation. In comparison, we also develop a MCMC scheme based on collapsed Gibbs sampling to achieve the desired statistical inferences.

2.4.1. Variational Expectation-Maximization

To achieve variational inference, we introduce free variational parameters ν⃗j and ξk to approximate π⃗j and pk, free variational variables ϕ⃗jk and η⃗jk to approximate D⃗jk and E⃗jk, and latent distribution q(X) to approximate the true posterior distribution p(X|Y, Θ). By Jensen’s inequality, we have the following likelihood lower bound

logp(Y|Θ)Eq[logp(Y,X|Θ)]Eq[logq(X)]. (2.6)

A coordinate ascend algorithm can be applied to obtain a local maximizer of this lower bound, which results in the updates 2.72.11. Detailed derivations are left in the Appendix A. The resulting variational EM algorithm is given in Algorithm 1.

graphic file with name nihms494268t1.jpg

Algorithm 1: The variational EM algorithm. The E steps 6 to 9 are also repeated until convergence to achieve the most stabilized mutual updates for the set of free parameters ϕ⃗, η⃗, ν⃗, ξ⃗.

ϕjk,gexp(ψ(νj,g)ψ(gνj,g))h(σ2·e(Y(j,k)B(g,h))2σ2)12ηjk,h. (2.7)
ηjk,hexp(ψ(ξk,h)ψ(hξk,h))g(σ2·e(Y(j,k)B(g,h))2σ2)12ϕjk,g. (2.8)
νj,g=kϕjk,g+α. (2.9)
ξk,h=jηjk,h+β. (2.10)
B(g,h)=j,kϕjk,gηjk,hY(j,k)j,kϕjk,gηjk,h. (2.11)

3. Evaluating inference and effects of pre-processing

Here we use simulated data to compare the performance of variational and MCMC inference procedures for the two-way block model along multiple dimensions: estimation accuracy of mixed membership vectors, accuracy of predictions out-of-sample, estimation accuracy of the blockmodel interaction matrix B, and run-time. This extensive comparative evaluation provides a practical guideline for choosing the proper inference procedure in a real setting, especially when analyzing large tables. In addition, we quantify the effect of censoring on the inference in terms of estimation error.

3.1. Design of experiments

In the past decade, variational EM (vEM) has become a practical alternative to MCMC when dealing with large data sets, despite its lack of theoretical guarantees (Jordan et al., 1999; Airoldi, 2007; Wainwright and Jordan, 2008). The relative merits between vEM and MCMC have been established empirically for a number of models (e.g., see Blei and Jordan, 2006; Braun and McAuliffe, 2010).We designed simulations with the goal of exploring the trade-off between estimation accuracy and computational burden that vEM helps manage in the context of estimation and posterior inference with the proposed model.

Briefly, vEM is an optimization approach, no sampling is involved, which requires key choices about: (1) error tolerance for both the approximate E step and the M step, and (2) how to design multiple initializations and how many to use. MCMC is a sampling approach, which requires key about: (1) convergence criteria, (2) burn-in, (3) thinning to reduce autocorrelation, and (4) multiple chains. For the variational EM approach, we set the overall error tolerance at 1e-5, the maximum number of iterations for the variational E steps at 10, and 10 random initializations. For the MCMC approach, we investigated the convergence using Gelman-Rubin and Raftery-Lewis for the median, autocorrelation using trace plots and partial autocorrelation functions. Based on these studies, we chose to use 1,000 iterations for burn-in, 6,000 iterations, and a 10-to-1 thinning ratio, which results in 500 draws for each chain, and we used 10 chains. For both approaches, we use the true Dirichlet parameters α, β and the true variance σ2 = 0.01. Overall, this seems a fair comparison.

The data are generated using the procedures described in Section 2.1 with the following specifications. The B(g, h) follows a Normal distribution B(g,h)~Normal(μB(g,h),σB2(g,h)), where μB(g, h) = 0 and σB2(g,h)=1. Three sets of block sizes are considered: (K1, K2) = (2, 3), (4, 6) and (6, 9). The corresponding table sizes are (N1, N2) = (10, 15), (50, 75) and (100, 150), respectively. The Dirichlet parameters are set to be α = β = 0.2 or α = β = 0.05. In all the experiments, we set σ2 = 0.01.

3.2. Mixed membership estimation

Here we evaluate the competing estimation procedures on recoving mixed membership vectors. We report results on the accuracy of the first and second largest membership components. It is well-known that mixture models and mixed-membership models suffer from identifiability issues, that is, their likelihood is uniquely specified up to permutations of the labels (Titterington et al., 1985). We evaluate the performance for a fixed permutation, obtained empirically by sorting the membership vectors for the vEM, and by using a standard Procrustes transform for the MCMC (Stephens, 2000). We note that vEM converged quickly to a (local) optimum, thus involving a considerably mitigated label switching issue than the collapsed Gibbs sampler. This is an advantage, especially given that the empirical vEM estimation error reported in Table 1 is comparable to that of the more principled MCMC sampler.

Table 1.

Comparisons on row and column estimation accuracy of estimates for the first highest membership (regular font) and second highest membership (italic font) obtained with variational EM and MCMC. Standard errors are quoted inside parenthesis.

K1 = 2 and K2 = 3 K1 = 4, K2 = 6 K1 = 6, K2 = 9
(N1, N2) α/β row column row column row column
vEM
(10,15) 0.2 0.970(0.067) 0.667(0.031) 0.620(0.063) 0.587(0.069) 0.470(0.048) 0.520(0.076)
0.970(0.067) 0.522(0.075) 0.233(0.152) 0.179(0.077) 0.210(0.129) 0.060(0.058)
0.05 0.980(0.042) 0.967(0.085) 0.870(0.125) 0.807(0.066) 0.780(0.063) 0.567(0.085)
0.980(0.042) 0.533(0.233) 0.233(0.179) 0.190(0.110) 0.317(0.123) 0.133(0.112)
(50,75) 0.2 0.784(0.122) 0.751(0.146) 0.680(0.034) 0.471(0.039) 0.426(0.053) 0.416(0.041)
0.784(0.122) 0.694(0.130) 0.304(0.097) 0.175(0.033) 0.194(0.054) 0.136(0.031)
0.05 0.980(0.000) 0.849(0.074) 0.620(0.104) 0.575(0.058) 0.634(0.046) 0.483(0.053)
0.980(0.000) 0.662(0.132) 0.239(0.118) 0.216(0.058) 0.210(0.073) 0.149(0.047)
(100,150) 0.2 0.960(0.000) 0.823(0.106) 0.601(0.077) 0.670(0.076) 0.485(0.048) 0.357(0.029)
0.960(0.000) 0.612(0.247) 0.261(0.055) 0.237(0.063) 0.194(0.029) 0.137(0.022)
0.05 0.946(0.092) 0.743(0.132) 0.769(0.055) 0.707(0.057) 0.553(0.084) 0.479(0.052)
0.946(0.092) 0.520(0.227) 0.361(0.057) 0.236(0.064) 0.217(0.060) 0.135(0.028)

MCMC
(10,15) 0.2 0.922(0.148) 0.730(0.102) 0.678(0.015) 0.665(0.012) 0.669(0.008) 0.521(0.007)
0.922(0.148) 0.504(0.167) 0.306(0.053) 0.204(0.031) 0.207(0.011) 0.157(0.004)
0.05 0.841(0.121) 0.901(0.120) 1.000(0.000) 0.878(0.031) 0.884(0.005) 0.825(0.007)
0.841(0.121) 0.409(0.138) 0.520(0.122) 0.413(0.091) 0.227(0.052) 0.161(0.022)
(50,75) 0.2 0.871(0.121) 0.671(0.097) 0.711(0.095) 0.659(0.084) 0.682(0.106) 0.562(0.051)
0.871(0.121) 0.437(0.186) 0.380(0.039) 0.300(0.065) 0.301(0.093) 0.231(0.026)
0.05 0.994(0.013) 0.676(0.113) 0.775(0.176) 0.753(0.129) 0.824(0.088) 0.839(0.054)
0.994(0.013) 0.452(0.131) 0.383(0.135) 0.319(0.142) 0.357(0.090) 0.365(0.074)
(100,150) 0.2 0.971(0.032) 0.653(0.150) 0.682(0.119) 0.633(0.083) 0.735(0.069) 0.614(0.078)
0.968(0.034) 0.420(0.223) 0.332(0.080) 0.255(0.054) 0.310(0.074) 0.235(0.059)
0.05 0.830(0.208) 0.773(0.138) 0.810(0.140) 0.772(0.127) 0.780(0.046) 0.750(0.064)
0.829(0.208) 0.463(0.203) 0.354(0.151) 0.277(0.088) 0.285(0.053) 0.249(0.046)

To quantify accuracy, we identify the locations of the largest two components in the estimated vector of probabilities, π⃗j, and take those to be the first and second choice of group memberships for the jth row. These assignments are compared, via zero-one loss, with the true memberships: if there is a match, we note the accuracy as 1, otherwise 0. The recorded row accuracy is the average over all the rows and the ten experiments. The column accuracy is defined in the similar fashion.

The results for the estimated first and second memberships are summarized in Table 1. The results for the first membership suggest that estimation is well behaved in the proposed model; the true membership can be recovered with a fairly high successful rate under different experimental settings. As expected, the estimation accuracy decreases with the increase on the block size. The lowest pair reported in the table are 0.485 and 0.357 for K1 = 6 and K2 = 9, still much better than random assignments where the accuracy would be 1/6 and 1/9, respectively. For the second membership, we only consider elements with an estimated second membership probability greater than a threshold. In this study, the thresholds are 110K1 and 110K2 for row and column memberships respectively. It is clear that the variational Bayes approach performs much better than MCMC in estimating the second membership. One explanation can be that the second membership is more ambiguous than the first membership, requiring a large number of iterations for MCMC to converge.

Another factor that affects model performances is the Dirichlet parameters α and β. Judging from the table, the accuracy when α = β = 0.05 is generally higher than those of α = β = 0.2. This result is reasonable since a smaller α and β value corresponds to a higher likelihood of a dominating component, which is easier to identify than more ambiguous memberships.

The membership accuracy computed through variational Bayes aligns with those calculated from MCMC, and even slightly better when the block size is small. Since variational inference is typically much more efficient than MCMC, the former method is preferred for practical analysis, especially for high dimensional cases. We will present run-time comparisons between these two approaches in the next section.

3.3. Predictions out of sample

Prediction power is a useful criterion for evaluating statistical models. When some data are missing, is the model sufficiently flexible to provide correct inferences and to predict the missing values with high accuracy? To answer this question, we randomly select 2/3 of rows and 2/3 of columns from the table, whose intersections are 4/9 of the entries. We set half of them (i.e., 2/9) as missing (to avoid eliminating an entire row or column), and run the model on the remaining 7/9 entries. The first membership prediction accuracy is reported in Table 2. They are slightly lower than those estimated without missing values, but overall much better than the baseline probabilities 1/K1 and 1/K2. Furthermore, the prediction accuracy achieved by variational Bayes is comparable or better than those obtained by MCMC. This result reinforces our belief that variational Bayes is a good inference approach for the proposed blockmodel.

Table 2.

Comparisons on row and column estimation accuracy between variational EM and MCMC, when 2/9 of the entries are missing. Standard errors are inside the parenthesis.

K1 = 2 and K2 = 3 K1 = 4 and K2 = 6 K1 = 6 and K2 = 9
(N1, N2) α/β row column row column row column
vEM
(10,15) 0.2 0.780(0.148) 0.600(0.094) 0.610(0.110) 0.507(0.118) 0.520(0.063) 0.547(0.103)
0.05 0.900(0.067) 0.853(0.117) 0.730(0.125) 0.547(0.108) 0.700(0.094) 0.613(0.042)
(50,75) 0.2 0.664(0.067) 0.615(0.134) 0.452(0.081) 0.383(0.039) 0.366(0.034) 0.335(0.037)
0.05 0.930(0.034) 0.843(0.077) 0.570(0.135) 0.564(0.074) 0.504(0.076) 0.444(0.040)
(100,150) 0.2 0.786(0.091) 0.672(0.128) 0.472(0.124) 0.362(0.055) 0.313(0.049) 0.326(0.059)
0.05 0.751(0.194) 0.749(0.136) 0.656(0.102) 0.503(0.091) 0.397(0.068) 0.373(0.056)

MCMC
(10,15) 0.2 0.703(0.100) 0.617(0.083) 0.480(0.091) 0.460(0.085) 0.406(0.012) 0.368(0.054)
0.05 0.770(0.145) 0.726(0.115) 0.540(0.161) 0.446(0.073) 0.454(0.101) 0.456(0.070)
(50,75) 0.2 0.788(0.145) 0.645(0.104) 0.544(0.064) 0.443(0.032) 0.357(0.062) 0.343(0.039)
0.05 0.809(0.194) 0.647(0.098) 0.606(0.072) 0.567(0.068) 0.473(0.054) 0.479(0.074)
(100,150) 0.2 0.813(0.103) 0.576(0.102) 0.575(0.061) 0.492(0.042) 0.411(0.048) 0.395(0.028)
0.05 0.867(0.150) 0.834(0.111) 0.639(0.051) 0.524(0.040) 0.514(0.092) 0.497(0.030)

3.4. Blockmodel matrix estimation

Here we compare the variational Bayes and MCMC in terms of estimating the matrix B. The estimation error εB is defined as the 1-norm of the matrix |B|, where is the estimated matrix. The result for K1 = 2, K2 = 3 is shown in Table 3. Except for the case of α = β = 0.05 and N1 = 10, N2 = 15, variational Bayes performances close to or better than MCMC. The true B in this simulation study is

(0.50090.06871.58870.41480.80861.3112).

Table 3.

Comparisons on εB as the estimation error of B between variational Bayes and MCMC.

(N1,N2) (10,15) (50,75) (100,150)

α/β 0.05 0.2 0.05 0.2 0.05 0.2
VB 0.152(0.042) 0.022(0.022) 0.048(0.024) 0.061(0.061) 0.053(0.029) 0.002(0.001)
MCMC 0.019(0.006) 0.027(0.047) 0.110(0.058) 0.045(0.066) 0.134(0.065) 0.105(0.058)

3.5. Sensitivity to initialization and priors specifications

Here we analyzed the sensitivity of the inference to informative versus noninformative prior specifications, and to uniform versus random initialization of some constants in our model. The results show no significant sensitivity of the estimation error to these choices. This evidence supports our claim that inference is well-behaved and that identifiability is not an issue for the model we proposed, in practice, in the data regimes we considered.

In Algorithm 1 (vEM) and Algorithm 2 (MCMC), we initialized a subset of parameters (π, η, ν, ε in vEM and D, E in MCMC) uniformly. To assess the sensitivity of inference to this initialization strategy, we tested alternative versions of these algorithms in which we initialized these parameters at random, on the data set analyzed in Section 3.7. Briefly, in vEM, we initialized each ϕ⃗jk and η⃗jk with a random membership vectors, then initialized ν⃗j, ξ⃗k using Eq. 2.9 and Eq. 2.10. The blockmodel B is initialized as in Algorithm 1. In MCMC, we initialized each D⃗jk, E⃗jk with a membership with a single positive entry assigned at random, we computed D⃗j→·, E⃗·←k, Ygh, ngh accordingly from these initial values of D⃗ and E⃗, then initialized p(Djk,g = 1, Ejk,h = 1) using Eq. B.1. The results of this experiment are shown in Table 4.

Table 4.

Comparisons on εB as the estimation error of B and the first highest membership accuracy between different initialization for variational Bayes and MCMC.

vEM MCMC

init. εB row column εB row column
random 0.200(0.163) 0.916(0.184) 0.907(0.120) 0.171(0.179) 0.872(0.171) 0.818(0.177)
uniform 0.205(0.173) 0.916(0.117) 0.880(0.102) 0.115(0.175) 0.957(0.047) 0.820(0.157)

Another input for Algorithms 1 and 2 are the Dirichlet parameters α and β. A-priori, α, β < 1 favor a single dominating membership component while α, β > 1 favor diffuse membership. In the analysis of real data, we expect few dominating memberships, so we typically set α = β equal to either 0.2 or 0.05 and assess sensitivity of resulting estimated memberships and other parameters. However, the question arises as whether an alternative strategy that features informative priors is more useful than using noninformative as we do. Using informative priors for the membership parameters might lead to improved inference, especially in the case of substantial non-identifiability.

To evaluate this issue, we generated a data set with informative priors α⃗ = (0.3, 0.7)′ for the rows and β⃗ = (0.6, 0.3, 0.10)′ for the columns. Then we fit the model with the vEM algorithm on this data set using both non-informative uniform priors (α = β = 0.05) and informative priors with the vectors α⃗, β⃗ set at the true values. The results are presented in Table 5 from which we see that the results are comparable. This justifies the simple choice of noninformative prior in our algorithms.

Table 5.

Comparison of vEM fits using informative and noninformative priors, in terms of estimation error εB and accuracy in estimating the highest membership component.

noninformative priors informative priors
εB row column εB row column
0.385(0.176) 0.868(0.121) 0.827(0.134) 0.203(0.121) 0.788(0.157) 0.870(0.091)

3.6. Run-time comparison

As seen previously, variational Bayes performs as effectively as MCMC in parameter and membership estimation as well as held-out prediction accuracy. In the following, we present results on run-time comparison between these two approaches. Our goal is to quantify the magnitude of savings that variational Bayes can achieve while obtaining similar inferences to those obtained through MCMC.

For each experiment we run 10 times, and the average log run-time is recorded. The plots are shown in Figure 1. Three table sizes are considered in this simulation: 10 × 15, 50 × 75 and 100 × 150. From this figure, the run-time for MCMC is consistently multi folds more than that of variational Bayes. For example, when block sizes equals (6,9), and Dirichlet parameters equal 0.05, one experiment takes about 30 minutes to run for variational Bayes, and it takes roughly 6 hours for MCMC. This trend continues when table size increases, and the saving on computational cost can be much more. These results suggest that variational Bayes should be preferred for analyzing large tables. Recently developed inference strategies based on spectral clustering (Rohe and Yu, 2012) and binary factor graphs (Azari and Airoldi, 2012) should also be considered.

Fig 1.

Fig 1

Log run-time for simulated data. Red lines represent variational Bayes and black lines represent MCMC via collapsed Gibbs. The x-axis is the number of elements in a table. For instance, 0.15(thousand) represents a 10 by 15 table with 150 elements.

3.7. Quantifying the effects of censoring

One of the issues in existing studies of coordinated cellular responses is the pre-processing of the original measurements. This kind of censoring reduces data utility and decreases estimation accuracy. The goal of this study is to quantify the effects of censoring by thresholding on the estimation of the blockmodel.

The data Y are generated from Y(j,k)~Normal(πjBpk,σ2). The domain of Y (j, k) is (−∞, + ∞). We perform Inverse Fisher Transformation (IFT) that maps Y (j, k) to ρ(j, k) so that its range is [−1, 1]. The censored data are defined as S(j, k) = 1(|ρ(j, k)| ≥ τ), where τ can be median, mean or 0.5. Clearly, S(j, k) ∈ {0, 1}.

The Normal blockmodel is applied to the original data Y (j, k) and the Bernoulli blockmodel described in Section 2.2 is applied to the censored data S(j, k). To make the comparison in the same scale, we define ρ̂(j, k) as the IFT of ϕjkηjk, where ϕ⃗jk, and η⃗jk are estimated from the Normal blockmodel. The estimation error is defined as

ε=(j,k)|ρ(j,k)ρ̂(j,k)|N1×N2.

The estimation error for the censored experiment is computed in the same fashion, with ρ̂(j,k)=ϕjkηjk, where ϕ⃗jk, and η⃗jk are estimated from the Bernoulli blockmodel, and ρ(j, k) replaced by |ρ(j, k)|.

We compare our model with a biclustering method popular in computational biology (Cheng and Church, 2000), fit to both the raw and censored correlations. We match each estimated bicluster to a true block and compute recall and precision in estimating absolute correlations above a threshold. Results are presented in Table 6, where the results obtained with BCCC are optimized over a range of input parameter values. For completeness, we also add results obtained with hierarchical clustering to rows an columns independently, and with cross-association (Chakrabarti et al., 2004).

Table 6.

Comparison of estimation error on censored and non censored data. Standard errors are inside the parenthesis.

Data Method Bic. Error ε Recall Precision
raw ρij 2-way Normal 6 0.054(0.010) 0.841(0.169) 0.881(0.116)
hier. clustering 6 0.221(—) 0.967(—) 0.970(—)
Cheng & Church 2 0.367(—) 0.679(—)
ij| > ρ(0.5) 2-way Bernoulli 6 0.175(0.006) 0.518(0.017) 0.722(0.048)
hier. clustering 6 0.125(—) 0.700(—) 0.850(—)
Cheng & Church 2 0.232(—) 0.640(—)
cross-associations 4 0.667(—) 0.762(—)
ij| > ρ̄ 2-way Bernoulli 6 0.182(0.003) 0.528(0.014) 0.773(0.056)
hier. clustering 6 0.187(—) 0.500(—) 0.841(—)
Cheng & Church 3 0.237(—) 0.640(—)
cross-associations 6 0.667(—) 0.841(—)
ij| > 0.5 2-way Bernoulli 6 0.158(0.002) 0.528(0.022) 0.835(0.030)
hier. clustering 6 0.189(—) 0.500(—) 0.841(—)
Cheng & Church 3 0.239(—) 0.640(—)
cross-associations 8 0.613(—) 0.667(—)

The effects of censoring are clearly seen from Table 6. The estimation error increases more than three folds when using the censored data with the Bernoulli block model. The effect of thresholding parameter τ is not very significant.

4. Analyzing transcriptional and metabolic coordination in response to starvation

Functions in a cell are executed by cascades of molecular events. Intuitively, proteins are the messengers, while metabolites and other small molecules are the messages. Measuring protein activity over time, directly, is difficult and expensive. An indication of the abundance of most proteins, however, can be inferred from the amount of the messenger RNA transcripts. These transcripts are copies of genes, and lead to the translation of proteins. This is especially true in yeast where alternatives to the transcription-translation hypothesis, such as alternative splicing, are not frequent. Metabolite concentrations add an essential perspective to the study of cascades of molecular events.

We conducted an integrated analysis of two data collections recently published: temporal profiles of metabolite concentrations (Brauer et al., 2006) and temporal profiles of gene expression (Bradley et al., 2009), both measured in Saccharomyces cerevisiae with matching sampling schemes.

An integrated analysis of the coordination between gene expression and metabolite concentrations may lead to the identification of sets of genes (i.e., the corresponding proteins) and metabolites that are functionally related, which will provide additional insights into regulatory mechanisms at multiple levels, and open avenues of inquiry. The identification and quantification of such coordinated regulatory behavior is the goal of our analysis.

The methodology in Section 2 allows us to identify genes and metabolites that show correlated responses to metabolic stress, namely starvation. To evaluate the biological significance of the results, we quantify to what extent correlated responses are associated with metabolic-related functions, and to what extent estimated models can be used to identify functionally related genes and metabolites out-of-sample.

4.1. Data and experimental design

The expression data consist of messenger RNA transcript levels measured using Agilent microarrays on cultures of S. cerevisiae before and after carbon starvation (glucose removal), and before and after nitrogen starvation (ammonium removal). Collection times were 0 minutes (before starvation) and 10, 30, 60, 120, 240, and 480 minutes after starvation. For more details about the data and the experimental protocol see (Bradley et al., 2009). The metabolite concentrations data were obtained using liquid chromatography-mass spectrometry before and after carbon starvation (glucose removal), and before and after nitrogen starvation (ammonium removal). Collection times were 0 minutes (before starvation) and 10, 30, 60, 120, 240, and 480 minutes after starvation. For more details about the data and the experimental protocol see (Brauer et al., 2006).

The concentration of each metabolite and the transcript level of each gene at time point t are expressed as log2 ratios versus the corresponding measurements at the zero time point. Thus for each gene j we have a sequence Gjt, t = 1, …, 6, and for each metabolite k we have a sequence Mkt, t = 1, …, 6, representing for the 6 time points observation after time 0. Complete temporal profiles are available for 5,039 genes and for 61 metabolites; 783 genes and 7 metabolites with missing data were not considered.

Using the temporal profile, we can calculate the sample correlation coefficient of each gene and metabolite pair (j, k): ρ(j,k)=t=1T(Gjtj)(Mktk)(T1)SGSM, where j = ∑t Gjt/T and k = ∑t Mkt/T are the sample mean, and SG and SM are the sample standard deviation. We then transform these correlations using the Fisher transformation Z(j,k)=12log1+ρ(j,k)1ρ(j,k). With the true correlation between genes and metabolites denoted as ρ0, we have Z(j, k) following asymptotically Normal distribution with mean μz=12log1+ρ01ρ0 and standard error 1/T3 (Fisher, 1915, 1921). Under the hypothesis that there is no correlation between genes and metabolites, we will expect ρ0 = 0 and μz=12log1+ρ01ρ0=0. These Fisher transformed quantities provide the input to our model.

We do not expect the multi-way blockmodel assumption to hold for all the 5,039 genes. Instead, we provide separate joint analyses on subsets of genes and all 61 metabolites, for a number of gene lists of interest, which we expect to be involved in the cellular response to starvation. We consider gene lists that were obtained in studies exploring the environmental stress response (ESR), cellular proliferation, metabolism, and the cell cycle (Gasch et al., 2000; Tu et al., 2005; Brauer et al., 2008; Airoldi et al., 2009).

For all the experiments we rely on variational Bayes implementation of our model due to its advantage in convergence speed, which is crucial when dealing with correlation tables involving hundreds of genes. We adopt the setting as described in Section 3.6 for VB, with specific changes described as they become relevant. In the remaining of this section, with exception of Section 4.5, we consistently set the number of metabolite blocks K2 = 4, since there are four metabolite classes, and we use informative priors for the memberships of each metabolite, depending on which class they are known to belong to. Specifically, each of the 61 metabolites belongs to one of the four classes: TCA, AA, GLY, BSI. If a metabolite is in class TCA, say Aconitate, its ξ⃗ vector will be initialized as ξ⃗ = [100 1 1 1], normalized to unit norm. By assuming a dominating component on the true index in the initial membership, the metabolites will mostly remain stably associated to their classes during VB inference.

For the optimal number of gene blocks, we select K1 by minimizing the Bayesian information criterion(BIC). The general BIC formula is −2 log L + k × log(n), in which k is the number of parameters, n is the number of observations, and L is the likelihood. For our model, the approximated BIC is

2logL+|B,π,p|×log|Y|

where |B, π⃗, p⃗| is the number of parameters, which is approximately equal to K1 × K2, and |Y| is the number of entries in the table i.e. |Y | = N1 × N2.

In Section 4, we present some of the results with the goal of showcasing how the data analysis, via the multi-way block model supports the biological research.

4.2. Multifaceted functional evaluation of coordinated responses

Here we evaluate to what extent the proposed model is useful in revealing the genes’ multifaceted functional roles. We rely on the functional enrichment analysis using the Gene Ontology to evaluate the functional content of clusters of genes (Ashburner et al., 2000; Boyle et al., 2004).

One aspect of our model that distinguishes it from clustering and biclustering methods is the mixed membership assumption. That is, in our model, each gene can participate in multiple functions, as modelled via the gene-specific latent membership vectors π⃗. In practice, the membership assumption lets us identify multiple levels of functional enrichment.

To illustrate this point, we consider 521 genes that were found to be strongly associated with metabolic activities, i.e., up-regulated in response to increasing growth rate, in previous studies (Brauer et al., 2008; Airoldi et al., 2009). We use the largest estimated memberships for each gene πgi, i = 1..K1, to assign genes g = 1..521 to metabolite classes j = 1..4. Then we perform functional analysis on the resulting sets of genes associated with each metabolite class.

More formally, we proceed as follows. First, the largest estimated membership is used to assign gene g to gene block i, according to îg = arg maxi=1..K1 πgi. Then the largest estimated gene-block to metabolite-class association |Bij| is then used to assign gene g with a metabolite class, according to

ĵg=arg maxj=1..4|Bîg,j|.

The collection of estimated gene-to-metabolite class associations, {ĵg, g = 1..521}, is used to partition genes into four sets, e.g. AA = {g s.t. ĵg = 1}. We perform functional enrichment analysis for each of these four sets. In addition, the mixed membership nature of the proposed multi-way blockmodel allows us to analyze second-order functional enrichment. We repeat the procedure above but we estimate îg using the second-largest membership in π⃗g.

The functional analysis results obtained for both first- and second-largest memberships are reported in Table 7. Interestingly, subsets of genes associated with the same metabolite class, for instance AA, are functionally enriched for multiple functions, to different degrees. For instance, genes use AA metabolites when performing translational activities in the nucleus, primarily, however they use AA metabolites when performing polymerase-related activities on the polymerase II and II complexes, to a lesser extent. Similarly, genes use BSI metabolites for binding activities in the preribosome, primarily, and for ligase and transferase activities in the preribosome and the ribonucleoprotein complex, to a minor extent. The magnitude of the components of the relevant mixed membership vectors provides more information on the degree of involvement the various gene blocks in these many activities. This type of multifaceted functional analysis is possible thanks to the mixed membership assumption encoded in the multi-way blockmodel.

Table 7.

Example functional evaluation. Gene Ontology terms associated with first- and second-largest membership scores for the Nitrogen starvation experiment.

Memb. Class Ontology Term description p-value
first AA component DNA-directed RNA polymerase I complex 9.8E-6
first AA component preribosome, small subunit precursor 0.00324
first AA function translation factor activity, nucleic acid binding 5.11E-14
first AA function translation initiation factor activity 1.64E-10
first AA function DNA-directed RNA polymerase activity 1.45E-6
first BSI component preribosome, large subunit precursor 0.00013
first BSI function GTP binding 0.03314
first BSI function guanyl ribonucleotide binding 0.03314

second AA component DNA-directed RNA polymerase III complex 6.76E-10
second AA component DNA-directed RNA polymerase II core complex 0.00322
second AA function RNA polymerase activity 1.27E-6
second AA function ATP-dependent RNA helicase activity 3.43E-6
second BSI component ribonucleoprotein complex 6.47E-12
second BSI component 90S preribosome 0.00124
second BSI function aminoacyl-tRNA ligase activity 0.0000817
second BSI function N-methyltransferase activity 0.0455

These results highlight the role of the mixed membership assumption in supporting a detailed multifaceted functional analysis, which is not possible with traditional methods.

4.3. Predicting functional annotations out-of-sample

Here we assess the goodness-of-fit of the proposed method on real data, in terms of predictions out-of-sample. We present results of an experiment in which we predict held-out functional annotations πgi. This analysis leverages use of informative priors on a subset of known functional annotations.

We consider 57 genes that were found to be strongly associated with cellular growth (Brauer et al., 2008; Airoldi et al., 2009), 760 genes that were found to be involved in the environmental response to stress (Gasch et al., 2000), and 19 genes that were found to be involved in metabolic cycling (Tu et al., 2005), in previous studies.

Good out-of-sample prediction performance will enable biologists to use this method to guide which functions they should be testing at the bench, speeding up the exploration of the functional landscape through statistical analysis of gene-metabolite associations.

To establish the ground truth for this experiment, we collected functional annotations for each gene in the same four lists as in Section 4.2 which will be held-out and predicted using the multi-way blockmodel. Table 8 reports summary statistics of the functional annotations in each list of genes, obtained using the Gene Ontology term finder (SGD). Column two reports the total number of genes in each list. Column three reports the number of genes with one, some and no functional annotations. Columns 4–9 report the quantiles from the distribution of the number of functional annotations for the genes in each list. Column 10 reports the value of K1 we selected for fitting the blockmodel.

Table 8.

Statistics for the lists of genes. Column three reports the number of genes with one, some and no functional annotations. K1 is the number of gene blocks in the fitted blockmodel.

no. of genes no. of functional annotations
gene list total one/some/none min 25% 50% 75% max mean K1
growth rate 57 5/19/38 1 1.25 4 7 7 4.26 12
ESR induced 240 0/215/25 2 5 7 12 31 9.31 76
ESR repressed 520 1/503/17 1 10 19 22 31 16.93 78
metabolic cycle 19 4/14/5 1 1 6.5 10 20 7.29 25

To perform the second experiment, we held out the annotations for 50% of the genes with multiple functional annotations, and also we held out the annotation for 50% of the genes with a single functional annotation. When fitting the multi-way blockmodel, in addition to using informative priors for the memberships of each metabolite depending on which class they are known to belong, as detailed in Section 4.1, we used informative priors for the functional annotations we did not hold out. For the held-out annotations, we used non-informative values for the hyper-parameters instead. For instance, suppose that the known vector of functional annotations for gene g is a⃗g =[1 0 0 1 1 0 0 0 0 1], and that ag(1) and ag(4) were to be held-out in a particular replication, so that we have a⃗g =[NA 0 0 NA 1 0 0 0 0 1]. The prior for the functional annotation for that gene would be set at ξ⃗g = [1 1 1 1 100 1 1 1 1 100], normalized to unit norm. The rationale for this choice is to fit a multi-way blockmodel with known biological structure for those genes and metabolites that are used for parameter estimation, but agnostic about the biology we want to predict out-of-sample. We claimed success in each prediction if the imputed annotations, ξ̂g(k) = 1 corresponded to real held-out annotations, and if the imputed absences of annotations, ξ̂g(i) = 0, corresponded to absences of real held-out annotations. We repeated this procedure 10 times, for each of the four lists of genes.

Table 9 reports the accuracy results, detailed by genes with single and multiple annotations, and evaluated separately for annotations (i.e., the 1s) and lack of annotations (i.e., the 0s). The baseline accuracy for predicting single annotations, using random guesses for each gene independently, ranges between 1/19 ≈ 5% for the metabolic cycle genes to 1/520 ≈ 0.2% for the ERS genes. The baseline accuracy is slightly higher for predicting multiple annotations, since predicting single annotations is a harder problem.

Table 9.

Out-of-sample predictions of functional annotations for the Nitrogen starvation experiment. Accuracy in recovering (single/multiple) annotations for four lists of genes.

single annotations multiple annotations
observed missing observed missing
gene list 0s 1s 0s 1s 0s 1s 0s 1s
growth rate 0.94 0.36 0.94 0.36 0.83 0.75 0.84 0.74
ESR induced 0.92 0.39 0.92 0.46
ESR repressed 0.99 0.00 0.84 0.43 0.85 0.50
metabolic cycle 0.97 0.25 0.96 0.10 0.78 0.65 0.80 0.63

For completeness, we also report the accuracy in predicting annotations that were known during model fitting, to get a sense of the goodness-of-fit from a substantive, biological perspective. In fact, if the model assumptions are accurate, we would expect accurate predictions for the known annotations. If the model are not accurate, or if the model provides too much shrinkage, we would expect lower accuracy on known annotations.

Overall, the blockmodel assumptions is substantiated by the results in Table 9. The model is useful for encoding biological information about single and multiple functional annotations. The out-of-sample prediction accuracy of the multi-way blockmodel is solid, and consistently much higher than the baseline. These results complement and confirm the out-of-sample prediction results we obtained in Section 3.3.

4.4. Coordinated and differential regulatory response to Nitrogen and Carbon starvation

Here we provide an illustration of how the multi-way blockmodel can be used to perform quantitative and qualitative analysis of coordinated regulation in response to Nitrogen starvation, and differential regulation in response to Carbon starvation. We perform this analysis for the same four lists of genes we considered in Section 4.2.

The quantitative analysis of coordinated regulation is based on the number of genes which are estimated to be associated with the various metabolite classes, in both the Nitrogen and the Carbon starvation experiments.

We used the same procedure described in Section 4.2 to estimate the metabolite classes associated with each gene, using the estimated largest and second-largest (gene-block) memberships. Table 10 reports the number of genes that were found to be associated with a primary metabolite class (largest membership) and with a secondary metabolite class (second-largest membership) for each of the four lists of genes we consider.

Table 10.

Quantitative evaluation of coordinated regulatory responses. Number of genes associated with the same metabolite class in both the Nitrogen and Carbon starvation experiments. The association is estimated using both largest and second-largest membership scores.

largest membership second-largest membership
gene list AA BSI GLY TCA AA BSI GLY TCA
growth rate 11 6 10 2 10 5 3 1
ESR induced 55 13 4 0 48 21 4 0
ESR repressed 128 22 1 17 109 52 0 27
metabolic cycle 2 3 1 0 2 3 1 0

About a fourth of the genes are found to be associated with a primary metabolite class. Despite the similarity in the patterns of primary and secondary associations, the gene sets involved in them are different. These results imply results imply that another fourth of the genes are found to be associated to a secondary metabolite class. Overall, the blockmodel suggests a substantial amount of overlap between the coordinated regulatory response to Nitrogen and Carbon starvation. A similar quantitative analysis could be conducted for highlight Nitrogen- and Carbon-specific coordinated regulatory responses.

The qualitative analysis of differential regulation is based on the functional enrichment analysis of those genes associated with a given metabolite class in the Nitrogen experiment, but associated with a different metabolic class in the Carbon experiment. For this analysis, we used the procedure above to estimate the metabolite classes associated with each gene, using the estimated largest memberships only, for the list of genes that were found to be ESR induced. The results of the functional analysis obtained for the largest memberships, using the Gene Ontology term finder, are reported in Table 11.

Table 11.

Functional evaluation of gene-metabolite associations that are differentially regulated in Nitrogen and Carbon. Gene Ontology terms for gene-metabolite associations unique to the Nitrogen starvation experiment. Association is computed using the largest memberships.

Class Ontology Term description p-value
AA function alcohol dehydrogenase (NADP+) activity 0.01775
AA function aldo-keto reductase (NADP) activity 0.01775
AA process vacuolar protein catabolic process 0.00026
AA process catabolic process 0.00808
BSI function peroxidase activity 0.0000568
BSI function antioxidant activity 0.0004
BSI function carbohydrate kinase activity 0.00083
BSI function glutathione peroxidase activity 0.0214
BSI process carbohydrate catabolic process 2.98E-7
BSI process cellular response to oxidative stress 0.0000131
BSI process trehalose metabolic process 0.0000203
BSI process alcohol catabolic process 0.0000231
BSI process glycoside metabolic process 0.000061
GLY function oxidoreductase activity 0.0000116
GLY process oxidation-reduction process 0.00097
GLY process cellular carbohydrate metabolic process 0.0011
GLY process carbohydrate metabolic process 0.00143
GLY process cellular aldehyde metabolic process 0.00351

These results highlight how the same set of genes (a proxy for proteins) may be using metabolites differently to execute a response to the Carbon. For instance, metabolites in the BSI class are used to process glycoside, alcohol, and trehalose, as part of antioxidant activities. Metabolites in the GLY class are used to execute oxidoreductase activities and metabolize aldehyde and carbohydrates. The magnitude of the components of the relevant mixed membership vectors provides more information on the degree of involvement the various gene blocks in these many activities.

A similar qualitative analysis could be carried out to explore the functional landscape that is shared by the Nitrogen and Carbon coordinated regulatory responses to starvation.

4.5. Comparative analysis of raw and pre-processed data

Here we compare a blockmodel analysis of coordinated regulation with an analysis using cross-association Chakrabarti et al. (2004), quantitatively, in terms of number of gene-metabolite class associations found. We consider the four lists of genes above for this analysis.

Cross-association takes a binary table as input. We built such a genes-by-metabolites binary matrix Y by thresholding the corresponding matrix of correlations. We assign Y (j, k) = 1 whenever ρ(j, k) is above the 75-th percentile or below the 25-th percentile of the empirical correlation distribution.

Cross-association provides a two-way blockmodel as output, in which K1 and K2 are estimated using a metric based on information gain. To make a valid comparison, we fit the stochastic multi-way blockmodel with the same number of gene and metabolite blocks.

An additional complication in this analysis is that the number of metabolite blocks can be different from four, for both cross-association and the stochastic blockmodel. We use non-informative priors on the metabolites memberships in the stochastic blockmodel. In addition, we developed a greedy matching procedure to associate metabolite blocks to metabolite classes, after inference. We proceeded as follows. Each metabolite was associated with a block using its largest (metabolite-block) membership. Each metabolite is associated with a known metabolite class. We assigned a metabolite class label to each metabolite block according to a simple majority rule.

We used the same procedure described in Section 4.2 to estimate the metabolite classes associated with each gene, using the estimated largest and second-largest (gene-block) memberships.

Table 12 reports the number of Gene Ontology terms that were found to be associated with a primary metabolite class (largest membership), in the first four rows, and with a secondary metabolite class (second-largest membership), in the next four rows, for each of the four lists of Gene Ontology terms we consider. The last four rows report the number of genes that were found to be associated with a metabolite class using cross-association. The multi-way stochastic blockmodel finds more primary associations than cross-association, 246 versus 194. In addition, if we consider the secondary associations, the blockmodel analysis uncovers 190 more associations. In fact, subsets of genes associated with the same primary and secondary metabolite class, for instance AA, are not overlapping by construction.

Table 12.

Quantitative evaluation of Gene Ontology terms associated with gene-metabolite class found. Shown in the tables are results for multi-way blockmodel’s largest (1st) and second-largest (2nd) memberships as well as cross associations (CA).

AA BSI TCA
Memb. gene list BP CC MF BP CC MF BP CC MF total
1st growth rate 1 · · 1 6 5 · · · 13
1st ESR induced 33 7 11 2 1 2 16 · 5 77
1st ESR repressed · 45 26 32 23 5 · · · 131
1st metabolic cycle 13 6 6 · · · · · · 25

2nd growth rate 4 6 3 · · · · · · 13
2nd ESR induced · 1 6 · 16 14 · · 1 38
2nd ESR repressed · 47 21 · 34 11 · · · 113
2nd metabolic cycle · · · 12 6 8 · · · 26

CA growth rate · 6 2 2 · 2 · · · 12
CA ESR induced · · · · 21 19 · · · 40
CA ESR repressed · 47 20 · 30 20 · · · 117
CA metabolic cycle · · · 12 6 7 · · · 25

Overall, cross-association is not well suited for any analysis of biological correlations because of a number of shortcomings, including its reliance on binary input, and its lack of flexibility for incorporating prior biological information, e.g., the number of metabolite blocks. Our results show that the multi-way stochastic blockmodel outperforms cross-associations quantitatively, even when we do not make use of biological prior knowledge.

5. Concluding remarks

In order to analyze the temporal coordination between gene expression and metabolite concentrations in yeast cells, in response to starvation, we developed a family of multi-way stochastic blockmodels. These models extend the mixed membership stochastic blockmodel (Airoldi et al., 2008) to the case of two sets of measurements, and to the case of Gaussian and binary responses. We developed and compared various inference schemes for multi-way blockmodels, including Monte Carlo Markov chains and variational Bayes.

We further explored the impact of thresholding and binning on the analysis. These censoring mechanisms are often used as pre-processing steps. The transformed data are then amenable to the analysis of coordination using off-the-shelf methods, including Bayesian networks and popular blocking algorithms from the data mining literature (Bradley et al., 2009; Chakrabarti et al., 2004). The sensitivity analysis suggests that the impact of pre-processing steps that involve censoring is substantial, both from a quantitative perspective and in terms of its impact on biological discovery, in our case study.

Supplementary Material

01

Fig 2.

Fig 2

Illustration of the properties of MCMC inference for one component of B̂. From left to right: absolute ACF for the entire sample chain, absolute ACF for thinned samples after burn-in, trace for the last 1000 iterations.

Fig 3.

Fig 3

Log-likelihood of variational Bayes (all iterations before convergence) and MCMC (the first 100 iterations).

Acknowledgements

The authors thank David Madigan for suggesting extensions of the mixed-membership blockmodel, in the context of the analysis of adverse events. This work was partially supported by the National Science Foundation under grants no. DMS-1106980, no. IIS-1017967 and CAREER IIS-1149662, and by the National Institute of Health under grant no. R01 GM096193 all to Harvard University, and by a Taft fellowship to the University of Cincinnati. EMA is an Alfred P. Sloan Fellow.

Footnotes

1

We refer to gene functions and metabolic pathways as defined in the yeast genome database and the Kyoto encyclopedia of genes and genomes.

References

  1. SGD project. saccharomyces genome database. URL http://www.yeastgenome.org/. [Google Scholar]
  2. Airoldi EM. Getting started in probabilistic graphical models. PLoS Computational Biology. 2007;3(12):e252. doi: 10.1371/journal.pcbi.0030252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Airoldi EM, Blei DM, Fienberg SE, Xing EP. Mixed membership stochastic blockmodels. Journal of Machine Learning Research. 2008;9:1981–2014. [PMC free article] [PubMed] [Google Scholar]
  4. Airoldi EM, Huttenhower C, Gresham D, Lu C, Caudy A, Dunham M, Broach JR, Botstein D, Troyanskaya OG. Predicting cellular growth from gene expression signatures. PLoS Computational Biology. 2009;5(1):e1000257. doi: 10.1371/journal.pcbi.1000257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubinand GM, Sherlock G. Gene ontology: Tool for the unification of biology. The gene ontology consortium. Nature Genetics. 2000;25(1):25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Azari H, Airoldi EM. Graphlet decomposition of a weighted network. Journal of Machine Learning Research, W&CP (AI&Stat) 2012;22:54–63. [Google Scholar]
  7. Blei DM, Jordan MI. Variational inference for Dirichlet process mixtures. Bayesian Analysis. 2006;1(1):121–144. [Google Scholar]
  8. Blocker AW, Meng X-L. The perils of data pre-processing. Bernoulli. 2103 [Google Scholar]
  9. Bottou L. Large-scale machine learning with stochastic gradient descent. Compstat. 2010;volume 2010:177–186. [Google Scholar]
  10. Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G. GO::TermFinder—open source software for accessing Gene Ontology terms associated with a list of genes. Bioinformatics. 2004;20(18):3710–3715. doi: 10.1093/bioinformatics/bth456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Bradley PH, Brauer MJ, Rabinowitz JD, Troyanskaya OG. Coordinated concentration changes of transcripts and metabolites in Saccharomyces cerevisiae. PLoS Computational Biology. 2009;5(1):e1000270. doi: 10.1371/journal.pcbi.1000270. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Brauer MJ, Yuan J, Bennett BD, Lu W, Kimball E, Botstein D, Rabinowitz JD. Conservation of the metabolomic response to starvation across two divergent microbes. Proceedings of the National Academy of Sciences. 2006;103(51):19302–19307. doi: 10.1073/pnas.0609508103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Brauer MJ, Huttenhower C, Airoldi EM, Rosenstein R, Matese JC, Gresham D, Boer VM, Troyanskaya OG, Botstein D. Coordination of growth rate, cell cycle, stress response and metabolic activity in yeast. Molecular Biology of the Cell. 2008;19:352–367. doi: 10.1091/mbc.E07-08-0779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Braun M, McAuliffe JD. Variational inference for large-scale models of discrete choice. Journal of the American Statistical Association. 2010;105(489):324–335. [Google Scholar]
  15. Chakrabarti D, Papadimitriou S, Modha D, Faloutsos C. Fully automatic cross-associations. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004;volume 10:79–88. [Google Scholar]
  16. Cheng Y, Church GM. Biclustering of expression data. Proceedings of the eighth international conference on intelligent systems for molecular biology. 2000;volume 8:93–103. [PubMed] [Google Scholar]
  17. Cherry JM, Ball C, Weng S, Juvik G, Schmidt R, Adlerand C, Dunn B, Dwight S, Riles L, Mortimer RK, Botstein D. Genetic and physical maps of Saccharomyces cerevisiae. Nature 1997. 1997;387(Suppl 6632):67–73. [PMC free article] [PubMed] [Google Scholar]
  18. Cope L, Zhong X, Garrett E, Parmigiani G. MergeMaid: R tools for merging and cross-study validation of gene expression data. Statistical Applications in Genetics and Molecular Biology. 2004;3(1):a29. doi: 10.2202/1544-6115.1046. [DOI] [PubMed] [Google Scholar]
  19. Fisher RA. Frequency distribution of the values of the correlation coefficient in samples of an indefinitely large population. Biometrika. 1915;10(4):507–521. [Google Scholar]
  20. Fisher RA. On the probable error of a coefficient of correlation deduced from a small sample. Metron. 1921;1:3–32. [Google Scholar]
  21. Franks AM, Csárdi G, Drummond DA, Airoldi EM. Estimating a structured covariance matrix from multi-lab measurements in high-throughput biology. 2012 Mar; doi: 10.1080/01621459.2014.964404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO. Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell. 2000;11:4241–4257. doi: 10.1091/mbc.11.12.4241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Goldenberg A, Zheng AX, Fienberg SE, Airoldi EM. Statistical models of networks. Foundations and Trends in Machine Learning. 2009;2(2):1–117. [Google Scholar]
  24. Jordan M, Ghahramani Z, Jaakkola T, Saul L. Introduction to variational methods for graphical models. Machine Learning. 1999;37:183–233. [Google Scholar]
  25. Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research. 2000;28(1):27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Le Cun LBY. Large scale online learning. Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference; MIT Press; 2004. p. 217. [Google Scholar]
  27. Liu JS. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. Journal of the American Statistical Association. 1994;89(427):958–966. [Google Scholar]
  28. Lu R, Markowetz F, Unwin RD, Leek JT, Airoldi EM, MacArthur BD, Lachmann A, Rozov R, Ma’ayan A, Boyer LA, Troyanskaya OG, Whetton AD, Lemischka IR. Systems-level dynamic analyses of fate change in murine embryonic stem cells. Nature. 2009;462:358–362. doi: 10.1038/nature08575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Markowetz F, Airoldi EM, Lemischka IR, Troyanskaya OG. Mapping dynamic histone acetylation patterns to gene expression in nanog-depleted murine embryonic stem cells. 2009 doi: 10.1371/journal.pcbi.1001034. Manuscript. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Nowicki K, Snijders TAB. Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association. 2001;96:1077–1087. [Google Scholar]
  31. Rohe K, Yu B. Co-clustering for directed graphs: The stochastic co-blockmodel and a spectral algorithm. 2012 Apr; arXiv 1204.2296. [Google Scholar]
  32. Snijders TAB, Nowicki K. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification. 1997;14:75–100. [Google Scholar]
  33. Stephens M. Bayesian analysis of mixtures with an unknown number of components—an alternative to reversible jump methods. Annals of Statistics. 2000;28:40–74. [Google Scholar]
  34. Titterington M, Smith DA, Makov U. Statistical Analysis of Finite Mixture Distributions. Wiley & Sons; 1985. [Google Scholar]
  35. Toulis P, Rennie J, Airoldi EM. Fitting large scale GLMs with implicit updates. 2013 Feb; Manuscript. [Google Scholar]
  36. Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in S. cerevisiae) Proceedings of the National Academy of Sciences. 2003;100(19):10623–10628. doi: 10.1073/pnas.0832373100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Tu BP, Kudlicki A, Rowicka M, McKnight SL. Logic of the yeast metabolic cycle: Temporal compartmentalization of cellular processes. Science. 2005;310:1152–1158. doi: 10.1126/science.1120499. [DOI] [PubMed] [Google Scholar]
  38. Turnbull BW. The empirical distribution function with arbitrarily grouped, censored and truncated data. Journal of the Royal Statistical Society, Series B. 1976;38(3):290–295. [Google Scholar]
  39. Vardi Y. Empirical distributions in selection bias models. The Annals of Statistics. 1985;13(1):178–203. [Google Scholar]
  40. Wainwright MJ, Jordan MI. Foundation and Trends in Machine Learning. Now Press; 2008. Graphical models, exponential families, and variational methods. To appear. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

01

RESOURCES