Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 May 5.
Published in final edited form as: J Am Stat Assoc. 2016 May 5;111(513):73–92. doi: 10.1080/01621459.2015.1110523

Perturbation Detection Through Modeling of Gene Expression on a Latent Biological Pathway Network: A Bayesian hierarchical approach

Lisa M Pham 1, Luis Carvalho 1, Scott Schaus 1, Eric D Kolaczyk 1
PMCID: PMC5026418  NIHMSID: NIHMS778487  PMID: 27647944

Abstract

Cellular response to a perturbation is the result of a dynamic system of biological variables linked in a complex network. A major challenge in drug and disease studies is identifying the key factors of a biological network that are essential in determining the cell’s fate.

Here our goal is the identification of perturbed pathways from high-throughput gene expression data. We develop a three-level hierarchical model, where (i) the first level captures the relationship between gene expression and biological pathways using confirmatory factor analysis, (ii) the second level models the behavior within an underlying network of pathways induced by an unknown perturbation using a conditional autoregressive model, and (iii) the third level is a spike-and-slab prior on the perturbations. We then identify perturbations through posterior-based variable selection.

We illustrate our approach using gene transcription drug perturbation profiles from the DREAM7 drug sensitivity predication challenge data set. Our proposed method identified regulatory pathways that are known to play a causative role and that were not readily resolved using gene set enrichment analysis or exploratory factor models. Simulation results are presented assessing the performance of this model relative to a network-free variant and its robustness to inaccuracies in biological databases.

Keywords: Bayesian Factor Models, Confirmatory Factor Analysis, Conditional Autoregressive Models, MCMC, Network Biology, Drug Target Prediction, Microarray

1. INTRODUCTION

With the influx of high-throughput genomic data, understanding biological mechanisms of action that cause changes in cellular homeostasis has become a reachable challenge in the fields of bioinformatics, computational biology, and statistics. High-throughput measurement techniques such as transcriptional profiling allow us to measure gene transcript levels across thousands of genes simultaneously. However, analyzing individual gene transcriptional profiles, cannot, by themselves, elucidate biological mechanisms that are responsible for the changes observed in gene expression. Rather, gene transcription provides only one facet of a multifaceted system of biological variables that culminate to form a cellular response.

Mechanisms of action (MoA) that drive cellular dysregulation can arise in several biological contexts. Chemotherapeutic compounds for instance, alter very distinct mechanisms, such as changing the topology of DNA structure induced by specific topoisomerase inhibitors (e.g. Malik et al., 2006; Nakada et al., 2006), or inhibiting cellular motility mechanisms caused by myosin II inhibitor compounds (e.g. (Allingham et al., 2005)). In another example, cancer metastasis is also caused by aberrant pathways that disrupt normal cellular regulation, resulting in cancer proliferation. The identification of such key mechanisms is important as they can provide unique signatures not readily apparent by directly analyzing gene expression without added structure or biological context. Methods that leverage biological information by incorporating added structure or biology can be used as a diagnostic tool in a clinical context.

Our primary goal in this paper is to identify drug targets and mechanisms of action in drug perturbation experiments. That is, we aim to find pathways significantly related to each drug’s MoA, which can enhance both therapeutic benefit and assessment of efficacy. Moreover, understanding a drug’s MoA and biological pathway information can provide a better assessment of cellular response to drugs, possibly providing a therapeutic profile for each drug selection.

1.1 The drug target problem: an inverse problem

A cellular response is the result of a culmination of interactions between genes/proteins. A single perturbed pathway, for instance, can cause a rippling effect across a global network of interactions, leaving behind a cascade of transcriptional dysregulation across the genome. For instance, Fig. 1 is a schematic illustration of how a drug perturbs a system of interacting pathways. In Fig. 1, the inhibition/activation of a protein (s) in a single pathway consequently alters several downstream pathways, leading to changes in gene expression. To better understand the primary targets of a perturbation source, we propose a method that formally filters out these regulatory dependencies between biological factors (pathways) to uncover the primary target underlying a perturbation response.

Figure 1.

Figure 1

A representation of a drug perturbation to a pathway A and its downstream effects on associated pathways and gene expression.

From a mathematical perspective, the nature of the problem we address in this paper is not unlike that of ‘deconvolution’ in image processing, a similarity that has been noted by others in this area (e.g., ‘drug target deconvolution’ (Terstappen et al., 2007)). In the image processing version of this problem, an image, say f, is of interest but one has available only blurred and noisy measurements, say y = Kf + e. While denoising y can be relatively straightforward, it only leaves one with an estimate of the blurred image, Kf. In order to recover f itself, the effect of the blurring operator K must be inverted. However, even in the ideal case where K is known this inversion can be ill-posed and the recovery of f can be severely degraded by the corresponding inflation of the noise e. When K is unknown or only partially known, as is analogous to what we face in the drug target prediction problem, the degradation can be arbitrarily worse.

1.2 Identifying Pathway Targets in the DREAM 7 Drug Sensitivity Prediction Challenge Data Set

For our purposes of target pathway identification in drug perturbation experiments, we explore the NCI DREAM7 drug sensitivity prediction challenge dataset (Bansal et al., 2014) which is a part of the Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge series (Marbach et al., 2012; Prill et al., 2011). To assess the performance of our method, we focus our attention on the DREAM7 drug sub-challenge 2 dataset (Bansal et al., 2014) which consists of microarray gene expression profiles from the LY3 cancer cell line. Exactly 14 drugs were tested at different concentrations and durations, and were compared to their mock control counterparts. These high quality, methodical, and carefully designed experiments serve well in testing methods that are designed to predict drug mode of actions because their cellular effects have been well studied, spanning a variety of mechanisms from DNA-damaging agents (e.g. etoposide (Nakada et al., 2006)) or cellular motility inhibitors (e.g. blebbistatin (Allingham et al., 2005)) to compounds that disrupt regulatory signaling mechanisms (e.g geldanamycin (Neckers et al., 1999; Grenert et al., 1997)).

Differential gene expression analyses and other gene enrichment methods may provide insight into dysregulated genes or gene sets (e.g. biological pathways) resulting from a drug perturbation propagating through a system of interacting genes or proteins. However, identifying the primary source of perturbation that can explain the global variation in gene expression is often difficult to discern from differential gene expression alone. For instance, DNA damaging agents that induce cell cycle arrest initiate a series of biological processes such as cell death pathways (apoptosis), protein degradation pathways (e.g. RNA degradation, ubiquitin mediated proteolysis), and possibly DNA-repair pathways. As a result, genes associated with these downstream pathways may be upregulated and consequently, identified by differential gene analyses such as gene set enrichment analysis (GSEA, Subramanian et al., 2005). Rather than detecting the residual effects of such a perturbation, we aim to identify upstream pathways positioned to cause changes in gene expression. In fact, in the case of DNA damaging agents such as the drug camptothecin, we identified P53 signaling in the DREAM7 dataset while GSEA has not (see Section 6.4 for details); P53 signaling may be causally linked to cell cycle arrest induced by DNA damage (Jaks et al., 2001; Gupta et al., 1997; Wang et al., 2004).

Moreover, comparing drug profiles from two different exposure times, we show that certain drugs are more sensitive on the LY3 cancer cell line than others. We also identified drug-induced pathways that were consistently identified across varying conditions. Lastly, we found that drugs having similar mechanisms (e.g. DNA damaging agents) clustered together using profiles generated by our method.

1.3 Organization of this paper

In Section 2, we discuss related work. In Section 3, we describe the hierarchical model in detail including priors and model identification constraints. In Section 4, we outline the aims of posterior inference and the steps to our sampler. We assess the performance of our model compared to an exploratory factor analysis (EFA) model using simulated data sets in Section 5. We discuss our results after applying our method to a drug perturbation dataset (DREAM7, Bansal et al., 2014) and compare our method against an EFA model and gene set enrichment analysis (GSEA, Subramanian et al., 2005) in Section 6. Lastly, in Section 7, we summarize our method as well as our results.

2. PRIOR AND RELATED WORK

Currently, there are roughly three ways the task of identifying disease or drug targets have been approached. In the simplest approach, statistical methods are used to find differential expression between a binary phenotype (Dudoit et al., 2003; Storey and Tibshirani, 2003). The goal with this approach is to identify individuals genes whose expression levels are associated with a trait relative to some control conditions. These methods are rooted in multiple testing and usually control either the family-wise error rate (Dudoit et al., 2003) or the false discovery rate (Dudoit et al., 2003; Storey and Tibshirani, 2003). Although relatively straightforward, these methods have two main drawbacks: they can only identify downstream transcriptional effects since the goal is to detect differential expression and not possible sources of perturbations; and they operate on a gene-by-gene basis and so the phenotypes for each gene are compared in isolation instead of in coordination across gene sets.

The second approach, however, uses biological information to group genes with the goal of capturing coordinated expression changes within a gene set that might not have been detected if the analysis were conducted in individual genes. The aim, therefore, is to look for statistical significance in differential expression across entire gene sets. These gene sets represent a system’s level view of the cell such as cellular pathways or transcription factor targets. The most simple and straightforward method to test if a gene set contains a significant amount of transcriptionally altered genes is a hypergeometric test. This method and other similarly straightforward methods are reviewed in (Khatri and Draghici, 2005; Rivals et al., 2007). Subramanian et al. (2005) developed a method called gene set enrichment analysis (GSEA) with the same goal, but this method does not require a user to choose a threshold to separate differentially expressed genes from non-differentially expressed genes. Instead, GSEA measures how well a gene set clusters at the top or bottom of a ranked list of genes (e.g. ranked using fold ratios or p-values from two sample tests when comparing binary phenotypes). Extensions of GSEA and other threshold-free methods are described in (Jiang and Gentleman, 2007; Nam and Kim, 2008). GSEA is the most widely used gene set analysis tool to date, however it does not use any gene network information.

The third and final approach is network-based and uses information from cellular regulation or gene-protein interactions to provide a better understanding of the molecular mechanisms underlying a response. On an individual protein-gene level, many supervised methods have been proposed to predict interactions between drugs and proteins. For instance, using existing information of known protein-drug interactions, machine learning methods (Faulon et al., 2008; Yamanishi et al., 2008) have been used to predict interactions of individual proteins and drugs using a combination of protein sequence and chemical data. Network based approaches that identify gene sets over individual genes or proteins have also been proposed. Gu et al. (2010), for instance, use a two step method where in the first step they identify a set of differentially expressed (DE) genes that are then mapped to a biological network; in the second step, a clustering method is used to identify genes that form a connected subgraph of DE genes. Other techniques, including pathway-express (Draghici et al., 2007) and gene network enrichment analysis (Liu et al., 2010) have also used network topology. Pathway-express (Draghici et al., 2007) scores genes in a network using differential expression such that differentially expressed genes that are connected are scored higher than those that are not connected. Pre-defined gene sets are then represented using the scores of its genes in the network. Gene network enrichment analysis (Liu et al., 2010) first identifies a connected subnetwork of differentially expressed genes of a global protein-protein interaction network, and then identifies pre-defined gene sets, such as biological pathways, that are enriched with respect to the genes in a differentially expressed subnetwork.

Although these increasingly complex approaches serve well in identifying genes or gene sets that are differentially expressed, they are not designed to identify which genes or gene sets, of the collection of genes or gene sets identified by their method, are primary targets of a specific response, and thus they do not prescribe a mechanism of action. To this end, it is essential to explicitly model the cascading effects of a perturbation by considering biological relationships in a gene network. Our proposed method is such a model-based approach: a factor model, called confirmatory factor analysis (CFA) with a conditional autoregressive (CAR) component designed to identify primary targets underlying gene expression. Spatial conditional autoregressive models have recently been explored in biology, specifically with functional networks (e.g. (Wei and Pan, 2008)). The motivation behind such models stems from the principle that genes that work together (i.e. genes that are functionally related) are usually co-expressed/inhibited, co-(de)regulated, or co-(de)activated.

Factor analysis models have also been used extensively in biology (Lucas et al., 2006; Carvalho et al., 2008; Lucas et al., 2010; Ma and Zhao, 2012). Ma and Zhao (2012) use a Bayesian factor analysis model to identify drug-pathway interactions by analyzing paired gene expression and drug sensitivity data and their relationship to common pathways, represented as latent factors. In their model all factors (pathways), however, are independent. These network-free models are referred to as exploratory factor analysis (EFA) models. West et al. (Lucas et al., 2006; Carvalho et al., 2008) also use exploratory factor analysis models but with an aim to identify biomarkers using gene expression data. However, in (Lucas et al., 2006; Carvalho et al., 2008) the factors are not structurally informed by known biological gene sets and instead are completely data-driven. Lastly, EFA or traditional CFA models assume a zero prior mean for the latent factors. Conversely, our proposed method is a novel approach to perturbation target identification on a latent scale which give rise to non-trivial prior means on the latent factors. Together, our proposed method is a synthesis of several key principles from both confirmatory factor models and conditional autoregressive models that allow us to explain the variation observed in transcriptional data as a function of perturbations to a latent network of biological pathways.

Our proposed model was first motivated by the work of di Bernardo et al. (2005), where a mathematical model adapted from Hill-type transcription kinetics (Liao et al., 2003) was used to describe the relationships among gene transcripts in a cell. The model captures both internal regulatory influences among p gene transcripts and effects due to external perturbation effects:

log10(νiνib)=log10(μiμib)+jipηijlog10(νjνjb), (1)

where νi, for i = 1, …, p, represents the expression level of transcript i, μi represents the direct influence of the perturbation on transcript i, νib and μib represent a set of baseline values of νi and μi, respectively, and ηij represents the influence of transcript j on transcript i. This can be re-expressed as

ψ=Bψ+ϕ (2)

where ψ = (ψ1, …, ψp)′ with ψi = log10(νi/νib), and ϕ = (ϕ1, …, ϕp)′ with ϕi = [1/(1 − ηii)] log10(μi/μib), a scaled version of the log-relative direct influence of the perturbation. Here, B is a p × p matrix representing the network interaction effects among gene transcripts with Bii = 0 and, for ij, Bij = ηij/(1 − ηii).

The work in (Cosgrove et al., 2008) is a statistical extension of the mathematical model in Eq. (2) in the form of

y=By+ϕ+e (3)

where y is now the p ×1 vector of measures of ψ. Here, each measured expression level yi is potentially influenced by other transcripts, where this influence is captured through B. The perturbations are additive components represented by ϕ.

Here, we further extend this model to factor analysis models which are frequently used to analyze high-dimensional gene expression data (Ma and Zhao, 2012; Lucas et al., 2010; Carvalho et al., 2008; Lucas et al., 2006). Specifically, we assume that gene expression is linked to an overall factor defined by biological pathways in the cell. Our proposed model can be described as a hierarchical model where the first (gene) level is a regression of gene expression on biological pathways, and the second (pathway) level is an auto-regressive model that models the behavior of a system of interacting biological pathways underlying external perturbation effects. We describe the proposed model in Section 3.

3. MODEL FRAMEWORK

We developed a statistical hierarchical model to uncover perturbations to biological pathways using high-throughput measurements of gene expression on individual genes. The data Y = {Ye, Yc} is a p × n matrix of transcriptional profiles of p genes across n varying conditions, and consists of a set of cases, Ye, and controls, Yc. To avoid using an intercept in our model, we subtract the mean of the control group Yc from the matrix Y. As a result of this normalization, the mean value of expression is zero if there is no difference between cases and controls. Our goal is to identify latent factors (e.g. biological pathways) that can explain the variation observed in the experimental data, Ye, relative to the control conditions, Yc.

Our model is built on three hierarchical components:

ykiΛ,ωi,ψkindN(Λkωi,ψk) (4)
ωjiω[-j]i,ρji,σ2indN(j=1,jjqBjjωji+ρji,σ2) (5)
ρjiθji,τ2indN(0,θjiτ2+(1-θji)v0τ2) (6)

for i = 1, …, n, k = 1, …, p, j = 1, …, q, where n, p, and q represent the number of samples, genes, and pathways, respectively, and

  • yki is the observed gene expression level of gene k in sample i.

  • ωi is a q×1 vector of latent common factors representing inherent pathway effects for sample i.

  • Λ is a p × q factor loadings matrix in the CFA model, relating gene expression to pathway effects. Λk is the k-th row of Λ.

  • ρji is a random “external” perturbation effect for pathway j in sample i.

  • θji is an indicator for pathway j in sample i being perturbed.

We describe and motivate the various aspects of this model in more detail in the following subsections.

3.1 Modelling Biological Replicates

It is common in biological studies to have replicates. In this case we allow perturbations and latent factors to vary for each replicate, yielding the model

ykirΛ,ωir,ψkindN(Λkωir,ψk) (4′)
ωjirω[-j]ir,ρjir,σ2indN(j=1,jjqBjjωjir+ρjir,σ2) (5′)
ρjirθjir,τ2indN(0,θjiτ2+(1-θji)v0τ2) (6′)

for replicates r = 1, …, mi in each study. Note that while ρ and ω vary, we assume the same perturbation targets and gene-pathway loadings across replicates and so θ and Λ do not depend on index r. In the next sections we present each of the these levels in detail, but we drop the replicate indices to simplify the notation.

3.2 Confirmatory Factor Level

In the first component, we focus on the likelihood of individual gene expression measurements. It has been shown that the transcriptional activity of a gene is, in part, regulated by pathways positioned upstream of gene regulation (e.g. (Vogelstein and Kinzler, 2004)). We describe this relationship between a gene and its pathways using a confirmatory factor analysis model, as shown in Eq. (4). Here, the gene expression measurements, Y, are regressed on a set of biologically structured latent factors ω (biological pathways).

Both the structure of the latent factors ω and the dependency between individual genes y and these latent factors are encoded, a priori, in the p× q loading factor matrix Λ. We integrate biological information into the model through Λ, by limiting some of the elements of Λ to zero, where the non-zero elements in each column of Λ correspond to the genes of a known canonical pathway. That is, a gene k loads onto a pathway j, (that is, λkj ≠ 0) if gene k (or its gene product) is a member of pathway j. If a gene (or its gene product) is not in a pathway, the corresponding loading factor is constrained to 0. Thus genes are regressed only on the pathways to which they belong.

We use the following priors for the loading factors Λ:

λkjiid{N(0,0.1),ifgenekisinpathwayjδ0(·),otherwise

where δ0 is the Dirac delta function and we set the variance of Λ to be small to prevent the latent factors from shrinking towards zero in practice, and keep the pathway noise variance weakly informative.

We give the gene variances Ψ inverse gamma priors,

ψkiidIG(ζ,ζ-1),k=1,,p,

where the shape is chosen so that the prior mean is 1. We now control the prior strength of the gene noise variance parameters Ψ through the hyper-parameter ζ to avoid overfitting the model. This prior regularization is important in cases where the data size is not large enough to fit complex hierarchical models and can lead to instability, poor convergence, multiple local modes, and empirically non-identified models when the prior on each ψk is too flexible. The extent to which an inverse gamma prior is informative depends on the strength of the data and is translated as a variance tradeoff between the prior and the data in the conditional posterior distribution of ψk:

ψkΩ,Λ,Y~IG(ζ+n2,ζ-1+12i=1n(Yki-Λk,ωi)2). (7)

Since an inverse gamma distribution with shape α corresponds to a χ2 distribution with 2α degrees of freedom (Gelman, 2006), the prior provides 2ζ pseudo-observations and the conditional in (7) has 2ζ + n observations. We thus constrain the prior variances of Ψ by setting κ times data observations as prior pseudo-observations, ζ = κn/2, and so the conditional in (7) has n(κ + 1) degrees of freedom. In this manner, we fix κ < 1 so that the prior does not overpower the likelihood. In our implementation, we fixed κ to 0.5 to have a reasonably informed prior.

3.3 Conditional Autoregressive Level

The second component of the model is designed to characterize a perturbed system of interacting pathways. The dependency between pathways is captured by a conditional autoregressive model (Cressie, 1993; de Oliveira, 2012) in Eq. (5). The reasoning is that, under normal stationary conditions, each pathway can be explained, on average, by other pathways in a pathway-pathway network through the coefficients B. However, when a pathway is targeted by some external perturbation source (e.g. from disease, or drug perturbation), then the expected mean of this pathway is shifted by an additional term ρ due to the perturbation. In the context of drug perturbations, one can effectively imagine that after a drug hits a specific pathway, this drug-induced effect propagates across the network, affecting many pathways by mere association, and altering cellular phenotype.

In the CAR level of the model, Eq. (5), the q × q design matrix B represents a system of pathway-pathway interactions. More specifically, let us initially set B := γW, where γ is referred in the CAR literature as the spatial scaling parameter, and W is a symmetric zero-diagonal matrix. We briefly describe the pathway network W in Section 3.4 and provide a technical description in Appendix A.

If we define Φ := (IγW)−1, where I represents the identity matrix of order q, the joint distribution of ω can be written as the following (Cressie, 1993):

ωρ,Φ,σ2~N(Φρ,σ2Φ). (8)

Thus, if γ = 0 the CAR model is reduced to an exploratory factor analysis model. However, to attain an identifiable model (please refer to Section 3.6 for details), we require that the factors have equal variance s2 and so Φ = s2R(γ), where R(γ) is the correlation induced by G(γ) := (IγW)−1,

R(γ)=Diag(G(γ))-12G(γ)Diag(G(γ))-12. (9)

In addition, to ensure that Φ is positive definite, we constrain γ to the interval (1/ι1, 1/ιq), where ι1 and ιq are the minimum and maximum eigenvalues of W, respectively. We note that if W is not positive definite then ι1 < 0 and so γ can take negative values which, in turn, induce negative entries in in R(γ) (Wall, 2004). These negative correlations can represent a net inhibitory effect between genes in two pathways, an expected pattern if the pathways contain regulatory genes. The prior on γ is then uniform,

γ~Unif(1ι1+δ,1ιq-δ),

where δ > 0 ensures that the maximum correlation in R(γ) is not near 1, which can lead to a non-identified model. We set δ = 0.005, which, in our experience, bounds the maximum correlation at roughly 0.90. Finally, we set a weakly informative prior on σ2, σ2 ~ IG(0.001, 0.001).

3.4 Interaction Network in the CAR Model

The pathway interaction network lies at the core of our model. We specifically use a network, where nodes are pathways, constructed in a manner that implicitly links pathways by their common function in the cell. To date there are several ways of building a pathway interaction network. The most ad-hoc approach is to simply define a link between two pathways where there is a non-empty intersection of gene members. However, this rule overlooks interactions that may occur between non-overlapping genes of two pathways. An alternative approach is to use a protein-protein interaction database to identify physical interactions between members of two pathways, and using an aggregate score to define the overall interaction link. However, using PPI interactions is often very noisy given the high variability in PPI data (von Mering et al., 2002; Reguly et al., 2006; Gandhi et al., 2006) resulting from methods that produce high-coverage of the proteome inherently having a high false positive rate.

To better capture interactions between pathways, we used Gene Ontology’s (GO) biological processes (The Gene Ontology Consortium, 2000) to define functional links between pathways. The advantage of GO over a PPI database is that a GO gene set is manually curated, thus containing much fewer false positives. Functional networks have been used extensively in biology (Ideker et al., 2002; Chuang et al., 2007; Roguev et al., 2008), which are generally motivated by the principle that genes that work together to accomplish a task in the cell are usually co-activated/inhibited or co-(de)regulated.

To construct the pathway-pathway interaction network W, we regard the set of pathways as a weighted network where the nodes represent canonical pathways and the edge weights in W reflect the degree of functional similarity between two pathways. As an example, in Fig. 1 the nodes would represent Pathways A, B, and C with a link connecting Pathways A to B and A to C. We then create W in a manner similar to an algorithm by Pham et al. (2011) using KEGG regulatory and signaling pathways (Kanehisa et al., 2006; Kanehisa and Goto, 2000), and GO biological processes (The Gene Ontology Consortium, 2000) obtained from the MSigDB collection (Subramanian et al., 2005). For a technical description of the network construction see Section A.

3.5 Spike-and-Slab Prior

Our goal of elucidating primary targets of a perturbation reduces to identifying perturbation effects after accounting for all pathway-pathway interactions (i.e. via network filtering) as described by a known biological network W. Importantly, we assume that relatively few pathways are primary targets of a perturbation. We use a spike-and-slab prior on ρ (George and McCulloch, 1997) and identify perturbations through posterior-based variable selection. Thus our parameters of interest are variables θji that indicate whether ρji represents a non-zero perturbation for pathway j in sample i, as in Eq. (6).

Because the scale of the perturbations is unknown, we choose a mixture of two normals (George and McCulloch, 1997), one approximating the spike and the other the slab, in such a way that is invariant to the scale of the perturbations. That is, we define the spike variance to be some fraction v0 of the slab variance (Braunstein et al., 2011) and let the slab variance be random. We fixed v0 = 0.01, meaning that a non-perturbed pathway has only one-hundredth of the variance of a perturbed pathway, while keeping the prior on τ2 weakly informative, similarly to the prior on σ2, τ2 ~ IG(0.001, 0.001).

Lastly, the prior probability of a pathway being perturbed is captured by a hyper-parameter α, θjiiidBern(α), for j = 1, …, q and i = 1, …, n. We expect α to be small to reflect the targeted effect of drug perturbations to a few pathways because, for instance, the FDA requires perturbation selectivity for drug approval. Thus, to induce sparsity we set α = 0.1 for the expectation of perturbed pathways. While we expect a priori that on average 10% of the pathways are selectively perturbed, we found in a prior sensitivity analysis that small changes in α do not significantly change our inference on perturbation detection.

3.6 Model Identification

Model identification has always been a non-trivial task with factor models (see Chapters 4, 7, 8 in (Bollen, 1989)). In CFA models, constraints are usually applied to elements of both Φ and Λ. This can be done, for example, by fixing the factor variances and the first q rows of Λ to a particular structure (Lucas et al., 2006). These methods prevent non-trivial rotational or scale transformations of Λ and Φ by forcing the transformation matrix U to be the identity matrix. However, since in our model Φ := s2R, where R is a correlation matrix (Eq. (9)), Φ is already rotationally unique.

This model, however, is still not identified in a few additional ways. Firstly, for the i-th sample we have Yi |Λ, ωi, Ψ ~ Nωi, Ψ) and so, if we marginalize ωi we have the following conditional distribution on Yi:

YiΛ,ρi,σ2,γ,Ψ~N(s2ΛR(γ)ρi,Ψ+σ2s2ΛR(γ)Λ). (10)

Thus, if we re-scale σ̃2 = kσσ2, 2 = kss2, ρ̃i = kρρi, and Λ̃ = kΛΛ such that

kskΛkρ=1andkσkskΛ2=1

we obtain the same conditional distribution on Yi since its mean and variance remain unaltered, respectively. For example, re-scaling Λ, σ2 and ρi with any scalar a ≠ 0 such that Λ̃ = aΛ, σ̃2 = σ2/a2, and ρ̃i = ρi/a does not affect the distribution of Yi. To solve these forms of rescaling, we fix s2 = 1 as well as the prior variance of Λ to 0.1, a small value that avoids shrinking the scale on ω towards 0 in practice. Moreover, we note that since the prior scale of Ψ is centered at 1 and the prior on σ2 is weakly informative, σ2 controls, in effect, how the variance of Yi in (10) is partitioned into pathway effects via Λ and residual effects via Ψ.

Finally, there is an additional case that can lead to a non-identified model. This last case arises from the latent factor correlation matrix R. In the simplest case, suppose γ ≠ 0 and W is a connected network (i.e. R can only be arranged as a single block matrix). Under these conditions, there exist exactly two solutions where

Λ=-Λ,ω=-ω,ρ=-ρ.

This is simply a sign flip across all pathways, which would not affect the underlying factor covariance structure Φ. In fact, if R is rearranged into a block diagonal matrix, then each set of pathways corresponding to a block in R can be flipped in this manner without affecting Θ and yielding the same joint posterior density. In other words, if R is rearranged into a block diagonal matrix with b blocks then there exist exactly 2b solutions. Therefore, perturbations are identified up to their signs; that is, we can infer if two pathways are perturbed in the same direction, but if the perturbation is positive or negative is arbitrary. Moreover, since our goal is to infer perturbations, we are more interested in the magnitude of a perturbation—whether it is significantly close to zero or not—rather than its sign.

4. POSTERIOR INFERENCE

Our primary goal is the posterior inference on Θ = {θji}, which is used to identify perturbation targets. We obtained posterior estimates to our model parameters via a partially collapsed hybrid Gibbs sampler (van Dyk and Park, 2008) with an adaptive Metropolis step (Gelman et al., 2004). We call the sampler “partially collapsed” because we sample some of the parameters (namely, ρ and θ) from a marginalized conditional posterior. We found that using this partially collapsed sampler facilitates the sampler to move into regions of high posterior mass concentration faster.

First, the steps for an ordinary, non-collapsed Gibbs sampler (dropping iteration indices and irrelevant conditional parameters) are:

[Λω,ψ,Y],[ρθ,Φ,ω,σ2,τ2],[θρ,τ2],[γω,ρ,Φ,σ2],[ωΛ,ρ,ψ,Φ,σ2,Y],[ψΛ,ω,Y],[σ2ω,ρ,Φ],[τ2ρ,θ].

Note that, even though Φ = s2R(γ) is a matrix that depends on γ, we use them interchangeably to make the conditional distributions clearer. Now, to improve the rate of convergence, we marginalize the two steps starred above: (i) instead of sampling from [ρ | θ, Φ, ω, σ2, τ2], we integrate out ω from [ρ, ω |Λ, θ, Φ, ψ, σ2, τ2, Y]; and (ii) instead of sampling from [θ | ρ, τ2], we marginalize ρ from [θ, ρ | ω, Φ, τ2, σ2]. The collapsed sampler has then updated steps

[ρΛ,θ,Φ,ψ,σ2,τ2,Y]and[θω,Φ,τ2,σ2].

We note that, as van Dyk and Park (2008) pointed out, marginalizing does not alter the stationary distribution of the full posterior nor the compatibility of the conditional distributions. In the next sections we provide details on each of these steps.

4.1 Initializing MCMC chains

We implemented a tempering algorithm at the beginning of our MCMC chains to find reasonable starting points such that the chains were less likely to get stuck at a local mode. These local modes are caused, in part, by each likelihood in the CFA level, Eq. (4), being close to non-identifiable up to the mean Λω. To do this, we run our sampler as usual, but for Ψ, Λ, and Ω, we temper the likelihood distributions Y |Ψ, Λ, Ω when obtaining their conditional posteriors. Similarly, we do the same for the likelihood of Y | ρ when sampling ρ. At hot temperatures, the likelihood of Y would be flatter, allowing the chains to move more freely around the space. We slowly cool the temperature to T = 1 to obtain the targeted posterior and thus beginning the true sampler.

4.2 Sampling Λ

Recall that we have constrained some elements in Λ to 0 such that genes are regressed only on pathways to which they belong. That is, if gene k is not in pathway j, then we constrain the element λkj to 0. Consider the k-th row of Λ, which we denote as Λk, and suppose some of the elements in Λk are constrained to zero. Let ck be the corresponding 1 × q row vector such that

ckj:=I(genekpathwayj),

that is, ckj = 0 if λkj is a structural zero and so λkjiidN(0,(1-ckj)0.1). If rk = Σj ckj is the number of pathways containing gene k and Λk is the 1 × rk row vector that contains the unknown parameters in Λk then the prior on Λk is N(0, Hk) where Hk = 0.1I, I the identity matrix of order rk.

Similarly, let Ωk be the rk × n sub-matrix of Ω such that for j = 1, …, rk, all the rows corresponding to ckj = 0 are deleted and also let Yk be the 1× n vector of observations for gene k. Then, from a well known result of Lindley and Smith (1972), the posterior conditional distribution for Λk, k = 1, …, p, is

ΛkYk,Ωk,ψk~N(Ak-1ak,Ak-1)

where Ak=ψk-1ΩkΩk+Hk-1 and ak=ψk-1ΩkYk,·.

4.3 Sampling ρ

We define Σ(θi) as a diagonal matrix whose j-th diagonal element is Σ(θi)jj = τ2[θji + v0(1 − θji)]. Then, again exploiting the result of Lindley and Smith (1972) on the marginalized conditional distribution of Yi after integrating out ωi in (10), we sample ρi from the following conditional posterior distribution:

ρiθi,Λ,Ψ,Yi~N(Ai-1ai,Ai-1),

where

Ai=ΦΛ(σ2ΛΦΛ+Ψ)-1ΛΦ+(θi)-1,ai=ΦΛ(σ2ΛΦΛ+Ψ)-1Yi.

4.4 Sampling θ

As in the previous section, we compute the marginalized conditional posterior of θ after integrating out ρ. For computational efficiency, we store the marginal variance V(θi(t)) of ωi conditional on θi at each iteration t of our sampler, where index i runs over samples. That is,

V(θi)=σ2Φ+Φ(θi)Φ.

To simplify the notation, we drop the sample index i on all parameters including θ.

At this step we sample iteratively from θj | θ[−j], ω, Φ, τ2 for j = 1, …, q. There are then two cases: when θj(t)=0 and when θj(t)=1, for which we define V0=V(θj(t)=0,θ[-j](t)) and V1=V(θj(t)=1,θ[-j](t)). Let ϕj denote the jth column of Φ. If θj(t)=0, then we define δj0=1+(τ2-v0τ2)ϕjV0-1ϕj and Δj0=(τ2-v0τ2)/δj0(V0-1ϕj)(V0-1ϕj) to obtain:

logit(θj(t+1)=1θ[-j],ω,Φ)=-12logδj0+12ωΔj0ω+logit(α).

If θj(t)=1 we define δj1=1-(τ2-v0τ2)ϕjV1-1ϕj and Δj1=(τ2-v0τ2)/δj1(V1-1ϕj)(V1-1ϕj) in a similar way to obtain:

logit(θj(t+1)=1θ[-j],ω,Φ)=12logδj1+12ωΔj1ω+logit(α).

For the full derivation of these posterior probabilities, please refer to Appendix B.

4.5 Sampling γ

As with most spatial models, the scaling parameter γ lacks conjugacy. We used a random walk adaptive Metropolis algorithm such that the proposal density is normal and centered at the current sample, γ(t), with variance ξ2. This variance is tuned to adjust for the acceptance rate and fixed post burn-in.

We propose γ* ~ N(γ(t), ξ2). Let Φ* = s2R(γ*) with the correlation computed as in Eq. (9). Then, if

l(Φ):=-n2logΦ-12i=1n(ωi-Φρi)(σ2Φ)-1(ωi-Φρi),

the acceptance ratio of the Metropolis step is just r(γ*,γ(t)) = exp{l*) − l(t))} since we have a flat prior on γ.

4.6 Sampling ω, ψ, σ2, and τ2

We sampled ω according to the linear conditional posterior:

ωiYi,Λ,Ψ,ρi,γindN(Ai-1ai,Ai-1)

where

Ai=ΛΨ-1Λ+σ-2Φ-1ai=ΛΨ-1Yi+σ-2ρi.

The conditional distributions of variance parameters ψ and σ2 follow from conjugacy:

ψkω,Λ,Y~IG(ζ+n2,ζ-1+12i=1n(Yki-Λk,ωi)2)σ2ω,Φ,ρ,~IG(0.001+qn2,0.001+12i=1n(ωi-Φρi)Φ-1(ωi-Φρi)).

where ζ = n/4 according to the discussion in Section 3.2.

The conditional distribution of τ2 is also conjugate; if vij = 1 when θij = 1 and vij = v0 when θij = 0, then

τ2ρ,θ~IG(0.001+qn2,0.001+i=1qj=1nρij22vij).

4.7 Posterior Inference

We infer perturbations based on the centroid estimator of θ for some threshold t (Carvalho and Lawrence, 2008):

θ^ji(t):=I((θjiY)>t). (11)

We choose the threshold t by controlling a Bayesian false discovery rate. We define the Bayesian false discovery rate (BFDR) of an estimator as

BFDR(θ^(t)):=EθY[i=1nj=1qθ^ji(1-θji)i=1nj=1qθ^ji]=i=1nj=1qθ^ji(1-(θji=1Y))i=1nj=1qθ^ji. (12)

5. SIMULATION STUDY

We conducted a simulation study to (i) assess the impact of including biological network information into a factor model, and (ii) test the robustness of our model to inaccuracies in the biological databases used to construct pathways and the pathway-pathway interaction network.

5.1 Assessing the impact of incorporating biological network information

We tested the CFA-CAR model under various signal-to-noise (SNR) ratios and compared its performance with an exploratory factor analysis (EFA) model. This comparison model can be obtained from our original CFA-CAR model by letting the pathway-pathway network be empty. That is, an EFA model is a factor model such that the factor covariance matrix, Φ, is diagonal, which is equivalent to setting γ = 0 in the CFA-CAR model. In effect, this constraint removes the network filtering effect of the CAR model. The EFA model can thus be written as (Eq. 13):

ykiΛ,ωi,ψkindN(Λkωi,ψk) (13)
ωjiω[-j]i,ρji,σ2indN(ρji,σ2) (14)
ρjiθji,τ2indN(0,θjiτ2+(1-θji)v0τ2) (15)

We simulated toy-scale networks with graph densities similar to the original KEGG pathway network. For each network, we randomly selected q = 10 pathways from our set of KEGG pathways. In this paper, we describe the results obtained from one of these networks. We obtained the pathway-pathway weighted networks by taking the subnetwork from our original pathway-pathway network induced by the ten randomly selected pathways. The number of distinct genes in the union of the selected pathways was 878. We used the real gene to pathway membership to create the mask of Λ. We sampled the non-zero loading factors in Λ from a Gaussian with zero mean and variance 0.1. The spatial scaling parameter γ was relatively high to emphasize non-trivial links in the network W. Each analysis was run using R on a Linux 8-core 3.6GHz CPU machine with 8Gb of memory and took, on average, 55 minutes for 8000 iterations.

We define the SNR in each simulated data set using the SNR of a perturbed pathway (which is the same for any perturbed case). If pathway k was perturbed with effect ρ, we define SNR= ρ/σ, and if we further fix the pathway noise variances (σ) to 1, then the SNR= ρ. In each case sample, we simulated an experiment where only a single pathway was perturbed, with 5 “biological” replicates per experiment (perturbation). Therefore, if we perturbed pathway k in sample i ∈ {1, 2, 3, 4, 5} in experiment j, we set ρkij=m, where m is the signal to noise ratio in the data set, and ρkij=0 for k′ ≠ k. Across all data sets, we fixed the gene variances (Ψ) to 1.

For this network, we created six data sets of different signal to noise ratio in ten simulation replicates. In each data set we fixed SNR to the following values: 0.50, 1.50, 2.50, 3.50, 4.50, and 5.50. These values were chosen based on the distribution of the SNRs in the NCI-DREAM drug sensitivity data sets (see Section 6 and Fig. 7). For each of these data sets, we sampled 50 cases and 50 controls.

Figure 7.

Figure 7

Beanplots with bean averages of the marginal posterior distributions of γ, σ2, τ2, and the SNR across each drug for the IC20 at 12 hours and IC20 at 24 hours data sets.

When running either models, we conditioned the perturbations of all replicates, ρij, where j indexes the experiment and i indexes the experimental replicate, on the same indicator vector θj. For the controls, we fixed the corresponding indicators to zero. We ran both our CFA model and the EFA model on each data set. Fig. 2 is a series of ROC curves across data sets of different signal to noise ratios. Both models performed similarly under low SNRs. However, the curves begin to separate as the signal in the data sets increased. This is further emphasized by the AUC boxplots in Fig. 3. The CFA-CAR model consistently exhibits a low false positive rate across varying signal to noise ratios. Unlike the CFA model, the false positive rate of the EFA model increases as the signal to noise ratios increase. When SNR = 4.5 the AUC of the EFA model begins to decrease dramatically. This was expected because given a strong enough perturbation, the EFA model confuses a perturbation with the residual downstream effects of a perturbation. That is, pathways associated with the perturbed pathway are wrongly identified by the EFA model.

Figure 2.

Figure 2

Mean ROC curves with standard error bars for the CFA-CAR and EFA models under different signal to noise ratios. The k-th set of error bars represents the mean and standard deviations in the k-th FPR and TPR across ten data replicates for each method. Each sub-figure corresponds to a different signal to noise ratio in the data. The solid line represents an expected ROC curve under random guessing. Note that we truncated the x-axis at 0.42 because there were no additional plot points until (x = 1, y = 1).

Figure 3.

Figure 3

Boxplot of the AUCs of the ROC plots in Fig. 2.

5.2 Assessing robustness to inaccuracies in the pathway database

We performed sensitivity analysis on the CFA-CAR model to test how possible inaccuracies in KEGG can affect the performance of CFA-CAR relative to an EFA model. To this end, we define the mask of Λ, Λ(mask) as the p × q binary matrix (for p genes and q pathways), where λjk(mask)=1 when gene j is in pathway k and λjk(mask)=0 otherwise. With misannotated genes, the inaccuracies encoded in Λ(mask) would effect both the CFA-CAR and EFA models. However, inaccuracies in the pathway network, W, would affect only the CFA-CAR model.

In this simulation, we tested six levels of gene set perturbations in the set of pathways. For a fixed SNR of 3.5, we chose 10 real pathways. Next, we simulated 10 data sets using the real network induced by these 10 pathways with a fixed SNR of 3.5 (same approach to the data simulations in Section 5.1).

To assess how LFIA performs under different degrees of database inaccuracy, we ran the LFIA model using a different weighted pathway-pathway network W and a different mask for Λ, where we randomly perturbed the assignment of genes to pathways for these false networks. We reassigned x percent of randomly selected genes to randomly selected pathways. Then, we re-created the pathway-pathway network W from the new Pathway-GO bipartite graph. Similarly, the mask of Λ is generated from the new gene to pathway assignment. For each percentage of genes tested, we generated a new W network and Λ(mask) for each simulated data set.

We perturbed x = 1%, 2%, 4%, 8%, 16%, 32% and 64% of all genes in our set of ten pathways. For each of these perturbations, we ran ten replicates where for each replicate we simulated a new “pathway” database and ran CFA-CAR on the original data set, using the newly constructed Λ(mask) and W.

To assess the changes in W and Λ(mask) relative to the true W and true Λ(mask), we computed the Frobenius norm of the difference between the real and false networks scaled by the Frobenius norm of the real network (see Figure 4.) The changes increase almost linearly with the percentage of perturbed genes in a logarithmic scale, doubling at each iteration.

Figure 4.

Figure 4

Boxplots of the relative error across different percentages of database inaccuracy for W (left), the mask of Λ (middle), and the posterior means of Λ (right.)

Fig. 5 is a series of ROC curves and Fig. 6 are AUC boxplots corresponding to the curves in Fig. 5 for both the CFA-CAR and EFA model across various levels of gene to pathway reassignments. We begin to see a small departure from the original ROC curve where the network and loading factor matrices were not randomized when 8% of the genes were randomly reassigned to different pathways. At this level, the specificity of the CFA-CAR model begins to decrease. However, even at 32% randomness, the specificity and sensitivity of the model is still significantly higher than what you expect to see by random chance alone. We conclude that the CFA-CAR model is rather robust to possible inaccuracies in the biological databases, performing as well or better than an EFA version of this model.

Figure 5.

Figure 5

Mean ROC curves with standard error bars for the CFA-CAR and EFA models under different percentages of randomly reassigned genes to pathways. The k-th set of error bars represents the mean and standard deviations in the k-th FPR and TPR across ten data replicates of W and Λ. The standard error bars on the EFA model represent standard errors across ten replicates of Λ. Each sub-figure corresponds to a different percentage of randomly reassigned genes. The solid line represents an expected ROC curve under random guessing.

Figure 6.

Figure 6

Boxplot of the AUCs of the ROC plots in Fig. 5.

6. CASE STUDY: DRUG TARGET PREDICTION IN DREAM 7 DRUG SENSITIVITY CHALLENGE DATA

We applied the CFA-CAR model on the NCI-DREAM drug sensitivity challenge data set. As a comparison, we also implemented an EFA model (see Section 5 for a description of the EFA model) and gene set enrichment analysis (GSEA) (Subramanian et al., 2005). The core algorithm of GSEA is a threshold-free variant of a hypergeometric test; however, unlike the CFA-CAR model, network interactions are not used.

In the NCI-DREAM drug sensitivity data set, 14 compounds (see Table 1) were tested at various concentrations and exposure times on the LY3 (Lymphoma) cancer cell line. We focused our attention on two data sets of different drug exposure times. The first data set consisted of drug perturbations at a concentration of (IC20) at 12 hours. The second data set consisted of drug perturbations at the same concentration but exposed for a duration of 24 hours. The IC20 concentration value is a measure of how much a drug is needed to inhibit a given biological process by 20%. These data sets were compared to mock control data sets where no drugs were exposed to the cells for 12 or 24 hour durations. There were eight mock replicates and each distinct experiment included three replicates, yielding a total of 50 samples per data set (Data Set 1: IC20 at 24 hours vs Mock at 24 hours and Data Set 2: IC20 at 12 hours vs Mock at 12 hours). Each analysis was run using R on a Linux 8-core 3.6GHz CPU machine with 8Gb of memory and took, on average, 8 hours for 8000 iterations.

Table 1.

Compounds tested on the LY3 cancer cell line in the NCI-DREAM data challenge, the abbreviations used in the figures and description of drug and known mechanisms of action.

DRUG Abbrev. Description
Aclacinomycin A ACLA-A Inhibits the degradation of ubiquinated proteins, affecting the proteasome (Figueiredo-Pereira M et al., 1996)
Blebbistatin BLEBB Inhibits the myosin II protein affecting cellular motility pathways (Allingham et al., 2005)
Camptothecin CPT DNA-damaging agent targeting DNA topoisomerase I, affecting cell-cycle and p53-signaling (Gupta et al., 1997; Jaks et al., 2001; Wang et al., 2004)
Cycloheximide CHX Inhibits protein biosynthesis pathways (Obrig et al., 1971)
Doxorubicin Hydrochloride DOX DNA-damaging agent targeting DNA topoisomerase II (Pommier et al., 2012), and causes cell cycle arrest (Ling et al., 1996). Implicated in a P53-dependent mechanism of action (Ling et al., 1996; Zhou et al., 2002; Kurz et al., 2004).
Etoposide ETP DNA-damaging agent causing DNA strands to break causing apoptosis and errors in DNA synthesis (Pommier et al., 2012). Involved in P53-dependent mechanisms of action (Karpinich et al., 2002; Grandela et al., 2007)
Geldanamycin GA Disrupts regulatory signaling mechanisms via HSP90 inhibition (Grenert et al., 1997; Neckers et al., 1999)
H-7, Dihydrochloride DHCL Protein kinase C inhibitor (Hidaka et al., 1984)
Methotrexate MTX Inhibits DNA-synthesis via inhibition of purine and thymine synthesis (Goodsell, 1999)
Mitomycin C MTC. DNA-damaging agent that is a potent DNA crosslinker (M, 1995)
Monastrol MNS Inhibits ATPase activity of the kinesin (Maliga et al., 2002)
Rapamycin RPM Targets the mTOR protein (Alqurashi et al., 2013) affecting mTOR signaling pathway
Trichostatin A TSA. Inhibits the class of histone deacetylase (HDAC) families of enzymes, inhibiting DNA transcription pathways (Vanhaecke et al., 2004)
Vincristine VCR Causes cell-cycle arrest via inhibiting the assembly of microtubule structures during mitosis (Rao et al., 2012)

When running CFA-CAR and EFA models, we conditioned the perturbations of all replicates, ρ1j,ρ2j,ρ3j, where j indexes the experiment, on the same indicator vector θj. In the control cases, we fixed the corresponding indicators θ to 0. For each dataset, we ran two chains of 4000 iterations each, with a burn-in of 2000 samples. We assessed convergence using the Gelman-Rubin statistic on the absolute values of all the continuous parameters. Using the same hardware and software configuration as the simulation studies in Section 5.1, these iterations took roughly 230 minutes to run.

Fig. 7 summarizes the posterior distributions of the model parameters γ, σ2, τ 2, and the signal to noise ratios (SNRs) across the different drugs for each data set (IC20 at 12 hours and IC20 at 24 hours). We define the SNR of a pathway j in a sample i as the mean ratio, across replicates, of the absolute perturbation ρ to either σ, if the pathway is perturbed, or σv0 otherwise; that is SNRji=r=13(θjiρjir/σ+(1-θji)ρjir/(σv0))/3, where r indexes the replicate of a drug experiment (with 3 replicates per drug). To assess SNR in a drug case, we pool the signal to noise ratio across all pathways.

For GSEA, we called a pathway “dysregulated” if the false discovery rate (FDR) adjusted p-value is less than 5%. For the CFA-CAR and EFA models we identify perturbations based on the centroid estimator of θ in Eq. (11) where the threshold t was chosen to control BFDR at 5% (see Eq. (12)). Restricting the BFDR at 5%, we used a perturbation threshold of 0.72 for IC20 at 12 hours and 0.50 for IC20 at 24 hours for the CFA-CAR model. For the EFA model, we used thresholds 0.80 and 0.70 for IC20 at 12 hours and IC20 at 24 hours, respectively.

6.1 Assessing model fit

We assessed model fit by first standardizing the data with respect to its marginalized likelihood given Θ. If, for sample i,

YiΛ,Φ,θi,σ2,Ψ~N(0,ΛΦ(θi)ΦΛ+σ2ΛΦΛ+Ψ),

then, if Ci denotes the Cholesky factor of the above marginalized covariance of Yi, we have that Yi~N(0,CiCi) and so i := C−1Yi ~ N(0, Ip), where Ip is the p-th order identity matrix. Thus, to assess model fit we take posterior mean estimates Ĉi of Ci, consider standardized gene expressions Yi=C^i-1Yi for each sample i = 1, …, n, and then check two assumptions: zero-mean gene expressions and normality.

Figure 8 summarizes our assessment, with samples from IC20 at 12 and 24 hours in top and bottom panels, respectively. In the first sub-figure in the left, we plot pooled standardized gene expressions over both samples and genes to assess the zero-mean assumption. As can be seen from the smoother fit, this assumption is well met, but samples might have slightly different variance scales. For this reason, since we are focused on gene expressions, we assess normality via average standardized gene expressions over samples in the two last sub-figures. We found that the IC20 at 24 hours data has larger departures from normality in the tails when compared to the IC20 at 12 hours dataset. This departure at IC20 at 24 hours is specifically due to two of the drugs, H-7 Dihydrochloride (DHCL) and Mitomycin C (MMC), which showed a drastic change in variability between the IC20 at 12 hours and IC20 at 24 hours. This effect is explained in further detail in Section 6.5.

Figure 8.

Figure 8

Top and bottom panels show model fit assessment for the IC20 at 12 hours and IC20 at 24 hours datasets, respectively. Left: scatterplot of standardized data with zero expectation and smoother (lowess) fit shown by a red line and green curve, respectively. Middle: histogram, density, and rug plot of the standardized dataset averaged over samples. Right: quantile-quantile normal plot of standardized data averaged over samples.

6.2 Assessing control of false positives via cross-validation

To verify the accuracy of our method, we tested our model against the control (mock) data by leaving three mock samples (microarray), out of the control group and treating them as replicates of an experiment, where the perturbations are left unknown. Because each drug experiment consisted of three replicates, we chose to treat three control samples as replicates of a single condition to maintain the same power. We applied this leave-three-out cross-validation 56 times to each data set to account for every possible combination of the 8 control samples. The aim here is to verify that the perturbed pathways identified in the case samples are not similarly identified against control samples.

We assessed pathway perturbations in the “case” group using the estimated posterior mean probabilities of a pathway being perturbed. We conducted this leave-three-out cross-validation procedure to both CFA and EFA models. We found that in each leave-three-out cross-validation test, in either model, we did not identify any pathways as being perturbed in the mock sample (Fig. 9).

Figure 9.

Figure 9

Boxplots representing the posterior means of θ for each of the 76 pathways in the control samples in the IC20 12 hours data set (Top) and the IC20 24 hours data set (Bottom) in the leave-three-out cross-validation using the CFA-CAR model (red) and the EFA model (blue).

6.3 Comparing EFA and CFA-CAR model results

Overall, we found that for both data sets the EFA model was less sensitive than the CFA model, finding a subset of mechanistically relevant pathways that were identified by the CFA model. Because the EFA model identified fewer pathways linked to drug mechanisms, we choose to focus our comparison with GSEA in the following subsections. We note, however, that the smaller number of relevant pathways found by the EFA model relative to the CFA model is due to the BFDR control forcing more stringent thresholds on the posterior means of θ for the EFA model. Results obtained from the EFA model can be found in Appendix C.

6.4 IC20 concentration at 12 hours

For the CFA-CAR model results, we ran a hierarchical bi-clustering analysis on the posterior means Θ̂ of θ. More specifically, we independently cluster samples (drugs) and pathways using complete linkage and Euclidean distances as dissimilarity metrics. Each dimension has their observations swapped to match the clustering hierarchy, as can be seen in Fig 10A. Fig. 10B shows the results from a similar analysis on adjusted p-values obtained from GSEA for comparison.

Figure 10.

Figure 10

A. Heatmap of the CFA-CAR posterior means of Θ for IC20 at 12 hours. Green cells indicate high posterior probabilities (above the x = 0.72 threshold or equivalently, less than 5% BFDR). B. Heatmap of (1-FDR adjusted P-values from) GSEA for IC20 at 12 hours. Green cells indicate small FDR values (above a 5% threshold). In both heatmaps, x-axes indicate drug experiments (samples) and y-axes indicate pathways. Each axis is clustered hierarchically using complete linkage and Euclidean distances as dissimilarities.

At IC20 at 12 hours, the most commonly identified pathway across drugs by the CFA-CAR model was P53-signaling: it was identified for etoposide (ETP), mitomycin-C (MMC), camptothecin (CPT), and doxorubicin (DOX). Interestingly, all four of these compounds are DNA damaging agents and were clustered into two groups (ETP/MMC and CPT/DOX) (Fig. 10), whereas some of the compounds that affect protein function unrelated to DNA damage, such as geldanamycin (GA) and rapamycin (RPM), also formed their own cluster. P53 is an important tumor suppressor that can regulate various cellular processes including apoptosis, cell cycle, and DNA repair; furthermore, it is mutated in over 50% of human cancers (reviewed in Ling and Wei-Guo, 2006). This loss of P53-signaling can lead to cancer growth and propagation. In many cases, P53-signaling is induced as a response to specific DNA damaging agents (Gupta et al., 1997), playing key roles in cell cycle arrest and apoptosis. All four drugs have been shown to play significant roles in activating P53-dependent mechanisms [ETP:(Karpinich et al., 2002; Grandela et al., 2007), MMC:(Abbas et al., 2002; Fritsche et al., 1993; Verweij and Pinedo, 1990), CPT:(Jaks et al., 2001; Gupta et al., 1997; Wang et al., 2004), DOX:(Kurz et al., 2004; Ling et al., 1996; Zhou et al., 2002)]. Furthermore, for some of these compounds including CPT and DOX, cell cycle arrest at specific check points occurs through the induction of a P53 signaling mechanism. For instance, P53 signaling was shown to play a critical role in the G1 checkpoint of the cell cycle under CPT-induced DNA damage (Jaks et al., 2001; Gupta et al., 1997). DOX is another chemotherapeautic DNA-damaging drug that causes cell cycle arrest in the G2/M phase (Ling et al., 1996). Furthermore, DOX was shown to cause an accumulation of P53 that lead to a depletion of cells in the G2/M phase and apoptosis (Zhou et al., 2002).

GSEA did not identify P53-signaling or cell cycle for any of the 14 compounds. Morever, the connection between the set of pathways identified by GSEA and some of the drug’s primary MoAs were not readily apparent. Both GSEA and the CFA-CAR model identified DNA repair pathways for Rapamycin (RPM), with GSEA identifying mismatch repair and CFA-CAR identifying base excision repair, although the connection between these repair pathways and RPM’s MoA is unclear.

6.5 IC20 concentration at 24 hours

At longer exposure times, both methods picked up more pathways (see Fig. 11 for heatmaps of the CFA-CAR and GSEA results, respectively.) This is expected from a biological point of view because as time of exposure increases the cell needs to respond to potentially more complex effects caused by the loss of cell homeostasis. Therefore, cells activate more pathways that, for example, help in cell survival, increase energy output, or signal cell demise or death. With the CFA-CAR model, we found that at IC20 concentrations at 24 hours of exposure, H-7 Dihydrochloride (DHCL) and Mitomycin C (MMC) were overly perturbed (i.e. many pathways had a significantly high posterior mean probability of having a non-zero perturbation). Base excision repair was again identified for RPM but was also picked by several other drugs such as monastrol (MNS), cycloheximide (CHX), trichostatin-A (TSA), DOX, and ETP. As mentioned earlier, the connection between DNA repair pathways and these compounds is unclear. Moreover, GSEA identified various DNA damaging repair or DNA related pathways at an FDR level of 5%. Such pathways include base excision repair, homologous recombination, mismatch repair, and DNA replication.

Figure 11.

Figure 11

A. Heatmap of the CFA-CAR posterior means of Θ for IC20 at 24 hours. Green cells indicate high posterior probabilities (above the x = 0.50 threshold or equivalently, less than 5% BFDR). B. Heatmap of (1-FDR adjusted P-values from) GSEA for IC20 at 24 hours. Green cells indicate small FDR values (above a 5% threshold). In both heatmaps, x-axes indicate drug experiments (samples) and y-axes indicate pathways. Each axis is clustered hierarchically using complete linkage and Euclidean distances as dissimilarities.

P53-signaling remains one of the most perturbed pathways among several of DNA-damaging or inhibiting compounds at the longer exposure time, including ETP, CPT, and DOX, all of which clustered together. P53-signaling was also picked up for methotrexate (MTX) which was something that was not picked up at the shorter exposure time but is very much linked to P53 induction (Krause et al., 2002; Huang et al., 2011). Krause et al. (2002) showed that treating HepG2 cells with methotrexate increases levels of p53. Huang et al. (2011) also demonstrated that methotrexate can induce apoptosis by the induction of the P53 targeted genes, DR5, P21, Puma and Noxa. Similarly, GSEA also identified P53 signaling for CHX, an inhibitor of protein biosynthesis at the longer exposure time, but was missed by CFA-CAR. Cell cycle was picked up for DOX and CPT (as in the shorter exposure time), but under these stronger experimental conditions, cell cycle was further identified for ETP and RPM as well.

7. DISCUSSION

Identifying mechanisms of action from transcriptional data alone is challenging. In disease phenotypes, differences in gene expression cannot, by themselves, elucidate aberrant causal pathways. Drug perturbations are even more difficult because often times perturbations occur directly on a proteomic level rather than a gene level. In many cases, transcriptional responses are the residual aftermath of a perturbation rather than being representative of the direct target.

GSEA and other gene set methods attempt to boost the signal on transcriptional data by focusing on entire gene sets with the assumption that related genes may contribute a greater signal than individual genes alone. Furthermore, network-based methods use cellular regulation or gene/protein interaction data to better understand the MoAs underlying a transcriptional response.

Here, we link transcriptional data to unobserved mechanisms of action using confirmatory factor analysis in conjunction with a conditional autoregressive model. In the CFA level of the model we link gene expression to the biological pathways that contain their gene products. In the CAR model, we seek to explain the variation observed in these latent pathways by a pathway-pathway network induced by “external” perturbations.

The core goal of this model is to uncover these perturbation targets. We have shown through simulation and real data that providing a network filtering method on expression data is a significant improvement over non-network approaches. In simulation, we show that the CFA-CAR model exhibits a high specificity across various signal to noise ratios, unlike an EFA model (network-free) which has a higher false positive rate as the signal in the data set increases. Furthermore, we have shown that the CFA-CAR model is very robust to inaccuracies in the databases used to create the pathways and the pathway-pathway interaction network.

Moreover, our method was able to identify pathways significantly related to some of the drugs’ MoAs in the NCI DREAM 7 data sets. Specifically, we were able to identify signaling/regulatory pathways that play a causative role for many of the DNA-damaging compounds. These pathways were generally not identified using gene set enrichment analysis, or less so with EFA. Some of the known targets were missed by both the CFA-CAR model and GSEA across both data sets. For instance, rapamycin targets the mTOR protein (Alqurashi et al., 2013) but mTOR signaling was not identified. In another example, blebbistatin inhibits the myosin II protein (Allingham et al., 2005), and we expected to see some of the cellular motility pathways such as regulation of cytoskeleton pathway. Moreover, some of the results we found with both methods were also unclear (e.g. identifying DNA repair pathways across many of these compounds).

While our model arguably yields better results than other approaches in Section 6 and the simulation study in Section 5 shows that these results are fairly robust to gene-pathway network annotation inaccuracies, we recognize that other, possibly more severe, types of misspecification can occur and impact reproducibility. Unfortunately, it is out of the scope of this work to perform a thorough investigation of possible sources of model misspecification, but we facilitate such future endeavor by providing our software as supplemental material.

There are several natural directions for extension of this model. For instance, other forms of biological variables such as transcription factors or metabolic pathways could be used as latent factors. In our more recent work, we have been focusing on transcription factors (TFs) as latent factors, and we have created a network connecting two TFs based on the set of shared targets. In this application, two TFs are linked if they jointly regulate a non-empty intersection of genes, that is, the biological function used here is co-regulation.

Another extension of this model would be to integrate other forms of high-throughput data in conjunction with gene expression. A temporal component could also be incorporated to accommodate temporal data and define a dynamic model. All these extensions, however, are at the cost of increased model complexity, which can greatly affect the fit and convergence of the model.

Supplementary Material

Supplementary Material

Acknowledgments

This research was supported, in part, by NIH grant GM078987. EK is partially supported by AFOSR grant 12RSL042. LC was supported by NSF grant DMS-1107067. LP is supported by NIH/NIDDK grant R01 DK078616.

APPENDIX A. NETWORK CONSTRUCTION

Our network W was created in a similar manner to Pham et al. (2011). We constructed a weighted network of pathways W where the nodes represent canonical pathways and the edges represent functional links. We describe the details of the construction of W below.

First, we classified genes using two sources of biological information: KEGG regulatory/signaling pathways (Kanehisa et al., 2006; Kanehisa and Goto, 2000), and GO biological processes (The Gene Ontology Consortium, 2000) obtained from the MSigDB collection (Subramanian et al., 2005) (version 3.0). We removed disease specific pathways and any metabolic pathways, focusing our attention to a regulatory/signaling network. We also removed 4 pathways that did not represent a single pathway but a collection of signaling molecules that are themselves part of other pathways (e.g. Cell Adhesion Molecules and Cytokine-Cytokine Receptor Interactions). In the end, we had 76 distinct regulatory/signaling KEGG pathways containing a total of 3636 genes.

To define functional links between two pathways, we constructed a bipartite network, where the two sets of nodes represent KEGG pathways and GO functions, respectively. A pathway P and a GO term G were linked together in this bipartite network if the intersection of P and G (as gene sets) is non-empty. Furthermore, the edge was weighted by the Jaccard index between G and P, measuring the relative overlap between function and pathway. Edges with Jaccard index smaller than 3% were removed to reduce the possibility of false positives in the database. We can represent this bipartite graph as an incidence matrix M where the rows represent KEGG pathways and the columns represent GO terms.

Our final network of pathways was obtained by translating the information from this two-mode network onto a one mode network formed by the KEGG node set alone. The projection of M onto a single mode network A was obtained by A = MM. As a result, two pathways are linked in this network if and only if they share at least one biological process. Furthermore, edges between pathways that heavily contribute to the same GO terms will be heavier than those pathways that contribute less to the same GO terms. Finally, we standardized A in the following manner:

Wij=AijAiiAjj(1-δij),

where δ is the Kronecker delta.

APPENDIX B. MARGINALIZED CONDITIONAL POSTERIOR DISTRIBUTION OF θ

In this section, we derive the marginalized posterior of θi for sample i. After integrating out ρi from Eq. (8) we have that

ωiθi,ΦindN(0,V(θi)),

where V (θi) = σ2Φ + ΦΣ(θi and the index runs over samples. To simplify the notation, we set τ02:=v0τ2 and drop the indices, that is, ω | θ, Φ ~ N(0, V (θ)).

Since θjiidBern(α) from the prior and with Σ(θ) similarly defined as in Section 4.3, we have that

V(θ)=σ2Φ+τ02ΦΦ+(τ2-τ02)ΦDiag(θ)Φ=σ2Φ+τ02ΦΦ+(τ2-τ02)j=1qθjϕjϕj,

where ϕj is the j-th column of Φ.

We want

(θj=1θ[-j],ω,Φ)=(θj=1,θ[-j],ω,Φ)b{0,1}(θj=b,θ[-j],ω,Φ).

By the distributions above,

(θj,θ[-j],ω,Φ)jαθj(1-α)1-θj(2π)-q/2V(θ)-1exp{-12ωV(θ)-1ω}.

Now, we define V0 = V (θj = 0, θ[−j]) and V1 = V (θj = 1, θ[−j]). Then,

V1=σ2Φ+τ02ΦΦ+(τ2-τ02)ϕjϕj+(τ2-τ02)kjθkϕkϕk=V0+(τ2-τ02)ϕjϕj,

and so, by the Sherman-Morrison formula,

V1-1=(V0+(τ2-τ02)ϕjϕj)-1=V0-1-(τ2-τ02)V0-1ϕjϕjV0-11+(τ2-τ02)ϕjV0-1ϕj.

Moreover, by the matrix determinant lemma,

V1=(1+(τ2-τ02)ϕjV0-1ϕj)V0.

Using these relations we can proceed to sample θj(t+1) based on θ(t) from the current iteration t; we keep the inverse of V (θ) and only update it when θj is flipped. We have

(θj(t+1)=1θ[-j],ω,Φ)=V1-1/2exp{-ωV1-1ω2}αV1-1/2exp{-ωV1-1ω2}α+V0-1/2exp{-ωV0-1ω2}(1-α),

or, taking logits,

logit(θj(t+1)=1θ[-j],ω,Φ)=-12logV1V0-12ω(V1-1-V0-1)ω+logit(α). (A.1)

There are two cases: when θj(t)=0 and when θj(t)=1. In the first case we already have V0-1, and so, by defining δj0=1+(τ2-τ02)ϕjV0-1ϕj and Δj0=(τ2-τ02)/δj0(V0-1ϕj)(V0-1ϕj) we can apply Eq. (A.1) to obtain

logit(θj(t+1)=1θ[-j],ω,Φ)=-12logδj0+12ωΔj0ω+logit(α).

In the sampler, we update (V (θ)−1)(t+1) = (V (θ)−1) (t) − Δj0 if θj (t+1) = 1, that is, when θj is flipped.

If θj(t)=1 we define δj1=1-(τ2-τ02)ϕjV1-1ϕj and Δj1=(τ2-τ02)/δj1(V1-1ϕj)(V1-1ϕj) in a similar way to obtain:

logit(θj(t+1)=1θ[-j],ω,Φ)=12logδj1+12ωΔj1ω+logit(α),

and update (V (θ)−1)(t+1) = (V (θ)−1)(t) + Δj1 if θj gets flipped to θj(t+1)=0.

APPENDIX C. EFA RESULTS ON THE DREAM 7 DATA SETS

Fig 12 includes heatmaps of the results obtained by the EFA model for the IC20 at 12 hours and IC20 at 24 hours exposure times, respectively.

Figure 12.

Figure 12

A. Heatmap of the EFA posterior means of Θ for IC20 at 12 hours. Green cells indicate high posterior probabilities (above the x = 0.80 threshold or equivalently, less than 5% BFDR). B. Heatmap of the EFA posterior means of Θ for IC20 at 24 hours. Green cells indicate high posterior probabilities (above the x = 0.70 threshold or equivalently, less than 5% BFDR). In both heatmaps, x-axes indicate drug experiments (samples) and y-axes indicate pathways. Each axis is clustered hierarchically using complete linkage and Euclidean distances as dissimilarities.

At IC20 12 hours, the EFA model identified fewer pathways than the CFA-CAR model, identifying P53-signaling for CPT (also identified by CFA-CAR in addition to other DNA-damaging agents ETP, MMC, and DOX). At IC20 24 hours, the EFA and CFA-CAR models both identified P53-signaling for the same pathways (MMC, DOX, MTX, and CPT), as well as cell cycle for ETP, a pathway mechanistically linked to p53-signaling. However, CFA-CAR further identified cell cycle for CPT, DOX, and RPM. Moreover, both EFA and CFA-CAR both found MMC and DHCL overly perturbed at IC20 24 hours.

References

  1. Abbas T, Olivier M, Lopez J, Houser S, Xiao G, Suresh-Kumar G, Tomasz M, Bargonetti J. Differential activation of p53 by the various adducts of mitomycin C. Biol Chem. 2002;277:40513–40519. doi: 10.1074/jbc.M205495200. [DOI] [PubMed] [Google Scholar]
  2. Allingham JS, Smith R, Rayment I. The structural basis of blebbistatin inhibition and specificity for myosin II. Nature Structural & Molecular Biology. 2005;12:378–379. doi: 10.1038/nsmb908. [DOI] [PubMed] [Google Scholar]
  3. Alqurashi N, Hashimi SM, Wei MQ. Chemical Inhibitors and microRNAs (miRNA) Targeting the Mammalian Target of Rapamycin (mTOR) Pathway: Potential for Novel Anticancer Therapeutics. Int J Mol Sci. 2013;14:3874–3900. doi: 10.3390/ijms14023874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bansal M, et al. 2014 “(To be determined),” (To be determined), to be submitted. [Google Scholar]
  5. Bollen KA. Structural Equations with Latent Variables. New Jersey: Wiley-Interscience; 1989. [Google Scholar]
  6. Braunstein A, McShane B, Piette J, Jensen S. A Hierarchical Bayesian Variable Selection Approach to Major League Baseball Hitting Metrics. JQAS. 2011;7 [Google Scholar]
  7. Carvalho C, Chang J, Lucas J, Wang Q, Nevins J, West M. High-dimensional sparse factor modelling: Applications in gene expression genomics. J Am Stat Assoc. 2008;103:1438–1456. doi: 10.1198/016214508000000869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Carvalho LE, Lawrence CE. Centroid estimation in discrete high-dimensional spaces with applications in biology. Proceedings of the National Academy of Sciences. 2008;105:3209–3214. doi: 10.1073/pnas.0712329105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chuang HY, Lee E, Liu YT, Lee D, Ideker T. Network-based classification of breast cancer metastasis. Molecular Systems Biology. 2007;3:140. doi: 10.1038/msb4100180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cosgrove EJ, Yingchun Z, Gardner TS, Kolaczyk ED. Predicting gene targets of perturbations via network-based filtering of mRNA expression compendia. BMC Bioinformatics. 2008;24:2482–2490. doi: 10.1093/bioinformatics/btn476. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Cressie NAC. Statistics for Spatial Data. New York: Wiley; 1993. [Google Scholar]
  12. de Oliveira V. Bayesian Analysis of Conditional Autoregressive Models. Ann Inst Stat Math. 2012;64:107–133. [Google Scholar]
  13. di Bernardo D, Thompson M, Gardner T, et al. Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nature Biotechnology. 2005;23:377–383. doi: 10.1038/nbt1075. [DOI] [PubMed] [Google Scholar]
  14. Draghici S, Khatri P, Tarca AL, Amin K, Done A, Voichita C, Georgescu C, Romero R. A systems biology approach for pathway level analysis. Genome Res. 2007;17:1537–1545. doi: 10.1101/gr.6202607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Stat Science. 2003;18:71–103. [Google Scholar]
  16. Faulon JL, Misra M, Martin S, Sale K, Sapra R. Genome scale enzyme-metabolite and drug-target interaction predictions sing the signature molecular descriptor. Bioinformatics. 2008;24:225–233. doi: 10.1093/bioinformatics/btm580. [DOI] [PubMed] [Google Scholar]
  17. Figueiredo-Pereira ME, Chen WE, Li J, Johdo O. The antitumor drug aclacinomycin A, which inhibits the degradation of ubiquitinated proteins, shows selectivity for the chymotrypsin-like activity of the bovine pituitary 20 S proteasome. Journal of Biological Chemistry. 1996;271:16455–9. doi: 10.1074/jbc.271.28.16455. [DOI] [PubMed] [Google Scholar]
  18. Fritsche M, Haessler C, Brandner G. Induction of nuclear accumulation of the tumor-suppressor protein p53 by DNA damaging agents. Oncogene. 1993;8:307–318. [PubMed] [Google Scholar]
  19. Gandhi T, Zhong J, Mathivanan S, et al. Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nature genetics. 2006;38:285–293. doi: 10.1038/ng1747. [DOI] [PubMed] [Google Scholar]
  20. Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Analysis. 2006;1:515–533. [Google Scholar]
  21. Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. New York: Chapman & Hall/CRC; 2004. [Google Scholar]
  22. George EI, McCulloch RE. Approaches for Bayesian variable selection. Statistica Sinica. 1997;7:339–373. [Google Scholar]
  23. Goodsell DS. The Molecular Perspective: Methotrexate. The Oncologist. 1999;4:340–341. [PubMed] [Google Scholar]
  24. Grandela C, Pera MF, Grimmond SM, Kolle G, Wolvetang EJ. p53 is required for etoposide-induced apoptosis of human embryonic stem cells. Stem Cell Res. 2007;1:116–128. doi: 10.1016/j.scr.2007.10.003. [DOI] [PubMed] [Google Scholar]
  25. Grenert JP, et al. The amino-terminal domain of heat shock protein 90 (hsp90) that binds geldanamycin is an ATP/ADP switch domain that regulates hsp90 conformation. J Biol Chem. 1997;272:23843–23850. doi: 10.1074/jbc.272.38.23843. [DOI] [PubMed] [Google Scholar]
  26. Gu J, Chen Y, Li S, Li Y. Identification of responsive gene modules by network-based gene clustering and extending: application to inflammation and angiogenesis. BMC Sys Bio. 2010;4:47. doi: 10.1186/1752-0509-4-47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Gupta M, Fan S, Zhan Q, Kohn KW, O’Connor PM, Pommier Y. Inactivation of p53 increases the cytotoxicity of camptothecin in human colon HCT116 and breast MCF-7 cancer cells. Clin Cancer Res. 1997;3:1653–1660. [PubMed] [Google Scholar]
  28. Hidaka, et al. Isoquinolinesulfonamides, novel and potent inhibitors of cyclic nucleotide dependent protein kinase and protein kinase C. Biochemistry. 1984;23:5036. doi: 10.1021/bi00316a032. [DOI] [PubMed] [Google Scholar]
  29. Huang WY, Yang PM, Chang YF, Marquez VE, Chen CC. Methotrexate induces apoptosis through p53/p21-dependent pathway and increases E-cadherin expression through downregulation of HDAC/EZH2. Biochem Pharmacol. 2011;81:510–517. doi: 10.1016/j.bcp.2010.11.014. [DOI] [PubMed] [Google Scholar]
  30. Ideker T, Ozier O, Schwikowski B, Siegel AF. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics. 2002;18(Suppl 1):S233–40. doi: 10.1093/bioinformatics/18.suppl_1.s233. [DOI] [PubMed] [Google Scholar]
  31. Jaks V, Jers A, Kristjuhan A, Maimets T. p53 protein accumulation in addition to the transactivation activity is required for p53-dependent cell cycle arrest after treatment of cells with camptothecin. Oncogene. 2001;20:1212–1219. doi: 10.1038/sj.onc.1204232. [DOI] [PubMed] [Google Scholar]
  32. Jiang Z, Gentleman R. Extensions to gene set enrichment. Bioinformatics. 2007;23:306–313. doi: 10.1093/bioinformatics/btl599. [DOI] [PubMed] [Google Scholar]
  33. Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Kanehisa M, et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006;34:354–357. doi: 10.1093/nar/gkj102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Karpinich NO, Tafani M, Rothman RJ, Russo MA, Farber JL. The course of etoposide-induced apoptosis from damage to DNA and p53 activation to mitochondrial release of cytochrome c. J Biol Chem. 2002;277:16547–16552. doi: 10.1074/jbc.M110629200. [DOI] [PubMed] [Google Scholar]
  36. Khatri P, Draghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005;21:3587–3595. doi: 10.1093/bioinformatics/bti565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Krause K, Wasner M, Reinhard W, et al. The tumour suppressor protein p53 can repress transcription of cyclin B. Nucl Acids Res. 2002;22:4410–4441. doi: 10.1093/nar/28.22.4410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Kurz EU, Douglas P, Lees-Miller SP. Doxorubicin Activates ATM-dependent Phosphorylation of Multiple Downstream Targets in Part through the Generation of Reactive Oxygen Species. J Biol Chem. 2004;279:53272–53281. doi: 10.1074/jbc.M406879200. [DOI] [PubMed] [Google Scholar]
  39. Liao J, Boscolo R, Yang Y, et al. Network component analysis: reconstruction of regulatory signals in biological systems. Proceedings of the National Academy of Sciences. 2003;100:15522–15527. doi: 10.1073/pnas.2136632100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Lindley DV, Smith AF. Bayes estimates for the linear model. Journal of the Royal Statistical Society. Series B (Methodological) 1972:1–41. [Google Scholar]
  41. Ling B, Wei-Guo Z. p53: Structure, Function and Therapeutic Applications. J Cancer Mol. 2006;2:141–153. [Google Scholar]
  42. Ling YH, el Naggar AK, Priebe W, Perez-Soler R. Cell cycle-dependent cytotoxicity, G2/M phase arrest, and disruption of p34cdc2/cyclin B1 activity induced by doxorubicin in synchronized P388 cells. Mol Pharmacol. 1996;49:832–841. [PubMed] [Google Scholar]
  43. Liu M, et al. Network-Based Analysis of Type 2 Diabetes. PLoS Genet. 2010;3:e96. doi: 10.1371/journal.pgen.0030096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Lucas J, Carvalho CM, Wang Q, Bild A, Nevins JR, West M. Sparse statistical modelling in gene expression genomics. Bayesian Inference for Gene Expression and Proteomics. 2006;14:155–176. [Google Scholar]
  45. Lucas JE, Kung HN, Chi JT. Latent Factor Analysis to Discover Pathway-Associated Putative Segmental Aneuploidies in Human Cancers. PLoS Comput Biol. 2010;6:e1000920. doi: 10.1371/journal.pcbi.1000920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. MT Mitomycin C: small, fast and deadly (but very selective) Chemical Biology. 1995;2:575–579. doi: 10.1016/1074-5521(95)90120-5. [DOI] [PubMed] [Google Scholar]
  47. Ma H, Zhao H. iFad: an integrative factor analysis model for drug-pathway association inference. Bioinformatics. 2012;28:1911–1918. doi: 10.1093/bioinformatics/bts285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Maliga Z, Kapoor TM, JMT Evidence that monastrol is an allosteric inhibitor of the mitotic kinesin Eg5. Chemical Biology. 2002;9:989–996. doi: 10.1016/s1074-5521(02)00212-0. [DOI] [PubMed] [Google Scholar]
  49. Malik M, Nitiss KC, Enriques-Rios V, Nitiss JL. Roles of nonhomlogous endjoinng pathways in surviving topoisomerase II-mediated DNA damage. Mol Cancer Ther. 2006;5:1405. doi: 10.1158/1535-7163.MCT-05-0263. [DOI] [PubMed] [Google Scholar]
  50. Marbach D, CCJ, Kffner R, NMV, Prill R, Camacho D, Allison K, Kellis M, Collins JJ, Stolovitzky G Consortium TD. Wisdom of crowds for robust gene network inference. Nature Methods. 2012;9:796–804. doi: 10.1038/nmeth.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Nakada S, Katsuki Y, Imoto I, Yokoyama T, Nagasawa M, Inazawa J, Mizutani S. Early G2/M checkpoint failure as a molecular mechanism underlying etoposide-induced chromosomal aberrations. J Clin Invest. 2006;116:80–89. doi: 10.1172/JCI25716. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Nam D, Kim SY. Gene-set approach for expression pattern analysis. Brief Bioinform. 2008;9:189–197. doi: 10.1093/bib/bbn001. [DOI] [PubMed] [Google Scholar]
  53. Neckers L, Schulte TW, Mimnaugh E. Geldanamycin as a potential anti-cancer agent: its molecular target and biochemical activity. Invest New Drugs. 1999;17:361–373. doi: 10.1023/a:1006382320697. [DOI] [PubMed] [Google Scholar]
  54. Obrig TG, Culp WJ, McKeehan WL, Hardesty B. The mechanism by which cycloheximide and related glutarimide antibiotics inhibit peptide synthesis on reticulocyte ribosomes. Journal of Biological Chemistry. 1971;246:174–181. [PubMed] [Google Scholar]
  55. Pham L, Christadore L, Schaus S, Kolaczyk ED. Network-based prediction for sources of transcriptional dysregulation via latent pathway identification analysis. Proc Nat Acad Sci. 2011;108:13347–13352. doi: 10.1073/pnas.1100891108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Pommier Y, Leo E, Zhang H, Marchand C. DNA topoisomerases and their poisoning by anticancer and antibacterial drugs. Chemical Biology. 2012;17:421–433. doi: 10.1016/j.chembiol.2010.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Prill RJ, Saez-Rodriguez J, Alexopoulos LG, Sorger PK, Stolovitzky G. Crowdsourcing Network Inference: The DREAM4 Predictive Signaling Network Challenge. Science. 2011;4:mr7. doi: 10.1126/scisignal.2002212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Rao CV, Kurkjian CD, Yamada HY. Mitosis-targeting natural products for cancer prevention and therapy. Curr Drug Targets. 2012;13:1820–1830. doi: 10.2174/138945012804545533. [DOI] [PubMed] [Google Scholar]
  59. Reguly T, Breitkreutz A, Boucher L, et al. Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae. Journal of Biology. 2006;5:11. doi: 10.1186/jbiol36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Rivals I, Personnaz L, Taing L, et al. Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics. 2007;23:401–407. doi: 10.1093/bioinformatics/btl633. [DOI] [PubMed] [Google Scholar]
  61. Roguev A, Bandyopadhyay S, et al. Conservation and rewiring of functional modules revealed by an epistasis map in fission yeast. Science. 2008;322:405–410. doi: 10.1126/science.1162609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Storey JD, Tibshirani R. Statistical significance for genome-wide studies. Proc Natl Acad Sci. 2003;100:9440. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Subramanian A, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wised expression profiles. Proc Natl Acad Sci USA. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Terstappen GC, Schlüpen C, Raggiaschi R, Gaviraghi G. Target deconvolution strategies in drug discovery. Nature Reviews. 2007;6:891–903. doi: 10.1038/nrd2410. [DOI] [PubMed] [Google Scholar]
  65. The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. van Dyk DA, Park T. Partially collapsed Gibbs samplers: Theory and methods. J Amer Stat Soc. 2008;482:790–796. [Google Scholar]
  67. Vanhaecke T, Papeleu P, Elaut G, Rogiers V. Trichostatin A-like hydroxamate histone deacetylase inhibitors as therapeutic agents: toxicological point of view. Curr Med Chem. 2004;11:1629–1643. doi: 10.2174/0929867043365099. [DOI] [PubMed] [Google Scholar]
  68. Verweij J, Pinedo HM. Mitomycin C: mechanism of action, usefulness and limitations. Anticancer Drugs. 1990;1:5–13. [PubMed] [Google Scholar]
  69. Vogelstein B, Kinzler KW. Cancer genes and the pathways they control. Nature Medicine. 2004;10:789–799. doi: 10.1038/nm1087. [DOI] [PubMed] [Google Scholar]
  70. von Mering C, Krause R, Snel B, Cornell M, et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002;417:6887. doi: 10.1038/nature750. [DOI] [PubMed] [Google Scholar]
  71. Wall MM. A close look at the spatial structure implied by the CAR and SAR models. Journal of Statistical Planning and Inference. 2004;121:311–324. [Google Scholar]
  72. Wang S, Konorev E, Kotamraju S, Joseph J, Kalivendi S, Kalyanaraman B. Doxorubicin Induces Apoptosis in Normal and Tumor Cells via Distinctly Different Mechanisms. J Biol Chem. 2004;279:25535–25543. doi: 10.1074/jbc.M400944200. [DOI] [PubMed] [Google Scholar]
  73. Wei P, Pan W. Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinformatics. 2008;24:404–411. doi: 10.1093/bioinformatics/btm612. [DOI] [PubMed] [Google Scholar]
  74. Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics. 2008;24:232–240. doi: 10.1093/bioinformatics/btn162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Zhou M, Gu L, Li F, Zhu Y, Woods WG, Findley HW. DNA damage induces a novel p53-survivin signaling pathway regulating cell cycle and apoptosis in acute lymphoblastic leukemia cells. Mol Pharmacol. 2002;303:124–31. doi: 10.1124/jpet.102.037192. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES