Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2012 May 10;28(14):1911–1918. doi: 10.1093/bioinformatics/bts285

iFad: an integrative factor analysis model for drug-pathway association inference

Haisu Ma 1, Hongyu Zhao 2,*
PMCID: PMC3389771  PMID: 22581178

Abstract

Motivation: Pathway-based drug discovery considers the therapeutic effects of compounds in the global physiological environment. This approach has been gaining popularity in recent years because the target pathways and mechanism of action for many compounds are still unknown, and there are also some unexpected off-target effects. Therefore, the inference of drug-pathway associations is a crucial step to fully realize the potential of system-based pharmacological research. Transcriptome data offer valuable information on drug-pathway targets because the pathway activities may be reflected through gene expression levels. Hence, it is of great interest to jointly analyze the drug sensitivity and gene expression data from the same set of samples to investigate the gene-pathway–drug-pathway associations.

Results: We have developed iFad, a Bayesian sparse factor analysis model to jointly analyze the paired gene expression and drug sensitivity datasets measured across the same panel of samples. The model enables direct incorporation of prior knowledge regarding gene-pathway and/or drug-pathway associations to aid the discovery of new association relationships. We use a collapsed Gibbs sampling algorithm for inference. Satisfactory performance of the proposed model was found for both simulated datasets and real data collected on the NCI-60 cell lines. Our results suggest that iFad is a promising approach for the identification of drug targets. This model also provides a general statistical framework for pathway-based integrative analysis of other types of -omics data.

Availability: The R package ‘iFad’ and real NCI-60 dataset used are available at http://bioinformatics.med.yale.edu/group/.

Contact: hongyu.zhao@yale.edu

Supplementary Information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Identification of drug targets, the gene products that bind to specific therapeutic molecules, is essential for understanding drugs' action mechanism and possible side effects, and for maximizing treatment efficacy and minimizing drug toxicity. Traditional pharmaceutical research and development process is deeply rooted in the ‘one target-one drug’ mindset, which tries to interfere the pathological process through blocking an important molecular player (e.g. a specific enzyme) using a compound. Unfortunately, most drug candidates identified through high-throughput screening based on this philosophy have failed due to either poor efficacy or serious side effects (Schadt et al., 2009).

High hope has been placed on using systems biology for drug discovery through the application of novel computational methods to high-throughput genomics and proteomics data. In contrast to the traditional ‘one target-one drug’ perspective that ignores the intricate interaction among genes/proteins, systems biology approaches consider the drug effects in the global physiological environment. They may be more effective in drug discoveries because it has become apparent that most common diseases result from system-level malfunctions rather than problems with individual genes (Pujol et al., 2010). In practice, drug combination or poly-pharmacy is more commonly used in recent years to modulate multiple drug targets.

Existing computational methods for drug-target identification can be generally categorized into three classes. The first class of methods uses the classical gene expression profiling strategies to infer drug targets. For example, drug targets can be identified by comparing the mRNA responses in human cell lines induced by drugs with unknown mechanisms with a set of well understood drugs. The connectivity map project (Lamb et al., 2006) used a non-parametric rank-based pattern matching strategy based on the Kolmogorov–Smirnov statistic, which requires extensive data mining procedures. Another study (Kutalik et al., 2008) proposed a bi-clustering method, named iterative signature algorithm (ISA), to search for ‘co-modules’ representing gene–drug associations.

The second class of methods aims to integrate various types of biological data, including knowledge about bioactive molecules and their protein targets (e.g. sequences, structures and molecular mechanisms) along with the phenotypic effects of drug treatment, to deduce the probability of two proteins being bound by the same ligand (Czodrowski et al., 2009; Ecker et al., 2008; Nigsch et al., 2009). Drug docking studies have also been incorporated to measure the binding probabilities based on 3D structural complementarities (Boyce et al., 2009; Irwin et al., 2009; Kolb et al., 2009; Zavodszky and Kuhn, 2005). One disadvantage of these methods is that they only evaluate the likelihood of presumed drug-target pairs and make no inference of unknown drug targets. Campillos et al. (2008) proposed a probabilistic network to calculate the probability of two drugs sharing the same target by comparing their clinical side effects. However, this method provides limited information on the drug action mechanisms.

The third class of methods analyzes the global patterns of drug–protein interactions (Kuhn et al., 2008; Yeh et al., 2006; Yildirim et al., 2007). It is known that established gene–drug binary associations form a dense network that exhibits local clustering of similar drug types. Such networks can enhance our understanding of the physiological effects of various drugs in terms of the molecular pathways and disease categories involved. However, these ‘guilt-by-association’ methods are too coarse-grained to make quantitative inference about the effect of drugs at the system level.

In this article, we develop a coherent statistical framework to jointly analyze drug sensitivity and gene expression patterns from the same set of cell lines to infer the target pathways of drugs. Much information on these two types of data has been accumulated in the literature (Bussey et al., 2006; Ikediobi et al., 2006; Shankavaram et al., 2007; Sharma et al., 2010; Shoemaker, 2006), and if appropriately analyzed, these data may be informative for inferring drug targets. We adopt a latent factor analysis approach, where each latent factor corresponds to the activity of a specific pathway. Note that factor analysis models have been proposed for the reconstruction of gene regulatory networks and the inference of transcription factor activity profiles (Gharib et al., 2006; Meng et al., 2011; Yeh et al., 2009; Yu and Li, 2005). A previous review article (Pournara and Wernisch, 2007) compared the performance of five different factor analysis algorithms. However, these models were developed for the analysis of a single data type, i.e. gene expression data. In contrast, our proposed Bayesian sparse factor analysis model, iFad (integrative factor analysis for drug-pathway association inference), is the first effort to bring together two distinct data types in a unified framework to identify drug-target pathways. We propose and implement a modified collapsed Gibbs sampling algorithm for model inference. Our approach can also easily incorporate known pathway information to infer the ‘many-to-many’ correspondence between the two types of data. Some unique features of our model are (1) joint analysis of distinct data types; (2) a Bayesian framework to integrate prior pathway knowledge and (3) explicit consideration of the sparse nature of the drug-target pathways. Both simulation studies and applications to real NCI-60 datasets show that iFad is a promising approach for drug-target inference.

The rest of the article is organized as follows. We detail the modeling assumptions and statistical inferential procedure in Section 2. Simulations and real data analysis are described in Section 3. We conclude the article in Section 4.

2 METHODS

2.1 Model description

This section describes the statistical framework of our proposed Bayesian sparse factor analysis model, iFad and statistical inference of the model parameters. As discussed above, iFad aims to analyze paired gene expression data and drug sensitivity data generated from the same set of samples. We denote the gene expression dataset by matrix Y1, with dimension G1 by J, where G1 is the number of genes and J is the sample size. The drug sensitivity dataset is denoted by matrix Y2, with dimension G2 by J, where G2 is the number of drugs. Drug sensitivity is usually quantified as the ‘GI50’ values, concentrations required to inhibit growth by 50% (Staunton et al., 2001). Matrices Y1 and Y2 are normalized (scaled to mean 0 and SD 1 for each gene/drug) before analysis.

iFad links the two matrices through the activity levels of K biological pathways (e.g. KEGG pathways), which are latent factors in our model. The rationale here is that pathway activities influence both gene expression levels and the sensitivity to drugs targeting these in these pathways. We assume that there is some prior knowledge about the gene-pathway and drug-pathway association relationships, represented by two binary matrices L1 and L2, with dimensions G1 by K and G2 by K, respectively, where L1[g, k]=1 (or L2[g, k]=1) indicates that the gth gene (or drug) is known to be associated with the kth pathway. This information can be retrieved from various pathway databases with different degrees of sensitivity and specificity (Bader et al., 2006; Kanehisa et al., 2010). iFad assumes that both matrices Y1 and Y2 are related to the common underlying pathway activity matrix X (with dimension K by J) through the following linear models:

graphic file with name bts285um1.jpg

Matrices W1 and W2 are the factor loading matrices describing the regulatory direction (positive or negative) and strength of the pathway activities on the gene expression levels Y1 and drug sensitivity Y2. The latent factor activity matrix X is shared between the two feature spaces, namely gene expression data and drug sensitivity data. Each entry in matrix X is assumed to follow a standard normal distribution. Σ1 and Σ2 represent the noise term added to gene expression or drug sensitivity, with mean 0 and diagonal covariance matrices Ψ1 and Ψ2. The precision τg1 (for the g1th gene) and τg2 (for the g2th drug) are modeled using a Gamma prior with shape parameters α1, α2 and rate parameters β1, β2.

In order to use the prior knowledge on the gene-pathway and drug-pathway associations (matrices L1 and L2), we use the spike-and-slab mixture prior (West, 2003) for the factor loading matrices W1 and W2. Although there exist other forms of sparsity-inducing priors (Pournara and Wernisch, 2007), the spike-and-slab prior has the advantage of easy incorporation of prior information on the connectivity structure of the loading matrix. For both W1 and W2, we put the following prior on each entry:

graphic file with name bts285um2.jpg

where δ0 is the unit point mass at zero (the Dirac delta function) and πg,k denotes the prior probability that Wg,k is non−zero. If Wg,k is non-zero, it is assumed to follow a normal distribution with mean 0 and precision τw. The precision τw can be either set to a constant or assumed to follow a Gamma prior with parameter (αw, βw). Usually, an auxiliary indicator variable Zg,k is used to enable the calculation of posterior probabilities (as Z1, W1, π1, L1 and Z2, W2, π2, L2 have very similar formats except the subscript, we just listed the general formula here):

graphic file with name bts285um3.jpg

In this way, prior link matrices L1 and L2 are used to induce the sparsity structure of the factor loading matrices W1 and W2 in a flexible way, with the strength of guidance tuned conveniently by user-specified parameters η0 and η1. Under this setting, we can derive the prior probability of different components of the model, as well as the complete joint posterior probability (see Supplementary Materials for details).

It is worth noting that during the simulation studies in Section 3.1, both matrices Z1 and Z2 are unknown and are the target of inference. In contrast, for real data analysis (the NCI-60 dataset) in Section 3.2, since prior information about gene-pathway association structure is available and fairly accurate, the major interest lies in the inference of matrix Z2, the drug-pathway association relationships.

2.2 Inference algorithm

There are many parameters to estimate for iFad. Gibbs sampling is a widely used technique to approximate the joint distribution through re-sampling. However, standard Gibbs sampler may have poor mixing due to dependence between matrices W and Z makes. Therefore, we used a modified collapsed Gibbs sampling algorithm for model inference as outlined below. Detailed derivations of the posterior conditional distributions are provided in Supplementary Materials. At the end of each sampling iteration, we add a local permutation step (Sharp et al. 2010 to address the problem of label-switching, which is also described in Supplementary Materials. We have implemented the above algorithm as the R package ‘iFad’, which is publicly available on CRAN.

graphic file with name bts285i1.jpg

3 RESULTS

We first tested the performance of iFad using simulated datasets, and then applied the method to real NCI-60 datasets to infer unknown drug-pathway associations.

3.1 Simulation study

In order to assess the performance of our proposed model, we first simulated a series of datasets to investigate the effects of different model parameter settings, as well as the various dataset properties, including sample size and noise level, among other factors.

3.1.1 Data simulation for model parameter selection

We tested eight different settings for the Gamma density parameters related to the precision of the gene/drug noise term (τg1, τg2) and the factor loading matrix (τw1, τw2), as shown in Table 1.

Table 1.

Model parameter settings considered in the simulations

Setting αg βg αw βw
1 0.7 0.3 0.7 0.3
2 0.7 0.3 σw = 1
3 1 0.1 0.7 0.3
4 1 0.1 σw = 1
5 1 0.01 0.7 0.3
6 1 0.01 σw = 1
7 1 0.005 0.7 0.3
8 1 0.005 σw = 1

αg and βg are the shape and rate parameters of the Gamma prior put on the precision of the noise term for both matrices Y1 and Y2; αw and βw are Gamma parameters for the precision of the non-zero elements of the factor loading matrices W1 and W2.

Different Gamma parameters represent different prior belief regarding the distribution of the standard deviation for the noise term/the factor loadings. Figure 1 shows the histogram of 10 000 standard deviation values randomly generated from the Gamma density with four parameter combinations tested (only the values smaller than 3 are plotted here). We first simulated two sets of data to compare the eight combinations of parameter settings, with each dataset consisting of four matrices Y1, Y2, π1 and π2, as shown in Table 2. For both datasets, we used αg=1, βg=0.01 and σw=1. For Gibbs sampling, we set the number of iteration to 30 000, with the first half discarded as burn-in period (this was chosen based on the MCMC (Markov chain Monte Carlo) trace plot). To reduce the effect of auto-correlation between adjacent iterations and the data storage burden, we only recorded the Gibbs sampling results every other 10th iteration. We ran five independent chains for Set 1 and three independent chains for Set 2, to check the consistency among multiple independent runs.

Fig. 1.

Fig. 1.

Histograms of the standard deviations sampled from the Gamma densities with different parameter settings. Note that the Gamma prior is put on the precision τ and SD = 1/sqrt(τ)

Table 2.

Data simulation for choosing model parameters

Set K G1 G2 J η0 and η1
Density
π1 π2 L1 L2 Z1 Z2
1 18 50 50 20 0.2, 0.2 0.3, 0.1 0.316 0.167 0.394 0.38
2 15 100 50 20 0.2, 0.2 0.35, 0.05 0.05 0.0067 0.235 0.353

We used the Area Under Curve (AUC) statistic [area under the receiver-operating characteristic curve (ROC) curve] to assess the inference performance of iFad. For the retained Gibbs samples after the burn-in period, we calculate the mean of each entry of matrices Z1 and Z2, and the ROC curve and AUC values by choosing different cutoffs and comparing the results with the true matrices Z1 and Z2. The ‘ROCR’ package was used for this analysis.

For general factor analysis models, there is a scale identifiability problem associated with the loading matrix W and factor matrix X, as after integrating out the latent factors, the complete density of the observed data matrix Y is a normal distribution with covariance matrix WΣxW′+Ψ (Pournara and Wernisch, 2007). In order to avoid this issue, for data simulation, we set matrix Σx to the identity matrix. We then tested eight combinations of Gamma parameters (αg, βg, αw, βw; Table 1) on two simulated datasets (as described in Table 2). We then asked how the result would be affected if η0 and η1 used for inference are different from that used in simulation. Hence, we chose η01=0.2 for matrix L1, η0=0.15, η1=0.05 for matrix L2 and tested the inference algorithm again on dataset 2, which is denoted as ‘Set 3’. The results are shown in Figure 2. The red lines are the proportion of entries in matrix Z that have the same value with matrix L, representing the accuracy of prior knowledge.

Fig. 2.

Fig. 2.

AUC result of eight parameter settings for two simulated datasets. Red line is the percentage of original matching between matrices L and Z

Comparing the AUC of eight parameter settings, it can be seen that σw=1 usually gives more consistent result among independent chains than putting a Gamma prior on τw. Nevertheless, it is obvious that αg and βg are the major factors here. Parameter setting 6 (αg=1, βg=0.01 and σw=1) gives best result in this comparison and is used for the remaining analysis in this article. For matrix L2 in Set 3, η0=0.35 during simulation but 0.15 during the Gibbs sampling. We checked the overlap of the inferred non-zero entries of matrix L2 between Sets 2 and 3 (by using cutoff 0.5 to dichotomize the posterior mean for each entry). For all the eight parameter settings, the non-zero entries inferred using η0=0.15 are almost always a subset of those inferred using η0=0.35 (as shown in Supplementary Fig. S1).

3.1.2 Data simulation with various patterns for model performance evaluation

After determining the appropriate model parameters, we explored how different dataset properties (e.g. sample size, confidence in the prior link matrix L, density of matrix L/Z and noise level) may influence the performance and robustness of the iFad model. Therefore, we simulated five other groups of datasets to investigate the effects of η0 and η1, density of the connectivity matrix, imbalanced dimension between the two feature spaces (G1G2), signal-to-noise ratio (SNR) and sample size (Table 3). Regarding the simulation, matrices L1 and L2 are randomly generated with specified density (proportion of non-zero entries). Matrices Z1 and Z2 are simulated based on L1 and L2 with Bernoulli probability specified by η0 and η1. The density of Z shown in Table 3 is the average value of Z1 and Z2. For the SNR (Group 4), the variance of each noise term (τg−1) was calculated as τg−1= Var (WX[g,])/SNR = K/SNR. Three independent chains were run for each dataset in Groups 1–4, with total iteration = 30 000 and the first half as burn-in. Gibbs samples were recorded every 10th iteration. AUC results are shown in Figure 3. For Group 5, the chain usually converges slower with sample size increasing, so we tried total iteration = 10 000 (burn-in = 8000), 60 000 (burn-in = 40 000) and 100 000 (burn-in = 70 000). The AUC is plotted in Figure 4.

Table 3.

Data simulations with different properties

Set K G1 and G2 J η0 and η1 Density
αg, βg
L Z
Group 1: the effect of different η0 and η1
1 20 100 15 0.2 0.1 0.25 1, 0.01
2 20 100 15 0.4 0.1 0.41 1, 0.01
3 20 100 15 0.3 0.1 0.34 1, 0.01
4 Same data as Set3, but used η0 and η1=0.2 for Gibbs sampling
5 Same data as Set3, but used η0 and η1=0.4 for Gibbs sampling
Group 2: the effect of density of matrices L and Z
1 20 100 15 0.2 0.01 0.21 1, 0.01
2 20 100 15 0.2 0.1 0.25 1, 0.01
3 20 100 15 0.2 0.3 0.38 1, 0.01
4 20 100 50 0.2 0.5 0.51 1, 0.01
5 20 100 50 0.2 0.7 0.61 1, 0.01
Group 3: the effect of imbalanced datasets
1 20 100 15 0.2 0.1 0.25 1, 0.01
2 20 125, 75 15 0.2 0.1 0.26 1, 0.01
3 20 150, 50 15 0.2 0.1 0.26 1, 0.01
Group 4: the effect of SNR
SNR

1 20 100 15 0.2 0.1 0.257 2.5
2 20 100 15 0.2 0.1 0.260 5
3 20 100 15 0.2 0.1 0.257 10
4 20 100 15 0.2 0.1 0.271 100
5 20 100 15 0.2 0.1 0.247 500
6 20 100 15 0.2 0.1 0.247 1000
Group 5: the effect of sample size
1 10 50 10 0.25 0.05 0.262 1, 0.01
2 10 50 30 0.25 0.05 0.251 1, 0.01
3 10 50 50 0.25 0.05 0.247 1, 0.01
4 10 50 70 0.25 0.05 0.244 1, 0.01
Fig. 3.

Fig. 3.

AUC result of simulated data, Groups 1–4

Fig. 4.

Fig. 4.

AUC result of simulated data, Group 5

From these five groups of comparisons, it can be observed that

  1. When prior information about the connectivity structure matrices Z1 and Z2 is fairly accurate, for example, when η0 and η1≤0.3, the AUC statistics are very good (Group 1, Sets 1 and 3), even if the η0 and η1 used for the Gibbs sampling algorithm deviated from the true values (Group 1, Sets 4 and 5). However, when η0 and η1 are large (Group 1, Set 2), the inference results are not satisfactory.

  2. iFad performs best when the density of matrices L and Z is around 0.1–0.3 (Group 2, Sets 2 and 3); too sparse (Group 2, Set 1) or too dense (Group 2, Sets 4 and 5) connectivity structure can hamper the inference result.

  3. Imbalanced dimension of the two datasets does not influence the inference result of iFad. All datasets in Group 3 achieved very good AUC statistics.

  4. The SNR has an important effect. A large SNR is desired for iFad to perform well (compare Group 4, Sets 1–3 with Sets 4–6).

  5. When sample size J is smaller than the total number of latent factors K, iFad cannot make accurate inference no matter how long the chain is run (Group 5, Set 1). Nevertheless, J=30 seems to be adequate for datasets with K=10 and G1=G2=50 (Group 5, Set 2). Increasing sample size can improve the performance of iFad (compare Sets 3 and 4 with Set 2) but requires more iterations of the Gibbs sampling.

3.2 Application of iFad: analysis of NCI-60 datasets for drug-pathway association discovery

We then applied iFad to the joint analysis of gene expression and drug sensitivity profiles of the NCI-60 cell lines. The NCI-60 project represents a comprehensive resource for various types of ‘Omics’ characterization of 60 human cancer cell lines with nine different tissue types, including RNA expression, DNA fingerprinting, DNA methylation, sequence mutation, as well as treatment response to >100 000 compounds. The gene expression and drug sensitivity data were downloaded from the CellMiner database (Shankavaram et al., 2009), with URL http://discover.nci.nih.gov/cellminer. We used ‘RNA: Affy HG-U133 (A,B)’ (44 000 probeset 2-chip set, Guanine Cytosine Robust Multi-Array Analysis (GCRMA) normalization) and ‘Drug: A4463’ for analysis.

3.2.1 Gene data preprocessing

We only used the HG-U133A chip and converted probe expression to gene expression by taking the average of the probes mapped to the same gene, resulting in a total of 12 980 genes measured across 59 cell lines (expression data of the cell line ‘LC:NCI_H23’ was unavailable). The expression data were then standardized so that for each gene, mean = 0 and SD = 1 across the 59 cell lines. As we are mainly interested in the analysis of drug response-related genes, we only kept genes that are included in either of the following two lists: first, 766 cancer-related genes (Chen et al., 2008); second, 8919 genes from the Integrated Druggable Genome Database Project (Hopkins and Groom, 2002; Russ and Lampel, 2005), downloaded from http://www.sophicalliance.com/. After this filtering, 6958 genes were retained.

3.2.2 Drug data preprocessing

The drug data are the -log10(GI50) values of Sulforhodamine assay for 4463 molecules (also known as the standard agents) that have known 2D structure and have been tested at-least two times. Higher values equate to higher sensitivity of cell lines. The data are also scaled so that mean = 0 and SD = 1 for each drug. Among these 4463 molecules, we only kept the 101 drugs annotated in the CancerResource database (Ahmed et al., 2011). Little information is available about the targets or mechanisms of action for the other drugs.

3.2.3 Pathway association information

Gene-pathway and drug-pathway association data were retrieved from the KEGG MEDICUS database (Kanehisa et al., 2010). The link is http://www.genome.jp/kegg/catalog/pathway_dd.html. We compiled a list of 58 pathways that are either known to be related to cancer or have drug targets. Among the 6958 genes selected in Section 3.2.1, 1863 genes are covered by these 58 pathways and constitute the final list of genes in our real data analysis. Therefore, matrix L1 is a binary one with dimension 1863 × 58, and the dimension of matrix L2 is 101 × 58.

Our research objective here is to infer unknown drug-pathway associations to help better understand the mechanism of action of less well-studied compounds. We treated the pathway activity levels as latent factors in the iFad model, the gene expression data as matrix Y1 and drug sensitivity as matrix Y2. We compiled a list of 58 pathways (Supplementary Table S1), 1863 genes and 101 drugs for analysis, as described earlier. Gene expression data are available for 59 cell lines, representing nine different cancer types (Supplementary Table S2). Since there are only two cell lines from prostate cancer, we excluded this panel, with eight cancer types remaining for study. iFad was applied to each type, respectively, instead of using all 57 cell lines altogether, in order to avoid potential problems arising from severe cell type heterogeneity (we checked for several well-known gene–drug correlations and found that the correlation coefficient is usually much more significant when calculated using cell lines of the same type, rather than all the 57 cell lines).

For the NCI-60 analysis, the total iteration was set to 100 000 and burn-in = 70 000. For model parameters, we set η01=0 for matrix π1, because of high confidence in the gene-pathway association information from KEGG. For matrix π2, we set η1=0 and tried η0=0.05, 0.1, 0.15, 0.2, 0.25, in order to infer unknown drug-pathway associations for various densities of matrix Z2. Based on prior knowledge from the KEGG database, matrix L1 has a density of 3.95%, whereas L2 has a density of 0.51%. Figure 5 shows the distribution of the number of genes/drugs associated with each pathway for matrix L1/L2.

Fig. 5.

Fig. 5.

Histogram of the number of genes or drugs known to be associated with the 58 KEGG pathways as a priori

We compared the distribution of the posterior means of the entries of matrix Z2, inferred using different values of η0, as shown in the histograms of Supplementary Figure S2 and the quantile plots in Supplementary Figure S3. As expected, with η0 increasing, the distribution of the posterior mean of Z2 shifts to the right. For η0=0.05, the number of newly inferred non-zero entries in matrix Z2 is shown in Table 4 based on various cutoffs.

Table 4.

Number of newly inferred non-zero entries in matrix Z2 using η0=0.05 and various posterior probability cutoff values

Cutoff 0.2 0.25 0.3 0.35 0.4 0.45 0.5
BR 248 181 129 99 73 48 39
CNS 203 119 86 59 39 26 21
CO 276 209 167 148 121 98 81
LC 321 238 192 157 123 102 94
LE 200 122 86 66 53 41 33
ME 339 280 204 169 139 118 99
OV 261 191 154 118 96 78 63
RE 287 208 172 142 119 101 81
Union 1685 1282 1016 838 684 563 476

We took a further look at the results obtained from cutoff = 0.3, because the posterior means of non-zero entries of matrix Z2 can reach around 0.5 (for more details, see Supplementary Table S3) at this cutoff. Figure 6 shows the association pattern in the heatmap. It can be seen that the drug-pathway interaction pattern exhibits strong cell type specificity, demonstrating the importance of conducting the analysis by separating cell line groups rather than on the entire NCI-60 panel. Supplementary Figure S4 shows the total number of drug-pathway associations in barplots. Generally speaking, the cell lines of ‘non-small cell lung cancer’ and ‘melanoma’ discovered more novel drug-pathway associations, whereas ‘leukemia’, central nervous system (‘CNS’) and ‘breast cancer’ cell lines inferred fewer new associations. One possible explanation is the difference in sample size: there are nine cell lines for ‘non-small cell lung cancer’ and ‘melanoma’, but only six cell lines for ‘leukemia’, ‘CNS’ and ‘breast cancer’.

Fig. 6.

Fig. 6.

Heatmaps showing the inferred drug-pathway association patterns. The upper panel shows the posterior mean for each entry of matrix Z2. The lower panel is the dichotomized value using cutoff = 0.3. Rows correspond to drugs and columns correspond to pathways

We then checked whether the newly inferred drug-pathway associations are supported by biological knowledge, based on the CancerResource database (Ahmed et al., 2011) and PubMed. We chose the CancerResource database as a reference because of its comprehensiveness: it integrates drug-target information from several well-known databases, including CTD (Davis et al., 2009), PharmGKB (Hernandez-Boussard et al., 2008), TTD (Zhu et al., 2011), DrugBank (Wishart et al., 2008), as well as its own literature mining. It is worth noting that the current catalog of drug-target information is still far from complete, and the absence of specific drug-pathway associations in one database does not exclude the possibility that the interaction actually exists. Herein, we checked the inferred drug-pathway interactions using cutoff = 0.9 (with the complete list provided in Supplementary Table S4) and found that many of these associations can be validated by the CancerResource database (Table 5). For the unconfirmed associations, we checked the database CTD and found several additional validations. For example, among the ‘colon cancer’ cell line panel, drug ‘daunorubicin’ is associated with pathway ‘endometrial cancer’ with a posterior probability of 0.9547. This association is not documented in the CancerResource database, but can be confirmed by CTD. Although some drug-pathway associations are significant in more than one cell line panels (e.g. ‘doxorubicin’ acts on the ‘thyroid cancer pathway’ in both renal and ovarian cancer cell lines), most associations are still context-specific, for instance, ‘chlorambucil’ is associated with ‘melanoma pathway’ mainly in melanoma cell lines, with a high posterior probability of 0.947. In contrast, this probability is much smaller in the other cell line panels (Figure 7).

Table 5.

Drug-pathway associations inferred using iFad which have been confirmed by the CancerResource database (cutoff = 0.9)

KEGG pathway Drug Posterior probability Cell line panel
Glutathione metabolism Vincristine 0.9987 LC
ErbB signaling pathway Mitoxantrone 0.9937 LC
Thyroid cancer Doxorubicin 0.991 RE
Glutathione metabolism 6-Mercaptopurine 0.9867 RE
Bladder cancer Tamoxifen 0.982 LC
VEGF signaling pathway Carmustine 0.9803 RE
Thyroid cancer Doxorubicin 0.9713 OV
ErbB signaling pathway Camptothecin 0.9583 LC
Bladder cancer Edelfosine 0.958 LC
Bladder cancer Chlorambucil 0.9473 RE
Melanoma Chlorambucil 0.947 ME
VEGF signaling pathway 6-Mercaptopurine 0.932 ME
Bladder cancer Geldanamycin 0.9273 CO
Thyroid cancer Dactinomycin 0.9263 OV
Apoptosis Thymidine 0.923 CO
Cell cycle Tiazofurin 0.919 LC
Drug metabolism—other enzymes Daunorubicin 0.9157 LC
VEGF signaling pathway Lomustine 0.9103 RE
Focal adhesion Geldanamycin 0.91 BR
Endometrial cancer Doxorubicin 0.909 CO
VEGF signaling pathway Quinacrine 0.9023 RE
Base excision repair Decitabine 0.9017 LC

BR, breast; CNS, central nervous system; CO, colon; LC, non-small cell lung cancer; LE, leukemia; ME, melanoma; OV, ovarian; RE, renal.

Fig. 7.

Fig. 7.

Cell line specificity of the ‘chlorambucil’—‘melanoma pathway’ association

We further investigated the loading matrices W1 and W2 for the ‘melanoma pathway’. Since there may be sign-flip during the Gibbs sampling iterations, we calculated the posterior mean of the absolute value for each entry of matrices W1 and W2 after the burn-in period. Figure 8 shows the heatmap of the estimated loadings of factor ‘melanoma pathway’ on its associated 63 genes (the left part) and on the 101 drugs (the right part). It can be clearly observed that when the analysis is applied to the ME panel, the factor ‘melanoma pathway’ has significant loadings on several drugs; however, these drug-pathway associations are much less evident when the analysis was performed on the other cell line types.

Fig. 8.

Fig. 8.

Posterior mean of the absolute value of matrices W1 and W2 (only showing the result corresponding to the ‘melanoma pathway’) for the NCI-60 data analysis, plotted by each cell line panel

4 DISCUSSION

Drug-target identification is one important problem in translational bioinformatics, as well as a crucial step in the early stage of drug discovery and development. Although there exist many different high-throughput technologies for molecular phenotype profiling, how to perform knowledge-based, informative data integration remains a major challenge. In this article, we have proposed a Bayesian sparse factor analysis model, iFad, for the joint analysis of gene expression and drug sensitivity profiles measured on the same set of cell lines. The aim is to identify the target biological pathways for drugs with unclear mechanism of action. This model allows natural incorporation of prior knowledge about the connectivity structure of biological pathways (e.g. KEGG pathway), and simultaneously relates the underlying pathway activity to both gene expression levels and drug response. Due to this sparsity formulation, the sample size needed to achieve satisfactory inference result can be much smaller than the number of features in either dataset. We demonstrate the performance of iFad first using simulation and then on the NCI-60 datasets. Real data analysis shows that our method is able to identify many cancer type-specific drug-pathway associations. One direction of great interest for future study is how to speed up the computation process, since MCMC methods are usually time-consuming when applied to high-dimensional inference.

Joint modeling of expression profiles and drug-related data represent an increasingly important and popular trend in the future. Besides the bi-clustering method ISA mentioned in Section 1 (Kutalik et al., 2008), another seminal work in this field (Chang et al., 2005) used Bayesian networks to model the gene–drug dependency, also on the NCI-60 data. Due to computational constraints of Bayesian network models, extensive feature selection was performed before the network inference. A more recent work (Chen et al., 2009) developed a linear regression model that integrates genotype and gene expression data generated under drug-free conditions of yeast segregants to predict the response to various drugs. From a statistical point of view, joint analysis of paired datasets can be achieved using a number of techniques, such as canonical correlation, bipartite graph inference, model-based clustering, etc. With the availability of more and more types of high-throughput datasets from the same panel of samples, novel statistical methods are in great need for knowledge-guided combined analysis.

Funding: National Institutes of Health (GM59507 to H.Z.) and NIH R21-GM084008 to Ning Sun.

Conflict of Interest: None declared.

Supplementary Material

Supplementary Data

REFERENCES

  1. Ahmed J., et al. CancerResource: a comprehensive database of cancer-relevant proteins and compound interactions supported by experimental knowledge. Nucleic Acids Res. 2011;39:D960–D967. doi: 10.1093/nar/gkq910. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bader G.D., et al. Pathguide: a pathway resource list. Nucleic Acids Res. 2006;34:D504–D506. doi: 10.1093/nar/gkj126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Boyce S.E., et al. Predicting ligand binding affinity with alchemical free energy methods in a polar model binding site. J. Mol. Biol. 2009;394:747–763. doi: 10.1016/j.jmb.2009.09.049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bussey K.J., et al. Integrating data on DNA copy number with gene expression levels and drug sensitivities in the NCI-60 cell line panel. Mol. Cancer Ther. 2006;5:853–867. doi: 10.1158/1535-7163.MCT-05-0155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Campillos M., et al. Drug target identification using side-effect similarity. Science. 2008;321:263–266. doi: 10.1126/science.1158140. [DOI] [PubMed] [Google Scholar]
  6. Chang J.H., et al. Bayesian network learning with feature abstraction for gene-drug dependency analysis. J. Bioinform. Comput. Biol. 2005;3:61–77. doi: 10.1142/s0219720005000874. [DOI] [PubMed] [Google Scholar]
  7. Chen B.J., et al. Harnessing gene expression to identify the genetic basis of drug resistance. Mol. Syst. Biol. 2009;5:310. doi: 10.1038/msb.2009.69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chen J., et al. Genomic profiling of 766 cancer-related genes in archived esophageal normal and carcinoma tissues. Int. J. cancer. 2008;122:2249–2254. doi: 10.1002/ijc.23397. [DOI] [PubMed] [Google Scholar]
  9. Czodrowski P., et al. Computational approaches to predict drug metabolism. Expert Opin. Drug Metab. Toxicol. 2009;5:15–27. doi: 10.1517/17425250802568009. [DOI] [PubMed] [Google Scholar]
  10. Davis A.P., et al. Comparative toxicogenomics database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res. 2009;37:D786–D792. doi: 10.1093/nar/gkn580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Ecker G.F., et al. Computational models for prediction of interactions with ABC-transporters. Drug Discov. Today. 2008;13:311–317. doi: 10.1016/j.drudis.2007.12.012. [DOI] [PubMed] [Google Scholar]
  12. Gharib S.A., et al. Computational identification of key biological modules and transcription factors in acute lung injury. Am. J. Respir. Crit. Care Med. 2006;173:653–658. doi: 10.1164/rccm.200509-1473OC. [DOI] [PubMed] [Google Scholar]
  13. Hernandez-Boussard T., et al. The pharmacogenetics and pharmacogenomics knowledge base: accentuating the knowledge. Nucleic Acids Res. 2008;36:D913–D918. doi: 10.1093/nar/gkm1009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hopkins A.L., Groom C.R. The druggable genome. Nat. Rev. Drug Discov. 2002;1:727–730. doi: 10.1038/nrd892. [DOI] [PubMed] [Google Scholar]
  15. Ikediobi O.N., et al. Mutation analysis of 24 known cancer genes in the NCI-60 cell line set. Mol. Cancer Ther. 2006;5:2606–2612. doi: 10.1158/1535-7163.MCT-06-0433. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Irwin J.J., et al. Automated docking screens: a feasibility study. J. Med. Chem. 2009;52:5712–5720. doi: 10.1021/jm9006966. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Kanehisa M., et al. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 2010;38:D355–D360. doi: 10.1093/nar/gkp896. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kolb P., et al. Docking and chemoinformatic screens for new ligands and targets. Curr. Opin. Biotechnol. 2009;20:429–436. doi: 10.1016/j.copbio.2009.08.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Kuhn M., et al. STITCH: interaction networks of chemicals and proteins. Nucleic Acids Res. 2008;36:D684–D688. doi: 10.1093/nar/gkm795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kutalik Z., et al. A modular approach for integrative analysis of large-scale gene-expression and drug-response data. Nat. Biotechnol. 2008;26:531–539. doi: 10.1038/nbt1397. [DOI] [PubMed] [Google Scholar]
  21. Lamb J., et al. The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science. 2006;313:1929–1935. doi: 10.1126/science.1132939. [DOI] [PubMed] [Google Scholar]
  22. Meng J., et al. Bayesian non-negative factor analysis for reconstructing transcription factor mediated regulatory networks. Proteome Sci. 2011;9(Suppl 1):S9. doi: 10.1186/1477-5956-9-S1-S9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Nigsch F., et al. Computational toxicology: an overview of the sources of data and of modelling methods. Expert Opin. Drug Metab. Toxicol. 2009;5:1–14. doi: 10.1517/17425250802660467. [DOI] [PubMed] [Google Scholar]
  24. Pournara I., Wernisch L. Factor analysis for gene regulatory networks and transcription factor activity profiles. BMC Bioinformatics. 2007;8:61. doi: 10.1186/1471-2105-8-61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Pujol A., et al. Unveiling the role of network and systems biology in drug discovery. Trends Pharmacol. Sci. 2010;31:115–123. doi: 10.1016/j.tips.2009.11.006. [DOI] [PubMed] [Google Scholar]
  26. Russ A.P., Lampel S. The druggable genome: an update. Drug Discov. Today. 2005;10:1607–1610. doi: 10.1016/S1359-6446(05)03666-4. [DOI] [PubMed] [Google Scholar]
  27. Schadt E.E., et al. A network view of disease and compound screening. Nat. Rev. Drug Discov. 2009;8:286–295. doi: 10.1038/nrd2826. [DOI] [PubMed] [Google Scholar]
  28. Shankavaram U.T., et al. Transcript and protein expression profiles of the NCI-60 cancer cell panel: an integromic microarray study. Mol. Cancer Ther. 2007;6:820–832. doi: 10.1158/1535-7163.MCT-06-0650. [DOI] [PubMed] [Google Scholar]
  29. Shankavaram U.T., et al. CellMiner: a relational database and query tool for the NCI-60 cancer cell lines. BMC Genomics. 2009;10:277. doi: 10.1186/1471-2164-10-277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Sharma S.V., et al. Cell line-based platforms to evaluate the therapeutic efficacy of candidate anticancer agents. Nat. Rev. Cancer. 2010;10:241–253. doi: 10.1038/nrc2820. [DOI] [PubMed] [Google Scholar]
  31. Sharp K., et al. A comparison of inference in sparse factor analysis. Submitted to the J. Mach. Learn. Res. 2010 September. [Google Scholar]
  32. Shoemaker R.H. The NCI60 human tumour cell line anticancer drug screen. Nat. Rev. Cancer. 2006;6:813–823. doi: 10.1038/nrc1951. [DOI] [PubMed] [Google Scholar]
  33. Staunton J.E., et al. Chemosensitivity prediction by transcriptional profiling. Proc. Natl Acad. Sci. USA. 2001;98:10787–10792. doi: 10.1073/pnas.191368598. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. West M. Bayesian factor regression models in the “large p small n” paradigm. In: Bernardo J.M., et al., editors. Bayesian Statistics. Vol. 7. Oxford, UK: Oxford University Press; 2003. pp. 733–742. [Google Scholar]
  35. Wishart D.S., et al. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008;36:D901–D906. doi: 10.1093/nar/gkm958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Yeh H.Y., et al. Identifying significant genetic regulatory networks in the prostate cancer from microarray data based on transcription factor analysis and conditional independency. BMC Med Genomics. 2009;2:70. doi: 10.1186/1755-8794-2-70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Yeh P., et al. Functional classification of drugs by properties of their pairwise interactions. Nat. Genet. 2006;38:489–494. doi: 10.1038/ng1755. [DOI] [PubMed] [Google Scholar]
  38. Yildirim M.A., et al. Drug-target network. Nat. Biotechnol. 2007;25:1119–1126. doi: 10.1038/nbt1338. [DOI] [PubMed] [Google Scholar]
  39. Yu T., Li K.C. Inference of transcriptional regulatory network by two-stage constrained space factor analysis. Bioinformatics. 2005;21:4033–4038. doi: 10.1093/bioinformatics/bti656. [DOI] [PubMed] [Google Scholar]
  40. Zavodszky M.I., Kuhn L.A. Side-chain flexibility in protein-ligand binding: the minimal rotation hypothesis. Protein Sci. 2005;14:1104–1114. doi: 10.1110/ps.041153605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Zhu F., et al. Therapeutic target database update 2012: a resource for facilitating target-oriented drug discovery. Nucleic Acids Res. 2011;40:D1128–D1136. doi: 10.1093/nar/gkr797. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES