Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2017 Jan 5;33(9):1362–1369. doi: 10.1093/bioinformatics/btw836

Modeling gene-wise dependencies improves the identification of drug response biomarkers in cancer studies

Olga Nikolova 1,, Russell Moser 2, Christopher Kemp 2, Mehmet Gönen 1,3, Adam A Margolin 1,
Editor: Bonnie Berger
PMCID: PMC5860077  PMID: 28082455

Abstract

Motivation

In recent years, vast advances in biomedical technologies and comprehensive sequencing have revealed the genomic landscape of common forms of human cancer in unprecedented detail. The broad heterogeneity of the disease calls for rapid development of personalized therapies. Translating the readily available genomic data into useful knowledge that can be applied in the clinic remains a challenge. Computational methods are needed to aid these efforts by robustly analyzing genome-scale data from distinct experimental platforms for prioritization of targets and treatments.

Results

We propose a novel, biologically motivated, Bayesian multitask approach, which explicitly models gene-centric dependencies across multiple and distinct genomic platforms. We introduce a gene-wise prior and present a fully Bayesian formulation of a group factor analysis model. In supervised prediction applications, our multitask approach leverages similarities in response profiles of groups of drugs that are more likely to be related to true biological signal, which leads to more robust performance and improved generalization ability. We evaluate the performance of our method on molecularly characterized collections of cell lines profiled against two compound panels, namely the Cancer Cell Line Encyclopedia and the Cancer Therapeutics Response Portal. We demonstrate that accounting for the gene-centric dependencies enables leveraging information from multi-omic input data and improves prediction and feature selection performance. We further demonstrate the applicability of our method in an unsupervised dimensionality reduction application by inferring genes essential to tumorigenesis in the pancreatic ductal adenocarcinoma and lung adenocarcinoma patient cohorts from The Cancer Genome Atlas.

Availability and Implementation

The code for this work is available at https://github.com/olganikolova/gbgfa

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

A fundamental problem in computational cancer biology is to predict drug response in a new sample from its genomic make up and further identify a subset of genes most predictive of drug response (signature). From an application stand point, this problem is important because it can help identify subpopulations of patients that are responsive to a particular therapy and suggest novel treatment targets and pathways for co-treatment with existing regimens. Cancer is a vastly heterogeneous disease. Studies in solid tumors have shown that heterogeneity in response to therapies is partially determined by molecular differences between tumors (Collisson et al., 2011). Stratification of drugs to tumor subtypes in which they are selectively effective is shown to improve treatment outcomes (Druker et al., 2001; Lynch et al., 2004; Slamon et al., 2001). Thus, ability to predict therapeutic efficacy from molecular data is critical.

Recent advances in biomedical and sequencing technologies have revealed the genomic landscape of common forms of human cancer in unprecedented detail. Of the genes that are able to promote or ‘drive’ tumorigenesis when altered, for most cancers, it is believed that there exist a small number of ‘mountains’ (genes altered at high frequencies across the population), and a much larger number of ‘hills’ (much less frequently altered genes). For example, in pancreatic ductal adenocarcinoma, alteration frequencies decrease at nearly exponential rate for the top five most frequently mutated genes from 96 to 8% (Biankin et al., 2013). While mountains are relatively easy to statistically identify their number (140) is believed to be reaching saturation (Vogelstein et al., 2013), and identifying the more prevalent driver events altered at lower frequencies is much more challenging. To make matters worse, the rate of mutation in normal cells is indicated to vary by 100-fold genome-wide (Michaelson et al., 2012). This challenging mutational profile of the cancer genome calls for approaches that can integrate data from different genomic platforms to more reliably prioritize driver genes with lower alteration frequencies.

1.1 Previous work

Various computational methods have been applied to tackle both the prediction (predict drug responses from genomic data) and variable selection (identify genomic features associated with drug sensitivity or resistance) questions. The use of multivariate linear regression, based on least squares formulation, has been extensive because it is readily amendable to regularization, which accounts for model complexity and introduces sparsity (Tibshirani, 1996; Zou and Hastie, 2003). While regularized regression methods like Elastic Net have been widely used in cancer studies (Barretina et al., 2013; Garnett et al., 2013), they do not allow for flexible integration of genomic features coming from distinct platforms, possibly with very different underlying distributions (e.g. copy number variation and RNA-sequencing count data). Canonical correlation analysis (CCA) is a technique that allows for such data integration. CCA finds a lower dimensional representation of two sets of multidimensional variables in which they are maximally correlated by solving a generalized eigenvalue problem (Hotteling, 1936; Ray et al., 2014; Sun et al., 2011; Tucker, 1958). Further, regularized CCA extensions have also been proposed (Sun et al., 2011; Witten et al., 2009), but these approaches remain limited to two input sets (views). To overcome this limitation, the multiple CCA model has been proposed (Witten and Tibshirani, 2009). A fully Bayesian formulation of group factor analysis (BGFA) has also been proposed, which extends beyond two views (Klami et al., 2013) and simultaneously learns the structure of the model. However, in all of the above described methods, measurements of the same gene’s genomic aberrations (e.g. gene expression and copy number) are assumed to be independent from one another. A handful of methods attempt to explicitly model such dependencies by introducing very specific regulatory assumptions (Vaske et al., 2010; Wang et al., 2013). For example, in PARADIGM, a directed causal relationship is postulated to encode the central dogma as well as biological pathway information (Vaske et al., 2010). Alternatively, Wang et al. (2013) adopted a strong regulatory assumption for the role of methylation and their approach is not applicable to studies where methylation data are not available.

1.2 Contributions

We investigate the applicability of the BGFA approach to the cancer biology problems that we set out to solve. We further propose a novel, biologically driven re-parametrization of BGFA, which explicitly models gene-centric dependencies in multi-omic readouts collected for the same genes. Our modified model favors genes that carry information in multiple measurement platforms, where we will more likely see a real biological effect. We extend the fully Bayesian formulation presented in the work of Klami et al. (2013) by introducing a gene-wise prior into a group factor analysis model that learns a lower-dimensional latent representation of the data via shared and private components across input datasets (views). We derive a variational approximation inference algorithm that casts and solves multiple learning tasks simultaneously as one, a property also known as multitask. The multitask property of our approach enables us to leverage similarities in the response profiles of groups of drugs that are more likely to correspond to true biological effects as opposed to experimental noise and off-target effects, which improves generalization ability of the results. Multitask approaches have been shown to improve predictive performance as compared to their single-task counterparts (Gönen and Margolin, 2014; Tsoumakas et al., 2010).

2 Materials

To evaluate the proposed model, we use four different cancer datasets. We first analyze genomically characterized cell lines profiled against two panels of drugs in the Cancer Cell Line Encyclopedia (CCLE) (Barretina et al., 2013) and the Cancer Therapeutics Response Portal (CTRP) (Seashore-Ludlow et al., 2015) studies. We then consider genomically characterized patient samples from the The Cancer Genome Atlas (TCGA) pancreatic ductal adenocarcinoma (PAAD) and lung adenocarcinoma (LUAD) cohorts (Hoadley et al., 2014).

2.1 Cell lines for comparative performance analysis

2.1.1 CCLE

Hybrid-capture sequencing mutation data for 1651 genes and Affymetrix messenger RNA (mRNA) expression data for 18 897 genes were collected in 504 cell lines profiled with 24 compounds (Barretina et al., 2013). Filtering missing values by rows (cell lines) and columns (drugs) in the response matrix and by columns (genes) in feature matrices is necessary to produce complete (fully observed) data. This procedure yielded 18 969 unique genes in 267 cell lines. Drug response in this dataset is given by the half-maximal inhibitory concentration (IC50), which indicates the drug concentration needed to achieve 50% cell death.

2.1.2 CTRP

An expanded set of 886 CCLE cell lines were profiled with a total of 545 compounds (Seashore-Ludlow et al., 2015). Even though overall only 20% of the response matrix data was missing, filtering the response matrix by rows or columns resulted in an empty set. To generate a complete response matrix, we considered all cell lines for which at most T% of compound data was missing, where T varied from 20 to 10 in increments of 0.1 (Supplementary Fig. S1). In this study, we chose to analyze performance using a matrix of 409 cell lines × 98 drugs (T=15.3%). Combining hybrid-capture mutation and mRNA gene expression data with the response matrix, after filtering missing data, yielded 18 969 genes and 409 cell lines. Drug response in this dataset is reported using the area under the drug dose–response curve (AUC).

2.2 Cell lines for feature type analysis

In addition to the data described in Section 2.1, CCLE copy number variation and OncoMap mutation were obtained and analyzed as described above (Barretina et al., 2013). Ten combinations of the four feature matrices were considered as input datasets in addition to the single feature type views.

2.3 Compounds with known targets and biomarkers

A list of CCLE compounds with known biomarkers of drug response was compiled from previously published studies (Supplementary Table 1). The biomarkers for many of the CTRP compounds are not known and we considered their targets as a proxy instead (Supplementary Table 2). This list was obtained from the CTRP portal (Seashore-Ludlow et al., 2015).

2.4 Patient data

2.4.1 PAAD

PAAD RNA-seq transcript data (178 samples × 20 501 genes) and mutation data (144 samples × 11 520 genes) were obtained from the TCGA data portal (https://tcga-data.nci.nih.gov/docs/publications/tcga/, date accessed: March 15, 2015). Combining these data and filtering missing values yielded 21 132 unique genes and 132 patients.

2.4.2 LUAD

Platform corrected LUAD RNA-seq transcript data (412 samples × 20 501 genes) and mutation data (236 samples × 15 637 genes) was obtained from Synapse (https://www.synapse.org/#!Synapse:syn395683) as part of the Pan-Cancer 12 TCGA data (Hoadley et al., 2014). Combining these data as described above yielded 20 608 unique genes and 165 patients.

3 Methods

3.1 Notation and preliminaries

We consider M different sets of measurements X1,X2,,XM which we will interchangeably call ‘views’. For each view Xm with Dm features and N observations, we assume an independently and identically distributed sample Xm={xm,n,dmXm|m[1M],n[1N],dm[1,Dm]}, where m, n and dm index the views, observations and features, respectively. The different views usually correspond to measurements from either different technical platforms or quantifications of different biological characteristics, such as gene expression, copy number variation, mutation or cell viability in response to drug treatment. Views can include overlapping sets of features (genes). A feature-wise (gene-wise) prior γG×1 is defined over the set of G unique features across all views. The learning algorithm proceeds to generate the data points of all M views by projecting their latent representations in a shared space YK×N with K hidden components using view-specific projection matrices WmDm×K and the feature-wise prior γ (Fig. 1). We use N(μ,Σ) to denote the normal distribution with mean vector μ and covariance matrix Σ and G(α,β) to denote the Gamma distribution with shape α and scale β hyperparameters. Finally, we use · to denote the expectation.

Fig. 1.

Fig. 1.

Algorithm workflow illustrating a matrix representation of GBGFA and dimensionality of each variable and prior, where h and e denote mutation and gene expression, respectively

3.2 Gene-wise prior Bayesian group factor analysis (GBGFA) model

Figure 2 shows the graphical representation of GBGFA. Its joint likelihood is given by:

p(W1,,WM,X1,,XM,Y,γ,τ)=p(γ)m=1M(p(Wm|γ)p(τm))n=1N(p(yn)p(xm,n|Wm,yn,τm)),

where

Fig. 2.

Fig. 2.

Graphical representation of GBGFA model

xm,n|Wm,yn,τmN(xm,n; Wmyn,τm1I)(m,n),wm,dm|γgN(wm,dm; 0,γg1I)(m,dm),γgG(γg; αγ,βγ)g,τmG(τm; ατ,βτ)m,ynN(yn; 0,I)n.

3.2.1 Inference

We present a deterministic variational approximation inference algorithm, which leads to closed-form update equations due to the conjugate priors. We approximate the full posterior using the following distribution and its factorization:

p(W1,,WM,Y,γ,τ|X1,,XM)q(W1,,WM,Y,γ,τ)=q(W1)q(WM)q(Y)q(γ)q(τ)

where each factor is defined to be from the family of its respective full conditional distribution. Recall that G denotes the total number of unique features across all views. Then, the approximating distribution factorizes as follows:

q(τ)=m=1MG(τm; aτm,bτm),q(γ)=g=1GG(γg; aγg,bγg),q(Y)=n=1NN(yn; 0,I),q(Wm)=dm=1DmN(wm,dm; μwm,dm,Σwm,dm).

Minimizing the Kullback–Leibler divergence between the posterior p(W1,,WM,Y,γ,τ|X1,,XM) and the approximate posterior q(W1,,WM,Y,γ,τ) distributions is equivalent to maximizing the following lower bound derived using Jensen’s inequality:

logp(Y|X1,,XM)logp(W1,,WM,Y,γ,τ|X1,,XM)logq(W1,,WM,Y,γ,τ)=(q).

The algorithm optimizes the lower bound (q) by updating each term of the bound until convergence. Let VgM be the number of views in which the feature with index g in γ appears. Let also ξ(g,i) be a many-to-one mapping function which returns the row dm in projection matrix Wi of the feature with index g in γ. Then the approximate posterior distribution for the feature-wise prior is given as follows:

aγg=αγ+VgMK,bγg=βγ+i=1Vg12(wξ(g,i)wξ(g,i)+tr(Σwξ(g,i))),

where γg=aγg/bγg. The noise precision parameters update is given by:

aτm=ατ+NDm2,bτm=βτ+12dm=1Dmn=1Nxm,dm,n2+12tr[WmWmYY]tr[WmYXm].

The latent variable’s approximate posterior can be updated as follows:

Y=m=1MτmΣYWmXm,ΣY=(I+m=1MτmWmWm)1,

where

WmWm=dm=1Dm(wm,dmwm,dm+Σwm,dm).

For each pair (m,dm), let ϕ(m,dm) denote a many-to-one mapping function, which returns the index g in the prior vector γ of the feature in row dm of the projection matrix Wm. Then, projections are updated as follows:

wm,dm=xm,dmYΣwm,dmτm,Σwm,dm=(γϕ(m,dm)I+τmYY)1,

and Wm=[μwm,1,μwm,2,,μwm,Dm]. The lower bound is given in the Supplementary Materials.

3.2.2 Prediction

Suppose that we wish to predict views Xi1,Xi2,,XiL, where il{1M},L<M given the views Xj1,Xj2,,XjML where {Xi} and {Xj} are mutually exclusive and {i1,i2,iL}{j1,j2,,jML}{1,2,,M}. That is, we are now interested in the predictive distribution p(Xi1,Xi2,,XiL|Xj1,Xj2,,XjML). This distribution cannot be computed in closed form but can be derived using expectations from the variational approximation:

Xi1,Xi2,,XiL|Xj1,Xj2,,XjML=i=i1iLWiYq(Wi1,Wi2,,WiL)q(Y|Wj1,Wj2,,WjML)=i=i1iLWij=j1jMLτjΣjWjXj

where Σj=(I+τjWjWj)1.

3.3 Comparison algorithms

We compare the performance of our proposed model to Elastic Net and BGFA. Elastic Net is a regularized linear regression with L1 and L2 penalties, which does continuous shrinkage, automatic variable selection and is able to select groups of correlated variables (Zou and Hastie, 2003). BGFA is a Bayesian formulation of generic factor analysis model with a specific group-wise sparsity prior, which generalizes the CCA model originally introduced by Hotteling (1936) and limited to exactly two views. This model is also known as ‘inter-battery factor analysis’ (Browne, 1979). Comprehensive overview of both methods is given in the Supplementary Materials.

4 Experimental results

We evaluate the predictive performance of our proposed model in cell lines profiled with the CCLE and CTRP drug panels. We perform feature selection to analyze the model’s ability to recapitulate known drug targets and response biomarkers. We compare the performance of GBGFA with that of Elastic Net (Zou and Hastie, 2003) and BGFA (Klami et al., 2013). Finally, we illustrate our model’s applicability in an unsupervised setting by applying GBGFA to TCGA PAAD and LUAD patient cohorts.

4.1 Predictive performance evaluation

To evaluate the predictive accuracy of the proposed model, we perform 5-fold cross-validation using data from CCLE and CTRP. For each dataset, we partition the data into five mutually exclusive subsets, each containing a nearly equal number of cell lines. Each of the five data subsets is used for testing of the model trained on the remaining four subsets. To achieve robust measurements, we report the average result of 10 random fold splits for each dataset. We calculate the proportion of variance explained (POV) to compare the three methods, i.e. POV=1NRMSE,

NRMSE=(vv^)(vv^)(v11v/N)(v11v/N), where NRMSE is the normalized root mean square error, and v and v^ denote the observed and predicted values. The results of these experiments are summarized in Figure 3. The top and bottom rows correspond to the CCLE and the CTRP panels, respectively. Columns 1 through 3 illustrate pairwise method comparisons. Average values of 10 random data splits into training and testing sets for 5-fold cross-validation are reported. The boxplots illustrate the average proportion of variance explained for all drugs and the dot plots show results for individual drugs. Paired t-test P-value is also reported for each comparison.

Fig. 3.

Fig. 3.

Cross-validation results of drug response prediction. The top and bottom rows correspond to the CCLE and the CTRP panels, respectively. Columns 1 through 3 present the pairwise comparison of BGFA and Elastic Net, GBGFA and Elastic Net, and GBGFA and BGFA, respectively. The boxplots illustrate the average POV over all drugs and the dot-plots show POV for individual drugs

We use the glmnet package to run Elastic Net (Friedman et al., 2010) and the CCAGFA package to run BGFA (Klami et al., 2013). When training Elastic Net, a tuning grid is used to optimize the α and λ parameters such that tested values are from {105,104,,100} for α and {e5,e4,,e0} for λ. For GBGFA and BGFA, the best model is picked of 50 random restarts with hyperparameters αγ=βγ=ατ=βτ=1e-14 and K = 50.

4.2 Prediction as a function of input data size

We further evaluate how decreasing the number of observations affects predictive performance. This question is important because most clinical trials have limited size patient cohorts, and this experiment allows us to investigate method performance in clinically relevant scenarios. For this purpose, for each of the 10 random splits into five partitions, we subset the training set in each experiment by selecting at random 75, 50 and 25% of the input cell lines. The results of these experiments are shown in Supplementary Figures S2 and S3 for CCLE and CTRP, respectively.

4.3 Prediction as a function of input feature type

We analyze the relative importance of each molecular feature type in the accuracy of predictions derived from GBGFA. For each of the two drug panels, CCLE and CTRP, and each random data split, we considered the input feature types gene expression (e), copy number (c), mutation measured by OncoMap (o), and mutation measured by hybrid-capture sequencing (h), and their respective combinations, with the corresponding POV as response. We fitted a random effects model using the lme4 R package (Bates et al., 2015) and calculated the percent variance explained by each factor and the residual. All non-zero factors are reported in Figure 4.

Fig. 4.

Fig. 4.

Analysis of predictive performance as a function of input data feature type

4.4 Feature selection performance evaluation

We evaluate our model’s ability to recapitulate known drug biomarkers (CCLE) and drug targets (CTRP) on a subset of compounds for which the targets or biomarkers are known (Supplementary Tables 1 and 2). For each drug panel, we learn models using combinations of input data comprised of gene expression (e), copy number (c), OncoMap mutation (o), and hybrid-capture sequencing mutation (h) data. We then perform feature selection by computing gene-centric scores.

In BGFA and GBGFA, for each drug i and view m, gene scores si,m are computed as the product of the view-specific projection matrix and the drug response vector si,m=Wmwi,dr and standardized. An overall gene score for each drug i and gene g is computed by optimizing the absolute value of view-specific standardized gene scores si,g*=argmaxm|si,m,g|. In Elastic Net, no standardization within views is required as different views are concatenated into a single input matrix and normalized together in the pre-processing step. To ensure fair comparison with BGFA and GBGFA in terms of total number of genes ranked, we similarly compute a gene-centric score by optimizing the absolute value of all available feature type weights per gene.

The computed gene scores are used to rank genes in decreasing order of importance (the first rank being the most relevant). We then consider how well each method ranks the biomarker or target of each compound across all input sets. Figure 5 summarizes the results: each bar corresponds to the performance of a given method on a given input dataset. Bar height indicates the number of drugs for which a given method scored better than all others. A total of 16 CCLE and 80 CTRP drugs had known biomarkers and targets, respectively.

Fig. 5.

Fig. 5.

Feature selection results for CCLE compounds (top) with known biomarkers and CTRP compounds (bottom) with known targets

4.5 Unsupervised learning in TCGA patient cohorts

We performed an exploratory analysis of TCGA PAAD and LUAD patient cohorts to illustrate the applicability of our model in an unsupervised scenario. For each cohort we train GBGFA on the mutation and gene expression data for all patients. We consider the overlapping genes between the mutation and expression sets (N  > 11 000) and compute gene-centric scores sg for each gene g as the sum over hidden components of the product of expression and mutation weights, i.e. sg=k=1K=200m=12wm,g,k, where K is the number of hidden components of the learned model. We use the scores to rank genes and consider the top and bottom 100 genes (<1%). We filter the selected genes by comparing their expression levels. A gene was deemed dysregulated if it was over- or under-expressed in samples carrying a mutation in that gene, as compared to the wild type samples (P <  0.05).

We performed functional annotation and pathway enrichment analysis of the identified genes using the gProfileR package version 0.5.3 with FDR as the multiple test correction method (Reimand et al., 2011), which integrates data from the Gene Ontology (Ashburner et al., 2000), Corum (Ruepp et al., 2009), KEGG (Kanehisa et al., 2013), and REACTOME (Croft et al., 2013) databases. These results are summarized in Supplementary Tables 3 and 4. The prioritized and dysregulated genes are shown in Figure 6.

Fig. 6.

Fig. 6.

Circos plots of TCGA data analysis results. Predicted essential genes are shown after learning models using gene expression (inner track) and mutation (outer track) data

5 Discussion

Predictive performance: We compared the predictive performance of GBGFA, BGFA and Elastic Net using 5-fold cross-validation. We reported the average POV on results from 10 random splits of the data into training and testing sets. The results are summarized in Figure 3 and the mean NRMSE and variance for each drug and test case are shown in Supplementary Tables 5 and 6. This comparison revealed that in CCLE, BGFA performed significantly better than Elastic Net (P = 0.0012); GBGFA improved upon Elastic Net even further (P = 0.0006). In CTRP, GBGFA performed significantly better than both Elastic Net and BGFA (P <1e-5). Our observations indicate that the integration of a gene-wise prior improved predictive performance as compared to Elastic Net and BGFA.

Predictive performance as a function of input data size: We next analyzed how decreasing the size of training input data affects predictive performance by observing 75, 50 and 25% of the complete input data for each random split. This analysis revealed that both GBGFA and BGFA tend to do better than Elastic Net as the input data size decreases (Supplementary Figs. S2 and S3). In CCLE, GBGFA achieved lower prediction error for more drugs than BGFA in the 100, 75 and 25% cases and tied in the 50% case (Supplementary Fig. S2, bottom row). In CTRP, GBGFA performed statistically better than BGFA (P3e-5) in all four cases (Supplementary Fig. S3, bottom row). Thus, we show consistent predictive performance improvement even when the sample size decreases.

Predictive performance as a function of input feature type: Finally, we analyzed predictive performance as a function of the input data feature type (Fig. 4). This analysis revealed that, consistent with previous studies, the majority of the variance, namely 84.5% in CCLE and 90.13% in CTRP, was explained by gene expression (Jang et al., 2014; Yuan et al., 2014). However, our results show that additional 15.3% in CCLE and 8.53% of the variance were captured by the combination of gene expression and copy number; in CTRP, a cumulative 9.62% of variance was explained by combinations of input feature matrices. In both CCLE and CTRP, <1% was attributed to the residual. We thus believe that the integration of the gene-wise prior allows us to leverage information in the combinations of molecular feature types that may otherwise not be used.

Feature selection: We compared the ability to recapitulate known drug targets and biomarkers of GBGFA, BGFA and Elastic Net in the same two drug panels, CCLE and CTRP. In CCLE, biomarkers for 16 of the total 24 drugs were compiled (Supplementary Table 1), while in CTRP, we used targets as a proxy to biomarkers instead. Of the 98 drugs considered, 80 had known targets (Supplementary Table 2). The three methods were compared on each input dataset and for each compound (Fig. 5). Our results indicate that GBGFA ranked CCLE biomarkers better than both other methods in 9 out of 14 cases, as compared to 1 for BGFA and none for Elastic Net. In the remaining four cases, we observed one tie with Elastic Net and three ties with BGFA. When ranking CTRP targets, GBGFA outranked BGFA and Elastic Net, respectively, by 11:3:0 with no observed ties. The overall relatively better performance of the two Bayesian methods as compared to Elastic Net in CTRP could be due to the multitask property that both GBGFA and BGFA employ and the ability to borrow information across compounds with similar response profiles could become more advantageous in larger drug panels (Gönen and Margolin, 2014). The overall superior performance of GBGFA over BGFA illustrates that the gene-wise prior captures relevant biological information and aids the model’s ability to better rank essential genes for drug response biomarkers.

Unsupervised learning: We applied our model in an unsupervised scenario to prioritize genes essential for the tumorigenesis of PAAD and LUAD in TCGA patient data. Figure 6 illustrates the predicted essential genes, based on gene expression and mutation data, in their respective physical locations on the chromosome. While somatic mutations in the KRAS oncogene and tumor suppressors, TP53, CDKN2A, SMAD4 are signature genomic events in PAAD, very little is known about additional genomic events that occur at lower frequencies and how these genes might reveal important pathways that impinge on pancreatic cancer progression. Our analysis identified the PBAF (Polybromo- and BAF containing complex) chromatin-remodeling complex, chromosomal organization and modifying enzymes (KDM3B, HIST1H2BB, PRMT1, NCOR2, SMARCC2, ARID2 and BRD8), a component of the polycomb repressive complex (RING1) and folate metabolism (DHFR) as important genes in the epigenetic etiology of PAAD (Locasale, 2013; Sanchez et al., 2012; Satijn and Otte, 1999; Wilson and Roberts, 2011 ). MicroRNAs are key regulators of oncogenesis and analysis of prioritized genes, which were mutated in at least three patients, identified a potentially important role for miR-30d, reduced expression of which is associated with poor prognosis in PAAD (Jamieson et al., 2012). In addition, high serum levels of IL-1B are associated with poor prognosis and can predict the efficacy of gemcitabine in patients with advanced pancreatic cancer (Mitsunaga et al., 2013).

When applied to the LUAD cohort, this analysis identified known tumor suppressors (SMAD4 and SMARCA4) (Medina et al., 2008; Orvis et al., 2014). Mutations and/or loss of heterozygosity (LOH) in these genes are common somatic events in the progression of LUAD which can also dictate response to therapeutics (Haeger et al., 2016). Massively parallel sequencing efforts on LUAD have recently identified RBM10 as a recurrent somatic mutation in non-small cell lung carcinoma (NSCLC) (Imielinski et al., 2012). RBPJ is implicated in the NOTCH mediated generation and maintenance of Kras (G12V)-driven NSCLC (Maraver et al., 2012), and hypermethylation of the GATA5 promoter correlates with clinicopathological parameters in lung cancer (Guo et al., 2004).

6 Conclusions and future work

We developed a novel Bayesian formulation of group factor analysis model with gene-centric prior to explicitly model the dependencies between genomic alterations of the same gene. Future work will extend our model to integrate prior information on functional (e.g. pathway) relationships between genes, as incorporation of such information has been demonstrated to be beneficial for drug sensitivity prediction (Costello et al., 2014; Jang et al., 2015). We are exploring modeling information from molecular pathway databases via a hierarchical prior over genes. We are also working on model extensions to enable learning in the presence of missing values.

Our current results show that the GBGFA model enables leveraging information from combinations of genomic data which improves the predictive performance and feature selection as compared to Elastic Net and BGFA. It also identifies biologically relevant genes for both PAAD and LUAD and should prove to be broadly applicable as an analytical tool for the covariate structure of biological data with high dimensionality and complexity.

Supplementary Material

Supplementary Data

Acknowledgements

This work was partially supported by National Institute of Health/National Cancer Institute grants U54CA149237 and U01CA176303.

Funding

The work of Mehmet Gönen was supported by the Turkish Academy of Sciences in the framework of the Young Scientist Award Program

Conflict of Interest: none declared.

References

  1. Ashburner M. et al. (2000) Gene Ontology: tool for the unification of biology. Nature Genetics, 25, 25–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Barretina J. et al. (2013) The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483, 603–307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bates D. et al. (2015) lme4: linear mixed-effects models using Eigen and S4, 2014. R package version.
  4. Biankin A.V. et al. (2013) Pancreatic cancer genomes reveal aberrations in axon guidance pathway genes. Nature, 491, 399–405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Browne M.W. (1979) The maximum-likelihood solution in inter-battery factor analysis. Brit. J. Math. Stat. Psychol., 37, 75–86. [Google Scholar]
  6. Collisson E.A. et al. (2011) Subtypes of pancreatic ductal adenocarcinoma and their differing responses to therapy. Nat. Med., 17, 500–503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Costello J.C. et al. (2014) A community effort to assess and improve drug sensitivity prediction algorithms. Nat. Biotechnol., 32, 1202–1212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Croft D. et al. (2013) The Reactome pathway knowledgebase. Nucleic Acids Res., 42, D472–D477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Druker B.J. et al. (2001) Efficacy and safety of a specific inhibitor of the BCR-ABL tyrosine kinase in chronic myeloid leukemia. New Eng J Med., 344, 1031–1037. [DOI] [PubMed] [Google Scholar]
  10. Friedman J. et al. (2010) Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw., 33, 1–22. [PMC free article] [PubMed] [Google Scholar]
  11. Garnett M.J. et al. (2013) Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature, 483, 570–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gönen M., Margolin A.A. (2014) Drug susceptibility prediction against a panel of drugs using kernelized Bayesian multitask learning. Bioinformatics, 30, i556–i563. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Guo M. et al. (2004) Hypermethylation of the GATA genes in lung cancer. Clin. Cancer Res., 10, 7917–7924. [DOI] [PubMed] [Google Scholar]
  14. Haeger S.M. et al. (2016) Smad4 loss promotes lung cancer formation but increases sensitivity to DNA topoisomerase inhibitors. Oncogene, 35, 577–586. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Hoadley K.A. et al. (2014) Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell, 158, 929–944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hotteling H. (1936) Relations between two sets of variables. Biometrica, 28, 312–377. [Google Scholar]
  17. Imielinski M. et al. (2012) Mapping the hallmarks of lung adenocarcinoma with massively parallel sequencing. Cell, 150, 1107–1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Jamieson N.B. et al. (2012) MicroRNA molecular profiles associated with diagnosis, clinicopathologic criteria, and overall survival in patients with resectable pancreatic ductal adenocarcinoma. Clin. Cancer Res., 18, 534–545. [DOI] [PubMed] [Google Scholar]
  19. Jang I.S. et al. (2014) Systematic assessment of analytical methods for drug sensitivity prediction from cancer cell line data. Pac Symp Biocomput, 63–74 [PMC free article] [PubMed] [Google Scholar]
  20. Jang I.S. et al. (2015) Stepwise group sparse regression (SGSR): gene-set-based pharmacogenomic predictive models with stepwise selection of functional priors. Pacific Symp. Biocomput., 32–43. [PMC free article] [PubMed] [Google Scholar]
  21. Kanehisa M. et al. (2013) Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res., 42, D199–D205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Klami A. et al. (2013) Bayesian canonical correlation analysis. J. Mach. Learn. Res., 14, 965–1003. [Google Scholar]
  23. Locasale J.W. (2013) Serine, glycine and one-carbon units: cancer metabolism in full circle. Nat Rev Cancer, Advance Online Publication, 572–583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Lynch T.J. et al. (2004) Activating mutations in the epidermal growth factor receptor underlying responsiveness of nonsmall-cell lung cancer to gefitinib. New Eng. J. Med., 350, 2129–2139. [DOI] [PubMed] [Google Scholar]
  25. Maraver A. et al. (2012) Therapeutic effect of γ-secretase inhibition in KrasG12V-driven non-small cell lung carcinoma by derepression of DUSP1 and inhibition of ERK. Cancer Cell, 22, 222–234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Medina P.P. et al. (2008) Frequent BRG1/SMARCA4-inactivating mutations in human lung cancer cell lines. Hum. Mutat., 29, 617–622. [DOI] [PubMed] [Google Scholar]
  27. Michaelson J.J. et al. (2012) Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Cell, 151, 1431–1442. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Mitsunaga S. et al. (2013) Serum levels of IL-6 and IL-1 can predict the efficacy of gemcitabine in patients with advanced pancreatic cancer. Br. J. Cancer, 108, 2063–2069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Orvis T. et al. (2014) BRG1/SMARCA4 inactivation promotes non-small cell lung cancer aggressiveness by altering chromatin organization. Cancer Res., 74, 6486–6498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Ray P. et al. (2014) Bayesian joint analysis of heterogeneous genomics data. Bioinformatics, 30, 1370–1376. [DOI] [PubMed] [Google Scholar]
  31. Reimand J. et al. (2011) g:Profiler—a web server for functional interpretation of gene lists (2011 update). Nucleic Acids Res., 39, W307–W315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Ruepp A. et al. (2009) CORUM: the comprehensive resource of mammalian protein complexes–2009. Nucleic Acids Res., 38, D497–D501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Sanchez G.V. et al. (2012) Is dietary fat, vitamin D, or folate associated with pancreatic cancer? Mol. Carcinog., 51, 119–127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Satijn D.P., Otte A.P. (1999) RING1 interacts with multiple Polycomb-group proteins and displays tumorigenic activity. Mol. Cell. Biol., 19, 57–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Seashore-Ludlow B. et al. (2015) Harnessing connectivity in a large-scale small-molecule sensitivity dataset. Cancer Discov., 5, 1210–1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Slamon D.J. et al. (2001) Use of chemotherapy plus a monoclonal antibody against HER2 for metastatic breast cancer that overexpresses HER2. New Eng. J. Med., 344, 783–792. [DOI] [PubMed] [Google Scholar]
  37. Sun L. et al. (2011) Canonical correlation analysis for multilabel classification: a least-squares formulation, extensions, and analysis. IEEE Trans. Pattern Anal. Mach. Intell., 33, 194–200. [DOI] [PubMed] [Google Scholar]
  38. Tibshirani R. (1996) Regression shrinkage and selection via the Lasso on JSTOR. J. Roy. Stat. Soc. Ser. B, 58, 267–288. [Google Scholar]
  39. Tsoumakas G. et al. (2010) Mining multi-label data In: Maimon O., Rokach L. (eds), Data Mining and Knowledge Discovery Handbook. Springer, USA, pp. 667–685. [Google Scholar]
  40. Tucker L.R. (1958) An inter-battery method of factor analysis. Psychometrika, 23, 111–136. [Google Scholar]
  41. Vaske C.J. et al. (2010) Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics, 26, i237–i245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Vogelstein B. et al. (2013) Cancer genome landscapes. Science, 339, 1546–1558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Wang W. et al. (2013) iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data. Bioinformatics, 29, 149–159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Wilson B.G., Roberts C.W.M. (2011) SWI/SNF nucleosome remodellers and cancer. Nat. Rev. Cancer, 11, 481–492. [DOI] [PubMed] [Google Scholar]
  45. Witten D.M. et al. (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10, 515–534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Witten D.M., Tibshirani R.J. (2009) Extensions of sparse canonical correlation analysis with applications to genomic data. Stat. Appl. Genet. Mol. Biol., 8, 1–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Yuan Y. et al. (2014) Assessing the clinical utility of cancer genomic and proteomic data across tumor types. Nat. Biotechnol., 32, 644–652. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Zou H., Hastie T. (2003) Regularization and variable selection via the elastic net. J. Roy. Stat. Soc.: Ser. B (Statistical Methodology), 67, 301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES