Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Mar 8.
Published in final edited form as: Conf Proc IEEE Eng Med Biol Soc. 2015;2015:6514–6518. doi: 10.1109/EMBC.2015.7319885

Non-Linear Bayesian Framework to Determine the Transcriptional Effects of Cancer-Associated Genomic Aberrations

Abolfazl Razi 1, Nilanjana Banerjee 2, Nevenka Dimitrova 2, Vinay Varadan 1
PMCID: PMC5341149  NIHMSID: NIHMS827711  PMID: 26737785

Abstract

While the tumorigenic effects of specific recurrent mutations in known cancer driver-genes is well-characterized, not much is known about the functional relevance of the vast majority of recurrent mutations observed across cancers. Prior studies have attempted to identify functional genomic aberrations by integrating multi-omics measurements in cancer samples with community-curated biological pathway networks. However, the majority of these approaches overlook the following biological considerations: i) signaling pathway networks are highly tissue-specific and their regulatory interactions differ across tissue types; ii) regulatory factors exhibit heterogeneous influence on downstream gene transcription; iii) epigenetic and genomic alterations exhibit nonlinear impact on gene transcription.

In order to accommodate these biological effects, we propose a hybrid Bayesian method to learn tissue-specific pairwise influence models amongst genes and to predict a gene's expression level as a nonlinear-function of its epigenetic and regulatory influences. We employ a novel tree-based depth-penalization mechanism in order to capture the higher regulatory impact of closer neighbors in the regulatory network. Using a breast cancer multi-omics dataset (N=1190), we show that our proposed method has superior prediction power over optimization-based regression models, with the additional advantage of revealing gene deregulations potentially driven by somatic mutations.

I. Introduction

Large-scale profiling studies of multiple cancers have revealed a plethora of genomic aberrations whose functional significance in driving the respective cancers remains largely unknown. This has resulted in the need for biologically savvy computational approaches [1] that can integrate multiple genomic measurements of cancer tissues to identify functional genomic aberrations that underlie cancer development and progression.

One computational approach involves assessment of an enrichment score that captures the deviation from random representation of a predefined set of genes within a ranked gene list associated with the cancer phenotype [2]. While this approach identifies functional genomic alterations, it does not explicitly incorporate regulatory relationships amongst genes. Subsequent approaches [3-5] also incorporated these regulatory networks to capture signaling programs that may be driving cancer progression. These methods integrate well-curated biological pathway databases with genomic measurements into a unified modeling framework to estimate activity levels of network nodes associated with tissue-level phenotypes. Such network-based inference of pathway-level activities has also been used to evaluate if specific mutations are functionally deregulating pathways [6]. These approaches mainly rely on the following assumptions: a) the pathway networks accurately and fully capture the cellular mechanisms across tissue types, b) that the influence of regulatory parents nodes on the downstream gene expression is equal and c) the relation between the gene expression and epigenetic information is linear. These assumptions underestimate the complexities and heterogeneity inherent in the biology of cancer.

We develop an integrative model that incorporates multi-omics measurements, including RNAseq-based gene expression, array-based DNA methylation (epigenetic) and SNP-array based somatic copy-number alterations (sCNA), and biological pathway network information to build a gene-gene regulatory influence network. Briefly, a non-linear Bayesian model is learned to predict the expression level of any given gene using its own sCNA and methylation data along with upstream regulatory influences inferred from biological pathway networks. The learned model is then used to identify genes whose measured expression levels show significant and abnormal deviations from the predictions, thus allowing for the discovery of somatic mutations that functionally alter gene regulation.

II. Method Overview

The proposed algorithm consists of several sequential steps to identify and report potential somatic aberrations driving deregulated genes. The first step is to build a tree for each gene that captures the relationship of the gene's expression levels with its own genomic (e..g. copy-number) and epigenetic (e.g. DNA methylation) status as well as its upstream transcriptional regulators (e.g. gene families and protein complexes). The gene of interest resides in the root node and the leaves of the tree represent all of the genes that potentially regulate its transcription either directly or indirectly through intermediate signaling partners. In the second step, we train a non-linear function to predict the gene expression level of the gene of interest by incorporating the molecular measurements associated with the leaves. The parameters of the non-linear function are estimated using a Bayesian inference method incorporating a novel depth penalization mechanism to capture the potentially stronger regulatory impact of nodes closer to the root node in the tree. The third and final step calculates relative inconsistency scores between the predicted and observed expression levels for each gene and reports deregulated genes. A subsequent analysis identifies the potential drivers of the gene deregulations arising from somatic mutations targeting the gene or its upstream transcriptional regulators.

Regulatory Tree Construction

In order to identify the genes that regulate the target gene expression level, we build a tree for each target gene by integrating information from well-curated pathway databases including NCI-PID, Biocarta, and Reactome similar to the PARADIGM framework [3].

The network consists of multiple node types and regulatory interactions (Fig. 1). In order to develop the regulatory tree for each gene we start with a specific target gene and traverse upstream along the pathway networks and capture the regulatory genes (referred to as regulators hereafter for brevity) along with their depth (defined as the number of links to the root node) to the root node (Fig. 1), using a depth-first traversal algorithm, collecting all nodes based on the following rules:

  • i)

    At depth = 1, we exclude the post-transcriptional modifiers of the root node.

  • ii)

    When we include a gene as a regulator and passing to the next level of regulators, we exclude transcriptional regulators of that gene since their effects are already captured by the gene's expression level.

  • iii)

    We terminate traversing a branch once we reach a predefined maximum depth level dmax or we reach an abstract process node, since the nodes connected via abstract nodes may not have real pairwise causal impacts.

  • iv)

    If we meet a leaf node via multiple disjoint paths, the shortest path is considered.

  • v)

    We also include the gene of interest's epigenetic (DNA methylation) and somatic copy-number measurements as two additional nodes directly connected to the root node.

Fig. 1.

Fig. 1

An example of a tree (dmax = 3) generated using the regulatory interactions derived from pathway databases for a sample gene PPP3CA. Node types are denoted as Genes (ovals), Protein complexes (rectangles), gene families (pentagon), abstract concepts (hexagon). The edges are colored according to their regulatory function with protein activation (red), transcriptional regulation (blue), component of protein complex (black) and gene family member (grey). The root node's epigenetic and sCNA measurements (rounded rectangles), considered as additional regulatory parents, are connected by green arrows. The nodes chosen as regulators are marked (Inline graphic) along with their depths.

Thus, the tree construction algorithm reports the genes whose expression levels regulate the root gene's expression along with their corresponding depth parameters.

A. Learning Gene Regulatory Functions

Here we train a function that predicts the target gene expression level based on measurements of the nodes in the regulatory network. We use the following regression model for gene g using a training set of n normal and cancer samples:

yg=μg1n+Xgβg+ϵ,ϵN(0,σg2Ing) (1)

where Xg=[XgS,XgP] is a n × p data matrix composed of two parts including XgS (the target gene's DNA methylation and sCNA measurements) and XgP the expression levels of the P regulators identified in Step 1. 1n is a one column vector of size n, βg is the regression coefficients of size p × 1 and ϵ is the Gaussian distributed model noise with zero-mean and identity covariance matrix. yg represents values of the target gene expression levels across the n training samples. Finally, μg is the expected value of target gene's expression level. We omit subscript g hereafter for notation convenience.

Considering the fact that the pathway networks are not tissue specific and may involve pathway links that are absent or loosely connected in a specific tissue type, the model parameters βi are expected to be sparse. The second consideration is possibly non-linear relationships between the measurements. Therefore, we apply non-linear transformation to the data prior to developing our regression model. We used a centered sigmoid function f1(x,c)=exc(1+exc) to capture the sensitivity around the mean value and a soft-thresholding function f2(x,c)=sign(x)(x2+c2c) to account for the impact of extreme values. We have applied the element-wise non-linear extension XΨ(X)=[XS,f1(XS),f2(XS),XP] only to Xs, hence increasing the number of predictors compared to the number of regulators. It is notable, that if the actual underlying function is linear, the coefficients of the nonlinear terms tend to zero in the proposed model.

Thirdly, to account for closer regulators having a greater impact on the target gene expression, we therefore extended the Bayesian lasso scheme to include a depth penalization mechanism in addition to the non-linear terms.

We used the following prior construction:

XΨ(X)yX,K,β,σ2N(Ψ(X)β,σ2In),K=diag([k12,,kp2])βσ2,τ12,,τp2Np(0p,σ2DτK1),Dτ=diag([τ12,,τp2])k12,k22,,kp2j=1p(akdj2)akΓ(ak)kj2(ak1)eakkj2dj2,j=1,2,3,τ12,,τp2j=1pλ22eλ2τj22π(σ2)=baσ2Γ(aσ2)(σ2)aσ21ebσ2σ2 (2)

The proposed Bayesian hierarchical generative model is desired over optimization-based sparse regression models such as LASSO, RIDGE and ELASTIC NET, since it provides a full posterior distribution of the model parameters. Moreover, we can easily incorporate any prior knowledge such as depth information to the model. While this leads to additional computation costs for sampling, it occurs only once during the training phase.

In the above formulations, the model parameters β are conditionally normal distributed around zero with variances that are controlled by three sets of hyper-parameters (βi=σ2τi2ki2), where σ2 controls the global shrinkage, τi2 accounts for the local shrinkage using the exponential prior and ki enforces the link depth impact. To provide more flexibility and a closed-form posterior, we assign a Gamma prior distribution for ki2Gamma(aki,bki) such that the standard deviation of βi is inversely proportional to the corresponding link depth di (i.e. E[ki2]=akibki=di2c, where c is a normalizing term to ensure 1pK1=1, which is obtained by setting c=pi=1pdi2). Therefore, we only have one free hyper-parameter aki for ki prior distribution and the second parameter bki is automatically obtained from bki=akicdi2. We note that var(ki)=akibki2=c2di4aki. Setting aki to small values provides higher variance for ki and hence is less formative, while large values of aki provides low variance reflecting a high certainty about the network topology and the fact that node pairs with shorter paths are associated with higher influences to one another. In this case, the gamma distribution approaches a Gaussian distribution concentrated around di. We choose the large value of aki = 10 to highlight on the significance of the underlying biological network.

Using the conjugate priors in (2) and applying Bayes rule results in the following closed form conditional distribution for the model parameters as in [7], where the details of derivations are omitted for brevity.

βS\βN(A1Ψ(X)Ty,σ2A1),A=Ψ(X)TΨ(X)+KDτ1γi=1τi2S\DτN1(λσβi,λ2)I(γi>0)ki2S\KGA(ak+12,akdi2+βi22σ2τi2)σ2S\σ2GA1(a+n+p2,b+12(yΨ(X)β)T(yΨ(X)β)+12βTDτ1Kβ) (3)

where S={μ,β,Ψ(X),y,σ2,τ12,,τp2} is the set of all variables and ‘\’ is exclusion operator. N(.), N−1(.), GA(.), GA−1(.) denote the Gaussian, inverse Gaussian, Gamma and inverse Gamma distributions.

There are several ways to set the regularization parameter λ including cross validation or expectation maximization. In this work, we used a gamma prior for λ ~ GA(aλ, bλ) and included it as an additional step in the Gibbs sampler.

B. Inconsistency Analysis

Comparison of the observed gene expression measurement ygo with the predicted value, ygp (the maximum a posteriori MAP estimate) for a given cancer sample determines the level inconsistency for gene g.

We note that the predictive distribution for the RNA expression of gene g for each new test sample ygnew is obtained by marginalizing out the model parameters from the conditional posterior distribution for given input xgnew:

f(ygnewxgnew)=f(ynewxnew,β,σ2)f(β,σ2y,X)dβdσ21Ni=1Nf(ygnewxgnew,βg(i),σg(i))ygp=argmaxygnewf(ygnewxgnew) (4)

where βg(i), σg(i) are the samples of the model parameters for gene g obtained from the N iterations of Gibbs sampling.

Consequently, the Z-score and the equivalent Log Likelihood of the consistency level of gene g in the new sample is obtained using the following equations:

zgnew=ygoygpσygnewxgnewLcnew=logf(ygnewxgnew)=cnstlog(σygnewxgnew2)(zgnew)22

where the mean μygnewxgnew and variance σygnewxgnew2 are provided by the Gibbs sampler.

III. Results

In this section, we first provide prediction results for sample genes that have valid regulatory connections in the pathway network and are known to be highly associated with cancer. RNA-seq based gene expression, DNA Methylation using the Illumina Infinium Methylation assays and sCNA profiles using Affymetrix SNP arrays were obtained using The Cancer Genome Atlas portal for a breast cancer dataset, containing 111 normal and 1079 cancer samples.

We compared the results of our proposed Bayesian method with state-of-the-art optimization-based sparse regression models including LASSO, RIDGE and Elastic-Net Regressions with solutions based on Coordinate Descent [8]. The Minimum Square Error (MSE) ratio (yoyp)2(yp)2 and, State Error Rate (SER) obtained by mapping the observed and predicted values to three (low, neutral and overexpressed) states, are presented across frameworks in Table 1. The results for all models are derived from a test dataset independent of the training set.

Table I.

Prediction accuracy for the proposed method in comparison with the benchmark optimization-based sparse regression models

Method Test on Normal Samples Test on Cancer Samples
MSE SER MSE SER
LSE 0.4028 0.1156 0.6102 0.2774
Lasso 0.332 0.0638 0.4867 0.1481
Ridge 0.3987 0.0848 0.5415 0.1997
Elastic-NET (0.5) 0.3469 0.0758 0.493 0.1667
PROPOSED 0.2797 0.0534 0.4688 0.1406

From Table 1, we see that the proposed method outperforms the state of the art sparse regression models with the additional advantage of providing full posterior distribution for the gene expression level required for subsequent inconsistency analysis. Another observation from Table 1 is that all models show higher predictability on the test set of normal samples despite the fact that the number of cancer samples used for model training is larger than the normal samples. This reveals that the functional states of gene expression in normal tissues are more consistent with their upstream regulatory networks than in cancer tissues. We further highlight the utility of our proposed method using two genes (ERBB2 and PTEN) of high import in breast cancer.

ERBB2 is highly expressed in a subset of breast cancers due to sCNAs. Our model appropriately captures this non-linear effect (Fig. 2), by automatically assigning high values for the coefficient associated with the soft-thresholding function of sCNA, reflecting the fact that variations around zero in sCNA values correspond to measurement noise.

Fig. 2.

Fig. 2

The observed and predicted relationship between sCNA and gene expression for ERBB2 in normal and cancer samples. The coefficient corresponding to f2(sCNA) dominates the other predictors of the model for ERBB2, with β2j=1pBj0.82.

On the other hand, inactivation of the gene PTEN is functionally important in breast cancer due to its essential role in down-regulation the PI3K pathway, a key mechanism of resistance to anti-HER2 therapy. Fig. 3 shows that a subset of breast cancers with significantly lower observed gene expression levels of PTEN as predicted by our model's integration of sCNA and regulatory networks. It is also notable that some cancer samples show significant inconsistency with the predictions. We hypothesize that these inconsistencies are likely associated with somatic mutations affecting either PTEN or its regulatory network. We therefore count all non-silent mutations affecting either PTEN or its regulators for each of the cancer samples scaled by their absolute inconsistency levels. In order to apply the same concept of depth-penalization, we penalize the count of mutations with (α)di,g, where (0 < α < 1) is an arbitrary penalization factor and di,g, is the depth of the regulatory gene i to the target gene g = PTEN. In general, the functional impact of mutations in gene h on the expression of gene g, denoted by fg(h) is calculated as:

fg(h)=j=1n1(hMjPg)(α)dh,gzgi)lPgj=1n1(lMjPg)(α)dl,gzgi)

where Pg is the set of regulatory ancestor genes of gene g, Mj is the set of genes mutated in sample j, zgj is the inconsistency score of gene g at sample j and 1(.) is the indicator function. The role of denominator is to normalizehPgfg(h)=1.

Fig. 3.

Fig. 3

Predicted versus observed expression levels of PTEN. Cancer samples (Inline graphic) show widespread inconsistency as compared to normal samples (Inline graphic).

The functional impact of somatic mutations on the deregulation of gene PTEN is depicted in Fig. 4, revealing that the inconsistencies in PTEN expression are highly associated with mutations in TP53, PTEN, PIK3CA, MAP3K1 and MAP2K4. The higher impact of TP53 mutations versus PIK3CA is particularly interesting given that PIK3CA is mutated more often than TP53 (387 samples versus 333 samples respectively). We observe that MAP3K1 and MAP2K4 mutations, previously shown to be associated with luminal breast cancers [9], impact PTEN inactivation, thus providing an intriguing nexus between these genes in driving a key subtype of breast cancers. We also calculate the relative impact of protein-truncating and other non-synonymous mutations after normalizing to their absolute counts on the inconsistency score for PTEN. The model determines that the two kinds of mutations have similar impact when they affect any of the regulatory genes of PTEN while the protein-truncating mutations in PTEN have an outsize impact on its deregulation, consistent with nonsense-mediated decay of PTEN mRNA. These findings highlight the capability of our modeling framework to capture the expected impact of somatic mutations in a gene on its own expression level, while also enabling the discovery of the functional effects of mutations in upstream regulatory genes.

Fig. 4.

Fig. 4

The Impact of somatic mutations in the upstream regulatory subnetwork of PTEN on its gene expression inconsistency. Depth penalization parameter is set to α = ½. The bars show the relative degree of association of mutations in genes (horizontal-axis) on the level of inconsistency between the observed PTEN gene expression level and its predicted value. The impact is divided between protein truncating mutations (blue) and other missense mutations (yellow) normalized by their counts in the respective genes.

IV. Conclusions

We have developed a novel Bayesian approach that integrates multi-omics data with prior biological knowledge derived from pathway annotation databases. Our approach captures the non-linear and heterogeneous influence of both upstream regulatory genes as well as epigenetic alterations on target gene expression levels. Furthermore, the inconsistency score estimated by our model quantifies the impact of somatic mutations potentially driving deregulation of gene expression in cancer samples. Our framework provides a new toolkit for cancer biologists to identify novel driver mutations in cancer through the integrative analysis of multi-omics cancer profiles.

Acknowledgment

The results shown here are based on data generated by the TCGA Research Network: http://cancergenome.nih.gov/, which was made available for public use in accordance with the TCGA policies governing human subject data. No additional human subject data or animal models were generated or used in this study.

References

  • 1.Varadan V, et al. The Integration of Biological Pathway Knowledge in Cancer Genomics: A review of existing computational approaches. IEEE Sig. Proc. Magazine. 2012;29(1):35–50. [Google Scholar]
  • 2.Subramanian A, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. National Academy of Sciences. 2005;102(43):15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Vaske CJ, et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics. 2010;26(12):i237–45. doi: 10.1093/bioinformatics/btq182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Greenblum SI, et al. The PathOlogist: an automated tool for pathway-centric analysis. BMC Bioinform. 2011;12:133. doi: 10.1186/1471-2105-12-133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Tarca AL, et al. A novel signaling pathway impact analysis. Bioinformatics. 2009;25(1):75–82. doi: 10.1093/bioinformatics/btn577. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ng S, et al. PARADIGM-SHIFT predicts the function of mutations in multiple cancers using pathway impact analysis. Bioinformatics. 2012;28(18):i640–i646. doi: 10.1093/bioinformatics/bts402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Trevor Park GC. The Bayesian lasso. J. American Statistical Association. (482):681–686, 2008. [Google Scholar]
  • 8.Friedman JH, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J. Statistical Software. 2010 Feb;33(1):1–22. [PMC free article] [PubMed] [Google Scholar]
  • 9.Ellis MJ, et al. Whole-genome analysis informs breast cancer response to aromatase inhibition. Nature. 2012;486:353–360. doi: 10.1038/nature11143. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES