Abstract
While individual studies have demonstrated that mRNA expressions are affected by copy number aberrations and microRNAs, their integrative analysis has largely been ignored. In this article, we use recently developed high-dimensional regression techniques to perform the integrative analysis of such data in the context of Glioblastoma Multiforme (GBM). It is revealed that copy numbers are more potent regulators of mRNA levels than microRNAs. We also infer the mRNA expression network after adjusting the effect of microR-NAs and copy numbers. Our association analysis demonstrates the expression levels of the genes IRS1 and GRB2 are strongly associated with the underlying variation in copy numbers, but we fail to detect significant associations with microRNA levels.
Keywords: Bayesian modeling, glioblastoma, graphical models, high-dimensional data analysis, integrative analysis
I. Introduction
With advances in high-throughput genomic technologies using array and sequencing-based approaches, it is now possible to collect millions of markers across the entire genomic landscape at various levels such as mRNA, copy numbers and microRNA, the interrelations among which provide key insights into the disease etiology of cancer. Such data have been collected extensively under The Cancer Genome Atlas (TCGA) project (http://tcga-data.nci.nih.gov/tcga). One of the key scientific objectives of many recent scientific studies is to investigate the inter-dependence between the various molecular markers, e.g., microRNA expression and/or copy number profiles on gene expression (mRNA) ([1], [2]). A common statistical framework to analyze such dependence is via high-dimensional regression models, where mRNA markers can be framed as “responses” and microRNA and copy number as “predictors”. A fundamental feature of these regressions is that the number of predictors (p) and the number of responses (q) far exceed the sample size (n). Further complicating the analysis in a realistic setting, the responses often exhibit complex correlation patterns among themselves due to regulatory mechanisms ([3]), causing standard data analysis techniques to fail, even those designed specifically to be successful in high dimensions, e.g., the lasso. Therefore, we need a technique that scales to high dimensions, while accounting for correlation structures among the responses. Moreover, while individual studies have been performed to demonstrate the effects of copy numbers and/or microRNA on mRNA expressions, it is not well understood which of these two is a more potent regulator of the mRNA expression levels. An answer to this lies in an integrative analysis of the effect of copy number aberrations and microRNA data on mRNA expressions, which we undertake in this article. Specifically, we are concerned with two main scientific questions: (1) to delineate the joint effects of copy number and microRNA data on mRNA expression levels and (2) to infer the resulting gene regulatory networks and associations accounting for these effects.
In this article we extend a recently-developed highdimensional sparse Bayesian regression technique developed in the context of eQTL analysis ([4]) to multiple sets of predictors in the context of an integrative genomic analysis. This technique is particularly relevant here since the data share many similar characteristics. Both p and q far exceed n and the responses (mRNA) are known to be correlated due to the presence of complex gene interactions. Thus, sparsity needs to be introduced in variable selection as well as the selection of the most significant edges of the gene interaction networks. However, the present analysis differs from the previous study in the sense that it focused only on one eQTL data set (SNPs as predictors, mRNAs as responses) whereas here we are concerned with different genomic data (mRNA, microRNA and copy numbers). Also, no data integration was performed in the previous study.
We illustrate our methods on a TCGA-based GBM data set. In single-platform analyses, we identify many of the copy number aberrations and microRNAs previously implicated in GBM as a part of our study, lending credence to our method. However, our key finding is that in an integrative analysis, copy numbers are a more potent regulator of mRNA expressions than microRNAs. Our sparse regression model is particularly well-suited to uncover this relationship and we find that more copy number aberrations are significantly associated with mRNA expression levels than microRNAs. We also perform an estimation of the covariate-adjusted interaction network for mRNAs.
II. Integrative analysis of high-dimensional multi-platform GBM data
A. Notation and Terminology
We frame our problem as a regression where copy numbers and microRNA expressions are respectively p1 and p2 dimensional predictors of q dimensional mRNA expression levels, each of these collected for n GBM patients. We denote the respective n×p1 and n×p2 matrices of predictors by X(1) and X(2) and the n × q response matrix as Y.
B. Hierarchical Bayesian Model For GBM data
We consider the following general model:
| (1) |
where Y is an n × q matrix of standarized gene expression data for n individuals, for the same q genes; is an is a matrix of predictors encoding of genetic markers; is a matrix of regression coefficients for i = 1, 2. Let pi be the total number of predictors of type i, of which only a sparse subset of length is present in the model. Thus we need a vector of indicators . Here γij is 1 if the jth predictor of type i is present in the model and 0 otherwise. Therefore, .
We further assume that ε follows a zero mean matrix-variate normal distribution [5], denoted as MNn×q(0, In, ΣG), where 0 is an n × q matrix of zeroes, ΣG is the q × q covariance matrix of q possibly correlated responses. We denote by G a symmetric matrix of indicators that control the sparsity in the response inverse covariance matrix . The key here is to do a sparse estimation of both γ and G, inferring respectively a sparse set of regulatory predictors (copy number or microRNA) and a sparse mRNA interaction network. Define the integrated matrix the matrix . The complete hierarchical model is now specified as follows:
where HIW (·) stands for hyper-inverse Wishart distribution which is a conjugate prior on the covariance matrix for Gaussian graphical models [6]; b, c, d are fixed, positive hyper-parameters and wγ and wG are prior weights that control the sparsity in γ and G respectively. Ber (·) and U(·) denote a Bernoulli and Uniform random variable. We denote by γi and Gk the indicators for the ith element for the vector γ and the kth off-diagonal edge in the lower triangular part of the adjacency matrix of the symmetric graph G.
For this model, a collapsed Gibbs sampler was designed by [4] that allows one to integrate out Bγ,G and ΣG, enabling an efficient inference of γ and G. In particular, the distribution of the data conditional only on γ and G can be shown to follow a hyper-matrix t density. The collapsing helps in achieving the mixing fast. After the MCMC for γ and G have stabilized, it is also possible to simulate Bγ,G and ΣG conditional on some posterior summary statistic and Ĝ in order to perform an association analysis between the predictors and responses. Controlling for false discovery rates at a given level, one can infer from the predictors (microRNA or copy numbers or the integrated microRNA and copy numbers) that are globally significant; from Ĝ the the significant edges in the mRNA interaction network and from the significant associations between the predictors and the responses.
C. Integrative Analysis of GBM data
We illustrate our method using a TCGA based GBM data set where we have for each of the 233 (= n) subjects diagnosed with GBM, matched copy numbers (p1 = 533), microRNA (p2 = 536) and mRNA expression levels (q = 49). We focus our analysis on the mRNA/genes selected from core pathways implicated in GBM such as receptor tyrosine kinase (RTK), phosphatidylinositol-3-OH kinase (PI3K) and etinoblastoma (RB) pathways ([7]). We normalize each of the data sets (mRNA, microRNA and copy number) by subtracting the mean and dividing by the standard deviations, to enable interpretations on the same scale. We then work with three different cases: (a) CN model: where we regress mRNA expressions on copy number aberrations (b) microRNA model: where we regress mRNA expressions on microRNA and (c) CN+microRNA model: where we regress mrNA expressions (additively) on both copy number and microRNA. Analysis is performed by 300,000 MCMC iterations with 50,000 samples as burn-in. We keep every 10th sample and discard the rest for the purpose of thinning. We verify that this results in good mixing of the sampler. For each model, we infer the posterior probabilities for γ; for G and the association levels (regression coefficients) Bγ,G, which we plot respectively in Figures 1, 2 and 3.
Figure 1.

From top: Plot of the posterior probabilities of the predictors for (a) CN model, (b) microRNA model (c) CN+microRNA model. In the bottom panel, copy numbers are in blue and microRNAs are in red.
Figure 2.

From top: Plot of the posterior probabilities of the edges in the mRNA interaction network for (a) CN model, (b) microRNA model (c) CN+microRNA model.
Figure 3.

From top: Plot of association of the mRNAs for (a) CN model, (b) microRNA model (c) CN+microRNA model. The expression levels of IRS1 shows significant up-regulation and that of GRB2 shows significant down-regulation when copy numbers are used as predictors. We fail to detect any significant association with the microRNAs.
A major finding is that in an integrative analysis, copy numbers dominate the set of significant predictors. Comparing the top and the middle panels of Figure 1, it is apparent that in single platform analysis copy numbers appear to be far stronger predictors of mRNA expressions than microRNA. For example, using a threshold of 0.4 on the posterior probability for inclusion, we see there are 12 copy numbers (circled in red) that fall above this threshold, whereas the highest posterior probability for all the microRNA was found to be 0.17. Perhaps more interesting is the bottom panel, where we perform an integrative analysis of microRNA and copy number data and use them as predictors in our multivariate regression. In the bottom panel, posterior probabilities for copy numbers are plotted in blue and that of microRNA are plotted in red. Even visually, it is clear that copy numbers dominate microRNAs in terms of associations and predictive power.
Figure 2 presents the posterior probabilities for the edges of the response covariance matrix inferred for the three analyses performed. In each case we detect similar patterns of interaction among the genes. This is perhaps not surprising, since though the estimated covariance depends on the predictors used, it is not expected to change significantly. This is consistent with previous findings in this area ([8]). As this previous work demonstrated, the real benefit of the simultaneous estimation of γ and G lies in an improved estimation of Bγ,G. Figure 3 shows the interaction pattern between the mRNAs and the predictors. Copy numbers appear to be strongly associated with the expression levels of two genes (estimated regression coefficients away from zero) IRS1 and GRB2. Whereas IRS1 appears to be up-regulated (estimated positive regression coefficient), GRB2 appears to be down-regulated (estimated negative regression coefficient). These two genes have previously known to interact in the context of GBM. IRS1 is a member of the RTK/PI3K signaling pathway, GRB2 is a member of the RTK/RAS signaling pathway and these two pathways share connections ([7]). Our analysis indicates the functions of IRS1 and GRB2 could be of potential interest for further functional validation studies for their involvement in GBM development and progression.
Another of our major findings is that the significant copy numbers appear to be serially correlated, from top panel of Figure 1, where we see several red dots close to each other. Even though our method do not account for spatial correlation among the predictors, copy number data have been well-known to be spatially correlated ([9]) and we see this effect manifest. The effect persists when only copy numbers are used as predictors or when both copy numbers and microRNAs are used. Also, from Figure 3, the serial correlation effect in copy numbers is obvious since long stretches of copy numbers appear to be significantly associated with both IRS1 and GRB2.
III. Conclusions and Future Work
This article presents an integrative analysis of data from multiple genomic platforms to investigate their interdependencies. We illustrate our methods using the effects of microRNA and copy numbers on mRNA expressions. It is also possible that microRNAs and copy numbers interact among themselves and provisions should be made to account for such interactions. Furthermore, it is very likely that nonlinear or non-Gaussian associations are present in the data and provisions should be made in the model to handle these effects.
Our analysis points out the expression levels of two genes: IRs1 and GRB2 are significantly associated with the underlying copy number aberrations. since these two genes are members of two different pathways that are known to interact in GBM, our findings indicate the pathways may themselves share some connection that need to be further explored via validation experiments.
The technique presented in this article is quite general, in the sense that theoretically it can handle any types of predictors and responses in presence of complex interactions (e.g., categorical such as SNPs or continuous such as various expression levels), both of dimensions orders of magnitude higher than the available sample size. The technique also allows for integration of data at both the predictor and response levels. While we have integrated two sources of data at the predictor level (copy numbers and microRNA), this can easily be extended to handle more. We believe the novel integrative analysis presented in this article will shed new lights on the analysis of genomic interactions involved in GBM and stimulate the analysis of multiple predictor, multiple responses genomic data from multiple platforms involved in other cancers.
Acknowledgments
VB’s research was partially supported by NIH grant R01 CA160736.
Contributor Information
Anindya Bhadra, Email: bhadra@purdue.edu, Department of Statistics, Purdue University.
Veerabhadran Baladandayuthapani, Email: veera@mdanderson.org, Department of Biostatistics, The University of Texas MD Anderson Cancer Center.
References
- 1.Dong H, Luo L, Hong S, et al. Integrated analysis of mutations, miRNA and mRNA expression in glioblastoma. BMC Systems Biology. 2010;4(1):163. doi: 10.1186/1752-0509-4-163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Hodgson JG, Yeh R-F, Ray A, et al. Comparative analyses of gene copy number and mRNA expression in glioblastoma multiforme tumors and xenografts. Neuro Oncol. 2009;11(5):477–487. doi: 10.1215/15228517-2008-113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Jornsten R, Abenius T, Kling T, et al. Network modeling of the transcriptional effects of copy number aberrations in glioblastoma. Mol Syst Biol. 2011 Apr;7:486. doi: 10.1038/msb.2011.17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Bhadra A, Mallick BK. Joint high-dimensional Bayesian variable and covariance selection with an application to eQTL analysis. Biometrics. 2013;69:447–457. doi: 10.1111/biom.12021. [DOI] [PubMed] [Google Scholar]
- 5.Dawid AP. Some matrix-variate distribution theory: Notational considerations and a Bayesian application. Biometrika. 1981;68(1):265–274. [Google Scholar]
- 6.Dawid AP, Lauritzen SL. Hyper markov laws in the statistical analysis of decomposable graphical models. Annals of Statistics. 1993;21(3):1272–1317. [Google Scholar]
- 7.TCGA. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455 doi: 10.1038/nature07385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Yin J, Li H. A sparse conditional Gaussian graphical model for analysis of genetical genomics data. Annals of Applied Statistics. 2011;5:2630–2650. doi: 10.1214/11-AOAS494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Baladandayuthapani V, Ji Y, Talluri R, et al. Bayesian random segmentation models to identify shared copy number aberrations for array CGH data. Journal of the American Statistical Association. 2010;105(492) doi: 10.1198/jasa.2010.ap09250. [DOI] [PMC free article] [PubMed] [Google Scholar]
