Abstract
Long non-coding RNAs (lncRNAs) are a large and diverse class of transcribed RNAs, which have been shown to play a significant role in developing cancer. In this study, we apply integrative modeling framework to integrate the DNA copy number variation (CNV), lncRNA expression, and downstream target protein expression to predict patient survival in breast cancer. We develop a 3-stage model combining a mechanical model (lncRNA regressed on CNV and target proteins regressed on lncRNA) and a clinical model (survival regressed on estimated effects from the mechanical models). Using lncRNAs (such as HOTAIR and MALAT1) along with their CNV, target protein expressions, and survival outcomes from The Cancer Genome Atlas (TCGA) database, we show that predicted mean square error and integrated Brier score (IBS) are both lower for the proposed 3-step integrated model than that of 2-step model. Therefore, the integrative model has better predictive ability than the 2-step model not considering target protein information.
Keywords: Long noncoding RNA, breast cancer, integrative modeling, survival model, TCGA
Introduction
Several evidences highlight the emerging impact of long noncoding RNAs (lncRNAs) in cancer progression.1-4 The aim of this study is to identify the predictive capability of some oncogenic lncRNAs in tumor progression and prognosis of breast cancer.
Breast cancer is the most common malignancy and the leading cause of cancer death in women. By focusing on a single type of genetic alteration such as copy number variation (CNV), scientists have identified significant genes that may contribute to cancer progression.5-8 Due to its complexity, the study of cancer should focus on incorporating data from multiple platforms ranging from genes, transcripts, and proteins found in cancer cells,9 to whole biological systems, represented by molecular pathways and cell populations.10 The integration, where multiple levels of omics data (ie, CNV, methylation, and gene expression) are gathered from the same subjects and analyzed, is known as vertical integration.10-12
In this study, we introduce an easy and simplified way to integrate multiple omics data to show that the survival prediction due to the presence of lncRNAs increases significantly in breast cancer. We consider the genomic platform such as CNV, mRNA expression, proteomic platform such as protein expression, and the phenotype such as the survival of the patients. This study focuses only on the lncRNA expressions from The Cancer Genome Atlas (TCGA) breast cancer data. We consider the target protein expressions as proteomics data.
An Integrative Model
We consider a 3-stage model here. Suppose that n is the number of patients, p is the number of lncRNAs, and L is the number of CNV expressions.
The mechanistic model for each lncRNA can be expressed as
(1) |
where is the level of gene expression for gene , and is of dimension ; is part of the expression that is attributed to the lth CNV; is the other (remaining) part of the gene expression which is not regulated by CNV and is of dimension ; and is the regression coefficient vector.
Next, the downstream target protein of each specific lncRNA was identified from PubMed articles, TCGA RNA-Seq database, and other extensive analyses such as differential expression analysis. The mechanistic model for each protein (for every lncRNA) can be expressed as
(2) |
where and represents the “other” part of the protein expression that is not regulated by lncRNA. and are the regression coefficients corresponding to the CNV expressions and the error part from equation (1), respectively.
The clinical component part models the effect of the mechanistic parts of the genes on a clinical outcome of interest and can be written as
(3) |
where is the survival outcome, is the error term, and and are the usual regression coefficients corresponding to lncRNA and the estimated error part from equation (2), respectively.
The variable represents the vectorized downstream gene effects attributed to protein expressions and is estimated from the second-stage mechanistic model. Therefore, the clinical component additively models the effects of all the gene expressions and their components—derived from different sources (gene expression, CNV) in a unified manner.
Assumptions such as and give rise to the usual linear model, whereas we obtain the log-normal accelerated failure time (AFT) model when we assume .
In the presence of right censoring, we observe the tuple , where if the event is observed (death in this case), and 0 otherwise; with being the censoring time. A standard statistical software can be used to fit a log-normal AFT model and the other linear regression models.
To quantify the prediction accuracy, we consider a standard comparative predictive approach Brier score (BS)13 which uses the predicted survival times
where denotes the Kaplan-Meier estimate of the censoring distribution which is based on the observations , and stands for the estimated survival function. As the mathematical form suggests, BS provides a numerical comparison between the observed and estimated survival functions. Brier score is defined for each time point and hence can be added for the entire time range to obtain IBS, . We can see that models with smaller scores are preferred. We compute integrated Brier score (IBS) using ipred package.14
Nevertheless, we also compute the prediction square error by comparing the observed data and their posterior predicted values.
From TCGA database, we consider the information of 222 breast tumor samples with their survival data. We observe that at least 82% data are right censored.
Along with the clinical observations, we also collected measurements of 12 lncRNA expressions (Table 1). Among those, we found the CNV information available for 9 genes (or lncRNAs). We also consider 64 target protein expressions for these genes.
Table 1.
Gene | Function |
---|---|
BCAR4 a | Oncogenic, promotes invasion and metastasis15 |
BCYRN1 | Oncogenic, promotes tumor progression16 |
GAS5 a | Tumor suppressor17 |
H19 a | Oncogenic, promotes proliferation and metastasis18 |
HOTAIR a | Oncogenic, promotes EMT, proliferation, and metastasis19 |
MALAT1 a | Oncogenic, promotes proliferation, invasion, and migration20 |
MEG3 a | Tumor suppressor, induces accumulation of p5321 |
PVT1 a | Oncogenic, promotes tumor progression22 |
SOX2OT | Oncogenic, promotes tumor growth and metastasis23 |
SRA1 a | Oncogenic24 |
UCA1 a | Oncogenic, promotes cell growth, suppresses the tumor suppressor p2725 |
XIST | Tumor suppressor26 |
Abbreviation: EMT, epithelial-mesenchymal transition.
The copy number variation available (among those lncRNAs, SRA1 transcribes both long noncoding and protein-coding RNAs which are produced by alternative splicing).
We apply the integrative modeling in these data and obtain the results shown in Table 2. We notice that the mean squared prediction error and IBS are both lower for the proposed model than for the 2-stage model after omitting the protein expressions from the analysis.
Table 2.
Models | MSPE | IBS |
---|---|---|
2-stage | 1.903 | 0.488 |
3-stage | 1.196 | 0.395 |
Abbreviations: IBS, integrated Brier score; MSPE, mean squared prediction error; TCGA, The Cancer Genome Atlas.
In this article, we have shown that when the contribution of lncRNA’s target protein expression measurement is not ignored, then the survival prediction has improved dramatically. Toward this, we have developed a simple yet integrative modeling strategy which borrows strengths from all 3 platforms such as DNA CNV, mRNA expressions for the long noncoding genes, and their target protein expressions to predict the survival of the subjects. We have shown that this integrated model outperforms its closest competitor.
Acknowledgments
The authors thank the editor and the reviewers for their helpful suggestions which substantially improved this paper.
Footnotes
Funding:The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: T.R.S. was supported through the NIH T32 Training grant (PI: Dr Raymond J Carroll); A.K.M. and B.K.M. were supported through NIH R01CA194391 (PI: Dr B.K.M.).
Declaration of conflicting interests:The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions: TRS, AKM, and BKM designed the study. AKM and YN collected and analyzed the data. TRS and AKM wrote the manuscript.
ORCID iD: Tapasree Roy Sarkar https://orcid.org/0000-0001-8022-6760
References
- 1. Prensner JR, Chinnaiyan AM. The emergence of lncRNAs in cancer biology. Cancer Discov. 2011;1:391-407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Zhang S, Wang J, Ghoshal Tet al. lncRNA gene signatures for prediction of breast cancer intrinsic subtypes and prognosis. Genes. 2018;9:65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Sun M, Wu D, Zhou Ket al. An eight-lncRNA signature predicts survival of breast cancer patients: a comprehensive study based on weighted gene co-expression network analysis and competing endogenous RNA network. Breast Cancer Res Treat. 2019;175:59-75. [DOI] [PubMed] [Google Scholar]
- 4. The Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61-70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Fan B, Dachrut S, Coral Het al. Integration of DNA copy number alterations and transcriptional expression analysis in human gastric cancer. PLoS One. 2012;7:e29824. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Bass AJ, Watanabe H, Mermel CHet al. SOX2 is an amplified lineage-survival oncogene in lung and esophageal squamous cell carcinomas. Nat Genet. 2009;41:1238-1242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Nanjundan M, Nakayama Y, Cheng KWet al. Amplification of MDS1/EVI1 and EVI1, located in the 3q26.2 amplicon, is associated with favorable patient prognosis in ovarian cancer. Cancer Res. 2007;67:3074-3084. [DOI] [PubMed] [Google Scholar]
- 8. Scott KL, Kabbarah O, Liang MCet al. GOLPH3 modulates mTOR signalling and rapamycin sensitivity in cancer. Nature. 2009;459:1085-1090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Wang W, Baladandayuthapani V, Morris JSet al. iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data. Bioinformatics. 2012;29:149-159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Chu SH, Huang YT. Integrated genomic analysis of biological gene sets with applications in lung cancer prognosis. BMC Bioinformatics. 2017;18:336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Wu MC, Kraft P, Epstein MPet al. Powerful SNP-set analysis for case-control genome-wide association studies. Am J Hum Genet. 2010;86:929-942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Liu L, Lei J, Sanders SJet al. DAWN: a framework to identify autism genes and subnetworks using gene expression and genetics. Mol Autism. 2014;5:22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparison of prognostic classification schemes for survival data. Stat Med. 1999;18: 2529-2545. [DOI] [PubMed] [Google Scholar]
- 14. Peters A, Hothorn T. Ipred: improved predictors (R package version 0.9-6). https://CRAN.R-project.org/package=ipred. Updated 2017.
- 15. Xing Z, Park PK, Lin C, Yang L. LncRNA BCAR4 wires up signaling transduction in breast cancer. RNA Biol. 2015;12:681-689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Ren H, Yang X, Yang Yet al. Upregulation of LncRNA BCYRN1 promotes tumor progression and enhances EpCAM expression in gastric carcinoma. Oncotarget. 2018;9:4851-4861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Li S, Zhou J, Wang Z, Wang P, Gao X, Wang Y. Long noncoding RNA GAS5 suppresses triple negative breast cancer progression through inhibition of proliferation and invasion by competitively binding miR-196a-5p. Biomed Pharmacother. 2018;104:451-457. [DOI] [PubMed] [Google Scholar]
- 18. Barsyte-Lovejoy D, Lau SK, Boutros PCet al. The c-Myc oncogene directly induces the H19 noncoding RNA by allele-specific binding to potentiate tumorigenesis. Cancer Res. 2006;66:5330-5337. [DOI] [PubMed] [Google Scholar]
- 19. Zhang H, Cai K, Wang Jet al. MiR-7, inhibited indirectly by lincRNA HOTAIR, directly inhibits SETDB1 and reverses the EMT of breast cancer stem cells by downregulating the STAT3 pathway. Stem Cells. 2014;32:2858-2868. [DOI] [PubMed] [Google Scholar]
- 20. Xu S, Sui S, Zhang Jet al. Downregulation of long noncoding RNA MALAT1 induces epithelial-to-mesenchymal transition via the PI3K-AKT pathway in breast cancer. Int J Clin Exp Pathol. 2015;8:4881-4891. [PMC free article] [PubMed] [Google Scholar]
- 21. Sun L, Li Y, Yang B. Downregulated long non-coding RNA MEG3 in breast cancer regulates proliferation, migration and invasion by depending on p53’s transcriptional activity. Biochem Biophys Res Commun. 2016;478:323-329. [DOI] [PubMed] [Google Scholar]
- 22. Tang Y, He Y, Zhang Pet al. LncRNAs regulate the cytoskeleton and related Rho/ROCK signaling in cancer metastasis. Mol Cancer. 2018;17:77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Shi XM, Teng F. Up-regulation of long non-coding RNA Sox2ot promotes hepatocellular carcinoma cell metastasis and correlates with poor prognosis. Int J Clin Exp Pathol. 2015;8:4008-4014. [PMC free article] [PubMed] [Google Scholar]
- 24. Leygue E, Dotzlaw H, Watson PHet al. Expression of the steroid receptor RNA activator in human breast tumors. Cancer Research. 1999;59:4190-4193. [PubMed] [Google Scholar]
- 25. Huang J, Zhou N, Watabe Ket al. Long non-coding RNA UCA1 promotes breast tumor growth by suppression of p27 (Kip1). Cell Death Dis. 2015;5:e1008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Huang YS, Chang CC, Lee SS, Jou YS, Shih HM. Xist reduction in breast cancer upregulates AKT phosphorylation via HDAC3-mediated repression of PHLPP1 expression. Oncotarget. 2016;7:43256-43266. [DOI] [PMC free article] [PubMed] [Google Scholar]