Abstract
Summary
Cancer Gene and Pathway Explorer (CGPE) is developed to guide biological and clinical researchers, especially those with limited informatics and programming skills, performing preliminary cancer-related biomedical research using transcriptional data and publications. CGPE enables three user-friendly online analytical and visualization modules without requiring any local deployment. The GenePub HotIndex applies natural language processing, statistics and association discovery to provide analytical results on gene-specific PubMed publications, including gene-specific research trends, cancer types correlations, top-related genes and the WordCloud of publication profiles. The OnlineGSEA enables Gene Set Enrichment Analysis (GSEA) and results visualizations through an easy-to-follow interface for public or in-house transcriptional datasets, integrating the GSEA algorithm and preprocessed public TCGA and GEO datasets. The preprocessed datasets ensure gene sets analysis with appropriate pathway alternation and gene signatures. The CellLine Search presents evidence-based guidance for cell line selections with combined information on cell line dependency, gene expressions and pathway activity maps, which are valuable knowledge to have before conducting gene-related experiments. In a nutshell, the CGPE webserver provides a user-friendly, visual, intuitive and informative bioinformatics tool that allows biomedical researchers to perform efficient analyses and preliminary studies on in-house and publicly available bioinformatics data.
Availability and implementation
The webserver is freely available online at https://cgpe.soic.iupui.edu.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
The massive genomic data provides advantages for phenotype marker identification, gene pattern discovery and pathway analysis (Ciriello et al., 2015; Liu et al., 2016). It is challenging to interpret the large numbers of genes and the vast volume of omics data. Researchers have developed many databases and webservers, aiming to use microarray and next-generation sequencing technology to reveal the alteration of cancer genomics for more than ten years, such as cBioportal, UCSC Xena and GEPIA (Gao et al., 2013; Goldman et al., 2020; Tang et al., 2017). These tools enable biomedical researchers to rapidly and intuitively translate large-scale genomics data into biological insights.
However, for many experimental biologists and biomedical researchers, there are many unsatisfied needs with these existing tools. First, the gene publication trends can effectively help biological and biomedical researchers to build the foundation of their future research. There are millions of publications, yet there is a lack of work on gene-specific research trends. Second, pathway alterations are essential in studying gene functions and mechanisms. Nevertheless, it isn’t easy to discover the pathway alterations induced by one or more genes with existing tools. Third, there are no integrated tools to identify the optimal cell lines for biological experiments since the characteristics of cancer cell lines are currently dispersed in multiple platforms and resources. Nowadays, there are >1000 cancer cell lines available. Thus, it is an arduous yet imperative task to understand the nature of cancer cell lines. Using improper cell lines in experiments may cause resource waste and misleading experiment results due to the inter-cell-line heterogeneity (Mullard, 2018).
This project has developed an online user-friendly Cancer Gene and Pathway Explorer (CGPE) tool, integrating source data from the PubMed, the Cancer Cell Line Encyclopedia (CCLE) (Barretina et al., 2012), the DepMap (Meyers et al., 2017), the TCGA and GEO gene expression data and in-house transcriptional datasets (Barrett et al., 2007). Our CGPE offers an interactive and customizable analytical portal, which addresses the above unmet needs, with the following three unique modules: (i) Gene HotIndex for gene-specific PubMed-based research trend analysis, (ii) online GSEA for gene signature and pathway alteration analysis and (iii) CellLine Search for optimal cancer cell line identification.
2 Webserver functionalities
2.1 The gene HotIndex module
The Gene HotIndex employed natural language processing technology to mine information and categorized gene-related publications from the PubMed database. It provides an overall view of the current research status of the query gene. A user only needs to input the gene name of interest, where the acceptable gene names include the HUGO gene symbol (e.g. STAT3) and ENSEMBL IDs (e.g. ENSG00000168610). On the result page, after showing the basic information about the query gene, a visualization panel with bar charts and an information box has been created to present the number of publications and gene-related top cancer types. In the interactive interface, when the mouse moves over a cancer type, the detailed information on the cancer type and publications will show up. Upon clicking on the bar, a separate PubMed page will display the source publications. The WordCloud plot shows the keywords and research topics based on the query gene’s publication profile. A lollipop plot displays the frequencies of co-studied genes by mining the gene-gene co-occurrence in publications. The data processing methods for the PubMed data explained in Supplementary Materials.
2.2 The OnlineGSEA module
The onlineGSEA is a web-based implementation for the Gene Set Enrichment Analysis (GSEA) (Subramanian et al., 2005) tool. It will help researchers to interpret the mechanism of a gene or a gene set related to cancer cell development with thousands of integrated and preprocessed publicly available cancer patient samples from GEO and TCGA databases. It also provides a portal for users to upload their in-house data to perform GSEA analysis. CGPE currently incorporated three cancer types: breast cancer, colorectal cancer and ovarian cancer, for public use.
The onlineGSEA can investigate the functions of a single gene or a set of genes. For a single gene, users first need to select one preprocessed public dataset or upload their own data. Upon entering a query gene, the GSEA can run with all default GSEA settings, which the user can also customize accordingly. The patient samples in the chosen dataset will be divided into two groups according to the query gene’s expression value using the median cutoff. The underlying GSEA algorithm will then detect the enriched pathways between these two groups of samples. For a set of genes, the onlineGSEA provides two methods to divide samples into two groups. One uses the Agglomerative Hierarchical Clustering (AHC) algorithm based on the genes’ expression values (Oyelade et al., 2016). The other method applies the Overlap By Gene Expression (OBGE), which allows users to define the control and test groups based on gene expression levels. Users can set high or low expression of a patient sample in the control or test group, with median expression value as the cutoff.
After the onlineGSEA analysis is submitted online, a unique analysis ID will be generated for the user. The analytical results can be retrieved or shared anytime online using the unique analysis ID on the Result Viewer page. The online server will display the information of the enriched pathways with horizontal bar charts and the plots of the top 8 enrichments on the web. Users can also download the entire GSEA analytical result from the Result Viewer page.
2.3 The CellLine search module
The CellLine Search aims to help biomedical researchers a convenient tool to investigate possible cancer cell lines for cancer-related experiments effectively. It integrates gene-specific cancer cell line information from several data resources, including CCLE and DepMap, with interactive visualizations of the search and analytical results.
To use this module, a user needs to select the cancer type and pathway database (KEGG or REACTOME) and input the gene name (HUGO Symbol or ENSEMBL ID). The search result page provides information for cell line selection with intuitive and interactive visualizations, which consist of four sections:
The general information about the query gene, such as gene aliases, map location, exon count, etc.;
The inter-related dual-bar charts, which interactively demonstrate the query gene’s mRNA expression level and gene-cell line dependency scores across all cell lines;
The publication profile of the query gene on cancer cell lines;
The pathway activity heatmap inferred from Gene Set Variation Analysis (GSVA) (Hänzelmann et al., 2013).
3 Discussion
By focusing on facilitating preliminary cancer research with bioinformatics tools, CGPE integrates data and analytical tools, including PubMed, GEO, TCGA, DepMap and CCLE. The CGPE webserver provides experimental biologists with a user-friendly exploratory tool with interactive visualization. The three CGPE modules cover the PubMed publication trends, gene-enrichment analysis for a human gene (or gene set) function inference from public datasets, and cell line search based on targeting genes. These three primary CGPE functions are logically related and cover the three essential tasks in cancer-related preliminary research. The online CGEP can ease the biomedical researchers' effort to collect, process and analyze the publicly available data during the initial research phase. It can serve as complements with other powerful bioinformatics tools.
Funding
This work was partially supported by the IUSM Alzheimer's Disease Drug Discovery Center (NIH grant # 1U54AG065181-01).
Conflict of Interest: none declared.
Supplementary Material
Contributor Information
Jiannan Liu, Department of BioHealth Informatics, Indiana University School of Informatics and Computing, Indianapolis, IN 46202, USA.
Chuanpeng Dong, Department of BioHealth Informatics, Indiana University School of Informatics and Computing, Indianapolis, IN 46202, USA.
Yunlong Liu, Department of Medical and Molecular Genetics, IU School of Medicine, Indianapolis, IN 46202, USA.
Huanmei Wu, Department of BioHealth Informatics, Indiana University School of Informatics and Computing, Indianapolis, IN 46202, USA; Temple University College of Public Health, Philadelphia, PA 19122, USA.
References
- Barretina J. et al. (2012) The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483, 603–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barrett T. et al. (2007) NCBI GEO: mining tens of millions of expression profiles—database and tools update. Nucleic Acids Res., 35, D760–D765. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ciriello G. et al. (2015) Comprehensive molecular portraits of invasive lobular breast cancer. Cell, 163, 506–519. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao J. et al. (2013) Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal, 6, pl1–pl1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goldman M.J. et al. (2020) Visualizing and interpreting cancer genomics data via the Xena platform. Nat. Biotechnol., 38, 675–678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hänzelmann S. et al. (2013) GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics, 14, 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu C. et al. (2016) Personalised pathway analysis reveals association between DNA repair pathway dysregulation and chromosomal instability in sporadic breast cancer. Mol. Oncol., 10, 179–193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meyers R.M. et al. (2017) Computational correction of copy number effect improves specificity of CRISPR–Cas9 essentiality screens in cancer cells. Nat. Genet., 49, 1779–1784. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mullard A. (2018) Can you trust your cancer cell lines?. Nat. Rev. Drug Disc., 17, 613–614. [DOI] [PubMed] [Google Scholar]
- Oyelade J. et al. (2016) Clustering algorithms: their application to gene expression data. Bioinf. Biol. Insights, 10, BBI.S38316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Subramanian A. et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA, 102, 15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tang Z. et al. (2017) GEPIA: a web server for cancer and normal gene expression profiling and interactive analyses. Nucleic Acids Res., 45, W98–W102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.