Abstract
Esophageal cancer is the seventh most prevalent and the sixth most lethal cancer. Esophageal squamous cell carcinoma (ESCC) is one of the major esophageal cancer subtypes that accounts for 87 % of the total cases. However, its molecular mechanism remains unclear. Here, we present an integrated database for ESCC called ESCCdb, which includes a total of 56 datasets and published studies from the GEO, Xena or SRA databases and related publications. It helps users to explore a particular gene with multiple graphical and interactive views with one click. The results comprise expression changes across 20 datasets, copy number alterations in 11 datasets, somatic mutations from 12 papers, related drugs derived from DGIdb, related pathways, and gene correlations. ESCCdb enables directly cross-dataset comparison of a gene’s mutations, expressions and copy number changes in multiple datasets. This allows users to easily assess the alterations in ESCC. Furthermore, survival analysis, drug-gene relationships, and results from whole-genome CRISPR/Cas9 screening can help users determine the clinical relevance, derive functional inferences, and identify potential drugs. Notably, ESCCdb also enables the exploration of the correlation structure and identification of potential key regulators for a process. Finally, we identified 789 consistently differential expressed genes; we summarized recurrently mutated genes and genes affected by significant copy number alterations. These genes may be stable biomarkers or important players during ESCC development. ESCCdb fills the gap between massive omics data and users’ needs for integrated analysis and can promote basic and clinical ESCC research. The database is freely accessible at http://cailab.labshare.cn/ESCCdb.
Keywords: Esophageal squamous cell carcinoma, Multi-omics, Webserver, Transcription factor, Consistently differential expressed genes
1. Introduction
Esophageal cancer is a highly prevalent cancer type with an estimated 572,000 new cases and 509,000 deaths worldwide in 2018, accounting for 3.2 % of all cancer cases and 5.3 % of all cancer deaths, respectively [1]. As the major histological type, esophageal squamous cell carcinoma (ESCC) accounts for about 87 % of all esophageal cancers. It is a critical health-threatening disease owing to its low survival rate; the five-year survival rate is about 20 %, falling to even lower than 5 % in some low- and middle-income countries [2]. Therefore, there is an urgent need to study the etiology, mechanisms, prognostics, and treatment options of ESCC.
Recently, whole-genome and whole-exome sequencing identified somatic mutations and copy number changes that usually disrupt several cancer-related pathways in ESCC. These include p53 signaling, PI3K/AKT pathway, RTK-Ras, cell cycle, Wnt and Notch pathways [2]. Some of the candidates identified in these studies have been comfirmed to be involved in the progression of ESCC. For instance, ZNF750 is frequently mutated in ESCC and has recently been characterized as a lineage-specific tumor suppressor in ESCC [3]. However, the molecular mechanisms of ESCC still need to be explored in detail.
In recent years, rapid development in high throughput sequencing technologies has generated enormous biological data that is distributed in some databases, such as The Cancer Genome Atlas (TCGA), Sequence Read Archive (SRA), and Gene Expression Omnibus (GEO). Several web servers based on TCGA such as GEPIA (http://gepia.cancer-pku.cn/) [4], Xena (http://xena.ucsc.edu/) [5] and cBioPortal (https://www.cbioportal.org/) [6] provide advanced functionality for visualization and analysis of the TCGA datasets. However, there are only 96 ESCC cases in TCGA and only 227 cases in cBioPortal. Moreover, many ESCC-related datasets in the GEO database are not processed according to standard protocols. The commercial Oncomine database [7] only includes a few expression chips and copy number variation (CNV) datasets for ESCC (less than 10 datasets). GEO provides the GEO2R [8] tool to aid users in group comparisons and identifying differentially expressed genes (DEGs). However, DEG analysis usually results in hundreds to thousands of DEGs, thereby making it difficult to perform multiple datasets comparisons for most experimental researchers. In addition, owing to the random nature of biological processed and introduction of inadvertent bias during experiments, the DEGs called from a single dataset may not be a stable signature for ESCC. Currently, a large number of ESCC datasets in the GEO database provide an alternative for identifying stable DEGs, highly correlated genes, and potential regulators by cross dataset comparisons.
To facilitate effective ESCC research, we present a comprehensive database called ESCCdb that integrates gene/miRNA expressions, copy number changes and somatic mutations of ESCC. The genes in the database are also linked to the gene-drug relationships from DGIdb [9], results of cancer driver screening [10], cell-fitness-related genes from CRISPR-based genome-wide screening [11], and DNA-damage-resistance genes [12]. It also provides a workflow to explore the correlation structure of gene expression data to identify potential key regulators for correlated genes. ESCCdb can help experimental and clinical researchers determine whether a gene is of interest, thus informing usefulness of further labor-demanding experiments. Furthermore, exploring the system for gene correlations may aid researchers in finding potential key regulators in a biological process. ESCCdb will thus promote research on of the molecular mechanisms underlying ESCC and provide significant insights into this lethal cancer.
2. Materials and methods
2.1. Gene and miRNA expression data processing
We downloaded raw gene and miRNA chip data for Esophageal Squamous Cell Carcinoma (ESCC) from the GEO database (https://www.ncbi.nlm.nih.gov/geo) (Supplementary Table S1). For Affymetrix expression chips, we used the ReadAffy function of affy [13] package to read in raw data. For Agilent expression chips, we used the read.maimages function of limma [14] ackage. We evaluated chip quality by MAplot, relative log expression (RLE) plot, and normalized unscaled standard errors (NUSE) plot. Low-quality chips that had skewed MAplot, large deviation of RLE or NUSE were removed. The raw data were then subjected to background correction, after which they were subjected to within array normalization and between array normalization by RMA method in the limma package. For one channel chips, the within array normalization was not performed. Principal components analysis was used to check and remove intermingled samples from the tumor and normal groups. Probe annotations were mostly obtained from GEO annotation. However, GSE53622 and GSE53624 did not provide gene annotations, thus a re-annotation was used [15]. Probes with low expression were filtered out by the nsFilter function from the genefilter package where applicable. For genes with multiple mapped probes, the probe with the highest expression level was retained. Limma package was used to perform DEG analysis between tumor and normal samples. For RNAseq data, raw fastq files were downloaded from the SRA databases (https://www.ncbi.nlm.nih.gov/sra). FastQC was used to evaluate the quality of fastq files. Quantification of gene expression was completed using Salmon-SMEM in RNAcocktail [16] workflow. The quantification data were subsequently imported into R by the tximport package [17] and DEG analysis was finally carried out by DESeq2 [18]. For normalized RSEM expression of TCGA-ESCA dataset from the Xena database (https://xenabrowser.net/datapages/), the expression values were rounded to the nearest integer and DEGs were identified using DESeq2.
2.2. High-throughput pathway inference analysis
Based on normalized gene expression data from 13 expression chip datasets, we used HiPathia to infer Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway activation/deactivation based on signal transduction along sub-paths [19]. The resulting interactive pathway views were then incorporated into the ESCCdb webpage. The webpage can further generate static plots if the client browser supports the “import” property.
2.3. Survival analysis
Three expression datasets (GSE53622, GSE53624, and TCGA) provide survival information for patients. We grouped the patients as high expressed (≥3rd quantile) or low expressed (≤1st quantile) and performed survival analysis using the survival R package.
2.4. Copy number data processing
We downloaded raw data for several SNP arrays or arrayCGH platforms from the GEO database (Supplementary Table S1). For platforms such as Affymetrix SNP5, Mapping 500k, and Mapping 250k array, we used aroma.affymetrix package [20] to call copy number changes. For other platforms, we used the snapCGH package to call copy number changes using R language. The resultant circular binary segmentation results were converted to hg19 coordinates by liftOver utility. GISTIC2 [21] was used to summarize copy number changes for each dataset and identify gene-level copy number alterations. Normalized gene-level copy number results from GISTIC2 for the TCGA-ESCA dataset were obtained from the Xena database.
2.5. Mutation data processing
We mainly collected gene mutation information mainly from supplementary tables of several related publications (Supplementary Table S1). The mutations were then annotated by ANNOVAR [22]. Mutation accumulation was inffered based on single nucleotide variations using the mutation set enrichment analysis method [23]. Domains or protein features w ere extracted from UniProtKB and presented as colored blocks along with the protein. For multi-region sequencing, the mutations were summarized at the patient-level, i.e. mutations detected in different regions from the same patient were merged together in the plot. The original mutation details were presented in corresponding tables.
2.6. KEGG, reactome and transcription factor enrichment analysis
For highly correlated genes, we provide an enrichment analysis module that can carry out enrichment analysis for KEGG or Reactome terms. This analysis was carried out using the clusterProfiler package [24]. Transcription factor binding sites (TFBS) on gene promoter (−1000,+100 range) were collected from the gene transcription regulation database (GTRD) [25]. For each transcription factor (TF), the enrichment score (ES) and significance (p) were calculated using a bootstrap method as follows:
There are N genes in a query. To calculate ES and p-value, we randomly selected a subset of N genes from all genes covered by GTRD and repeated this iteration for 1000 times. In the two equations, B is the number of binding sites for the TF on the query gene promoter, while bi is the number of binding sites for the TF on the ith random set of genes. We consider a TF as enriched when ES > 2 and p < 0.05.
2.7. Database content
ESCCdb is a comprehensive database for ESCC, which integrates gene expression (21 datasets, 1013 samples), miRNA expression (9 datasets, 682 samples), copy number changes (11 datasets, 961 samples), survival data (3 datasets, 448 samples), single nucleotide variations and indels (12 papers, 1421 samples), several high-throughput CRISPR screening (3 papers) and drug-gene interactions from DGIdb. We uniformly processed these data and stored them in a MySQL database for rapid access (Fig. 1 and Table S1).
For gene expression and miRNA expression datasets, we obtained normalized expressions of genes and miRNAs and carried out differential expression analysis. In addition, we further computed gene-gene correlations in 13 gene expression chips and miRNA-gene correlations from two pairs of miRNA and gene expression datasets. To give insight into the pathway activity changes in ESCC, high-throughput pathway inference analysis was also carried out in the 13 chip datasets using the hiPathia package [19]. For arrayCGH and SNP arrays, we obtained normalized log2ratios and further processed them with GISTIC2 to get gene-level copy number estimates. For expression data with survival information, Kaplan-Meier process was applied to study the relationship between gene expression and overall survival. For somatic mutations, ANNOVAR was used to annotate the mutations. In addition, mutation distributions along each transcript were also generated and annotated on proteins, along with its domain/feature annotations.
2.8. Database implementation
ESCCdb is implemented by HTML5, PHP, JQuery, and Ajax. MySQL is used as the backend database. R and python scripts were used to perform backend analysis. The JavaScript libs, including Highcharts and inchlib, were used to support interactive visualization and presentation.
To facilitate quick query, most of the supporting data were stored in a relational MySQL database. These includes the normalized expression data, copy number changes for individual genes, normalized miRNA expression data, statistics of DEG analysis, Pearson’s correlation coefficient (PCC) of paired genes calculated from 13 expression chip datasets, gene mutations, and corresponding annotations, as well as protein structure annotation extracted from UniProtKB database (https://www.uniprot.org/).
3. Results and discussion
To aid the interpretation of various data in ESCCdb, a coupled webserver was developed. This enables the user to browse the whole gene set in the “Browse” interface, and search a specific gene/miRNA or carry out a promoter scan based on a set of genes in the “Search” interface. The results pages present rich information with multiple graphs to aid users in assessing the importance of the gene. The following sections will introduce these functionalities in detail.
3.1. Search a gene in one click
ESCCdb was designed to provide rich information from multi-omics data for a gene in a way that enable cross-dataset comparison. In the “Search” interface, the user can search a gene (we use AURKA as an example) with one click. The results page will present multiple information related to the gene. These include:
-
a)
Basic information and the associated pathway(s) for the query gene. This field will present basic annotations, related drugs, and involvement in cancer, DNA toxicity, and cell fitness in large-scale screens. Clicking the pathway “view” button will redirect users to a new page showing hiPathia pathway plots for 13 pre-processed datasets (Fig. 2A). We can easily find that the AURKA gene, involved in the oocyte meiosis pathway (hsa04114), is over-expressed in most datasets and that the oocyte meiosis pathway may be activated in ESCC.
-
b)
Expression changes of the query gene in multiple datasets (Fig. 2B). The row is marked in red if the gene is significantly over-expressed (Fold change>2 and adjusted p < 0.05) or in green if significantly under-expressed (Fold change<−2 and adjusted p < 0.05). We can find that AURKA was consistently over-expressed in most datasets which indicates that it is a stable biomarker for ESCC. Clicking the “Show” button will display a boxplot showing the expression distribution in “Tumor” and “Control” groups.
-
c)
Survival plots in the three datasets stratified by gene expression (lower quantile vs. upper quantile) For AURKA, one of the datasets (GSE53624) suggested that higher gene expression is correlated with lower survival rate (Fig. 2C).
-
d)
Copy number changes of the query gene in multiple datasets. Here we observed that our test gene AURKA underwent copy number increase about 38 % of cancer samples. Clicking the “Show” button will display a barplot showing the log2Ratios for each tumor sample (Fig. 2D).
-
e)
Representative protein structure encoded by the query gene and distribution of its mutations. In this plot, we find several mutations reported for our query gene AURKA. We also find that adverse mutations predicted by the SIFT and PolyPhen2 were mainly on the protein kinase domain (Fig. 3A). To view proteins encoded by other isotypes, the user can click the hyperlink “here” under the graph. The results page will display the plot and a detailed annotation of mutations as a table.
-
f)
The last part of the results page is the gene expression correlation table that shows the most highly correlated genes estimated by mean PCC (Fig. 3B). Users can further investigate the correlated genes by clicking the hyperlink “here” under the table. This will redirect the user to a new page that enables user to filter the correlated genes based on mean PCC, the number of supporting datasets (PCC>0.3), and the number of datasets with PCC> 0.5. Detailed use of this feature will be discussed in the next section.
Based on the information for our query gene, we can easily find out that AURKA is consistently upregulated in ESCC possibly owing to copy number changes, that it may be of prognostic value for overall survival, that knocking out of this gene leads to altered tolerance on DNA toxicity, and that some drugs are available to inhibit this gene. These observations suggests that AURKA is a good prognostic marker and a possible drug target for ESCC.
3.2. Correlation and enrichment analysis for potential regulators
One of the main features of ESCCdb is correlation analysis and the subsequent enrichment analysis. It was specially designed to facilitate the user take advantage of the correlation structure and identify potential master transcription factors for highly correlated genes. As an example, we analyze tyrosine kinase 7 (PTK7), which has been implicated as an oncogene in ESCC [26]. We found that PTK7 was over-expressed in most of the datasets (Fig. 4A). Several genes were highly correlated with PTK7. Users can apply custom filters to further restrict the genes in subsequent analysis. For example, set a higher threshold for mean PCC and/or the number of supporting datasets. Upon clicking the “Show Heatmap” dropdown menu and choosing one dataset, the top 200 genes that are highly correlated with PTK7 could be visualized as an interactive heatmap (partial heatmap in Fig. 4B). Users can further manually restrict the gene set in the heatmap and use these genes (up to 200 genes) to perform enrichment analysis as explained in the following paragraph.
KEGG pathway and Reactome term enrichment analysis can be applied to highly correlated genes. For PTK7 and the top 200 correlated genes, we found enriched KEGG pathways that included DNA amplification, cell cycle and among others (Fig. 4C). As cell cycle control is an important process for cancer initiation and progression, we tried to use the promoter scan module to identify enriched transcription factor binding sites (TFBS) on the promoters of enriched cell cycle genes as a query. In the results, we found that several TFs had enriched binding sites. Among them, FOXM1 binds to all the query gene promoters and had an average PCC of 0.827 with the query genes (Fig. 5A-B). FOXM1 is a known oncogene that is involved in cell cycle and proliferation [27]. Using the same strategy, we further found that MYBL2 is a potential regulator for FOXM1. Interestingly, MYBL2 and FOXM1 are both involved in cell senescence and there is evidence suggesting that they are co-regulated at the transcription level [28]. These results suggest that our TFBS enrichment analysis is capable to find out potential regulators of highly correlated genes.
3.3. Summary pages: Consensus DEGs, significant CNVs and recurrent mutations
To better navigate users to potential important genes, we also constructed the following three summary pages:
Consensus DEGs page maintains a list of 789 stable DEGs that are consistently changed in>50 % of the transcriptome datasets. These included 365 consistently over-expressed genes, which are enriched in several known cancer-related pathways, including cell cycle, DNA replication, ECM-receptor interaction, p53 signaling, and cellular senescence (Fig. 6A, Supplementary Figure 1A). The stable DEGs also included 424 consistently under-expressed genes, with the enriched pathways including several metabolic pathways and immune-related pathways, such as fatty acid metabolism, beta-alanine metabolism, drug metabolism-cytochrome P450, biosynthesis of unsaturated fatty acids, arachidonic acid metabolism, and neutrophil degranulation pathway (Fig. 6B, Supplementary Figure 1B). The consistent up-regulation of cancer-related pathways may suggest the essential role of these pathways in ESCC development. Conversely, the down-regulation of some metabolic and immune pathways suggests the selective nature of metabolism and a possible altered immune state in ESCC.
Significant CNVs page presents the results of GISTIC2 analysis for CNV datasets. These include the significant peaks and affected genes in individual peaks (Supplementary Figure 1C). We identified 33 peaks with significantly amplification. The genes in these peaks are enriched in collagen degradation pathways (Fig. 6C). Sixty-three peaks were significant deletions. The genes in these peaks are enriched in digestion of dietary carbohydrate and olfactory signaling pathway (Fig. 6D).
Recurrent mutations page records the genes mutated in more than 5 % of ESCC samples (Fig. 7A). These include many known ESCC related genes such as TP53, CDKN2A, EP300, NOTCH1, KMT2D, FAT1, FAT3, PIK3CA, NFE2L2, CSMD1 and DNAH5. In addition, recurrent mutations page also provide the KEGG and reactome pathway enrichment analysis of genes mutated in more than 1 % of ESCC samples (Fig. 7B-C). These genes enriched in focal adhesion, ECM-interaction, ABC transporters, circadian entrainment, calcium signaling and thyroid hormone signaling pathways. The full list of enriched pathways and genes in individual pathways are available in the webpage.
4. Discussion
Multi-omics data are important resources that can promote molecular studies and facilitate identification of biomarkers. Owing to stochastic vibration of cell regulation and introduction of bias during data preparation, there is a propensity for false discoveries in a single experiment. Thus, a platform that integrates and compares multiple datasets in parallel can identify consistent changes in ESCC progression. Keeping this concept in mind, we first integrated multiple ESCC related datasets including 1695 cases with gene/miRNA expressions, 961 cases with copy number changes, and 1421 cases with somatic mutations and 448 cases with clinical data. These data were annotated to the gene-level and visualized with multiple graphs. Gene-drug relationships were extracted from DGIdb [9]. Several genome-wide CRISPR/cas9 based screening studies were also included to further inform the users of the possible role of a query gene in cell fitness and DNA toxic resistance with experimental evidence [11], [12]. These annotations enable researchers to estimate the importance of a gene in ESCC progression. Although, some database also can analyse ESCC sample, such as GEPIA [4] and Xena [5]. Their data are derived from TCGA datasets, which contains only 96 ESCC samples (Supplementary Table 2). The cBioPortal database [6] has 227 ESCC samples and limited to copy number analysis and survival analysis. CCGD-ESCC database contains 2022 ESCC cases with genome-wide association study, 675 cases with SNVs/indels from whole-genome or whole-exome sequencing, 94 cases with expression data and clinical data [29]. However, CCGD-ESCC lacks graphical presentation of these data and in-depth analysis module. ESCC ATLAS [30], which is the most similar to ESCCdb, can query gene details, somatic mutations, CNVS, methylation, histone modification, related miRNA and GO terms. Compared with ESCC ATLAS, ESCCdb still have the following advances: a) much more visual presentation for genes expressions, CNVs, mutations and correlations; b) clinical data for survival analysis; c) pathway enrichment analysis for correlated genes or a set of genes; d) transcription factor binding site enrichment analysis. We hope such advantages can further promote the researches of ESCC.
The stable DEGs identified across multiple datasets are candidate biomarkers for ESCC diagnosis and/or prognosis. The over-expressed DEGs are enriched in many known cancer-related pathways, such as cell cycle, ECM-receptor interaction, and cellular senescence. Unexpectedly, the under-expressed DEGs are enriched in several metabolic pathways, some of which are related to immune response. For example, arachidonic acid metabolism leads to many metabolites that have a function in immune responses [31]. Immune cell infiltration is a well-known feature in cancer. Interestingly, neutrophil chemoattractants such as CXCL1, CXCL3, and CXCL8 were consistently over-expressed in ESCC which may suggest the accumulation of neutrophils. However, Interleukin 17 (encoded by IL17A, IL17B, IL17C, IL17D, and IL17F), which can induce antitumor immunity in ESCC [32], was not found to be consistently expressed in ESCC. The neutrophil degranulation pathway is also enriched in consistently under-expressed DEGs. Taken together these observations suggest accumulation of malfunctioned neutrophils in the ESCC tumor microenvironment.
Correlation pattern is an important feature in gene expression. This aspect has been explored in several cancer types [33], [34]. Therefore, the correlation patterns between each pair of genes, as well as gene-miRNA pairs were also constructed in ESCC. To further explore the correlation patterns, we designed a pathway enrichment module to find out the enriched TFBS in the promoters of a set of highly correlated genes such as the highly correlated genes that involved in the same process. This approach may identify potential regulators for a process. In addition, the interactive and visualized filter feature enables users to fine-tune the set of correlated genes and remove unwanted genes in TFBS enrichment analysis. Taken together, the ESCCdb provides a one-click online resource to evaluate the importance of a gene in ESCC progression and clinical implication. The TFBS enrichment module further provides an additional perspective of the possible regulation of the genes.
To provide better services for ESCC researchers, we will continue to collect ESCC related data and update the ESCCdb. We will include information related to DNA methylation and gene expression in the exosome (GSE104926) and also explore the relationships between expression and genomics changes. As TFs can co-ordinate to form complexes to promote gene expression co-ordinately, we will update our TFBS enrichment module to support the enrichment analysis of multiple TF footprints in the promoters of a gene set. Thus, ESCCdb can be a very useful tool for ESCC researchers in biomarker discovery, candidate identification, regulator exploration, and identification of drug targets. This database will contribute to both basic research and clinical studies.
5. Conclusions
ESCCdb is an important resource for ESCC studies. It incorporates multiple omics data and several in-depth analysismodules, including DEG analysis, CNV analysis, mutational profiling, survival analysis, and enrichment analysis for pathways as well as transcription factor. ESCCdb integrates and visualizes these multiple omics data for ESCC from more than 4000 samples in more than 50 datasets and published studies. It enables cross-dataset comparison in one click to estimate the consistency of aberration of a single gene. Based on comparisons among 20 transcriptome datasets, we identified 789 consistent DEGs that serve as potential biomarkers for ESCC diagnosis. Recurrent mutated genes and genes in significant amplifications or deletions are also available in summary pages. The annotations of whole-genome CRISPR/Cas9 screening and gene-drug relationship provide further information on functional aspects and potential clinical relevance of a gene. Notably, the unique transcription factor binding site enrichment analysis module, which is integrated with the correlating patterns of gene expression, can serve as a mining tool for identifying potential regulatory transcription factors for a biological process or a set of genes. This will promote the mechanism study for ESCC research. Taken together, ESCCdb is an important database as well as an useful analysis platform for ESCC research.
Funding
This work was supported by the National Natural Science Foundation of China (32100441), the Innovative Spark Project of Sichuan University (2018SCUH0016), and Sichuan University Postdoctoral Interdisciplinary Innovation Fund (0020404153020).
CRediT authorship contribution statement
CW, GW, YG, MW, HL, KW, YW, and YH participated in data collections and workflow construction. JY, LB, LD and HC performed the analysis, constructed the database, and designed the web pages. JY and HC co-wrote the manuscript. JY and ZX designed and coordinated the project. All authors reviewed and approved the final manuscript.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
We would like to thank Ying Wu for her help in data collection.
Footnotes
Supplementary data associated with this article can be found in the online version at doi:10.1016/j.csbj.2023.03.026.
Contributor Information
Jian Yang, Email: yangjian89@scu.edu.cn.
Haoyang Cai, Email: haoyang.cai@scu.edu.cn.
Zhixiong Xiao, Email: jimzx@scu.edu.cn.
Appendix A. Supplementary material
.
.
.
Data availability
ESCCdb is freely available at http://cailab.labshare.cn/ESCCdb.
References
- 1.Bray F., Ferlay J., Soerjomataram I., et al. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68(6):394–424. doi: 10.3322/caac.21492. [DOI] [PubMed] [Google Scholar]
- 2.Reza Malekzadeh Christian C., Abnet, Dawsey Sanford M. In: World Cancer Report 2020: Cancer research for cancer prevention. Christopher P., Wild Elisabete, Weiderpass, Stewart Bernard W., editors. International Agency for Research on Cancer; Lyon, France: 2020. Oesophageal cancer: a tale of two malignancies; pp. 323–332. [Google Scholar]
- 3.Wang X., Jia Y., Deng H., et al. Intratumoral heterogeneity of esophageal squamous cell carcinoma and its clinical significance. Pathol Res Pract. 2019;215(2):308–314. doi: 10.1016/j.prp.2018.11.019. [DOI] [PubMed] [Google Scholar]
- 4.Tang Z., Li C., Kang B., et al. GEPIA: a web server for cancer and normal gene expression profiling and interactive analyses. Nucleic Acids Res. 2017;45(W1):W98–W102. doi: 10.1093/nar/gkx247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Goldman M.J., Craft B., Hastie M., et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol. 2020;38(6):675–678. doi: 10.1038/s41587-020-0546-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gao J., Aksoy B.A., Dogrusoz U., et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal. 2013;6(269):pl1. doi: 10.1126/scisignal.2004088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Rhodes D.R., Yu J., Shanker K., et al. ONCOMINE: a cancer microarray database and integrated data-mining platform. Neoplasia. 2004;6(1):1–6. doi: 10.1016/s1476-5586(04)80047-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Barrett T., Wilhite S.E., Ledoux P., et al. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. (Database Issue) 2013;41 doi: 10.1093/nar/gks1193. D991-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Cotto K.C., Wagner A.H., Feng Y.Y., et al. DGIdb 3.0: a redesign and expansion of the drug-gene interaction database. Nucleic Acids Res. 2018;46(D1):D1068–D1073. doi: 10.1093/nar/gkx1143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Bailey M.H., Tokheim C., Porta-Pardo E., et al. Comprehensive characterization of cancer driver genes and mutations. Cell. 2018;173(2):371–385. doi: 10.1016/j.cell.2018.02.060. e18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Behan F.M., Iorio F., Picco G., et al. Prioritization of cancer therapeutic targets using CRISPR-Cas9 screens. Nature. 2019;568(7753):511–516. doi: 10.1038/s41586-019-1103-9. [DOI] [PubMed] [Google Scholar]
- 12.Olivieri M., Cho T., Alvarez-Quilon A., et al. A genetic map of the response to DNA damage in human cells. Cell. 2020;182(2):481–496. doi: 10.1016/j.cell.2020.05.040. e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Gautier L., Cope L., Bolstad B.M., et al. affy--analysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 2004;20(3):307–315. doi: 10.1093/bioinformatics/btg405. [DOI] [PubMed] [Google Scholar]
- 14.Ritchie M.E., Phipson B., Wu D., et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7) doi: 10.1093/nar/gkv007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Guo J.C., Li C.Q., Wang Q.Y., et al. Protein-coding genes combined with long non-coding RNAs predict prognosis in esophageal squamous cell carcinoma patients as a novel clinical multi-dimensional signature. Mol Biosyst. 2016;12(11):3467–3477. doi: 10.1039/c6mb00585c. [DOI] [PubMed] [Google Scholar]
- 16.Sahraeian S.M.E., Mohiyuddin M., Sebra R., et al. Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. Nat Commun. 2017;8(1):59. doi: 10.1038/s41467-017-00050-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Soneson C., Love M.I., Robinson M.D. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res. 2015;4:1521. doi: 10.12688/f1000research.7563.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Love M.I., Huber W., Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15(12):550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Pena-Chilet M., Esteban-Medina M., Falco M.M., et al. Using mechanistic models for the clinical interpretation of complex genomic variation. Sci Rep. 2019;9(1):18937. doi: 10.1038/s41598-019-55454-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Bengtsson H., Irizarry R., Carvalho B., et al. Estimation and assessment of raw copy numbers at the single locus level. Bioinformatics. 2008;24(6):759–767. doi: 10.1093/bioinformatics/btn016. [DOI] [PubMed] [Google Scholar]
- 21.Mermel C.H., Schumacher S.E., Hill B., et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 2011;12(4):R41. doi: 10.1186/gb-2011-12-4-r41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Wang K., Li M., Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38(16) doi: 10.1093/nar/gkq603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Jia P., Wang Q., Chen Q., et al. MSEA: detection and quantification of mutation hotspots through mutation set enrichment analysis. Genome Biol. 2014;15(10):489. doi: 10.1186/s13059-014-0489-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Yu G., Wang L.G., Han Y., et al. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16(5):284–287. doi: 10.1089/omi.2011.0118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Yevshin I., Sharipov R., Kolmykov S., et al. GTRD: a database on gene transcription regulation-2019 update. Nucleic Acids Res. 2019;47(D1):D100–D105. doi: 10.1093/nar/gky1128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Liu K., Song G., Zhang X., et al. PTK7 is a novel oncogenic target for esophageal squamous cell carcinoma. World J Surg Oncol. 2017;15(1):105. doi: 10.1186/s12957-017-1172-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Liao G.B., Li X.Z., Zeng S., et al. Regulation of the master regulator FOXM1 in cancer. Cell Commun Signal. 2018;16(1):57. doi: 10.1186/s12964-018-0266-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Mowla S.N., Lam E.W., Jat P.S. Cellular senescence and aging: the role of B-MYB. Aging Cell. 2014;13(5):773–779. doi: 10.1111/acel.12242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Peng L., Cheng S., Lin Y., et al. CCGD-ESCC: a comprehensive database for genetic variants associated with esophageal squamous cell carcinoma in Chinese population. Genom Proteom Bioinforma. 2018;16(4):262–268. doi: 10.1016/j.gpb.2018.03.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Tungekar A., Mandarthi S., Mandaviya P.R., et al. ESCC ATLAS: a population wide compendium of biomarkers for Esophageal Squamous Cell Carcinoma. Sci Rep. 2018;8(1):12715. doi: 10.1038/s41598-018-30579-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Hanna V.S., Hafez E.A.A. Synopsis of arachidonic acid metabolism: a review. J Adv Res. 2018;11:23–32. doi: 10.1016/j.jare.2018.03.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Chen C.L., Wang Y., Huang C.Y., et al. IL-17 induces antitumor immunity by promoting beneficial neutrophil recruitment and activation in esophageal squamous cell carcinoma. Oncoimmunology. 2017;7(1) doi: 10.1080/2162402X.2017.1373234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Lian Q., Wang S., Zhang G., et al. HCCDB: a database of hepatocellular carcinoma expression atlas. Genom Proteom Bioinforma. 2018;16(4):269–275. doi: 10.1016/j.gpb.2018.07.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Yang Y., Sui Y., Xie B., et al. GliomaDB: a web server for integrating glioma omics data and interactive analysis. Genom Proteom Bioinforma. 2019;17(4):465–471. doi: 10.1016/j.gpb.2018.03.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
ESCCdb is freely available at http://cailab.labshare.cn/ESCCdb.