Abstract
Microbial communities are massively resident in the human body, yet dysbiosis has been reported to correlate with many diseases, including various cancers. Most studies focus on the gut microbiome, while the bacteria that participate in tumor microenvironments on site remain unclear. Previous studies have acquired the bacteria expression profiles from RNA-seq, whole genome sequencing, and whole exon sequencing in The Cancer Genome Atlas (TCGA). However, small-RNA sequencing data were rarely used. Using TCGA miRNA sequencing data, we evaluated bacterial abundance in 32 types of cancer. To uncover the bacteria involved in cancer, we applied an analytical process to align unmapped human reads to bacterial references and developed the BIC database for the transcriptional landscape of bacteria in cancer. BIC provides cancer-associated bacterial information, including the relative abundance of bacteria, bacterial diversity, associations with clinical relevance, the co-expression network of bacteria and human genes, and their associated biological functions. These results can complement previously published databases. Users can easily download the result plots and tables, or download the bacterial abundance matrix for further analyses. In summary, BIC can provide information on cancer microenvironments related to microbial communities. BIC is available at: http://bic.jhlab.tw/.
INTRODUCTION
The human microbiota massively lives, varies in our bodies, and is diverse in different body sides (1,2). It was estimated that a human body harbors more than three trillion bacterial members, similar to the number of human cells (3). Host–microbiome interactions impact multiple physiological processes and disease susceptibilities. The human microbiota plays an important role in human health, such as maintaining homeostasis, immunity and inflammation (4,5). Most microbial studies focus on the gut microbiome and related diseases, such as inflammatory bowel disease (IBD) and depression and anxiety (6). Furthermore, studies have shown that the microbial compositions are different and associated with cancer (7,8).
While many studies focus on the gut microbiome derived from patients’ stool (9–11), the bacteria that participate in the on-site tumor microenvironments remain unclear. Dohlman et al. and Poore et al. have acquired the bacteria expression profiles from RNA-seq, whole genome sequencing (WGS), and whole exon sequencing (WXS) in The Cancer Genome Atlas (TCGA) (12,13). However, the small-RNA sequencing data are not used. We developed an analytical approach using the small-RNA sequencing data of colorectal cancer (CRC) tissue samples to study cancer-associated microbiome in CRC and observed similar results compared to other studies using 16S rDNA sequencing (14).
There are certain benefits in using miRNA-seq compared to WGS, WXS, and RNA-seq. First, small RNAs (sRNAs) have been found to play regulatory roles in both bacteria and bacterial infectious diseases (15,16). Compared to WGS and WXS, sRNAs were transcribed and functional in either bacteria or hosts. Only a small fraction of total RNA was polyadenylated and appeared transiently in bacteria (17,18). In many RNA-seq studies, RNAs were extracted and reverse-transcribed to cDNAs through poly-A tails. Most bacterial RNAs without poly-A tails will be filtered in RNA-seq data. Compared to RNA-seq, miRNA-seq which is processed without poly-A filtering could have a chance to identify bacteria not found in RNA-seq.
Using TCGA miRNA sequencing data, we evaluated tissue-resident bacterial abundance in 32 types of cancer. We aligned unmapped human reads to bacterial references by sRNAnalyzer and merged them for each taxonomic rank of 32 cancer types (14,19). The bacterial relative abundance and sample diversity were compared across different cancer types. We parsed all the data and developed the BIC database for the transcriptional landscape of bacteria in cancer. BIC provides the following information: (i) relative abundance of bacteria, (ii) bacterial diversity, (iii) bacterial composition, (iv) clinical relevance, (v) bacterial co-abundance network, (vi) bacteria-correlated human gene expression network and (vii) bacteria-associated biological function (Figure 1). Users can easily query and browse the analysis plots and result tables, or download the bacterial expression matrices for further analyses.
DATA COLLECTION
The TCGA miRNA-seq BAM files were retrieved from the NCI Genomic Data Commons (GDC) using the GDC Data Transfer Tool (20). Human RNA expression profiles (EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv), tumor stages, races, survival events, and time (TCGA-CDR-Supplemental Table S1.xlsx) were downloaded from the Supplemental Data in PanCanAtlas Publications (https://gdc.cancer.gov/about-data/publications/pancanatlas). The gene symbols in the RNA expression profiles were renamed according to org.Hs.eg.db (version 3.6.0) (21). Only samples from primary tumors and their adjacent normal tissues were used. We acquired the biospecimen information using TCGAbiolinks (version 2.17.3) (22).
DATA PROCESSING AND INTEGRATION
Bacteria relative abundance matrixes
We used SAMtools (version 1.3.1) to extract the unmapped reads from human miRNA-seq BAM files and stored them in FASTQ format (23). sRNAnalyzer scripts (‘preprocess.pl’, ‘align.pl’, ‘desProfile.pl’ and ‘taxProfile.pl’) were used for read preprocessing, alignment, taxonomy annotation (19). We set the minimal read length to 20 nucleotides and mapped the reads to multiple references, but did not allow any mismatch to obtain the highest alignment accuracy. The references used in alignment were provided by sRNAnalyzer, including CDS and DNA of bacteria, nt_bacteria, and microbiomes. After taxonomy annotation by the sRNAnalyzer scripts, we reassigned the reads mapped to multiple species to their common higher-level taxa and generated the read matrixes at different taxonomic levels (14). The processed read counts in each data processing step are summarized in Supplementary Figure S1. We identified 1617 genera, 303 families, 126 orders, 56 classes, and 47 phyla from 10362 samples (9709 patients) across 32 cancer types. Since the count matrixes were sparse, we applied the geometric mean of pairwise ratio (GMPR), a robust normalization method for zero-inflated data, to produce normalized count tables (24). To keep all 10 362 samples, the intersection numbers of the phylum, class, order, family, and genus count tables were set to 3, 5, 5, 6 and 5, respectively. Normalized count matrices were transformed into relative abundance matrices. The relative bacterial abundance of each taxonomy level was used for all subsequent analyses in BIC. An overview of these processes is shown in Figure 2. Detailed bacterial references and processing scripts are available on GitHub.
Precomputed analysis data and database construction
Based on bacterial relative abundances, we calculated the bacterial diversity in each taxonomy level for every kind of cancer. Vegan (version 2.5–7, https://CRAN.R-project.org/package=vegan) was used to calculate the Shannon, Gini-Simpson, and inverse Simpson indices (25). Bacteria with a prevalence (nonzero count) of ≥20% in the individual type of cancer were used to analyze the co-abundance of bacteria, the correlation with host gene expression, and the associated biological function. We applied SparCC, a method designed for compositional data, to calculate bacterial co-abundance relationships and establish the co-abundance networks for individual cancer types (26,27). The function sparccboot in SpiecEasi (version 1.1.0) was used to acquire SparCC correlation coefficients and empirical p-values of the bacterial co-abundance with 10 000 times of bootstraps (28). Spearman correlation coefficients (SCC) were calculated for the bacterial correlation with human gene expression using common samples between bacteria and tissue transcriptome data. Only human genes that were measured with nonzero counts in ≥20% of the samples were considered. To correct for the sample size effect, we applied Fisher's z-transformation for SCC. To reveal the possible biological processes in which the queried bacteria are involved, we performed gene set enrichment analysis (FGSEA, version 1.12.0) for bacteria-correlated gene expression ranked in the descending order of the corrected z-score. The gene sets of biological processes annotated by Gene Ontology (GO, c5.go.bp.v7.2) (29), KEGG (c2.cp.kegg.v7.5.1) (30) and Reactome (c2.cp.reactome.v7.5.1) (31) were downloaded from the Molecular Signatures Database (32,33). These analyses were performed with R scripts (version 3.6.0) (34) and all the precomputed analysis data are stored in PostgreSQL (version 13.3). The tables deposited in PostgreSQL are shown in Figure 3.
Web application framework
The BIC web application framework (Supplementary Figure S2) was constructed using Python (version 3.6.8) (35) and Django (version 3.2.3, https://djangoproject.com). The analyses of clinical relevance were performed under Django, including overall survival and bacterial abundance comparison of different groups, such as tumor (T) versus adjacent normal (N), tumor stages and races. The survival analysis was implemented using lifelines (version 0.26.3) (36). Calculations of statistical P-values (Kruskal–Wallis and Wilcox ranksum tests) in different groups were implemented by kruskal and ranksums in scipy (version 1.5.4) (37). Plots were produced by bokeh (version 2.3.3, http://www.bokeh.pydata.org).
USER INTERFACE AND USE CASES
Figure 4 shows the user interface and all the analyses provided by BIC. Modules I and II enable users to query the bacterial relative abundance and diversity indexes or evenness of the selected taxonomy level across all cancer types. Modules III to VII allow users to find the bacterial composition, clinical relevance, co-abundance, correlated human gene expression, and inferred biological processes of the queried bacteria under specified taxonomy level of the selected cancer type. Users can easily save the output plots and tables for their queried analyses.
Figure 5 illustrates an example of how users can investigate the genus Fusobacterium in cancer. For CRC and head and neck cancer, Fusobacterium is known to be associated with cancer progression (38,39). With the Bacterial abundance module, users can query Fusobacterium at the genus level (Figure 5A) and observe that the relative abundances of Fusobacterium are remarkably high in COAD (colon adenocarcinoma), READ (rectum adenocarcinoma), and HNSC (head and neck squamous cell carcinoma) (Figure 5B, C). Furthermore, the Clinical relevance module shows that Fusobacterium is more abundant in tumor than in adjacent normal tissues in these three types of cancer (Figure 5D–F). In the Bacteria-human gene network module, CXCL8 is the top gene positively correlated with Fusobacterium in COAD (Figure 5G). CXCL8 has been found to play an important role in CRC (40,41). With the Bacteria-associated biological function module, users can view the most significant KEGG pathways correlated with the abundance of Fusobacterium in COAD (Figure 5H). Among many cancer-related pathways, the NOD-like receptor signaling pathway has previously been reported to be related to the onset of CRC (40,41).
CONCLUSION
We have developed a user-friendly database, BIC, for bacterial profiles derived from TCGA miRNA-seq data in 32 types of cancer. BIC allows comparisons of the relative abundance and diversities of bacteria in different types of cancer. BIC also provides the bacterial composition, clinical relevance, co-abundance network, correlated human gene expression network, and associated gene ontologies, for each type of cancer. With the comprehensive characterization of bacteria in tissues of different cancers, BIC can greatly facilitate the exploration of bacterial functions and mechanisms in tumor microenvironments. We believe that our database will be a valuable resource for understanding the interactions between humans and microbes in cancer formation.
DATA AVAILABILITY
BIC is freely accessible at: http://bic.jhlab.tw/. The entire BIC data collection can be downloaded from the website. The source codes of BIC data processing, database construction, and web application are available at GitHub https://github.com/Kai-Pu/BIC_production.
Supplementary Material
ACKNOWLEDGEMENTS
The authors thank Yue-Hua Tu for advice and discussion of the BIC web framework. We thank the National Center for High-performance Computing (NCHC) for providing computational and storage resources.
Contributor Information
Kai-Pu Chen, Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei 106, Taiwan.
Chia-Lang Hsu, Department of Medical Research, National Taiwan University Hospital, Taipei 100, Taiwan.
Yen-Jen Oyang, Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei 106, Taiwan.
Hsuan-Cheng Huang, Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei 112, Taiwan.
Hsueh-Fen Juan, Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei 106, Taiwan; Department of Life Science, National Taiwan University, Taipei 106, Taiwan; Center for Computational and Systems Biology, National Taiwan University, Taipei 106, Taiwan.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
Ministry of Science and Technology, Taiwan [MOST 109-2221-E-002-161-MY3, 109-2320-B-002-017-MY3, 109-2221-E-010-011-MY3]; Ministry of Education (the Higher Education Sprout Project) [NTU-110L8808, NTU-CC-109L104702-2]. Funding for open access charge: Ministry of Science and Technology, Taiwan.
Conflict of interest statement. None declared.
REFERENCES
- 1. Costello E.K., Lauber C.L., Hamady M., Fierer N., Gordon J.I., Knight R.. Bacterial community variation in human body habitats across space and time. Science. 2009; 326:1694–1697. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Cho I., Blaser M.J.. The human microbiome: at the interface of health and disease. Nat. Rev. Genet. 2012; 13:260–270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Sender R., Fuchs S., Milo R.. Revised estimates for the number of human and bacteria cells in the body. PLoS Biol. 2016; 14:e1002533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Belkaid Y., Hand T.W.. Role of the microbiota in immunity and inflammation. Cell. 2014; 157:121–141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Lee J.Y., Tsolis R.M., Baumler A.J.. The microbiome and gut homeostasis. Science. 2022; 377:eabp9960. [DOI] [PubMed] [Google Scholar]
- 6. Foster J.A., McVey Neufeld K.A.. Gut-brain axis: how the microbiome influences anxiety and depression. Trends Neurosci. 2013; 36:305–312. [DOI] [PubMed] [Google Scholar]
- 7. Arthur J.C., Perez-Chanona E., Muhlbauer M., Tomkovich S., Uronis J.M., Fan T.J., Campbell B.J., Abujamel T., Dogan B., Rogers A.B.et al.. Intestinal inflammation targets cancer-inducing activity of the microbiota. Science. 2012; 338:120–123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Urbaniak C., Gloor G.B., Brackstone M., Scott L., Tangney M., Reid G.. The microbiota of breast tissue and its association with breast cancer. Appl. Environ. Microbiol. 2016; 82:5039–5048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Qin J., Li Y., Cai Z., Li S., Zhu J., Zhang F., Liang S., Zhang W., Guan Y., Shen D.et al.. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature. 2012; 490:55–60. [DOI] [PubMed] [Google Scholar]
- 10. Amato K.R., Arrieta M.C., Azad M.B., Bailey M.T., Broussard J.L., Bruggeling C.E., Claud E.C., Costello E.K., Davenport E.R., Dutilh B.E.et al.. The human gut microbiome and health inequities. Proc. Natl. Acad. Sci. U.S.A. 2021; 118:e2017947118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Henrick B.M., Rodriguez L., Lakshmikanth T., Pou C., Henckel E., Arzoomand A., Olin A., Wang J., Mikes J., Tan Z.et al.. Bifidobacteria-mediated immune system imprinting early in life. Cell. 2021; 184:3884–3898. [DOI] [PubMed] [Google Scholar]
- 12. Dohlman A.B., Arguijo Mendoza D., Ding S., Gao M., Dressman H., Iliev I.D., Lipkin S.M., Shen X.. The cancer microbiome atlas: a pan-cancer comparative analysis to distinguish tissue-resident microbiota from contaminants. Cell Host Microbe. 2021; 29:281–298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Poore G.D., Kopylova E., Zhu Q., Carpenter C., Fraraccio S., Wandro S., Kosciolek T., Janssen S., Metcalf J., Song S.J.et al.. Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature. 2020; 579:567–574. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Lee W.H., Chen K.P., Wang K., Huang H.C., Juan H.F.. Characterizing the cancer-associated microbiome with small RNA sequencing data. Biochem. Biophys. Res. Commun. 2020; 522:776–782. [DOI] [PubMed] [Google Scholar]
- 15. Storz G., Vogel J., Wassarman K.M.. Regulation by small RNAs in bacteria: expanding frontiers. Mol. Cell. 2011; 43:880–891. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Gonzalez Plaza J.J. Small RNAs as fundamental players in the transference of information during bacterial infectious diseases. Front. Mol. Biosci. 2020; 7:101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Sarkar N. Polyadenylation of mRNA in bacteria. Microbiology (Reading). 1996; 142:3125–3133. [DOI] [PubMed] [Google Scholar]
- 18. Hajnsdorf E., Kaberdin V.R.. RNA polyadenylation and its consequences in prokaryotes. Philos. Trans. R. Soc. Lond. B Biol. Sci. 2018; 373:20180166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Wu X., Kim T.K., Baxter D., Scherler K., Gordon A., Fong O., Etheridge A., Galas D.J., Wang K.. sRNAnalyzer-a flexible and customizable small RNA sequencing data analysis pipeline. Nucleic Acids Res. 2017; 45:12140–12151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Grossman R.L., Heath A.P., Ferretti V., Varmus H.E., Lowy D.R., Kibbe W.A., Staudt L.M.. Toward a shared vision for cancer genomic data. N. Engl. J. Med. 2016; 375:1109–1112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Carlso M. 2019; org.Hs.eg.db: Genome wide annotation for Human. R package version 3.10.10.
- 22. Colaprico A., Silva T.C., Olsen C., Garofano L., Cava C., Garolini D., Sabedot T.S., Malta T.M., Pagnotta S.M., Castiglioni I.et al.. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016; 44:e71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Danecek P., Bonfield J.K., Liddle J., Marshall J., Ohan V., Pollard M.O., Whitwham A., Keane T., McCarthy S.A., Davies R.M.et al.. Twelve years of SAMtools and BCFtools. Gigascience. 2021; 10:giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Chen L., Reeve J., Zhang L., Huang S., Wang X., Chen J.. GMPR: a robust normalization method for zero-inflated count data with application to microbiome sequencing data. PeerJ. 2018; 6:e4600. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Oksanen J., Guillaume Blanchet J., Friendly M., Kindt R., Legendre P., McGlinn D., Minchin P.R., O’Hara R.B., Simpson G.L., Solymos P.et al.. 2020; vegan: Community Ecology Package. R package version 2.5-7.
- 26. Chen L., Collij V., Jaeger M., van den Munckhof I.C.L., Vich Vila A., Kurilshikov A., Gacesa R., Sinha T., Oosting M., Joosten L.A.B.et al.. Gut microbial co-abundance networks show specificity in inflammatory bowel disease and obesity. Nat. Commun. 2020; 11:4018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Friedman J., Alm E.J.. Inferring correlation networks from genomic survey data. PLoS Comput. Biol. 2012; 8:e1002687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Kurtz Z.D., Müller C.L., Miraldi E.R., Littman D.R., Blaser M.J., Bonneau R.A.. 2021; SpiecEasi: Sparse Inverse Covariance for Ecological Statistical Inference. R package version 1.1.0.
- 29. Gene Ontology Consortium The gene ontology resource: enriching a GOld mine. Nucleic Acids Res. 2021; 49:D325–D334. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Kanehisa M., Furumichi M., Sato Y., Ishiguro-Watanabe M., Tanabe M.. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res. 2021; 49:D545–D551. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Gillespie M., Jassal B., Stephan R., Milacic M., Rothfels K., Senff-Ribeiro A., Griss J., Sevilla C., Matthews L., Gong C.et al.. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022; 50:D687–D692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Sergushichev A.A. An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation. 2016; bioRxiv doi:20 June 2016, preprint: not peer reviewed 10.1101/060012. [DOI]
- 33. Subramanian A., Tamayo P., Mootha V.K., Mukherjee S., Ebert B.L., Gillette M.A., Paulovich A., Pomeroy S.L., Golub T.R., Lander E.S.et al.. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 2005; 102:15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. R. C. Team R: A Language and Environment for Statistical Computing. 2019; Vienna, Austria: R Foundation for Statistical Computing. [Google Scholar]
- 35. Van Rossum G., Fred L.. Python Reference Manual. 1995; Centrum voor Wiskunde en Informatica Amsterdam. [Google Scholar]
- 36. Davidson-Pilon C. lifelines: survival analysis in python. J. Open Source Softw. 2019; 4:1317. [Google Scholar]
- 37. Virtanen P., Gommers R., Oliphant T.E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J.et al.. SciPy 1.0: fundamental algorithms for scientific computing in python. Nat. Methods. 2020; 17:261–272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Wu J., Li Q., Fu X.. Fusobacterium nucleatum contributes to the carcinogenesis of colorectal cancer by inducing inflammation and suppressing host immunity. Transl Oncol. 2019; 12:846–851. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Bronzato J.D., Bomfim R.A., Edwards D.H., Crouch D., Hector M.P., Gomes B.. Detection of fusobacterium in oral and head and neck cancer samples: a systematic review and meta-analysis. Arch. Oral. Biol. 2020; 112:104669. [DOI] [PubMed] [Google Scholar]
- 40. Bie Y., Ge W., Yang Z., Cheng X., Zhao Z., Li S., Wang W., Wang Y., Zhao X., Yin Z.et al.. The crucial role of CXCL8 and its receptors in colorectal liver metastasis. Dis. Markers. 2019; 2019:8023460. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
- 41. Velloso F.J., Trombetta-Lima M., Anschau V., Sogayar M.C., Correa R.G.. NOD-like receptors: major players (and targets) in the interface between innate immunity and cancer. Biosci. Rep. 2019; 39:BSR20181709. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
BIC is freely accessible at: http://bic.jhlab.tw/. The entire BIC data collection can be downloaded from the website. The source codes of BIC data processing, database construction, and web application are available at GitHub https://github.com/Kai-Pu/BIC_production.