Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Aug 8.
Published in final edited form as: Cancer Cell. 2022 Aug 8;40(8):809–811. doi: 10.1016/j.ccell.2022.07.008

Decoding tumor microenvironment through artificial tumor transcriptomes

Liqing Tian 1, Jinghui Zhang 1,*
PMCID: PMC9680037  NIHMSID: NIHMS1847283  PMID: 35944501

Abstract

In this issue of Cancer Cell, Zaitsev et al present a machine-learning based approach, trained from millions of artificial transcriptomes with admixed cell populations, for reconstructing tumor microenvironment (TME). The high accuracy of this approach, demonstrated through extensive validation, enables systematic investigation of TME in both research and clinical setting.


Understanding the tumor microenvironment (TME) is critical for exploring the therapeutical potential and rational design of immunotherapy (Wei et al., 2018). Many computational methods have been developed to deconvolute transcriptome sequencing (RNA-seq) data generated from bulk tumor samples, which has greatly advanced our understanding of TME cellular composition in recent years (Thorsson et al., 2018). Existing methods were primarily designed to model cell type-specific gene expression profiles (Newman et al., 2015) or to perform deep learning using single-cell RNA-seq (scRNA-seq) of the same tissue type (Newman et al., 2019; Menden et al., 2020). Therefore, they are not optimized for capturing rare or hierarchical TME subpopulations and have limited applications in samples that lack scRNA-seq or flow cytometry data.

In their current work (Zaitsev et al., 2022), Zaitsev and colleagues addressed these challenges by taking an innovative approach that uses millions of artificial transcriptomes to train their machine learning model Kassandra for deconvoluting TME in bulk samples (Figure 1). The artificial transcriptomes were constructed from a highly curated database (Kassandra database) of >9,400 purified RNA-seq samples generated from cancer cell lines and sorted blood/immune/stromal cell populations. The resource is comprised of 9,041 curated samples from the public databases and 348 RNA-seq of FACS-sorted blood and cancer tissues generated by this study to ensure representation of rare cell types such as T helper cells of steady or active state. Altogether, 51 unique populations are represented in this resource including 18 TME cell types and 41 populations present in blood.

Figure 1. Reconstructing tumor and blood microenvironment using Kassandra.

Figure 1.

Zaitsev et al. curated and harmonized RNA-seq data from diverse sorted cell populations representing 51 unique cell types. By admixing different cell populations in silico, millions of artificial transcriptomes were created to match the profile of real patient tumor or blood samples. The data were used to train Kassandra, a stepwise machine learning algorithm designed to accurately estimate cellular composition of TME including hierarchical populations. A clinical application of Kassandra shown the potential of predicting PD-L1 immunohistochemistry (IHC) level and immunotherapy response in bladder cancer patients by deconvoluting PD1+ CD8+ T cells from bulk RNA-seq data.

Artificial transcriptomes were created by random selection of cell types from the Kassandra database in proportions resembling real tumor tissues or blood samples. The process also took into consideration technical noise and aberrant gene expression in tumor cells. This resulted in 18 million and 8 million artificial transcriptomes for training Kassandra-Tumor and Kassandra-Blood models, respectively. The artificial transcriptomes were split into training plus validation (84%) and holdout (16%) data sets so that the performance of Kassandra can be ascertained in the holdout data set. LightGBM (Light Gradient Boosting Machine), a machine-learning algorithm based on decision tree, was used to build the Kassandra models and the performance was assessed by Pearson correlation of predicted versus true cell populations in the holdout data set. The high correlation observed across 14 TME cell types (r = 0.92 to 0.99) predicted by Kassandra in contrast to the weaker correlations (r < 0.70) by the other 12 publicly available methods demonstrates the consistent and superior performance of Kassandra. The authors also evaluated the limit of detection (LOD) by creating holdout artificial transcriptomes with low total read counts. The average LOD was approximately 0.5–1% and higher LOD was noted for NK cells which share the expression of multiple genes with CD8+ T cells.

Having established the benchmark of Kassandra based on the artificial transcriptomes, further validation was carried out by applying Kassandra on bulk tumor and normal tissue RNA-seq samples generated by the Cancer Genome Atlas project (TCGA https://www.cancer.gov/tcga) and GTEx (GTEx Consortium, 2013), respectively. The deconvoluted cell populations detected in different types of tumor and normal tissues match the existing knowledge in scientific literature. More importantly, orthogonal validation was performed by using H&E slides for 4,035 TCGA tumor samples across the 13 different cancer types. This allowed for a direct comparison of histologically defined percentages of tumor-infiltrating lymphocytes (TILs) and macrophages (Saltz et al., 2018) with deconvoluted cell populations. Kassandra was able to achieve an impressive correlation of >0.7 for 10 out of the 13 cancer types, surpassing the performance of 6 additional algorithms. Kassandra also outperformed the other methods by achieving the highest concordance with tumor purity and expression of T-cell receptor (TCR) and IgH/L (B-cell receptor, BCR) based on RNA-seq read analysis. Similarly, robust performance of Kassandra was demonstrated by high correlation with orthogonal data sets in the following samples: 1) FACS-sorted cell populations including 11 novel cell types from 45 blood samples; 2) Pseudobulk RNA-seq datasets built from 9 independent peripheral blood mononuclear cell (PBMC) scRNA-seq datasets; 3) Pseudobulk RNA-seq datasets built from scRNA-seq data derived from 6 tumor types; 4) CyTOF analysis of T and B cell, neutrophil, macrophage, Treg cell, NK cell, endothelial cell, and fibroblast populations in non-small cell lung carcinoma (NSCLC) and clear cell renal cell carcinoma (ccRCC) tumors; and 5) Multiplex immunofluorescence images (MxIF, 20 markers) of ccRCC samples with cell segmentation and typing.

With the impressive validation statistics, reconstructing TME to predict response to immune therapy in three cancer types (bladder, gastric, and ccRCC) demonstrated potential clinical utility of Kassandra analysis on bulk RNA-seq samples. Amongst the immune cell types uncovered by Kassandra, PD1+CD8+ T cells significantly correlated with PD-L1 immune cell IHC levels while the ratios of PD1+ CD8+ T cells to all T cells were significantly associated with immunotherapy response independent of TME and PD-L1 expression values (Figure 1, bottom). A second application, which shows association of deconvoluted immune composition with age in archived blood samples provided another perspective on the utility of Kassandra.

By performing in-depth curation, rigorous analyses and exhaustive validation Zaitsev et al. showed that Kassandra was able to achieve high accuracy in reconstructing the immune cell composition using bulk RNA-seq samples generated in silico as well as from samples of cancer patients or healthy donors. The impressive results presented in this study showed that reverse engineering of tissue transcriptome was possible through the use of RNA-seq generated from sorted cells, which may have broad application for reconstructing the TME. The availability of source code (https://github.com/BostonGene/Kassandra) and the training data will accelerate the broad dissemination of this unique approach.

Acknowledgments

The authors were supported in part by American Lebanese Syrian Associated Charities (ALSAC) and National Cancer Institute (NCI) grant R01CA216391 to J.Z.

Footnotes

Declaration of interests

The authors declare no competing interests.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Consortium GT (2013). The Genotype-Tissue Expression (GTEx) project. Nat Genet 45, 580–585. 10.1038/ng.2653. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Menden K, Marouf M, Oller S, Dalmia A, Magruder DS, Kloiber K, Heutink P, and Bonn S (2020). Deep learning-based cell composition analysis from tissue expression profiles. Sci Adv 6, eaba2619. 10.1126/sciadv.aba2619. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, Hoang CD, Diehn M, and Alizadeh AA (2015). Robust enumeration of cell subsets from tissue expression profiles. Nat Methods 12, 453–457. 10.1038/nmeth.3337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Newman AM, Steen CB, Liu CL, Gentles AJ, Chaudhuri AA, Scherer F, Khodadoust MS, Esfahani MS, Luca BA, Steiner D, et al. (2019). Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat Biotechnol 37, 773–782. 10.1038/s41587-019-0114-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Saltz J, Gupta R, Hou L, Kurc T, Singh P, Nguyen V, Samaras D, Shroyer KR, Zhao T, Batiste R, et al. (2018). Spatial Organization and Molecular Correlation of Tumor-Infiltrating Lymphocytes Using Deep Learning on Pathology Images. Cell Rep 23, 181–193 e187. 10.1016/j.celrep.2018.03.086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Thorsson V, Gibbs DL, Brown SD, Wolf D, Bortone DS, Ou Yang TH, Porta-Pardo E, Gao GF, Plaisier CL, Eddy JA, et al. (2018). The Immune Landscape of Cancer. Immunity 48, 812–830 e814. 10.1016/j.immuni.2018.03.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Wei SC, Duffy CR, and Allison JP (2018). Fundamental Mechanisms of Immune Checkpoint Blockade Therapy. Cancer Discov 8, 1069–1086. 10.1158/2159-8290.CD-18-0367. [DOI] [PubMed] [Google Scholar]
  8. Zaitsev A, Chelushkin M, Dyikanov D, Cheremushkin I, Shpak B, Nomie K, Zyrin V, Nuzhdina K, Lozinsky Y, Zotova A, et al. (2022). Precise reconstruction of the tumor microenvironment using bulk RNA-seq and a unique machine learning algorithm trained on artificial transcriptomes. Cancer Cell [DOI] [PubMed] [Google Scholar]

RESOURCES