Abstract
The spatial organization of different types of cells in tumor tissues reveals important information about the tumor microenvironment (TME). In order to facilitate the study of cellular spatial organization and interactions, we developed Histology-based Digital (HD)-Staining, a deep learning-based computation model, to segment the nuclei of tumor, stroma, lymphocyte, macrophage, karyorrhexis and red blood cells from standard Hematoxylin and Eosin (H&E)-stained pathology images in lung adenocarcinoma (ADC). Using this tool, we identified and classified cell nuclei and extracted 48 cell spatial organization-related features that characterize the TME. Using these features, we developed a prognostic model from the National Lung Screening Trial dataset, and independently validated the model in The Cancer Genome Atlas (TCGA) lung ADC dataset, in which the predicted high-risk group showed significantly worse survival than the low-risk group (pv=0.001), with a hazard ratio of 2.23 [1.37–3.65] after adjusting for clinical variables. Furthermore, the image-derived TME features significantly correlated with the gene expression of biological pathways. For example, transcriptional activation of both the T-cell receptor (TCR) and programmed cell death protein 1 (PD1) pathways positively correlated with the density of detected lymphocytes in tumor tissues, while expression of the extracellular matrix organization pathway positively correlated with the density of stromal cells. In summary, we demonstrate that the spatial organization of different cell types is predictive of patient survival and associated with the gene expression of biological pathways.
Keywords: Tumor microenvironment, Lung Cancer, Digital Pathology, Pathology images
INTRODUCTION
With the advance of technology, Hematoxylin and Eosin (H&E)-stained tissue slide scanning has become a routine clinical procedure, which produces pathology images that capture histological details in high resolution. Pathology images of tumor tissues contain not only essential information for tumor grade and subtype classifications(1), but also information on the tumor microenvironment (TME), such as the spatial organization of different types of cells. Cell spatial organization reveals cell growth patterns and the spatial interactions among different types of cells, which provide important insights into tumor progression and metastasis. A recent study by Cheng et al(2) developed an algorithm to segment the cell nucleus and extract the topological features for TME analysis in renal cell carcinoma (RCC), which improved the understanding of cell spatial organization and patient outcome in RCC. However, this study used an unsupervised approach that assigns an identified cell nucleus to one of the clusters without a clear definition, which hampered interpretation of the results. Recent studies(3,4) showed that the spatial organization and architecture of tumor-infiltrating lymphocytes (TIL) play important roles in the TME. However, these studies focused solely on recognizing lymphocytes and ignored other types of cells, which greatly limited exploration of the interactions among different types of cells. In this study, we developed a deep learning-based algorithm to examine standard H&E pathology images to automatically segment and classify different types of cell nuclei. This algorithm can be used as a tool to “computationally stain” different types of cell nuclei, in order to facilitate pathologists in examining tissue images and researchers in studying the TME from these standard clinical materials.
The major cell types in a malignant tissue include tumor cells, stromal cells, lymphocytes, and macrophages. Stromal cells are mainly connective tissue cells such as fibroblasts and pericytes. The interactions between tumor cells and stromal cells play an important role in cancer progression(5–7) and metastasis inhibition(8). Since the cell boundaries of tumor cells and stromal cells are often unclear in standard H&E stained lung cancer pathology images, we segmented and classified cell nuclei instead of whole cells. TIL are mainly white blood cells that have migrated into a tumor region. They are a mix of different types of cells, in which T cells are the most abundant population. The spatial organization of TIL has been associated with patient outcome and molecular profiles in multiple tumor types(9–12). Macrophages are inflammatory cells, and inflammation in tumor niches has been reported as a prognostic marker and correlated with tumor progression(13,14). Other tissues and cellular structures existing in the TME include blood vessels(15,16) and necrosis. In this study, blood cells and karyorrhexis are segmented to represent blood vessels and necrosis, respectively, in order to quantify blood vessels and necrosis and study their interactions with tumor cells, stromal cells, lymphocytes and macrophages.
In this study, we developed a deep learning algorithm, Histology-based Digital (HD)-Staining, using the Mask Regional Convolutional Neural Network (Mask R-CNN) architecture(17). We trained this HD-Staining model using pathology images of lung adenocarcinoma (ADC) patients from the National Lung Screening Trial (NLST) study with the nuclei of tumor cells, stromal cells, lymphocytes, macrophages, blood cells and karyorrhexis manually labeled by expert pathologists. Through the training step, the HD-Staining model automatically learned to identify different nuclei based on a wide range of feature maps, including color, size, and texture within neighborhood area. The model accuracy was validated in a different set of images. From the identified cell types and cell spatial locations, we derived cell spatial organization features to characterize the TME. In our analysis, we found that these TME-related image features were significantly associated with patient overall survival. Based on these image features, a prognostic model for lung ADC patients was developed from the NLST dataset. This model was independently validated in the pathology image data from The Cancer Genome Atlas (TCGA) lung ADC (LUAD) dataset, in which the predicted high-risk group showed significantly worse survival than the low-risk group (pv=0.001), with a hazard ratio of 2.23 [1.37–3.65] after adjusting for clinical variables.
Furthermore, the image-derived TME features significantly correlated with the gene expression of biological pathways. For example, transcriptional activation of both the T-cell receptor (TCR) and programmed cell death protein 1 (PD1) pathways positively correlated with the density of detected lymphocytes in tumor tissues, while expression of the extracellular matrix organization pathway positively correlated with the density of stromal cells.
In this study, we developed a HD-Staining algorithm for nuclei segmentation and cell classification as a tool to study the tumor morphological microenvironment using tissue pathology images in lung ADC. In order to facilitate usage of this deep learning tool, a user-friendly web portal has been developed and can be accessed at http://lce.biohpc.swmed.edu/maskrcnn/analysis.php. Although this tool was developed in lung ADC pathology images, our results showed that the HD-Staining method can be adapted and applied in head and neck cancer, breast cancer and lung cancer squamous cell carcinoma pathology image datasets. The web portal also provides functions to facilitate researchers in adapting the HD-Staining model for other types of cancers.
MATERIALS AND METHODS
Dataset
Pathology images that support the findings of this study are available online in NLST (https://biometry.nci.nih.gov/cdas/nlst/) and The Cancer Genome Atlas Lung Adenocarcinoma (TCGA-LUAD, https://wiki.cancerimagingarchive.net/display/Public/TCGA-LUAD). mRNA expression data for the TCGA dataset are available online at http://firebrowse.org. The H&E-stained pathology images together with the corresponding clinical data were obtained from the NLST and TCGA Lung ADC cohorts: 208 40X pathology images for 135 lung ADC patients were acquired from the NLST dataset, and 431 40X pathology images for 372 lung ADC patients were acquired from the TCGA LUAD dataset (there could be multiple pathology images for a single patient). To refine our analysis within TME, a specialized lung cancer pathologist, Dr. Lin Yang, labeled the tumor Region of Interest (ROI) for each of the pathology images (Figure 1). Another lung cancer pathologist, Dr. Adi Gazdar, confirmed the labelling. Dr. Shirley Yan, another lung cancer pathologist, annotated the lung ADC histology subtypes for the NLST dataset. Clinical characteristics of the patients in this study are summarized in Supplemental Table 1.
Nuclei segmentation using HD-Staining
Training, validation, and testing sets preparation
In order to construct the training set for the HD-Staining algorithm, 127 image patches (500 × 500 pixels) from 39 pathological ROIs (Figure 1) were extracted from the NLST dataset. In these patches, different types of cell nuclei were labeled. All the pixels with tumor nuclei, stromal nuclei, lymphocyte nuclei, macrophage nuclei, red blood cells, and karyorrhexis were labeled according to their categories and all the remaining pixels were considered “other”. These labels, also collectively called the “mask”, were then used as the ground truth to train the HD-Staining model. The labeled images were randomly divided into training, validation, and testing sets. To ensure independence among these datasets, image patches from the same ROI were assigned together. More than 12,000 cell nuclei were included in the training set (tumor nuclei 24.1%, stromal nuclei 23.9%, lymphocytes 29.5%, red blood cells 5.8%, macrophages 1.5%, karyorrhexis 15.2%), while 1227 and 1086 nuclei were included in the validation and testing sets, respectively.
Training process
A deep learning model was developed using the Mask-RCNN architecture. A Keras version by Waleed Abdulla was implemented. (https://github.com/matterport/Mask_RCNN, last accessed on Oct. 1, 2019). While the native structure of Mask-RCNN was kept, we adapted the implementation for pathological image analysis by customizing data loader, image augmenter, image centering and scaling. The model pre-trained with Coco dataset (http://github.com/matterport/Mask_RCNN/releases/download/v2.1/mask_rcnn_balloon.h5) was fine-tuned on our training dataset from the NLST study. Images were standardized (centered and scaled to have zero mean and unit variance) for each RGB channel. To increase generalizability and avoid bias from different H&E staining conditions, we performed extensive augmentations on the image patches. In particular, random projective transformations were applied to images and their corresponding masks; each image channel was randomly shifted using linear transformation(18). For the training process, the batch size was set to 2, the learning rate was set to 0.01 and decreased to 0.001 after 500 epochs, the momentum was set to 0.9, and the maximum number of epochs to train was set to 1000. In the validation set, the model trained at the 707th epoch reached the lowest loss. This model was selected and used in the following analysis to avoid overfitting. Python (version 3.5.2) and python libraries (Keras, version 2.1.5; openslide-python, version 1.1.1; tensorflow-gpu, version 1.8.0) were used(19).
Segmentation performance evaluation
Since the HD-Staining model simultaneously segments and classifies cell nuclei, three criteria were used to evaluate the segmentation performance in the validation and testing datasets, respectively. First, detection coverage was calculated as the ratio between the detected nuclei and the total ground truth nuclei. Each ground truth nucleus was matched to a segmented nucleus, which generated the maximum Intersection over Union (IoU). If the IoU for a ground truth nuclei was > 0.5, this nuclei was labeled “matched”; otherwise it was labeled “unmatched”. Second, nuclei classification accuracy was determined for the matched nuclei by comparing the predicted nucleus type with the ground truth. Third, segmentation accuracy was evaluated by the IoUs, which were calculated for each detected nucleus and averaged in different nuclei categories.
Image feature extraction to describe nuclei composition and organization
In order to make the HD-Staining model more computationally efficient while retaining a good representation of each ROI, instead of applying the model to the whole slide, 100 image patches (1024 × 1024 pixels) were randomly sampled and analyzed for each pathologist-labeled ROI (Supplemental Figure S1A). These 100 image patches provided good coverage of each ROI (Supplemental Figure S1B). Nuclei were then segmented and classified through the HD-Staining model developed from this study (Figure 1). In order to characterize the spatial organization of cells using a graph, we calculated the centroids of nuclei and used them as vertices to construct a Delaunay triangle graph for each image patch. The Delaunay triangle graph connects nuclei into a graph, and the number of connections and the average length (i.e. spatial distance) between two types of nuclei summarize the spatial organization of different types of cell. Since 6 nucleus categories were included in this study, the edges of the graph were classified into 21 categories [i.e. 6 × (6+1)/2 = 21] according to their vertex pairs. For each image patch, the number of connections (i.e. edges) for different categories was counted (which added up to 21 features), the lengths of the connections were averaged for each edge category (yielding another 21 image features), and the density of each type of nucleus was calculated (yielding 6 image features). In total, 48 image features were extracted. The image features were averaged across the 100 patches for each ROI in the pathology image. When 2 or more pathology slides were available for 1 patient, the features from the slides were averaged for each patient. Thus, in total 48 image features were extracted for each patient, in both the NLST and TCGA datasets.
Prognostic model development and validation
Overall survival, defined as the date of diagnosis till death or last contact, was used as the response variable for survival analyses. A CoxPH (Cox Proportional Hazard) prognostic model for overall survival was developed in the lung ADC patients in the NLST dataset and independently validated in the TCGA LUAD dataset. An elasticnet penalty was used to avoid overfitting for the 22 features selected in the final CoxPH model (Supplemental Table 3). Given a set of the 22 image-derived TME features for each patient, the prognostic model will calculate a risk score for the patient by summarizing the products between features and corresponding coefficients, with a higher risk score indicating worse prognosis. The median risk score was used as the cutoff. The patients with a risk score higher than the median risk score were predicted as the high-risk group, and otherwise as the low-risk group. The survival curves of the predicted high- and low-risk groups were estimated using the Kaplan-Meier method. The survival differences between predicted high- and low-risk groups were compared using a log-rank test. A multivariate Cox proportional hazard model was used to determine the prognostic value of predicted risk groups (using image-derived TME features) after adjusting for other clinical characteristics, including age, gender, smoking status, and stage. R software, version 3.4.2, and R packages (survival, version 2.41–3; glmnet, version 2.0–13; spatstat, version 1.55–1) were used(20,21). The results were considered significant if the two-tailed p value <0.05.
Association analysis between image features and gene expression of biological pathways.
Gene expression data of 372 patients from the TCGA LUAD dataset were downloaded and preprocessed: the genes whose mRNA expression levels were 0 in more than 20% of patient samples were removed. The correlation between mRNA expression levels and image-derived TME features was evaluated using Spearman rank correlation. Gene set enrichment analysis (GSEA) was performed for each TME feature. All gene sets from the Reactome database were used(22). For multiple testing correction, Benjamini-Hochberg (BH)-adjusted p values were used to detect significantly enriched gene sets. Gene sets with BH-adjusted two-tailed p values < 0.05 were regarded as significantly enriched. R packages Hmisc (version 4.1–1), fgsea (version 1.4.1), and gplots (version 3.0.1) were used(23).
RESULTS
HD-Staining simultaneously and accurately classifies and segments cell nuclei
The developed HD-Staining model segments and classifies individual nuclei at the same time (Supplemental Figure S2 A, B & C). Figure 2A demonstrates some of the segmentation and classification results. In total, the segmented cell nuclei were classified into six categories: tumor cell, stromal cell, lymphocyte, macrophage, red blood cell, and karyorrhexis, and all the remaining structures or spaces were considered background. Different nuclei were colored according to the predicted categories (Figure 2A). For detected objects, the overall classification accuracy was 85% and 85% in the validation set and the testing set, respectively, while the accuracy for tumor nuclei was 88% in validation and 90% in testing, respectively (Figure 2B). The stability of classification accuracy was further evaluated and validated in the 5 major LUAD subtypes: lepidic, papillary, acinar, micropapillary, and solid (Supplemental Figure S3 A–E Figure S3)(24). It is noteworthy that the developed HD-Staining model can be applied to the entire digital pathology image to generate a cell spatial organization map across the whole slide, where tumor region and lymphocyte infiltration areas are clearly illustrated (Figure 3), hinting at its potential ability to assist pathologic diagnosis.
To further validate the cell classification accuracy, nuclei detection results on H&E images were compared with immunohistochemistry (IHC) stained images from the consecutive cut slides of the same sample (Supplemental Figure S4, S5, and S6). From these consecutive cut slides of the same tumor tissue, we observed consistent pattern between the predicted lymphocyte from HD-Staining model and the IHC-stained CD3, a marker for T cells. Similarly, we observed consistency between the predicted macrophage and IHC-stained CD68, a marker for cells in monocyte lineage. These comparisons with IHC staining validated the cell classification accuracy of the HD-Staining model.
Prognostic value of nuclei composition and organization in the TME
A Delaunay triangle graph was constructed for each image patch(2) to extract topological features from nuclei spatial organization to characterize the TME (Supplemental Figure S7). The nuclei and edges were counted and edge lengths averaged to yield 48 image features (see Methods Section). Supplemental Table 2 summarizes the TME features that significantly correlated with survival outcome in univariate analysis. It shows that higher karyorrhexis density, more karyorrhexis-karyorrhexis connections and more karyorrhexis-red blood cell connections were associated with worse survival outcome, which was expected as these features indicate a higher rate of tumor necrosis. Furthermore, higher stromal nuclei density and more stromal-stromal connections were associated with better survival outcome, which agreed with our current knowledge that more stromal tissues corresponds to better prognosis.
A prognostic model based on the image features was developed in the NLST dataset and then independently validated in the TCGA LUAD dataset. Figure 4 shows the survival curves of the predicted high and low-risk groups in the TCGA cohort, where the patients in the predicted high-risk group show significantly worse survival than those in the predicted low-risk group (log-rank test, p value = 0.0011). Furthermore, the risk group defined by the TME features serves as an independent prognostic factor (high- vs. low-risk, Hazard Ratio = 2.23, 95% Confidence Interval = 1.37–3.65, and p value=0.0013), after adjusting for clinical variables, including age, gender, smoking status, and stage (Table 1).
Table 1.
TCGA dataset (n=371) | HR (95% CI) | p value |
---|---|---|
High- vs. low-risk | 2.23 (1.37 – 3.65) | 0.0013 |
Age (year) | 1.02 (0.99 – 1.05) | 0.097 |
Male vs. female | 0.90 (0.55 – 1.49) | 0.70 |
Smoker vs. non-smoker | 1.09 (0.66 – 1.81) | 0.73 |
Stage | ||
Stage I | ref | - |
Stage II | 2.36 (1.28 – 5.30) | 0.0029 |
Stage III | 4.59 (2.44 – 8.63) | <0.001 |
Stage IV | 3.30 (1.30 – 8.41) | 0.012 |
CI, confidence interval; HR, hazard ratio.
Association between image features and transcriptional activity of biological pathways
GSEA was performed to identify the biological pathways whose mRNA expression profiles significantly correlated with image-derived TME features in the TCGA dataset. Figure 5 and Supplemental Figure S8 show examples of these biological pathways. For example, the transcriptional activation of both the T-cell receptor (TCR) and programmed cell death protein 1 (PD1) pathways positively correlated with lymphocyte density in the tumor tissue (Figure 5A). This observation is consistent with previous reports that genes involved in the TCR and PD1 pathways are expressed in immune cells(25,26). In addition, expression of the extracellular matrix organization gene set, for which fibroblasts act as an important source(27), positively correlated with stromal cell density in tumor tissue (Supplemental Figure S8A). In a negative control experiment where we randomly shuffled the patient IDs and repeated the same analysis, such correlation was no longer observed (Supplemental Figure S9).
Furthermore, GSEA analysis showed that the cell cycle pathway was significantly enriched with genes whose expression levels correlated with both the tumor nuclei density (Figure 5B) and karyorrhexis density in tumor issue (Supplemental Figure S8B). To look into the relationship between tumor cell density and the gene expression of the cell cycle pathway, we grouped and sorted the patients in the TCGA LUAD cohort according to their tumor nuclei density. Figure 5C shows, for each patient group in the TCGA LUAD dataset, the average expression levels of genes within the cell cycle pathway and whose expression levels significantly (p value <0.001) correlated with tumor nuclei density. Positive correlations between gene expression and tumor nuclei density can be observed for most of the cell cycle-related genes, except for one gene, POLD4, which showed an inverse trend. Most of the genes in the cell cycle pathway have higher expression in tumors with higher tumor nuclei density (may be a higher grade of tumor), while POLD4 shows the opposite pattern. This pattern of POLD4 compared with other genes in the cell cycle gene set is consistent with a previous study of lung cancer(28): while most cell cycle genes were upregulated in lung cancer, POLD4 was usually downregulated.
Webserver for publically accessible pathological image segmentation model
In order to facilitate usage of the HD-Staining model developed in this study, we also developed an online tool (http://lce.biohpc.swmed.edu/maskrcnn/analysis.php) for this deep learning-based nuclei segmentation and classification model (Figure 6). This tool requires only a pathology image (or a patch from the image) as the input (Figure 6A). Each uploaded input image will be assigned a job ID (Figure 6B). The segmentation results will be automatically displayed and the spatial coordinates of each nucleus can be downloaded as an Excel table (Figure 6C). In order to assist researchers in using this tool to study TME-related features for other cancer types, we also provide a function to automatically generate a mask for other cancer types. The newly generated segmentation mask can greatly reduce the manual work of creating the training sets for other cancer types, and thus accelerate the development of applications for pathology image analysis.
DISCUSSION
In this study, we developed a deep learning-based analysis tool to study the TME using standard H&E stained pathology images. This tool successfully visualized and quantified the spatial organization of tumor cells, stromal cells, lymphocytes, inflammatory cells, red blood cells and karyorrhexis in the tumor tissues of lung ADC patients. The topological features of cell spatial organization were used to characterize the TME. Our results showed that these features were associated with patient survival outcome and the gene expression of biological pathways. From these image-derived TME features, we developed a prognostic model for lung ADC patients and independently validated it in another lung ADC patient cohort. The prognostic model predicts patient survival independent of other clinical variables in the validation cohort.
Several previous studies have tried to analyze the TME and discovered prognostic image features. However, these studies involved time-consuming hand-labeling by pathologists(29–31). In contrast, we developed a fully automated and objective nuclei segmentation and classification strategy. In addition, this deep learning-based method enables the segmentation of nuclei within a whole slide image, including small biopsy samples. Since the number of cells in a whole slide image could be tremendous (~2,000,000 on average), manually labelling all of them is impractical. Thus, this deep-learning method empowers quantification of the TME across the whole slide image. Furthermore, although developed in lung ADC, this method can be easily generalized to other cancer types by retraining the model using the tools provided by our web portal.
In pathology image analysis, three-dimensional tissue structures are captured as two-dimensional images, and the cell nuclei may “touch” and “overlap” each other in the resulting images. This is one of the major challenges for nuclei segmentation in pathology image analysis. In this study, we developed a HD-Staining model to segment and classify different types of cell nuclei in order to study the spatial interactions among different cell types and tissue structures. Compared with other image segmentation algorithms, the HD-Staining method has several advantages: First, it segments and classifies nuclei at the same time, while traditional nuclei segmentation algorithms relying on color deconvolution cannot classify cell types(2,32). Second, by using extensive color augmentation during the training process, it adapts to different staining conditions, which makes the algorithm more robust and allows us to avoid the time-consuming color normalization steps(33). Third, compared with traditional statistical approaches, deep learning based approach does not require handcrafted feature extraction, can be highly parallel and saves time. With GPU-aided computation, processing (classifying or segmenting) a 1000-by-1000-pixel image usually takes less than one second for HD-Staining, much faster than non-deep learning-based image segmentation methods(34). Fourth, compared with other popular semantic image segmentation neural networks such as Fully Convolutional Network (FCN), SegNet, and Deeplab(35–37) that classify each pixel, HD-Staining is intrinsically an instance segmentation algorithm that detects an object bounding box first and assigns pixels as foreground or background within this bounding box(17). In summary, HD-Staining provides a new solution to segmenting closely clustered nuclei in tissue pathology images.
The associations between the extracted TME features and patient prognosis were evaluated in this study. Karyorrhexis, a representative of necrosis, has been reported as an aggressive tumor phenotype in lung cancer(38). Consistently, the density of karyorrhectic cells and numbers of karyorrhexis-karyorrhexis edges were shown as negative prognostic factors in this study. On the other hand, the density of stromal cells and the numbers of stromal cell-stromal cell edges were positive prognostic factors, which is consistent with a recent report on lung ADC patients(8). These consistencies indicate the validity of this MaskRCNN-based deep neural network and the potentiality of using cell organization features as novel biomarkers for clinical outcomes. Currently, there are many lung cancer prognosis models using clinical data, such as nomogram models(39,40). In this study, we showed that the image-based model has the prediction power to separate patients with significant survival differences after adjusting for clinical variables. It indicates that a prognostic model with both clinical variables and pathology image-based risk score could better predict patient prognosis than using clinical variable alone. In addition, many genomic signatures have been developed in recent years for lung cancer patient prognosis(41). Comparing with genomic signature, the pathology image-based prediction model can be more easily integrated into the current clinical practice, as the pathology image based diagnosis is a part of routine clinical procedures. Thus, modeling using imaging data has advantages in the aspect of clinical practice. Furthermore, tumor genomic profile and pathology images provide complementary information and characterization of the tumor. Therefore, it will be potentially powerful to integrate pathology images, genomic data and clinical data for better characterization of tumor and prediction of patient outcomes.
Gene expression patterns have been widely used to study the underlying biological mechanisms of different tumor types and subtypes(42,43); moreover, genes with abnormal expression could become potential therapeutic targets of cancers (44,45). However, traditional transcriptome profiling is usually done in bulk tumor(42,46), which contains multiple cell types, such as stromal cells and lymphocytes, in addition to tumor cells. This bulk tumor-based sequencing could blur or diminish the mRNA expression changes arising from a single cell type or from different cell compositions in the TME. Currently, the relationship between the transcription activities of biological pathways and the TME remains unclear. In this study, the image-derived TME features show interesting correlations with the transcriptional activities of biological pathways. For example, gene expression levels of TCR and PD-1 pathways positively correlated with the density of lymphocytes detected from tumor tissues. As genes involved in the TCR and PD1 pathways are expressed in immune cells(25,26), such correlation illustrates the contribution of lymphocytes to bulk tumor transcriptome profiling and thus validates the accuracy of both image-based nuclei detection and genetic sequencing of bulk tumor. This indicates the image-derived TME features may be used to study or predict immunotherapy response, since several promising cancer immunotherapies rely on activation of tumor-infiltrated immune cells and blocking immune checkpoint pathways(26,47). In addition, the gene expression extracellular matrix organization pathway is associated with the density of stromal cells in tumor tissues. Since traditional transcriptome sequencing is done in bulk tumor, accurate cell composition derived from pathology images could help to improve the evaluation of gene expression for each individual cell type. Moreover, the correlation between image features and transcriptional patterns of biological pathways hints at the potential usage of image features to study tumor bioprocesses, including cell cycle and metabolism status.
There are some limitations to this HD-Staining model. First, information on the individual nuclei, such as nucleus shape and size, was not considered since this study focused on nuclei organization. Morphological and intensity features of nuclei have been reported as prognostic factors, which can be automatically extracted using this nuclei segmentation algorithm(48). Second, some special structures, such as bronchi and cartilage, were not included in this algorithm. This study handled this problem by avoiding such structures during ROI annotation. However, a more comprehensive training set would be desirable for whole slide analysis. Moreover, to distinguish different functional subtypes of lymphocytes, stromal cells(49), and macrophages(50), tissue slides which are sequentially stained with H&E and IHC of specific markers would be needed as a training set. Distinguishing functional cell subtypes would further illustrate which subtypes play predominant roles in patient prognosis. Third, to represent whole tumor heterogeneity, more than one slide, such as tissue microarray, should be collected and analyzed per tumor.
Supplementary Material
Significance:
Findings present a deep learning-based analysis tool to study the tumor microenvironment in pathology images and demonstrate that the cell spatial organization is predictive of patient survival and is associated with gene expression.
Acknowledgements
This work was partially supported by the National Institutes of Health [5R01CA152301, P50CA70907, 1R01GM115473, and 1R01CA172211], and the Cancer Prevention and Research Institute of Texas [RP190107 and RP180805].
We thank the late Dr. Adi Gazdar for his critical inputs and discussion through-out this project, and for confirming the annotation of the pathology images. Jessie Norris for helping us to edit this manuscript, and Dr. Justin Bishop for support of the SPORE Pathology Core.
Footnotes
Conflict of Interests: The authors declare no potential conflicts of interest.
References
- 1.Travis WD, Brambilla E, Nicholson AG, Yatabe Y, Austin JHM, Beasley MB, et al. The 2015 World Health Organization Classification of Lung Tumors: Impact of Genetic, Clinical and Radiologic Advances Since the 2004 Classification. J Thorac Oncol 2015;10:1243–60 [DOI] [PubMed] [Google Scholar]
- 2.Cheng J, Mo X, Wang X, Parwani A, Feng Q, Huang K. Identification of topological features in renal tumor microenvironment associated with patient survival. Bioinformatics 2018;34:1024–30 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Corredor G, Wang X, Zhou Y, Lu C, Fu P, Syrigos K, et al. Spatial Architecture and Arrangement of Tumor-Infiltrating Lymphocytes for Predicting Likelihood of Recurrence in Early-Stage Non-Small Cell Lung Cancer. Clin Cancer Res 2019;25:1526–34 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Saltz J, Gupta R, Hou L, Kurc T, Singh P, Nguyen V, et al. Spatial Organization and Molecular Correlation of Tumor-Infiltrating Lymphocytes Using Deep Learning on Pathology Images. Cell Rep 2018;23:181–93 e7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Nakamura H, Ichikawa T, Nakasone S, Miyoshi T, Sugano M, Kojima M, et al. Abundant tumor promoting stromal cells in lung adenocarcinoma with hypoxic regions. Lung Cancer 2018;115:56–63 [DOI] [PubMed] [Google Scholar]
- 6.Bremnes RM, Donnem T, Al-Saad S, Al-Shibli K, Andersen S, Sirera R, et al. The role of tumor stroma in cancer progression and prognosis: emphasis on carcinoma-associated fibroblasts and non-small cell lung cancer. J Thorac Oncol 2011;6:209–17 [DOI] [PubMed] [Google Scholar]
- 7.Pietras K, Ostman A. Hallmarks of cancer: interactions with the tumor stroma. Exp Cell Res 2010;316:1324–31 [DOI] [PubMed] [Google Scholar]
- 8.Ichikawa T, Aokage K, Sugano M, Miyoshi T, Kojima M, Fujii S, et al. The ratio of cancer cells to stroma within the invasive area is a histologic prognostic parameter of lung adenocarcinoma. Lung Cancer 2018 [DOI] [PubMed] [Google Scholar]
- 9.Gooden MJ, de Bock GH, Leffers N, Daemen T, Nijman HW. The prognostic influence of tumour-infiltrating lymphocytes in cancer: a systematic review with meta-analysis. Br J Cancer 2011;105:93–103 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Miyashita M, Sasano H, Tamaki K, Hirakawa H, Takahashi Y, Nakagawa S, et al. Prognostic significance of tumor-infiltrating CD8+ and FOXP3+ lymphocytes in residual tumors and alterations in these parameters after neoadjuvant chemotherapy in triple-negative breast cancer: a retrospective multicenter study. Breast Cancer Res 2015;17:124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Huh JW, Lee JH, Kim HR. Prognostic significance of tumor-infiltrating lymphocytes for patients with colorectal cancer. Arch Surg 2012;147:366–72 [DOI] [PubMed] [Google Scholar]
- 12.Brambilla E, Le Teuff G, Marguet S, Lantuejoul S, Dunant A, Graziano S, et al. Prognostic Effect of Tumor Lymphocytic Infiltration in Resectable Non-Small-Cell Lung Cancer. J Clin Oncol 2016;34:1223–30 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Proctor MJ, Morrison DS, Talwar D, Balmer SM, O’Reilly DSJ, Foulis AK, et al. An inflammation-based prognostic score (mGPS) predicts cancer survival independent of tumour site: a Glasgow Inflammation Outcome Study. Brit J Cancer 2011;104:726–34 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Jafri SH, Shi RH, Mills G. Advance lung cancer inflammation index (ALI) at diagnosis is a prognostic marker in patients with metastatic non-small cell lung cancer (NSCLC): a retrospective review. Bmc Cancer 2013;13 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Matsuyama K, Chiba Y, Sasaki M, Tanaka H, Muraoka R, Tanigawa N. Tumor angiogenesis as a prognostic marker in operable non-small cell lung cancer. Annals of Thoracic Surgery 1998;65:1405–9 [DOI] [PubMed] [Google Scholar]
- 16.Fontanini G, Bigini D, Vignati S, Basolo F, Mussi A, Lucchi M, et al. Microvessel Count Predicts Metastatic Disease and Survival in Non-Small-Cell Lung-Cancer. J Pathol 1995;177:57–63 [DOI] [PubMed] [Google Scholar]
- 17.He KM, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. Ieee I Conf Comp Vis 2017:2980–8 [DOI] [PubMed] [Google Scholar]
- 18.Wang SD, Yang DH, Rang RC, Zhan XW, Xiao GH. Pathology Image Analysis Using Segmentation Deep Learning Algorithms. Am J Pathol 2019;189:1686–98 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Chollet F. 2015. Keras. <https://keras.io>.
- 20.Therneau T. A Package for Survival Analysis in S 2015. [Google Scholar]
- 21.Team RC. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.2016. [Google Scholar]
- 22.Fabregat A, Sidiropoulos K, Garapati P, Gillespie M, Hausmann K, Haw R, et al. The Reactome pathway Knowledgebase. Nucleic Acids Res 2016;44:D481–7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Sergushichev A. An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation. BioRxiv 2016:060012 [Google Scholar]
- 24.Travis WD, Brambilla E, Noguchi M, Nicholson AG, Geisinger KR, Yatabe Y, et al. International Association for the Study of Lung Cancer/American Thoracic Society/European Respiratory Society International Multidisciplinary Classification of Lung Adenocarcinoma. Journal of Thoracic Oncology 2011;6:244–85 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Cantrell DA. T cell antigen receptor signal transduction pathways. Cancer Surv 1996;27:165–75 [PubMed] [Google Scholar]
- 26.Pardoll DM. The blockade of immune checkpoints in cancer immunotherapy. Nat Rev Cancer 2012;12:252–64 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kalluri R, Zeisberg M. Fibroblasts in cancer. Nat Rev Cancer 2006;6:392–401 [DOI] [PubMed] [Google Scholar]
- 28.Huang QM, Tomida S, Masuda Y, Arima C, Cao K, Kasahara TA, et al. Regulation of DNA Polymerase POLD4 Influences Genomic Instability in Lung Cancer. Cancer Research 2010;70:8407–16 [DOI] [PubMed] [Google Scholar]
- 29.Beck AH, Sangoi AR, Leung S, Marinelli RJ, Nielsen TO, van de Vijver MJ, et al. Systematic analysis of breast cancer morphology uncovers stromal features associated with survival. Sci Transl Med 2011;3:108ra13 [DOI] [PubMed] [Google Scholar]
- 30.Wang C, Pecot T, Zynger DL, Machiraju R, Shapiro CL, Huang K. Identifying survival associated morphological features of triple negative breast cancer using multiple datasets. J Am Med Inform Assoc 2013;20:680–7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Yuan Y, Failmezger H, Rueda OM, Ali HR, Graf S, Chin SF, et al. Quantitative image analysis of cellular heterogeneity in breast tumors complements genomic profiling. Sci Transl Med 2012;4:157ra43 [DOI] [PubMed] [Google Scholar]
- 32.Phoulady HA, Goldgof DB, Hall LO, Mouton PR. Nucleus Segmentation in Histology Images with Hierarchical Multilevel Thresholding. Medical Imaging 2016: Digital Pathology 2016;9791 [Google Scholar]
- 33.Alsubaie N, Trahearn N, Raza SEA, Snead D, Rajpoot NM. Stain Deconvolution Using Statistical Analysis of Multi-Resolution Stain Colour Representation. Plos One 2017;12 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Cai WL, Chen SC, Zhang DQ. Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation. Pattern Recogn 2007;40:825–38 [Google Scholar]
- 35.Shelhamer E, Long J, Darrell T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans Pattern Anal Mach Intell 2017;39:640–51 [DOI] [PubMed] [Google Scholar]
- 36.Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans Pattern Anal Mach Intell 2018;40:834–48 [DOI] [PubMed] [Google Scholar]
- 37.Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans Pattern Anal Mach Intell 2017;39:2481–95 [DOI] [PubMed] [Google Scholar]
- 38.Swinson DEB, Jones JL, Richardson D, Cox G, Edwards JG, O’Byrne KJ. Tumour necrosis is an independent prognostic marker in non-small cell lung cancer: correlation with biological variables. Lung Cancer 2002;37:235–40 [DOI] [PubMed] [Google Scholar]
- 39.Wang T, Lu R, Lai S, Schiller JH, Zhou FL, Ci B, et al. Development and Validation of a Nomogram Prognostic Model for Patients With Advanced Non-Small-Cell Lung Cancer. Cancer Inform 2019;18:1176935119837547 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Hoang T, Xu R, Schiller JH, Bonomi P, Johnson DH. Clinical model to predict survival in chemonaive patients with advanced non-small-cell lung cancer treated with third-generation chemotherapy regimens based on eastern cooperative oncology group data. J Clin Oncol 2005;23:175–83 [DOI] [PubMed] [Google Scholar]
- 41.Tang H, Wang S, Xiao G, Schiller J, Papadimitrakopoulou V, Minna J, et al. Comprehensive evaluation of published gene expression prognostic signatures for biomarker-based lung cancer clinical studies. Ann Oncol 2017;28:733–40 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, et al. Multiclass cancer diagnosis using tumor gene expression signatures. P Natl Acad Sci USA 2001;98:15149–54 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. P Natl Acad Sci USA 2001;98:10869–74 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Vermeulen K, Van Bockstaele DR, Berneman ZN. The cell cycle: a review of regulation, deregulation and therapeutic targets in cancer. Cell Prolif 2003;36:131–49 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Semenza GL. Targeting HIF-1 for cancer therapy. Nat Rev Cancer 2003;3:721–32 [DOI] [PubMed] [Google Scholar]
- 46.Li HP, Courtois ET, Sengupta D, Tan Y, Chen KH, Goh JJL, et al. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors (vol 50, pg 1754, 2018). Nat Genet 2018;50:1754- [DOI] [PubMed] [Google Scholar]
- 47.Topalian SL, Drake CG, Pardoll DM. Targeting the PD-1/B7-H1(PD-L1) pathway to activate anti-tumor immunity. Curr Opin Immunol 2012;24:207–12 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Diamond DA, Berry SJ, Umbricht C, Jewett HJ, Coffey DS. Computerized Image-Analysis of Nuclear Shape as a Prognostic Factor for Prostatic-Cancer. Prostate 1982;3:321–32 [DOI] [PubMed] [Google Scholar]
- 49.Eyden B. The myofibroblast: phenotypic characterization as a prerequisite to understanding its functions in translational medicine. J Cell Mol Med 2008;12:22–37 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Gordon S, Pluddemann A. Tissue macrophages: heterogeneity and functions. BMC Biol 2017;15:53. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.