Summary
This study introduces TG-ME, an innovative computational framework that integrates transformer with graph variational autoencoder (GraphVAE) models for dissection of tumoral niches using spatial transcriptomics data and morphological images. TG-ME effectively identifies and characterizes niches in bench datasets and a high resolution NSCLC dataset. The pipeline consists in different stages that include normalization, spatial information integration, morphological feature extraction, gene expression quantification, single cell expression characterization, and tumor niche characterization. For this, TG-ME leverages advanced deep learning techniques that achieve robust clustering and profiling of niches across cancer stages. TG-ME can potentially provide insights into the spatial organization of tumor microenvironments (TME), highlighting specific niche compositions and their molecular changes along cancer progression. TG-ME is a promising tool for guiding personalized treatment strategies by uncovering microenvironmental signatures associated with disease prognosis and therapeutic outcomes.
Subject areas: Microenvironment, Cancer, Transcriptomics, Artificial intelligence
Graphical abstract

Highlights
-
•
TG-ME integrates transformer and GraphVAE for cancer niche analysis
-
•
Achieves superior clustering and profiling tumor microenvironments
-
•
Leverages multimodal data for comprehensive TME analysis
-
•
Enables precise identification of tumor microenvironment composition
Microenvironment; Cancer; Transcriptomics; Artificial intelligence
Introduction
In the realm of cancer research, dissecting the interplay between tumor cells and their surrounding tissue microenvironment (TME) stands as a pivotal endeavor for unraveling the mechanisms governing progression, metastasis, and responses to treatment.1,2 TME is interpreted as a spatial region of cancer tissue that comprises tumor cells, multiple immune cells, fibroblasts, endothelial cells, and other cell types together with the extracellular matrix. Serving as a repository for both upstream stimuli to resident cells and downstream cellular products, the TME facilitates closed-loop inter- and intra-cellular regulations3 and influences tumor progression,3 metastasis [1], tumorigenesis, and treatment resistance, underscoring its importance in cancer research. Previous studies have shown that understanding the TME in cancer holds great promise for identifying therapeutic targets and developing personalized treatment strategies,4 underscoring a better understanding of TME for cancer prognosis and treatment. Given its heterogeneous nature, unraveling the cellular and niche composition, spatial cellular localization, and gene expression profiles within the TME becomes a crucial yet historically challenging task.5 To address this challenge, spatial transcriptomics (ST) technologies have attracted significant research efforts to enable the visualization and analysis of gene expression profiles in their spatial context in tissues.6,7,8,9,10,11,12 Common experimental methods for ST include immunohistochemistry (IHC)13,14 and fluorescent in situ hybridization.10,15,16,17,18,19,20,21 For this last technology, 10X Genome Visium adopts in situ capturing, offering whole transcriptome data with a resolution of 55 μm of diameter known as spot-based technology, accommodating 1–10 cells per spot based on the tissue type and cell size.6,9,11,12 In the other hand, NanoString CosMx spatial molecular imager (SMI) uses in situ hybridization technology, resolving 50 nm, the typical size of a cell. NanoString first SMI generation provides the spatial locations of about 980 genes at a sub-cellular resolution, and includes the use of morphological images, enabling the characterization of transcription profiles within individual cells.10 NanoString SMI anticipates unveiling an upgraded technology capable of whole-genome examination at single-cell level in the year 2025, while company 10X with their Visium HD technology aims to profile the whole transcriptome with single-cell resolution in the next few years. These new ST technologies impose the need for innovative tools capable of analysis and interpretation of high-dimensional multimodal datasets. Given the complexity of the spatial transcriptomics data and morphological images involved in the study of tumoral niches, the use of advanced AI techniques like the one proposed in this framework is a reasonable approach that can be accomplished through the implementation of a pipeline using the best tools for every stage. These stages include cell typing prediction,22,23,24 niche prediction,2,10,25,26,27,28,29,30 and identification of cross-talks between specific cell types and the TME.30,31
In the past several deep learning (DL)27,32,33,34,35 and statistical methods2,10,12,25,26,28,29,30 have been developed to better understand the TME, revealing spatial niches and heterogeneity in cancer. Some existing methods focus on identifying spatial niches based on transcription patterns alone,26,28,29,30 while others consider the transcription and the spatial location of cells or spots,2,10,12,25,32,35 without considering tissue morphology. For example, StSME includes a normalization method that incorporates spatial locations, morphology images, and gene expressions through imputation to adjust gene expression values between spots.36 First, they calculate a spatial distance matrix between spots. Then, gene expression correlation is calculated, and the top 3 correlated spots are selected for imputation. A morphological similarity across a pair of spots is determined by image features obtained through a convolutional neural network. A final weight is calculated by multiplying the gene expressions and morphological similarities which are further normalized by the total number of selected spots. The normalized gene expressions consider the raw gene expressions of the spot of interest plus the summation of the raw gene expressions of the neighboring spots multiplied by the similarity weights. Louvain clustering is applied to the normalized gene expression levels to identify the spatial niches. It reaches an adjusted rand index (ARI) value of 0.43 in the analysis of the a dorsolateral prefrontal cortex (DLPFC) benchmark dataset.20
SpaGCN deploys a graph convolutional network (GCN) approach where each node in the graph is a spot and each weighted edge between two nodes represents the closeness of two spots based on the location of a spot and similarity of RGB color from the histology image of a spot.34 The SpaGCN uses top principal components from gene expression profiles as input, aggregates inputs with weighted edges in the graph, and identifies spatial niches in the tissue. It reaches an ARI value of 0.52 in the analysis of DLPFC benchmark dataset.20
DeepST first implements spatial data augmentation by calculating similarity weights across a spot of interest and its neighbors with gene expression levels, morphological similarity, and spatial location of spots.27 DeepST further implements a denoising linear autoencoder and a graph autoencoder to identify spatial niches, where the graph was built from the spatial locations of the spots. It reaches an ARI value of 0.59 in the analysis of the DLPFC benchmark dataset.20
STAGATE integrates spatial information and gene expression profiles through a graph attention auto-encoder that learns the low-dimensional latent embeddings to identify spatial niches.33 The input to STAGATE includes an adjacency matrix computed from the spatial locations of spots, and a pre-clustering of the gene expression conducted by the Louvain algorithm on the principal-component analysis (PCA) embeddings of the gene expression. It reaches an ARI value of 0.6 in the analysis of the DLPFC benchmark dataset.20
SEDR utilizes a deep autoencoder coupled with a masked self-supervised mechanism to build the low-dimensional representations of gene expressions, which are simultaneously embedded with the spatial information through a variational graph autoencoder (VGA), while not integrating morphological information.35 The model is trained to reconstruct the gene expression matrix from the latent representation, obtained by concatenating the embeddings from the VGA and the deep autoencoder. The input to SEDR includes the gene expression matrix and the spatial coordinates of the spots, while the output is the latent representation, further clustered to identify spatial niches in the tissue. It reaches an ARI value of 0.68 in the analysis of the DLPFC benchmark dataset.20 Also Liu et al. proposed a graph deep learning spatial clustering approach to integrate gene expression profiles and spatial positions, followed by a Bayesian Gaussian mixture model to identify spatial niches.20,32
Studying the cellular and niche composition of the TME has been challenging due to the limitations of the existing methods in processing complex multi-modal datasets. To address this challenge, we propose a TG-ME a transformer and a graph autoencoder to identify microenvironments. Our pipeline identifies tumoral niches in a tissue slide using a transformer model to process single cell gene expression profiles, a convolutional neural network (CNN) to process cell characteristics like size using morphological images, and a graph autoencoder to identify tumoral niches. We tested TG-ME using two benchmark datasets, the DLPFC and 10X Visium benchmark dataset of breast cancer (invasive ductal carcinoma). Then we applied TG-ME to a non-small cell lung cancer (NSCLC) dataset of samples at different stages of cancer.
Results
TG-ME pipeline demonstrates competitive performance in benchmark datasets
In order to validate TG-ME, we processed a single slice of healthy tissue of human dorsolateral prefrontal cortex (DLPFC) and two breast cancer (BC) benchmark datasets.20,37 In the case of DLPFC we observed that TG-ME identified 7 niches on the healthy tissue, including 5 layers, the white matter (WM) and a region called NA (not a layer or WM) (Figure 1B). We compared TG-ME results with the annotation by a pathologist and found that seven out of eight layers were also identified by TG-ME (Figure 1A). The result demonstrates TG-ME consistency for predicting tissue niches (Figures 1A and 1B). To provide a quantification of TG-ME performance, we calculated the ARI and normalized mutual information (NMI) scores and found that TG-ME has the highest ARI score with 0.54 and NMI score with 0.66, compared to other methods including STAGATE, SEDR, SpaGCN, and Leiden algorithms (Figures 1C–1G). We processed the remaining 11 samples and included their scores in Table S2 (Figure S1). The results, based on all 12 samples, reveal an average ARI score of 0.57 and an NMI score of 0.60. Overall, the results demonstrate that TG-ME has a higher consistency compared to the pathologist’s annotations.
Figure 1.
Quantitative evaluation of multiple methods in the DLPFC dataset
(A) Pathologist annotation.
(B) TG-ME prediction.
(C) STAGATE prediction.
(D) SEDR prediction.
(E) SpaGCN prediction.
(F) Leiden algorithm prediction.
(G) Comparison of ARI and NMI scores (ranging from 0 to 1) across different methods.
To further test TG-ME pipeline, we processed a public 10X Visium benchmark dataset of BC (invasive ductal carcinoma) and compared it to other methods (Figure 2). We calculated the ARI and NMI scores, and we observed again that TG-ME has the highest ARI score with 0.54 and NMI score with 0.44, compared to STAGATE, SEDR, SpaGCN, and Leiden algorithms (Figure 2). To investigate the performance of TG-ME on recognizing tumor niches, we calculated the precision, recall, and F1 scores, for each of the layers predicted by TG-ME. We observe that the higher precision scores were 0.79, and 0.83 for tumor and invasive niches (Table S3), with higher values compared to other methods,27,35 showing that TG-ME excelled in predicting tumor niches. TG-ME successfully captured the four different niches previously annotated by a pathologist, while the rest of the methods only identified three out of four niches. These findings demonstrate TG-ME’s ability to provide a more detailed analysis of the tumoral niches, compared to previous algorithms.27,33,34,35
Figure 2.
Quantitative evaluation of multiple methods in the breast cancer dataset
(A) Pathologist annotation.
(B) TG-ME prediction.
(C) STAGATE prediction.
(D) SEDR prediction.
(E) SpaGCN prediction.
(F) Leiden algorithm prediction.
(G) Comparison of ARI and NMI scores (ranging from 0 to 1) across different methods.
We applied TG-ME to a third benchmark dataset, specifically the 10X Visium breast cancer ductal carcinoma in situ (DCIS) dataset from.37 To assess TG-ME’s performance, we once again calculated the ARI and NMI scores, achieving values of 0.54 and 0.44, respectively (Figure 3B). As before, TG-ME outperformed other methods in these metrics (Figure 3). To further evaluate TG-ME’s ability to detect tissue niches, we calculated precision, recall, and F1 scores for the layers predicted by the algorithm. TG-ME demonstrated high precision in identifying the DCIS #1 niche with a score of 0.94, and the stromal niche with a precision of 0.96 (Table S4), outperforming other methods.27,35 While precision for DCIS #2 was slightly lower at 0.78, TG-ME successfully identified all other niches previously annotated by a pathologist, with the exception of the myoepithelial/stromal/immune niche, which none of the evaluated methods captured. Overall, these results underscore TG-ME’s capability to offer a more detailed and accurate analysis of tumoral niches than existing algorithms.27,33,34,35
Figure 3.
Quantitative evaluation of multiple methods in the breast cancer dataset
(A) Pathologist annotation.
(B) TG-ME prediction.
(C) STAGATE prediction.
(D) SEDR prediction.
(E) SpaGCN prediction.
(F) Leiden algorithm prediction.
(G) Comparison of ARI and NMI scores (ranging from 0 to 1) across different methods.
Ablation study highlights the importance of TG-ME normalization
One of the key contributions of TG-ME is its normalization process, which integrates multiple data types: gene expression, morphological images, spatial location, and cell composition (cell type). To assess the importance of each component, we conducted an ablation test. This involved systematically removing one data type at a time to observe its effect on ARI and NMI scores. For our analysis, we used the benchmark dataset from.37
First, we implemented the TG-ME pipeline using the full dataset, achieving an ARI score of 0.55 and an NMI score of 0.67 (Figure 4B). Then, we removed the gene expression data and computed the TG-ME pipeline. We observed a significant drop in performance with the ARI score decreasing to 0.21 and the NMI score to 0.36 (Figure 4C). This underscores the crucial role of gene expression in accurately identifying tissue microenvironments, as its removal resulted in poorly defined and low-resolution niches (Figure 4C).
Figure 4.
Ablation scores of the multimodal data for TG-ME normalization
(A) Pathologist annotation.
(B) TG-ME prediction with all the data.
(C) Ablation of the gene expression.
(D) Ablation of the morphology images.
(E) Ablation of the spatial location.
(F) Ablation score of the cell composition.
Next, we evaluated the impact of removing morphological image data. The ARI and NMI scores decreased to 0.47 and 0.63, respectively (Figure 4D). Without this information, the algorithm struggled to differentiate stromal niches, indicating the importance of morphological images for niche identification (Figure 4D).
We then excluded spatial location data from the TG-ME normalization and again observed a decline in performance, with ARI and NMI scores both dropping to 0.47 and 0.63 (Figure 4E). The absence of spatial data led to the misclassification of stromal niches as adipocytes and mixed niches as DCIS #2 (Figure 4E).
Finally, we removed the cell composition data. While the ARI and NMI scores decreased to 0.49 and 0.65, respectively (Figure 4F), the drop was less pronounced compared to the other data types. However, this still resulted in a loss of resolution, as the algorithm failed to identify the left mixed niche and struggled with stromal niche identification.
Overall, these results highlight the importance of TG-ME’s multimodal normalization approach, demonstrating that integrating all available data leads to more precise identification of tissue microenvironments.
Ablation study highlights the critical role of transformer and GraphVAE modules
The TG-ME model consists of two key components: the transformer model and graph variational autoencoder (GraphVAE) (Figures 5 and 6, STAR Methods). The transformer model captures gene-gene interactions within individual cells, while GraphVAE integrates spatial interactions across cells. To assess the contribution of each module, we conducted an ablation study using the benchmark dataset from.37
Figure 5.
Major steps of the TG-ME pipeline
(A) Pre-processing of gene expression.
(B) Division into cells.
(C) Location based adjacency matrix.
(D) Gene expression and cell type composition.
(E) Morphological feature extraction.
(F) Normalized expression.
(G) TG-ME model.
(H) Identification of niches from clustering.
(I) Interpretation and spatial mapping.
(J) TME composition across regions.
(K) Severity signatures across regions.
Figure 6.
TG-ME model overview
(A) The transformer model.
(B) The graph variational autoencoder model.
First, we evaluated the full TG-ME model, incorporating both the transformer and GraphVAE modules (Figure 7B). By comparing the predicted niches with pathologist annotations (Figure 7A), we observed an ARI score of 0.55 and an NMI score of 0.67.
Figure 7.
Ablation scores of the TG-ME model
(A) Pathologist annotation.
(B) TG-ME prediction with both transformer and GraphVAE modules.
(C) Ablation of the GraphVAE module.
(D) Ablation of the transformer module.
(E) Ablation of both GraphVAE and transformer modules, replaced by a fully connected network.
Next, we removed the GraphVAE module, retaining only the transformer module. Upon re-training and evaluation, the ARI and NMI scores dropped to 0.47 and 0.63, respectively (Figure 7C), underscoring the significance of spatial interactions in identifying tissue niches.
We then ablated the transformer module, leaving only GraphVAE. The retrained model showed a decrease in both ARI and NMI scores to 0.49 and 0.63 (Figure 7D), respectively, highlighting the importance of gene-gene interactions for niche identification in spatial transcriptomics data.
Finally, we replaced both modules with a fully connected network. This resulted in further reductions, with ARI and NMI scores declining to 0.46 and 0.61, respectively (Figure 7E). These findings emphasize the complementary roles of the transformer and GraphVAE modules, demonstrating that both are crucial for accurately identifying tissue niches in spatial transcriptomics data.
Application of TG-ME to a high-resolution spatial transcriptomics non-small cell lung cancer dataset
The main target of TG-ME is the identification of tumor niches defined as regions within the TME that contain a distinctive cell-type composition with specific spatial distribution. The first stage of TG-ME pipeline (TG-ME) consists on quality control (QC) (Figure S2).38,39 At this stage, low-quality cells are filtered out using as criteria the number of features (genes coverage), the number of counts (counts depth), and the percentage of mitochondrial counts per cell (cell activity). Our goal was to ensure that only high-quality cells were included in the analysis while minimizing the risk of introducing artifacts or noise from low-quality cells. We selected the 10% mitochondrial gene expression threshold based on established practices in single-cell RNA sequencing analysis.40 High mitochondrial content is often a marker of stressed or dying cells, which can negatively impact the reliability of downstream analyses by introducing biological noise that does not reflect true cellular states. This threshold is widely used in the field to filter out such cells.
Additionally, we adopted a strategy to filter as few cells as possible to maintain the spatial integrity of the transcriptomics data. Given that spatial transcriptomics aims to map gene expression across a tissue, removing too many cells can create gaps or “holes” in the spatial data, which could distort the overall biological interpretation of the TME. Therefore, we carefully balanced the need to exclude poor-quality cells while preserving the structure of the spatial transcriptomic sample.
To evaluate TG-ME in a high resolution ST datasets we processed four samples from the NSCLC dataset10 (Table 1). Using these data, we filtered and obtained a total of 385,187 cells and 980 genes (Table S5), with 98,002 cells belonging to sample 1, 138,095 cells to sample 2, 71,153 cells to sample 3, and 77,937 to sample 4. Supplement Table S5 shows the results after filtering, and Table S6 the quality of the removed cells.
Table 1.
Patient’s demographic information
| Tissue | Sex | Age | Histological diagnosis | Stage | Metastases | TNM | Percentage of tumor content in sample |
|---|---|---|---|---|---|---|---|
| Sample 1 | F | 63 | Adenocarcinoma | IIIA | 2/19 lymph nodes | T2aN2M0 | 19% |
| Sample 2 | F | 65 | Adenocarcinoma | IIIA | 3/9 lymph nodes | T3N1M0 | 70% |
| Sample 3 | F | 62 | Adenocarcinoma | IIIA | 0/14 lymph nodes | T4N0M0 | 26% |
| Sample 4 | M | 77 | Adenocarcinoma | IIB | 3/12 lymph nodes | T3N0M0 | 34% |
TG-ME integration stage highlights consistency in clustering across samples
We evaluate the consistency of clustering across samples after integration (see STAR Methods). For this we visualize the distribution of cells using Uniform Manifold Approximation Projection (UMAP) (Figure 8). We observed that tumor cells colocalize in the same cluster after processing them and exhibit a clear separation from the rest of the clusters (Figure 8) suggesting consistency in the integration stage of TG-ME. To quantify the clustering performance, we calculated the Calinski and Harabasz, and Silhouette scores (Table 2). We obtained a score of 61500.48 that indicates that the clusters are dense and well separated.41 For Silhouette, we obtained a score of 0.148 on the integrated data. This is important because sample 3 exhibited a negative Silhouette score before integration, suggesting that the method was able to correct and integrate this sample (Table 2). We also observed a correspondence between clustering scores and TNM classification,42 suggesting that an increase in heterogeneity is related to a decrease in clustering reflected in the score metrics (Table 1, Figure S3).
Figure 8.
Examination of cell typing of each patient in UMAP and selected gene markers
(A) Integrated UMAP visualization of four samples.
(B) UMAP visualization of the individual samples.
(C) Cell type marker genes.
Table 2.
Clustering performance metrics
| Metric | Sample 1 | Sample 3 | Sample 2 | Sample 4 | Integrated samples |
|---|---|---|---|---|---|
| Silhoutte score | 0.294 | −0.055 | 0.245 | 0.177 | 0.148 |
| Calinski and Harabasz score | 41821.02 | 5721.4 | 11824.92 | 22141.89 | 61500.48 |
In order to confirm the consistency of the clusters after integration, we perform differential gene expression analysis to identify gene markers (STAR Methods) that can be used to assign cell types. We observe that the identified markers correspond to the previously claimed cell types (STAR Methods, Figure 8C). To verify that each marker gene belongs to the identified cell type, we underwent a thorough literature review to validate its association. Overall, we observe clear marker genes for each cell type, despite some expected overlap between different T cell and macrophage-related subtypes.
Finally, to assess the effectives of our integration in removing batch and technical effects while conserving biological variance, we computed the scIB batch correction metrics, including the average silhouette width (ASW) and principal component regression.43 An ASW score of 1 indicates perfect batch mixing, whereas a score of 0 reflects strongly separated batches. We obtained a score of 0.91, demonstrating successful batch mixing in our integrated data. This suggests that our integration method effectively mitigated batch effects while maintaining the underlying biological signal.
Integration of spatial information with input data improves clustering performance
The proposed integration approach involves processing one cell at a time, referred to as the “cell of interest” (CoI). The cells in proximity to the CoI are designated as adjacent cells (CAs). In order to identify the CAs, we computed an adjacency matrix (STAR Methods, Figure 5C). On average, approximately 14 neighbors per cell were obtained. Next, a similarity weight value for each gene is calculated between the CoI and its CAs based on gene expression, cell type, and morphology features (STAR Methods). The similarity values are then used to normalize the gene expression of the CoI (STAR Methods). Next the pipeline extracts the morphological images for each of the cells under the assumption that cells of diverse types and functions often display distinct morphological characteristics44 (STAR Methods, Figure 5B). These images were generated by staining the cells with a nuclear dye (DAPI) and morphology markers such as membrane (CD298), epithelial cell marker pancytokeratin (PanCK), and T cells marker (CD3).10 We obtained one image for each cell with a total of 98,002 for sample 1; 138,095 for sample 2; 71,153 for sample 3; and 77,937 from sample 4. Images are uniformly sized at 224 by 224 pixels. To extract meaningful features from these images, we used a pre-trained CNN, specifically ResNet50.45 This choice was motivated by the algorithm’s capability to derive representations from images by transforming them into feature vectors (Figure 5E, STAR Methods). The resulting features, initially comprising 2,048 dimensions, were subsequently reduced to 50 dimensions through (PCA).
Next, we visualize the clustering of the four NSCLC samples. We observed a higher cluster separation for all samples after applying TG-ME integration (Figure 9). For example, in sample 1, prior to normalization (Figure 9A) the clusters of T CD4 naive, endothelial, fibroblast, and B cell were initially grouped together at the center of the sample. After TG-ME integration (Figure 9B), these clusters have now distinctively separated, resulting in unique clusters per cell type.
Figure 9.
TG-ME improves the separation of cell types
(A) Sample 1 with gene expression only.
(B) Sample 1 with TG-ME integration.
(C) Sample 4 with gene expression only.
(D) Sample 4 with TG-ME integration.
(E) Sample 3 before TG-ME integration.
(F) Sample 3 after TG-ME integration, where the tumor cells separate into two clusters, and the rest of the cluster becomes well defined.
(G) Sample 2 before TG-ME integration.
(H) Sample 2 after TG-ME integration, where the tumor cells separate into two clusters, and the rest become well-defined.
In the case of sample 4, we observed that the macrophage and epithelial clusters were connected to the fibroblast, Treg, and T memory cells (CD4 and CD8) clusters (Figure 9C). Again, after TG-ME integration these clusters completely separate (Figure 9D).
In the case of sample 3, which is more severe (Table 1), before TG-ME integration, most of the cell types overlap in the center of the UMAP (Figure 9E) and then after integration, there is a clear separation of fibroblast and endothelial clusters (Figure 9F). We also observed that tumor cells separate into two clusters (Figure 9F). One of those clusters is made of tumor cells and is completely separated from the rest of the clusters, while the other, smaller, is mixed with epithelial cells, which may represent epithelial cells becoming dysplastic and transforming into malignant cells as previously described.46
For sample 2, before TG-ME integration (Figure 9G), tumor cells overlap with neutrophils and then after, the cluster splits into two different groups. One is entirely separated from the rest of the clusters, while the other is located next to the neutrophil cluster. We also found that a cluster containing fibroblasts and T CD4 memory cells completely separated from the rest of the clusters. This agrees with previous studies that have shown that tumor-associated fibroblasts correlate to T cells in the TME.47,48
Additionally, we observed appreciable differences in the immune clusters. For low severity samples, the immune clusters are clearly separated, but as severity increases, the immune cells tend to cluster together. For example, in sample 1 (Figure 9B), there are separated clusters of B cells and neutrophils. In sample 4 (Figure 9D), there is a separate cluster of macrophages, and we observe a cluster of Treg and T CD8 memory cells connected to the T CD4 memory cluster, where all of them are well defined. In these two samples, we observe the mast cells cluster separated from the rest of the clusters, which for samples 2 and 3 are overlapping with other cell types including tumor cells. Interestingly, several studies have identified that mast cells play roles in normal and abnormal processes in NSCLC, where they can promote tumor growth in solid tumors.49,50
To quantify the improvement in clustering after TG-ME integration, we computed the Silhouette51 and Calinski and Harabasz scores41 (Table 3) and compare before and after TG-ME integration. Silhouette score ranges from -1 to 1, where -1 means samples have been assigned a wrong cluster, and 1 means perfect clustering.51 Calinski and Harabasz score is a variance ratio criterion that measures cluster quality by evaluating the separation between clusters and compactness within a cluster, a higher score indicates a better-quality clustering.41 For all 4 cases, we observe that both the Silhouette and Calinski and Harabasz scores increased after TG-ME processing (Table 3). For example, for sample 1, Silhouette score with gene expression was 0.244, and increased to 0.698 after integration. The Calinski and Harabasz score was 25,389.93 and increased to 211,032.91 after TG-ME integration. Overall TG-ME effectively integrates gene expression, morphology features, cell type, and neighborhood similarity.
Table 3.
Clustering metrics
| Sample | Gene expression silhouette score | TG-ME silhouette score | Fold change | Gene expression Calinski and Harabasz score | TG-ME Calinski and Harabasz score | Fold change |
|---|---|---|---|---|---|---|
| 1 | 0.244 | 0.698 | 2.86 | 25389.93 | 211032.91 | 8.31 |
| 2 | 0.137 | 0.293 | 2.13 | 3658.87 | 13560.01 | 3.70 |
| 3 | −0.018 | 0.14 | −7.77 | 5316.66 | 17807.32 | 3.35 |
| 4 | 0.133 | 0.533 | 4.01 | 18716.93 | 118918.82 | 6.35 |
The TG-ME transformer model can successfully contextualize gene-gene interactions of individual cells
In order to model the internal expression context within a cell, we implement a transformer model that contextualizes a single cell (Figures 5 and 6). The transformer model in TG-ME is designed to capture the gene-gene interactions within individual cells by focusing on the gene expression profiles. It employs attention mechanisms to calculate the importance or relevance of specific genes within a cell, helping to highlight key gene interactions involved in the microenvironment. For this, we use the cell type provided for each cell as a label. Then we train the transformer model to predict cell types with the intention of determining the gene-gene interactions within a cell. For this, we divided the dataset by cell types and then we extracted 70% of the samples as the training set, 10% as the validation set, and 20% as the testing set. Next, we used the TG-ME integrated gene expression of the 980 genes as input features (STAR Methods). Subsequently, we evaluated the performance by calculating the precision, recall, F1-score, ARI, and NMI scores (Table 4). The precision, recall, and F1-score were greater than 90% for all the samples (Table 4), suggesting that the TG-ME model can accurately determine the gene-gene interactions. We observe that as the severity of the samples increases (Table 1), the precision, recall, and F1-score decrease, suggesting that gene interactions within a cell are dysregulated. To quantify this, we calculated the ARI score, and we observed that for the least severe samples (samples 1 and 4) (Table 1), the ARI score was ∼0.97 (Table 4). Previous studies categorize ARI scores where ARI ≥0.90 as excellent recovery.52,53,54 For sample 3 the ARI score was 0.92 (Table 4), still considered as excellent classification. While for sample 3, the most severe sample (Table 1), the ARI score was 0.89 (Table 4), considered good recovery, using previously defined criteria.52,53,54 We again observe a decrease in the ARI score as severity increases.
Table 4.
Performance evaluation of TG-ME transformer model on 4 NSCLC samples
| Sample | Precision | Recall | F1-score | ARI | NMI |
|---|---|---|---|---|---|
| 1 | 98.4% | 98.4% | 98.2% | 0.978 | 0.962 |
| 2 | 91.9% | 92.2% | 90.5% | 0.925 | 0.762 |
| 3 | 92.2% | 92.4% | 91.6% | 0.894 | 0.830 |
| 4 | 97.1% | 97.3% | 97.0% | 0.971 | 0.936 |
Finally, we computed the NMI score, a common metric for evaluating the alignment between predicted cell types and real cell types, which is often used in clustering but applicable for comparing label consistency,55 where a 0 value means no mutual information between the predicted and the real cell type labels, and a value of 1 means perfect correlation. The least severe samples have a score greater than 0.90 (Table 4), while for the most severe samples, the score is 0.76. Suggesting again that as severity increases, identifying the gene-gene interactions becomes more complex, since gene expression can be modified by cancer cells. To further study the source of the misclassifications, we computed the confusion matrix for each sample (Tables S7–S10). Overall, we observe that as cancer severity increases, the number of tumor cells also increases, and hence the number of cells belonging to the rest of the cell types decreases. Structural cells including tumor cells, fibroblasts, and endothelial cells; myeloid cells such as neutrophils and macrophages; and plasmablasts presented a low number of mispredictions (Tables S7–S10). While the mispredicted cells were mostly immune cells (Tables S7–S10).
TG-ME can dissect different microenvironments related to tumor severity
To comprehensively analyze TMEs across various stages of NSCLC, we applied the TG-ME pipeline to four samples, each representing a different TNM classification stage. The TG-ME model consists of two key components: a transformer model for detecting gene-gene interactions and a GraphVAE for integrating spatial information (Figures 5 and 6, STAR Methods). The transformer model identifies similarities between genes expressed within individual cells, while the GraphVAE takes these gene similarities further by incorporating spatial neighborhood information through an adjacency matrix. This adjacency matrix is built based on the assumption that cells within the same tumor niche will not only express similar genes but will also be close in space.
The integration of these two modules allows TG-ME to dissect tumor niches with a high level of granularity. For example, TG-ME identified 17 distinct niches in sample 1, 10 niches in sample 4, and 8 niches in both samples 2 and 3 (Figure 10A–10D). The number of detected niches varies according to the cell composition and stage of each tumor. For subsequent analysis, we decided to focus on investigating the identified tumor-specific niches to gain insights into their biological relevance.
Detailed characterization of tumoral niches
In sample 1, TG-ME detected five distinct tumor niches and a tumor-stroma border (Figure 10A). To further understand the differences between these tumoral niches, we performed differential gene expression (DGE) analysis (STAR Methods) followed by enrichment analysis using the Hallmark database, which catalogs gene sets representing specific biological states.56,57 Each niche exhibited enrichment in distinct biological pathways, underscoring the molecular heterogeneity within the tumor. Detailed p values for each pathway enrichment are provided in Table 4. For instance.
-
(1)
Tumor-hypoxia niche: Characterized by enrichment in hypoxia-related pathways (Table 5). This niche forms when tumor cells develop faster than the vasculature, leading to oxygen deficit.58 Hypoxia is a well-known driver of tumor progression and therapeutic resistance.58,59,60 The presence of this niche is associated with poor clinical prognosis.
-
(2)
Tumor-EMT niche: Enriched in pathways related to epithelial-mesenchymal transition (EMT) (Table 5). This niche promotes cancer cell invasion and metastasis.61
-
(3)
Tumor-apoptosis niche: This niche is enriched for apoptosis-related genes (Table 5), reflecting programmed cell death mechanisms to control cell proliferation.62 Interestingly, apoptosis within tumors can sometimes enhance tumor growth.63,64
-
(4)
Tumor-interferon alpha niche: Enriched in pathways related to the interferon alpha response (IFN-alpha) (Table 5), this niche highlights a cytokine pathway that mediates immune responses to tumors.65
-
(5)
Tumor-TNF niche: This niche is enriched in TNF-alpha signaling pathway via the nuclear factor kappa B (NF-kB) pathway (Table 5). The tumor necrosis factor (TNF) is a cytokine implicated in immune homeostasis, inflammation, and host defense, capable of inducing various cellular responses including apoptosis, necrosis, angiogenesis, immune cell activation, differentiation, and cell migration. These processes are important in immune surveillance and tumor progression.66
Figure 10.
Tumor microenvironment identified by TG-ME
(A) TME (left) and TME cellular composition (right) for sample 1.
(B) TME (left) and TME cellular composition (right) for sample 4.
(C) TME (left) and TME cellular composition (right) for sample 3.
(D) TME (left) and TME cellular composition (right) for sample 2.
Table 5.
Tumor niches identified by TG-ME
| Sample | Tumor niche | Pathway | p value |
|---|---|---|---|
| 1 | 1 | Hypoxia | 4.97E-07 |
| 1 | 2 | Epithelial mesenchymal transition | 5.29E-17 |
| 1 | 3 | Apoptosis | 1.12E-08 |
| 1 | 4 | Interferon alpha response | 4.02E-07 |
| 1 | 5 | TNF-alpha signaling pathways | 1.77E-05 |
| 4 | 1 | Interferon gamma response | 1.00E-02 |
| 4 | 2 | Hypoxia | 2.64E-06 |
| 3 | 1 | Apoptosis | 3.80E-03 |
| 2 | 1 | E2F targets pathway | 1.00E-02 |
| 2 | 2 | Unidentified | NA |
Additionally, at the tumor-stroma border, we observed enrichment in both the TNF-alpha signaling pathway via the NF-kB pathway (p value 4.5E-5), and the antimicrobial humoral response (GO:0019730), the latter representing an antigen-specific adaptive immune response that directly destroys antigen-expressing target cells.67
Sample 4 displayed two tumor niches and a tumor-stroma border (Figure 10B), including.
-
(1)
Tumor-IFN gamma niche: This niche is enriched in the interferon gamma response (IFN-gamma) pathway (Table 5), suggesting a robust anti-tumor immune response involving inflammation and apoptosis.68
-
(2)
Tumor-hypoxia niche: This niche, is enriched in the hypoxia pathway, also identified in sample 1 (Table 5), suggesting similarities between these two niches.
Sample 3, a more severe case, revealed only one tumor niche (Figure 10C): Tumor-apoptosis (p value 3.8E-3), also identified for one of the niches in sample 1 (Table 5). We also identified a tumor-stroma border that displayed enrichment in the coagulation pathway and the MAPK cascade pathway (GO:0045657), both pivotal in promoting tumor growth, invasiveness, and metastasis.69,70,71
Sample 2, the most severe sample, contained two tumor niches (Figure 10D).
-
(1)
Tumor-E2F: This niche is enriched in the E2F targets pathway, which is associated with cell cycle regulation, proliferation, differentiation, and stress responses (Table 5).72
-
(2)
Tumor-unidentified: The second niche, while not presenting any pathway enrichment, contained both tumor and neutrophil cells (Figure 6D), indicating potential interactions between the tumor and the immune system. We called this niche tumor-neutrophil.
Additionally, the tumor-stroma border in sample 2, demonstrated pathways similar to those in sample 1, including the TNF-alpha signaling via NF-kB pathway (p value 1.8E-), and the antimicrobial humoral response (GO:0019730) (p value 1.6E-4).
Across all four samples, we observed that as the severity of the tumor increases, the percentage of certain niches, such as tumor-apoptosis and tumor-E2F also increases. For example, the tumor-apoptosis niche increased from 2.03% in sample 1 (least severe) to 8.18% in sample 2 (second most severe) (Table 6). Additionally, the Tumor-E2F niche dominated in sample 3, accounting for 34.65% of the total tumor composition, suggesting a highly proliferative and aggressive tumor state (Table 6).
Table 6.
TG-ME identified tumoral niches
| Sample 1 | Niche | Percentage of cells from niche |
|---|---|---|
| Sample 1 | Tumor-hypoxia | 6.70% |
| Sample 1 | Tumor-EMT | 2.85% |
| Sample 1 | Tumor-Apoptosis | 2.03% |
| Sample 1 | Tumor-IFN alpha | 0.93% |
| Sample 1 | Tumor-TNF | 0.12% |
| Sample 4 | Tumor-hypoxia | 6.50% |
| Sample 4 | Tumor-IFN Gamma | 10.60% |
| Sample 2 | Tumor-Apoptosis | 8.18% |
| Sample 3 | Tumor-E2F | 34.65% |
| Sample 3 | Tumor-neutrophil | 21.14% |
In conclusion, TG-ME provides a detailed dissection of tumoral niches, revealing diverse molecular signatures that reflect both severity and heterogeneity of NSCLC. The attention scores obtained from the transformer model are crucial for understanding how specific genes and pathways contribute to the TME dynamics and disease progression. By identifying these key interactions, the model can disclose biological mechanisms underlying cancer, such as hypoxia-related gene activity, immune response modulation, or changes associated with EMT. These findings highlight the value of niche-specific insights for understanding cancer progression and offer potential pathways for therapeutic intervention.
Biological implications of tumoral niches in NSCLC
Tumor-hypoxia niche
TG-ME identified two niches enriched in the hypoxia pathway, one in sample 1 and the other in sample 4. Hypoxia occurs when there is a rapid grow of tumor that outpaces its vascular supply, leading to an oxygen deficit. This activates hypoxia-inducible factors (HIFs), that promote angiogenesis, metabolic reprogramming, and immune suppression.58,59 In NSCLC, hypoxia is known to increase tumor aggressiveness, metastasis, resistance to therapy, and tumor recurrence.59,73
Additionally, non-immune cells, such as endothelial cells, play a crucial role in responding to hypoxia by promoting the formation of new blood vessels (angiogenesis) to improve oxygen supply to the tumor.74 This process that is mediated by vascular endothelial growth factor (VEGF), not only supports tumor growth but also facilitates tumor metastasis.75 In other hand, cancer-associated fibroblasts (CAFs) can also be activated under hypoxic conditions and contribute to extracellular matrix remodeling, further enhancing tumor invasiveness.76
In our findings, the presence of the tumor-hypoxia niches across multiple NSCLC samples suggests that hypoxia-related pathways may be involved in tumor progression. This highlights potential therapeutic opportunities, for example targeting HIF pathways or VEGF signaling to interrupt the pro-tumorigenic effects of hypoxia in NSCLC patients.
Tumor-EMT niche
EMT, is a key process in cancer metastasis.77 During EMT, epithelial cells lose their adhesion properties and apical-basal polarity gaining mesenchymal characteristics that allows them to migrate and invade tissues.78 Factors such as growth factor stimulation and adhesion to type I collagen (Col-I) induce EMT in cancer cells which are often activated in response to external stimuli in the TME.78 In NSCLC, EMT has been linked to poor prognosis and resistance to therapy, particularly in advanced stages.79
Tumor-apoptosis niche
The tumor-apoptosis niche is characterized by the enrichment of genes related to programmed cell death or apoptosis. While apoptosis pathway is typically a tumor-suppressive mechanism that eliminates tumoral cells, certain conditions within the TME may also allow cancer cells to evade apoptosis, promoting survival and growth.80 Additionally, apoptosis can enhance tumor progression by releasing inflammatory signals that remodel the TME to foster tumor development.64
Specifically, in NSCLC, apoptosis can promote cancer by enhancing proliferation. Apoptosis provides support of proliferating cells by removing tumor cells that respond to treatment leading to tumor resistance. In NSCLC therapy, it is crucial to consider other regulatory mechanisms apart from apoptosis.81
Tumor-IFN alpha and gamma niches
The tumor-IFN alpha and gamma niches are enriched in pathways related to interferon signaling, which plays a crucial role in anti-tumor immunity.82 IFN-alpha and gamma have the ability to induce the expression of genes that promote immune responses against tumor cells, inhibiting tumor proliferation and enhancing the presentation of tumor antigens to the immune system.83
In NSCLC, interferon signaling is essential for immune response within the TME, particularly in the interaction between tumor cells and immune cells. The presence of these niches in our analysis reflects the ongoing battle between the tumor cells and the immune system, with interferon pathways playing a key role in this interaction.
Discussion
In this work, we have designed, implemented and applied TG-ME, a method for automatic tumor niche classification. The proposed approach resolves gene expression at subcellular resolution and uses a network to infer gene expression changes as a result of cell interactions and then categorized each cell with a specific niche. TG-ME achieves significantly higher ARI and NMI scores compared to previous methods.
We validated the strengths of this approach on human dorsolateral prefrontal cortex and a layer of breast cancer benchmark datasets. The recent advancements in high-resolution spatial transcriptomics technologies have significantly increased the demand for sophisticated computational frameworks like TG-ME to analyze and extract meaningful insights from these complex datasets. TG-ME approach consists in the normalization of gene expression for each cell using two main assumptions. First, TG-ME considers that the internal gene-gene interactions within a cell are reflected in its gene expression profile. Second, TG-ME considers that the cell-to-cell interactions within a tumor neighborhood are indeed reflected in the gene expression profile of that spatial neighborhood. These are crucial aspects that TG-ME framework leverages to gain insights into the complex cellular dynamics within the tumor niches.
One of the main advantages of normalizing the gene expression is the ability to comprehensively profile individual cells. The combination of the gene expression data provides insights into the molecular activities of the cells, mainly into the genes that are actively transcribed and potentially contributing to tumor progression.84,85 Simultaneously, the inclusion of morphology features offers a structural perspective, elucidating the shape, size, and, in some instances, the type of cells.10,86 This comprehensive profiling provides a more detailed and context-rich representation of the cellular behavior within the TME. Cell type information, in the other hand, incorporated into our multimodal data representation, helps to understand the distinct contributions of different populations to the TME.87 It allows to identify cell type specific gene expression changes and their potential implications in the TME.
Also the implemented model inspired by Zhang T.-H. et al.88 has the characteristic of one-to-one representations. This means it exhibits an input-output consistency, aiming to preserve the biological context. Therefore, for every gene in a cell, the model will generate a representation that considers its relationship with the remaining genes. These representations are called attention values, and each attention layer will generate an attention value for each gene, as a function of the output of the previous attention layer. Having not only the gene expression information but also integrating the information extracted from the morphology images to determine how similar cells are in morphology terms, improves the robustness of discerning the similarity between two cells, and identifying cells transformed by cancer. Future studies can add over TG-ME model due to its modularity.
Limitations of the study
We observed that the precision, recall, and F1-score decrease as the severity of the sample increases which highlights the challenges in distinguishing between cell types as cancer progresses.
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to the lead contact, Mario Flores (mario.flores@utsa.edu).
Materials availability
This study did not generate new unique reagents.
Data and code availability
-
•
This paper analyzes existing, publicly available data, accessible at https://doi.org/10.1038/s41593-020-00787-0, https://support.10xgenomics.com/spatial-gene-expression/datasets/1.0.0/V1_Breast_Cancer_Block_A_Section_1, Gene Expression Omnibus (GEO) through accession number GSE243275, and https://doi.org/10.1038/s41587-022-01483-z, and are publicly available as of the date of publication.
-
•
All original code has been deposited at GitHub: https://github.com/Karladanielap/TG-ME/ and is publicly available as of the date of publication.
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request
Acknowledgments
This article’s publication costs were supported partially by the National Institutes of Health (3R01CA124332-13A1S1 to M.A.F.), the National Science Foundation (2051113 to Y.J.), the National Institutes of Health (P30 CA054174 to Y.C.), Cancer Prevention and Research Institute of Texas Core (RP220662 to Y.C.), National Institutes of Health (R01 CA284554 to S.G.).
Author contributions
K.P. and M.F. was responsible for data acquisition, analysis, interpretation, and drafted and critically revised the manuscript. Y.-F.J. and M.F. contributed to the conceptualization, data acquisition, analysis, and interpretation, and also drafted and critically revised the manuscript. Y.C., S.G., and Y.H. provided resources, contributed to writing, and approved the final version of the manuscript. All authors have read and approved the final manuscript.
Declaration of interests
The authors declare no competing interests.
STAR★Methods
Key resources table
Method details
Please provide precise details of all the procedures in the paper (behavioral task, generation of reagents, biological assays, modeling, etc.) such that it is clear how, when, where, and why procedures were performed. We encourage authors to provide information related to the experimental design as suggested by NIH and ARRIVE guidelines (e.g., information about replicates, randomization, blinding, sample size estimation, and the criteria for inclusion and exclusion of any data or subjects).
Pre-processing of gene expressions of the NSCLC dataset
The raw gene expression matrix was processed using Seurat v4,94 where cells with a mitochondrial percentage higher than 10% were removed, keeping the cells with the highest quality.40 All 980 genes included in the dataset were used since the number of features and the total counts illustrated good quality control of the data as shown in Figure S2. More information about the quality control process can be found at Methods S1. Details of the removed cells are shown at Table S6.
Normalization, scaling, and selection of highly variable genes were performed with SCTransform, a regularized negative binomial regression-based model available in Seurat.89 Principal component analysis (PCA) was conducted on the normalized data for dimensionality reduction, resulting in the top 30 components. The data was visualized with the uniform manifold approximation projection (UMAP) computed over the first 30 PCAs, with neighboring points of 5 and a minimum distance of 0.001. The nearest neighbors were identified over the 30 principal components, and clusters were computed using the Leiden algorithm with a resolution of 0.8.
Integration of different samples
The gene expression matrices of the 4 samples were integrated by Seurat v4 using the SCTransformed counts (Figure 4A). PCA was conducted on integrated data, where the top 30 Principal Components (PC) were selected for UMAP visualization.
Image stitching
The Morphology images were further stitched into one image in Python v3.9.7, with the package stitch2d v1.1,95 through the function StructuredMosaic which arranges the images into a grid. The number of images to be stitched together in each line is determined by the size of the samples. The origin was always set to the lower left corner, the stitching direction was defined to be horizontal, and the selected stitching pattern was raster.
Extraction of cell morphology image
The dataset also provides the center (X-Y coordinates), the width, and the height of each cell. The width or height of the largest cell is used as the size template for image cropping, so each cell has a squared image cropped while the center of the cell is the center of the image. For a smaller cell, its image can include more than one cell, where the cell of interest is always located at the center of the image, providing information not only of the cell of interest but also of its context.
Extraction of morphology features
The morphology images of each cell are augmented, including normalization, rotation, sharpness, and adjusting. The adjusted morphology images were processed by a pre-trained CNN, ResNet50 to extract the features.45 The extracted features in a cell were represented as a 2,048-dimension vector. PCA analysis is performed over the vector of each cell to further reduce the dimensions, and the top 50 PCAs are extracted as the morphological features () of each cell.
Processing of cell location-adjacency matrix
A ball tree graph is built to calculate the distances between spots by constructing a cell-cell spatial relationship graph, which is used to construct the adjacency matrix (γ). The cell-cell adjacency matrix is built over the top 12 nearest neighbors, where, when two cells n and j are neighbors, it is represented by a γnj = γjn = 1, and γnj = γjn = 0 otherwise. The adjacency matrix contains self-loops. The resulting matrix is a squared matrix of size N by N, where N represents the number of cells. N varies across samples (Table 2).
The ball tree graph is a hierarchical data structure used for organizing points in a multi-dimensional space. The graph-structured representation contains the information of the normalized adjacency matrix, the adjacency labels, and the normalization term.
Selection of similar cells in the neighborhood using morphological and location information
Cellular morphological features (), cell types (), and neighborhood () of a cell are used to screen similar cells in a region based on cosine similarity. The number of cell types in the dataset is used for one-hot encoding of the cell type feature (). The adjacency matrix previously computed illustrates if two cells are neighbors () or not (). Then, the similarity weights (K) between a cell n and its neighboring cell j are calculated according to the vectors and as Equation 1
| (Equation 1) |
Where,
| (Equation 2) |
TG-ME normalization
If cells are similar in morphology, type, and locations, cosine similarity of expression levels of 980 genes between cell n, Gn, and Gj from its neighboring cell j, , is calculated.
TG-ME pipeline normalizes the gene expression levels in a cell if the cell share similar expression, morphology, location, and type with its neighbors, suggesting a logical and operation between the similarity weights and , obtaining a final similarity weight (Equation 4).
| (Equation 3) |
The gene expression of a cell of interest (n) will be normalized accounting for the similarity weights (S) as follows:
| (Equation 4) |
Where is the normalized gene expression of cell n, S is the similarity weight obtained before, and m is the total number of neighbors.
TG-ME normalization characterizes the similarity of cells deploying morphology, cell type, gene expression, and spatial location information.
TG-ME model for TME dissection
The normalized gene expression matrix representing the total number of cells, and N is the number of genes in the dataset, serves as the input to the TG-ME model including a Transformer model, followed by a Graph Variational Autoencoder (GraphVAE), as illustrated in Figure 8.
Transformer
The transformer model consists of three attention layers, each featuring three attention heads (Figure 8A). Each attention layer generates an attention vector , where b represents each layer, encapsulating the TG-ME normalized gene expression vector .
The self-attention module, the cornerstone of the transformer architecture, generates attention values for each element in the input sequence. It conducts two primary operations: computing the query , key and value vectors for each element, where the corresponding weights () are learned during training. These weights are initially set to random values before training commences and are then updated iteratively during the training process. The optimizer then adjusts these weights to minimize the loss. This process repeats for each epoch until the model achieves satisfactory performance, with the weights being fine-tuned to capture intricate patterns and relationships within the data.
For gene in cell , the query of this gene in the first attention layer is computed by multiplying the query weight of the gene () by its gene expression in cell , as
| (Equation 5) |
| (Equation 6) |
| (Equation 7) |
Each gene in a cell will have its own query, key, and value.
The second set of operations includes the calculation of the attention weight vector () as a probability distribution representing the similarity between and , to obtain the relationship between gene j and the rest of genes:
| (Equation 8) |
Where represents the attention weight between the gene j and gene l. Softmax calculates the probability distribution of the similarity between and .
Subsequently, a representation of the Query gene () is obtained for each self-attention module by multiplying the transpose of by the vector of values, also excluding the gene of interest .
| (Equation 9) |
For each cell , the output of the attention layer is a vector of length 980, containing all the representations for the query genes .
The proposed model contains 3 attention heads . The representations from each head are linearly combined to obtain one final set of representations , where represents each layer.
| (Equation 10) |
The summation represents the residual connection, the weights of head H, and an by matrix, where is the number of heads, and is the number of genes that containsthe representations for each head. This computation will be repeated for each of the layers, wherein the first layer is the TG-ME normalized gene expression for cell , , while in the 2nd and 3rd layers is the output of the previous layer . Layernorm is the layer normalization function,96 which calculates normalization statistics directly from the combined inputs to the neurons within a hidden layer.
Finally, a classification layer is included to classify between different cell types using the Softmax function. The model is further trained with a dropout rate of 0.7 and no activation function, and the final set of representations for each cell generated by the last attention layer is extracted because it represents the relationship among genes which are further used as an input for the GraphVAE, as shown in Figures 7 and 8.
Graph Variational Autoencoder (GraphVAE)
GraphVAE integrates Graph Convolutional Networks (GCNs) with a Variational Autoencoder (VAE). GraphVAE aims to acquire latent features embedded in location information denoted by graph γ and the attention output from the transformer .
The encoder contains 3 GCN layers, as presented in Figure 8B. The first GCN layer convolutes and , with a rectified linear unit (ReLU) activation function,
| (Equation 11) |
| (Equation 12) |
where is the symmetrically normalized adjacency matrix, and D is the degree matrix.
The second layer computes the mean μ of the latent representation, and the third layer computes the log variance log σ2,
| (Equation 13) |
| (Equation 14) |
where , , and are the weights for each layer updated during the training process with Adam optimizer. The output of the encoder consists of the reparameterizations, where noise is introduced, and the final latent representation is computed as
| (Equation 15) |
with .
The GraphVAE model also has a decoder that reconstructs the input , denoted as ξ. It is made of a fully connected layer with the ReLU activation function, followed by a second fully connected layer. The convolutional hidden layers have a size of.16,32
| (Equation 16) |
| (Equation 17) |
where are the bias vectors and are the weights updated during training.
The model is designed to consider two different angles, the first one is the transformer model, which learns the relationships across genes , the second one is the GraphVAE, where contains location-based signatures integrating gene expression and morphological information. and are concatenated and used for downstream analysis.
Hyperparameters of the TG-ME model
The model was trained for 100 epochs (with early stopping) with the batch size of 128. Back propagation and the Adam optimizer with a learning rate of 0.01 and a weight decay of 0.01 were applied. A validation set was used during training. The transformer model was trained with a dropout rate of 0.7, while GraphVAE was trained with a dropout rate of 0.1.
The negative log likelihood loss was calculated for the class prediction using the Transformer model, and a mean squared error loss between the decoded features from the GraphVAE and the input to the GraphVAE (, γ) was calculated. Both losses were weightily added (0.8 for the Transformer and 0.2 for the GCN autoencoder) to calculate a final loss.
To address potential overfitting, we incorporated several strategies in TG-ME’s architecture, including early stopping, dropout layers (0.7 for the Transformer model and 0.1 for the GraphVAE), and regularization techniques like weight decay. These mechanisms help to prevent overfitting by reducing the model’s complexity and improving generalization. Additionally, we used cross-validation, and a separate validation set to fine-tune hyperparameters and evaluate performances across diverse samples.
Spatial Transcriptomics data can have high dimensionality; to further avoid overfitting, we reduced feature space using dimensionality reduction techniques (PCA) before feeding the data into the model, which helps to mitigate the curse of dimensionality and limits the risk of overfitting.
The model takes 134.13 s to train for 100 epochs on an NVIDIA GeForce RTX 3070 Ti GPU, which has 8 GB of total memory. The GPU utilizes NVIDIA driver version 560.94 and supports CUDA version 12.6. In contrast, the model trains in 30.09 s for 100 epochs in a high-performance computing (HPC) environment using a Tesla V100S-PCIE-32GB GPU, equipped with 32 GB of total memory. This GPU is running NVIDIA driver version 545.23.08 and supports CUDA version 12.3.
Tumor microenvironment identification from clustering
The concatenated location-based signatures, and are further clustered with Louvain algorithm in Scanpy python package90 to identify cells with similar representations. If both, and are similar, cells will be assigned the same cluster. At the end, each cell is assigned one cluster, which is considered a niche of the TME.
Each of the identified clusters are assigned a name according to its cell type composition and spatial location for interpretation. The niches are mapped back to single-cell resolution by assigning the corresponding niche to each of the cells.
Evaluation
Two benchmark public datasets, a LIBD human dorsolateral prefrontal cortex20 and a public 10X Visium benchmark dataset of Breast Cancer (Invasive Ductal Carcinoma) were processed by the proposed TG-ME pipeline. The adjusted rand index (ARI) and normalized mutual information (NMI) scores were calculated to evaluate the performance of the proposed pipeline in identifying microenvironment niches. The results were compared with the niches identified by the pathologist (annotation), STlearn, and Leiden clustering.
Severity signatures across tumor regions NSCLC
With the validated TG-ME model, the NSCLC dataset was further analyzed to examine the severity signatures across tumor regions.
The TG-ME normalized gene expression for cells in the identified tumor and Tumor-Stroma Border niches across samples are integrated by batch normalization using Seurat V4.89 Differential Gene Expression (DGE) analysis is further computed using the MAST method, with a log fold change threshold of 0.25 and a minimum percentage of expression in 0.1 cells (10%). Enrichment analysis was performed for the identify differentially expressed genes with enrichr package56 in R and the Hallmark57 database to get an insight into the severity pathways.
Quantification and statistical analysis
Pathway enrichment analysis was conducted using R package enrichR 3.2 and significance was assessed based on adjusted p-values derived from the Fisher’s exact test. The statistical details are provided in Table 5 and results section TG-ME can dissect different microenvironments related to tumor severity. Statistical significance was defined as p-values < 0.05.
Published: March 13, 2025
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.isci.2025.112214.
Supplemental information
References
- 1.Lorusso G., Rüegg C. The tumor microenvironment and its contribution to tumor evolution toward metastasis. Histochem. Cell Biol. 2008;130:1091–1103. doi: 10.1007/s00418-008-0530-8. [DOI] [PubMed] [Google Scholar]
- 2.Ospina O.E., Wilson C.M., Soupir A.C., Berglund A., Smalley I., Tsai K.Y., Fridley B.L. spatialGE: quantification and visualization of the tumor microenvironment heterogeneity using spatial transcriptomics. Bioinformatics. 2022;38:2645–2647. doi: 10.1093/bioinformatics/btac145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kobayashi H., Enomoto A., Woods S.L., Burt A.D., Takahashi M., Worthley D.L. Cancer-associated fibroblasts in gastrointestinal cancer. Nat. Rev. Gastroenterol. Hepatol. 2019;16:282–295. doi: 10.1038/s41575-019-0115-0. [DOI] [PubMed] [Google Scholar]
- 4.Mittal V., El Rayes T., Narula N., McGraw T.E., Altorki N.K., Barcellos-Hoff M.H. The microenvironment of lung cancer and therapeutic implications. Adv. Exp. Med. Biol. 2016;890:75–110. doi: 10.1007/978-3-319-24932-2_5. [DOI] [PubMed] [Google Scholar]
- 5.Wang N., Li X., Wang R., Ding Z. Spatial transcriptomics and proteomics technologies for deconvoluting the tumor microenvironment. Biotechnol. J. 2021;16 doi: 10.1002/biot.202100041. [DOI] [PubMed] [Google Scholar]
- 6.Fan Z., Luo Y., Lu H., Wang T., Feng Y., Zhao W., Kim P., Zhou X., Zhou X. SPASCER: spatial transcriptomics annotation at single-cell resolution. Nucleic Acids Res. 2023;51:D1138–D1149. doi: 10.1093/nar/gkac889. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Park J., Kim J., Lewy T., Rice C.M., Elemento O., Rendeiro A.F., Mason C.E. Spatial omics technologies at multimodal and single cell/subcellular level. Genome Biol. 2022;23:256. doi: 10.1186/s13059-022-02824-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Burgess D.J. Spatial transcriptomics coming of age. Nat. Rev. Genet. 2019;20:317. doi: 10.1038/s41576-019-0129-z. [DOI] [PubMed] [Google Scholar]
- 9.Rao A., Barkley D., França G.S., Yanai I. Exploring tissue architecture using spatial transcriptomics. Nature. 2021;596:211–220. doi: 10.1038/s41586-021-03634-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.He S., Bhatt R., Brown C., Brown E.A., Buhr D.L., Chantranuvatana K., Danaher P., Dunaway D., Garrison R.G., Geiss G., et al. High-plex imaging of RNA and proteins at subcellular resolution in fixed tissue by spatial molecular imaging. Nat. Biotechnol. 2022;40:1794–1806. doi: 10.1038/s41587-022-01483-z. [DOI] [PubMed] [Google Scholar]
- 11.Saiselet M., Rodrigues-Vitória J., Tourneur A., Craciun L., Spinette A., Larsimont D., Andry G., Lundeberg J., Maenhaut C., Detours V., Detours V. Transcriptional output, cell-type densities, and normalization in spatial transcriptomics. J. Mol. Cell Biol. 2020;12:906–908. doi: 10.1093/jmcb/mjaa028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zhao E., Stone M.R., Ren X., Guenthoer J., Smythe K.S., Pulliam T., Williams S.R., Uytingco C.R., Taylor S.E.B., Nghiem P., et al. Spatial transcriptomics at subspot resolution with BayesSpace. Nat. Biotechnol. 2021;39:1375–1384. doi: 10.1038/s41587-021-00935-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Bray M.-A., Singh S., Han H., Davis C.T., Borgeson B., Hartland C., Kost-Alimova M., Gustafsdottir S.M., Gibson C.C., Carpenter A.E., Carpenter A.E. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nat. Protoc. 2016;11:1757–1774. doi: 10.1038/nprot.2016.105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Swanson P.E. Foundations of immunohistochemistry. Am. J. Clin. Pathol. 1988;90:333–339. doi: 10.1093/ajcp/90.3.333. [DOI] [PubMed] [Google Scholar]
- 15.Raj A., Van Den Bogaard P., Rifkin S.A., Van Oudenaarden A., Tyagi S. Imaging individual mRNA molecules using multiple singly labeled probes. Nat. Methods. 2008;5:877–879. doi: 10.1038/nmeth.1253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lubeck E., Cai L. Single-cell systems biology by super-resolution imaging and combinatorial labeling. Nat. Methods. 2012;9:743–748. doi: 10.1038/nmeth.2069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Femino A.M., Fay F.S., Fogarty K., Singer R.H. Visualization of single RNA transcripts in situ. Science. 1998;280:585–590. doi: 10.1126/science.280.5363.585. [DOI] [PubMed] [Google Scholar]
- 18.Wang F., Flanagan J., Su N., Wang L.-C., Bui S., Nielson A., Wu X., Vo H.T., Ma X.J., Luo Y., Luo Y. RNAscope: a novel in situ RNA analysis platform for formalin-fixed, paraffin-embedded tissues. J. Mol. Diagn. 2012;14:22–29. doi: 10.1016/j.jmoldx.2011.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lubeck E., Coskun A.F., Zhiyentayev T., Ahmad M., Cai L. Single-cell in situ RNA profiling by sequential hybridization. Nat. Methods. 2014;11:360–361. doi: 10.1038/nmeth.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Maynard K.R., Collado-Torres L., Weber L.M., Uytingco C., Barry B.K., Williams S.R., Catallini J.L., 2nd, Tran M.N., Besich Z., Tippani M., et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nat. Neurosci. 2021;24:425–436. doi: 10.1038/s41593-020-00787-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Chen K.H., Boettiger A.N., Moffitt J.R., Wang S., Zhuang X. Spatially resolved, highly multiplexed RNA profiling in single cells. Science. 2015;348 doi: 10.1126/science.aaa6090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Kleshchevnikov V., Shmatko A., Dann E., Aivazidis A., King H.W., Li T., Elmentaite R., Lomakin A., Kedlian V., Gayoso A., et al. Cell2location maps fine-grained cell types in spatial transcriptomics. Nat. Biotechnol. 2022;40:661–671. doi: 10.1038/s41587-021-01139-4. [DOI] [PubMed] [Google Scholar]
- 23.Cable D.M., Murray E., Zou L.S., Goeva A., Macosko E.Z., Chen F., Irizarry R.A. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat. Biotechnol. 2022;40:517–526. doi: 10.1038/s41587-021-00830-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Bae S., Na K.J., Koh J., Lee D.S., Choi H., Kim Y.T. CellDART: cell type inference by domain adaptation of single-cell and spatial transcriptomic data. Nucleic Acids Res. 2022;50:e57. doi: 10.1093/nar/gkac084. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Biswas A., Ghaddar B., Riedlinger G., De S. Inference on spatial heterogeneity in tumor microenvironment using spatial transcriptomics data. Comput. Syst. Oncol. 2022;2 doi: 10.1002/cso2.1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Dhainaut M., Rose S.A., Akturk G., Wroblewska A., Nielsen S.R., Park E.S., Buckup M., Roudko V., Pia L., Sweeney R., et al. Spatial CRISPR genomics identifies regulators of the tumor microenvironment. Cell. 2022;185:1223–1239.e20. doi: 10.1016/j.cell.2022.02.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Xu C., Jin X., Wei S., Wang P., Luo M., Xu Z., Yang W., Cai Y., Xiao L., Lin X., et al. DeepST: identifying spatial domains in spatial transcriptomics by deep learning. Nucleic Acids Res. 2022;50 doi: 10.1093/nar/gkac901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Zhang A.W., O’Flanagan C., Chavez E.A., Lim J.L.P., Ceglia N., McPherson A., Wiens M., Walters P., Chan T., Hewitson B., et al. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat. Methods. 2019;16:1007–1015. doi: 10.1038/s41592-019-0529-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Hirz T., Mei S., Sarkar H., Kfoury Y., Wu S., Verhoeven B.M., Subtelny A.O., Zlatev D.V., Wszolek M.W., Salari K., et al. Dissecting the immune suppressive human prostate tumor microenvironment via integrated single-cell and spatial transcriptomic analyses. Nat. Commun. 2023;14:663. doi: 10.1038/s41467-023-36325-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Peng Z., Ye M., Ding H., Feng Z., Hu K. Spatial transcriptomics atlas reveals the crosstalk between cancer-associated fibroblasts and tumor microenvironment components in colorectal cancer. J. Transl. Med. 2022;20:302. doi: 10.1186/s12967-022-03510-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Jerby-Arnon L., Regev A. DIALOGUE maps multicellular programs in tissue from single-cell or spatial transcriptomics data. Nat. Biotechnol. 2022;40:1467–1477. doi: 10.1038/s41587-022-01288-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Liu T., Fang Z.-Y., Li X., Zhang L.-N., Cao D.-S., Yin M.-Z. Graph deep learning enabled spatial domains identification for spatial transcriptomics. Brief. Bioinform. 2023;24 doi: 10.1093/bib/bbad146. [DOI] [PubMed] [Google Scholar]
- 33.Dong K., Zhang S. Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder. Nat. Commun. 2022;13:1739. doi: 10.1038/s41467-022-29439-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Hu J., Li X., Coleman K., Schroeder A., Ma N., Irwin D.J., Lee E.B., Shinohara R.T., Li M., Li M. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat. Methods. 2021;18:1342–1351. doi: 10.1038/s41592-021-01255-8. [DOI] [PubMed] [Google Scholar]
- 35.Xu H., Fu H., Long Y., Ang K.S., Sethi R., Chong K., Li M., Uddamvathanak R., Lee H.K., Ling J., et al. Unsupervised spatially embedded deep representation of spatial transcriptomics. Genome Med. 2024;16:12. doi: 10.1186/s13073-024-01283-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Pham D., Tan X., Balderson B., Xu J., Grice L.F., Yoon S., Willis E.F., Tran M., Lam P.Y., Raghubar A., et al. Robust mapping of spatiotemporal trajectories and cell–cell interactions in healthy and diseased tissues. Nat. Commun. 2023;14:7739. doi: 10.1038/s41467-023-43120-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Janesick A., Shelansky R., Gottscho A.D., Wagner F., Williams S.R., Rouault M., Beliakoff G., Morrison C.A., Oliveira M.F., Sicherman J.T., et al. High resolution mapping of the tumor microenvironment using integrated single-cell, spatial and in situ analysis. Nat. Commun. 2023;14:8353. doi: 10.1038/s41467-023-43458-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Flores M., Liu Z., Zhang T., Hasib M.M., Chiu Y.-C., Ye Z., Paniagua K., Jo S., Zhang J., Gao S.J., et al. Deep learning tackles single-cell analysis—a survey of deep learning for scRNA-seq analysis. Brief. Bioinform. 2022;23 doi: 10.1093/bib/bbab531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Flores M.A., Paniagua K., Huang W., Ramirez R., Falcon L., Liu A., Chen Y., Huang Y., Jin Y., Jin Y. Characterizing Macrophages Diversity in COVID-19 Patients Using Deep Learning. Genes. 2022;13:2264. doi: 10.3390/genes13122264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Osorio D., Cai J.J. Systematic determination of the mitochondrial proportion in human and mice tissues for single-cell RNA-sequencing data quality control. Bioinformatics. 2021;37:963–967. doi: 10.1093/bioinformatics/btaa751. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Caliński T., Harabasz J. A dendrite method for cluster analysis. Commun. Stat. Theor. Methods. 1974;3:1–27. [Google Scholar]
- 42.Sobin L.H., Gospodarowicz M.K., Wittekind C. John Wiley & Sons; 2011. TNM Classification of Malignant Tumours. [Google Scholar]
- 43.Luecken M.D., Büttner M., Chaichoompu K., Danese A., Interlandi M., Müller M.F., Strobl D.C., Zappia L., Dugas M., Colomé-Tatché M., et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods. 2022;19:41–50. doi: 10.1038/s41592-021-01336-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Yu H., Lim K.P., Xiong S., Tan L.P., Shim W. Functional morphometric analysis in cellular behaviors: shape and size matter. Adv. Healthc. Mater. 2013;2:1188–1197. doi: 10.1002/adhm.201300053. [DOI] [PubMed] [Google Scholar]
- 45.He K., Zhang X., Ren S., Sun J. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. Deep residual learning for image recognition. [Google Scholar]
- 46.Denlinger C.E., Ikonomidis J.S., Reed C.E., Spinale F.G. Epithelial to mesenchymal transition: the doorway to metastasis in human lung cancers. J. Thorac. Cardiovasc. Surg. 2010;140:505–513. doi: 10.1016/j.jtcvs.2010.02.061. [DOI] [PubMed] [Google Scholar]
- 47.Nazareth M.R., Broderick L., Simpson-Abelson M.R., Kelleher R.J., Yokota S.J., Bankert R.B. Characterization of human lung tumor-associated fibroblasts and their ability to modulate the activation of tumor-associated T cells. J. Immunol. 2007;178:5552–5562. doi: 10.4049/jimmunol.178.9.5552. [DOI] [PubMed] [Google Scholar]
- 48.Zheng X., Jiang K., Xiao W., Zeng D., Peng W., Bai J., Chen X., Li P., Zhang L., Zheng X., et al. CD8+ T cell/cancer-associated fibroblast ratio stratifies prognostic and predictive responses to immunotherapy across multiple cancer types. Front. Immunol. 2022;13 doi: 10.3389/fimmu.2022.974265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Longo V., Catino A., Montrone M., Galetta D., Ribatti D. Controversial role of mast cells in NSCLC tumor progression and angiogenesis. Thorac. Cancer. 2022;13:2929–2934. doi: 10.1111/1759-7714.14654. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Shemesh R., Laufer-Geva S., Gorzalczany Y., Anoze A., Sagi-Eisenberg R., Peled N., Roisman L.C. The interaction of mast cells with membranes from lung cancer cells induces the release of extracellular vesicles with a unique miRNA signature. Sci. Rep. 2023;13 doi: 10.1038/s41598-023-48435-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Rousseeuw P.J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics. 1987;20:53–65. [Google Scholar]
- 52.Gonzalez-Castillo J., Hoy C.W., Handwerker D.A., Robinson M.E., Buchanan L.C., Saad Z.S., Bandettini P.A. Tracking ongoing cognition in individuals using brief, whole-brain functional connectivity patterns. Proc. Natl. Acad. Sci. USA. 2015;112:8762–8767. doi: 10.1073/pnas.1501242112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Thornton-Wells T.A., Moore J.H., Haines J.L. Dissecting trait heterogeneity: a comparison of three clustering methods applied to genotypic data. BMC Bioinf. 2006;7:204–218. doi: 10.1186/1471-2105-7-204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Tueller S., Lubke G. Evaluation of structural equation mixture models: Parameter estimates and correct class assignment. Struct. Equ. Modeling. 2010;17:165–192. doi: 10.1080/10705511003659318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Guérin J., Gibaru O., Thiery S., Nyiri E. CNN features are also great at unsupervised classification. arXiv. 2017 doi: 10.48550/arXiv.1707.01700. Preprint at. [DOI] [Google Scholar]
- 56.Kuleshov M.V., Jones M.R., Rouillard A.D., Fernandez N.F., Duan Q., Wang Z., Koplev S., Jenkins S.L., Jagodnik K.M., Lachmann A., et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44:W90–W97. doi: 10.1093/nar/gkw377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Liberzon A., Birger C., Thorvaldsdóttir H., Ghandi M., Mesirov J.P., Tamayo P. The molecular signatures database hallmark gene set collection. Cell systems. 2015;1:417–425. doi: 10.1016/j.cels.2015.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Brahimi-Horn M.C., Chiche J., Pouysségur J. Hypoxia and cancer. J. Mol. Med. 2007;85:1301–1307. doi: 10.1007/s00109-007-0281-3. [DOI] [PubMed] [Google Scholar]
- 59.Wilson W.R., Hay M.P. Targeting hypoxia in cancer therapy. Nat. Rev. Cancer. 2011;11:393–410. doi: 10.1038/nrc3064. [DOI] [PubMed] [Google Scholar]
- 60.Challapalli A., Carroll L., Aboagye E.O. Molecular mechanisms of hypoxia in cancer. Clin. Transl. Imaging. 2017;5:225–253. doi: 10.1007/s40336-017-0231-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Ribatti D., Tamma R., Annese T. Epithelial-mesenchymal transition in cancer: a historical overview. Transl. Oncol. 2020;13 doi: 10.1016/j.tranon.2020.100773. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Debatin K.-M. Apoptosis pathways in cancer and cancer therapy. Cancer Immunol. Immunother. 2004;53:153–159. doi: 10.1007/s00262-003-0474-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Park W.-Y., Gray J.M., Holewinski R.J., Andresson T., So J.Y., Carmona-Rivera C., Hollander M.C., Yang H.H., Lee M., Kaplan M.J., et al. Apoptosis-induced nuclear expulsion in tumor cells drives S100a4-mediated metastatic outgrowth through the RAGE pathway. Nat. Cancer. 2023;4:419–435. doi: 10.1038/s43018-023-00524-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Morana O., Wood W., Gregory C.D. The apoptosis paradox in cancer. Int. J. Mol. Sci. 2022;23:1328. doi: 10.3390/ijms23031328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Vidal P. Interferon α in cancer immunoediting: From elimination to escape. Scand. J. Immunol. 2020;91 doi: 10.1111/sji.12863. [DOI] [PubMed] [Google Scholar]
- 66.Wajant H. The Role of TNF in Cancer. Death Receptors and Cognate Ligands in Cancer. 2009:1–15. doi: 10.1007/400_2008_26. [DOI] [PubMed] [Google Scholar]
- 67.Reuschenbach M., von Knebel Doeberitz M., Wentzensen N. A systematic review of humoral immune responses against tumor antigens. Cancer Immunol. Immunother. 2009;58:1535–1544. doi: 10.1007/s00262-009-0733-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Zaidi M.R. The interferon-gamma paradox in cancer. J. Interferon Cytokine Res. 2019;39:30–38. doi: 10.1089/jir.2018.0087. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Seitz R., Heidtmann H.-H., Wolf M., Immel A., Egbring R. Prognostic impact of an activation of coagulation in lung cancer. Ann. Oncol. 1997;8:781–784. doi: 10.1023/a:1008240918434. [DOI] [PubMed] [Google Scholar]
- 70.Lima L.G., Monteiro R.Q. Activation of blood coagulation in cancer: implications for tumour progression. Biosci. Rep. 2013;33 doi: 10.1042/BSR20130057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Guo Y.J., Pan W.W., Liu S.B., Shen Z.F., Xu Y., Hu L.L. ERK/MAPK signalling pathway and tumorigenesis. Exp. Ther. Med. 2020;19:1997–2007. doi: 10.3892/etm.2020.8454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Chen H.-Z., Tsai S.-Y., Leone G. Emerging roles of E2Fs in cancer: an exit from cell cycle control. Nat. Rev. Cancer. 2009;9:785–797. doi: 10.1038/nrc2696. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Liang S., Galluzzo P., Sobol A., Skucha S., Rambo B., Bocchetta M. Multimodality Approaches to Treat Hypoxic Non–Small Cell Lung Cancer (NSCLC) Microenvironment. Genes Cancer. 2012;3:141–151. doi: 10.1177/1947601912457025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Krock B.L., Skuli N., Simon M.C. Hypoxia-induced angiogenesis: good and evil. Genes Cancer. 2011;2:1117–1133. doi: 10.1177/1947601911423654. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Carmeliet P. Angiogenesis in life, disease and medicine. Nature. 2005;438:932–936. doi: 10.1038/nature04478. [DOI] [PubMed] [Google Scholar]
- 76.Kalluri R. The biology and function of fibroblasts in cancer. Nat. Rev. Cancer. 2016;16:582–598. doi: 10.1038/nrc.2016.73. [DOI] [PubMed] [Google Scholar]
- 77.Voulgari A., Pintzas A. Epithelial–mesenchymal transition in cancer metastasis: mechanisms, markers and strategies to overcome drug resistance in the clinic. Biochim. Biophys. Acta. 2009;1796:75–90. doi: 10.1016/j.bbcan.2009.03.002. [DOI] [PubMed] [Google Scholar]
- 78.Fujisaki H., Futaki S. Epithelial–Mesenchymal Transition Induced in Cancer Cells by Adhesion to Type I Collagen. Int. J. Mol. Sci. 2022;24:198. doi: 10.3390/ijms24010198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Chae Y.K., Chang S., Ko T., Anker J., Agte S., Iams W., Choi W.M., Lee K., Cruz M., Cruz M. Epithelial-mesenchymal transition (EMT) signature is inversely associated with T-cell infiltration in non-small cell lung cancer (NSCLC) Sci. Rep. 2018;8:2918. doi: 10.1038/s41598-018-21061-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Dandoti S. Mechanisms adopted by cancer cells to escape apoptosis–A review. Biocell. 2021;45:863–884. [Google Scholar]
- 81.Qin P., Li Q., Zu Q., Dong R., Qi Y. Natural products targeting autophagy and apoptosis in NSCLC: a novel therapeutic strategy. Front. Oncol. 2024;14 doi: 10.3389/fonc.2024.1379698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Fenton S.E., Saleiro D., Platanias L.C. Type I and II interferons in the anti-tumor immune response. Cancers. 2021;13:1037. doi: 10.3390/cancers13051037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Jorgovanovic D., Song M., Wang L., Zhang Y. Roles of IFN-γ in tumor progression and regression: a review. Biomark. Res. 2020;8:49. doi: 10.1186/s40364-020-00228-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Casarrubios M., Provencio M., Nadal E., Insa A., del Rosario García-Campelo M., Lázaro-Quintela M., Dómine M., Majem M., Rodriguez-Abreu D., Martinez-Marti A., et al. Tumor microenvironment gene expression profiles associated to complete pathological response and disease progression in resectable NSCLC patients treated with neoadjuvant chemoimmunotherapy. J. Immunother. Cancer. 2022;10 doi: 10.1136/jitc-2022-005320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Giannos P., Kechagias K.S., Gal A. Identification of prognostic gene biomarkers in non-small cell lung cancer progression by integrated bioinformatics analysis. Biology. 2021;10:1200. doi: 10.3390/biology10111200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Soille P. Vol. 2. Springer; 1999. (Morphological Image Analysis: Principles and Applications). [Google Scholar]
- 87.Cui Zhou D., Jayasinghe R.G., Chen S., Herndon J.M., Iglesia M.D., Navale P., Wendl M.C., Caravan W., Sato K., Storrs E., et al. Spatially restricted drivers and transitional cell populations cooperate with the microenvironment in untreated and chemo-resistant pancreatic cancer. Nat. Genet. 2022;54:1390–1405. doi: 10.1038/s41588-022-01157-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Zhang T.-H., Hasib M.M., Chiu Y.-C., Han Z.-F., Jin Y.-F., Flores M., Chen Y., Huang Y., Huang Y. Transformer for Gene Expression Modeling (T-GEM): An Interpretable Deep Learning Model for Gene Expression-Based Phenotype Predictions. Cancers. 2022;14:4763. doi: 10.3390/cancers14194763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Hafemeister C., Satija R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 2019;20:296. doi: 10.1186/s13059-019-1874-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Wolf F.A., Angerer P., Theis F.J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:15. doi: 10.1186/s13059-017-1382-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Antiga L. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019;32:8026–8037. [Google Scholar]
- 92.Harris C.R., Millman K.J., Van Der Walt S.J., Gommers R., Virtanen P., Cournapeau D., Wieser E., Taylor J., Berg S., Smith N.J., et al. Array programming with NumPy. Nature. 2020;585:357–362. doi: 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Virshup I., Rybakov S., Theis F.J., Angerer P., Wolf F.A. anndata: Annotated data. bioRxiv. 2021 doi: 10.1101/2021.12.16.473007. Preprint at. [DOI] [Google Scholar]
- 94.Hao Y., Hao S., Andersen-Nissen E., Mauck W.M., Zheng S., Butler A., Lee M.J., Wilk A.J., Darby C., Zager M., et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184:3573–3587.e29. doi: 10.1016/j.cell.2021.04.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Mansur A. Stitch2D. 2023. https://github.com/adamancer/stitch2d
- 96.Ba J.L., Kiros J.R., Hinton G.E. Layer normalization. arXiv. 2016 doi: 10.48550/arXiv.1607.06450. Preprint at: 1607.06450. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
-
•
This paper analyzes existing, publicly available data, accessible at https://doi.org/10.1038/s41593-020-00787-0, https://support.10xgenomics.com/spatial-gene-expression/datasets/1.0.0/V1_Breast_Cancer_Block_A_Section_1, Gene Expression Omnibus (GEO) through accession number GSE243275, and https://doi.org/10.1038/s41587-022-01483-z, and are publicly available as of the date of publication.
-
•
All original code has been deposited at GitHub: https://github.com/Karladanielap/TG-ME/ and is publicly available as of the date of publication.
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request










