Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2025 Oct 6:2025.10.03.680242. [Version 1] doi: 10.1101/2025.10.03.680242

SpaGene: A Deep Adversarial Framework for Spatial Gene Imputation

Aishwarya Budhkar 1, Juhyung Ha 1, Qianqian Song 2,*, Jing Su 3,*, Xuhong Zhang 1,*
PMCID: PMC12632518  PMID: 41278680

Abstract

Integrating transcriptome-wide single-cell gene expression data with spatial context significantly enhances our understanding of tissue biology, cellular interactions, and disease progression. Although single-cell RNA sequencing (scRNA-seq) provides high-resolution gene expression data, it lacks crucial spatial context, whereas spatial transcriptomics techniques offer spatial resolution but are limited in the transcriptomic coverage. To address these limitations, integrating scRNA-seq and spatial transcriptomics data is essential. We introduce SpaGene, a novel deep learning framework designed to integrate scRNA-seq data and spatial transcriptomics data. SpaGene consists of two encoder-decoder pairs combined with two translators and two discriminators to effectively impute missing gene expressions within spatial transcriptomics datasets. We benchmarked SpaGene against existing state-of-the-art methods across diverse datasets. Across the datasets, SpaGene achieved an average 33% higher Pearson correlation coefficient (PCC), 21% higher Structural similarity index (SSIM), and 6.6% lower Root mean squared error (RMSE) compared to the existing approaches, highlighting its capability to reliably impute missing genes and provide comprehensive transcriptomics profiles. Application of our model to lung tumor tissue revealed immune cell enrichment at tumor boundaries, restricted myeloid cell trafficking in adjacent normal regions, and microenvironmental-driven pathways linked to immune neighborhoods. These results provide novel insight into immune exclusion and tumor-immune interactions that drive tumor progression, highlighting potential avenues for therapeutic development. Thus, SpaGene extends the power of spatial transcriptomics by delivering spatially resolved, enhanced transcriptome data that enable deeper biological understanding.

Keywords: single-cell RNA sequencing, single-cell spatial transcriptomics, cross-modal translation, adversarial learning, trajectory inference, tumor microenvironment

INTRODUCTION

Recent advances in spatial transcriptomics (ST) techniques now make it possible to capture gene expression at the single-cell level while retaining spatial information about cells. For example, using positional barcodes, detected RNA transcripts can be mapped back to their tissue region to retain spatial information1. Understanding transcriptional profiles at the cellular level aids in several ways, such as recognition of the similarities and differences within cell populations, which helps elucidate cellular heterogeneity, identification of cell development pathways, study of rare cell populations such as tumor cells, etc2, 3. The availability of commercial platforms like the Vizgen MERSCOPE platform4 and the NanoString CosMX Spatial Molecular Imager (SMI) platform5 has enabled researchers and clinicians to access high-resolution ST data, facilitating the discovery of novel biological insights3. Spatial information, along with gene expression profiles, helps biologists understand complex cellular relationships and the resulting biological phenomenon6. For example, the NanoString platform enables spatial in situ detection of mRNA and proteins at the cellular and subcellular levels using formalin-fixed paraffin-embedded (FFPE) and fresh frozen (FF) tissue samples5. Another imaging-based technique, MERSCOPE, uses Multiplexed Error Robust Fluorescence In Situ Hybridization (MERFISH) to capture the spatial distribution of RNA molecules at the single-cell level. However, different ST techniques have their own limitations, such as less accurate capture of gene expression or limited spatial resolution. For example, imaging-based techniques such as MERFISH7 and osmFISH8 provide single-cell resolution with high accuracy. Still, they are limited to measuring gene expressions for hundreds to a few thousand genes. Sequencing-based techniques like Slide-seq9 and 10x Visium10 can detect thousands of genes, but at lower spatial resolution than single-cell techniques and with lower capture efficiency. Similarly, the NanoString CosMX SMI platform5 can profile thousands of genes, but the detection accuracy is low due to limitations of the technology.

Given that current ST technologies face notable limitations, there is a pressing need for computational strategies to enhance the quality and resolution of ST data. Prior to the development of the ST technology, single-cell RNA sequencing (scRNA-seq) emerged as a powerful tool for dissecting cellular heterogeneity and tracing cell lineages2, 3. However, while scRNA-seq (SC) offers detailed molecular profiles, it lacks spatial information, making it difficult to reconstruct the tissue architecture and cell-cell interactions within complex biological systems1. When combined with ST, SC serves as a valuable complementary modality, boosting the accuracy of spatially resolved transcriptomic analyses within individual tissue sections. Several computational methods have been proposed to integrate SC with ST for gene expression imputation to overcome ST’s inherent limitations. For example, SpaGE (Spatial Gene Enhancement)11 employs the PRECISE12 domain adaptation algorithm to align datasets. After alignment, the k-nearest neighbor algorithm13 (k-NN) is used to assign gene expression to spatially unmeasured locations using a weighted average of neighbors with positive cosine similarity. However, SpaGE faces limitations in handling complex datasets due to its reliance on k-NN and the PRECISE12 algorithm, which uses linear dimensionality reduction. gimVI (Gene imputation with Variational Inference)14 utilizes a deep generative model to integrate ST and SC datasets. The method learns a shared latent space of the input datasets and then uses posterior inference for gene imputation. gimVI can struggle to capture nuanced biological variations due to its limitations in latent space complexity and variability modeling. Tangram15 integrates ST and SC data by learning a probabilistic assignment of cells to spatial spots and uses the mapping to impute gene expression at cellular resolution. However, its effectiveness depends on shared gene expression patterns between datasets, which can limit performance in highly heterogeneous or novel biological contexts.

In this work, we introduce SpaGene, a novel deep learning method for predicting unmeasured genes in ST data by leveraging information from SC data. SpaGene utilizes an advanced encoder-decoder architecture supplemented by dedicated translator and discriminator modules to accurately impute missing gene expressions within ST data. First, the encoder-decoder modules project each dataset into a low-dimensional latent space and reconstruct it back to the original dimension, learning to capture the distinctive characteristics of each dataset while preserving biologically relevant variation. Then, translators learn mappings between latent spaces, learning their shared features and their relationships. Discriminator modules further refine these mappings by encouraging the translated features to resemble the target data distribution. By leveraging the learned features from both datasets, SpaGene accurately imputes expression for unmeasured genes in ST data, significantly enhancing the analytical power of ST. This enables more comprehensive downstream analysis, such as spatial cellular interaction, pathway enrichment analysis, and microenvironment profiling, leading to advancing our ability to extract biological insights from spatial data.

RESULTS

Overview of the SpaGene model

The primary function of the model is to enhance the limited ST data by expanding measured gene sets, significantly enriching the underlying biological information and, thus, facilitating novel biological insights through downstream analysis and interpretation. Figure 1a illustrates the overall objective of SpaGene. SpaGene leverages reference SC data to predict unmeasured genes in ST datasets, resulting in spatial profiles enriched with imputed genes. As shown in Figure 1b, the model comprises three components: encoder-decoder modules, translators, and discriminators. The model uses separate encoder-decoder networks for ST and SC datasets to capture and encode the unique dataset-specific features, mapping each dataset to compact, low-dimensional latent representations. This step is crucial to effectively capture essential characteristics of complex biological datasets and mitigate noise. The translators and discriminators then learn intricate non-linear mappings between the latent spaces of ST and SC datasets. Translators are trained to convert representations from one dataset to another, while discriminators guide the translators to generate realistic outputs. This adversarial framework encourages translated outputs that are biologically plausible and closely resemble the target data distributions. The discriminators enforce stringent quality control on translated representations, helping translators to iteratively refine their outputs towards increased biological relevance. Figure 1c further elaborates on the domain translation process, where ST and SC data are projected into their latent representations and translated across domains. This bidirectional translation using the learned low-dimensional latent space representation helps the model learn shared features between the two datasets. As depicted in Figure 1d during inference, ST data is projected into the learned latent space and translated into the SC domain. Finally, the gene expression is reconstructed, enhancing the ST data by expanding the measured gene expression. Thus, leveraging both shared and unique features of both datasets, SpaGene learns to translate ST data into the SC domain. Through adversarial training, SpaGene ensures accurate alignment and imputation.

Figure 1. Overview of the SpaGene model.

Figure 1.

(a) SpaGene take ST data and reference SC data as input to impute unmeasured genes in ST data. (b) SpaGene consists of two encoder-decoder pairs for ST and SC data each to learn the unique features of the datasets, two translators to map the latent representations between datasets, and two discriminators that guide the translators to generate realistic, biologically meaningful outputs. (c) Data are first embedded into a low-dimensional latent representation, followed by translation to the target dataset and back to the source to learn the shared features across datasets. (d) Missing gene expression in ST is imputed by translating the latent representation to the SC domain and decoding it to reconstruct the gene expression.

SpaGene outperforms existing methods across diverse datasets

To assess SpaGene’s ability in imputing missing gene expression, we benchmarked it against leading data imputation methods, gimVI14, SpaGE11, and Tangram15 across seven diverse dataset pairs: MERFISH7_Moffitt7, NanoString5_GSE16, osmFISH8_AllenSSp17, osmFISH_AllenVISp18, osmFISH_Zeisel19, seqFish20_AllenVISp, and STARmap21_AllenVISp detailed in the Methods section. Performance was evaluated using three metrics: Pearson correlation coefficient22 (PCC), structural similarity index22 (SSIM), and root mean square error22 (RMSE). Across all seven benchmarks, SpaGene consistently outperforms SpaGE, gimVI and Tangram by capturing non-linear relationships and more faithfully reconstructing smother, more accurate gene expression patterns.

Figure 2a shows the PCC results for each method across the seven datasets. SpaGene yields stronger agreement with ground-truth spatial measurements. For example, on the imaging-based MERFISH_Moffitt dataset pair, SpaGene achieved an average PCC of 0.3966, outperforming SpaGE (PCC = 0.3447), gimVI (PCC = 0.2418), and Tangram (PCC = 0.2727). Similarly, on the Nanostring_GSE dataset pair, SpaGene achieved an average PCC of 0.2456, 67.1% higher than SpaGE (PCC = 0.1469), 48.7% higher than gimVI (PCC = 0.1652), and 35.2% higher than Tangram (PCC = 0.1817). Thus, SpaGene produces stronger alignment between predicted and measured expression profiles of individual genes, reflecting its capacity to learn complex, non-linear mappings that faithfully reconstruct individual gene expression in spatial context. Figure 2b shows that SpaGene achieves higher SSIM scores than other methods across all seven datasets. This demonstrates that it produces a more precise recovery of spatial features, such as localized hotspots, than competing methods. For example, for the osmFISH_AllenVISp dataset pair, SpaGene achieved an average SSIM of 0.3821, better than SpaGE (average SSIM = 0.2268), gimVI (average SSIM = 0.3235), and Tangram (average SSIM = 0.2528). Similarly, for the osmFISH_Zeisel dataset pair, SpaGene achieved an average SSIM of 0.4127, 30.6% higher than SpaGE (SSIM = 0.3159), 37.9% higher than gimVI (SSIM = 0.2992), and 20.1% higher than Tangram (SSIM = 0.3434). These results show that SpaGene not only captures gene expression more accurately but also better preserves spatial architectures underlying tissue organization. Figure 2c presents RMSE values across all seven datasets. SpaGene consistently reduces the average discrepancy between predicted and measured gene expression intensities across all datasets. For example, for the seqFISH_AllenVISp dataset pair, SpaGene achieved an average RMSE of 1.1613, better than SpaGE (RMSE = 1.2435), gimVI (RMSE = 1.2570), and Tangram (RMSE = 1.2235). Similarly, for the STARmap_AllenVISp dataset pair, SpaGene achieved an average RMSE of 1.2496, lower than SpaGE (average RMSE = 1.3083), gimVI (average RMSE = 1.2763), and Tangram (average RMSE = 1.2773). Lower reconstruction error highlights SpaGene’s ability to minimize systematic and random deviations, yielding intensity predictions that closely align with observed data.

Figure 2. Performance evaluation across datasets.

Figure 2.

(a) Box plot of PCC scores across seven datasets for each method (b) Box plot of SSIM scores across seven datasets for each method (c) Box plot of RMSE scores across seven datasets for each method (d) Box plot of average PCC, SSIM, and RMSE scores for the seven datasets for each method

Across all seven datasets, SpaGene achieved an average PCC of 0.2782, 36% higher than SpaGE (PCC = 0.2045), 34% higher than gimVI (PCC = 0.2076), and 30.7% higher than Tangram (PCC = 0.2129), average SSIM of 0.3902, 23.6% higher than SpaGE (SSIM = 0.3158), 16.8% higher than gimVI (SSIM = 0.3342), and 22.7% higher than Tangram (SSIM = 0.3180), and a lower RMSE score of 1.1898, 10.1% lower compared to SpaGE (RMSE = 1.3238), 4.97% lower than gimVI (RMSE = 1.2520), and 4.6% lower than Tangram (RMSE = 1.2472) as shown in Figure 2d. Collectively, these results demonstrate SpaGene’s robust performance across diverse datasets including large-scale imaging-based datasets like MERFISH profiling thousands of cells with large gene panels to targeted osmFISH assays where spatial measurements are limited to a small gene panel. Thus, the model’s architecture is suitable to handle both high and low-dimensional data. This adaptability shows that SpaGene’s underlying mechanisms are not dependent on technology or gene counts but exploit underlying patterns in spatial and single-cell data to generate reliable imputations. Overall, the SpaGene framework produces more accurate, visually coherent, and quantitatively reliable gene imputations that reflect true underlying biological mechanisms.

SpaGene demonstrates superior performance on the NanoString Lung9 rep1 dataset

We applied SpaGene to integrate the NanoString Lung9 rep1 dataset with reference SC data to predict unmeasured spatial gene expression patterns, enriching the ST data. To rigorously evaluate the predictive performance of SpaGene, we conducted a five-fold cross-validation, detailed in the Methods section. In each fold, a subset of genes was held out, and the remaining genes were used to impute the spatial expression of the omitted genes. We compared our model’s performance with SpaGE, gimVI, and Tangram using PCC between measured spatial gene expression and predicted values. Figure 3a shows the PCC results for the NanoString Lung9 rep1 tissue sample, highlighting the superiority of SpaGene. SpaGene achieved an average PCC of 0.2456, significantly higher than SpaGE (average PCC = 0.1470), gimVI (average PCC = 0.1652), and Tangram (average PCC = 0.1817). We further conducted gene-level comparisons to provide detailed insights into the method performance. Figures 3b-d show scatter plots comparing gene-wise PCC values between SpaGene and each competing method. We observed that the majority of the data points lie above the y = x line, demonstrating that SpaGene performs better than the competitors.

Figure 3. Performance evaluation on the NanoString Lung9 rep1 sample.

Figure 3.

(a) Box plot of PCC for each method (b) Scatter plot of PCC values for each imputed gene between SpaGene versus SpaGE (c) Scatter plot of PCC values for each imputed gene between SpaGene versus gimVI (d) Scatter plot of PCC values for each imputed gene between SpaGene versus Tangram (e) Spatial patterns of measured and imputed genes STMN1, NDRG1, and CD163 for each method

Furthermore, we visually assessed the spatial expression patterns of selected imputed genes with complex spatial patterns. Figure 3e illustrates measured and imputed spatial patterns of STMN1, NDRG1, and CD163 genes in the NanoString Lung9 rep1 sample. Compared with SpaGE, gimVI, and Tangram, SpaGene consistently produced more spatially coherent patterns that closely matched the measured patterns. For example, SpaGene accurately captured the intricate spatial heterogeneity of NDRG1, and CD163 clearly outperforming the competitors. SpaGE produced a noisy pattern for STMN1 and failed to capture the complex spatial distribution for NDRG1 and CD163. gimVI overly smoothed the expression pattern for STMN1, and Tangram failed to reproduce the spatial expression pattern for NDRG1. The robustness of SpaGene is evident in accurately reconstructing distinct spatial expression patterns, which suggests reliable performance.

SpaGene demonstrates superior performance on the STARmap dataset

Similarly, we evaluated the performance of SpaGene on the STARmap_AllenVISp dataset in predicting unmeasured spatial gene expression patterns. To assess the predictive accuracy, we used five-fold cross-validation and quantified the performance using the PCC metric. Figure 4a summarizes the PCC results for the STARmap data, where SpaGene achieved an average PCC of 0.2108, outperforming SpaGE (PCC = 0.1386), gimVI (PCC = 0.1798), and Tangram (PCC = 0.1786). This marked improvement highlights the strong predictive power of SpaGene. To examine gene-level performance, Figures 4b-4d present scatter plots comparing SpaGene’s PCC values with those of the other methods. Across all competitors, more data points lie above y = x line, which demonstrates that SpaGene often yields higher PCCs for a large proportion of genes.

Figure 4. Performance evaluation on the STARmap data.

Figure 4.

(a) Box plot of PCC is shown for each method (b) Scatter plot of PCC values for each imputed gene between SpaGene versus SpaGE (c) Scatter plot of PCC values for each imputed gene between SpaGene versus gimVI (d) Scatter plot of PCC values for each imputed gene between SpaGene versus Tangram (e) Spatial patterns of measured and imputed CAMK2N1, PLP1, and ITM2A genes for each method

In addition to quantitative performance, we evaluated the spatial fidelity of imputed gene expression patterns. Figure 4e displays spatial patterns of three representative imputed genes, CAMK2N1, PLP1, and ITM2A, by showing the ground truth STARmap measurements alongside predicted expression patterns for SpaGene, SpaGE, gimVI, and Tangram. We observed that gene expression imputed by SpaGene more closely replicates the measured spatial distributions compared to the competitors. For example, for CAMK2N1, SpaGene captures spatial heterogeneity of the expression pattern better than gimVI, which performs reasonably well but suppresses some expression patterns. SpaGE and Tangram lose the spatial heterogeneity of the expression pattern in some regions. For PLP1, SpaGene closely matches the measured spatial pattern, Tangram captures the pattern reasonably well, whereas gimVI and SpaGE fail to do that. For ITM2A, SpaGene shows a consistent spatial pattern with measured expression, while others cannot accurately reproduce the pattern. Together, these visual and quantitative results establish that SpaGene delivers more accurate and biologically realistic spatial imputation compared to existing methods.

SpaGene leverages imputed genes to uncover novel biological insights

We investigated how the local immune microenvironment influences tumor cell states in the NanoString Lung9 rep1 sample using imputed gene expression profiles with spatial neighborhood-based analysis. First, we defined spatial niches using Seurat v523 based on each cell’s local neighborhood composition using k-nearest neighbors. Five major niches in the microenvironment were identified: tumor, myeloid, fibroblast, neutrophil, and lymphocyte. Figure 5a displays the spatial plot of cells with cell types and identified spatial niches that capture functionally distinct regions within the tissue.

Figure 5. Downstream analysis using imputed data.

Figure 5.

(a) Spatial niche analysis. Spatial plot of cells colored by cell type, Spatial plot of cells colored by spatial niche: tumor, myeloid, fibroblast, neutrophil, and lymphocyte. (b) Pseudotime and immune neighbor association among tumor niche cells. UMAP projections for tumor niche cells colored by the inferred pseudotime (ptime), Ridge density plots showing distribution of tumor niche cells along pseudotime, grouped based on their lymphocyte and myeloid cell neighbor proportion (none (0), low (>0–0.2), moderate (0.2–0.6), high (0.6–1.0)). (c) Immune neighborhood for tumor niche cells. Spatial plot of cells colored by proportion of lymphocyte in neighborhood (left) and proportion of myeloid in neighborhood (right) for tumor niche cells with immune cells of interest highlighted in red and other cells in gray. Representative fields of view (FOVs) for lymphocyte (FOVs 2, 13, 17) and myeloid (FOVs 2, 7, 19) neighbor proportions for tumor niche cells colored from high (yellow) to low (purple), immune cells in red, and other cells in grey. (d) Pathway enrichment comparisons. Scatter plots of pathway enrichment significance (–log10 adj p-value) detected using imputed versus raw expression data for lymphocyte (left two panels) and myeloid (right two panels): imputed original (only measured genes in raw data) versus raw (first column) and imputed all (both measured genes in raw data and newly imputed genes) versus raw (second column). Points in scatter plots denote pathways colored by annotation – Immune (red), Migration/Attachment (blue), Other (gray), and the dashed line indicates the y=x line.

To study tumor cell progression, we applied SpaTrack24 to infer pseudo-temporal trajectories within tumor niche cells. Figure 5b presents UMAP embeddings of tumor niche cells colored by inferred pseudotime and ridge density plots depicting distributions for these cells along pseudotime, grouped by their lymphocyte and myeloid cell neighbor proportion (none (0), low (>0–0.2), moderate (0.2–0.6), high (0.6–1.0)). The ridge density plots reveal that tumor niche cells at later pseudotime are enriched in immune-rich neighborhoods. These patterns suggest that immune infiltration is associated with tumor state transitions25, 26. To spatially validate these patterns, we visualize the spatial plot of tumor niche cells in Figure 5c. Tumor niche cells are colored by their lymphocyte and myeloid neighbor proportion, with immune cells of interest highlighted in red and all other cells shown in gray. Lymphocyte-rich neighborhoods are seen to be localized near tumor boundaries, while myeloid-rich neighborhoods are predominantly found in supporting tissue around the tumor. Representative fields of view (FOVs) illustrate immune enrichment at the tumor-normal interface and show restricted deeper immune entry into tumor regions, with myeloid cells accumulating in the normal tissue around the tumor.

Next, we investigate molecular pathways associated with immune infiltration. Local immune cell abundance can drive transcriptional changes in neighboring cells27, 28. Therefore, we used the proportion of local immune neighbors for tumor niche cells to select genes for pathway analysis to capture both cell-specific and microenvironmentally regulated pathways that might not be captured by cell-type information alone. Specifically, we computed PCC between gene expression and local immune neighbor proportions (lymphocyte and myeloid) for each tumor niche cell. Genes with moderate, biologically meaningful correlations were selected independently from three data subsets: raw gene expressions, imputed gene expressions with genes present in raw data, and all imputed gene expressions. Using genes selected from each data subset, we performed pathway enrichment using Reactome and Gene Ontology (GO) gene sets from the MSigDB database collection29, 30. We identified significant pathways and combined the top 20 enriched pathways for each data subset, and manually annotated them as Immune, Cell Migration/Attachment related, or Other. Figure 5d shows scatter plots comparing enrichment significance (–log10 adj p-value) between raw and imputed data. We observed that using imputed expression increased the sensitivity for the detection of Immune and Cell Migration/Attachment related pathways compared with raw expression. Moreover, inclusion of newly imputed genes increased the significance of detected pathways compared with using imputed expression only for genes already measured in raw data. This demonstrates that imputed expression uncovers microenvironment-driven biological processes missed in raw data and improves the pathway detection sensitivity.

These results demonstrate how imputation enhances ST data, offering insights into tumor trajectory mapping, spatial immune tumor organization, and improved detection of biological pathways, in turn advancing our understanding of the complex tumor microenvironment.

DISCUSSION

Spatial single-cell transcriptomics data plays an important role in understanding complex tissue structures and functions by revealing spatially resolved gene expression at single-cell resolution. However, only a limited subset of genes is typically captured, constraining comprehensive biological interpretation. Accurate gene imputation using single-cell reference data can address this limitation and aid in advancing spatial biology, enabling novel insights into cellular heterogeneity, interactions, and underlying tissue mechanisms. SpaGene addresses this critical need, serving as an advanced computational approach to enhance ST datasets, empowering researchers to gain novel biological insights.

The primary objective of SpaGene is to enhance the ST data by expanding gene coverage beyond the assayed set through effective integration with reference SC data. SpaGene imputes expression for unmeasured genes in ST data, significantly enhancing the biological signal and thereby facilitating downstream analysis to discover novel insights. The framework consists of two encoder-decoder networks, two translators, and two discriminators. Separate encoders learn robust latent space representations from ST and SC data. Two translators are trained to translate data between domains. To enhance ST data, its latent representation is translated into the SC domain by a dedicated translator module. The translated representation is then used to reconstruct comprehensive gene expression profiles using the SC decoder module. SpaGene’s superior performance stems from its adversarial framework that effectively captures shared information by learning non-linear mappings between source and target domains. By learning and leveraging complex data distributions, SpaGene improves the accuracy and reliability of gene imputation in spatial datasets, yielding comprehensive biologically meaningful transcriptome profiles. A key advantage of SpaGene is its ability to effectively capture complex biological signals and model the non-linear characteristics of omics datasets. This capacity is particularly beneficial for understanding spatial heterogeneity, intercellular communication, and functional cellular states. Thus, SpaGene stands out as a valuable tool for elucidating biological complexity at the spatial level.

Biologically, SpaGene provides a more comprehensive view of the tumor immune microenvironment using the expanded spatial transcriptomic profile, revealing cellular interactions and biological pathways underrepresented in raw data. In the NanoString Lung9 rep1 sample, the enriched transcriptome enabled detection of spatial patterns such as lymphocytes clustering at tumor boundaries and myeloid cells blocked at the boundary between tumor and normal region. The expanded gene coverage highlighted associations between the immune neighborhood and tumor cell state transitions, offering insights into how immune pressure influences tumor evolution. Moreover, the model improves detection of microenvironment-driven pathways, particularly immune and cell migration/ attachment pathways, relative to the raw data. These findings indicate that SpaGene utilizes the expanded transcriptome to provide insights into tumor evolution and to improve the identification of biologically relevant pathways that can be used to develop future therapeutic strategies.

While SpaGene has demonstrated superior performance in integrating ST and SC datasets, it holds potential for future research and development. One promising area of improvement involves the integration of additional data modalities such as imaging, proteomics, epigenomics, or metabolomics. Incorporating such multi-modal datasets would allow for a comprehensive understanding of tissue morphology and underlying biological processes. Such integration would enable a more holistic understanding of biological systems and potentially aid in the discovery of novel pathways and cellular interactions. Another essential avenue for future development is enhancing model interpretability through advanced explainability techniques31. Implementing attention frameworks32, integrated gradients33, layer-wise relevance propagation34, and related approaches could increase trust in model outputs. Greater transparency into how predictions are made could lead to the discovery of novel biomarkers, deeper insights into cellular mechanisms, and aid researchers in discovering hidden biological insights31. This will help for more effective translation of computational findings into biological knowledge. Overall, SpaGene is an advanced tool for spatial transcriptomics, significantly improving gene imputation and integration capabilities to drive forward research in spatial biology.

MATERIALS AND METHODS

Data processing

Following ST and SC dataset pairs are used for performance evaluation: MERFISH_Moffitt, NanoString_GSE, osmFISH_AllenSSp, osmFISH_AllenVISp, osmFISH_Zeisel, seqFish_AllenVISp, and STARmap_AllenVISp. For both ST and SC datasets, genes with density less than 0.05 and cells with density less than 0.1 were filtered out to reduce noise and computational load. Next, 2000 highly variable genes are selected using Scanpy35 v3. To reduce the influence of extreme values, outliers exceeding two standard deviations above the mean for each feature were clipped to the mean plus two standard deviations. Finally, we applied square root normalization by computing the square root of each expression value in the data to reduce data skewness.

The SpaGene model

Our model consists of two Encoder-Decoder pairs, two Translators, and two Discriminators. Each Encoder-Decoder pair follows the AutoEncoder36 framework, which uses an encoder function to reduce the original data features into a low-dimensional feature space, and a decoder function to reconstruct the data from that low-dimensional feature space. The encoder is a multi-layer fully connected neural network with ReLU37 non-linear activations that compresses the given gene expression into a latent space representation. The decoder is a neural network that reconstructs the original gene expression from the latent space representation through successive nonlinear transformations. Translators facilitate domain adaptation by translating the latent space representation between the ST and SC domains. Discriminators act as binary classifiers for distinguishing whether a latent space representation originates from the source domain or was generated by a translator. The trained discriminator provides adversarial loss38 that guides the translators to generate more realistic outputs.

Encoder-Decoder

SpaGene consists of two Encoder-Decoder modules, one for the ST and one for the SC domain. Each module comprises an encoder E that maps an input expression vector x to a low-dimensional latent representation z and decoder D that maps z back to a reconstruction x^. When applied to the ST and SC domains, the encoders produce zST=ESTxST and zSC=ESCxSC, and the decoders reconstruct x^ST=DSTzST and x^SC=DSCzSC. To train the Encoder-Decoder, reconstruction error is minimized as follows:

argminE,DMSE(D(E(x)),x)

where MSE(x,y) stands for the mean-squared-error loss.

Both domain-specific modules are trained independently to minimize reconstruction error. Once the Encoder-Decoders are trained, their weights are frozen.

Translator

The model comprises two directional translators: TSTSC that maps ST latent space zST to SC latent space zSTSC and TSCST that maps SC latent space zSC to ST latent space zSCST. Here,

zSTSC=TSTSCzST,zSCST=TSCSTzSC

The objective of these translators is to generate realistic latent representations in the translated domains.

Discriminator

We employ two discriminators: CST that classifies whether a latent representation is from real ST data or generated using the translator TSCST and CSC that distinguishes real SC embeddings from those produced by the translator TSTSC. Discriminators provide feedback that guides translators to generate more realistic representations through adversarial learning.

Loss function

Multiple loss functions are employed, including cycle loss39, identity (ID) loss40, CORAL (CORrelation ALignment) loss41, MMD loss42, and GAN loss40.

Cycle loss:

it ensures that translating the latent space to the other domain and back to the original domain preserves the original gene expression. Mathematically, the MSE between the original gene expression xST and the cycled gene expression xSTcyc is minimized as follows:

Lcyc=MSExSTcyc,xST

where MSE(x,y) is the mean-squared-error loss function. Thus, the cycle loss is defined as:

Lcyc=MSEDSTTSCSTTSTSCzST,xST+MSEDSCTSTSCTSCSTzSC,xSC

For identity (ID) loss, Pearson correlation between the original gene expression and the translated gene expression is used. Since the dimensions of the translated gene expression and original gene expression are different, backpropagation is only applied for shared genes that exist in both domains. By minimizing this loss, the translator learns to generate gene expressions that align closely with the target domain values. Thus, in our model, the ID loss is defined as:

Lid=1ρDSCzSTSC,xST+1ρDSTzSCST,xSC

where ρ is the Pearson correlation computed over the shared gene set.

CORAL loss:

it aligns the covariance between the ST and SC domains. The objective is to improve domain alignment in the latent space. CORAL loss minimizes the squared Frobenius norm of the difference between the covariance matrices of the two domains, as follows:

LCORAL=14d2CovzSCSTCovzSTF2+14d2CovzSTSCCovzSCF2

where .F is the Frobenius norm, Cov() denotes covariance, and d represents the number of features in the matrices to be aligned.

MMD (Maximum mean discrepancy) loss:

this loss measures the distributional distance between latent space representations of ST and SC domains. By minimizing the MMD loss, the translator aligns the distributions of two domains in latent space to reduce domain shifts. The MMD loss for two distributions P, Q is given by:

MMD2(P,Q)=Ea,a~PKa,a+Eb,b~QKb,b2Ea~P,b~Q(K(a,b))

where Ea,a~PKa,a is the expectation of kernel function between two independent draws a, a from P, Eb,b~QKb,b is the expectation of kernel function between two independent draw b, b from Q, Ea~P,b~Q(K(a,b)) is the expectation of kernel function between independent draws from P, Q and K is Gaussian kernel. Thus, the MMD loss is defined as:

LMMD=MMD2zSCST,zST+MMD2zSTSC,zSC

GAN loss:

it introduces an adversarial objective where domain-specific discriminators are trained to classify whether the latent space representations originate from the source domain or are generated by the translator. Each translator acts as the generator, mapping representations from one domain to the other with the goal of producing latent representations that are indistinguishable from real ones.

The translator is trained to fool the discriminator by generating representations xfake that are indistinguishable from latent representations originating from the source domain xreal with the loss defined as:

LGAN=ExfakeCxfake

Specifically, for two translators the GAN loss is given as:

LGAN=EzSCSTCSTzSCSTEzSTSCCSCzSTSC

The discriminators are trained using hinge loss, enforcing a margin separator to distinguish between representations from the source domain and those generated by a translator as follows:

Ldisc=Exrealmax0,1Cxreal+Exfakemax0,1+Cxfake

Thus, for our model, the discriminator loss is defined as:

Ldisc=EZSTmax0,1CSTzST+EZSCSTmax0,1+CSTzSCST+EZSCmax0,1CSCZSC+EzSTSCmax0,1+CSCzSTSC

The translator loss combines ID loss, cycle loss, CORAL loss, MMD loss, and GAN loss as follows:

Ltrans=λIDLID+λcycleLcycle+λCORALLCORAL+λMMDLMMD+λGANLGAN

where λID, λcycle, λCORAL, λMMD, λGAN are chosen using hyperparameter tuning.

Model Inference

Gene expression from the ST domain xST is passed through the trained encoder EST to generate the latent space representation zSTzST is then translated into the SC domain zSTSC using the trained translator TSTSC. Finally, the translated latent space is passed through the trained decoder of the SC domain DSC to generate the translated gene expression from the ST to SC domain x^SC=DSCzSTSC.

Experimental setup

The training is divided into two stages: In the first stage, the Encoder-Decoder networks are trained to efficiently encode and reconstruct gene expressions, ensuring faithful compression and reconstruction of input data. In the second stage, translators are trained to translate the latent representations from ST to SC. Multiple loss functions are utilized to ensure accurate and smooth domain translation including cycle loss, identity (ID) loss, CORAL loss, MMD loss, and GAN loss.

For cross-validated inference, the model is trained to predict a subset of genes held-out in each fold for evaluation. This is done by first encoding ST gene expression into a low-dimensional latent representation, translating that latent space to the SC domain, and finally decoding it to reconstruct full gene expression including genes not used during training in that fold. Through this process, the SpaGene model achieves its goal: translating gene expression data in the ST domain into the SC domain, thus enhancing the ST data and enabling more meaningful downstream analysis.

We evaluate SpaGene on seven ST and SC dataset pairs. For each fold in cross-validation, the model is trained on a subset of genes and evaluated on the held-out gene set. During each fold, the inference to translate held-out gene expression from ST to SC is done after training all components of the model including the two encoder-decoder pairs, two translators and two discriminators.

For each fold, the model is trained on four-fifths of the shared genes and tested on the remaining fifth, and this process is repeated five times. Each dataset is partitioned based on the set of shared genes between the ST and SC domains (MERFISH Moffitt dataset pair: 141, NanoString GSE dataset pair: 488, osmFISH AllenSSp dataset pair: 26, osmFISH AllenVISp dataset pair: 26, osmFISH Zeisel dataset pair: 32, seqFISH AllenVISp dataset pair: 411, STARmap AllenVISp dataset pair: 242). In each round of cross-validation, four gene folds are used for integration while the remaining fold is used for evaluation. Model performance was evaluated by comparing the measured and predicted gene expression profiles on the held-out genes.

Evaluation metrics

Given N the number of cells and G the number of genes in the ST dataset, the measured expression for gene g as xg, and the predicted expression for gene g as x^g, we used the following metrics to assess the performance of our method:

PCC:

The Pearson correlation coefficient22 between each gene g in predicted expression by each method and the ground truth expression of ST is computed as follows:

PCC(g)=Covxg,x^gσxgσx^g

where Covxg,x^g=1Ni=1Nxg,iμgx^g,iμ^g, μg=1Ni=1Nxg,i, μ^g=1Ni=1Nx^g,i, and σxg, σx^g are their standard deviations.

SSIM (Structural similarity index):

The structural similarity index22 measures the similarity between measured and predicted gene expression and is computed as follows:

SSIM(g)=2μgμ^g+C12Covxg,x^g+C2μg2+μ^g2+C1σxg2+σx^g2+C2

where C1=0.01 and C2=0.03

RMSE (Root mean square error):

The root mean square error measures the error22 between measured and predicted gene expression and is computed as follows:

RMSE(g)=1Ni=1Nxg,ix^g,i2

The average of PCC, RMSE and SSIM across held-out genes is reported for performance evaluation. Higher PCC and SSIM and lower RMSE indicate better prediction accuracy.

FUNDING

Q.S. is supported by the National Institute of General Medical Sciences of the National Institutes of Health (R35GM151089). J.S., A.B. and J.H. are financially supported by the National Library of Medicine of the National Institute of Health (R01LM013771). J.S. and A.B. are also financially supported by the National Cancer Institute of the National Institutes of Health (P30CA082709-25S1). J.S. is also supported by the Indiana University Precision Health Initiative and the Indiana University Melvin and Bren Simon Comprehensive Cancer Center Support Grant from the National Cancer Institute (P30CA 082709).

Biographies

Aishwarya Budhkar is a Ph.D. Candidate in the Department of Computer Science, Indiana University Bloomington, IN, USA. Her research focuses on developing novel artificial intelligence methods in bioinformatics.

Juhyung Ha is a Ph.D. Candidate in the Department of Computer Science, Indiana University Bloomington, IN, USA. His research focuses on developing novel artificial intelligence methods in interdisciplinary science.

Qianqian Song is an Assistant Professor in the Department of Health Outcomes & Biomedical Informatics at the University of Florida College of Medicine, FL, USA. Her research focuses on developing advanced computational and AI models to decipher disease mechanisms and identify novel therapeutic targets.

Jing Su is an Associate Professor in the Department of Biostatistics and Health Data Science, Indiana University School of Medicine, IN, USA. His research focuses on graph artificial intelligence and machine learning in biomedical informatics and precision health.

Xuhong Zhang is an Assistant Professor in the Department of Computer Science, Indiana University Bloomington. Her research focuses on computer science and bioinformatics.

Footnotes

CONFLICT OF INTEREST

The authors have no competing interests to declare.

DATA AVAILABILITY

GSE dataset can be download from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131907. NanoString CosMx SMI dataset can be download from https://nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/, MERFISH Moffitt, osmFISH AllenSSp, osmFISH AllenVISp, osmFISH Zeisel, seqFISH AllenVISp, STARmap AllenVISp datasets can be download from the public repository https://zenodo.org/records/3967291.

CODE AVAILABILITY

The SpaGene method is provided as an open-source Python package on GitHub: https://github.com/asbudhkar/SpaGene.

REFERENCES

  • 1.Piwecka M., Rajewsky N. & Rybak-Wolf A. Single-cell and spatial transcriptomics: deciphering brain complexity in health and disease. Nature Reviews Neurology 19, 346–362 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Lim J. et al. Advances in single-cell omics and multiomics for high-resolution molecular profiling. Experimental & Molecular Medicine 56, 515–526 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Haque A., Engel J., Teichmann S.A. & Lönnberg T. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome medicine 9, 1–12 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Reel P.S., Reel S., Pearson E., Trucco E. & Jefferson E. Using machine learning approaches for multi-omics data analysis: A review. Biotechnol Adv 49, 107739 (2021). [DOI] [PubMed] [Google Scholar]
  • 5.He S. et al. High-plex imaging of RNA and proteins at subcellular resolution in fixed tissue by spatial molecular imaging. Nature Biotechnology 40, 1794–1806 (2022). [Google Scholar]
  • 6.Armingol E., Officer A., Harismendy O. & Lewis N.E. Deciphering cell–cell interactions and communication from gene expression. Nature Reviews Genetics 22, 71–88 (2021). [Google Scholar]
  • 7.Moffitt J.R. et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science 362, eaau5324 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Codeluppi S. et al. Spatial organization of the somatosensory cortex revealed by osmFISH. Nature methods 15, 932–935 (2018). [DOI] [PubMed] [Google Scholar]
  • 9.Rodriques S.G. et al. Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Maynard K.R. et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nature Neuroscience 24, 425–436 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Abdelaal T., Mourragui S., Mahfouz A. & Reinders M.J. SpaGE: spatial gene enhancement using scRNA-seq. Nucleic acids research 48, e107–e107 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Mourragui S., Loog M., Van De Wiel M.A., Reinders M.J. & Wessels L.F. PRECISE: a domain adaptation approach to transfer predictors of drug response from pre-clinical models to tumors. Bioinformatics 35, i510–i519 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Steinbach M. & Tan P.-N. in The top ten algorithms in data mining 165–176 (Chapman and Hall/CRC, 2009). [Google Scholar]
  • 14.Lopez R. et al. A joint model of unpaired data from scRNA-seq and spatial transcriptomics for imputing missing gene expression measurements. arXiv preprint arXiv:1905.02269 (2019). [Google Scholar]
  • 15.Biancalani T. et al. Deep learning and alignment of spatially resolved single-cell transcriptomes with Tangram. Nature methods 18, 1352–1362 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Kim N. et al. Single-cell RNA sequencing demonstrates the molecular and cellular reprogramming of metastatic lung adenocarcinoma. Nature communications 11, 2285 (2020). [Google Scholar]
  • 17.Chatterjee S. et al. Nontoxic, double-deletion-mutant rabies viral vectors for retrograde targeting of projection neurons. Nature neuroscience 21, 638–646 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Tasic B. et al. Shared and distinct transcriptomic cell types across neocortical areas. Nature 563, 72–78 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Zeisel A. et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142 (2015). [DOI] [PubMed] [Google Scholar]
  • 20.Eng C.-H.L. et al. Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH+. Nature 568, 235–239 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wang X. et al. Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science 361, eaat5691 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Li B. et al. Benchmarking spatial and single-cell transcriptomics integration methods for transcript distribution prediction and cell type deconvolution. Nature methods 19, 662–670 (2022). [DOI] [PubMed] [Google Scholar]
  • 23.Satija R., Farrell J.A., Gennert D., Schier A.F. & Regev A. Spatial reconstruction of single-cell gene expression data. Nature biotechnology 33, 495–502 (2015). [Google Scholar]
  • 24.Shen X. et al. Inferring cell trajectories of spatial transcriptomics via optimal transport analysis. Cell Systems 16 (2025). [Google Scholar]
  • 25.Wang T., Chen Z., Wang W., Wang H. & Li S. Single-cell and spatial transcriptomic analysis reveals tumor cell heterogeneity and underlying molecular program in colorectal cancer. Frontiers in Immunology 16, 1556386 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Jing S. y. et al. Quantifying and interpreting biologically meaningful spatial signatures within tumor microenvironments. npj Precision Oncology 9, 68 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Sun Y. et al. Spatial transcriptomics reveals macrophage domestication by epithelial cells promotes immunotherapy resistance in small cell lung cancer. npj Precision Oncology 9, 252 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Dong Z.-R. et al. Spatial resolved transcriptomics reveals distinct cross-talk between cancer cells and tumor-associated macrophages in intrahepatic cholangiocarcinoma. Biomarker Research 12, 100 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Liberzon A. et al. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Systems 1, 417–425 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Subramanian A. et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102, 15545–15550 (2005). [Google Scholar]
  • 31.Budhkar A., Song Q., Su J. & Zhang X. Demystifying the black box: A survey on explainable artificial intelligence (XAI) in bioinformatics. Computational and Structural Biotechnology Journal 27, 346–359 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Raghavan K. Attention guided grad-CAM: an improved explainable artificial intelligence model for infrared breast cancer detection. Multimedia Tools and Applications, 1–28 (2023). [Google Scholar]
  • 33.Sundararajan M., Taly A. & Yan Q. in International conference on machine learning 3319–3328 (PMLR, 2017). [Google Scholar]
  • 34.Böhle M., Eitel F., Weygandt M. & Ritter K. Layer-wise relevance propagation for explaining deep neural network decisions in MRI-based Alzheimer’s disease classification. Frontiers in aging neuroscience 11, 194 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Wolf F.A., Angerer P. & Theis F.J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biology 19 (2018). [Google Scholar]
  • 36.Zhai J., Zhang S., Chen J. & He Q. in 2018 IEEE international conference on systems, man, and cybernetics (SMC) 415–419 (IEEE, 2018). [Google Scholar]
  • 37.Agarap A.F. Deep learning using rectified linear units. arXiv preprint arXiv:1803.08375 (2018). [Google Scholar]
  • 38.Arjovsky M., Chintala S. & Bottou L. in International conference on machine learning 214–223 (PMLR, 2017). [Google Scholar]
  • 39.Zhu J.-Y., Park T., Isola P. & Efros A.A. in Proceedings of the IEEE international conference on computer vision 2223–2232 (2017). [Google Scholar]
  • 40.Pan Z. et al. Loss functions of generative adversarial networks (GANs): Opportunities and challenges. IEEE Transactions on Emerging Topics in Computational Intelligence 4, 500–522 (2020). [Google Scholar]
  • 41.Sun B., Feng J. & Saenko K. in Proceedings of the AAAI conference on artificial intelligence, Vol. 30 (2016). [Google Scholar]
  • 42.Wang W., Sun Y. & Halgamuge S. Improving MMD-GAN training with repulsive loss function. arXiv preprint arXiv:1812.09916 (2018). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

GSE dataset can be download from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131907. NanoString CosMx SMI dataset can be download from https://nanostring.com/products/cosmx-spatial-molecular-imager/ffpe-dataset/, MERFISH Moffitt, osmFISH AllenSSp, osmFISH AllenVISp, osmFISH Zeisel, seqFISH AllenVISp, STARmap AllenVISp datasets can be download from the public repository https://zenodo.org/records/3967291.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES