Skip to main content
Communications Biology logoLink to Communications Biology
. 2024 Oct 5;7:1271. doi: 10.1038/s42003-024-06964-2

Imputing spatial transcriptomics through gene network constructed from protein language model

Yuansong Zeng 1,2,3, Yujie Song 2, Chengyang Zhang 2, Haoxuan Li 1, Yongkang Zhao 1, Weijiang Yu 2, Shiqi Zhang 1, Hongyu Zhang 1, Zhiming Dai 2,, Yuedong Yang 2,
PMCID: PMC11455941  PMID: 39369061

Abstract

Image-based spatial transcriptomic sequencing technologies have enabled the measurement of gene expression at single-cell resolution, but with a limited number of genes. Current computational approaches attempt to overcome these limitations by imputing missing genes, but face challenges regarding prediction accuracy and identification of cell populations due to the neglect of gene-gene relationships. In this context, we present stImpute, a method to impute spatial transcriptomics according to reference scRNA-seq data based on the gene network constructed from the protein language model ESM-2. Specifically, stImpute employs an autoencoder to create gene expression embeddings for both spatial transcriptomics and scRNA-seq data, which are used to identify the nearest neighboring cells between scRNA-seq and spatial transcriptomics datasets. According to the neighbored cells, the gene expressions of spatial transcriptomics cells are imputed through a graph neural network, where nodes are genes, and edges are based on cosine similarity between the ESM-2 embeddings of the gene-encoding proteins. The gene prediction uncertainty is further measured through a deep learning model. stImpute was shown to consistently outperform state-of-the-art methods across multiple datasets concerning imputation and clustering. stImpute also demonstrates robustness in producing consistent results that are insensitive to model parameters.

Subject terms: Bioinformatics, Biological models


stImpute imputes missing genes in spatial transcriptomics using scRNA-seq data and a gene network constructed from the protein language model ESM-2, achieving high accuracy and robust clustering through a graph neural network and deep learning based uncertainty measurement.

Introduction

Recent advancements in spatially resolved transcriptomic sequencing technologies enable simultaneously measuring cellular gene expression and its corresponding positional context, key to understanding complex tissue organizations1. However, imaging-based spatial transcriptomic (ST) technologies such as SeqFISH+2, osmFISH3, STARmap4, and MERFISH5, achieve a limited number of genes from the entire transcriptome. This restricted gene coverage limits the comprehensive understanding of the molecular landscape of the tissue or biological process. Several computational methods attempt to address this challenge by imputing spatial gene expression from single-cell RNA sequencing (scRNA-seq) data. However, this task is challenging due to the different distributions observed between scRNA-seq data and spatial transcriptomic data.

Early Integration methods predict spatial gene expression by projecting both scRNA-seq and ST data into a common latent space and then employing the KNN algorithm for gene imputation. For instance, Harmony6 iteratively applies maximum diversity clustering and mixture-model-based linear batch correction to project scRNA-seq and ST data into a shared latent space. Harmony subsequently utilizes the KNN algorithm to impute gene expression of ST data based on their nearest neighbors in the scRNA-seq dataset. LIGER7 and Seurat8 take a similar strategy. However, these methods employ linear models such as Non-negative Matrix Factorization (NMF) in LIGER, Canonical Correlation Analysis (CCA) in Seurat, and Principal Component Analysis (PCA) in Harmony, which may struggle to capture non-linear gene relations. Furthermore, these methods lack specific optimizations for spatial gene expression prediction tasks, potentially yielding sub-optimal results.

To further improve the performance of spatial gene expression prediction, several methods are tailored for gene prediction. gimVI9 employs a hierarchical Bayesian model with deep neural networks to establish a shared latent space between scRNA-seq and ST data. It models count data using either negative binomial (NB) or zero-inflated NB (ZINB)10 for imputing missing gene expressions. SpaGE11 is a robust and interpretable machine-learning approach for predicting unmeasured genes, which incorporates domain adaptation to rectify sensitivity differences in transcript detection between scRNA-seq and spatial transcriptomics data. SpaGE uses the KNN algorithm to predict missing spatial gene expression. stPlus12 takes similar strategies while applying an autoencoder to obtain the joint embedding space. Tangram13 opts not to learn a common latent space. Instead, it focuses on maximizing spatial correlation across genes between scRNA-seq and ST data by learning a mapping function. Tangram demonstrates decent performance in gene imputation and spatial deconvolution tasks. Though these methods achieved commendable gene prediction results, they did not consider the imputation reliability for each predicted gene.

To predict reliable genes, TransImpLR14 introduces a linear framework to predict reliable spatial genes through uncertainty estimation and spatial regularization. TransImpLR enables reliable imputation mainly relying on a post-hoc uncertainty prediction model. Though TransImpLR achieves state-of-the-art results on several datasets, there are still some drawbacks. First, it uses all cells within the scRNA-seq reference to predict the gene expression for each cell in the spatial data, which makes it challenging to predict the cell-specific gene expression. Second, TransImpLR is a linear model, which makes it difficult to capture the non-linear relationships in the scRNA-seq and spatial data. Most importantly, TransImpLR ignores gene-to-gene relationships in gene prediction, which have shown improved performance in previous work15,16.

In fact, gene relationships can be discerned through the similarity of their gene-encoding proteins, which enables unraveling the complexities in gene functionality and interactions. Proteins play a crucial role in understanding gene functions across various biological processes, serving as direct expressions of genes. Analyzing the structures and functions of gene-coding proteins provides deeper insights into the underlying mechanisms of gene regulation, expression, and interaction. Recently, the protein language model ESM-217, trained on large-scale protein datasets, has emerged as an efficient tool for embedding numerical representations of gene-coding proteins. These informative representations of proteins can further facilitate the calculation of protein similarities.

Here, we propose a reliable spatial imputation method stImpute, which predicts authentic spatial transcriptomics data from scRNA-seq reference by considering gene-to-gene relationships. stImpute constructs gene-similar relationships based on the cosine similarity between the ESM-217 embeddings of the gene-encoding proteins. The gene relationships are then fed into a graph neural network (GNN) to predict spatial gene expression from the nearest neighboring cells within scRNA-seq data. The nearest neighboring cells in scRNA-seq data for each ST cell are identified using joint embeddings, which are created by an autoencoder that processes both spatial and scRNA-seq data. Uniquely, stImpute possesses the ability to reliably identify genes for imputation. stImpute consistently outperforms competing methods across multiple datasets regarding gene prediction and cell clustering. In addition, the spatial genes predicted by stImpute preserve spatial patterns.

Methods

stImpute is a method for accurately predicting the gene expressions of ST data from scRNA-seq reference data. stImpute consists of an autoencoder and a graph neural network (Fig. 1). The inputs of stImpute are the reference scRNA-seq data denoted as XrRcr×(p+q) and the target spatial transcriptomics data denoted as XtRct×p, where cr and ct are the number of cells within scRNA-seq and ST data, p is the shared genes between two datasets, and q is the genes unique to scRNA-seq data. stImpute will be introduced in detail in the following sections.

Fig. 1. The model architecture of stImpute.

Fig. 1

A StImpute first embeds the scRNA-seq data and spatial transcriptomics (ST) data into joint latent space through the autoencoder. B Based on the latent representation z, stImpute fits k nearest scRNA-seq cells for each ST cell. C stImpute constructs the gene relation graph based on cosine similarity between the ESM-2 embeddings of the gene-encoding proteins. The gene graph is then fed into the graph neural network to predict gene expression for each ST cell from its k nearest scRNA-seq cells. D The expression of each gene in ST inferred from stImpute is further fed into the MLP model to predict the reliable score. E The downstream validation for the proposed stImpute model.

Autoencoder for joint embeddings learning

To fit each ST cell to the most similar k scRNA-seq cells, we apply an autoencoder (AE) to project gene expressions of scRNA-seq data XrRcr×(p+q) and ST data XtRct×p into a shared latent space. Note that, we train the AE model only using the ST data with the shared p genes. Concretely, we use the encoder of AE to encode gene expression matrices of ST data into fixed-size vector representations. At the encodingl layer of encoder, the output H(l) can be calculated as follows:

H(l)=ϕWe(l)H(l1)+be(l) 1

where ϕ is the activation function ReLU, We(l) and be(l) are the weight matrix and bias parameters, respectively. we set H(0) = Xt. We set the last layer of the encoder as the latent representation z, which is then fed into the decoder of the AE to reconstruct the input matrix. The output of the m layer in the decoder part is subsequently computed as follows:

H(m)=ϕWd(m)H(m1)+bd(m) 2

where ϕ is the ReLU activation function. Wd(m) and bd(m) are the weight matrix and bias parameters, respectively. The output of the last layer of the decoder is the reconstructed data X~t. All parameters of the AE model are then optimized as the following loss functions:

Lae=1cti=1ctj=1pXijtX~ijt2 3

After the AE model is trained, we use the trained AE model to project both the ST and the scRNA-seq data into the common latent space. Based on the latent representations, we fit each ST cell to the k most similar scRNA-seq cells through cosine distance, these neighboring cells will be used for cell-specific spatial gene prediction.

Graph neural network for predicting gene expression

The gene expression of each ST cell will predicted from its k most similar scRNA-seq cells through a graph neural network (GNN) with the gene as the node and gene-gene relationships as the adjacent matrix. In this study, we calculate gene relationships through gene encoding proteins, which enables unraveling the complexities of gene functionality and their interactions. Concretely, we fetch the protein sequences of all genes from the protein databases (such as UniProt) and then extract the effective protein embeddings (XpembR2560×e) through the protein language model ESM-2. The term e denotes the number of genes identifiable within protein databases. Thus, we can obtain gene relationships by computing the cosine similarity between gi and gj in the Xpemb matrix as follows:

s(ij)protein=X:,ipembX:,jpembX:,ipembX:,jpemb 4

where X:,ipemb and X:,jpemb means the columns i and j in the Xpemb matrix. All gene relationships can be obtained through the similarity matrix SproteinRe×e.

For partial genes (about 3% in this study) that could not be found in the protein database, we inferred their relationships through cosine similarity using the original gene expression. Specifically, using original gene expression (XrRcr×(p+q)), we computed the cosine similarity between gi and gj in the Xr matrix using the dot product function as follows:

s(ij)gene=X:,irX:,jrX:,irX:,jr 5

where X:,ir and X:,jr represent columns i and j in the Xr matrix. The gene-to-gene relationships are captured in the similarity matrix SgeneR(p+q)×(p+q).

We then combined the similarity relationships of genes without protein-coding with the similarity relationships of genes with protein-coding as follows:

A=SgeneSprotein 6

where the symbol ∣∣ represents the replacement of gene relationships (Sgene) with Sprotein. The matrix A is subsequently utilized as the adjacency matrix in the graph neural network.

Subsequently, the classical GNN framework GraphSAGE18 is applied to predict gene expression for each ST cell, where the gene is the node and the gene relationships S are the adjacent matrix A. The information propagation between genes within GraphSAGE through the aggregation layer can be formulated as follows:

hN(v)l=AGG(hul1,uNv) 7
hvl=θWthvl1hN(v)l 8

where N(v) is the set of one-hop gene neighbors of gene v. The AGG is the mean () aggregator. A Multilayer Perceptron (MLP) projector following the GraphSAGE is applied to predict the final expression of each gene in the ST cell. Concretely, the gene v in the ST cell can be obtained as follows:

gv=δMLP(hvl) 9

where gvR1 and δ is the activation function. Depending on this procedure, We can obtain all (p + q) gene expressions of each ST cell. Therefore, we follow the above mechanism can obtain all gene expressions of all ST cells donated as X^tRct×(p+q).

Finally, we use the observed gene expression of cells within ST data XtRct×p with shared p genes to supervise the predicted gene expression of ST cells through Mean Squared Error (MSE) loss. Meanwhile, we calculate the similarity loss at the cell level and gene level to assist in optimizing the GNN model as follows:

Ltotal=Lsim+LresLsim=1cti=1ct(1cos((X^tX^meant)i,:,(XtXmeant))i,:)+1Npj=1Np(1cos((X^tX^meant):,j,(XtXmeant)):,j)Lres=XtX^t22 10

where Xi,:t and X:,jt index the ith row and jth column from the ST matrix Xt, respectively. ct and Np are the total number of cells and shared genes in ST data, respectively.

The iterative training procedure

Inspired by the iterative optimization strategy in the previous work19, we further joint optimization of the AE model and the GNN model in the stImpute to obtain robust performance. Specifically, (1) we first train the AE model until it converges to obtain the common representations of ST and scRNA-seq data. Based on these representations, stImpute fits the k nearest scRNA-seq cells for each ST cell. (2) The GNN model is then trained to predict spatial gene expression for each ST cell from its corresponding k scRNA-seq neighboring cells. (3) After obtaining the predicted gene expression of ST data, we fix the parameters of GNN and continue to train the AE model. Note that, in this step, we optimize the AE model by calculating the MSE loss as follows:

Lre_ae=1cti=1ctj=1pXijt(X~ijt+X^ijt)22 11

where X^ijt and X~ijt are the gene expression predicted by the GNN model and the reconstructed gene expression from the AE model, respectively. We iteratively train stImpute by the above steps until it converges. After stImpute is trained, the final gene expression of each ST cell is obtained as follows:

X^final=αX^t+(1α)1kikXir 12

where X^t and Xr are the gene expression predicted by GNN from scRNA-seq data and the gene expression of scRNA-seq data, respectively. α is the hyper-parameter to control the contribution of X^t and Xr, respectively.

Reliable gene prediction

To assess the reliability of genes predicted by stImpute, we proposed a post hoc reliable prediction model. Specifically, we employed an MLP model to determine the reliability score of each predicted gene. The rationale for using an MLP model is its ability to capture complex relationships between input features (predicted gene expression) and target outcomes (reliability score), which are quantified by the Cosine Similarity Score (CSS) between predicted and observed gene expressions. The CSS is a robust measure for assessing the similarity between predicted and actual gene expressions, making it a suitable choice for evaluating reliability. Concretely, we partitioned the spatial gene expressions predicted by stImpute into five folds, with four folds serving as training data to optimize the MLP model, and one fold designated as test data. Each gene from the training set was inputted into the MLP model to predict its reliability score. The optimization of the MLP model was achieved by minimizing the Mean Squared Error (MSE) loss (Eq. (13)) between the predicted reliability scores and the CSS computed from the predicted (the inputted gene expression) and truth gene expression. This optimization ensures that the MLP model accurately reflects the reliability of gene predictions. Once the MLP model was optimized, reliable scores for each predicted gene in the test data were obtained. Subsequently, we iteratively computed the predicted reliable scores of all genes in all cells through cross-validation experiments.

Lreliable=MLP(gpred)CSS(gobs,gpred)22 13

where gobs and gpred are the observe expression and predicted expression for gene v, respectively.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Experimental Setup

Dataset and pre-processing

All small datasets were from methods SpaGE, TransImpLR, and Spatial-ID20 as illustrated in Table 1. The large tonsil data was downloaded from https://www.10xgenomics.com/datasets/human-tonsil-data-xenium-human-multi-tissue-and-cancer-panel-1-standard. These dataset pairs exhibit diverse levels of gene detection sensitivity, sample sizes, and the number of spatially measured genes. We selected the top 2000 highly variable genes for the Dia and WT datasets to evaluate all methods, given the large number of genes in their original matrices. Notably, the cell population labels in the osmFISH and MERFISH datasets are available, whereas those in the other datasets cannot be obtained. We followed the data preprocessing protocol outlined in SpaGE. Specifically, the dataset underwent normalization, involving the division of counts within each cell by the total number of transcripts. This result was then scaled by the median number of transcripts per cell and log-transformed with a pseudo-count.

Table 1.

The statistics of six paired spatial transcriptomics and scRNA-seq data

Spatial data scRNA-seq data
Dataset pair # of cells # of genes Data sparsity Tissue # of cells # of genes Data sparsity Tissue
STARmap_AllenVISp 1549 1020 79% VISc 14249 34617 74.70% VISc
osmFISH_Zeisel 3405 33 29.70% SMSc 1691 15075 78.90% SMSc
osmFISH_AllenVISp 3405 33 29.70% SMSc 14249 34617 74.70% VISc
osmFISH_AllenSSp 3405 33 29.70% SMSc 5577 30527 69.80% SMSc
MERFISH_Moffit 64373 155 60.60% POR 31299 18646 85.60% POR
MouseBrain 2471 119 75.94% Brain 40733 16488 85.38% Brain
Dia_GSE112393 33441 24105 73.59% SPG 34633 37241 94.47% SPG
WT_GSE112393 33059 24105 73.97% SPG 34633 37241 94.47% SPG
Xenium_GSM5051495 1349620 377 91.62% Tonsil 26717 33551 98.25% Tonsil

VISc Visual cortex, SMSc Somatosensory cortex, POR Pre-optic region, SPG mouse spermatogenesis, WT wild-type mice, Dia leptin-deficient diabetic mice.

Baseline

Using their default parameters, we compared the latest approaches including Tangram, gimVI, SpaGE, stPlus, NovoSpaRc, Uniport, and TransImpLR. Data processing procedures, such as normalization and scaling, were likewise executed by the source code of each method.

Evaluation Metrics

We evaluated the performance of our method using metrics in two ways. Initially, we followed the TransImpLR and Tangram approaches, calculating the Cosine Similarity Score (CSS) and Mean Squared Error (MSE) between the predicted and observed spatial gene expressions. The CSS provides a direct reflection of the correlation between predicted and observed gene expressions, whereas the MSE serves as an indicator of gene value recovery ability. Consequently, a lower MSE value and a higher CSS value signify superior prediction performance.

Next, we assessed the performance in identifying cell populations using three clustering metrics, namely Adjusted Rand Index (ARI)21, Fowlkes-Mallows index (FMI)22, and Completeness score (Comp)23. The detailed definition can be found in Supplementary Note 1. A higher score on these cell clustering metrics indicates better performance in cell population identification, consequently leading to better predictions in spatial transcriptomics.

Implementation Details

The stImpute was implemented using PyTorch and Python. To optimize the AE model, we used the Adam optimizer with a learning rate of 0.01, running for 30 epochs across all datasets. We followed stPlus to select the 50 (k = 50) closest scRNA-seq cells for each ST cell. The graph neural network had two layers, with both input and hidden dimensions set to 50. The GNN Model optimization was performed using the Adam optimizer with a learning rate of 0.01, and the training epochs were set to 10 for small ST data (the number of cells less than 10,000) and 32 for other ST datasets. The number of iterative training rounds of stImpute was set to 2. The hyperparameter α was set to 0.7 for small data and 0.5 for other datasets, respectively. Additionally, the Leiden algorithm24 was used to cluster cells. The experiments presented in this paper were performed on a system running Ubuntu 18.04.7 LTS, equipped with an Intel® Core™ i7-8700K CPU @ 3.70 GHz, 256 GB of RAM, and a pair of NVIDIA GeForce RTX 4090 graphics cards.

Results

The assessment of gene imputation and reliable gene inference

To assess the performance of our approach, we conducted 5-fold cross-validation experiments across all datasets. In each dataset, scRNA-seq data served as the reference, and spatial transcriptomics data served as the target. Shared genes between the reference and target of each paired dataset were randomly partitioned into five folds. Prediction of gene expression in each fold was conducted using models trained on genes from the remaining four folds. The final gene expression encompassed predictions from each fold. Given the difficulty in determining the best parameters for every method and dataset, we opted to use default parameters for our comparisons. Using default parameters enables a more comparable evaluation of each method’s relative performance across different datasets. We quantitatively evaluated each method by computing the Mean Squared Error (MSE) value and Cosine Similarity Score (CSS) between predicted and observed gene expression. A lower MSE value and a higher CSS value signify superior prediction performance.

We initially evaluated our method at the cell level to assess the preservation of crucial cellular characteristics in the predicted spatial profiles. As shown in Fig. 2a, c, our method achieved the best performance, with an average MSE of 0.45 and a CSS of 0.68. Compared to the second-ranked method, stPlus, our method demonstrated significant improvements-yielding a 1.5 reduction in MSE and a 4.5% increase in CSS. These improvements are likely due to the efficient capture of gene relationships by our method using a Graph Neural Network. Although stPlus achieved higher CSS results than gimVI, it performed worse than gimVI in terms of MSE. SpaGE and TransImpLR delivered comparable outcomes in terms of average CSS, possibly because both are linear methods. Tangram exhibited the highest MSE, which may be attributed to its tendency to predict over-expression for individual genes. The lower performance of Tangram could be because it is a versatile tool for analyzing spatial transcriptomics data but is not specifically optimized for gene imputation. stImpute consistently outperformed Uniport and NovoSpaRc in terms of average CSS across all datasets, with performance increases of 7.8% and 14%, respectively. We next test each method at the gene level (Fig. 2b, d), which provides a direct reflection of the correlation between predicted and measured spatial profiles. Our method consistently outperformed the second-ranked method TransImpLR. Specifically, compared to TransImpLR, our method exhibited a 34% improvement and a 3.4% increase in terms of average MSE and CSS values, respectively. The detailed information is listed in Supplementary Table S1. We noticed that all methods performed low performance on the STARmap_AllenVISp dataset with large-scale genes. The relatively low performance on the STARmap_AllenVISp dataset might be caused by its inconsistency with other datasets since the performance remains low even when selecting only the 100 most highly variable genes through scanpy (Supplementary Fig. S1).

Fig. 2. The performance of the stImpute when compared to other baseline methods evaluated across six paired datasets.

Fig. 2

This evaluation was conducted using two distinct metrics: the Cosine Similarity Score (CSS), assessed both at the cellular level (a) and the gene level (b), and the Mean Squared Error (MSE), similarly analyzed at the cellular (c) and gene (d) levels. This analysis includes n = 8 biologically independent paired datasets. e The CSS performance between two groups of genes: one with predicted reliability scores exceeding the median value observed across the entire gene set, and the other comprising all genes.

To evaluate the robustness of our method, we evaluated all methods using the new metric Pearson correlation coefficient (PCC). As shown in Supplementary Fig. S2, our method is similar to stPlus in terms of PCC and outperforms other methods by 2.8% at the cell level. A similar trend was observed at the gene level. Additionally, we conducted experiments using 5-fold cross-validation by partitioning the cells and holding out a specific set of genes for evaluation. As shown in Supplementary Fig. S3, when 50% of genes were reserved, our method maintained strong performance, achieving an average CSS of 61.1% at the cell level, with only a slight decrease from the original result. Remarkably, performance was consistent with the original results when 80% of genes were reserved. We also assessed the influence of pre-processing steps by substituting the steps used in our method with those from Tangram and the classic tool Scran. As shown in Supplementary Fig. S4, our original pre-processing steps yielded the best results, likely due to our model’s optimization for normalized data. Scran’s pre-processing outperformed Tangram’s normalization steps. To ensure a fair comparison, we aligned the pre-processing steps of all competing methods with ours. Despite slight variations in their average performance, our method consistently outperformed the others.

We further evaluated our method on a large-scale dataset comprising 1.3 million cells from tonsil tissue, sequenced using 10× Xenium technology. As shown in Supplementary Fig. S5, our method successfully processed this dataset and achieved a CSS of 0.53 at the cell level, 7% higher than the second-ranked method stPlus. All methods performed worse at the gene level due to the large cell count. Tangram, SpaGE, and NovoSpaRc were excluded from the results because they failed to run on this large dataset (Tangram ran out of GPU memory, SpaGE generated NaN values, and NovoSpaRc produced KeyError). On the other hand, we conducted a comprehensive evaluation of computational and memory costs across datasets with varying cell counts sampled from the tonsil data containing 1.3 million cells. As shown in Supplementary Fig. S6, while our method’s speed was comparable to deep learning-based methods (gimVI and stPlus). While our method was slower than TransImpLR, a linear regression-based method (this method has lower accuracies), the speed was still acceptable. In terms of memory usage, most methods including stImpute demonstrated minimal memory costs, requiring less than 14 GB when dealing with the data with approximately 1.3 million cells.

Although we achieved state-of-the-art performance, there remains potential for enhancing overall accuracy. This is primarily due to differences between scRNA-seq and ST datasets. To address this, we implemented a deep learning-based model for assessing the confidence of each predicted gene, as detailed in the Method section. Concretely, we first obtained the reliable scores for each predicted gene through the MLP model as described in the method section. Then, the top 50% genes with the highest reliable scores were selected to evaluate the performance of the reliable prediction. Significantly, as indicated in Fig. 2e and Supplementary Table S2, by concentrating on the half-gene set with higher confidence scores, we elevated the average CSS from 0.67 to 0.74, a notable enhancement relative to using the full-gene set. These results suggest that our approach is a reliable indicator of imputation confidence. Applying the same strategy to other methods like SpaGE and stPlus yielded comparable improvements (Supplementary Table S2). To further confirm the robustness of our MLP-based approach, we added random noise to the predicted gene expressions and then predicted their reliability scores. As shown in Supplementary Fig. S7, the results indicated that the gene expressions with added noise obtained significantly lower average reliability scores on all datasets, decreasing by 14.5% compared to the originally predicted gene expressions. These results demonstrate that our model can effectively learn the variability and patterns in the predicted gene expressions and accurately predict their reliability scores.

Clustering results

A crucial step in ST analyses involves the characterization of cell types. This is typically achieved using clustering methods that group cells based on the similarity of their gene expression profiles. In this study, we used the Leiden algorithm with default parameters to cluster cell types based on the gene expression predicted by each method. The Leiden algorithm is renowned for its ability to identify high-quality, well-separated clusters in complex and large scRNA-seq datasets, which is crucial for accurately distinguishing different cell types and states. Previous studies commonly recommend the Leiden algorithm for cell clustering2528. The clustering results were evaluated by ARI, FMI, and Comp metrics. The clustering metrics used in this manuscript evaluate performance from different perspectives, and each has its limitations. ARI may not perform well with imbalanced class sizes, FMI can be sensitive to minor changes in clustering results, and the Completeness score may be biased in noisy datasets. Therefore, we applied all three clustering metrics simultaneously to test our method.

We didn’t evaluate all methods on the STARmap_AllenVISp and MouseBrain data since they did not contain the ground truth cell types. As shown in Fig. 3a, b, our method achieved the best performance regarding ARI and FMI, which were 5% and 6% higher than the second-best method stPlus, respectively. SpaGE and TransImpLR obtained similar clustering results and were better than the Tangram method. This is likely because Trangram was a versatile tool for scRNA-seq data and spatial transcriptomic data alignment and did not specifically optimize parameters for the task of spatial gene expression. These results not only suggest that our method is capable of predicting spatially unmeasured gene expression but also demonstrate that the predicted data can reserve cell heterogeneity. A similar trend could be found when the metric Comp measured each method (Supplementary Fig. S8). The detailed information is shown in the Supplementary Table S3.

Fig. 3. The clustering results of each method when the gene expression of ST cells predicted by each method is used to cluster ST cells.

Fig. 3

The cells were grouped using the Leiden clustering algorithm, and the clustering results were evaluated through two distinct metrics: ARI in (a) and FMI in (b). This analysis includes n = 4 biologically independent datasets. Additionally, (c) presents UMAP-generated visualizations of the predicted gene expression on the case data MERFISH, where each distinct color represents the corresponding true cell type labels.

We further visualized the predicted gene expression data using the Uniform Manifold Approximation and Projection (UMAP). As depicted in Fig. 3c, our approach, along with TransImpLR, successfully differentiated the majority of cell types. In contrast, methods such as Tangram and gimVI showed a tendency to blend certain cell types. In summary, the gene expression patterns predicted by each method were largely preserved, with the majority of cells within each type clustering together, while distinct cell types were effectively segregated.

Instead of 5-fold cross-validation, we further used p-shared genes to train models and predict the expressions of the q genes unique to scRNA-seq data. Due to the lack of ground truth for these q genes in the ST data, we validated the gene expression predictions using a clustering strategy. Since the high dimension of q genes, we applied PCA to obtain the low-dimensional representation for clustering. Additionally, we also evaluated the clustering performance on the 50 most highly variable genes from the q genes through the package Scanpy29. As shown in Supplementary Fig. S9, our method consistently outperformed competing methods in terms of clustering results, with an increase of 1.5% and 3.75% compared to the second-ranked method stPlus regarding ARI, respectively. The results demonstrated that the spatial expression of genes unique to scRNA-seq data predicted by our method will continue to deepen our understanding of the characterization of cell heterogeneity.

Co-expression network analysis

To further validate the biological relevance of the gene expression predicted by our model, we constructed gene co-expression networks from the predicted data and analyzed the formation of biologically meaningful modules. Specifically, we employed the K-nearest neighbors algorithm to cluster functionally related genes based on predicted gene expression. Using mouse brain data as an example (Supplementary Fig. S10), we identified a cohort of genes with high similarity scores, including Pax3 and Pax7. Previous studies have shown that Pax3 and Pax7 interact genetically and physically with Meis2 during dorsal midbrain development30. Additionally, we identified a group of genes, such as Irx3 and Irx5, which have been demonstrated in earlier studies to be novel regulatory factors in postnatal hypothalamic neurogenesis31. To evaluate the performance of our method on the q predicted genes, we used the osmFISH_Zeisel data to conduct a co-expression analysis. As shown in Supplementary Fig. S11, we identified a cohort of genes with high similarity scores, including C1qb and C1qa. Previous studies have demonstrated that genes in the CP, C1qa, C1qb, C1ra, and C4b are among the most enriched in adult mouse brains32. Similarly, we identified a group of genes, such as Lyz2, Ccl24, and Pf4. Prior research on the brain has shown that Ccl24 and Pf4 are primarily expressed in Lyve1+ BAM cells33.

Visualization for spatial gene patterns

To further confirm the capability of our method in preserving biological meaning, we showed the spatial patterns of reconstructed genes on osmFISH data (Fig. 4). We use heatmaps to visualize spatial gene expression patterns. In these heatmaps, each point represents the expression level of a gene in a specific region, with the color intensity indicating the level of expression. The visualization results effectively displayed the spatial distribution of gene data. Initially, we presented the spatial gene patterns of a set of genes known to exhibit spatial patterns, such as Rorb and Slc32a1. Within this gene set, our method consistently revealed similar spatial gene patterns to the measured genes. In contrast, competing methods struggled to accurately recover correct spatial gene patterns, often resulting in predicted gene over-expression or under-expression. For instance, for the gene Rorb (Fig. 4a), our method, stPlus, and SpaGE showed similar gene spatial patterns along with the measured gene expression, whereas Tangram exhibited a higher-contrast expression pattern. TransImpLR and gimVI showed a lower contrast expression pattern. Additionally, our method achieved the best performance in terms of the CSS and MSE metrics. We found the imputation by stImpute on the Rorb gene indicates additional expression on the top curve. Thus, we performed a comparison between the predicted spatial expression patterns and known annotations for Rorb8. The study showed that while Rorb is predominantly enriched in the middle region, there is some biological basis for its expression in other regions.

Fig. 4. The stImpute method demonstrated accurate predictions of case gene expression levels, which were found to be in concordance with spatial gene expression patterns observed through osmFISH data.

Fig. 4

a For the case gene Rorb, the expression visualization is presented as predicted by various methods, accompanied by the corresponding CSS and MSE values for the gene. b Similarly, the expression visualization for the case gene Slc32a1 is depicted, and the associated CSS and MSE values are also provided for comparative analysis.

For the gene Slc32a1 (Fig. 4b), most competing methods tended to predict higher (stPlus, SpaGE, and Tangram) or lower (TransImpLR and gimVI) gene expression, while our method accurately captured the spatial gene patterns and obtained the best performance regarding the CSS and MSE metrics. In summary, these findings affirm that genes predicted by our approach retain and accurately represent biological meanings. A similar trend could be found for the case genes Aldoc and Gfap shown in Supplementary Fig. S12.

The importance of each module

To explore the impact of each component in stImpute, we conducted ablation experiments across all datasets. As depicted in Fig. 5a and Supplementary Table S4, omitting the GNN module resulted in a substantial decrease of 4% and 5% in the average CSS and MSE values at the gene level, respectively, highlighting the efficient gene relationship capturing capabilities of the GNN module through information propagation among similar genes in the graph. The exclusion of the ESM-2 encode led to a reduction of 2% and 3% in the CSS and MSE, respectively. The results showed the gene relation network constructed by ESM-2 is pivotal to the prediction results. In conclusion, the collaborative functioning of these modules facilitated improved gene prediction for spatial data.

Fig. 5. Ablation studies and parameter sensitivity analysis.

Fig. 5

a The results of the ablation study quantified by CSS and MSE metrics, which delineate the impact of individual components on the overall performance; and b an evaluation of the parameter robustness within the stImpute algorithm on the osmFISH_Zeisel data, demonstrating its resilience to variations in input parameters.

We further conducted the loss ablation on all datasets. As shown in Supplementary Fig. S13, the results demonstrated that the combined use of both loss components yielded the best performance. Removing the Lrec loss component resulted in a decrease of 20% in average CSS and 36% in MSE. Removing the Lsim loss component resulted in a 3% decrease in average MSE, while the average CSS remained unchanged. To evaluate the influence of explicitly adding the alignment mechanism in the co-embedding training, we simultaneously fed the autoencoder with the scRNA-seq and spatial data and aligned them with the classical domain adaption algorithm Maximum Mean Discrepancy (MMD) to address the domain differences. As shown in Supplementary Fig. S14, the results showed that the co-embedding and MMD didn’t bring an improvement in terms of the average CSS. The domain adaptation mechanism likely struggled to capture the inherent differences between ST and scRNA-seq data, leading to the overcorrection of these datasets.

Sensitivity analysis

We empirically selected the hyperparameters of our model by balancing performance and computational efficiency. To enhance the reliability of our results, we conducted a sensitivity analysis on the seed, epoch, learning rate (lr), layer size, and dimension of the AE module, as well as the number of layers, lr, number of neighbors, and dimension in the GNN module. As shown in Fig. 5b and Supplementary Fig. S15, the results demonstrate that our method is robust across a reasonable range of hyperparameter values, consistently outperforming baseline methods.

To ensure a fair comparison, we conducted additional clustering experiments with varying resolution values to assess their impact on the number of clusters generated. As shown in Supplementary Fig. S16, all methods exhibited consistently low performance across different resolution values. Similar results were observed when clustering real sequenced gene expression data. The low clustering performance may be attributed to intrinsic properties of the datasets, such as high noise levels or complex cell-type heterogeneity.

Discussion

Image-based spatial transcriptomic sequencing technologies have enabled the measurement of gene expression at single-cell resolution but with a limited number of genes. Current computational approaches attempt to overcome these limitations by imputing missing genes, but face challenges regarding prediction accuracy and identification of cell populations due to the neglect of gene-gene relationships. Here, we present stImpute, a method to impute spatial transcriptomics according to reference scRNA-seq data based on the gene network constructed from the protein language model ESM-2. The gene prediction uncertainty is further measured through a deep learning model. stImpute is the first approach to impute spatial genes considering the gene-to-gene relationships through the graph neural network. More importantly, stImpute bridges the gap between genomic data and functional proteomics, which holds significant potential for advancements in genetic research. stImpute was shown to consistently outperform state-of-the-art methods across multiple datasets concerning imputation and clustering. Additionally, since partial datasets used in this study did not provide ground truth cell types, we did not evaluate their clustering performance. However, we reported other metrics including CSS and MSE for these datasets and performed a co-expression analysis for the unlabeled dataset. The predicted gene expression preserved the biological signal as shown in Supplementary Fig. S10. Notably, the resource requirements of our method increase with cell count, with a memory usage of 13.8 GB and a runtime of 961 minutes for 1.3 million cells. These results indicate that our method can be extended to less powerful systems.

Beyond the inherent advantages of our approach, we introduce a suite of strategies designed to augment the efficacy of stImpute even further. Firstly, beyond the gene relationships captured by gene-coding protein embeddings, other relationships, such as those found in functional pathways, could potentially enhance the gene imputation process. We plan to explore incorporating this biological information in future work. Secondly, we acknowledge that keeping the autoencoder and GNN modules separate may not be optimal. In future work, we will explore integrating these modules to improve performance. Thirdly, we followed standard normalization steps during dataset preprocessing. However, since the normalization choice is critical, we will explore which preprocessing methods are most suitable for different datasets in the future.

Statistics and reproducibility

In this research, we made use of datasets that are accessible to the public. We did not use any statistical techniques to set the sample sizes beforehand; instead, we adhered to the sample sizes that have been reported in earlier studies11. After conducting a thorough quality check, we included all the data in our analysis, ensuring no data points were left out. It’s important to note that our experiments were not randomized.

Supplementary information

Reporting summary (71.1KB, pdf)

Acknowledgements

This study has been supported by the National Key R&D Program of China (2022YFF1203100), the National Natural Science Foundation of China (T2394502), the Research and Development Project of Pazhou Lab (Huangpu) (2023K0606), the Postdoctoral Fellowship Program of CPSF (GZC20233321), the Fundamental Research Funds for the Central Universities (2024IAIS-QN020), and the National Natural Science Foundation of China Youth Program (62402071).

Author contributions

Y.Y. conceived and supervised the project. Y.Z. and Y.S. developed and implemented the stImpute algorithm. Y.Y., H.Z., and Y.Z. validated the methods and wrote the paper. Z.D. and C.Z. conducted the biological analysis. H.L., Y.Z., W.Y., and S.Z. discussed and performed the rebuttal experiments. All authors read and approved the final paper.

Peer review

Peer review information

Communications Biology thanks Nguyen Quoc Khanh Le and the other, anonymous, reviewers for their contribution to the peer review of this work. Primary Handling Editors: Aylin Bircan and Laura Rodriguez Perez.

Data availability

This research did not generate any new data; instead, it utilized publicly available datasets, as detailed previously (see Table 1 for specifics). The datasets STARmap_AllenVISp, osmFISH_Zeisel, osmFISH_AllenVISp, osmFISH_AllenSSp, and MERFISH_Moffitt were obtained from the public repository at 10.5281/zenodo.396729034. The MouseBrain dataset was downloaded from ref. 35. The large tonsil Xenium dataset was acquired from https://www.10xgenomics.com/datasets/human-tonsil-data-xenium-human-multi-tissue-and-cancer-panel-1-standard. The GSM5051495 dataset is available through the GEO accession number GSM5051495. The Mouse_spermatogenesis datasets (Dia and WT) were downloaded from https://www.dropbox.com/s/ygzpj0d0oh67br0/Testis_Slideseq_Data.zip?dl=0. Finally, the GSE112393 dataset is accessible via the GEO accession number GSE112393.

Code availability

All source codes used in our experiments have been deposited at https://github.com/cquzys/stImpute. A Zenodo version is also available at 10.5281/zenodo.13823043 (ref. 36).

Competing interests

The authors declare no competing interests. Yuedong Yang is an Editorial Board Member for Communications Biology, but was not involved in the editorial review of, nor the decision to publish this article.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors jointly supervised this work: Zhiming Dai, Yuedong Yang.

Contributor Information

Zhiming Dai, Email: daizhim@mail.sysu.edu.cn.

Yuedong Yang, Email: yangyd25@mail.sysu.edu.cn.

Supplementary information

The online version contains supplementary material available at 10.1038/s42003-024-06964-2.

References

  • 1.Ji, A. L. et al. Multimodal analysis of composition and spatial architecture in human squamous cell carcinoma. Cell182, 497–514 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Eng, C.-H. L. et al. Transcriptome-scale super-resolved imaging in tissues by rna seqfish+. Nature568, 235–239 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Codeluppi, S. et al. Spatial organization of the somatosensory cortex revealed by osmfish. Nat. Methods15, 932–935 (2018). [DOI] [PubMed] [Google Scholar]
  • 4.Wang, X. et al. Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science361, eaat5691 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Moffitt, J. R. et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science362, eaau5324 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat. Methods16, 1289–1296 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell177, 1873–1887 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Stuart, T. et al. Comprehensive integration of single-cell data. Cell177, 1888–1902 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lopez, R. et al. A joint model of unpaired data from scrna-seq and spatial transcriptomics for imputing missing gene expression measurements. arXiv preprint arXiv:1905.02269 (2019).
  • 10.Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell rna-seq denoising using a deep count autoencoder. Nat. Commun.10, 390 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Abdelaal, T., Mourragui, S., Mahfouz, A. & Reinders, M. J. Spage: spatial gene enhancement using scrna-seq. Nucleic acids Res.48, e107 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Shengquan, C., Boheng, Z., Xiaoyang, C., Xuegong, Z. & Rui, J. stplus: a reference-based method for the accurate enhancement of spatial transcriptomics. Bioinformatics37, i299–i307 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Biancalani, T. et al. Deep learning and alignment of spatially resolved single-cell transcriptomes with tangram. Nat. methods18, 1352–1362 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Qiao, C. & Huang, Y. Reliable imputation of spatial transcriptome with uncertainty estimation and spatial regularization. Available at SSRN 4544286 (2023). [DOI] [PMC free article] [PubMed]
  • 15.Rao, J., Zhou, X., Lu, Y., Zhao, H. & Yang, Y. Imputing single-cell rna-seq data by combining graph convolution and autoencoder neural networks. Iscience24 (2021). [DOI] [PMC free article] [PubMed]
  • 16.Roohani, Y., Huang, K. & Leskovec, J. Predicting transcriptional outcomes of novel multigene perturbations with gears. Nature Biotechnology 1–9 (2023). [DOI] [PMC free article] [PubMed]
  • 17.Lin, Z. et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv (2022).
  • 18.Hamilton, W. L., Ying, Z. & Leskovec, J. Inductive representation learning on large graphs. In Neural Information Processing Systemshttps://api.semanticscholar.org/CorpusID:4755450 (2017).
  • 19.Zhang, W. et al. Adapgl: An adaptive graph learning algorithm for traffic prediction based on spatiotemporal neural networks. Transportation Res. Part C: Emerg. Technol.139, 103659 (2022). [Google Scholar]
  • 20.Shen, R. et al. Spatial-id: a cell typing method for spatially resolved transcriptomics via transfer learning and spatial embedding. Nat. Commun.13, 7640 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Rand, W. M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc.66, 846–850 (1971). [Google Scholar]
  • 22.Fowlkes, E. B. & Mallows, C. L. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc.78, 553–569 (1983). [Google Scholar]
  • 23.Nguyen, X. V., Epps, J. & Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res.11, 2837–2854 (2010). [Google Scholar]
  • 24.Traag, V. A., Waltman, L. & van Eck, N. J. From louvain to leiden: guaranteeing well-connected communities. Scientific Reports9 (2018). [DOI] [PMC free article] [PubMed]
  • 25.Heumos, L. et al. Best practices for single-cell analysis across modalities. Nat. Rev. Genet.24, 550–572 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Cui, H. et al. scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature Methods 1–11 (2024). [DOI] [PubMed]
  • 27.Camargo, A. P. et al. Identification of mobile genetic elements with genomad. Nature Biotechnology 1–10 (2023). [DOI] [PMC free article] [PubMed]
  • 28.Yisimayi, A. et al. Repeated omicron exposures override ancestral sars-cov-2 immune imprinting. Nature625, 148–156 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Wolf, F. A., Angerer, P. & Theis, F. J. Scanpy: large-scale single-cell gene expression data analysis. Genome Biol.19, 1–5 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Agoston, Z., Li, N., Haslinger, A., Wizenmann, A. & Schulte, D. Genetic and physical interaction of meis2, pax3 and pax7 during dorsal midbrain development. BMC developmental Biol.12, 1–12 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Dou, Z., Son, J. E. & Hui, C.-c Irx3 and irx5-novel regulatory factors of postnatal hypothalamic neurogenesis. Front. Neurosci.15, 763856 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Zhang, Y. et al. Spatial and temporal profiling of the complement system uncovered novel functions of the alternative complement pathway in brain development. bioRxiv 2023–11 (2023).
  • 33.Kim, J.-S. et al. A binary cre transgenic approach dissects microglia and cns border-associated macrophages. Immunity54, 176–190 (2021). [DOI] [PubMed] [Google Scholar]
  • 34.Abdelaal, T., Mourragui, S., Mahfouz, A. & Reinders, M. J. T. Starmap_allenvisp, osmfish_zeisel, osmfish_allenvisp, osmfish_allenssp, and merfish_moffitt-spatial gene enhancement using scrna-seq [data set]. Zenodo 10.5281/zenodo.3967291 (2020).
  • 35.Abdelaal, T. et al. Mousebrain-spatial inference of rna velocity at the single-cell resolution [data set]. Zenodo (2024). [DOI] [PMC free article] [PubMed]
  • 36.Zeng, Y. et al. Imputing spatial transcriptomics through gene network constructed from protein language model. 10.5281/zenodo.13823043 (2024).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Reporting summary (71.1KB, pdf)

Data Availability Statement

This research did not generate any new data; instead, it utilized publicly available datasets, as detailed previously (see Table 1 for specifics). The datasets STARmap_AllenVISp, osmFISH_Zeisel, osmFISH_AllenVISp, osmFISH_AllenSSp, and MERFISH_Moffitt were obtained from the public repository at 10.5281/zenodo.396729034. The MouseBrain dataset was downloaded from ref. 35. The large tonsil Xenium dataset was acquired from https://www.10xgenomics.com/datasets/human-tonsil-data-xenium-human-multi-tissue-and-cancer-panel-1-standard. The GSM5051495 dataset is available through the GEO accession number GSM5051495. The Mouse_spermatogenesis datasets (Dia and WT) were downloaded from https://www.dropbox.com/s/ygzpj0d0oh67br0/Testis_Slideseq_Data.zip?dl=0. Finally, the GSE112393 dataset is accessible via the GEO accession number GSE112393.

All source codes used in our experiments have been deposited at https://github.com/cquzys/stImpute. A Zenodo version is also available at 10.5281/zenodo.13823043 (ref. 36).


Articles from Communications Biology are provided here courtesy of Nature Publishing Group

RESOURCES