Abstract
Decoding spatial transcriptomes from single-cell RNA sequencing (scRNA-seq) data has become a fundamental technique for understanding multicellular systems; however, existing computational methods lack both accuracy and biological interpretability due to their model-free frameworks. Here, we introduce Perler, a model-based method to integrate scRNA-seq data with reference in situ hybridization (ISH) data. To calibrate differences between these datasets, we develop a biologically interpretable model that uses generative linear mapping based on a Gaussian mixture model using the Expectation–Maximization algorithm. Perler accurately predicts the spatial gene expression of Drosophila embryos, zebrafish embryos, mammalian liver, and mouse visual cortex from scRNA-seq data. Furthermore, the reconstructed transcriptomes do not over-fit the ISH data and preserved the timing information of the scRNA-seq data. These results demonstrate the generalizability of Perler for dataset integration, thereby providing a biologically interpretable framework for accurate reconstruction of spatial transcriptomes in any multicellular system.
Subject terms: Transcriptomics, Machine learning, Gene expression
Single cell RNA-seq loses spatial information of gene expression in multicellular systems because tissue must be dissociated. Here, the authors show the spatial gene expression profiles can be both accurately and robustly reconstructed by a new computational method using a generative linear mapping, Perler.
Introduction
Genes are heterogeneously expressed in multicellular systems, and their spatial profiles are tightly linked to biological functions. In developing embryos, spatial gene-expression patterns are responsible for coordinated cell behavior (e.g., differentiation and deformation) that regulates morphogenesis1. In addition, within organ tissues, cells at different locations play different roles in organ function based on their gene-expression patterns2. Thus, identification of spatial genome-wide gene-expression profiles is key to understanding the functions of various multicellular systems. In situ hybridization (ISH) has been widely used to visualize spatial profiles of gene expression; however, application of this method is generally limited to only small numbers of genes. By contrast, the single-cell RNA-sequencing (scRNA-seq) method developed during the previous decade has enabled measurement of genome-wide gene-expression profiles in tissues at the single-cell level3. However, this method requires tissue dissociation, which leads to loss of spatial information for the original cells.
To compensate for the lost spatial information, new computational approaches have emerged (Seurat (v.1)4, DistMap5, Achim et al.6, Halpern et al.7), enabling reconstruction of genome-wide spatial expression profiles from scRNA-seq data by integrating existing ISH data as a spatial reference map in silico. However, their methods require binarization of gene-expression data8, which leads to unsatisfactory accuracy, or tissue-specific modeling, which leads difficulty in application to other systems. Recently, the seminal methods Seurat (v.3)9 and Liger10,11 were developed to address gene-expression data as continuous variables in a non-tissue-specific manner. These methods match the distributions of ISH and scRNA-seq data points by using dimensionality reductions [e.g., canonical correlation analysis (CCA)]12 and integrative nonnegative matrix factorization (iNMF)13, followed by mapping the scRNA-seq data points to the nearest ISH data points, according to Euclidean distance using nearest-neighbor (NN) methods (e.g., k-NN14 and mutual NN15). However, a major issue is that the flexibility of the methods allow mapping of ISH data to scRNA-seq data without any models of the underlying scRNA-seq data structure. Specifically, these methods do not account for difference in gene-expression noise associated with each gene. Given this model-free property, these methods are dependent upon nonlinear NN mapping, which innately causes overfitting to the reference ISH data.
To address these issues, we propose a model-based computational method for probabilistic embryo reconstruction by linear evaluation of scRNA-seq (Perler), which reconstructs spatial gene-expression profiles via generative linear modeling in a biologically interpretable framework. Perler addresses gene-expression profiles as continuous variables and models generative linear mapping from ISH data points into the scRNA-seq space. To estimate parameters of the linear mapping, we developed a method based on the expectation–maximization (EM) algorithm14. Using the estimated parameters, we also propose an optimization method to infer spatial information of scRNA-seq data within a tissue sample. We applied this method to existing Drosophila scRNA-seq data5 and successfully reconstructed spatial gene-expression profiles in Drosophila early embryos that were more accurate than those generated using another spatial reconstruction method (DistMap5). In addition, we showed that Perler can reconstruct a spatial gene-expression pattern that could not be fully predicted using previous methods, including Seurat (v.3), Liger, and DistMap. Further analysis revealed that Perler was able to preserve the timing information of the scRNA-seq data without overfitting to the reference ISH data. Furthermore, we demonstrated that this method accurately predicted spatial gene-expression profiles in early zebrafish embryos4, the mammalian liver7, and the mouse visual cortex16,17. These findings demonstrate Perler as a robust, generalized framework for predicting spatial transcriptomes from any type of ISH data for any multicellular system without overfitting to the reference.
Results
Framework of spatial reconstruction in Perler
Perler is a computational method for model-based prediction of spatial genome-wide expression profiles from scRNA-seq data that works by referencing spatial gene-expression profiles measured by ISH (Fig. 1a). In general, scRNA-seq data have higher dimensionality (on the order of ~10,000 genes), but does not contain information of spatial coordinates in tissues. By contrast, reference ISH data contain expression information for D genes in each cell or tissue subregion, with these referred to as landmark genes (e.g., D = 84 in Drosophila melanogaster early embryos) and tagged with spatial coordinates in tissues.
The Perler procedure involves two steps. The first step estimates a generative linear model-based mapping function that transforms ISH data into the scRNA-seq space, thereby enabling calculation of pairwise distances between ISH data and scRNA-seq data (Fig. 1b). The second step reconstructs spatial gene-expression profiles according to the weighted mean of scRNA-seq data, which is optimized by the mapping function estimated in the first step (Fig. 1c).
The first step considers gene-specific differences between scRNA-seq and ISH measurements. For example, we assume that some genes are more or less sensitive to ISH or scRNA-seq, and subject to high or low background signals in the associated data. We account for gene-specific noise intensity, because gene expression fluctuates over time in a gene-specific manner18. These differences in sensitivity, background signals, and noise intensity can be expressed by linear mapping:
where yi and hi denote the expression levels of landmark gene i measured by scRNA-seq and ISH, respectively; , , and are constant parameters of gene i and interpreted as the sensitivity coefficient, background signal, and noise intensity, respectively; and indicates standard Gaussian noise. Note that , , and are different for each gene, and that these parameter values are unknown. To estimate this linear mapping from the data, we developed a generative model in which scRNA-seq data points are generated/derived from each cell in the tissue, whose expression is measured by ISH (see “Methods”). We then derived a parameter estimation procedure based on the EM algorithm (see “Methods”). Using the estimated parameters, a gene-expression vector for each cell in a given tissue sample measured by ISH can be mapped to the scRNA-seq space, thereby allowing evaluation of pairwise distances between ISH and scRNA-seq data.
The second step reconstructs the spatial gene-expression profile in tissue from scRNA-seq data. We estimated gene expression of each cell in a tissue sample according to the weighted mean of all scRNA-seq data points, where the weights were determined by the pairwise distances between cells in tissue samples measured using ISH and scRNA-seq data points (Fig. 1c). Weights were evaluated by Mahalanobis distance, which accounts for the reliability of each gene depending on its noise intensity (Fig. 1d, Eqs. (22–25)). For the best prediction, we optimized the hyperparameters of the weighting function to ensure that the predicted and referenced landmark gene-expression profiles were well-correlated by cross-validation (CV; Fig. 1e and see “Methods”). We then predicted non-landmark gene expression using the optimized weighting function. Data were preprocessed before the first step. Some landmark genes redundantly exhibit similar spatial expression patterns, which can lead to biased parameter estimation and cause a loss of mapping ability. To reduce redundancy in scRNA-seq and ISH data, we performed dimensionality reduction using partial least squares correlation analysis (PLSC)19 (see “Methods”). Each factor in the reduced dimension can be interpreted as a “metagene”, which is representative among a highly correlated gene cluster, with its coordinate corresponding to the expression level of the metagene. In Perler, we regarded the metagene-expression level i (i.e., factor i) in the scRNA-seq and ISH spaces as yi and hi in the equation above (see “Methods”).
Model-based mapping between scRNA-seq and ISH data
Previously, Karaiskos et al.5 measured gene expression in individual cells dissociated from early D. melanogaster embryos at developmental stage 6 by scRNA-seq, followed by development of a computational method (DistMap) to reconstruct the spatial gene-expression profile of the embryos from the scRNA-seq data. They used as reference data a spatial gene-expression atlas provided by the Berkeley Drosophila Transcription Network Project (BDTNP)20,21, in which the expression of 84 landmark genes was quantitatively measured by fluorescent (FISH) at single-cell resolution at developmental stage 5.
In the present study, we applied Perler to the same scRNA-seq dataset and used the 84 landmark genes from the BDTNP atlas as the spatial reference map. We then predicted the spatial gene-expression profiles for 8840 non-landmark genes. To compare Perler results with those of DistMap, we used the same normalization methods for the scRNA-seq dataset as the previous study5. For preprocessing, we manually extracted 60 metagenes as nonredundant clusters of the landmark genes by dimensionality reduction, because some landmark genes were correlatedly expressed in both the ISH and the scRNA-seq data (Supplementary Fig. 1a, b). Then, Perler estimated the parameters of the linear mapping by integrating the scRNA-seq data with the ISH data (Supplementary Fig. 2a–c). The mapped ISH data points according to the linear mapping were distributed consistently with the scRNA-seq data points (Fig. 2a and Supplementary Fig. 2d). We also confirmed that the linear mapping properly calibrated the difference between ISH and scRNA-seq data on each metagene level (Supplementary Fig. 3).
Perler can predict the origin of a scRNA-seq data point in a tissue by computing a posterior probability that a scRNA-seq data point was generated from each cell in the tissue sample. We found that the scRNA-seq data points were specifically assigned to cells in a small region (a few cell diameters) of the tissue (Fig. 2c, d and Supplementary Fig. 2e). We also evaluated this performance of other methods (Supplementary Fig. 4), showing that Perler had superior performance to DistMap and equivalent performance to Liger and Seurat v.3.
Based on linear mapping, Perler reconstructed gene-expression profiles from scRNA-seq data. The reconstructed and referenced gene-expression profiles were well-correlated following optimization of the hyperparameters (Fig. 2b and Supplementary Fig. 5). The reconstruction accuracy of Perler (average correlation coefficient (aCC) = 0.83) was significantly higher than that of Seurat v.3 (aCC = 0.61), Liger (aCC = 0.61), and DistMap (aCC = 0.56; Fig. 2e, f and Supplementary Fig. 6).
We also evaluated the predictive performance of Perler by conducting leave-one-gene-out cross-validation (LOOCV) to confirm whether gene expression can be predicted following removal of the landmark gene of interest from the ISH data prior to training (Supplementary Figs. 7 and 8). The predictive accuracy of Perler (aCC = 0.59) was significantly higher than that of Seurat v.3 (aCC = 0.55), Liger (aCC = 0.51), and DistMap (aCC = 0.44; Supplementary Fig. 7). However, we noticed that some genes lost the predictive accuracy compared with the reconstruction accuracy (Supplementary Fig. 9a), indicating that genes can be classified as well-predicted or poorly predicted. To clarify the difference between these two classes of genes (Supplementary Table 1), we examined the correlated data structure among landmark genes (Supplementary Fig. 9b–d) and found that poorly predicted genes had different expression patterns between ISH and scRNA-seq (Supplementary Fig. 9d), suggesting that the loss of prediction accuracy was primarily caused by different correlated data structures between ISH and scRNA-seq. Further, we showed that the predictive accuracy of each well-predicted gene was not affected by eliminating the poorly predicted genes (aCC = 0.64 before and 0.63 after gene elimination), indicating that Perler robustly predicted the spatial gene-expression pattern irrespective of the poorly predicted genes.
These results demonstrated that Perler accurately reconstructed and predicted the spatial expression profiles of the landmark genes and was capable of doing this via simple linear mapping.
Validity and robustness of Perler
Next, to analyze the validity of Perler, we examined the effect of dimensionality reduction on reconstruction and predictive accuracy. We found that the high reconstruction accuracy (aCC > 0.78) was maintained regardless of the presence of the dimensionality reduction, whereas the predictive accuracy was significantly improved by dimensionality reduction (Supplementary Fig. 10a, b), indicating that dimensionality reduction was an important factor for avoiding overfitting to ISH data. We also examined the effect of optimizing hyperparameters and found that it significantly improved Perler performance (Supplementary Fig. 11a).
Moreover, we analyzed the robustness of Perler against the downsampled landmark gene set by, first, evaluating the reconstruction performance under random selection of different quantities of landmark genes (Supplementary Fig. 12a). We found that reconstruction accuracy increased with the number of landmark genes, while only 30 landmark genes were required to achieve comparable performance with Liger and Seurat v.3, with full use of the landmark genes (>0.6). Second, we evaluated the resolution of the predicted origin of a scRNA-seq data point (Supplementary Fig. 13a), and found that the confidence for origin prediction of a scRNA-seq data point to a small region increased with the number of landmark genes, while only 40 landmark genes were required for a scRNA-seq data point to be predicted to a small region (four-cell radius) with sufficient confidence (>0.5). To further demonstrate the robustness of Perler, we conducted tenfold CV, for which the folds were extracted from the Drosophila scRNA-seq data points22. In this CV scheme, the performance of methods was evaluated using scoring metrics that were previously used in the DREAM Single-Cell Transcriptomics challenge22. We found that Perler more robustly performed the origin prediction of the scRNA-seq data points compared with other methods (Liger and Seurat v.3), although the top-ranked methods developed in the DREAM challenge had better performance than Perler (Supplementary Tables 2–4). Note that the metrics used in this DREAM challenge were designed assuming that DistMap prediction was ground truth, namely, high scores on these metrics would be interpreted as a method that performs similarly to DistMap.
Taken together, we conclude that, in terms of reconstructing gene-expression profiles and predicting the original location of scRNA-seq data points, Perler exhibits robustness against the downsampled gene set and the tenfold CV of scRNA-seq data point.
Prediction of non-landmark genes
In addition to the landmark genes, Perler successfully predicted the spatial expression profiles of non-landmark genes along both anterior–posterior (A–P) and dorsoventral (D–V) axes (Fig. 3a, b). Furthermore, we evaluated the predicted spatial profile of 310 spatially restricted genes proposed by Bageritz et al.23 (Supplementary Figs. 14 and 15) and found that Perler was able to uncover the unknown spatial gene-expression pattern. Notably, we observed that spatial patterns predicted by Seurat (v.3), Liger, and DistMap were incomplete. For example, the predicted stripes disappeared in the ventral part of embryos (e.g., abd-A and Ubx in Fig. 3a), whereas this issue was not observed with Perler, which accurately predicted the stripe pattern, even in the ventral part of embryos.
Prediction of 14-stripe patterns of segment-polarity genes
We then presented the spatial predictions of “segment-polarity” genes, which are expressed in a 14-stripe pattern consistent with the parasegments that subdivide the trunk (main body) region of embryos (Fig. 3b)24–28. Although the BDTNP reference does not contain information concerning the genes expressed in the 14-stripe pattern, we found that Perler accurately predicted the spatial expression patterns of these segment-polarity genes, including engrailed (en), wingless (wg), hedgehog (hh), and midline (mid) (Fig. 3b)24–28. By contrast, all of the previous methods exhibited issues regarding prediction of the 14-stripe patterns. The predicted patterns demonstrated that DistMap and Seurat (v.3) were unable to predict any 14-stripe patterns, and that Liger partially predicted 14-stripe patterns, although the ventral part of each stripe was missing (Fig. 3b). These results suggested that Perler more accurately revealed the spatial gene-expression patterns of non-landmark genes.
We further analyzed the details of the gene-expression profiles of the segment-polarity genes within each parasegment. Each parasegment shows a four-cell width and is delimited by periodic expression of pair-rule genes and segment-polarity genes at the single-cell width resolution at stage 629,30 (Fig. 4a). First, we confirmed that the reconstructed patterns of ftz, eve, and odd were consistent with experimental results (Fig. 4b). In addition, the predicted stripes of wg were identified adjacent to the predicted stripes of en, and the predicted stripes of en were identified adjacent to the reconstructed stripes of odd (Fig. 4b, c). These results were consistent with experimental results29,30, strongly supporting the ability of Perler to reveal differences in spatial gene expression at single-cell resolution.
Preservation of timing information of scRNA-seq data
We then investigated the effect of timing differences between scRNA-seq (stage 6) and FISH (stage 5) experiments. Although most gene-expression patterns at stage 6 are the same as those at stage 5, several “pair-rule” genes (odd, prd, slp1, and run) exhibit stripe-doubling from the 7- to the 14-stripe expression patterns during stages 5 and 629 (Fig. 5a). Accordingly, the scRNA-seq data should intrinsically contain information for the 14-stripe expression pattern. Therefore, we determined whether Perler could reconstruct the 14-stripe pattern from the stage 6 scRNA-seq data.
In our reconstruction, ftz, eve, and h showed a seven-stripe pattern, which was consistent with the previous report25 showing that these genes do not exhibit stripe-doubling during stages 5 and 6 (Fig. 5b). For odd, prd, and slp1, which exhibit stripe-doubling, Perler reconstructions resulted in 14-stripe patterns (Fig. 5c). In addition, reconstruction of run resulted in a partial stripe-doubling pattern, where the third stripe from the posterior of the embryo was split into two stripes (Fig. 5c), surprisingly suggesting that Perler detected the ongoing phase of a 7-stripe to 14-stripe pattern. These results showed that Perler was able to reconstruct embryos according to the timing of the scRNA-seq experiment. By contrast, Seurat (v.3) and DistMap reconstructed every pair-rule gene as seven-stripe patterns (Fig. 5b, c). Moreover, Liger reconstructed odd, prd, and slp1 as broad primary seven stripes with weak secondary seven stripes, which were so obscure that it was difficult to distinguish 14 stripes, and reconstructed run as a 7-stripe pattern (Fig. 5c). These results indicated that previous methods reconstructed embryos according to the timing of FISH experiments rather than that of scRNA-seq experiments. Taken together, these findings showed that Perler successfully reconstructed spatial gene-expression profiles according to the timing of scRNA-seq experiments (stage 6), regardless of the timing of FISH experiments (stage 5), while all other methods reconstructed those at the timing of FISH experiments. We concluded that Perler has the ability to not over-fit to ISH data and robustly preserve timing information in scRNA-seq data.
Application to other datasets
To evaluate Perler applicability to other datasets, we evaluated it using three published datasets. First, we applied Perler to the zebrafish embryo datasets (Supplementary Fig. 1, 10–13, 16, and 17), in which the spatial reference map was binarized based on traditional measurement by ISH4 (Fig. 6a and see “Methods”). LOOCV demonstrated that Perler accurately predicted spatial gene-expression profiles compatible with Seurat (v.1)4 (median receiver operating characteristic (ROC) score = 0.97), even using the binary spatial reference map (Fig. 6a–c).
We then applied Perler to mammalian liver datasets (Supplementary Figs. 11, 18, and 19), in which the spatial reference map was measured by single-molecule (sm)FISH7 (Fig. 6d and see “Methods”). LOOCV showed that the predictive accuracy (aCC = 0.87) was sufficiently high (Fig. 6e), and that Perler successfully predicted both monotonic and non-monotonic gene-expression gradients (Fig. 6f)7.
Finally, we applied Perler to adult mouse visual cortex datasets (Supplementary Figs. 1, 11–13, 20, and 21), in which the single-cell resolution ISH data for 1020 genes was measured by recent in situ technology (STARmap16), and scRNA-seq data for 14,739 cells available from the Allen Brain Atlas17. CV revealed that Perler predicted the spatial expression patterns of genes according to both layer-specific expression and cell-type-specific expression in brain cortex (Fig. 6g). We also applied Perler to another scRNA-seq dataset (194,027 cells; Drop-viz31), and found that it predicted the spatial gene-expression patterns using the Drop-viz dataset consistently with the Allen Brain Atlas dataset (Supplementary Figs. 22–24). These results suggest that Perler is applicable for prediction using high-dimensional spatial reference maps.
Taken together, the findings support Perler as a powerful tool for predicting spatial gene-expression profiles in any multicellular system with general applicability to any type of ISH data (e.g., binary or continuous, low to high dimension, and single-cell to tissue-level resolution).
Discussion
In this study, we developed a model-based computational method (Perler) that predicts genome-wide spatial transcriptomes. Perler sequentially conducted a two-step computation, with the first step mapping ISH data points to the scRNA-seq space according to the generative linear model by EM algorithm (Fig. 1b), and the second step optimizing the weighting function used to predict spatial transcriptomes according to weighted scRNA-seq data points (Fig. 1c, d). Using a dataset for early Drosophila embryos, we demonstrated that Perler accurately reconstructed and predicted genome-wide spatial transcriptomes with robustness (Figs. 2–5). Moreover, we showed that in any multicellular system, Perler displayed broad applicability to any type of ISH data (Fig. 6).
We propose that Perler offers three innovative features. First, Perler can calibrate the difference between scRNA-seq and ISH measurement properties. To express this difference, we applied a “linear mapping model” assuming biologically interpretable constraints that expression levels are linearly correlated between ISH and scRNA-seq measurements with gene-specific sensitivity, background signals, and noise intensity, as in Eq. (1). Second, Perler can reliably reconstruct gene-expression patterns in a noise-resistant manner. Specifically, Perler can evaluate to which extent each gene is reliable for reconstruction depending on the noise intensity (Fig. 1d), by using Mahalanobis pairwise distances (Eq. (23), related to Fig. 1c). As a result, more reliable genes with low noises have larger contribution to the weights for the reconstruction, whereas less reliable genes with high noises have smaller contribution. It should be stressed that such quantitative evaluation of gene reliability is possible only with a method using a generative model. Third, the model-based linear mapping used in Perler is beneficial in terms of the performance for gene-expression pattern reconstruction. To ensure generalized performance, we introduced generative linear modeling with biologically interpretable constraints and statistically reasonable distances. This model-based characteristic of Perler differs from Seurat (v.3)9 and Liger10, both based on model-free mapping between ISH and scRNA-seq data (e.g., CCA and NMF methods). Their model-free mapping addresses gene expression as continuous variables with applicability to any kind of multicellular system; however, these methods freely map ISH data to scRNA-seq data without any assumptions (i.e., they do not account for latent relationships between the two datasets). We showed that model-based Perler significantly improved reconstruction/prediction accuracy compared with other methods (Liger, Seurat v.3, and DistMap; Fig. 2e, f). Further, in a demonstration using Drosophila data, Perler was found to preserve the timing information of scRNA-seq data and robustly reconstruct the spatial gene-expression patterns of the pair-rule genes; whereas this kind of robustness is not observed in other model-free methods (Liger, Seurat v.3, and DistMap; Fig. 5). For example, by focusing on the stripe-doubling of pair-rule genes in Drosophila, Perler successfully reconstructed 14-stripe patterns at a single-cell resolution, while Seurat v.3 and Liger were unable to effectively reconstruct these patterns, and over-fit to the timing of ISH experiments (Fig. 5). We believe that these results highlight the importance of using model-based prediction of spatial gene-expression patterns. Additional characteristic features of Perler are summarized in Supplementary Table 5.
It is worth mentioning a recent method called novoSpaRc32. This method proposed a new concept for predicting spatial expression patterns using the physical information of cells in tissue, which enables these predictions with little or no information regarding ISH gene-expression patterns. However, in practice, their predictive ability using Drosophila scRNA-seq data is unsatisfactory at single-cell resolution; therefore, this concept of using cellular information remains challenging. As a focus of future study, it would be interesting to extend our generative model to introduce prior knowledge of physical information.
We demonstrated that Perler can integrate two distinct datasets of RNA-expression profiles, while also avoiding overfitting to the reference. These features suggest that Perler could be a suitable theoretical framework for integrating not only two RNA-expression datasets, but also two single-cell datasets with different modalities, such as chromatin accessibility measured by a single-cell assay for transposase-accessible chromatin, using sequencing and DNA methylation measured by chromatin immunoprecipitation sequencing. Particularly in terms of multi-omics analysis, where datasets from two different modalities do not exactly match and are often sampled from different individuals and using different time intervals33,34, Perler can potentially help integrate different types of single-cell genomics data. Thus, Perler provides a powerful and generalized framework for revealing the heterogeneity of multicellular systems.
Methods
We developed a method to reconstruct spatial gene-expression profiles from an scRNA-seq dataset via comparison with a spatial reference map measured by ISH-based methods. In the spatial reference map, landmark gene-expression vectors (D genes; e.g., D = 84 in early D. melanogaster embryos) are available for all cells, whose locations in the tissue are known. The landmark gene-expression vector of cell k is represented as hk = (hk,1, hk,2,…, hk,D)T, where cells are indexed by k (k ∈ {1,2,…, K}), and K is the total number of cells in the tissue of interest. By contrast, in an scRNA-seq dataset, genome-wide expression (D′ genes; e.g., D′ = 8924 in early D. melanogaster embryos) lack information regarding cell location in tissue. The genome-wide expression vector of cell n is represented as yn = (yn,1, yn,2,…, yn,D′)T, where cells are indexed by n (n ∈ {1,2,…, N}), and N is the total number of cells used for scRNA-seq measurement.
Observation model
We modeled the difference between scRNA-seq and ISH measurements as
1 |
where yi and hi indicate expression levels of landmark gene i measured by scRNA-seq and ISH experiments, respectively; ξi indicates Gaussian noise with zero mean and unit variance; and ai, bi, and ci are constant parameters for gene i, which are interpreted as scale difference amplification rates, background signals, and noise intensities, respectively.
We reduced the dimensionality of the genes to change Eq. (1) to
2 |
where xj and rj indicate expression levels of metagene j for scRNA-seq and ISH in the lower dimensional space, j ∈ {1, 2,…, M}; and M indicates the number of metagenes. In vector–matrix representation, the Eq. (2) is written as:
3 |
where x = (x1, x2, …, xΜ)Τ, r = (r1, r2, …, rΜ)Τ, A = diag(a1, a2, …, aΜ), b = (b1, b2, …, bΜ)T, C = diag(c1, c2, …, cΜ), and ξ = (ξ1, ξ2, …, ξΜ)Τ.
Metagene representation in lower dimensional space
The dimensionalities of both scRNA-seq and reference data were reduced by PLSC analysis19. PLSC can extract the correlated coordinates from both datasets. In PLSC analysis, the cross-correlation matrix of scRNA-seq and ISH data is first calculated as:
4 |
where Y and H indicate a D × N scRNA-seq data matrix with D landmark genes and N cells, and a D × K ISH data matrix with D landmark genes and K cells, respectively. W is then subjected to singular value decomposition as:
5 |
where U, Δ, and V indicate the M × N singular vector matrices, the M × M diagonal matrix, and M × K singular vector matrices, respectively, with M representing the reduced dimension (i.e., the number of metagenes). In this study, the metagene vectors for scRNA-seq (xn) and the reference data (rk) were, respectively, calculated by:
6 |
7 |
where un and vk indicate the nth row vector of U and the kth row vector of V, respectively.
A Gaussian mixture model (GMM) for scRNA-seq observation
We used Eq. (3) to transform ISH observations into scRNA-seq observations. To infer from which cells in the tissue the scRNA-seq observations originated, we developed a generative model for metagene-expression vectors for scRNA-seq data x, which was expressed by a K-components GMM:
8 |
where
9 |
(σj = cj), N(x|μ, Σ) indicates a multivariate Gaussian distribution with mean and variance-covariance matrix Σ, and πk is the probability that x originated from cell k in the tissue. Note that A, b, and Σ are unknown parameters that need to be estimated.
The log of likelihood function of this GMM model is given by:
10 |
where θ indicates a set of the parameters θ ∈ {π, A, b, Σ} and π = (π1, π2, …, πM)T.
EM algorithm (the first step in Perler)
To estimate the unknown parameters (π, A, b, and Σ), we maximize the log likelihood function using the EM algorithm. In the E step, based on the current parameter values, we calculated the responsibility, which represents the posterior probability that scRNA-seq vector xn was derived from cell k in the tissue as:
11 |
In the M step, we optimize the parameter values in order to maximize the log likelihood function based on the current responsibilities. These parameter values are updated as follows:
12 |
13 |
14 |
15 |
where
16 |
17 |
18 |
19 |
The detailed derivation for these equations is presented in a later subsection. The E and M steps iterate until the log likelihood function converges, after which the obtained estimated parameters . are given as:
20 |
describing the mapped metagene-expression vector of cell k measured by ISH. Note that is the metagene-expression vector in the scRNA-seq space.
In Perler, updating πk is optional. Ideally, πk should be proportional to the number of cells within region k. In our study, πk was fixed as 1/K for Drosophila, zebrafish, and mouse cortex data as the tissue was equally divided in the ISH data, whereas πk values were fixed to the area of each zone in the mammalian liver data. Note, fixing πk accelerated the convergence of the EM algorithm compared with optimizing πk (Supplementary Figs. 25 and 26).
For the initialization of parameter values, we selected the values of ai and bi such that mean and variance of each element of xni and rki were the same and selected the ci values as standard deviation of xni.
Spatial reconstruction (the second step in Perler)
We reconstructed/predicted the gene-expression vector by weighted averaging all scRNA-seq data points as
21 |
where yn indicates the nth scRNA-seq data point (D-component vector). wnk is calculated by
22 |
where α, β, and δ are positive constants. Note that δ in the numerator and denominator of Eq. (22) are canceled out. Dnk indicates Mahalanobis distance between scRNA-seq data point xn and cell k:
23 |
If α = 1/2 and β = 0, wnk is exactly the posterior probability that scRNA-seq data point xn is generated by cell k. Note that Eq. (21) has a similar structure to the Nadaraya–Watson model14. Values of α and β are determined by CV.
Weight sensitivity to the small perturbation of metagene
We calculated differentiation of weight with respect to each metagene-expression level of scRNA-seq data point as:
24 |
25 |
These equations show that as moves away from , weights decrease with a rate inversely proportional to the estimated noise of metagene i. This relationship indicates that small changes in unreliable genes with high noise levels has little effect on the weight, while small changes in reliable genes with low levels of noise have a large effect on weight. Therefore, Perler can reconstruct gene-expression profiles in a noise-resistant manner by accounting for the reliability of each gene through weight determination.
Hyperparameter optimization
We optimized the hyperparameters α and β of the weighting function by LOOCV, in order to fit the predicted gene expression to the referenced gene expression measured by ISH. To this end, we removed one of the landmark genes from the ISH data and used this dataset to predict the spatial gene-expression profile of the removed landmark gene with the fixed hyperparameters in Perler. This LOO prediction was repeated for every landmark gene. We then quantitatively evaluated the predictive performance of these hyperparameters according to the mutual information existing between the predicted expression and referenced expression of all landmark genes:
26 |
where J is the approximated mutual information between the predicted and referenced gene expression. ρi(α, β) indicates the Pearson’s correlation coefficient between the predicted spatial expression pattern of each landmark gene i and its reference ISH data as:
27 |
where
28 |
29 |
The derivation of J is described in a later subsection. Here, we optimized α and β by grid search in order to maximize the mutual information, J. We then used the optimized hyperparameters to predict the spatial profile of non-landmark genes (Fig. 3 and Supplementary Fig. 5). To evaluate the predictive performance of Perler (Fig. 3), we removed each landmark gene from the mutual information and re-optimized the hyperparameters. This re-optimization is repeated for every landmark gene. Note that for the zebrafish embryo data, we used the ROC score instead of the correlation coefficient, because only the binary ISH data was available. In addition, for the mouse visual cortex data, we conducted tenfold CV because of the massive computational cost of LOOCV for the large number of landmark genes (1020 genes).
Data acquisition and preprocessing
For D. melanogaster reconstruction, we used scRNA-seq and ISH data at Drosophila Virtual Expression eXplorer (DVEX; https://shiny.mdc-berlin.de/DVEX/5), which was originally used for DistMap5. In these datasets, the number of scRNA-seq data points is 1297, whereas the number of cells to be estimated in the embryos is 3039. The expressed mRNA counts in this scRNA-seq dataset were already log normalized according to the total number of unique molecular identifiers for each cell. For each gene, we subtracted the average expression from the scRNA-seq data. In addition, the ISH data were log-scaled and subtracted average expression from this ISH data, as same as the scRNA-seq data.
For reconstruction of the early zebrafish embryos, we acquired the public scRNA-seq and ISH data from the Satija Lab homepage (https://satijalab.org/4), with these data originally used by Seurat (v.1)4. In these data, the number of scRNA-seq data points is 851, whereas the number of subregions to be estimated in the embryos is 64. Note that the ISH data were binary. Similar to the Drosophila data, we log-scaled both scRNA-seq and ISH datasets and subtracted the average expression of each gene.
For reconstruction of the mammalian liver, we used scRNA-seq and smFISH data provided by Halpern et al.7. In these data, the number of scRNA-seq data points is 1415, whereas the number of zones to be estimated in the embryos is 9. Because multiple samples were provided in the smFISH data, we calculated their average at each tissue location for Perler, followed by log-scaling both the scRNA-seq and smFISH data and subtracting the average expression of each gene.
For reconstruction of the mouse visual cortex, we used scRNA-seq data provided by the Allen Brain Institute17 and Drop-seq data provided by Saunders et al.31. For ISH data, we used smFISH data provided by Wang et al.16, respectively, which were originally used for Seurat (v.3)9. The number of scRNA-seq data points is 14,739 and 194,027, respectively, whereas the number of cells to be estimated in the cortex is 1549. We log-scaled both the scRNA-seq and smFISH data, and subtracted the average expression of each gene.
Data analysis
We used the newly developed method, Perler for data analysis. Perler was deposited at GitHub (see “Code availability”). Perler is built on Python 3.8.3.
All other software used in this study is publicly available: Numpy==1.19.5 (https://numpy.org/) for calculation; Scipy==1.6.1 (https://www.scipy.org/) for calculation; joblib==1.0.0 (https://joblib.readthedocs.io/en/latest/index.html) for calculation; scikit-learn==0.24.1 (https://scikit-learn.org/) for traditional machine learning (e.g., PCA and NN method); Matplotlib==3.3.3 (https://matplotlib.org/) for data visualization; Seaborn==0.11.1 (https://seaborn.pydata.org/) for data visualization; and pandas==1.2.1 (https://pandas.pydata.org/) for reading data frames. For the visualization of zebrafish embryos, we used “zf.insitu.vec.lateral” function of Seurat (v. ≥ 1.2).
Data visualization
For D. melanogaster, we visualized the reconstructed gene-expression profile at single-cell resolution by using the three-dimensional coordinates of all cells from DVEX (https://shiny.mdc-berlin.de/DVEX/5). Because the embryo is bilaterally symmetric, we mapped the reconstructed spatial gene-expression levels of the 3039 cells in the right-half embryo. According to the previous study5, we then mirrored the spatial gene-expression levels of the right-half cells to the remaining cells in left-half embryo. In the case of the early zebrafish embryos, we visualized the reconstructed gene expression using the “zf.insitu.vec.lateral” function of Seurat (v. ≥ 1.2)4. In the case of the mammalian liver, we visualized the reconstructed gene expression as a heatmap. In the case of the mouse visual cortex, we visualized the reconstructed gene expression at single-cell resolution. We used two-dimensional coordinates of all cells within cortical slices provided by Wang et al.16.
Derivation of the EM algorithm
The goal of the EM algorithm is to maximize the likelihood function p(X | θ) with respect to θ, where X = {x1, x2,…, xN} and θ = {π, A, b, Σ}. The generative model of scRNA-seq data point x with latent variables z is formulated, as follows. The probability distribution of z is:
30 |
where z is a vector in a one-of-K representation that shows from which cells/regions in tissue a scRNA-seq sample originated; zk is the kth element of z; K is the number of the elements of the latent variables z equal to the number of cells in the tissue; and πk is probability that zk = 1. The probability distribution of x conditioned by z is:
31 |
where N(x|μ, Σ) indicates a Gaussian distribution with mean μ and variance Σ; Ak is the M × M diagonal matrix; bk indicates the M elements vector in Eq. (3); and rk indicates the M elements vector describing the metagene-expression level in cell k. The joint probability distribution of x and z is:
32 |
Note that the marginalized distribution of z becomes Eq. (8). The likelihood function for the complete dataset {X, Z} is given as:
33 |
where Z = {z1, z2,…, zN}. Therefore, the expectation of its log likelihood function over the posterior distribution of P(Z|X, θ(old)) becomes:
34 |
35 |
where γnk is the expectation of znk over P(Z|X, θ(old)) given as:
36 |
According to Bayes’ theorem:
37 |
where P(znk = 1|xn) becomes Eq. (11).
In the E step, γnk is calculated based on the current parameter values of θ(old). In the M step, we update the parameter values θ by maximizing the Q-function as:
38 |
The maximization of Q(θ, θ(old)) with respective to A, b, and Σ is achieved by ∂Q/∂A = 0, ∂Q/∂b = 0, and ∂Q/∂Σ = 0, leading to Eqs. (13–19). π is updated by introducing a Lagrange multiplier to enforce the constraint , leading to Eq. (12).
Derivation of mutual information
We derived Eq. (26) by approximating the following mutual information between the reconstructed spatial expression pattern of the landmark genes and their reference map:
39 |
where and and indicate random variables representing the predicted and referenced expression levels of landmark gene i, respectively, and indicates the joint probability distribution of and h. Here, we assumed that spatial expressions of landmark genes are independent from one another, which leads to:
40 |
We calculated by assuming as a bivariate Gaussian distribution and obtained:
41 |
where denotes the calculated Pearson’s correlation coefficient calculated.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Supplementary information
Acknowledgements
This study was supported in part by the Cooperative Study Program of Exploratory Research Center on Life and Living Systems (ExCELLS; program Nos. 18-201, 19-102, and 19-202 to H.N.); Moonshot R&D–MILLENNIA Program (Grant No.: JPMJMS2024-9) by JST; a Grant-in-Aid for Young Scientists (B) (19H04776 and 21H03541 to H.N.), a Grant-in-Aid for Scientific Research (B) (17KT0021 to T.K.), and a JSPS research fellowship for young scientist (to S.S.) from the Japan Society for the Promotion of Science (JSPS); the Naito Foundation (to T.K.); and the Keihanshin Consortium for Fostering the Next Generation of Global Leaders in Research (K-CONNEX) established by the program of Building of Consortia for the Development of Human Resources in Science and Technology, MEXT (to T.K.).
Source data
Author contributions
H.N., S.S., and T.K. conceived the project. Y.O., H.N., and K.N. developed the method, Y.O. implemented the software, and Y.O. and S.S. analyzed data. Y.O. and H.N. wrote the manuscript with input from all authors.
Data availability
This study constituted a reanalysis of existing data. The detailed data are available at the following sites: Drosophila embryo datasets from Drosophila Virtual Expression eXplorer (DVEX); Zebrafish embryo datasets from Satija Lab homepage (https://satijalab.org/); mammalian liver datasets from Gene Expression Omnibus (GEO) with the accession code GSE84498; STARmap data from STARmap Resources website (https://www.starmapresources.com/; mouse cortex); SMART-seq2 data from Cell Types Database website of the Allen Institute for Brain Science (http://celltypes.brain-map.org/api/v2/well_known_file_download/694413985; mouse cortex); and Drop-seq data from Drop-viz website (http://dropviz.org/; mouse cortex). Source data are provided with this paper.
Code availability
Perler is developed under python 3.8 on GitHub (https://github.com/yasokochi/Perler)35. The minimal usage of Perler is provided in Supplementary Table 6, and the selected parameters in the manuscript are provided in Supplementary Table 7. The running time and the memory usages on a MacBook Pro (2.3 GHz 8-Core Intel Core i9, 64GB) are also provided in Supplementary Table 8.
Competing interests
The authors declare no competing interests.
Footnotes
Peer review information Nature Communications thanks Pablo Meyer, Xianwen Ren and other, anonymous, reviewers for their contributions to the peer review of this work. Peer review reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Yasushi Okochi, Shunta Sakaguchi.
Supplementary information
The online version contains supplementary material available at 10.1038/s41467-021-24014-x.
References
- 1.Gilmour D, Rembold M, Leptin M. From morphogen to morphogenesis and back. Nature. 2017;541:311–320. doi: 10.1038/nature21348. [DOI] [PubMed] [Google Scholar]
- 2.Halpern KB, et al. Paired-cell sequencing enables spatial gene expression mapping of liver endothelial cells. Nat. Biotechnol. 2018;36:962. doi: 10.1038/nbt.4231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Tang F, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods. 2009;6:377–382. doi: 10.1038/nmeth.1315. [DOI] [PubMed] [Google Scholar]
- 4.Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 2015;33:495–502. doi: 10.1038/nbt.3192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Karaiskos N, et al. The Drosophila embryo at single-cell transcriptome resolution. Science. 2017;358:194–199. doi: 10.1126/science.aan3235. [DOI] [PubMed] [Google Scholar]
- 6.Achim K, et al. High-throughput spatial mapping of single-cell RNA-seq data to tissue of origin. Nat. Biotechnol. 2015;33:503–509. doi: 10.1038/nbt.3209. [DOI] [PubMed] [Google Scholar]
- 7.Halpern, K. B. et al. Single-cell spatial reconstruction reveals global division of labour in the mammalian liver. Nature 542, 352–356 (2017). [DOI] [PMC free article] [PubMed]
- 8.Faridani OR, Sandberg R. Putting cells in their place. Nat. Biotechnol. 2015;33:490–491. doi: 10.1038/nbt.3219. [DOI] [PubMed] [Google Scholar]
- 9.Stuart, T. et al. Comprehensive integration of single-cell data. Cell177, 1888–1902.e21 (2019). [DOI] [PMC free article] [PubMed]
- 10.Welch JD, et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell. 2019;177:1873–1887.e17. doi: 10.1016/j.cell.2019.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Liu J, et al. Jointly defining cell types from multiple single-cell datasets using LIGER. Nat. Protoc. 2020;15:3632–3662. doi: 10.1038/s41596-020-0391-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hardoon DR, Szedmak S, Shawe-Taylor J. Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 2004;16:2639–2664. doi: 10.1162/0899766042321814. [DOI] [PubMed] [Google Scholar]
- 13.Yang Z, Michailidis G. A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data. Bioinformatics. 2016;32:1–8. doi: 10.1093/bioinformatics/btw326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).
- 15.Haghverdi L, Lun ATL, Morgan MD, Marioni JC. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 2018;36:421–427. doi: 10.1038/nbt.4091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Wang, X. et al. Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science361, eaat5691 (2018). [DOI] [PMC free article] [PubMed]
- 17.Tasic B, et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 2016;19:335–346. doi: 10.1038/nn.4216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Arias AM, Hayward P. Filtering transcriptional noise during development: concepts and mechanisms. Nat. Rev. Genet. 2006;7:34–44. doi: 10.1038/nrg1750. [DOI] [PubMed] [Google Scholar]
- 19.Abdi, H. & Williams, L. J. Partial least squares methods: partial least squares correlation and partial least square regression. Methods Mol. Biol.930, 549–579 (2013). [DOI] [PubMed]
- 20.BDTNP. Berkeley Drosophila Transcription Network Project http://bdtnp.lbl.gov/Fly-Net (2009).
- 21.Luengo Hendriks CL, et al. Three-dimensional morphology and gene expression in the Drosophila blastoderm at cellular resolution I: data acquisition pipeline. Genome Biol. 2006;7:R123. doi: 10.1186/gb-2006-7-12-r123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Tanevski J, et al. Gene selection for optimal prediction of cell position in tissues from single-cell transcriptomics data. Life Sci. Alliance. 2020;3:1–13. doi: 10.26508/lsa.202000867. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Bageritz J, et al. Gene expression atlas of a developing tissue by single cell expression correlation analysis. Nat. Methods. 2019;16:750–756. doi: 10.1038/s41592-019-0492-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Tabata T, Eaton S, Kornberg TB. The Drosophila hedgehog gene is expressed specifically in posterior compartment cells and is a target of engrailed regulation. Genes Dev. 1992;6:2635–2645. doi: 10.1101/gad.6.12b.2635. [DOI] [PubMed] [Google Scholar]
- 25.BDGP. Patterns of gene expression in Drosophila embryogenesis. https://insitu.fruitfly.org/cgi-bin/ex/insitu.pl (2005).
- 26.Hammonds AS, et al. Spatial expression of transcription factors in Drosophila embryonic organ development. Genome Biol. 2013;14:1–15. doi: 10.1186/gb-2013-14-12-r140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Tomancak P, et al. Global analysis of patterns of gene expression during Drosophila embryogenesis. Genome Biol. 2007;8:1–24. doi: 10.1186/gb-2007-8-7-r145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Tomancak P, et al. Systematic determination of patterns of gene expression during Drosophila embryogenesis. Genome Biol. 2002;3:1–14. doi: 10.1186/gb-2002-3-12-research0088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Clark E, Akam M. Odd-paired controls frequency doubling in Drosophila segmentation by altering the pair-rule gene regulatory network. Elife. 2016;5:1–42. doi: 10.7554/eLife.18215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Clark, E. Dynamic patterning by the Drosophila pair-rule network reconciles long-germ and short-germ segmentation. PLoS Biol.15, e2002439 (2017). [DOI] [PMC free article] [PubMed]
- 31.Saunders A, et al. Molecular diversity and specializations among the cells of the adult mouse brain. Cell. 2018;174:1015–1030.e16. doi: 10.1016/j.cell.2018.07.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Nitzan, M., Karaiskos, N., Friedman, N. & Rajewsky, N. Gene expression cartography. Nature576, 132–137 (2019). [DOI] [PubMed]
- 33.Stuart T, Satija R. Integrative single-cell analysis. Nat. Rev. Genet. 2019;20:257–272. doi: 10.1038/s41576-019-0093-7. [DOI] [PubMed] [Google Scholar]
- 34.Rood JE, et al. Toward a common coordinate framework for the human body. Cell. 2019;179:1455–1467. doi: 10.1016/j.cell.2019.11.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Okochi, Y., Sakaguchi, S., Nakae, K., Kondo, T. & Naoki, H. yasokochi/Perler: first release of Perler 10.5281/ZENODO.4770427 (2021).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
This study constituted a reanalysis of existing data. The detailed data are available at the following sites: Drosophila embryo datasets from Drosophila Virtual Expression eXplorer (DVEX); Zebrafish embryo datasets from Satija Lab homepage (https://satijalab.org/); mammalian liver datasets from Gene Expression Omnibus (GEO) with the accession code GSE84498; STARmap data from STARmap Resources website (https://www.starmapresources.com/; mouse cortex); SMART-seq2 data from Cell Types Database website of the Allen Institute for Brain Science (http://celltypes.brain-map.org/api/v2/well_known_file_download/694413985; mouse cortex); and Drop-seq data from Drop-viz website (http://dropviz.org/; mouse cortex). Source data are provided with this paper.
Perler is developed under python 3.8 on GitHub (https://github.com/yasokochi/Perler)35. The minimal usage of Perler is provided in Supplementary Table 6, and the selected parameters in the manuscript are provided in Supplementary Table 7. The running time and the memory usages on a MacBook Pro (2.3 GHz 8-Core Intel Core i9, 64GB) are also provided in Supplementary Table 8.