Abstract
In spatially resolved transcriptomics, Stereo-seq facilitates the analysis of large tissues at the single-cell level, offering subcellular resolution and centimeter-level field-of-view. Our previous work on StereoCell introduced a one-stop software using cell nuclei staining images and statistical methods to generate high-confidence single-cell spatial gene expression profiles for Stereo-seq data. With advancements allowing the acquisition of cell boundary information, such as cell membrane/wall staining images, we updated our software to a new version, STCellbin. Using cell nuclei staining images, STCellbin aligns cell membrane/wall staining images with spatial gene expression maps. Advanced cell segmentation ensures the detection of accurate cell boundaries, leading to more reliable single-cell spatial gene expression profiles. We verified that STCellbin can be applied to mouse liver (cell membranes) and Arabidopsis seed (cell walls) datasets, outperforming other methods. The improved capability of capturing single-cell gene expression profiles results in a deeper understanding of the contribution of single-cell phenotypes to tissue biology.
Availability & Implementation
The source code of STCellbin is available at https://github.com/STOmics/STCellbin.
Statement of need
Spatially resolved single-cell transcriptomics enables the generation of comprehensive molecular maps that provide insights into the spatial distribution of molecules within individual cells constituting tissues. This groundbreaking technology offers insights into the location and function of cells across diverse tissues, advancing our understanding of organ development [1], tumor heterogeneity [2], cancer evolution [3], and other biological mechanisms. Resolution and field-of-view are critical parameters in spatial transcriptomics. Specifically, a high resolution provides detailed molecular information at the single-cell level, and a large field-of-view facilitates the creation of complete 3D maps, capturing biological functions at the organ level. Stereo-seq simultaneously achieves subcellular resolution and a centimeter-level field-of-view, providing the technical foundation for obtaining comprehensive spatial gene expression profiles of whole tissues at the single-cell level [4]. Our previous work introduced StereoCell, a one-stop software for obtaining single-cell spatial gene expression profiles with a high signal-to-noise ratio from Stereo-seq data [5]. StereoCell takes the cell nuclei staining image tiles and its corresponding spatial gene expression data as input, and it performs tasks such as image stitching, image registration, tissue segmentation, cell nuclei segmentation, and molecule labeling steps. Notably, Stereo-seq uses cell nuclei staining images; however, there exists a significant difference between cell nuclei and cell boundary staining images, based on cell membrane/wall staining, in terms of the ability to capture robust and precise cell-specific gene expression profiles. Despite the widespread use of spatial techniques, such as MERFISH [6], CosMx [7], and Xenium [8], several of these techniques struggle to provide accurate cell boundary information, as they rely on cell nuclei staining images generated using stains such as 4,6-diamidino-2-phenylindole (or DAPI). Hematoxylin-eosin and single-strand DNA fluorescence nuclei staining images are also commonly used and readily obtainable. The updated Stereo-seq technology incorporates a procedure leveraging simultaneous cell membrane/wall and cell nuclei staining by adding multiplex immunofluorescence (mIF) and calcofluor white (CFW) staining [9, 10], enabling to acquire more accurate cell boundary information automatically and, consequently, more reliable single-cell spatial gene expression profiles.
Here, we updated StereoCell to a new version: STCellbin. The new version retains key steps from StereoCell, such as image stitching, tissue segmentation, and molecule labeling. Additionally, it incorporates improved image registration and cell segmentation steps. Notably, the “track line”, a crossed linear marker embedded on the Stereo-seq chip, is key to the image registration step of StereoCell [5]. As the cell membrane/wall staining images miss the “track line” information, the cell nuclei staining images are used to align the cell membrane/wall staining images with the spatial gene expression maps, thereby obtaining registered cell boundary information in the cell segmentation step. Based on the cell boundary information, STCellbin directly assigns the molecules to their corresponding cells, obtaining single-cell spatial gene expression profiles. We applied STCellbin to mouse liver (cell membrane) and Arabidopsis seed (cell wall) datasets and confirmed the accuracy of the cell segmentation provided by the software. This update offers a comprehensive workflow to obtain reliable single-cell spatial gene expression profiles based on cell membrane/wall information. Hence, STCellbin provides support and guidance, particularly for scientific investigations based on Stereo-seq data.
Implementation
Overview of STCellbin
The process of STCellbin includes image stitching, image registration, cell segmentation, and molecule labeling (Figure 1). Input into STCellbin includes Stereo-seq spatial gene expression data, alongside cell nuclei and cell membrane/wall staining image tiles. The stitched cell nuclei and cell membrane/wall staining images are obtained using the MFWS algorithm [5]. These two stitched staining images are registered using a Fast Fourier Transform (FFT) algorithm [11]. The spatial gene expression data is transformed into a map, which is then registered with a stitched cell nuclei staining image based on “track line” information. Thus, the registration of the gene expression map and cell membrane/wall staining image is implemented. Cell segmentation is performed on the registered cell membrane/wall staining image using the adjusted Cellpose 2.0 tool [12] to obtain the cell mask. Molecules are then assigned to their corresponding cells based on the cell mask, thus generating the single-cell spatial gene expression profile. The tissue segmentation step based on Bi-Directional ConvLSTM U-Net [13] is set as optional, and it can be used to generate a tissue mask to assist in filtering out impurities outside the tissue.
Image stitching
The image stitching step in STCellbin is consistent with the one in StereoCell. The MFWS algorithm [5] leverages FFT [11] to compute offsets between adjacent tiles featuring overlapping areas. This enables the stitching of these tiles, and the process is extended iteratively to encompass all tiles in the dataset. The relative error, absolute error, and computational efficiency of MFWS were assessed in our previous work [5].
Image registration
The registration of STCellbin includes three stages. The first stage is the registration of the stitched cell nuclei and stitched cell membrane/wall staining images. These two staining images have similar sizes and no significant difference in rotation because the chip does not move when they are photographed. The key to this registration is to calculate their offset. The size of the cell membrane/wall staining image is adjusted to match that of the cell nuclei staining image through cutting and zero-padding (Figure 2A). The two staining images are mean-based subsampled [14] (Figure 2B). The offset of the subsampled images is calculated through FFT [11], similarly to MFWS [5] (Figure 2C). Then, the calculated offset is restored to the scale of the original images (Figure 2D). Thus, these two staining images can be registered.
The second stage is the registration of the stitched nuclei staining image and spatial gene expression map. This registration is the same as in StereoCell [5]. The spatial gene expression data is transformed into a map. The stitched cell nuclei staining image is registered with the map based on “track line” information, involving scaling, rotating, flipping, and translating on the stitched cell nuclei staining image.
The third stage is the registration of the stitched cell membrane/wall staining image and the spatial gene expression map. Since the cell nuclei and cell membrane/wall staining images have been registered in the first stage, the same operations of the second stage, including scaling, rotating, flipping, and translating, are applied to the cell membrane/wall staining image (Figure 2E). Then, the cell membrane/wall staining image and spatial gene expression map can be registered. Moreover, when utilizing staining images produced with a multi-channel microscope, STCellbin can omit the registration among these images. STCellbin can also process the case of multiple mIF staining images captured from identical tissues using the same microscope when there is only a difference in offsets among these images.
Cell segmentation
The cell segmentation step of STCellbin uses Cellpose 2.0 [12] with some adjustments. The model architecture of Cellpose 2.0 and its weight files “cyto2” are downloaded. However, due to the large size of the staining images derived from Stereo-seq data, Cellpose 2.0 cannot be executed smoothly using normal hardware configurations. To address this issue, the staining images are cropped into multiple tiles with overlapping areas to perform cell segmentation and record the coordinates of these tiles. The overlapping areas prevent cells at the border of the tiles from being cropped. For optimal results, segmentations with different values of the cell diameter are performed independently, and the segmentation yielding the highest total cell area is retained. Next, all the segmented tiles are assembled into the final segmented result according to the recorded coordinates. Moreover, when selecting the tissue segmentation option, an additional step involves applying a filter to the cell mask using the tissue mask, resulting in a refined segmented output.
Molecule labeling
The molecule labeling of STCellbin is the same as in StereoCell, in principle. StereoCell assigns molecules from the cell nuclei to the cell by using the cell nuclei mask, and then assigns molecules outside the cell nuclei to the cells with the highest probability density using a Gaussian Mixture Model [15]. Conversely, STCellbin directly assigns molecules to the cells based on the cell mask, while assigning molecules outside the cell is optional. This decision was driven by the observation that cell membranes/walls are usually tightly packed, with a few molecules appearing outside the cells, and the assignment of these molecules may take a lot of time. Thus, we generally do not recommend this option, and users can choose to employ it based on particular requirements.
Results
Datasets and computing resource
We selected two datasets acquired via Stereo-seq technology [4]. One was a mouse liver dataset, a tissue that offers cell boundary information via cell membranes, as in all mammalian tissues. The other dataset was derived from seeds of the plant Arabidopsis, a tissue that provides cell boundary information based on rigid cell walls. More details of the two datasets are shown in Table 1.
Table 1.
Detail | Mouse liver dataset | Arabidopsis seed dataset |
---|---|---|
Data source | A slice of liver | Slices of multiple seeds |
Cell nuclei dye | DAPI | ssDNA |
Cell membrane/wall dye | mIF | CFW |
Number of molecules | 16,177,288 | 62,884,637 |
The experiment for image segmentation was implemented on the STOmics cloud platform [16] with these settings: 32 CPUs, 32 GB memory, and “ALL” resource type. An exception was the watershed method [17], which was implemented using ImageJ on a computer with a 16-core CPU and 16 GB of RAM. Also, the experiment for downstream analysis was implemented on a server with a 40-core CPU, 128 GB of RAM, and 24 GB of GPU.
Evaluation criteria for cell segmentation performance
In a cell mask image, the gray value of a pixel is set to 255 in the cell area and 0 in the background. True positive (TP, the number of pixels with gray value of 255 in both ground truth and segmented result), true negative (TN, the number of pixels with gray value of 0 in both ground truth and segmented result), false positive (FP, the number of pixels with gray value of 0 in ground truth and 255 in segmented result) and false negative (FN, the number of pixels with gray value of 255 in ground truth and 0 in segmented result) are calculated. The number of cells segmented by a method is ns. For each segmented cell (cell i ), there should be a corresponding area in the ground truth (area i ), where i is the cell index (i = 1, 2, …, ns). The intersection over union metric (IoU) [18] is set as:
(1) |
where ao i is the overlap area between cell i and area i , and au i is the union area of cell i and area i . Then the precision (Pre), recall (Rec), F1 score (F1_s), Dice coefficient (Dc), and average Jaccard index (Avg_J) are calculated as:
(2) |
(3) |
(4) |
(5) |
(6) |
Process and evaluation of downstream analysis
The generated single-cell spatial gene expression profiles were input into Stereopy (v0.6.0) [19]. The cells with fewer than ten expressed genes, fewer than three expression counts, and more than 3% mitochondrial genes were removed; genes present in less than three cells were also removed. After normalization, the differentially expressed genes were summarized using Principal Component Analysis to reduce the data dimensionality. Specifically, the number of features was reduced to 10. The Leiden algorithm [20] was used for clustering, and the Uniform Manifold Approximation and Projection (UMAP) algorithm (RRID:SCR_018217) [21] was used to obtain 2D data projections. The Silhouette coefficient (Sc) and Moran’s I (MI) were used to evaluate the effect of clustering and the spatial self-correlation of each cluster, respectively. Sc is calculated as:
(7) |
where a j is the average distance between the j-th sample and other samples in its cluster, and b j is the average distance between the j-th sample and the samples in other clusters. MI is calculated as:
(8) |
where n is the number of clusters, y k and y l are the attribute values of the k-th and l-th clusters, respectively, y is the mean of all cluster attributes, 𝜔 k, l is the spatial weight between the k-th and l-th clusters, and W 0 is the aggregation of all spatial weights as:
(9) |
STCellbin more accurately segments cells based on cell membrane/wall staining images
We cropped two areas with higher image quality from the two datasets and designed their ground truths based on the manual markup of the cells according to the cell membranes/walls in the staining images. The cell segmentation method of STCellbin was compared with the original Cellpose [18], the state-of-the-art method DeepCell [22], and a traditional watershed method [17].
Using the mouse liver dataset, STCellbin effectively identified cell membranes for segmentation, yielding cell masks that demonstrated acceptable agreement with the stating image and ground truth (Figure 3A, upper). Among all cell mask images, STCellbin provided the best description of the cell boundaries, outperforming other methods, which missed quite a few cells (Figure 3A, lower). We observed a similar trend using the Arabidopsis seed dataset, showing that STCellbin can also effectively identify cell walls for segmentation (Figure 3B). Compared with other methods, STCellbin obtained higher values across most indicators on these two datasets (Figure 3C). The comparison with the original Cellpose validated the effectiveness of STCellbin in adjusting segmentation. While DeepCell is a powerful method, it is primarily designed for segmenting cell nuclei, which involves identifying highlighted areas in the nuclei staining images. This strategy is unsuitable for cell membrane/wall staining images, resulting in less desirable results. Similarly, the traditional watershed method performs poorly on cell membrane/wall staining images. In summary, STCellbin’s cell segmentation emerged as the most practical and effective method.
STCellbin generates more reliable single-cell spatial gene expression profiles for downstream analysis
Currently, there is a lack of image-based one-stop software like STCellbin for Stereo-seq data. Therefore, we compared STCellbin with Baysor (v0.6.2) [23], a tool that generates the spatial gene expression profile without relying on images. However, Baysor could not output results on the complete mouse liver and Arabidopsis seed datasets in an acceptable time or a given computational resource. We ran Baysor on a smaller Arabidopsis seed dataset, which was the cropped area in the cell segmentation experiment and contained two complete seed data.
The cell area, number of unique genes per cell, and number of gene counts per cell were statistically calculated from the results of STCellbin (Figure 4A). The clustering results of STCellbin were obtained utilizing the generated single-cell spatial gene expression profiles. The clusters of cells were spatially mapped within the tissue (Figure 4B, left-hand side for each tissue), allowing for the observation of their specific positions. From the UMAPs, it was apparent that the different cell types were effectively distinguished (Figure 4B, right-hand side for each tissue). The spatial location of the different cell types positively influenced a series of downstream analyses, such as cellular annotation in less well-studied tissues.
The cells were clustered into seven clusters on the profile of STCellbin, and 14 clusters on the profile of Baysor (Figure 4C, the first subfigure from the left). We observed that the number of cells segmented by Baysor was significantly higher than that segmented by STCellbin, and it did not align with the cell count observed in the ground truth image. This fact could account for the higher number of clusters produced by Baysor. The Sc and MI obtained by both STCellbin and Baysor were not satisfactory (Figure 4C, the second and third subfigures from the left), possibly due to the limited information from a small dataset. Nevertheless, the values from STCellbin were higher than those from Baysor. Moreover, STCellbin demonstrated significant advantages in terms of computing resource usage and running time (Figure 4C, the fourth and fifth subfigures from the left), which also explains why Baysor was unable to process the complete mouse liver and Arabidopsis seed datasets.
It should be noted that the stitching and registration steps of STCellbin could not be performed on the cropped dataset. Hence, the corresponding computational resource usage and running time could not be recorded. Thus, the resource usage and time of STCellbin for comparison were obtained on the complete Arabidopsis seed dataset. Specifically, STCellbin was able to process a larger dataset with fewer computational resources and less time compared to Baysor. Overall, STCellbin is a more reliable method, particularly for analyzing high-resolution and large-field-of-view spatial transcriptomic data.
Discussion
Accurate identification of cell boundaries is crucial in generating single-cell resolution in spatial omics applications. Building upon previous work in StereoCell, which uses cell nuclei staining images to generate single-cell spatial gene expression profiles, this STCellbin update extends the capability to automatically process Stereo-seq cell membrane/wall staining images for identifying cell boundaries, thereby facilitating downstream analyses. We also showcased a few examples of the performance of cell membrane/wall segmentation in STCellbin. Currently, the tools for cell nuclei and cell membrane/wall segmentation can be independently executed, allowing users to choose the most suitable solution for their specific applications. In future work, these two techniques could be combined by training a deep learning model compatible with any staining image type, thereby achieving more accurate results.
Availability of source code and requirements
Project name: STCellbin
Project home page: https://github.com/STOmics/STCellbin
Operating system(s): Platform independent
Programming language: Python
Other requirements: Python 3.8
License: MIT License
RRID: SCR_024438
Acknowledgements
We thank China National GeneBank for providing technical support.
Funding Statement
This work was supported by the National Key R&D Program of China (2022YFC3400400).
Data availability
The data that support the findings of this study have been deposited into the Spatial Transcript Omics DataBase (STOmics DB) of the China National GeneBank DataBase (CNGBdb), with accession number STT0000048. A backup for the data is also provided at the Github link of STCellbin [24]. Archival snapshots of the code are also available from software heritage (Figure 5) [25].
Abbreviations
FFT, Fast Fourier Transform; MI, Moran’s I; mIF, multiplex immunofluorescence; Sc, Silhouette coefficient; UMAP, Uniform Manifold Approximation and Projection.
Declarations
Ethics approval and consent to participate
The authors declare that ethical approval was not required for this type of research.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
Conceptualization: BZ and ML. Project administration and supervision: SB and XX. Software implementation: ZD, HQ, KS, and HL. Data collection and processing: QK, XF, and LC. Validation: QK and ZD. Project coordination: BZ and ML. Manuscript writing and figure generation: BZ, ML, and QK. Manuscript review: ML, SF, YZ, YL and SB.
Funding
This work was supported by the National Key R&D Program of China (2022YFC3400400).
References
- 1.Fang S, Chen B, Zhang Y et al. . Computational approaches and challenges in spatial transcriptomics. Genom. Proteom. Bioinform., 2023; 21: 24–47. doi: 10.1016/j.gpb.2022.10.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Lu T, Ang CE, Zhuang X. . Spatially resolved epigenomic profiling of single cells in complex tissues. Cell, 2022; 185: 4448–4464. doi: 10.1016/j.cell.2023.04.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Erickson A, He M, Berglund E et al. . Spatially resolved clonal copy number alterations in benign and malignant tissue. Nature, 2022; 608: 360–367. doi: 10.1038/s41586-022-05023-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Chen A, Liao S, Cheng M et al. . Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell, 2022; 185: 1777–1792. doi: 10.1016/j.cell.2022.04.003. [DOI] [PubMed] [Google Scholar]
- 5.Li M, Liu H, Li M et al. . StereoCell enables highly accurate single-cell segmentation for spatial transcriptomics. bioRxiv. 2023; 10.1101/2023.02.28.530414. [DOI]
- 6.Chen KH, Boettiger AN, Moffitt JR et al. . Spatially resolved, highly multiplexed RNA profiling in single cells. Science, 2015; 348: aaa6090. doi: 10.1126/science.aaa6090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.He S, Bhatt R, Brown C et al. . High-plex imaging of RNA and proteins at subcellular resolution in fixed tissue by spatial molecular imaging. Nat. Biotechnol., 2022; 40: 1794–1806. doi: 10.1038/s41587-022-01483-z. [DOI] [PubMed] [Google Scholar]
- 8.Janesick A, Shelansky R, Gottscho AD et al. . High resolution mapping of the tumor microenvironment using integrated single-cell, spatial and in situ analysis. Nat. Commun., 2023; 14(1): 8353. doi: 10.1038/s41467-023-43458-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Liao S, Heng Y, Liu W et al. . Integrated spatial transcriptomic and proteomic analysis of fresh frozen tissue based on stereo-seq. bioRxiv. 2023; 10.1101/2023.04.28.538364. [DOI]
- 10.STOmics Documentation . https://en.stomics.tech/.
- 11.Duhamel P, Vetterli M. . Fast fourier transforms: A tutorial review and a state of the art. Signal Process., 1990; 19(4): 259–299. [Google Scholar]
- 12.Pachitariu M, Stringer C. . Cellpose 2.0: how to train your own model. Nat. Methods, 2022; 19: 1634–1641. doi: 10.1038/s41592-022-01663-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Azad R, Asadi-Aghbolaghi M, Fathy M et al. . Bi-Directional ConvLSTM U-Net with densley connected convolutions. In: IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). 2019. pp. 406–415, doi: 10.1109/ICCVW.2019.00052. [DOI] [Google Scholar]
- 14.Levina A, Priesemann V. . Subsampling scaling. Nat. Commun., 2017; 8: 15140. doi: 10.1038/ncomms15140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Reynolds D. . Gaussian mixture models. In: Li SZ, Jain A (eds), Encyclopaedia of Biometrics. vol. 741, Boston, MA: Springer, 2009; pp. 659–663. [Google Scholar]
- 16.STOMICs Cloud . https://cloud.stomics.tech/.
- 17.Wen T, Tong B, Liu Y et al. . Review of research on the instance segmentation of cell images. Comput. Meth. Prog. Bio., 2022; 227: 107211. doi: 10.1016/j.cmpb.2022.107211. [DOI] [PubMed] [Google Scholar]
- 18.Stringer C, Wang T, Michaelos M et al. . Cellpose: a generalist algorithm for cellular Segmentation. Nat. Methods, 2021; 18: 100–106. doi: 10.1038/s41592-020-01018-x. [DOI] [PubMed] [Google Scholar]
- 19.Fang S, Xu M, Cao L et al. . Stereopy: modeling comparative and spatiotemporal cellular heterogeneity via multi-sample spatial transcriptomics. bioRxiv. 2023; 10.1101/2023.12.04.569485. [DOI]
- 20.Traag VA, Waltman L, Van Eck NJ. . From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep., 2019; 9: 5233. doi: 10.1038/s41598-019-41695-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Becht E, McInnes L, Healy J et al. . Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol., 2019; 37: 38–44. doi: 10.1038/nbt.4314. [DOI] [PubMed] [Google Scholar]
- 22.Greenwald NF, Miller G, Moen E et al. . Whole-cell segmentation of tissue images with human-level performance using large-scale data annotation and deep learning. Nat. Biotechnol., 2022; 40: 555–565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Petukhov V, Xu RJ, Soldatov RA et al. . Cell segmentation in imaging-based spatial transcriptomics. Nat. Biotechnol., 2022; 10: 345–354. doi: 10.1038/s41587-021-01044-w. [DOI] [PubMed] [Google Scholar]
- 24.STCellbin GitHub . 2023; https://github.com/STOmics/STCellbin.
- 25.Zhang B, Li M, Kang Q et al. . STCellbin (Version 1) [Computer software]. Software Heritage. 2024; https://archive.softwareheritage.org/swh:1:dir:f7963c6d274ef64e392923ce7405f39fb23dea5a;origin=https://github.com/STOmics/STCellbin;visit=swh:1:snp:dd0bfc7b4fb0791789cf43b2c50742d25c8bc00e;anchor=swh:1:rev:09e89551499f980c4ffb4df9fb73712d93830fd0.