SAW: an efficient and accurate data analysis workflow for Stereo-seq spatial transcriptomics

Chun Gong; Shengkang Li; Leying Wang; Fuxiang Zhao; Shuangsang Fang; Dong Yuan; Zijian Zhao; Qiqi He; Mei Li; Weiqing Liu; Zhaoxun Li; Hongqing Xie; Sha Liao; Ao Chen; Yong Zhang; Yuxiang Li; Xun Xu

doi:10.46471/gigabyte.111

. 2024 Feb 20;2024:gigabyte111. doi: 10.46471/gigabyte.111

SAW: an efficient and accurate data analysis workflow for Stereo-seq spatial transcriptomics

Chun Gong ^1,^†, Shengkang Li ^1,^†, Leying Wang ^1,^†, Fuxiang Zhao ¹, Shuangsang Fang ^1,², Dong Yuan ¹, Zijian Zhao ¹, Qiqi He ¹, Mei Li ¹, Weiqing Liu ¹, Zhaoxun Li ¹, Hongqing Xie ¹, Sha Liao ¹, Ao Chen ¹, Yong Zhang ¹, Yuxiang Li ^1,^*, Xun Xu ^3,^*

PMCID: PMC10905255 PMID: 38434930

Abstract

The basic analysis steps of spatial transcriptomics require obtaining gene expression information from both space and cells. The existing tools for these analyses incur performance issues when dealing with large datasets. These issues involve computationally intensive spatial localization, RNA genome alignment, and excessive memory usage in large chip scenarios. These problems affect the applicability and efficiency of the analysis. Here, a high-performance and accurate spatial transcriptomics data analysis workflow, called Stereo-seq Analysis Workflow (SAW), was developed for the Stereo-seq technology developed at BGI. SAW includes mRNA spatial position reconstruction, genome alignment, gene expression matrix generation, and clustering. The workflow outputs files in a universal format for subsequent personalized analysis. The execution time for the entire analysis is ∼148 min with 1 GB reads 1 × 1 cm chip test data, 1.8 times faster than with an unoptimized workflow.

Statement of need

Stereo-seq of BGI STOmics [1] is a panoramic spatial transcriptome technology that achieves ultra-high throughput and ultra-high precision. By capturing mRNA in tissues with the Stereo-seq chip and restoring it to its spatial location, in situ sequencing of tissues is achieved, laying the foundation for a deeper understanding of the relationship between gene expression, morphology, and the local environment of cells.

Due to its ultra-high throughput and ultra-high precision, Stereo-Seq generates a large amount of data, which poses a challenge for data analysis. Therefore, efficient analysis tools are needed. In addition, accurate spatial positioning is an important part of data analysis; hence, accurate positioning lays a good foundation for subsequent analysis.

In spatial transcriptomics data analyses, large amounts of data can lead to performance issues in the traditional analysis process. Firstly, the alignment of mRNA sequences, whether using STAR [2] (RRID:SCR_004463) or other software, cannot meet the performance requirements in the current situation. In the s1 (1 cm × 1 cm) chip, this step can account for 70% of the processing time. In addition, the coordinate ID (CID) mapping step is also an essential step in the process, and its accuracy affects the efficiency of spatial positioning. In this step, CID and coordinates need to be recorded in memory for real-time query and spatial positioning of reads. Faced with large chips, such as S6 (6 cm × 6 cm), the spatial coordinate points can reach as many as 15 billion, and the data structure that stores the correspondence between the CID and the coordinates occupies a lot of memory. Moreover, querying in such a large table can be slow, especially when considering fault tolerance, as the computational complexity and time consumption can increase further. Finally, on large chips, matrix operations of the same size as the chip also have performance bottlenecks, such as excessive memory usage and slow speed. These problems need to be solved through high-performance computing technologies.

We developed the Stereo-Seq Analysis Workflow (SAW), a standard analysis process of Stereo-Seq data. Taking FASTQ [3, 4] as inputs, SAW performs mRNA spatial location restoration, filtering, mRNA genome alignment, gene region annotation, molecule identity (MID) correction, expression matrix generation, tissue region extraction, clustering, saturation analysis, and report generation to obtain the gene expression and spatial information of tissues. Hence, SAW provides the complete basic analyses required for spatial transcriptomics data.

Implementation

Processing and parallelization of spatial information in large chips

The principle of spatial localization of sequencing data by spatial transcriptomics is to mark the spatial position and sequencing reads with a 25 bp CID sequence. It then requires locating the sequencing reads back to their original spatial position by matching the CID sequence on the reads. However, as the DNA sequence obtained by current sequencers is not 100% accurate and has a certain error rate, a margin of error tolerance is required when matching the CID sequence. The current error tolerance strategy is to replace each base on the CID sequence with the other three bases (the gene sequence comprises four bases, A, G, C, and T) and then perform the matching.

Due to the high resolution and large field of view of Stereo-seq, the number of spatial coordinate points is very large. For example, for the S6 chip (6 cm × 6 cm), the number of spatial coordinate points can reach as many as 15 billion. Simply storing the corresponding relationship between each coordinate point and the CID sequence in a data structure can consume more than 600 GB of memory, and the query speed can be very slow, seriously affecting the applicability and analysis efficiency of standard analyses.

Therefore, we split the spatial coordinates and CID information, which are stored in a mask file. Correspondingly, the FASTQ files are also split according to the same rules. For example, if the mask file needs to be split into four parts, the first base of the CID sequence can be used as the classification criterion and split into four parts starting with A, C, G, and T. If 16 parts are needed, the first two bases can be used as the classification criterion. If a non-power-of-4 number of parts is needed, such as ten parts, we can use a modulo operation. Similarly, the FASTQ file can be split according to the CID sequence using the same rule, and the corresponding mask and FASTQ files belonging to the same category need to undergo CID mapping. This step solves the above memory problem and improves the parallelization of data processing (Figure 1). The mask file (which records the corresponding relationship between the CID sequence and the spatial coordinates) and the FASTQ split are paired for subsequent analyses, and then merged when needed.

Figure 1. — Stereo-seq data analysis workflow.

Rapid alignment of genomes

The successful positioning of spatial location in read goes through several filtering steps, including whether MID contains N bases, whether MID is poly-A, the MID quality value, and whether mRNA contains poly-A. Reads filtered through these steps undergo genome alignment and output a BAM [5] file containing alignment information. Currently, commonly used RNA alignment tools include STAR, Hisat2 [6] (RRID:SCR_015530), and TopHat2 [7] (RRID:SCR_013035). Among them, STAR is known for its high unique alignment rate and relatively fast speed, but it still cannot meet the requirements of sizeable spatial omics datasets. Therefore, we made a series of optimization attempts, including using efficient multi-threaded input-output (IO) models, single instruction multiple data, improving L2 cache hit rate and other micro-architectural optimization techniques, redesigning business-level algorithms in data processing, and using FM-index technology in maximum matching prefix search, ultimately accelerating it by two times.

Gene expression matrix generation

Gene expression quantification analysis of STOmics is achieved through the count tool in the analysis software. Count annotates uniquely mapped reads based on mapping alignment results, combined with the reference gene annotation file (GFF/GTF [8, 9]) of the corresponding species, and then corrects and deduplicates MID, generating processed BAM files, gene annotation, and MID correction and deduplication.

Gene region annotation process

(1)
For each read, search for overlap with the gene interval in the annotation file, calculate the gene name/strand on the annotation, determine whether it is EXONIC, INTRONIC, or INTERGENIC, and count how many belong to each type.
(2)
Parse the cigar information. Obtain the entire length of the read and the starting position and length of each align block.
(3)
Search for all overlapping genes.
(4)
Determine which gene to choose. For each gene:
- (a)
  For each align block, calculate with each transcript of the current gene, obtain multiple exoncnt/introncnt based on the length of overlap with exon/non-exon regions.
- (b)
  Accumulate the cnts of each align block to obtain the optimal exoncnt/introncnt. If the exoncnt is greater than or equal to 50% of the read length, mark it as EXONIC. Otherwise, if the introncnt is greater than or equal to 50% of the read length, mark it as INTRONIC. Otherwise, mark it as INTERGENIC.
- (c)
  Choose the most reliable gene from multiple genes.
  - (i)
    First, obtain a list of genes with the best annotation results (priority is given to EXONIC>INTRONIC>INTERGENIC).
  - (ii)
    From these genes, select the gene with the most significant overlap as the annotation result.
  - (iii)
    If multiple genes have the same overlap length, randomly select one gene (the selection rule is to choose the gene with the smallest start and end).

MID correction

The following process corrects the error MID caused by sequencing errors based on the Hamming distance:

(1)
Data preparation. A nested map of the form {cidgene: {mid: cnt}} stores the number of each CID and gene combination under various MIDs.
(2)
Correction.
- (a)
  Set parameters: minimum number of MID types threshold/tolerance number threshold/MID length. Default: 5/1/10.
- (b)
  Correction. For each group of data in the nested map:
  - (i)
    Check the number of MID types and continue processing if it is greater than or equal to the threshold.
  - (ii)
    Sort in descending order according to the cnt of MID, obtaining a list in the form of [(mid1, cnt1), (mid2, cnt2), …].
  - (iii)
    Traverse the sorted list in reverse order, starting with the MID with the smallest cnt, and calculate the base error with other MIDs. If it satisfies the tolerance number threshold, correct the current MID to the MID with a larger cnt, and transfer the cnt of the current MID to the correct MID.
  - (iv)
    Obtain the nested map after correction in the form of {cidgene: {oldmid: newmid}}.
Expression matrix
- (1)
  Select reads annotated to EXON or INTRON.
- (2)
  Filter out reads with directions opposite to the annotated gene chain direction.
- (3)
  Group by coordinate, gene, and MID in order.
- (4)
  Count the number of unique MIDs for each coordinate and gene, which is the expression matrix.
(3)
Example

Given a fault tolerance threshold of 1, assuming the sorted MID sequence and count are as follows:

{

"AAA": 5,

"GGA": 4,

"AGA": 3,

"AAT": 2,

"GGG": 1,

"CCC": 1

}

Correction process:

Calculate the fault tolerance count between "CCC" and "AAA" - "GGG", all of which are greater than the threshold of 1.

Calculate the fault tolerance count between "GGG" and "AAA" - "AAT", and find that when encountering "GGA", the fault tolerance count is 1. Then, update the count values of both and record the corresponding relationship before and after correction.

Calculate the fault tolerance count between "AAT" and "AAA" - "AGA", and find that when encountering "AAA", the fault tolerance count is 1. Then, update the count values of both and record the corresponding relationship before and after correction.

Calculate the fault tolerance count between "AGA" and "AAA" - "GGA", and find that when encountering "AAA", the fault tolerance count is 1. Then update the count values of both and record the corresponding relationship before and after correction.

Calculate the fault tolerance count between "GGA" and "AAA", which is greater than the threshold of 1.

Update the original data to:

{

"AAA": 10,

"GGA": 5,

"AGA": 0,

"AAT": 0,

"GGG": 0,

"CCC": 1

}

Save the mapping relationship before and after correction:

{

"AGA": "AAA",

"AAT": "AAA",

"GGG": "GGA"

}

Extracting data of tissue coverage area

Extracting the data of tissue coverage area is based on the tissue outline. Tissuecut implements two deep learning methods and traditional image processing, compatible with two types of images (i.e., tissue microscopic images and gene expression heatmaps) and designs an end-to-end tissue region extraction algorithm. The deep learning method uses the BiSeNet [10, 11] network algorithm of the neural network algorithm to train two lightweight real-time segmentation network models, which are used to extract tissue regions in microscopic images and gene expression heatmaps. The traditional image processing algorithm mainly extracts tissue regions based on the grayscale value information of the image. The algorithm process is as follows:

For different types of images to be extracted and different algorithms, different image pre-processing processes are used;

Deep learning or traditional algorithms are used to extract tissue regions;

The algorithm results are further processed, noise is filtered, and the final tissue region is obtained. Then, the tissue outline is extracted based on the tissue region, and the data corresponding to the contour coordinates in the region are obtained.

Clustering

The clustering process for identifying heterogeneity and similarity among cells in tissue regions uses spatial information and gene expression levels. The clustering process involves three steps:

(1)
Data preprocessing. This step involves filtering, normalization, and standardization of the data. The purpose of filtering is to remove cells with too few genes. Normalization and standardization aim to transform the data into the same scale and eliminate the adverse effects of outliers.
(2)
Feature selection. Principal component analysis and Umap [12] (RRID:SCR_018217) are used for feature selection and dimensionality reduction. The most representative genes are selected from all gene expression values for the subsequent clustering analysis.
(3)
Clustering analysis. An unsupervised clustering is performed using the Leiden [13] algorithm, a graph-based clustering method. First, a neighborhood graph is constructed based on the similarity between cells, where each cell is considered a node and the connections between nodes represent their similarity. Each node is initially considered a separate cluster, and the modularity of the entire graph is calculated. Then, adjacent nodes are iteratively merged to improve modularity. In each iteration, the modularity of merging each node with its neighboring nodes is calculated, and the merging method with the highest modularity is selected. When the iterative process converges, the final clustering result is reached.

Saturation analysis

Preparation

The formula for calculating the saturation value is 1-(uniq reads/total reads). First, 5% of bin200 unique coordinates are sampled and restored to bin1 coordinates. Then, the sampled bin1 coordinates under tissue are used to filter data, a list of (x, y, gene, MID) is constructed, and all count values to obtain anno reads are accumulated.

Saturation calculation

Shuffle the previous list, process the data in order according to the sampling interval of ${0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}$ , and, for each sampling point, calculate the total reads/saturation value/median number of genes under bin1/bin200, respectively, and output the statistical result file.

Saturation value. Calculate the uniq reads (i.e., gene and MID are both uniq) and the total reads under all bins using the formula 1-(uniq reads/total reads). The algorithms for bin1 and bin200 are the same.

Median number of genes. Calculate the number of uniq genes under each bin and take the median. The algorithms for bin1 and bin200 are the same.

Uniq reads. The uniq reads of bin1 are all the uniq reads (i.e., gene and MID are both uniq) under all bin1; the uniq reads of bin200 are all the uniq reads under all bin200 (x, y coordinates of bin1, gene, and MID are all uniq).

Memory issues in large chip scenarios

In large chip scenarios, in addition to the mask file, storing gene expression information in large IO files and matrix calculations can cause excessive memory consumption. To solve this problem, we used a series of optimization techniques, including batch processing of large IO files and partial matrix calculations, pre-calculating sizes to avoid using dynamically expanding data structures, and designing more finely-tuned custom data-types with smaller memory overhead based on business characteristics. These techniques enabled large chip data to be successfully completed on ordinary memory machines (256 GB).

Examples

mRNA spatial position restoration, filtering, and genome alignment statistics

Taking SS200000135TLD1 data (https://github.com/BGIResearch/SAW/tree/main/testdata) as an example, we executed mRNA spatial position restoration, filtering, and genome alignment in sequence, and obtained the statistics shown in Figure 2. Among 1 GB reads, 818 MB (78.8%, compared to the previous step) could be aligned back to spatial positions. After filtering, 763 MB (93.3%) reads were still left. After alignment to the genome, we obtained 641 MB (84.0%) of uniquely mapped reads.

Figure 2. — Summary of spatial position restoration, filtering, and genome alignment.

Gene expression spatial distribution map

The BAM file generated after genome alignment was annotated and MID-corrected to produce expression information and statistical results. The expression information was stored in hdf5 format and could be visualized (Figure 3). The statistical results provided the number of exonic regions, introns, and intergenic regions annotated. A total of 481 MB reads were annotated in the exonic region (Table 1).

Figure 3. — A demo of the spatial visualization of gene expression information.

Table 1.

Demo of the reads annotation statistics.

Type	Number
Exonic	495,050,694
Intronic	50,296,627
Intergenic	113,402,471
Transcriptome	545,347,321
Antisense	112,611,349

Open in a new tab

Spatial clustering

Gene expression profiles were calculated for each position within bin200 (a 200 × 200 grid of points), then spatial clustering was performed (Figure 4). This resulted in 21 classifications, which roughly aligned with the cell clustering results.

Figure 4. — A demo of the spatial clustering for a mouse brain dataset.

Saturation analysis

Our saturation analysis showed that the median sequencing depth and number of gene species tended to saturate, while the number of unique reads did not yet saturate (Figure 5).

Figure 5. — A demo of reads saturation statistics.

Testing

Each pipeline tool was optimized through a series of high-performance computing techniques. We conducted performance tests on three samples to evaluate changes in runtime and memory. Data1 and data2 are both s1 chips (1 cm × 1 cm) with around 1 billion reads, while data3 is a large chip (2 cm × 3 cm). After optimization, the runtime on data1 decreased from 263.1 to 148.1 min, resulting in a 1.8× speed increase, and the mapping time decreased from 175.0 to 106.9 min. On data2, the runtime decreased from 227.9 to 127.3 min, resulting in a 1.8× speed increase, and the time of mapping decreased from 158.0 to 83.7 min (Table 2). In terms of memory optimization, after process optimization, the memory peak of tissueCut on data3 decreased from well over 83.5 GB to 33.5 GB.

Table 2.

The elapsed time used by the analysis workflow before and after optimization.

Performance	Data1 (s1 chip, ∼1 GB reads)		Data2 (s1 chip, ∼ 1 GB reads)
	Before optimization	After optimization	Before optimization	After optimization
Elapsed time (min)	263.1	148.1	227.9	127.3

Open in a new tab

Future directions

The alignment rate of CID affects the amount of data that enters the subsequent analysis. With our test sample, we obtained a successful alignment rate of 78.8% reads. Due to sequencing errors and alignment algorithm limitations, approximately 20% of reads could not be aligned. In the future, more accurate algorithms (such as those that consider mask CID and fastq CID-mismatch base quality-values and spatial features) or deep learning models may further improve the accuracy of the pipeline.

Availability of source code and requirements

Project name: SAW
Project home page: https://github.com/BGIResearch/saw_tools (tools source code), https://github.com/STOmics/SAW (script and docker)
Operating system(s): Linux
Programming language: C++, Python
Other requirements: Python >=3.8
License: GNU General Public License version 3
RRID:SCR_025001

Acknowledgements

The authors would like to acknowledge STOmic Cloud (https://cloud.stomics.tech) for supplying software analysis, China National GeneBank.

Funding Statement

This research was supported by BGI and the National Key R&D Program of China (2022YFC3400400).

Data availability

The data supporting this study’s findings have been deposited into the CNGB Sequence Archive (CNSA) [14] of the China National GeneBank DataBase (CNGBdb) [15] with the accession number CNP0004437. The raw sequencing data is available in the SRA via BioProject: PRJNA1036005. The test data is in GitHub [16], and additional supporting data is in the GigaDB repository [17]. Archival snapshots of the code are also available in Software Heritage (Figure 6) [18].

Figure 6. — Software Heritage archive of the code [18]. https://archive.softwareheritage.org/browse/embed/swh:1:dir:50985db9d2ec5f14d4d5dbdebf76f8bcfc6f7d29;origin=https://github.com/STOmics/SAW;visit=swh:1:snp:6834a6db518ff7d000ce6ffea0cca9956a8347d2;anchor=swh:1:rev:411bab897e0d4642715f1f7b60780b545fb21d12/

Abbreviations

CID, coordinate ID; IO, input-output; MID, molecule identity; SAW, Stereo-Seq Analysis Workflow; STOmics, Spatial-Temporal Omics.

Declarations

Ethics approval and consent to participate

The authors declare that ethical approval was not required for this type of research.

Competing interests

The authors are all employees of BGI.

Funding

National Key R&D Program of China (2022YFC3400400).

References

1.STOmics® . BGI-Research, Shenzhen, China. 2020; https://en.stomics.tech/.
2.Dobin A, Davis CA, Schlesinger F et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013; 29(1): 15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Cock PJA, Fields CJ, Goto N et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res., 2009; 38(6): 1767–1771. doi: 10.1093/nar/gkp1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Archives SR, Sra T, Nucleotide I et al. File Format Guide 1. Published online 2009: 1–11. https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/. Accessed 21 May 2021.
5.Sequence Alignment/Map Format Specification. 2021; https://github.com/samtools/hts-specs. Accessed 21 May 2021.
6.Kim D, Pertea G, Trapnell C et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol., 2013; 14: R36. doi: 10.1186/gb-2013-14-4-r36. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Kim D, Langmead B, Salzberg S. . HISAT: a fast spliced aligner with low memory requirements. Nat. Methods, 2015; 12: 357–360. doi: 10.1038/nmeth.3317. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Ensembl . GFF/GTF File Format. Published 2020; http://www.ensembl.org/info/website/upload/gff.html?redirect=no. Accessed 27 May 2021.
9.GFF2 - GMOD . http://gmod.org/wiki/GFF2. Accessed 27 May 2021.
10.Yu C, Wang J, Peng C et al. BiSeNet: Bilateral segmentation network for real-time semantic segmentation. In: Computer Vision – ECCV 2018. LNCS, vol. 11217, Cham: Springer, 2018; pp. 334–349, doi: 10.1007/978-3-030-01261-8_20. [DOI] [Google Scholar]
11.Yu C, Gao C, Wang J et al. BiSeNet V2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis., 2021; 129: 3051–3068. Published online 5 April 2020. doi: 10.1007/s11263-021-01515-2. [DOI] [Google Scholar]
12.McInnes L, Healy J, Melville J. . UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv. Published online 2018; 10.48550/arXiv.1802.03426. Accessed 9 July 2021. [DOI] [Google Scholar]
13.Traag VA, Waltman L, van Eck NJ. . From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep., 2019; 9: 5233. doi: 10.1038/s41598-019-41695-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Guo XQ, Chen FZ, Gao F et al. CNSA: a data repository for archiving omics data. Database (Oxford), 2020; 2020: baaa055. doi: 10.1093/database/baaa055. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Chen FZ, You LJ, Yang F et al. CNGBdb: China National GeneBank DataBase. Hereditas, 2020; 42(08): 799–809. doi: 10.16288/j.yczz.20-080. [DOI] [PubMed] [Google Scholar]
16.SAW GitHub Test Data . https://github.com/STOmics/SAW/tree/main/Test_Data.
17.Gong C, Li S, Wang L et al. Supporting data for “SAW: an efficient and accurate data analysis workflow for Stereo-seq spatial transcriptomics”. GigaScience Database, 2023; 10.5524/102440. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Gong C, Li S, Wang L et al. SAW: Stereo-seq Analysis Workflow (Version 1). [Computer software]. Software Heritage. 2023; https://archive.softwareheritage.org/browse/directory/50985db9d2ec5f14d4d5dbdebf76f8bcfc6f7d29/?origin_url=https://github.com/STOmics/SAW&revision=411bab897e0d4642715f1f7b60780b545fb21d12&snapshot=6834a6db518ff7d000ce6ffea0cca9956a8347d2.

GigaByte. 2024 Feb 20;2024:gigabyte111.

Article Submission

Chun Gong

GigaByte.

Assign Handling Editor

Editor: Scott Edmunds

GigaByte.

Editor Assess MS

Editor: Hongfang Zhang

GigaByte.

Curator Assess MS

Editor: Yannan Fan

GigaByte.

Review MS

Editor: Yanjie Wei

Reviewer name and names of any other individual's who aided in reviewer	YanjieWei
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published manuscript. (If no, please inform the editor that you cannot review this manuscript.)	Yes
Is the language of sufficient quality?	Yes
Please add additional comments on language quality to clarify if needed
Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?	Yes
Additional Comments
Is the source code available, and has an appropriate Open Source Initiative license <a href="https://opensource.org/licenses" target="_blank">(https://opensource.org/licenses)</a> been assigned to the code?	Yes
Additional Comments
As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?	Yes
Additional Comments
Is the code executable?	Yes
Additional Comments
Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?	Yes
Additional Comments
Is the documentation provided clear and user friendly?	Yes
Additional Comments
Additional Comments
Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?	Yes
Additional Comments
Have any claims of performance been sufficiently tested and compared to other commonly-used packages?	Yes
Additional Comments
Additional Comments
Are there (ideally real world) examples demonstrating use of the software?	Yes
Additional Comments
Additional Comments
Any Additional Overall Comments to the Author
Recommendation	Minor Revisions

Open in a new tab

GigaByte.

Review MS

Editor: Zexuan Zhu

Reviewer name and names of any other individual's who aided in reviewer	Zexuan Zhu
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published manuscript. (If no, please inform the editor that you cannot review this manuscript.)	Yes
Is the language of sufficient quality?	Yes
Please add additional comments on language quality to clarify if needed
Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is?	Yes
Additional Comments
Is the source code available, and has an appropriate Open Source Initiative license <a href="https://opensource.org/licenses" target="_blank">(https://opensource.org/licenses)</a> been assigned to the code?	Yes
Additional Comments
As Open Source Software are there guidelines on how to contribute, report issues or seek support on the code?	Yes
Additional Comments
Is the code executable?	Yes
Additional Comments
Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined?	Yes
Additional Comments
Is the documentation provided clear and user friendly?	Yes
Additional Comments
Is there enough clear information in the documentation to install, run and test this tool, including information on where to seek help if required?	Yes
Additional Comments
Is there a clearly-stated list of dependencies, and is the core functionality of the software documented to a satisfactory level?	Yes
Additional Comments
Have any claims of performance been sufficiently tested and compared to other commonly-used packages?	Yes
Additional Comments
Is test data available, either included with the submission or openly available via cited third party sources (e.g. accession numbers, data DOIs)?	Yes
Additional Comments
Are there (ideally real world) examples demonstrating use of the software?	Yes
Additional Comments
Is automated testing used or are there manual steps described so that the functionality of the software can be verified?	Yes
Additional Comments
Any Additional Overall Comments to the Author	It would be helpful if some examples can be provided to illustrate the key steps, e.g., the gene region annotation process and MID correction. Some information of the references is missing. Please carefully check the format of the references.
Recommendation	Minor Revisions

Open in a new tab

GigaByte.

Editor Decision

Editor: Hongfang Zhang

GigaByte. 2024 Feb 20;2024:gigabyte111.

Minor Revision

Chun Gong

GigaByte.

Assess Revision

Editor: Hongfang Zhang

GigaByte.

Final Data Preparation

Editor: Mary-Ann Tuli

GigaByte.

Editor Decision

Editor: Hongfang Zhang

GigaByte.

Accept

Editor: Scott Edmunds

Editor’s Assessment

One limiting factor in the adoption of spatial omics research are workflow systems for data preprocessing, and to address these authors developed the SAW tool to process Stereo-seq data. The analysis steps of spatial transcriptomics involve obtaining gene expression information from space and cells. Existing tools face issues with large data sets, such as intensive spatial localization, RNA alignment, and excessive memory usage. These issues affect the process's applicability and efficiency. To address this, this paper presents a high-performance open-source workflow called SAW for Stereo-Seq. This includes mRNA position reconstruction, genome alignment, matrix generation, clustering, and result file generation for personalized analysis. During review the authors have added examples of MID correction in the article to make the process easier to understand. And In the future, more accurate algorithms or deep learning models may further improve the accuracy of this pipeline.

Open in a new tab

GigaByte.

Export to Production

Editor: Scott Edmunds

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[ref1] 1.STOmics® . BGI-Research, Shenzhen, China. 2020; https://en.stomics.tech/.

[ref2] 2.Dobin A, Davis CA, Schlesinger F et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 2013; 29(1): 15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3.Cock PJA, Fields CJ, Goto N et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res., 2009; 38(6): 1767–1771. doi: 10.1093/nar/gkp1137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] 4.Archives SR, Sra T, Nucleotide I et al. File Format Guide 1. Published online 2009: 1–11. https://www.ncbi.nlm.nih.gov/sra/docs/submitformats/. Accessed 21 May 2021.

[ref5] 5.Sequence Alignment/Map Format Specification. 2021; https://github.com/samtools/hts-specs. Accessed 21 May 2021.

[ref6] 6.Kim D, Pertea G, Trapnell C et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol., 2013; 14: R36. doi: 10.1186/gb-2013-14-4-r36. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] 7.Kim D, Langmead B, Salzberg S. . HISAT: a fast spliced aligner with low memory requirements. Nat. Methods, 2015; 12: 357–360. doi: 10.1038/nmeth.3317. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] 8.Ensembl . GFF/GTF File Format. Published 2020; http://www.ensembl.org/info/website/upload/gff.html?redirect=no. Accessed 27 May 2021.

[ref9] 9.GFF2 - GMOD . http://gmod.org/wiki/GFF2. Accessed 27 May 2021.

[ref10] 10.Yu C, Wang J, Peng C et al. BiSeNet: Bilateral segmentation network for real-time semantic segmentation. In: Computer Vision – ECCV 2018. LNCS, vol. 11217, Cham: Springer, 2018; pp. 334–349, doi: 10.1007/978-3-030-01261-8_20. [DOI] [Google Scholar]

[ref11] 11.Yu C, Gao C, Wang J et al. BiSeNet V2: Bilateral network with guided aggregation for real-time semantic segmentation. Int. J. Comput. Vis., 2021; 129: 3051–3068. Published online 5 April 2020. doi: 10.1007/s11263-021-01515-2. [DOI] [Google Scholar]

[ref12] 12.McInnes L, Healy J, Melville J. . UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv. Published online 2018; 10.48550/arXiv.1802.03426. Accessed 9 July 2021. [DOI] [Google Scholar]

[ref13] 13.Traag VA, Waltman L, van Eck NJ. . From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep., 2019; 9: 5233. doi: 10.1038/s41598-019-41695-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] 14.Guo XQ, Chen FZ, Gao F et al. CNSA: a data repository for archiving omics data. Database (Oxford), 2020; 2020: baaa055. doi: 10.1093/database/baaa055. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] 15.Chen FZ, You LJ, Yang F et al. CNGBdb: China National GeneBank DataBase. Hereditas, 2020; 42(08): 799–809. doi: 10.16288/j.yczz.20-080. [DOI] [PubMed] [Google Scholar]

[ref16] 16.SAW GitHub Test Data . https://github.com/STOmics/SAW/tree/main/Test_Data.

[ref17] 17.Gong C, Li S, Wang L et al. Supporting data for “SAW: an efficient and accurate data analysis workflow for Stereo-seq spatial transcriptomics”. GigaScience Database, 2023; 10.5524/102440. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref18] 18.Gong C, Li S, Wang L et al. SAW: Stereo-seq Analysis Workflow (Version 1). [Computer software]. Software Heritage. 2023; https://archive.softwareheritage.org/browse/directory/50985db9d2ec5f14d4d5dbdebf76f8bcfc6f7d29/?origin_url=https://github.com/STOmics/SAW&revision=411bab897e0d4642715f1f7b60780b545fb21d12&snapshot=6834a6db518ff7d000ce6ffea0cca9956a8347d2.

PERMALINK

SAW: an efficient and accurate data analysis workflow for Stereo-seq spatial transcriptomics

Chun Gong

Shengkang Li

Leying Wang

Fuxiang Zhao

Shuangsang Fang

Dong Yuan

Zijian Zhao

Qiqi He

Mei Li

Weiqing Liu

Zhaoxun Li

Hongqing Xie

Sha Liao

Ao Chen

Yong Zhang

Yuxiang Li

Xun Xu

Roles

Abstract

Statement of need

Implementation

Processing and parallelization of spatial information in large chips

Figure 1.

Rapid alignment of genomes

Gene expression matrix generation

Gene region annotation process

MID correction

Extracting data of tissue coverage area

Clustering

Saturation analysis

Preparation

Saturation calculation

Memory issues in large chip scenarios

Examples

mRNA spatial position restoration, filtering, and genome alignment statistics

Figure 2.

Gene expression spatial distribution map

Figure 3.

Table 1.

Spatial clustering

Figure 4.

Saturation analysis

Figure 5.

Testing

Table 2.

Future directions

Availability of source code and requirements

Acknowledgements

Funding Statement

Data availability

Figure 6.

Abbreviations

Declarations

Ethics approval and consent to participate

Competing interests

Funding

References

Article Submission

Mr Chun Gong

Roles

Assign Handling Editor

Roles

Editor Assess MS

Roles

Curator Assess MS

Roles

Review MS

Roles

Review MS

Roles

Editor Decision

Roles

Minor Revision

Mr Chun Gong

Roles

Assess Revision

Roles

Final Data Preparation