Deciphering the principles and mechanisms by which gene activity orchestrates complex cellular arrangements in multicellular organisms has far-reaching implications for research in the life sciences. Recent technological advancements in next-generation sequencing-based and imaging-based approaches have established the potential of spatial transcriptomics to measure expression levels of all or most genes systematically throughout tissue space, and have been adopted to generate biological insight in neuroscience, development, plant biology, and a range of diseases including cancer. Similar to datasets made possible by genomic sequencing and population health surveys, the large-scale atlases generated by this technology lend themselves to exploratory data analysis for hypothesis generation. Here, we review spatial transcriptomic technologies and describe the repertoire of operations available for paths of analysis of the resulting data. Spatial transcriptomics can also be deployed for hypothesis testing using experimental designs comparing timepoints or conditions - including genetic or environmental perturbations. Finally, spatial transcriptomic data is naturally amenable to integration with other data modalities providing an expandable framework for insight into tissue organization.
Many of the notable discoveries in the life sciences followed from the recognition that cellular organization within tissues is intimately linked to biological function. In developmental biology, central topics such as symmetry-breaking between daughter cells and cell fate decisions are based on spatial relationships between cells1. In clinical settings, histopathology is often used as a conclusive diagnostic, precisely because diseases are characterized by abnormal spatial organization within tissues2. Infectious and inflammatory processes can drastically change the cellular organization of tissues3. These discoveries were supported by methods in molecular biology - including in situ hybridization4 (ISH) and immunohistochemistry5 - that provided the ability to visualize biological processes more directly by mapping DNA, RNA and protein within tissues. However, these methods limit analysis to at most a handful of genes or proteins at a time.
The ‘omics revolution has profoundly changed our ability to characterize cells. Instead of a few RNA or protein markers, new methods assay the full genome, transcriptome or proteome in cells6-9. This has led to the discovery of novel cell types and cell states and provided a more detailed understanding of biological processes in health and disease10-12. Until recently however, these high-throughput techniques could not be applied in situ, resulting in the loss of information about spatial relationships among the catalogued populations of cells. To circumvent this limitation, early methods such as TomoSeq performed transcriptomics on serial slices to reconstruct a spatial axis13-16. Similarly, microdissection was used to manually isolate specific regions for scRNA-Seq, thus obtaining spatially-resolved information17-23. Nanostring GeoMX digital spatial profiling was developed to capture targeted transcripts in manually selected regions of interest24. To reconstruct spatial relationships between neighboring cells, creative methods rely on partial tissue dissociation25, including ProximID25-27, PICseq27 and ClumpSeq28. In another approach, targeted mapping of a subset of genes can be used to ‘anchor’ single-cell RNA-Seq data29-33.
While these approaches enabled the reconstruction of tissue organization, they also highlighted the need for whole transcriptome, spatially-resolved methods. Over the past decade, technologies have emerged that bridge the gap between traditional approaches that retain spatial information (such as IF, ISH) and new methodologies with the ability to concurrently query the entire transcriptome in individual cells. The inception of this new approach of ‘spatial transcriptomics’ has facilitated novel discoveries in diverse fields from neuroscience to development to cancer. Here, we review common spatial transcriptomic technologies, discuss the principles of exploration of the data generated by these methods, examine the utility of spatial transcriptomics in different experimental designs, and highlight the promise of the technology for biological insights through integration with other modalities.
Spatial transcriptomics technologies
While key aspects of spatial transcriptomic technologies vary widely in terms of both the number of genes that can be probed as well as the size of tissue that can be assayed (Box 1), the methods reviewed here focus on technologies that allow transcriptome-level measurements across a tissue region. Spatial transcriptomics technologies are primarily categorized34,35 as (1) Next-Generation Sequencing (NGS)-based, encoding positional information onto transcripts before next-generation sequencing, and (2) imaging-based approaches, comprising in situ sequencing-based methods, where transcripts are amplified and sequenced in the tissue, and in situ hybridization-based methods, where imaging probes sequentially hybridized in the tissue36-40 (Figure 1a-c). This classification is not always clear-cut, and methods may incorporate elements from both categories. These diverse technologies can be seen as converging upon a gene expression matrix (Figure 1d) capturing the transcriptome at every spot (i.e. a pixel, a cell, or a group of cells).
Box 1: Considerations for selecting a spatial transcriptomic method.
Gene throughput:
NGS-based methods are unbiased as they capture all polyadenylated transcripts, and are therefore well suited for exploring a new system. In contrast, ISH- and most ISS-based methods (with the exception of FISSEQ70 and ExSeq69,70) are targeted and require a priori knowledge of the genes of interest. Nonetheless, the throughput of these methods has increased in recent years, reaching 10,000 genes166,170. Targeted spatial transcriptomic methods can also be used in conjunction with single-cell RNA-Seq, where genes of interest have already been identified and can then be located more precisely51,63. In addition, probes for non-polyadenylated transcripts can be designed to query for other RNAs such as mature microRNAs and tRNAs170.
Sequence information:
In NGS-based and ISS-based methods, the cDNA sequence itself is a read out, enabling the detection of splice isoforms58,56,171,56 as well as single nucleotide variants and point mutations60. When integrated with the gene expression matrix, this data can assist with reconstructing a time-course - using RNA Velocity53 or lineage tracing172.
ISH-based methods are highly sensitive, recently reaching 80% detection efficiency relative to the gold standard smFISH170. Sensitivity of the NGS-based methods is significantly lower and remains inferior to scRNA-Seq, but is rapidly improving to ~100 unique transcripts per square μm54,58,59,173. There is generally a tradeoff between sensitivity and gene throughput, as seen in the higher sensitivity of targeted ISS-based methods64 relative to the unbiased methods70.
The resolution of in situ methods is limited only by the optical diffraction limit, and with expansion microscopy has reached ~100nm80,170. These methods are therefore well-suited to questions concerning sub-cellular organization. NGS-based methods are limited by the diameter of spots, but their resolution has rapidly increased since the original method41, recently reaching ~1μm58,59.
Area size:
The in situ methods can span a wide range of sizes, although there is a tradeoff between tissue size and imaging time73. The NGS-based methods on the other hand are standardized, with arrays on the order of ten square mm (6 for Visium), which may be inappropriate for smaller or larger samples.
While these technologies are extremely powerful, there are obstacles to their widespread adoption, including access to single molecule imaging for in situ methods, as well as manufacturing for the capture arrays. Commercialization has facilitated access in some cases.
Fig. 1 ∣. The technologies of spatial transcriptomics provide a gene expression matrix.
a, NGS-based spatial transcriptomic methods barcode transcripts according to their location in a lattice of spots. b, In situ sequencing approaches directly read out the transcript sequence within the tissue. c, In situ hybridization methods detect target sequences by hybridization of complementary fluorescent probes. d, The product of spatial transcriptomics is the gene expression matrix - where the rows and columns correspond to genes and locations.
Next-Generation Sequencing (NGS)-based approaches:
NGS-based approaches stem from the conceptual innovations of single-cell RNA-Seq methodologies and are contingent on the addition of a spatial barcode prior to library preparation35 (Figure 1a). In 2016, Stahl et al. reported the first NGS-based method for spatial transcriptomics that enabled the capture of whole transcriptomes from tissue sections41. The central innovation was to capture poly-adenylated RNA on spatially-barcoded microarray slides prior to reverse transcription, ensuring that each transcript could be mapped back to its original spot using the unique positional molecular barcode. With each slide consisting of just over a thousand spots (100μm in diameter, 200μm center-center), large tissue areas could be investigated in an unbiased manner without selecting a region or importantly, a set of gene targets42,43. The method was first demonstrated on the mouse olfactory bulb41, and has since been employed by several other groups44-47. 10x Genomics recently released an improved version of the technology called Visium, with increased resolution (55μm in diameter, 100μm center-center) and sensitivity (>10k transcripts per spot)48. Many different fields have adopted this technology, including neuroscience49, cancer biology47,50 and developmental biology51.
Slide-Seq, another NGS-based technology, uses randomly barcoded beads deposited onto a slide for mRNA capture52. Here, the position of each random barcode is obtained by in situ indexing. This method has achieved high-resolution (10μm) and recently increased sensitivity (500 transcripts per bead), which was found to be twice that of Visium for the same surface area53. In parallel, high-definition spatial transcriptomics (HDST) also improved the resolution by replacing the glass slide with beads deposited in wells, similar to Slide-Seq54. More recently, the DBiT-Seq55 method has adopted microfluidics to apply polyT barcodes to the tissue section, while Stereo-seq uses randomly barcoded DNA nanoballs deposited in an array pattern to achieve nanoscale resolution56,57. Seq-Scope has achieved subcellular resolution spatial barcoding and can be used to visualize nuclear and cytoplasmic transcripts58. An innovative approach was adopted in Pixel-Seq where a polony-derived gel oligo array was used for RNA capture resulting in up to ~200 fold increase in resolution in comparison to existing methods59.
Common to all NGS-based methods, the spatially-barcoded RNAs are collected and processed for sequencing. The barcode of each read is used to map the spatial position, while the rest of the sequencing read is mapped to the genome to identify the transcript of origin, collectively generating a gene expression matrix.
Imaging-based approaches:
Two main types of imaging-based approaches to spatial transcriptomics have been introduced: in situ sequencing- and in situ hybridization-based methods. In situ sequencing (ISS)-based methods directly read out the sequences of transcripts within the tissue. Specifically, the RNA is reverse transcribed, amplified by rolling circle amplification, and sequenced (Figure 1b). Ke et al.60 first used this method by deploying targeted probes for the reverse transcription, followed by sequencing-by-ligation, and was implemented to study ~50 targeted genes in cancer60,61, tuberculosis62, and brain development63. Building upon this approach, STARMap incorporated advances in hydrogel chemistry, improved padlock and primer design, and devised an error-robust sequencing-by-ligation method, and was thus able to profile thousands of genes in the mouse cortex64. Other methods using sequencing-by-synthesis - as in BaristaSeq65 and Barseq66 - or sequencing-by-hybridization as in HybISS67 - have led to increased read lengths, enabling higher throughput and cellular barcoding. Furthermore, in situ sequencing has been combined with cDNA extraction for NGS68,69, highlighting the difficulty in classifying spatial transcriptomic methods as either NGS- or imaging-based. In situ sequencing also has the potential for untargeted profiling, as demonstrated by FISSEQ70. Although the untargeted amplification can lead to optical crowding and lower sensitivity, the recently-developed ExSeq demonstrated that expansion microscopy can be used to perform untargeted in situ sequencing in tissues69.
In situ hybridization (ISH) -based methods are the second category of imaging-based methods which build on in situ hybridization technologies, whereby a target sequence is detected by hybridization of a complementary fluorescent probe (Figure 1c). Initially limited in the number of distinguishable transcripts, innovations enabling the addition of sequential rounds of hybridization and imaging71 combined with barcoding have enabled substantial multiplexing. In MERFISH, successive rounds of hybridizations are imaged to detect the presence or absence of fluorescently labeled probes. The serial images are then decoded, using the error-robust barcode associated with each transcript identity72-74. MERFISH has been used at a wide range of scales, from transcript location within individual cells75 to tissue-level spatial transcriptomics such as on the hypothalamic preoptic region76. Another strategy to increase the number of distinguishable transcripts is the combination of colors into pseudocolors, as done in SeqFISH77,78. Similar to MERFISH, this method can be applied to investigate intracellular organization79 as well as to generate large maps, for example of the hippocampus78. Both methods have improved considerably in the last few years, and are now able to detect ~10,000 genes at sub-cellular resolution75,80. Ongoing efforts in the community aim to improve the sensitivity and scale of these methods34,81,82.
For both ISS- and ISH-based methods, the image is processed to generate the gene expression matrix. To obtain a cell-level matrix, the image is segmented, either manually on small areas, or systematically using a computational approach. Watershed algorithms use DAPI-stained nuclei as seeds and identify cell borders as regions with low RNA density83. Although these may not correspond to true physical boundaries, but rather to the limit between cells, they accomplish the task of assigning each mRNA to a cell. Alternatively, the data analysis can begin at the level of individual pixels, and incorporate the gene expression data to delineate cells84-86.
Spatial transcriptomics insights into development, physiology and disease
Since spatial transcriptomic technologies provide an unbiased picture of spatial composition, they have been used to generate tissue atlases, which provide a valuable resource as reference maps. The use of spatial transcriptomics to generate spatial atlases of the nervous system is of particular note: ST-based approaches have established detailed maps of the entire mouse brain49, or of specific regions: visual cortex64, primary motor cortex87, middle temporal gyrus67, hypothalamic pre-optic region76, hippocampus69,78, and cerebellum88. Maynard et al. identified spatial patterns of known schizophrenia- and autism-related genes in their analysis of the dorsolateral prefrontal cortex, that led to proposed mechanisms of genetic susceptibility to schizophrenia89. Spatial transcriptomics was also used to identify genes and pathways in eight inflorescence tissue domains of A. thaliana90.
In developmental biology, time-resolved spatial transcriptomics atlases have been useful to elucidate the spatial dynamics of heart development51, spermatogenesis91 and intestinal development92. Similarly, a comprehensive study of the human endometrium during the proliferative and secretory phases of the menstrual cycle identified a role for WNT and Notch signaling in regulating differentiation towards ciliated or secretory epithelial cells93. In order to serve as effective resources for the research community, these atlases have been the focus of coordinated community efforts and are supported by the Human Cell Atlas project94 and the Allen Institute for Brain Science95.
Beyond normal development and physiology, spatial transcriptomics is well- positioned to study tissue disorganization in disease. Most prominently, spatial transcriptomics has enabled the identification of the mechanisms at play in cancer, where the tissue structure underlying normal physiological function is altered44,50,96-99. With the increasing recognition of the importance of the tumor microenvironment, spatial transcriptomics has been used to address its relationship to cancer cells adopting different states45,46,69,98. In particular, spatial transcriptomics enables the study of the molecular features across the cancer and normal tissue boundaries. For example, an immunomodulatory cancer cell state was revealed in skin squamous cell carcinoma47. Spatial transcriptomics has also provided insights on the mechanisms of tissue dysregulation in neurodegenerative disorders - including Alzheimer’s disease100,101 and amyotrophic lateral sclerosis102, infectious and inflammatory processes - such as leprosy103, influenza104 and sepsis105, and rheumatological diseases - including rheumatoid and spondyloarthritis106,107.
Spatial transcriptomics-enabled exploratory data analysis
The spatial transcriptomic technologies result in a gene expression matrix, which can be analyzed both to test existing hypotheses and to generate new observations through exploratory analysis. Given the complexity and high dimensionality of a spatial transcriptomic dataset, novel insights can arise from adopting a mindset open to finding unexpected relationships by data analysis. In this exploratory mode of data analysis - championed by John Tukey108 - the result of one analysis guides the choice of the next, analogous to the way in which the result of a bench experiment guides the design of the next experiment. This is not to say that prior knowledge and hypotheses are ignored; rather that they are used to interpret results and direct the analyses. Thus, there is no predefined protocol in exploratory data analysis - no set pipeline for how to study a spatial transcriptomic dataset. Instead, there is a particular logic for how the data can be examined and a recognition of possible outcomes with each analysis109,110.
Analyzing spatial transcriptomic data often requires the exclusion of low quality data and initial transformations on the gene expression matrix to increase the signal-to-noise ratio, which can be performed using analysis packages such as Giotto111, Seurat112,113, STUtility114, and STLearn115. The total number of transcripts detected in a spot provides a first indication of the technical and biological attributes of the data. A relatively low number of transcripts per spot may indicate a technical artifact, such as insufficient permeabilization in certain regions, or a difference in cell density in the case of NGS-based methods. Alternatively, variation can arise from biological sources, such differences in transcriptional activity between cell types, or the presence of dying or necrotic cells, and this signal may confound downstream analyses. Smoothing algorithms can be applied to the data to increase sensitivity and to remove unwanted sources of technical and biological variation. Based on the premise that information can be shared between neighboring spots, averaging gene expression between physically adjacent spots in a moving window along the spatial coordinates can reduce noise47. To compare the expression of a gene across spots, transcriptomes are often normalized by dividing by the total number of transcripts (TPM) or using regularized negative binomial regression116. Similarly, comparisons across genes are aided by scaling the data to have the same mean and variance across spots (z-score).
The normalized gene expression matrix provides the basis for initial observations at the level of individual genes or spots (Figure 1). Revealing structure in the data, such as cell type properties or coherent gene modules, requires further processing of the matrix. We distinguish five classes of operations that have been used to study spatial transcriptomic data, though more operations will undoubtedly be devised (Figure 2a). While applying any one operation to the data may not immediately lead to insight, using the operators serially based on the interpretation of the results at each stage can generate a ‘path’ to a result (Figure 2b).
Fig. 2 ∣. Exploratory data analysis using spatial transcriptomic datasets.
a. Schematic of exploratory data analysis operations of spatial transcriptomic datasets. Characterize: Depicted are spots characterized to be composed of proportions of cell type ɑ, β and γ and gene sets annotated with functional terms. Cluster: Clusters of spots are shown in a lower dimensional space and mapped onto the tissue, and co-expressed gene sets are shown within a gene-gene correlation matrix. Select: A subset of spots can be selected based on histological information, or a subset of spatially variable genes may be selected for analysis. Relate: The relationship between gene sets found to have a spatial overlap as well as adjacent spots and clusters can be examined using the relate operation. Score: Spots scored for gene set expression generate a spatial pattern, while gene profiles can be obtained by summarizing the expression of a subset of spots. b. Operational paths for analysis. Composition: Spots are scored for cell type-specific gene expression profiles from scRNA-Seq data and characterized to identify the composition of the tissue region. Co-localization: Co-varying genes are identified by clustering and spots are scored for the expression of these gene sets to identify a pattern of overlapping spatial expression. A co-localization is described by relating the distance between these spots. Communication: Transcriptionally similar spots are identified by clustering and characterized according to their resident cell types. A subset of receptor and ligand pairs are selected for analysis. Receptors and ligands expressed in cell type α and cell type β, respectively, suggests a relationship between them.
The clustering operation reveals structure in the data, most basically defining sets of spots with similar transcriptomes or orthogonally, identifying genes with similar expression patterns across the spots. Similarity between spots can be calculated directly between transcriptomes using correlation or euclidean distance, or after dimensionality reduction such as PCA, tSNE and UMAP117,118. These similarities are then used to cluster spots, for example using k-means, Louvain or hierarchical clustering119. These clusters may correspond to distinct regions or cell types in the tissue of study, which can then be annotated (see ‘Characterize’). In a study on gingivitis, spots clustered according to whether they were epithelial, connective or inflammatory120. Clustering methods were also used to describe the tissue composition on sections of the plant A. thaliana, revealing four groups of spots corresponding to stem, meristematic area, flower reproductive organs, and sepals and petals90.
Gene clustering, using the same approach, can identify co-expressed gene modules corresponding to a cell type or cell state111. In spatial transcriptomic data from the cerebellum for example, clustering of genes identified two modules of spatially correlated genes in Purkinje cells52. Methods to cluster genes and spots simultaneously have also been used, including Non-negative Matrix Factorization (NMF)121,122 or factor analysis96, where the gene expression matrix is factorized to reveal the underlying structure in spot clusters and gene modules. In prostate tumor samples, this revealed sets of spots and genes corresponding to cancer, stroma, and inflammation96. Currently, clustering methods focusing on the specific features of spatial transcriptomics are being developed, such as BayesSpace123.
Typical spatial transcriptomic datasets contain more biological information than can be meaningfully interpreted by any single analysis. Therefore, it is usually appropriate to select a region of interest, for example a specific layer in the brain52,53, or the interface between tumor and microenvironment87,124. Orthogonally, one may focus the analysis on context-specific genes, either chosen a priori from biological knowledge - most notably in imaging-based methods which do not yet cover the whole transcriptome - or chosen from the dataset itself - by identifying highly variable genes for example. Gene selection methods abound, and those tailored to spatial transcriptomic data attempt to identify genes with high variance and whose expression is not random across the tissue. Genes can be scored according to their spatial autocorrelation (using Moran’s I or Geary’s C)125, neighbor enrichment (for example, in BinSpect)111 or entropy (for example, in Haystack)126. Trendsceek127 uses a marked point processes approach128 and is able to identify hotspots, streaks, and gradients of expression. SpatialDE decomposes a given gene’s expression variability into spatial and nonspatial components using Gaussian process regression129, and a similar approach was extended upon in SPARK130. Cancer-specific metabolic vulnerabilities were thus characterized by identifying spatially variable genes in prostate cancer97.
While the genes and spots are the primary data observations of spatial transcriptomics, the underlying biology is such that genes are co-expressed as modules, and that spot transcriptomes reflect a finite set of cell types and states. This is the premise of the scoring function, which is used to summarize a cluster of similar spots as a single gene expression profile, or - orthogonally - a coherent set of genes as a single pattern. Summarizing in this way can identify functional properties - for example, a stress response state or infiltrating macrophages that are spatially organized within a tumor - which might not be detectable when analyzing spots or genes individually. Scoring can be done simply by averaging the values of the set, or by scoring the expression relative to a null model as implemented in the Seurat workflow113. In the brain for example, Moffitt et al. generated average cell type expression profiles to compare spatial transcriptomics and scRNA-Seq clusters76. In melanoma, spots were scored according to their expression of previously established gene sets corresponding to cancer cell states45 or to Gene Ontology terms124.
The objects identified by operations on spatial transcriptomic data - clusters of spots and sets of genes - must be characterized for biological understanding and interpretation. For this, integration with other data sources and with other prior knowledge is essential. A cluster of spots may be characterized manually when it matches a histological region, as was done in MERFISH to annotate individual cell types in the brain76 and in pancreatic cancer samples to annotate normal and malignant regions of the tumor46. A cluster may also be annotated indirectly by identifying a set of marker genes and characterizing those. Specifically, a gene set can be characterized by quantifying its overlap with an annotated gene set. This is the basis of the Multimodal Intersection Analysis (MIA) introduced by Moncada et al.46, and of Gene Set Enrichment Analysis (GSEA) which queries for enrichment with functional groups obtained from Gene Ontology, KEGG, Hallmarks, and other databases131-133.
Because NGS-based spatial transcriptomics is not at single-cell resolution, much attention has been given to the problem of inferring the cell type composition of each spot (deconvolution), which is an important step in building detailed organ atlases51,93. Most methods achieve this by integrating single-cell data, either generated from the same sample (paired) or from a similar sample or database (unpaired). This integration helps to overcome the limitations of single-cell RNA-Seq – which lacks spatial information – and NGS-based spatial transcriptomics – which is not at single-cell resolution. The SPOTLight method uses non-negative linear regression on the spatial transcriptomic data using the NMF factors derived from single-cell to infer spot cell type composition134. Similarly, NMF regression (NMFref) is used in SlideSeq52. Probability-based methods such as Stereoscope135, Cell2location136, and RSTG137, as well as graph-based138 and deep-learning based such as Tangram139 have been introduced. In Stereoscope135, the cell type parameters are assigned by maximum likelihood estimation on the single-cell and use those to estimate each spot composition. Cell2location136 is similar to Stereoscope, but additionally attempts to infer the absolute number of cells per spot. DSTG138 uses single-cell data to construct pseudo-spots, and then links real and pseudo spots in a graph of nearest neighbors. The spatialDWLS method borrows from methodologies previously used for bulk RNASeq deconvolution and applies cell type enrichment followed by a dampened weighted least squares method to determine spot composition140.
ST methods with sub-cellular resolution face the inverse problem consisting of grouping spots into organelles or cells. Seq-scope makes use of transcript annotations as spliced, unspliced, or mitochondrial to define regions within cells58. Recent approaches have been developed that use the local density of each RNA species to assign a cell type to each spot84,85. Pci-Seq uses probabilistic cell typing and is able to identify cell types more efficiently in larger tissue areas23,86. FICT, another method, integrates expression and neighbourhood information to assign cell types141. In the case of imaging-based methods, each DAPI-stained nucleus can be classified as a cell type according to its distance from marker gene RNAs86.
Given its systematic nature, spatial transcriptomics is well suited to identifying similarities, differences and relationships between populations of genes and tissue regions. Clusters of spots can be related by querying for expressed genes, spatial overlap, developmental or functional relationships. For example, Stickels et al.53 identified genes that are differentially expressed between the proximal neuropil and the soma within the hippocampus using the different spots as replicates. Creative ways to relate the transcriptomes of clusters of spots are borrowed from those originally developed for scRNA-Seq. RNA Velocity142 makes use of the unspliced transcripts to infer how spots are related to each other in time, and was applied in the cortex to map the dynamics of neuro-development53. RNA-Seq-based Copy-number variation inference identifies chromosomal aneuploidies, which can be used to distinguish malignant from non-malignant spots, and also identify distinct subclones143,144. When two sets of spots are spatially adjacent, potential modes of interaction145 between the cells can be proposed by examining their paired receptors and ligands111 using known databases such as CellPhoneDB47,93,146 or NicheNet147.
ST for hypothesis generation and testing
ST atlases of healthy or diseased tissues naturally lend themselves to unbiased exploration and hypothesis generation51,93. Even spatial transcriptomic datasets designed to study a specific biological process, such as time-course studies or perturbation experiments, can be explored to reveal unexpected changes and formulate new hypotheses101 (Figure 3). Thousands of spots or genes may be studied together, thereby exploiting the high dimensionality of the dataset to yield robust biological inferences. These observations - the presence of a cell type, a pattern of gene expression, or the co-localization of two cell states - may lead to a novel testable hypothesis. They should also be validated independently, for example by immunofluorescence46 or in situ hybridization76 (Figure 3a).
Fig. 3 ∣. Hypothesis generation and testing using spatial transcriptomics.
a, Spatial transcriptomics can be used for hypothesis generation in various experimental contexts. Examples of spatial transcriptomic datasets include normal tissue (atlas), a developmental or disease time-course, and perturbation experiments (genetic, drug or infection). Following data collection, exploratory data analysis may generate observations - requiring validation - that lead to a hypothesis. b, Spatial transcriptomics for hypothesis testing. A well-powered experimental design that uses spatial transcriptomics can test formulated hypotheses. These can be further tested using clinical data, in vivo or in vitro models.
Alternatively, spatial transcriptomic data can be incorporated into a classical hypothesis-driven experimental design, whereby a sufficiently powered experiment is leveraged to test a well-defined prediction. Indeed, as spatial transcriptomic technology becomes more accessible, it is poised for use as a routine assay, on par with flow cytometry or RNA sequencing. Guided by experimental design, spatial transcriptomics can corroborate or falsify a hypothesis when used as a readout in a perturbation or time-course experiment. Each sample can be summarized by an individual datapoint, to be compared across replicates and conditions, necessitating that data be collected in sufficient numbers to ensure statistical rigor and power. Studies may incorporate spatial transcriptomics on several sections from the same sample to account for technical variability, or multiple biological replicates per condition. The hypothesis can further be tested in model systems, in vitro or in vivo, or in clinical data (Figure 3b).
Integration of spatial transcriptomics with other modalities
As the resolution and sensitivity of spatial transcriptomic technologies improve, integration with other data modalities can provide an opportunity for better tissue characterization using ST. While currently underutilized, the tissue image itself can be used to extract high resolution information, especially when combined with the vast knowledge acquired by the field of histopathology to manually identify and annotate regions2. In particular, morphological features detected in the tissue such as cell shape or nucleus size can be directly incorporated in the analysis. In stLearn, spots with similar features are identified and spatial smoothing is improved by averaging across spots that are not only physically close but also similar in composition115. Another study improved the resolution of spatial transcriptomics gene expression data by fusing it with high-resolution histology image data148. Deep learning has also been used to predict cell type annotations from gene expression and histology, outperforming annotations predicted from either modality alone149. With the increase in transcriptomic data available for training, machine learning algorithms have also been used to predict gene expression from histopathology images150,151. Rather than relying on pre-defined morphological features, these algorithms improve their performance by decomposing the full image into “tiles”. Integration of spatial transcriptomics with such machine learning approaches may improve the interpretability of histopathology and its use in clinical decision making to guide treatment and inform prognosis.
At subcellular resolution, the spatial organization of chromatin may provide clues into the regulation of gene expression in various contexts. DNA seqFISH integrated with RNA seqFISH and multiplexed immunofluorescence revealed that active gene loci are located on the surface of nuclear bodies and zone interfaces in embryonic stem cells152. Integrating spatial transcriptomic datasets with high-throughput imaging of genome in situ and the spatial distribution of histone marks within a tissue will be extremely valuable153-155. Recently, spatial mapping of genome organization with concurrent DNA sequencing within intact tissue has been made feasible156. This suggests that the goal of combining spatial genome sequencing with in situ transcriptomic profiling may be within reach, deepening our understanding how genome organization and function are encoded155.
Augmenting gene expression data with a complementary modality like protein-co detection can also shed light into processes that spatial transcriptomics does not capture, such as post-translational modification and sub-cellular localization of proteins and their dysregulation in disease. Targeted protein co-detection performed alongside spatial transcriptomics can be achieved using immunostaining on the same tissue section, as enabled by Visium48. A novel imaging cytometry based approach was used to simultaneously detect transcripts and proteins in breast cancer tissue samples157. DBiT-Seq allows for the co-mapping of mRNA and proteins in the tissue using antibody-derived DNA tags, as is done in CITE-Seq158. High-throughput spatial methods for protein detection such as MIBI, CODEX, t-cyCIF as well as mass spectrometry and barcode based approaches provide an unparalleled snapshot of the proteome within the tissue section159-164. Technological advances that allow the integration of these high-throughput proteomics methods with spatial transcriptomics will tremendously improve our ability to study tissue complexity.
The spatial transcriptomics field is growing at an exponential pace, with daily releases of technologies and datasets. The challenges faced by current spatial transcriptomic methods - including the limits to resolution and sensitivity, as well as throughput and accessibility - are being rapidly overcome. Spatial transcriptomics methods are being made compatible with paraffin-embedded tissues, opening the door to retrospective analyses of samples collected over decades in biobanks48,70,165,166. With future innovations, it may be possible to systematically assay larger tissue areas for the reconstruction of 3D organ- or organism-level atlases, and to visualize transcriptome-wide gene expression changes as they unfold over time. In addition to overcoming these technological challenges, future work will require the development of new computational tools and creative analytical thinking. Together, these will enable data exploration to identify ‘spatial patterns’ - a central feature of spatial transcriptomic datasets - and reveal insights into the underlying biology.
As we speculate about the future milestones of the field, the human genome project may serve as a useful parallel. The initial draft of the human genome was published in 2001167,168 and provided a reference to study the sources and consequences of genetic variation. However, the function and regulation of the different regions of the genome are still under active investigation. In spatial transcriptomics, future projects may similarly benefit from a reference from which to study distinct conditions. However, mapping the expression level of every gene in space will only be the first step to elucidating organizing principles of tissue biology. It is the coupling of these high-resolution cellular atlases with hypothesis-free inquiries that will enable new insight, and reveal the salient features of tissue architecture in physiology and disease.
A key challenge for the field will be to iteratively build a model of how multicellular spatial patterns emerge from cell-level properties. Independent of spatial transcriptomic technologies, implementing a simple principle - that each cell is overall most similar to its neighbors - was sufficient to recover complex spatial patterns in the Drosophila embryo169. Building on this idea, the exploration of spatial transcriptomic datasets will enable us to uncover the fundamental principles that guide our modeling of tissue-level spatial organization and will facilitate the study of the mechanistic basis of these patterns and their consequences. These deeper biological insights will extend the level of understanding from simple tissues to more complex structures, including developing organisms and diseased tissues, bringing us closer to conquering the spatial frontier.
We thank Felicia Kuperwaser, Andrew Pountain, Bo Xia, and other members of the Yanai lab, as well as Mark Phillips for critical reading and feedback. We thank the students of the exploratory data analysis course at NYU Langone Medical Center. IY was supported by grants from the NIH (R01AI143290) and the Lowenstein Foundation, and DB was supported by the NIH (F30CA257400).
Competing interests: The authors declare that they have no competing interests to the manuscript.
