scGHOST: Identifying single-cell 3D genome subcompartments

Kyle Xiong; Ruochi Zhang; Jian Ma

doi:10.1038/s41592-024-02230-9

. Author manuscript; available in PMC: 2024 May 25.

Published in final edited form as: Nat Methods. 2024 Apr 8;21(5):814–822. doi: 10.1038/s41592-024-02230-9

scGHOST: Identifying single-cell 3D genome subcompartments

Kyle Xiong ^1,^‡, Ruochi Zhang ^1,^‡,^#, Jian Ma ^1,^*

PMCID: PMC11127718 NIHMSID: NIHMS1992165 PMID: 38589516

Abstract

Single-cell Hi-C (scHi-C) technologies allow for probing of genome-wide cell-to-cell variability in 3D genome organization from individual cells. Computational methods have been developed to reveal single-cell 3D genome features based on scHi-C, including A/B compartments, topologically-associating domains, and chromatin loops. However, no method exists for annotating single-cell subcompartments, important for understanding chromosome spatial localization in single cells. Here, we present scGhost, a single-cell subcompartment annotation method using graph embedding with constrained random walk sampling. Applications of scGhost to scHi-C data and contact maps derived from single-cell 3D genome imaging demonstrate reliable identification of single-cell subcompartments, offering insights into cell-to-cell variability of nuclear subcompartments. Using scHi-C data from complex tissues, scGhost identifies cell type-specific or allele-specific subcompartments linked to gene transcription across various cell types and developmental stages, suggesting functional implications of single-cell subcompartments. scGhost is an effective method for annotating single-cell 3D genome subcompartments in a broad range of biological contexts.

Introduction

The development of high-throughput three-dimensional (3D) whole-genome mapping methods, such as Hi-C [1], has advanced our understanding of multiscale 3D genome features, including A/B compartments [1], subcompartments [2, 3], topologically associating domains (TADs) [4, 5], and chromatin loops [2]. These 3D genome features are intertwined with important genome functions, such as gene transcription and DNA replication [6, 7]. A major challenge in studying 3D genome structure and function lies in uncovering the 3D genome features and their variability at single-cell resolution [8]. Emerging single-cell 3D genome mapping technologies, particularly single-cell Hi-C (scHi-C), have enabled the analysis of higher-order chromatin structure in individual cells [9–15]. Recently, several computational methods have been developed to address the analysis challenges posed by the high data sparsity of scHi-C, enhancing overall data quality [16–18] and characterizing multiscale 3D genome features and their heterogeneity at single-cell resolution [17, 19–21].

Although computational methods for calling A/B compartments [13, 17], TADs [17], and chromatin loops [21] from scHi-C data exist, none yet reveal single-cell 3D genome subcompartments. The 3D genome exhibits critical subcompartment patterns that the binary A/B compartment definitions do not capture [22]. For bulk Hi-C data, Rao et al. [2] generated high-coverage Hi-C to refine A/B compartments into five major subcompartments (A1, A2, B1, B2, and B3), which show distinct correlations with various epigenomic features. More recent methods have been developed to identify subcompartments based on bulk Hi-C with low to moderate coverage [3, 23, 24]. Collectively, these studies show that subcompartment annotations advance the binary A/B compartment definitions with strong stratification of functional genomic signals. However, applying these methods developed for bulk Hi-C to scHi-C data remains infeasible primarily due to two challenges: (1) scHi-C data requires specialized methods to mitigate its high data sparsity and noise for identifying specific 3D genome features, such as subcomparments; (2) bulk subcompartment annotation typically requires Hi-C datasets with at least 50 million trans-reads [3]. However, nearly all scHi-C datasets lack sufficient coverage to reveal interchromosomal chromatin interactions for facilitating subcompartment annotations at a pseudo-bulk level, let alone at the single-cell level. These issues underscore the need for new methods to annotate 3D genome subcompartments from scHi-C data.

Here, we introduce scGhost (single-cell graph-based Hi-C organization and segmentation toolkit), a computational method for genome-wide subcompartment annotation in individual cells using scHi-C data. scGhost leverages data imputed from our recent Higashi algorithm [17]. It employs graph-embedding neural networks with a constrained random walk sampling strategy for partitioning scHi-C contact maps into subcompartment annotations. By applying scGhost to scHi-C data in several cell lines and single-cell 3D genome imaging data, we demonstrate its ability to reveal single-cell subcompartments, providing insights into the functional implications of chromatin spatial localization in individual cells. Moreover, scGhost uncovers cell type-specific or allele-specific links between subcompartments and gene transcription in human prefrontal cortex, developing mouse brains, and developing mouse embryos.

Results

Overall design of scGhost

scGhost annotates subcompartments in scHi-C datasets and views scHi-C contact maps as graphs, where genomic loci are vertices in the graph and are connected through edge weights defined by Hi-C contact frequencies among loci. scGhost employs a unique random sampling procedure that filters noise in imputed scHi-C data, represents (embeds) each genomic locus (graph vertex) in single cells as a continuous-valued vector, and uses unsupervised learning to partition single-cell genomes and identify single-cell 3D genome subcompartments (see Fig. 1a for an overview).

Figure 1: — a. Schematic of the scGhost workflow. b. Higashi embeddings are used to identify cells that exhibit the most similarity in the single-cell embedding space. c. Random walks create sparse graphs that portray the most crucial connections among genomic loci. d. Walks are aggregated and fed into a graph embedding model, which generates embeddings for each genomic locus. These embeddings are subsequently clustered and compared to derive a final set of single-cell subcompartment annotations comparable across chromosomes and cells.

The input of scGhost includes (1) imputed scHi-C contact maps of a cell; and (2) scHi-C embeddings (e.g., via Higashi [17]). scGhost identifies k-nearest neighbors (k-NN) for each cell based on single-cell embeddings. Starting with cell embeddings, scGhost calculates genomic locus embeddings tailored for downstream clustering of single-cell subcompartments. Our framework then proceeds with clustering using the newly derived scGhost embeddings, producing discrete annotations for each genomic locus in each individual cell. These annotations highlight cell-to-cell variability of single-cell subcompartments and facilitate cell type-specific genome structure-function analysis.

scGhost comprises four main components. (1) A sampling procedure motivated by node2vec [25], which uses second-order random walks to sample neighborhoods in graph networks. Our sampling process estimates the most reliable genomic interactions in a cell using imputed scHi-C contact maps and the contact maps from neighboring cells in the embedding space (Fig. 1b), producing a sparse, undirected, and weighted graph containing only the strongest Hi-C contacts (Fig. 1c). (2) A graph node embedding procedure, treating genomic loci as nodes in a graph and using neural networks akin to those in word embedding frameworks [26]. This step embeds each genomic locus in scHi-C maps and aggregates random walks across multiple cells, connecting spatially proximal loci obscured by noise in scHi-C data. The embeddings are used in the subsequent clustering step for annotating single-cell subcompartments. (3) A clustering process unique to scHi-C data that ensures clusters in different chromosomes correspond to the same set of genome-wide subcompartments. (4) An alignment procedure that makes single-cell subcompartment annotations comparable in all single cells (Fig. 1d).

Additionally, we combine single-cell subcompartments to approximate bulk-level subcompartments, referred to as “pseudo-bulk subcompartments” (details in Supplementary Note on pseudo-bulk and bulk subcompartment computation).

Subcompartments from GM12878 scHi-C data

We applied scGhost to the scHi-C data of GM12878 at 500kb resolution [12]. Fig. 2a shows the median intra- and inter-subcompartment observed-over-expected (O/E) contact frequencies across all cells. O/E contact frequencies are calculated for the Higashi-imputed contact maps of individual cells, and the median O/E frequency is then determined across all cells. Our analysis revealed that pairs of genomic regions with the same single-cell subcompartment label – scA1, scA2, scB1, scB2, and scB3 – preferentially interact. Intra-cluster interactions are significantly more frequent than inter-cluster interactions, with a one-sided p < 5.53 × 10⁻¹³⁴ (see Supplementary Note). These interaction patterns among single-cell subcompartments suggest that scGhost possesses the sensitivity to categorize scHi-C contact maps into distinct subcompartments with unique contact patterns.

Figure 2: — a. Median observed-over-expected (O/E) contact frequencies among single-cell subcompartments. P-values were computed with one-sided t-tests. b. Example regions, marked by arrows and highlighted areas, where single-cell subcompartments correlate more strongly than bulk Hi-C subcompartments [2] with contact patterns in pseudo-bulk (top contact map) and bulk (bottom contact map) Hi-C matrices. c. Distribution of H3K27me3 across GM12878 genomic regions with variable and stable single-cell subcompartment annotations (one-sided t-test p = 5.28×10⁻³⁷, N = 5290 genomic loci). d. Boxplots displaying the distribution of Higashi single-cell A/B scores in single-cell subcompartments (left) of all cells and in population subcompartments (right). One-sided t-tests were computed with 100 random samples of N = 750 scA/B scores in both single-cell and population subcompartments. In the boxplot of c and d, the middle line represents the median, the lower and upper lines correspond to the first and third quartile, and the upper and lower whiskers extend to values no further than 1.5 × IQR. e. Enrichment of epigenomic signals in pseudo-bulk subcompartments.

Next, we sought to demonstrate that scGhost subcompartments enhance the overall genome segmentation compared to bulk Hi-C subcompartments. We found pseudo-bulk subcompartment annotations align more closely with changes in both bulk and pseudo-bulk Hi-C interaction patterns than bulk Hi-C subcompartment annotations (Fig. 2b). Single-cell subcompartments were aggregated over all cells for each genomic locus by computing their frequencies across cells. In Fig. 2b, we show an example at the population level where fluctuations in contact patterns in the B3 subcompartment clearly show shifts in compartmentalization (highlighted regions indicated by the black arrows in Fig. 2b). While bulk Hi-C annotations remained as B3, scGhost placed them in more refined, active subcompartments, reflecting contact pattern variation more accurately. Furthermore, we observed a change from bulk-level A2 to B3, not supported by Hi-C contact patterns, indicated by the third highlighted area in the left-most example in Fig. 2b. By contrast, our aggregated single-cell subcompartments did not show these abrupt but unexplained changes. On a genome-wide level (Supplementary Fig. 1), pseudo-bulk subcompartments correspond to reduced variance of scHi-C contact frequencies. Therefore, at the population level, single-cell annotations from scGhost improve the segmentation from the original bulk Hi-C annotations [2], despite being distributed differently (Supplementary Fig. 2).

We then utilized single-cell subcompartment annotations to better understand the single-cell variability of genomic loci associated with facultative heterochromatin at the population level. These loci, largely annotated as B1 from bulk Hi-C, interact with genomic loci in both the A and B compartments [2]. We divided genomic loci of GM12878 cells into variable and stable single-cell subcompartments (Supplementary Note) and assessed the enrichment of H3K27me3, a histone mark enriched in B1 and indicative of facultative heterochromatin at the bulk level [2, 27]. In Fig. 2c, we observed significantly higher H3K27me3 enrichment over control in variable genomic regions than stable ones in the cell population (p = 1.31 × 10⁻⁴), supporting earlier observation that H3K27me3-repressed genes can exhibit heterogeneity of expression [28]. This suggests that loci with enriched H3K27me3 are more likely to consist of variable single-cell subcompartments, which are more frequently associated with population-level B1 (Supplementary Fig. 2) that is also correlated with variable single-cell annotations (Supplementary Fig. 3).

While we have shown that scGhost can enhance single-cell genome segmentation compared to bulk Hi-C subcompartments, it is necessary to reconcile the discrepancy between single-cell subcompartments and bulk Hi-C subcompartments (Supplementary Fig. 2). We analyzed single-cell A and B (denoted scA/B) compartment scores corresponding to single-cell and bulk Hi-C subcompartments (Fig. 2d), calculating A and B compartment scores for each cell using Higashi-imputed scHi-C maps (Supplementary Note). In single-cell subcompartments, the distribution of scA/B in all single-cell subcompartments are distinct from one another. In contrast, scA/B are not differently distributed in the A2 and B1 subcompartments from bulk Hi-C with a mean p=0.086 (see Supplementary Note for details on p-value calculation).

We found that single-cell subcompartment annotations consistently stratify epigenomic signals such as histone modifications and replication timing. We computed pseudo-bulk subcompartments using our GM12878 single-cell annotations (Supplementary Note) and constructed an epigenomic mark enrichment profile of our pseudo-bulk subcompartments (Fig. 2e). The pattern of histone mark enrichment and early-late replication timing when transitioning from scA1 to scB3 mirros observations in bulk Hi-C subcompartments [2, 3]. In a population of single cells, scGhost subcompartments align with active/repressive histone marks and early-late replication timing.

Single-cell subcompartment association with transcriptional variability

We next evaluated single-cell subcompartment annotations in WTC11 human iPS cells [29] at 500kb resolution. We applied scGhost to identify spatially variable and stable genomic loci in terms of subcompartment states across individual cells. For each genomic locus, we calculated the percentage of cells in each subcompartment. These percentages were then used to compute a scalar variability value for each locus (Supplemental Note).

For each genomic locus of WTC11 cells, categorized as variable or stable, we determined the fraction of cells in each subcompartment (Fig. 3a, blue). Stable subcompartment regions are primarily annotated as scA1 and scB3, indicating their spatial positions are largely conserved aross cells. By contrast, regions with variable spatial positions across cells are marked by relative uniform distribution of single-cell subcompartments (Fig. 3a, orange).

Figure 3: — a. Proportion of single-cell subcompartments in stable (blue) and variable (orange) chromatin states across all chromosomes. b. Relative frequency of single-cell subcompartments in 10 Mb regions flanking transcription start sites (TSS) of genes stably (left) and variably (right) transcribed (defined based on the residual variance of sctransform normalized WTC11 scRNA-seq). c. Single-cell gene expression is more variable in single-cell subcompartment boundaries and variable subcompartment annotation regions. One-sided t-test p = 3.79 × 10⁻⁹ for 2628 and 2625 boundary and non-boundary associated loci, respectively. One-sided t-test p = 2.60 × 10⁻² for 1870 and 1902 variable and stable loci, respectively. d. Epigenomic signal enrichment of pseudo-bulk subcompartments in chr21, as revealed by the IMR90 3D genome MERFISH dataset. e. Co-assayed gene transcription activity from MERFISH data [31] in single-cell subcompartments genome-wide (top) and in regions overlapping with A compartment annotated by bulk Hi-C (bottom). Top: N = 37,777 genomic loci with gene transcription. Bottom: N = 23,769 A compartment loci with gene transcription. In the boxplot of c and e, the middle line represents the median, the lower and upper lines correspond to the first and third quartile, and the upper and lower whiskers extend to values no further than 1.5 × IQR.

We next examined the connection between the variability of subcompartments and gene transcription activities, categorizing genomic loci by stable and variable gene transcription based on scRNA-seq. Transcription variability was quantified using the residual variance in scRNA-seq for all genes with non-zero expression in WTC11 [30], which was further binarized into stable and variable transcription activity based on the 50-th percentile of residual variance. Near the transcription start sites (TSS) of genes with stable transcription activity, genomic loci are mostly in the scA1 and scA2 single-cell subcompartments (Fig. 3b, left), whereas variable transcription activity is more frequently found in scA1, scA2, and scB1 (Fig. 3b, right). Both stable and variable transcription activity are least likely in scB3.

When analyzing stable and variable transcription activity alongside stable and variable subcompartment annotations, we found that stable transcription in stable subcompartments mostly occurs in scA1, while in variable subcompartments, it is marked by scA1, scB1, and scB2 (Supplementary Fig. 4a,b). Variable transcription in stable subcompartments is also highly linked to scA1, and in variable subcompartments, to scA2, scB1, and scB2 (Supplementary Fig. 4c,d). Notably, variable genes in stable subcompartments have an increased likelihood of being in scB3 (Fig. 3b). Interestingly, at genomic loci with variable subcompartments, TSS of genes with either stable or variable transcription have a high likelihood of being in scB2, suggesting that the inactive scB2 subcompartment can be transiently active. Moreover, scA2 appears to only be associated with gene transcription at loci with variable subcompartments. These findings underscore the distinct properties each single-cell subcompartment holds concerning transcription variability.

We then aimed to show the connection between subcompartment variability and transcription activity across the cell population. We defined genomic loci with variable and stable subcompartmentalization as regions in the top and bottom 50th percentile of subcompartment variability, respectively (Supplementary Note for details on computing variability). We observed that regions undergoing changes of subcompartmentalization across cells show enrichment of genes with more variable transcription activity (one-sided t-test p = 2.60 × 10⁻²; Fig. 3c, right), with the difference becoming more pronounced when more extreme variability thresholds are used (Supplementary Fig. 5). Moreover, regions at subcompartment boundaries (Supplementary Note) are associated with higher transcriptional variability (Fig. 3c, left; with one-sided t-test p = 3.79 × 10⁻⁹). These subcompartment boundaries also tend to co-localize with TAD-like domain boundaries, as indicated by local minima in aggregated single-cell insulation scores (Supplementary Fig. 6). This highlights how scGhost unveils single-cell subcompartment boundaries and structurally variable regions that are both associated with high transcriptional variability.

Subcompartments from 3D genome imaging data

We next sought to show that scGhost can also effectively identify single-cell subcompartments from 3D genome imaging data. Applying scGhost on the single-cell contact maps of chr21 at 100kb resolution from MERFISH imaging in IMR90 cells [31], we compared our subcompartment annotations directly with gene transcription, also imaged in Su et al. [31] based on nascent RNA transcripts of over 1000 genes in the same individual cells. We focused on chr21 as it is the only chromosome in the dataset that includes co-assayed single-cell gene transcription activity.

We transformed the chromatin tracing imaging data into proximity maps, analogous to Hi-C contact maps, by calculating Euclidean distance maps for each cell and inverting the map (Supplementary Note). We then obtained clusters of genomic segments by using observed-over-expected proximity maps as inputs. Clusters were then matched to IMR90 population Hi-C subcompartments defined in Sniper [3] for single-cell subcompartments annotation. Because we only used one chromosome, we defined subcompartments using an alternative clustering method (Supplementary Note).

The chr21 pseudo-bulk subcompartments identified by scGhost correspond to the expected enrichment of epigenomic features, further confirming scGhost’s reliability (Fig. 3d). In particular, the epigenomic enrichment is consistent with findings from our prior work [3]. scA1 and scA2 largely correlate with active histone modification and earlier replication timing, whereas scB1, scB2, and scB3 show less active histone modification and later replication timing, with scB1 notably enriched in H3K27me3 and replicating earlier compared to scB2 and scB3. Histone mark enrichments in single-cell subcompartments, derived from imaging data, mirror those from the GM12878 scHi-C data.

Next, we compared subcompartment annotations with the number of co-assayed actively transcribed genes in each cell. We found that scA1 and scA2 co-occur with more frequently transcribing genes (Fig. 3e, top). Moreover, scB1, scB2, and scB3 co-occurring with the population-level A compartment corresponds to less frequent gene transcription (Fig. 3e, bottom), suggesting that scGhost correctly recognizes less active subcompartments in population-level A compartment regions. By comparing single-cell subcompartments to transcription activity, we also found that the relationship between single-cell subcompartment and transcription variability was similar to what we observed in Supplementary Fig. 4 (Supplementary Fig. 7). Furthermore, actively transcribing genes are linked to lower single-cell subcompartment variability (Supplementary Fig. 21), demonstrating scGhost’s ability to capture subcompartment annotations in individual cells that directly reflect their transcriptional activity in the same cell. Additionally, variability of single-cell subcompartments from imaging data is consistent to variability observed in the WTC11 dataset and is connected to active gene transcription.

Although scGhost is primarily intended for scHi-C data, these results show that the method effectively annotates subcompartments from single-cell 3D genome imaging data, highlighting its broad applicability. Moreover, in the imaging dataset, single-cell subcompartment comparisons to bulk epigenomic marks align with bulk Hi-C findings.

Cell type-specific subcompartments in the human prefrontal cortex

We next aimed to demonstrate that scGhost can reveal subcompartments and its association to genome functions from scHi-C data in tissues with diverse cell types, a key challenge in 3D genome analysis. We applied scGhost to scHi-C data from the human prefrontal cortex (PFC) [14] and assessed if subcompartments exhibit cell type specificity and if these annotations correspond with cell type-specific gene expression. Importantly, the cell type labels were derived from jointly profiled single-cell methylations, independent of chromatin contacts.

Firstly, we evaluated scGhost’s ability to capture cell type-specific Hi-C contact patterns in the PFC dataset. We used the single-cell genome-wide subcompartment annotations from scGhost as embeddings for each cell (termed “scGhost embeddings”). Compared to Higashi scA/B compartments, scGhost enhances the separation of prefrontal cortex cell types. We plotted the UMAP visualization of scGhost embeddings and Higashi scA/B scores across all PFC cells (Fig. 4a). We found that inhibitory neurons (Vip, Sst, Pvalb, Ndnf) and excitatory neurons (L2/3, L4, L5, L6) tend to cluster together when using Higashi scA/B, whereas with the scGhost subcompartments, inhibitory neurons are clearly separated from excitatory neurons. Moreover, subcompartment annotations within the same cell type show greater similarity than those across different cell types (Supplementary Fig. 8), further demonstrating scGhost’s ability in capturing cell type-specific subcompartments.

Figure 4: — a. UMAP visualization comparing scGhost embeddings with Higashi scA/B embeddings in differentiating prefrontal cortex subtypes. b. Average single-cell subcompartment scores of marker genes in microglia, neurons, and oligodendrocytes. c. Comparison of genomic loci with cell type-specific marker genes in a given PFC cell type and those same loci in other cell types. “Exc” denotes all excitatory neurons, while “Pvalb”, “Vip”, and “Sst” are inhibitory neurons. d. UMAP visualization of the Tan et al. developing mouse brain Dip-C dataset, with each dot representing a haploid colored by its parent-of-origin genotype. See also Supplementary Fig. 12. e. Similarity distributions of subcompartment annotations between two alleles across developmental stages. f. Comparison of similarity distributions of subcompartment annotations between two alleles and between two haploids of the same cell type. g. Overlap heatmaps showcasing regions with known imprinted genomic regions and allele-specific subcompartment annotations.

Additionally, using a random forest classifier to classify PFC cell types, scGhost achieves accuracy comparable to using full Higashi embeddings (Supplementary Fig. 9). However, a distinct advantage of scGhost annotations is their direct correspondence to specific genome loci. Therefore, applying classifiers that calculate feature importance (e.g., random forest) to predict cell types using scGhost annotations can highlight genomic loci key to distinguishing cell types (see Supplementary Fig. 10 for an example between L2/3 and L4).

To show that more active single-cell subcompartments in the PFC dataset contain cell type-specific marker genes, we assigned scores from 0 to 4 to five subcompartments, with lower scores indicating more active subcompartments. We analyzed the subcompartment scores for the 500 marker genes (see Methods for marker gene identification) with the highest fold change in each cell type. The scatter plots in Fig. 4b depict the average single-cell subcompartment scores for the marker genes of specific cell types. Notably, clusters representing microglia, neurons, and oligodendrocytes generally exhibit lower scores, suggesting that marker genes specific to each cell type tend to locate in more active subcompartments.

Next, we trained a random forest classifier for each cell type to distinguish it from the rest of the population. From these models, we identified 250 genomic loci per cell type with the highest feature importance. We then found marker genes specific to each PFC cell type that also co-localized with the most important loci and compiled the subcompartments across all cells in the population within 2.5 Mb flanking regions of each marker gene. Similar to Fig. 4b, we assigned subcompartment scores to these flanking regions of marker genes across all cells, with lower scores for active and higher for inactive subcompartments. For each cell type, we calculated the average subcompartment score in the flanking regions of each marker gene. We found that cell type-specific marker genes reside in loci significantly more active compared to those same loci in other cell types (Fig. 4c and Supplementary Fig. 11; with p < 1.44 × 10⁻¹⁰). This pattern is consistent even in neuron cell types, which are typically more challenging to distinguish using subcompartment annotations. Additionally, loci with marker genes show a significant association with single-cell subcompartment boundaries (Supplementary Fig. 11; with p = 1.30 × 10⁻¹²).

Allele-specific single-cell subcompartments

Recently developed Dip-C techniques have facilitated characterizing haplotype-resolved chromatin contact maps via genotype phasing and imputation, allowing us to explore 3D genome at allele-specific resolution. To showcase scGhost’s capabilities, we applied it to the developing mouse brain Dip-C dataset for identify allele-specific single-cell subcompartments.

Our initial step segregated each diploid cell, consisting of 22 autosomal chromosome pairs, into two pseudo-haploid cells, distinguished by phasing and each containing 22 intra-chromosomal contact maps. We then applied Higashi to learn embeddings and generate imputation results. We found that the haploids tend to cluster by both cell types and genotypes (Fig. 4d and Supplementary Fig. 12a). This dataset includes samples from initial and reciprocal crosses, suggesting subclusters are more genotypes-driven rather than confounding factors (Supplementary Fig. 12c). Most cell types have two individual clusters for each parent-of-origin genotype, except neonatal neurons, where no separate clusters are formed (dashed lines in Fig. 4d). We then assessed the similarities between the two alleles and noted a decreasing similarity scores during developmental stages, highlighting the role of allele-specific 3D genome structures (Fig. 4e).

We next explored whether two haploids of different genotypes would exhibit more similar subcompartments if they are alleles from the same cell, by comparing subcompartment annotation similarities between two haploids from the same cell against two random haploids of the same cell type across both genotypes. Our analysis showed that subcompartment similarities between alleles from the same cell are significantly greater than those between random haploids of the same cell types (Fig. 4f).

We also investigated the functional implications of differences in subcompartments between alleles. As discussed in Tan et al. [13], a parent-of-origin-specific 3D genome structure may underlie genome imprinting, leading to allele-specific expressions. Our findings indicate that regions around imprinted genes [32] are more likely to exhibit allele-specific subcompartments (Fig. 4g). The overlap between imprinted genes and allele-specific subcompartments is significant under a hypergeometric test (p < 1.18 × 10⁻⁴, Supplementary Fig. 13).

Single-cell subcompartments and gene expression dynamics

The advent of single-cell profiling methods that jointly assess 3D genome structure and the transcriptome [33, 34] presents new opportunities to understand how multiscale 3D genome architectures influence transcriptional activity at the single-cell resolution. As a proof-of-concept, we applied scGhost to the HiRES dataset from developing mouse embryos [33]. We first used Higashi in its co-assay mode (RNA+Hi-C) to learn cell embeddings (see Fig. 5a for UMAP visualization) and to impute sparse scHi-C contact maps. The learned embeddings accurately reflect cell type labels from the original study [33] and, through Higashi’s co-assay mode, distinguish cell types by features beyond cell cycle variations (Supplementary Fig. 14).

Figure 5: — a. UMAP visualization of Higashi embeddings for the developing mouse embryos in the HiRES dataset [33], where Higashi uses its co-assay mode incorporating both scRNA-seq and scHi-C data. b. Change in the percentage of scA1 subcompartment at the transcription start site (TSS) of genes that are transcribing versus silent. The log2 ratio of these genes is rank-sorted and displayed. c. Inferred developmental stages for the neuronal development trajectory within the HiRES dataset. d. Variation in subcompartment and gene expression for marker genes at each developmental stage along the trajectory. For each set of marker genes, the subcompartment annotations (0–4 representing B3 to A1) and gene expression are averaged for each single cell. e. UMAP visualization of marker genes for the developmental stage E8.5 to E9.5. The k-NN graph for UMAP is based on normalized trajectories of A1 subcompartment percentage and average gene expression along the developmental trajectory. Louvain clustering on this k-NN graph leads to five clusters. f. Visualization of the five gene clusters identified in **(e)**, categorized as Prime I, Prime II, Decoupled, Uncorrelated, and Synchronized based on their trajectory relationships and patterns. g. An example gene from the Prime II cluster, illustrating how an increase in the percentages of cells with the scA1 single-cell subcompartment (red dashed line, at E9.5) primes the upregulation of gene expression (black dashed line, at Ex05). The lineplots in **(f)**, and **(g)** are presented as mean values ± 1.96 s.e.m. (95% confidence interval).

We examined the link between single-cell subcompartments and transcription in a cell type-agnostic manner. Gene transcriptional activity was defined based on UMI thresholding: genes with UMI ≥ 10 are considered actively transcribed, while 0 UMI indicates lack of transcription. After aggregating scGhost subcompartment annotations for corresponding cell groups, we calculated the log2 fold change of the percentage of scA1 annotations (Fig. 5b). We found that for 81.5% of the genes, cells actively transcribing these genes were more likely to exhibit scA1 annotations at the TSS. This pattern aligns with analyses based on scA/B values, albeit with lesser fold-change (Supplementary Fig. 15), demonstrating the increased sensitivity of single-cell subcompartment analysis in identifying compartmental shifts.

We next focused on the neuronal development trajectory in this dataset (dashed outlines in Fig. 5a), tracking the progression from epiblast cells to neural tube formations, early neurons, and other non-neuronal cell types like oligodendrocytes (Fig. 5c). In Fig. 5d, we visualized the relationship between scGhost subcompartments for these marker genes and gene transcription activity along development trajectories. Generally, genes robustly expressed at a specific stage are associated with more active single-cell subcompartments, although not all show synchronized changes between subcompartments and transcriptional shifts.

Motivated by dynamic changes in subcompartment annotation and gene expression, we focused on marker genes during the critical E8.5 and E9.5 stage. We collected trajectories of subcompartment and expression over developmental stages for each gene (both trajectories are min-max normalized to the scale 0–1), clustering these marker genes accordingly (Fig. 5e). Using Louvain clustering, we identified five gene clusters with distinct subcompartment and expression dynamics (Fig. 5f). Notably, around 50% of genes show subcompartment switches preceding upregulation in gene expression, while 14% exhibit synchronized changes. This is illustrated with the representative gene Akt3 from cluster Prime II, observing a higher proportion of cells with scA1 at the TSS prior to peak transcription levels (Fig. 5g and Supplementary Fig. 16).

Discussion

In this work, we developed scGhost, the first computational method for annotating single-cell 3D genome subcompartments from scHi-C data. It effectively annotates subcompartments across various cell lines and complex tissues, showing cell type-specific or even allele-specific subcompartments and linking them to gene expression. scGhost extends our understanding of the connection between 3D genome architecture and gene transcription at single-cell resolution, providing a fresh perspective of gene regulation through subcompartment.

There are a few possible improvements for scGhost. First, the method could be improved to generate embeddings directly comparable across chromosomes, reducing the need for a workaround such as our approximation of single-cell inter-chromosomal Hi-C maps. Second, integrating additional single-cell epigenomic features and functional genomic data could further improve analyses of cell type-specific 3D genome organization, crucial for charting a more comprehensive nuclear structure-function landscape in complex tissues. Third, the discrepancy between single-cell and bulk subcompartments should be further investigated. While we are confident that single-cell annotations are specific to single-cell biological properties and offer improvements over bulk subcompartments, additional work is needed to understand the mechanisms underlying cell-to-cell subcompartment variability that can help reconcile single-cell and population-average discrepancies. Specifically, the observed cell-to-cell variability could arise from various factors, such as cell type, cellular states, biological processes (e.g., cell cycle), intrinsic dynamics, and bias factors (e.g., read depths, technology differences). Like other single-cell analysis methods, scGhost aims to minimize technical bias and emphasize cell-to-cell variability in 3D genome features, thereby highlighting cellular states through lower-dimensional representations. Further work is required to disentangle this variability in a more interpretable manner. Lastly, the computational efficiency of scGhost could be improved, enabling its application to scHi-C datasets at much higher resolutions. Despite potential improvements, scGhost is an effective method for identifying single-cell subcompartments, with the potential to provide crucial insights into higher-order genome organization and function.

Methods

Constrained random walk sampling for scGhost graph embedding

Due to the noise present in imputed scHi-C contact maps, we aim to filter out contacts from scHi-C that are less informative regarding spatial interactions among genomic loci. To achieve this, we utilize random walk sampling in the contact maps with a constrained random walk space. Genomic loci pairs from the random walk sampling take the format of sparse, undirected, weighted graphs for each chromosome of each cell. These sparse graph representations are then fed input a graph embedding neural network to assign genomic loci embeddings, which we subsequently use for annotating subcompartments.

To improve the performance of the subsequent graph embedding model (see Supplementary Fig. 17 for a real data evaluation), we identify the most similar cells to each cell in the dataset (Fig. 1b). For each single cell n ∈ [1, N] where N is the total number of cells, we calculate the Euclidean distances between the scHi-C embeddings of n and all other cells. The k_n cells with the lowest distance are considered the nearest cell neighbors to n. We then calculate the Pearson correlation maps of the neighboring cells, $R_{n}^{m} = Pearson (M_{n}^{m})$ , where m ∈ [1, k_n] and $M_{n}^{m}$ is the scHi-C matrix of neighbor m of cell n. Following the strategy in Higashi [17], we chose k_n = 5. Random walk sampling is then performed on all $R_{n}^{m}$ for m ∈ [1, k_n]. To avoid any ambiguity, it should be noted that the embeddings and imputed contact maps used as input are from Higashi [17]. However, it is worth mentioning that scGhost is designed to be compatible with other scHi-C embedding and imputation methods (see Supplementary Note for details).

Given the scHi-C contact map M_n for cell n ∈ [1, N], and maps $M_{n}^{m}$ for the cells most similar to n (measured by scHi-C embeddings), we first calculate the Pearson correlation maps of the contact maps, $R_{n}, R_{n}^{1}, \dots, R_{n}^{k n}$ to reduce noise in the contact map. To filter out genomic loci that are less informative of spatial interactions, we next carry out constrained random walks in cell n and its neighbors at a given locus i ∈ [1, L] where L is the number of bins on a chromosome at a given resolution (Fig. 1b,c). For each locus, we perform w random walks where each random walk consists of a first and second order walk to improve the performance of the downstream graph embedding model [35]. We conduct first-order walks to sample pairs between locus i and its nearest neighbors, loci that share the most frequent and infrequent scHi-C contacts with i. Second-order walks then sample second-order neighbors – loci that share the most frequent and infrequent contacts with the first-order neighbors of locus i. The novelty in this approach lies in selecting the first-order and second-order neighbors only selected from subsets of loci with the most informative scHi-C contacts with i, resulting in clearer spatial connections among chromatin regions unique to different subcompartments than those offered by scHi-C maps.

First-order random walks

In each first-order walk, we first individually sort the rows of the Pearson correlation matrix R_n of cell n in descending order and track the indices in each row that sort the array (Fig. 1c). We compile the entries in the first t percentile of each sorted row and probabilistically select one genomic locus from these higher-valued entries in each row of R_n. We then set the selected locus $x_{i, 1}^{+}$ as a first-order positive sample of locus i. Repeat the random selection for the bottom t percentile, we set the selected locus $x_{i, 1}^{-}$ as a first-order negative sample. As a result of the first-order walk, $(i, x_{i, 1}^{+})$ and $(i, x_{i, 1}^{-})$ are positively and negatively connected first-order neighbor pairs, respectively.

Second-order random walks

scGhost also links loci that may not be strongly connected in the original scHi-C contact map due to noise but share strongly connected first-order neighbors, using second-order random walks. Having defined $x_{i, 1}^{+}$ and $x_{i, 1}^{-}$ , we sort the $x_{i, 1}^{+}$ -th and $x_{i, 1}^{-}$ -th rows of R_n in descending order, compile the indices that sort the rows, and compile entries in the top and bottom t percentiles of those rows. We then define the second-order positive sample $x_{i, 2}^{+}$ by probabilistically selecting from entries in the top t percentile of the $x_{i, 1}^{+}$ -th row and repeat this step on the $x_{i, 1}^{-}$ -th row to define the second-order negative sample $x_{i, 2}^{-}$ . As a result of the second-order walk, $(i, x_{i, 2}^{+})$ and $(i, x_{i, 2}^{-})$ are positively and negatively connected second-order neighbor pairs, respectively.

Composite set of all random walks

Next, we assemble first and second-order neighbors into a format akin to sparse graphs to be inputs for a graph embedding model and subsequent subcompartment annotations (Fig. 1c). The input is represented as a set of pairs between i and $x_{i, 1}^{+}, x_{i, 1}^{-}, x_{i, 2}^{+}$ , and $x_{i, 2}^{-}$ for all loci i. We assign labels to each pair by annotating positive pairs $(i, x_{i, 1}^{+})$ and $(i, x_{i, 2}^{+})$ as 1 and negative pairs $(i, x_{i, 1}^{-})$ and $(i, x_{i, 2}^{-})$ as −1. We repeat sampling of first-order and second-order walks for the Pearson correlation matrices of the k neighbors of cell n at locus i and define $x_{i J, O}^{P}$ for j ∈ [1, k], p ∈ [+, −] denoting positive or negative samples, and o ∈ [1, 2] denoting first or second-order neighbors of locus i. Pairs $(i, x_{i, O}^{P})$ and $(i, x_{i J, O}^{P})$ , ∀ j ∈ [1, k], p ∈ [+, −], o ∈ [1, 2] are appended to a set S containing all pairs and labels across all loci and random walks. After iterating through all loci, we filter S to only keep unique pair and label combinations.

Random walk parameters

For all instances of scGhost, we used w = 50 because we found that more than 50 random walks yielded little to no improvement in the training of the scGhost neural network. For all experiments, we set t to equal 25% of the number of nodes in each chromosome such that for each genomic locus, scGhost only performs random walks among nodes in the top and bottom 25% of contact frequencies.

Calibrating the labels of the random walk samples

Subcompartments within the same major A/B compartment often feature similar Hi-C contact patterns but exhibit different interaction frequencies. These subtle differences are poorly captured by graph embedding if we discretely label random walk samples. Therefore, we convert discrete labels of the random walks into continuous-valued labels, allowing scGhost to better differentiate random walks specific to different subcompartments. We calibrate these labels by replacing the discrete label of each contact pair with its observed-over-expected Pearson correlation value. We normalize the labels by dividing positive-valued labels by the q-th percentile of all positive labels and dividing negative-valued labels by the 100−q-th percentile of all negative labels. This step is crucial to mitigate cross-chromosome bias caused by Pearson correlation distributions varying across different chromosomes. Since Pearson correlation matrices often contain values of 0 and 1, normalizing by the 100-th and 0-th percentile values would suppress the mean value of ours labels undesirably. In all instances of scGhost, we used q = 97.5, which neither over-suppresses nor over-saturates label values.

The scGhost graph embedding model

We developed a graph embedding model to be used in conjunction with constrained random walks and label calibration, designed to embed noisy scHi-C contact maps (Fig. 1d). Random walk-based graph embedding methods [25, 36] have been applied to bulk Hi-C data in the past [24]. While variants of random walks have been employed in scHi-C analysis methods [16, 18], scGhost uniquely utilizes this technique to learn emebddings of genomic loci, as opposed to single cells. Our graph embedding model generates embeddings for each genomic bin (typically 500kb in size) and aims to reconstruct cosine similarity scores between positively or negatively correlated genomic bins from these embeddings. The model consists of two neural network layers: one hidden layer between the input and the embedding output, and the embedding output layer.

The hidden layer in scGhost contains a n_h × N_a weight matrix $W_{h} \in ℝ^{n_{h} \times N_{a}}$ where N_a is the number of ungapped 500kb bins along a given chromosome a, and n_h is the dimensionality of the hidden layer. The output of the hidden layer pertaining to the i-th bin in the scHi-C contact map is ${\vec{z}}_{h, i} = W_{h} {\vec{x}}_{i} + b_{h}$ , where ${\vec{x}}_{i}$ is a N_a-dimensional one-hot encoded vector (the i-th entry is 1, and all other entries are 0), and b_h is the learnable bias term. ${\vec{z}}_{h}$ is then fed forward into the rectified linear unit activation function, which sets all negative inputs to zero.

The embedding output layer in scGhost contains a n_out × n_h matrix $W_{out} \in ℝ^{n_{out} \times n_{h}}$ and the embedding output of the i-th 500kb bin ${\hat{z}}_{i} = W_{out} {\vec{z}}_{h, i} + b_{out}$ , where b_out is the bias term of the embedding output layer and n_out is the output dimension.

The models are fit by inputting two different bins simultaneously, embedding both regions and computing the cosine similarity between them. For the i-th and j-th bins in the scHi-C contact map, cosine similarity between i and j, denoted by ρ_ij, is calculated as follows:

ρ i j = \frac{{\hat{z}}_{i} \cdot {\hat{z}}_{j}}{‖ {\hat{z}}_{i} ‖ ‖ {\hat{z}}_{j} ‖}

(1)

For the pair of regions i and j, the loss l_ij = (ρ_ij − y_ij)², where y_ij is the correlation between i and j as described in the scHi-C contact maps. Intuitively, the model maximizes the cosine similarity among loci that interact frequently and minimizes similarity among loci that interact infrequently. We optimize the parameters of our models using the Adam optimizer [37].

Constructing estimated inter-chromosomal scHi-C contacts

The neural network computes embeddings independently for each chromosome. As such, clustering on a single chromosome would return clusters that might not be comparable across chromosomes. Therefore, we developed a method to approximate inter-chromosomal contacts for each single cell, facilitating cross-chromosome comparison of subcompartment annotations.

For each cell, we compute the Pearson correlation matrix of N_a 128-dimensional embeddings of each chromosome, where N_a is the number of genomic loci (500kb in size) in a given chromosome a. We then calculate the first principal component (PC1) of the correlation matrix of each chromosome, calibrate the PC1 values such that higher values correspond to the A compartment, defined using CpG density, and divide the PC1 vectors into quantiles. Rows in the upper and bottom v-th percentiles of each chromosome a ∈ 𝒞}, where 𝒞 is the set of chromosomes, are averaged column-wise into N_a ×1-dimensional vectors.

For each pair of chromosomes, we compute the outer product of the top percentiles and subtract the resulting matrix from the outer product of the bottom percentiles:

M_{a b} = q_{a}^{'} \otimes q_{b}^{'} - q_{a}^{″} \otimes q_{b}^{″}

(2)

where M_ab is the result from the outer products between chromosomes a and b $(a, b \in 𝒞), q_{a}^{'}$ and $q_{b}^{'}$ are the aggregated top percentile vectors from chromosomes a and b, respectively, and $q_{a}^{″}$ and $q_{b}^{″}$ are the aggregated bottom percentile vectors. $q_{a}^{'}$ and $q_{a}^{″}$ are N_a × 1-dimensional while $q_{a}^{'}$ and $q_{a}^{″}$ are N_b × 1-dimensional. Eq. 2 therefore returns a N_a × N_b-dimensional inter-chromosomal matrix between chromosomes a and b.

M_ab is then quantile-normalized to have similar value ranges across different chromosome pairs. We set a, b ∈ 𝒞 and concatenate M_ab, ∀b ∈ [1, n_chrom] horizontally:

M_{a} = [M_{a 1} M_{a 2} \dots M_{a n_{chrom}}]

(3)

Finally, we concatenate M_a, ∀a ∈{1, …,n_chrom} vertically:

M_{inter} = [\begin{matrix} M_{a} \\ ⋮ \\ M_{n_{chrom}}^{T} \end{matrix}]

(4)

where M_inter is the estimated inter-chromosomal matrix.

We created the estimated inter-chromosomal contact map using odd-numbered chromosomes along the rows and even-numbered chromosomes along the columns, similar to the selections in previous annotation methods [2, 3]. Using this approach, we found that the estimated inter-chromosomal contact maps are positively correlated with population-level inter-chromosomal contact maps based on bulk Hi-C (Supplementary Fig. 18).

Clustering the scGhost embeddings

To annotate subcompartments in each cell, we begin by determining the optimal number of clusters in the genome. We applied the Bayesian Information Criterion (BIC) heuristic on the estimated inter-chromosomal scHi-C contact map (see previous subsection). However, we have noted in scGhost and population-level datasets, such as those in Rao et al. [2], that continually increasing the number of clustering in clustering methods almost invariably leads to a monotonically decreasing BIC. While it is possible to define a higher number of subcompartments, we found diminishing returns as BIC decreases with an increase in cluster numbers. We therefore used the Kneedle algorithm [38, 39] (Supplementary Fig. 19) to determine the knee point of the heuristic curve, which represents the point of diminishing return. The horizontal axis at this knee point is then considered the the optimal number of clusters. Using Kneedle, we identified that k = 5 is be the optimal stopping point, where further increasing the number of clusters results in much less improvement in BIC.

Following the determination of an optimal k value, we applied Gaussian HMM clustering on the approximated inter-chromosomal scHi-C contact maps of each cell. As different subsets of the genome are present along the rows and columns of the inter-chromosomal matrices, we ran two separate clustering instances: one on the previously defined M_i and one on the transpose, $M_{i}^{T}$ . To ensure comparability of subcompartment annotations across individual cells, we sort the Gaussian HMM clustering results of each cell using PC1 values derived from Higashi. Clusters are sorted according to the mean PC1 values of all loci in each cluster. As a result, scA1 and scA2 correspond to higher mean PC1 values while scB1, scB2, and scB3 correspond to lower mean values.

scGhost parameter selection and performance

Quantile selections for estimating single-cell inter-chromosomal contacts

We chose to use the upper and lower 20th percentiles to approximate the inter-chromosomal scHi-C contact map of each cell. We defined an optimal quantile as one that corresponds to (1) a high overlap between pseudo-bulk and bulk-level A and B compartments; and (2) the inflection point at which narrower quantiles offer diminishing returns in the overlap. As shown in Supplementary Fig. 20, we found that the upper and lower 20th percentiles best satisfy these conditions.

Parameter selection for the graph embedding model

scGhost generates 128-dimensional embeddings for each genomic locus in every cell. This choice is primarily to ensure comparability with the Higashi embeddings, which also have 128 dimensions. Moreover, the scGhost embedding model includes a hidden layer with an output of 256 dimensions. We chose this dimensionality as it balances the input dimensionality of most chromosomes and the output dimensionality.

Runtime and performance

The scGhost framework generates a very number of sampled random walk pairs. Although each random walk is quickly generated and subsequently processed in the neural network by a GPU, the bulk of the dataset processing occurs in a single CPU thread. scGhost requires approximately 4 hours to run a dataset of 4,238 cells at a 500Kb resolution on a CUDA-enabled GPU and 16-core CPU. We found that the runtime of scGhost scales linearly with the number of cells in the dataset.

Supplementary Material

Supplement

NIHMS1992165-supplement-Supplement.pdf^{(11.2MB, pdf)}

Acknowledgements

This work was supported in part by the National Institutes of Health Common Fund 4D Nucleome Program grant UM1HG011593 (J.M.), National Institutes of Health Common Fund Cellular Senescence Network Program grant UG3CA268202 (J.M.), National Institutes of Health grants R01HG007352 (J.M.) and R01HG012303 (J.M.). J.M. was additionally supported by a Guggenheim Fellowship from the John Simon Guggenheim Memorial Foundation, a Google Research Collabs Award, and a Single-Cell Biology Data Insights award from the Chan Zuckerberg Initiative. R.Z. was additionally supported by funding from the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard.

Footnotes

Competing Interests

The authors declare no competing interests.

Code Availability

The source code of scGhost can be accessed at: https://github.com/ma-compbio/scGHOST, which has also been deposited via Zenodo (https://doi.org/10.5281/zenodo.10141210 [45]).

Data Availability

In this work, we used several public datasets.

scHi-C data for the GM12878 cell line [12] were downloaded from the 4DN Data Portal [29, 40] (4DNES4D5MWEZ, 4DNESUE2NSGS, and 4DNESTVIP977) in fastq format and were processed into contact maps at 500Kb resolution using the recommended processing pipeline (https://github.com/VRam142/combinatorialHiC) of the data source.
The scHi-C dataset of the human prefrontal cortex [14] was downloaded from the Gene Expression Omnibus (GEO): GSE130711 in contact pairs format, which was then transformed into contact maps at 500Kb resolution. The WTC11 scHi-C dataset was downloaded from the 4DN data portal [29] (accession IDs 4DNESF829JOW and 4DNESJQ4RXY5). All scHi-C datasets were imputed with Higashi [17] (https://github.com/ma-compbio/Higashi) with default parameters.
The Dip-C developing mouse brain dataset [13] was downloaded from GEO: GSE162511.
The HiRES developing mouse embryos dataset [33] was downloaded from GEO: GSE223917.
We downloaded the following ENCODE datasets: ENCFF167NBF, ENCFF171MDW, ENCFF803DJF, ENCFF776OVW, ENCFF001GNK, ENCFF001GNN, ENCFF001GOA, ENCFF001GNX, ENCFF001GNT, ENCFF001GNR, ENCFF001GRA, ENCFF001GRD, ENCFF001GRQ, ENCFF001GRM, ENCFF001GRJ, ENCFF001GRG, ENCFF834HNV, ENCFF066MEE, ENCFF366BVS, ENCFF050ZTH, ENCFF519FHW
The imaging dataset [31] was obtained from Zenodo (https://doi.org/10.5281/zenodo.3928890).
We also downloaded the scRNA-seq of multiple cortical areas of the human brain from the Allen Brain map [41, 42]. The marker genes for cell types astrocyte (Astro), oligodendrocyte (ODC), oligodendrocyte progenitor cell (OPC), endothelial cell (Endo), microglia (MG), neurons were identified using Seurat [43, 44] with the default parameters. For each cell type, the background was chosen as the rest of the cell types. When identifying marker genes for neuron subtypes, the background was chosen as the rest of the neuron cells. The genes were then ranked by the log fold-change value between a specific cell type and the background.

References

[1].Lieberman-Aiden E et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Rao SS et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–80 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Xiong K & Ma J Revealing Hi-C subcompartments by imputing inter-chromosomal chromatin interactions. Nature Communications 10 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Dixon JR et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Nora EP et al. Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 485, 381 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Zheng H & Xie W The role of 3D genome organization in development and cell differentiation. Nature Reviews Molecular Cell Biology 20, 535–550 (2019). [DOI] [PubMed] [Google Scholar]
[7].Marchal C, Sima J & Gilbert DM Control of DNA replication timing in the 3D genome. Nature Reviews Molecular Cell Biology 20, 721–737 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Misteli T The self-organizing genome: Principles of genome architecture and function. Cell 183, 28–45 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Ramani V et al. Massively multiplex single-cell Hi-C. Nature Methods 14, 263 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Nagano T et al. Cell-cycle dynamics of chromosomal organization at single-cell resolution. Nature 547, 61 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Tan L, Xing D, Chang C-H, Li H & Xie XS Three-dimensional genome structures of single diploid human cells. Science 361, 924–928 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Kim H-J et al. Capturing cell type-specific chromatin compartment patterns by applying topic modeling to single-cell Hi-C data. PLoS Computational Biology 16, e1008173 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Tan L et al. Changes in genome architecture and transcriptional dynamics progress independently of sensory experience during post-natal brain development. Cell 184, 741–758 (2021). [DOI] [PubMed] [Google Scholar]
[14].Lee D-S et al. Simultaneous profiling of 3D genome structure and DNA methylation in single human cells. Nature Methods 16, 999–1006 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Liu H et al. DNA methylation atlas of the mouse brain at single-cell resolution. Nature 598, 120–128 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Zhou J et al. Robust single-cell Hi-C clustering by convolution-and random-walk–based imputation. Proceedings of the National Academy of Sciences 116, 14011–14018 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Zhang R, Zhou T & Ma J Multiscale and integrative single-cell Hi-C analysis with higashi. Nature Biotechnology 40, 254–261 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Zhang R, Zhou T & Ma J Ultrafast and interpretable single-cell 3D genome analysis with Fast-Higashi. Cell Systems 13, 798–807 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Zhang Y et al. Computational methods for analysing multiscale 3D genome organization. Nature Reviews Genetics (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Zhou T, Zhang R & Ma J The 3D genome structure of single cells. Annual Review of Biomedical Data Science 4 (2021). [DOI] [PubMed] [Google Scholar]
[21].Yu M et al. SnapHiC: a computational pipeline to identify chromatin loops from single-cell Hi-C data. Nature Methods 18, 1056–1059 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Belmont AS Nuclear compartments: an incomplete primer to nuclear compartments, bodies, and genome organization relative to nuclear architecture. Cold Spring Harbor Perspectives in Biology 14, a041268 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Liu Y et al. Systematic inference and comparison of multi-scale chromatin sub-compartments connects spatial organization to cell phenotypes. Nature Communications 12, 2439 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Ashoor H et al. Graph embedding and unsupervised learning predict genomic sub-compartments from hic chromatin interaction data. Nature Communications 11, 1173 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Grover A & Leskovec J node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 855–864 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Mikolov T, Chen K, Corrado G & Dean J Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013). [Google Scholar]
[27].Trojer P & Reinberg D Facultative heterochromatin: is there a distinctive molecular signature? Molecular Cell 28, 1–13 (2007). [DOI] [PubMed] [Google Scholar]
[28].Zhu C et al. Joint profiling of histone modifications and transcriptome in single cells from mouse brain. Nature Methods 18, 283–292 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Reiff SB et al. The 4D Nucleome data portal as a resource for searching and visualizing curated nucleomics data. Nature Communications 13, 1–11 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Friedman CE et al. Single-cell transcriptomic analysis of cardiac differentiation from human PSCs reveals HOPX-dependent cardiomyocyte maturation. Cell Stem Cell 23, 586–598 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Su J-H, Zheng P, Kinrot SS, Bintu B & Zhuang X Genome-scale imaging of the 3D organization and transcriptional activity of chromatin. Cell 182, 1641–1659 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Perez JD et al. Quantitative and functional interrogation of parent-of-origin allelic expression biases in the brain. elife 4, e07860 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Liu Z et al. Linking genome structures to functions by simultaneous single-cell Hi-C and RNA-seq. Science 380, 1070–1076 (2023). [DOI] [PubMed] [Google Scholar]
[34].Zhou T et al. Concurrent profiling of multiscale 3D genome organization and gene expression in single mammalian cells. bioRxiv 2023–07 (2023). [Google Scholar]
[35].Tang J et al. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, 1067–1077 (2015). [Google Scholar]
[36].Perozzi B, Al-Rfou R & Skiena S Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 701–710 (2014). [Google Scholar]
[37].Kingma DP & Ba J Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [Google Scholar]
[38].Satopaa V, Albrecht J, Irwin D & Raghavan B Finding a “kneedle” in a haystack: Detecting knee points in system behavior. In 2011 31st International Conference on Distributed Computing Systems Workshops, 166–171 (IEEE, 2011). [Google Scholar]
[39].Arvai K kneed (2020). URL https://github.com/arvkevi/kneed.
[40].Dekker J et al. The 4D nucleome project. Nature 549, 219–226 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
[41].Hawrylycz MJ et al. An anatomically comprehensive atlas of the adult human brain transcriptome. Nature 489, 391–399 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
[42].Hodge RD et al. Conserved cell types with divergent features in human versus mouse cortex. Nature 573, 61–68 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
[43].Butler A, Hoffman P, Smibert P, Papalexi E & Satija R Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology 36, 411–420 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
[44].Stuart T et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
[45].Xiong K, Zhang R & Ma J scGHOST (2023). https://zenodo.org/doi/10.5281/zenodo.10116434. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS1992165-supplement-Supplement.pdf^{(11.2MB, pdf)}

Data Availability Statement

In this work, we used several public datasets.

scHi-C data for the GM12878 cell line [12] were downloaded from the 4DN Data Portal [29, 40] (4DNES4D5MWEZ, 4DNESUE2NSGS, and 4DNESTVIP977) in fastq format and were processed into contact maps at 500Kb resolution using the recommended processing pipeline (https://github.com/VRam142/combinatorialHiC) of the data source.
The scHi-C dataset of the human prefrontal cortex [14] was downloaded from the Gene Expression Omnibus (GEO): GSE130711 in contact pairs format, which was then transformed into contact maps at 500Kb resolution. The WTC11 scHi-C dataset was downloaded from the 4DN data portal [29] (accession IDs 4DNESF829JOW and 4DNESJQ4RXY5). All scHi-C datasets were imputed with Higashi [17] (https://github.com/ma-compbio/Higashi) with default parameters.
The Dip-C developing mouse brain dataset [13] was downloaded from GEO: GSE162511.
The HiRES developing mouse embryos dataset [33] was downloaded from GEO: GSE223917.
We downloaded the following ENCODE datasets: ENCFF167NBF, ENCFF171MDW, ENCFF803DJF, ENCFF776OVW, ENCFF001GNK, ENCFF001GNN, ENCFF001GOA, ENCFF001GNX, ENCFF001GNT, ENCFF001GNR, ENCFF001GRA, ENCFF001GRD, ENCFF001GRQ, ENCFF001GRM, ENCFF001GRJ, ENCFF001GRG, ENCFF834HNV, ENCFF066MEE, ENCFF366BVS, ENCFF050ZTH, ENCFF519FHW
The imaging dataset [31] was obtained from Zenodo (https://doi.org/10.5281/zenodo.3928890).
We also downloaded the scRNA-seq of multiple cortical areas of the human brain from the Allen Brain map [41, 42]. The marker genes for cell types astrocyte (Astro), oligodendrocyte (ODC), oligodendrocyte progenitor cell (OPC), endothelial cell (Endo), microglia (MG), neurons were identified using Seurat [43, 44] with the default parameters. For each cell type, the background was chosen as the rest of the cell types. When identifying marker genes for neuron subtypes, the background was chosen as the rest of the neuron cells. The genes were then ranked by the log fold-change value between a specific cell type and the background.

[R1] [1].Lieberman-Aiden E et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Rao SS et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–80 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Xiong K & Ma J Revealing Hi-C subcompartments by imputing inter-chromosomal chromatin interactions. Nature Communications 10 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Dixon JR et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Nora EP et al. Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 485, 381 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Zheng H & Xie W The role of 3D genome organization in development and cell differentiation. Nature Reviews Molecular Cell Biology 20, 535–550 (2019). [DOI] [PubMed] [Google Scholar]

[R7] [7].Marchal C, Sima J & Gilbert DM Control of DNA replication timing in the 3D genome. Nature Reviews Molecular Cell Biology 20, 721–737 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Misteli T The self-organizing genome: Principles of genome architecture and function. Cell 183, 28–45 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Ramani V et al. Massively multiplex single-cell Hi-C. Nature Methods 14, 263 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Nagano T et al. Cell-cycle dynamics of chromosomal organization at single-cell resolution. Nature 547, 61 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Tan L, Xing D, Chang C-H, Li H & Xie XS Three-dimensional genome structures of single diploid human cells. Science 361, 924–928 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Kim H-J et al. Capturing cell type-specific chromatin compartment patterns by applying topic modeling to single-cell Hi-C data. PLoS Computational Biology 16, e1008173 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Tan L et al. Changes in genome architecture and transcriptional dynamics progress independently of sensory experience during post-natal brain development. Cell 184, 741–758 (2021). [DOI] [PubMed] [Google Scholar]

[R14] [14].Lee D-S et al. Simultaneous profiling of 3D genome structure and DNA methylation in single human cells. Nature Methods 16, 999–1006 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Liu H et al. DNA methylation atlas of the mouse brain at single-cell resolution. Nature 598, 120–128 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] [16].Zhou J et al. Robust single-cell Hi-C clustering by convolution-and random-walk–based imputation. Proceedings of the National Academy of Sciences 116, 14011–14018 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] [17].Zhang R, Zhou T & Ma J Multiscale and integrative single-cell Hi-C analysis with higashi. Nature Biotechnology 40, 254–261 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Zhang R, Zhou T & Ma J Ultrafast and interpretable single-cell 3D genome analysis with Fast-Higashi. Cell Systems 13, 798–807 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Zhang Y et al. Computational methods for analysing multiscale 3D genome organization. Nature Reviews Genetics (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Zhou T, Zhang R & Ma J The 3D genome structure of single cells. Annual Review of Biomedical Data Science 4 (2021). [DOI] [PubMed] [Google Scholar]

[R21] [21].Yu M et al. SnapHiC: a computational pipeline to identify chromatin loops from single-cell Hi-C data. Nature Methods 18, 1056–1059 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Belmont AS Nuclear compartments: an incomplete primer to nuclear compartments, bodies, and genome organization relative to nuclear architecture. Cold Spring Harbor Perspectives in Biology 14, a041268 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Liu Y et al. Systematic inference and comparison of multi-scale chromatin sub-compartments connects spatial organization to cell phenotypes. Nature Communications 12, 2439 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Ashoor H et al. Graph embedding and unsupervised learning predict genomic sub-compartments from hic chromatin interaction data. Nature Communications 11, 1173 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Grover A & Leskovec J node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 855–864 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Mikolov T, Chen K, Corrado G & Dean J Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013). [Google Scholar]

[R27] [27].Trojer P & Reinberg D Facultative heterochromatin: is there a distinctive molecular signature? Molecular Cell 28, 1–13 (2007). [DOI] [PubMed] [Google Scholar]

[R28] [28].Zhu C et al. Joint profiling of histone modifications and transcriptome in single cells from mouse brain. Nature Methods 18, 283–292 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Reiff SB et al. The 4D Nucleome data portal as a resource for searching and visualizing curated nucleomics data. Nature Communications 13, 1–11 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Friedman CE et al. Single-cell transcriptomic analysis of cardiac differentiation from human PSCs reveals HOPX-dependent cardiomyocyte maturation. Cell Stem Cell 23, 586–598 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Su J-H, Zheng P, Kinrot SS, Bintu B & Zhuang X Genome-scale imaging of the 3D organization and transcriptional activity of chromatin. Cell 182, 1641–1659 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Perez JD et al. Quantitative and functional interrogation of parent-of-origin allelic expression biases in the brain. elife 4, e07860 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] [33].Liu Z et al. Linking genome structures to functions by simultaneous single-cell Hi-C and RNA-seq. Science 380, 1070–1076 (2023). [DOI] [PubMed] [Google Scholar]

[R34] [34].Zhou T et al. Concurrent profiling of multiscale 3D genome organization and gene expression in single mammalian cells. bioRxiv 2023–07 (2023). [Google Scholar]

[R35] [35].Tang J et al. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, 1067–1077 (2015). [Google Scholar]

[R36] [36].Perozzi B, Al-Rfou R & Skiena S Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 701–710 (2014). [Google Scholar]

[R37] [37].Kingma DP & Ba J Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [Google Scholar]

[R38] [38].Satopaa V, Albrecht J, Irwin D & Raghavan B Finding a “kneedle” in a haystack: Detecting knee points in system behavior. In 2011 31st International Conference on Distributed Computing Systems Workshops, 166–171 (IEEE, 2011). [Google Scholar]

[R39] [39].Arvai K kneed (2020). URL https://github.com/arvkevi/kneed.

[R40] [40].Dekker J et al. The 4D nucleome project. Nature 549, 219–226 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] [41].Hawrylycz MJ et al. An anatomically comprehensive atlas of the adult human brain transcriptome. Nature 489, 391–399 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] [42].Hodge RD et al. Conserved cell types with divergent features in human versus mouse cortex. Nature 573, 61–68 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] [43].Butler A, Hoffman P, Smibert P, Papalexi E & Satija R Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature Biotechnology 36, 411–420 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] [44].Stuart T et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] [45].Xiong K, Zhang R & Ma J scGHOST (2023). https://zenodo.org/doi/10.5281/zenodo.10116434. [Google Scholar]

PERMALINK

scGHOST: Identifying single-cell 3D genome subcompartments

Kyle Xiong

Ruochi Zhang

Jian Ma

Abstract

Introduction

Results

Overall design of scGhost

Figure 1: Overview of the scGhost framework.

Subcompartments from GM12878 scHi-C data

Figure 2: scGhost’s application to GM12878 single-cell Hi-C data showcases its accuracy in annotating single-cell subcompartments.

Single-cell subcompartment association with transcriptional variability

Figure 3: scGhost’s application to WTC11 scHi-C data and IMR90 single-cell 3D genome imaging data.

Subcompartments from 3D genome imaging data

Cell type-specific subcompartments in the human prefrontal cortex

Figure 4: scGhost’s application to scHi-C data from the Lee et al. human prefrontal cortex (PFC) and Tan et al. developing mouse brain.

Allele-specific single-cell subcompartments

Single-cell subcompartments and gene expression dynamics

Figure 5: Application to HiRES data of developing mouse embryos.

Discussion

Methods

Constrained random walk sampling for scGhost graph embedding

First-order random walks

Second-order random walks

Composite set of all random walks

Random walk parameters

Calibrating the labels of the random walk samples

The scGhost graph embedding model

Constructing estimated inter-chromosomal scHi-C contacts

Clustering the scGhost embeddings

scGhost parameter selection and performance

Quantile selections for estimating single-cell inter-chromosomal contacts

Parameter selection for the graph embedding model

Runtime and performance

Supplementary Material

Acknowledgements

Footnotes

Data Availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases