Abstract
Motivation
Topologically associating domains (TADs) are fundamental building blocks of 3D genome. TAD-like domains in single cells are regarded as the underlying genesis of TADs discovered in bulk cells. Understanding the organization of TAD-like domains helps to get deeper insights into their regulatory functions. Unfortunately, it remains a challenge to identify TAD-like domains on single-cell Hi-C data due to its ultra-sparsity.
Results
We propose scKTLD, an in silico tool for the identification of TAD-like domains on single-cell Hi-C data. It takes Hi-C contact matrix as the adjacency matrix for a graph, embeds the graph structures into a low-dimensional space with the help of sparse matrix factorization followed by spectral propagation, and the TAD-like domains can be identified using a kernel-based changepoint detection in the embedding space. The results tell that our scKTLD is superior to the other methods on the sparse contact matrices, including downsampled bulk Hi-C data as well as simulated and experimental single-cell Hi-C data. Besides, we demonstrated the conservation of TAD-like domain boundaries at single-cell level apart from heterogeneity within and across cell types, and found that the boundaries with higher frequency across single cells are more enriched for architectural proteins and chromatin marks, and they preferentially occur at TAD boundaries in bulk cells, especially at those with higher hierarchical levels.
Availability and implementation
scKTLD is freely available at https://github.com/lhqxinghun/scKTLD.
1 Introduction
The chromatin architecture is nonrandom, which is organized with spatial structures of hierarchies, and plays an important role in multiple cellular processes, such as cell differentiation, gene regulation, epigenetic organization, and DNA replication (Dekker and Mirny 2016, Franke et al. 2016, Dekker et al. 2017, Marchal et al. 2019). High-throughput chromatin conformation capture (Hi-C) technology is devoted to the investigation of chromatin architecture by counting the interaction frequencies (IFs) between chromatin loci at a genome-wide scale (Liebermanaiden et al. 2009). In light of Hi-C, topologically associating domains (TADs) have been observed as structural blocks of chromatin loci and show a high degree of self-interacting (Dixon et al. 2012, Hou et al. 2012, Nora et al. 2012, Rao et al. 2014), and serve to confine genomic activity within their boundaries and restrict activity across their boundaries (Le Dily et al. 2014, Dixon et al. 2016). The boundaries of TADs are reported to be conserved across cell types and even species, demarcated by CTCF, housekeeping genes, cohesin complexes, and other histone marks (Dixon et al. 2012, Bonev and Cavalli 2016), and the disruption of TADs may result in gene misregulation and severe diseases, such as developmental disorders and cancers (Lupiáñez et al. 2015, Flavahan et al. 2016, Hnisz et al. 2016, Lupiáñez et al. 2016). Besides, further studies show that TADs are organized into a hierarchy with different structural levels (Weinreb and Raphael 2016), the more TADs move towards the inner of hierarchy, the higher their levels, and there is a more remarkable CTCF enrichment and gene expression near the boundaries of TADs with higher levels (An et al. 2019, Cresswell et al. 2020, Liu et al. 2022).
Getting off the ground, the exploration of TADs and their hierarchy are conducted mainly on Hi-C contact matrix of bulk cells, which are the mixture of thousands to millions of cells under different conditions. While entering the era of single-cell transcriptomics, the single-cell Hi-C technology allows the preparation of contact matrix of individual cells and the investigation of TAD-like domains on them, providing a means to get a deeper insight into these chromatin domains at single-cell level (Nagano et al. 2013, Nagano et al. 2017). For a short period of time, the low density of chromatin contacts in individual cells and the cell-to-cell variability led to a debate over whether TADs exist in single cells, until an imaging technology with kilo-base and nanometer-scale resolution was proposed to trace chromatin organizations, revealing the TAD-like structures at single-cell level (Bintu et al. 2018). Thus, an in silico method for identifying boundaries of TAD-like domains on single-cell Hi-C contact matrix seem necessary, although it remains a challenge to determine these boundaries due to the great sparsity and cell-to-cell variation of single-cell Hi-C data.
To our knowledge, several types of computational methods have been given out to identify TADs on bulk Hi-C data, such as (i) 1D statistic-based Directional index (DI) (Dixon et al. 2012), Insulation score (IS) (Crane et al. 2015), and TopDom (Shin et al. 2016); (ii) probabilistic model-based rGMAP (Yu et al. 2017), TADtree (Weinreb and Raphael 2016), and TADbit (Serra et al. 2017); (iii) graph-based Spectral (Chen et al. 2016), 3DNetMod (Norton et al. 2018), SuperTAD (Zhang et al. 2021), GRiNCH (Lee and Roy 2021), and deDoc (Li et al. 2018); as well as (iv) clustering-based ClusterTAD (Oluwadare and Cheng 2017), SpectralTAD (Cresswell et al. 2020), and TADpole (Solervila et al. 2020). Comparative analyses have shown that most of these methods can run on downsampled bulk Hi-C data (Dali and Blanchette 2017, Forcato et al. 2017, Zufferey et al. 2018, Li et al. 2020), but while these data are continuously downsampled to be as sparse as single-cell Hi-C data, almost all the methods become dysfunctional, demonstrating that they cannot be directly used on Hi-C data at single-cell level (Li et al. 2020). Currently, a few ad-hoc methods for identification of TAD-like domains on single-cell Hi-C data have been designed, including scHiCluster (Zhou et al. 2019), Higashi (Zhang et al. 2022), and deTOKI (Li et al. 2021). In the former two, a strategy of random walk and hypergraph representation learning is used to impute single-cell Hi-C contact matrix, and the TAD-like domains on it are called using TopDom and IS, respectively. In the last one, multiple rounds of nonnegative matrix factorization are carried out on single-cell Hi-C contact matrix, followed by a clustering to produce a consensus map, and the bins with minimal cluster rate (CR) are determined as the boundaries of TAD-like domains. These emerging methods seek out for TAD-like domains on ultra-sparse Hi-C contact matrix at single-cell level, but underutilize the global information far from the diagonal or have not concerned the running efficiency adequately, especially with the rapid increase in the number of single cells and the continuous improvement of resolution.
In this study, we propose scKTLD, an in silico method for identification of TAD-like domains on single-cell Hi-C data. It treats symmetric single-cell Hi-C contact matrix as an adjacency matrix for a graph, embeds the graph into a low-dimensional space with the help of sparse matrix factorization followed by spectral propagation, and identifies the TAD-like domains in the embedding space using a kernel-based changepoint detection. Beyond the existing methods, scKTLD addresses the following two issues. One is the combination of sparse matrix factorization and spectral propagation, so that the local smoothing and global community structures of the graph can be incorporated into the embeddings. The other is the detection of changepoints in the embedding space with a kernel model optimized by pruned exact linear time (PELT) (Killick et al. 2012), which allows the domain boundaries to be identified efficiently. The results tell that the embeddings by our scKTLD have an ability to reconstruct the Hi-C map with enhanced TAD-like structures, and the TAD-like domains can be identified more accurately and efficiently in most cases, compared with the other seven methods, including deTOKI, deDoc, scHiCluster, TopDom, SpectralTAD, GRiNCH, and Higashi, based on downsampled bulk Hi-C data as well as simulated and experimental single-cell Hi-C data. Moreover, with the help of scKTLD, it is observed that the TAD-like domain boundaries also exhibit conservation apart from heterogeneity within and across cell types in single cells, the bins identified as TAD-like domain boundaries with higher frequency may be more enriched for architectural proteins and chromatin marks, including CTCF, Rad21, Smc3, and H3K4me3, and the TAD-like domain boundaries in single cells preferentially occur at TAD boundaries in bulk cells, especially at those with higher hierarchical levels.
2 Materials and methods
2.1 Datasets
Experimental Hi-C data at bulk level as well as simulated and experimental Hi-C data at single-cell level are both involved in this article. For experimental bulk Hi-C data, 10 replicates of GM12878 and one replicate of K562 were derived from Rao’s bulk Hi-C experiment (Supplementary Table S1). In detail, the .hic files were downloaded from the Juicer data archive at https://bcm.app.box.com/v/aidenlab, and the contact matrices were extracted from these files via the dump command provided in juicer_tools (Durand et al. 2016). For simulated single-cell Hi-C data, there are two cell types being considered, with each type consisting of 100 single cells. In summary, a 3D physical model was established for each individual cell using the strategy proposed by Li et al. (2021) with the help of Integrative Modeling Platform (Bau and Marti-Renom 2012). Then for each model, a total of four single-cell Hi-C contact matrices were generated by weighted sampling of genomic loci at random, three of which are contact matrices prepared at different thresholds of contacting distance (500, 750, and 1000), and the leaving one is a reference contact matrix with ground-truth TAD-like domains (Supplementary Table S2). For experimental single-cell Hi-C data, the datasets for 17 GM12878 cells and 18 PBMC cells from Tan’s experiment (Tan et al. 2018) were downloaded at Gene Expression Omnibus (GEO) under accession number GSE80006 (Supplementary Table S3), and the datasets for three types of mouse cells, including 22 ZygMs, 18 ZygPs, and 70 Oocytes from Flyamer’s experiment were downloaded at GEO under accession number GSE117876 with the cells <50k contacts screened out (Supplementary Table S3). To make a fair comparison, the Hi-C data involved were binned to the same resolution for all the competing methods. Beyond the Hi-C data above, the ChIP-seq data for architectural proteins and histone marks, including CTCF, Rad21, Smc3, and H3K4me3, were also downloaded from ENCODE consortium (Dunham et al. 2012) to investigate the biological relevance of TAD-like domains (Supplementary Table S4). The downloaded ChIP-seq data were in bigWig format, and the program ‘bigWigAverageOverBed’ was used to segment the signals into bins of 10 kb for downstream processing and visualization.
2.2 Overview of scKTLD
Taking advantage of the natural graph-like attributes of single-cell Hi-C contact matrix, scKTLD treats TAD-like domain calling as a problem of changepoint detection in the embedding space from a weighted graph, in which each bin of contact matrix is treated as a node and the IFs between bins are the weights of corresponding edges. As shown in Fig. 1, scKTLD takes a single-cell Hi-C contact matrix as input, and outputs the identified TAD-like domains as well as a reconstructed contact matrix. It consists of three major steps. Step one, graph embedding. A symmetric single-cell Hi-C contact matrix is treated as the adjacency matrix of a weighted graph, the graph is then embedded into a low-dimensional space by combining a sparse matrix factorization with spectral propagation, while trying to preserve its graph properties, including the local smoothing and global community structures. Step two, changepoint detection. A dynamic programming-based algorithm called PELT is used to search for the optimal changepoints in the embedding space by minimizing a kernel cost function within a linear time consumption, and the changepoints are considered as the boundaries of TAD-like domains. Step three, results visualization. The identified TAD-like domains are outlined on the heatmap of input original single-cell Hi-C contact matrix or of reconstructed contact matrix in which the TAD-like domains are much clearer.
Figure 1.
Overview of scKTLD. scKTLD takes a single-cell Hi-C contact matrix as input and outputs the TAD-like domains as well as the reconstructed contact matrix. It consists of three major steps. In the first step, scKTLD regards the contact matrix as the adjacency matrix for a weighted graph and embeds it into low-dimensional vectors via sparse matrix factorization. Then the vectors are further enhanced by propagation in the spectrally modulated network, trying to preserve the local smoothing and global community structures of the graph in the embedding space. In the second step, the changepoints in the embedding space are detected by minimizing a kernel cost function using PELT algorithm within a linear time consumption, and these changepoints are determined as the boundaries of TAD-like domains. Finally, the identified TAD-like domains are outlined on the heatmap of original single-cell Hi-C contact matrix or of contact matrix reconstructed with the embeddings according to the proximities between them.
2.3 Graph embedding
2.3.1 Sparse matrix factorization
Single-cell Hi-C contact matrix can be interpreted as an adjacency matrix for a weighted graph, and the sparse matrix factorization aims to obtain the initial distributional similarity-based embeddings of the graph in an efficient way (Zhang et al. 2019). Specifically, the graph defined by a contact matrix can be represented as G = (V, E, A), where V indicates node set, E denotes edge set, and A is adjacency matrix. Let n = |V| stand for the number of nodes, and D denote a diagonal degree matrix with . Then an n-by-n proximity matrix M can be constructed with each entry calculated as , where and refer to the embedding vectors of the node vi and context node vj, respectively. Besides, the negative samples are taken into account to avoid trivial solutions, thus the M can be defined as:
| (1) |
where , is the negative samples associated with context vj, and λ is a negative sample ratio. Subsequently, the proximity matrix M is factorized as using truncated singular value decomposition (tSVD), and the initial embeddings can be obtained via , in which each row corresponds to a node. In order to improve the efficiency of matrix factorization, the factorization is performed via a sparse randomized tSVD (Halko et al. 2011). Compared to the traditional tSVD, it applies the random matrix theory to transform the tSVD of an n-by-n matrix M into the tSVD of a smaller d-by-n matrix H = QTM, where Q is a matrix composed of d orthonormal columns, reducing the time complexity of matrix factorization from o(n2d) to o(nd2+|E|), where |E| is the number of edges in the whole graph. Taking a contact matrix for chromosome 1 of GM12878 cell (cell #1) in Tan’s dataset at 50 kb resolution as an example, The M is 4986 × 4986, while the H is only 128 × 4986, and the |E| is 62 888. We can see that d is much smaller than n, and the value of |E| is also small due to the ultra-sparsity of the single-cell Hi-C contact matrix, as a result, the sparse randomized tSVD will accelerate matrix factorization obviously.
2.3.2 Spectral propagation
The initial embeddings above are enhanced by a propagation in a spectrally modulated network to incorporate the global and local structures of the graph. Assuming that the graph G has a normalized Laplacian matrix , where is the identity matrix. The L can be decomposed as , where is a diagonal matrix of eigenvalues, and U is a square matrix composed of the eigenvectors of L. Based on this, the propagation procedure can be described as follows:
| (2) |
where is a modulated Laplacian, with g indicating a spectral modulator, and is the resultant modulated network. The spectral modulator can be regarded as a band-pass filter to pass eigenvalues within a certain range and attenuate too large or too small ones, which will lead to the amplification of the global clustering and local smoothing according to high-order Cheeger’s inequality (Bandeira et al. 2013, Lee et al. 2014). Although the procedure provides a concise form to propagate the embeddings in the spectrally modulated network, the calculation of s still computationally expensive due to the explicit eigen-decomposition of L for large graphs. To circumvent this, a truncated Chebyshev expansion is used, so that can be approximated by an iterative polynomial operation (Supplementary Methods).
2.4 Changepoint detection
2.4.1 Kernel cost function
Given a candidate segmentation of the sequential embeddings, a kernel cost function is used to estimate the sum of deviations from the piecewise mean in the reproducible kernel Hilbert space. The intuition behind the cost function is that the embeddings within the same TAD-like domain should have a similar distribution and be piecewise stationary in the transformed space. Let stand for a clearer presentation of the sequential embeddings, and denote a set of m changepoints in order where and are artificially defined as 0 and n, respectively. Then the cost function can be given by:
| (3) |
where is a mapping function from ℝd to Hilbert space Η, which is implicitly defined as , and indicates the commonly used Gaussian kernel function, is the empirical mean of the mapped segment , denotes the norm in Η, and β is a penalty constant to control the trade-off between segmentation complexity and goodness-of-fit. Considering that it may be inconvenient to directly calculate , the cost function can be simplified via the kernel trick during the calculation (Harchaoui and Cappé 2007):
| (4) |
2.4.2 PELT optimization
The changepoints are determined by minimizing the cost function over possible numbers and locations with the help of a dynamic programming-based algorithm called PELT (Killick et al. 2012). This solution is chosen due to two main reasons. One is exactness. Unlike other approximate search methods, such as binary segmentation and bottom-up segmentation, PELT has an ability to find the exact global minimum of the cost function by recursively solving when the penalty is linear (Truong et al. 2020). The other is efficiency. This algorithm considers the observations sequentially, and uses an explicit pruning rule to discard some impossible locations from the potential changepoints in the next iteration, resulting in a significantly reduced computational cost that is linear in the number of data points (Supplementary Methods). These two advantages allow to find changepoints accurately and efficiently.
2.5 Reconstruction of contact Contact Matrix
To illustrate how well the local smoothing and global community structures, especially TAD-like structures, underlying contact matrix can be preserved or even enhanced by the graph embedding procedure in SCKTLD, a new artificial contact matrix is reconstructed with the embeddings produced by the graph embedding procedure. In detail, a proximity matrix S is first calculated, where is the inner product between two embeddings for the ith and jth nodes in the graph. Then the values of elements in this proximity matrix are scaled to 0–1 using min–max normalization, and the normalized matrix is regarded as the reconstructed contact matrix.
2.6 Evaluation of TAD-like domains
Given two sets of TAD-like domains on an n-by-n Hi-C contact matrix, where c and s are the total numbers of domains in U and V, respectively,. Let,,, and, in which indicates the set size. The similarity between the two sets of TAD-like domains can be assessed via the following two metrics:.
2.6.1 Adjusted mutual information
Mutual information (MI) is usually used to estimate how much information is shared between two variables, which can be defined as. As a variation, adjusted mutual information (AMI) can be used to evaluate the similarity between two partitions for a set. AMI is given by Li et al. (2021):
| (5) |
where H denotes the Shannon entropy, and E denotes the mathematical expectation.
2.6.2 Measure of concordance
A measure of concordance (MoC) is usually used to evaluate clustering assignments. To score the similarity between two sets of TAD-like domains, the bins within the same TAD-like domain are regarded as the elements within the same cluster; thus, the MoC between two sets of TAD-like domains can be defined as Zufferey et al. (2018):
| (6) |
2.6.3 TAD-adjR2
In addition to the similarity between TAD-like domains, another new metric named TAD-adjR2 has been proposed by An et al. (2019). With the help of this metric, the accuracy of domain assembly can be investigated by checking how much of the variation in the IFs of the Hi-C contact matrix can be explained by the classification of TAD-like domains. For each genomic distance, TAD-adjR2 is defined as follows:
| (7) |
where Yi is the IFs for the ith bin, b indicates the number of bins that have the same genomic distance as this bin, and p denotes the number of so-called TAD-like domains whose sizes are no smaller than the distance. If the ith bin is within a TAD-like domain, is the average IF within the TAD-like domain at a given genomic distance; otherwise, indicates the average IF in the gap region at the given distance. And denotes the overall mean IF across all the bins at a given genomic distance.
2.7 Embedding and clustering of simulated cells
To investigate how well different types of simulated cells can be distinguished with the TAD-like domains called on them, these cells are embedded into a 2D space using a dimensionality reduction algorithm named multidimensional scaling (MDS) (Kruskal and Wish 1978), where the similarity between cells is scored by MoC and AMI between their TAD-like domains. In the implementation, MDS is carried out with the help of a function MDS in the Python package sklearn.manifold. Besides, to examine to what extent the embeddings can represent different types of cells in the embedding space, k-means clustering is performed on the embeddings of simulated cells, and the clustering results are evaluated via adjusted rand index (ARI).
2.8 Downsampling of bulk Hi-C data
To mimic the ultra-sparsity of single-cell Hi-C data, the replicates of bulk Hi-C contact matrices of GM12878 (Supplementary Table S1) are deeply downsampled with the help of the strategy proposed by Yardımcı et al. (2019), so that each replicate can be as sparse as single-cell Hi-C data. In the implementation, a downsampling ratio Rd is first calculated as the number of target contacts over the total number of contacts for a given replicate, and the contact matrices for all chromosomes in this replicate are downsampled with this ratio. Then the contact matrix is converted into a set of pairwise individual interactions without considering the zero-valued elements, leaving a pairwise interaction vector of length N, where N is the sum of the individual elements of the contact matrix. Afterwards, a given number of interactions (N*Rd) are randomly sampled from this vector without replacement using a uniform sampling procedure. Finally, the chosen interactions are re-binned into a new contact matrix with a fixed number of contacts. In this paper, each replicate is downsampled to contain only 0.350 million contacts, which is close to the median value of 0.339 million contacts per cell in Flyamer’s single-cell Hi-C experiment.
3 Results
3.1 Results on downsampled bulk Hi-C data at single-cell level
To investigate the ability of scKTLD in identifying TAD-like domains on ultra-sparse single-cell Hi-C data, the contact matrices for the GM12878 cell line from Rao’s bulk Hi-C experiment (Rao et al. 2014) were deeply downsampled, so that each replicate only contains 0.350M contacts, which is close to the median value of 0.339M contacts per cell in Flyamer’s single-cell Hi-C experiment. Then, on both the full and downsampled bulk Hi-C data, scKTLD was compared with the other seven TAD callers, including deTOKI, deDoc, scHiCluster, TopDom, SpectralTAD, GRiNCH, and Higashi. It is worth noting that among these methods, deTOKI, scHiCluster, and Higashi are devoted to the identification of TAD-like domains on single-cell Hi-C data, while the remaining four are designed for bulk Hi-C data and are reported to be applicable to lower sequencing depths (Cresswell et al. 2020, Lee and Roy 2021, Liu et al. 2022). This comparison was conducted in the following three ways: Firstly, similarity between the TADs was called for on bulk Hi-C data before and after downsampling. The similarity is quantified with MoC and AMI by treating the bins in a TAD as the elements in a cluster or as the subset in a partition. It is obvious that scKTLD can reach both the highest MoC and AMI (median MoC = 0.69 and median AMI = 0.88) (Fig. 2a). The results are also supported by the visualized TADs marked on the heatmaps of contact matrices (Supplementary Fig. S1). Secondly, similarity between the TADs called on downsampled bulk Hi-C data across different resolutions. In this case, scKTLD still has the highest similarity (median MoC = 0.63 and median AMI = 0.87) (Fig. 2b), followed by Higashi and deTOKI, indicating that the TADs called by the three methods show higher consistency across resolutions. Finally, the enrichment of architectural proteins and histone marks, including CTCF, Rad21, Smc3, and H3K4me3,. The average ChIP-seq signals per bin within 500 kb up-stream and down-stream flanking regions of each TAD boundary were calculated and normalized by subtracting the average background signals in the up-stream −500 kb to −100 kb and the down-stream 100–500 kb regions. As shown in Fig. 2c, an enrichment of CTCF can be observed for all the methods on the full bulk Hi-C data, with scKTLD, SpectralTAD, and TopDom having the higher peaks. However, once running on the downsampled bulk Hi-C data, the results change greatly for most methods (Fig. 2d). The enrichment decreases dramatically for deTOKI and SpectralTAD, especially for deDoc and TopDom, leaving scKTLD having the most apparent CTCF peak at the boundaries called on the downsampled bulk Hi-C data. The enrichments of Rad21, Smc3, and H3K4me3 for these methods are also given in Supplementary Figs S2 and S3, and the comparison results tell that scKTLD is still competitive, especially on the downsampled bulk Hi-C data at single-cell level.
Figure 2.
Results on downsampled bulk Hi-C data at the single-cell level. TADs were called by scKTLD and the other seven methods, including deTOKI, deDoc, scHiCluster, TopDom, SpectralTAD, GRiNCH, and Higashi, on both the full and deeply downsampled bulk Hi-C contact matrices for chromosome 1 of the GM12878 cell line from Rao’s bulk Hi-C experiment, and the similarity between these TADs in different cases was compared by means of MoC and AMI. (a) Similarity between the TADs called on bulk Hi-C data before and after downsampling at 50 kb resolution. (b) Similarity between the TADs called on the downsampled bulk Hi-C data across different resolutions (50 kb versus 25 kb). Beyond the two similarities, the enrichment of CTCF at TAD boundaries was also compared. (c) The normalized average CTCF signal per bin within 500 kb up-stream and down-stream flanking regions of each TAD boundary called on the full bulk Hi-C data. It should be noted that the curves for scHiCluster and TopDom are completely coincident since their TAD calling algorithm is exactly the same on the full bulk Hi-C data. (d) The normalized average CTCF signal per bin within 500 kb up-stream and down-stream flanking regions of each TAD boundary is called on the downsampled bulk Hi-C data. In addition, the superiority of our scKTLD derived from the graph embedding step while dealing with ultra-sparse Hi-C data was illustrated. (e) The left panel shows an intuitive comparison between the full and downsampled bulk Hi-C contact matrices (GSM1551550_HIC001, 41M–49M) at 50 kb resolution, while the right panel is a comparison between the full bulk Hi-C contact matrix and the corresponding reconstructed contact matrix. The IFs in these matrices are scaled to 0–1 using min–max normalization. (f) The scatter plot of the IFs in the full bulk Hi-C contact matrix versus those in the reconstructed contact matrix. PCC refers to the Pearson correlation coefficient.
It is believed that the superiority of our scKTLD while dealing with downsampled bulk Hi-C data is mainly derived from its graph embedding step. As shown in the left panel of Fig. 2e, the TAD structures near the diagonal of the downsampled bulk Hi-C contact matrix are much more difficult to distinguish compared with those of the full bulk Hi-C contact matrix, which may frustrate the other methods that only make use of the local features near the diagonal. Different from these methods, scKTLD seeks to find TADs in the embedding space where the structures of the whole graph can be preserved. To clarify this point further, a new contact matrix was reconstructed with the embeddings given by our graph embedding step on the downsampled bulk Hi-C data, and a comparison between the full bulk Hi-C contact matrix and the corresponding reconstructed contact matrix was made (Fig. 2e, right panel). As we can see, the TAD structures in the reconstructed contact matrix, especially near the diagonal, show a considerable similarity with those in the original bulk Hi-C contact matrix, and the IFs in the reconstructed contact matrix are also highly correlated with those in the full bulk Hi-C contact matrix (PCC = 0.73, Fig. 2f). These indicate that our graph embedding step has the ability to capture and enhance the key block structures in the ultra-sparse Hi-C data.
3.2 Results on simulated single-cell Hi-C data
The simulated single-cell Hi-C data for two cell types were prepared with the help of the strategy proposed by Li et al. (Supplementary Methods) (Li et al., 2021). In the preparation, the thresholds of contacting distance were separately set to three different values (500, 750, and 1000) to simulate multiple different experimental conditions. These simulated data were fed into our SCKTLD and the other seven methods to investigate their performance. Firstly, the accuracy of the TAD-like domains. The so-called TAD-like domains were compared with ground-truth TADs using MoC and AMI to examine their accuracy. As shown in Fig. 3a, the median MoC and AMI for scKTLD are the highest at most distance thresholds. It suggests that scKTLD can identify TAD-like domains more accurately on the simulated single-cell Hi-C data in most cases. Secondly, the compactness of the so-called TAD-like domains. TADs are observed as structural blocks of chromatin loci and show a high degree of self-interacting; that is to say, the loci pairs within domains are closer in distance than those outside. In light of that, the compactness of TAD-like domains has been taken into account by the 3D physical models in the simulation. Thus, the fold change of the average spatial distance between loci in adjacent TAD-like domains to that within TAD-like domains was calculated (Fig. 3b). As expected, our scKTLD always achieves the highest fold change regardless of distance thresholds, indicating that the TAD-like domains called by our method are more compact in space. Besides, the TAD-like domains given by scKTLD are marked on the heatmaps of simulated single-cell Hi-C contact matrices, and it seems that these domains are generally in line with the visualized 3D physical model for the corresponding chromosome regions in the simulation (Supplementary Fig. S4). Finally, the specificity of the so-called TAD-like domains across cells. Considering that the chromatin structures across simulated cells, especially across two different types of simulated cells, are different, we wondered the specificity of the TAD-like domains called on individual cells and whether or to what extent the two types of cells could be distinguished with these TAD-like domains. Thus, the simulated cells were embedded in MDS, where the MoC and AMI between TAD-like domains were used to score the similarity between cells (Fig. 3c and Supplementary Fig. S5). As expected, the two types of simulated cells can be separated in the embedding space with the TAD-like domains called on them, and the separation appears clearer with those called by scKTLD. Furthermore, the embeddings were clustered using k-means (Fig. 3d), and as we can see, scKTLD can achieve the highest ARI. It shows that the simulated single cells can be better distinguished with the TAD-like domains called scKTLD.
Figure 3.
Results on simulated single-cell Hi-C data. TAD-like domains were called by scKTLD and the other seven methods, including deTOKI, deDoc, scHiCluster, TopDom, SpectralTAD, GRiNCH, and Higashi, on two types of simulated single-cell Hi-C contact matrices at 50 kb resolution, and the performance of these methods was compared in terms of accuracy and compactness of the called TAD-like domains. (a) Accuracy of TAD-like domains, which is scored by the similarity between the TAD-like domains called on simulated single-cell Hi-C data and those preset on reference Hi-C data. (b) Compactness of TAD-like domains, which is scored by the fold change of the average spatial distance between loci in adjacent TAD-like domains to that within TAD-like domains. To investigate the specificity of the TAD-like domains across cells, the TAD-like domains were called on the two types of simulated single-cell Hi-C data at the distance threshold of 750, and the cells were then embedded by MDS, where the MoC and AMI between the TAD-like domains called on these cells were used to score the similarity between cells. (c) Scatter plot of simulated single cells in the embedding space. Besides, these cells were further clustered in the embedding space using k-means, and the ARI for different methods was compared. (d) Barplot of ARI for the clusters of the two types of cells.
3.3 Results on experimental single-cell Hi-C data
Beyond the downsampled bulk Hi-C data and simulated single-cell Hi-C data, TAD-like domains were also called on single-cell Hi-C data from Tan’s and Flyamer’s experiments. Unlike the simulated single-cell Hi-C data, the ground-truth TAD-like domains on experimental data are unknown, and there is no gold standard to score the accuracy of identified TAD-like domains. Thus, we attempted to assess the quality of TAD-like domains in the following ways:. One is the distribution of the IFs within TAD-like domains versus that between adjacent TAD-like domains. As shown in Fig. 4a and Supplementary Fig. S6, the IFs within TAD-like domains are higher than those between adjacent TAD-like domains for all the methods except for GRiNCH on two datasets, ZygMs and ZygPs, and our scKTLD has the highest value within TAD-like domains as well as the lowest value between adjacent TAD-like domains in most cases, indicating that the block structure of TAD-like domains called by scKTLD seems more clear. Another is the accuracy of domain assembly. If TAD-like domains are accurately called, one would expect that a high proportion of the variation in the IFs can be explained by the classification of TAD-like domains, and a metric named TAD-adjR2 has been used to quantify this proportion to investigate the accuracy of domain assembly (An et al. 2019). Given this, the TAD-adjR2 for different methods within 1 Mb genomic distance is shown (Fig. 4b and Supplementary Fig. S7). Generally, our scKTLD has a higher TAD-adjR2, especially in the range from 0.25 to 0.75 MB. The third is the relationship between TAD-like domains on single-cell Hi-C data and TADs on bulk Hi-C data. The frequencies at which the bins are identified as the boundaries of TAD-like domains across different single cells are counted and compared with TAD boundaries called on bulk Hi-C data. It can be seen that the frequency of boundaries is greater than zero at most genomic bins for single cells, which is in line with the cell-to-cell variation of TAD-like domains. Interestingly, these boundaries of the TAD-like domain are not completely random; it seems that the TAD-like domain boundaries on single-cell Hi-C data prefer to be located at the bins, which are also the TAD boundaries on bulk Hi-C data (Fig. 4c). This observation is in agreement with the results obtained by an imaging technology for chromatin organization tracing with kilo-base and nanometer-scale resolution (Bintu et al. 2018). Furthermore, to score the reproducibility between the boundaries called on single-cell and bulk Hi-C data, the bins as the peaks of TAD-like domain boundaries are regarded as the consensus boundaries for single cells, and the similarity between these consensus boundaries and TAD boundaries called on bulk Hi-C data was scored using MoC and AMI. It is shown that scKTLD and Higashi can reach both the higher MoC and AMI (Fig. 4d), indicating that the boundaries given by them have better reproducibility across single-cell and bulk Hi-C data.
Figure 4.
Results on experimental single-cell Hi-C data. The TAD-like domains were called by scKTLD and the other seven methods, including deTOKI, deDoc, scHiCluster, TopDom, SpectralTAD, GRiNCH, and Higashi, on single-cell Hi-C contact matrices for chromosome 1 of 17 GM12878 single cells from Tan’s dataset at 50 kb resolution. (a) Distribution of the IFs within TAD-like domains versus that between adjacent TAD-like domains. (b) Accuracy of domain assembly. The variation in the IFs stratified by genomic distance can be explained by the classification of TAD-like domains; the proportion of this variation within 1 MB of genomic distance is calculated (measured by TAD-adjR2) to quantify the accuracy of TAD assembly. In addition, the corresponding bulk Hi-C contact matrices for chromosome 1 of the GM12878 cell line (GSM1551550_HIC001) from Rao’s dataset at 50 kb resolution were also involved to investigate the relationship between TAD-like domains on single-cell Hi-C data and TADs on bulk Hi-C data. (c) Profile of the frequencies at which the bins are identified as the boundaries of TAD-like domains on single-cell Hi-C data (110–120 Mb); the peaks on this profile and the boundaries of TADs called on the corresponding bulk Hi-C data are marked. (d) Similarity between the bins as the peaks of TAD-like domain boundaries on single-cell Hi-C data and the TAD boundaries called on the corresponding bulk Hi-C data; the similarity is scored using MoC and AMI.
3.4 Conservation of TAD-like domain boundaries between cells
It has been reported that TAD boundaries on bulk Hi-C data are conserved across replicates, even across cell lines (Dixon et al., 2012; Durand et al., 2016). Different from the above, single-cell Hi-C data shows cell-to-cell heterogeneities, but the conservation of TAD-like domains for different single cells is not very clear. Considering the ability of scKTLD to handle ultra-sparse single-cell Hi-C contact matrix and identify TAD-like domains on it, our proposed method is used to examine the number of TAD-like domain boundaries shared between single cells from Tan’s experiment, including 17 GM12878 cells and 18 PBMC cells. As shown in Fig. 5a, a considerable number of TAD-like domain boundaries are shared between cells within and across cell types, and the number of shared boundaries within GM12878 and PBMC cells seems greater than that between them, which can be further quantified by P values (1.2e−3 < 0.05 for GM12878 and 2.6e−4 < 0.05 for PBMC). It shows that the cells within the same cell type share more consensus TAD-like domain boundaries than those across types. To gain a deeper understanding of the conservation and heterogeneity of TAD-like domains in single cells within and across cell types, taking a specific genomic region on chromosome 1 as an example, the heatmaps of merged single-cell Hi-C contact matrices and the corresponding profiles of insulation score (Crane et al. 2015), as well as the positions of TAD-like domain boundaries called by scKTLD for each single cell, were shown (Supplementary Fig. S8). It can be found that the TAD-like domain boundaries are heterogeneous in single cells, regardless of cell types. Interestingly, this heterogeneity appears to vary across different genomic locations. For the regions marked with blue boxes, the boundaries seem to exhibit vibrations around certain assumed anchor points, and the vibrations do not show obvious differences between cells across different types, indicating that these domain boundaries are conserved across cell types. In contrast, for the regions marked with orange boxes, the boundaries that appear in GM12878 cells seem to shift or disappear in PBMC cells, which can also be validated by the profile of the insulation score. This suggests that these domain boundaries exhibit more conservation within cell types. Overall, the observations may explain the conservation and heterogeneity of TAD-like domain boundaries in single cells. Despite the high heterogeneity of TAD-like domain boundaries among cells, they maintain the stability of their basic domain architectures while performing their own functions.
Figure 5.
Biological relevance of TAD-like domains called by scKTLD and tuning of two main hyperparameters. (a) Boxplot of the number of shared TAD-like domain boundaries within each of the two cell types and between them. The P values were calculated via a one-sided Wilcox test. (b) Curve of the enrichment of architectural proteins and histone marks at TAD-like domain boundaries versus the frequencies at which the bins are identified as boundaries; the enrichment is scored by a fold change of the average ChIP-seq signals at TAD-like domain boundaries to that at the bins not identified as boundaries in any cell. (c) Boxplot of the frequencies at which the bins are identified as TAD-like domain boundaries on single-cell Hi-C data; these frequencies are divided into three groups using the hierarchical levels of corresponding bins on bulk Hi-C data; control indicates the bins not identified as TAD boundaries on bulk Hi-C data; and the P values were also calculated via a one-sided Wilcox test. In addition, based on the full and downsampled bulk Hi-C data of the GM12878 cell line from Rao’s experiment, the tuning of two main hyperparameters was examined. (d) Boxplot of the Pearson correlation coefficient between the original full bulk Hi-C contact matrix and the corresponding artificial contact matrix reconstructed by the embeddings with different dimensions; the embeddings were obtained from the downsampled Hi-C contact matrix. (e) Curve of the average TAD-adjR2 (mean ± SD) within 1 Mb genomic distance versus the penalty constant parameter in changepoint detection. The average TAD-adjR2 is calculated on a downsampled Hi-C contact matrix. All the domains were called on contact matrices at 50 kb resolution.
3.5 Biological relevance of TAD-like domains
To investigate the biological relevance of the TAD-like domains called by scKTLD on single-cell Hi-C data, one cell was randomly chosen from Tan’s GM12878 dataset, and the average ChIP-seq signals, including CTCF, Rad21, Smc3, and H3K4me3, per bin within 500 kb up-stream and down-stream flanking regions of each TAD-like domain boundary were calculated (Supplementary Fig. S9). As expected, the boundaries are enriched by these architectural proteins and histone marks, which is consistent with the results given by the imaging technology for tracing chromatin organizations at the single-cell level, where the TAD-like domain boundaries are more likely to appear at CTCF and Rad21 peaks (Bintu et al. 2018, Zhou et al. 2019). Herein, the enrichment is scored by a fold change of the average ChIP-seq signals at the boundaries to that at the bins not identified as boundaries in any cell, and the relationship between the enrichment and the frequencies at which the bins are identified as boundaries across different single cells was examined (Fig. 5b). Generally, this fold change is related to the frequency; the higher the frequency, the greater the fold change. That is to say, the bins identified as TAD-like domain boundaries with higher frequency may be more enriched for these architectural proteins and chromatin marks. This interesting observation encourages us to further explore the biological meaning underlying the enrichment. In light of some reports that higher enrichment of architectural proteins and histone marks is usually accompanied by higher levels of hierarchical TADs on bulk Hi-C data (Wang et al. 2017, An et al. 2019, Cresswell et al. 2020, Liu et al. 2022), we speculate that there might be an association between TAD-like domains on single-cell Hi-C data and hierarchical TADs on bulk Hi-C data. To check this point, the hierarchical TADs were called with a recently developed tool named TADfit (Liu et al., 2022), which is a multivariate regression model for the identification of hierarchical TADs on replicate Hi-C data. And the hierarchical level of a bin can be assigned using the terminology given by An et al. (2019), where the boundary level of a bin is defined as the maximum number of hierarchical TADs that use a boundary on either its left or right side. Then, for the bins identified as TAD boundaries with different hierarchical levels on bulk Hi-C data, the frequencies at which these bins are identified as TAD-like domain boundaries across different single cells were also counted. As shown in Fig. 5c, the TAD boundaries on bulk Hi-C data are more frequently identified as TAD-like domain boundaries on single-cell Hi-C data, and this trend seems more obvious for the TAD boundaries with higher hierarchical levels. In other words, TAD-like domain boundaries in single cells preferentially occur at TAD boundaries in bulk cells, especially at those with higher hierarchical levels.
3.6 Runtime and memory consumption
The efficiency of TAD-like domain identification is also important, since there are often a large number of individual cells produced in a single-cell Hi-C experiment. Thus, the runtime and memory consumption of scKTLD are compared with those of the other seven methods on Tan’s and Flyamer’s single-cell Hi-C datasets on a computing platform of Ubuntu 18.04 LTS with an Intel (R) Xeon (R) E5-2609 @ 1.70 GHz CPU and an NVIDIA GeForce RTX 3090 GPU. scHiCluster and deTOKI were executed on multiple threads of 4, 8, and 12 since they are designed to support multi-threaded parallel computing. As shown in Supplementary Table S5, it can be seen that scKTLD runs faster and consumes less memory than most of the methods except for TopDom and SpectralTAD. As for the multi-threaded scHiCluster and deTOKI, as the number of threads increases, the two methods sacrifice more memory to speed up the runtime, showing an obvious acceleration effect with multi-threading. However, even when the number of threads is increased to 12 (limited by our computing platform), they still run much slower than single-threaded scKTLD. Therefore, scKTLD is competitive in runtime and memory consumption and outperforms deTOKI, scHiCluster, and Higashi, which are designed for single-cell Hi-C data. This competitiveness may be explained by the following three points: (i) randomized tSVD (Halko et al. 2011) to avoid singular value decomposition of the large proximity matrix in the initialization of graph embedding. (ii) Chebyshev expansion to avoid the eigen-decomposition of the large Laplacian matrix during the spectral propagation. And (iii) PELT to find the optimal changepoints within a linear time consumption.
4 Discussion
There are two main hyperparameters that need to be tuned in SCKTLD. One is the dimension of the embeddings, and the other is the penalty constant β in changepoint detection. In this paper, the dimension is chosen according to the correlation between the original full-bulk Hi-C contact matrix and the artificial contact matrix reconstructed with the embeddings obtained from the downsampled one. As we can see, the Pearson correlation coefficient (PCC) increases as the dimension goes up from 8 to 128, indicating that more details in genomic structures can be learned in the graph embedding procedure. However, once the dimension exceeds 128, the PCC begins to decrease due to the involvement of more noises (Fig. 5d and Supplementary Fig. S10). Thus, a dimension of 128 was regarded as optimal. In addition, the penalty constant β controls the trade-off between segmentation complexity and goodness-of-fit in the changepoint detection procedure. While β is set to a smaller value, more TAD-like domains with small sizes will be captured, and while β is set to a larger value, more small domains will trend to be discarded. Herein, the value of β is optimized so that the average TAD-adjR2 can be maximized across different genomic distances; thus, a value of 1.42 was chosen in this paper (Fig. 5e and Supplementary Fig. S11).
Matrix balancing is widely used for the normalization of bulk Hi-C data, and it has been demonstrated that ICE and KR normalization have an effect on improving the reproducibility between TADs called on bulk Hi-C replicates in our previous benchmark (Lyu et al. 2020). So we explored how these normalizations affect the identification of TAD-like domains on ultra-sparse Hi-C data via our scKTLD. To investigate this point, firstly, we compared the similarity of TADs identified by scKTLD on the full and downsampled bulk Hi-C data with the preprocessings of ICE normalization, KR normalization, and no normalization. As shown in Supplementary Fig. S12, the similarity between the TADs identified on the full and downsampled bulk Hi-C data normalized by ICE or KR decreases compared to that without normalization, especially for the AMI metric. Secondly, we also compared the accuracy of domain assembly using the metric TAD-adjR2 on Tan’s experimental single-cell Hi-C data with the same preprocessings as above. As shown in Supplementary Fig. S13, it seems that the TAD-adjR2 also shows a slight decrease on the normalized single-cell Hi-C data regardless of ICE and KR normalization. These suggest that the results of TAD-like domain identification benefit little from matrix balancing preprocessings for ultra-sparse single-cell Hi-C data. This might be related to the fact that the matrix balancing methods tend to delete ultra-sparse rows and columns, resulting in the deletion of a substantial portion of the contact matrix in the case of single-cell Hi-C. Therefore, they may not work well under the condition of ultra-sparsity (Li et al. 2020).
Although we have shown the ability of scKTLD to capture and recover TAD-like domains from ultra-sparse single-cell Hi-C data, it is not clear whether the reconstructed matrix could recover finer structures in single-cells, like long-range interactions. To investigate this, the computational tool HICCUPS (Rao et al. 2014) was used to identify the interactions on Hi-C contact matrices, and the interactions between loci that are more than 1.5 Mb apart in genomic coordinates are filtered out as long-range interactions. Thus, the analysis can be carried out in two aspects. Firstly, we called long-range interactions on the full bulk Hi-C contact matrices for chromosome 1 of the GM12878 cell line in Rao’s dataset and made a comparison with those called on the reconstructed Hi-C contact matrices with different embedding dimensions. The comparison was conducted in terms of frequency and intensity, characterized by the number of long-range interactions and fold enrichment score from the donut background, respectively. As shown in Supplementary Fig. S14, it seems that there are much more long-range interactions called on the reconstructed contact matrices, and the fold enrichment scores on them are lower than those on the full bulk Hi-C contact matrices, regardless of the embedding dimensions. It shows that the reconstructed contact matrices still have room for improvement in recovering the long-range interactions to the quality of full bulk Hi-C. Secondly, for the long-range interactions called on the full bulk Hi-C contact matrix, we visualized them by aggregate peak analysis (APA) on the reconstructed ones and made an intuitive comparison with the APA on the full bulk Hi-C. As shown in Supplementary Fig. S15, a significant aggregate enrichment of interactions can be seen on the peak center of the full bulk Hi-C contact matrix, and the enrichment can be weakly observed on the reconstructed ones with higher embedding dimensions. This indicates that our graph embedding procedure has a limited ability to capture long-range interactions. Overall, the embeddings produced by scKTLD may be more suitable for recovering and enhancing the chromatin structures at the TAD level, and the methods that can reliably recover the fine structures, like long-range interactions, are still expected to further reveal the chromatin architectures and their functions in single cells in the future.
Supplementary Material
Contributor Information
Erhu Liu, School of Information and Control Engineering, Xi'an University of Architecture and Technology, Xi'an 710055, China.
Hongqiang Lyu, School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi'an 710049, China.
Yuan Liu, School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi'an 710049, China.
Laiyi Fu, School of Automation Science and Engineering, Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi'an 710049, China.
Xiaoliang Cheng, Department of Pharmacy, The First Affiliated Hospital of Xi’an Jiaotong University, Xi'an 710061, China.
Xiaoran Yin, Department of Oncology, The Second Affiliated Hospital of Xi’an Jiaotong University, Xi'an 710004, China.
Supplementary data
Supplementary data is available at Bioinformatics Online.
Conflict of interest
None declared.
Funding
This work was financially supported by the Natural Science Foundation of Shaanxi Province under Grant 2024JC-YBMS-783 and the Fundamental Research Funds for the Central Universities under Grant xzy012022087.
Data availability
Our method, scKTLD, has been implemented in a Python package and is available at https://github.com/lhqxinghun/scKTLD.
References
- An L, Yang T, Yang J. et al. OnTAD: hierarchical domain structure reveals the divergence of activity among TADs and boundaries. Genome Biol 2019;20:282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bandeira AS, Singer A, Spielman DA.. A Cheeger inequality for the graph connection Laplacian. SIAM J Matrix Anal Appl 2013;34:1611–30. [Google Scholar]
- Bau D, Marti-Renom MA.. Genome structure determination via 3C-based data integration by the integrative modeling platform. Methods 2012;58:300–6. [DOI] [PubMed] [Google Scholar]
- Bintu B, Mateo LJ, Su J-H. et al. Super-resolution chromatin tracing reveals domains and cooperative interactions in single cells. Science 2018;362:eaau1783. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bonev B, Cavalli G.. Organization and function of the 3D genome. Nat Rev Genet 2016;17:661–78. [DOI] [PubMed] [Google Scholar]
- Chen J, Hero Iii AO, Rajapakse I.. Spectral identification of topological domains. Bioinformatics 2016;32:2151–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Crane E, Bian Q, McCord RP. et al. Condensin-driven remodelling of X chromosome topology during dosage compensation. Nature 2015;523:240–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cresswell KG, Stansfield JC, Dozmorov MG.. SpectralTAD: an R package for defining a hierarchy of topologically associated domains using spectral clustering. BMC Bioinf 2020;21:1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dali R, Blanchette M.. A critical assessment of topologically associating domain prediction tools. Nucleic Acids Res 2017;45:2994–3005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dekker J, Belmont AS, Guttman M. et al. ; 4D Nucleome Network. The 4D nucleome project. Nature 2017;549:219–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dekker J, Mirny L.. The 3D genome as moderator of chromosomal communication. Cell 2016;164:1110–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dixon JR, Gorkin DU, Ren B.. Chromatin domains: the unit of chromosome organization. Mol Cell 2016;62:668–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dixon JR, Selvaraj S, Yue F. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 2012;485:376–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 2012;489:57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Durand NC, Shamim MS, Machol I. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst 2016;3:95–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flavahan WA, Drier Y, Liau BB. et al. Insulator dysfunction and oncogene activation in IDH mutant gliomas. Nature 2016;529:110–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Forcato M, Nicoletti C, Pal K. et al. Comparison of computational methods for Hi-C data analysis. Nat Methods 2017;14:679–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Franke M, Ibrahim DM, Andrey G. et al. Formation of new chromatin domains determines pathogenicity of genomic duplications. Nature 2016;538:265–9. [DOI] [PubMed] [Google Scholar]
- Halko N, Martinsson P-G, Tropp JA.. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev 2011;53:217–88. [Google Scholar]
- Harchaoui Z, Cappé O. Retrospective multiple change-point estimation with kernels. In: 2007 IEEE/SP 14th Workshop on Statistical Signal Processing, Madison, WI, USA. IEEE, 2007, 768–72.
- Hnisz D, Weintraub AS, Day DS. et al. Activation of proto-oncogenes by disruption of chromosome neighborhoods. Science 2016;351:1454–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hou C, Li L, Qin ZS. et al. Gene density, transcription, and insulators contribute to the partition of the drosophila genome into physical domains. Mol Cell 2012;48:471–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Killick R, Fearnhead P, Eckley IA.. Optimal detection of changepoints with a linear computational cost. J Am Stat Assoc 2012;107:1590–8. [Google Scholar]
- Kruskal JB, Wish M.. Multidimensional Scaling. Beverly Hills, CA: Sage Publications,1978. [Google Scholar]
- Le Dily F, Baù D, Pohl A. et al. Distinct structural transitions of chromatin topological domains correlate with coordinated hormone-induced gene regulation. Genes Dev 2014;28:2151–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee D-I, Roy S.. GRiNCH: simultaneous smoothing and detection of topological units of genome organization from sparse chromatin contact count matrices with matrix factorization. Genome Biol 2021;22:164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee JR, Gharan SO, Trevisan L.. Multiway spectral partitioning and higher-order cheeger inequalities. J ACM 2014;61:1–30. [Google Scholar]
- Li A, Yin X, Xu B. et al. Decoding topologically associating domains with ultra-low resolution Hi-C data by graph structural entropy. Nat Commun 2018;9:3265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li X, An Z, Zhang Z.. Comparison of computational methods for 3D genome analysis at single-cell Hi-C level. Methods 2020;181–182:52–61. [DOI] [PubMed] [Google Scholar]
- Li X, Zeng G, Li A. et al. DeTOKI identifies and characterizes the dynamics of chromatin TAD-like domains in a single cell. Genome Biol 2021;22:217–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liebermanaiden E et al. Comprehensive mapping of long range interactions reveals folding principles of the human genome. Science 2009;326:289–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu E, Lyu H, Peng Q. et al. TADfit is a multivariate linear regression model for profiling hierarchical chromatin domains on replicate Hi-C data. Commun Biol 2022;5:608–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu K, Li H-D, Li Y. et al. A comparison of topologically associating domain callers based on Hi-C data. IEEE/ACM Trans Comput Biol Bioinform 2023;20:15–29. [DOI] [PubMed] [Google Scholar]
- Lupiáñez DG, Kraft K, Heinrich V. et al. Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell 2015;161:1012–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lupiáñez DG, Spielmann M, Mundlos S.. Breaking TADs: how alterations of chromatin domains result in disease. Trends Genet 2016;32:225–37. [DOI] [PubMed] [Google Scholar]
- Lyu H, Liu E, Wu Z.. Comparison of normalization methods for Hi-C data. BioTechniques 2020;68:56–64. [DOI] [PubMed] [Google Scholar]
- Marchal C, Sima J, Gilbert DM.. Control of DNA replication timing in the 3D genome. Nat Rev Mol Cell Biol 2019;20:721–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nagano T, Lubling Y, Stevens TJ. et al. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature 2013;502:59–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nagano T, Lubling Y, Várnai C. et al. Cell-cycle dynamics of chromosomal organization at single-cell resolution. Nature 2017;547:61–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nora EP, Lajoie BR, Schulz EG. et al. Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 2012;485:381–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Norton HK, Emerson DJ, Huang H. et al. Detecting hierarchical genome folding with network modularity. Nat Methods 2018;15:119–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oluwadare O, Cheng J.. ClusterTAD: an unsupervised machine learning approach to detecting topologically associated domains of chromosomes from Hi-C data. BMC Bioinf 2017;18:480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rao SSP, Huntley MH, Durand NC. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 2014;159:1665–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Serra F, Baù D, Goodstadt M. et al. Automatic analysis and 3D-modelling of Hi-C data using TADbit reveals structural features of the fly chromatin colors. PLoS Comput Biol 2017;13:e1005665. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shin H, Shi Y, Dai C. et al. TopDom: an efficient and deterministic method for identifying topological domains in genomes. Nucleic Acids Res 2016;44:e70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Solervila P, Cusco P, Farabella I et al. Hierarchical chromatin organization detected by TADpole. Nucleic Acids Res 2020;48:e39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tan L, Xing D, Chang C-H. et al. Three-dimensional genome structures of single diploid human cells. Science 2018;361:924–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Truong C, Oudre L, Vayatis N.. Selective review of offline change point detection methods. Signal Process 2020;167:107299. [Google Scholar]
- Wang X, Cui W, Peng C.. HiTAD: detecting the structural and functional hierarchies of topologically associating domains from chromatin interactions. Nucleic Acids Res 2017;45:e163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weinreb C, Raphael BJ.. Identification of hierarchical chromatin domains. Bioinformatics 2016;32:1601–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yardımcı GG, Ozadam H, Sauria MEG. et al. Measuring the reproducibility and quality of Hi-C data. Genome Biol 2019;20:57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu W, He B, Tan K.. Identifying topologically associating domains and subdomains by gaussian mixture model and proportion test. Nat Commun 2017;8:535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang J, Dong Y, Wang Y. et al. ProNE: fast and scalable network representation learning. In: IJCAI, 2019, 4278–84.
- Zhang R, Zhou T, Ma J.. Multiscale and integrative single-cell Hi-C analysis with higashi. Nat Biotechnol 2022;40:254–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang YW, Wang MB, Li SC.. SuperTAD: robust detection of hierarchical topologically associated domains with optimized structural information. Genome Biol 2021;22:1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou J, Ma J, Chen Y. et al. Robust single-cell Hi-C clustering by convolution-and random-walk–based imputation. Proc Natl Acad Sci USA 2019;116:14011–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zufferey M, Tavernari D, Oricchio E. et al. Comparison of computational methods for the identification of topologically associating domains. Genome Biol 2018;19:217–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Our method, scKTLD, has been implemented in a Python package and is available at https://github.com/lhqxinghun/scKTLD.





