Significance
Chromosomes are compactly folded in nuclei, and their specific 3D structures play a role in the regulation of gene expression. While cell type specificity of gene regulation has been revealed through transcriptomic and epigenomic assays, comprehensive analysis of genome conformation patterns in different cell types is still lacking. Single-cell approaches have facilitated our understanding of cell type heterogeneity, and profiling chromosome architecture at the single-cell level has been achieved using Hi-C. However, unbiased and efficient computational methods are needed to distinguish different cell types utilizing these data. Here, we describe scHiCluster, a computational framework to study cell type-specific chromosome structural patterns. We demonstrate that scHiCluster allows clustering of single cells with high accuracy and identifies their local chromosome interaction domains.
Keywords: single cell, Hi-C, 3D chromosome structure, random walk
Abstract
Three-dimensional genome structure plays a pivotal role in gene regulation and cellular function. Single-cell analysis of genome architecture has been achieved using imaging and chromatin conformation capture methods such as Hi-C. To study variation in chromosome structure between different cell types, computational approaches are needed that can utilize sparse and heterogeneous single-cell Hi-C data. However, few methods exist that are able to accurately and efficiently cluster such data into constituent cell types. Here, we describe scHiCluster, a single-cell clustering algorithm for Hi-C contact matrices that is based on imputations using linear convolution and random walk. Using both simulated and real single-cell Hi-C data as benchmarks, scHiCluster significantly improves clustering accuracy when applied to low coverage datasets compared with existing methods. After imputation by scHiCluster, topologically associating domain (TAD)-like structures (TLSs) can be identified within single cells, and their consensus boundaries were enriched at the TAD boundaries observed in bulk cell Hi-C samples. In summary, scHiCluster facilitates visualization and comparison of single-cell 3D genomes.
In recent years, there has been a rapid increase in the development of single-cell transcriptomic and epigenomic assays (1), including single-cell/nucleus RNA sequencing (RNA-seq) (2), assay for transposase-accessible chromatin using sequencing (ATAC-seq) (3, 4), bisulfite sequencing (5), and Hi-C (6–11). Such powerful techniques allow the study of unique patterns of molecular features that distinguish each cell type. Computational methods have been developed to identify different cell types in heterogeneous cell populations based on various molecular features such as transcriptome (12, 13), methylome (14), and open chromatin (15–17). However, unbiased and efficient algorithms for single-cell clustering based on 3D chromosome structures are limited. In previous studies, cells have been organized by their contact decay profiles, which is useful for distinguishing different stages of the cell cycle (9). However, separating different cell types at the same cell cycle stage is still challenging. Principal-component analysis (PCA) performed on both intrachromosomal and interchromosomal reads was unable to completely distinguish between four cancer cell lines (7). Tan et al. (11) showed that annotated features in bulk Hi-C data could be used to separate single-cell Hi-C data into corresponding cell types. However, this approach would be limited to features identified in the few tissues or cell lines with published Hi-C data, and may be difficult to generalize to unprofiled cell types. Several methods have been developed to examine the reproducibility of bulk Hi-C data, which mainly focus on computing different types of similarity scores between contact matrices (18–21). These methods have been benchmarked by Yardimci et al. (22), and HiC-Rep was found to perform the best when generalized to single-cell Hi-C data. An embedding method for single-cell Hi-C data based on HiCRep has been specifically designed for capturing structural dynamics of the cell cycle state (23). However, cell cycle state is continuous in nature, and this approach has not explicitly been tested for the purpose of clustering, and thus it remains unclear how well this method would perform for cell type identification from single-cell Hi-C data.
Clustering of single cells based on Hi-C data faces three main challenges. 1) Intrinsic variability. 3D chromosome structures are highly spatially and temporally dynamic. Imaging-based technologies have suggested a large degree of heterogeneity of chromosome positioning and spatial distances between loci even within a population of the same cell type (24–27). How this fluctuation between cells of the same cell type compares to fluctuations between different cell types remains unclear. 2) Data sparsity. The sparsity of single-cell Hi-C data are higher than most other types of single-cell data. State-of-the-art single-cell DNA assays typically cover only 5–10% of the linear genome. Since Hi-C data are represented as 2D contact matrices, this level of sensitivity leads to coverage of only 0.25–1% of all contacts to be captured. 3) Coverage heterogeneity. It is often observed that the genome coverage of cells extends over a wide range within a single-cell Hi-C experiment. We find this bias often acts as the leading factor to drive clustering results, making it difficult to systematically eliminate. For example, this bias could be alleviated by removing the first principal component (PC1) before clustering and visualization. However, PC1 is not guaranteed to represent only cell coverage in these experiments as it may also contain information related to other biological variables (SI Appendix, Fig. S1 A and B).
To address these challenges, we developed a computational framework, scHiCluster, to cluster single-cell Hi-C contact matrices. To overcome the sparsity problem, we performed two steps of imputation on the chromosome contact matrices to better capture the topological structures. To solve the heterogeneity problem, we selected only the top-ranked interactions after imputation, which were proved to be sufficient to represent the underlying data structure. This framework significantly improved upon the clustering performance using low coverage datasets as well as facilitated the visualization and comparison of chromosome interactions among single cells.
Results
Overview of scHiCluster.
As shown in Fig. 1, scHiCluster consists of four major steps. In the first step, every element of the contact matrix is replaced by the weighted average of itself and its surrounding elements, in a type of linear convolution. Then a random walk (with restart) algorithm (28) is applied to smooth the signal to further capture both the local and global information of the contact maps. In particular, the convolution step only allows the information to pass among the linear genome neighbors, while the subsequent random-walk step aids information sharing among the network neighbors. To alleviate the bias introduced by uneven sequence coverage, we only keep the top 20% interactions after the imputation (SI Appendix, Fig. S1 C and D). Finally, we project the processed contact matrices onto a shared low-dimensional space, so that the topological structure of the 3D chromosome contacts can be compared between cells and used for further clustering and visualization.
scHiCluster Improves Clustering Performance on Simulated Data.
To explore the combinatorial effects of different levels of coverage and resolution, we first applied our algorithm to a set of simulated single-cell Hi-C data. We noticed that direct sampling from the Hi-C contact matrices of bulk cells leads to a relatively lower sparsity and heterogeneity (SI Appendix, Fig. S2), which often yields more accurate clustering results compared with real single-cell data. The real data concentrated more on specific loci in each cell, and the individual loci were different between different cells (SI Appendix, Fig. S2A). On the contrary, the simulated cells from bulk data often had more evenly distributed contacts SI Appendix, Fig. S2B). Therefore, we controlled the sparsity of each simulated contact matrix and added noise to the contact–distance curves to better mimic the sparsity and noise of real data (Methods). As shown in SI Appendix, Fig. S2G, when considering the first two principal components (PCs), the simulated cells generated were indistinguishable from real single cells of the same cell type.
In our simulation, we performed downsampling from bulk Hi-C experimental data from two studies. Rao et al. (29) examined seven human cell types (GM12878, IMR90, HMEC, NHEK, K562, HUVEC, and KBM7), while Bonev et al. (30) examined three mouse cell types [embryonic stem cells (ESCs), neural progenitor cells (NPCs), and cortical neurons (CNs)]. We downsampled each dataset to 500 k, 250 k, 100 k, 50 k, 25 k, 10 k, and 5 k contacts, respectively, and used 1-Mbp and 200-kbp resolution contact maps to test our algorithm. At each coverage level and resolution, we generated 30 simulated cells for each cell type. We evaluated the ability of scHiCluster compared with PCA to recover the correct cell type in an unsupervised way. The adjusted Rand index (ARI) was used to measure the accuracy of clustering. As shown in Fig. 2 and SI Appendix, Fig. S4, in both datasets, scHiCluster consistently performed better than PCA. The performances of scHiCluster began to be impaired with fewer than 25 k contacts, and failed to remove the coverage bias at 5 k contacts (SI Appendix, Fig. S5C), which leads to a complete loss of clustering ability. We also found that 1-Mbp resolution performed better than 200 kbp (SI Appendix, Fig. S6 C and D), suggesting that lower sparsity (lower resolution) may be sufficient to distinguish cell types. Thus, we used 1-Mbp resolution in all subsequent experiments.
scHiCluster Has Superior Performance on Published Single-Cell Hi-C Data.
Next, we evaluated our analysis framework using authentic single-cell Hi-C datasets. Thus far, there have been three published studies focusing on single-cell chromosome structures with analyses of multiple cell types. Ramani et al. (7) used a combinatorial indexing protocol to generate single-cell Hi-C libraries from thousands of cells for four human cell lines (HeLa, HAP1, GM12878, and K562). The number of contacts captured in each cell ranged from 5.2 k to 102.7 k (median, 10.0 k). Flyamer et al. (10) performed whole-genome amplification after ligation and detected 6.6 k to 1.1 m contacts per cell (median, 97.3 k) in mouse zygotes and oocytes. Tan et al. (11) developed an optimized protocol also using whole-genome amplification and obtained data with a median coverage of 513.0 k contacts. Since the last benchmark dataset (Tan) had relatively high coverage, either simple PCA (SI Appendix, Fig. S7) or chromosome compartment score (11) easily allowed cell types to be distinguished. Due to cost considerations, it is still challenging to achieve such depth of genome coverage. Therefore, we focused on the first two datasets with lower coverage (Ramani and Flyamer) to test the utility of our computational framework.
We compared our algorithm with four baseline methods: PCA, HiCRep+MDS (23), the eigenvector method along with the decay profile method (9) (Methods). Besides the methods used in published works, we included the eigenvector method since the chromosome compartments are considered to be cell type specific based on the bulk Hi-C experiments, and the first eigenvector of contact matrix is widely used to represent these compartment features (29, 31, 32). scHiCluster outperformed the baseline methods on both datasets in terms of better visualization (Fig. 3 A and B) and improved ARI (Fig. 3 C and D). In the mouse dataset (Flyamer), scHiCluster made a significant distinction among all three cell types (Fig. 3A); while in the human dataset (Ramani), the algorithm separated K562 and HAP1 better in the first two PC dimensions (Fig. 3B). The performances of scHiCluster are robust to the parameters (SI Appendix, Fig. S8). It is also worth commenting on the scalability of each method. Since HiCRep is designed specifically for two-sample comparison rather than multiple samples, generating the similarity matrix using HiCRep involves many repetitive computations, which required 8 h (Flyamer) and 4.5 d (Ramani); whereas scHiCluster and other methods consumed ∼30 s (Flyamer) and 60 s (Ramani) (SI Appendix, Fig. S9). Additionally, we carried out the same experiments on each chromosome separately and noticed that almost every chromosome showed advanced separation on the mouse dataset (Fig. 3E), while only one chromosome showed significant improvement on the human dataset (Fig. 3F). These results may suggest that to separate cells using global chromosome structure differences (e.g., oocytes and zygotes), the information provided by a single chromosome might be sufficient, but to distinguish more complex cell types, a combination of different chromosomes or a more careful feature selection is necessary.
We also visualized the weights of each element in the contact matrices when computing the final PCs (whitening matrices). In general, the weights for PC1 were uniformly distributed parallel to the diagonal (SI Appendix, Fig. S10A), which suggested it captures the information of the contact–distance curve and might correspond to the variance resulting from cell cycle or other relevant biological effects (9). This is also corroborated by the observation that cells with greater PC1 values tended to have a higher frequency of short-range contacts, while smaller PC1 inclined to correspond to a higher frequency of long-range contacts (SI Appendix, Fig. S10B). On the contrary, the weights for computing PC2 showed region specificity (SI Appendix, Fig. S10A), which may indicate its correlation with compartment strength. These findings also explained why the oocytes and zygotes in Flyamer et al. (10) are dominantly separated by PC1 (Fig. 3A), where the contact distance curves differ between cell types; meanwhile, in Ramani et al. (7), PC2 achieved a better partition of the cancer cell lines (Fig. 3B), but PC1 separated a cluster of cells likely in M-phase (SI Appendix, Fig. S11 A and B). We further examined the ability of scHiCluster to capture stages of the cell cycle by embedding the Nagano et al. (9) dataset, which contains 1,992 mouse ESCs across different stages of cell cycle. As shown in SI Appendix, Fig. S11D, the cell cycle information is generally well preserved.
Next, we wanted to evaluate the contribution of each step to the final clustering performance. For the three major steps of the pipeline, we tested all possible combinations of one or two steps of the three. More specifically, we compared our framework with PCA (with none of the steps), DS_PCA (downsampling to uniform coverage), CONV (convolution only), RW (random walk only), CONV_TOP (convolution and select top elements), RW_TOP (random walk and select top elements), and CONV_RW (convolution and random walk). Notably, for the whole scHiCluster framework including all of the three steps, we used K-means for 10 PCs to assign the cluster labels. However, to fully exploit the potential of the baseline methods, we compared all of the different combinations of clustering methods and numbers of PCs, and identified the parameters generating the most accurate results. From SI Appendix, Fig. S12, we concluded that all three steps are necessary to achieve the current visualization (SI Appendix, Fig. S12 A and B) and clustering accuracy (SI Appendix, Fig. S12 C and D). The necessity of these steps is more evident when using the mouse dataset.
scHiCluster Allows Visualization of Structural Difference in Single Cells.
The most popular method to interpret and validate identified cell clusters in single-cell experiments is to analyze known marker genes. Gene expression is directly measured in single-cell RNA-seq data and promoter, gene body ATAC-seq signals or cytosine methylation ratios can also be used to infer the cluster-specific genes in single-cell open chromatin and methylome data. Similarly, in single-cell Hi-C data, the differential chromosome interactions could serve as cell-type markers. With the single-cell Hi-C data, imputed contact matrices from every single cluster can be merged, where we observed square patterns that are visually similar to the topologically associating domains (TADs) identified in bulk Hi-C experiments along the diagonal. However, since the existence of TADs remains unclear in single cells, and accurate identification of the structures were limited by data sparsity, we referred to this featured pattern as TAD-like structures (TLSs) hereafter. Thus, differential TLSs could be applied to characterize different cell types. For instance, as demonstrated in Fig. 4 and SI Appendix, Fig. S13, a TLS at chr9:133.6M-134.2M is observed in 9 of 10 K562 cells but in 2 of the GM12878 cells. This structure difference is concordant with the bulk Hi-C data from the same cell lines. Gene expression and H3K4me1 signals that mark active enhancers are also higher in K562 within this TLS.
Structural differences are also observed near differentially expressed genes between the two cell types, including CXCR4 and ZBTB11. CXCR4 is a chemokine receptor that enhances cell adhesion, which is highly expressed in noncancer cells (GM12878) comparing to cancer cells (K562) (33). With scHiCluster imputation, a TLS surrounding CXCR4 was detected in 6 of 10 GM12878 cells but only 2 of 10 K562 cells (SI Appendix, Fig. S14 and Methods). Intriguingly, an H3K4me1 peak was detected in bulk GM12878 but not K562 at the other boundary of the TLS, which may indicate the potential interaction between the gene and its enhancer. Similarly, a TLS whose boundary located at ZBTB11 was observed in more GM12878 cells than K562 cells (SI Appendix, Fig. S15). Consistently, more H3K4me1 peaks within this TLS were also detected in the bulk GM12878 sample.
Next, we examined whether the imputation based on scHiCluster could facilitate the systematic identification of TLSs in both simulated and real single cells. We first leveraged Bonev et al. data for bulk ESC and NPC, and downsampled them to 1-Mbp, 500-kbp, 250-kbp, 100-kbp contacts per cell. We applied scHiCluster on contact matrices and then ran TopDom (34) to detect TLSs in every single cell. A TAD in NPC that splits into two TADs in ESC was selected to test the performance of TLS-calling (Fig. 5A). The visualization of single-cell TLSs was significantly improved after scHiCluster smoothing (Fig. 5B), and the alternative boundary was captured in more cells (Fig. 5C). Next, we applied scHiCluster to analyze single-cell Hi-C data from Nagano et al. (9). The dataset was sequenced with high coverage and enabled us to statistically analyze the dynamic of TADs location within single cells. We identified TLSs in contact matrices smoothed by scHiCluster at 40-kbp resolution with TopDom, and on average, observed 46% of the boundaries of TLSs in each single cell covered 53% of the boundaries identified in bulk cell data (SI Appendix, Fig. S16 A and B). Next, for each bin, we counted the number of cells in which the bin was determined as a TLS boundary. We observed nonzero probability for almost all bins to be a TLS boundary in single cells, and these probabilities peaked at the CTCF binding sites, and the TAD boundaries described in bulk Hi-C (Fig. 5D), which is in agreement with the conclusions of a recent imaging study (35). This signal was significantly enhanced after convolution and random walk (Fig. 5D and SI Appendix, Fig. S16C), which further highlighted the potential application of scHiCluster to study single-cell chromosome structure.
Our imputation method also helps visualize the signature of chromatin structures within specific cell type. Sox2 is a classic marker gene of ESCs, and the chromosome structure around this gene is unique to ESCs (30). Specifically, Sox2 is located at the upstream boundary of a large TAD in NPCs (SI Appendix, Fig. S17B), which is split into two smaller TADs in ESCs (SI Appendix, Fig. S17A). Stevens et al. (8) carried out Hi-C analysis of eight single haploid mouse ESCs. A median of 49.4-kbp long-range intrachromosome contacts was detected (21.0 k to 78.0 k). Although this study provided superior coverage among the current single-cell Hi-C experiment, the limited number of cells examined made it difficult to observe the interaction pattern surrounding the Sox2 even if contact matrices from all cells are merged (SI Appendix, Fig. S17C). However, after the imputation using the scHiCluster framework, the TLS boundaries at downstream of Sox2 are observed in four of the eight cells (SI Appendix, Fig. S17E). Merging the imputed matrices reveals the known domain splitting pattern near Sox2 (SI Appendix, Fig. S17D). A similar interaction pattern is also observed near another ESC marker Zfp42 (SI Appendix, Fig. S18).
Discussion
To advance our understanding of the role of genome structure in cell type-specific gene regulation, new computational tools are needed for exploration of single-cell Hi-C data. We describe a computational approach for cell type clustering, scHiCluster, that requires only sparse single-cell Hi-C contact data. In the scHiCluster framework, the chromosome interactions are considered as a network. The contact information is first averaged in the linear genome. A random walk is then used to propagate the smoothed interaction throughout the graph and further reduce the sparsity of the single-cell contact matrices. scHiCluster performed significantly better than existing methods in clustering single-cell data into constituent cell types and facilitated identification of local chromosome interaction domains.
A major challenge in clustering single-cell Hi-C data is the sparsity of the contact matrices. Our results demonstrate that scHiCluster is robust to sparse contact matrices when there are at least 5 k contacts detected per cell (Methods). scHiCluster takes advantage of both a linear smoothing and a random-walk step to handle these sparse data. Similar methods have been utilized for smoothing bulk Hi-C data, including HiCRep, which took the average of genome neighbors before computing the correlation of two Hi-C matrices (18), and GenomeDISCO, which provided a network representation of Hi-C matrices and used random walk to smooth it (19). Liu et al. (23) systematically evaluated these methods for single-cell Hi-C data embedding. However, since they used a cell similarity matrix that is embedded by multidimensional scaling (MDS), the data are generally continuous under their low-dimensional representation and are unable to present explicit clusters for each cell type. Our scHiCluster framework combines the advantage of both HiCRep and GenomeDISCO and provides a flexible pipeline to resolve the clustering of Hi-C data, where some components (e.g., embedding) can be further tuned and improved when the algorithm is applied to more specific and challenging situations such as tissues with greater cell type complexity.
Published single-cell Hi-C datasets have employed cell lines that contain relatively large 3D genomic structural differences, simplifying the cell clustering problem. In practice, heterogeneous tissues with more closely related cell types, such as brain tissue, might pose a much greater challenge than cell lines. For cell clustering using complex tissues, further improvements in the clustering algorithm and feature selection are necessary. For instance, hierarchical clustering could be applied to identify the coarse cell types using megabase-scale resolution, followed by dividing cell types into finer scale (subtypes) using matrices of a smaller bin size. An alternative approach would be to simultaneously profile 3D genome architecture along with other “omic” information in the same cell, such as jointly profiling chromatin conformation and DNA methylation (36, 37). While such single-cell multiomic data modalities may provide the information content necessary to deconvolute cell types while preserving 3D structural information (38), they can also be more costly to perform, and more technically challenging to carry out.
We noted that the smoothing and random walk steps aid in visualization of chromosome contact maps in single cells. Such visualization can facilitate analysis of the variability in features of 3D genome organization between cells. Previous studies using bulk cell lines have reported the existence of several 3D structural features: megabase-level A/B compartments, submegabase-level TADs, and kilobase-level loops (29, 31, 39, 40). In our study, visualizing the smoothened scHiCluster results revealed the existence of TLSs in specific cells. The boundaries of these structures were variable between cells. However, the boundaries shared between TLSs in individual cells corresponded to TAD boundaries identified in bulk Hi-C studies. These results would support recent imaging studies (35), which suggested that TLSs exist in single cells, and their boundaries in individual cells are variable but nonrandom.
Methods
Data Processing.
For Ramani et al. (7), interaction pairs and cell quality files of combinatorial single-cell Hi-C library ML1 and ML3 were downloaded from GSE84920. Interaction pairs for Flyamer et al. (10), Stevens et al. (8), and Tan et al. (11) were downloaded from GSE80006, GSE80280, and GSE117876, respectively. Interaction pairs for diploid ESC cultured with 2i in Nagano et al. (9) were accessed from https://bitbucket.org/tanaylab/schic2/src/default/. Given a chromosome of length and a resolution , the chromosome is divided into nonoverlapping bins. Hi-C data are represented as a contact matrix , where denotes the number of read-pairs supporting the interaction between the th and th bins of the genome. For each dataset, contact matrices were generated at 40-kbp and 1-Mbp resolutions for each chromosome and each cell. Total contacts of the cell were counted as the nondiagonal interaction pairs in intrachromosomal matrices. As quality control, we ruled out the cells with less than 5 k contacts. Also, for a single chromosome whose length is Mb, we required the number of contacts to be greater than , to avoid the chromosomes with too few contacts. We only kept cells where all chromosomes satisfied this criterion. The number of cells remaining after each quality control step for each cell type is shown in SI Appendix, Table S1. Generally, we suggest to apply scHiCluster only on the cells that passed these quality controls.
Simulations.
Rationale.
First, we used the single-cell Hi-C dataset from Stevens et al. (8) to test the similarity between the real single-cell data and the pseudo–single-cell data, simulated by downsampling. The eight single-cell contact matrices of chromosome 1 are shown in SI Appendix, Fig. S2A. We merged the data from these eight single cells to generate a pseudobulk dataset, and then generated a simulated single-cell dataset by downsampling from the pseudobulk dataset. We added a constraint to let the number of sampled contacts equal to the number of contacts observed for each real single cell. However, we observed a side effect of this operation in that the sparsity and heterogeneity of the simulated data were much lower than that observed for real single-cell data (SI Appendix, Fig. S2B). Therefore, we limited the sparsity when performing the downsampling. After controlling the sparsity of the contact matrices, we used PCA to visualize the simulated cell data together with the real single-cell data and found that the lower heterogeneity of the simulated data was still observed in the first two PCs. Specifically, we observed variation of cells in PC1, which is highly correlated with the coverage of these cells, while only real single-cell data showed variation in PC2, but not the simulated cell data (SI Appendix, Fig. S2 D and E). To address this problem, we added a random noise during the simulation to amplify the heterogeneity of the contact decay curves among the single cells. The combination of these two steps enabled the simulation to generate cells with high sparsity (SI Appendix, Fig. S2C) and indistinguishable from the real single cells (SI Appendix, Fig. S2G).
Bulk Hi-C data.
We downsampled bulk Hi-C data to simulate datasets with similar sparsity and heterogeneity of single cells. Bulk MAPQ30 contact matrices were extracted from Juicebox at 100-kbp resolution for the datasets of Rao et al. (29) and Bonev et al. (30), respectively. Contact matrices for each cell type at 200-kbp and 1-Mbp resolution were calculated by merged bins in the 100-kbp resolution matrices.
Normalization.
SQRTVC normalization was applied to the bulk contact matrices to deal with the coverage bias along the genome. The normalized contact matrices are computed by the following:
[1] |
where is a diagonal matrix where each elements is the sum of the th row of .
Sparsity controlling.
We further controlled the sparsity during sampling to make the simulated data more similar to the real data. Leveraging Ramani et al. (7) and Flyamer et al. (10) datasets, we fit a linear relationship between total contacts and sparsity at log scale (SI Appendix, Fig. S3):
[2] |
To generate a simulated dataset with the median contact counts to be , for each simulated single cell we uniformly sampled from to and set the total contacts number of the cell as . The sparsity of the cell was computed based on ref. 2. The sampled new contacts are randomly assigned to different chromosomes based on the contact numbers of each chromosome in a particular cell type in the bulk cell dataset.
Adding random noise.
We added noise to the contact frequency through contact–distance curve, which describes the values in the contact matrices changed with respect to their distance to the diagonal. More specifically, we generated a random vector of length , where is the bin number of the contact matrix. The values in range from to following a uniform distribution, where denotes the noise level. Then, the normalized bulk contact matrix was rescaled linearly to the noisy representation by . Finally, based on , we sampled positions to be nonzero candidates based on Eq. 2, and distributed the simulated contacts to these positions.
scHiCluster.
Convolution-based imputation.
Imputation techniques are widely adopted in single-cell RNA-seq data to improve the data quality based on the structure of the data itself. For scHiCluster, the first step is to integrate the interaction information from the genomic neighbors to impute the interaction at each position. The missing value in the contact matrix could be due to experimental limitations of material dropout, rather than no interactions. Since the genome is linearly connected, our hypothesis is that the interaction partners of one bin may also be close to its neighboring bins. Thus, we used a convolution step to inference these missing values. Specifically, given a window size of , we applied a filter of size , where , to scan the contact matrix of size . The elements in the imputed matrix is computed by the following:
[3] |
where . In this work, all of the filters are set to be all-one matrices, which is equivalent to taking the average of the genomic neighbors. However, the filters could be tuned to incorporate different weights for elements during imputation. For instance, the elements located further from the imputed elements could be assigned smaller weights. The window size was set to 1 for 1-Mbp resolution maps.
Random-walk–based imputation.
Random walk with restarts (RWR) is widely used to capture the topological structure of a network (28, 41). The random-walk process helps to infer the global structure of the network and the restart step provides the information of local network structures. What Hi-C data fundamentally describe is the relationship between two genomic bins, which can be considered as a network where nodes are the genomic bins and edges are their interactions. Different from the convolution step, which takes information from the neighbor on the linear genome, the random-walk step considers the signal from the neighbor with experimentally measured interactions. The imputed matrix defined in Eq. 3 is first normalized by its row sum:
[4] |
We use to represent the matrix after the th iteration of random walk and restart. Then the random walk starts from the identity matrix , and is computed recursively by the following:
[5] |
where is a scalar representing the restart probability to balance the information between global and local network structures. The random walk with restart was performed until . Each element in the matrix after convergence signifies the probability of random walk to reach the th node when starting from the th node. The number of iterations until convergence ranged from 8 to 21 in Flyamer et al. (10) dataset, with a mean of 15.5, and ranged from 10 to 22 in Ramani et al. (7) dataset, with the mean of 15.3.
Embedding and clustering.
Since the coverage of the matrices from each cell is different, the sparsity and scales of the matrices after random walk is also distinct. Thus, after random walk, a threshold was chosen to convert the real matrix into binary matrix . The threshold was set to be the 80th percentile of for all of the analysis, and its impact is discussed in SI Appendix, Fig. S4. This is a crucial step since it facilitates us to choose the most conserved and reliable interactions in each cell. Then the matrix is reshaped to and the matrices from different cells were concatenated into a matrix. In the last step, PCA was used for projecting the matrix into a low-dimensional space and produce the embedding of the cells. Each single chromosome was embedded separately and the embedding of all chromosomes was concatenated at last and another PCA was applied to derive the final embedding. The whitening matrices for the two steps of PCA were multiplied, and the dot product representing the weight of each element in the contact matrices for computing each PC was visualized in SI Appendix, Fig. S10. The first two PCs were plotted for visualizing the cells and the first 10 PCs were used for K-means++ clustering. Since we know the cell-type labels of the datasets used in the manuscript, the number of clusters is based on the number of predefined cell types in the corresponding dataset. In cases where the cluster number is unknown, the number of clusters is a user-defined parameter in the scHiCluster package. Since scHiCluster also returns the embedding, the user can also apply other clustering algorithms that do not require a predefined number of clusters on the embedding.
ARI.
The ARI was used to compare the similarity between the true label of the cell types and the results of the clustering algorithm. ARI is defined based on the confusion matrix , where is the number of cells that labeled as the th cell type and assigned to the th cluster by the algorithm:
[6] |
where and are the sum of the th row and the sum of the th column of , respectively, and is the total number of cells.
Baseline Methods.
PCA.
The raw contact matrices of each cell were log2 transformed and reshaped to . The matrices from different cells were concatenated into a matrix. The matrix for each chromosome was PCA transformed and concatenated at last, and another PCA was applied to derive the final embedding with all chromosomes.
HiCRep+MDS.
HiCRep 1.6.0 was installed from bioconductor. For each chromosome, the raw contact matrix at 1-Mb bin size of each cell were log2 transformed and smoothed with a window size of 1. The stratum-adjusted correlation coefficient (SCC) was computed between each pair of smoothed matrices. The median of SCC distances across all chromosomes were transformed to Euclidean distances by Eq. 7:
[7] |
The Euclidean distance matrix was then embedded into two dimensions with MDS.
Eigenvector.
The raw contact matrix of each cell was log2 transformed to . The distance-normalized matrix of each cell was computed by the following:
[8] |
Then PCA was performed on the correlation matrix of and the PC1 was kept as features of the cell. We computed the mean CpG content of the bins with positive and negative features, respectively, and reversed the features if the negative features corresponded to higher CpG content. The features from different cells were concatenated into a matrix and PCA transformed.
Decay.
The raw contact matrix of each cell was log2 transformed to . The feature vector of each cell was computed by the following
[9] |
which represent the proportion of contacts at each distance. The features from different cells were concatenated into a matrix and PCA transformed.
Identification of TLSs/TADs.
In Fig. 5C and SI Appendix, Fig. S16, all TLSs/TADs were computed by TopDom with a window size of 5. TADs in bulk ESCs and NPCs were identified at 10-kbp resolution. The cells with more than 100 k nondiagonal contacts at 40-kb resolution were included in Fig. 5D (1,007 in total). For a given TAD identified in bulk Hi-C data whose boundaries are and , we decided whether a TLS in a single cell between and is corresponding to the TAD by whether and satisfied and or not.
In SI Appendix, Figs. S13–S15, S17, and S18, we did not call TLS directly in single cells due to the low coverage of the dataset. Instead, the differential TLSs between cell types were found by browsing the bulk Hi-C data of the corresponding cell types and finding the TADs that are obviously different between those cell types. Then we counted in how many single cells the similar interactions within the TADs were also detected. We defined a cell having a TLS similar to the TAD between and if its contact matrix satisfied , where is the indicator function.
Supplementary Material
Acknowledgments
We thank Dr. Jie Liu for specifying the details in his algorithm. J.R.E. is an investigator of the Howard Hughes Medical Institute. Research reported in this publication was supported by the National Human Genome Research Institute of NIH under Award R21 HG009274.
Footnotes
Conflict of interest statement: J.R.E., J.R.D., and A.K. are coauthors on a 2015 research article. J.R.E. and A.K. are coauthors on a 2017 correspondence article.
Data deposition: scHiCluster has been deposited on GitHub (https://github.com/zhoujt1994/scHiCluster).
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1901423116/-/DCSupplemental.
References
- 1.Tanay A., Regev A., Scaling single-cell genomics from phenomenology to mechanism. Nature 541, 331–338 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Ramsköld D., et al. , Full-length mRNA-seq from single-cell levels of RNA and individual circulating tumor cells. Nat. Biotechnol. 30, 777–782 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Cusanovich D. A., et al. , Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Buenrostro J. D., et al. , Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Luo C., et al. , Single-cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. Science 357, 600–604 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Nagano T., et al. , Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature 502, 59–64 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ramani V., et al. , Massively multiplex single-cell Hi-C. Nat. Methods 14, 263–266 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Stevens T. J., et al. , 3D structures of individual mammalian genomes studied by single-cell Hi-C. Nature 544, 59–64 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Nagano T., et al. , Cell-cycle dynamics of chromosomal organization at single-cell resolution. Nature 547, 61–67 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Flyamer I. M., et al. , Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-to-zygote transition. Nature 544, 110–114 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Tan L., Xing D., Chang C.-H., Li H., Xie X. S., Three-dimensional genome structures of single diploid human cells. Science 361, 924–928 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Levine J. H., et al. , Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell 162, 184–197 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Macosko E. Z., et al. , Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Luo C., et al. , Robust single-cell DNA methylome profiling with snmC-seq2. Nat. Commun. 9, 3824 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Schep A. N., Wu B., Buenrostro J. D., Greenleaf W. J., chromVAR: Inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat. Methods 14, 975–978 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Cusanovich D. A., et al. , The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 555, 538–542 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Preissl S., et al. , Single-nucleus analysis of accessible chromatin in developing mouse forebrain reveals cell-type-specific transcriptional regulation. Nat. Neurosci. 21, 432–439 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Yang T., et al. , HiCRep: Assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Res. 27, 1939–1949 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ursu O., et al. , GenomeDISCO: A concordance score for chromosome conformation capture experiments using random walks on contact map graphs. Bioinformatics 34, 2701–2707 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Yan K.-K., Yardimci G. G., Yan C., Noble W. S., Gerstein M., HiC-spector: A matrix library for spectral and reproducibility analysis of Hi-C contact maps. Bioinformatics 33, 2199–2201 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sauria M. E. G., Taylor J., QuASAR: Quality assessment of spatial arrangement reproducibility in Hi-C data. bioRxiv:10.1101/204438 (14 November 2017).
- 22.Yardimci G. G., et al. , Measuring the reproducibility and quality of Hi-C data. Genome Biol. 20, 57 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Liu J., Lin D., Yardimci G. G., Noble W. S., Unsupervised embedding of single-cell Hi-C data. Bioinformatics 34, i96–i104 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Kind J., et al. , Single-cell dynamics of genome-nuclear lamina interactions. Cell 153, 178–192 (2013). [DOI] [PubMed] [Google Scholar]
- 25.Shachar S., Voss T. C., Pegoraro G., Sciascia N., Misteli T., Identification of gene positioning factors using high-throughput imaging mapping. Cell 162, 911–923 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Kind J., et al. , Genome-wide maps of nuclear lamina interactions in single human cells. Cell 163, 134–147 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wang S., et al. , Spatial organization of chromatin domains and compartments in single chromosomes. Science 353, 598–602 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Pan J.-Y., Yang H.-J., Faloutsos C., Duygulu P., “Automatic multimedia cross-modal correlation discovery” in Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04 (ACM, New York, 2004), pp 653–658. [Google Scholar]
- 29.Rao S. S. P., et al. , A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Bonev B., et al. , Multiscale 3D genome rewiring during mouse neural development. Cell 171, 557–572.e24 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Lieberman-Aiden E., et al. , Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Dixon J. R., et al. , Chromatin architecture reorganization during stem cell differentiation. Nature 518, 331–336 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Burger J. A., Bürkle A., The CXCR4 chemokine receptor in acute and chronic leukaemia: A marrow homing receptor and potential therapeutic target. Br. J. Haematol. 137, 288–296 (2007). [DOI] [PubMed] [Google Scholar]
- 34.Shin H., et al. , TopDom: An efficient and deterministic method for identifying topological domains in genomes. Nucleic Acids Res. 44, e70 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Bintu B., et al. , Super-resolution chromatin tracing reveals domains and cooperative interactions in single cells. Science 362, eaau1783 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Li G., et al. , Simultaneous profiling of DNA methylation and chromatin architecture in mixed populations and in single cells. bioRxiv:10.1101/470963 (15 November 2018).
- 37.Lee D.-S., et al. , Single-cell multi-omic profiling of chromatin conformation and DNA methylome. bioRxiv:10.1101/503235 (26 December 2018).
- 38.Kelsey G., Stegle O., Reik W., Single-cell epigenomics: Recording the past and predicting the future. Science 358, 69–75 (2017). [DOI] [PubMed] [Google Scholar]
- 39.Dixon J. R., et al. , Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Nora E. P., et al. , Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 485, 381–385 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Cowen L., Ideker T., Raphael B. J., Sharan R., Network propagation: A universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551–562 (2017). [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.