Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Mar 16;17(3):e1008834. doi: 10.1371/journal.pcbi.1008834

A unified framework for inferring the multi-scale organization of chromatin domains from Hi-C

Ji Hyun Bak 1,¤,#, Min Hyeok Kim 1,#, Lei Liu 1,2, Changbong Hyeon 1,3,*
Editor: Alexandre V Morozov4
PMCID: PMC7997044  PMID: 33724986

Abstract

Chromosomes are giant chain molecules organized into an ensemble of three-dimensional structures characterized with its genomic state and the corresponding biological functions. Despite the strong cell-to-cell heterogeneity, the cell-type specific pattern demonstrated in high-throughput chromosome conformation capture (Hi-C) data hints at a valuable link between structure and function, which makes inference of chromatin domains (CDs) from the pattern of Hi-C a central problem in genome research. Here we present a unified method for analyzing Hi-C data to determine spatial organization of CDs over multiple genomic scales. By applying statistical physics-based clustering analysis to a polymer physics model of the chromosome, our method identifies the CDs that best represent the global pattern of correlation manifested in Hi-C. The multi-scale intra-chromosomal structures compared across different cell types uncover the principles underlying the multi-scale organization of chromatin chain: (i) Sub-TADs, TADs, and meta-TADs constitute a robust hierarchical structure. (ii) The assemblies of compartments and TAD-based domains are governed by different organizational principles. (iii) Sub-TADs are the common building blocks of chromosome architecture. Our physically principled interpretation and analysis of Hi-C not only offer an accurate and quantitative view of multi-scale chromatin organization but also help decipher its connections with genome function.

Author summary

An array of square blocks and checkerboard patterns characteristic to Hi-C data reflects the multi-scale organization of the chromatin chain. Deciphering the structures of chromatin domains from Hi-C and associating them with genome function are open problems of great importance in genome research. However, most existing methods are specialized in finding domains at different scales, making it difficult to integrate the solutions. Here we develop a unified framework for modeling and inferring domain structures over multiple scales, based on a physical model of the chromosome that reflects its nature as a three-dimensional object. Our method efficiently explores the space of domain solutions at different genomic scales, and systematically infers the chromatin domains over multiple scales from Hi-C data by employing a single tuning parameter. Our principled interpretation of Hi-C not only offers a quantitative view of multi-scale chromatin organization but also helps understand its connections with genome function.

Introduction

The spatial organization of chromatin inside cell nuclei has a profound impact on the function of the genome [1]. Chromosome conformation capture (3C) and its derivatives, which are used to identify chromatin contacts through the proximity ligation techniques [2, 3], take center stage in chromosome research [1, 4]. In particular, high-throughput chromosome conformation capture (Hi-C) technique quantifies all pairwise interactions between the segments of the chromatin. Square-block and checkerboard patterns that appear in Hi-C data provide glimpses into the organization of chromatin chains.

Despite a fundamental limitation of Hi-C that the data are in practice determined from a population of cells with strong cell-to-cell heterogeneity, the cell-type specificity and even the pathological states [5, 6] can clearly be discerned between different Hi-C maps. Given that Hi-C pattern changes with the transcription activity and along the phase of cell cycle [716], accurate characterization of chromatin domains (CDs) from Hi-C data is of great importance in advancing our quantitative understanding to the functional roles of chromosome structure in gene regulation.

Different types of CDs have been identified at different genomic scales. Inside cell nuclei each chromosome made of O(102) Mb DNA is segregated into its own territory (Fig 1A) [17]. At the scale of O(10) Mb, alternating blocks of active and inactive chromatin are phase-separated into two megabase sized aggregates, called A- and B-compartments [1821] (Fig 1B). Topologically associated domains (TADs), detected at O(101)O(1) Mb [2225], are considered the basic functional unit of chromatin organization and gene regulation because of their well-conserved domain boundaries across cell/tissue types [17, 20, 26, 27]. It was suggested that the proximal TADs in genomic neighborhood aggregate into a higher-order structural domain termed “meta-TAD” [8]. At smaller genomic scale, each TAD is further split into sub-TADs that display more localized contacts [19, 2831] (Fig 1C).

Fig 1. The hierarchical organization of interphase chromosome and Hi-C map.

Fig 1

(A) Chromosome territories in the cell nucleus, which are manifested as the higher intra-chromosomal counts in the Hi-C map. (B) Alternating blocks of active and inactive chromatins, segregated into A- and B-compartments, give rise to the checkerboard pattern on Hi-C. (C) Sub-megabase to megabase sized chromatin folds into TADs. Adjacent TADs are merged to meta-TAD [8], and individual TAD is further decomposed into sub-TADs [19, 2831].

The current knowledge of CDs is based on a number of computational methods for Hi-C data analysis [18, 19, 22, 32]. Although each method made a unique and important contribution to the field, most are limited to finding CDs at specific scales under specific conditions. Lack of a systematic way of integrating or comparing these methods and their CD solutions engenders two issues. First, depending on the method being used, the identified CDs display significant variations, with the average size of a TAD varying from 100 kb to 2 Mb; yet there is no shared way to determine which size should be preferred. Second, it is difficult to formulate a comparative analysis between CDs found at different scales. To make things worse, many algorithms require that Hi-C data be pre-processed in specific ways, for example down-sampling to a coarser resolution for targeting CDs at larger scale [18, 19, 22]. Progress was made by methods that consider hierarchical CDs [33, 34]; however, these method fall short of providing a common interpretable framework, because they are restricted to local pattern recognitions, as in many earlier methods [18, 19, 22, 32]. To avoid the arbitrariness of focusing on the features at specific scales, a better approach will be to start from a physically principled notion that chromosomes are a hierarchically structured three dimensional object made of a long polymer [8, 16, 3541].

Here we present a unified framework for characterizing CDs over multiple scales, incorporating an interpretable model of the chromosome based on polymer physics, and a clustering method based on statistical physics and Bayesian inference. For convenience, we will divide our framework to the two parts: pre-processing and inference. For pre-processing, Hi-C data is interpreted as a pairwise contact probability matrix, resulting from a Gaussian polymer network whose inter-monomer distances are restrained harmonically, which allows us to transform a raw Hi-C data into the correlation matrix. For the inference step, we formulate a statistical model of pairwise correlation with hidden domains, and set up an optimization problem to find the CD solution that can best explain the Hi-C-derived correlation matrix. The model flexibly explores the space of CD solutions at different genomic scales, and finds a family of solutions parameterized by a single parameter λ. We present a single method pipeline, Multi-CD, that combines the pre-processing and inference algorithms. Our method can be applied to any raw Hi-C data to analyze the multi-scale domain organization.

Results

Our primary contribution is a method for characterizing the chromatin domains (CDs) at multiple scale in a unified framework.

Correlation-based model for chromatin domains (CDs)

Our approach is based on an assumption that the chromosome is organized into discrete chromatin domains (CDs), such that the position (and consequently, the activity) of two genomic loci within the same CD is more strongly correlated than across different CDs.

The pattern of pairwise correlation between different genomic loci is captured by Hi-C. However, most standard analysis approaches do not explicitly address how the information in Hi-C should be interpreted in a statistically rigorous way. Here we start by constructing a bridge between the observed data (Hi-C counts) and the key statistical quantity in our model (the pairwise correlation matrix).

We use a Gaussian polymer network model to describe the long-range pairwise interaction of genomic segments in a chromosome. Use of the Gaussian polymer network model was motivated by an observation, from fluorescence measurements, that the spatial distances between pairs of chromatin segments are well described by the gaussian distribution [21, 4244] (see S1 Fig). This observation suggests that, despite the presence of cell-to-cell variation in a population of cells [39, 4551], we can still approximate the chromosome with a gaussian polymer network whose configuration fluctuates around a local basin of mechanical equilibrium [16, 41, 5255], for the purpose of modeling the pairwise distances. See S1 Appendix for more discussion.

The Gaussian polymer network provides a flexible physical model for diverse patterns of correlation, because we allow each pair of segments to interact with a harmonic potential with an independent “spring constant”—the matrix of these spring constants can be translated to the pairwise correlation matrix between the segments. At the same time, the model is tractable enough to allow us to write down the pairwise distance distribution given the pairwise correlation, and consequently the pairwise contact probability. Assuming that Hi-C is essentially a manifestation of the pairwise contact probabilities, we can construct a chain of relationships that connects the Hi-C data matrix to the pairwise correlation matrix (Fig 2A). See Methods for details.

Fig 2. An overview of the Multi-CD method.

Fig 2

(A) We first pre-process Hi-C data to extract a correlation matrix C. Given C, we infer the chromatin domain (CD) solutions s at multiple scales by varying a single parameter λ. At each λ, the best domain solution is found through simulated annealing, in which the effective temperature T is gradually decreased (inner blue box). A complete Multi-CD algorithm involves repeating the process for different values of λ (outer red box), to obtain a family of solutions. (B) An example of the normalized Hi-C matrix, (C) and the correlation matrix that results from pre-processing. Shown is the full chromosome 10 in GM12878 (50-kb Hi-C), which is used as an example dataset throughout the paper. (D) Simplified schematic for the resulting family of domain solutions, {sλ}, at varying parameter λ. Each s is a vector of domain indices; line breaks illustrate domain boundaries. These solutions are not meant to be the optimal solutions for the shown data, but they illustrate how the typical domain scale increases with λ.

Now we have a pairwise correlation matrix, C, that summarizes the experimental data. We assume that the correlation pattern in C was generated from a hidden model, which is characterized by a relatively small number of chromatin domains (CDs). Suppose that there are N genomic segments, and that each segment i ∈ {1, 2, ⋯, N} belongs to one of K domains, indexed by si ∈ {1, 2, ⋯, K}. The vector s = (s1, s2, …, sN) can be called the domain solution in this model, because it summarizes how the genome is organized into distinct CDs. We are assuming that pairs of segments in the same CD will be more highly correlated (corresponding elements in C have larger values), which are manifested in higher counts in the Hi-C data. Adapting the group model [5658] from statistical mechanics, we formulate the pairwise correlation between two genomic segments (i, j) to have two contributions: the intra-domain correlation of the si-th CD that is present only when si = sj, and the domain-independent correlation between the two segments (see Methods).

The goal of our method is to identify the domain solution s that best represents the pattern in the correlation matrix C. To evaluate how well a domain solution s explains the correlation pattern in the data, we calculate the log likelihood L(C;s) that the observed correlation C was drawn from an underlying set of domains s. We also impose an additional preference to more parsimonious solutions; instead of explicitly fixing the number of domains, K, we penalize solutions with more fragmented domains, in terms of the generalized number of domains K(s). We combine the two measures into a single family of objective functions, parameterized by λ, and look for solution s^ that maximizes:

s^λ=maxs[L(C;s)λK(s)] (1)

(see Methods for the mathematical details and the optimization procedure). The free parameter λ (≥ 0) controls how strongly the problem prefers parsimonious solutions: with a larger λ, the solution s^ for the optimization problem would tend to have fewer numbers of domains. When N is fixed, we can also say that a larger λ prefers solutions with domains that are larger in the genomic scale. Solving at multiple values of λ (Fig 2B), therefore, would reveal the multi-scale domain structure in data.

There is a mathematical analogy between the form of the objective function in Eq 1, and the grand canonical ensemble in statistical physics (see Methods for details). The analogy to the well-studied physical formulation provides a useful conceptual framework, and justifies the use of efficient inference procedures such as simulated annealing (see Methods). We also note that, in this view, the parameter λ has a physical interpretation as the effective chemical potential of a domain in the solution, which is associated with the creation of a new domain or the merging of two domains into one.

Putting together, we present Multi-CD, a unified framework for modeling, inferring and interpreting the hidden chromatin domains (CDs) from Hi-C data.

Discovery of CDs at multiple scales

Now we show how our method can be used to characterize the multi-scale structure of the chromosome and generate new insights. We applied Multi-CD to a sample subset of Hi-C data from a commonly used human lymphocyte cell line, GM12878, at 50-kb resolution. After the transformation of the raw Hi-C (Fig 3A) into a correlation matrix (Fig 3B), Multi-CD was employed to infer a family of CD solutions that vary with λ (Fig 3C). We also applied Multi-CD to four other cell lines, HUVEC, NHEK, K562, and KBM7 (Fig 3D), and analyzed the CD solutions for all five cell lines. All results shown in the main text consider chromosome 10. See See S2 Fig for similar results from three other chromosomes (chr4, 11, 19).

Fig 3. Multi-scale chromatin domain solutions for various cell types.

Fig 3

(A) A subset of 50-kb resolution Hi-C data, covering a 10-Mb genomic region of chr10 in GM12878. (B) The cross-correlation matrix Cij for the corresponding subset. (C) Multi-CD applied to the correlation matrix in B. Domain solutions determined at 4 different values of λ = 0; 10; 30; 50. (D) Hi-C data from the same chromosome (chr10) in four other cell lines: HUVEC, NHEK, K562, and KBM7. Same subset as in A. (E-G) Characteristics of the domain solutions determined for all five cell lines in A and D: (E) the average domain size, 〈n〉 (F) the index of dispersion in the domain size, D(=σn2/n) (G) the normalized mutual information, nMI. (H-I) Comparison of domain solutions across cell types. (H) Average cell-to-cell similarity of domain solutions, in terms of Pearson correlations, at varying λ. (I) Domain solutions obtained at λ = 10 for 5 different cell types. See S3 Fig for solutions at λ = 0 and λ = 40. (J) Similarity between domain solutions at different λ’s, shown for GM12878. See S4 Fig for corresponding results for the other four cell lines. (K) RNA-seq signals from the five cell lines (colored hairy lines), on top of the TAD solutions (filled boxes), in a genomic interval that contains the regulatory elements associated with a gene APBB1IP. APBB1IP is transcriptionally active only in two cell lines, GM12878 and KBM7, where the regulatory elements are fully enclosed in the same TAD. See S5 Fig for additional examples.

We observed several general features from the families of CD solutions in these cell lines:

  • (i) The average domain size 〈n〉 always increased monotonically with λ (Fig 3E), as expected from our construction of the prior.

  • (ii) The domain sizes were relatively homogeneous in the small-domain regime (small λ), but became heterogeneous after a cross-over point (Fig 3F). To quantify this, we defined the index of dispersion for the domain sizes, i.e., the variance-to-mean ratio D=σn2/n. If the domains were generated by randomly selecting the boundaries along the genome, D = 1; a smaller D < 1 indicates that domain sizes are more homogeneous than random. A larger D > 1 means that the domain sizes are heterogeneous. The crossover points arise at λcr ≈ 30 − 40 (〈ncr ≈ 1.6 Mb) for GM12878, HUVEC, and NHEK; and λcr ≈ 60 − 70 (〈ncr ≈ 2.2 Mb) for K562 and KBM7. We observed that the onset of heterogeneity was related to the appearance of non-local domains (Fig 3C).

  • (iii) We quantified the goodness of each CD solution by comparing its corresponding binary matrix against the Hi-C data in terms of the normalized mutual information (nMI; see Methods). There is a scale, λ*, at which the diagonal block pattern manifested in Hi-C data is most accurately captured. In Fig 3G, the best solution was found at λ* ≈ 30 for GM12878, HUVEC, and NHEK; the λ*’s were identified at larger values for K562 and KBM7. As an interesting side note, K562 and KBM7 belong to immortalized leukemia cell lines, whereas the other three cell types are normal cells; the different statistical property of Hi-C patterns manifested in λ* may hint at a link between the pathological state and a coarser organization of the chromosome.

  • (iv) The CD solutions inferred by Multi-CD, especially the families of local CDs, appeared to be conserved across different cell types (Fig 3H and 3I). We quantified the extent of domain conservation in terms of the Pearson correlation (Methods), averaged over all pairs of different cell types. Domain conservation was strong for smaller domains at λ ≤ 30 (〈n〉 ≲ 1.5 Mb), with the strongest conservation at λ = 10 (Fig 3H). The CDs at λ = 10 are shown in Fig 3I for five different cell types.

  • (v) Finally, we quantified the similarity between pairs of CD solutions obtained at different scales, again using the similarity measure based on Pearson correlation. In the case of GM12878, the family of CD solutions is divided into two regimes; the smaller-scale CD solutions from a range of 10 ≤ λ ≤ 40 are correlated among themselves, and the larger-scale CD solutions from λ > 40 as well. CD solutions below and above λ ≈ 40 are not correlated with each other (Fig 3J; also see S4 Fig for the other cell lines).

The division boundary in (v) is found at a λ value in the similar range with the best-clustering scale λ*, and the crossover λcr from local/homogeneous to non-local/heterogeneous CDs (compare S4 Fig to Fig 3F and 3G). Hereafter we will refer to the two regimes as the family of local CDs with homogeneous size distribution (λ ≲ λ*), and the family of non-local CDs with heterogenous size distribution (λ ≳ λ*).

TAD-like organizations in the family of local CDs

We identified at least three important scales in the family of local CDs. First of all, there is a scale at which domain conservation is maximized across different cell types (λ = 10). This observation is consistent with the widely accepted notion that TADs are the most well-conserved, common organizational and functional unit of chromosomes, across different cell types [27, 59]. Thus, for the example from human chromosome 10, we identify the CDs found at this scale λ = 10 as the TADs. The average domain size at λ = 10 is 〈n〉 ≈ 0.9 Mb, which agrees with the typical size of TADs as suggested by previous studies [22, 23].

The goodness of clustering, on the other hand, is maximized at a larger scale, λ ≈ 30 (〈n〉 ≈ 1.5 Mb) for GM12878. The CDs at this scale turn out to be aggregates of multiple TADs in the genomic neighborhood, from visual inspection (see Fig 3C), or as quantified in terms of a nestedness score (Methods). We therefore identify these CDs as the “meta-TADs”, a higher-order structure of TADs, adopting the term of Ref. [8]. In contrast to a previous analysis that extended the range of meta-TADs to the entire chromosome [8], we use the term meta-TAD exclusively for the larger-scale local CDs, distinguishing them from the non-local structures (i.e., compartments, discussed below). We note, however, that the terminologies of TADs and the meta-TADs are still not definitive—a recently proposed algorithm based on structural entropy minimization [60] found that the “best” solutions were found at ∼ 2 Mb domains, which is consistent with our findings, although these domains were called the TADs in Ref. [60].

Finally, a trivial but special scale is λ = 0, where no additional preference for coarser CDs is imposed. The CDs at this scale are supposed to best explain the local correlation pattern that is reflected in the strong Hi-C signals near the diagonal. These smaller CDs are almost completely nested in the TADs and the meta-TADs; we can therefore call them the sub-TADs. We also confirm that the sub-TAD solutions are not limited by the resolution of the Hi-C data; sub-TADs are robustly reproduced from a finer, 5-kb Hi-C (S6 Fig).

The first three panels in Fig 3C shows three representative TAD-like CD solutions at λ ≤ λ*: sub-TADs (λ = 0; smallest CDs), TADs (λ = 10, strongest domain conservation), and meta-TADs (λ = λ* = 30, largest nMI). The nested structure is reminiscent of the hierarchically crumpled structure of chromatin chains [37].

Chromatin organization and its link to gene expression

The CD solutions from Multi-CD can provide important insights into the link between chromatin organization and gene expression. To demonstrate this, we overlaid the RNA-seq profiles on the TAD solutions, identified for the corresponding subset of the chr10 of five cell lines (GM12878, HUVEC, NHEK, K562, KBM7) (Fig 3K). At around 26.8 Mb position of this chromosome, we found a gene APBB1IP, which is transcriptionally active in GM12878 and KBM7 but not in HUVEC, NHEK and K562. Consulting the GeneHancer database [61], we identified the regulatory elements for this gene (enhancers and promoters) in the interval between 26.65 and 27.15 Mb. Notably, our Multi-CD solutions show that for the case of APBB1IP gene the interval associated with the regulatory elements is fully enclosed in the same TAD in GM12878 and KBM7, whereas it is split into different TADs in the other three cell lines (Fig 3K). This is an important demonstration of how the 3D chromosome structure regulates the gene expression level, which is also consistent with the understanding that TADs are the functional units of the genome [7, 8, 17, 20]. More generally, the observation suggests that gene expression depends on the spatial organization of the genome, captured by our TAD solutions, which constrains the interaction between the gene and the regulatory elements (also see S5 Fig for additional examples of TAD structure-dependent cell type specific gene expression).

Compartments as the best domain solution that coexists with TAD-like domains

Looking at the correlation pattern in Hi-C data (for example Fig 2B and 2C), one can hardly miss the prominent global structure with a characteristic checkerboard pattern demonstrated in the off-diagonal part of the matrix. These highly non-local, super-Mb-sized domains are generally defined as the compartments in the chromosome organization [27]. Given the large scale of the compartments, compartment-like solutions were initially anticipated in the large-λ limit of Multi-CD. However, a naïve application of Multi-CD by increasing λ failed in identifying the compartments; some non-local CDs were found (Fig 3C), but the characteristic pattern of compartments was not obtained by merely increasing the value of λ.

We hypothesized that compartments correspond to a secondary CD solution that coexists with the primary solution (in this case the TAD-like domains, as already identified), rather than belonging to the same family of solutions. As described in Methods, such secondary solution can be inferred by applying an extended version of Multi-CD, effectively taking out the contribution of the primary solution from the correlation matrix C. We consider a simplified version of this problem, and remove from C a diagonal band of width 2 Mb, corresponding to the known size of meta-TADs (Fig 4A). Along with our prior knowledge of the primary solution, this approximation has the advantage that the secondary inference (for compartments) can be performed independently from the outcome of the primary inference (for the TAD-like domains). In this case, the algorithm successfully captures the non-local correlations, and identifies two large compartments with alternating patterns (Fig 4B). The correspondence is clearer when the indices of segments are re-ordered (Fig 4C). Because the larger CD (k = 1) shows a greater number of contacts (Fig 4D), it can be associated with the B-compartment, which is usually more compact; k = 2 is associated with the A-compartment. Further validation of the two compartments will be presented below, through comparisons with epigenetic markers.

Fig 4. Domain solutions for compartments.

Fig 4

(A) Input correlation data for compartment identification. The 2-Mb diagonal band was removed. (B) Lower triangle: CDs obtained at λ = 90, based on the diagonal-band-removed data, which we identify as the compartments. Upper triangle: the CO/E matrix shown for comparison. (C) Same pair of data, after re-ordering to collect the two largest CDs in our solution, k = 1 and k = 2 (lower triangle). The CO/E is simultaneously reordered to show a clear separation of correlation patterns (upper triangle). (D) Intra-domain contact profiles for the two CDs k = 1 and k = 2. The domain solution k = 1 is locally more compact, with more contacts at short genomic distances. We therefore identify k = 1 as the B-compartment, and k = 2 as the A-compartment. (E) nMI between CDs at varying λ and CO/E, showing a plateau in the range 70 ≤ λ ≤ 100. CDs inferred by Multi-CD show consistently higher nMI, compared to sub-compartments (dashed line) and compartments (dotted line) from a previous method [19].

To compare the goodness of our compartments with existing methods, we calculated the nMI against the CO/E matrix, the conventional form for compartment identification (Methods) [18]. We find that Multi-CD outperforms GaussianHMM [19], a widely accepted benchmark algorithm, in capturing the large-scale structures in Hi-C (Fig 4E).

Multi-scale, hierarchical organization of CDs

Now that we identified four classes of CD solutions, namely sub-TADs, TADs, meta-TADs and compartments, we examined their relationships. Note that these CDs were obtained independently at the respective λ values, not through a hierarchical merging. Sub-TADs or TADs are almost always nested inside a meta-TAD, and TADs inside a meta-TAD, whereas there are mismatches between the TAD-like domains and the compartments (Fig 5). We quantified this relationship in terms of a nestedness score h, such that h = 0 indicates the chance level and h = 1 a perfect nestedness (Methods), along with a visual comparison of each pair of CD solutions (Fig 5A). This analysis confirms that there exists an appreciable amount of hierarchy between any pair of TAD-like domains (sub-TADs, TADs, and meta-TADs). On the other hand, the hierarchical links between the TAD-like domains and compartments are much weaker, which is again consistent with the recent reports that TADs and compartments are organized by different mechanisms [62, 63].

Fig 5. Hierarchical organization of CD families.

Fig 5

(A) Hierarchical structure of CDs are highlighted with the domain solutions for sub-TADs (red), TADs (green), meta-TADs (blue) and compartments (black). Shown for chr10 of GM12878. Each square panel overlays a pair of CD solutions; number above the panel reports the nestedness score. Inset: a reprint of the nestedness scores in a tetrahedral visualization with the four representative CD solutions. (B) A schematic diagram of inferred hierarchical relations between sub-TADs, TADs, meta-TADs and compartments, based on our calculation of nestedness scores.

Although the nestedness score between sub-TADs and compartments (nestedness score h = 0.67) is not so large as those among the pairs of TAD-based domains, it is still greater than those between TADs and compartments (h = 0.55) or between meta-TADs and compartments (h = 0.43). Thus, sub-TAD can be considered a common building block of the chromatin architecture (see Fig 5B).

Validation of domain solutions

The CD solutions from Multi-CD are in good agreement with the results of several existing methods that specialize in particular domain scales. Specifically, our CDs correspond to the previously proposed sub-TADs [19] at λ = 0, to the TADs [22] at λ ≈ 10, and to the compartments [19] at λ ≈ 90 (S7 Fig). When assessed in terms of the nMI, Multi-CD outperforms the corresponding alternatives (ArrowHead [19], DomainCaller [22], GaussianHMM [19] for sub-TADs, TADs, meta-TADs) at the respective scales (Fig 6A).

Fig 6. Validation of CD solutions from Multi-CD.

Fig 6

(A) In terms of the normalized mutual information between the CD solutions and the input data, Multi-CD outperforms ArrowHead, DomainCaller and GaussianHMM at the corresponding scales (sub-TAD, TAD and compartment). (B) The correlation function χ(d) between CTCF signals and the domain boundaries. Shown for sub-TADs and TADs, obtained from Multi-CD (left); from ArrowHead and DomainCaller (right). (C) Genome-wide, locus-dependent replication signal. Top panel shows the A- (blue) and B- (red) compartments inferred by Multi-CD. Bottom panels show the replication signals in six different phases in the cell cycle, shaded in matching colors for the two compartments. (D) Pearson correlation between the replication signals and the two compartments A (filled blue) and B (open red).

We also compared our CD solutions with several biomarkers that are known to be correlated with the spatial organization of the genome [64]. All results shown here are for chr10 of GM12878.

First, we calculated how much the boundaries of our sub-TAD and TAD solutions are correlated with the CTCF signals, which are known to be linked to TAD boundaries [22, 23] (Fig 6B). We quantified this in terms of a correlation function, χ(d), where d is the genomic distance between a domain boundary and each CTCF signal (Methods). Compared with those of ArrowHead and DomainCaller, the correlation function calculated for the results from Multi-CD shows a similar enrichment of CTCF signals at domain boundaries (high peak of correlation at d ≈ 0), along with better precisions (fast decay of correlation as d increases) (see Fig 6B). Specifically, when fitted to exponential decays, the correlation lengths are 34 kb (λ = 0) and 143 kb (λ = 10) for Multi-CD, compared to ≳ 900 kb for the two previous methods (Fig 6B).

Next, we compared our compartment solutions (CDs at λ = 90, shown in Fig 4B) with the replication timing profiles (Repli-Seq), which are known to correlate differently with the A- and B- compartments [9, 65]. Our inferred compartments exhibit the anticipated patterns of replication timing (Fig 6C); the A-compartment shows an activation of replication signals in the early-phases (G1, S1, S2) and a repression in the later phases (S3, S4, G2), whereas the B-compartment shows an opposite trend. There is a clear anti-correlation between the replication patterns in the two compartments along the replication cycle (Fig 6C), which is also quantified in terms of the Pearson correlation (Fig 6D). Comparison to other epigenetic markers, such as the pattern of histone modifications, further confirms the association of our CD solutions with the A/B-compartments (S8 Fig).

Discussion

Multi-CD is a unified framework for Hi-C data analysis and a principled interpretation of Hi-C from the viewpoint of polymer and statistical physics, enabling identification of CDs at various genomic scales. As a computational algorithm, Multi-CD includes two core steps: the pre-processing of raw Hi-C data into a correlation matrix, and the inference of chromatin domain (CD) solutions from the correlation matrix.

The pre-processing, based on a model of the chromosome as a gaussian polymer network, allows us to make a physically justifiable interpretation of the Hi-C data. This fundamentally differentiates our approach from other methods in which the target patterns need to be selected empirically. Moreover, the polymer physics-based transformation obviates the need for the arbitrary use of a nonlinear (most often logarithmic) scaling to enhance the correlation patterns in the Hi-C data.

The inference problem is rigorously formulated based on a statistical mechanical model of the pairwise correlation matrix in the presence of hidden correlated domains. We note that in the current model, domains are defined very generally in terms of a many-to-one grouping of the chromatin segments. As a result, our method deals with non-local CDs (occupying non-consecutive locations on the genome) as naturally as it does with the local CDs (consecutive intervals between two boundaries on the genome). This differentiates Multi-CD from previous methods that focus on local features in Hi-C, such as CD boundaries or loops enriched with higher contact frequencies.

As an important feature, Multi-CD offers a one-method-fits-all framework to identify CDs at multiple scales, tuning a single parameter λ to control the preference to coarser solutions. Because the resulting family of CD solutions share the same formulation, it is possible to make quantitative comparisons between CD solutions at different scales. Moreover, Multi-CD can find CDs across a wide range of scales without having to adjust or down-sample the Hi-C data to match the scale of CDs to be identified, which is an important improvement over many existing methods.

Applying Multi-CD to Hi-C data from five human cell lines, our analysis revealed special scales at which the CD solutions are particularly interesting: sub-TADs (λ = 0), TADs (λ = 10, where domain conservation was strongest), and meta-TADs (λ = 30, where the correlation pattern was best captured). At larger scales, we found that compartments (λ = 90) emerge as a secondary solution that can be inferred after explaining away the contributions from the TAD-like solutions. We confirmed that Multi-CD successfully reproduces, or even outperforms, the existing methods to identify CDs at the specific scales. Importantly, Multi-CD achieves this performance through a single unified algorithm, which not only identifies the specific CD solutions accurately, but also allows a comparative analysis of the multi-scale family of solutions.

In particular, we characterized the hierarchical organization of the chromatin by quantifying the similarity and the nestedness between CD solutions at two different scales. We showed that the characteristics of CD solutions shared by the local, TAD-like domains do not precisely hold together in the non-local, compartment-like domains. This finding is consistent with the recent studies which report that compartments and TADs are formed by different mechanisms of motor-driven active loop extrusion and microphase separation, and that they do not necessarily have a hierarchical relationship [63, 6668]. It is also consistent with our modeling assumptions that compartments are the secondary solution, whereas the TAD-like domains belong to the family of primary solutions. The ability of our model to appropriately describe the secondary solution, in addition to a family of primary solutions, further highlights the strength of our framework. Meanwhile, the sub-TADs are nested in each of the other three solutions, including the compartments (Fig 5), indicating that sub-TADs are the fundamental building blocks of the higher-order CD organization. We note that this is not a trivial consequence of the finite data resolution; the sub-TADs are robustly recovered when Multi-CD was applied to Hi-C data at a 10-fold finer resolution of 5kb (S6 Fig).

While there are methods that report hierarchical CDs [33, 34], Multi-CD makes significant advances both algorithmically and conceptually. Multi-CD can detect non-local domains with better flexibility instead of finding a set of intervals. Multi-CD also avoids the high false-negative rate that is typical of the previous method (e.g., TADtree [33]) that focuses on the nested domain structure (S9 Fig). Further, employing an appropriate prior to explore the solution space effectively, Multi-CD can avoid the problem encountered in Armatus [34] which skips detection of domains in some part of Hi-C data while its single scale parameter is varied (S9 Fig).

Multi-CD is a method of great flexibility that can be readily applied to analyze any dataset that exhibits pairwise correlation patterns. However, two cautionary remarks are in place for more careful interpretation of the results. (i) In general, the relevant values of λ depend on the resolution of the input Hi-C dataset, as well as on the cell type. While λ is a useful parameter that allows comparative analysis, its specific value does not carry any biological significance. Although we referred to a specific CD solution by the corresponding value of λ in the current analysis (Fig 3), the lesson should not be that TADs, for instance, always correspond to the particular value of λ; instead, TADs should be identified as the most conserved CD solutions across cell types after scanning a range of λ’s. (ii) Multi-CD is agnostic about whether the collected data is homogeneous or heterogeneous. Application of Multi-CD to single-cell Hi-C data, and the subsequent interpretation of the result, would be straightforward; however, if the input Hi-C data were an outcome of a mixture of heterogeneous subpopulations, the solution from Multi-CD would correspond to their superposition. This is a fundamental issue inherent to any Hi-C data analysis method. Nevertheless, the population-averaged pattern manifest in Hi-C carries a rich set of information that is specific to the cell type. The need for interpretable inference methods that can extract valuable insights into the spatial organization of the genome, including ours, is still high.

To recapitulate, in order to glean genome function from Hi-C data that varies with the genomic state [1215], a computationally accurate method for characterizing the domain organization is of vital importance. Multi-CD is a physically principled method that identifies the multi-scale structure of CDs, by solving a family of optimization problems with a single tuning parameter. We find the resulting CD solutions in excellent match with biological data such as CTCF binding sites and replication timing signal. Our framework enables quantitative analyses of CD structures identified across multiple genomic scales and various cell types, offering general physical insights into the chromatin organization inside cell nuclei.

Methods

Interpretation of Hi-C data

We first describe our physical interpretation of Hi-C. We construct a polymer network model of the chromosome, and assume that Hi-C is essentially a sampling procedure for the contact probability between pairs of segments in the network.

Gaussian polymer network model for the chromosome

For a polymer chain whose long-range pairwise interactions are restrained via harmonic potentials with varying stiffness, the probability density function of the distance between a pair of segments i and j is written in the following form:

P(rij;γij)=4πγij3/2rij2exp(γijrij2), (2)

where γij amounts to the “stiffness” or the “spring constant” of the harmonic potential, up to a factor of the energy unit kBT. (Note that this T is the physical temperature, which should not be confused with the effective temperature for the simulated annealing in the inference procedure). This γij is directly related to the positional covariance determined by the topology of polymer network [16, 69]. More specifically, we describe the spatial positions of the polymer segments using a gaussian distribution with zero mean and and a covariance matrix Σ, with elements (Σ)ij = σij = 〈δriδrj〉. Then it follows that the distance rij = |rirj| between two different monomers i and j (ij) can be described in the form of a weighted gaussian function (Eq 2) where the variance (2γij)−1 is associated with the covariance matrix elements as

γij1=2(σii+σjj2σij). (3)

The contact probability between the two segments, pij, is the probability that their pairwise distance rij is below a cutoff distance rc. The contact probability is then calculated as pij=0rcdrP(r;γij). Once we obtain the contact probability pij from the Hi-C data (more details below), this model allows us to reversely solve for the correlation matrix C. Specifically, there is a one-to-one mapping from the contact probability pij to the stiffness parameter γij, which allows one to determine the covariance matrix {σij}, and consequently the cross-correlation matrix, (C)ij=σij/σiiσjj with −1 ≤ (C)ij ≤ 1 for ij and (C)ii = 1 (correlation of the self-interaction is unity) (more details below). In summary, the following transformation from pij to (C)ij is conceived (see also Fig 2):

pijγijσij(C)ij. (4)

Normalization and contact probability

Here we describe how the Hi-C data can be interpreted as a set of contact probabilities for pairs of genomic segments, pij. Typically, a Hi-C matrix have widely varying row-sums; for example, the net count of the i-th segment in the experiment is much larger than the net count of the j-th segment. To marginalize out this site-wise variation and only focus on the differential strengths of pairwise interactions, the raw Hi-C matrix Mraw is normalized to have uniform row and column sums. This is achieved using the Knight-Ruiz (KR) algorithm [70], which finds a vector v = (v1, ⋯, vN) for calculating (Mij = vivj(Mraw)ij, such that each row (column) in M sums to 1.

We assume that the normalized Hi-C signal is proportional to the contact probability: (M)ijpij. Note that pij is the probability that the two segments i and j are within a contact distance, and the rows of the contact probability matrix (P)ij = pij is not required to sum to 1. Because the proportionality constant is unknown a priori, however, we have a free parameter to choose. We do this by fixing the average nearest-neighbor contact probability, p¯1=pi,i+1. We expect the p¯1 to be relatively close to 1, assuming that nearest-neighbor contact is likely; but not exactly 1, because there are variations among the nearest-neighbor Hi-C signal. In this work we chose p¯1=0.9. The resulting contact probability matrix P is given as (P)ij=min(1,p^ij), with p^ij=(p¯1/μ)(M)ij, where μ = 〈Mi,i+1〉 is the Hi-C signals averaged over the nearest-neighbors. At p¯1=0.9, in our case, the fraction of over-saturated elements (p^ij>1) was sufficiently small.

Building the correlation matrix from Hi-C

Here we continue to build the connection between the contact probability and the correlation matrices. The contact probabilities can be calculated from the distribution of pairwise distances (Eq 2), by saying that two segments i and j are in contact when their distance rij is below a cutoff, rc. In other words, we write

pij=0rcP(rij;γij)drij=erf(γij1/2rc)2rcγijπeγijrc2, (5)

where erf(x)=2π0xdtet2. Because this pij is a monotonically increasing function of γij, the value of γij is uniquely determined for each pij. It is straightforward to perform this inverse mapping numerically; see our publicly available code for a simple Matlab implementation of the procedure.

Once we have the γij’s, we can reconstruct the covariance matrix {σij} using Eq 3. Depending on the Hi-C pipeline, the Hi-C-derived contact matrix, pij, may or may not contain the diagonal entries, which reflect the contact frequency within each genomic region; however, regardless of the value of pii, the correlation of self-interaction is unity by definition ((C)ii = 1). Note that although the value of γij depends on the choice of rc, its effect is only to scale the γij’s as γijrc2γij, and consequently the σij’s.

Finally, we normalize the covariance matrix to build the correlation matrix C:

(C)ij=σijσiiσjj=σijσc=114σcγij. (6)

where we assume a uniform variance σii = σjj = σc along the diagonal, so as to set the overall intensity of C. Here, we chose the value of σc as the median of 1/4γij, i.e., σc = median(1/4γij). This choice of σc was motivated to ensure that the resulting C has balanced fractions of positive and negative correlation, making the most of the value range [−1, 1] such that the global correlation pattern is clearly visible. This also cancels out the scaling effect of rc in σij, so that the choice of rc does not affect the ultimate construction of the correlation matrix C.

Generative model for the correlation matrix

We now consider a generative model for the cross-correlation matrix C that displays correlated domains. Specifically, we adapt a statistical mechanical formalism known as the group model [5658], which is used to cluster correlated domains in a given correlation data.

The group model

Let us assume that each genomic segment i ∈ {1, 2, ⋯, N} belongs to a chromatin domain si. Then the vector s = (s1, s2, …, sN) can be called the domain solution for the N segments. For example, a state s = (1, 1, 1, 2, 2, 3) describes a structure where the 6 genomic segments are clustered into 3 domains. Indexing of the domains is arbitrary. If there are K distinct domains in the solution, we can always index the domains such that si ∈ {1, 2, ⋯, K}.

We also assume that the cross-correlation matrix C can be represented by the correlation between a set of hidden variables {xi} where xi denotes the “genomic state” of the i-th chromatin segment. Without loss of generality, we only consider the case where xi has zero mean and unit variance. Adapting the formalism in Refs. [56, 57], we assume that xi obeys the following stochastic equation

xi=gsi1+gsiηsi+11+gsiϵi (7)

where ηsi and ϵi are two independent and identically distributed (i.i.d) random variables with ηsi,ϵiN(0,1), that are linked to the domain (si) and the individual segment (i) respectively. The clustering strength parameter gsi(0) is associated with each domain si, such that a larger gsi indicates a stronger contribution from the domain-dependent variable ηsi. The cross-correlation between two segments i and j is written as

xixj=gsi1+gsiδsisj+11+gsiδij. (8)

In light of Eq 8, the first term of Eq 7 on the right hand side contributes to intra-CD correlation of the si-th CD; the second term of Eq 7 corresponds to the domain-independent noise. With a larger gsi, the domain si becomes more clearly discernible from other domains (sjsi).

We are interested in the inference problem in which the model correlation matrix 〈xi xj〉 is fitted against the data correlation matrix C,

(C)ijxixj, (9)

to find the best domain solution si along with the corresponding clustering strength parameter gsi.

Incorporating the secondary group to the group model

The group model assumes that the correlation pattern in C is generated from a single level of group structure. However, there may be more complex situations in which the correlation pattern is hierarchical, not fully described by a single group. Here we expand the model to include up to the secondary group, whose contribution is, by construction, weaker than the primary group.

Suppose that the underlying grouping was bivariate, i ↦ (si, ui), such that each genomic locus i simultaneously belongs to a primary group si and a secondary group ui. Generalizing Eq 7, we assume a linear model

xi=ϵi+gsiηsi+huiξui1+gsi+hui (10)

where ξui and hui are, respectively, the random variable and the grouping strength parameter for the secondary group ui. If we further assume that si and ui are statistically independent, the pairwise correlation between two loci i and j can be written as

xixj=δij+gsiδsi,sj+huiδui,uj1+gsi+hui; (11)

the contributions from different groups are additive. We can rearrange Eq 11 to write

1+gsi+hui1+hui(xixjgsiδsi,sj1+gsi+hui)=δij+huiδui,uj1+hui, (12)

such that the right-hand side becomes the single-group model. The left-hand side of this expression is a normalized residual of the correlation, which we will call (Cres)ij. If we have already inferred the primary group si and the corresponding strength g˜si without considering the secondary group, the correspondence to this two-group model (due to normalization) is given as gsi=(1+hui)g˜si. Substituting this and replacing the model correlation 〈xi xj〉 with the data (C)ij simplify the left-hand side of Eq 12 to

(Cres)ij=(1+g˜si)(C)ijg˜siδsi,sj. (13)

Note that Cres is independent of the unknown secondary solution u (and h). Now u is the solution of a modified single-group problem, using this residual correlation Cres as the input data.

To infer the secondary group solution, therefore, one can simply repeat the inference procedure for the single-group problem after obtaining the primary group solution.

Formulating the inference problem

Our goal is to identify the domain solution s = (s1, s2, ⋯, sN) that best represents the pattern in the correlation matrix C, with an appropriate set of strength parameters g = (g1, g2, ⋯, gK), where K is the number of distinct domains in the solution. Using the group model described above, we can calculate the likelihood p(C|s, g) that the observed correlation matrix C was drawn from an underlying set of domains s with strengths g [58] (See S2 Appendix for derivation). The log-likelihood, log p(C|s, g), is written as a sum over all domains in the solution s:

logp(C|s,g)=12k=1K[(1+gk)(nkgkck1+gknk)nklog(1+gk)+log(1+gknk)], (14)

where nk=i=1Nδsi,k is the size of domain k, and ck=i,j=1NCijδsi,kδsj,k is the sum of all intra-domain correlation elements. The log-likelihood in Eq 14 is maximized at g^k=(cknk)/(nk2ck) for each k, allowing us to consider the reduced likelihood p(C|s)maxgp(C|s,g)=p(C|s,g^).

For convenience, we write the likelihood function as p(C|s)exp(E(s|C)/T) to resemble a Boltzmann distribution, where E corresponds to the energy or to L(C;s) introduced in Eq 1, and T is the effective temperature that will be used later in the simulated annealing. At g=g^, E(s|C) reads

E(s|C)=12k=1K[logcknk+(nk1)lognk2cknk2nk]. (15)

The problem of finding the maximum likelihood solution s is equivalent to finding the energy-minimizing s for E(s|C).

Besides evaluating how well a domain solution s explains the correlation pattern in the data, we also want to impose an additional preference to more parsimonious solutions. Instead of explicitly fixing the number of domains, K, we modify the optimization problem using the method of Lagrange multiplier:

minsHλ(s|C),Hλ(s|C)=E(s|C)+λK(s), (16)

where K is a generalized number of domains. Specifically, we define K(s) as

K(s)=exp(k=1Kpklogpk),pk=nkN, (17)

such that logK(s) is the entropy of s. In particular, this quantity reduces to K(s)=K in the regime where the domain sizes are uniform. The Lagrange multiplier λ(≥ 0) is a parameter that controls how strongly the problem prefers parsimonious solutions: with a larger λ, the solution s* for the optimization problem would tend to have fewer domains. Equivalently, a larger λ prefers a larger-scale domain solution. Solving at multiple values of λ, therefore, may reveal the multi-scale domain structure in data.

In parallel to the statistical physics problem of a grand-canonical ensemble, H corresponds to the effective Hamiltonian of the system. In this view, λ amounts to the negative chemical potential for adding extra domains to the solution. From the Bayesian viewpoint, on the other hand, our formulation is equivalent to considering a prior distribution of the form p(s)exp(λK(s)/T), and a posterior distribution p(s|C)p(C|s)p(s)exp(Hλ(s|C)/T). The maximum a posteriori inference is equivalent to solving the minimization problem for Hλ(s|C).

Solving the inference problem

We solve the optimization problem at a fixed value of λ, to find the minimizer s* for Hλ(s|C). As the s-space is expected to be high-dimensional and is likely characterized with multiple local minima, we use simulated annealing [71], in which the sampled distribution is narrowed down as T is gradually decreased. At each value of T, we use a Markov chain Monte Carlo sampling method to approximate the posterior distribution p(s|C)exp(Hλ(s|C)/T). See S3 Appendix for details.

We repeat the procedure to determine the best domain solutions at each different value of λ. This yields a family of multi-scale domain solutions, {sλ}. We note that there is no a priori notion of an optimal λ at this point; instead, certain sλ’s may be more interesting or relevant, depending on the specific dataset and context. In Results, we applied our method to example Hi-C datasets, and identified and discussed the existence of such interesting solutions.

Analysis and evaluation of domain solutions

Similarity between two CD solutions

To measure the extent of similarity between two CD solutions s and s′, we evaluate the Pearson correlation. The binary matrices B and B′ that represent the two CD solutions, are defined such that the matrix element are all 1’s within the same CD and 0 otherwise. i.e., (B)ij=Bij=δsisj. The similarity between B and B′ is quantified using the Pearson correlation

ρ=δBδB(δB)2(δB)2, (18)

where δBδB=(BijB¯)(BijB¯)ij, and (δB)2=(BijB¯)2ij. The average 〈⋅〉ij runs over all distinct pairs.

Normalized mutual information

We use the mutual information to evaluate how well a CD solution s captures the visible patterns in the pairwise correlation data. We consider the binary grouping matrix (B)ij=Bij=δsi,sj for the CD solution of interest, and compare it to the input data matrix (A)ij = Aij. In this study, either log10 M or CO/E was used for A. Treating the matrix elements aA and bB as two random variables, we construct the joint distribution

p(A,B)=δAij,aδBij,bij (19)

where 〈⋅〉ij is an average over all distinct pairs. The Kronecker delta for the continuous variable a is defined in a discretized fashion: that is, δAij,a=1 if Aij ∈ [a, a + Δa) and 0 otherwise, where Δa(= [max{Aij} − min{Aij}]/100) is used for discretization into 100 bins. Then we can calculate the mutual information,

I(A;B)=aAbBp(a,b)log[p(a,b)p(a)p(b)], (20)

and the normalized mutual information (nMI),

nMI(A;B)=I(A;B)H(A)·H(B), (21)

where H(X) = −∑xX p(x) log p(x) is the marginal entropy.

Nestedness of CD solutions

Here we define a measure to quantify the nestedness between two CD solutions, s (assumed to have smaller domains on average) and s′ (larger domains). The idea is the following: s is perfectly nested in s′ if, whenever two sites belong to a same domain in s, they also belong to a same domain in s′. For each domain ks, we consider the best overlap of this domain k on the other solution s′:

h1(ks)=maxksvk,knk (22)

where vk,k is the number of overlapping sites between two domains ks and k′ ∈ s′, and nk is the size of domain k. The highest score h1(ks′) = 1 is obtained when domain k is fully included in one of the domains in s′. The null hypothesis corresponds to where the domains in s and s′ are completely uncorrelated, in which case h1 only reflects the overlap “by chance”. The chance level h¯1(ks) is calculated by making nk random draws from s′; we averaged over 100 independent trials. We normalize the score as

h^1(ks)=h1h¯11h¯1, (23)

such that h^1(ks)=0 indicates the chance level, and h^1(ks)=1 means a perfect nestedness. Finally, we define the nestedness score h(ss′) for the entire CD solution as a weighted average:

h(ss)=ksh^1(ks)·nkN. (24)

The observed/expected matrix and its Pearson correlation matrix

The observed/expected (O/E) matrix was used to account for the genomic distance-dependent contact number due to random polymer interactions in chromosome [18]. Each pair (i, j) in O/E matrix is calculated by taking the count number Mij (observed number) and dividing it by average contacts within the same genomic distance d = |ij| (expected number). Since the expected number could be noisy, one smooths it out by increasing the window size (see refs. [18, 19] for further details). In this study, we used the expected number obtained from [19]. The Pearson correlation matrix of the O/E (CO/E) represents the overall contact pattern through pairwise correlation coefficients between segments.

Correlation between CTCF signal and domain boundaries

The validity of domain boundaries, determined from various CD-identification methods including Multi-CD, is assessed in terms of their correlation with the CTCF signal. Suppose that the CTCF signal at genomic segment i is given as ϕCTCF(i). Then, we can consider an overlap function between ϕCTCF(i) and a CD-boundary indicating function ψDB(i), where ψDB(i) = 1 if the i-th segment is precisely at the domain boundary; ψDB(i) = 0, otherwise. We evaluated a distance-dependent, normalized overlap function χ(d), defined as

χ(d)=δϕCTCF(i+d)ψDB(i)iψDB, (25)

where δϕCTCF = ϕCTCF − 〈ϕCTCF〉. If the domain boundaries determined from Multi-CD is well correlated with TAD-capturing CTCF signal, a sharply peaked and large amplitude overlap function (χ(d)) is expected at d = 0.

Correlation between epigenetic marks and compartments

We calculate the correlation of our compartment solutions with the epigenetic marks. Given a compartment solution s with two large domains A and B, we consider two binary vectors q(A) and q(B), where qi(A)=+1 if the i-th segment belongs to compartment A, and q(A) = −1 otherwise. For a set of epigenetic marks measured across the genome is represented with h, where its component hi denotes the value at the i-th genomic segment, the correlation between the solutions of compartment A and B and the epigenetic marks can be evaluated using the Pearson correlations as:

cA=(q(A)·h)|q(A)||h|,cB=(q(B)·h)|q(B)||h|. (26)

Data acquisition

All data used in the paper were obtained from publicly available repositories.

Hi-C data

We used Hi-C data from [19], available at NCBI GEO database [72] (https://www.ncbi.nlm.nih.gov/geo/), accession GSE63525. Analysis was performed using the intrachromosomal contact matrices files for GSE63525- celltype, where celltype is replaced by one of the five cell type identifiers (GM12878-primary, HUVEC, NHEK, K562, and KBM7).

Genomics data

Information about genes on the human chromosome was obtained from the Known Gene table in the UCSC Genome Browser [73], and their annotated regulatory elements from the GeneHancer [61] interaction table, both accessed through the UCSC Table Browser [74] (https://genome.ucsc.edu/cgi-bin/hgTables). All data were consistent with the human genome assembly GRCh37 (hg19). Information about the APBB1IP gene was specifically obtained from the GeneCards database [75] (https://www.genecards.org).

Biological markers

The domain solutions from Multi-CD were compared with known biological markers. We obtained these data mostly from the ENCODE project [76]; the NCBI GEO [72] accession numbers are provided below. We used the enrichment data of the transcriptional repressor CTCF measured in a ChiP-seq assay from GEO accession GSM749704 (narrowpeak file). We binned the CTCF assay at 50-kb resolution, to match the Hi-C format. If there are multiple signal enrichments in a single bin, we took the average value. Because each CTCF signal has a finite width, there are occasional cases where a signal ranges across two bins; in those cases we evenly divided the signal strength into the two bins. The Repli-seq signals in the six phases G1, S1, S2, S3, S4, and G2 were obtained from GEO accession GSM923451, and the 11 histone mark signals from accession GSE29611. The Repli-seq and histone mark signals were averaged over 50-kb windows along the genome to construct the replication timing profiles. The RNA-seq data for the four cell lines GM12878, HUVEC, NHEK and K562 were obtained from GEO accession GSE33480. RNA-seq for the cell line KBM7 were separately obtained from [77] (https://opendata.cemm.at/barlowlab/2015_Kornienko_et_al/hg19/AK_KBM7_2_WT_SN.F.bw).

Code availability

The Matlab software package and associated documentation are available online (https://github.com/multi-cd).

Supporting information

S1 Appendix. Gaussian polymer network for modeling chromosomes.

(PDF)

S2 Appendix. Derivation of the likelihood function.

(PDF)

S3 Appendix. The inference algorithm.

(PDF)

S1 Fig. Distance distributions of segment pairs are described by Gaussian.

(A) Gaussian probability distribution plotting P(rij) with different values of γij (Eq 2). The shaded area in different colors represents the corresponding values of contact probabilities, (Eq 5 at rc = 1) (B) Distance distributions between one TAD (TAD17) and other TADs on Chr21 in human IMR90 cells measured with FISH. This figure was adapted from Fig 3A in [21]. (C) Distance distributions between three FISH probes on the X chromosome of male Drosophila embryos. The experimental data were digitized from Fig 3B in [42]. Their best fits to Eq 2 are plotted with solid lines. (D) Distance distributions between five pairs of FISH probes on chr1 in fibroblast cells. The experimental data (histograms) were digitized from Fig 4B in [43]. The fits using Eq 2 are plotted with solid lines. (E) Distance distributions between seven pairs of FISH probes in the Tsix/Xist region on the X chromosome of mouse ESC. The experimental data (black lines) were digitized from Fig 2F in [44], and their corresponding fits are shown in red.

(PDF)

S2 Fig. Chromatin domain solutions for chromosomes 4, 10, 11 and 19.

Extension of Fig 3. (A) Relative sizes of chromosomes considered, aligned at the centromeres. The gray shade in each chromosome indicates the 10-Mb interval for which we show the Hi-C data in the next panels. (B-E) Hi-C data for the corresponding 10-Mb genomic intervals of (B) chr4, (C) chr10, (D) chr11, and (E) chr19, for the five different cell lines respectively. All the panels for chr10 are reprints of Fig 3 in the main text. (F-I) Statistics of the domain solutions for chr4, chr10, chr11, and chr19. The five cell lines are color coded as indicated at the top of (B). (F) Mean domain size 〈n〉 as a function of λ. (G) The index of dispersion D(=σn2/n) of domain sizes. (H) The goodness of domain solutions, measured in terms of the normalized mutual information with respect to Hi-C data (log10 M). (I) The similarity of domain solutions across the five different cell types, measured by the Pearson correlation between binarized contact matrices. For each chromosome, arrows indicate the likely TAD scale (highest cell-to-cell similarity) and the likely meta-TAD scale (where the nMI is high and the index of dispersion D starts to diverge).

(PDF)

S3 Fig. Examples of Multi-CD domain solutions at different scales.

Extension of Fig 3. Shown are the domain solutions obtained from Multi-CD for the five different cell lines (GM12878, HUVEC, NHEK, K562, KBM7), at (A) λ = 0 and (B) λ = 40.

(PDF)

S4 Fig. Scale-to-scale similarity of domain solutions for different cell lines.

Extension of Fig 3. We calculate the similarity between domain solutions at different λ in terms of Pearson correlation. The calculation was performed for chromosome 10 from five different cell lines.

(PDF)

S5 Fig. Link between chromatin organization and gene expression.

Extension of Fig 3K (gene APBB1IP, reproduced in the left panel), with two additional genes SVIL (middle panel) and TSPAN15 (right panel). Top five rows: Colored blocks indicate the TAD solutions for the five cell lines, as identified from our method. Colored hairy lines show the RNA-seq signal for the respective cell lines. Bottom two rows: Black horizontal bar and the gray shade mark the known position range of the gene. Finally, positions of all known regulatory elements for the gene are shown, as annotated in the GeneHancer database [61].

(PDF)

S6 Fig. Identification of sub-TAD boundaries at 5-kb resolution.

(A) The optimum cluster size, best describing 5-kb resolution Hi-C map in terms of nMI, is determined at 〈n〉 = 0.35 Mb, which is consistent with the sub-TAD size determined from 50-kb resolution Hi-C at λ = 0. (B-C) Comparison between Multi-CD solutions at different resolutions of the input Hi-C data, that point to the robustness of sub-TAD boundaries regardless of Hi-C resolution. (B) The best CD solution (corresponding to λ = λ* in panel (A)) for the 5-kb resolution Hi-C data in the 120-124 Mb region of the genome. (C) Solution for the same genomic interval from 50-kb Hi-C, determined at λ = 0. The two CD solutions are effectively identical, which supports our interpretation of sub-TAD as the unit of hierarchical chromosome organization.

(PDF)

S7 Fig. Comparison of domain solutions from Multi-CD and other methods at specific scales.

Comparison between domain solutions obtained by three popular algorithms (ArrowHead, DomainCaller, GaussianHMM) (right column) and those by Multi-CD (left column), applied to 50-kb resolution Hi-C data. Three subsets from the same Hi-C data (log10 M), with different magnification (5, 10, and 40 Mb from top to bottom), are given in the middle column. ArrowHead algorithm [19] was used for identifying the domain structures of sub-TADs, DomainCaller [22] for TADs, and Gaussian Hidden Markov Model (GaussianHMM) [19] for compartments. Multi-CD use λ = 0, 10, 90, as the parameter values for identifying sub-TADs, TADs, and compartments, respectively.

(PDF)

S8 Fig. Comparison of histone marks and compartments.

Extension of Fig 6C and 6D, which make comparison between the CD solutions for A/B-compartments by Multi-CD and epigenetic marks. The upper part with Repli-Seq signals is a reprint from the main text figure. The lower part shows histone marks on the corresponding genomic range. Majority of the histone marks are correlated with the A-compartment. The values of Pearson correlation between Repli-Seq signal or histone marks and A/B-compartment are given on the right.

(PDF)

S9 Fig. Comparison to existing algorithms for identifying domains at multiple scales.

(A, B) Normalized mutual information between domain solutions at multiple scales, from Multi-CD, Armatus [34] and TADtree [33] respectively, and the log10 of KR-normalized Hi-C matrix for chr10 of the cell line GM12878. The scale of a domain solution s is measured in two ways, in terms of (A) the effective number of clusters, K(s)=exp(k=1K(nk/N)log(nk/N)), where nk=i=1Nδsi,k is the domain size; and (B) the total area of 1’s in the corresponding binary contact matrix, (area)=i,j=1NBij where Bij=δsi,sj. All domain solutions from TADtree and Armatus were obtained using the respective default parameter settings. (C-F) Visual comparison of domains found by (C, D) TADtree and Multi-CD, and (E, F) Armatus and Multi-CD, at matching scales in terms of the average domain size. Domain solutions are shown in the upper triangle, colored by red (intra-domain) and white (extra-domain) for effective visualization. The lower triangle plots the corresponding subset of the Hi-C data (KR-normalized and in log10). Refer to the original papers [33, 34] for the definitions of the respective control parameters α (TADtree) and γ (Armatus).

(PDF)

Acknowledgments

We thank Roger Oria Fernandez for feedback on the code. We thank the Center for Advanced Computation in KIAS for providing computing resources.

Data Availability

All codes are available from https://github.com/multi-cd. All data used in the paper were obtained from publicly available repositories. The following data were obtained through the NCBI GEO database: Hi-C data, accession GSE63525; CTCF ChiP-seq, GSM749704; Repli-seq, GSM923451; histone signals, GSE29611; RNA-seq, GSE33480. RNA-seq data for the KBM7 cell were obtained from https://opendata.cemm.at/barlowlab. Information about genes (Known Genes table) and the associated regulatory elements (GeneHancer interaction table) were obtained from UCSC Genome Browser. The GeneCards database was consulted for individual gene information.

Funding Statement

This work was supported in part by a KIAS Individual Grant at Korea Institute for Advanced Study (No. CG035003 to C.H.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Misteli T. Beyond the sequence: cellular organization of genome function. Cell. 2007;128(4):787–800. 10.1016/j.cell.2007.01.028 [DOI] [PubMed] [Google Scholar]
  • 2. Davies JO, Oudelaar AM, Higgs DR, Hughes JR. How best to identify chromosomal interactions: a comparison of approaches. Nat Methods. 2017;14(2):125. 10.1038/nmeth.4146 [DOI] [PubMed] [Google Scholar]
  • 3. Dekker J, Rippe K, Dekker M, Kleckner N. Capturing chromosome conformation. Science. 2002;295(5558):1306–1311. 10.1126/science.1067799 [DOI] [PubMed] [Google Scholar]
  • 4. Bickmore WA, van Steensel B. Genome architecture: domain organization of interphase chromosomes. Cell. 2013;152(6):1270–1284. 10.1016/j.cell.2013.02.001 [DOI] [PubMed] [Google Scholar]
  • 5. Franke M, Ibrahim DM, Andrey G, Schwarzer W, Heinrich V, Schöpflin R, et al. Formation of new chromatin domains determines pathogenicity of genomic duplications. Nature. 2016;538(7624):265–269. 10.1038/nature19800 [DOI] [PubMed] [Google Scholar]
  • 6. Baca SC, Prandi D, Lawrence MS, Mosquera JM, Romanel A, Drier Y, et al. Punctuated evolution of prostate cancer genomes. Cell. 2013;153(3):666–677. 10.1016/j.cell.2013.03.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Lanctôt C, Cheutin T, Cremer M, Cavalli G, Cremer T. Dynamic genome architecture in the nuclear space: regulation of gene expression in three dimensions. Nat Rev Genetics. 2007;8(2):104. 10.1038/nrg2041 [DOI] [PubMed] [Google Scholar]
  • 8. Fraser J, Ferrai C, Chiariello AM, Schueler M, Rito T, Laudanno G, et al. Hierarchical folding and reorganization of chromosomes are linked to transcriptional changes in cellular differentiation. Mol Sys Biol. 2015;11(12):852. 10.15252/msb.20156492 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Ryba T, Hiratani I, Lu J, Itoh M, Kulik M, Zhang J, et al. Evolutionarily conserved replication timing profiles predict long-range chromatin interactions and distinguish closely related cell types. Genome Res. 2010;20(6):761–770. 10.1101/gr.099655.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Hiratani I, Ryba T, Itoh M, Yokochi T, Schwaiger M, Chang CW, et al. Global reorganization of replication domains during embryonic stem cell differentiation. PLoS Biol. 2008;6(10):e245. 10.1371/journal.pbio.0060245 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Cavalli G, Misteli T. Functional implications of genome topology. Nat Struct Mol Biol. 2013;20(3):290–299. 10.1038/nsmb.2474 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Dixon JR, Jung I, Selvaraj S, Shen Y, Antosiewicz-Bourget JE, Lee AY, et al. Chromatin architecture reorganization during stem cell differentiation. Nature. 2015;518:331. 10.1038/nature14222 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Krijger PHL, De Laat W. Regulation of disease-associated gene expression in the 3D genome. Nat Rev Mol Cell Biol. 2016;17(12):771. 10.1038/nrm.2016.138 [DOI] [PubMed] [Google Scholar]
  • 14. Dixon JR, Xu J, Dileep V, Zhan Y, Song F, Le VT, et al. Integrative detection and analysis of structural variation in cancer genomes. Nat Genetics. 2018;50(10):1388. 10.1038/s41588-018-0195-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Shao Y, Lu N, Wu Z, Cai C, Wang S, Zhang LL, et al. Creating a functional single-chromosome yeast. Nature. 2018;560(7718):331. 10.1038/s41586-018-0382-x [DOI] [PubMed] [Google Scholar]
  • 16. Liu L, Kim MH, Hyeon C. Heterogeneous Loop Model to Infer 3D Chromosome Structures from Hi-C. Biophys J. 2019;117:613–625. 10.1016/j.bpj.2019.06.032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Cremer T, Cremer M. Chromosome Territories. Cold Spring Harbor Perspectives in Biology. 2010;2(3). 10.1101/cshperspect.a003889 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Lieberman-Aiden E, Van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326(5950):289–293. 10.1126/science.1181369 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Rao SSP, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, et al. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell. 2014;159(7):1665—1680. 10.1016/j.cell.2014.11.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Sexton T, Cavalli G. The Role of Chromosome Domains in Shaping the Functional Genome. Cell. 2015;160(6):1049—1059. 10.1016/j.cell.2015.02.040 [DOI] [PubMed] [Google Scholar]
  • 21. Wang S, Su JH, Beliveau BJ, Bintu B, Moffitt JR, Wu Ct, et al. Spatial organization of chromatin domains and compartments in single chromosomes. Science. 2016;353(6299):598–602. 10.1126/science.aaf8084 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485(7398):376–380. 10.1038/nature11082 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Nora EP, Lajoie BR, Schulz EG, Giorgetti L, Okamoto I, Servant N, et al. Spatial partitioning of the regulatory landscape of the X-inactivation center. Nature. 2012;485(7398):381. 10.1038/nature11049 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Sexton T, Yaffe E, Kenigsberg E, Bantignies F, Leblanc B, Hoichman M, et al. Three-Dimensional Folding and Functional Organization Principles of the Drosophila Genome. Cell. 2012;(3):458–472. 10.1016/j.cell.2012.01.010 [DOI] [PubMed] [Google Scholar]
  • 25. Pope BD, Ryba T, Dileep V, Yue F, Wu W, Denas O, et al. Topologically associating domains are stable units of replication-timing regulation. Nature. 2014;515(7527):402. 10.1038/nature13986 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Dekker J, Heard E. Structural and functional diversity of Topologically Associating Domains. FEBS Letters. 2015;589(20PartA):2877–2884. 10.1016/j.febslet.2015.08.044 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Dixon JR, Gorkin DU, Ren B. Chromatin domains: the unit of chromosome organization. Mol Cell. 2016;62(5):668–680. 10.1016/j.molcel.2016.05.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Phillips-Cremins JE, Sauria ME, Sanyal A, Gerasimova TI, Lajoie BR, Bell JS, et al. Architectural protein subclasses shape 3D organization of genomes during lineage commitment. Cell. 2013;153(6):1281–1295. 10.1016/j.cell.2013.04.053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Jin F, Li Y, Dixon JR, Selvaraj S, Ye Z, Lee AY, et al. A high-resolution map of three-dimensional chromatin interactome in human cells. Nature. 2013;503(7475):290. 10.1038/nature12644 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Rocha PP, Raviram R, Bonneau R, Skok JA. Breaking TADs: insights into hierarchical genome organization. Epigenomics. 2015;7(4):523–526. 10.2217/epi.15.25 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Wang Q, Sun Q, Czajkowsky DM, Shao Z. Sub-kb Hi-C in D. melanogaster reveals conserved characteristics of TADs between insect and mammalian cells. Nature communications. 2018;9(1):188. 10.1038/s41467-017-02526-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Shin H, Shi Y, Dai C, Tjong H, Gong K, Alber F, et al. TopDom: an efficient and deterministic method for identifying topological domains in genomes. Nucleic Acids Res. 2015;44(7):e70–e70. 10.1093/nar/gkv1505 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Weinreb C, Raphael BJ. Identification of hierarchical chromatin domains. Bioinformatics. 2016;32(11):1601–1609. 10.1093/bioinformatics/btv485 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Filippova D, Patro R, Duggal G, Kingsford C. Identification of alternative topological domains in chromatin. Algorithms for Molecular Biology. 2014;9(1):14. 10.1186/1748-7188-9-14 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Mateos-Langerak J, Bohn M, de Leeuw W, Giromus O, Manders EM, Verschure PJ, et al. Spatially confined folding of chromatin in the interphase nucleus. Proc Natl Acad Sci U S A. 2009;106(10):3812–3817. 10.1073/pnas.0809501106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Barbieri M, Chotalia M, Fraser J, Lavitas LM, Dostie J, Pombo A, et al. Complexity of chromatin folding is captured by the strings and binders switch model. Proc Natl Acad Sci U S A. 2012;109(40):16173–16178. 10.1073/pnas.1204799109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Halverson JD, Smrek J, Kremer K, Grosberg AY. From a melt of rings to chromosome territories: the role of topological constraints in genome folding. Rep Prog Phys. 2014;77:022601. 10.1088/0034-4885/77/2/022601 [DOI] [PubMed] [Google Scholar]
  • 38. Brackley CA, Brown JM, Waithe D, Babbs C, Davies J, Hughes JR, et al. Predicting the three-dimensional folding of cis-regulatory regions in mammalian genomes using bioinformatic data and polymer models. Genome Biol. 2016;17(1):59. 10.1186/s13059-016-0909-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Shi G, Liu L, Hyeon C, Thirumalai D. Interphase Human Chromosome Exhibits Out of Equilibrium Glassy Dynamics. Nat Commun. 2018;9:3161. 10.1038/s41467-018-05606-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Liu L, Shi G, Thirumalai D, Hyeon C. Chain organization of human interphase chromosome determines the spatiotemporal dynamics of chromatin loci. PLoS Comp Biol. 2018;14(12):e1006617. 10.1371/journal.pcbi.1006617 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Liu L, Hyeon C. Revisiting the organization of Polycomb-repressed domains: 3D chromatin models from Hi-C compared with super-resolution imaging. Nucleic Acids Research. 2020; p. gkaa932. 10.1093/nar/gkaa932 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Szabo Q, Jost D, Chang JM, Cattoni DI, Papadopoulos GL, Bonev B, et al. TADs are 3D structural units of higher-order chromosome organization in Drosophila. Science advances. 2018;4(2):eaar8082. 10.1126/sciadv.aar8082 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Finn EH, Pegoraro G, Shachar S, Misteli T. Comparative analysis of 2D and 3D distance measurements to study spatial genome organization. Methods. 2017;123:47–55. 10.1016/j.ymeth.2017.01.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Giorgetti L, Galupa R, Nora EP, Piolot T, Lam F, Dekker J, et al. Predictive polymer modeling reveals coupled fluctuations in chromosome conformation and transcription. Cell. 2014;157(4):950–963. 10.1016/j.cell.2014.03.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Kang H, Yoon YG, Thirumalai D, Hyeon C. Confinement-induced glassy dynamics in a model for chromosome organization. Phys Rev Lett. 2015;115:198102. 10.1103/PhysRevLett.115.198102 [DOI] [PubMed] [Google Scholar]
  • 46. Shi G, Thirumalai D. Conformational heterogeneity in human interphase chromosome organization reconciles the FISH and Hi-C paradox. Nature Commun. 2019;10(1):1–10. 10.1038/s41467-019-11897-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Bintu B, Mateo LJ, Su JH, Sinnott-Armstrong NA, Parker M, Kinrot S, et al. Super-resolution chromatin tracing reveals domains and cooperative interactions in single cells. Science. 2018;362:eaau1783. 10.1126/science.aau1783 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Nir G, Farabella I, Estrada CP, Ebeling CG, Beliveau BJ, Sasaki HM, et al. Walking along chromosomes with super-resolution imaging, contact maps, and integrative modeling. PLoS genetics. 2018;14(12):e1007872. 10.1371/journal.pgen.1007872 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Nagano T, Lubling Y, Stevens TJ, Schoenfelder S, Yaffe E, Dean W, et al. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature. 2013;502(7469):59–64. 10.1038/nature12593 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Stevens TJ, Lando D, Basu S, Atkinson LP, Cao Y, Lee SF, et al. 3D structures of individual mammalian genomes studied by single-cell Hi-C. Nature. 2017;544(7648):59–64. 10.1038/nature21429 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Finn EH, Pegoraro G, Brandão HB, Valton AL, Oomen ME, Dekker J, et al. Extensive heterogeneity and intrinsic variation in spatial genome organization. Cell. 2019;176(6):1502–1515. 10.1016/j.cell.2019.01.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Sauerwald N, Zhang S, Kingsford C, Bahar I. Chromosomal dynamics predicted by an elastic network model explains genome-wide accessibility and long-range couplings. Nucleic acids research. 2017;45(7):3663–3673. 10.1093/nar/gkx172 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Zhang S, Chen F, Bahar I. Differences in the intrinsic spatial dynamics of the chromatin contribute to cell differentiation. Nucleic Acids Research. 2020;48:1131–1145. 10.1093/nar/gkz1102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Le Treut G, Képès F, Orland H. A Polymer Model for the Quantitative Reconstruction of Chromosome Architecture from HiC and GAM Data. Biophys J. 2018;115(12):2286–2294. 10.1016/j.bpj.2018.10.032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Shinkai S, Nakagawa M, Sugawara T, Togashi Y, Ochiai H, Nakato R, et al. PHi-C: deciphering Hi-C data into polymer dynamics. NAR Genomics and Bioinfo. 2020;2(2):lqaa020. 10.1093/nargab/lqaa020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Laloux L, Cizeau P, Bouchaud JP, Potters M. Noise dressing of financial correlation matrices. Phys Rev Lett. 1999;83(7):1467. 10.1103/PhysRevLett.83.1467 [DOI] [Google Scholar]
  • 57. Noh JD. Model for correlations in stock markets. Phys Rev E. 2000;61(5):5981. 10.1103/PhysRevE.61.5981 [DOI] [PubMed] [Google Scholar]
  • 58. Giada L, Marsili M. Data clustering and noise undressing of correlation matrices. Phys Rev E. 2001;63(6):061101. 10.1103/PhysRevE.63.061101 [DOI] [PubMed] [Google Scholar]
  • 59. Vietri Rudan M, Barrington C, Henderson S, Ernst C, Odom D, Tanay A, et al. Comparative Hi-C Reveals that CTCF Underlies Evolution of Chromosomal Domain Architecture. Cell Reports. 2015;10(8):1297—1309. 10.1016/j.celrep.2015.02.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Li A, Yin X, Xu B, Wang D, Han J, Wei Y, et al. Decoding topologically associating domains with ultra-low resolution Hi-C data by graph structural entropy. Nature Commun. 2018;9(1):3265. 10.1038/s41467-018-05691-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Fishilevich S, Nudel R, Rappaport N, Hadar R, Plaschkes I, Iny Stein T, et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database. 2017;2017:1–17. 10.1093/database/bax028 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Nora EP, Goloborodko A, Valton AL, Gibcus JH, Uebersohn A, Abdennur N, et al. Targeted degradation of CTCF decouples local insulation of chromosome domains from genomic compartmentalization. Cell. 2017;169(5):930–944. 10.1016/j.cell.2017.05.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Schwarzer W, Abdennur N, Goloborodko A, Pekowska A, Fudenberg G, Loe-Mie Y, et al. Two independent modes of chromatin organization revealed by cohesin removal. Nature. 2017;551(7678):51. 10.1038/nature24281 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Misteli T. Beyond the sequence: cellular organization of genome function. Cell. 2007;128(4):787–800. 10.1016/j.cell.2007.01.028 [DOI] [PubMed] [Google Scholar]
  • 65. Hansen RS, Thomas S, Sandstrom R, Canfield TK, Thurman RE, Weaver M, et al. Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. Proc Natl Acad Sci U S A. 2010;107(1):139–144. 10.1073/pnas.0912402107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Fudenberg G, Imakaev M, Lu C, Goloborodko A, Abdennur N, Mirny LA. Formation of chromosomal domains by loop extrusion. Cell Reports. 2016;15(9):2038–2049. 10.1016/j.celrep.2016.04.085 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Gassler J, Brandão HB, Imakaev M, Flyamer IM, Ladstätter S, Bickmore WA, et al. A mechanism of cohesin-dependent loop extrusion organizes zygotic genome architecture. EMBO J. 2017;36(24):3600–3618. 10.15252/embj.201798083 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Nuebler J, Fudenberg G, Imakaev M, Abdennur N, Mirny LA. Chromatin organization by an interplay of loop extrusion and compartmental segregation. Proc Natl Acad Sci U S A. 2018;115:E6697–E6706. 10.1073/pnas.1717730115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Bohn M, Heermann DW, van Driel R. Random loop model for long polymers. Phys Rev E. 2007;76(5):051805. 10.1103/PhysRevE.76.051805 [DOI] [PubMed] [Google Scholar]
  • 70. Knight PA, Ruiz D. A fast algorithm for matrix balancing. IMA J Numer Anal. 2013;33(3):1029–1047. 10.1093/imanum/drs019 [DOI] [Google Scholar]
  • 71. Kirkpatrick S, Gelatt D, Vecchi MP. Optimization by Simulated Annealing. Science. 1983;220:671–680. 10.1126/science.220.4598.671 [DOI] [PubMed] [Google Scholar]
  • 72. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Research. 2013;41(D1):D991–D995. 10.1093/nar/gks1193 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The Human Genome Browser at UCSC. Genome Research. 2002;12(6):996–1006. 10.1101/gr.229102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Research. 2004;32(suppl_1):D493–D496. 10.1093/nar/gkh103 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Stelzer G, Rosen N, Plaschkes I, Zimmerman S, Twik M, Fishilevich S, et al. The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses. Current Protocols in Bioinformatics. 2016;54(1):1.30.1–1.30.33. 10.1002/cpbi.5 [DOI] [PubMed] [Google Scholar]
  • 76. The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57. 10.1038/nature11247 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77. Kornienko AE, Vlatkovic I, Neesen J, Barlow DP, Pauler FM. A human haploid gene trap collection to study lncRNAs with unusual RNA biology. RNA Biology. 2016;13(2):196–220. 10.1080/15476286.2015.1110676 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008834.r001

Decision Letter 0

Alexandre V Morozov, Jian Ma

25 Nov 2020

Dear Prof. Hyeon,

Thank you very much for submitting your manuscript "A unified framework for inferring the multi-scale organization of chromatin domains from Hi-C" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Alexandre V. Morozov, Ph.D.

Associate Editor

PLOS Computational Biology

Jian Ma

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors developed a unified method for analyzing Hi-C data and determine multi-scale organization of chromatin chain by changing the parameter λ. The method was applied to several cell lines and obtained valuable results that are consistent with the experimental observations. The methods and the results are quite interesting. I believe it is suitable for publication after the revision. There are some issues need to be addressed:

1. The proposed method and the model are fully depended on the probability, which is obtained from the normalized Hi-C data. There are several normalization methods for Hi-C data, including ICE, KR, VC, etc. The authors should talk about the influence of the normalization algorithm on the results of the Sub-TADs, TADs, meta-TADs, and the compartments.

2. The authors claimed that the CD solutions correspond to the compartments at about λ=90. How to determine this value? If λ changes to 80 or 100, the distribution of the compartments stays unchanged or changes significantly? In addition, the results of large λ values (70 to 100) are not shown in Fig. 3.

3. The analysis of the link between chromatin organization and gene expression is very interesting and meaningful. The authors mentioned that the domain conservation was strongest at λ=10. Is this value for the results of GM12878 chr10, or the other four cell lines have the same value? In addition, can the authors add the analysis of two other gene expressions that support their results (Fig. 3k)?

4. Only part of the short arm of chr10 is shown in this study. Did the authors construct the model for the whole chr10 and calculate the properties, or for only a part of chr10? The component of the model is not so clearly described in this study. I think the results of parameter λ will change a lot for different cells, chromosomes, and regions (Fig. S2).

Reviewer #2: The authors developed a Hi-C data analysis pipeline, called Multi-CD, including two steps: pre-processing and inference based on a polymer physics approach and a statistical physics sampling approach. Interestingly, they formulated the inference problem with a single parameter λ relating to multi-scale chromatin domains and tried to provide a unified framework into the hierarchical chromatin organization. Significantly, the idea of combining the correlation matrix and group modeling is very original and innovative in the field of 3D genome physical biology. As shown in Figure 6, the outcomes by this method are consistent with other reported methods. Besides, the hierarchical organization of chromatin domain families in Figure 5 clearly shows that their unified framework works well. However, this manuscript includes little biological insights. The reviewer suggests that this methodological framework is worth publishing as not a research paper but a method one. Before a final assessment, the authors need to address the following comments in a revised manuscript.

Major:

1) at l399: γij has a physical unit [m^-2]. Therefore, stiffness or the spring constant is not the correct expression.

2) at l441: It is not clear how to solve Equation (4) from pij to γij in this method, although the uniqueness might be confirmed.

3) at l444 and Equation (5): the assumption σii = σjj = σc = median(1/4γij) is too arbitrary. The reviewer guesses that the authors could not derive the one-to-one connection between γij and σij.

According to a rigorous theory of the Gaussian polymer network [55 and https://doi.org/10.1016/j.csbj.2020.08.014], the Kirchhoff or Laplacian matrix is positive semidefinite, and the smallest eigenvalue is 0 with the eigenvector proportional to (1, 1, ..., 1). Therefore, an inverse matrix does not exist. On the other hand, γij^-1 = 2(σii + σjj - 2σij) at l405 is a rigorous relationship. Besides, the matrix Σ=(σij) satisfies Σ (1, 1, ..., 1)^T = 0. Using this relation, we can derive (γij^{-1}) (1, 1, ..., 1)^T = 2 A (σ11, σ22, ..., σNN)^T, where the diagonal elements of the matrix A are N+1 and the non-diagonal elements are 1. The inverse matrix of A is \\frac{1}{2N^2} B, where the diagonal elements of the matrix B are 2N-1 and the non-diagonal elements are -1. Therefore, we can derive the diagonal elements (σ11, σ22, ..., σNN) only from the matrix (γij^{-1}) and solve the equation γij^-1 = 2(σii + σjj - 2σij) without the assumption σii = σjj = σc = median(1/4γij).

4) Although the reviewer cannot verify the Matlab codes, information regarding the computational calculation time to obtain an optimal solution for a parameter λ must be useful for PLOS CB readers.

5) In terms of bioinformatics and computational approaches, the parameter λ would be an excellent measure to identify multi-scale chromatin domains. Also, the authors have discussed the interpretation of λ as the negative chemical potential. However, if the authors are responsible for the meaning of λ in terms of physics, they should discuss and propose a way to find a physical meaning of λ by experiments.

Minor:

1) typo at l76: the presence "of" of cell-to-cell ... Remove "of."

2) at l396: Equation (2) does not represent the distance distribution but the probability density function of the distance.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see http://journals.plos.org/compbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008834.r003

Decision Letter 1

Alexandre V Morozov, Jian Ma

8 Feb 2021

Dear Prof. Hyeon,

Thank you very much for submitting your manuscript "A unified framework for inferring the multi-scale organization of chromatin domains from Hi-C" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. 

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Alexandre V. Morozov, Ph.D.

Associate Editor

PLOS Computational Biology

Jian Ma

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I believe that the authors have already addressed all my concerns. Therefore I have no further comments and recommend its acceptance.

Reviewer #2: The authors gave answers to my comments and improved the quality of the manuscript significantly. However, some of the points they addressed are still not clear. I believe, after addressing these concerns properly, this manuscript can be accepted to PLOS Computational Biology.

Comment 1 to the reply R2-3:

I understood that the authors used the tar formatted Hi-C matrix data, and the diagonal elements were missing in the format.

But, if one uses the hic formatted data, the diagonal elements are available. Moreover, invalid pairs, including self-ligation, are eliminated in a Hi-C analysis pipeline from FASTQ data. Therefore, the diagonal element p_ii indicates not self-ligation counts but the contact frequency within the i-th genomic region.

I am looking forward to your future work with improving these parts.

Comment 2 to the reply R2-5:

The authors should add a discussion on the physical implication of λ.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-materials-and-methods

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008834.r005

Decision Letter 2

Alexandre V Morozov, Jian Ma

23 Feb 2021

Dear Prof. Hyeon,

We are pleased to inform you that your manuscript 'A unified framework for inferring the multi-scale organization of chromatin domains from Hi-C' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Alexandre V. Morozov, Ph.D.

Associate Editor

PLOS Computational Biology

Jian Ma

Deputy Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008834.r006

Acceptance letter

Alexandre V Morozov, Jian Ma

10 Mar 2021

PCOMPBIOL-D-20-01936R2

A unified framework for inferring the multi-scale organization of chromatin domains from Hi-C

Dear Dr Hyeon,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Alice Ellingham

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Gaussian polymer network for modeling chromosomes.

    (PDF)

    S2 Appendix. Derivation of the likelihood function.

    (PDF)

    S3 Appendix. The inference algorithm.

    (PDF)

    S1 Fig. Distance distributions of segment pairs are described by Gaussian.

    (A) Gaussian probability distribution plotting P(rij) with different values of γij (Eq 2). The shaded area in different colors represents the corresponding values of contact probabilities, (Eq 5 at rc = 1) (B) Distance distributions between one TAD (TAD17) and other TADs on Chr21 in human IMR90 cells measured with FISH. This figure was adapted from Fig 3A in [21]. (C) Distance distributions between three FISH probes on the X chromosome of male Drosophila embryos. The experimental data were digitized from Fig 3B in [42]. Their best fits to Eq 2 are plotted with solid lines. (D) Distance distributions between five pairs of FISH probes on chr1 in fibroblast cells. The experimental data (histograms) were digitized from Fig 4B in [43]. The fits using Eq 2 are plotted with solid lines. (E) Distance distributions between seven pairs of FISH probes in the Tsix/Xist region on the X chromosome of mouse ESC. The experimental data (black lines) were digitized from Fig 2F in [44], and their corresponding fits are shown in red.

    (PDF)

    S2 Fig. Chromatin domain solutions for chromosomes 4, 10, 11 and 19.

    Extension of Fig 3. (A) Relative sizes of chromosomes considered, aligned at the centromeres. The gray shade in each chromosome indicates the 10-Mb interval for which we show the Hi-C data in the next panels. (B-E) Hi-C data for the corresponding 10-Mb genomic intervals of (B) chr4, (C) chr10, (D) chr11, and (E) chr19, for the five different cell lines respectively. All the panels for chr10 are reprints of Fig 3 in the main text. (F-I) Statistics of the domain solutions for chr4, chr10, chr11, and chr19. The five cell lines are color coded as indicated at the top of (B). (F) Mean domain size 〈n〉 as a function of λ. (G) The index of dispersion D(=σn2/n) of domain sizes. (H) The goodness of domain solutions, measured in terms of the normalized mutual information with respect to Hi-C data (log10 M). (I) The similarity of domain solutions across the five different cell types, measured by the Pearson correlation between binarized contact matrices. For each chromosome, arrows indicate the likely TAD scale (highest cell-to-cell similarity) and the likely meta-TAD scale (where the nMI is high and the index of dispersion D starts to diverge).

    (PDF)

    S3 Fig. Examples of Multi-CD domain solutions at different scales.

    Extension of Fig 3. Shown are the domain solutions obtained from Multi-CD for the five different cell lines (GM12878, HUVEC, NHEK, K562, KBM7), at (A) λ = 0 and (B) λ = 40.

    (PDF)

    S4 Fig. Scale-to-scale similarity of domain solutions for different cell lines.

    Extension of Fig 3. We calculate the similarity between domain solutions at different λ in terms of Pearson correlation. The calculation was performed for chromosome 10 from five different cell lines.

    (PDF)

    S5 Fig. Link between chromatin organization and gene expression.

    Extension of Fig 3K (gene APBB1IP, reproduced in the left panel), with two additional genes SVIL (middle panel) and TSPAN15 (right panel). Top five rows: Colored blocks indicate the TAD solutions for the five cell lines, as identified from our method. Colored hairy lines show the RNA-seq signal for the respective cell lines. Bottom two rows: Black horizontal bar and the gray shade mark the known position range of the gene. Finally, positions of all known regulatory elements for the gene are shown, as annotated in the GeneHancer database [61].

    (PDF)

    S6 Fig. Identification of sub-TAD boundaries at 5-kb resolution.

    (A) The optimum cluster size, best describing 5-kb resolution Hi-C map in terms of nMI, is determined at 〈n〉 = 0.35 Mb, which is consistent with the sub-TAD size determined from 50-kb resolution Hi-C at λ = 0. (B-C) Comparison between Multi-CD solutions at different resolutions of the input Hi-C data, that point to the robustness of sub-TAD boundaries regardless of Hi-C resolution. (B) The best CD solution (corresponding to λ = λ* in panel (A)) for the 5-kb resolution Hi-C data in the 120-124 Mb region of the genome. (C) Solution for the same genomic interval from 50-kb Hi-C, determined at λ = 0. The two CD solutions are effectively identical, which supports our interpretation of sub-TAD as the unit of hierarchical chromosome organization.

    (PDF)

    S7 Fig. Comparison of domain solutions from Multi-CD and other methods at specific scales.

    Comparison between domain solutions obtained by three popular algorithms (ArrowHead, DomainCaller, GaussianHMM) (right column) and those by Multi-CD (left column), applied to 50-kb resolution Hi-C data. Three subsets from the same Hi-C data (log10 M), with different magnification (5, 10, and 40 Mb from top to bottom), are given in the middle column. ArrowHead algorithm [19] was used for identifying the domain structures of sub-TADs, DomainCaller [22] for TADs, and Gaussian Hidden Markov Model (GaussianHMM) [19] for compartments. Multi-CD use λ = 0, 10, 90, as the parameter values for identifying sub-TADs, TADs, and compartments, respectively.

    (PDF)

    S8 Fig. Comparison of histone marks and compartments.

    Extension of Fig 6C and 6D, which make comparison between the CD solutions for A/B-compartments by Multi-CD and epigenetic marks. The upper part with Repli-Seq signals is a reprint from the main text figure. The lower part shows histone marks on the corresponding genomic range. Majority of the histone marks are correlated with the A-compartment. The values of Pearson correlation between Repli-Seq signal or histone marks and A/B-compartment are given on the right.

    (PDF)

    S9 Fig. Comparison to existing algorithms for identifying domains at multiple scales.

    (A, B) Normalized mutual information between domain solutions at multiple scales, from Multi-CD, Armatus [34] and TADtree [33] respectively, and the log10 of KR-normalized Hi-C matrix for chr10 of the cell line GM12878. The scale of a domain solution s is measured in two ways, in terms of (A) the effective number of clusters, K(s)=exp(k=1K(nk/N)log(nk/N)), where nk=i=1Nδsi,k is the domain size; and (B) the total area of 1’s in the corresponding binary contact matrix, (area)=i,j=1NBij where Bij=δsi,sj. All domain solutions from TADtree and Armatus were obtained using the respective default parameter settings. (C-F) Visual comparison of domains found by (C, D) TADtree and Multi-CD, and (E, F) Armatus and Multi-CD, at matching scales in terms of the average domain size. Domain solutions are shown in the upper triangle, colored by red (intra-domain) and white (extra-domain) for effective visualization. The lower triangle plots the corresponding subset of the Hi-C data (KR-normalized and in log10). Refer to the original papers [33, 34] for the definitions of the respective control parameters α (TADtree) and γ (Armatus).

    (PDF)

    Attachment

    Submitted filename: PLOS_HiC_reply_letter.pdf

    Attachment

    Submitted filename: Response.docx

    Data Availability Statement

    All codes are available from https://github.com/multi-cd. All data used in the paper were obtained from publicly available repositories. The following data were obtained through the NCBI GEO database: Hi-C data, accession GSE63525; CTCF ChiP-seq, GSM749704; Repli-seq, GSM923451; histone signals, GSE29611; RNA-seq, GSE33480. RNA-seq data for the KBM7 cell were obtained from https://opendata.cemm.at/barlowlab. Information about genes (Known Genes table) and the associated regulatory elements (GeneHancer interaction table) were obtained from UCSC Genome Browser. The GeneCards database was consulted for individual gene information.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES