Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2020 Mar 19;117(14):7824–7830. doi: 10.1073/pnas.1910364117

Bayesian inference of chromatin structure ensembles from population-averaged contact data

Simeon Carstens a,b,c, Michael Nilges c, Michael Habeck a,d,1
PMCID: PMC7148566  PMID: 32193349

Significance

High-throughput chromosome conformation capture (5C, Hi-C) experiments deliver fascinating insights into 3D genome organization. They are typically performed on a large population of cells and therefore yield only average numbers of genomic contacts. Nevertheless, population Hi-C data are often interpreted in terms of a single genomic structure, which ignores all cell-to-cell variability. We propose a probabilistic, statistically rigorous method to infer chromatin structure ensembles from population Hi-C and similar data that takes the ensemble nature of the data explicitly into account and allows us to infer the number of structures required to explain the data.

Keywords: chromosome conformation capture, genome structure modeling, Bayesian inference, model comparison

Abstract

Mounting experimental evidence suggests a role for the spatial organization of chromatin in crucial processes of the cell nucleus such as transcription regulation. Chromosome conformation capture techniques allow us to characterize chromatin structure by mapping contacts between chromosomal loci on a genome-wide scale. The most widespread modality is to measure contact frequencies averaged over a population of cells. Single-cell variants exist, but suffer from low contact numbers and have not yet gained the same resolution as population methods. While intriguing biological insights have already been garnered from ensemble-averaged data, information about three-dimensional (3D) genome organization in the underlying individual cells remains largely obscured because the contact maps show only an average over a huge population of cells. Moreover, computational methods for structure modeling of chromatin have mostly focused on fitting a single consensus structure, thereby ignoring any cell-to-cell variability in the model itself. Here, we propose a fully Bayesian method to infer ensembles of chromatin structures and to determine the optimal number of states in a principled, objective way. We illustrate our approach on simulated data and compute multistate models of chromatin from chromosome conformation capture carbon copy (5C) data. Comparison with independent data suggests that the inferred ensembles represent the underlying sample population faithfully. Harnessing the rich information contained in multistate models, we investigate cell-to-cell variability of chromatin organization into topologically associating domains, thus highlighting the ability of our approach to deliver insights into chromatin organization of great biological relevance.


In recent years, methods based on chromosome conformation capture (3C) (1) have emerged as a powerful tool to investigate the three-dimensional (3D) organization of genomes on previously inaccessible length scales. Due to experimental advances and decreasing sequencing costs, genome-wide 3C methods such as high-throughput chromosome conformation capture (Hi-C) and Hi-C variants (3, 4) have yielded contact maps with resolutions of up to 1 kb. High-resolution contact maps have provided fascinating insights into the organization of chromatin domains and compartments of various sizes and their role in gene regulation (2, 47).

Although Hi-C can be performed in single cells (812), much richer data are usually obtained by analyzing populations of cells at the price of a more difficult interpretation of the data: Contact maps obtained in population Hi-C experiments show an average over millions of cells. Given that single-cell Hi-C experiments revealed significant structural variability among cell nuclei (8), it remains difficult to assess the information content and degree of structural heterogeneity reflected by population Hi-C data. A prime example highlighting this problem is topologically associating domains (TADs). TADs (see ref. 13 for a recent review) are linear segments of chromatin that span between 0.1 and 1 Mb. TADs have first been identified in Hi-C (5, 7) and chromosome conformation capture carbon copy (5C) (6) matrices where they manifest themselves as square-shaped blocks along the diagonal that show an elevated number of internal contacts. While TADs have emerged as fundamental units of gene regulation in many species (13), their boundary regions are often not clearly defined and it is uncertain to which extent TADs exist in individual cells rather than only as an averaging effect in cell populations.

Much effort has gone into the development of chromatin structure determination approaches, in which a single consensus structure is calculated (14). In light of recent results on cell-to-cell variability and considering experimental limitations and biases, consensus structure approaches are fundamentally limited, if not flawed (15). Several approaches taking into account the heterogeneity of cell populations have been explored. Matrix deconvolution has been used to unmix contact frequency matrices without explicitly modeling 3D structures (16, 17). While some structural information is already apparent in contact matrices, it is useful to find an approximate 3D representation giving rise to the data. For example, a 3D chromosome model allows the detection of three-way interactions not accessible in 3C data (18). Furthermore, data from other sources such as imaging can be included in the modeling process. A second approach is thus to calculate an ensemble of structures that reflect the structural heterogeneity of chromosomes more realistically than consensus structure approaches and reproduce the contact map only on average. This can be achieved in two different ways: First, one can optimize a set of structures generated from a polymer model in the presence of restraints enforcing that they reproduce the experimental data on average (3, 1921). A second possibility is to adjust the parameters of a polymer model such that by simulating the polymer model an ensemble of structures is obtained that reproduces the experimental data on average (2224).

In this article, we extend the scope of our previous work on Bayesian chromosome structure inference from single-cell Hi-C data (25) to population-averaged contact data, combining the explicit modeling of chromatin structure populations with a rigorous probabilistic interpretation of the data. We leverage the full power of Bayesian inference to compute multistate models of chromosome structures, which together reproduce the population contact data.

Results

Bayesian Inference of Chromatin Structure Ensembles.

Our approach is based on the inferential structure determination (ISD) (ref. 26 and SI Appendix) framework. ISD infers a structural model x along with unknown modeling parameters α from experimental data D and background information I. The inference problem is solved by sampling from the joint posterior distribution for x and α, conditioned on D and I, which can be expressed using Bayes’ theorem as

Pr(x,α|D,I)=Pr(D|x,α,I)Pr(x,α|I)Pr(D|I). [1]

Here it is assumed that a single structure x can explain the data D, as is the case for, e.g., proteins with a stable and unique fold. Sampling from Pr(x,α|D,I) results in a statistically well-defined ensemble of structures with associated estimates of nuisance parameters α, each being assigned a probability given by the posterior distribution.

To accommodate the fact that most 3C experiments are performed on populations of molecules with possibly very different conformations due to cell-to-cell heterogeneity and dynamics, we extend the scope of the ISD approach to the inference of multistate models X=(x1,,xn) consisting of n structures xk, each represented by a polymer model. The central quantity in this extended ISD framework is the posterior distribution

Pr(X,α|n,D,I)=Pr(D|X,α,n,I)Pr(X,α|n,I)Pr(D|n,I), [2]

sampling from which results in a hyperensemble, that is, a representative set of multistate models Xi with associated nuisance parameters αi, in which each pair (Xi,αi) has a probability given by Eq. 2. In the case of a single state (n=1), we recover the original ISD formulation and infer a consensus structure.

At the center of a Bayesian approach is a generative model for the data Pr(D|X,α,n,I) given a set X of n structures. To reflect the data density appropriately and allow for sufficiently fast computation, we represent a genomic region by a beads-on-a-string model and bin the input contact matrix such that each bin corresponds to one bead. The (coarse-grained) data D={fij} then comprise, for a given pair of beads (i,j), all experimentally accessible contact frequencies between loci mapping to the bins represented by beads i and j. Our forward model computes a contact matrix for each of the n states in X and averages these to obtain an idealized contact frequency matrix f^ij (Fig. 1A). We model deviations from idealized contact frequencies due to experimental noise, forward model inaccuracy, and, in the case of Hi-C data, data normalization with Poisson distributions whose rates are given by the idealized count data.

Fig. 1.

Fig. 1.

Main components of the proposed chromatin structure inference approach. (A) Forward model to calculate idealized contact frequency matrices from structures that are part of a multistate model. The contact matrices arising from each structure are summed together and multiplied with a scaling factor to yield the back-calculated contact frequency matrix on the right. (B) Bayesian model comparison illustrated for a polynomial curve-fitting example. Shown are the evidence (blue) and the residual mismatch between model and data (red) for polynomials of increasing degree. Data points and best-fitting models are shown below the x axis.

Proper modeling of prior information about model parameters is an essential ingredient of any Bayesian inference approach. We model the physical interactions between beads within a single structure xk similar to our model for single-cell data (25). Further details on our choice of prior distributions are explained in SI Appendix.

Bayesian Choice of the Size of the Structure Ensemble.

The number of states n is a crucial parameter in all modeling approaches that compute ensembles of chromatin structures (3, 19, 20). Existing multistate modeling approaches choose n in an ad hoc manner. For example, in the Genomic Organization Reconstructor Based on Conformational Energy and Manifold Learning (GEM) software (20) n defaults to 4 and differs largely from n=10,000 used by Alber and coworkers (19).

From a Bayesian perspective, the number of states n is a model parameter that should be inferred from the data. Depending on the true heterogeneity of the chromosome structures, n will adopt smaller or larger values. If n is too small, we underfit the contact data, resulting in highly strained structure models. If n is too large, the structural states are only loosely defined and potentially overfit the contact data. Therefore, a simple goodness-of-fit criterion is not sufficient to select n. We have to balance goodness of fit against model complexity. Fig. 1B illustrates this point for polynomial curve fitting. The degree of the polynomial determines the flexibility of the fitting function and therefore corresponds to the number of states n. For data generated from a parabola, the goodness of fit reaches an optimum if we use polynomials of degree two and higher, eventually resulting in overfitting: The fitted curve closely follows each data point.

A unique feature of Bayesian data analysis is that it offers an objective, albeit often computationally challenging framework for model comparison (27). Bayesian model comparison ranks alternative n-state models according to their model evidences Pr(D|n,I); the ratio Pr(D|n1,I)/Pr(D|n2,I) is the Bayes factor for preferring n1 over n2 states. The model evidence is the goodness of fit with the data obtained when averaging over all structures of the n-state model weighted by the prior. The evidence has a built-in penalty for model complexity. In the curve-fitting example (Fig. 1B), Pr(D|n,I) selects the correct degree of the polynomial (n=2), although polynomials of degree >2 achieve a better fit. This is because with increasing degree the number of curves that do not fit the data also increases, which is reflected by a decrease in the model evidence.

Proof of Concept: Reconstructing Coarse-Grained Protein Structures from Simulated Average Contacts.

As a proof of concept, we first apply our inference approach to contact data simulated from two protein structures. We consider beads-on-a-string models of two protein domains, the B1 immunoglobin-binding domain of streptococcal protein G (Protein Data Bank [PDB] identifier 1PGA) and the SH3 domain of human Fyn (PDB identifier 1SHF). Only positions of Cα atoms were taken into account, and the SH3 domain was truncated by three C-terminal residues to match the length of 1PGA. This results in two very different conformations of a chain of 56 beads (Fig. 2 A and B). We generated simulated contact frequencies and used ISD to infer representative multistate models. We simulated the posterior distributions for n=1,2,3,4,5,10 states. Model comparison shows that the model with n=2 states is strongly preferred, as indicated by Pr(D|n=2,I)/Pr(D|n=1,I)>10307 and Pr(D|n=2,I)/Pr(D|n=3,I)>10160 (Fig. 2C). It is reassuring that while the evidence decreases for n>2, models with an even number of states are preferred over models with a similar, but uneven number of states. The reason for this is that for even n, half of the states adopt the conformation of the B1 domain and the other half adopt the conformation of the SH3 domain, thus having no superfluous parameters (SI Appendix, Fig. S2). For uneven n, on the other hand, there is at least one superfluous state (SI Appendix, Figs. S3 and S4), contributing uninformative parameters which are penalized in the evidence. We find that, for n>2, the average scale factor α decreases monotonically with an increasing number of states (Fig. 2F), in agreement with our intuition that α compensates for the smaller number of simulated states compared to typical numbers of counts in the data.

Fig. 2.

Fig. 2.

Reconstruction of protein structures from simulated data. (A and B) Reference structures of the SH3 domain (A) and the B1 domain of protein G (B). (C and D) “Sausage representation” of reconstructed folds (n=2 states). Tube thickness indicates the local precision of bead positions. (E) Model evidences for different numbers of states. (F) Average values of sampled scale factor.

Fig. 2 D and E shows the two reconstructed states from the n=2 simulation which deviate from the ground truth by a root-mean-square deviation (rmsd) of bead positions of (1.46±0.11) Å for the B1 domain and (1.59±0.21) Å for the SH3 domain. ISD thus recovers both conformations with high accuracy.

Inferring Chromatin Structure Populations from 5C Data.

We now apply our extended ISD approach to a 5C (28) dataset covering the X-inactivation center in mouse embryonic stem cells (mESCs) (6). The available contact counts are already corrected for common biases and we use them as is. We restrict our analysis to contacts between primer sites within the two consecutive TADs harboring the Tsix and Xist promoters, respectively, and their boundary region. Similar to previous work on the same data (22), we model the corresponding 920-kb region by M beads, each representing a genomic length of (920/M)kb. To keep the computational efforts for sufficient Markov chain Monte Carlo (MCMC) sampling reasonable, we chose a rather coarse resolution with one bead representing 15 kb, resulting in a total of 62 beads. Simulating the ISD posterior distribution for n=1,5,10,15,20,25,30,35,40, and 100 states and calculating model evidences, we find that the multistate model with n=30 states is preferred by the data (Fig. 3A). This result establishes that Bayesian model comparison is indeed able to determine a unique, optimal number of states from population-averaged contact frequencies. Low-resolution models provide only limited biological insight; therefore we increased the number of beads to M=308 and computed multistate models at 3-kb resolution. Our model for the chromatin polymer then coincides with the model chosen in previous work (22) on the same dataset.

Fig. 3.

Fig. 3.

Determination of the optimal number of states by Bayesian model comparison. (A) Model evidences and average likelihoods for various numbers of states at 15-kb resolution. Inset shows a close-up of the log model evidence for models using more than 10 states. (B) Scatter plot showing the correlation between back-calculated and experimental data at 3-kb resolution. (C) Representative sample from an n=30 simulation at 3-kb resolution. Tsix TAD is colored red and the Xist TAD is in blue.

In the following, we analyze high-resolution chromatin structure ensembles obtained from a posterior simulation for n=30. Of three simulations with identical parameters (but different, random initial state), we chose the one with the highest mean posterior probability. While the optimal number of states for low- and high-resolution models does not necessarily have to match, for n=30, the back-calculated data reproduce the experimental data very well, with most mismatches stemming from short-range contacts contributing little structural information (Fig. 3B). Determining the optimal number of states at the 3-kb resolution is currently not feasible, because accurate and exhaustive sampling of very high-dimensional posterior distributions is required for calculating good estimates of the evidence. This is outside the scope of our current implementation and this contribution.

Visual inspection of sample populations (Fig. 3C) reveals a large structural variability which cannot be captured by only a few, let alone a single structure. This heterogeneity is already apparent in the low-resolution models (SI Appendix, Fig. S5).

Bayesian Multistate Models Agree with Independent Experimental Data.

Before further analyzing our models, we validate the inferred multistate chromatin models with previously measured fluorescence in situ hybridization (FISH) (29, 30) data on loci in the Tsix TAD (22). In FISH experiments, fluorescent probes are designed that bind selectively to parts of a nucleic acid with high sequence complementarity. Imaging these probes thus allows the localization of a single locus within a cell. Fig. 4A shows that experimental distances are in excellent agreement with distances obtained from the simulated structures, confirming that our approach yields structures matching data from experimental cell populations not only on average, but also correctly reproducing the spread of the population. Further validation of our modeling approach comes from considering data obtained from female mESCs before and 48 h after onset of differentiation (6). In this cell line, transcription of the Xist TAD is up-regulated while transcription levels of the Tsix TAD remain the same (6). After again choosing n=30, we simulated the posterior densities with contact data before and after the onset of differentiation. To monitor structural differences between both populations, we use the radius of gyration which measures the spatial extension of a structure. The distributions of the radii of gyration of both TADs reveal that before differentiation (Fig. 4B), the Xist TAD assumes more densely packed conformations as opposed to after differentiation (Fig. 4C), while the packing of the Tsix TAD stays, on average, the same. This agrees with the notion that to be highly transcribed, chromatin has to be loosely packed to provide easy access for the transcription machinery. We furthermore compare the structures calculated at 15-kb resolution with structures calculated from recent single-cell Hi-C data (10) at the same resolution. While structures calculated with the approach presented in this work are more extended, other general structural features agree reasonably well (SI Appendix).

Fig. 4.

Fig. 4.

(A–C) Validation of multistate chromatin models inferred from 5C data on male (A) and female (B and C) mESCs. (A) Pairwise distances between several loci in the Tsix TAD as measured by DNA FISH and in inferred multistate models. Names of genes overlapping the probes are shown. (B and C) Distribution of the radius of gyration for data obtained before onset of differentiation (B) and 48 h later (C). Dashed lines indicate sample means.

Chromatin Structure Ensembles Reveal Cell-to-Cell Heterogeneity and TAD Intermingling.

We now consider the heterogeneity within explicit structure populations. Fig. 5A shows that the radius of gyration varies within a certain range within a structure population, but population-wide averages are comparable across posterior samples. This indicates that our MCMC algorithm converged and draws structure populations with similar characteristics.

Fig. 5.

Fig. 5.

Heterogeneity of chromatin structure ensembles. (A) Radius of gyration across representative structure populations (gray circles) and averaged over structure populations (black line) as a function of simulation progress. (B) Distribution of TAD boundaries detected from contact matrices obtained from single-structure population members (dark gray) and back-calculated from full populations (light gray). CTCF binding sites are shown as vertical black lines. Bin populations are on a logarithmic scale. (C) Overlap between densities fitted to both TADs as measured by the correlation coefficient between the mass densities obtained by applying a Gaussian kernel with a bandwidth of one bead diameter. The green reference distribution (“no data”) was obtained by generating models from the prior with an additional radius of gyration bias that ensures that the level of compaction of the prior ensemble matches the degree of compaction observed in the posterior ensemble (“full data”). The orange histogram (“no inter contacts”) was obtained with data lacking inter-TAD contacts.

A unique advantage of analyzing the 5C data with a multistate model is that it allows us to gain insight into the question of to which extent TAD formation is a feature of single structures or whether it is an averaging phenomenon. To study this question, we use a simple heuristic to determine the position of the TAD boundary from a structure by finding the best split of the model contact matrix such that the contact density within both parts achieves the highest value. Fig. 5B suggests that, assuming the formation of two TADs with unknown and possibly variable position, the TAD boundary position determined from single-structure contact matrices can vary significantly within a structure population. If the same TAD boundary detection heuristic is applied to contact frequency matrices obtained by summing single population members’ contact matrices, however, the distribution of TAD boundary positions is sharply peaked at a value of 97. Visual inspection of the 5C contact matrix suggests that a TAD boundary at bead 97 is more appropriate than the value of 109 reported in earlier work using the same data binning (22) (SI Appendix, Fig. S6). We also compare the inferred TAD boundary positions with the location of CTCF binding sites (31). CTCF is a key player in TAD organization and often demarcates TAD boundaries (32). Several CTCF binding sites are located within the range of inferred TAD boundary positions as measured from back-calculated population matrices. We also observe that several CTCF binding sites coincide with distinct peaks in the TAD boundary position distribution derived from single-contact matrices. These results suggest that TAD boundary positions derived both from structure populations and from single structures are indeed biologically relevant results.

To investigate possible TAD intermingling, we convert the structure of each TAD into a mass density by using a Gaussian blur kernel and measure the correlation coefficient between the two densities for each structure (see SI Appendix, Fig. S7 for examples). Fig. 5C shows that, compared to structures calculated from data lacking inter-TAD contacts and structures calculated without any data, considerable TAD intermingling occurs, indicated by a higher overlap between the TAD densities.

Comparison with Other Multistate Modeling Approaches.

We compare the structures estimated by our extended ISD approach to the results of two previous approaches, which also model 3C data by multistate models. We modified the publicly available code for the Population-Based Genome Structure Modeling Package (PGS) (33), the software used by Alber and coworkers (19), to work with the genomic region considered in this work. Calculations for three different numbers of states show good agreement of the resulting structure populations with the aforementioned FISH data (22) for some loci pairs, while for others the distance distributions differ significantly (SI Appendix, Fig. S8). On average, the structure populations calculated with PGS show similar radii of gyration to those of our structures, but the radius of gyration distributions of the PGS structure populations are much broader. It is noteworthy that for structure populations calculated using PGS, neither FISH loci distance nor radius of gyration histograms vary significantly with the number of states, which span two orders of magnitude. Distances between most FISH loci are significantly different in both ISD and PGS structure populations compared to a simulation of the null model without any data given by the polymer model used in our ISD approach (SI Appendix, Fig. S8). We also calculated structure populations using GEM (20), but found that, irrespective of the number of states, which we varied between 4 and 100, there is virtually no heterogeneity in the populations (SI Appendix, Figs. S9–S11).

Discussion

The calculation of chromatin structure ensembles is reminiscent of the determination of protein structure ensembles. The most common approaches for Bayesian modeling of protein ensembles are reweighting and replica averaging (34). Reweighting methods assign weights to members of a structure library (3538). Replica-averaging techniques (3941) infer a finite number of conformers simultaneously from the experimental data. Our approach can be seen as a replica-averaging strategy. However, whereas in replica averaging the number of states is held fixed, our goal is to determine the number of structures directly from the data. Therefore, our method can be viewed as a rigorous minimal-ensemble method based on the model evidence.

Using both simulated and 5C data, we showed that the proposed Bayesian approach to chromatin structure inference not only is able to construct multistate models from contact data that agree well with independent biological observations, but also allows us to address crucial questions about the cell-to-cell variability of chromatin architecture. We demonstrated this by using multistate models to uncover significant heterogeneity in the organization of two consecutive TADs, with respect to the location of the boundary region separating the TADs as well as the overlap between them. This structural heterogeneity of TAD organization agrees with previous results on the same dataset obtained by simulating from a polymer model, whose parameters were optimized such that the input data could be reproduced (22). We note that, in comparison, our structures are somewhat different: While in our models, structures are rather compact, most structures shown in ref. 22 seem very extended with an elongated region. Furthermore, in the models inferred with ISD, both TADs tend to have larger radii of gyration (SI Appendix, Fig. S12). Different results on structural diversity of intra-TAD organization have been obtained in an earlier modeling-based attempt based on a coarse-mixture model (42), where it was found that the 3D structure of 71% and 82% of reconstructed TADs (depending on the dataset) corresponded to the consensus structure, which is in contrast to the highly heterogeneous ensembles we obtained. Cell-to-cell variability of TADs has also been addressed on the basis of single-cell Hi-C contact matrices with at best unclear, if not contradicting, results (8, 11). In structures calculated from single-cell Hi-C data (9), TADs have been shown to vary widely in terms of compaction between individual cells. Independent from Hi-C, imaging methods have recently been used to investigate intercell variability of chromatin structure: Multiplexed superresolution FISH imaging (43) showed that TAD-like domain structures with strong physical segregation exist in single cells, but the location of domain boundaries is highly variable. The latter observation agrees with our results as shown in Fig. 5B, while in our structures, physical segregation between the Tsix and Xist TADs seems less pronounced (Fig. 5C). This is evident by comparing the overlap of both TADs observed in the posterior ensemble and a reference ensemble (prior with additional compaction). Although the overlap tends to be higher in the reference ensemble (0.23±0.07), there is still substantial overlap of both TADs in the posterior (0.19±0.07), indicating that both TADs are not fully segregated (as is the case in the posterior based on data missing the inter-TAD contacts: 0.03±0.02). Another recent study based on high-throughput FISH (44) found that TAD structure is highly variable between single cells and that TADs frequently overlap in space.

Our method comes with an objective measure for the number of states required to accurately model population-based data, a feature previous approaches (3, 19, 20) are missing. The number of states used in previous work varies from very low numbers of states (n=4) (20) to very large populations (n=104) (3, 19). Bayesian model comparison can, provided sufficiently exhaustive MCMC sampling, uniquely and objectively determine this number. While our method includes consensus structure calculation as a special case, it clearly demonstrates that this kind of data is much more appropriately modeled using a larger number of structures. At the same time, the Bayesian model comparison approach employed in this work helps to avoid overfitting the data. It is also important to note that our multistate model-based approach is not limited to 5C/Hi-C data. Instead, any kind of pairwise contact frequency data could be used. This allows chromatin structure modeling from data obtained from recent ligation-free alternatives to 3C based methods (45, 46).

Our work opens up several avenues of further research. A more efficient implementation using molecular modeling packages such as OpenMM (47), possibly leveraging the power of graphics processing units (GPUs), would increase the feasible range of both resolutions and system size. For example, we estimate that modeling the mouse X chromosome at a resolution of 40 kb (4,250 beads) involves a 100- to 200-fold increase in computation time. In the current implementation, the time required to evaluate the likelihood scales quadratically with the number of beads. With an implementation improving this bottleneck, computation of full chromosome structures should be feasible. More detailed structural prior information encoding, e.g., a priori contact probabilities between loci based on epigenetic marks (48, 49) and more detailed forward and error models for the data would likely result in more accurate 3D models. For example, a fully probabilistic treatment of biases in Hi-C data could be included in the generative model so that raw, unnormalized Hi-C data can be used as an input. Eventually, with these improvements, the model comparison approach implemented in this work could rise from merely helping to set an important modeling parameter to a measure of cell-to-cell variability with the potential to answer a plethora of important biological questions. Finally, we note that our approach is not limited to 3C data, but could also be useful for other kinds of contact data in which structural heterogeneity can be suspected, for example cross-linking studies of protein complexes.

Materials and Methods

Statistical Model for 5C Data.

In our generative model, we compute a contact matrix with entries f^ijk for each state xk such that f^ij=k=1nf^ijk. Ideally, two beads i and j are in contact if their distance dijk is smaller than some cutoff dc resulting in binary-valued f^ijk. Because we use a gradient-based MCMC sampling algorithm, we require the posterior probability to be differentiable and thus approximate a hard contact with a flipped sigmoidal function s(d)=(γd(1+γ2d2)12+1)/2, where γ controls the steepness of the smoothed contact such that for large γ the hard contact criterion is recovered: limγs(d)=1θ(d), where θ is the step function. The total number of contacts between two beads kf^ijk is not directly comparable to the experimental contact matrix, because the number of states n does not match the (unknown) number of molecules analyzed in the experiment. We therefore need to scale the calculated contacts with a positive factor α, which we treat as a nuisance parameter:

f^ij=αk=1nf^ijk=αk=1ns(dcdijk).

We model the discrepancy between experimental and back-calculated data using a Poisson distribution with rate f^ij. For a single contact, the likelihood is then given by

Pr(fij|X,α)[f^ij(X,α)]fijef^ij(X,α).

Assuming that the single measurements are independent, the likelihood of the complete dataset is the product over all contacts:

Pr(D|X,α,I)expj>iMf^ij(X,α)j>iM[f^ij(X,α)]fij.

It is important to note that we could have chosen a different error model. In principle, Bayesian model comparison allows us to determine the error model that is most compatible with the data. However, we restrict this study to a Poisson error model, because the Poisson distribution is discrete and has been used previously to model Hi-C data.

Bayesian Model Comparison.

The posterior probability of a particular choice of n is

Pr(n|D,I)=Pr(D|n,I)Pr(n|I)Pr(D|I).

Assuming that a priori all n are equally likely, the ratio of two choices for the number of states n and n is

Onn=Pr(n|D,I)Pr(n|D,I)=dXdαPr(D|X,α,n,I)Pr(X,α|n,I)dXdαPr(D|X,α,n,I)Pr(X,α|n,I). [3]

The odds ratio Onn is called the Bayes factor and tells us which model to prefer: For Onn>1, an n-state model describes the data better; while Onn<1 indicates a preference for a model based on n states. The evaluation of the model evidence involves calculating two possibly intractable and high-dimensional integrals, which in all but the most trivial cases have to be approximated numerically. As several of the occurring distributions are of a nonstandard form, we use histogram reweighting (50, 51) to approximate the evidences.

Code Availability.

We implemented our method in a freely available Python package (http://bitbucket.org/simeon_carstens/ensemble_hic).

Supplementary Material

Supplementary File

Acknowledgments

We thank Luca Giorgetti for sharing DNA-FISH data. S.C. and M.N. acknowledge funding from the Agence National de la Recherche (ANR-11-MONU-0020). S.C. thanks Christian Griesinger for funding and support. M.H. acknowledges funding by the German Research Foundation under Grant SFB 860, TP B09 as well as funding by the Carl Zeiss Foundation.

Footnotes

The authors declare no competing interest.

This article is a PNAS Direct Submission.

Data deposition: Code for our method has been deposited in a freely available Python package (http://bitbucket.org/simeon_carstens/ensemble_hic).

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1910364117/-/DCSupplemental.

References

  • 1.Dekker J., Rippe K., Dekker M., Kleckner N., Capturing chromosome conformation. Science 295, 1306–1311 (2002). [DOI] [PubMed] [Google Scholar]
  • 2.Lieberman-Aiden E., et al. , Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kalhor R., Tjong H., Jayathilaka N., Alber F., Chen L., Genome architectures revealed by tethered chromosome conformation capture and population-based modeling. Nat. Biotechnol. 30, 90–98 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Rao S. S. P., et al. , A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Dixon J. R., et al. , Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Nora E. P., et al. , Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 485, 381–385 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Sexton T., et al. , Three-dimensional folding and functional organization principles of the Drosophila genome. Cell 148, 458–472 (2012). [DOI] [PubMed] [Google Scholar]
  • 8.Nagano T., et al. , Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature 502, 59–64 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Stevens T. J., et al. , 3D structure of individual mammalian genomes studied by single cell Hi-C. Nature 544, 59–64 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Nagano T., et al. , Cell-cycle dynamics of chromosomal organization at single-cell resolution. Nature 547, 61–67 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Flyamer I. M., et al. , Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-to-zygote transition. Nature 544, 110–114 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ramani V., et al. , Massively multiplex single-cell Hi-C. Nat. Methods 14, 263–266 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Szabo Q., Bantignies F., Cavalli G., Principles of genome folding into topologically associating domains. Sci. Adv. 5, eaaw1668 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Imakaev M. V., Fudenberg G., Mirny L. A., Modeling chromosomes: Beyond pretty pictures. FEBS Lett. 589, 3031–3036 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.O’Sullivan J. M., Hendy M. D., Pichugina T., Wake G. C., Langowski J., The statistical-mechanics of chromosome conformation capture. Nucleus 4, 390–398 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Junier I., Spill Y. G., Marti-Renom M. A., Beato M., le Dily F., On the demultiplexing of chromosome capture conformation data. FEBS Lett. 589, 3005–3013 (2015). [DOI] [PubMed] [Google Scholar]
  • 17.Sefer E., Duggal G., Kingsford C., Deconvolution of ensemble chromatin interaction data reveals the latent mixing structures in cell subpopulations. J. Comput. Biol. 23, 425–438 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Gürsoy G., Xu Y., Kenter A. L., Liang J., Computational construction of 3D chromatin ensembles and prediction of functional interactions of alpha-globin locus from 5C data. Nucleic Acids Res. 45, 11547–11558 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Tjong H., et al. , Population-based 3D genome structure analysis reveals driving forces in spatial genome organization. Proc. Natl. Acad. Sci. U.S.A. 113, E1663–E1672 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Zhu G., et al. , Reconstructing spatial organizations of chromosomes through manifold learning. Nucleic Acids Res. 46, e50 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wang S., Xu J., Zeng J.. Inferential modeling of 3D chromatin structure. Nucleic Acids Res. 43, e54 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Giorgetti L., et al. , Predictive polymer modeling reveals coupled fluctuations in chromosome conformation and transcription. Cell 157, 950–963 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Zhang B., Wolynes P. G., Topology, structures, and energy landscapes of human chromosomes. Proc. Natl. Acad. Sci. U.S.A. 112, 6062–6067 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Di Pierro M., Zhang B., Aiden E. L., Wolynes P. G., Onuchic J. N., Transferable model for chromosome architecture. Proc. Natl. Acad. Sci. U.S.A. 113, 12168–12173 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Carstens S., Nilges M., Habeck M., Inferential structure determination of chromosomes from single-cell Hi-C data. PLoS Comput. Biol. 12, 1–33 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Rieping W., Habeck M., Nilges M., Inferential structure determination. Science 309, 303–306 (2005). [DOI] [PubMed] [Google Scholar]
  • 27.Wasserman L., Bayesian model selection and model averaging. J. Math. Psychol. 44, 92–107 (2000). [DOI] [PubMed] [Google Scholar]
  • 28.Dostie J., et al. , Chromosome conformation capture carbon copy (5C): A massively parallel solution for mapping interactions between genomic elements. Genome Res. 16, 1299–1309 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Langer-Safer P. R., Levine M., Ward D. C., Immunological method for mapping genes on Drosophila polytene chromosomes. Proc. Natl. Acad. Sci. U.S.A. 79, 4381–4385 (1982). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Cui C., Shu W., Li P., Fluorescence in situ hybridization: Cell-based genetic diagnostic and research applications. Front. Cell Dev. Biol. 4, 89 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Yue F., et al. , A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Ghirlando R., Felsenfeld G., CTCF: Making the right connections. Genes Dev. 30, 881–891 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hua N., et al. , Producing genome structure populations with the dynamic and automated PGS software. Nat. Protoc. 13, 915–926 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Bonomi M., Heller G. T., Camilloni C., Vendruscolo M., Principles of protein structural ensemble determination. Curr. Opin. Struct. Biol. 42, 106–116 (2017). [DOI] [PubMed] [Google Scholar]
  • 35.Fisher C. K., Huang A., Stultz C. M., Modeling intrinsically disordered proteins with Bayesian statistics. J. Am. Chem. Soc. 132, 14919–14927 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Rozycki B., Kim Y. C., Hummer G., SAXS ensemble refinement of ESCRT-III CHMP3 conformational transitions. Structure 19, 109–116 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Hummer G., Kofinger J., Bayesian ensemble refinement by replica simulations and reweighting. J. Chem. Phys. 143, 243150 (2015). [DOI] [PubMed] [Google Scholar]
  • 38.Potrzebowski W., Trewhella J., Andre I., Bayesian inference of protein conformational ensembles from limited structural data. PLoS Comput. Biol. 14, e1006641 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Roux B., Weare J., On the statistical equivalence of restrained-ensemble simulations with the maximum entropy method. J. Chem. Phys. 138, 084107 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Cavalli A., Camilloni C., Vendruscolo M., Molecular dynamics simulations with replica-averaged structural restraints generate structural ensembles according to the maximum entropy principle. J. Chem. Phys. 138, 094112 (2013). [DOI] [PubMed] [Google Scholar]
  • 41.Bonomi M., Camilloni C., Cavalli A., Vendruscolo M.. Metainference: A Bayesian inference method for heterogeneous systems. Sci. Adv. 2, e1501177 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Hu M., et al. , Bayesian inference of spatial organizations of chromosomes. PLoS Comput. Biol. 9, e1002893 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Bintu B., et al. , Super-resolution chromatin tracing reveals domains and cooperative interactions in single cells. Science 362, eaau1783 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Finn E. H., et al. , Extensive heterogeneity and intrinsic variation in spatial genome organization. Cell 176, 1502–1515.e10 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Beagrie R. A., et al. , Complex multi-enhancer contacts captured by genome architecture mapping. Nature 543, 519–524 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Quinodoz S. A., et al. , Higher-order inter-chromosomal hubs shape 3d genome organization in the nucleus. Cell 174, 744–757.e24 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Eastman P., et al. , OpenMM 7: Rapid development of high performance algorithms for molecular dynamics. PLoS Comput. Biol. 13, 1–17 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Barbieri M., et al. , Complexity of chromatin folding is captured by the strings and binders switch model. Proc. Natl. Acad. Sci. U.S.A. 109, 16173–16178 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Di Pierro M., Cheng R. R., Lieberman Aiden E., Wolynes P. G., Onuchic J. N.. De novo prediction of human chromosome structures: Epigenetic marking patterns encode genome architecture. Proc. Natl. Acad. Sci. U.S.A. 114, 12126–12131 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Ferrenberg A. M., Swendsen R. H., New Monte Carlo technique for studying phase transitions. Phys. Rev. Lett. 61, 2635–2638 (1988). [DOI] [PubMed] [Google Scholar]
  • 51.Habeck M., “Evaluation of marginal likelihoods via the density of states” in Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, Lawrence N. D., Girolami M., Eds. (Proceedings of Machine Learning Research, 2012), pp. 486–494. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES