Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2020 Oct 19;48(21):e123. doi: 10.1093/nar/gkaa872

ASHIC: hierarchical Bayesian modeling of diploid chromatin contacts and structures

Tiantian Ye 1, Wenxiu Ma 2,
PMCID: PMC7708071  PMID: 33074315

Abstract

The recently developed Hi-C technique has been widely applied to map genome-wide chromatin interactions. However, current methods for analyzing diploid Hi-C data cannot fully distinguish between homologous chromosomes. Consequently, the existing diploid Hi-C analyses are based on sparse and inaccurate allele-specific contact matrices, which might lead to incorrect modeling of diploid genome architecture. Here we present ASHIC, a hierarchical Bayesian framework to model allele-specific chromatin organizations in diploid genomes. We developed two models under the Bayesian framework: the Poisson-multinomial (ASHIC-PM) model and the zero-inflated Poisson-multinomial (ASHIC-ZIPM) model. The proposed ASHIC methods impute allele-specific contact maps from diploid Hi-C data and simultaneously infer allelic 3D structures. Through simulation studies, we demonstrated that ASHIC methods outperformed existing approaches, especially under low coverage and low SNP density conditions. Additionally, in the analyses of diploid Hi-C datasets in mouse and human, our ASHIC-ZIPM method produced fine-resolution diploid chromatin maps and 3D structures and provided insights into the allelic chromatin organizations and functions. To summarize, our work provides a statistically rigorous framework for investigating fine-scale allele-specific chromatin conformations. The ASHIC software is publicly available at https://github.com/wmalab/ASHIC.

INTRODUCTION

The three-dimensional (3D) organization of chromatin in the nucleus plays an essential role in gene regulation (1). The recently developed chromosome conformation capture coupled with high-throughput sequencing (Hi-C) technique (2–4) and its variants (5–7) have been widely applied to map genome-wide chromatin interactions and to elucidate the principles of spatial genome architecture. The Hi-C experiment yields a genome-wide chromatin contact matrix; each entry (i, j) in the matrix represents the contact frequency between two loci i and j in the genome. The mapping and subsequent analyses of genome-wide Hi-C contact matrices in various organisms have demonstrated that the gene expression is tightly regulated by chromatin interactions at multiple scales ranging from active/inactive chromosomal compartments and sub-compartments (2,6), to topologically associating domains (TADs) (8), and fine-scale chromatin loops (5,6).

One hindrance of current Hi-C data analysis is the lack of allele-specific modeling for diploid genomes. Most mammalian genomes are diploid, in which the genome contains two sets of each chromosome—a maternal and a paternal copy. Hence, a chromatin contact observed between two genomic loci in the reference (haploid) genome may correspond to four distinct yet indistinguishable chromatin interactions in the diploid genome. For example, a chromatin contact mapped to a loci pair (i, j) on the same chromosome in the reference genome could be either an intra-chromosomal contact (mi, mj) on the maternal allele, or an intra-chromosomal contact (pi, pj) on the paternal allele, or inter-homologous contacts (mi, pj) or (pi, mj) (Figure 1A). However, the majority of existing Hi-C analyses on diploid genomes do not distinguish between homologous chromosomes. As a result, current analyses are based on an aggregated contact matrix generated with mixed signals of maternal and paternal chromatin contacts, which could result in the false identification of significant chromatin interactions and an inaccurate understanding of the diploid genome architecture. Therefore, statistical methods for rigorous and accurate modeling of diploid Hi-C data are needed to facilitate elucidation of the mechanisms of chromatin organization and gene regulation.

Figure 1.

Figure 1.

Overview of allele-specific modeling of diploid Hi-C data. (A) Diploid contact (i, j) is a combination of four distinct allele-specific contacts (mi, mj), (mi, pj), (pi, mj) and (pi, pj). (B) Reconstruction of allele-specific diploid contact matrix. (C) Observed allele-specific contacts between bins i and j can be decomposed into observed allele-certain contacts Inline graphic, observed allele-ambiguous contacts Inline graphic. We aim to decompose Inline graphic and infer the hidden contacts Inline graphic, and impute the true allele-specific contacts Inline graphic. (D) Illustration of the hierarchical Bayesian ASHIC-ZIPM model.

Recently, several methods have been developed to obtain allele-specific chromatin contact matrices and/or allelic 3D structures from diploid Hi-C data (6,9–14). These methods use heterozygous single nucleotide polymorphisms (SNPs) to identify the allele identity of chromatin interactions. Specifically, a Hi-C contact is a mate pair with two read ends representing the two interacting chromatin fragments. If a read end overlaps with SNPs for which the allele identity can be determined, we term it an allele-certain read. For example, a read containing maternal-specific SNP(s) is assigned to the maternal allele; similarly a read containing paternal-specific SNP(s) is assigned to the paternal allele. In addition, reads without SNPs are allele-ambiguous reads. Based on the allele identity of the paired ends, we can then categorize diploid Hi-C contacts into three groups: both-end allele-certain contacts, one-end allele-ambiguous contacts, and both-end allele-ambiguous contacts.

Without a statistically rigorous allele inference method, many previous studies applied either an ‘allele-certain’ or a ‘mate-rescue’ strategy to reconstruct the allele-specific contact maps in diploid genomes. In the allele-certain approach, only both-end allele-certain contacts are used (6,11). However, the both-end allele-certain contacts only account for a small portion of the total chromatin contacts (Supplementary Table S1). For example, in the patski (BL6×Spretus) cell line of which the SNP density is approximately 1 per 75 bp, the proportion of both-end allele-certain contacts in a typical Hi-C dataset is about 35.6%. Whereas, in the human GM12878 cell line of which the SNP density is ∼1 per 1700 bp, the both-end allele-certain proportion drops to 0.14%. Consequently, the diploid contact matrices obtained by such an allele-certain approach is often sparse and of low resolution.

To overcome the low-coverage issue of the allele-certain approach, several diploid Hi-C studies adopted a straightforward mate-rescue strategy to infer the allele identity of one-end allele-ambiguous contacts, i.e., the allele-ambiguous end of such contact is assigned to the same allele as its mate-end (10,12,15). This mate-rescue method attempts to recover one-end allele-ambiguous contacts, which varies approximately from 5.7% (in the case of GM12878 cells) to 43.3% (in the case of patski cells) of the total contacts (Supplementary Table S1). However, one-end allele-ambiguous contacts are all assumed to be intra-chromosomal contacts in the results of the mate-rescue approach. Such false assumption would lead to inaccurate contact maps, especially in the regions where inter-chromosomal interactions are observed across chromosomal territories.

Since the mate-rescue method fails to infer inter-chromosomal interactions from one-end allele-ambiguous contacts, Tan et al. (13) proposed an iterative two-stage imputation algorithm Dip-C for modeling single-cell diploid Hi-C data. In the first imputation stage, one-end allele-ambiguous contacts are phased using an ad hoc voting procedure by their neighborhood on the contact matrix. In the second imputation stage, the assignment of allele-ambiguous contacts is refined by the 3D structures. The Dip-C method can be viewed as an advanced mate-rescue method, as it leverages additional information from both contact matrices and 3D structures to infer allele-ambiguous contacts. However, the Dip-C method is specifically designed for single-cell Hi-C data therefore may not adapt well to bulk Hi-C data. Moreover, Dip-C uses a deterministic voting strategy to assign allele-ambiguous contacts, which does not provide a probabilistic model of all possible allele origins.

One common drawback of the allele-certain and mate-rescue methods is that they do not utilize both-end allele-ambiguous contacts, which represent a substantial proportion of the total diploid contacts, ranging from 21.1% (patski) to 94.1% (GM12878) (Supplementary Table S1). Inferring the allele identity of both-end allele-ambiguous contacts remains a significant challenge. To date, few methods have been developed to address this problem. The Dip-C method (13) attempts to impute only inter-chromosomal rather than intra-chromosomal both-end allele-ambiguous contacts. Thus, it does not produce a fully imputed diploid contact map. In addition, our previously proposed Poisson-Gamma model (9) imputes both one-end and both-end allele-ambiguous contacts, and estimates the diploid contact matrices by an iterative expectation-maximization (EM) algorithm. However, the Poisson-Gamma method does not predict 3D structures nor use the structures to assist the assignment of allele-ambiguous contacts. As a result, it might not work robustly in fine-resolution analyses. Furthermore, Cauer et al. (14) developed diploid-PASTIS, an extension of the PASTIS model (16), to infer the diploid chromatin structures. Diploid Hi-C contacts are modeled as Poisson variables, and the optimal diploid structures are solved by maximizing the likelihood function with additional structural constraints. The diploid-PASTIS method is specifically designed to model diploid 3D structures, but does not infer allele-ambiguous contacts to impute diploid contact matrices.

To tackle the aforementioned challenges, we developed a hierarchical Bayesian framework for allele-specific diploid Hi-C modeling, named ASHIC. Briefly, allele-specific contact counts are modeled as Poisson-multinomial random variables (referred as the ASHIC-PM model) and diploid contact matrices and 3D structures are estimated via an EM algorithm. In addition, to overcome the sparsity issue of diploid Hi-C contact maps, we proposed a zero-inflated version of the ASHIC-PM method, namely the zero-inflated Poisson-multinomial model (in short, ASHIC-ZIPM). Both ASHIC models can completely dissect all diploid Hi-C contacts into allele-specific contact maps, while simultaneously reconstruct 3D homologous chromosomal structures. To the best of our knowledge, our ASHIC methods are the first methods that fully impute all allele-ambiguous contacts and infer both the diploid contact matrices and allelic 3D structures.

We thoroughly evaluated our methods through a series of simulation studies and demonstrated that our ASHIC methods outperformed existing allele-certain and mate-rescue approaches in various settings of sequencing coverage, SNP density, and homologous structural similarity. We also applied the ASHIC-ZIPM method to two published diploid Hi-C datasets (6,15). First, using the mouse patski data (15), we successfully confirmed that the predicted diploid contact maps and 3D structures of the homologous X chromosomes exhibited distinct chromatin conformations, where the inactive X demonstrated the bipartite superdomains (9). Furthermore, we studied fine-scale chromatin organizations of the imprinted H19/Igf2 region at 10 kb resolution and revealed distinct parental-specific chromatin interactions anchored at H19 and Igf2. With the fully imputed diploid contact matrices, we uncovered a maternal-specific sub-TAD organization at the H19/Igf2 imprinting region. Second, using the human GM12878 data (6), we further confirmed the maternal-specific sub-TAD structure and parental-specific chromatin interactions at the human H19/IGF2 imprinting locus. Our ASHIC-imputed allele-specific contacts maps were consistent with the previously published chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) results (17).

MATERIALS AND METHODS

Notations of allele-specific chromatin contacts in diploid genomes

Let m and p denote a homologous chromosomal pair with same length n in a diploid genome. To construct the diploid Hi-C contact frequency matrix, we partition the chromosomes into fixed-size non-overlapping bins and count chromatin contacts observed between each bin pair. In the diploid setting, chromatin contacts between the bins i and j can result from four distinct events: (mi, mj), (pi, pj), (mi, pj), or (pi, mj), where (mi, mj) and (pi, pj) are intra-chromosomal contacts on chromosome m and p, respectively, and (mi, pj) and (pi, mj) are inter-chromosomal contacts between the homologous chromosomes (Figure 1A). Therefore, the aggregated contact frequency Tij between the bins i and j can be calculated as follows: Inline graphic, where Inline graphic is the unknown true allele-specific contact frequency between ηi (bin i on chromosome η) and θj (bin j on chromosome θ) that we aim to estimate, ηθ ∈ {m, p}, 1 ≤ i, jn (Figure 1B).

Using heterozygous SNPs, we can classify single-end reads into three categories: reads containing allele-m-specific SNPs, reads containing allele-p-specific SNPs, and reads containing no SNPs. We refer to the first two categories as allele-certain reads while the last category as allele-ambiguous reads. Furthermore, since Hi-C contacts are paired-end reads, each end of the mated pair can either be allele-certain or allele-ambiguous. Let Inline graphic indicate the frequency of both-end allele-certain contacts between the bins ηi and θj. In addition, we specify Inline graphic to be the contact frequency between ηi and θj where the allele identity of ηi is known but the allele identity of θj is unknown. In other words, one end of the Hi-C contact is from θj; however, the read does not overlap with any SNPs. Therefore the allele identity of θj remains unknown. Similarly, we use Inline graphic when the allele identity of ηi is unknown and Inline graphic when the allele identity of both ends are unknown. Hence, the true allele-specific contact frequency Inline graphic equals to the sum of the following four components:

graphic file with name M22.gif (1)

In diploid Hi-C data, we cannot directly observe Inline graphic and Inline graphic since the read end mapped to the bin j is allele-ambiguous and hence, it cannot be distinguished between mj and pj. As a result, the observed Hi-C contacts contain the following types of allele-ambiguous contacts:

graphic file with name M25.gif (2)

where x indicates that the allele identity of a read end is unknown. We refer to Inline graphic and Inline graphic as one-end allele-ambiguous contacts and Inline graphic as both-end allele-ambiguous contacts (Figure 1C).

In summary, we define Inline graphic as the observed allele-specific contact frequencies, Inline graphic as the observed allele-ambiguous contact frequencies (LHS in Equation 2), and Inline graphic as the unobserved (hidden) allele-specific contact frequencies (RHS in Equation 2). Our goal is to decompose Inline graphic and infer Inline graphic in order to impute the true allele-specific frequencies Inline graphic by Equation (1) (Figure 1C).

Hierarchical Bayesian modeling of diploid chromatin contact maps and 3D structures

To model diploid Hi-C data, we propose a hierarchical Bayesian modeling framework for imputing the allele-specific chromatin contacts and reconstructing the allelic 3D structures. Specifically, we model the generation of allele-specific contacts with either a Poisson-multinomial process (the ASHIC-PM model) or a zero-inflated Poisson-multinomial process (the ASHIC-ZIPM model) for the inference of diploid contact matrices and 3D structures. The ASHIC-ZIPM model is a zero-inflated version of ASHIC-PM and it explicitly accounts for the excessive zeros observed in Hi-C contact matrices.

Here, we describe ASHIC-ZIPM, the zero-inflated version of our hierarchical Bayesian model and the corresponding EM algorithm for model fitting. The details for the ASHIC-PM model are available in Supplementary Methods.

Modeling true allele-specific contact frequencies from diploid 3D structures

We adopt the coarse-grained polymer model (18) to represent the chromosomal structures. Each bin in the genome is represented as a bead in the 3D space, and each chromosome can be viewed as a chain of beads. Specifically, we denote Inline graphic and Inline graphic to be the 3D coordinates of the homologous chromosomes m and p, respectively, where Inline graphic. Let Inline graphic and Inline graphic to be the 3D coordinates of beads ηi and θj, respectively, where ηθ ∈ {m, p}. According to polymer physics (16,19), the contact frequency Inline graphic between ηi and θj is inversely correlated with their spatial distance Inline graphic, following a power-law decay function. That is, Inline graphic, where α < 0 is the exponent of the distance-decay function, and Inline graphic is the Euclidean distance between beads ηi and θj.

Similar to the PASTIS method (16), we model the true allele-specific contact frequency Inline graphic as a Poisson random variable, Inline graphic, where β is a scaling factor (the ASHIC-PM model, Supplementary Methods). Furthermore, to account for the excessive zeros in Hi-C contact matrices, we propose to use a zero-inflated Poisson (ZIP) distribution to model the contact counts (the ASHIC-ZIPM model) (Figure 1D).

In the ASHIC-ZIPM model, we assume that Inline graphic follows a ZIP distribution.

graphic file with name M47.gif (3)

Different from the ASHIC-PM model, here we introduce Inline graphic, a latent binary variable to indicate whether Inline graphic is generated from the Poisson state Inline graphic or the missing state Inline graphic.

Furthermore, we assume that Inline graphic follows a Bernoulli prior with a success probability Inline graphic.

graphic file with name M54.gif (4)

For intra-chromosomal contacts (where η = θ), Inline graphic is a function of the corresponding genomic distance. For inter-chromosomal contacts (ηθ), Inline graphic is set to a constant.

In other words, the true allele-specific contact frequency Inline graphic is a mixture of two states. In the Poisson state (with probability Inline graphic), Inline graphic follows a Poisson distribution; whereas in the missing state (with probability Inline graphic), Inline graphic. The Inline graphic parameter acts as a weight between the Poisson and missing states. Note that when all latent variables Inline graphic are equal to 1, the ASHIC-ZIPM model reduces to the ASHIC-PM model. Hence, ASHIC-PM is a special case of ASHIC-ZIPM.

Modeling allele-identifiable probability and allele-ambiguous contact counts

As discussed above, we cannot directly observe the allele identity of all diploid Hi-C contacts. We use qi to denote the allele-identifiable probability of bin i in the genome, i.e. if a single-end read is mapped to bin i, the probability that the read overlaps with SNP(s) (and therefore can be distinguished between alleles m and p) is qi. Consequently, assuming that bins i and j are independent, the probabilities that a paired-end contact between the bins i and j is both-ends allele-certain (qij), one-end allele-ambiguous at bin i (Inline graphic), one-end allele-ambiguous at bin j (Inline graphic), and both-end allele-ambiguous (Inline graphic) can be calculated as follows:

graphic file with name M67.gif (5)

Recall in Equation (1), the true allele-specific contact frequency Inline graphic can be expressed as the sum of one observed allele-certain contact frequency Inline graphic and three unobserved hidden allele-specific contact frequencies Inline graphic. We assume that the decomposed allele-specific contact frequencies follow a multinomial distribution.

graphic file with name M71.gif (6)

Based on the above assumptions, we derive that Inline graphic given Inline graphic are mutually conditional independent ZIP random variables. Consequently, we demonstrate that the observed allele-ambiguous contact frequencies Inline graphic, Inline graphic, and Inline graphic are ZIP random variables and mutually conditional independent given Inline graphic (Supplementary Methods).

graphic file with name M78.gif (7)

where Inline graphic or Inline graphic, i.e., Inline graphic if Inline graphic or Inline graphic. Similarly, we denote Inline graphic, Inline graphic.

Incorporating bias factors

Real Hi-C data contains various types of systematic biases. Similar to the ICE method (20), we assume that the bias of observing contacts between bins ηi and θj can be factorized as the product of the bias factors Inline graphic and Inline graphic of the two bins, respectively. Hence, we can incorporate bias factors into the ASHIC-ZIPM model as follows:

graphic file with name M88.gif (8)

Our ASHIC software provides an option to estimate the bias factors from real diploid Hi-C data and to incorporate them into our model (Supplementary Methods).

Estimating allele-specific chromatin structures and contact frequencies via EM algorithm

We design an EM algorithm to simultaneously infer 3D structures and estimate model parameters. The EM algorithm for the ASHIC-ZIPM model is briefly outlined below. Details of the EM algorithms for both ASHIC-PM and ASHIC-ZIPM are available in Supplementary Methods.

In the ASHIC-ZIPM model, the parameter space contains the homologous chromosome structures Inline graphic and Inline graphic, the distance-decay exponent α, the scaling factor β, the hyper parameter Inline graphic (for the Bernoulli prior distribution of the Poisson state latent variables Inline graphic), and the allele-identifiable probabilities Inline graphic, 1 ≤ kn. Note that in Equation (3), the ZIP parameter Inline graphic is a function of αβ, Inline graphic and Inline graphic. Here, we fix α and β in order to obtain a unique solution for Inline graphic and Inline graphic. Specifically, we use true estimate of α in simulations and set α = −3 in real data. We set β = 1 in both cases.

From the diploid Hi-C data, we can directly observe the allele-certain contacts Inline graphic and the allele-ambiguous contacts Inline graphic. The unobserved latent variables include the hidden allele-specific contacts Inline graphic and the Poisson state latent variables Inline graphic. The goal of the EM algorithm is to find the maximum likelihood estimate (MLE) of the model parameters, reconstruct the allele-specific 3D structures Inline graphic and Inline graphic, and impute the true allele-specific contacts Inline graphic.

The complete likelihood of the observed data Inline graphic, Inline graphic and the unobserved latent variables Inline graphic, Inline graphic is

graphic file with name M110.gif (9)

To solve the MLE of the marginal likelihood of observed data Inline graphic, Inline graphic, we propose an EM algorithm which applies the following two steps iteratively:

  • Expectation step (E-step):

    Inline graphic

  • Maximization step (M-step):

    Inline graphic

The pseudocode and detailed steps of the EM algorithm are available in Supplementary Methods. In particular, while estimating the homologous structures Inline graphic and Inline graphic, we develop an inter-homologous optimization strategy. Briefly, we first estimate Inline graphic and Inline graphic separately, then estimate the relative position between the two homologs to improve the final structure prediction (Supplementary Methods). The 3D coordinates of each initial structure were randomly sampled from a unit cube. We first applied the multidimensional scaling method (16,21) to obtain a draft structure, and then used the draft structure as the starting point for the ASHIC models.

We implement both ASHIC-PM and ASHIC-ZIPM methods in Python. The software can be accessed publicly at https://github.com/wmalab/ASHIC.

Simulation settings

Simulating allele-specific chromatin contacts from homologous X chromosome structures

First, We considered the homologous X chromosomes as the ground truth and simulated diploid Hi-C datasets as described below. We assumed that the allele-specific chromatin contact frequencies follow the ASHIC-ZIPM model. The true model parameters αm, αp, β, Inline graphic and Inline graphic were estimated from two published datasets on human GM12878 cells: the predicted X chromosome structures from single-cell Hi-C data by Tan et al. (13), and the allele-specific contact matrices from in situ bulk Hi-C data by Rao et al. (6), both at 100 kb resolution (Supplementary Methods, Supplementary Table S2).

At the default setting, we generated 10 simulated allele-specific Hi-C datasets with the scale factor Inline graphic and the average allele-identifiable probability Inline graphic. Subsequently, we kept other parameters fixed and generated 10 additional datasets for each of the decreased β values (Inline graphic, Inline graphic, and Inline graphic), and another 10 datasets for each of the decreased Inline graphic values (0.25, 0.1 and 0.05). In total, 70 diploid Hi-C datasets were generated in this simulation study. For each simulated dataset, we ran 10 random initializations and chose the result with the highest observed log-likelihood for performance evaluation and subsequent analyses.

Simulating allele-specific chromatin contacts from identical chromosome structures

To study the effect of structural differences on the performance of our methods, we deployed a challenging simulation setting where we simulated diploid Hi-C datasets using two identical chromosome structures. Briefly, we duplicated the paternal (inactive) X chromosome structure Inline graphic predicted by Tan et al. (13) as the pseudo-maternal structure. Then we used the two identical chromatin structures as the ground truth and simulated diploid Hi-C datasets in a similar manner as previously described.

The relative position of these two identical structures was determined by a reversed structural superposition procedure. Using the original homologous structures Inline graphic and Inline graphic, we calculated the optimal translation vector Inline graphic and rotation matrix Inline graphic using the Kabsch algorithm (22), such that the root-mean-square deviation (RMSD) between Inline graphic and Inline graphic was minimized. Then we duplicated Inline graphic and reversed the superposition of Inline graphic by Inline graphic. The resulting identical structures Inline graphic and Inline graphic was served as the pseudo-homologous chromosome structures in which the relative position between Inline graphic and Inline graphic remained approximately the same as the original homologous pair Inline graphic and Inline graphic.

Real Hi-C data processing and analysis

We used two diploid Hi-C datasets in our study (Supplementary Table S2). First, allelic mapping results of the wild-type patski Hi-C dataset published by Bonora et al. (15) were downloaded from GEO (GSE107282). Second, the raw sequencing reads of the GM12878 Hi-C dataset published by Rao et al. (6) were downloaded from GEO (GSE63525) and the allele-specific mapping was performed using HiC-Pro (23). Briefly, HiC-Pro aligned reads to a masked reference genome where all SNP sites are N-masked. Then reads overlap with SNP sites were assigned to either maternal or paternal allele based on the nucleotide at the SNP position. Reads that do not overlap with any SNPs were labeled as allele-ambiguous. Reads with conflicting allele assignment or unexpected allele at SNP sites were discarded. For each genomic region, we ran 20 random initializations with the ASHIC-ZIPM model and chose the one with the highest likelihood for subsequent analyses.

Convergence and running time

The convergence of the EM algorithm is defined as the relative increase of log-likelihood between two consecutive iterations is <10−4. We tested our ASHIC software using a single core on an Intel E5-2683v4 processor with 8GB memory allocation. In a typical simulation setting, two X chromosomes were partitioned into 3000+ bins at 100 kb resolution. With the default sequencing coverage (Inline graphic) and SNP density (Inline graphic) setting, both ASHIC-ZIPM and ASHIC-PM converged within 20 iterations (2 h). Lower coverage or lower SNP density requires more iterations. For example, when Inline graphic reduced to 0.05, the EM algorithm of ASHIC-ZIPM took about 50 iterations (8 h) to converge. When β decreased to Inline graphic, the EM algorithm of ASHIC-ZIPM underwent ∼90 iterations (20 h) to converge.

Evaluation metrics

We used the following evaluation metrics in the simulation studies: the recovery rate (RR) for measuring the proportion of diploid Hi-C contacts recovered by each method, the imputation error rate (IER) and the stratum adjusted correlation coefficient (SCC) (24) for measuring the accuracy of imputed diploid contact matrices, the distance error rate (DER) and homologous distance error rate (HDER) for measuring the accuracy of predicted allelic 3D structures. Additionally, we calculated the recall, precision and F1 score to evaluate the allele-specific chromatin interactions identified from imputed diploid contact matrices. In the real data analysis of mouse X chromosomes, we used the bipartite index (BI) (9) to measure the bipartition organization of the inactive X chromosome, and calculated the radius of gyration (Rg) to measure the compactness of both X chromosomes. The detailed definitions of these evaluation metrics can be found in Supplementary Methods.

RESULTS

Simulation studies on homologous X chromosomes

Default simulation setting

We first evaluated the performance of the proposed ASHIC methods on simulated diploid Hi-C datasets of the homologous X chromosomes in human GM12878 cells. Of the two X chromosomes, the active X chromosome (denoted as Xa) is the maternal copy and the inactive X chromosome (denoted as Xi) is the paternal copy. We considered the 3D structures of Xa and Xi published by Tan et al. (13) as the ground truth and generated 10 simulated diploid Hi-C datasets at 100-kb resolution (see Materials and Methods). Each simulated dataset contained two intra-chromosomal contact matrices, one for Xa and one for Xi, as well as one inter-chromosomal contact matrix between Xa and Xi.

We compared our ASHIC-ZIPM and ASHIC-PM methods with two commonly used approaches for analyzing diploid Hi-C data. The first approach is the allele-certain method that uses only both-end allele-certain contacts (6,11). The second approach is the mate-rescue method that combines both-end allele-certain contacts with one-end allele-ambiguous contacts by assigning the allele-ambiguous read-end to the same allele as the allele-certain mate-end (10,12,15).

To evaluate the imputation of diploid Hi-C contact maps, we first calculated the proportion of allele-specific contacts recovered by each method (Supplementary Table S3). At the default sequencing coverage (Inline graphic) and SNP density (Inline graphic) setting, the allele-certain and mate-rescue approaches recovered evidently smaller proportion of diploid chromatin contacts (25.65% and 75.55%, respectively) compared to the ASHIC-ZIPM and ASHIC-PM methods that were able to recover all one-end and both-end allele-ambiguous reads, thereby achieving 100% full recovery rate.

Next, we sought to assess the accuracy of the imputed allele-specific contact matrices. Recent studies have demonstrated that the genomic distance dependence and sequencing depth have confounding effects on measuring the similarity between Hi-C contact matrices (24). To account for these confounding factors, we computed the stratum adjusted correlation coefficient (SCC) using the HiCRep package (24) to measure the similarity between the imputed contact matrices and true matrices (Figure 2A). We observed that the imputed diploid matrices obtained by ASHIC-ZIPM and ASHIC-PM had near-perfect SCC values of 0.9997 and 0.9996, respectively; whereas mate-rescue and allele-certain methods demonstrated lower SCC values of 0.9733 and 0.8100, respectively. ASHIC-ZIPM showed a significantly higher SCC values than ASHIC-PM (P-value = 2.53 × 10−3, one-sided paired Wilcoxon signed-rank test). In addition, ASHIC-ZIPM performed significantly better than the allele-certain and mate-rescue methods (P-values = 2.53 × 10−3, one-sided paired Wilcoxon signed-rank tests). Note that P = 2.53 × 10−3 is the smallest possible P-value given the sample size.

Figure 2.

Figure 2.

Evaluation on simulated homologous X chromosome (Xa/Xi) data. (A) Stratum-adjusted correlation coefficients (SCCs) and (B) and Pearson’s correlation coefficients (PCCs) between the imputed diploid contact matrices and the true contact matrices. The PCC curves are smoothened using the locally weighted LOESS method. (C) Distance error rates between the predicted allelic 3D structures and the true structures. (D) F1 scores of the identified allele-specific chromatin interactions.

The SCC statistic is a weighted average of the Pearson’s correlation coefficients (PCCs) across different genomic distances (24). To breakdown the effect of genomic distance, we computed the PCCs between the imputed contact matrices and the true matrices at different genomic distances (Figure 2B). As expected, the PCC values decreased as the genomic distance increased for all four methods. We observed that the ASHIC-ZIPM and ASHIC-PM methods demonstrated similar PCC values across all genomic distances. In addition, the ASHIC-ZIPM and ASHIC-PM methods outperformed the allele-certain and mate-rescue approaches by large margin, especially at large genomic distances. Taken together, the SCC and PCC results showed that our ASHIC methods can accurately impute allele-specific contact matrices. Moreover, the imputation accuracy outperformed the allele-certain and mate-rescue approaches, especially for long-range contacts.

In addition to imputing diploid Hi-C contact matrices, the ASHIC-ZIPM and ASHIC-PM methods also predict allele-specific 3D structures. To evaluate the accuracy of the predicted allelic structures, we calculated the distance error rates between the predicted structures and the ground truth (Figure 2C). We observed that ASHIC-ZIPM yielded significantly lower distance error rates and thereby, more accurate allelic 3D structures than those obtained by ASHIC-PM (P-value = 2.53 × 10−3, one-sided paired Wilcoxon signed-rank test).

Furthermore, we investigated whether the imputed diploid contact matrices can facilitate the detection of allele-specific chromatin interactions. First, we called significant interactions using the Fit-Hi-C package (25) on the true diploid contact matrices. We subsequently defined the maternal-specific interactions as the interactions that were called only from the true maternal matrix but not from the paternal matrix. The paternal-specific interactions were defined accordingly. The final set of true allele-specific interactions was defined as the union of both monoallelic sets, which contained 9061.5 interactions on average (Supplementary Table S5). Following the same procedure, we then identified the allele-specific interactions from the imputed diploid contact matrices resulting from the four methods, separately. We evaluated the identified allele-specific interactions from each method using three metrics: precision, recall, and their harmonic mean F1 score (Figure 2D, Supplementary Figure S1). ASHIC-ZIPM and ASHIC-PM maintained the highest F1 scores of 0.9867 and 0.9853, respectively. In addition, ASHIC-ZIPM significantly outperformed mate-rescue (F1 = 0.8940) and allele-certain (F1 = 0.6024) in terms of the F1 scores (P-values = 2.53 × 10−3, one-sided paired Wilcoxon signed-rank tests). The low F1 scores of the mate-rescue and allele-certain methods were primarily contributed by their low recall rates (Supplementary Figure S1), which was a result of their low recovery rates of allele-ambiguous contacts (Supplementary Table S3).

Collectively, our comparisons have demonstrated that the proposed ASHIC-ZIPM and ASHIC-PM methods outperformed the existing mate-rescue and allele-certain approaches with respect to the recovery rate of allele-ambiguous contacts, the accuracy of imputed diploid contact matrices and predicted allelic 3D structures, and the ability to facilitate the detection of allele-specific chromatin loops. In addition, ASHIC-ZIPM demonstrated a better performance overall than that of ASHIC-PM, especially in the prediction of allelic 3D structures. To further evaluate the performance of these methods under different circumstances, we conducted a series of additional simulation experiments by adjusting three major factors: sequencing coverage, SNP density, and homologous structural similarity.

Performance on low sequencing coverage data

The sequencing coverage of Hi-C contact matrices is a major factor that can affect the performance of the diploid Hi-C methods. An observed zero entry in the Hi-C contact matrix can be either a ‘true’ zero as a result of no physical contact between the pair of chromatin fragments, or a ‘missing’ zero as a result of insufficient sequencing coverage. Lower sequencing depth of Hi-C experiments yields lower-coverage and sparse contact matrices that containing excessive ‘missing’ zeros. As a result, it becomes more challenging to distinguish the ‘true’ zeros from the ‘missing’ zeros.

While generating the simulation datasets, the scale factor β controls the coverage of simulated contact matrices. We estimated Inline graphic from the published Hi-C data by Rao et al. (6) (Supplementary Methods). At the default Inline graphic setting, the simulated Hi-C map contained about 4.9 million contacts from the homologous X chromosomes. To evaluate the performance of our methods on lower-coverage data, we fixed the SNP density Inline graphic and gradually decreased the value of β from Inline graphic to Inline graphic, Inline graphic, and Inline graphic, resulting in 2.5, 1.0 and 0.5 million contacts, respectively. We then repeated the assessments of the ASHIC-ZIPM, ASHIC-PM, mate-rescue, and allele-certain methods with these low-coverage simulation datasets.

As shown in Figure 3A, ASHIC-ZIPM and ASHIC-PM maintained the highest SCC values across all coverage levels. When the sequencing coverage decreased from Inline graphic to Inline graphic, the SCC values for both methods only dropped by 0.28%. On the other hand, when sequencing coverage lowered, the SCC values decreased evidently for mate-rescue and allele-certain by 1.80% and 10.38%, respectively. These results suggested that our ASHIC methods can robustly and accurately infer allele-specific contact matrices under low sequencing coverage conditions.

Figure 3.

Figure 3.

ASHIC-ZIPM accurately imputes diploid contact maps and 3D structures on low-coverage Xa/Xi simulation data. (A) SCCs between the imputed diploid contact matrices and the true contact matrices, (B) Distance error rates between the predicted allelic 3D structures and the true structures, and (C) F1 scores of the identified allele-specific chromatin interactions at different sequencing coverage β levels.

Additionally, we observed that ASHIC-ZIPM produced more accurate 3D structures with smaller distance error rates than those produced by ASHIC-PM across all sequencing coverage levels (Figure 3B). The improvements of the distance error rates were significant at coverage levels Inline graphic, Inline graphic and Inline graphic (P-values = 2.53 × 10−3, 2.53 × 10−3, 6.26 × 10−3, respectively, one-sided paired Wilcoxon signed-rank tests).

When the sequencing coverage decreased from Inline graphic to Inline graphic, the true set of allele-specific interactions decreased from 9061.5 to 2136.4 interactions (Supplementary Table S5, Supplementary Methods). As shown in Figure 3C, when the coverage decreased from Inline graphic to Inline graphic, the ability of the allele-certain method to detect allele-specific interactions was highly impacted as its F1 scores dropped by 35.17% from 0.6024 to 0.3906. The decrease of F1 score for mate-rescue was less severe, about Inline graphic from 0.8940 to 0.8144. The ASHIC methods consistently delivered robust results against coverage changes (ASHIC-ZIPM: Inline graphic, ASHIC-PM: Inline graphic), and maintained high F1 score even at the lowest Inline graphic level (ASHIC-ZIPM: 0.9743, ASHIC-PM: 0.9740). The decay in F1 scores for the allele-certain and mate-rescue methods was primarily contributed by their low recall rates (Supplementary Figure S1).

Taken together, our results demonstrated that the ASHIC methods significantly outperformed other methods in low sequencing coverage conditions, resulted in more accurately imputed matrices and benefited the detection of allele-specific interactions on low-coverage data. In particular, we observed that ASHIC-ZIPM had better performance than ASHIC-PM under low coverage conditions. This is owing to the fact that in our ASHIC-ZIPM model, the Poisson state probabilities Inline graphic act as weights between the ‘true’ and ‘missing’ zeros. When the sequencing coverage lowered, the observed diploid matrices contained additional ‘missing’ zeros. The zero-inflated model explicitly adjusted the estimation of Inline graphic to model these ‘missing’ zeros, thereby achieving better model fitting results. Consistent with our expectations, the estimated values of Inline graphic became smaller as coverage decreased, which demonstrated its ability to account for the additional ‘missing’ zeros (Supplementary Figure S2).

Performance on low SNP density data

In addition to the sequencing coverage, the SNP density is another major factor affecting the performance of the diploid Hi-C methods. The SNP density varies across different species and cell lines. For example, the F1 mouse cross (BL6×Spretus) has a relatively high SNP density of approximately 1 SNP per 75 bp. On average, a 70-bp sequence read has a 60% chance overlapping with SNP(s), thus being allele-identifiable. Whereas the GM12878 cell line has a low SNP density about 1 for every 1700 bp, which is corresponding to an average allele-identifiable probability of 0.04 (Supplementary Table S1). To evaluate the performance of our methods on low-SNP-density data, we fixed the coverage level at Inline graphic and then gradually decreased Inline graphic, the average allele-identifiable probability, from 0.5 which mimics the BL6×Spretus cross, to 0.25, 0.1 and 0.05, where the smallest value mimics the GM12878 cells.

When the SNP density was low, fewer both-end allele-certain contacts but higher number of one-end allele-ambiguous and both-end allele-ambiguous contacts were observed. Consequently, as the average allele-identifiable probability Inline graphic decreased from 0.5 to 0.05, the recovery rates dropped dramatically from 25.65% to 0.25% for allele-certain and from 75.55% to 9.82% for mate-rescue (Supplementary Table S3). In contrast, our ASHIC methods were able to recover all allele-ambiguous reads at the lowest Inline graphic setting. Among the recovered contacts, 15.95% for ASHIC-ZIPM and 17.60% for ASHIC-PM were incorrectly imputed (Supplementary Table S4).

Consistent with the high recovery rates and low imputation error rates, the SCC values also demonstrated robust and accurate imputation of diploid contact matrices by the ASHIC methods at low SNP density settings (Figure 4A). When the average allele-identifiable probability Inline graphic decreased from 0.5 to 0.05, the SCC values dropped significantly from 0.8100 to 0.3959 for allele-certain and from 0.9733 to 0.8719 for mate-rescue, respectively. In contrast, the SCC values remained high at 0.9941 and 0.9922 for ASHIC-ZIPM and ASHIC-PM, respectively, at the lowest Inline graphic setting. Moreover, ASHIC-ZIPM significantly outperformed ASHIC-PM at the lowest SNP density level (P-value = 8.30 × 10−3, one-sided paired Wilcoxon signed-rank test). The difference between our ASHIC methods and other methods was also observed on the PCC plot at the lowest SNP density, particularly for long genomic distances (Supplementary Figure S3). Furthermore, when comparing the predicted allelic 3D structures with the ground truth, ASHIC-ZIPM outperformed ASHIC-PM significantly at all SNP density levels (P-values = 2.53 × 10−3, 4.67 × 10−3, 3.46 × 10−3, 2.53 × 10−3, for Inline graphic, respectively, one-sided paired Wilcoxon signed-rank tests) (Figure 4B).

Figure 4.

Figure 4.

ASHIC-ZIPM accurately imputes diploid contact maps and 3D structures on low-SNP-density Xa/Xi simulation data. (A) SCCs between the imputed diploid contact matrices and the true contact matrices, (B) Distance error rates between the predicted allelic 3D structures and the true structures, and (C) F1 scores of the identified allele-specific chromatin interactions at different SNP density Inline graphic levels.

Next, we questioned whether the ability to detect allele-specific chromatin interactions was impacted by low SNP density levels. Adjusting the average allele-identifiable probability did not affect the underlying true diploid contact matrices. As a result, the true set of allele-specific interactions remained the same at different SNP density settings (Supplementary Table S5, Inline graphic). As shown in Figure 4C, low SNP density severely impacted the allele-certain and mate-rescue methods. The F1 scores of allele-certain dropped from 0.6024 to 0.0039, recovering only 17.8 out of 9061.5 true allele-specific interactions. Similarly, the F1 score of mate-rescue dropped from 0.8940 to 0.3666. In contrast, when SNP density lowered, the F1 score of our methods decreased only slightly—3.62% for ASHIC-ZIPM and 4.26% for ASHIC-PM. In addition, our ASHIC methods outperformed the other methods by a notable margin. We observed that decreasing SNP density increased the margin between ASHIC-ZIPM and other methods. Taken together, our results demonstrated that the ASHIC-ZIPM method significantly exceeded other methods with high robustness in low SNP density situations.

Simulation studies on identical chromosomal structures

In the aforementioned simulation settings, we took the homologous X chromosomes in GM12878 cells as the ground truth, where Xa and Xi have drastically dissimilar structures. Unlike the X chromosomes, homologous autosomes often have similar 3D shapes. Imputing diploid Hi-C contact matrices and allelic structures from homologs with similar structures is a more challenging problem than the one from homologs with different structures. To evaluate our methods in such situation, we duplicated the paternal/Xi structure as the pseudo-maternal structure to build an identical homologous structure pair (see Methods). We then generated simulation datasets and evaluated our methods at different coverage and SNP density settings, similarly as previously described.

Performance on low sequencing coverage data

As demonstrated in previous homologous structure simulations, our ASHIC methods maintained high accuracy of imputed diploid contact matrices at low sequencing coverage settings (Figure 5A). The SCC values were all above 0.9949 for ASHIC-ZIPM and above 0.9938 for ASHIC-PM at various sequencing coverage levels. On the other hand, the SCC values of mate-rescue demonstrated a minor decline from 0.9778 to 0.9664 when the coverage decreased from Inline graphic to Inline graphic. The allele-certain method was the most impacted, as its SCC values declined by 7.46% from 0.8362 to 0.7738 when the coverage level dropped from Inline graphic to Inline graphic.

Figure 5.

Figure 5.

ASHIC-ZIPM accurately imputes diploid contact maps and 3D structures on low-coverage identical-homolog simulation data. (A) SCCs between the imputed diploid contact matrices and the true contact matrices, (B) Distance error rates between the predicted allelic 3D structures and the true structures, (C) Homologous distance error rates between the predicted maternal and paternal 3D structures and (D) F1 scores of the identified bi-allelic interactions at various sequencing coverage β levels.

We then evaluated the accuracy of the allelic 3D structures predicted by our ASHIC methods. Overall, ASHIC-ZIPM generated more accurate structures with smaller distance error rates than the ones predicted by ASHIC-PM across all coverage levels (Figure 5B). The improvements were significant at Inline graphic, Inline graphic, and Inline graphic levels (P-values = 2.53 × 10−3, 1.42 × 10−2, 2.53 × 10−3, one-sided paired Wilcoxon signed-rank tests).

In addition to comparing the predicted allelic structures against the ground truth structures, we further calculated the homologous distance error rate between the predicted maternal and paternal structures (Figure 5C). For both ASHIC-ZIPM and ASHIC-PM methods, the average homologous distance error rates were smaller than 0.08, suggesting that both models produced homologous structures with very similar shapes. Furthermore, the ASHIC-ZIPM model had significantly lower homologous distance error rates than ASHIC-PM, at sequencing coverage Inline graphic and Inline graphic levels (P-values = 2.53 × 10−3, one-sided paired Wilcoxon signed-rank tests). These results further confirmed that ASHIC-ZIPM predicted more accurate allelic 3D structures than the structures predicted by ASHIC-PM.

Next, we investigated the effects of low sequencing coverage on the detection of chromatin interactions when the homologous structures were identical. Similar to the case of different homologous structures, we applied Fit-Hi-C (25) to call significant interactions on the two allele-specific contact matrices separately. Given that the two ground truth homologous structures were identical, we defined the true integration set as the bi-allelic interactions shared by both maternal and paternal chromosomes (Supplementary Methods). When the coverage dropped from Inline graphic to Inline graphic, the number of interactions in the true set decreased by 81.48% from 4103.1 to 759.9 (Supplementary Table S6). As shown in Figure 5D, the allele-certain method was the most impacted by the sequencing coverage changes, where its F1 scores decreased by 31.23% from 0.6127 to 0.4214 as the coverage dropped from Inline graphic to Inline graphic. The F1 score of mate-rescue decreased to a less extend, by 7.37% from 0.9075 to 0.8406. Whereas our ASHIC-ZIPM and ASHIC-PM methods demonstrated consistent high F1 scores of 0.9351 and 0.9296, respectively, even under the lowest coverage Inline graphic setting.

Performance on low SNP density data

When the SNP density lowered, we observed an overall decreasing trend in the SCC values for all four methods (Figure 6A). The allele-certain and mate-rescue methods were greatly impacted by the low SNP density. When the average allele-identifiable probability Inline graphic decreased from 0.5 to 0.05, the SCC values dropped significantly by 46.52% from 0.8362 to 0.4472 for allele-certain and by 9.16% from 0.9778 to 0.8883 for mate-rescue. Again, our ASHIC methods maintained robustly high accuracy of the imputed contact matrices; the SCC values decreased only by 0.45% from 0.9996 to 0.9950 for ASHIC-ZIPM and by 2.38% from 0.9988 to 0.9750 for ASHIC-PM when Inline graphic decreased from 0.5 to 0.05. The visible difference between ASHIC-ZIPM and ASHIC-PM at the lowest SNP density level Inline graphic was also supported by the PCC measures, where ASHIC-ZIPM outperformed ASHIC-PM by an evidently large margin of PCCs within genomic distance of 100 Mb (Supplementary Figure S3).

Figure 6.

Figure 6.

ASHIC-ZIPM accurately imputes diploid contact maps and 3D structures on low-SNP-density identical-homolog simulation data. (A) SCCs between the imputed diploid contact matrices and the true contact matrices, (B) distance error rates between the predicted allelic 3D structures and the true structures, (C) homologous distance error rates between the predicted maternal and paternal 3D structures and (D) F1 scores of the identified bi-allelic interactions at different SNP density Inline graphic levels.

In terms of structural accuracy, ASHIC-ZIPM also outperformed ASHIC-PM with significantly smaller distance error rates across all SNP density levels (P-values = 2.53 × 10−3, one-sided paired Wilcoxon signed-rank tests) (Figure 6B). Furthermore, the allelic structures predicted by ASHIC-ZIPM demonstrated significantly smaller homologous distance error rates than the ones predicted by ASHIC-PM (P-values = 2.53 × 10−3, at all four Inline graphic levels, one-sided paired Wilcoxon signed-rank tests) (Figure 6C).

In addition to achieving the highest imputation accuracy of the diploid contact matrices and 3D structures, ASHIC-ZIPM also demonstrated the best performance with respect to the detection of biallelic chromatin interactions under low SNP density conditions (Figure 6D). When the average allele-identifiable probability Inline graphic decreased from 0.5 to 0.05, the F1 values dropped by 99.43% for allele-certain, 62.49% for mate-rescue, and 8.32% for ASHIC-PM. The ASHIC-ZIPM model showed the smallest decline in F1 scores, merely 1.70% from 0.9814 to 0.9647. Moreover, we observed that ASHIC-ZIPM significantly outperformed all other methods by a large margin across all SNP density levels (P-values = 2.53 × 10−3, one-sided paired Wilcoxon signed-rank tests)

Taken together, we demonstrated that our ASHIC methods significantly outperformed the allele-certain and mate-rescue methods under low SNP density conditions when the homologous structures have identical shapes. In addition, ASHIC-ZIPM evidently outperformed the ASHIC-PM model by a large margin, especially at the lowest SNP density level.

ASHIC reconstructs the bipartite structure of the mouse inactive X chromosome

The X chromosomes in mammalian females is a representative example of homologous structural difference. In contrast to males having only one X chromosome, the females have two X chromosomes. To compensate for the dosage imbalance of X-linked genes between females and males, one X chromosomes in female cells is randomly silenced through the X chromosome inactivation (XCI) mechanism (26). To study the structural differences between the active X (Xa) and inactive X (Xi) chromosomes, we applied ASHIC-ZIPM to a published diploid Hi-C data generated from wild-type patski (BL6×Spretus) cells (15). The patski cell line has completely skewed XCI such that the maternal BL6 X is always inactive while the paternal Spretus X is always active. Several Hi-C studies conducted on the patski cells have demonstrated that the maternal Xi and paternal Xa chromosomes exhibit distinct morphology and chromatin contact profiles (9,15). Specifically, Xi shows a clear bipartite structure, where the entire chromosome is densely packed into two superdomains. The hinge region between the two superdomains contains the macrosatellite repeat locus Dxz4 and represents a nucleolus-associated domain (6,9,11,15).

To study the bipartite organization of Xi, we applied our ASHIC-ZIPM model to the patski Hi-C data and reconstructed the diploid contact maps and 3D structures of Xa and Xi at various resolutions (500, 100 and 50 kb). As shown in Figure 7A, the contact map of Xa demonstrated a clear plaid pattern representing the alternating A/B compartments. In contrast, Xi was clearly separated into two superdomains by a hinge region containing Dxz4. We observed frequent intra-superdomain contacts but sparse inter-superdomain contacts on Xi. In addition, we calculated the bipartite index (BI) (9) (Supplementary Methods) for both X chromosomes (Figure 7B). At all three resolutions, we observed an evident BI peak at the hinge region (Dxz4) on Xi, confirming the existence of bipartite organization on Xi. In contrast, the BI values were rather flat across the entire Xa, indicating the absence of bipartite structure. These observations demonstrated that our ASHIC-ZIPM method can produce robust and consistent diploid contact maps across different resolutions.

Figure 7.

Figure 7.

Bipartite organization of the inactive X chromosome in mouse patski cells. (A) ASHIC-ZIPM-imputed allele-specific Hi-C contact matrices of Xi and Xa are shown at 500 kb resolution. The Xi shows a bipartite structure of two superdomains connected by a hinge region (Dxz4), indicated by an arrow. Gray strips indicate low mappability regions. (B) Chromosome-wise bipartite index (BI) values for Xi (brown) and Xa (blue) at 500 kb (left), 100 kb (middle) and 50 kb (right) resolutions. The Xi curve shows an evident peak at the hinge region (yellow). (C) The Xi structures predicted by ASHIC-ZIPM at 500, 100 and 50 kb resolutions. The first superdomain (centromeric region) is shown in orange, and the second superdomain (distal region) is shown in brown. The hinge region (Dxz4) is marked by a yellow ball. The 3D structures are interpolated and smoothed by the Akima interpolator in SciPy. (D) Box plots of the radius of gyration for the Xi (brown) and Xa (blue) structures at 500, 100 and 50 kb resolutions.

In addition to the existence of two superdomains in the Xi contact map, we also observed that the predicted Xi structures preserved the bipartite conformation across all three resolutions (Figure 7C). The two superdomains were clearly separated in space, as each superdomain occupied half of the sphere and there were minimal interactions between them. In addition, the hinge region (Dxz4) connecting the two superdomains was located towards the periphery of the Xi structure, which is consistent with previous DNA-FISH results (9). While the previously published Xa and Xi structures were at 1 Mb (9) and 500 kb (14) resolutions, our method produced chromosomal structures at 50 kb resolution and successfully confirmed the bipartite organization of Xi.

With regards to the overall morphology of the chromosomal structures, we observed that Xi exhibited a more condensed structure than Xa, which is consistent with the fact that Xi is almost entirely silenced. In particular, we calculated the radius of gyration (Rg, Supplementary Methods) (27) to measure the compactness of the X chromosomes (Figure 7D). Across all three resolutions, Xi consistently showed a significantly lower Rg value than Xa, indicating that Xi was more tightly packed (P-values = 4.43 × 10−5, one-sided paired Wilcoxon signed-rank tests).

To assess the reproducibility of the inferred allelic contact maps and 3D structures, we randomly split the X chromosome data into two pseudo-replicates and performed ASHIC-ZIPM analysis on each one separately. At 500 kb resolution, the imputed allelic contact matrices were highly similar with SCC values of 0.9632 (Xi) and 0.9691 (Xa) between the two pseudo-replicates (Supplementary Figure S4). Additionally, the allelic 3D structures estimated from the pseudo-replicates were well aligned with similar global architecture. Moreover, similar results at 100 kb resolution further confirmed the reproducibility of the ASHIC method (Supplementary Figure S5).

Collectively, the results obtained on the patski Hi-C data demonstrated that our ASHIC-ZIPM method can accurately and robustly detect distinct allele-specific chromatin organizations of Xa and Xi at fine resolution.

ASHIC reveals differential allele-specific chromatin organization at the mouse H19/Igf2 imprinting region

Imprinting is an epigenetic mechanism that causes a subset of genes to express exclusively on one allele in diploid cells. The expression of imprinted genes is controlled by parental-specific epigenetic modifications, such as DNA methylation, at the imprinting control regions. One well-studied example is the H19/Igf2 imprinting region. In the mouse genome, the paternally expressed Igf2 gene is located approximately 80 kb upstream (telomeric side) from the long non-coding RNA H19 that is expressed only on the maternal allele. These two genes demonstrate opposite allele-specific expression yet share a common set of enhancers located downstream of H19 (29–31). It has been shown that the parent-specific expression pattern of H19 and Igf2 is controlled by the H19 differentially methylated region (H19-DMR) located 2 kb upstream from H19 (32). The H19-DMR is methylated only on the paternal allele, and therefore exhibits methylation-sensitive CCCTC-binding factor (CTCF) binding. On the maternal allele, the unmethylated H19-DMR recruits CTCF bindings and therefore blocking the interactions between the enhancers and Igf2. As a result, Igf2 remains unexpressed, while H19 can still access the enhancers and thus is activated. Whereas on the paternal allele, the methylated H19-DMR inhibits CTCF bindings. Consequently, Igf2 can access the enhancers and being activated; while the H19 silencing is likely caused by spreading of methylation from H19-DMR (33).

It has been widely speculated that CTCF attains enhancer-blocking insulation function via the formation of chromatin loops (34). Using diploid Hi-C contact maps of human GM12878 cells at 25 kb resolution, Rao et al. (6) examined the H19/IGF2 imprinting region and identified parental-specific chromatin loops between the H19/IGF2 cluster and a distal region which was referred to as the H19/Igf2 Distal Anchor Domain (HIDAD). The HIDAD-H19 loop was present exclusively on maternal allele; in contrast, the HIDAD-IGF2 loop appeared only on the paternal allele. Additionally, Llères et al. (35) performed a diploid 4C-seq study on the mouse ESCs and showed that H19-DMR interacted significantly more with the mouse homologue of HIDAD (mHIDAD) on maternal allele compared to the interactions on the paternal allele. They subsequently performed 3D DNA-FISH experiments and confirmed that the distances between mHIDAD and H19 were significantly shorter on the maternal allele than the distances on the paternal allele.

Although the aforementioned 4C-seq study (35) and several other 3C studies (36–38) have been conducted in the H19/Igf2 imprinting region, diploid Hi-C studies are still restricted to a rather coarse resolution due to the limitations of low SNP density and insufficient sequencing coverage. To bridge this gap and provide a holistic view of chromatin structures on the imprinted H19/Igf2 region, we applied our ASHIC-ZIPM method to the published diploid Hi-C data in mouse patski cells (15), and generated fine-scale allele-specific contact maps and 3D structures of a 5-Mb region (chr7: 140–145 Mb) around the H19/Igf2 imprinting region at 10 kb resolution.

First, we constructed a differential contact map using log-fold-change values between the imputed maternal and paternal contacts (Figure 8A). Along with the contact map, we also visualized the allelic CTCF ChIP-seq data (15). Consistent with previous studies (39,40), we observed a clear maternal-specific CTCF binding at the H19-DMR locus. Additionally, a few bi-allelic CTCF binding clusters were observed at mHIDAD, near the Syt8 and Lsp1 genes, and at the telomeric side of Igf2. As shown in Figure 8A, the contacts between mHIDAD and H19 were enriched on the maternal allele (box 1), whereas the contacts between mHIDAD and Igf2 were enriched on the paternal allele (box 2). In addition to the contacts between mHIDAD and H19/Igf2, H19 and Igf2 demonstrated differential contact preferences to the bi-allelic CTCF clusters near Syt8 and Lsp1 (boxes 3 and 4). To further characterize the parental-specific chromatin interactions, we identified chromatin loops with genomic distance of 30–500 kb from the imputed allelic contact maps using Fit-Hi-C (25) with a strict FDR threshold (q-value < 10−5). The identified chromatin loops were mostly anchored to the CTCF binding clusters (Figure 8A). We further categorized these chromatin loops into bi-allelic loops that were shared between the two alleles, or monoallelic loops that are either maternal-specific or paternal-specific. Consistent with the differential contact map, chromatin loops anchored at H19 and Igf2 were primarily parental-specific. We observed a distinct pattern of maternal-specific chromatin loops between mHIDAD and H19 and paternal-specific chromatin loops between mHIDAD and Igf2. Besides mHIDAD, the region containing bi-allelic CTCF binding clusters near the Syt8 and Lsp1 genes also demonstrated parental-specific chromatin interactions with H19 and Igf2. Specifically, these CTCF clusters interacted preferentially with H19 on the maternal allele and with Igf2 on the paternal allele. These observations are consistent with the previous 4C-seq results in mouse ESCs (35).

Figure 8.

Figure 8.

Allele-specific chromatin organizations of the H19/Igf2 imprinting region in mouse patski cells. (A) Differential contact map between the ASHIC-ZIPM-imputed maternal and paternal contacts at 10 kb resolution. Contact counts are normalized separately on each allele to account for the potential mapping bias towards the reference genome. The red vs blue color key indicates maternal versus paternal enrichment. Four allelicly enriched chromatin interacting regions are labeled in boxes 1–4. Maternal-specific CTCF peak (pink) and bi-allelic CTCF binding clusters (yellow) are highlighted. Chromatin loops are called using Fit-Hi-C (25) and categorized into maternal-specific (red), paternal-specific (blue) and bi-allelic (gray). Only loops anchored at H19 or Igf2 are displayed. (B) Allelic 3D structures of the H19/Igf2 imprinting region predicted by ASHIC-ZIPM at 10 kb resolution. The maternal (red) and paternal (blue) structures are overall similar, but the relative spatial positions of mHIDAD (blue), H19 (yellow), and Igf2 (orange) are evidently different. (C) Box plots of pairwise Euclidean distances between H19-Igf2 (left), mHIDAD-H19 (middle), and mHIDAD-Igf2 (right). (D) Allelic Hi-C contact maps at 10 kb resolution (top panel: maternal allele, red color key; bottom panel: paternal allele, blue color key). Maternal-specific (red), paternal-specific (blue), and bi-allelic (gray) chromatin loops are called using Fit-Hi-C (25). A local minimum of the insulation score (IS) is marked by an asterisk. Positive and negative directionality index (DI) values (8) are shown in red and blue, respectively. (Sub-)TAD domains derived from IS and DI measures are labeled as triangles on the contact maps, and dashed lines indicate (sub-)TAD boundaries. Panels (A) and (D) are drawn using pyGenomeTracks (28).

Besides the differential contact map, we also examined the allele-specific chromatin conformations using the predicted allelic 3D structures (Figure 8B). The overall chromatin organizations of the H19/Igf2 imprinting region appeared to be similar between the two alleles. However, the relative spatial position among mHIDAD, H19, and Igf2 demonstrated parental-specific differences. From the 3D structures, we observed that mHIDAD was spatially close to H19 on the maternal allele, presumably forming a chromatin loop. In addition, we observed that Igf2 was much closer to mHIDAD on paternal structure than on the maternal structure.

For the quantitative comparison, we calculated the pairwise Euclidean distances of mHIDAD, H19, and Igf2 on the maternal and paternal structures predicted by ASHIC-ZIPM from 20 random initializations. As shown in Figure 8C, the distance between mHIDAD and H19 was significantly smaller on the maternal structure than that on the paternal structure (P-value = 4.43 × 10−5, one-sided Wilcoxon paired signed-rank test), which is consistent with the previous DNA-FISH data (35). In contrast, the distance between mHIDAD and Igf2 was significantly larger on maternal allele (P-value = 4.43 × 10−5, one-sided Wilcoxon paired signed-rank test), which is consistent with the observation of paternal-specific HIDAD-IGF2 loop in human GM12878 cells (6). No significant difference of the distance between H19 and Igf2 was detected on our predicted allelic structures. These observations demonstrated that our method can stably predict fine-scale 3D structures that reflect the distinct parental-specific chromatin conformations.

ASHIC-imputed diploid contact maps uncover the maternal-specific sub-TAD organization at the mouse H19/Igf2 locus

In addition to the formation of chromatin loops, CTCF also participates in the establishment of higher-order chromatin structures such as topologically-associating domains (TADs). TADs are sub-megabase genomic regions containing frequent local chromatin interactions, whereas TAD boundaries result in physical insulation between neighboring domains (8). It has been observed that CTCF bindings are often enriched at TAD boundaries and play an important role in TAD formation (6,8). Since the genome is organized in a hierarchical manner, smaller domains called sub-TADs are often observed within the large TADs. Unlike TADs that are mostly invariant between cell types, sub-TADs are more variable and play a pivotal role in mediating cell-type-specific gene regulation (41,42). Based on the presence of monoallelic CTCF bindings at H19-DMR, Llères et al. (35) proposed a novel parental-specific sub-TAD model for the regulation of imprinting at H19/Igf2 locus. Supported by allelic 4C-seq and DNA-FISH data, they speculated that several bi-allelic CTCF binding sites form a first layer of TAD on both alleles. In addition, the maternal-specific CTCF binding around H19-DMR hijacks the first layer of TAD and consequently creates an additional layer of sub-TAD on the maternal allele.

To verify this hypothesis, we calculated the insulation score (IS) (43) and directionality index (DI) (8) using TADtool (44) to search for possible (sub-)TAD boundaries around the H19/Igf2 imprinting region. Overall we observed similar IS values on both alleles, except at the H19-DMR locus (Figure 8D, Supplementary Figure S6). Specifically, we observed a local minimum of IS values at H19-DMR only on the maternal allele indicating a potential presence of a sub-TAD boundary at H19-DMR. Consistently, the DI values suggested similar (sub-)TAD pattern (Figure 8D). We observed strong positive DIs at mHIDAD on both alleles, indicating that mHIDAD is highly biased towards interacting with its downstream loci and serves as a starting position of a TAD. On the other hand, the telomeric-side flanking region of Igf2 demonstrated negative DIs on both alleles, indicating a likely ending boundary of a TAD. Furthermore, a negative DI region around H19-DMR appeared only on the maternal allele, suggesting H19-DMR has a higher tendency to interact with its upstream loci, possibly indicating an ending position for a maternal-specific sub-TAD.

Both the IS and DI measurements suggested that H19/Igf2 is embedded within a TAD demarcated by two main boundaries: one near mHIDAD and the other one at the telomeric side of Igf2. The locations of the two boundaries were in good agreement between both alleles. However, the (sub-)TAD organization within this TAD region undergoes drastic parental-specific changes. Specifically, we observed a sub-TAD boundary at H19-DMR locus exclusively on the maternal allele. The TAD and sub-TAD boundaries mentioned above were all located at CTCF binding clusters. We further examined the allelic chromatin loops within this imprinting region (Figure 8D). On the maternal allele, chromatin loops were mostly confined to the mHIDAD-H19 sub-TAD. Whereas on the paternal allele, we observed several chromatin loops connecting the centromeric side of H19-DMR with Igf2, indicating the absence of insulation at H19-DMR. These observations of allelic chromatin loops are consistent with the parental-specific (sub-)TAD structures.

Taken together, these results supported the hypothesis that the maternal-specific CTCF binding at H19-DMR forms a chromatin loop with the CTCF binding sites at mHIDAD. This mHIDAD-H19 loop creates an additional layer of sub-TAD inside the original mHIDAD-Igf2 TAD. The maternal-specific mHIDAD-H19 sub-TAD organization mediates the insulation between the centromeric side of H19-DMR and Igf2, and thereby leading to the silencing of Igf2 on the maternal allele.

ASHIC-imputed allelic chromatin contacts are consistent with phased ChIA-PET loops in human GM12878 cells

Besides Hi-C, ChIA-PET is another popular technique for detecting genome-wide chromatin interactions (45). ChIA-PET incorporates chromatin immunoprecipitation-based enrichment and focuses on the mapping of chromatin interactions mediated by a specific protein of interest. Applying an advanced long-read ChIA-PET strategy, Tang et al. (17) comprehensively mapped the functional chromatin interactions mediated by CTCF and RNA polymerase II (RNAPII) with haplotype specificity in human cell lines. To further assess our method, we applied ASHIC-ZIPM to the published Hi-C data in human GM12878 cells (6), and compared the imputed allelic chromatin maps with the phased ChIA-PET data published by Tang et al. (17).

We first looked at a 4-Mb region (chr11: 1–5 Mb) around the H19/IGF2 imprinting locus and generated allelic contact maps and structures at 10 kb resolution (Figure 9A). Compared to the 25-kb-resolution mate-rescued Hi-C maps reported by Rao et al. (6), our ASHIC-imputed allelic contact maps showed much higher coverage and finer interaction patterns. Similar to the mouse H19/Igf2 region, the human H19/IGF2 imprinting region also exhibited a maternal-specific sub-TAD organization. The sub-TAD boundary located at H19-DMR and was enriched with maternal-specific CTCF bindings. In addition, we observed maternal-specific chromatin contacts between H19-DMR and several loci (including HIDAD) at the telomeric side (red boxes), which was in high correspondence with the maternal-biased ChIA-PET loops mediated by CTCF. On the paternal allele, we observed enriched chromatin contacts between IGF2 and the aforementioned telomeric-side loci (blue boxes), which was consistent with our observations with the mouse Igf2 homolog. We did not observe the corresponding paternal-biased CTCF ChIA-PET loops, probably due to the absence of SNPs at the IGF2 locus.

Figure 9.

Figure 9.

Allele-specific Hi-C chromatin maps and ChIA-PET loops in human GM12878 cells. ASHIC-imputed allelic contact maps are shown at 10 kb resolution (top panel: maternal allele, red color key; bottom panel: paternal allele, blue color key). Phased ChIA-PET loops and SNPs with haplotype-biased ChIA-PET bindings are obtained from Tang et al. (17). (A) H19/IGF2 imprinting region. Maternal-enriched and paternal-enriched chromatin interacting regions are labeled in red and blue boxes, respectively. Vertical dashed lines indicate (sub-)TAD boundaries. (B) Allelic long-range enhancer-promoter interactions at LOC374443, CLEC2D and CLECL1 genes. Blue box indicates the paternal-enriched chromatin interacting region. The distal enhancer associated with paternal-biased RNAPII-mediated ChIA-PET loops is highlighted in blue. Both panels are drawn using pyGenomeTracks (28).

In addition to CTCF-mediated parental-specific chromatin loops, our approach also revealed RNAPII-mediated allelic chromatin interactions. For example, we studied another 4-Mb region (chr12: 8–12 Mbp) containing the LOC374443, CLEC2D and CLECL1 multi-gene complex. Previously, Tang et al. (17) discovered paternally biased RNAPII-mediated interactions between this paternally expressed multi-gene complex and its distal enhancer (300 kb apart). Consistently, our ASHIC-imputed allelic contact maps showed paternal-enriched long-range contacts (blue box) between the distal enhancer and the promoters of the three genes, as shown in Figure 9B.

Collectively, these results demonstrated that our ASHIC method is capable of imputing diploid chromatin maps in low-SNP-density cells such as GM12878 and the ASHIC-imputed allelic contacts are in high correspondence with the phased ChIA-PET data.

DISCUSSION

In this work, we proposed a hierarchical Bayesian framework for imputing allele-specific contacts and reconstructing allelic 3D structures from diploid Hi-C data. We developed two models under this Bayesian framework: ASHIC-PM and ASHIC-ZIPM. To the best of our knowledge, our ASHIC methods are the first methods that produce fully decomposed diploid Hi-C contact matrices as well as the allelic 3D structures.

Unlike the existing allele-certain and mate-rescue approaches, our ASHIC methods utilize all diploid Hi-C contacts, including both-end allele-ambiguous contacts. As a result, ASHIC methods exceeded the allele-certain and mate-rescue methods, in terms of producing more accurate diploid matrices and structures as well as facilitating better detection of allele-specific chromatin interactions. We also conducted a series of simulation experiments and evaluated how the performance of our methods was impacted by various factors, including sequencing coverage, SNP density, and homologous structural similarity. Overall, our models significantly outperformed other methods, especially under low sequencing coverage and low SNP density conditions. The ability of the ASHIC methods in inferring allele-ambiguous contacts at low-SNP-density setting is critical for analyses in diploid human cells such as GM12878, where the existing mate-rescue method (6) was only able to rescue 5.86% of total diploid contacts (Supplementary Table S1).

In our simulation studies, we did not compare the ASHIC methods with the recently published Dip-C method by Tan et al. (13) as their method was specifically designed for single-cell Hi-C data. Another reason was that Dip-C does not impute intra-chromosomal both-end allele-ambiguous contacts. Therefore we expect that its performance would be close to the mate-rescue method. In addition, our earlier work of the Poisson-Gamma model (9) imputes diploid contact counts based on genomic distances rather than the spatial distances derived from 3D structures, and therefore is not computationally stable on fine-resolution (such as 100 kb) or low-coverage Hi-C data. Lastly, the newly developed diploid-PASTIS method by Cauer et al. (14) predicts only the allelic 3D structures rather than the diploid contact matrices. Therefore, we did not evaluate the diploid-PASTIS method in our simulations as most of our evaluation metrics were based on imputed contact matrices.

The main advantage of the ASHIC-ZIPM model over the ASHIC-PM model is that ASHIC-ZIPM explicitly accounts for the excessive zeros in Hi-C matrices, by modeling the probabilities whether each observed zero count is a ‘true’ zero or a ‘missing’ zero. As a result, we observed that the ASHIC-ZIPM model consistently outperformed the ASHIC-PM model in all simulation settings. While the performance of the two models were often similar, the improvements of ASHIC-ZIPM over ASHIC-PM became more evident when the SNP density decreased. In addition, the differences between the ASHIC-ZIPM and ASHIC-PM models were particularly noticeable under the more challenging simulation setting of identical homologous structures. This is owing to the fact that when SNP density was low, only few allele-certain contacts were observed. The ASHIC-PM model uses the allele-certain contacts to initialize the EM algorithm and treats all zeros as ‘true’ zeros, thereby producing less optimal results. In contrast, ASHIC-ZIPM explicitly adjusts the weights between ‘true’ and ‘missing’ zeros and thereby archiving more accurate models.

Hi-C contact counts could be over-dispersed, thus a Negative Binomial (NB) model may provide a better fit than a Poisson model. However, our ASHIC models leverage on two nice properties of the Poisson distribution: the outcomes from a Poisson-multinomial hierarchical model are Poisson variables; and the sum of Poisson variables is also a Poisson variable (Supplementary Methods). If we adapt a NB model, we will no longer have such a neat and tractable hierarchical model and as a result the model fitting will become computationally expensive. In addition, we would like to point out that the ZIP model can account for over-dispersion to some extent by fitting a mixture of Poisson state and the zero (missing) state. Furthermore, the ASHIC methods use the spatial distance rather than the genomic distance between the contacting pair as the Poisson or ZIP parameter, therefore could be less impacted by the over-dispersion.

We demonstrated the applications of our ASHIC-ZIPM method in the mouse patski cells and in the human GM12878 cells. Previous studies predicted allelic X chromosome structures at 1 Mb (9) and 500 kb (14) resolutions. In contrast, our method utilized all diploid contacts and produced finer-scale allelic structures of the entire X chromosomes at 50-kb resolution. Our results further confirmed the existence of the bipartite structure of Xi. The ability to impute all allele-ambiguous contacts is particularly important when zooming into local imprinting regions. Since imprinting regions are often small, fine-resolution allelic contact maps and 3D structures are required for an in-depth study. With our ASHIC-ZIPM model, we produced the first 10-kb-resolution diploid Hi-C contact maps of the mouse H19/Igf2 imprinting region, and revealed the existence of the maternal-specific sub-TAD organization at H19-DMR. This sub-TAD formation creates an insulation between H19 and Igf2 that likely prevents the activation of Igf2 on the maternal allele. Our study of the human H19/IGF2 imprinting region further confirmed this parental-specific chromatin organization. Furthermore, the ASHIC-imputed diploid Hi-C maps offered an informative view of the (sub-)TAD organizations on the imprinting region, whereas the previous 4C-seq study (35) was restricted to only few anchor regions.

Currently, only a few limitations can be attributed to our ASHIC methods. First, our methods provide chromosome-wide modeling of diploid Hi-C data. One possible future extension is to build a genome-wide model by incorporating an additional estimation step in the EM algorithm to model the relative position of multiple homologous chromosomes. We could further parallelize the optimization procedures for each homologous chromosome pair to speed up the genome-wide modeling. Second, our model is specifically designed for diploid genomes. Extending the model to polyploid or aneuploid genomes remains a challenging problem. Lastly, the computational efficiency of our EM algorithm, especially the structure estimation step, could be further improved. One possible solution is to adapt an iterative modeling strategy similar to (13,27), starting with coarse-resolution modeling and gradually refining the structures to finer resolutions.

Supplementary Material

gkaa872_Supplemental_Files

ACKNOWLEDGEMENTS

The authors would like to thank Weixin Yao, Tao Jiang and Thomas Girke for the discussion and helpful suggestions, Yijun Ruan for providing the phased ChIA-PET data, and the referees for the constructive and valuable suggestions.

Contributor Information

Tiantian Ye, Genetics, Genomics and Bioinformatics Program.

Wenxiu Ma, Department of Statistics, University of California Riverside, Riverside, CA 92521, USA.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

National Science Foundation [DBI-1751317]; Regents’ Faculty Fellowship from the University of California Riverside. Funding for open access charge: NSF [DBI-1751317].

Conflict of interest statement. None declared.

REFERENCES

  • 1. Dekker J. Gene regulation in the third dimension. Science. 2008; 319:1793–1794. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Lieberman-Aiden E., van Berkum N.L., Williams L., Imakaev M., Ragoczy T., Telling A., Amit I., Lajoie B.R., Sabo P.J., Dorschner M.O. et al.. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009; 326:289–293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Duan Z., Andronescu M., Schutz K., McIlwain S., Kim Y.J., Lee C., Shendure J., Fields S., Blau C.A., Noble W.S.. A three-dimensional model of the yeast genome. Nature. 2010; 465:363–367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Kalhor R., Tjong H., Jayathilaka N., Alber F., Chen L.. Genome architectures revealed by tethered chromosome conformation capture and population-based modeling. Nat. Biotechnol. 2012; 30:90–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Ma W., Ay F., Lee C., Gulsoy G., Deng X., Cook S., Hesson J., Cavanaugh C., Ware C.B., Krumm A. et al.. Fine-scale chromatin interaction maps reveal the cis-regulatory landscape of lincRNA genes in human cells. Nat. Methods. 2015; 12:71–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Rao S.S.P., Huntley M.H., Durand N., Neva C., Stamenova E.K., Bochkov I.D., Robinson J.T., Sanborn A.L., Machol I., Omer A.D. et al.. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014; 159:1665–1680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Ramani V., Cusanovich D.A., Hause R.J., Ma W., Qiu R., Deng X., Blau C.A., Disteche C.M., Noble W.S., Shendure J. et al.. Mapping 3D genome architecture through in situ DNase Hi-C. Nat. Protoc. 2016; 11:2104–2121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Dixon J.R., Selvaraj S., Yue F., Kim A., Li Y., Shen Y., Hu M., Liu J.S., Ren B.. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012; 485:376–380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Deng X., Ma W., Ramani V., Hill A., Yang F., Ay F., Berletch J.B., Blau C.A., Shendure J., Duan Z., Noble W.S. et al.. Bipartite structure of the inactive mouse X chromosome. Genome Biol. 2015; 16:152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Giorgetti L., Lajoie B.R., Carter A.C., Attia M., Zhan Y., Xu J., Chen C.J., Kaplan N., Chang H.Y., Heard E. et al.. Structural organization of the inactive X chromosome in the mouse. Nature. 2016; 535:575–579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Darrow E.M., Huntley M.H., Dudchenko O., Stamenova E.K., Durand N.C., Sun Z., Huang S.C., Sanborn A.L., Machol I., Shamim M. et al.. Deletion of DXZ4 on the human inactive X chromosome alters higher-order genome architecture. Proc. Natl. Acad. Sci. U.S.A. 2016; 113:E4504–E4512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Du Z., Zheng H., Huang B., Ma R., Wu J., Zhang X., He J., Xiang Y., Wang Q., Li Y. et al.. Allelic reprogramming of 3D chromatin architecture during early mammalian development. Nature. 2017; 547:232–235. [DOI] [PubMed] [Google Scholar]
  • 13. Tan L., Xing D., Chang C.H., Li H., Xie X.S.. Three-dimensional genome structures of single diploid human cells. Science. 2018; 361:924–928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Cauer A.G., Yardimci G., Vert J.P., Varoquaux N., Noble W.S.. Inferring diploid 3D chromatin structures from Hi-C data. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). 2019; 143:Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik; 11:1–11:13. [Google Scholar]
  • 15. Bonora G., Deng X., Fang H., Ramani V., Qiu R., Berletch J.B., Filippova G.N., Duan Z., Shendure J., Noble W.S. et al.. Orientation-dependent Dxz4 contacts shape the 3D structure of the inactive X chromosome. Nat. Commun. 2018; 9:1445. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Varoquaux N., Ay F., Noble W.S., Vert J.P.. A statistical approach for inferring the 3D structure of the genome. Bioinformatics. 2014; 30:i26–i33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Tang Z., Luo O.J., Li X., Zheng M., Zhu J.J., Szalaj P., Trzaskoma P., Magalska A., Wlodarczyk J., Ruszczycki B. et al.. CTCF-mediated human 3D genome architecture reveals chromatin topology for transcription. Cell. 2015; 163:1611–1627. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Wang S., Xu J., Zeng J.. Inferential modeling of 3D chromatin structure. Nucleic Acids Res. 2015; 43:e54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Dekker J., Marti-Renom M.A., Mirny L.A.. Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat. Rev. Genet. 2013; 14:390–403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Imakaev M., Fudenberg G., McCord R.P., Naumova N., Goloborodko A., Lajoie B.R., Dekker J., Mirny L.A.. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods. 2012; 9:999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Kruskal J.B., Wish M.. Multidimensional scaling. 1978; Sage. [Google Scholar]
  • 22. Kabsch W. A solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A. 1976; 32:922–923. [Google Scholar]
  • 23. Servant N., Varoquaux N., Lajoie B.R., Viara E., Chen C.-J., Vert J.-P., Heard E., Dekker J., Barillot E.. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol. 2015; 16:259. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Yang T., Zhang F., Yardımcı G.G., Song F., Hardison R.C., Noble W.S., Yue F., Li Q.. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Res. 2017; 27:1939–1949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Ay F., Bailey T.L., Noble W.S.. Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts. Genome Res. 2014; 24:999–1011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Lyon M.F. Gene action in the X-chromosome of the mouse (Mus musculus L.). nature. 1961; 190:372–373. [DOI] [PubMed] [Google Scholar]
  • 27. Stevens T.J., Lando D., Basu S., Atkinson L.P., Cao Y., Lee S.F., Leeb M., Wohlfahrt K.J., Boucher W., O’Shaughnessy-Kirwan A. et al.. 3D structures of individual mammalian genomes studied by single-cell Hi-C. Nature. 2017; 544:59–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Ramírez F., Bhardwaj V., Arrigoni L., Lam K.C., Grüning B.A., Villaveces J., Habermann B., Akhtar A., Manke T.. High-resolution TADs reveal DNA sequences underlying genome organization in flies. Nat. Commun. 2018; 9:1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Yoo-Warren H., Pachnis V., Ingram R.S., Tilghman S.M.. Two regulatory domains flank the mouse H19 gene. Mol. Cell Biol. 1988; 8:4707–4715. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Leighton P.A., Saam J.R., Ingram R.S., Stewart C.L., Tilghman S.M.. An enhancer deletion affects both H19 and Igf2 expression. Genes Dev. 1995; 9:2079–2089. [DOI] [PubMed] [Google Scholar]
  • 31. Ishihara K., Hatano N., Furuumi H., Kato R., Iwaki T., Miura K., Jinno Y., Sasaki H.. Comparative genomic sequencing identifies novel tissue-specific enhancers and sequence elements for methylation-sensitive factors implicated in Igf2/H19 imprinting. Genome Res. 2000; 10:664–671. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Thorvaldsen J.L., Duran K.L., Bartolomei M.S.. Deletion of the H19 differentially methylated domain results in loss of imprinted expression of H19 and Igf2. Genes Dev. 1998; 12:3693–3702. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Barlow D.P., Bartolomei M.S.. Genomic imprinting in mammals. CSH Perspect. Biol. 2014; 6:a018382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Merkenschlager M., Odom D.T.. CTCF and cohesin: linking gene regulatory elements with their targets. Cell. 2013; 152:1285–1297. [DOI] [PubMed] [Google Scholar]
  • 35. Llères D., Moindrot B., Pathak R., Piras V., Matelot M., Pignard B., Marchand A., Poncelet M., Perrin A., Tellier V. et al.. CTCF modulates allele-specific sub-TAD organization and imprinted gene activity at the mouse Dlk1-Dio3 and Igf2-H19 domains. Genome Biol. 2019; 20:272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Murrell A., Heeson S., Reik W.. Interaction between differentially methylated regions partitions the imprinted genes Igf2 and H19 into parent-specific chromatin loops. Nat. Genet. 2004; 36:889–893. [DOI] [PubMed] [Google Scholar]
  • 37. Court F., Baniol M., Hagege H., Petit J.S., Lelay-Taha M.N., Carbonell F., Weber M., Cathala G., Forne T.. Long-range chromatin interactions at the mouse Igf2/H19 locus reveal a novel paternally expressed long non-coding RNA. Nucleic Acids Res. 2011; 39:5893–5906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Kurukuti S., Tiwari V.K., Tavoosidana G., Pugacheva E., Murrell A., Zhao Z., Lobanenkov V., Reik W., Ohlsson R.. CTCF binding at the H19 imprinting control region mediates maternally inherited higher-order chromatin conformation to restrict enhancer access to Igf2. Proc. Natl. Acad. Sci. U.S.A. 2006; 103:10684–10689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Bell A.C., Felsenfeld G.. Methylation of a CTCF-dependent boundary controls imprinted expression of the Igf2 gene. Nature. 2000; 405:482–485. [DOI] [PubMed] [Google Scholar]
  • 40. Hark A.T., Schoenherr C.J., Katz D.J., Ingram R.S., Levorse J.M., Tilghman S.M.. CTCF mediates methylation-sensitive enhancer-blocking activity at the H19/Igf2 locus. Nature. 2000; 405:486–489. [DOI] [PubMed] [Google Scholar]
  • 41. Phillips-Cremins J.E., Sauria M.E.G., Sanyal A., Gerasimova T.I., Lajoie B., Bell J.S.K., Ong C.T., Hookway T.A., Guo C., Sun Y. et al.. Architectural protein subclasses shape 3D organization of genomes during lineage commitment. Cell. 2013; 153:1281–1295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Cubeñas-Potts C., Corces V.G.. Topologically associating domains: an invariant framework or a dynamic scaffold?. Nucleus. 2015; 6:430–434. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Crane E., Bian Q., McCord R.P., Lajoie B.R., Wheeler B.S., Ralston E.J., Uzawa S., Dekker J., Meyer B.J.. Condensin-driven remodelling of X chromosome topology during dosage compensation. Nature. 2015; 523:240–244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Kruse K., Hug C.B., Hernández-Rodríguez B., Vaquerizas J.M.. TADtool: visual parameter identification for TAD-calling algorithms. Bioinformatics. 2016; 32:3190–3192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Fullwood M.J., Liu M.H., Pan Y.F., Liu J., Xu H., Mohamed Y.B., Orlov Y.L., Velkov S., Ho A., Mei P.H. et al.. An oestrogen-receptor-α-bound human chromatin interactome. Nature. 2009; 462:58–64. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkaa872_Supplemental_Files

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES