Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 May 20.
Published in final edited form as: Cell Syst. 2020 May 20;10(5):397–407.e5. doi: 10.1016/j.cels.2020.04.004

MATCHA: Probing multi-way chromatin interaction with hypergraph representation learning

Ruochi Zhang 1, Jian Ma 1,*
PMCID: PMC7299183  NIHMSID: NIHMS1588572  PMID: 32550271

SUMMARY

Recent advances in ligation-free, genome-wide chromatin interaction mapping such as SPRITE and ChIA-Drop have enabled the identification of simultaneous interactions involving multiple genomic loci within the same nuclei, which are informative to delineate higher-order genome organization and gene regulation mechanisms at single-nucleus resolution. Unfortunately, computational methods for analyzing multi-way chromatin interaction data are significantly underexplored. Here we develop an algorithm, called MATCHA, based on hypergraph representation learning where multi-way chromatin interactions are represented as hyperedges. Applications to SPRITE and ChIA-Drop data suggest that MATCHA is effective to denoise the data and make de novo predictions, which greatly enhances the data quality for analyzing the properties of multi-way chromatin interactions. MATCHA provides a promising framework to significantly improve the analysis of multi-way chromatin interaction data and has the potential to offer unique insights into higher-order chromosome organization and function. MATCHA is freely available for download here: https://github.com/ma-compbio/MATCHA.

eTOC paragraph

A new computational framework called MATCHA facilitates the analysis of multi-way chromatin interaction data to delineate unique principles of 3D genome structure and function.

INTRODUCTION

Interphase chromosomes in higher eukaryotic cells are folded and packaged in the nucleus, leading to higher-order chromatin interactions in three-dimensional (3D) space that are key to understanding gene regulation and cell function (Kumaran et al., 2008; Bonev and Cavalli, 2016). Advances in high-throughput mapping of nuclear genome organization by capturing pairwise interactions between proximal genomic loci, e.g., Hi-C (Lieberman-Aiden et al., 2009; Rao et al., 2014) and ChIA-PET (Fullwood and Ruan, 2009; Tang et al., 2015), have enabled genome-wide characterization of 3D genome features, including loops (Rao et al., 2014; Tang et al., 2015), topologically associating domains (Dixon et al., 2012; Nora et al., 2012), A/B compartments (Lieberman-Aiden et al., 2009) and subcompartments (Rao et al., 2014; Xiong and Ma, 2019). However, one limitation of the proximity ligation based methods is that they capture interactions between genomic loci that are in close proximity to directly ligate and are unable to reveal chromatin interactions that are beyond the distance of direct ligation (Beagrie et al., 2017; Quinodoz et al., 2018). More importantly, most widely available Hi-C and ChIA-PET data only measure pairwise interactions and cannot delineate multiple chromatin loci that interact simultaneously in the same nucleus (Kempfer and Pombo, 2019).

Very recently, new technologies such as SPRITE (Quinodoz et al., 2018) and ChIA-Drop (Zheng et al., 2019) have been developed to capture simultaneous interactions among multiple genomic loci within individual nuclei. Based on the SPRITE data, Quinodoz et al. (2018) reported that the inter-chromosomal chromatin interactions can be partitioned into distinct active and inactive hubs. ChIA-Drop, on the other hand, allows the detection of multi-way chromatin interactions mediated by specific proteins (Zheng et al., 2019). For example, from RNA PolII ChIA-Drop data, potential co-regulated genes can be characterized. These recently developed methods have demonstrated the unique properties of the 3D genome architecture that can only be manifested by multi-way chromatin interactions (Beagrie et al., 2017; Quinodoz et al., 2018; Zheng et al., 2019).

However, there are a number of unsolved challenges in analyzing multi-way chromatin interaction data. First, existing methods for analyzing SPRITE and ChIA-Drop data typically have strong assumptions that would lead to loss of information from the original data. For example, Peakachu (Salameh et al., 2019) extracts chromatin interactions from multi-way interaction data by simply decomposing clusters into a pairwise contact matrix to directly apply methods developed for Hi-C and ChIA-PET, which causes a dramatic loss of higher-order contact information. Recently, MIA-Sig (Kim et al., 2019) was developed to remove noise and call significant chromatin complex in ChIA-Drop data based on the assumption that the genomic distance of a true multi-way interaction should be more evenly spaced and if a genomic locus is far from other loci within the droplet then it is likely contamination. This assumption, however, may introduce errors and has yet to be further evaluated and confirmed (Kim et al., 2019). Second, the observed frequencies for clusters with a larger number of loci are much lower than smaller ones, making it increasingly more difficult to reliably denoise the data based on frequencies only. For example, in Quinodoz et al. (2018), combinations of 1Mb genomic bins (referred to as k-mers) were required to be observed at least in 5 SPRITE clusters (occurrence frequency ≥ 5) to be further considered. However, when analyzing multi-way interactions with larger size and higher resolution, the cut-off for frequencies of larger k-mers would be much harder to determine because of their low occurrence. Third, the number of multi-way interaction datasets and their quality remain limited, but no reliable predictive method is currently available to either denoise the data or to make de novo multi-way interaction predictions. Indeed, in order to fully realize the potential of the emerging multi-way chromatin interaction data, it is imperative to have new algorithms that can extract important patterns by addressing the aforementioned challenges.

Here we develop a new generic computational framework, called MATCHA (Multi-wAy inTeracting CHromatin Analysis), for the analysis of multi-way chromatin interaction data (see the concept of MATCHA in Figure 1). We consider the 3D genome as a graph where nodes are genomic bins and edges connecting bins represent chromatin interactions between bins. When an interaction involves more than two nodes, it is referred to as a hyperedge and the graph containing hyperedges is called a hypergraph (Berge, 1984; Zhou et al., 2007). In other words, we model each multi-way interaction as a hyperedge. MATCHA takes multi-way chromatin interaction data as input and extracts patterns from the corresponding hypergraph. The patterns are represented as embedding vectors for each genomic bin that reflect the properties of 3D chromatin structures. The model can further predict the probability for a group of genomic bins having a simultaneous interaction, for either denoising the dataset or making de novo predictions. Taken together, MATCHA is a new computational method based on hypergraph representation learning for the analysis of multi-way chromatin interaction data that can provide new insights into nuclear genome structure and function.

Figure 1: Illustration of the concept of MATCHA.

Figure 1:

MATCHA takes multi-way chromatin interaction dataset (such as SPRITE and ChIA-Drop) as input and constructs a hypergraph where genomic bins are nodes and simultaneous interactions involving multiple loci are hyperedges. The constructed hypergraph passes through a hypergraph representation learning method that can produce embeddings for nodes and accurately predict hyperedges. The trained model can be used for applications such as multi-way interaction identification, denoised contact matrix generation and integrating other functioinal genomic signals.

RESULTS

Overview of the MATCHA algorithm

Figure 2A illustrates the workflow of MATCHA for the analysis of multi-way chromatin interaction data. There are four main components: (1) Constructing hypergraphs (the formal definition can be found in armethods]STAR Methods) based on the multi-way chromatin interaction data where non-overlapping genomic bins are defined as nodes and bins in the same multi-way interaction are connected by a hyperedge. (2) Generating node features for the hypergraph based on decomposed pairwise contact matrix from multi-way interaction data. The decomposed pairwise contact matrix passes through the Mix-n-Match autoencoder (armethods]STAR Methods). (3) Generating labeled data for the training of the hypergraph representation learning model. Within the dataset, positive samples are defined as existing hyperedges while negative samples are unobserved ones. We generate negative samples through an efficient and biologically meaningful negative sampling strategy. (4) Training our hypergraph representation learning model Hyper-SAGNN (Zhang et al., 2020) which takes both labeled data and node features as input (Figure 2B). Details of each component are described in armethods]STAR Methods.

Figure 2: Overview of the MATCHA algorithm.

Figure 2:

(A) Workflow of MATCHA. Observed clusters from multi-way chromatin interaction data such as SPRITE and ChIA-Drop are defined as positive samples (hyperedges) and the unobserved groups of nodes are sampled as negative ones. Observed clusters are decomposed into a pairwise contact matrix to generate node features by going through an autoencoder. Both labeled data and node features are used to train Hyper-SAGNN (Zhang et al., 2020) to model the hypergraph constructed by multi-way chromatin interaction data. (B) Hyper-SAGNN architecture (details in Zhang et al. (2020)). The input of the model contains groups of nodes with node features, (x1,x2,,xk). The tuple passes through fully-connected layers and multi-head attention layers to generate static and dynamic node embeddings, respectively. Then a pseudo euclidean distance is calculated for each pair of static/dynamic embeddings which would be turned into probability scores between 0 to 1. The final probability score indicating whether this group of nodes (1,2,,k) would form a hyperedge is calculated by average pooling.

We remark that in the last step we use our recently developed model, Hyper-SAGNN (Zhang et al., 2020), instead of other previously developed graph representation learning methods for several reasons. The hyperedge prediction problem (the formal definition can be found in armethods]STAR Methods) is equivalent to learning the function p that takes tuples of node features (x1,,xk) as input and produces the probability of these nodes forming a hyperedge. Since the number of genomic bins involved in a multi-way interaction varies, the constructed hypergraph would be a non-uniform hypergraph. Moreover, there is no intrinsic order in each hyperedge; in other words, the function p should satisfy p(x1,,xk)=p(shuffle(x1,,xk)) where shuffle() represents random shuffling of the order of the nodes involved in the hyperedge. None of the graph representation learning methods for pairwise interactions (such as DeepWalk (Perozzi et al., 2014) and node2vec (Grover and Leskovec, 2016)) or previously developed hyperedge prediction methods (such as DHNE (Tu et al., 2018) and HEBE (Gui et al., 2016)) can handle our situation because they either cannot model higher-order information within hyperedges or require fixed-sized and ordered input. We recently developed Hyper-SAGNN (Zhang et al., 2020) for the general hypergraph representation learning problem and it is applicable to homogeneous and heterogeneous hypergraphs with variable hyperedge sizes. We demonstrated in Zhang et al. (2020) that Hyper-SAGNN achieves state-of-the-art performance in multiple applications.

MATCHA accurately predicts multi-way chromatin interactions with different sizes

We first briefly describe the data we used in this work and the construction of hypergraphs. The SPRITE data are from the GM12878 lymphoblastoid human cell line Quinodoz et al. (2018). The RNAPII enriched ChIA-Drop data are from Drosophila S2 cells (Zheng et al., 2019). Unless otherwise stated, we used 1Mb resolution for the SPRITE data and 5kb resolution for the RNAPII ChIA-Drop data to build the hypergraphs. The details of data processing are described in armethods]STAR Methods. In Table S1, we list the number of different sized hyperedges for both datasets grouped by the occurrence frequency. As expected, the number of hyperedges decreases with the larger size of hyperedge and larger occurrence frequency cut-off. To balance the number of hyperedges over different sizes in SPRITE data, we selected hyperedges that have occurrence frequency ≥ 8 for hyperedges with the size of 3, ≥ 3 for hyperedges with the size of 4, and ≥ 2 for hyperedges with the size of 5. This resulted in around 700k hyperedges of size 3, 800k hyperedges of size 4, and 700k hyperedges of size 5. Note that using a more stringent standard to define hyperedges would lead to higher accuracy, but this also leads to smaller training samples and potential bias because hyperedges with larger occurrence frequency tend to have smaller genomic distance. A comprehensive analysis of the effect of different cut-off for defining hyperedges can be found in armethods]STAR Methods). For the RNAPII ChIA-Drop data, all hyperedges are “intra” as the inter-chromosomal reads were filtered in the original ChIA-Drop pipeline. We found that the remaining number of hyperedges decreases more dramatically for RNAPII ChIA-Drop data compared with SPRITE (in Table S1 and armethods]STAR Methods). We therefore used 2 as the cut-off for hyperedges of all sizes.

We evaluated MATCHA based on the hypergraph constructed from the SPRITE data with 80% of the hyperedges as the training samples and 20% of them as the testing samples. We quantified the performance by AUROC (area under the receiver operating characteristic) and AUPR (area under the precision-recall curve) scores on the testing dataset. Both metrics were calculated for hyperedges with different sizes separately to provide a comprehensive assessment. The performance is shown in Table S2. We found that MATCHA is able to make accurate predictions across hyperedges with different sizes. We observed lower AUROC and AUPR score for hyperedges of size 5, which may be caused by the lower occurrence frequency cut-off for larger hyperedges. We also performed the evaluation based on the hypergraph constructed from RNAPII ChIA-Drop data. Our method again achieves strong performance in terms of the prediction of hyperedges. However, we observed a different trend of performance versus the size of hyperedge compared with the results from SPRITE, which could be due to the uniform cut-off we used for RNAPII ChIA-Drop data whereas we used a lower cut-off for larger hyperedges for the SPRITE data. Overall, this evaluation suggests that MATCHA is able to accurately predict hyperedges solely based on a fraction of the hyperedges derived from multi-way chromatin interaction data.

MATCHA can denoise multi-way chromatin interaction data

Next, we sought to ask if MATCHA can be used to assess whether the clusters with occurrence below the cut-off are real interactions. As a proof-of-principle, we first predicted the probability of triplets (3-way interactions) from the occurrence frequency category of 2, 3–4, and 5–7 in Table S1. Note that here we used triplets with occurrence frequency ≥ 8 as training data. We evaluated the predicted probabilities by comparing with the Hi-C contact matrix. The rationale of this evaluation is that if a certain group of genomic loci simultaneously interact frequently in the cell population, it is expected to see that Hi-C can capture pairwise interactions between each pair of genomic bins.

We first decomposed all the triplets in the training data (triplets with occurrence frequency ≥ 8) into pairwise edges and identified the corresponding entry in the Hi-C contact matrix. We then calculated the average value of these entries for intra-chromosomal interactions and inter-chromosomal interactions, respectively. These two averaged values were then used to binarize the Hi-C contact matrix and build the Hi-C graph. All the triplets were grouped by the predicted probability score where each group is further categorized by the number of Hi-C edges within the triplets (see Figure 3A where we also include the positive samples used to train the model as a reference). As shown in Figure 3A, triplets assigned with higher probability scores are typically more enriched with Hi-C edges. For triplets with probability scores > 0.9, there are more than 40% with all three pairwise interactions supported by Hi-C edges.

Figure 3: Evaluation of MATCHA’s performence in denoising multi-way chromatin interaction data.

Figure 3:

(A) Distribution of the number of Hi-C edges between the nodes from the triplets using the SPRITE data. The triplets that are observed in SPRITE clusters for frequency between 2–7 are grouped by the predicted probability score. The triplet group that is used in the training as positive samples (observed in SPRITE clusters for more than 8 times) is marked as “positive”. (B) The number of triplets in the RNAPII ChIA-Drop data that either have occurrence frequency larger than 2 or have probability score greater than or equal to the listed threshold. (C) Distribution of the number of RNAPII ChIA-PET edges between the nodes from the triplets observed in RNAPII ChIA-Drop data. (D) Distribution of the number of RNAPII ChIA-PET edges between the nodes from the triplets that are unobserved in RNAPII ChIA-Drop data but have 1D genomic distance within the range 1D genomic distance in the positive triplets. (E) Heatmap comparison of Hi-C versus the original SPRITE (left) and the denoised SPRITE by MATCHA (right). The heatmap is for chromosome 1 in GM12878 human cell line at 100kb resolution. (F) Similarity measurement of Hi-C versus the original SPRITE and the denoised SPRITE by MATCHA. The measurement includes the stratum adjusted correlation coefficient (SCC) with different maximum distance, Pearson correlation score, and Spearman correlation score. “SCC - k” stands for the SCC that considers the first k diagonals of the contact matrix. (G) Heatmap comparison of the original ChIA-Drop versus the ChIA-Drop denoised by MIA-Sig (left) and MATCHA (right). The heatmap is for chromosome 1 in Drosophila S2 cell line at 20kb resolution. (H) Similarity measurement of Hi-C versus the ChIA-Drop denoised by MIA-Sig and MATCHA. The measurement includes the SCC with different maximum distance, Pearson correlation score and Spearman correlation score. “SCC - k” stands for the SCC that considers the first k diagonals of the contact matrix. See also Table S1 and Table S2.

We then assessed the effectiveness of our method for denoising the RNAPII ChIA-Drop data. Similar to the denoising evaluation on the SPRITE data, we trained the model on hyperedges with occurrence frequency ≥ 2 as positive samples and then predicted probability scores for all the observed triplets. To evaluate the reliability of the predictions, we counted the number of in situ RNAPII ChIA-PET loops (data from Zheng et al. (2019)) within the triplets. We specifically compared the triplets filtered by the predicted probability score versus the triplets filtered by occurrence frequencies. As shown in Fig 3BC, we found that by using our method for denoising, we can identify more triplets with more ChIA-PET support than simply using occurrence frequency. In addition, although a higher probability cut-off would lead to fewer triplets, it would also result in a higher percentage of triplets that have at least 2 ChIA-PET edges between pairs of genomic bins, further suggesting the improved qualities of the predicted triplets.

Together, these results demonstrate the great potential of MATCHA as a denoising method. By training the model on a relatively small set of reliable hyperedges, MATCHA can more reliably remove false-positive hyperedges by predicting the probability score for the nodes whose occurrence frequencies are non-zero but not high enough to be confidently assigned as positive samples. As compared to filtering by occurrence frequency, MATCHA has a more principled framework to identify reliable hyperedges.

MATCHA makes de novo predictions of multi-way chromatin interaction

We then asked whether MATCHA is able to make de novo predictions, i.e., to predict new hyperedges that are not observed in the original data. Because the hypergraph constructed by RNAPII ChIA-Drop only contains intra-chromosomal hyperedges and has a maximum 1D genomic distance for interactions, it is therefore possible to practically enumerate all the combinations of triplets. We specifically excluded the observed triplets from this triplet set and obtained probability scores based on MATCHA. We again evaluated the reliability of the predictions by comparing to RNAPII ChIA-PET data. In Fig 3D, we found that the triplets with higher probability scores again are more enriched with ChIA-PET edges compared with all potential triplets. Based on the support from ChIA-PET edges, these triplets with high probability scores predicted by MATCHA could potentially be real multi-way interactions (i.e., false negatives). Note that the fraction of triplets with only 1 ChIA-PET edge is larger than those with 2 or 3 ChIA-PET edges except for the triplets that have the predicted probability greater than 0.95, which was likely caused by the limited coverage of the RNAPII ChIA-PET data. Specifically, for the data used in this work, there are more than 2M clusters identified from the RNAPII ChIA-Drop data but only around 200K edges from the RNAPII ChIA-PET data. However, the tendency that triplets with higher predicted probability scores would obtain more support from RNAPII ChIA-PET edges demonstrates the potential of MATCHA in making de novo predictions from existing data. Based on the model trained from the observed multi-way interactions, MATCHA is able to make de novo predictions of hyperedges, which could help reveal potential multi-way chromatin interactions undetected in the original data, further showing the advantage of MATCHA as a predictive model compared with other denoising approaches.

MATCHA improves overall data quality of SPRITE and ChIA-Drop

To further assess the performance of MATCHA for denoising and de novo prediction, we used MATCHA to generate a denoised pairwise contact map and compared to either the original contact map or the ones from other data sources with higher coverage. We first evaluated MATCHA on the SPRITE data from GM12878. To achieve a more detailed comparison, we changed the resolution from 1Mb to 100Kb while the other processing procedures remained the same. After training the model, all pairs of genomic bins were used as input for MATCHA to predict the probabilities of being pairwise interactions, i.e., a “probability map”. We then calculated the element-wise product of the “probability map” and the original contact map, which becomes the denoised contact map. To show the impact of denoising, we compared the original and the denoised SPRITE contact maps to the Hi-C contact map in GM12878. Figure 3E shows the heatmap comparison of the original and the denoised SPRITE versus Hi-C on chromosome 1, respectively. We found that the denoised SPRITE contact map produced by MATCHA, while preserving similar near-diagonal structures, contains much less noise and clearer patterns for long-range interactions. These off-diagonal patterns based on MATCHA also correspond well with the Hi-C contact map. Detailed quantification of the similarity between the original and the denoised SPRITE versus Hi-C is shown in Figure 3F, where we utilized the stratum adjusted correlation coefficient (SCC) used in HiCRep (Yang et al., 2017)) with different maximum distance, Pearson correlation score, and Spearman correlation score. The denoised contact map achieves higher scores for all similarity metrics than the original SPRITE contact map compared with the Hi-C contact map. In particular, the advantage is more pronounced with metrics that take into account long-range interactions (Figure 3F).

Next, we generated the denoised contact map for the ChIA-Drop data at 20kb resolution. Here we used the ChIA-Drop data without ChIP enrichment (data from (Zheng et al., 2019)) to compare with the Hi-C data for Drosophila S2R+ (Szabo et al., 2018). Specifically, we compared the denoising results with a recently published ChIA-Drop denoising method, MIA-Sig (Kim et al., 2019). Figure 3G shows the heatmap comparison of the original ChIA-Drop contact map versus the denoised contact maps by MIA-Sig and MATCHA, respectively. Similar to what we observed from the SPRITE data, as compared to the original contact map, both MATCHA and MIA-Sig make the chromatin interaction features clearer. In particular, for long-range interactions, both MIA-Sig and MATCHA remove more interactions as compared to the original ChIA-Drop data. However, the contact map produced by MATCHA yields even clearer patterns, especially for long-range contacts. We also used various metrics to assess the similarity between the contact map of Hi-C and the maps of denoised ChIA-Drop. As shown in Figure 3H, MATCHA consistently achieves higher similarity scores with Hi-C compared with the original contact map and the contact map produced by MIA-Sig. We also observed that, for the similarity scores that take long-range interactions into account (SCC-250, SCC-500), MIA-Sig performs even worse than the original ChIA-Drop contact matrix, whereas MATCHA consistently outperforms the original ChIA-Drop contact matrix.

Taken together, these results further demonstrate that, by more reliably identifying multi-way chromatin interactions, MATCHA improves the overall data quality of SPRITE and ChIA-Drop.

MATCHA can distinguish multi-way interactions from pairwise interaction cliques

The hyperedges defined from multi-way interaction data and the cliques (groups of nodes where all pairwise edges are present) defined from pairwise interaction data are drastically different. In particular, a hyperedge represents simultaneous interaction among chromatin loci in a single nucleus whereas a clique simply represents the presence of all pairwise interactions from a group of chromatin loci in a cell population. In other words, cliques do not have the single nucleus resolution. To test if MATCHA can effectively differentiate between hyperedges and cliques, we generated cliques based on Hi-C edges (referred to as Hi-C cliques) and required that the cliques satisfy the 1D genomic distance greater than 5Mb (same as the SPRITE data; see armethods]STAR Methods). We provided these cliques to the trained model and evaluated the results by comparing to SPRITE. To prevent the model from “cheating” by memorizing all training data to achieve better performance, we reduced the number of training samples from 80% to 20% of the hyperedges, which were excluded in the following evaluation. After training the model, we predicted the probability score for the Hi-C cliques and grouped them based on the score. As shown in Figure 4A, for Hi-C clique groups with higher probability scores, the percentage of cliques that are supported by SPRITE data is significantly higher than the groups with a lower score. Moreover, we also observed that the fraction of cliques that occur 3–7 times in SPRITE data, which were not included in the training data, are more enriched in groups with higher probability scores as well. This analysis suggests the ability of MATCHA to distinguish potential multi-way interactions from cliques based on population Hi-C data.

Figure 4: MATCHA distinguishes multi-way interactions from pairwise interaction cliques.

Figure 4:

(A) Distribution of the occurrence frequencies in the SPRITE data for the Hi-C cliques in each predicted probability group. (B) Overlapping between the Hi-C cliques with different predicted probability scores and the scHi-C cliques. The bar indicates the percentage of the total number of Hi-C cliques and scHi-C cliques in each group. (C) An example where the triplets predicted by MATCHA are not defined as positive samples based on occurrence frequency in SPRITE. The heatmap at the top shows the Hi-C O/E contact matrix. For the triplets track, each line represents a triplet either predicted by MATCHA or from the SPRITE data. Repli-seq signals as well as Hi-C subcompartments and super-enhancer annotations for GM12878 are also shown. Note that the left and middle anchor regions each have 5 genes that are regulated by the same 12 transcription factors, respectively. All 12 transcription factors are confirmed by ChIP-seq data to show that they bind to the upstream of transcriptional start sites (within 3kbp) of the genes from the left and middle anchor regions and also bind to the super-enhancer in the right anchor. A zoom-in view of the related histone modifications and transcription factor ChIP-seq signals are included to show one of the co-regulation examples (bottom panel). See also Figure S1.

We further assessed if these Hi-C cliques with high probability scores indeed interact in the same nuclei by comparing with single cell Hi-C (scHi-C) data (Ramani et al., 2017). For the contact matrix of each single cell, we created an edge for each non-zero entry and then collected all cliques that satisfy the 1D genomic distance constraint (referred to as scHi-C cliques). These scHi-C cliques indicate that all pairwise interactions between these three bins happen in the same nucleus. In Figure 4B, we found that of all the scHi-C cliques that overlap with the Hi-C cliques, about 70% of them are in the group with a probability score greater than 0.9, which makes up less than 20% of the Hi-C cliques. On the other hand, the Hi-C cliques with less than 0.3 probability scores account for more than 70% but only overlap with less than 20% of the scHi-C cliques. These results suggest that among all the Hi-C cliques, those received higher probability scores from MATCHA are enriched with multi-way interactions while the rest correspond to combinations of pairwise interactions within a cell population.

In Figure 4C, we show an example of a Hi-C clique that receives a high probability score from MATCHA and overlaps with the scHi-C clique but was not defined as a positive sample based on the occurrence frequency from SPRITE. The three regions predicted by MATCHA to form a triplet hyperedge are close to A1 and A2 active subcompartments (Rao et al., 2014). We refer to these three regions as the left, middle, and right anchor regions in Figure 4C. Based on the super-enhancer annotations for GM12878 (Hnisz et al., 2013), one super-enhancer is at the right anchor region, which also belongs to the A2 subcompartment. The Repli-seq data indicate that the left and the middle anchor region and the bin that contains the super-enhancer from the right anchor region are early replicated, which suggests that these regions are more towards the nuclear interior. By comparing to the expressed genes of the left and middle regions and the transcriptional regulatory network (Marbach et al., 2016), we found that the expressed genes in these two regions share a number of regulating transcription factors (TFs). We further assessed the binding of TFs to the upstream region of transcriptional start site (TSS; within 3kbp) and the super-enhancer region with the regulatory region dataset from ReMap 2020 (Chèneby et al., 2020) (which combines annotations from public sources such as ENCODE (Consortium et al., 2012) and Roadmap Epigenomics Project (Bernstein et al., 2010)). We found 5 genes in the left anchor region and 5 genes in the middle anchor region share 12 TFs in total. An illustration of the regulatory network of the genes and TFs are shown in Figure S1. We also provide a zoom-in view of the related ChIP-seq signals for one of the 16 TF-sharing gene pairs in Figure 4C. Specifically, gene DNAJC16 from the left anchor region and gene TTC4 from the right anchor region share 4 TF: NR3C1, IKZF2, EGR1, and CTCF. ChIP-seq signals of NR3C1 are not available in ENCODE. As shown in Figure 4C, all related TFs bind to the upstream region of the TSS and also have enriched peaks within the annotated super-enhancer. The ChIP-seq signals of H3K27ac and H3K4me1 further suggest that the annotated super-enhancer region indeed has enhancer-like histone profiles. Therefore, this potential interaction among a super-enhancer, transcriptional factors, and target genes may reflect a higher-order module of chromatin interaction and transcriptional regulation (Stadhouders et al., 2019; Tian et al., 2020). We further analyzed the triplets from SPRITE with relatively low occurrence frequencies (i.e., not positive samples) and found four (each observed only twice) that are close to this triplet (+/−2Mb for each anchor point), supporting the MATCHA prediction.

Collectively, these results demonstrate that the higher-order relationships in the multi-way chromatin interaction data are more than the combinations of pairwise interactions and that MATCHA is able to model and predict multi-way contacts properly, leading to potential new insights into the interplay between transcription regulation and 3D genome organization.

Embeddings produced by MATCHA reflect 3D genome function and spatial localization

To further demonstrate that MATCHA reliably extracts chromatin interaction patterns from the constructed hypergraph, we analyzed the embeddings produced by MATCHA trained from the SPRITE data. We first demonstrated the impact of the Mix-n-Match autoencoder (see armethods]STAR Methods) by replacing it with the standard paired autoencoder. The model reached similar performance in predicting hyperedges, as expected. We then visualized the learned embeddings by projecting them to two-dimensional space with PCA. Each data point in Figure 5A represents one 1Mb genomic bin with colors indicating the chromosome to which it belongs. We found that the embeddings of genomic bins based on the standard autoencoder form clusters according to the chromosome, making it impossible to make meaningful comparisons. On the other hand, we visualized the embeddings from the Mix-n-Match autoencoder with the same coloring scheme and found that bins from different chromosomes are more comparable (Figure S2B).

Figure 5: Evaluations for the learned embeddings of the genomic bins.

Figure 5:

(A) Visualization of the embeddings without the Mix-n-Match autoencoder. The embeddings are projected to two dimensions by PCA. Data points (genomic bins) are colored based on the chromosome they are from. Without the Mix-n-Match setting, genomic bins from the same chromosome are clustered. (B) Correlation of the predicted Repli-seq value using the learned embedding vectors versus the true Repli-seq value. (C) Visualization of the embeddings with the Mix-n-Match autoencoder. Data points are colored based on Hi-C subcompartment annotation. See also Figure S2.

We then evaluated the embeddings from MATCHA with the Mix-n-Match autoencoder by predicting the DNA replication timing based on Repli-seq signals. We binned the two-fraction Repli-seq signals to 1Mb resolution and applied genome-wide z-score normalization. A linear regression model was trained with odd-numbered chromosomes and tested on the even-numbered chromosomes. The predicted value versus true signal value is shown in Figure 5B, where a strong correlation can be observed (Pearson correlation = 0.85), suggesting that the MATCHA embeddings reflect the replication timing program, which is a fundamentally important genome function. Because the embeddings are enriched with the information extracted from the multi-way chromatin interaction data, even simple models like linear regression can reach a high correlation score. We tried more complicated regression models such as random forest regression and found the correlation score was similar to that achieved by linear regression.

We further asked if these embeddings capture Hi-C subcompartments, which are spatial genome segregation patterns in cell nucleus (Rao et al., 2014). Since the original subcompartment annotation is at 100Kb resolution, we converted it to the 1Mb resolution by a “voting scheme”. Specifically, for each 1Mb, there are 10 labels from 100Kb subcompartment annotations. When more than half of the labels belong to the same group, that 1Mb bin is labeled as the corresponding subcompartment. Otherwise, that bin would be removed in the next step. We also excluded the very small subcompartment B4 which only exists on chromosome 19. More than 95% of the bins received subcompartment labels. We again projected the embeddings to two-dimensional space using PCA and visualized the projected embedding vectors with annotations of subcompartments (Figure 5C). We found that overall the genomic bins belonging to the same subcompartment are clearly clustered together. Since the subcompartment annotations reflect the spatial segregation pattern of the genome with a gradual change, as expected, there is no clear separation between clusters. We also visualized the embeddings from the standard autoencoder with Hi-C subcompartment annotations (Figure S2A) and did not observe the correlation with subcompartments. This again demonstrates that by including the Mix-n-Match autoencoder, the bins are no longer clustered into groups based on the chromosome that they belong to. Additionally, we quantitatively evaluated the consistency between embedding vectors with subcompartments by training a logistic regression to classify the genomic bins based on the embedding vectors. We used half of the genomic bins as training data and made sure that the bins from the same chromosome either appear all in the training data or all in the testing data. For the testing set, a simple logistic regression model can make accurate predictions with the micro-F1 (accuracy) score of 0.82 and the macro-F1 score of 0.80.

These results demonstrate that the Mix-n-Match autoencoder scheme in MATCHA is effective in forming genome-wide embeddings, which successfully capture genome structure and function based on multi-way interaction data.

DISCUSSION

Recent advances of ligation-free, genome-wide chromatin interaction mapping methods such as SPRITE (Quinodoz et al., 2018) and ChIA-Drop (Zheng et al., 2019) provide new perspectives on 3D genome and function by revealing multi-way contacts within the same nuclei (Kempfer and Pombo, 2019). However, computational methods that can fully utilize the potential of such multi-way chromatin interaction data remain underdeveloped. In this work, we developed MATCHA, a new multi-way chromatin interaction analysis framework based on hypergraph representation learning. We demonstrated that MATCHA can effectively extract multi-way chromatin interactions. Specifically, the method is able to make accurate predictions for multi-way interacting genomic loci to denoise the original data. We also showed the effectiveness of MATCHA by comparing with additional datasets such as Hi-C and ChIA-PET as well as its potential of identifying new multi-way interactions missed by the original data.

MATCHA has several algorithmic novelties: (1) To our knowledge, this is the first method that analyzes the multi-way chromatin interaction data based on hypergraph representation learning. MATCHA has strong promise in capturing the embeddings of multi-way interaction and can also be used to denoise input data. (2) We incorporated our recently developed Hyper-SAGNN (Zhang et al., 2020), a hypergraph representation learning paradigm into MATCHA. Specifically, we designed a novel feature generation method and a biologically-motivated negative sampling approach to make the model better suited for multi-way chromatin interaction. (3) We also enhanced the scalability of MATCHA with an efficient Bloom filter data structure that allows accurate and efficient negative sampling.

MATCHA can be further improved to better characterize higher-order interactions among different components in the nucleus. Although we mainly focused on analyzing the multi-way interaction among different genomic loci (i.e. a homogeneous hypergraph as the nodes are all genomic bins), MATCHA can be extended to incorporate other constituents in the cell nucleus in addition to chromatin (e.g., proteins and RNAs) as a heterogeneous hypergraph based on emerging datasets such as RNA-DNA SPRITE (Quinodoz et al., 2018). Our hypergraph representation learning method Hyper-SAGNN can effectively learn the embeddings for heterogeneous hypergraphs (Zhang et al., 2020). In addition, we chose to train MATCHA using multi-way interaction data only in this work, but the model can easily include other functional genomic signals as features of the corresponding genomic bins. This would in principle extend the existing work on predicting pairwise chromatin loops based on functional genomic data (Huang et al., 2015; Whalen et al., 2016; Zhang et al., 2018; Kai et al., 2018; Zhang et al., 2019). Finally, the multi-way chromatin interactions extracted by MATCHA could also be the foundation to connect transcriptional regulation and 3D genome organization (Stadhouders et al., 2019; Kim and Shendure, 2019; Tian et al., 2020). Indeed, data from GAM (Beagrie et al., 2017) and SPRITE (Quinodoz et al., 2018) revealed the abundance of three-way interactions involving super-enhancer and active genes. Promoters may also act as enhancers to regulate other genes (Li et al., 2012; Engreitz et al., 2016). The recent data based on ChIA-Drop (Zheng et al., 2019) also uncovered three-way interactions among promoters that have imbalanced gene expression levels in which the promoters with lower transcription level might act as enhancers. These observations further suggest the importance of investigating chromatin interactions in a non-pairwise manner. Taken together, we believe that MATCHA provides an effective algorithmic framework for the modeling and analysis of multi-way chromatin interaction data with the potential to advance our understanding of the nuclear organization.

STAR METHODS

RESOURCE AVAILABILITY

Lead Contact:

Further information and requests for resources should be directed to and will be fulfilled by the Lead Contact, Jian Ma (jianma@cs.cmu.edu)

Materials Availability:

This study did not generate new materials or reagents.

Data and Code Availability:

The source code of MATCHA can be accessed at: https://github.com/ma-compbio/MATCHA. This study did not generate new datasets.

METHOD DETAILS

Definitions of hypergraph and the hyperedge prediction problem

Hypergraph

A hypergraph is defined as G=(V,E), where V={n1,,nN} represents the set of nodes in the graph, and E={ei=(n1(i),,nk(i))} represents the set of hyperedges. For any hyperedge e, it connects two or more nodes (|e|2). If all the hyperedges in a hypergraph contain the exact same number of nodes |ei|=k,eiE, it is referred to as a k-uniform hypergraph.

The hyperedge prediction problem

The hyperedge prediction problem aims to learn a function p that can predict the probability of a group of nodes (n1,n2,,nk) forming a hyperedge (Zhang et al., 2020):

p(n1,n2,,nk)={s,if(n1,n2,,nk)E<s,if(n1,n2,,nk)E (1)

where s, typically chosen as 0.5, is the threshold to binarize the continuous probability score into a label indicating the existence of the corresponding hyperedge. As ni is just the id for the corresponding node, to make the problem numerically tractable, it is natural to assume that the features of nodes X={x1,,xk} are known. This makes it possible to rewrite the function as:

p(n1,n2,,nk)p(f(x1),f(x2),,f(xk)) (2)

where the transformation of features f(xi) can be considered as the embedding vectors for the node ni.

Data processing

The GM12878 DNA SPRITE cluster files on hg38 (Quinodoz et al., 2018) were downloaded from the 4DN data portal: https://data.4dnucleome.org. We downloaded the processed RNAPII enriched ChIA-Drop data and ChIA-Drop data of Drosophila S2 cell from GSE109355 (Zheng et al., 2019). When creating the contact matrix for the SPRITE data, we used the same procedure in Quinodoz et al. (2018) for balancing the weight of each pairwise interaction with the size of the original SPRITE cluster. The contact matrix was further normalized by matrix balancing. For the ChIA-Drop data, we did not perform further normalization after decomposition.

The GM12878 in-situ Hi-C data on hg38 were downloaded from the 4DN data portal. We used KR matrix balancing for normalization. The single cell Hi-C (scHiC) data of GM12878 were downloaded from GSE84920 (Ramani et al., 2017). For the scHi-C data, the aligned paired-end reads were converted to hg38 using UCSC liftOver (Hinrichs et al., 2006). The alignment was then binned to produce the contact matrix for every single cell. To reduce the sparsity of the contact matrix, we applied linear convolution to each with a window size of 1 and step size of 1. Two fraction Repli-seq data were also obtained from the 4DN data portal. The processed in-situ Hi-C data for Drosophila S2R+ cell were downloaded from GSE99104 (Szabo et al., 2018).

Constructing hypergraphs based on multi-way chromatin interaction data

For the SPRITE data, nodes in the graph represent non-overlapping 1Mb genomic bins (bins at the centromere are discarded). When certain genomic bins share multiple SPRITE clusters, we connect the corresponding nodes with a hyperedge. Note that we chose to focus on hyperedges with the size of 3 to 5 because they are more abundant in the dataset and thus have enough number of samples for the model training. Specifically, we decomposed SPRITE clusters with size less than or equal to 25 into subsets of the corresponding size and counted the frequency of each combination of genomic bins (referred to as k-mers in the SPRITE paper (Quinodoz et al., 2018)). To reduce the number of k-mers to be considered, we focused on relatively distal interactions by requiring the minimum intra-chromosomal genomic distance > 5Mb. We further used different occurrence frequency cut-off for different sized hyperedges to balance the average size of hyperedges. Note that in this work we did not include SPRITE clusters with size larger than 25 to reduce the processing time for decomposing clusters into hyperedges.

For the RNAPII ChIA-Drop data, a similar data processing procedure was applied with the resolution of 5Kb to generate hyperedges of size 3 to 5. Compared to SPRITE, k-mers in ChIA-Drop usually have smaller occurrence frequencies; we therefore chose to use a uniform cut-off of 2 to define hyperedges and do not constrain the genomic distance.

Labeled data generation

The Hyper-SAGNN model in MATCHA (see later section) was trained to be a binary classifier for the existence of the hyperedge among a group of nodes (Figure 2A). The positive samples are the observed hyperedges in the constructed hypergraph while the negative samples are groups of nodes unobserved as the hyperedges. We can in principle generate random combinations of genomic bins and consider them as negative samples; however, this would oversimplify the prediction task because most of the randomly generated negative samples can be identified by simple metrics, e.g., whether it contains inter-chromosomal interactions or the 1D genomic distance. We therefore designed the following new procedure to generate negative samples by changing a small fraction of the nodes in the positive samples. For each positive sample of size k, the number of nodes to be altered n is sampled from a zero truncated binomial distribution with parameter k and a hyperparameter p, which is equivalent to altering each node in the positive samples independently with probability p while making sure at least one node would change, i.e.,

nZeroTruncatedBinomial(k,p) (3)
P(n=x)=(kx)px(1p)kx1(1p)k (4)

Smaller p would lead to smaller averaged difference between positive and negative samples, producing more difficult negative samples. Here we chose p = 0.5 which leads to 1.7 to 2.5 expected difference between positive and negative samples for hyperedges of size 3 to 5. We then randomly selected n nodes and changed each of them. We required that the changed node is from the same chromosome to make the problem harder. We further made sure that this sample satisfies the genomic distance constraint. Specifically, for SPRITE, the minimum intra-chromosomal 1D distance within a group of nodes should be larger than 5 bins (same as the positive samples). For ChIA-Drop, we ensured that the changed node is within the ± 20 bins to make the distance profile of positive and negative samples similar to each other. Note that we did not use the genomic distances as features in MATCHA, but this feature can be easily incorporated into our model to potentially further improve performances. Finally, we assessed if this modified sample happens to be the same as any other positive samples. If that happened, we repeated the above process until having a negative sample satisfying all conditions (see an efficient approach we implemented for this work in later section). For each batch of positive samples, we generated 3 times the amount of negative samples. The negative samples were generated dynamically for each training and evaluation epoch instead of being generated beforehand. This approach allows a more accurate characterization of the negative samples and prevents potential over-fitting.

A new Mix-n-Match autoencoder for node feature generation

Here we describe our approach to generate the features for the nodes in the hypergraph, i.e., X={x1,,xN}. Although we can use functional genomic signals on the genomic bins (such as ChIP-seq for histone marks or transcription factors) as the node features, to demonstrate the generalization ability of this approach to cell types where these signals are inadequate, we developed a framework that generates features based on the hypergraph only (Figure 2A).

We first decompose each hyperedge in the training data into pairwise interactions and create a corresponding adjacency matrix A. The i-th row of A, denoted by ai, shows the neighborhood structures of the node n1, which then passes through an auto-encoder-like neural network to produce xi=Encoder(ai). A decoder with symmetric structure is applied to reconstruct ai from xi. The corresponding mean-squared reconstruction error is added to the final loss as a regularization term. The same strategy has been used in previous graph/hypergraph representation learning methods (Tu et al., 2018; Wang et al., 2016) and also in our recent work Hyper-SAGNN (Zhang et al., 2020), i.e,

Lrecon=1Ni=1N|Decoder(Encoder(ai))ai|2 (5)

Note that, although this approach decomposes each hyperedge into pairwise interactions, the contact matrix is passed through the encoder which makes non-linear transformation of the input to be used as the node features for the predictions of higher order interactions. This significantly differs from the earlier work that decomposes the hyperedges into a contact matrix and studies the contact matrix directly. However, for networks constructed by chromatin interaction data, this encoder based approach does have a shortcoming. Data including Hi-C and SPRITE that have both intra-chromosomal and inter-chromosomal interactions usually contain more intra-chromosomal interactions. In other words, for each row in the genome-wide contact matrix, a small fraction of the columns (intra) receive more weight while a large fraction of the columns (inter) are much sparser and noisier. If we use the corresponding row in the genome-wide contact matrix as the feature, nodes from the same chromosome would have more similar features as compared to nodes from different chromosomes. Although this would not have a negative impact on hyperedge prediction, as nodes from different chromosomes indeed have very different spatial neighborhood structure in the graph, it may not be appropriate for analyzing the embedding vectors, as the model learned for one chromosome cannot generalize to the other ones.

We therefore designed a new method called “Mix-n-Match autoencoder”. A similar structure has been proposed in the computer vision field for image translation (Wang et al., 2018). For a genome with n chromosomes, we denote Cij as the part of the genome-wide contact matrix corresponding to interactions between chromosome i and j and Ni as the number of bins in chromosome i. We train n encoder and n decoder where the i-th encoder Encoderi takes vector of size Ni and produces a hidden vector of size dh. The Decoderi works the other way around as the input size is dh. However, instead of making the encoder and decoder work in a paired manner (as described above), it is randomly paired (as the name ‘Mix-n-Match’ suggests). Specifically, for a node k from chromosome i, the k-th row in the intra-chromosomal contact matrix Cii (denoted as Cii(k)) is taken as the input for Encoderi to produce feature xk. Then a random chromosome j,ji, is selected with the corresponding decoder Decoderj to reconstruct the k-th row in the inter-chromosomal contact matrix Cij from the input xk. The new reconstruction loss for node k is therefore defined as:

Lrecon(k|i,j)=Decoderj(Encoderi(Cii(k)))Cij(k)22 (6)

By adding this reconstruction loss to the final loss term, the model would make the embeddings for nodes from different chromosomes more comparable as the same Decoderj would be applied to nodes from all the other chromosomes to reconstruct inter-chromosomal interactions to chromosome j.

The Hyper-SAGNN architecture for hypergraph representation learning

The detailed description of this part of the method can be found in our recent work Zhang et al. (2020). The structure of the neural network for Hyper-SAGNN is shown in Figure 2B. The input to the model can be represented as tuples, i.e., (x1,x2,,xk). Each tuple first passes through a position-wise feed forward network to produce (s1,s2,,sk), where si=f(xi). f represents the transformation of the neural network to xi. We refer to each si as the static embedding for node i since it remains the same for node i independent to the given tuple. The input also passes through another transformation to produce a new set of node embedding vectors (d1,d2,,dk), where di=g(xi|(x1,x2,,xk)). We refer to each di as the dynamic embedding because it depends on all the node features within this tuple. The transformation to produce dynamic embeddings is the multi-head self-attention layer (Vaswani et al., 2017) (see below).

Given a group of nodes (x1,x2,,xk) and weight matrices WQ,WK,WV to be trained that represent the linear transformation of features before applying the scaled dot-product attention (Vaswani et al., 2017), the attention coefficients that indicate the pairwise importance of nodes are computed. These coefficients are then normalized through softmax to produce the final pairwise importance score:

eij=(WQTxi)T(WKTxj),1i,jk (7)
αij=exp(eij)1lkexp(eil) (8)

The dynamic embeddings are defined as the weighted sum of linear transformed features with a non-linear activation function:

di=tanh(1jk,jiαijWVTxj) (9)

We further calculate the Hadamard power (element-wise power) of the difference of the corresponding static/dynamic pair for each node, which is subsequently passed through a one-layered neural network with sigmoid as the activation function to produce a probability score pi. Finally, all the output pi[0,1] are averaged to produce the final result p, i.e.,

p=1Ki=1kpi=1Ki=1kσ(WoT((disi)2)+b) (10)

The Hyper-SAGNN model is trained to minimize the binary cross-entropy loss. The training procedure is terminated when it reaches a predefined number of training epochs or the performance stops improving on an individual validation set.

Optimizing memory consumption via probabilistic data structure

One important practical issue in the negative sampling process (mentioned above) is that we ensure the generated negative samples do not overlap with known positive samples. Although one might argue that this process is unnecessary, as the probability for a randomly generated negative sample being the same as a positive sample is small, this probability can increase greatly due to our stringent negative sampling strategy. Indeed, we found that the average number of trials for generating a non-overlapping negative sample is larger than 2.5 with the maximum trial number being more than 300. A similar technique has been developed in previous hypergraph representation learning methods (Tu et al., 2018) where a dictionary is maintained to keep the record of all positive samples. However, maintaining a dictionary in memory consumes enormous resources and significantly increases runtime for building it. This problem becomes even more significant when the number of hyperedges exceeds 2 million for SPRITE (as compared to 100K in the datasets that previous methods studied (Tu et al., 2018)).

To reduce the memory consumption while maintaining an acceptable query runtime, we utilized the Bloom filter (Bloom, 1970) to keep track of the observed hyperedges. The Bloom filter is a type of data structure that returns whether an element is a member of a set and has several advantages. First, it is highly memory-efficient at the cost of producing potential false positives. However, this will have little impact for our purpose if we assume the false positive samples from the Bloom filter are distributed relatively evenly in the negative samples. Here, we control the error rate to be less than 10−3 by setting the number of hash functions and the size of the bit array. Second, it has constant runtime for the adding operation, which leads to a shorter wait time before the actual training process. In the extreme case, when it is not possible to maintain the Bloom filter in memory, we use the memory-mapped Bloom filter which allows us to keep the data structure in the hard disk and query (Debnath et al., 2011). Finally, the query process is still efficient as compared to searching algorithms over the training data.

In our implementation, before the labeled data generation step, we built a Bloom filter that stores all the positive samples and “potential” hyperedges. For SPRITE, these are k-mers that have occurrence frequency greater than 2 but smaller than the chosen cut-off. For ChIA-Drop, these are k-mers that have occurrence frequency equal to 1. These samples cannot be classified into either positive or negative samples based on the current data, so these are excluded in the performance evaluation. By incorporating the Bloom filter into the implementation, our method is able to deal with large datasets or hyperedges of larger size efficiently. This would greatly enhance the scalability in practice.

Parameter setting of MATCHA

Here we discuss the selection of parameters used in MATCHA. These would include the cut-off for hyperedges to define the positive samples, the structure of the Mix-n-Match autoencoder, and the parameters for Hyper-SAGNN.

In general, a higher occurrence frequency of a multi-way interaction should indicate a higher probability of it being a true interaction while the lower occurrence frequency would increase its probability of being a false positive. In this work, we chose the cut-off as mentioned in the main text to: (1) make sure there are enough hyperedges left for the training of Hyper-SAGNN; (2) make sure the selected hyperedges are confident enough to be used as positive training samples; (3) balance the number of hyperedges across different sizes. However, we would show in the later section that when different cut-offs are used, MATCHA is still able to identify reliable hyperedges.

For the structure of the Mix-n-Match autoencoders, we make each encoder consist of 2 hidden layers each with 64 neurons. Each decoder has a symmetric structure to the corresponding encoder. For the standard autoencoder, the structure is chosen to be the same as the Mix-n-Match autoencoder.

For the Hyper-SAGNN, the size of embedding is set to be 64 (the same as the dimension of the output from Mix-n-Match autoencoder). The number of heads in the multi-head attention layer is set to be 8. In Zhang et al. (2020), we discussed two potential variants of Hyper-SAGNN with slightly different structures and demonstrated that the current framework of Hyper-SAGNN achieved higher performance or converged faster than the two potential variants over multiple tasks on multiple benchmark datasets.

For the training of Hyper-SAGNN, we used the Adam optimizer with learning rate 1e-3. Each training batch contains 96 positive samples, with 3 times more negative samples. We terminated the training procedure when it reaches the maximum epoch (50) or the performance on an individual validation set no longer improves.

We did all the evaluations on an 8-core machine with 1 NVIDIA GeForce 1080 GPU card. The running time for constructing hypergraphs is within 1 hour for the SPRITE and a few minutes for the ChIA-Drop data. The time used to build the Bloom filter that keeps tracking of all positive samples is within a few minutes. The negative sample generation happens during the training process of Hyper-SAGNN which for both the SPRITE and ChIA-Drop data can be finished within 2 hours.

QUANTIFICATION AND STATISTICAL ANALYSIS

MATCHA is robust to the cut-off for defining positive samples

We performed additional analysis to evaluate our previous claim that the lower prediction performance for hyperedges with the larger size in SPRITE is due to the usage a lower occurrence frequency cut-off. To be specific, we group hyperedges by their occurrence frequency into 4 categories: 2, 3–4, 5–7 and ≥ 8 (referred to as group I, II, III, IV, respectively). We train MATCHA and evaluate the hyperedge prediction performance on each group, respectively. The training and testing samples are non-overlapping to avoid potential bias. Within each occurrence frequency group, the number of hyperedges is dominated by those of smaller size. For instance, within group I, there would be about 90 times more hyperedges of size 3 than hyperedges of size 5. The model may not perform well on larger hyperedges mainly because, during most training epochs, the training samples do not contain larger hyperedges. To resolve this issue, for each epoch, we dynamically balance the number of different sized hyperedges by sampling an equal number of hyperedges of different size.

As shown in Figure S3, for each heatmap, the rows that are closer to bottom usually have higher AUROC and AUPR scores. This indicates that hyperedges with higher occurrence frequencies would consistently have higher testing AUROC and AUPR scores across different hyperedge size and different groups of training samples. It confirms our observation of the relationship between the prediction performance and the occurrence frequency cut-off. It also suggests that for hyperedges with lower occurrence frequencies, they are more likely to be false positives, i.e., the corresponding groups of data contain more label noise. However, even when using data with label noise to train the model, MATCHA can still achieve good prediction performance on hyperedges with higher occurrence frequencies (i.e. hyperedges that are likely to be real interactions). Specifically, when using hyperedges from the group I as training data, the model can still make accurate predictions of hyperedges from group IV with AUROC higher than 0.9 across all sizes. This demonstrates the robustness of MATCHA to the cut-off values used for defining hyperedges.

In addition, we observed that the best prediction performance on a certain group is often achieved when using that specific group as training data (Figure S3). This could be related to the negative correlation between interaction frequencies and genomic distance. Similar to pairwise interactions, multi-way interactions with larger genomic distances occur less frequently. When training the model with hyperedges with extremely large occurrence frequencies only, the model may have a bias against hyperedges with larger genomic distance. It is worth noting that, we require both training and testing samples to have a genomic distance larger than 5 bins. Thus, even when the model may be affected by this potential bias, MATCHA still makes meaningful predictions.

Taken together, these results demonstrate that MATCHA is robust to the chosen cut-off values for defining hyperedges. Even when using noisy data as training samples, it can still make accurate predictions on hyperedges with larger occurrence frequencies. Including hyperedges from lower occurrence frequency groups could improve the prediction performance of the model on hyperedges from that category. However, this would lead to the imbalance of hyperedges with different sizes and from different occurrence frequency groups that requires extra balancing procedure during training process. Another drawback of including them as training samples is that using data with label noise would directly perturb the estimated gradients during training which will decrease the convergence rate.

Supplementary Material

2

KEY RESOURCES TABLE

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited Data
DNA SPRITE of GM12878 cell line Quinodoz et al. (2018) GEO: GSE114242
In-situ Hi-C of GM12878 cell line Rao et al. (2014) GEO: GSE63525
Single-cell Hi-C of GM12878 cell line Ramani et al. (2017) GEO: GSE84920
Repli-seq of GM12878 cell line 4DN Data Portal 4DNESO83H9ZI, 4DNESDQ9PZOX
ChIA-Drop of Drosophila S2 cell line Zheng et al. (2019) GEO: GSE109355
RNAPII ChIA-Drop of Drosophila S2 cell line Zheng et al. (2019) GEO: GSE109355
in-situ Hi-C of Drosophila S2R+ cell line Szabo et al. (2019) GEO: GSE99104
Experimental Models: Cell Lines
Homo Sapiens: lymphoblastoid cell line GM12878 Rao et al. (2014) Cat# GM12878; RRID: CVCL_7526
Drosophila S2 cell line Zheng et al. (2019) N/A
Drosophila S2R+ cell line Szabo et al. (2018) N/A
Software and Algorithms
MATCHA This paper https://github.com/ma-compbio/MATCHA

PRIMER.

Advances in high-throughput mapping of 3D genome organization have enabled genome-wide characterization of chromatin interactions. However, most proximity ligation based mapping approaches for pairwise chromatin interaction such as Hi-C cannot capture multi-way interactions, which are informative to delineate higher-order genome organization and gene regulation mechanisms at single-nucleus resolution. Recently developed ligation-free, genome-wide chromatin interaction mapping methods such as SPRITE and ChIA-Drop are able to reveal higher-order chromatin organization by capturing simultaneous interactions among multiple genomic loci within the same nuclei. However, computational methods that can fully utilize such multi-way chromatin interaction data remain underdeveloped. Existing analysis approaches typically have strong assumptions that would lead to loss of information from the original data. Here we develop a new computational framework, called MATCHA, for the analysis of multi-way chromatin interaction data (see the concept in Figure 1). Specifically, using hypergraph representation learning, MATCHA represents genomic bins as nodes and multi-way chromatin interactions as hyperedges. We apply MATCHA to SPRITE and ChIA-Drop data and demonstrate that MATCHA can effectively denoise the data and make de novo predictions of multi-way chromatin interactions, reducing the potential false positives and false negatives from the original data. We also show that MATCHA is able to distinguish between the multi-way interaction in a single nucleus and the combination of pairwise interactions in a cell population. In addition, the embeddings generated by MATCHA for genomic bins reflect 3D genome spatial localization and function. MATCHA is a new framework that can significantly enhance the analysis of multi-way chromatin interaction data, delineating unique principles of higher-order chromosome organization and function.

Highlights.

  • MATCHA models multi-way chromatin interaction in a single nucleus as hyperedge

  • It effectively enhances the data quality of SPRITE and ChIA-Drop

  • It distinguishes multi-way interactions from pairwise interaction cliques

  • MATCHA’s embeddings for genomic bins reflect 3D genome structure and function

ACKNOWLEDGEMENTS

This work was supported in part by the National Institutes of Health Common Fund 4D Nucleome Program grant U54DK107965 (J.M.) and the National Institutes of Health grant R01HG007352 (J.M.). The authors would like to thank Ben Chidester, Zhijun Duan, Minji Kim, and Yijun Ruan for helpful discussions.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

DECLARATION OF INTERESTS

The authors declare no competing interests.

References

  1. Beagrie RA, Scialdone A, Schueler M, Kraemer DC, Chotalia M, Xie SQ, Barbieri M, de Santiago I, Lavitas L-M, Branco MR et al. (2017). Complex multi-enhancer contacts captured by genome architecture mapping. Nature 543, 519. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Berge C (1984). Hypergraphs: combinatorics of finite sets, vol. 45,. Elsevier. [Google Scholar]
  3. Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR et al. (2010). The NIH roadmap epigenomics mapping consortium. Nature Biotechnology 28, 1045–1048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bloom BH (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13, 422–426. [Google Scholar]
  5. Bonev B and Cavalli G (2016). Organization and function of the 3D genome. Nature Reviews Genetics 17, 661. [DOI] [PubMed] [Google Scholar]
  6. Chèneby J, Ménétrier Z, Mestdagh M, Rosnet T, Douida A, Rhalloussi W, Bergon A, Lopez F and Ballester B (2020). ReMap 2020: a database of regulatory regions from an integrative analysis of Human and Arabidopsis DNA-binding sequencing experiments. Nucleic Acids Research 48, D180–D188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Consortium EP et al. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Debnath B, Sengupta S, Li J, Lilja DJ and Du DH (2011). BloomFlash: Bloom filter on flash-based storage. In 2011 31st International Conference on Distributed Computing Systems pp. 635–644, IEEE. [Google Scholar]
  9. Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS and Ren B (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Engreitz JM, Haines JE, Perez EM, Munson G, Chen J, Kane M, McDonel PE, Guttman M and Lander ES (2016). Local regulation of gene expression by lncRNA promoters, transcription and splicing. Nature 539, 452–455. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Fullwood MJ and Ruan Y (2009). ChIP-based methods for the identification of long-range chromatin interactions. Journal of Cellular Biochemistry 107, 30–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Grover A and Leskovec J (2016). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining pp. 855–864, ACM. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Gui H, Liu J, Tao F, Jiang M, Norick B and Han J (2016). Large-scale embedding learning in heterogeneous event data. In 2016 IEEE 16th International Conference on Data Mining (ICDM) pp. 907–912, IEEE. [Google Scholar]
  14. Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F et al. (2006). The UCSC genome browser database: update 2006. Nucleic acids research 34, D590–D598. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Hnisz D, Abraham BJ, Lee TI, Lau A, Saint-André V, Sigova AA, Hoke HA and Young RA (2013). Super-enhancers in the control of cell identity and disease. Cell 155, 934–947. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Huang J, Marco E, Pinello L and Yuan G-C (2015). Predicting chromatin organization using histone marks. Genome Biology 16, 162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Kai Y, Andricovich J, Zeng Z, Zhu J, Tzatsos A and Peng W (2018). Predicting CTCF-mediated chromatin interactions by integrating genomic and epigenomic features. Nature Communications 9, 4221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kempfer R and Pombo A (2019). Methods for mapping 3D chromosome architecture. Nature Reviews Genetics doi: 10.1038/s41576-019-0195-2. [DOI] [PubMed] [Google Scholar]
  19. Kim M, Zheng M, Tian SZ, Lee B, Chuang JH and Ruan Y (2019). MIA-Sig: multiplex chromatin interaction analysis by signal processing and statistical algorithms. Genome Biology 20, 251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kim S and Shendure J (2019). Mechanisms of Interplay between Transcription Factors and the 3D Genome. Molecular Cell 76, 306–319. [DOI] [PubMed] [Google Scholar]
  21. Kumaran RI, Thakar R and Spector DL (2008). Chromatin dynamics and gene positioning. Cell 132, 929–934. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Li G, Ruan X, Auerbach RK, Sandhu KS, Zheng M, Wang P, Poh HM, Goh Y, Lim J, Zhang J et al. (2012). Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell 148, 84–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Lieberman-Aiden E, Van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO et al. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Marbach D, Lamparter D, Quon G, Kellis M, Kutalik Z and Bergmann S (2016). Tissue-specific regulatory circuits reveal variable modular perturbations across complex diseases. Nature Methods 13, 366–370. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Nora EP, Lajoie BR, Schulz EG, Giorgetti L, Okamoto I, Servant N, Piolot T, van Berkum NL, Meisig J, Sedat J et al. (2012). Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 485, 381. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Perozzi B, Al-Rfou R and Skiena S (2014). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge Discovery and Data Mining pp. 701–710, ACM. [Google Scholar]
  27. Quinodoz SA, Ollikainen N, Tabak B, Palla A, Schmidt JM, Detmar E, Lai MM, Shishkin AA, Bhat P, Takei Y et al. (2018). Higher-order inter-chromosomal hubs shape 3D genome organization in the nucleus. Cell 174, 744–757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Ramani V, Deng X, Qiu R, Gunderson KL, Steemers FJ, Disteche CM, Noble WS, Duan Z and Shendure J (2017). Massively multiplex single-cell Hi-C. Nature Methods 14, 263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES et al. (2014). A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Salameh TJ, Wang X, Song F, Zhang B, Wright SM, Khunsriraksakul C and Yue F (2019). A supervised learning framework for chromatin loop detection in genome-wide contact maps. bioRxiv 739698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Stadhouders R, Filion GJ and Graf T (2019). Transcription factors and 3D genome conformation in cell-fate decisions. Nature 569, 345–354. [DOI] [PubMed] [Google Scholar]
  32. Szabo Q, Jost D, Chang J-M, Cattoni DI, Papadopoulos GL, Bonev B, Sexton T, Gurgo J, Jacquier C, Nollmann M et al. (2018). TADs are 3D structural units of higher-order chromosome organization in Drosophila. Science Advances 4, eaar8082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Tang Z, Luo OJ, Li X, Zheng M, Zhu JJ, Szalaj P, Trzaskoma P, Magalska A, Wlodarczyk J, Ruszczycki B et al. (2015). CTCF-mediated human 3D genome architecture reveals chromatin topology for transcription. Cell 163, 1611–1627. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Tian D, Zhang R, Zhang Y, Zhu X and Ma J (2020). MOCHI enables discovery of heterogeneous interactome modules in 3D nucleome. Genome Research 30, 227–238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Tu K, Cui P, Wang X, Wang F and Zhu W (2018). Structural deep embedding for hyper-networks. In Thirty-Second AAAI Conference on Artificial Intelligence. [Google Scholar]
  36. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł and Polosukhin I (2017). Attention is all you need. In Advances in Neural Information Processing Systems pp. 5998–6008,. [Google Scholar]
  37. Wang D, Cui P and Zhu W (2016). Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining pp. 1225–1234, ACM. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Wang Y, van de Weijer J and Herranz L (2018). Mix and match networks: encoder-decoder alignment for zero-pair image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 5467–5476,. [Google Scholar]
  39. Whalen S, Truty RM and Pollard KS (2016). Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin. Nature Genetics 48, 488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Xiong K and Ma J (2019). Revealing Hi-C subcompartments by imputing inter-chromosomal chromatin interactions. Nature Communications 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Yang T, Zhang F, Yardımcı GG, Song F, Hardison RC, Noble WS, Yue F and Li Q (2017). HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Research 27, 1939–1949. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Zhang R, Wang Y, Yang Y, Zhang Y and Ma J (2018). Predicting CTCF-mediated chromatin loops using CTCF-MP. Bioinformatics 34, i133–i141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Zhang R, Zou Y and Ma J (2020). Hyper-SAGNN: a self-attention based graph neural network for hypergraphs. In International Conference on Learning Representations (ICLR). [Google Scholar]
  44. Zhang S, Chasman D, Knaack S and Roy S (2019). In silico prediction of high-resolution Hi-C interaction matrices. Nature Communications 10, 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Zheng M, Tian SZ, Capurso D, Kim M, Maurya R, Lee B, Piecuch E, Gong L, Zhu JJ, Li Z et al. (2019). Multiplex chromatin interactions with single-molecule precision. Nature 566, 558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Zhou D, Huang J and Schölkopf B (2007). Learning with hypergraphs: Clustering, classification, and embedding. In Advances in Neural Information Processing Systems pp. 1601–1608,.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

2

Data Availability Statement

The source code of MATCHA can be accessed at: https://github.com/ma-compbio/MATCHA. This study did not generate new datasets.

RESOURCES