Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

ArXiv logoLink to ArXiv
[Preprint]. 2024 Mar 4:arXiv:2403.03231v1. [Version 1]

Machine and deep learning methods for predicting 3D genome organization

Brydon P G Wall 1, My Nguyen 2, J Chuck Harrell 3,4,5, Mikhail G Dozmorov 2,3,*
PMCID: PMC10942493  PMID: 38495565

Abstract

Three-Dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, Topologically Associating Domains (TADs), and A/B compartments play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. Such methods frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA sequencing information (k-mers, Transcription Factor Binding Site (TFBS) motifs), and other genomic properties to learn the associations between genomic features and chromatin interactions. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, chromatin interactions, TAD boundaries) and analyze their pros and cons. We also point out obstacles of computational prediction of 3D interactions and suggest future research directions.

Keywords: Hi-C, enhancer-promoter interactions, chromatin, loops, TADs, machine learning, deep learning, software

Introduction

Historically, biological characterization of a genome began with the investigation of individual genomic elements, such as genes, Transcription Factor Binding Sites (TFBSs), histone modifications, CpG sites, and DNA methylation [1], collectively referred to as the 1D genome organization. The advent of chromosome conformation capture sequencing technologies [24] revealed a distinct, complementary aspect of genome organization - its Three-Dimensional (3D) structure. These technologies have revealed important 3D spatial constructs such as chromosomal territories, A/B compartments [5], topologically associating domains (TADs) [6,7], and at the most local scale, chromatin loops and enhancer-promoter interactions (EPIs) [810].

TADs are self-interacting structural domains within a genome formed through loop extrusion. The cohesin complex is first loaded onto DNA, then extrudes the DNA until it reaches a barrier, usually with the convergent orientation of CTCF and several other DNA-binding proteins [11,12,8,13,14]. Loops and EPIs frequently occur within TADs and are manifested as “point-to-point” interactions facilitating regulatory processes. Emerging evidence has linked chromatin interactions to critical roles in cell dynamics, such as gene regulation and cell differentiation. An example of this is the mechanism by which distal enhancers come into contact with promoters to initiate transcription [1517]. Consequently, the disruption of the higher-order chromatin structures has been shown to lead to developmental diseases [18], cancer [1922] and other disorders [23,24]. Although subsets of 3D structures are tissue-invariant [6,8,7], many are cell-type-specific [5,25] representing potential targets for therapeutic intervention. Therefore, defining precise locations of 3D chromatin structures is a top priority in understanding the biology of genome regulation in health and disease.

Hi-C, a high-throughput chromatin conformation capture technology, allows for the detection of chromatin interactions on a genome-wide scale. However, Hi-C technology has its limitations in the analysis of 3D chromatin structures. A typical Hi-C experiment requires billions of reads, more than 20 times the amount of sequencing of a typical RNA-seq experiment [8]. This leads to a prohibitively high cost for this use case. Also, cross-linked DNA fragments are digested with restriction enzymes cutting at predefined sites. Enzymes recognizing 6 bp sequences cut the DNA into fragments with an average size of about 4 Kb, which is suboptimal. Even with 4-cutter restriction enzymes theoretically cutting at every 256 bp, the uneven distribution of cutting sites across a genome makes the resolution variable between individual fragments [26]. A more common approach is to bin a genome into regions of equal size (typically, 5Kb - 100Kb) and count the number of interactions between regions. Because increasing the resolution of Hi-C data requires a quadratic increase in the total sequencing depth [27], high-resolution Hi-C data (1 Kb [28,8], or even 750 bp [29]) remains rare. Furthermore, the lack of systematic collections of high-resolution Hi-C data is compounded by a wide variety of chromatin conformation capture protocols. This further hinders the detection and comparison of higher-order chromatin structures.

Numerous studies have demonstrated the association of higher-order chromatin structures with genomic annotations and sequence features [25,30,31,9,32], reviewed in [3336]. This collection of genome annotation data, which are regions annotated as carrying functional/regulatory potential or having a biological property, are collectively referred to hereafter as epigenomic data [37]. A/B compartments were found to correspond to transcriptionally active or inactive chromatin regions, enrichment or depletion in DNA accessibility signal, gene density, and replication timing [5,38]. Boundaries of TADs were found to be enriched in CTCF (considering the convergent orientation of CTCF binding motifs) and other architectural proteins of cohesin and mediator complex (RAD21, SMC3, and others) [39,40,14,41], marks of transcriptionally active chromatin (DNAse hypersensitive sites, H3K4me3, H3K27ac, and H3K36me3 histone modifications) [4244,7], and housekeeping genes [45,6]. Similarly, smaller chromatin loops were found to be mediated by CTCF and cohesin via a loop extrusion mechanism, where CTCF binds to a specific and non-palindromic motif in a convergent orientation at two sites acting as loop anchors [11,46,6,14,47]. These relationships suggest that genomic annotations and sequence features may be used to predict the location of higher-order chromatin structures.

In contrast to limited 3D data, the amount of 1D genome annotation data, also known as epigenomic data, has been growing exponentially. ENCODE, NIH Roadmap Epigenomics, FANTOM5, BLUEPRINT, and other members of the International Human Epigenome Consortium (IHEC) [48] have been actively cataloging functional and regulatory genome annotation datasets. These datasets include cell-type and tissue-specific histone modifications, DNAse I hypersensitive sites, DNA methylation, and TFBSs. These data are typically available at a resolution much higher than 3D data (tens of bases in contrast to tens of kilobases) which provides opportunities to refine the location of higher-order chromatin structures.

This review focuses on the machine learning methods and tools for predicting higher-order chromatin structures using genomic annotations, DNA sequences, and other genomic properties (Figure 1). We first describe the machine learning framework and considerations needed for predicting chromatin interactions. We then describe the chronological development of tools for EPIs, chromatin interactions/loops, and boundaries of TADs. Finally, we describe the advantages and disadvantages of each tool.

Figure 1.

Figure 1.

Timeline of chromatin interaction prediction tools.

Results

Machine learning framework for genomic predictions

The most common approach for genomic predictions involves binary classification, such as distinguishing between functional and nonfunctional EPIs, boundaries and non-boundaries of loops, and TADs. Binary classification involves construction of predictor and response variables, also known as dependent and independent (explanatory) variables. A genome is binned into discrete regions and the dependent variable is constructed by either labeling those regions as being promoter, enhancer, or TAD/loop boundary (1), or not (0). The predictor variables are constructed from genome annotation data associated with binned regions. Various types of annotations can be used as predictors; DNA sequence/TFBS motifs, evolutionary conservation, overlap with TFBSs, and/or histone modification or open chromatin regions, etc. Each of these data require different computational approaches. For sequence-based features, one-hot encoding (DNA representation as a binary vector with four dimensions) or k-mer counts can be used. For genomic annotations, average or sum of signal within a region or the extent of overlap between a region and an annotation can be utilized [49]. A special case includes modeling pairs of regions (such as EPIs or two boundaries flanking a TAD) for which distance between these regions may be used as a response variable [30]. The proper construction of predictor and response variables is essential for genomic machine learning.

Genomics machine learning frequently suffers from several pitfalls leading to biased or overly optimistic results (reviewed in [50]). Class imbalance, defined as an unequal number of positive and negative examples, is one of such pitfalls. With a class imbalance, algorithms tend to focus on the class represented by the largest number of examples [51]. Methods to account for class imbalance include Random Over-Sampling (ROS), Random Under-Sampling (RUS), and combinations of the two (reviewed in [52]). ROS aims to balance class distribution through random replication of minority class examples. RUS performs random elimination of majority class examples. Another popular technique is the Synthetic Minority Over-sampling Technique (SMOTE). SMOTE forms new minority class examples by interpolation from K-Nearest Neighbors (KNN) of randomly selected minority class examples. Alternatively, proportionately up-weighting the misclassification cost for the minority class is another approach to account for class imbalance [53]. While some machine learning methods, such as Support Vector Machines (SVMs), can provide relatively robust classification results when applied to imbalanced data sets [54], deep neural networks are not immune to class imbalance, and methods like SMOTE can help [55,51]. Ultimately, the goal of data-level methods is to minimize class imbalance while making the data directly compatible with machine learning algorithms and performance evaluation metrics.

If class imbalance is not accounted for, common performance evaluation metrics may report overly optimistic performance. For instance, if the majority class has a 99% prevalence, an accuracy of 99% (or, equivalently, 1% error rate) can easily be achieved by always predicting the majority class. Such baseline methods can be notoriously difficult to beat [56]. The commonly used Area Under the Receiver Operator Characteristics curve (AUROC) metric [57] may demonstrate good performance, but the number of false positives could also be quite high because of the size of the majority class. Instead, Area Under Precision-Recall Curve (AUPRC) is better suited for performance evaluation as it assesses the fraction of true positives among positive predictions [58,59]. Alternatively, complementary evaluation metrics including combined metrics like F-measure, a harmonic mean of precision and recall [60], and Matthews Correlation Coefficient (MCC), a correlation coefficient utilizing all proportions of the predicted outcome [61], should be explored when working with imbalanced classes [62].

Multiple types of genome annotation data and multiple ways of constructing predictors are frequently used. Various techniques coupled with feature selection algorithms are frequently utilized to avoid overfitting. K-fold cross-validation randomly splits the data into k equal parts and uses k - 1 parts for training with the remaining part for validation. Leave-one-out cross-validation iteratively uses all but one data point for training with the remaining single data point for validation. Schreiber, J. et al. has described three cross-validation strategies, as well as their advantages and downsides [56]. The cross-chromosome and cross-cell-line validation techniques may lead to over-optimistic model performance as the average associations become memorized. In contrast, the hybrid approach minimizes the memorization effect by validating the model on unseen genomic locations (cross-chromosome) and in different data (cross-cell-line) [56]. Improper segregation of test and validation datasets may also lead to class imbalance and inflation of performance metrics [63]. We describe the aforementioned considerations for each tool where relevant.

Using genome annotation data for prediction

It has long been noted that different genomic annotations tend to co-localize, suggesting shared functional properties of a given region. Hidden Markov Model (HMM)-based methods have been successfully applied for segmentation of chromatin into distinct functional states. The HMM-SA method developed by Won, KJ. et al. [64] predicts human promoters and enhancers using signal from several histone modification marks shown to have complementary information for prediction. It was trained on RNAPII, TAF1, and p300 ChIP-seq data and could successfully predict promoters and enhancers in different cell lines. ChromHMM [65] and Segway [66] applies a HMM and its extension, a Dynamic Bayesian Network (DBN), for learning chromatin states characterized by similar patterns of genomic annotations such as transcriptionally active or repressed regions, candidate enhancers, and promoters. The follow-up efforts annotate a genome in 20 annotation-based chromatin states (IDEAS [67]) and 100 conservation-based states (ConsHMM [68]).

Early methods demonstrating association between genome annotations and regulatory regions included prediction of promoters and enhancers from epigenomic data. CSI-ANN was among the first using artificial neural networks to predict enhancers [69]. Using six histone modification marks, the method implements two data transformation strategies, average signal and energy, for feature representation. Although the method uses a five-fold cross-validation scheme, the network is intentionally trained on an imbalanced dataset with the number of negative classes being 10X larger than positive classes. The empirically tuned time-delay neural network outperforms correlation-based and HMM-based methods for enhancer prediction as quantified by sensitivity and Positive Predictive Values (PPV). Consequently, it is applied to enhancer prediction in CD4+ cells, and is validated using p300 binding peaks, sequence conservation, and enrichment in TFBSs. CSI-ANN demonstrates that data transformation (energy transformation of epigenomic signal) may enhance predictive power of machine learning algorithms.

Machine learning methods, such as SVMs, are also adapted to predict enhancers from histone modification signal. ChromaGenSVM is an SVM method tuned with a genetic algorithm to select optimal parameters [70]. Using signal from six histone modification marks, it outperforms previous methods in predicting experimentally validated enhancers judged by the F1 score and AUROC metrics in cross-validation settings. This study also uses an imbalanced dataset with respect to the number of negative examples. Expectedly, H3K4me3 was the most predictive mark of enhancers, with H3K4ac and H2BK5ac marks adding to the predictive power.

RFECS [71] uses only genome annotation data to predict enhancers with a Random Forest (RF) algorithm. Trained on p300 sites using 5-fold cross-validation, the model identifies H3K4me1, H3K4me3, H3K4me2 marks as most predictive. Interestingly, the authors intentionally used imbalanced classes (1:7 to 1:9) arguing that when measured by AUROC, such models better reflect the observed excess of the negative class. They demonstrated that the model trained in one cell type can be used in another cell type, albeit at reduced validation and increased misclassification rates.

Besides genome annotation data, DNA sequence properties, such as k-mers, are also used to predict enhancers from DNA sequence [72]. The SVM model was trained on experimental enhancers using balanced, 50X imbalanced, and 100X imbalanced classes. Both AUROC and AUPRC were used for model evaluation in 3-fold cross-validation settings. The SVM model shows good performance in predicting enhancers vs. background sequence as well as distinguishing enhancers between tissues with the selected k-mers corresponding to TFBS motifs. CLARE [73] is another method for predicting active regulatory elements in a DNA sequence based on the presence of TFBS motifs. It uses Least Absolute Shrinkage and Selection Operator (LASSO) penalized regression to select a set of motifs best predicting active elements, as compared with matched-in-length and GC content random sequence. Although it uses sequence information only, the model achieved AUROC 0.83 in distinguishing forebrain enhancers from background sequences. Ultimately, gkm-SVM [74] and its improved version, LS-GKM [75], demonstrated good AUROC and AUPRC performance metrics.

Combining genome annotation data and DNA sequence information has also been explored for regulatory element prediction. EnhancerFinder [76] uses binding motifs together with evolutionary conservation and epigenomic data to predict enhancers using the SVM-based multiple kernel learning method. First, it distinguishes enhancers from non-enhancers, and second, predicts their activity in a given tissue. Trained on a balanced set of 711 experimentally validated VISTA enhancers and the same number of matched random regions, EnhancerFinder demonstrates that multiple datasets provide complementary information improving prediction accuracy compared to using a single predictor. Interestingly, both evolutionary conservation and DNA motifs were among the least predictive features, strengthening the notion that epigenomic data is better suited for enhancer prediction.

Numerous other methods have been developed for predicting TFBSs, DNA methylation status, and gene expression level from genome annotation data (reviewed in [77,78]). The success of using genome annotation data for predicting genomic features on a linear genome (1D predictions) prompted the development of methods for predicting higher-order chromatin interactions.

Enhancer-promoter interactions prediction

Enhancer-promoter interactions (EPIs) represent an easily understood functional link between distal genomic regions (enhancers) and gene promoters. An active EPI is thought to facilitate gene expression by recruiting transcriptional co-factors. Biological interpretability of EPIs prompted a plethora of tools being developed for their prediction (Table 1). Although not implemented as a tool, PreSTIGE was among the first methods to use epigenomic data for EPI prediction [99]. Considering cell type-specific H3K4me1 signal (a marker of enhancers), the method linked these regions with genes expressed specifically in that cell type. The method restricts its search to enhancers and genes located within domains marked by CTCF binding. The authors’ estimation of the False Discovery Rate (FDR) using experimental 3C, 5C and ChIA-PET data demonstrated only ~40% of genes were regulated by the nearest enhancers with the majority of enhancers being located within 100Kb of transcription start sites.

Table 1.

Enhancer-promoter interaction prediction tools.

Epigenomic data-based methods
IM-PET RF http://tanlab4generegulation.org/IM-PET.html Perl [79]
Epigenomic data-based methods
EpiTensor SVD http://wanglab.ucsd.edu/star/EpiTensor/ Matlab [80]
RIPPLE RF and group LASSO http://pages.discovery.wisc.edu/~sroy/ripple/queryg.php Matlab [81]
PETModule RF https://hulab.ucf.edu/research/projects/PETModule/ Python [82]
TargetFinder Boosted trees, multiple techniques https://github.com/shwhalen/targetfinder Python [30]
JEME RF, LASSO https://github.com/yiplabcuhk/JEME Java [83]
EPIP Decision trees https://github.com/amlantalukder/EPIP Python [84]
3DPredictor GBM https://github.com/labdevgen/3DPredictor Python [85]
LoopPredictor RF, GBM https://github.com/bioinfomaticsCSU/LoopPredictor Python [86]
DNA sequence-based methods
PEP GBM https://github.com/ma-compbio/PEP Python [87]
EPIANN Attention-based DNN https://github.com/wgmao/EPIANN Python [88]
SPEID CNN+RNN https://github.com/ma-compbio/SPEID Python [89]
SIMCNN CNN https://github.com/zzUMN/Combine-CNN-Enhancer-and-Promoters Python [90]
EP2vec GBM https://github.com/wanwenzeng/ep2vec Python [91]
DeepTACT CNN-based https://github.com/liwenran/DeepTACT Python [92]
SEPT CNN+LSTM https://github.com/NWPU-903PR/SEPT Python [93]
EPIsHilbert CNN+Hilbert encoding https://github.com/zmyqx/E-HilbertEPIs Python [94]
EPI-DLMH CNN https://github.com/Xzenglab/EPI-DLMH Python [95]
EPIHC CNN+feed-forward+communicative module https://github.com/BioMedicalBigDataMiningLab/EPIHC Python [96]
Hybrid methods
CENTRE GBM https://github.com/slrvv/CENTRE R [97]
DeepPHiC CNN+shared feature extractor+classifier https://github.com/lichen-lab/DeepPHiC Python [98]

CNN - Convolutional Neural Network, GBM - Gradient Boosting Machines, LSTM RF - Random Forest, RNN - Recurrent Neural Network, SVD - Singular Value Decomposition.

He, B. et al. developed the IM-PET method to predict EPIs [79]. The IM-PET method estimates enhancer activity using the precomputed histone and transcription factor signal scores together with FPKM gene expression between enhancers and promoters. The method additionally includes both coevolution and distance between enhancers and promoters with distance proving to be the most predictive. IM-PET was trained with a RF model on a balanced dataset using ChIA-PET-validated positive EPIs. The model outperformed a baseline nearest promoter metric, SVM and logistic regression, and PreSTIGE. Five-fold cross-validation, AUROC, and F1 scores were used to assess performance. The authors predicted EPIs in 12 cell lines and reported that one enhancer targets about three promoters on average.

EpiTensor [80] predicts cell type-specific EPIs using singular value decomposition of a tensor comprised of signal measures of 16 histone marks, DNase Hypersensitive Sites (DHSs), and RNA-seq signals in five cell types. it decomposes the tensor into three subspaces (cell-type, assay, and genomic locus subspaces), focusing on the “locus” eigenvectors. The genomic locus subspace captures epigenomic patterns of interacting distal regions and demonstrates the importance of H3K4me1, H3K4me3, H3K27ac, and DHSs in defining EPI hotspots. Although not considering the class imbalance problem, the method outperforms the correlation-based baseline as measured by AUROC. EpiTensor has enabled prediction of EPIs at a 200bp resolution.

Roy, S. et al. developed RIPPLE [81], a method using a combination of RFs and group LASSO to predict enhancer-promoter interactions in multiple cell lines. It uses 3C chromatin interaction data with 23 epigenomic datasets (8 histone marks, 15 TFBSs), common for five cell lines. The method addresses class imbalance with RUS controlling for distance, but the model performs well even on imbalanced data as judged by AUPR. CTCF, SMC3, RAD21, DNase I, expression, H3K27ac, H3K27me3, H3K36me3, H3K4me2, H3K4me3, H3K79me2, H3K9ac, RNA PolII and RAD21 were the most predictive and the model trained in one cell line could predict EPIs in anther.

Zhao, C. et al. developed PETModule [82], a RF based approach for predicting EPIs. It uses conservation, distance, enhancer-promoter gene ontology correlation, and a novel sequence motif module feature. PETModule’s feature selection identifies distance as the most predictive feature. Compared with IM-PET and PreSTIGE, the method has better recall, precision, AUROC, and F1 score when trained on balanced data and validated on experimental and manually curated interactions. The model was trained on human datasets and performs well in mouse datasets. The authors confirmed that the majority of enhancers are distant and should be considered within 2Mb regions from transcription start sites.

The popular TargetFinder [30] method uses boosted trees to predict EPIs using DNase-seq, DNA methylation, TF ChIP-seq, histone marks, CAGE, and gene expression data. Similar to IM-PET, the authors identified the window feature (the intervening genomic interval between the enhancer and the promoter) as the most predictive. Interestingly, the best performance with an F1-score of approximately 0.2 was only achieved by gradient boosting with a high number (~4000) of trees but not with a linear SVM. The follow-up investigation of these results demonstrated that a class imbalance in the test set with respect to the number of enhancers overlapping a promoter leads to overly optimistic prediction performance. Consequently, cross-validation on a sorted chromosome list (training on one and testing on the following chromosome) demonstrates nearly random prediction performance [63].

JEME [100] uses a two-step process to define enhancer promoter interactions. First, functional enhancers are identified using a LASSO regression of gene expression on signal from histone modifications and DHSs within 1Mb across samples. Second, to predict the regulating enhancers in a particular sample, it uses a RF model constructed on sample-specific error terms and features, including distance. RUS is used to address class imbalance in cross-validation settings and performance assessment using AUPRC. The authors show that the method outperforms IM-PET, PreSTIGE, Ripple, and TargetFinder, and predicted EPIs for 935 cell and tissue types.

Enhancer-Promoter Interaction Prediction (EPIP) [84] was pioneering when it was introduced by combining genomic (distance, conservation synteny score, and correlation of epigenomic signals in enhancers and promoters) and epigenomic data (DNAse, TFBS, and histone ChIP-seq). It is an ensemble learning method using 200 weak learners (decision trees) and an AdaBoost-inspired algorithm to reweight the input data after each learner. The positive and negative classes were created from FANTOM enhancers, subsetted by cell type-specific ChromHMM enhancer annotations, and loops detected from Hi-C data. The model was trained on both balanced and imbalanced data based on the observation that the balanced model had a high sensitivity, and the imbalanced model had a high specificity when tested on the training data by 10-fold cross-validation. The authors show the model performed well in cells with incomplete or missing data and outperformed TargetFinder and Ripple as judged by AUROC, AUPRC, and F1 scores.

Belokopytova, P. S. et al. [85] developed 3DPredictor, a gradient boosting model for predicting EPIs using CTCF binding loci and orientation, gene expression, and distance between interacting loci. The authors showed the inferior performance of TargetFinder was due to the incorrect design of training and validation datasets and developed a training strategy using balanced classes with a cross-chromosomal training-validation approach and multiple evaluation metrics to overcome its limitations. Instead of predicting qualitative EPIs, they predict the quantitative strength of EPIs. They showed the model trained on one cell type can successfully predict EPIs in another cell type.

Tang, L. et al. developed LoopPredictor [86], a two-component ensemble machine learning model to predict EPIs and chromatin loops. The model architecture included an Anchor type predictor implemented with a hybrid RF and Confidence predictor using Gradient Boosted Regression Trees. The model uses TFBSs, histone modifications, chromatin accessibility, gene expression, methylation, and transcription start sites to train the model on H3K27ac and YY1 HiChIP data. The authors showed that the model outperforms the TargetFinder and 3DPredictor tools as well as four baseline classifiers (Linear Support Vector Clustering (SVC), Logistic Regression, K-Neighbors, and RF) as judged by the F1 score, AUROC, and AUPRC in cross-validation settings. as well as TargetFinder and 3DPredictor tools. Trained on human data, the model performed well in predicting EPIs across other organisms. The authors demonstrated, not single, but multiple genomic feature types were needed for accurate predictions. The continued demonstration of the superiority of the RF technique over other methods [101] positions it as a robust machine learning approach for EPI predictions utilizing epigenomic data.

It has also been demonstrated that tools can predict EPIs using only DNA sequence. PEP [87] uses a machine learning model and Gradient Tree Boosting [102] based only on features from DNA sequences of enhancer and promoter regions. It has two variants, PEP-Motif and PEP-Word. PEP-Motif only uses motif enrichment features for known TFBS motifs. PEP-Word uses word embeddings, a term from natural language processing, that allow representing (arbitrary-length) sentences from a discrete vocabulary as fixed-length numerical vectors while retaining semantic meaning in the embedding space. The authors addressed class imbalance by reweighting and using multiple performance metrics (AUROC, AUPRC, Precision, Recall, F1, MCC). The authors’ results show that it is possible to achieve comparable results using only sequence-based features to predict EPIs.

Similarly, EPIANN [88], an attention-based neural network model, was developed for predicting EPIs with only DNA sequence. The network structure includes three functional blocks; an attention mechanism, interaction quantification, and multi-task learning block. EPIANN was trained on TargetFinder’s data and accurately predicted EPI events. It also reveals pairwise attention scores which uncovers over-represented TFBSs and TF-pair interactions associated with enhancer function. Comparative analysis with TargetFinder and PEP shows comparable performance across cell lines, with EPIANN outperforming PEP based on AUROC scores while demonstrating similar performance to TargetFinder in other scenarios. Attention-based analysis provides insight into the importance of DNA sequence features and correlates well with known genome annotations like CTCF and EGR family members. Attention-based analysis also identifies novel TFs driving EPI formation like NRF1 for which ChIP-seq data is not available. The model’s multi-task learning architecture ultimately enhances generalizable feature learning, while motif analysis and DNAseq footprinting improve biological interpretation.

The authors of PEP also developed SPEID [89], a deep neural network framework predicting EPIs from DNA sequence only. Its architecture includes combining convolution, activation, and max-pool layers, together with a recurrent neural network (RNN) layer and a dense layer. Class imbalance was addressed using a “data augmentation” method similar to oversampling. Several metrics were used to benchmark performance (AUROC, AUPRC, F1) and the method outperformed PEP and TargenFinder. Zhuang, Z. et al. [90] proposed a simplified version of a deep neural network used by SPEID, without the RNN layer (SIMCNN). They demonstrated the network’s performance to be comparable or better to that of SPEID, TargetFinder, EPIANN, and PEP. Similarly trained and evaluated, their method performed well in cell type-specific settings but less optimally across cell lines.

EP2vec [91] uses sequence embedding features of enhancers and promoters to train a Gradient Boosting Regression Trees algorithm on experimentally validated EPI pairs. EP2vec uses balanced classes, with the same number of positive and negative examples, and evaluated using F1, AUROC, AURPC in 10-fold cross-validation settings. Although it outperformed the epigenomic data-based TargetFinder and SVM predictors, the addition of epigenomic data slightly improved performance, suggesting that sequence and epigenomic features are complementary.

The DeepTACT neural network was developed to predict EPIs and promoter-promoter interactions from DNA sequence and chromatin accessibility (DNAse-seq) data [92]. It contains three modules: a sequencing module using one-hot encoded sequence information, an openness module using chromatin accessibility signal, and an integration module that includes a bidirectional Long Short-Term Memory (LSTM) network and an attention layer. The authors showed DeepTACT’s use of two data types improved performance as measured by AUROC and AUPRC. Due to multi-way interactions, they justified the use of imbalanced data. DeepTACT’s data augmentation and bootstrapping strategy was designed to improve the stability of network parameters trained on experimental Promoter-Capture Hi-C and ChIA-PET data. Compared with SPEID and Rambutan, DeepTACT detects interactions at finer resolution, is more biologically interpretable, and discovers hub promoters active across cell types. Similar to EP2vec, this method demonstrates that DNA sequence properties complement epigenomic features for functional element predictions.

Jing, M. et al. developed SEPT [93], a deep learning method using DNA sequence information for EPI prediction. It extracts sequence features of enhancers and promoters from DNA sequence using two CNN layers and one LSTM layer. Then, it introduces the gradient reversal layer to reduce the cell line specific features and prioritize enhancer-promoter specific features. SEPT simultaneously trains two classifiers of EPIs and domain-specific (cell type specific) discriminators. Trained on cell lines with known EPIs, the authors showed that SEPT performed well in new cell lines and outperformed LS-SVM, SPEID, and RIPPLE according to AUROC, AUPRC, and F1 scores.

Zhang, M. et al. introduced EPIsHilbert [94], a convolutional neural network (CNN) utilizing Hilbert curve encoding to predict EPIs. Hilbert curve encoding of DNA sequence and improved the model’s performance by preserving the spatial positioning relationship between an enhancer and a promoter. The authors addressed class imbalance using both over and under sampling techniques and used F1, AUROC, and AUPRC metrics for performance evaluation. Their approach outperformed SPEID and SIMCNN using different cell line data. This work introduces additional approaches for EPI prediction, such as transfer learning between cell lines and visualization of sequence features.

Min, X. et al. developed EPI-DLMH [95], a two-layer CNN for predicting EPIs from only DNA sequence. EPI-DLMH contains a bidirectional, gated, recurrent unit to capture long-range dependencies, and an attention mechanism to prioritize the most important features. It also uses k-mer sequence representation and embedding with dna2vec to reduce dimensionality. The authors additionally introduced an additional matching heuristic to capture more interaction information between promoters and enhancers. Using balanced cell type-specific datasets and several performance evaluation metrics (AUROC, AUPRC, F1), their method outperformed SPEID and EPIANN. This approach demonstrates that additional DNA sequence information captured by matching heuristics can improve EPI prediction.

The EPIHC [96] deep neural network is another method combining sequence-derived and genomic annotations. It extracts sequence features from enhancers and promoters using CNN. It then introduces a communicative learning module to capture communicative information between enhancers and promoters. The authors investigated the effect of class imbalance on the model’s performance in 5-fold cross-validation settings using AUROC and AUPRC. Using benchmarking data from TargetFinder, the model was trained on data from multiple cell lines and then applied to a target cell line. EPIHC outperforms three other neural network-based methods, SIMCNN, SPEID, and EPIVAN. This paper demonstrated that hybrid data (both sequence and genome annotation data) and model usage (neural networks connected via a communicative learning module) improves EPIs.

Combining epigenetic data with other data modalities, such as gene expression, showed to be beneficial for EPI predictions. Cell-specific ENhancer Target pREdiction (CENTRE) [97] aims to predict cell type-specific EPIs using only gene expression and ChIP-seq data (H3K27ac, H3K4me1, and H3K4me3). It trains the XGBoost algorithm on the ENCODE registry of candidate cis-regulatory elements with enhancer-like signatures and GENCODE transcription start sites. It also utilizes the Benchmark of candidate ENhancer-Gene Interactions (BENGI) dataset for validation to mitigate bias stemming from dependencies between training and test datasets. The authors used 12 cross-validation groups divided by chromosome and evaluated performance with F1-scores due to the highly imbalanced dataset. The model outperformed TargetFinder and, similar to IM-PET, PETModule, and other methods, identified enhancer-promoter distance as the most predictive feature.

DeepPHiC [98] is a multi-modal deep learning model for EPI and promoter-promoter interaction prediction. It incorporates one-hot encoded DNA sequence, epigenetic signals (H3K4me1, H3K4me3, H3K27ac), anchor distance, evolutionary, and DNA structural features. The model’s architecture, based on DenseNet, comprises two modules: a feature extractor (two dense blocks and a CNN) and a classifier. It was trained on promoter-centered interactions from the 3D-genome Interaction Viewer and database using an unbalanced dataset (5x negative examples). DeepPHiC outperforms SPEID, DeepMILO, ChINN, and DeepTACT as evaluated by AUROC and AUPRC. This approach demonstrates superiority in predicting EPIs and promoter-promoter interactions across multiple cell types using a combination of genomic and epigenomic features with distance between anchors.

Chromatin interaction prediction

Besides canonical EPIs, predicting significant chromatin interactions using genome annotation and/or DNA sequence data also attracted significant attention (Table 2). Similar to EPIs, these interactions are thought to facilitate long-range gene expression regulation and are frequently referred to as chromatin loops, interaction hubs, or significant chromatin interactions. Huang, J. et al. 2015 [103] developed HubPredictor that predicts frequently interacting genomic loci (termed “hubs”) and TAD boundaries from nine histone modification marks and CTCF binding sites. They employed a Bayesian Additive Regression Trees (BART) model that used signal from CTCF and nine cell-type-specific histone modification marks. The model was trained on balanced data (RUS), evaluated using cross-validation and AUROC, and performed well across cell lines. Histone marks (H3K4me1) were the most predictive of chromatin hubs across datasets and cell types. The addition of sequence-based features, such as conservation and GC content, did not improve predictions. TAD boundaries were predicted by CTCF peaks and negative H3K4me1 peaks. Alongside this work, Dixon, J. R. et al. [25] predicted differential chromatin interactions using a RF algorithm. It was trained on the signal measures of six histone modification marks, CTCF, and DNAse hypersensitive sites on balanced data using RUS. Similarly identified changes in H3K4me1 signal were the most predictive followed by DHSs.

Table 2.

Chromatin interaction prediction tools.

Epigenomic data-based methods
HubPredictor Bayesian Additive Regression Trees https://github.com/huangjialiangcn/HubPredictor R [103]
CITD Wavelet deconvolution, non-linear transformation https://cb.utdallas.edu/CITD/index.htm NA [104]
3DEpiLoop RF https://bitbucket.org/4dnucleome/3depiloop R [105]
Lollipop RF https://github.com/ykai16/Lollipop Python [106]
HiC-Reg RF regression https://github.com/Roy-lab/HiC-Reg C [107]
L-HiC-Reg RF regression https://github.com/Roy-lab/Roadmap_RegulatoryVariation C [108]
sevenC Logistic regression https://bioconductor.org/packages/release/bioc/html/sevenC.html R [109]
Epiphany CNN+LSTM https://github.com/arnavmdas/epiphany Python [110]
X-SCNN Siamese CNN https://github.com/ernstlab/X-CNN Python [111]
Epigenomic data-based methods
EPCOT Encoder-decoder https://github.com/liu-bioinfo-lab/EPCOT Python [112]
DNA sequence-based methods
Samarth SVM https://bioinf.mpi-inf.mpg.de/publications/samarth/ Matlab [113]
Akita CNN https://github.com/calico/basenji/tree/tf2_hic/manuscripts/akita Python [114]
deepC CNN https://github.com/rschwess/deepC Python [115]
ChINN CNN https://github.com/caofan/chinn Python [116]
Orca Encoder-decoder https://github.com/jzhoulab/orca Python [117]
HiCDiffusion Encoder-decoder, CNN, transformer https://github.com/SFGLab/HiCDiffusion Python [118]
Hybrid methods
CTCF-MP Boosted Tree Classifier https://github.com/ma-compbio/CTCF-MP Python [119]
Rambutan CNN https://github.com/jmschrei/rambutan Python [120]
CCIP RF https://github.com/GaoLabXDU/CCIP Python [121]
IChrom-Deep Transformer https://github.com/HaoWuLab-Bioinformatics/IChrom-Deep Python [122]
FusNet Fusion network https://github.com/CSUBioGroup/FusNet Python [123]
C.Origami CNN, transformer https://github.com/tanjimin/C.Origami Python [124]
Hi-C data-based methods
Peakachu Random Forest https://github.com/tariks/peakachu Python [125]
MATCHA Graph embedding+NN https://github.com/ma-compbio/MATCHA Python [126]
Be-1DCNN Bagging of 1D CNN models https://github.com/HaoWuLab-Bioinformatics/Be1DCNN Python [127]
DeepChIA-PET CNN https://github.com/zwang-bioinformatics/DeepChIA-PET/ Python [128]

Chen, Y. et al. [104] developed CITD, a method to predict genome-wide chromatin interaction matrices using 1D histone modification data. The authors utilized the observation that histone modification signals in nearby genomic bins are correlated, and this correlation follows power-law decrease with the increasing distance between bins. They developed a five-step procedure that includes wavelet decomposition of the original Hi-C matrix, nonlinear power-law transformation of the resulting coefficient matrices, and wavelet deconstruction of the predicted interaction matrix. Both cross-chromosome and cross-cell-line validation showed high correlation of predicted and experimental matrices, similar TADs, and EPIs.

Al Bkhetan and Plewczinski [105] developed 3DEpiLoop to predict two types of chromatin loops: CTCF-mediated and RNAPII-mediated. They trained a RF model in genome-wide cross-validation settings and estimated the performance of their model with both AUROC and experimentally obtained ChIA-PET data as a gold standard. In their tests, the RF technique outperformed AdaBoost classification trees, neural networks, SVM, and Stochastic Gradient Boosting. This study was among the first that used distance between genome annotations and loop anchors. The authors showed differences in genome annotation signatures with CTCF-mediated loops best predicted by TFBSs and RNAPII-mediated loops best predicted by histone marks, which indicated differences in annotation signatures for different loop types.

A similar approach was taken by Kai, Y. et al. [106], who developed Lollipop to predict CTCF-mediated interactions using a RF technique. Besides TFBS signal and orientation, they used conservation scores and the distance between loop anchors (loop length). Trained on experimental ChIA-PET data, the model was evaluated in cross-validation settings using both AUROC and AUPRC. Similar to previous models, the authors identified CTCF, RAD21, and loop length as the most predictive features. Models trained in one cell line performed well in another cell line and also identified many more loops than experimentally detected. This work demonstrated that genome annotations can guide the discovery of novel loops that cannot be detected by current 3C technologies.

Zhang, S. et al. developed HiC-Reg [107], a RF regression-based approach for predicting genome-wide chromatin interactions from epigenomic data. The authors experimented with various feature encoding approaches to predict interactions between two regions. These approaches include concatenating epigenomic signals at both regions, including signal between regions, and incorporating epigenomic signals from multiple cell lines. Using the model trained on the distance between interacting regions as a baseline, they demonstrated improved performance when including signal between regions and across cell lines according to the AUROC scores. They used cross-chromosomal validation within and between cell lines with an expected drop in performance when using chromosomes and cell lines other than those used for training. The authors demonstrated the biological relevance of their predictions using distance-stratified Pearson Correlation Coefficients (PCCs) between the original and predicted interactions. They detected similar regions forming loops, bidirectional CTCF binding in those loops, and overlapping TAD boundaries. Using experimental evidence of interactions in HBA1 and PAPPA gene promoters, they demonstrated their model can successfully predict them. An extension of this model, L-HiC-Reg, performs local modeling and prediction of chromosomal interactions, and additionally evaluates networks of significant gene interactions and the associated disease-associated variants [108].

The sevenC R package by Ibn-Salem & Andrade-Navarro [109] investigated the hypothesis that a functional chromatin loop interaction should contain highly correlated CTCF ChIP-seq signal. Consequently, a model trained on such correlation measures may be used to predict functional chromatin loops. They considered every pair of CTCF motifs in a convergent orientation within a 1Mb window as potential looping interactions. For each pair, they measured the correlation of ChIP-seq signal within a 1000bp window around each motif. Using the correlation coefficient, distance between motifs, orientation, and motif significance scores, they built a logistic regression model using high-resolution Hi-C and ChIA-PET data as a gold standard. They acknowledged the presence of class imbalance, although did not directly account for it, and they used multiple performance evaluation metrics (AUROC, AUPRC, F1-score). Their approach revealed comparable predictive power using CTCF, RAD21, and ZNF143 factors, known architectural proteins of chromatin loops [129,8], as well as novel factors TRIM22, RUNX3, and BHLHE40. Interestingly, DNAse-seq performed similarly to TF ChIP-seq signal; however, histone modification signal (H3K4me1, H3K4me3, H3K27me3, H3K27ac) was not predictive. Their combined model, trained on the average signal of the 10 most predictive TFs, can be used to predict cell type-specific loops.

Deep neural network architectures were also explored for chromatin interaction predictions. Epiphany [110] was among the first neural networks to predict the Hi-C contact map from five epigenomic tracks (DNAse, CTCF, H3K27ac, H3K27me3, and H3K4me3). As an architecture, it uses 1D convolutional layers and a bidirectional LSTM. The authors further used a generator network to extract information and make predictions and a discriminator to use adversarial loss to predict realistic Hi-C maps. AUROC and AUPRC were used for performance evaluation. The authors demonstrated that CTCF was critically important for accurate Hi-C map prediction and its removal breaks the model.

EpiMCI [130], although not implemented as a tool, uses a dual-channel hypergraph neural network for reconstructing multi-way chromatin interactions from epigenomic signals. The model represents the data as a hypergraph, learns the vertex embedding and predicts multi-way interactions as a classification task. It was developed for high-throughput Pore-C technology and applied to GM12878 and K562 cells at 1Mb distance. It uses balanced classes, cross-validation, and six evaluation metrics. It marginally outperformed MATCHA, described below, in predicting multi-way chromatin interactions. Besides prediction, embeddings can be used to define A/B compartments and denoise Pore-C data.

EPCOT [112] is a deep learning framework (an encoder-decoder architecture) for predicting cell-type specific chromatin interactions, as well as epigenomic data, gene expression, and enhancer activity from chromatin accessibility only (DNAse-seq, ATAC-seq). It utilizes a pre-training and fine-tuning approach, where a model is first trained on cell-specific chromatin accessibility profiles (one-hot encoded accessibility sequences) and then transferred to downstream tasks. EPCOT’s performance is rigorously validated through several experiments, including cross-chromosome and cross-cell type prediction analyses, where it consistently outperforms baseline models, such as Avocado for epigenomic data prediction, as evidenced by AUROC and AUPRC. Additionally, the authors demonstrate the model’s ability to accurately predict TF binding patterns and its generalizability across different cell types.

Jaroszevicz & Ernst developed X-SCNN [111], a siamese CNN (two subnetworks with shared parameters) that leverages signal from TFBSs, histone marks, DNAse data (average signal) to predict chromatin interactions at high resolution (100bp). They used HiCCUPS-called interactions to train the model, RUS to balance the data, and used AUROC, AUPRC, and chromosome-specific validation to benchmark performance.

Besides genome annotation data, a polymer physics model [131] has been proposed to simulate chromatin interactions at high resolution. The model uses 15 chromatin states learned from histone modification profiles, CTCF motifs and their orientation, and employs molecular dynamics simulation based on an energy function to simulate an ensemble of high-resolution chromatin structures. Using a variety of similarity metrics (PCC, stratum-adjusted correlation, contact enrichment, and TAD boundary matching score), the model demonstrates the feasibility to resolve small chromatin loops, TADs, and long-range EPIs compatible with Hi-C and imaging data.

Similar to EPI predictions, DNA sequence properties have also been explored for chromatin interaction predictions. Nikumbh & Pfeifer developed Samarth [113], a tool for predicting long-range chromatin interactions using sequence information only. They used SVM with string kernel predictors using the oligomer distance histogram representations (histogram of distances between short (3–5bp) oligomers in the sequence) and trained their model on significantly interacting vs. non-interacting loci. The model accounts for class imbalance using upweighting of misclassification cost and uses AUROC for performance evaluation. This work demonstrated that short tandem repeats are potentially important for distinguishing interaction patterns. Although the model’s performance was relatively modest in predicting cell type-specific interactions, this work was among the first to demonstrate that sequence alone provides information for long-range chromatin interaction predictions.

Fudenberg, G. et al. developed Akita [114], a CNN built using the Basenji architecture [132] that processes 1Mb DNA sequences and predicts chromatin interactions at 2Kb resolution. Akita was trained on high-resolution human Micro-C data and accounts for distance-dependent decay of interactions in chromatin interaction data. A conceptually similar CNN, deepC [115], similarly predicts chromatin interactions from DNA sequence, but uses epigenomic features to pre-train and initialize the network. It is trained on cell-type-specific Hi-C datasets validated using high-resolution Capture-C interactions, and it was shown to outperform HiC-Reg. These tools allow for understanding the effect of mutations (in silico saturation mutagenesis), understanding structural variants, interpreting expression quantitative trait loci (eQTLs), and predicting chromatin interactions in different species.

Cao, F. et al. developed ChINN [116], a CNN for predicting chromatin interactions from DNA sequence. The authors first showed that interactions can be predicted from both functional genomic data and distance between interacting regions. The CNN was then applied to forward and reverse complement sequences of interaction anchors and trained on CTCF loops, RNA PolII-mediated loops, and Hi-C data. The model was trained on a 1:5 imbalanced dataset and evaluated using AUPRC. The authors showed that convergent CTCF orientation is an important predictor, while other motifs complement predictive power.

Orca [117], a deep-learning model, predicts 3D genome architecture directly from genomic sequence data. Its architecture includes a hierarchical sequence encoder and a multilevel cascading decoder designed to provide a ‘zooming’ series of predictions at multiple scales. Performance evaluations show strong correlations between predicted and experimental interactions, with validation techniques such as AUROC scores confirming the model’s efficacy. Orca accurately predicts diverse genome interaction mechanisms, including those mediated by CTCF and Polycomb, and exhibits concordance with experimental observations of chromatin interactions marked by histone modifications like H3K4me3, H3K27ac, and H3K4me1. Similar to other models for predicting chromatin interactions from DNA sequence, Orca enables in silico screens to probe sequence-based mechanisms of genome organization.

Recent deep learning approaches have also been utilized to predict chromatin interactions. The HiCDiffusion model [118] was developed to improve the resolution of Hi-C matrices generated from DNA sequence data. By incorporating an encoder-decoder architecture and a diffusion model, HiCDiffusion aims to reduce artificial blurring and enhance the fidelity of predictions. The encoder-decoder architecture includes residual convolutions and transformer encoders, with transfer learning utilized to pre-train the encoder-decoder architecture. Evaluation involved comprehensive data processing, training, testing, and validation procedures (Fréchet inception distance scores, used for quantifying the realism and diversity of generated images, correlation with the original Hi-C maps), with comparisons to C.Origami showcasing superior performance in both sequence-only and epigenetics-enhanced scenarios. These results underscore the potential of modern deep learning architectures for improving predictions of chromatin interactions solely from DNA sequence data.

A combination of genomic annotations and sequence features was also explored for chromatin interaction predictions. Zhang, R. et al. [119] developed CTCF-MP, a boosted tree classifier for predicting chromatin loops using CTCF motifs, distance between them, conservation, and cell type-specific ChIP-seq and DNAse-seq data. The CTCF motif sequences plus flanking regions were initially processed using word2vec and deep autoencoder to compress the 200-dimensional space to 32 dimensions, and the learned features were used for prediction. Imbalanced data performed well in their settings, with AUROC and AUPRC used for performance evaluation in cross-validation settings. The model trained on one cell type can be used to predict chromatin loops in other cell types, although the best performance was achieved on predicting common loops.

Schreiber, J. et al. developed Rambutan [120], a deep convolutional network for predicting significant chromatin interactions from one-hot encoded DNA sequence and DNAse Hypersensitive sites (signal, logFC over control). The network is trained on significant interactions defined by Fit-Hi-C (q-value <= 1e-6), DNAse signal, and binarized genomic distance. It uses balanced classes, AUROC, and AUPRC for performance evaluation. Predictions correlated with Insulation Score and replication timing. Rambutan is trained at a 1–5Kb resolution, but theoretically can predict at a finer resolution.

Network/graph analysis methods were also applied for chromatin interaction detection. Wang, W. et al. developed CTCF-mediated Chromatin Interaction Prediction (CCIP) [121], a tool for predicting CTCF-mediated convergent loops and tandem loops with transitivity. Transitivity is defined from the network of multiple CTCF-interacting regions, convergent and tandem. In addition to CTCF, RAD21 binding sites, anchor and in between features (as in TargetFinder), and directional CTCF motif one-hot encoding, the model incorporated the graph connecting probability (GCP) score which proved to be the most important predictive feature. A RF was trained on a balanced dataset and evaluated in cross-validation settings with AUROC, AUPRC, and other metrics. The authors showed that CCIP outperforms Lollipop and CTCF-MP and that transitive loops can explain the formation of tandem loops.

Farré, P. et al. [133] demonstrated the use of a deep neural network (dense architecture) to predict chromatin interactions using sequences of TFBSs (ChIP-seq) and transcriptionally active/inactive (RNA-seq) data. They also demonstrated that chromatin interactions themselves can be used to predict TFBSs. The following research utilized a CNN architecture to predict chromatin interactions with only DNA sequence.

The IChrom-Deep [122] neural network combines the use of sequence features and genomic features. It was implemented using a novel attention-based deep learning model containing a sequence module and a genomic module. The genomic module processes genomic features, conservation scores, CTCF motifs, and the distance between chromatin bins. The model was trained on a balanced dataset and evaluated using four evaluation metrics (AUPRC, Accuracy, MCC, F1) alongside a cross-validation strategy. The authors showed that IChrom-Deep outperforms three models, TargetFinder, XGBoolst and SGDC and demonstrated the importance of the distance of interacting regions, CTCF (convergent orientation), and the cohesin complex members (RAD21, SMC3). This tool demonstrates that synergistic use of sequence and genomic features can improve chromatin interaction predictions.

FusNet [123] is a 3-layer fusion neural network designed for predicting chromatin loops utilizing genome sequence information, distance between anchors, open chromatin, and ChIP-seq data. The feature extraction layer employs a CNN for dimensionality reduction of one-hot encoded DNA sequences. The predictor layer integrates Light Gradient Boosting Machine eXtreme Gradient Boosting and KNN models. The fusion layer integrates the predictions from each basic model as new features for model training and prediction of the final probability of loop formation. The fusion layer improves prediction performance, especially in cross-cell type loop prediction, as compared with ChINN, Peakachu, DeepYY1 methods using AUROC and AUPRC. FusNet demonstrates high consistency with Hi-C data and associates predicted loops with regulatory functions and disease-related mechanisms. Additionally, FusNet’s prediction accuracy is supported by Aggregate Peak Analysis and aligns well with known TADs and EPIs. Permutation importance analysis highlights the significance of sequence features, particularly anchor sequences. FusNet was also applied to identify potential target genes of pathogenic single nucleotide polymorphisms (SNPs).

Similarly, C.Origami [124] uses DNA sequence, CTCF-binding, and ATAC-seq signals to predict genome organization comparable to high-quality Hi-C experiments. It introduces a novel multimodal deep neural network architecture for cell type-specific prediction of Hi-C maps in 2Mb windows. One-hot-encoded DNA sequence and feature signals are processed in parallel by 12 1D CNN layers followed by a transformer with eight self-attention blocks. The following 2D convolutional layers reconstruct the predicted Hi-C map. To evaluate the performance of C.Origami and other comparable models (Akita, deepC, Orca), the study uses insulation score correlation, observed/expected Hi-C map correlation, mean squared error (MSE), distance-stratified correlation, and AUROC. C.Origami and similar models facilitate in silico genetic perturbation studies, enabling efficient exploration of causal relationships in chromatin organization.

Hi-C data itself may contain sufficient information for chromatin interaction predictions. Salameh, T. J. et al. developed Peakachu [125], a RF model to predict chromatin loops (represented as pixels of high intensity) using Hi-C contact matrices only. They designed a strategy to represent each loop as a collection of intensities and ranks within a 11×11 pixel window around each loop. Using loops detected by H3K27ac, HiChIP, and CTCF ChIA-PET as positive examples with the same number of negative examples, they achieved robust performance measured by MCC. Compared with HiCCUPS, Peakachu detected more experimentally validated loops. The model shows robust performance when varying sequencing depth across cell lines and can be applied to data obtained by other technologies such as Micro-C and SPRITE.

MATCHA [126] uses structural information derived from Hi-C data and graph embedding followed by Hyper-SAGNN (a self-attention based graph neural network for hypergraphs) analysis. Multi-way chromatin interactions are represented as hyperedges that can be predicted by Hyper-SAGNN. Applied to SPRITE and ChIA-Drop data with Gm12878 at both 1Mb and 100Kb resolutions, the method provides a wealth of information about the genomic properties of multi-way interactions. This method was the first to predict multi-way chromatin interactions.

The Be-1DCNN model [127] employs a bagging ensemble of ten one-dimensional CNNs (each containing three convolutional layers and dropout layers) to improve prediction accuracy of chromatin loops. It is trained on high-resolution Hi-C data using 22 chromosomes for training and one for testing. Comparisons with other models like Gaussian Naïve Bayes, Perceptron, KNN, Decision Tree, and Peakachu showed Be-1DCNN’s superior performance in predicting chromatin loops as evaluated on experimental ChIA-PET and HiChIP data. The model was trained on balanced datasets (RUS-like of negative examples) and its effectiveness is validated using various metrics such as accuracy, MCC, and area under the curve. This method demonstrated that combining traditional machine learning approaches, such as bagging, with deep learning architectures improves chromatin loop prediction.

Combining Hi-C and genomic annotation data was shown to be beneficial for chromatin interaction predictions. Liu, T. et al. presented DeepChIA-PET [128], a CNN (40 dilated residual convolutional blocks) for predicting ChIA-PET-defined chromatin interactions from Hi-C and ChIP-seq data. The model outperforms Peakachu when trained on Hi-C data only using three evaluation metrics (average precision, AUROC, AUPRC) with CTCF ChIA-PET representing ground truth. The model was trained on one cell line (GM12878) can be applied to others (e.g., HeLa). The authors demonstrated that performance improves when including ChIP-seq data, underscoring the importance of epigenomic data for chromatin interaction prediction.

TAD boundary prediction

Topologically Associating Domains (TADs) represent a higher-order level of chromatin interactions [145]. They represent kilobase-to-megabase size regions on the linear genome that are highly self-interacting [6]. TADs have also been reported to constrain enhancer-promoter communication [15,16] and might be related to genome stability [146]. Boundaries of TADs were found to be enriched in architectural factors such as CTCF, RAD21, SMC3, YY1, and ZNF143 [6,8,147,148,49], and boundary strength correlates with their occupancy [149,150]. Furthermore, distinct patterns of histone modifications [6] and other regulatory marks [42] have also been shown to be present at boundaries. These observations strongly suggest that genome annotation data may be used for TAD boundary prediction [137] (Table 3).

Table 3.

TAD boundary prediction tools.

Epigenomic data-based methods
cdBEST Enrichment http://e-portal.ccmb.res.in/e-space/rakeshmishra/cdBEST.html Web, Perl [134]
HiCfeat Penalized multiple logistic regression https://cran.r-project.org/web/packages/HiCfeat/index.html R [135]
HiCblock Generalized linear model https://cran.r-project.org/src/contrib/Archive/HiCblock/ R [136]
nTDP Generalization of Conditional Random Field https://www.cs.cmu.edu/~ckingsf/research/ntdp/ [137]
TAD-Lactuca RF, NN https://github.com/LoopGan/TAD-Lactuca Python [138]
preciseTAD RF https://bioconductor.org/packages/preciseTAD/ R [49]
DeepMILO CNN & RNN https://github.com/tuantrieu/DeepMILO Python [139]
PredTAD GBM https://github.com/jchyr-sbmi/PredTAD/ R [140]
DNA sequence-based methods
TADBoundaryDetector CNN+LSTM https://github.com/lincshunter/TADBoundaryDectector Python [141]
Deep-loop CNN https://github.com/linDing-group/Deep-loop Python [142]
Epigenomic data-based methods
CLNN-loop CNN+LSTM https://github.com/HaoWuLab-Bioinformatics/CLNN-loop Python [143]
Hybrid methods
pTADS RF https://github.com/chrom3DEpi/pTADS R [144]

The cdBEST tool was among the first to demonstrate the use of TFBS motifs for boundary prediction in 12 Drosophila species [134]. Trained on experimentally validated boundaries, cdBEST scans a genome using a 750bp window in 10bp increments to calculate fold-enrichment values. The well-known transcription factors, such as BEAF and dCTCF, were among the most predictive motifs. cdBEST defines five types of boundaries, and further demonstrates their association with gene expression differences. A subsequent work demonstrated superior performance of cdBEST over k-means clustering and ChromHMM segmentation and developed a RF model for predicting novel boundaries using modENCODE ChIP-seq data and protein binding motifs [151]. Importantly, this later study used RUS and whole-genome cross-validation, shaping the standard framework for genomics machine learning.

Mourad, R. et al. [135] developed the HiCfeat R package that implements multiple logistic regression for TAD boundary prediction. This model was among the first to use the percent of overlap with TFBSs, instead of binary overlaps. The model was trained on TAD boundaries identified by the HiSeg TAD caller (imbalanced data) and evaluated in cross-validation settings using AUROC. The model uses L1 norm LASSO penalization to select the most predictive features. As expected, Human CTCF, Cohesin, ZNF134, and Polycomb group proteins were positive predictors and P300, RXRA, BCL11A and ELK1 were negative predictors. For Drosophila, BEAF and CP190 were the most predictive predictors, followed by enriched, but not strongly predictive, dCTCF. This work demonstrated that predictive modeling using genome annotations works well across organisms.

In a follow-up work, Mourad, R. et al. [136] introduced a generalized linear model (HiCblock) that allows for estimation of blocking effect of architectural proteins. In addition to genome annotations, the model relies on Hi-C data to model the strength of interactions between loci separated by architectural proteins. The model also requires a distance parameter that specifies the distance between interacting loci. It accounts for both DNA and Hi-C-specific biases, such as GC content, mappability, and fragment length. This work, and the previous work, did not account for class imbalance, relying on 10-fold cross-validation and AUROC to select the best model. Applied to Drosophila and Human Hi-C data, the model reveals known and new findings about the blocking effect of architectural proteins at different distance scales. The authors of this work demonstrate how the integration of genome annotation (one-dimensional) and Hi-C (two-dimensional) data can model the insulating effect of TAD boundary proteins.

Hong and Kim [42] have developed a Position-Specific Linear Model (PSLM) for TAD boundary prediction using overlap count with histone marks, TFBSs, and DNAse hypersensitive sites. The model was trained on TAD boundaries identified by the Directionality Index [6] with the balanced set of negative examples and evaluated using 5-fold cross-validation and AUROC and F1-score metrics. The Population greedy search algorithm (PGSA) was used to select the combination of most predictive features. The authors used the model to identify CTCF, SMC3, RAD21, ZNF143, YY1, DNAse, and H3K36me3 as the most predictive features and distinguished between common and cell type-specific TAD boundaries. Although not implemented as a software tool, this work further demonstrates the importance of different types of genome annotations for predicting TAD boundaries.

Ramírez, F. et al. [152] used several standard classification methods (linear, logistic regression, Gradient Boosting Machine (GBM), and RF) to predict TAD boundaries using the binding affinity of TFBS motifs and DNAse hypersensitive sites. Trained on high-resolution data (0.5Kb Drosophila Kc167 cells Hi-C data) with TAD boundaries predicted using a modified TopDom approach, the models reach >70% sensitivity and specificity in 10-fold cross-validation settings. Feature importance ranking further identifies DNAse as the most important predictor, followed by the expected motifs Beaf32, M1BP, and Ctcf. The authors classified TAD boundaries as promoter-associated or non-promoter-associated, and linked gene expression variability with promoter orientation. They demonstrated that machine learning is able to predict TAD boundaries missed among those detected from Hi-C data only, thus improving the resolution of TAD boundary detection. They also provide an interactive resource, Chorogenome, based on HiCExplorer visualization, to explore TAD structure and epigenomic annotations in Drosophila, Mouse, and Human cell lines.

Sefre & Kingsford developed nTDP [137], a non-parametric approach based on a generalization of Conditional Random Field to predict TAD and interior boundaries using histone modification marks. The class imbalance problem was addressed by reweighting positive and negative examples. The model uses cross-validation and normalized variation of information for performance assessment. The authors defined signatures of four and six histone marks that together, but not individually, accurately predict TAD boundaries. The final trained model can be used to predict TAD boundaries across species and cell types.

TAD-Lactuca [138] predicts TAD boundaries using eight histone marks, CTCF, and sequence information (k-mers). Both epigenomic and sequence information were shown to have complementary contributions to prediction accuracy. Two machine learning algorithms, RF and a 4-layer Multi-Layer Perceptron, were tested on a balanced set of positive and negative TAD boundary examples. Using AUROC and AUPRC measures in 5-fold cross-validation settings, the method was shown to outperform HubPredictor and PGSA and are consistent across resolutions.

preciseTAD [49] was the first tool to predict TAD boundaries at base-level resolution. It uses a RF model to learn TAD-genome annotation associations at the resolution of Hi-C data (detected using Arrowhead [8] or Peakachu [125]) and applies its learned associations to predict the probability of each base being a boundary. After evaluating the performance of chromatin states (histone modification marks, DNAse hypersensitive sites, and TFBSs), the authors found that TFBSs (CTCF, RAD21, SMC3, and ZNF143) were the most predictive. They evaluated four methods for measuring the association between a region and a genomic annotation: signal (sum or average), overlap count, overlap percent, and distance to the region and found the latter to be best performing. For cell lines lacking some TFBS data, data imputed with Avocado [153] has been shown to be a good substitute. Among methods for addressing class imbalance, RUS was shown to perform best. Bases with probability of being a boundary equal to 1 are clustered into preciseTAD boundary regions, and boundary points are identified using Partition Around Medoids. Evaluated in cross-chromosome and cross-cell-line settings, preciseTAD demonstrates improvements to CTCF/RAD21/SMC3 signal around the predicted boundary points. Implemented as an R package, it provides coordinates of predicted boundaries for 60 cell lines.

Deep neural network architectures have also been applied to TAD boundary predictions. Rozenwald, M. et al. [154] compared a linear regression, with and without penalization, with a RNN and with a LSTM to predict TAD boundaries on the Drosophila genome using histone and TFBS marks. The RNN + LSTM captures the sequential order of the genome and distance between genomic annotations. It was shown to outperform all types of linear regression using the weighted mean square error as a performance metric.

DeepMILO [139] uses two types of neural networks, CNN and RNN, to learn sequence features of true TAD boundaries and predict the effect of genomic variants on the probability of boundary formation. True TAD boundaries are defined as having both CTCF and Rad21 signals, considering the convergent direction of CTCF motifs. Evaluated by AUROC, the model outperforms word2vec and boosted trees in distinguishing true vs. non-boundaries.

An interesting development was introduced with PredTAD [140], a tool for predicting differential TAD boundaries. The tool employs a GBM trained on a combination of both genomic and epigenomic features. Genomic features include chromosomal number, location, distance to centromere, gene length, and density. Epigenomic features include histone modifications, DNA binding proteins, and DNA accessibility. PredTAD demonstrates high accuracy in predicting gained, lost, and conserved TAD boundaries when comparing breast cancer cell lines MCF10A, MCF7, and T47D. Interestingly, chromosome number emerges as a significant predictive feature alongside well-known factors like CTCF, RAD21, and SMC3. The model’s performance remains robust even without chromosome information and this indicates its versatility. Furthermore, PredTAD outperforms other existing prediction methods such as HubPredictor, PGSA, and TAD-Lactuca, suggesting its efficacy in studying the dynamics of chromatin organization. The tool’s ability to predict TAD boundary alterations provides insights into gene regulation, signaling pathway activation, and disease states. In particular, this highlights PredTAD’s potential for understanding and targeting 3D chromatin remodeling in breast cancer therapy.

Henderson, J. et al. developed TADBoundaryDetector [141], a novel deep learning approach for predicting TAD boundaries in fruit flies using high-resolution Hi-C data. Through extensive experimentation and hyperparameter optimization, a model comprising three convolutional layers followed by a LSTM layer achieved an impressive accuracy of 96%. Validation techniques such as 10-fold cross-validation were employed, with performance metrics including accuracy, MCC, precision, recall, and F1 score utilized to comprehensively assess model performance. Hyperparameter tuning played a crucial role in optimizing model performance, emphasizing the significance of parameter selection in deep learning architectures. The study also identified several sequence motifs enriched in TAD boundaries, shedding light on the complex biological mechanisms governing chromatin organization. Comparison with traditional, feature-based models highlighted the superiority of deep learning approaches in TAD boundary prediction. Overall, the findings underscore the potential of deep learning in deciphering genomic architecture and provide valuable insights into regulatory elements governing chromatin organization.

Lv, H. et al. developed Deep-loop [142], a sequence-based CNN model for predicting CTCF-mediated chromatin loops. This model used K-tuple Nucleotide Frequency Component (KNFC), Nucleotide Pair Spectrum Encoding (NPSE), Position Conservation and Position Scoring Function (PCPSF) and natural vector. The authors showed that KNFC, NPSE and PCPSF have strong predictive power across different types of chromatin interaction pairs, while PCPSF is the most informative descriptor to discriminate between true chromatin loops and chromatin non-loops. The model outperforms four other common classification algorithms (SVM, KNN, Naïve Bayes, and LSTM) evaluated by AUROC.

CLNN-loop [143] combines CNN and LSTM (CNN layer, max pooling layer, bidirectional LSTM layer, dropout layer, dense layer, and output layer) for CTCF-mediated chromatin loop prediction using 31 sequence-based features designed with Deep-loop (5 are most predictive, including CTCF motif and sequence conservation). The authors utilized five different sequence encoding schemes to encode DNA sequences: Reverse Complement K-mer (RCKmer), Combination Position Scoring Function (CPSF), NPSE, Combination Position-Specific Tri-Nucleotide Propensity based on single-strand (CPSTNPss), and Combination Position-Specific Tri-Nucleotide Propensity based on double-strand (CPSTNPds). The model performs similar to CTCF-MP and Deep-loop (evaluated by AUROC) and performs well across cell types.

A combination of genome annotation and sequence-based data has also been explored for TAD boundary prediction. Wang, Y. et al. developed pTADS [144], a tool for predicting TAD boundaries and their strength (boundary score) using epigenetic and sequence profile-based signals. Sequence-based features include DNA shape which is calculated by the DNAshape program. Minor groove width, propeller twist, roll, and helix twist are transformed into statistical values like mean, RMSE, median, minimum, maximum, standard deviation, and population deviation. In addition to DNA shape, DNA properties and TFBS motif occurrences are also used. Epigenetic features that are used include chromatin states, TFBSs, histones, replication timing, and nucleosome positioning. The model utilizes a RF method followed by variable selection. Then the selected features are used to train the XGBoost-based model. The authors showed that this approach outperformed other baseline predictors (Ada, GBM, SVMLinear, KNN, NNet, NB, RPART) using 10-fold cross-validation and multiple evaluation metrics including AUROC. LASSO was used to prioritize feature importance. Models that combined TF and histone signals were shown to be the best performing, while known CTCF, Insulator, RAD21, H3K20me1 and PML were enriched at TAD boundaries. Notably, DNA shape and motif occurrences were shown to be least predictive. Different models, trained on TADs detected by three different TAD callers, were shown to robustly prioritize the same features and identify similar boundaries. The authors also demonstrated that pTADS, trained on data from a single cell line, performs well in other cell lines.

A/B compartment prediction

Initial observations that genome annotations may be associated with 3D chromatin interactions came from the landmark 2009 paper of Lieberman-Aiden, E. et al. The authors demonstrated that chromatin interactions exhibit one of two long-range contact patterns, dubbed the A and B compartments, suggesting the presence of two spatial neighborhoods. These compartments are typically identified with a principal component analysis (PCA) on the observed over expected count correlation matrices, where the positive/negative values of the first eigenvector are associated with A/B compartments, respectively [5]. The A compartment shows enrichment in expressed genes, DNAse hypersensitive sites, H3K36me3 and H3K27me3 histone modifications. The B compartment shows the opposite enrichment patterns [155157]. A/B compartments partition a genome into roughly similar proportions of active and inactive regions, thus alleviating the class imbalance problem. These compartments are further refined into the five primary Hi-C subcompartments (A1, A2, B1, B2, and B3, plus B4 on chr19) [8], and further into the 10 SPIN states that have distinct associations with genomic annotations [158]. The A/B compartment classification is frequently used to characterize changes in a genome’s activity between conditions [159,160].

Although A/B compartments can be directly predicted from Hi-C data, high-resolution cell type-specific Hi-C data is not always available. This prompted methods development for A/B compartment predictions using broadly available genomic annotations. Fortin and Hansen [161] demonstrated that cell type-specific A/B compartments can be reliably detected from methylation and open chromatin data. Using an approach similar to the original A/B detection methodology (eigenvector analysis of the methylation correlation matrix), they showed the first eigenvector generally corresponds well to the first eigenvector obtained from the observed-to-expected normalized genome contact matrix. These observations link methylation and chromatin accessibility with the formation of higher-order chromatin structures and were consistent across different platforms, e.g., when using Illumina 27K and 450K methylation data, DNAse hypersensitive sites, single-cell ATAC-seq, and whole-genome bisulfite sequencing.

Moore, B. L. et al. developed 3DGenome, a RF model to predict A/B compartments from co-localization with 22 TFBSs and 10 histone marks [162]. This model predicts eigenvectors used to segment a genome into A/B compartments. It outperforms regression-based approaches in cross-cell-line evaluation settings, according to the correlation with the Hi-C derived eigenvectors and AUROC. The authors showed that boundaries of the predicted A/B compartments were enriched in CTCF, YY1, H2A.Z, H3K4me2 and other marks and that these enrichments were cell type-specific.

Di Pierro, M. et al. developed MEGABASE, a neural network that uses ChIP-seq genome annotation data (discretized signal of 84 TFBSs and 11 histone marks) to predict A1, A2, B1, B2, B3, B4 subcompartments [163]. These predictions were subsequently used as an input to an energy landscape model for chromatin organization (MiChroM) to predict the 3D structure of individual chromosomes. The model further demonstrated that genome annotations “encode sufficient information to determine the global architecture of chromosomes and that de novo structure prediction for whole genomes may be increasingly possible”.

Notably, Hi-C data itself contains sufficient information to predict A1, A2, B1, B2, B3, B4 subcompartments. Xiong & Ma developed SNIPER [164], a deep neural network consisting of a 9-layer denoising autoencoder and a multi-layer perceptron to predict subcompartments. Trained on Hi-C data from the GM12878 cell line, it outperforms the genome annotation-based predictions of a Gaussian HMM and of MEGABASE. SNIPER-derived predictions correlate well with histone mark signal, replication timing, RNA-seq, and TSA-seq which further supports the association of higher-order chromatin structures with genome annotation data.

Discussion

Besides the canonical EPIs, chromatin loops, TAD boundaries, and A/B compartment predictions, other higher-order chromatin structures and their association with genome annotation data remain to be explored. These include, but not limited to, differential TAD/loop boundaries, hierarchical TAD boundaries (sub-TADs nested within larger TADs) [165169], ultra-high resolution 3D genomic features [157], and multi-way chromatin interactions [126,130].

The importance of differential TAD/loop boundaries lies in the fact that 3D genome organization varies between cell-types [5,8], along different stages of the cell-cycle [170172] even within homogenous populations of synchronized cells [172]. Approximately 60–80% of TADs remain invariant across cell types (constitutive) [173], while the others correspond to flexible structures (facultative) [174,6]. Differential TAD/loop boundaries can be detected by a simple overlap analysis or in a statistically principled way [175179]. In addition, other differential structures, such as differential A/B compartments [159] and differential stripes [180], can be better defined. Despite the biological relevance and interpretability of such 3D changes, methods for predicting them, and consequently the knowledge of associated genomic features, remain undeveloped and a highly promising direction for future investigation.

Different types of TAD/loop boundaries are similarly not considered by prediction methods. Not all TAD boundaries are created equal [181] and TAD boundary strength correlates with occupancy of architectural proteins [149,150]. TADs can form hierarchies, with sub-TADs nested within larger TADs [165,166,8,167,168]. With the advent of ultra-high-resolution technologies such as Micro-C (nucleosome-resolution) [182], we anticipate the development of methods to predict the strength and hierarchy of TAD/loop boundaries.

Prediction of multi-way chromatin interactions has been hindered by limited technology and data availability. Several recently developed technologies started to close this gap. Genome Architecture Mapping (GAM) [183] quantifies chromatin contacts by sequencing DNA from a set of ultrathin nuclear sections at random orientations. Trac-looping [184] captures multiscale contacts by inserting a transposon linker between interacting regions. DNA SPRITE [185] follows a split-pool procedure to assign unique barcodes to individual complexes, with read pairs sharing identical barcodes capturing multi-way chromatin interactions. As the multi-way chromatin interaction data and standards for its representation become more developed, we anticipate the development of methods predicting multi-way interactions.

The utility of DNA sequence features for chromatin interaction predictions remains to be explored [186]. Besides one-hot encoding and k-mer representations of DNA sequence, recent methods introduce additional methods for feature extraction. The iLearnPlus [187] web-based tool includes 147 unique feature sets capturing various properties of DNA/RNA/protein sequences as well as 21 machine-learning algorithms with 7 deep-learning approaches for their extraction, clustering, normalization, and predictor construction. It remains to be seen how informative such features actually are for predicting TAD/loop boundaries or other chromatin interaction patterns.

Another promising avenue for improving predictions of 3D organization is by using gene expression data. It has been shown that distal methylation, open chromatin regions, TFBSs, and histone modifications can predict gene expression [188190]. Conversely, incorporating gene expression in chromatin network analysis can improve active enhancer prediction [191].

The pervasive problem with many chromatin interaction prediction methods remains the lack of consensus on “ground truth” or “gold standard” for reference chromatin interaction data. Some studies use databases such as FANTOM, while others rely on external tools that define chromatin interactions from Hi-C maps. A more biologically justifiable approach is to use experimentally obtained data such as proximity-ligation ChIA-PET [192], PLAC-Seq [193], HiChIP [194], Capture-C [195], and/or Capture Hi-C [196]. Recently developed high-throughput imaging approaches such as STORM [197] and HiFISH [198] can directly measure spatial distances at the single-cell level. To increase the robustness and interpretability of chromatin interaction prediction methods, viable methods must define ground truth chromatin interactions with imaging approaches.

Summary

In this review, we summarize the use of machine learning methods to predict various 3D features of genomes. These methods draw on data from different sources such as DNA sequence features, genome annotation data, histone modifications, and DNAse hypersensitivity sites. We cover methods for predicting promoter-enhancer interactions, chromatin looping interactions, TAD boundaries, and A/B compartments. We discuss the challenges associated with this type of analysis including class imbalance and the need for careful validation. We also highlight the importance of using appropriate metrics to assess performance. Overall, we provide a comprehensive overview of the current state-of-the-art machine learning tools to gain insights into the 3D organization and function of genomes.

Funding

This work was supported in part by the George and Lavinia Blick Research Scholarship to MD and the NIH/NCI (R01CA246182, R21CA273779) grants to JCH.

Funding Statement

This work was supported in part by the George and Lavinia Blick Research Scholarship to MD and the NIH/NCI (R01CA246182, R21CA273779) grants to JCH.

Footnotes

Conflict of Interest. None.

References

  • 1.Maston GA, Evans SK, Green MR: Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet 2006, 7:29–59 10.1146/annurev.genom.7.080505.115623. [DOI] [PubMed] [Google Scholar]
  • 2.Dekker J, Rippe K, Dekker M, Kleckner N: Capturing chromosome conformation. Science 2002, 295:1306–11 10.1126/science.1067799. [DOI] [PubMed] [Google Scholar]
  • 3.Denker A, Laat W de: The second decade of 3C technologies: Detailed insights into nuclear organization. Genes Dev 2016, 30:1357–82 10.1101/gad.281964.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wit E de, Laat W de: A decade of 3C technologies: Insights into nuclear organization. Genes Dev 2012, 26:11–24 10.1101/gad.179804.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Lieberman-Aiden E, Berkum NL van, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker: Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009, 326:289–93 10.1126/science.1181369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B: Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 2012, 485:376–80 10.1038/nature11082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Nora EP, Lajoie BR, Schulz EG, Giorgetti L, Okamoto I, Servant N, Piolot T, Berkum NL van, Meisig J, Sedat J, Gribnau J, Barillot E, Blüthgen N, Dekker J, Heard E: Spatial partitioning of the regulatory landscape of the x-inactivation centre. Nature 2012, 485:381–5 10.1038/nature11049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Rao SSP, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES, Aiden EL: A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 2014, 159:1665–80 10.1016/j.cell.2014.11.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Bonev B, Cavalli G: Organization and function of the 3D genome. Nat Rev Genet 2016, 17:661–678 10.1038/nrg.2016.112. [DOI] [PubMed] [Google Scholar]
  • 10.Li G, Ruan X, Auerbach RK, Sandhu KS, Zheng M, Wang P, Poh HM, Goh Y, Lim J, Zhang J, Sim HS, Peh SQ, Mulawadi FH, Ong CT, Orlov YL, Hong S, Zhang Z, Landt S, Raha D, Euskirchen G, Wei C-L, Ge W, Wang H, Davis C, Fisher-Aylor KI, Mortazavi A, Gerstein M, Gingeras T, Wold B, Sun Y, et al. : Extensive promoter-centered chromatin interactions provide a topological basis for transcription regulation. Cell 2012, 148:84–98 10.1016/j.cell.2011.12.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Fudenberg G, Imakaev M, Lu C, Goloborodko A, Abdennur N, Mirny LA: Formation of chromosomal domains by loop extrusion. Cell Rep 2016, 15:2038–49 10.1016/j.celrep.2016.04.085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Nora EP, Goloborodko A, Valton A-L, Gibcus JH, Uebersohn A, Abdennur N, Dekker J, Mirny LA, Bruneau BG: Targeted degradation of CTCF decouples local insulation of chromosome domains from genomic compartmentalization. Cell 2017, 169:930–944.e22 10.1016/j.cell.2017.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Rao SSP, Huang S-C, Glenn St Hilaire B, Engreitz JM, Perez EM, Kieffer-Kwon K-R, Sanborn AL, Johnstone SE, Bascom GD, Bochkov ID, Huang X, Shamim MS, Shin J, Turner D, Ye Z, Omer AD, Robinson JT, Schlick T, Bernstein BE, Casellas R, Lander ES, Aiden EL: Cohesin loss eliminates all loop domains. Cell 2017, 171:305–320.e2410.1016/j.cell.2017.09.026Available: 10.1016/j.cell.2017.09.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Tang Z, Luo OJ, Li X, Zheng M, Zhu JJ, Szalaj P, Trzaskoma P, Magalska A, Wlodarczyk J, Ruszczycki B, Michalski P, Piecuch E, Wang P, Wang D, Tian SZ, Penrad-Mobayed M, Sachs LM, Ruan X, Wei C-L, Liu ET, Wilczynski GM, Plewczynski D, Li G, Ruan Y: CTCF-mediated human 3D genome architecture reveals chromatin topology for transcription. Cell 2015, 163:1611–27 10.1016/j.cell.2015.11.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Lupianez DG, Kraft K, Heinrich V, Krawitz P, Brancati F, Klopocki E, Horn D, Kayserili H, Opitz JM, Laxova R, Santos-Simarro F, Gilbert-Dussardier B, Wittler L, Borschiwer M, Haas SA, Osterwalder M, Franke M, Timmermann B, Hecht J, Spielmann M, Visel A, Mundlos S: Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions. Cell 2015, 161:1012–1025 10.1016/j.cell.2015.04.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Franke M, Ibrahim DM, Andrey G, Schwarzer W, Heinrich V, Schöpflin R, Kraft K, Kempfer R, Jerković I, Chan W-L, Spielmann M, Timmermann B, Wittler L, Kurth I, Cambiaso P, Zuffardi O, Houge G, Lambie L, Brancati F, Pombo A, Vingron M, Spitz F, Mundlos S: Formation of new chromatin domains determines pathogenicity of genomic duplications. Nature 2016, 538:265–269 10.1038/nature19800. [DOI] [PubMed] [Google Scholar]
  • 17.Gibcus JH, Dekker J: The hierarchy of the 3D genome. Mol Cell 2013, 49:773–82 10.1016/j.molcel.2013.02.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lupianez DG, Spielmann M, Mundlos S: Breaking TADs: How alterations of chromatin domains result in disease. Trends Genet 2016, 32:225–37 10.1016/j.tig.2016.01.003. [DOI] [PubMed] [Google Scholar]
  • 19.Taberlay PC, Achinger-Kawecka J, Lun ATL, Buske FA, Sabir K, Gould CM, Zotenko E, Bert SA, Giles KA, Bauer DC, Smyth GK, Stirzaker C, O’Donoghue SI, Clark SJ: Three-dimensional disorganization of the cancer genome occurs coincident with long-range genetic and epigenetic alterations. Genome Res 2016, 26:719–31 10.1101/gr.201517.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Hnisz D, Weintraub AS, Day DS, Valton A-L, Bak RO, Li CH, Goldmann J, Lajoie BR, Fan ZP, Sigova AA, Reddy J, Borges-Rivera D, Lee TI, Jaenisch R, Porteus MH, Dekker J, Young RA: Activation of proto-oncogenes by disruption of chromosome neighborhoods. Science 2016, 351:1454–8 10.1126/science.aad9024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Spielmann M, Lupianez DG, Mundlos S: Structural variation in the 3D genome. Nat Rev Genet 2018, 10.1038/s41576-018-0007-0. [DOI] [PubMed] [Google Scholar]
  • 22.Dixon JR, Xu J, Dileep V, Zhan Y, Song F, Le VT, Yardımcı GG, Chakraborty A, Bann DV, Wang Y, Clark R, Zhang L, Yang H, Liu T, Iyyanki S, An L, Pool C, Sasaki T, Rivera-Mulia JC, Ozadam H, Lajoie BR, Kaul R, Buckley M, Lee K, Diegel M, Pezic D, Ernst C, Hadjur S, Odom DT, Stamatoyannopoulos JA, et al. : Integrative detection and analysis of structural variation in cancer genomes. Nat Genet 2018, 50:1388–1398 10.1038/s41588-018-0195-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Sun JH, Zhou L, Emerson DJ, Phyo SA, Titus KR, Gong W, Gilgenast TG, Beagan JA, Davidson BL, Tassone F, Phillips-Cremins JE: Disease-associated short tandem repeats co-localize with chromatin domain boundaries. Cell 2018, 175:224–238.e15 10.1016/j.cell.2018.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Meaburn KJ, Cabuy E, Bonne G, Levy N, Morris GE, Novelli G, Kill IR, Bridger JM: Primary laminopathy fibroblasts display altered genome organization and apoptosis. Aging Cell 2007, 6:139–53 10.1111/j.1474-9726.2007.00270.x. [DOI] [PubMed] [Google Scholar]
  • 25.Dixon JR, Jung I, Selvaraj S, Shen Y, Antosiewicz-Bourget JE, Lee AY, Ye Z, Kim A, Rajagopal N, Xie W, Diao Y, Liang J, Zhao H, Lobanenkov VV, Ecker JR, Thomson JA, Ren B: Chromatin architecture reorganization during stem cell differentiation. Nature 2015, 518:331–6 10.1038/nature14222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Belaghzal H, Dekker J, Gibcus JH: Hi-c 2.0: An optimized hi-c procedure for high-resolution genome-wide mapping of chromosome conformation. Methods 2017, 123:56–6510.1016/j.ymeth.2017.04.004Available: 10.1016/j.ymeth.2017.04.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Schmitt AD, Hu M, Ren B: Genome-wide mapping and analysis of chromosome architecture. Nat Rev Mol Cell Biol 2016, 17:743–755 10.1038/nrm.2016.104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Jin F, Li Y, Dixon JR, Selvaraj S, Ye Z, Lee AY, Yen C-A, Schmitt AD, Espinoza CA, Ren B: A high-resolution map of the three-dimensional chromatin interactome in human cells. Nature 2013, 503:290–4 10.1038/nature12644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Bonev B, Mendelson Cohen N, Szabo Q, Fritsch L, Papadopoulos GL, Lubling Y, Xu X, Lv X, Hugnot J-P, Tanay A, Cavalli G: Multiscale 3D genome rewiring during mouse neural development. Cell 2017, 171:557–572.e24 10.1016/j.cell.2017.09.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Whalen S, Truty RM, Pollard KS: Enhancer-promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat Genet 2016, 48:488–96 10.1038/ng.3539. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Dekker J, Mirny L: The 3D genome as moderator of chromosomal communication. Cell 2016, 164:1110–21 10.1016/j.cell.2016.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Wang S, Su J-H, Beliveau BJ, Bintu B, Moffitt JR, Wu C, Zhuang X: Spatial organization of chromatin domains and compartments in single chromosomes. Science 2016, 353:598–602 10.1126/science.aaf8084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Tao H, Li H, Xu K, Hong H, Jiang S, Du G, Wang J, Sun Y, Huang X, Ding Y, Li F, Zheng X, Chen H, Bo X: Computational methods for the prediction of chromatin interaction and organization using sequence and epigenomic profiles. Brief Bioinform 2021, 22 10.1093/bib/bbaa405. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Belokopytova P, Fishman V: Predicting genome architecture: Challenges and solutions. Front Genet 2020, 11:617202. 10.3389/fgene.2020.617202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Xu H, Zhang S, Yi X, Plewczynski D, Li MJ: Exploring 3D chromatin contacts in gene regulation: The evolution of approaches for the identification of functional enhancer-promoter interaction. Comput Struct Biotechnol J 2020, 18:558–570 10.1016/j.csbj.2020.02.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Phan LT, Oh C, He T, Manavalan B: A comprehensive revisit of the machine-learning tools developed for the identification of enhancers in the human genome. Proteomics 2023, 23:e2200409. 10.1002/pmic.202200409. [DOI] [PubMed] [Google Scholar]
  • 37.Berger SL, Kouzarides T, Shiekhattar R, Shilatifard A: An operational definition of epigenetics. Genes Dev 2009, 23:781–3 10.1101/gad.1787609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Zhang Y, McCord RP, Ho Y-J, Lajoie BR, Hildebrand DG, Simon AC, Becker MS, Alt FW, Dekker J: Spatial organization of the mouse genome and its role in recurrent chromosomal translocations. Cell 2012, 148:908–21 10.1016/j.cell.2012.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Chen H, Tian Y, Shu W, Bo X, Wang S: Comprehensive identification and annotation of cell type-specific and ubiquitous CTCF-binding sites in the human genome. PLoS One 2012, 7:e41374 10.1371/journal.pone.0041374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Splinter E, Heath H, Kooren J, Palstra R-J, Klous P, Grosveld F, Galjart N, Laat W de: CTCF mediates long-range chromatin looping and local histone modification in the beta-globin locus. Genes Dev 2006, 20:2349–54 10.1101/gad.399506. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Phillips JE, Corces VG: CTCF: Master weaver of the genome. Cell 2009, 137:1194–211 10.1016/j.cell.2009.06.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Hong S, Kim D: Computational characterization of chromatin domain boundary-associated genomic elements. Nucleic Acids Res 2017, 45:10403–10414 10.1093/nar/gkx738. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Ciabrelli F, Cavalli G: Chromatin-driven behavior of topologically associating domains. Journal of Molecular Biology 2015, 427:608–625 10.1016/j.jmb.2014.09.013. [DOI] [PubMed] [Google Scholar]
  • 44.Filippova D, Patro R, Duggal G, Kingsford C: Identification of alternative topological domains in chromatin. Algorithms Mol Biol 2014, 9:14. 10.1186/1748-7188-9-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Eisenberg E, Levanon EY: Human housekeeping genes, revisited. Trends Genet 2013, 29:569–74 10.1016/j.tig.2013.05.010. [DOI] [PubMed] [Google Scholar]
  • 46.Sanborn AL, Rao SSP, Huang S-C, Durand NC, Huntley MH, Jewett AI, Bochkov ID, Chinnappan D, Cutkosky A, Li J, Geeting KP, Gnirke A, Melnikov A, McKenna D, Stamenova EK, Lander ES, Aiden EL: Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proc Natl Acad Sci U S A 2015, 112:E6456–65 10.1073/pnas.1518552112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Xu C, Corces VG: Towards a predictive model of chromatin 3D organization. Semin Cell Dev Biol 2016, 57:24–30 10.1016/j.semcdb.2015.11.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Stunnenberg HG, International Human Epigenome Consortium, Hirst M: The international human epigenome consortium: A blueprint for scientific collaboration and discovery. Cell 2016, 167:1145–1149 10.1016/j.cell.2016.11.007. [DOI] [PubMed] [Google Scholar]
  • 49.Stilianoudakis SC, Dozmorov MG: preciseTAD: A machine learning framework for precise 3D domain boundary prediction at base-level resolution. bioRxiv 2020, :2020.09.03.28218610.1101/2020.09.03.282186Available: http://biorxiv.org/content/early/2020/09/29/2020.09.03.282186.abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Whalen S, Schreiber J, Noble WS, Pollard KS: Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet 2022, 23:169–181 10.1038/s41576-021-00434-9. [DOI] [PubMed] [Google Scholar]
  • 51.Buda M, Maki A, Mazurowski MA: A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 2018, 106:249–259. [DOI] [PubMed] [Google Scholar]
  • 52.He H, Garcia EA: Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 2009, 21:1263–1284. [Google Scholar]
  • 53.Elkan C: The foundations of cost-sensitive learning. In International joint conference on artificial intelligence Lawrence Erlbaum Associates Ltd; Vol. 17 2001:973–978. [Google Scholar]
  • 54.Japkowicz N, Stephen S: The class imbalance problem: A systematic study. Intelligent Data Analysis 2002, :429–449. [Google Scholar]
  • 55.Bugnon LA, Yones C, Milone DH, Stegmayer G: Deep neural architectures for highly imbalanced data in bioinformatics. IEEE Transactions on Neural Networks and Learning Systems 2019, :1–11 10.1109/TNNLS.2019.2914471. [DOI] [PubMed] [Google Scholar]
  • 56.Schreiber J, Singh R, Bilmes J, Noble WS: A pitfall for machine learning methods aiming to predict across cell types. Genome Biol 2020, 21:282. 10.1186/s13059-020-02177-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Hanley JA, McNeil BJ: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143:29–36 10.1148/radiology.143.1.7063747. [DOI] [PubMed] [Google Scholar]
  • 58.Saito T, Rehmsmeier M: The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 2015, 10:e0118432. 10.1371/journal.pone.0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Davis J, Goadrich M: The relationship between precision-recall and ROC curves. In Proceedings of the 23rd international conference on machine learning 2006:233–240. [Google Scholar]
  • 60.Goutte C, Gaussier E: A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In European conference on information retrieval Springer; 2005:345–359. [Google Scholar]
  • 61.Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H: Assessing the accuracy of prediction algorithms for classification: An overview. Bioinformatics 2000, 16:412–24 10.1093/bioinformatics/16.5.412. [DOI] [PubMed] [Google Scholar]
  • 62.Seliya N, Khoshgoftaar TM, Hulse JV: A study on the relationships of classifier performance metrics. In 2009 21st IEEE international conference on tools with artificial intelligence 2009:59–66. [Google Scholar]
  • 63.Xi W, Beer MA: Local epigenomic state cannot discriminate interacting and non-interacting enhancer-promoter pairs with high accuracy. PLoS Comput Biol 2018, 14:e1006625. 10.1371/journal.pcbi.1006625. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Won K-J, Chepelev I, Ren B, Wang W: Prediction of regulatory elements in mammalian genomes using chromatin signatures. BMC Bioinformatics 2008, 9:547. 10.1186/1471-2105-9-547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Ernst J, Kellis M: ChromHMM: Automating chromatin-state discovery and characterization. Nat Methods 2012, 9:215–6 10.1038/nmeth.1906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS: Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods 2012, 9:473–6 10.1038/nmeth.1937. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Zhang Y, Hardison RC: Accurate and reproducible functional maps in 127 human cell types via 2D genome segmentation. Nucleic Acids Res 2017, 45:9823–9836 10.1093/nar/gkx659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Arneson A, Ernst J: Systematic discovery of conservation states for single-nucleotide annotation of the human genome. Commun Biol 2019, 2:248. 10.1038/s42003-019-0488-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Firpi HA, Ucar D, Tan K: Discover regulatory DNA elements using chromatin signatures and artificial neural network. Bioinformatics 2010, 26:1579–86 10.1093/bioinformatics/btq248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Fernández M, Miranda-Saavedra D: Genome-wide enhancer prediction from epigenetic signatures using genetic algorithm-optimized support vector machines. Nucleic Acids Res 2012, 40:e77. 10.1093/nar/gks149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Rajagopal N, Xie W, Li Y, Wagner U, Wang W, Stamatoyannopoulos J, Ernst J, Kellis M, Ren B: RFECS: A random-forest based algorithm for enhancer identification from chromatin state. PLoS Comput Biol 2013, 9:e1002968. 10.1371/journal.pcbi.1002968. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Lee D, Karchin R, Beer MA: Discriminative prediction of mammalian enhancers from DNA sequence. Genome Res 2011, 21:2167–80 10.1101/gr.121905.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Taher L, Narlikar L, Ovcharenko I: CLARE: Cracking the LAnguage of regulatory elements. Bioinformatics 2012, 28:581–3 10.1093/bioinformatics/btr704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Ghandi M, Lee D, Mohammad-Noori M, Beer MA: Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput Biol 2014, 10:e1003711. 10.1371/journal.pcbi.1003711. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Lee D: LS-GKM: A new gkm-SVM for large-scale datasets. Bioinformatics 2016, 32:2196–8 10.1093/bioinformatics/btw142. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Erwin GD, Oksenberg N, Truty RM, Kostka D, Murphy KK, Ahituv N, Pollard KS, Capra JA: Integrating diverse datasets improves developmental enhancer prediction. PLoS Comput Biol 2014, 10:e1003677. 10.1371/journal.pcbi.1003677. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Xu T, Zheng X, Li B, Jin P, Qin Z, Wu H: A comprehensive review of computational prediction of genome-wide features. Brief Bioinform 2018, 10.1093/bib/bby110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Dozmorov MG: Epigenomic annotation-based interpretation of genomic data: From enrichment analysis to machine learning. Bioinformatics 2017, 33:3323–3330 10.1093/bioinformatics/btx414. [DOI] [PubMed] [Google Scholar]
  • 79.He B, Chen C, Teng L, Tan K: Global view of enhancer-promoter interactome in human cells. Proc Natl Acad Sci U S A 2014, 111:E2191–9 10.1073/pnas.1320308111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Zhu Y, Chen Z, Zhang K, Wang M, Medovoy D, Whitaker JW, Ding B, Li N, Zheng L, Wang W: Constructing 3D interaction maps from 1D epigenomes. Nat Commun 2016, 7:10812. 10.1038/ncomms10812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Roy S, Siahpirani AF, Chasman D, Knaack S, Ay F, Stewart R, Wilson M, Sridharan R: A predictive modeling approach for cell line-specific long-range regulatory interactions. Nucleic Acids Res 2015, 43:8694–712 10.1093/nar/gkv865. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Zhao C, Li X, Hu H: PETModule: A motif module based approach for enhancer target gene prediction. Sci Rep 2016, 6:30043. 10.1038/srep30043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Cao J, Packer JS, Ramani V, Cusanovich DA, Huynh C, Daza R, Qiu X, Lee C, Furlan SN, Steemers FJ, Adey A, Waterston RH, Trapnell C, Shendure J: Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 2017, 357:661–667 10.1126/science.aam8940. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Talukder A, Saadat S, Li X, Hu H: EPIP: A novel approach for condition-specific enhancer-promoter interaction prediction. Bioinformatics 2019, 35:3877–3883 10.1093/bioinformatics/btz641. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Belokopytova PS, Nuriddinov MA, Mozheiko EA, Fishman D, Fishman V: Quantitative prediction of enhancer-promoter interactions. Genome Res 2019, 10.1101/gr.249367.119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Tang L, Hill MC, Wang J, Wang J, Martin JF, Li M: Predicting unrecognized enhancer-mediated genome topology by an ensemble machine learning model. Genome Res 2020, 30:1835–1845 10.1101/gr.264606.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Yang Y, Zhang R, Singh S, Ma J: Exploiting sequence-based features for predicting enhancer-promoter interactions. Bioinformatics 2017, 33:i252–i260 10.1093/bioinformatics/btx257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Mao W, Kostka D, Chikina M: Modeling enhancer-promoter interactions with attention-based neural networks. bioRxiv 2017, :219667.10.1101/219667Available: http://biorxiv.org/content/early/2017/11/14/219667.abstract. [Google Scholar]
  • 89.Singh S, Yang Y, Póczos B, Ma J: Predicting enhancer-promoter interaction from genomic sequence with deep neural networks. Quantitative Biology 2019, 7:122–13710.1007/s40484–019-0154–0Available: 10.1007/s40484-019-0154-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Zhuang Z, Shen X, Pan W: A simple convolutional neural network for prediction of enhancer-promoter interactions with DNA sequence data. Bioinformatics 2019, 35:2899–2906 10.1093/bioinformatics/bty1050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Zeng W, Wu M, Jiang R: Prediction of enhancer-promoter interactions via natural language processing. BMC Genomics 2018, 19:84. 10.1186/s12864-018-4459-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Li W, Wong WH, Jiang R: DeepTACT: Predicting 3D chromatin contacts via bootstrapping deep learning. Nucleic Acids Res 2019, 47:e60. 10.1093/nar/gkz167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Jing F, Zhang S-W, Zhang S: Prediction of enhancer-promoter interactions using the cross-cell type information and domain adversarial neural network. BMC Bioinformatics 2020, 21:507. 10.1186/s12859-020-03844-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Zhang M, Hu Y, Zhu M: EPIsHilbert: Prediction of enhancer-promoter interactions via hilbert curve encoding and transfer learning. Genes (Basel) 2021, 12 10.3390/genes12091385. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95.Min X, Ye C, Liu X, Zeng X: Predicting enhancer-promoter interactions by deep learning and matching heuristic. Brief Bioinform 2021, 22 10.1093/bib/bbaa254. [DOI] [PubMed] [Google Scholar]
  • 96.Liu S, Xu X, Yang Z, Zhao X, Liu S, Zhang W: EPIHC: Improving enhancer-promoter interaction prediction by using hybrid features and communicative learning. IEEE/ACM Trans Comput Biol Bioinform 2022, 19:3435–3443 10.1109/TCBB.2021.3109488. [DOI] [PubMed] [Google Scholar]
  • 97.Rapakoulia T, Lopez Ruiz De Vargas S, Omgba PA, Laupert V, Ulitsky I, Vingron M: CENTRE: A gradient boosting algorithm for cell-type-specific ENhancer-target pREdiction. Bioinformatics 2023, 39 10.1093/bioinformatics/btad687. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Agarwal A, Chen L: DeepPHiC: Predicting promoter-centered chromatin interactions using a novel deep learning approach. Bioinformatics 2023, 39 10.1093/bioinformatics/btac801. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Corradin O, Saiakhova A, Akhtar-Zaidi B, Myeroff L, Willis J, Cowper-Sal lari R, Lupien M, Markowitz S, Scacheri PC: Combinatorial effects of multiple enhancer variants in linkage disequilibrium dictate levels of gene expression to confer susceptibility to common traits. Genome Res 2014, 24:1–13 10.1101/gr.164079.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Cao Q, Anyansi C, Hu X, Xu L, Xiong L, Tang W, Mok MTS, Cheng C, Fan X, Gerstein M, Cheng ASL, Yip KY: Reconstruction of enhancer-target networks in 935 samples of human primary cells, tissues and cell lines. Nat Genet 2017, 49:1428–1436 10.1038/ng.3950. [DOI] [PubMed] [Google Scholar]
  • 101.Zheng L, Liu L, Zhu W, Ding Y, Wu F: Predicting enhancer-promoter interaction based on epigenomic signals. Front Genet 2023, 14:1133775. 10.3389/fgene.2023.1133775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Friedman JH: Greedy function approximation: A gradient boosting machine. Ann. Statist. 2001, 29:1189–123210.1214/aos/1013203451Available: https://projecteuclid.org:443/euclid.aos/1013203451. [Google Scholar]
  • 103.Huang J, Marco E, Pinello L, Yuan G-C: Predicting chromatin organization using histone marks. Genome Biol 2015, 16:162. 10.1186/s13059-015-0740-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Chen Y, Wang Y, Xuan Z, Chen M, Zhang MQ: De novo deciphering three-dimensional chromatin interaction and topological domains by wavelet transformation of epigenetic profiles. Nucleic Acids Res 2016, 44:e106. 10.1093/nar/gkw225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105.Al Bkhetan Z, Plewczynski D: Three-dimensional epigenome statistical model: Genome-wide chromatin looping prediction. Sci Rep 2018, 8:5217. 10.1038/s41598-018-23276-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106.Kai Y, Andricovich J, Zeng Z, Zhu J, Tzatsos A, Peng W: Predicting CTCF-mediated chromatin interactions by integrating genomic and epigenomic features. Nat Commun 2018, 9:4221. 10.1038/s41467-018-06664-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.Zhang S, Chasman D, Knaack S, Roy S: In silico prediction of high-resolution hi-c interaction matrices. Nat Commun 2019, 10:5449. 10.1038/s41467-019-13423-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108.Baur B, Shin J, Schreiber J, Zhang S, Zhang Y, Manjunath M, Song JS, Stafford Noble W, Roy S: Leveraging epigenomes and three-dimensional genome organization for interpreting regulatory variation. PLoS Comput Biol 2023, 19:e1011286. 10.1371/journal.pcbi.1011286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109.Ibn-Salem J, Andrade-Navarro MA: 7C: Computational chromosome conformation capture by correlation of ChIP-seq at CTCF motifs. BMC Genomics 2019, 20:777. 10.1186/s12864-019-6088-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Yang R, Das A, Gao VR, Karbalayghareh A, Noble WS, Bilmes JA, Leslie CS: Epiphany: Predicting hi-c contact maps from 1D epigenomic signals. Genome Biol 2023, 24:134. 10.1186/s13059-023-02934-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 111.Jaroszewicz A, Ernst J: An integrative approach for fine-mapping chromatin interactions. Bioinformatics 2019, 10.1093/bioinformatics/btz843. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112.Zhang Z, Feng F, Qiu Y, Liu J: A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome. Nucleic Acids Res 2023, 51:5931–5947 10.1093/nar/gkad436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Nikumbh S, Pfeifer N: Genetic sequence-based prediction of long-range chromatin interactions suggests a potential role of short tandem repeat sequences in genome organization. BMC Bioinformatics 2017, 18:218. 10.1186/s12859-017-1624-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114.Fudenberg G, Kelley DR, Pollard KS: Predicting 3D genome folding from DNA sequence. bioRxiv 2019, :800060.10.1101/800060Available: http://biorxiv.org/content/early/2019/10/10/800060.abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115.Schwessinger R, Gosden M, Downes D, Brown R, Telenius J, Teh YW, Lunter G, Hughes JR: DeepC: Predicting chromatin interactions using megabase scaled deep neural networks and transfer learning. bioRxiv 2019, :724005.10.1101/724005Available: http://biorxiv.org/content/early/2019/08/04/724005.abstract. [Google Scholar]
  • 116.Cao F, Zhang Y, Cai Y, Animesh S, Zhang Y, Akincilar SC, Loh YP, Li X, Chng WJ, Tergaonkar V, Kwoh CK, Fullwood MJ: Chromatin interaction neural network (ChINN): A machine learning-based method for predicting chromatin interactions from DNA sequences. Genome Biol 2021, 22:226. 10.1186/s13059-021-02453-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 117.Zhou J: Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. Nat Genet 2022, 54:725–734 10.1038/s41588-022-01065-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 118.Chiliński M, Plewczynski D: HiCDiffusion - diffusion-enhanced, transformer-based prediction of chromatin interactions from DNA sequences. bioRxiv 2024, :2024.02.01.57838910.1101/2024.02.01.578389Available: http://biorxiv.org/content/early/2024/02/05/2024.02.01.578389.abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 119.Zhang R, Wang Y, Yang Y, Zhang Y, Ma J: Predicting CTCF-mediated chromatin loops using CTCF-MP. Bioinformatics 2018, 34:i133–i141 10.1093/bioinformatics/bty248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 120.Schreiber J, Libbrecht M, Bilmes J, Noble WS: Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture. bioRxiv 2018, :103614.10.1101/103614Available: http://biorxiv.org/content/early/2018/07/15/103614.abstract. [Google Scholar]
  • 121.Wang W, Gao L, Ye Y, Gao Y: CCIP: Predicting CTCF-mediated chromatin loops with transitivity. Bioinformatics 2021, 37:4635–4642 10.1093/bioinformatics/btab534. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 122.Zhang P, Wu H: IChrom-deep: An attention-based deep learning model for identifying chromatin interactions. IEEE J Biomed Health Inform 2023, 27:4559–4568 10.1109/JBHI.2023.3292299. [DOI] [PubMed] [Google Scholar]
  • 123.Tang L, Huang W, Hill MC, Ellinor PT, Li M: Fusion neural network (FusNet) for predicting protein-mediated loops. bioRxiv 2023, :2023.06.24.54636010.1101/2023.06.24.546360Available: http://biorxiv.org/content/early/2023/06/26/2023.06.24.546360.abstract. [Google Scholar]
  • 124.Tan J, Shenker-Tauris N, Rodriguez-Hernaez J, Wang E, Sakellaropoulos T, Boccalatte F, Thandapani P, Skok J, Aifantis I, Fenyö D, Xia B, Tsirigos A: Cell-type-specific prediction of 3D chromatin organization enables high-throughput in silico genetic screening. Nat Biotechnol 2023, 41:1140–1150 10.1038/s41587-022-01612-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 125.Salameh TJ, Wang X, Song F, Zhang B, Wright SM, Khunsriraksakul C, Yue F: A supervised learning framework for chromatin loop detection in genome-wide contact maps. bioRxiv 2019, :739698.10.1101/739698Available: http://biorxiv.org/content/early/2019/08/20/739698.abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 126.Zhang R, Ma J: MATCHA: Probing multi-way chromatin interaction with hypergraph representation learning. Cell Syst 2020, 10:397–407.e5 10.1016/j.cels.2020.04.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 127.Wu H, Zhou B, Zhou H, Zhang P, Wang M: Be-1DCNN: A neural network model for chromatin loop prediction based on bagging ensemble learning. Brief Funct Genomics 2023, 22:475–484 10.1093/bfgp/elad015. [DOI] [PubMed] [Google Scholar]
  • 128.Liu T, Wang Z: DeepChIA-PET: Accurately predicting ChIA-PET from hi-c and ChIP-seq with deep dilated networks. PLoS Comput Biol 2023, 19:e1011307. 10.1371/journal.pcbi.1011307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 129.Bailey SD, Zhang X, Desai K, Aid M, Corradin O, Cowper-Sal\cdotplari R, Akhtar-Zaidi B, Scacheri PC, Haibe-Kains B, Lupien M: ZNF143 provides sequence specificity to secure chromatin interactions at gene promoters. Nature Communications 2015, 610.1038/ncomms7186Available: 10.1038/ncomms7186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 130.Xu J, Zhang P, Sun W, Zhang J, Zhang W, Hou C, Li L: EpiMCI: Predicting multi-way chromatin interactions from epigenomic signals. Biology (Basel) 2023, 12 10.3390/biology12091203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131.Qi Y, Zhang B: Predicting three-dimensional genome organization with chromatin states. PLoS Comput Biol 2019, 15:e1007024. 10.1371/journal.pcbi.1007024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 132.Kelley DR, Reshef YA, Bileschi M, Belanger D, McLean CY, Snoek J: Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res 2018, 28:739–750 10.1101/gr.227819.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 133.Farré P, Heurteau A, Cuvier O, Emberly E: Dense neural networks for predicting chromatin conformation. BMC Bioinformatics 2018, 19:372. 10.1186/s12859-018-2286-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 134.Srinivasan A, Mishra RK: Chromatin domain boundary element search tool for drosophila. Nucleic Acids Res 2012, 40:4385–95 10.1093/nar/gks045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 135.Mourad R, Cuvier O: Computational identification of genomic features that influence 3D chromatin domain formation. PLoS Comput Biol 2016, 12:e1004908 10.1371/journal.pcbi.1004908. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 136.Mourad R, Cuvier O: TAD-free analysis of architectural proteins and insulators. Nucleic Acids Res 2018, 46:e27. 10.1093/nar/gkx1246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 137.Sefer E, Kingsford C: Semi-nonparametric modeling of topological domain formation from epigenetic data. Algorithms Mol Biol 2019, 14:4 10.1186/s13015-019-0142-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 138.Gan W, Luo J, Li YZ, Guo JL, Zhu M, Li ML: A computational method to predict topologically associating domain boundaries combining histone marks and sequence information. BMC Genomics 2019, 20:980 10.1186/s12864-019-6303-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 139.Trieu T, Khurana E: A deep learning approach to predict the impact of non-coding sequence variants on 3D chromatin structure. bioRxiv 2019, :516849.10.1101/516849Available: http://biorxiv.org/content/early/2019/01/10/516849.1.abstract. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 140.Chyr J, Zhang Z, Chen X, Zhou X: PredTAD: A machine learning framework that models 3D chromatin organization alterations leading to oncogene dysregulation in breast cancer cell lines. Comput Struct Biotechnol J 2021, 19:2870–2880 10.1016/j.csbj.2021.05.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 141.Henderson J, Ly V, Olichwier S, Chainani P, Liu Y, Soibam B: Accurate prediction of boundaries of high resolution topologically associated domains (TADs) in fruit flies using deep learning. Nucleic Acids Res 2019, 47:e78 10.1093/nar/gkz315. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 142.Lv H, Dao F-Y, Zulfiqar H, Su W, Ding H, Liu L, Lin H: A sequence-based deep learning approach to predict CTCF-mediated chromatin loop. Brief Bioinform 2021, 22 10.1093/bib/bbab031. [DOI] [PubMed] [Google Scholar]
  • 143.Zhang P, Wu Y, Zhou H, Zhou B, Zhang H, Wu H: CLNN-loop: A deep learning model to predict CTCF-mediated chromatin loops in the different cell lines and CTCF-binding sites (CBS) pair types. Bioinformatics 2022, 38:4497–4504 10.1093/bioinformatics/btac575. [DOI] [PubMed] [Google Scholar]
  • 144.Wang Y, Liu Y, Xu Q, Xu Y, Cao K, Deng N, Wang R, Zhang X, Zheng R, Li G, Fang Y: TAD boundary and strength prediction by integrating sequence and epigenetic profile information. Brief Bioinform 2021, 10.1093/bib/bbab139. [DOI] [PubMed] [Google Scholar]
  • 145.Dekker J, Heard E: Structural and functional diversity of topologically associating domains. FEBS Lett 2015, 589:2877–84 10.1016/j.febslet.2015.08.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 146.Canela A, Maman Y, Jung S, Wong N, Callen E, Day A, Kieffer-Kwon K-R, Pekowska A, Zhang H, Rao SSP, Huang S-C, Mckinnon PJ, Aplan PD, Pommier Y, Aiden EL, Casellas R, Nussenzweig A: Genome organization drives chromosome fragility. Cell 2017, 170:507–521.e18 10.1016/j.cell.2017.06.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 147.Phillips-Cremins JE, Sauria MEG, Sanyal A, Gerasimova TI, Lajoie BR, Bell JSK, Ong C-T, Hookway TA, Guo C, Sun Y, Bland MJ, Wagstaff W, Dalton S, McDevitt TC, Sen R, Dekker J, Taylor J, Corces VG: Architectural protein subclasses shape 3D organization of genomes during lineage commitment. Cell 2013, 153:1281–1295 10.1016/j.cell.2013.04.053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 148.Corces MR, Corces VG: The three-dimensional cancer genome. Curr Opin Genet Dev 2016, 36:1–7 10.1016/j.gde.2016.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 149.Van Bortle K, Nichols MH, Li L, Ong C-T, Takenaka N, Qin ZS, Corces VG: Insulator function and topological domain border strength scale with architectural protein occupancy. Genome Biol 2014, 15:R82. 10.1186/gb-2014-15-5-r82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 150.Fudenberg G, Pollard KS: Chromatin features constrain structural variation across evolutionary timescales. Proc Natl Acad Sci U S A 2019, 116:2175–2180 10.1073/pnas.1808631116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 151.Bednarz P, Wilczyński B: Supervised learning method for predicting chromatin boundary associated insulator elements. J Bioinform Comput Biol 2014, 12:1442006. 10.1142/S0219720014420062. [DOI] [PubMed] [Google Scholar]
  • 152.Ramírez F, Bhardwaj V, Arrigoni L, Lam KC, Grüning BA, Villaveces J, Habermann B, Akhtar A, Manke T: High-resolution TADs reveal DNA sequences underlying genome organization in flies. Nat Commun 2018, 9:189. 10.1038/s41467-017-02525-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 153.Schreiber J, Durham T, Bilmes J, Noble WS: Avocado: A multi-scale deep tensor factorization method learns a latent representation of the human epigenome. Genome Biol 2020, 21:81 10.1186/s13059-020-01977-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 154.Rozenwald M, Khrameeva E, Sapunov G, Gelfand M: Prediction of 3D chromatin structure using recurrent neural networks. In 2018 IEEE international conference on bioinformatics and biomedicine (BIBM) IEEE; 2018:2488–2488. [Google Scholar]
  • 155.Naumova N, Dekker J: Integrating one-dimensional and three-dimensional maps of genomes. J Cell Sci 2010, 123:1979–88 10.1242/jcs.051631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 156.Hou C, Li L, Qin ZS, Corces VG: Gene density, transcription, and insulators contribute to the partition of the drosophila genome into physical domains. Mol Cell 2012, 48:471–84 10.1016/j.molcel.2012.08.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 157.Harris HL, Gu H, Olshansky M, Wang A, Farabella I, Eliaz Y, Kalluchi A, Krishna A, Jacobs M, Cauer G, Pham M, Rao SSP, Dudchenko O, Omer A, Mohajeri K, Kim S, Nichols MH, Davis ES, Gkountaroulis D, Udupa D, Aiden AP, Corces VG, Phanstiel DH, Noble WS, Nir G, Di Pierro M, Seo J-S, Talkowski ME, Aiden EL, Rowley MJ: Chromatin alternates between a and b compartments at kilobase scale for subgenic organization. Nat Commun 2023, 14:3303. 10.1038/s41467-023-38429-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 158.Wang Y, Zhang Y, Zhang R, Schaik, Zhang, Sasaki, Peric-Hupkes D, Chen Y, Gilbert DM, Steensel B van, et al. : SPIN reveals genome-wide landscape of nuclear compartmentalization. bioRxiv 2020,. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 159.Chakraborty A, Wang J, Ay F: dcHiC: Differential compartment analysis of hi-c datasets. 2022,. [DOI] [PMC free article] [PubMed]
  • 160.Dozmorov MG, Marshall MA, Rashid NS, Grible JM, Valentine A, Olex AL, Murthy K, Chakraborty A, Reyna J, Figueroa DS, Hinojosa-Gonzalez L, Da-Inn Lee E, Baur BA, Roy S, Ay F, Harrell JC: Rewiring of the 3D genome during acquisition of carboplatin resistance in a triple-negative breast cancer patient-derived xenograft. Sci Rep 2023, 13:5420. 10.1038/s41598-023-32568-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 161.Fortin J-P, Hansen KD: Reconstructing a/b compartments as revealed by hi-c using long-range correlations in epigenetic data. Genome Biol 2015, 16:180. 10.1186/s13059-015-0741-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 162.Moore BL, Aitken S, Semple CA: Integrative modeling reveals the principles of multi-scale chromatin boundary formation in human nuclear organization. Genome Biol 2015, 16:110. 10.1186/s13059-015-0661-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 163.Di Pierro M, Cheng RR, Lieberman Aiden E, Wolynes PG, Onuchic JN: De novo prediction of human chromosome structures: Epigenetic marking patterns encode genome architecture. Proc Natl Acad Sci U S A 2017, 114:12126–12131 10.1073/pnas.1714980114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 164.Xiong K, Ma J: Revealing hi-c subcompartments by imputing inter-chromosomal chromatin interactions. Nat Commun 2019, 10:5069. 10.1038/s41467-019-12954-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 165.Weinreb C, Raphael BJ: Identification of hierarchical chromatin domains. Bioinformatics 2016, 32:1601–9 10.1093/bioinformatics/btv485. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 166.Yu W, He B, Tan K: Identifying topologically associating domains and subdomains by gaussian mixture model and proportion test. Nat Commun 2017, 8:535. 10.1038/s41467-017-00478-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 167.Norton HK, Emerson DJ, Huang H, Kim J, Titus KR, Gu S, Bassett DS, Phillips-Cremins JE: Detecting hierarchical genome folding with network modularity. Nat Methods 2018, 15:119–122 10.1038/nmeth.4560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 168.Haddad N, Vaillant C, Jost D: IC-finder: Inferring robustly the hierarchical organization of chromatin folding. Nucleic Acids Res 2017, 45:e81. 10.1093/nar/gkx036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 169.Cresswell KG, Stansfield JC, Dozmorov MG: SpectralTAD: An r package for defining a hierarchy of topologically associated domains using spectral clustering. BMC Bioinformatics 2020, 21:319. 10.1186/s12859-020-03652-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 170.Gibcus JH, Samejima K, Goloborodko A, Samejima I, Naumova N, Nuebler J, Kanemaki MT, Xie L, Paulson JR, Earnshaw WC, Mirny LA, Dekker J: A pathway for mitotic chromosome formation. Science 2018, 359 10.1126/science.aao6135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 171.Naumova N, Imakaev M, Fudenberg G, Zhan Y, Lajoie BR, Mirny LA, Dekker J: Organization of the mitotic chromosome. Science 2013, 342:948–53 10.1126/science.1236083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 172.Nagano T, Lubling Y, Várnai C, Dudley C, Leung W, Baran Y, Mendelson Cohen N, Wingett S, Fraser P, Tanay A: Cell-cycle dynamics of chromosomal organization at single-cell resolution. Nature 2017, 547:61–67 10.1038/nature23001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 173.Schmitt AD, Hu M, Jung I, Xu Z, Qiu Y, Tan CL, Li Y, Lin S, Lin Y, Barr CL, Ren B: A compendium of chromatin contact maps reveals spatially active regions in the human genome. Cell Rep 2016, 17:2042–2059 10.1016/j.celrep.2016.10.061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 174.Sexton T, Cavalli G: The role of chromosome domains in shaping the functional genome. Cell 2015, 160:1049–59 10.1016/j.cell.2015.02.040. [DOI] [PubMed] [Google Scholar]
  • 175.Cresswell KG, Dozmorov MG: TADCompare: An r package for differential and temporal analysis of topologically associated domains. Front Genet 2020, 11:158. 10.3389/fgene.2020.00158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 176.Sauerwald N, Kingsford C: Quantifying the similarity of topological domains across normal and cancer human cell types. Bioinformatics 2018, 34:i475–i483 10.1093/bioinformatics/bty265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 177.Zaborowski R, Wilczynski B: DiffTAD: Detecting differential contact frequency in topologically associating domains hi-c experiments between conditions. 2016, 10.1101/093625. [DOI] [Google Scholar]
  • 178.Lee D-I, Roy S: Examining dynamics of three-dimensional genome organization with multi-task matrix factorization. bioRxiv 2023, :2023.08.25.55488310.1101/2023.08.25.554883Available: http://biorxiv.org/content/early/2023/08/27/2023.08.25.554883.abstract. [Google Scholar]
  • 179.Hua D, Gu M, Zhang X, Du Y, Xie H, Qi L, Du X, Bai Z, Zhu X, Tian D: DiffDomain enables identification of structurally reorganized topologically associating domains. Nat Commun 2024, 15:502. 10.1038/s41467-024-44782-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 180.Gupta K, Wang G, Zhang S, Gao X, Zheng R, Zhang Y, Meng Q, Zhang L, Cao Q, Chen K: StripeDiff: Model-based algorithm for differential analysis of chromatin stripe. Sci Adv 2022, 8:eabk2246. 10.1126/sciadv.abk2246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 181.Krietenstein N, Abraham S, Venev SV, Abdennur N, Gibcus J, Hsieh T-HS, Parsi KM, Yang L, Maehr R, Mirny LA, Dekker J, Rando OJ: Ultrastructural details of mammalian chromosome architecture. Mol Cell 2020, 78:554–565.e7 10.1016/j.molcel.2020.03.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 182.Hsieh T-HS, Weiner A, Lajoie B, Dekker J, Friedman N, Rando OJ: Mapping nucleosome resolution chromosome folding in yeast by micro-c. Cell 2015, 162:108–19 10.1016/j.cell.2015.05.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 183.Beagrie RA, Scialdone A, Schueler M, Kraemer DCA, Chotalia M, Xie SQ, Barbieri M, Santiago I de, Lavitas L-M, Branco MR, Fraser J, Dostie J, Game L, Dillon N, Edwards PAW, Nicodemi M, Pombo A: Complex multi-enhancer contacts captured by genome architecture mapping. Nature 2017, 543:519–524 10.1038/nature21411. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 184.Lai B, Tang Q, Jin W, Hu G, Wangsa D, Cui K, Stanton BZ, Ren G, Ding Y, Zhao M, Liu S, Song J, Ried T, Zhao K: Trac-looping measures genome structure and chromatin accessibility. Nat Methods 2018, 15:741–747 10.1038/s41592-018-0107-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 185.Quinodoz SA, Ollikainen N, Tabak B, Palla A, Schmidt JM, Detmar E, Lai MM, Shishkin AA, Bhat P, Takei Y, Trinh V, Aznauryan E, Russell P, Cheng C, Jovanovic M, Chow A, Cai L, McDonel P, Garber M, Guttman M: Higher-order inter-chromosomal hubs shape 3D genome organization in the nucleus. Cell 2018, 174:744–757.e24 10.1016/j.cell.2018.05.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 186.Alakuş TB: A novel repetition frequency-based DNA encoding scheme to predict human and mouse DNA enhancers with deep learning. Biomimetics (Basel) 2023, 8 10.3390/biomimetics8020218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 187.Chen Z, Zhao P, Li F, Marquez-Lago TT, Leier A, Revote J, Zhu Y, Powell DR, Akutsu T, Webb GI, Chou K-C, Smith AI, Daly RJ, Li J, Song J: iLearn: An integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform 2020, 21:1047–1057 10.1093/bib/bbz041. [DOI] [PubMed] [Google Scholar]
  • 188.Li Y, Ju F, Chen Z, Qu Y, Xia H, He L, Wu L, Zhu J, Shao B, Deng P: CREaTor: Zero-shot cis-regulatory pattern modeling with attention mechanisms. Genome Biol 2023, 24:266. 10.1186/s13059-023-03103-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 189.Kim S, Park HJ, Cui X, Zhi D: Collective effects of long-range DNA methylations predict gene expressions and estimate phenotypes in cancer. Sci Rep 2020, 10:3920. 10.1038/s41598-020-60845-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 190.Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, Assael Y, Jumper J, Kohli P, Kelley DR: Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 2021, 18:1196–1203 10.1038/s41592-021-01252-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 191.Heer M, Giudice L, Mengoni C, Giugno R, Rico D: Esearch3D: Propagating gene expression in chromatin networks to illuminate active enhancers. Nucleic Acids Res 2023, 51:e55. 10.1093/nar/gkad229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 192.Fullwood MJ, Liu MH, Pan YF, Liu J, Xu H, Mohamed YB, Orlov YL, Velkov S, Ho A, Mei PH, Chew EGY, Huang PYH, Welboren W-J, Han Y, Ooi HS, Ariyaratne PN, Vega VB, Luo Y, Tan PY, Choy PY, Wansa KDSA, Zhao B, Lim KS, Lesow SC, Yow JS, Joseph R, Li H, Desai KV, Thomsen JS, Lee YK, et al. : An oestrogen-receptor-alpha-bound human chromatin interactome. Nature 2009, 462:58–64 10.1038/nature08497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 193.Fang R, Yu M, Li G, Chee S, Liu T, Schmitt AD, Ren B: Mapping of long-range chromatin interactions by proximity ligation-assisted ChIP-seq. Cell Res 2016, 26:1345–1348 10.1038/cr.2016.137. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 194.Mumbach MR, Rubin AJ, Flynn RA, Dai C, Khavari PA, Greenleaf WJ, Chang HY: HiChIP: Efficient and sensitive analysis of protein-directed genome architecture. Nat Methods 2016, 13:919–922 10.1038/nmeth.3999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 195.Hughes JR, Roberts N, McGowan S, Hay D, Giannoulatou E, Lynch M, De Gobbi M, Taylor S, Gibbons R, Higgs DR: Analysis of hundreds of cis-regulatory landscapes at high resolution in a single, high-throughput experiment. Nat Genet 2014, 46:205–12 10.1038/ng.2871. [DOI] [PubMed] [Google Scholar]
  • 196.Mifsud B, Tavares-Cadete F, Young AN, Sugar R, Schoenfelder S, Ferreira L, Wingett SW, Andrews S, Grey W, Ewels PA, Herman B, Happe S, Higgs A, LeProust E, Follows GA, Fraser P, Luscombe NM, Osborne CS: Mapping long-range promoter contacts in human cells with high-resolution capture hi-c. Nat Genet 2015, 47:598–606 10.1038/ng.3286. [DOI] [PubMed] [Google Scholar]
  • 197.Xu J, Ma H, Jin J, Uttam S, Fu R, Huang Y, Liu Y: Super-resolution imaging of higher-order chromatin structures at different epigenomic states in single mammalian cells. Cell Rep 2018, 24:873–882 10.1016/j.celrep.2018.06.085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 198.Shachar S, Pegoraro G, Misteli T: HIPMap: A high-throughput imaging method for mapping spatial gene positions. Cold Spring Harb Symp Quant Biol 2015, 80:73–81 10.1101/sqb.2015.80.027417. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from ArXiv are provided here courtesy of arXiv

RESOURCES