Abstract
With the development of sequencing technology and the dramatic drop in sequencing cost, the functions of noncoding genes are being characterized in a wide variety of fields (e.g. biomedicine). Enhancers are noncoding DNA elements with vital transcription regulation functions. Tens of thousands of enhancers have been identified in the human genome; however, the location, function, target genes and regulatory mechanisms of most enhancers have not been elucidated thus far. As high-throughput sequencing techniques have leapt forwards, omics approaches have been extensively employed in enhancer research. Multidimensional genomic data integration enables the full exploration of the data and provides novel perspectives for screening, identification and characterization of the function and regulatory mechanisms of unknown enhancers. However, multidimensional genomic data are still difficult to integrate genome wide due to complex varieties, massive amounts, high rarity, etc. To facilitate the appropriate methods for studying enhancers with high efficacy, we delineate the principles, data processing modes and progress of various omics approaches to study enhancers and summarize the applications of traditional machine learning and deep learning in multi-omics integration in the enhancer field. In addition, the challenges encountered during the integration of multiple omics data are addressed. Overall, this review provides a comprehensive foundation for enhancer analysis.
Keywords: enhancer, multi-omics, high-throughput data analysis, data integration, machine learning
INTRODUCTION
The maintenance of transcriptional homeostasis is crucial to the development and growth of living things [1]. Transcriptional homeostasis is dependent on the interactions between transcription factors (TFs) and cis-regulatory elements (e.g. promoters and enhancers) [2]. In the 1980s, enhancers were first discovered in simian virus 40 (SV40) [3]. Subsequently, researchers have gradually characterized different types of enhancers, and various techniques have been developed to predict and study the function of enhancers (Figure 1). In 2004, Benjamin predicted enhancers in the Drosophila genome based on sequence conservation, initiating the application of bioinformatics methods in enhancer research [4]. The eRNA and Super-enhancer were discovered in 2010 and 2013, respectively, further enriching the understanding of enhancers [5, 6]. In 2013, the introduction of CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats)-Cas9 technology greatly accelerated the validation of enhancer functions [7]. Since 2016, with the rapid development of the machine learning field, scientists have gradually adopted neural network technology into enhancer research and have developed numerous models and softwares to study enhancers [8, 9]. The development of omics technology has brought a breakthrough in enhancer research. The dissection of the underlying transcriptional regulation of enhancers provides new insights into the complexity of transcriptional regulation.
Figure 1.
Timeline of enhancer research.
Currently, omics approaches for enhancer research focus on four key questions (Figure 2). (i) How are enhancers identified? (ii) What induces the changes in enhancer activity? (iii) How do enhancers interact with their targets in the complicated 3D structure of the genome? (iv) What regulates the production and function of eRNA (enhancer RNA)? Since multiple biological processes are involved in the above four questions, genomic sequencing data provide opportunities to study such complicated issues by genomic sequencing data integration.
Figure 2.
Challenges in the study of enhancers by omics methods.
Enhancer-associated sequencing technologies can be divided into four categories: genomics, epigenomics, transcriptomics and gene-editing technology (Figure 3). Genomics focuses on gene sequences and genomic structures of enhancers. Specifically, the former explores enhancers through genome variation-phenotype correlation and gene sequence conservation, while the latter identifies potential enhancers and target genes under the 3D structure of the genome [10–13]. Epigenomics identifies enhancers from the perspectives of chromatin spatial information, DNA interaction and modification, and RNA secondary structure [14–16]. Since active enhancers are transcribed into eRNAs, the transcriptome is widely applied to characterize enhancers and enhancer-target pairs based on expression correlation [5, 17–19]. STARR-seq (self-transcribingactive regulatory region sequencing) is specifically designed to evaluate enhancer activity [20, 21]. CRISPR gene editing technology has been applied prevalently to knock out/down genes or enhancers. In addition, CRISPR gene editing technology has been developed to conduct large-scale parallel screening of enhancers followed by sequencing [22].
Figure 3.
Applications of different omics methods and CRISPR gene editing technology in enhancer research.
The centermost circle represents the transcriptional regulation by enhancers on the target genes; the inner circle presents a variety of techniques, particularly sequencing; the middle layer presents molecular information acquired from each technique and the outermost displays the classification. Genomic approaches are shown in blue, epigenomic approaches are represented by pink, transcriptomic approaches are shown in yellow and gene editing technology is represented by green.
Although single omics data mining can initially screen out enhancers or genes correlated with a specific disease or phenotype, single omics analysis is subject to significant limitations, e.g. inadequate interpretation of data in a single dimension and insufficient depletion of signal noise [23]. Multi-omics analysis has the advantage of diversity and is systematic, making it more conducive to clarifying the underlying mechanisms of enhancers [24–27]. The integration of multi-omic data provides more reliable results and dramatically reduces the false-positive rate [28]. Over the past decade, various multi-omics analysis methods have been developed. However, there are still many challenges, such as the accuracy variation among different omics data, missing values, and computational and storage costs [29]. This review will discuss the data characteristics of different omics in enhancer research, the methods of multi-omics data analysis and the challenges in multi-omics research.
THE APPLICATION OF DIFFERENT OMICS DATA IN ENHANCER RESEARCH
The widespread application of sequencing technologies has provided a wealth of molecular information in the enhancer field (Table 1). To better illustrate the sources and applications of different types of molecular information, we categorized omics approaches from the perspective of the research subject: genomics, epigenomics, transcriptomics and CRISPR editing technologies (Figure 3).
Table 1.
Key questions about enhancers are addressed by different molecular information
| Omics methods | Molecular information | Identification | Activity | Structure | eRNA |
|---|---|---|---|---|---|
| Genomics | DNA sequence | √ | √ | ||
| Genomics | 3D structure | √ | √ | √ | |
| Epigenomics | Chromatin accessibility | √ | √ | ||
| Epigenomics | DNA methylation | √ | √ | ||
| Epigenomics | DNA–Protein interaction | √ | √ | ||
| Epigenomics | DNA–DNA interaction | √ | √ | √ | |
| Epigenomics | DNA–RNA interaction | √ | √ | √ | |
| Epigenomics | RNA secondary structure | √ | √ | √ | |
| Transcriptomics | Gene expression | √ | √ | ||
| Transcriptomics | Enhancer expression | √ | √ | √ | |
| Transcriptomics | Enhancer activity | √ | |||
| Gene editing technology | Gene/enhancer activation | √ | √ | √ |
Genomics
Driven by the progress of sequencing technology and the decline in sequencing cost, large-scale population genome sequencing has been initiated in many countries, and the amount of data has grown exponentially [30]. SNPs (single nucleotide polymorphisms), SVs (structural variations), CNVs (copy number variations), InDels (insertions–deletions) and other molecular information can be obtained using WGS (whole genome sequencing), WES (whole exon sequencing), WGRS (whole genome resequencing) and other genomic sequencing techniques [31]. Researchers have developed GWAS (genome-wide association study) analysis and eQTL (expression quantitative trait loci) analysis methods to study the relationship between SNPs or CNVs and phenotypes. The GWAS method can analyze millions of SNPs in the genome simultaneously, which has the advantages of high efficiency and wide coverage [32]. eQTL is used to study the relationship between gene expression level and genotype [33]. Young group summarized the results of 1675 GWAS and found 5303 SNPs associated with various diseases. The majority of SNPs are in noncoding regions (93%), and among these, 64% of the loci are enriched in enhancer regions [34]. However, the GWAS method has limitations, such as the inability to identify complex traits, the inability to assess rare genetic variants and the uncertainty of gene-phenotype associations [32]. Compared with GWAS analysis, eQTL is advantageous in exploring gene expression regulation mechanisms and gene-phenotype associations. Since eQTL information can determine genetic variants associated with gene expression levels, it can more accurately identify potential enhancer elements [35, 36]. By introducing eQTLs from the 1000 Genomes Project, Chen et al. identified 65 pairs of cancer-specific enhancer genes [36]. Chignon et al. conducted a colocalization analysis of enhancer–promoter locations with tissue eQTL locations associated with genetic coronary artery disease, evaluating the importance of genetic variability in the disease [35]. In both studies, the approach to integrate enhancers with eQTLs was a location-based approach [35, 36].
Transcription regulation is closely correlated with the 3D conformation of chromatin, which is an alternative perspective for studying enhancers [37]. Hi-C (high-throughput chromosome conformation capture) has been the most extensively performed approach for 3D genome sequencing at the genome-wide level, with the advantages of wide coverage, high accuracy and more complete sequence positioning [38]. Since Hi-C data provide comprehensive information on chromatin interactions, they are used to determine the binding of enhancers to target genes in physical space [37]. The Hi-C derivative technologies include ChIA-PET (chromatin interaction analysis based on paired-end-tag sequencing) [39], HiChIP (in situ Hi-C followed by chromatin immunoprecipitation) [40] and PLAC-seq (proximity ligation-assisted ChIP-seq) [41]. These techniques can detect specific protein-mediated chromatin loops at high resolution, which are also used in enhancer analysis. However, for most tissues and cell lines, the Hi-C and Hi-C derivative technologies have the disadvantage of insufficient resolution [42]. The exponential growth in data volume and depth brings new analytical challenges as well [43].
Epigenomics
Epigenetics (e.g. DNA methylation, RNA modification, RNA secondary structure, histone modifications, etc.) refers to a type of regulatory mechanism on phenotypic properties by regulating gene transcription or translation processes without changing the DNA sequence [44, 45]. In enhancer studies, epigenomics methods can be divided into four categories according to the differences in research objects: chromatin accessibility, DNA modification, DNA interaction and RNA interaction (Figure 3).
Chromatin accessibility
The open state of eukaryotic chromatin is considered as a prerequisite for transcription. Four sequencing techniques have been developed to identify chromatin regions in the open state: DNase-seq (DNaseI sequencing), MNase-seq (micrococcal nuclease digestion and sequencing), FAIRE-seq (formaldehyde-assisted isolation of regulatory elements) and ATAC-seq (assay for targeting accessible chromatin with high-throughput sequencing) [46–48] (Figure 3). DNase-seq and MNase-seq are both genome sequencing techniques based on enzymatic digestion to determine chromatin accessibility. DNase-seq combines nonspecific endonuclease DNase I (Deoxyribonuclease I) to obtain DNA sequences between nucleosomes, whereas MNase-seq obtains DNA sequences wrapped around nucleosomes using micrococcal nuclease (MNase). Consistently, these techniques are used to identify active enhancers by high chromatin accessibility. However, both DNase I and MNase enzymes have sequence preferences, resulting in uneven signal distribution and false-negatives [49–52]. FAIRE-seq uses the difference in the solubility between DNA with or without nucleosome wrapping in phenol and chloroform. The DNA in nucleosome-free regions is determined by sequencing the DNA in the aqueous phase. FAIRE-seq overcomes the sequence preference of MNase and DNase I, but the low signal-to-noise ratio and the high background signal make FAIRE-seq data difficult to interpret [47, 53]. As the main sequencing technology for open chromatin so far, ATAC-seq employs the modified Tn5 transposase to randomly insert designed DNA sequences with adapter sequences into the open chromosomal regions. Fragmentation by Tn5 transposase and ligation with adapters are performed simultaneously, such that the sequencing library preparation process is notably simplified [54]. ATAC-seq has good repeatability, strong consistency and significant signals, and as few as 500 cells are needed, although mitochondrial contamination is inevitable [55]. Innovations in single-cell genomic technologies make it possible to map regulomes in individual cells. The single-cell ATAC-seq (scATAC-seq) and single-cell DNase-seq (scDNase-seq) are two technologies for analyzing open chromatin in single cells. By adding barcode sequences to each cell, scientists are able to examine heterogeneous samples at cellular resolution [56]. In enhancer studies, chromatin accessibility analysis methods are often employed to screen potential active enhancers. For example, Chen and Liang hypothesized a negative correlation between enhancer activity and the strength of nucleosome binding. To validate the hypothesis, they integrated enhancer position information and MNase-seq data from 29 different tissues/cell types and observed a reduction in nucleosome signals on the eRNA loci compared with the flanking sequences across all 29 tissue types. Integrating these findings with RNA-seq data to determine the eRNA expression level, they identified ~200 000 new eRNA loci [57]. Through in-depth analysis of single-cell RNA sequencing (scRNA-seq) and scATAC-seq data from mouse embryonic spinal cord, an enhancer regulatory network algorithm, called eNET, successfully identified enhancers crucial to the development of spinal cord neurons [58].
DNA modification
In most cancer types, the proportion of DNA methylation in the enhancer region is negatively correlated with its activity [59]. Yu’s group developed Guide Positioning Sequencing technology. By harnessing the 3′ → 5′ exonuclease and 5′ → 3′ polymerase activities of T4 DNA polymerase, methylcytosines were introduced into the 3′ end of each DNA fragment. Following bisulfite treatment, the 3′ read of each DNA fragment serves as a guide to determine the DNA methylation status of the paired 5′ read [60]. The approach improves the efficiency and accuracy of the mapping rate, and there is no sequence preference in methylation detection [60]. By comparing the changes in DNA methylation and H3K27ac histone modification between normal liver and two liver cancer cell lines (97 L and LM3), they discovered that the DNA methylation levels were increased, the H3K27ac peaks were lost in 5 liver-specific enhancer regions, and the expression of target genes was silenced in liver cancer cells. Therefore, they concluded that aberrant DNA methylation pattern in enhancer regions may alter the activity of enhancers, resulting in alterations in the expression of target genes.
DNA interaction
The interactions between DNA and other molecules, such as proteins, DNA and RNA, modify the structure or binding affinity of DNA. These processes can also result in functional alterations in enhancers. Given the crucial role of DNA interactions in biological processes, a multitude of sequencing technologies have been devised for their study. Based on the different interaction partners, we will introduce various omics technologies from three perspectives: DNA–protein interactions, DNA–DNA interactions and DNA–RNA interactions.
DNA–protein interactions
To characterize the interaction between DNA and protein, ChIP-seq (Chromatin Immunoprecipitation sequencing), FiTAc-seq (fixed-tissue ChIP-seq for H3K27ac profiling) and CUT&Tag (cleavage under targets and tagmentation) have been prevalently performed [14–16, 61–64]. ChIP-seq, first developed in 2007, has become one of the most prevalent methods for identifying the binding sites of TFs on DNA and DNA interacting with certain histone modifications to study epigenetic mechanisms [65]. However, the prolonged exposure of clinical specimens to formalin results in excessive chemical cross-linking, which limits the isolation of soluble chromatin. Therefore, the signal intensity of ChIP-seq analysis for FFPE (formalin fixation and paraffin embedding) samples is low, and the resolution is poor [63]. Therefore, FiT-seq and FiTAc-seq were developed to obtain high-quality information on the signal distribution of H3K4me1 and H3K27ac for FFPE samples [63, 64]. Another disadvantage of ChIP-seq is the low peak signal, high background noise and sometimes uneven distribution of target DNA fragments [66]. To address the issues with ChIP-seq, in 2019, Kaya-Okur et al. developed CUT&Tag technology, which requires fewer cells, with a minimum of 60 cells, and which has library construction steps that are simplified by removing the steps of formaldehyde crosslinking and ultrasonic interruption [67]. CUT&Tag has the advantages of lower background noise, higher reading accuracy and better data repeatability [68]. In general, there are two main directions to study enhancers using sequencing data: to screen active enhancers by detecting enriched histone modifications (H3K4me1 and H3K27ac) and to characterize enhancer-target pairs by binding to certain transcription activators or coactivators [69, 70].
DNA–DNA interactions
Extrachromosomal circular DNA (eccDNA) is a circular and double-stranded molecule in the nucleus that is independent of chromosomal DNA (chrDNA). These eccDNAs can vary greatly in size, ranging from tens to millions of base pairs [71]. eccDNA interferes with the replication and expression of genes by interacting with chrDNA [71]. In addition, eccDNA functions as a mobile enhancer to increase the transcription of genome-wide target genes [72]. At present, canonical DNA sequencing can indirectly predict eccDNA through sequence information, while Circle-seq is specifically designed to detect eccDNA [73].
DNA–RNA interactions
Increasing evidence suggests that nascent RNAs mediate the chromosomal interaction between promoters and enhancers several mega-bases away in linear distance. GRID-seq (global RNA interactions with DNA by deep sequencing) is a technique for unbiased detection of DNA–RNA interactions at the genome scale [74]. GRID-seq is complementary to Hi-C in studying 3D chromatin architecture [75, 76]. However, GRID-seq requires rather deep sequencing to generate a robust contact map, which limits its application [74, 75].
RNA–RNA interactions
RNA molecules in the cell nucleus form secondary structures via intramolecular base pairing to exert their biological functions. For example, eRNA and promoter upstream antisense RNAs (also known as promoter upstream transcripts, PROMPTs) form enhancer–promoter loops to activate transcription [77]. Thus, deciphering the higher-order structure of RNA is crucial for understanding the underlying mechanisms [77, 78]. RIC-seq can accurately capture the secondary structure of RNA and identify RNA–RNA interactions through chimeric sequences. In HeLa cells, 31 genes were predicted as target genes of 7 enhancers by RIC-seq (RNA in situ conformation sequencing). Locked nucleic acid and antisense oligonucleotides were used to knock down the 7 enhancers, and the expression of 27 predicted target genes was decreased accordingly. Therefore, the prediction accuracy for target genes for enhancers was >85% based on RIC-seq. RIC-seq will be helpful to study the regulatory role of eRNA in promoter activity.
Transcriptomics
The transcriptome refers to the collection of all RNAs transcribed in a specific tissue or cell at a certain stage [79]. The most extensively employed transcriptome detection methods comprise microarray, RNA-seq, scRNA-seq, spatial transcriptome sequencing and other derivative techniques [17]. RNA-seq, the most prevalent transcriptome sequencing technology, represents low background noise, accurate quantification and higher resolution of differentially expressed genes, and has a much lower limit of detection than a standard whole genome microarray [18, 19]. However, in model organisms, microarrays are reliable and more cost effective than RNA-seq [80]. To address the different cell states within a sample, single-cell transcriptomics was developed in 2009. Nowadays, many single-cell transcriptome platforms have emerged, such as 10X Genomics, BD Rhapsody, Fluidigm C1, etc. Among these platforms, the 10X Genomics single-cell transcript platform is the most commonly used due to its high-throughput and efficiency in capturing 100–80 000 cells (per chip) [81]. The scRNA-seq and spatial transcriptome sequencing endow expression information with high accuracy and specificity at single-cell resolution, whereas the steep price and the complexities in data analysis hinder their prevalence [82, 83]. In addition to conventional RNA-seq, STARR-seq has been applied to detect enhancer activity [84]. STARR-seq is a massively parallel reporter assay that identifies transcriptional enhancers based on their activity across the genome and quantitatively assesses their activity [84].
Transcriptome sequencing (RNA-sequencing) has been performed to study the expression and genomic alterations of enhancers. First, the transcriptome provides expression information for both target genes and enhancers. Compared with mRNAs, eRNAs have the characteristics of instability and low expression level, and most eRNAs do not contain polyA tails [85, 86]. Most RNA-seq studies utilize oligo-dT enrichment to capture polyA-tailed RNAs, which results in low detection efficiency for eRNAs. scRNA-seq and spatial transcriptomics sequencing, with relatively low depth, have not yet been performed to obtain eRNA expression. In addition to RNA-seq, GRO-seq (global nuclear run-on sequencing), PRO-seq (precision nuclear run-on sequencing), CAGE-seq (cap analysis of gene expression by deep sequencing) and other RNA-seq-derived techniques have been employed to capture eRNAs [87–91].
CRISPR gene editing technology
The integration of gene editing techniques and second-generation sequencing technology implements genome-wide parallel screenings for enhancers regulating a specific phenotype. CRISPR/Cas9 technology has been applied in enhancer screening, functional verification and target gene identification [22]. Various CRISPR-derived techniques for high-throughput screening of enhancers have been developed, such as CRISPRi-FlowFISH. CRISPRi-FlowFISH integrates CRISPRi with RNA fluorescence in situ hybridization (FISH) technology. The main principle is that gRNA guides KRAB-dCas9 to bind to a specific nucleotide sequence and inhibit the transcription of the sequence 200–500 bp near the gRNA. Subsequently, RNA FISH has been used to quantitatively label single cells based on the expression level of a gene of interest. When an enhancer is targeted by gRNA, CRISPRi-FlowFISH can quantify the effect of the enhancer on the target gene(s) [92]. Furthermore, Perturb-seq (also referred to as CRISP-seq and CROP-seq) integrates multiplexed CRISPR-mediated gene inactivation with scRNA-seq to comprehensively evaluate gene expression phenotypes for each perturbation. By designing sgRNAs (single guide RNAs) for enhancers, Perturb-seq enables simultaneous quantitative measurement of enhancer expression in many cells and a wealth of phenotypic information and greatly improves screening efficiency [93].
ENHANCER DATABASE
With the continuous growth of genomic data and experimental results, the enhancer databases have become an essential resource to study enhancers efficiently. Currently, there are more than a dozen databases that are widely used (Table 2). VISTA enhancer browser, DiseaseEnhancers and ENdb are three databases collecting enhancers experimentally validated [94–96]. So far, the VISTA enhancer browser (https://enhancer.lbl.gov/) has collected 1699 human or mouse noncoding elements with enhancer activity assessed in transgenic mice [94]. The DiseaseEnhancer (https://github.com/shijianasdf/DiseaseEnhancer/tree/master) database has collected 1059 experimentally validated disease-related enhancers from 167 human diseases based on literature [95]. And the ENdb (https://bio.liclab.net/ENdb/index.php) database is a manually curated enhancer database for human and mouse from 1590 published literatures, with 713 experimentally validated enhancers and their related information, including target genes, TFs, diseases and functions [96]. Cancer-specific enhancers are one of the hot topics in enhancer research. CancerEnD (https://webs.iiitd.edu.in/raghava/cancerend/) has conducted on 18 different cancer types by RNA expression data from TCGA, providing 8599 enhancers of 8063 cancer samples [97]. CenhANCER (http://cenhancer.chenzxlab.cn/) has collected H3K27ac ChIP-seq data from 49 cancer types, and predicts >57 million enhancers [98]. The TCEA database (https://bioinformatics.mdanderson.org/Supplements/Super_Enhancer/TCEA_website/) has collected TCGA and GTEx RNA-seq data and provides the downloadable eRNA expression data that has been calculated [57]. In addition to cancer-specific enhancers, tissue-specific and disease-related enhancer research is another key issue in the enhancer field. The GeneHancer (http://www.genecards.org/) database uses an integration algorithm to eliminate redundancy and identifies >434 000 tissue-specific enhancers from multiple data sources [25]. Mutations on the DNA sequences of enhancers may cause diseases by affecting target gene expression. HACER (http://bioinfo.vanderbilt.edu/AE/HACER/) utilizes GWAS information on disease-related genetic variation sites to link enhancers to diseases [99]. In addition to enhancers in human, enhancer research in mouse and other mammals gains increasing attention, as well. Fantom5 (https://fantom.gsc.riken.jp/5/) and RAEdb (http://www.computationalbiology.cn/RAEdb/index.php) have predicted different enhancers in humans and mouse, respectively, through CAGE-seq and STARR-seq methods [91, 100]. EnhancerAtlas2.0 (http://www.enhanceratlas.org/) collected data from 12 different tissue samples and predicted enhancers for 9 different mammalian species, greatly expanding the scope of enhancer research [101]. Based on the evolutionary conservation on enhancer between species, studies on enhancers and enhancer-gene interactions were performed in other model organisms. scEnhancer (http://enhanceratlas.net/scenhancer/), the first database to annotate enhancers at the single-cell level, covering 14 527 776 enhancers from 1196 906 single cells in human, mouse and Drosophila.
Table 2.
Comparison of commonly used enhancer databases
| Database | Species | Enhancers | eRNA | Specificity | Experimental result |
|---|---|---|---|---|---|
| CancerEnD | Human | 168 464 | No | Cancer | 0 |
| CenhANCER | Human | >57 000 000 | No | Cancer | 0 |
| DiseaseEnhancer | Human | 1059 | No | Disease | 1059 |
| ENdb | Human/Mouse | 713 | No | Disease | 713 |
| EnhancerAtlas2.0 | 9 species | 13 494 603 | No | None | 0 |
| Fantom5 | Human/Mouse | 65 359 | Yes | None | 0 |
| GeneHancer | Human | 434 139 | Yes | None | 0 |
| HACER | Human | 1676 284 | No | Disease | 0 |
| RAEdb | Human/Mouse | >500 000 | No | None | 0 |
| scEnhancer | 3 species | 14 527 776 | No | None | 0 |
| TCEA | Human | >300 000 | Yes | Cancer | 0 |
| VISTA | Human/Mouse | 3321 | No | None | 1699 |
MULTI-OMICS INTEGRATION METHOD
In recent years, the development of mathematics, statistics and computational science has laid the foundation for the integration of multi-omics analysis. At present, multi-omics integration methods can be divided into two categories based on whether neural networks are used: traditional machine learning models, which have the advantages of strong interpretability of algorithms and lower requirements for computing resources; and deep learning models using neural networks, which can capture complex relationships in data due to their powerful nonlinear fitting capabilities (Figure 4) [102–104]. The key factors in determining the two methods include data volume, computational resources and feature numbers. Neural networks require a large volume of data (at least thousands of samples) to avoid overfitting and abundant computational resources (hardware, software, etc.) [105]. Compared with traditional machine learning, one advantage of neural networks is that they do not require a large amount of manual labeling, and only simple data preprocessing is required for computation [106]. When choosing a method, researchers should weigh the characteristics of the issue to resolve.
Figure 4.
Classification of multi-omics integration methods in enhancer research.
Traditional machine learning models
Traditional machine learning models are algorithms that use statistics, linear algebra and optimization algorithms to extract information from existing data to build predictive models. The classical machine learning methods, such as logistic regression, random forests and naive Bayes, are used to predict and classify unknown data. Based on whether manually annotated labels are required for data, it can be divided into three types of learning: unsupervised, semi-supervised and supervised (Table 3).
Table 3.
Model classification for the prediction of enhancer
| Traditional machine learning | Algorithms | Tool name | Model |
|---|---|---|---|
| Unsupervised | Distance | ABC | Distance-based |
| Unsupervised | Correlation | ELMER | Pearson correlation |
| Unsupervised | Correlation | CISMAPPER | Pearson correlation |
| Semi-supervised | Regression | McEnhancer | Logistic regression |
| Semi-supervised | Classifier | DPHM | Bayesian model |
| Supervised | Regression | JEME | Regression-based methods |
| Supervised | Regression | FENRIR | Elastic net logistic regression |
| Supervised | Classifier | FOCS | Linear regression |
| Supervised | Classifier | IM-PET | Random forest |
| Supervised | Classifier | PETModule | Random forest |
| Supervised | Classifier | RIPPLE | Random forests |
| Supervised | Classifier | TargetFinder | Gradient tree boosting |
Unsupervised learning
Unsupervised learning is an analytical approach that eliminates the need for prelabeled training data. The main objective of unsupervised learning is to unveil hidden patterns and establish new connections between variables within a dataset [107]. In the study of enhancers, unsupervised learning methods can be divided into distance-based methods and correlation-based methods [108].
The earliest method used to predict enhancer target genes was the distance-based method, which relies on the genomic proximity between enhancers and genes. This approach assumes that enhancers tend to regulate nearby genes in the genome [109, 110]. However, the accuracy is not high, the variation range is large and the false discovery rate (FDR) is ~40–73% [111]. Even when RNA expression data have been used to screen the enhancer’s regulatory gene, its accuracy has remained low, with FDR values ranging from 53 to 77% [112]. Furthermore, the distance-based method overlooks distal regulatory interactions and the situation in which multiple enhancers target the same promoter [5]. Therefore, the distance-based method is generally used as a baseline [109]. For example, the ABC model predicts cell type-specific enhancer-target pairs based on the distance between enhancers and genes, the frequency of chromosomal contact between enhancers and promoters (by Hi-C data analysis), and enhancer activity (by DNase-seq and H3K27ac ChIP-seq) [92].
Developed from distance-based methods, correlation-based methods combine the correlation of features (e.g. histone modifications, DHS signals of enhancers and promoters, and gene transcription levels) to increase the prediction accuracy, such as ELMER and CisMapper [113–116]. ELMER identifies transcriptional targets by correlating methylation-affected enhancers with the expression of nearby genes. A nonparametric U test was used to examine the correlation degree of enhancer methylation and expression data (RNA-seq) with 10 genes upstream and 10 genes downstream of each enhancer, and all enhancer-gene pairs with P < 0.001 were retained [115]. CisMapper predicts enhancer-target pairs by calculating the Pearson correlation coefficient between the log of gene expression and the log of the histone signal at the TF-binding site within 500 kb upstream of the gene TSS [116]. CisMapper is more accurate than simple distance-based methods, with an average accuracy improvement of 2.7 times [116].
Semi-supervised and supervised learning
Semi-supervised learning uses algorithms that cover both unlabeled and labeled data for training, which is preferred when there is not enough labeled dataset available for supervised learning [117]. Compared with supervised learning, semi-supervised learning can reduce overfitting and improve the robustness of the model [117]. Supervised learning depends on high confidence positive and negative labeled training datasets (enhancers and non-enhancers, respectively). The model is usually trained to maximize the distinction between case and control sets [118]. Dependent on the algorithm applied in the model, semi-supervised and supervised learning can be divided into regression-based methods and classifier-training [109].
Regression-based methods (e.g. McEnhancer, JEME, FENRIR and FOCS) integrate enhancer and promoter features or gene expression to identify the regulatory relationship between enhancers and target genes [119–122]. McEnhancer uses a semi-supervised logistic regression model to calculate the probability of TFs binding to promoters and enhancers, and predicts the binding strength between genes and enhancers, with a prediction accuracy of 73–98% [122]. The merged regulation by multiple enhancers is considered in JEME, and sample-specific information is integrated as well to predict gene regulatory networks [121]. FENRIR integrates thousands of different epigenetic and functional genomics datasets to infer tissue-specific functional relationships between enhancers in 140 different human tissues and cell types [119]. FOCS is a statistical framework that utilizes eRNA as a marker of enhancer activity and determines enhancer–promoter interactions correlated with transcriptional activity based on information about chromatin epigenetic modifications [120].
The classifier training method uses experimentally identified enhancer–promoter interactions as the gold standard set. By learning the sequence and epigenetic modification features of the standard, a classifier can be trained to predict whether a given enhancer–promoter pair has an interaction or not [109]. DPHM, as a semi-supervised Bayesian model, predicts target genes of 47 enhancers in mice using Nkx2-5 ChIP-seq data [123]. IM-PET tool, using the random forest classifier algorithm, predicts the association between enhancers and promoters by collecting a large amount of enhancer feature data (epigenetic modification data, TF expression data, enhancer conservation data, etc.) [124]. The tool has high predictive accuracy, with an FDR reduced to ~1%, and the predicted distance between enhancers and target genes is also extended to 2 Mb [124]. PETModule, RIPPLE, TargetFinder, EAGLE and EPIP are algorithms that adopt supervised learning methods to predict enhancer-target gene interactions. Although they use different classification features, they all present good prediction performance on different datasets [112, 125–128] (Table 3).
Deep learning models
Since 2016, scientists have gradually begun to use neural network technology to study enhancers [8, 9]. Many studies have shown that neural networks have significant advantages in enhancer research, such as being able to predict across different cell types, thereby reducing computational and time costs [129, 130]. Convolutional neural networks (CNNs) have become the widely used algorithm in enhancer research, and various models such as DNABERT [131], iEnhancer-GAN [132], GC-MERGE [133], GraphReg [134], EPIVAN [130] and DeepTACT [135] have been proposed and optimized (Figure 3). DNABERT, as a novel pre-trained bidirectional encoder representation, can reveal the potential associations between different cis-DNA by learning DNA sequence information [131]. iEnhancer-GAN integrate word embeddings and sequence generation adversarial networks to predict the binding strength of enhancer-target gene interactions [132]. DeepTACT applies a bootstrapping deep-learning model to integrate genome sequence and chromatin accessibility data to predict enhancer–promoter interactions [135]. GC-MERGE is a graph-based deep learning framework that decodes Hi-C map through graph convolutional networks to capture the potential genomic spatial structure. It models the epigenetic modification signals and DNA sequence information to predict the target genes regulated by distant enhancers [133]. GraphReg model uses CNN layers to learn 1D features of enhancer-target gene (epigenomic data, genomic DNA sequence, etc.), and then constructs different enhancer-target genes into a whole through iterative methods on 3D genomic maps (such as HiChIP, Hi-C, etc.) by using graph attention networks (GAT) [134]. Compared with linear CNN models (such as Epi-CNN, Seq-CNN, etc.), the GraphReg model requires less sample size and has higher accuracy in prediction [134]. Graph-based methods (GC-MERGE and GraphReg) have advantages in handling complex relationships, robustness and data utilization compared with traditional machine learning methods. Compared with linear CNN models, graph-based methods have the advantages of strong interpretability, fast calculation efficiency and high accuracy in predicting long-distance enhancer-target genes [133, 134]. With more and more researchers focusing on graph theory and deep learning techniques in bioinformatics, graph-based methods will provide more powerful tools for analyzing enhancer-target gene networks [136].
In addition to CNN, architectures based on deep neural networks (DNN) are used to learn enhancer features as well. For example, EP-DNN uses p300 binding sites as markers for enhancers, and TSS and random non-DHS sites as markers for non-enhancers to perform training. The prediction accuracy of EP-DNN is 91.6%, exceeding the accuracy of DEEP-ENCODE (85.3%) and RFECS (85.5%) [129]. ES-ARCNN is a computational model for predicting the enhancers strength. To train ES-ARCNN, researchers applied two data augmentation tricks (i.e. reverse complement and shift) to improve the model’s predictive performance [137]. Enformer, as a developed enhancer prediction model based on the transformer, can integrate information of remote interactions in the genome (up to 100 kb away) [138]. By calculating the contribution scores of gene input gradients and attention weights, Enformer can identify the enhancer sequences that are most predictive of specific gene expression [138]. Although deep learning has outperformed many traditional computer methods in enhancer prediction applications, the problems of overparameterization and limited model performance still exist, and its interpretability lags behind traditional statistical methods. Continuous development of new deep learning methods is expected to achieve elegant applications in enhancer research.
CHALLENGES IN MULTI-OMICS APPROACHES
In recent years, there has been an increasing number of studies on enhancers by multi-omics approaches. However, there are still challenges in the application of multi-omics approaches to study enhancers, either due to a lack of sufficient attention or limited solutions. We have summarized the five major challenges in enhancer research, and provided some possible methods to overcome these challenges (Figure 5).
Figure 5.
Challenges and resolution strategies for multi-omics integration methods in enhancer research.
The accuracy variation between different omics data
Multi-omics data from different sources are often heterogeneous, with divergence in signal-to-noise ratios and significant differences in accuracy [139]. For instance, genome sequencing has a higher coverage than RNA-seq; transcriptomics and ChIP-seq use different quantification methods (the former uses RPM or count values, while the latter quantifies based on peak areas), resulting in different data ranges and distributions [140]. Currently, increasing the number of samples and improving experimental design can improve the statistical power of different omics analyses. However, according to MultiPower software, in the estimation of sample size required to achieve specific statistical power in different omics, DNA-seq and ChIP-seq require close to more than double the sample size of RNA-seq samples to achieve the same statistical effect [141]. Therefore, it is inefficient and expensive to improve accuracy only by increasing sequencing samples. Instead, one can consider balancing sample sizes through undersampling [141]. In addition, it can also evaluate the performance of machine learning by using standardized metrics to choose the optimal sample size, or by adopting techniques (such as regularization, bagging, cross-validation) to balance bias-variance trade-offs [142, 143].
Missing value imputation in multi-omics data
Data may be missing due to experimental random errors or inherent technical defects (e.g. low coverage in repetitive regions) during sequencing [144]. Consequently, some unmatched data have to be excluded during data integration, limiting the power for detection in the genome. Surprisingly, the problem of processing missing values is often treated as a data preprocessing step, and some scientists do not believe that it will have any impact on the outcomes of subsequent statistical analyses. Instead, the distribution characteristics of the multi-omics data should be reassessed in the analysis process, and sensitivity analysis should be performed to assess the impact of missing value inputs on the downstream analysis [29]. Imputation methods have the potential to correct missing values by leveraging the correlations within omics data and utilizing partially measured data from other omics datasets to impute missing values. MOFA analyzes the latent space across omics types to impute missing samples, and MultiBaC creates a multivariate predictive model of the incomplete omics types as a function of a shared omics modality [145, 146]. However, these two methods can create data structures that violate the assumption of independence and subsequently lead to unreliable analysis [29, 147, 148]. Therefore, the missing values across different data resources affirm the reliability and applicability of multi-omics analysis, and a better solution is urgently needed. Liew et al. compared 19 different missing value completion algorithms and found that the choice of algorithm should be assessed from an application-driven viewpoint, and validation of the imputation data is an important step in evaluating the performance of any input algorithm [149]. There is no one optimal imputation algorithm for all type of data, so it is necessary to choose an appropriate imputation algorithm according to the characteristics of the data [149].
Evaluation of model performance
Currently, the computational models for integrating multi-omics data in biology possess various characteristics (e.g. accuracy, speed, complexity and computational cost), and it is crucial to select the most suitable algorithm for multi-omics analysis [130]. Some supervised approaches are subjected to overfitting from inappropriate cross-validation policies, while certain approaches are limited by training label uncertainties [150]. In supervised learning, an incorrect label definition can lead to inaccurate prediction results. Consequently, it is essential to enlist biological expertise to properly define the labels [107]. On the other hand, models, such as the ABC model and eNet model, predict the functions of thousands of enhancers. However, most of these enhancers have not been experimentally validated, making it difficult to determine the accuracy of model prediction. Performance metrics used commonly for this purpose include the F1 score (Harmonic Mean of Precision and Recall), the area under the receiver operating characteristic curve and the area under the precision recall curve. But due to the diversity of the principles and standard definitions of prediction, it is difficult to systematically evaluate the performance of all available computational methods [109].
Interpretability of multi-omics approaches
Interpretability is about the extent to which a cause and effect can be observed within a system [151]. Factors affect the interpretability of a model, including data, model architecture and algorithms [152, 153]. To improve the interpretability of a model integrating the multi-omics data, several approaches have been developed from different perspectives of the affecting factors mentioned above. (i) Human-labeled data can improve the interpretability of a model. For example, GenNet improves the interpretability of genotype data by constructing explainable neural networks that use prior biological knowledge to label the data [154]. (ii) Simplifying the network architecture can increase the interpretation [153]. For example, ExplaiNN uses a large series of simple neural networks, each of which learns different TF binding profiles. As a result, it becomes easier to understand the prediction results of each TF, thereby improving the overall interpretability [153]. The HEAP model uses the weights of the first convolutional layer to capture important enhancer features to build a deep network model [152]. Adopting explainable artificial intelligence (XAI) architecture is another approach [155]. With this kind of architecture, researchers can have a clearer understanding of the degree of causal relationship between input signals and output results. Therefore, some teams have utilized XAI to identify enhancers based on the epigenetic feature signals of different histone groups, and discovered the connection between the enrichment of different histone modifications and the activity of enhancers [155]. (iii) From the algorithmic perspective, using interpretable algorithms (such as clustering, SHAP, etc.) can improve the interpretability of a model [152, 153]. In summary, although there are many methods available currently to improve the interpretability of models, overly pursuing interpretability may lead to a decline in model performance [156]. The existing interpretability methods often target specific model architectures or data types only [29]. Therefore, new strategies need to be continuously developed to improve the explanation ability of models.
Computational and storage costs
Multi-omics analysis incurs costs for computation and data storage [157]. Most integrated algorithms require high computational power and considerable storage capacity to store logs, results and analyses [140]. How to store multi-omics datasets to facilitate the reuse of existing research datasets is another challenge. High-performance computing infrastructure, cloud computing solutions and advanced statistical methods are all effective ways to reduce computing and storage costs. The general principle is FAIR, which stands for findable, accessible, interoperable and reusable [158]. However, many multi-omics data storage platforms (such as Figshare, Zenodo or Lifebit) do not support data retrieval and query [29]. Numerous computing models have been deployed on specialized graphics processing units and cloud computing platforms over the past few years (such as Microsoft Azure [159]), which is one of the ways to address the issues mentioned above [160, 161].
The five yellow rectangles in the figure represent the five major challenges encountered in enhancer research, while the blue squares represent different resolution strategies to address these challenges.
CONCLUSION
Enhancers are crucial regulatory elements in gene transcription, and the application of omics techniques speeds up the elucidation of the role and mechanisms of enhancers in gene regulation. In this review, we first summarized the current issues encountered in enhancer research. Next, we discussed the application and limitations of four types of omics technologies (genomics, epigenomics, transcriptomics and CRISPR gene editing). With the increasing availability of larger, high-quality datasets paralleled by the development of new omics technologies, the demand for ideas and methods for multi-omics analysis will continue to grow. Using machine learning to integrate and analyze high-dimensional and multi-omics data can effectively improve the accuracy of the enhancer prediction model. Furthermore, novel algorithms can be utilized to extract new information from existing data. However, the application of omics technology in the field of enhancer research is still challenging. Despite the widespread heterogeneity and divergent quality of multi-omics data, both data quality and quantity are being improved with the increasing application of sequencing techniques. In addition, statistical methods are continuously evolving to address the current challenges. In summary, we believe that with the development of omics technologies and statistics, multi-omics techniques will have greater value in enhancer research.
Key Points
At present, there are four basic problems in enhancer research: identification, activity, structure and eRNA.
Genomics, epigenomics, transcriptomics and CRISPR-gene editing technology have been widely used in enhancer research.
Multi-omics integration methods in enhancer research are divided into traditional machine learning and deep learning methods.
ACKNOWLEDGEMENTS
We deeply thank Yong Zhang and Weiwei Zhai for their comments and helpful suggestions during the manuscript preparation.
Qilin Wang is a PhD candidate at Beihang University, China. His research interests are bioinformatics, machine learning and data mining.
Junyou Zhang is a PhD candidate at Beihang University, China. His research interests focus on cancer genomics.
Zhaoshuo Liu is a master candidate at Beihang University, China. His research interests are data mining, computational biology and user modeling.
Yingying Duan is a master candidate at Beihang University, China. Her research interests focus on cancer genomics.
Chunyan Li is an associate professor at Beihang University, China. Her research interests are functional genomics on cancer and osteoporosis.
Contributor Information
Qilin Wang, School of Engineering Medicine, Beihang University, Beijing 100191, China; School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China.
Junyou Zhang, School of Engineering Medicine, Beihang University, Beijing 100191, China; School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China.
Zhaoshuo Liu, School of Engineering Medicine, Beihang University, Beijing 100191, China; School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China.
Yingying Duan, School of Engineering Medicine, Beihang University, Beijing 100191, China; School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China.
Chunyan Li, School of Engineering Medicine, Beihang University, Beijing 100191, China; School of Biological Science and Medical Engineering, Beihang University, Beijing 100191, China; Key Laboratory of Big Data-Based Precision Medicine (Ministry of Industry and Information Technology), Beihang University, Beijing 100191, China; Beijing Advanced Innovation Center for Big Data-Based Precision Medicine, Beihang University, Beijing 100191, China.
FUNDING
This work was supported by grants from the National Natural Science Foundation of China (32270610, 31801094 and 82072499 to C.L.) and the Fundamental Research Funds for the Central Universities (YWF-21-BJ-J-T105 to C.L.).
ETHICS APPROVAL AND CONSENT TO PARTICIPATE
Not applicable.
CONSENT FOR PUBLICATION
Not applicable.
References
- 1. Ohler U, Wassarman DA. Promoting developmental transcription. Development 2010;137:15–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Peng Y, Zhang Y. Enhancer and super-enhancer: positive regulators in gene transcription. Animal Model Exp Med 2018;1:169–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Thomas HF, Buecker C. What is an enhancer? Bioessays 2023;45:e2300044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Berman BP, Pfeiffer BD, Laverty TR, et al. Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol 2004;5:R61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Ye R, Cao C, Xue Y. Enhancer RNA: biogenesis, function, and regulation. Essays Biochem 2020;64:883–94. [DOI] [PubMed] [Google Scholar]
- 6. Agrawal P, Rao S. Super-enhancers and CTCF in early embryonic cell fate decisions. Front Cell Dev Biol 2021;9:653669. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Corradin O, Scacheri PC. Enhancer variants: evaluating functions in common disease. Genome Med 2014;6:85. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Liu F, Li H, Ren C, et al. PEDLA: predicting enhancers with a deep learning-based algorithmic framework. Sci Rep 2016;6:28517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Yang B, Liu F, Ren C, et al. BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone. Bioinformatics 2017;33:1930–6. [DOI] [PubMed] [Google Scholar]
- 10. Bosse Y, Amos CI. A decade of GWAS results in lung cancer. Cancer Epidemiol Biomarkers Prev 2018;27:363–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Wainberg M, Sinnott-Armstrong N, Mancuso N, et al. Opportunities and challenges for transcriptome-wide association studies. Nat Genet 2019;51:592–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Cano-Gamez E, Trynka G. From GWAS to function: using functional genomics to identify the mechanisms underlying complex diseases. Front Genet 2020;11:424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Dozmorov MG, Tyc KM, Sheffield NC, et al. Chromatin conformation capture (Hi-C) sequencing of patient-derived xenografts: analysis guidelines. Gigascience 2021;10:10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Nakato R, Sakata T. Methods for ChIP-seq analysis: a practical workflow and advanced applications. Methods 2021;187:44–53. [DOI] [PubMed] [Google Scholar]
- 15. Song L, Crawford GE. DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harb Protoc 2010; 2010(2):pdb prot5384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Ocampo J, Cui F, Zhurkin VB, Clark DJ. The proto-chromatosome: a fundamental subunit of chromatin? Nucleus 2016;7:382–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. van Dijk EL, Jaszczyszyn Y, Naquin D, Thermes C. The third revolution in sequencing technology. Trends Genet 2018;34:666–81. [DOI] [PubMed] [Google Scholar]
- 18. Orgaz JL, Crosas-Molist E, Sadok A, et al. Myosin II reactivation and cytoskeletal remodeling as a hallmark and a vulnerability in melanoma therapy resistance. Cancer Cell 2020;37:85–103.e9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Conesa A, Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis. Genome Biol 2016;17:13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Neumayr C, Pagani M, Stark A, et al. STARR-seq and UMI-STARR-seq: assessing enhancer activities for genome-wide-, high-, and low-complexity candidate libraries. Curr Protoc Mol Biol 2019;128:e105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Tian W, Huang X, Ouyang X. Genome-wide prediction of activating regulatory elements in rice by combining STARR-seq with FACS. Plant Biotechnol J 2022;20:2284–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Lu T, Yang B, Wang R, et al. Xenotransplantation: current status in preclinical research. Front Immunol 2019;10:3060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Subramanian I, Verma S, Kumar S, et al. Multi-omics data integration, interpretation, and its application. Bioinform Biol Insights 2020;14:1177932219899051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Kleftogiannis D, Kalnis P, Bajic VB. Progress and challenges in bioinformatics approaches for enhancer identification. Brief Bioinform 2016;17:967–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Fishilevich S, Nudel R, Rappaport N, et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database (Oxford) 2017;2017:bax028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Tsai A, Alves MR, Crocker J. Multi-enhancer transcriptional hubs confer phenotypic robustness. Elife 2019;8:e45325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Ribeiro DM, Rubinacci S, Ramisch A, et al. The molecular basis, genetic control and pleiotropic effects of local gene co-expression. Nat Commun 2021;12:4842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Wörheide MA, Krumsiek J, Kastenmüller G, et al. Multi-omics integration in biomedical research – a metabolomics-centric review. Anal Chim Acta 2021;1141:144–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Tarazona S, Arzalluz-Luque A, Conesa A. Undisclosed, unmet and neglected challenges in multi-omics studies. Nat Comput Sci 2021;1:395–402. [DOI] [PubMed] [Google Scholar]
- 30. Investigators GPP, Smedley D, Smith KR, et al. 100,000 genomes pilot on rare-disease diagnosis in health care - preliminary report. N Engl J Med 2021;385:1868–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Nakagawa H, Fujita M. Whole genome sequencing analysis for cancer genomics and precision medicine. Cancer Sci 2018;109:513–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Tam V, Patel N, Turcotte M, et al. Benefits and limitations of genome-wide association studies. Nat Rev Genet 2019;20:467–84. [DOI] [PubMed] [Google Scholar]
- 33. Gilad Y, Rifkin SA, Pritchard JK. Revealing the architecture of gene regulation: the promise of eQTL studies. Trends Genet 2008;24:408–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Hnisz D, Abraham BJ, Lee TI, et al. Super-enhancers in the control of cell identity and disease. Cell 2013;155:934–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Chignon A, Mathieu S, Rufiange A, et al. Enhancer promoter interactome and Mendelian randomization identify network of druggable vascular genes in coronary artery disease. Hum Genomics 2022;16:8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Chen H, Li C, Peng X, et al. A pan-cancer analysis of enhancer expression in nearly 9000 patient samples. Cell 2018;173:386–399.e312. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Mohanta TK, Mishra AK, Al-Harrasi A. The 3D genome: from structure to function. Int J Mol Sci 2021;22:11585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Lafontaine DL, Yang L, Dekker J, et al. Hi-C 3.0: improved protocol for genome-wide chromosome conformation capture. Curr Protoc 2021;1:e198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Vardaxis I, Drablos F, Rye MB, et al. MACPET: model-based analysis for ChIA-PET. Biostatistics 2020;21:625–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Mumbach MR, Rubin AJ, Flynn RA, et al. HiChIP: efficient and sensitive analysis of protein-directed genome architecture. Nat Methods 2016;13:919–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Rosen JD, Yang Y, Abnousi A, et al. HPRep: quantifying reproducibility in HiChIP and PLAC-Seq datasets. Curr Issues Mol Biol 2021;43:1156–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Zhang Y, An L, Xu J, et al. Enhancing Hi-C data resolution with deep convolutional neural network HiCPlus. Nat Commun 2018;9:750. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Rao SS, Huntley MH, Durand NC, et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 2014;159:1665–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Wang KC, Chang HY. Epigenomics: technologies and applications. Circ Res 2018;122:1191–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Wilson PC, Ledru N, Humphreys BD. Epigenomics and the kidney. Curr Opin Nephrol Hypertens 2020;29:280–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Klemm SL, Shipony Z, Greenleaf WJ. Chromatin accessibility and the regulatory epigenome. Nat Rev Genet 2019;20:207–20. [DOI] [PubMed] [Google Scholar]
- 47. Song L, Zhang Z, Grasfeder LL, et al. Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. Genome Res 2011;21:1757–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Buenrostro JD, Wu B, Chang HY, et al. ATAC-seq: a method for assaying chromatin accessibility genome-wide. Curr Protoc Mol Biol 2015;109:21.29.1–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Chen A, Chen D, Chen Y. Advances of DNase-seq for mapping active gene regulatory elements across the genome in animals. Gene 2018;667:83–94. [DOI] [PubMed] [Google Scholar]
- 50. Liu Y, Fu L, Kaufmann K, et al. A practical guide for DNase-seq data analysis: from data management to common applications. Brief Bioinform 2019;20:1865–77. [DOI] [PubMed] [Google Scholar]
- 51. Kong S, Lu Y, Tan S, et al. Nucleosome-omics: a perspective on the epigenetic code and 3D genome landscape. Genes (Basel) 2022;13:1114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Chereji RV, Bryson TD, Henikoff S. Quantitative MNase-seq accurately maps nucleosome occupancy levels. Genome Biol 2019;20:198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Seuter S, Neme A, Carlberg C. Monitoring genome-wide chromatin accessibility by formaldehyde-assisted isolation of regulatory elements sequencing (FAIRE-seq). Epigenetics Methods 2020;353–69. [Google Scholar]
- 54. Buenrostro JD, Giresi PG, Zaba LC, et al. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods 2013;10:1213–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Jia G, Preussner J, Chen X, et al. Single cell RNA-seq and ATAC-seq analysis of cardiac progenitor cell transition states and lineage settlement. Nat Commun 2018;9:4877. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Ji Z, Zhou W, Hou W, et al. Single-cell ATAC-seq signal extraction and enhancement with SCATE. Genome Biol 2020;21:161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Chen H, Liang H. A high-resolution map of human enhancer RNA loci characterizes super-enhancer activities in cancer. Cancer Cell 2020;38:701–715.e705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Hong D, Lin H, Liu L, et al. Complexity of enhancer networks predicts cell identity and disease genes revealed by single-cell multi-omics analysis. Brief Bioinform 2023;24(1):bbac508. [DOI] [PubMed] [Google Scholar]
- 59. Clermont P-L, Parolia A, Liu1 HH, et al. DNA methylation at enhancer regions: novel avenues for epigenetic biomarker development. IMR Press. 2016;21(2):430–46. [DOI] [PubMed] [Google Scholar]
- 60. Li J, Li Y, Li W, et al. Guide positioning sequencing identifies aberrant DNA methylation patterns that alter cell identity and tumor-immune surveillance networks. Genome Res 2019;29:270–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Park PJ. ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet 2009;10:669–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Nakao K, Miyaaki H, Ichikawa T. Antitumor function of microRNA-122 against hepatocellular carcinoma. J Gastroenterol 2014;49:589–93. [DOI] [PubMed] [Google Scholar]
- 63. Cejas P, Li L, O'Neill NK, et al. Chromatin immunoprecipitation from fixed clinical tissues reveals tumor-specific enhancer profiles. Nat Med 2016;22:685–91. [DOI] [PubMed] [Google Scholar]
- 64. Font-Tello A, Kesten N, Xie Y, et al. FiTAc-seq: fixed-tissue ChIP-seq for H3K27ac profiling and super-enhancer analysis of FFPE tissues. Nat Protoc 2020;15:2503–18. [DOI] [PubMed] [Google Scholar]
- 65. Pareek CS, Smoczynski R, Tretyn A. Sequencing technologies and genome sequencing. J Appl Genet 2011;52:413–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66. Mundade R, Ozer HG, Wei H, et al. Role of ChIP-seq in the discovery of transcription factor binding sites, differential gene regulation mechanism, epigenetic marks and beyond. Cell Cycle 2014;13:2847–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67. Kaya-Okur HS, Wu SJ, Codomo CA, et al. CUT&Tag for efficient epigenomic profiling of small samples and single cells. Nat Commun 2019;10:1930. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68. Kaya-Okur HS, Janssens DH, Henikoff JG, et al. Efficient low-cost chromatin profiling with CUT&Tag. Nat Protoc 2020;15:3264–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69. Li QL, Lin X, Yu YL, et al. Genome-wide profiling in colorectal cancer identifies PHF19 and TBC1D16 as oncogenic super enhancers. Nat Commun 2021;12:6407. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70. Cheung K, Barter MJ, Falk J, et al. Histone ChIP-Seq identifies differential enhancer usage during chondrogenesis as critical for defining cell-type specificity. FASEB J 2020;34:5317–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. Zuo S, Yi Y, Wang C, et al. Extrachromosomal circular DNA (eccDNA): from chaos to function. Front Cell Dev Biol 2021;9:792555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72. Zhu Y, Gujar AD, Wong CH, et al. Oncogenic extrachromosomal DNA functions as mobile enhancers to globally amplify chromosomal transcription. Cancer Cell 2021;39:694–707.e697. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73. Møller HD. Circle-Seq: isolation and sequencing of chromosome-derived circular DNA elements in cells. Methods Mol Biol 2020;2119:165–81. [DOI] [PubMed] [Google Scholar]
- 74. Zhou B, Li X, Luo D, et al. GRID-seq for comprehensive analysis of global RNA-chromatin interactions. Nat Protoc 2019;14:2036–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75. Li X, Zhou B, Chen L, et al. GRID-seq reveals the global RNA-chromatin interactome. Nat Biotechnol 2017;35:940–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76. Li J, Xiang Y, Zhang L, et al. Enhancer-promoter interaction maps provide insights into skeletal muscle-related traits in pig genome. BMC Biol 2022;20:136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77. Cai Z, Cao C, Ji L, et al. RIC-seq for global in situ profiling of RNA-RNA spatial interactions. Nature 2020;582:432–7. [DOI] [PubMed] [Google Scholar]
- 78. Kim TK, Shiekhattar R. Architectural and functional commonalities between enhancers and promoters. Cell 2015;162:948–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009;10:57–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80. Mantione KJ, Kream RM, Kuzelova H, et al. Comparing bioinformatic gene expression profiling methods: microarray and RNA-Seq. Med Sci Monit Basic Res 2014;20:138–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81. Jovic D, Liang X, Zeng H, et al. Single-cell RNA sequencing technologies and applications: a brief overview. Clin Transl Med 2022;12:e694. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82. Saliba AE, Westermann AJ, Gorski SA, et al. Single-cell RNA-seq: advances and future challenges. Nucleic Acids Res 2014;42:8845–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83. Moses L, Pachter L. Museum of spatial transcriptomics. Nat Methods 2022;19:534–46. [DOI] [PubMed] [Google Scholar]
- 84. Muerdter F, Boryn LM, Arnold CD. STARR-seq - principles and applications. Genomics 2015;106:145–50. [DOI] [PubMed] [Google Scholar]
- 85. Goldstein I, Hager GL. Dynamic enhancer function in the chromatin context. Wiley Interdiscip Rev Syst Biol Med 2018;10(1):10.1002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86. Andersson R, Refsing Andersen P, Valen E, et al. Nuclear stability and transcriptional directionality separate functionally distinct RNA species. Nat Commun 2014;5:5336. [DOI] [PubMed] [Google Scholar]
- 87. Lee JH, Xiong F, Li W. Enhancer RNAs in cancer: regulation, mechanisms and therapeutic potential. RNA Biol 2020;17:1550–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88. Hah N, Kraus WL. Hormone-regulated transcriptomes: lessons learned from estrogen signaling pathways in breast cancer cells. Mol Cell Endocrinol 2014;382:652–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89. Murakawa Y, Yoshihara M, Kawaji H, et al. Enhanced identification of transcriptional enhancers provides mechanistic insights into diseases. Trends Genet 2016;32:76–88. [DOI] [PubMed] [Google Scholar]
- 90. Consortium F, the RP, CLST, et al. A promoter-level mammalian expression atlas. Nature 2014;507:462–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91. Andersson R, Gebhard C, Miguel-Escalada I, et al. An atlas of active enhancers across human cell types and tissues. Nature 2014;507:455–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92. Fulco CP, Nasser J, Jones TR, et al. Activity-by-contact model of enhancer-promoter regulation from thousands of CRISPR perturbations. Nat Genet 2019;51:1664–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93. Dixit A, Parnas O, Li B, et al. Perturb-Seq: dissecting molecular circuits with scalable single-cell RNA profiling of pooled genetic screens. Cell 2016;167:1853–1866.e1817. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94. Visel A, Minovitsky S, Dubchak I, et al. VISTA enhancer browser--a database of tissue-specific human enhancers. Nucleic Acids Res 2007;35:D88–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95. Zhang G, Shi J, Zhu S, et al. DiseaseEnhancer: a resource of human disease-associated enhancer catalog. Nucleic Acids Res 2017;46:D78–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96. Bai X, Shi S, Ai B, et al. ENdb: a manually curated database of experimentally supported enhancers for human and mouse. Nucleic Acids Res 2019;48:D51–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97. Kumar R, Lathwal A, Kumar V, et al. CancerEnD: a database of cancer associated enhancers. Genomics 2020;112:3696–702. [DOI] [PubMed] [Google Scholar]
- 98. Luo Z-H, Shi M-W, Zhang Y, et al. CenhANCER: a comprehensive cancer enhancer database for primary tissues and cell lines. Database 2023;2023:baad022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99. Wang J, Dai X, Berry LD, et al. HACER: an atlas of human active enhancers to interpret regulatory variants. Nucleic Acids Res 2019;47:D106–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100. Cai Z, Cui Y, Tan Z, et al. RAEdb: a database of enhancers identified by high-throughput reporter assays. Database (Oxford) 2019;2019:bay140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101. Gao T, Qian J. EnhancerAtlas 2.0: an updated resource with enhancer annotation in 586 tissue/cell types across nine species. Nucleic Acids Res 2020;48:D58–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102. Tang L, Hill MC, Wang J, et al. Predicting unrecognized enhancer-mediated genome topology by an ensemble machine learning model. Genome Res 2020;30:1835–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103. Cai Z, Poulos RC, Liu J, et al. Machine learning for multi-omics data integration in cancer. iScience 2022;25:103798. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104. Chen Z, Zhang J, Liu J, et al. DECODE: a deep-learning framework for condensing enhancers and refining boundaries with large-scale functional assays. Bioinformatics 2021;37:i280–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105. Miotto R, Wang F, Wang S, et al. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform 2018;19:1236–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106. Ahmad F, Mahmood A, Muhmood T. Machine learning-integrated omics for the risk and safety assessment of nanomaterials. Biomater Sci 2021;9:1598–608. [DOI] [PubMed] [Google Scholar]
- 107. Correa-Aguila R, Alonso-Pupo N, Hernández-Rodríguez EW. Multi-omics data integration approaches for precision oncology. Mol Omics 2022;18:469–79. [DOI] [PubMed] [Google Scholar]
- 108. Xu H, Zhang S, Yi X, et al. Exploring 3D chromatin contacts in gene regulation: the evolution of approaches for the identification of functional enhancer-promoter interaction. Comput Struct Biotechnol J 2020;18:558–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109. Tao H, Li H, Xu K, et al. Computational methods for the prediction of chromatin interaction and organization using sequence and epigenomic profiles. Brief Bioinform 2021;22(5):bbaa405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110. Popay TM, Dixon JR. Coming full circle: on the origin and evolution of the looping model for enhancer-promoter communication. J Biol Chem 2022;298:102117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111. Malin J, Aniba MR, Hannenhalli S. Enhancer networks revealed by correlated DNAse hypersensitivity states of enhancers. Nucleic Acids Res 2013;41:6828–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112. Whalen S, Truty RM, Pollard KS. Enhancer-promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat Genet 2016;48:488–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113. Ernst J, Kheradpour P, Mikkelsen TS, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 2011;473:43–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114. Corces MR, Granja JM, Shams S, et al. The chromatin accessibility landscape of primary human cancers. Science 2018;362:eaav1898. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115. Yao L, Shen H, Laird PW, et al. Inferring regulatory element landscapes and transcription factor networks from cancer methylomes. Genome Biol 2015;16:105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 116. O'Connor T, Bodén M, Bailey TL. CisMapper: predicting regulatory interactions from transcription factor ChIP-seq data. Nucleic Acids Res 2017;45:e19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117. Huska MR, Ramisch A, Vingron M, et al. Predicting enhancers using a small subset of high confidence examples and co-training. PeerJ Preprints 2016;4:e2407v1. [Google Scholar]
- 118. Greene CS, Tan J, Ung M, et al. Big data bioinformatics. J Cell Physiol 2014;229:1896–900. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 119. Chen X, Zhou J, Zhang R, et al. Tissue-specific enhancer functional networks for associating distal regulatory regions to disease. Cell Systems 2021;12:353–362.e356. [DOI] [PubMed] [Google Scholar]
- 120. Hait TA, Amar D, Shamir R, et al. FOCS: a novel method for analyzing enhancer and gene activity patterns infers an extensive enhancer-promoter map. Genome Biol 2018;19:56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 121. Cao Q, Anyansi C, Hu X, et al. Reconstruction of enhancer-target networks in 935 samples of human primary cells, tissues and cell lines. Nat Genet 2017;49:1428–36. [DOI] [PubMed] [Google Scholar]
- 122. Hafez D, Karabacak A, Krueger S, et al. McEnhancer: predicting gene expression via semi-supervised assignment of enhancers to target genes. Genome Biol 2017;18:199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 123. Mehdi TF, Singh G, Mitchell JA, et al. Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers. Bioinformatics 2019;35:3232–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 124. He B, Chen C, Teng L, et al. Global view of enhancer-promoter interactome in human cells. Proc Natl Acad Sci U S A 2014;111:E2191–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 125. Zhao C, Li X, Hu H. PETModule: a motif module based approach for enhancer target gene prediction. Sci Rep 2016;6:30043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126. Roy S, Siahpirani AF, Chasman D, et al. A predictive modeling approach for cell line-specific long-range regulatory interactions. Nucleic Acids Res 2015;43:8694–712. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 127. Gao T, Qian J. EAGLE: an algorithm that utilizes a small number of genomic features to predict tissue/cell type-specific enhancer-gene interactions. PLoS Comput Biol 2019;15:e1007436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 128. Talukder A, Saadat S, Li X, et al. EPIP: a novel approach for condition-specific enhancer-promoter interaction prediction. Bioinformatics 2019;35:3877–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 129. Kim SG, Harwani M, Grama A, et al. EP-DNN: a deep neural network-based global enhancer prediction algorithm. Sci Rep 2016;6:38433. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130. Hong Z, Zeng X, Wei L, et al. Identifying enhancer-promoter interactions with neural network based on pre-trained DNA vectors and attention mechanism. Bioinformatics 2020;36:1037–43. [DOI] [PubMed] [Google Scholar]
- 131. Ji Y, Zhou Z, Liu H, et al. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 2021;37:2112–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 132. Yang R, Wu F, Zhang C, et al. iEnhancer-GAN: a deep learning framework in combination with word embedding and sequence generative adversarial net to identify enhancers and their strength. Int J Mol Sci 2021;22(7):3589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 133. Bigness J, Loinaz X, Patel S, et al. Integrating long-range regulatory interactions to predict gene expression using graph convolutional networks. J Comput Biol 2022;29:409–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 134. Zhao M, Ma L, Jia X, et al. GraphReg: dynamical point cloud registration with geometry-aware graph signal processing. IEEE Trans Image Process 2022;31:7449–64. [DOI] [PubMed] [Google Scholar]
- 135. Li W, Wong WH, Jiang R. DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning. Nucleic Acids Res 2019;47:e60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 136. Xiao S, Lin H, Wang C, et al. Graph neural networks with multiple prior knowledge for multi-omics data analysis. IEEE J Biomed Health Inform 2023;27:4591–600. [DOI] [PubMed] [Google Scholar]
- 137. Zhang T-H, Flores M, Huang Y. ES-ARCNN: predicting enhancer strength by using data augmentation and residual convolutional neural network. Anal Biochem 2021;618:114120. [DOI] [PubMed] [Google Scholar]
- 138. Avsec Ž, Agarwal V, Visentin D, et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 2021;18:1196–203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 139. Bersanelli M, Mosca E, Remondini D, et al. Methods for the integration of multi-omics data: mathematical aspects. BMC Bioinformatics 2016;17(Suppl 2):15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 140. Reel PS, Reel S, Pearson E, et al. Using machine learning approaches for multi-omics data analysis: a review. Biotechnol Adv 2021;49:107739. [DOI] [PubMed] [Google Scholar]
- 141. Tarazona S, Balzano-Nogueira L, Gómez-Cabrero D, et al. Harmonization of quality metrics and power calculation in multi-omic studies. Nat Commun 2020;11:3092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 142. Jeni LA, Cohn JF, Torre FDL. Facing imbalanced data--recommendations for the use of performance metrics. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction. 2013, p. 245–51. [DOI] [PMC free article] [PubMed]
- 143. Siebert U, Rochau U, Claxton K. When is enough evidence enough? - Using systematic decision analysis and value-of-information analysis to determine the need for further evidence. Z Evid Fortbild Qual Gesundhwes 2013;107:575–84. [DOI] [PubMed] [Google Scholar]
- 144. Chen L, Liu P, Evans TC, Jr, et al. DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science 2017;355:752–6. [DOI] [PubMed] [Google Scholar]
- 145. Argelaguet R, Velten B, Arnol D, et al. Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol 2018;14:e8124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 146. Ugidos M, Tarazona S, Prats-Montalbán JM, et al. MultiBaC: a strategy to remove batch effects between different omic data types. Stat Methods Med Res 2020;29:2851–64. [DOI] [PubMed] [Google Scholar]
- 147. Voillet V, Besse P, Liaubet L, et al. Handling missing rows in multi-omics data integration: multiple imputation in multiple factor analysis framework. BMC Bioinformatics 2016;17:402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 148. Conesa A, Beck S. Making multi-omics data accessible to researchers. Sci Data 2019;6:251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 149. Liew AW-C, Law N-F, Yan H. Missing value imputation for gene expression data: computational techniques to recover missing data from available information. Brief Bioinform 2010;12:498–513. [DOI] [PubMed] [Google Scholar]
- 150. McCabe SD, Lin DY, Love MI. Consistency and overfitting of multi-omics methods on experimental data. Brief Bioinform 2020;21:1277–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 151. Lipton ZC. The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 2018;16:31–57. [Google Scholar]
- 152. Liu Y, Wang Z, Yuan H, et al. HEAP: a task adaptive-based explainable deep learning framework for enhancer activity prediction. Brief Bioinform 2023;24(5):bbad286. [DOI] [PubMed] [Google Scholar]
- 153. Smith GD, Ching WH, Cornejo-Páramo P, et al. Decoding enhancer complexity with machine learning and high-throughput discovery. Genome Biol 2023;24:116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 154. van Hilten A, Kushner SA, Kayser M, et al. GenNet framework: interpretable deep learning for predicting phenotypes from genetic data. Commun Biol 2021;4:1094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 155. Wolfe JC, Mikheeva LA, Hagras H, et al. An explainable artificial intelligence approach for decoding the enhancer histone modifications code and identification of novel enhancers in Drosophila. Genome Biol 2021;22:308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 156. McDermid JA, Jia Y, Porter Z, et al. Artificial intelligence explainability: the technical and ethical dimensions. Philos Trans A Math Phys Eng Sci 2021;379:20200363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 157. Herrmann M, Probst P, Hornung R, et al. Large-scale benchmark study of survival prediction methods using multi-omics data. Brief Bioinform 2021;22:bbaa167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 158. Caspi R, Billington R, Fulcher CA, et al. The MetaCyc database of metabolic pathways and enzymes. Nucleic Acids Res 2018;46:D633–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 159. Copeland M, Soh J, Puca A, et al. Microsoft Azure and cloud computing. In: Copeland M, Soh J, Puca A et al. (eds). Microsoft Azure: Planning, Deploying, and Managing Your Data Center in the Cloud. Berkeley, CA: Apress, 2015, 3–26. [Google Scholar]
- 160. Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw 2015;61:85–117. [DOI] [PubMed] [Google Scholar]
- 161. Armbrust M, Fox A, Griffith R, et al. A view of cloud computing. Commun ACM 2010;53:50–8. [Google Scholar]





