Abstract
Transcriptional enhancers—unlike promoters—are challenging to identify computationally due to their variable distance and orientation relative to target genes. The scarcity of experimentally confirmed enhancers often limits the training of robust machine-learning models for enhancer prediction. We present EnhancerMatcher, a convolutional neural network-based tool that identifies cell-type-specific enhancers using only two confirmed enhancers as references. Trained on putative enhancers from the CATlas Project and control sequences from the human genome, EnhancerMatcher classifies sequences in triplets: two known enhancers from a common cell type and a third sequence evaluated for enhancer activity. Unlike existing methods, EnhancerMatcher enables classification across all cell types while preserving specificity through two reference enhancers. It achieved 90% accuracy, 92% recall, and 87% specificity on human test data. Furthermore, EnhancerMatcher demonstrated strong cross-species generalization, effectively recognizing mouse enhancers using its human-trained model, and exhibited consistent performance across diverse cell types regardless of their data representation size. EnhancerMatcher extracts features directly from raw sequences and provides interpretability through class activation maps, making it a powerful, versatile, and generalizable tool for enhancer discovery and regulatory sequence analysis.
Introduction
The spatio-temporal patterns of gene expression as well as expression levels are controlled by genetic and epigenetic factors. Transcriptional enhancers are genetic elements that play an important role in gene regulation. Unlike promoters, enhancer locations and orientations are unconstrained with respect to the transcription start sites of their target genes [1]. Enhancers function by recruiting transcription factors, co-activators, and/or co-repressors interacting with their target promoters via looping [2, 3]. Active enhancers are frequently marked by specific histone modifications [4–7] and are often transcribed into eRNA [8].
The human genome is estimated to include about 400 000 enhancers [9]. Enhancers have clinical importance; Karnuta and Scacheri surveyed links of mutations and variants in enhancers to diseases and susceptibility to diseases [10]. Mutations in enhancers are associated with aniridia [11, 12], split-hand syndrome [13], craniosynostosis [14], “disorders of sex development” [15], and cancer [16], among other disorders. Variants in enhancers are associated with increased risks of melanoma [17], prostate cancer [18], obesity [19], and Alzheimer’s disease [20]. Therefore, the study of enhancers is crucial to understanding gene regulation and genetic bases of disease.
Several empirical and computational approaches have been developed for locating enhancers. Tomoyasu and Halfon group empirical methods into: (i) reporter assays [21–23], (ii) genome-wide reporter assays [24–26], (iii) chromatin profiling [27], (iv) CRISPR/Cas9-based approaches [28], and (v) antibody-based approaches [29]. Suryamohan and Halfon classify computational approaches into the following categories [30]: (i) approaches utilizing sequence conservation scores [31–33], (ii) tools searching for clusters of transcription factor binding sites [34–36], and (iii) methods relying on supervised learning [37–44].
Deep learning models like DeepSEA [45] and Basset [46] have emerged as powerful tools, predicting chromatin features (e.g., DNase I hypersensitivity, transcription factor binding) or accessibility directly from raw DNA sequence using convolutional neural networks. Despite their strong performance, these methods typically demand extensive training data and focus on classifying the absolute regulatory activity of individual sequences. Earlier sequence-based classifiers such as gkm-SVM and deltaSVM [47, 48] demonstrated that k-mer–based features are also effective for enhancer and variant prediction, while multi-epigenomic integration approaches improved accuracy by incorporating histone modifications, transcription factor binding, and chromatin accessibility. Benchmarking studies utilizing massively parallel reporter assays (MPRAs) [49] and blind evaluations such as the Critical Assessment of Genome Interpretation (CAGI) challenge [50] have revealed both strengths and limitations of enhancer prediction, highlighting challenges in generalization across cell types. In parallel, conservation-oriented methods such as LECIF [51], EpiAlignment [52], and DeepGCF [53] integrate sequence and functional genomic data to quantify enhancer conservation across species.
Available computational tools suffer from inconsistent performance on different tissues and cell types [40]. These tools are likely to perform well on tissues with large sets of known enhancers while performing poorly on tissues with small sets of known enhancers. Our tool set is critically missing an accurate tool for predicting whether a small number of sequences have similar enhancer activity. Current machine learning methods for enhancer discovery require large training sets of functionally-similar enhancers, whereas a pairwise or triplet-wise comparison method would require just one or two.
We propose a novel, intelligent, computational tool for calculating a metric measuring functional similarity between three sequences, two of which are known enhancers active in at least one common cell type. To begin, we define enhancer similarity to mean functional similarity such that similar enhancers regulate gene expression in the same cell type (at a similar time during development or in response to a similar stimulus).
Our proposed tool—EnhancerMatcher—utilizes deep artificial neural networks [54] in measuring an enhancer–enhancer similarity metric. Although multiple tools that use deep networks for locating cell-type-specific enhancers have been proposed [41–44], in this work, we formulate the problem in a novel way and propose a new approach for training such networks. Traditionally, bioinformaticians use deep networks (and other supervised learning approaches) to answer the following question: Does a given sequence have the same cell-type-specific enhancer activity as known enhancers comprising a training set? Here, we focus on a related, yet different, question: How similar are the enhancer activities of three sequences? The input to our new tool is two sequences that act as enhancers in at least one same cell type and a third sequence of unknown enhancer activity; the output is a score in the 0–1 range. For example, assume that a scientist has two heart-specific enhancers. The scientist is studying another sequence, and she wants to know if this sequence is also a heart-specific enhancer. EnhancerMatcher performs a comparison and generates a score that indicates the degree of similarity between the unknown sequence and the two known sequences in terms of enhancer activity in the heart.
An important application of EnhancerMatcher involves identifying mutations in enhancers associated with disease. Many genetic disorders have been linked to mutations in non-coding enhancer regions that alter gene regulation in specific tissues [55]. If a mutation disrupts enhancer activity in a disease-relevant cell type, the mutated sequence would no longer exhibit functional similarity to known enhancers in that cell type. By comparing the mutated enhancer to reference enhancers, EnhancerMatcher can highlight such loss-of-function effects, thereby assisting in the prioritization of disease-associated variants.
When EnhancerMatcher analyzes a DNA sequence, class activation maps (CAMs) identify the subsequences that most influence the network’s decision. These highlighted subsequences provide a visual guide, allowing researchers to pinpoint biologically significant regions and focus on the most relevant subsequences associated with enhancer activity.
Materials and methods
Overview
We developed EnhancerMatcher, which is a computational tool for assessing the similarity among three sequences with respect to their enhancer activities. The tool takes three sequences. The first and the second sequences must be enhancers active in the same cell type(s). EnhancerMatcher is tasked with recognizing whether a third sequence is an enhancer active in at least one cell type where other two sequences are active enhancers. The tool takes three sequences in FASTA format and outputs a number between 0 and 1; the closer the number to 1, the higher the probability that the third sequence has similar enhancer activity to the first and the second sequences. EnhancerMatcher is a deep convolutional neural network trained and validated on putative human enhancers of the CATlas (Cis-element Atlas) Project [56]. We refer to the CATlas putative enhancers hereafter simply as “enhancers.”
It is important to note that our tool is trained to recognize similarity in enhancer activity among three sequences. In contrast, a traditional tool is trained to recognize whether or not a single sequence exhibits enhancer activity in a particular cell type. Figure 1 diagrams EnhancerMatcher versus a traditional approach. To illustrate, EnhancerMatcher takes three sequences; it outputs a score between 0 and 1. Suppose that the first two sequences are enhancers active in fibroblasts. Scores <0.5 imply that the third sequence does not have enhancer activity in fibroblasts or is not an enhancer at all. Scores of 0.5 and higher imply that the third sequence is an enhancer active in fibroblasts. In contrast, a conventional tool—trained on a large number of fibroblast enhancers—takes a single sequence as its input. It outputs a score <0.5 if the input sequence does not have enhancer activity in fibroblasts; otherwise, it outputs a score between 0.5 and 1, indicating that this sequence has enhancer activity in fibroblasts. Note that our tool does not require a large number of fibroblast enhancers to be available during training.
Figure 1.
Comparison between a traditional classifier and EnhancerMatcher. (a) A traditional classifier is trained on enhancers from a single cell type (e.g. fibroblast). It predicts whether a given sequence is an enhancer or a non-enhancer of the same cell type on which it was trained. To achieve reliable performance, the classifier requires training on a large number of enhancers specific to that cell type. (b) EnhancerMatcher is trained using sequence triplets. The first and second sequences in each triplet are enhancers active in the same cell type, while the third sequence is either an enhancer active in the same cell type(s) as the first two sequences or a negative sequence. A negative sequence can be either: (i) an enhancer active in cell types where the first and second enhancers are inactive, (ii) a shuffled version of the first enhancer, or (iii) a genomic sequence randomly sampled from the human genome. EnhancerMatcher is trained on a large set of sequence triplets collected from all known cell types, including those with only a small number of known enhancers. During inference, EnhancerMatcher compares the third sequence to the first two and classifies it as either a similar enhancer or a negative sequence. Note that EnhancerMatcher is trained on 222 cell types; we show here only six cell types for simplicity of presentation.
EnhancerMatcher is not intended to determine whether a sequence is an enhancer in isolation. Instead, it estimates whether a third sequence shares cell-type-specific enhancer activity with two known reference enhancers. Negative examples in our training include both non-enhancers and enhancers active in unrelated cell types. We understand the importance of annotating enhancers in general and have just created another tool for across-species and across-cell-type enhancer discovery [57]. Next, let’s describe the enhancers used for training EnhancerMatcher.
Enhancer dataset
We utilized the CATlas dataset [56] in training, validating, and testing EnhancerMatcher. This dataset includes “1.2 million candidate cis-regulatory elements” (based on snATAC-seq) active in 222 human cell types. All of the sequences are 400 bp long.
Data preparation
The CATlas dataset was further processed. Sequences that are inactive in any cell type were removed. Sequences overlapping with promoters, coding regions, and insulators were removed. Promoters (1000-bp-long sequences centered on transcription start sites) and coding regions were obtained from RefSeq [58] and GENCODE [59]. Insulators were obtained from the ENCODE Project [9]. Locations of promoters, coding regions, and insulators were downloaded from the UCSC Table Browser. The processed dataset includes 895 414 elements.
The processed CATlas dataset is our primary source of enhancers. We then describe how we assembled additional datasets to serve as sources of non-enhancer sequences, i.e. control sequences. Both enhancer and control datasets are needed to construct positive and negative sequence triplets.
Control datasets
We generated five datasets, one of which is a shuffled version of the CATlas enhancers, and the other four are randomly sampled from the human genome (assembly HG38) to serve as negative (likely non-enhancer) sequences. These datasets are referred to collectively as the control datasets.
To shuffle a CATlas enhancer, the enhancer is divided into short k-mers (k is between 1 and 6 and is selected randomly); next, these k-mers are shuffled. We generated two shuffled versions of each enhancer, totaling 1 790 828 sequences.
While assembling the random datasets, we excluded any regions from (i) the original CATlas dataset, (ii) the CATlas brain dataset [60], and (iii) the DNase hypersensitive sites identified by the ENCODE project [9]. For each dataset, we randomly sampled two sequences per enhancer from the human genome.
A combination of three criteria—(i) length, (ii) GC-content, and (iii) exclusion of repeats—was considered when generating these datasets. According to the length criterion, a valid random sequence from the human genome must have the same length as an enhancer. According to the GC-content criterion, a valid random sequence must be within 3%
of the GC-content of an enhancer. We sampled two datasets from the genome excluding repeats (the length-no-repeats and length-GC-no-repeats datasets) and two datasets from the genome including repeats (the length and length-GC datasets).
Repeats comprise about 50% of the human genome. When we sampled control sequences from the human genome while keeping repeats, our control sequences might include 50% repeats on average. This high percentage of repeats may incorrectly push a classifier to learn the properties of repeats—not those of enhancers with similar activities. For this reason, we generated two datasets from the whole genome and two datasets from the non-repetitive regions of the human genome. Repeats were delineated by Red [61].
Now, the assembly of the enhancer dataset and the control datasets is complete. We are ready to describe how such data were utilized in generating similar and dissimilar sequence triplets.
Assembling triplets
Earlier, we conducted an evaluative study on image data, utilizing multiple metric learning approaches trained on image pairs or triplets [62]. A key finding from this study was that training with triplets yielded superior performance compared to using pairs. This enhanced performance stems from the fact that triplets provide information on the relative distance between similar and dissimilar elements, whereas pairs only offer insights into absolute distances. To illustrate this concept, consider a pairwise comparison. If you were asked to assess the similarity between a pen and a pencil, some might argue they are quite different (e.g. ink versus graphite, permanent versus erasable). Now, imagine you are given three objects: a pen, a pencil, and an eraser. If asked which of the two—the pencil or the eraser—is more similar to the pen, most would naturally choose the pencil, since both are writing instruments, differentiating them from the eraser’s function. This kind of relative comparison, explicitly defining a “more similar” and “less similar” example to a reference object, is the basis of triplet learning. In contrast, pairwise comparisons (pen versus pencil, pen versus eraser) would require setting a fixed similarity threshold for each pair independently, which can be harder to learn and generalize across diverse examples. By explicitly modeling these relative relationships, triplet-based approaches like EnhancerMatcher are better suited for distinguishing subtle functional similarities in biological sequences.
To construct our dataset, we assemble both similar and dissimilar triplets. A similar triplet consists of three enhancers active in at least one same cell type. A dissimilar triplet consists of two enhancers active in at least one same cell type and a negative sequence. A negative sequence can be drawn from the enhancer dataset, one of the five control datasets, or both. If the enhancer dataset is being used, a negative is an enhancer inactive in any cell type shared by the first two enhancers. If a control dataset is being used, a negative is any sequence from the control dataset. There is also a composite dataset, where a negative is chosen from the enhancer dataset (an enhancer inactive in any cell type where the first two enhancers are active) or one of the five control datasets with equal chance. In sum, a similar triplet consists of three enhancers that are active in at least one common cell type, whereas a dissimilar triplet consists of two enhancers with common activity and a third sequence with different enhancer activity or no enhancer activity at all.
Most deep neural networks require a large number of training examples, so we need to generate a huge number of sequence triplets for training EnhancerMatcher successfully.
Data augmentation
A deep neural network consists of a large number of parameters; therefore, a large number of samples are needed for training such a large network. Our enhancer data are limited; however, data-augmentation strategies can be applied. We apply two strategies for augmenting the data. First, we generate different sequence triplets at each epoch (an iteration through an entire dataset). Second, a sequence in a triplet may be represented as is, i.e. the forward orientation, or by its reverse complement; a representation—forward or reverse complement—is chosen randomly.
Millions of sequence triplets can be generated using different sequence combinations. Each sequence in one of the enhancer datasets (training, validation, or testing datasets) serves as an anchor, i.e. the first sequence of a triplet. Anchor sequences do not change from epoch to epoch. However, a network sees new triplets in each epoch. For example, let’s say that there are multiple sequences named A, B, C, D, and E. Further, suppose that A, B, and C are all enhancers active in one common cell type where D and E have no enhancer activities. At each epoch, each sequence serves as an anchor; the second and the third sequences are selected randomly from sequences that satisfy the similarity or the dissimilarity criteria. For example, when sequence A is the anchor of dissimilar triplets, the combination A, B, and D may be observed in one epoch, the combination A, C, and D in another epoch, and the combination A, B, and E in a third epoch. When sequence A is the anchor of similar triplets, the combination A, B, and C and the combination A, C, and B may be observed at different epochs. Note that the anchor remains constant, but it is a part of different triplets at different epochs. Through this augmentation technique, a network is able to see a huge number of different similar and dissimilar triplets.
The second data-augmentation strategy leverages the equivalence between sequences and their reverse complements. Each sequence in a triplet has a 50% chance of being represented as is and a 50% chance of being represented as its reverse complement, further augmenting the generated triplets.
Once we have the generated triplets, we need to ensure that we can validate our network during training and test it to confirm that it has learned how to compare the enhancer activities of three sequences. To achieve this, it’s essential to train, validate, and test the network on three independent datasets.
Three partitions
First, we divided our entire collection of enhancers into three mutually exclusive sets: training (626 789; 70%), validation (179 082; 20%), and testing (89 541; 10%). Then, from these partitioned enhancer sets, we assembled our triplets. This means that for any given triplet, its anchor, positive, and negative enhancers (i.e. similar and dissimilar) all originate exclusively from the same partition. For example, an anchor enhancer from the training set will only be grouped with similar and dissimilar enhancers also from the training set. This rigorous approach ensures that both the individual sequences and the resulting triplets are entirely unique to each partition, preventing any data leakage across the training, validation, and testing datasets. Each partition is balanced, including an equal number of similar and dissimilar triplets. As deep neural networks require a large amount of data to be properly trained, we used 70% of our data for training. Another 20% (the validation dataset) was set aside to check for overfitting (where the performance on training data outstrips the performance on validation data) and to stop training a model as soon as overfitting occurs. The performance on the validation dataset guided the process of optimizing a model parameters, e.g. the number of convolutional filters or the number of hidden neurons of a dense layer. The testing dataset was utilized at the very end of all validation and optimization experiments, i.e. we did not evaluate any model on the testing set until all the experiments were completed. Evaluating the final model on the testing set is a true blind test.
These sequences must be converted into a numerical representation before being processed by a neural network. The following section outlines this format.
Sequence representation
Each of the nucleotides (A, C, G, T, and any uncertain bases) is first assigned a unique integer identifier. We then use a special type of neural network component called an embedding layer. This layer takes the integer identifier for each nucleotide and maps it to a single real number. The idea is that nucleotides with similar functional properties in the context of enhancer activity should be mapped to real numbers that are close to each other, while dissimilar nucleotides will be mapped to numbers further apart. The specific notion of “similarity” among nucleotides is not predefined but is instead learned by the network during training, based on the task of comparing enhancer activities. This embedding layer is implemented using the standard Keras Embedding layer from the TensorFlow 2.13 framework. Because the embedding dimension is set to 1 in our model, each nucleotide’s integer identifier is transformed into a single numerical value. This allows the model to learn a unique, continuous numerical representation for each base (A, C, G, and T), as well as for any uncertain nucleotides, optimizing these values for the task of identifying similar enhancer activity.
We now describe the different neural network architectures we utilized in classifying similar and dissimilar triplets.
Baseline classifier
We built a deep network to predict whether (i) the third sequence of a triplet is an enhancer that is active where the first two enhancers are active or (ii) the third sequence has different enhancer activity than the first two sequences or is a non-enhancer.
A sequence triplet is represented as a 1-x-400-x-3 tensor. The last dimension of this tensor (i.e. 3) is referred to as the channel dimension. A separable-convolutional layer learns patterns in each channel separately—not a pattern in the three channels stacked on top of each other. In other words, a separable-convolutional layer processes each channel of an input independently, rather than analyzing all channels together as a single unit as with a traditional convolutional layer. It first learns patterns within each channel separately, then combines the information. Common transcription factor binding sites in enhancers active in the same cell type are not at the same locations in the three sequences; for this reason, a separable-convolutional layer is more suitable than a regular convolutional layer when it is applied to sequence triplets directly. Our baseline classifier takes advantage of the separable-convolution technique.
In Fig. 2, we outline the architecture of the baseline classifier. It consists of an embedding layer followed by four (determined experimentally) separable-convolutional blocks, each of which includes a separable-convolutional layer with a scaled exponential linear unit activation function, a batch normalization layer, and a max-pooling layer (the final block has a global max-pooling layer). The number of convolutional filters doubles as the network gets deeper. For example, if the first block utilizes 4 filters, then the second, third, and fourth blocks utilize 8, 16, and 32 filters. The last block is followed by a dense layer of a large number of neurons (experimentally determined) with a scaled exponential linear unit activation function. A batch normalization layer follows the dense layer. The last layer is the output layer, which is a dense layer of one neuron with a sigmoid activation function.
Figure 2.

The baseline separable-convolutional classifier. The network takes a triplet as input, where the classifier determines whether the third sequence is an enhancer active in the same cell type(s) as the first two sequences or not. The triplet is passed through an embedding layer, followed by four 1D separable-convolutional blocks. Each block consists of a separable-convolutional layer, a batch normalization layer, and a max-pooling layer. The output of the final block is then sent through two dense layers, with a batch normalization layer in between. The final dense layer outputs a value between 0 and 1, indicating how similar the third sequence is to the first and second sequences in terms of enhancer activity.
Up to this point, we described our data and the baseline classifier, which can classify sequence triplets into similar and dissimilar with regard to their enhancer activities. We now introduce a network architecture that is more advanced than the baseline network. This network is the core of our tool.
EnhancerMatcher
The core of EnhancerMatcher is a deep convolutional neural network (Fig. 3). Our network consists of two sub-networks trained together: the convolutional sub-network and the classifier sub-network. The convolutional sub-network processes one sequence at a time. It consists of an embedding layer followed by four convolutional blocks. Each convolutional block contains two one-dimensional convolutional layers with a filter size of 3, followed by a batch normalization layer. Both convolutional layers use a Rectified Linear Unit (ReLU) activation function. The second convolutional layer in each block has a stride of two, which serves as a technique for dimensionality reduction. The first convolutional block uses 64 filters, followed by 128, 256, and 256 filters in the subsequent blocks, respectively. Finally, the output of the fourth block is flattened into a one-dimensional feature map.
Figure 3.

The network architecture of EnhancerMatcher. This network takes a three-sequence input, which we call triplets. In a triplet, the first two sequences are enhancers active in at least one shared cell type, while the third sequence is classified as either similar or dissimilar to the first two based on enhancer activity. Each sequence is passed through an embedding layer, followed by four 1D convolutional blocks. Each block consists of two 1D convolutional layers and a batch normalization layer. The output of the last block is flattened. The output of a convolutional layer is called a feature map. The three flattened feature maps are sent through an attention layer, followed by a concatenation layer that merges the output of the attention layer with the three original flattened feature maps. Finally, the output of the concatenation layer passes through three dense layers, with two batch normalization layers in between. The final dense layer outputs a value between 0 and 1, indicating how similar the third sequence is to the first and second sequences in terms of enhancer activity.
The classifier sub-network processes the three flattened feature maps. Its first layer is an attention layer that is applied to the three flattened feature maps. The second layer is a concatenation layer, which appends the output of the attention layer to the original three feature maps. This combined vector is then passed through two dense layers of 20 neurons each, with each dense layer followed by a batch normalization layer and a ReLU activation function. The final output layer is a dense layer with a single neuron and a sigmoid activation function, which produces a similarity score between 0 and 1.
Another advantage of EnhancerMatcher is its ability to highlight key subsequences that contribute most to a sequence’s regulatory activity. This important feature is explained in the following section.
Class activation map
A CAM highlights the most important parts of a network’s input that influence its output. CAM acts like a heat map, showing which areas EnhancerMatcher “pays attention to” when making a decision. These highlighted subsequences are likely to have important biological information related to the regulatory function of a sequence. The main computational idea behind CAM is that the output of a convolutional network can be viewed as a function of the feature maps learned by a series of convolutional blocks. CAM revolves around calculating the partial derivatives of the output of a network with respect to the output of the last convolutional block with the goal of finding regions in the processed input that maximize the network’s output [63].
To compute CAM, we extract the output from the final convolutional layer for a given input triplet. Using TensorFlow’s gradient tracking, we calculate the gradient of the network’s output score with respect to these feature maps. These gradients are globally averaged across the entire length of each feature map to yield channel-wise importance weights. These weights are then multiplied with their corresponding feature maps to produce a weighted activation map. We apply a ReLU activation and reduce the result along the channel dimension (e.g. by max pooling) to obtain a 1D heatmap over positions representing consecutive subsequences in each original sequence. This heatmap is linearly interpolated to the original sequence length and min-max normalized.
To validate the results of the CAM technique, we obtained position weight matrices representing all transcription factor binding sites from JASPAR [64]. We then used FIMO from the MEME Suite [65] to scan the subsequences highlighted by CAM for the occurrence of these binding sites.
Our approach leverages state-of-the-art deep networks, allowing it to learn sequence patterns directly without relying on manual feature engineering. To demonstrate its advantage, we compare it to traditional techniques that depend on engineered features. We describe these traditional networks in the following section.
Engineered features
We compared the performance of EnhancerMatcher to the performance of two traditional networks based on processing k-mer histograms of sequence triplets or 26 statistics calculated on these histograms. The following 26 statistics were chosen: “Manhattan, Euclidean,
, Chebyshev, Hamming, Minkowski, Cosine, Correlation, Bray Curtis, Squared chord, Hellinger, Conditional KL divergence, K divergence, Jeffrey divergence, Jensen-Shannon divergence, Revised relative entropy, Intersection, Kulczynski 1, Kulczynski 2, Covariance, Harmonic mean, Similarity ratio, Markov, SimMM,
, and
” [66]. The histogram data are represented as a three-channel matrix, each channel of which includes a histogram representing one of the three sequences. The statistical data are represented as a two-channel matrix. The first channel includes the 26 statistics calculated on histograms of the first and the second sequences. The second channel includes the 26 statistics calculated on histograms of the first and the third sequences. We created two networks, each of which takes the input and passes it to a dense projection layer that transforms multi-channel data to single-channel data. Then the data are passed through two dense layers for classifying the input as similar or dissimilar triplets.
Traditional multi-label classifier
Another traditional approach we tested involved using a multi-label classifier that takes a single sequence as input and outputs a list of 1’s and 0’s, corresponding to which cell types the sequence is active in across 222 cell types. The input sequence is passed through an embedding layer and then through four convolutional blocks. Each block consists of two one-dimensional convolutional layers followed by a batch normalization layer. The output is then flattened and passed through two dense layers, with a batch normalization layer in between, before being processed by the final output layer.
Up to here, we described our datasets, a baseline network, the core network of EnhancerMatcher, two networks utilizing engineered features, and a traditional multi-label network. Next, we discuss the criteria applied to evaluating EnhancerMatcher and the related approaches.
Evaluation criteria
We designed the datasets to be balanced, i.e., each dataset contains an equal number of similar and dissimilar triplets. Accuracy, recall, and specificity are three evaluation criteria that are suitable for evaluating a classifier on balanced datasets. Accuracy is the percentage of correctly predicted triplets, whether similar or dissimilar. Recall is the percentage of similar triplets that are correctly predicted. Specificity is the percentage of dissimilar triplets that are correctly predicted.
With the criteria explained, we now discuss the performance of the various networks we just described.
Results and discussion
Baseline network
We started with a deep network consisting of multiple depthwise separable convolutional layers [67]. This network is a simple sequential model and is easy to construct. It processes the three sequences of a triplet together. Table 1 shows the performance of this network on the training and validation datasets. Recall that our datasets are balanced, i.e. each dataset includes an equal number of similar and dissimilar triplets. We utilize accuracy, recall, and specificity scores in evaluating the performance of the network because they are suitable for balanced datasets. The recall of this network is slightly higher than its specificity. On the training dataset, the network achieved 84% recall and 76% specificity scores. On the validation dataset, the network achieved 80% recall and 74% specificity scores. Overall, the baseline network achieved 80% training accuracy and 77% validation accuracy scores. These results show that sequence triplets with similar enhancer activities can be differentiated from those with dissimilar enhancer activities or with no enhancer activity. Because this network utilizes sequence information only, these results demonstrate that enhancers have distinguishable features at the sequence level.
Table 1.
Performance of multiple enhancer classifiers on different datasets
| Training performance | Validation performance | |||||
|---|---|---|---|---|---|---|
| Dataset | Accuracy | Recall | Specificity | Accuracy | Recall | Specificity |
| Baseline network | ||||||
| Overall | 79.86 | 83.80 | 75.93 | 77.38 | 80.38 | 74.38 |
| Shuffled | 87.16 | 83.47 | 90.85 | 85.61 | 80.40 | 90.83 |
| Length | 82.43 | 83.36 | 81.50 | 80.50 | 80.01 | 80.99 |
| Length-no-repeats | 82.29 | 83.74 | 80.84 | 79.89 | 80.06 | 79.72 |
| Length-GC | 81.02 | 83.64 | 78.40 | 78.62 | 80.05 | 77.20 |
| Length-GC-no-repeats | 78.27 | 83.61 | 72.93 | 75.95 | 80.33 | 71.57 |
| Dissimilar enhancers | 65.38 | 83.48 | 47.27 | 63.25 | 80.46 | 46.03 |
| EnhancerMatcher | ||||||
| Overall | 88.80 | 91.42 | 86.17 | 85.55 | 86.61 | 84.50 |
| Shuffled | 94.50 | 91.77 | 97.24 | 91.47 | 86.15 | 96.79 |
| Length | 92.17 | 91.42 | 92.92 | 89.24 | 86.15 | 92.32 |
| Length-no-repeats | 92.18 | 91.61 | 92.74 | 88.92 | 86.52 | 91.30 |
| Length-GC | 92.11 | 91.80 | 92.43 | 88.55 | 86.17 | 90.92 |
| Length-GC-no-repeats | 90.87 | 91.64 | 90.10 | 87.25 | 86.19 | 88.31 |
| Dissimilar enhancers | 70.36 | 91.26 | 49.46 | 66.79 | 86.21 | 47.38 |
| Specialized EnhancerMatcher | ||||||
| Overall | 73.44 | 69.65 | 77.24 | 71.52 | 65.87 | 77.17 |
| Shuffled | 67.23 | 70.03 | 64.43 | 65.64 | 65.45 | 65.84 |
| Length | 77.32 | 71.30 | 83.35 | 74.03 | 65.53 | 82.53 |
| Length-no-repeats | 75.35 | 70.03 | 80.68 | 73.02 | 65.84 | 80.19 |
| Length-GC | 76.94 | 71.38 | 82.49 | 73.76 | 65.53 | 82.00 |
| Length-GC-no-repeats | 74.20 | 70.02 | 78.39 | 71.66 | 65.33 | 78.00 |
| Dissimilar enhancers | 74.59 | 70.79 | 78.40 | 69.63 | 65.44 | 73.81 |
| Combined EnhancerMatcher | ||||||
| Overall | 82.71 | 69.20 | 96.22 | 77.98 | 61.98 | 93.97 |
| Shuffled | 84.39 | 70.13 | 98.65 | 80.55 | 62.77 | 98.33 |
| Length | 82.15 | 66.52 | 97.77 | 79.94 | 62.76 | 97.12 |
| Length-no-repeats | 83.05 | 69.69 | 96.41 | 78.84 | 61.26 | 96.42 |
| Length-GC | 81.36 | 65.97 | 96.74 | 79.75 | 62.83 | 96.66 |
| Length-GC-no-repeats | 81.32 | 66.37 | 96.27 | 78.18 | 61.08 | 95.27 |
| Dissimilar enhancers | 72.23 | 64.12 | 80.34 | 70.59 | 61.21 | 79.97 |
We present accuracy, recall, and specificity scores obtained during training and validation across multiple datasets. The baseline network’s performance is compared to EnhancerMatcher, its specialized version, and a combined classifier. The baseline network classifies enhancer triplets using separable convolutional layers. EnhancerMatcher, in contrast, employs two sub-networks: a convolutional sub-network and a classifier sub-network. The specialized version of EnhancerMatcher was trained on the similar triplets and only the dissimilar triplets where the third enhancer is inactive in the cell types where the first and second enhancers are active. None of the randomly sampled or shuffled control datasets was utilized in training the specialized EnhancerMatcher. The combined EnhancerMatcher integrates both EnhancerMatcher and its specialized version. A triplet is classified as similar only if both classifiers agree; otherwise, it is classified as dissimilar.
Next, the baseline network was evaluated on the six negative datasets separately (Table 1). We increased the complexity of the datasets by controlling for length, GC content, or exclusion of repeats. These evaluations show a trend: as the datasets incorporate more features of enhancers, the classification task becomes more difficult, yet still achievable. On the shuffled dataset, the baseline network obtained a validation accuracy score of 86%, which progressively decreased to 81% on the Length dataset, 80% on the Length-no-repeats dataset, 79% on the Length-GC dataset, 76% on the Length-GC-no-repeats dataset, and 63% on the dissimilar enhancers dataset. The baseline classifier obtained the lowest accuracy score on the dissimilar enhancers, demonstrating once again that enhancers possess special features at the sequence level.
EnhancerMatcher
The baseline network was able to recognize distinguishable features of similar and dissimilar enhancers at the sequence level. Constructing a classifier to process one sequence at a time (in contrast to processing three sequences together) should yield better results, taking advantage of advances in metric learning and representation learning. We call this classifier EnhancerMatcher. Overall, EnhancerMatcher achieved a training accuracy of 89% and a validation accuracy of 85%, resulting in an 11% improvement over the baseline network’s training and validation accuracy scores. EnhancerMatcher obtained 91% training recall and 86% validation recall. With regard to the specificity, the network achieved 86% on the training and 84% on the validation datasets. By using this model, we were able to improve our recall and specificity by about 10% each in contrast to those due to the baseline network.
Next, we studied the performance of EnhancerMatcher on the six negative datasets. The same trend due to the baseline network was also observed on EnhancerMatcher. Specifically, the easiest negative dataset is the shuffled sequences (91% validation accuracy) and the most difficult dataset is the dissimilar enhancers (67% validation accuracy). Although our results improved in contrast to the baseline network, EnhancerMatcher still had difficulty on the dissimilar enhancers with validation accuracy of 67%, recall of 86%, and specificity of 47%. These results led us to train a specialized EnhancerMatcher using dissimilar enhancers as the only controls. The specialized EnhancerMatcher is able to classify triplets of similar enhancers and triplets of dissimilar enhancers with higher accuracy (70% versus 67%), lower sensitivity (65% versus 86%), and much higher specificity (74% versus 47%) than the EnhancerMatcher trained on all control datasets. Even though the ability of the specialized network to recognize similar enhancer triplets is modest, its ability to recognize dissimilar enhancers is promising.
Combined EnhancerMatcher
We value classifiers with high specificity because sequences with enhancer activities represent a small fraction of the whole genome. With this criterion in mind, we combined the predictions of the two EnhancerMatcher models. A triplet is classified as similar if the two networks agree (i.e. the two networks output scores greater than 0.5); otherwise, the triplet is classified as dissimilar. Table 1 shows the results of the combined classifier, which consists of the two EnhancerMatcher models. This approach resulted in a classifier with reasonable accuracy of 78%, modest recall of 62%, and much improved specificity of 94%, allowing for identifying similar enhancers among a plethora of irrelevant sequences.
EnhancerMatcher properly handles sequence orientation
One of the challenges in locating enhancers is that they are not restricted with respect to the strand orientation of their target genes. Two experiments were conducted to test the ability of EnhancerMatcher in identifying enhancers represented in the forward and the reverse complement orientations. When training EnhancerMatcher, each sequence has a 50% chance to be either on the forward strand or the reverse complement strand. In the first experiment, our network that was trained on randomly mixed orientations (one of our data-augmentation strategies) was evaluated on all sequences in the forward orientation. After that, we repeated the experiment on all sequences in the reverse-complement orientation. As expected, the validation accuracy scores for all three cases (forward, reverse complement, or randomly mixed) were very similar (85%, 85%, and 86%). Regarding recall scores on the validation set, the three cases were evaluated to have 86% for the forward orientation, 86% for the reverse-complement orientation, and 87% for the randomly mixed orientations. Lastly, for the validation specificity, the three cases achieved 84% for the forward orientation, 84% for the reverse complement orientation, and 84% for the randomly mixed orientations. These consistent results across all orientations indicate that (i) our data-augmentation technique utilized during training is effective and (ii) our network is able to classify similar or dissimilar enhancer triplets successfully regardless of the strand orientation.
EnhancerMatcher generalizes well across all cell types
EnhancerMatcher was trained using data from 222 cell types found in the CATlas database. However, the number of sequences associated with each cell type is highly imbalanced, with some cell types having many known enhancers and others only a few. To evaluate how well EnhancerMatcher generalizes across both well-represented and poorly represented cell types, we performed cell-type-specific testing.
We began by taking our comprehensive test triplet dataset and systematically decomposing each similar triplet into its corresponding cell type(s). Since a single triplet can be active in multiple cell types, each triplet was assigned to all applicable cell types. This allowed us to quantify the number of test triplets available per cell type. We applied the same process to the dissimilar triplets, ensuring that the first two sequences in each triplet were active in the same cell type, while the third sequence belonged to a different cell type. For each cell type, we then constructed a balanced dataset containing equal numbers of similar and dissimilar triplets. Finally, we evaluated EnhancerMatcher’s performance independently on each cell-type-specific dataset. This enabled us to measure accuracy, recall, and specificity for each cell type and assess how generalization varies with respect to the number of available triplets per cell type.
As depicted in Fig. 4, we evaluated EnhancerMatcher separately on each cell type and plotted accuracy (a), recall (b), and specificity (c) against the number of available triplets for that cell type. While these figures specifically display results for the LGR control, we repeated this analysis for all other control types, including Enhancer, Shuffled, LR, LNR, and LGNR (data not shown), observing consistent trends.
Figure 4.
Performance of EnhancerMatcher across individual cell types. Each data point illustrates the evaluation of EnhancerMatcher on a single cell type, leveraging a balanced dataset of both positive and negative triplets specific to that cell type. (a) Accuracy, (b) Recall, and (c) Specificity are presented as a function of the number of available triplets per cell type. The red dashed line in each subplot denotes the overall baseline performance achieved when evaluating on the complete test set, utilizing the LGR control. It is important to note that cell types with fewer triplets exhibit greater variance, attributable to their increased susceptibility to random fluctuations in smaller sample sizes. Conversely, as the triplet count increases, the performance metrics demonstrate a clear convergence and stabilization. Crucially, these plots do not exhibit upward trends, which would be expected if EnhancerMatcher were biased toward cell types with a large number of known enhancers.
Visually, the plots demonstrate that EnhancerMatcher generalizes effectively across the full spectrum of cell types, from those with very few triplets to those with thousands. The performance metrics (accuracy, recall, and specificity) show a broad scatter of points with no discernible upward or downward trend as the number of triplets increases. To quantitatively confirm this observation, we computed the Pearson correlation coefficient between cell-type size (number of triplets) and each evaluation metric. For accuracy, the coefficients across all controls ranged from −0.0923 (Enhancer control) to 0.0916 (LGR control). For recall, the range was −0.0149 (LR control) to 0.0494 (LGNR control), and for specificity, it ranged from −0.2005 (Enhancer control) to 0.2111 (LGR control). Across all controls, these correlation values remain very close to zero, indicating no statistically significant dependency between EnhancerMatcher’s performance and the number of available triplets for a given cell type. This robust quantitative analysis, supported by the visual distribution in the plots, strongly supports the conclusion that EnhancerMatcher generalizes remarkably well to cell types with both rare and abundant enhancers, making it broadly applicable regardless of existing data representation.
EnhancerMatcher can assess similarity of mouse enhancers
To test whether our human-trained triplet network generalizes across species, we evaluated EnhancerMatcher on mouse enhancers from six cell types that have corresponding human cell types in our data set. These were: adipocyte, B-cell, CD4+ T cell, CD8+ T cell, NK cell, and T cell. We obtained these sequences from the ENCODE project [68], and we processed them in the same way we did with the human data by removing sequences overlapping with promoters, exons, or insulators. We extended any sequence that was between 25 and 400 bp long to 400 bp. Next, we conducted two experiments. In the first, we assessed the similarity of one mouse enhancer to two human enhancers. In the second, we assessed the similarity of one mouse enhancer to two mouse enhancers.
In the first experiment, for each mouse cell type, we first randomly sampled 21 human enhancer pairs from the corresponding human cell type. We then formed test triplets by joining a human enhancer pair with a mouse enhancer of the same cell type. A given mouse region was called similar whenever at least 11 out of the 21 predictions were positive. The network’s recall on the mouse enhancers was 64% (adipocyte), 62% (B cell), 53% (CD4), 67% (CD8), 63% (NK cell), and 63% (T cell). These results show that EnhancerMatcher, although trained purely on human data, retains sufficient cell-type specificity to recognize mouse enhancers.
In the second experiment, we constructed 10 000 positive triplets for each of the six mouse cell types, where each triplet consisted of three enhancers active in the same cell type. We then paired these with 10 000 negative triplets by sampling two enhancers active in the same cell type and one control sequence. These control sequences were generated using similar types as our human controls (LR, LNR, LGR, LGNR, and Shuffled), but specifically from mouse sequences. As summarized in Table 2, EnhancerMatcher achieved 72%–80% accuracy, 54%–72% recall, and 89%–91% specificity across the six cell types tested. CD4+ T cells proved the most challenging, showing 72% accuracy and 54% recall. Conversely, CD8+ T cells exhibited the strongest performance with 80% accuracy and 72% recall. Importantly, every cell type surpassed 76% accuracy and 89% specificity. These results demonstrate that even without human-derived enhancers, EnhancerMatcher can successfully assess the similarity of mouse enhancers.
Table 2.
Performance of EnhancerMatcher on similar and dissimilar mouse triplets
| Cell type | Accuracy | Recall | Specificity |
|---|---|---|---|
| Adipocyte | 78.00 | 66.00 | 91.00 |
| B-cell | 76.00 | 63.00 | 89.00 |
| CD4+ T cell | 72.00 | 54.00 | 90.00 |
| CD8+ T cell | 80.00 | 72.00 | 89.00 |
| NK cell | 78.00 | 66.00 | 90.00 |
| T cell | 78.00 | 69.00 | 90.00 |
To evaluate EnhancerMatcher’s performance on mouse data, we constructed triplets exclusively from mouse enhancers. Specifically, for each of six distinct mouse cell types, we sampled enhancers to form 10 000 positive triplets, where each triplet consisted of three enhancers active within the same cell type. We also created negative triplets by sampling two enhancers active in the same cell type and one control sequence from the mouse data, utilizing various control types: LR, LNR, LGR, LGNR, and Shuffled mouse enhancers.
Building upon these findings, the collective results demonstrate that EnhancerMatcher exhibits robust performance in assessing enhancer similarity, particularly highlighting its capacity for cross-species generalization. The first experiment’s reasonable recall rates (53%–67%) on human-trained-to-mouse-tested enhancers confirm that EnhancerMatcher learns fundamental, cell-type-specific enhancer characteristics that are conserved across species. This is further bolstered by the second experiment’s strong performance metrics on purely mouse data, where EnhancerMatcher consistently achieved high accuracy (72%–80%) and remarkably high specificity (89%–91%), ensuring reliable distinction between similar and dissimilar enhancer activities. The consistent performance across diverse mouse cell types, coupled with its ability to generalize from human training data, underscores EnhancerMatcher’s potential as a valuable and versatile computational tool for annotating enhancers in newly sequenced vertebrate and mammal genomes.
EnhancerMatcher can highlight important subsequences
Through the application of CAMs, we can identify the most important regions within a sequence triplet. Our network considers these subsequences to be the most influential in determining whether a sequence triplet represents three similar enhancers. To illustrate this feature of EnhancerMatcher, we provide an example. We highlight the key subsequences that led EnhancerMatcher to classify three enhancers—chr13:23986592-23986991, chr11:35781443-35781842, and chr1:170929765-170930164—as similar enhancers. These enhancers are active in the following cell types: Pericyte Muscularis and Fetal Fibro Splenic. Figure 5 presents the CAMs generated by EnhancerMatcher. Each map represents the contribution of specific subsequences within each enhancer to the network’s classification decision. Warmer colors, such as red shades, indicate regions of high importance, while cooler colors, such as blue shades, correspond to regions of lower importance. The activation map for the first enhancer (chr13:23986592-23986991) highlights two key regions: one of moderate importance in the second quarter and another of high importance in the third quarter. The map for the second enhancer (chr11:35781443-35781842) reveals three important regions, ranging from moderate to high importance, spanning the first, second, and fourth quarters of the sequence. The map for the third enhancer (chr1:170929765-170930164) identifies a single important region located between the second and third quarters. This visualization provides valuable insight for life scientists studying the regulatory activity of these sequences. The highlighted subsequences likely contain common transcription factor binding sites. To validate this hypothesis, we conducted the following experiment.
Figure 5.
CAMs of a triplet sequence. Each class activation map is visualized as a heat map, where dark red regions indicate the most influential positions contributing to the network’s decision. These highlighted sub-regions are expected to be of biological significance. In this example, the key subsequence for the first sequence lies between base pairs 210 and 230. For the second sequence, the important regions are between base pairs 90–190 and 350–380. For the third sequence, the region of importance spans base pairs 155 to 190.
Important regions identified by the network contain relevant biological information
To validate the usefulness of the information provided by CAMs, we developed the idea of utilizing nucleotide scores from CAMs, where 0 represents the lowest signal and 1 represents the highest. We tested a range of 0.0 to 0.5 as the threshold for identifying important regions; a region with an average CAM score above a specific threshold (e.g. 0.3) was considered an important region. We hypothesize that these important regions include transcription factor binding sites and that the number of common binding sites in similar triplets is higher than in dissimilar and random triplets. Recall that the first and second sequences of a dissimilar triplet are enhancers with similar activity. Any three sequences sampled randomly from a control dataset constitute a random triplet.
To test this hypothesis, we designed experiments in which we identified important regions in a sequence triplet and then counted the number of common binding sites (at least two sites) in these important regions. For each control dataset, we counted common binding sites in 1000 similar triplets, 1000 dissimilar triplets, and 1000 random triplets. First, we analyzed important regions due to EnhancerMatcher from six control datasets. Then, on the dissimilar enhancer dataset, we analyzed important regions due to the specialized version because the specialized tool performed better than the general one on this set. The results are shown in Table 3.
Table 3.
Results of the experiment identifying common binding sites in important regions across CAM score thresholds (0.0–0.5)
| Threshold | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 |
|---|---|---|---|---|---|---|
| Length | ||||||
| Similar | 679 | 563 | 362 | 193 | 91 | 37 |
| Random | 578 | 490 | 291 | 160 | 61 | 22 |
| Dissimilar | 561 | 496 | 345 | 167 | 81 | 34 |
| P-Value(similar and random) | 3.34E−11 | 2.22E−06 | 7.52E−07 | 0.0031 | 0.0001 | 0.0019 |
| P-Value(similar and dissimilar) | 1.51E−14 | 1.27E−05 | 0.1363 | 0.0166 | 0.1361 | 0.3235 |
| Length-no-repeats | ||||||
| Similar | 661 | 538 | 361 | 195 | 93 | 32 |
| Random | 448 | 362 | 227 | 110 | 41 | 17 |
| Dissimilar | 489 | 434 | 261 | 164 | 63 | 16 |
| P-Value(similar and random) | 6.01E−42 | 7.55E−30 | 6.78E−22 | 2.60E−15 | 5.57E−13 | 0.0007 |
| P-Value(similar and dissimilar) | 4.10E−28 | 2.55E−11 | 2.16E−12 | 0.0053 | 0.0001 | 0.0002 |
| Length-GC | ||||||
| Similar | 632 | 519 | 351 | 200 | 85 | 31 |
| Random | 611 | 477 | 319 | 146 | 70 | 26 |
| Dissimilar | 571 | 506 | 296 | 169 | 60 | 29 |
| P-Value(similar and random) | 0.0915 | 0.0043 | 0.0169 | 2.20E−06 | 0.0391 | 0.1837 |
| P-Value(similar and dissimilar) | 4.98E−05 | 0.2146 | 9.98E−05 | 0.0058 | 0.00096 | 0.3784 |
| Length-GC-no-repeats | ||||||
| Similar | 640 | 520 | 329 | 166 | 64 | 20 |
| Random | 441 | 319 | 190 | 91 | 33 | 13 |
| Dissimilar | 558 | 493 | 329 | 155 | 74 | 28 |
| P-Value(similar and random) | 9.07E−37 | 1.62E−39 | 1.33E−25 | 4.79E−14 | 6.82E−07 | 0.0416 |
| P-Value(similar and dissimilar) | 8.30E−08 | 0.0469 | 0.5119 | 0.1790 | 0.8998 | 0.9544 |
| Shuffled | ||||||
| Similar | 649 | 554 | 374 | 205 | 94 | 31 |
| Random | 234 | 159 | 90 | 35 | 17 | 3 |
| Dissimilar | 549 | 489 | 333 | 209 | 84 | 36 |
| P-Value(similar and random) | 8.63E−171 | 1.00E−179 | 6.78E−132 | 1.24E−92 | 9.93E−40 | 2.81E−21 |
| P-Value(similar and dissimilar) | 8.50E−11 | 2.23E−05 | 0.0035 | 0.6343 | 0.1399 | 0.8241 |
| Dissimilar enhancers | ||||||
| Similar | 529 | 366 | 224 | 193 | 115 | 41 |
| Random | 572 | 454 | 281 | 128 | 62 | 21 |
| Dissimilar | 559 | 507 | 323 | 202 | 78 | 32 |
| P-Value(similar and random) | 0.9972 | 1.00 | 0.99998 | 4.33E−09 | 2.49E−10 | 5.89E−05 |
| P-Value(similar and dissimilar) | 0.9738 | 1.00 | 1.00 | 0.7718 | 2.44E−05 | 0.0674 |
| Dissimilar enhancers—the performance of the specialized version | ||||||
| Similar | 677 | 521 | 360 | 210 | 104 | 28 |
| Random | 594 | 450 | 286 | 143 | 56 | 18 |
| Dissimilar | 591 | 471 | 285 | 131 | 41 | 21 |
| P-Value(similar and random) | 3.68E−08 | 3.94E−06 | 2.41E−07 | 6.22E−09 | 1.84E−09 | 0.0164 |
| P-Value(similar and dissimilar) | 1.26E−08 | 0.00087 | 1.65E−07 | 3.23E−12 | 1.63E−17 | 0.0803 |
Note that a threshold of 0.0 means an entire sequence is considered an important region. The table displays the number of common binding sites detected in similar, random (any three sequences sampled from a control dataset), and dissimilar sequence triplets (where the first and second sequences are similar enhancers, but the third sequence may be a random genomic sequence, a shuffled version of the first enhancer, or an inactive enhancer while the first and second enhancers are active). Additionally, P-values (calculated using one-tailed binomial test) are provided to assess the statistical significance of the enrichment of common binding sites in important regions of similar triplets compared to those found in random or dissimilar triplets. P-values <0.05 (highlighted in bold) indicate a statistically significant enrichment of common binding sites (more than two) in important regions of similar triplets compared to another group (random or dissimilar).
First, let us exclude the dissimilar enhancer set from our analysis. Important regions identified in similar triplets were significantly more enriched with common binding sites than those found in random triplets in 28 out of 30 experiments. Important regions identified in similar triplets were significantly more enriched with common binding sites than those found in dissimilar triplets in 18 out of 30 experiments. Now, let us discuss the content of important regions due to the general EnhancerMatcher and the specialized version on the dissimilar enhancer set. As expected, important regions due to the general EnhancerMatcher were not enriched with common binding sites in contrast to dissimilar and random triplets (significant enrichments were found in only 3 out of 6 and 1 out of 6 experiments, respectively). However, important regions due to the specialized version of our tool were significantly more enriched with common binding sites in the similar triplets compared to the dissimilar enhancers and the random enhancers (11 out of 12 experiments). These experiments confirmed our hypothesis that important regions identified by CAMs are biologically relevant.
In-silico knockout experiments
The purpose of these experiments is to evaluate the role of important regions identified by CAMs in model predictions. We extracted the important regions of a triplet using CAMs. These regions contributed to the network’s decision to classify the triplets as similar or dissimilar. We hypothesized that these important regions are necessary for enhancer activity. To validate this hypothesis, we performed a knockout experiment for CAM scores ranging from 0.2 to 0.5. In each experiment, an important region found in the third sequence (a similar enhancer) was randomly shuffled. For simplicity, we selected triplets where the third sequence contained only one important region. Additionally, we ensured that the length of the region was between 25 and 100 bp; if the important region was too long, shuffling it could completely disrupt the sequence. We selected this range to focus on the effects of shuffling an important region.
For the sequences that met this criterion, we shuffled the important region and reclassified the modified triplets using EnhancerMatcher. We then compared the network’s predictions with those for the original triplets. The results are shown in Table 4. For the thresholds of 0.2, 0.3, 0.4, and 0.5, the percentages of similar triplets that were reclassified as dissimilar after knocking out the important region of the third sequence were 50%, 48%, 48%, and 46%, respectively. We anticipated this outcome because shuffling the important regions affected the network’s predictions.
Table 4.
Knockout experiment results after shuffling important regions versus shuffling equally sized regions outside important regions in the third sequence of similar triplets
| Threshold | Samples | Important | Others | P-value |
|---|---|---|---|---|
| 0.2 | 482 | 244 (50%) | 104 (22%) | 1.45E−044 |
| 0.3 | 1681 | 815 (48%) | 454 (27%) | 3.67E−078 |
| 0.4 | 3386 | 1626 (48%) | 892 (26%) | 8.71E−160 |
| 0.5 | 5969 | 2770 (46%) | 1875 (31%) | 5.96E−129 |
Important regions were identified using different CAM score thresholds (0.2–0.5). Shuffling important regions led to the reclassification of similar triplets as dissimilar in 50%–100% more samples compared to shuffling other non-overlapping, equally sized regions of the third sequence. P-values were calculated using the binomial test. Numbers listed under “Important” represent the count of similar triplets that were reclassified as dissimilar after shuffling the important regions of the third sequences. Numbers listed under “Others” represent the count of similar triplets that were reclassified as dissimilar after shuffling regions outside the important regions in the third sequences.
As a control, we repeated the same experiments, but instead of shuffling the important region of the third similar enhancer, we shuffled another non-overlapping region of the same size. Effectively, this experiment provided a baseline estimate of how many true positives became negatives when shuffling any region within a similar enhancer. To assess the significance of these important regions, we calculated the P-value using the binomial test. We counted the triplets that were reclassified as dissimilar by shuffling the important region, and we did the same for shuffling another non-overlapping region of the same size. Using these counts and the corresponding P-values, we found that shuffling important regions led to the reclassification of a similar triplet as dissimilar in more triplets than shuffling another region of the same size across all thresholds (50%, 48%, 48%, and 46% versus 22%, 27%, 26%, and 31%). These results confirm our hypothesis that important regions identified by CAMs include biologically relevant information that influences EnhancerMatcher’s decision.
Case study
As an example, we conducted the knockout experiment on a true positive triplet (Fig. 6a). Initially, we shuffled the important region of the third sequence (Fig. 6b). Shuffling the important region caused the network to reclassify the triplet as dissimilar. The modified triplet CAM highlights regions that differ from those highlighted in the original triplet. Next, we kept the important region intact but shuffled another non-overlapping region of the same length as the important region (Fig. 6c). When we shuffled the random region of the third sequence while keeping the important region intact, the network reclassified this triplet as similar again. The modified triplet CAM is almost identical to that of the original triplet, despite shuffling a region of the third sequence. This example highlights the critical role of the CAM in identifying regions that are most relevant for assessing enhancer similarity.
Figure 6.
An example knockout experiment. (a) The CAM of a true positive similar triplet. (b) The CAM of the triplet after shuffling the important region within the third sequence. (c) The CAM of the triplet after shuffling a non-overlapping region with the same length as the important region of the third sequence.
EnhancerMatcher performs well according to shuffled K-fold cross-validation
While our original approach partitions the enhancer sequences into training, validation, and testing sets before constructing triplets—thereby preventing any overlap of individual sequences across partitions—we also explored an alternative strategy to further validate the robustness of our method. Specifically, we conducted a K-fold cross-validation experiment using the CATlas enhancer dataset. We split the enhancer sequences into three equal partitions and generated triplets within each fold. For each fold, two partitions were used for training and the remaining one for validation, resulting in three model configurations. This ensured that no triplet used during training was reused in validation, avoiding direct overlap. We repeated this process three times, each time shuffling the sequences before partitioning them, yielding a total of nine trained models.
Across these models, the validation accuracy ranged from 81% to 87%, recall from 78% to 87%, and specificity from 75% to 84%. The average validation accuracy across the three iterations was 83%, with an average recall of 84% and average specificity of 81%, with standard deviations of 3%, 4%, and 3%, respectively. These results are comparable to our original method (86% accuracy, 87% recall, and 85% specificity). We performed these tests with shuffled four-folds and five-folds, which also led to comparable results to both the three-fold and the original network.
These findings confirm that our original pipeline is reliable and not subject to any data leakage. Additionally, the original approach allows for the use of a dedicated testing set for true blind evaluation, which is not feasible in the K-fold method due to the need for ample sequences to form triplets for each fold. Moreover, the original method results in a single comprehensive model, while the shuffled three-fold approach produced nine models, requiring a consensus method for downstream decisions and complicating predictions. These results confirm once again the success of our training strategy.
EnhancerMatcher outperforms the traditional approach
Our EnhancerMatcher is unique in that no contemporary tool can directly assess the similarity of three sequences with respect to their enhancer activities. This task can be accomplished indirectly using a conventional classifier. Here, we compare the performance of EnhancerMatcher to that of a conventional multi-label classifier. We utilized our training and validation datasets to develop the multi-label classifier. The classifier consists of an embedding layer followed by four convolutional blocks, each of which includes two convolutional layers and a batch-normalization layer; the final convolutional block is followed by two dense layers with a batch-normalization layer in between. The multi-label classifier takes a single sequence as input and outputs a list of 1’s and 0’s across 222 cell types, indicating the presence or absence of enhancer activities in the corresponding cell types. To compare the multi-label classifier to EnhancerMatcher, we passed each of the three sequences individually through the classifier. A triplet is classified as similar if the multi-label classifier outputs 1’s for some common cell types across all three sequences. The multi-label classifier did not perform well; its recall and specificity scores were 0.1% and 100%. We initially thought that this poor performance was due to the large size of the control dataset relative to the enhancer dataset. In an attempt to improve the performance of the multi-label classifier, we retrained it using only enhancer data. Yet the output remained unchanged. We then modified the classifier’s architecture by increasing the number of neurons in the first dense layer and adding more convolutional blocks, but the classifier’s performance did not improve. These results demonstrate the clear advantages of EnhancerMatcher over the traditional approach.
EnhancerMatcher outperforms statistical and histogram-based approaches
To compare EnhancerMatcher (which automatically extracts features from raw sequences) with traditional methods (which rely on manually engineered features), we generated k-mer histograms from the training, validation, and testing datasets. We then calculated 26 pairwise statistics on the histograms of the first and second sequences, as well as the first and third sequences. Subsequently, we constructed two triplet classifiers that process histogram triplets or two lists, each containing 26 pairwise statistics. Each network was trained on combinations of all control datasets to assess overall performance and was subsequently evaluated separately on each control dataset. For blind testing, the networks were evaluated on testing data that the models had never seen before.
The results are summarized in Table 5. Overall, EnhancerMatcher outperformed both the histogram-based and statistical triplet classifiers, achieving a high test accuracy of 90%, compared to 63% and 74%. EnhancerMatcher also achieved superior test recall (92%) in contrast to 59% and 73% for the other networks, and demonstrated the best specificity (87%) compared to 67% and 74%. Even when evaluated on the challenging dissimilar enhancer dataset, EnhancerMatcher consistently outperformed the statistical and histogram-based classifiers. On the dissimilar enhancer dataset, EnhancerMatcher achieved superior test accuracy, recall, and specificity (accuracy: 68% versus 56% and 60%; recall: 85% versus 67% and 63%; specificity: 52% versus 44% and 57%). These results demonstrate that the features learned automatically by EnhancerMatcher are more informative than the traditional manually engineered features, such as k-mer histograms and pairwise statistics, which have dominated the field for a long time. The consistent high accuracy, recall, and specificity across all datasets highlight EnhancerMatcher’s ability to accurately evaluate the similarity of three sequences with respect to their enhancer activity.
Table 5.
The performance of EnhancerMatcher, a histogram-based triplet classifier, and statistical triplet classifier
| Dataset | Accuracy | Recall | Specificity |
|---|---|---|---|
| EnhancerMatcher | |||
| Overall | 89.51 | 92.27 | 86.74 |
| Shuffled | 91.31 | 84.36 | 97.82 |
| Length | 88.92 | 84.72 | 93.12 |
| Length-no-repeats | 88.70 | 84.91 | 92.49 |
| Length-GC | 88.43 | 85.02 | 91.84 |
| Length-GC-no-repeats | 87.35 | 84.95 | 89.74 |
| Dissimilar enhancers | 68.44 | 84.92 | 51.96 |
| Statistical triplet classifier | |||
| Overall | 62.93 | 59.29 | 66.56 |
| Shuffled | 77.34 | 57.42 | 97.27 |
| Length | 63.83 | 75.78 | 51.90 |
| Length-no-repeats | 62.61 | 41.53 | 83.67 |
| Length-GC | 58.96 | 55.92 | 61.96 |
| Length-GC-no-repeats | 59.19 | 57.91 | 60.44 |
| Dissimilar enhancers | 55.67 | 67.19 | 44.13 |
| Histogram-based triplet classifier | |||
| Overall | 73.79 | 73.17 | 74.41 |
| Shuffled | 92.91 | 87.95 | 97.87 |
| Length | 75.18 | 73.37 | 76.98 |
| Length-no-repeats | 73.67 | 73.65 | 73.69 |
| Length-GC | 72.68 | 70.82 | 74.54 |
| Length-GC-no-repeats | 68.54 | 70.41 | 66.67 |
| Dissimilar enhancers | 59.75 | 62.83 | 56.68 |
The three networks were evaluated on the testing datasets.
Conclusion
We developed EnhancerMatcher, a novel computational tool designed for detecting cell-type-specific enhancer activity by functionally comparing a sequence to two other reference sequences. EnhancerMatcher utilizes a deep convolutional neural network that processes triplets of sequences: the first two sequences must be active in at least one common cell type, and the objective is to determine whether the third sequence functions as an enhancer with similar activity in at least one of those cell types.
We trained and initially evaluated EnhancerMatcher using the comprehensive human CATlas dataset, complemented by shuffled copies of CATlas sequences and randomly sampled genomic regions. On this human test dataset, EnhancerMatcher achieved high performance metrics with 90% accuracy, 92% recall, and 87% specificity, unequivocally demonstrating its effectiveness in comparing sequences based on enhancer activity.
EnhancerMatcher proved to generalize exceptionally well across all 222 cell types within the CATlas database, regardless of their representation (i.e. the number of known enhancers available for that cell type). Our cell-type-specific evaluations demonstrated no significant correlation between performance metrics (accuracy, recall, and specificity) and the number of available triplets per cell type. This crucial finding confirms that EnhancerMatcher’s effectiveness is not limited to well-represented cell types but extends reliably to those with sparse data, significantly broadening its practical utility.
Beyond its performance on human data, EnhancerMatcher exhibits remarkable generalization capabilities. We rigorously evaluated its performance on mouse enhancers to assess its cross-species applicability. In a first experiment, the human-trained EnhancerMatcher demonstrated its ability to recognize mouse enhancers when paired with human references, achieving recall rates ranging from 53% to 67% across diverse mouse cell types. This indicates that EnhancerMatcher learns fundamental, conserved enhancer features that transcend species boundaries. Further within-species testing on mouse enhancers showed strong results, with accuracy between 72% and 80%, recall between 54% and 72%, and particularly high specificity of 89%–91%. This robust performance on mouse data, especially when trained solely on human data, underscores EnhancerMatcher’s potential as a versatile tool for annotating enhancers in newly sequenced vertebrate and mammal genomes.
EnhancerMatcher offers several distinct advantages over conventional approaches for enhancer analysis. First, its design allows for the evaluation of enhancer activity across all cell types, while retaining precise cell-type specificity controlled by the selection of reference enhancers. Second, unlike traditional classifiers that often rely on manually engineered features (e.g. histogram statistics), EnhancerMatcher automatically extracts relevant sequence features through its deep learning architecture. Third, EnhancerMatcher incorporates CAMs, an explainable AI technique, to highlight specific subsequences that are crucial for determining enhancer activity similarity, thereby offering biological interpretability.
Overall, EnhancerMatcher provides a novel, powerful, and robust addition to enhancer discovery and characterization methods, significantly advancing our ability to interpret the complex regulatory code governing gene expression across diverse cell types and even across species.
Acknowledgements
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author contributions: Luis M. Solis (Data curation [lead], Formal analysis [lead], Investigation [lead], Methodology [lead], Validation [lead], Visualization [lead], Writing – original draft [lead]), William L. Melendez (Formal analysis [supporting], Software [supporting], Validation [supporting], Visualization [supporting], Writing – original draft [supporting]), Shantanu H. Fuke (Formal analysis [supporting], Software [supporting], Validation [supporting], Visualization [supporting], Writing – original draft [supporting]), Sayantan Paul (Formal analysis [supporting], Software [supporting], Validation [supporting], Visualization [supporting], Writing – original draft [supporting]), Anthony B. Garza (Data curation [supporting], Formal analysis [supporting], Software [supporting], Supervision [supporting], Validation [supporting], Visualization [supporting], Writing – original draft [supporting]), Rolando Garcia (Data curation [supporting], Formal analysis [supporting], Software [supporting], Supervision [supporting], Validation [supporting], Visualization [supporting], Writing – original draft [supporting]), Marc S. Halfon (Conceptualization [equal], Data curation [equal], Formal analysis [equal], Funding acquisition [equal], Investigation [equal], Methodology [equal], Software [supporting], Validation [equal], Visualization [equal], Writing – review & editing [equal], and Hani Z. Girgis (Conceptualization [lead], Data curation [equal], Formal analysis [equal], Funding acquisition [lead], Investigation [lead], Methodology [lead], Project administration [lead], Software [equal], Supervision [lead], Validation [equal], Visualization [equal], Writing – original draft [equal]
Contributor Information
Luis M Solis, The Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX 78363, United States.
William L Melendez, The Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX 78363, United States.
Shantanu H Fuke, The Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX 78363, United States.
Sayantan Paul, The Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX 78363, United States.
Anthony B Garza, The Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX 78363, United States.
Rolando Garcia, The Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX 78363, United States.
Marc S Halfon, Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, United States.
Hani Z Girgis, The Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX 78363, United States.
Conflict of interest
None declared.
Funding
Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under award number R21HG011507.
Data availability
The source code and the trained networks can be obtained from https://github.com/BioinformaticsToolsmith/EnhancerMatcher, and our code is available for download on Zenodo at https://zenodo.org/records/16117400.
References
- 1. Banerji J, Rusconi S, Schaffner W. Expression of a β-globin gene is enhanced by remote SV40 DNA sequences. Cell. 1981;27:299–308. 10.1016/0092-8674(81)90413-X. [DOI] [PubMed] [Google Scholar]
- 2. Amano T, Sagai T, Tanabe H et al. Chromosomal dynamics at the Shh locus: limb bud-specific differential regulation of competence and active transcription. Dev Cell. 2009;16:47–57. 10.1016/j.devcel.2008.11.011. [DOI] [PubMed] [Google Scholar]
- 3. Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: from properties to genome-wide predictions. Nat Rev Genet. 2014;15:272–86. 10.1038/nrg3682. [DOI] [PubMed] [Google Scholar]
- 4. Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery and characterization. Nat Methods. 2012;9:215–16. 10.1038/nmeth.1906. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Wang J, Lunyak VV, Jordan IK. Chromatin signature discovery via histone modification profile alignments. Nucleic Acids Res. 2012;40:10642–56. 10.1093/nar/gks848. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Park SH, Lee SM, Kim YJ et al. ChARM: discovery of combinatorial chromatin modification patterns in hepatitis B virus X-transformed mouse liver cancer using association rule mining. BMC Bioinformatics. 2016;7:1307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Girgis HZ Velasco A, Reyes ZE. HebbPlot: an intelligent tool for learning and visualizing chromatin mark signatures. BMC Bioinformatics. 2018;19:310. 10.1186/s12859-018-2312-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Kim TK, Hemberg M, Gray JM et al. Widespread transcription at neuronal activity-regulated enhancers. Nature. 2010;465:182–7. 10.1038/nature09033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. The ENCODE Project Consortium . An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Karnuta JM, Scacheri PC. Enhancers: bridging the gap between gene control and human disease. Hum Mol Genet. 2018;27:R219–27. 10.1093/hmg/ddy167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Fantes J, Redeker B, Breen M et al. Aniridia-associated cytogenetic rearrangements suggest that a position effect may cause the mutant phenotype. Hum Mol Genet. 1995;4:415–22. 10.1093/hmg/4.3.415. [DOI] [PubMed] [Google Scholar]
- 12. Bhatia S, Bengani H, Fish M et al. Disruption of autoregulatory feedback by a mutation in a remote, ultraconserved PAX6 enhancer causes aniridia. Am J Hum Genet. 2013;93:1126–34. 10.1016/j.ajhg.2013.10.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Lango Allen H, Caswell R, Xie W et al. Next generation sequencing of chromosomal rearrangements in patients with split-hand/split-foot malformation provides evidence for DYNC1I1 exonic enhancers of DLX5/6 expression in humans. J Med Genet. 2014;51:264–7. 10.1136/jmedgenet-2013-102142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Will AJ, Cova G, Osterwalder M et al. Composition and dosage of a multipartite enhancer cluster control developmental expression of Ihh (Indian hedgehog). Nat Genet. 2017;49:1539–45. 10.1038/ng.3939. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Erickson RP, Yatsenko SA, Larson K et al. A case of agonadism, skeletal malformations, bicuspid aortic valve, and delayed development with a 16p13.3 duplication including GNG13 and SOX8 upstream enhancers: are either, both or neither involved in the phenotype?. Mol Syndromol. 2010;1:185–91. 10.1159/000321957. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Zhang X, Choi PS, Francis JM et al. Identification of focally amplified lineage-specific super-enhancers in human epithelial cancers. Nat Genet. 2016;48:176–82. 10.1038/ng.3470. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Choi J, Xu M, Makowski MM et al. A common intronic variant of PARP1 confers melanoma risk and mediates melanocyte growth via regulation of MITF. Nat Genet. 2017;49:1326–35. 10.1038/ng.3927. [DOI] [PubMed] [Google Scholar]
- 18. Huang Q, Whitington T, Gao P et al. A prostate cancer susceptibility allele at 6q22 increases RFX6 expression by modulating HOXB13 chromatin binding. Nat Genet. 2014;46:126–35. 10.1038/ng.2862. [DOI] [PubMed] [Google Scholar]
- 19. Smemo S, Tena JJ, Kim KH et al. Obesity-associated variants within FTO form long-range functional connections with IRX3. Nature. 2014;507:371–5. 10.1038/nature13138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Gjoneska E, Pfenning AR, Mathys H et al. Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease. Nature. 2015;518:365–9. 10.1038/nature14252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Hiromi Y, Gehring WJ. Regulation and function of the Drosophila segmentation gene fushi tarazu. Cell. 1987;43:603–13. 10.1016/0092-8674(85)90232-6. [DOI] [PubMed] [Google Scholar]
- 22. Goto T, Macdonald P, Maniatis T. Early and late periodic patterns of even skipped expression are controlled by distinct regulatory elements that respond to different spatial cues. Cell. 1989;57:413–22. 10.1016/0092-8674(89)90916-1. [DOI] [PubMed] [Google Scholar]
- 23. Harding K, Hoey T, Warrior R et al. Autoregulatory and gap gene response elements of the even-skipped promoter of Drosophila. EMBO J. 1989;8:1205–12. 10.1002/j.1460-2075.1989.tb03493.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Jenett A, Rubin GM, Ngo TTB et al. A GAL4-driver line resource for Drosophila neurobiology. Cell Rep. 2012;2:991–1001. 10.1016/j.celrep.2012.09.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Arnold CD, Gerlach D, Stelzer C et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science. 2013;339:1074–7. 10.1126/science.1232542. [DOI] [PubMed] [Google Scholar]
- 26. Tokusumi T, Tokusumi Y, Brahier MS et al. Screening and analysis of Janelia FlyLight Project enhancer-Gal4 strains identifies multiple gene enhancers active during hematopoiesis in normal and wasp-challenged Drosophila larvae. G3 (Bethesda). 2017;7:437–48. 10.1534/g3.116.034439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Klemm SL, Shipony Z, Greenleaf WJ. Chromatin accessibility and the regulatory epigenome. Nat Rev Genet. 2019;20:207–20. 10.1038/s41576-018-0089-8. [DOI] [PubMed] [Google Scholar]
- 28. Catarino RR, Stark A. Assessing sufficiency and necessity of enhancer activities for gene expression and the mechanisms of transcription activation. Genes Dev. 2018;32:202–23. 10.1101/gad.310367.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Visel A, Blow MJ, Li Z et al. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature. 2009;457:854–8. 10.1038/nature07730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Suryamohan K, Halfon MS. Identifying transcriptional cis-regulatory modules in animal genomes. Wiley Interdiscip Rev Dev Biol. 2015;4:59–84. 10.1002/wdev.168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Bergman CM, Pfeiffer BD, Rincón-Limas DE et al. Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome. Genome Biol. 2002;3:research0086–1. 10.1186/gb-2002-3-12-research0086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Richards S, Liu Y, Bettencourt BR et al. Comparative genome sequencing of Drosophila pseudoobscura: chromosomal, gene, and cis-element evolution. Genome Res. 2005;15:1–18. 10.1101/gr.3059305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Sosinsky A, Honig B, Mann RS et al. Discovering transcriptional regulatory regions in Drosophila by a nonalignment method for phylogenetic footprinting. Proc Natl Acad Sci USA. 2007;104:6305–10. 10.1073/pnas.0701614104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Berman BP, Nibu Y, Pfeiffer BD et al. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc Natl Acad Sci USA. 2002;99:757–62. 10.1073/pnas.231608898. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Halfon MS, Grad Y, Church GM et al. Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model. Genome Res. 2002;12:1019–28. 10.1101/gr.228902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Girgis HZ, Ovcharenko I. Predicting tissue specific cis-regulatory modules in the human genome using pairs of co-occurring motifs. BMC Bioinformatics. 2012;13:25. 10.1186/1471-2105-13-25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Kazemian M, Zhu Q, Halfon MS et al. Improved accuracy of supervised CRM discovery with interpolated Markov models and cross-species comparison. Nucleic Acids Res. 2011;39:9463–72. 10.1093/nar/gkr621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Rajagopal N, Xie W, Li Y et al. RFECS: a random-forest based algorithm for enhancer identification from chromatin state. PLoS Comput Biol. 2013;9:e1002968. 10.1371/journal.pcbi.1002968. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Visel A, Taher L, Girgis HZ et al. A high-resolution enhancer atlas of the developing telencephalon. Cell. 2013;152:895–908. 10.1016/j.cell.2012.12.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Liu F, Li H, Ren C et al. PEDLA: predicting enhancers with a deep learning-based algorithmic framework. Sci Rep. 2016;6:28517. 10.1038/srep28517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Min X, Zeng W, Chen S et al. Predicting enhancers with deep convolutional neural networks. BMC Bioinformatics. 2017;18:478. 10.1186/s12859-017-1878-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Yang B, Liu F, Ren C et al. BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone. Bioinformatics. 2017;33:1930–6. 10.1093/bioinformatics/btx105. [DOI] [PubMed] [Google Scholar]
- 43. Chen L, Fish AE, Capra JA. Prediction of gene regulatory enhancers across species reveals evolutionarily conserved sequence properties. PLoS Comput Biol. 2018;14:e1006484. 10.1371/journal.pcbi.1006484. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Li Y, Shi W, Wasserman WW. Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. BMC Bioinformatics. 2018;19:202. 10.1186/s12859-018-2187-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015;12:931–4. 10.1038/nmeth.3547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016;26:990–9. 10.1101/gr.200535.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Ghandi M, Lee D, Mohammad-Noori M et al. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput Biol. 2014;10:1–15. 10.1371/journal.pcbi.1003711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Lee D, Gorkin DU, Baker M et al. A method to predict the impact of regulatory variants from DNA sequence. Nat Genet. 2015;47:955–61. 10.1038/ng.3331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Kreimer A, Zeng H, Edwards MD et al. Predicting gene expression in massively parallel reporter assays: a comparative study. Hum Mutat. 2017;38:1240–50. 10.1002/humu.23197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Jain S, Bakolitsa C, Brenner SE et al. CAGI, the Critical Assessment of Genome Interpretation, establishes progress and prospects for computational genetic variant interpretation methods. Genome Biol. 2024;25:53. 10.1186/s13059-023-03113-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Kwon SB, Ernst J. Learning a genome-wide score of human–mouse conservation at the functional genomics level. Nat Commun. 2021;12:2495. 10.1038/s41467-021-22653-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Lu J, Cao X, Zhong S. EpiAlignment: alignment with both DNA sequence and epigenomic data. Nucleic Acids Res. 2019;47:W11–9. 10.1093/nar/gkz426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Li J, Zhao T, Guan D et al. Learning functional conservation between human and pig to decipher evolutionary mechanisms underlying gene expression and complex traits. Cell Genom. 2023;3:100390. 10.1016/j.xgen.2023.100390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Goodfellow I, Bengio Y, Courville A. Deep Learning. Cambridge, MA: MIT Press, 2016. [Google Scholar]
- 55. Ko JY, Oh S, Yoo KH. Functional enhancers as master regulators of tissue-specific gene regulation and cancer development. Mol Cells. 2017;40:169–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Zhang K, Hocker JD, Miller M et al. A single-cell atlas of chromatin accessibility in the human genome. Cell. 2021;184:5985–6001. 10.1016/j.cell.2021.10.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Solis LM, Sterling-Lentsch G, Halfon MS et al. EnhancerDetector: enhancer discovery from human to fly via interpretable deep learning. bioRxiv, 10.1101/2025.05.28.656532, 11 April 2025, preprint: not peer reviewed. [DOI] [Google Scholar]
- 58. O’Leary NA, Wright MW, Brister JR et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2015;44:D733–45. 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Frankish A, Carbonell-Sala S, Diekhans M et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 2022;51:D942–9. 10.1093/nar/gkac1071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Li YE, Preissl S, Miller M et al. A comparative atlas of single-cell chromatin accessibility in the human brain. Science. 2023;382:eadf7044. 10.1126/science.adf7044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Girgis HZ. Red: an intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale. BMC Bioinformatics. 2015;16:227. 10.1186/s12859-015-0654-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Garza AB Garcia R, Halfon MS et al. Evaluation of metric and representation learning approaches: effects of representations driven by relative distance on the performance. In: IEEE International Conference on Intelligent Methods, Systems, and Applications (IMSA). Giza, Egypt, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Selvaraju RR, Cogswell M, Das A et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In: Proc IEEE Int Conf Comput Vis. Venice, Italy, 2017, 618–26. [Google Scholar]
- 64. Rauluseviciute I, Riudavets-Puig R, Blanc-Mathieu R et al. JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2023;52:D174–82. 10.1093/nar/gkad1059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011;27:1017–18. 10.1093/bioinformatics/btr064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66. Girgis HZ, James BT, Luczak BB. Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models. NAR Genom Bioinform. 2021;3:lqab001. 10.1093/nargab/lqab001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67. Chollet F. Xception: Deep Learning with Depthwise Separable Convolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society, 2017, 1800–7. 10.1109/CVPR.2017.195. [DOI] [Google Scholar]
- 68. Luo Y, Hitz BC, Gabdank I et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res. 2020;48:D882–9. 10.1093/nar/gkz1062. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The source code and the trained networks can be obtained from https://github.com/BioinformaticsToolsmith/EnhancerMatcher, and our code is available for download on Zenodo at https://zenodo.org/records/16117400.




