Significance
Cis-regulatory sequences regulate the expression of nearby genes. Recently, remarkable advances in our ability to predict the function of cis-regulatory sequences have been achieved by using convolutional neural networks (CNNs). However, the difficulty in interpreting these CNNs has hindered the translation of these advances to better understanding of the cis-regulatory code. The main difficulty is due to the problem of multifaceted CNN neurons, each of which can recognize multiple types of patterns. Visualizing a multifaceted neuron in one image will result in a mixed pattern that is hard to interpret. To address this difficulty, we propose an algorithm capable of revealing the individual facets of deep neurons, which enables better understanding of the cis-regulatory codes embedded in the state-of-the-art CNNs.
Keywords: cis-regulatory grammar, motif combination, deep neural network, model interpretation, multifaceted neuron
Abstract
Discovering DNA regulatory sequence motifs and their relative positions is vital to understanding the mechanisms of gene expression regulation. Although deep convolutional neural networks (CNNs) have achieved great success in predicting cis-regulatory elements, the discovery of motifs and their combinatorial patterns from these CNN models has remained difficult. We show that the main difficulty is due to the problem of multifaceted neurons which respond to multiple types of sequence patterns. Since existing interpretation methods were mainly designed to visualize the class of sequences that can activate the neuron, the resulting visualization will correspond to a mixture of patterns. Such a mixture is usually difficult to interpret without resolving the mixed patterns. We propose the NeuronMotif algorithm to interpret such neurons. Given any convolutional neuron (CN) in the network, NeuronMotif first generates a large sample of sequences capable of activating the CN, which typically consists of a mixture of patterns. Then, the sequences are “demixed” in a layer-wise manner by backward clustering of the feature maps of the involved convolutional layers. NeuronMotif can output the sequence motifs, and the syntax rules governing their combinations are depicted by position weight matrices organized in tree structures. Compared to existing methods, the motifs found by NeuronMotif have more matches to known motifs in the JASPAR database. The higher-order patterns uncovered for deep CNs are supported by the literature and ATAC-seq footprinting. Overall, NeuronMotif enables the deciphering of cis-regulatory codes from deep CNs and enhances the utility of CNN in genome interpretation.
We regard the cis-regulatory code as a special language. In a natural language, the basic meaningful unit in the language is the lexeme and several different word forms may correspond to the same lexeme. Likewise, several cis-regulatory DNA sequences may correspond to a sequence motif recognized by the same Transcription Factor (TF) (Fig. 1A). The correspondences between the TFs and their motifs are specified by a motif glossary. The motifs can be combined into cis-regulatory modules (CRM) according to the motif syntax which specifies the set of involved TFs and the likely arrangements of their binding positions relative to each other (Fig. 1B). Hard or soft syntax rules specify fixed or flexible distances between elements in the module. In general, the modules can be further combined into more complex modules (higher-order modules) in a hierarchical manner. To understand the cis-regulatory code, it is necessary to identify the motifs and clarify the syntax rules that govern the combinations of motifs and modules in the hierarchy.
Fig. 1.
The CNN model interpretation for genomics. (A) Understanding cis-regulatory DNA sequences by considering these sequences as a special language. The basic meaningful unit in a language is the lexeme. Some words with different word forms share one lexeme. Both English and DNA follow this definition. For example, humans use word forms of “eat” with different affixes to express the meaning of EAT. Similar to English, the 8-mer DNA sequence matched by the TF ZEB1 motif can be bound by ZEB1, which is the meaning or function of the DNA sequence. In a DNA word form, the conservative alphabets are the root, the less conservative alphabets are affixes, and the nonconservative alphabets are the spaces. (B) Cis-regulatory grammar includes motif pattern and their higher-order combinatorial logic. The motif patterns form motif glossary. The motif combinatorial logic forms the motif syntax. The arrangement and the fixed/flexible gap between adjacent motif combinations depict hard/soft syntax. (C) DNA sequences in the genome have been annotated by different kinds of profiles (e.g., ChIP-seq and ATAC-seq). A CNN is good at learning motifs and motif combinations from these sequences because it performs excellently on prediction tasks. For the example in the figure, the sequences are matched by the motif combinatorial logic in B and are used to train the CNN model. The task of this work is to interpret the motif glossary and syntax from the CNN model.
Traditionally, motifs are identified experimentally by systematic evolution of ligands by exponential enrichment (SELEX) sequencing (1) or computationally by motif discovery analysis (2, 3) of sequences of genomic regions bound by specific TFs (4) or having specific epigenetic features (4–6). These efforts have produced glossary of motifs for up to 705 human TFs in JASPAR database (7). On the contrary, while module discovery methods have been developed and applied in focused studies (8–10), there has been limited progress on the comprehensive discovery of higher-order logic of the cis-regulatory code. To fill this gap, here we introduce a method to extract motifs, modules, and their combinatorial syntax rules, based on the parameters of convolutional neural network (CNN) models well trained to predict a comprehensive set of genomic features.
The development of CNN-based predictive models (11, 12) was a major recent advance in computational genomics. The models were trained based on a massive amount of data on the context-specific functional features of genomic regions (Fig. 1C). For example, the DeepSEA model was trained on genome-wide measurement (in various cell cellular contexts) of 919 functional features, including binding affinity of TFs, histone modification marks, and chromatin accessibility. Given the sequence of a genomic region, the models can predict the functional features in the different cellular contexts. The initial models, which used wide kernels (8 bp in DeepSEA) and shallow networks (three layers in DeepSEA), already achieved excellent predictive performance ( for DeepSEA model). Subsequent works further improved the performance by using deeper networks with narrow kernels (13, 14). The high predictive power of the predictive model suggests that it has implicitly learned key aspects of the underlying regulatory code. Indeed, from the parameters of the learned CNN, it was possible to use standard model interpretation tools such as TF-MoDISco (15) to extract many sequence motifs that are well matched to the known motifs in databases such as JASPAR [e.g., 71 motifs are found in work (16)]. However, these standard model interpretation tools are less effective in extracting complex regulatory patterns such as higher-order modules. The purpose of this work is to provide a method to extract these higher-order aspects of the regulatory code from the parameters of the CNN.
The main difficulty in interpreting deep CNN is that many neurons, especially those in the deep layers, are multifaceted. A neuron is multifaceted if it responds to multiple, distinct categories of patterns, which has been proved to be widespread in both human neurons (17) and CNN neurons (18–20). For a neuron activated by input sequences containing a specific motif, the motif patterns are not always located at the same position in the input sequences (21). Existing interpretation tools such as those reviewed in the next paragraph typically produce a visualization that corresponds to the mixture of patterns rather than visualizations of the individual component patterns (18, 20). Typically, the visualization of the mixture is hard to interpret, while the visualization of the individual components can be directly interpreted in terms of position weight matrices (PWMs) that reflect the sequence preferences of TFs.
Current methods for neural network interpretation include attribution map-based methods (AMBM) and sequence alignment-based methods (SABM) (SI Appendix, Figs. S1–S4 and Table S1). AMBM such as DeepLIFT (22), saliency map (23), and DeepResolve (24) use gradient-backpropagation computation to obtain the importance score (IS) for each position in input sequences and then try to visualize the ISs of sequences (21, 25). Some tools like TF-MoDISco (15) further attempt to align the subsequences based on their ISs to estimate the motif PWM (25). The second type of method, namely SABM, follows the traditional approach to PWM estimation by stacking a set of input sequences that can strongly activate the convolutional neuron (CN) and then visualize position-specific base preferences of the stacked sequences. Unfortunately, as shown below by the toy example in the results section and Fig. 2 (see also refs. 12 and 26–29), these existing AMBM or SABM methods cannot handle multifaceted neurons. Their usefulness is restricted to shallow neurons as most deep neurons are multifaceted (21).
Fig. 2.
A toy CNN to show the problem of multifaceted neurons and the basic idea of NeuronMotif. (A) A two-convolutional-layer CNN trained by 8 bp sequence samples including random negative samples and the positive samples taken from two ZEB1 motifs shifted by 1 bp. (B) The results of the existing methods, NeuronMotif, and the ground truth for the output CN. (C) The gradient-based IS of the single positive sequence () may be multifaceted. The curved surface of the output CN: , where represents vectors of the important input features for pattern 1/2, which is not important for pattern 2/1. represents vectors of remaining features that are not shown in the curved surface. The curved surface is composed of two peaks, each of which is enriched by the sequences with the same motif pattern. The gradient direction at the input sequence () does not point to either of the peak (, and ), which is the gradients we want, but the gradient we obtain is the mixture . (D) Comparison between the forward propagation process of the two sequences with the same motif shifted by 1 bp. The feature maps for the two input sequences in each layer of the CNN are not identical until the max-pooling operation. The columns and rows of the feature map or filter matrix indicate the position and channel, respectively. (E) Based on the samples, NeuronMotif clusters the feature maps of the max-pooling operation input to distinguish the samples with the same motif at different position. (F) NeuronMotif clusters the input feature maps of the max-pooling operations in the CNN model backward and recursively.
To handle multifaceted neurons, we proposed the NeuronMotif algorithm, an enhanced version of SABM that can convert the model parameters of a CNN into motifs as represented by PWMs and higher-order modules as depicted by tree structures (Fig. 1B). We demonstrated that the pooling operation is the main cause for the problem of multifaceted neurons in widely used CNN architectures such as DeepSEA (11) and Basset (12). We developed a method to dissect the mixed facets of a CN, by implementing forward sampling of high activation sequences, followed by backward clustering of the input feature maps of max-pooling operations, in a recursive, layer-wise manner. In such a way, each facet captured by a deep CN would be represented by the sequences in one cluster, which can be converted to combinations of PWMs. Importantly, by extracting information-rich segments from the demixed sequences during the layer-wise demixing process, we can obtain a hierarchical organization of the motifs and modules, which provides a natural interpretation of the multifaceted neuron. Because of its ability to demix multiple facets of a deep neuron and to extract and visualize high-order modules, NeuronMotif is a useful tool for deciphering the cis-regulatory code from deep CNN models trained on large-scale genomic data.
Results
Max-Pooling Gives Rise to Multifaceted Neuron.
To demonstrate the problem of multifaceted neurons, we sampled 8 base pair (bp) sequences from two ZEB1 motifs with positions shifted by 1 bp. We regard these sequences as positive data and randomly generated sequences as negative data. We used these data (sequence with positive/negative labels) to train a toy two-layer CNN classifier (Fig. 2A). After obtaining the trained model, we then applied the existing model interpretation methods to see whether they can discover the ZEB1 motif pattern from the parameters of the CNN. SABMs used in recent works (28–30) such as those of Kelley et al. (12) and Alipanahi et al. (27) stack all sequences with activations () above a given threshold (lines 1 to 2 in Fig. 2B). The sequences are not aligned according to the motif pattern, so the result corresponds to a mixture of two ZEB1 motifs shifted by 1 bp. The state-of-the-art AMBMs (25, 30) compute the IS of each nucleotide in a positive sequence through the gradient of the output to the input [saliency map (23)] or by other gradient-like methods [DeepLIFT (22)] to reveal the motif-like patterns (lines 3 and 5 in Fig. 2B). Since the output function of the two-facet CN () has two maxima, the directions of the gradient of many sequences tend to be the mixture of the two directions pointing to the two maxima points (Fig. 2C and SI Appendix). Thus, the gradient-like method will give the mixed IS of the two shifted motif-like patterns for each of these sequences. Furthermore, neither the stacking of these pattern mixtures directly (line 4 in Fig. 2B) nor the alignment of them by tools such as TFmoDISco (15) (line 6 in Fig. 2B) can recover the true motif pattern.
Taking this toy CNN as an example, we further studied the mixing mechanism by layer-wise comparison of the feature maps of two positive sequences with the ZEB1 motif located at different positions (i.e., shifted by 1 bp) in the sequences (Fig. 2D and SI Appendix). This CNN is composed of two convolutional layers and one max-pooling layer. The input is the one-hot code of the DNA sequence (4 channels by 8 bp). There are three filters (4 channels by 5 bp) in convolutional layer 1 (L1), each of which can be viewed as a PWM (21) for scanning the input sequence and quantifying the matching level by producing a value at each position. The feature maps of convolutional L1 for the two sequences are quite different (Fig. 2D). However, when the max-pooling operation is applied to take the maximal value for each channel in each contiguous bin of size two across the positional axis, the output feature maps of the max-pooling operation for the two sequences are almost identical (Fig. 2D). In convolution layer 2 (L2), there is one filter (3 channels by size 2). This filter scans the feature map of the max-pooling output, which generates similar activation values of the output CN for the two sequences (Fig. 2D). Hence, this output neuron is a multifaceted neuron that responds to two different patterns with motifs at shifted positions relative to each other. The max-pooling operation is the main cause of multifaceted neurons.
Overview of the NeuronMotif Algorithm.
To interpret a target CN that may be multifaceted, we can try to partition a large set of sequences recognized by the CN into subsets (clusters). Each of them corresponds to one of the facets. A difficulty in implementing this approach is that most combinatorial motif patterns in genomic sequences are weak (i.e., leading to relatively low activation of the CN), which makes it hard to extract cis-regulatory signals (SI Appendix, Fig. S5). To increase signal-to-noise, we directly sample sequences with high activation based on the parameters of the CNN (see Materials and Methods for details). These sequences are then partitioned by clustering analysis of the input feature maps they produce for the max-pooling operations in the substructure of the target CN. Both the sampling of sequences and the clustering of feature maps are performed in a recursive manner, in the sense that the computation at any layer will make use of the results of computation in the proceeding layer (Materials and Methods). In the toy model shown in Fig. 2D, there is one max-pooling operation of size 2 and it recognizes one freely shifting motif (ZEB1) without soft syntax. NeuronMotif flattened the feature maps of the max-pooling input for all the positive sequences and used k-means (k = 2) to cluster the feature maps into two clusters (Fig. 2E and lines 7 to 8 in Fig. 2B). The corresponding sequences of the feature maps in a cluster share the motif pattern at the same position, and they can be stacked directly. Generally, if there is more than one max-pooling operation in CNN, we can perform this step recursively backward from deeper layers to the first layer. For example, if we have a trained three-convolutional-layer CNN with two max-pooling operations of size 2, NeuronMotif clusters the input feature maps of the second max-pooling operation into two clusters and then further clusters the input feature maps of the first max-pooling operation in one of the existing clusters into two new subclusters (Fig. 2F). Corresponding sequences in one cluster can be stacked directly into a PWM of the motif. More generally, if there is any substructure of multifaceted CN in the CNN or the CN can recognize combinatorial motifs with several freely shifting motifs, the different patterns mixed in positive sequence samples can be backward and layer-wise demixed (Materials and Methods and see reason for backward strategy in SI Appendix, Fig. S6). Since a PWM can represent combinatorial motifs (31), we call the PWM of the sequences in a cluster as a “CN CRM” (Figs. 1B and 2F). NeuronMotif segments the PWM of the CN CRM based on the continuous regions with high information content and discards the remaining regions with low information content (see Materials and Methods for details). Finally, NeuronMotif summarizes the combinatorial logic of the motifs in a tree structure (Figs. 1B and 5).
Fig. 5.
The heuristic approach to obtain motif combinatorial logic. The schema of the process includes five steps. (I) Rank quality of CN CRMs. The ones with low quality are removed (high opacity). (II) Select the high-quality CN CRMs and build a motif dictionary after removing duplicates. (III) Annotate segments by motif dictionary. Each segment is replaced by the motifs in the dictionary (blue or green regions). (IV) Summarize the gap sizes of motif pairs in the CN CRM. (V) Build motif syntax tree and the truth table of the leaf nodes. The gap size ranges (bp) are marked in branch nodes. Each row of the truth table shows the motif combination occurs in one of the CN CRMs. In each row, the motif node marked by true(T)/false(F) means that it is included in/excluded from the combination.
The number of clusters in a clustering step is typically equal to the max-pooling size but can be higher in some cases, such as when the CRM involves two motifs with a flexible gap size between them. Thus, we determine the number of clusters adaptively based on the quality of the clusters. NeuronMotif uses to measure the relative quality of mixture decoupling for each PWM, where is the activation value of the sequence with the maximum probability in the PWM model, and is the maximum activation value of the sequences in the cluster. If or even smaller, it indicates that the CN CRM’s pattern is less likely to be recognized by the CN due to the sequences in the cluster still consisting of multifaceted patterns. Hence, a larger k in k-means clustering is required in NeuronMotif. Based on this metric, NeuronMotif can automatically increase the number of clusters to match the number of facets so that the PWM quality meets the requirement (Materials and Methods and SI Appendix, Figs. S7–S10).
NeuronMotif Can Reveal Motif Patterns Discovered by Deep CNs.
To evaluate the performance of the method for motif discovery, we first applied NeuronMotif to annotating two well-known models, DeepSEA (11) and Basset (12), both of which are DNA-sequence-based CNN models with three general convolutional layers. We applied NeuronMotif to the CNs in each layer. NeuronMotif set the number of clusters in a clustering step as max-pooling size. The evaluation result shows that many sequence clusters for generating CN CRMs are sufficiently decoupled (, SI Appendix, Figs. S11, S12C, and S13), which implied that each of these CNs recognize one motif or one motif combination with fixed gaps (SI Appendix, Figs S7–S10). In the Basset model, the first- and second-layer max-pooling sizes are 3 and 4, so the numbers of shifted patterns are and for a CN in L2 and L3, respectively. As the two instances of CN 3 in L2 and L3 shown in Fig. 3 A and B, all adjacent rows are shifted by 1 bp and are highly consistent. However, the state-of-the-art SABM [Kelley et al. (12), Alipanahi et al. (27)] and AMBM [saliency map (23), DeepLIFT (22), DeepResolve (24)], cannot deal with mixture patterns, which leads to much lower information content and very noisy patterns in PWM or IS of the top activated sequence. For each CN, we matched the discovered motifs to the JASPAR (32) database by Tomtom (33) and evaluated the similarity between the motifs by q values (Materials and Methods). In deep layers, the top motifs of each CN interpreted by NeuronMotif were more similar to the motifs in JASPAR (t test of greater , , Fig. 3D), and NeuronMotif found the most JASPAR motifs in L2 (102 versus 62 in L1, q value < 0.001, Fig. 3C). Compared to 8 longer motifs () that only found by L1, NeuronMotif found 39 additional longer JASPAR motifs from L2. We further compared with TF-MoDISco that aligns ISs of DeepLIFT to find motifs. TF-MoDISco found motif PWMs for 69 CNs of 200 CNs in L2 of Basset. Among them, the result of 9 CNs can be matched to 28 JASPAR motifs (q value < 0.001). In contrast, NeuronMotif found motifs for all CNs and the results of 89 CNs can be matched to the 103 JASPAR motifs (q value < 0.001). The result of NeuronMotif is more similar to that of the JASPAR motif (t test of greater , , Fig. 3D). Similar results were observed when comparing with the CIS-BP motif database (34) (SI Appendix, Fig. S14).
Fig. 3.
NeuronMotif is used to interpret the Basset model. (A and B) Motifs of CN 3/3 in L2/L3 of the Basset model decoupled by NeuronMotif (rows 2 to 4/3 to 14). The 3/12 clusters’ sequence logos of the same size (51/132 bp) are aligned with 1 bp offsets. The interpretation results using the existing methods are shown in rows 5 to 12/15 to 22. Tomtom outputs the P values of the motif comparison. (A) The motifs are matched to the JASPAR motif NFIB using Tomtom (row 1). (B) The motifs are matched to the CTCF core motif in JASPAR using Tomtom (row 1). They are also matched to the motifs for the other modules of CTCF proposed by Soochit et al. (row 2). (C and D) Analysis of the Tomtom q values. (C) The number of motifs discovered (q value < 0.001) from the CNs in L1, L2, and L3 of the Basset model using NeuronMotif and state-of-the-art interpretation methods for corresponding layers. (D) The performance of three interpretation methods applied to three layers of the Basset model. The distribution of the top 100 CNs’ top motif matched to JASPAR motifs is shown with boxes.
NeuronMotif Is Applicable to Different CNN Architectures.
Building deeper neural networks has been shown to improve performance (35) but at the cost of being harder to interpret. To show that NeuronMotif can interpret deeper model, we built deep (10-convolution-layers) models and trained them on the Basset dataset (BD-10 model) and the DeepSEA dataset (DD-10 model) (Materials and Methods and SI Appendix, Figs. S12E and S15). Their prediction performances are significantly better than those of Basset () and DeepSEA , respectively (Materials and Methods, SI Appendix, Figs. S12 F and G, S16, and S17). We applied NeuronMotif to layer 10 (L10) of BD-10 and DD-10. For most L10 CNs, the mixed patterns could be sufficiently () decoupled when NeuronMotif assumes that each CN recognizes two motifs or motif combinations with a flexible gap (Materials and Methods, Fig. 4A, and SI Appendix, Fig. S18). With the help of NeuronMotif, 157 JASPAR motifs can be found from the deep CN in L10 of BD-10 (see visualized motifs in SI Appendix, Figs. S19 and S20), and the similarity to the motifs in JASPAR database is significantly improved (, Fig. 4B). Among 512 CNs in L10 of BD-10, we found CRMs with hard or soft syntax rules from 369 CN, each of which is composed of the motifs chosen from these JASPAR motifs. Compared with MEME [a typical SAMB (2)] discovering motifs from activated genome sequences, the motifs discovered by NeuronMotif are significantly more similar to JASPAR motifs (, SI Appendix, Fig. S21). Interestingly, NeuronMotif found the two different basic leucine zipper (bZip) motifs from one CN, which demonstrates the flexible binding between two halves of bZip domain (SI Appendix, Fig. S22).
Fig. 4.
(A) Compare the performance of NeuronMotif applied to L3 of Basset and L10 of BD-10 models. (B) Distribution of the indicator for the decouple CN CRMs of L10 in BD-10 model. (C) Compare the performance of NeuronMotif and Kelley et al. applied to L1 and L2 of Basset model (ReLU) and Basset-like model (exponential activation function). (A and C) The venn plots show the number of motifs discovered (q value < 0.001) from the CNs in different layers of model using corresponding interpretation methods. The distribution of the top 100 CNs' top motifs matched to JASPAR motifs is shown with boxes.
To interpret deep CN, Koo et al. (36) trained a Basset-like model by replacing the ReLU activation function with exponential function in the first layer. Similar to Koo's result, using the exponential activation function can obtain more (147 versus 62) and better () JASPAR motif from the CNN model, while applying NeuronMotif to this modified CNN can furtherly improve the similarity (t test of greater , ) and find even more motifs (157 versus 147) (Materials and Methods, Fig. 4C). Hence, NeuronMotif is a generally applicable method, which can be used to interpret CNNs of diverse architectures.
NeuronMotif Provides a Way to Obtain Motif Syntax from Deep CN.
To summarize each CN’s motif syntax and the relationship of syntax rules across different layers, we developed a method to extract motif arrangements and gap sizes for CN CRMs in different layers in the substructure of a CN of interest and depict their combinatorial logic as a tree-structured syntax rule (Fig. 5, see Materials and Methods for details). The method consists of 5 steps: I) rank the quality of CN CRMs of a CN according to , and information content; II) for all high-quality CN CRMs, use their motif segments to build a motif dictionary. Remove duplicated motif segments found by Tomtom; III) match the segments of each CN CRM by the motif in the dictionary using Tomtom, and align the segments across CN CRMs; IV) compute gap sizes between adjacent motif segments in each CN CRM; V) use the corresponding dictionary motif of aligned segments as leaf nodes to build a tree structure. NeuronMotif continuously connects two motif or motif combination branches with the smallest flexibility and size of the gap by creating branch nodes. We marked each branch node with gap size ranges and marked each leaf node with the truth table of motif combination occurrence in CN CRMs to show the gap sizes, arrangement, and motif combinatorial logic. In shallower layer, the rules are usually very simple and only contains one motif. The rules in the deeper layer extend the rule of the previous layer by extending gap size ranges or combining more/different motifs that were learned by the previous layer. To show the relationship of the rules of different layers, we used Tomtom to locate the motif/motif combination of each tree node in the layer it first occurs. If the motif/motif combination does not occur in the previous layers, it is newly generated in the layer. Otherwise, it only changes gap size ranges of CRMs in previous layers. Here, we take CN 1130 in L10 of the DD-10 model as an example to briefly depict the algorithm in Fig. 5. In addition, we showed the result of CN 1130 and CN 254 in Fig. 6 A and B, respectively.
Fig. 6.
Verification of CN CRMs and their syntax. (A and B) Motif syntax represented by CN 1130/254 of L10 in the DD-10 model. The layers in CNN that have learned motif or motif combination are shown in the corresponding tree node. (A) The motif segments of CN CRMs are matched to the CTCF motifs in the study by Soochit et al. (B) The motif segments of CN CRMs are matched to the bZIP and NFI motifs in JASPAR. (C and D) ATAC-seq data footprinting of five cell lines or tissues for the motifs in A and B. The footprinting is extended to 500 bp upstream and downstream from the CN motif-matched midpoint in the top 3,000 sequences. The cut-site counts of each position are normalized by the total cut-site counts within a 1,000 bp window. The expression level (transcripts per million, TPM) of CTCF/NFI family (NFIA, NFIB, NFIC, and NFIX) in the RNA-seq data is shown for each cell line/tissue in the Right of C and D. (E) CTCF motif-matched counts of every relative position in the top 3,000 sequences. For specific positions, the matched motif is counted only if the midpoint of the matched motif is at the position. The number of each combinatoric motifs with valid gap sizes is shown at the Bottom. (F) Similar to E, NFIX motif-matched counts of every relative position and the relations of the NIFXs (P < 0.05).
Motif Combination Found by NeuronMotif Is Supported by Chromatin Openness Profiles.
Many of the motif syntaxes that NeuronMotif discovered are supported by the literature. For example, CN 1130 in L10 of DD-10 represents a soft CTCF homodimer with an approximately 58 bp interval that plays important roles in the transcriptional process of cancer and germ cell development (37) (Fig. 6A). The ZF4-7 and ZF8-11 of the 11 ZFs in CTCF can bind to each of the discovered hard-syntax motif heterodimers (38) (Fig. 6A), which is a conservative hard motif syntax that also occurs in the Basset model (Fig. 3B).
As a more comprehensive validation of the motif syntaxes discovered by NeuronMotif, we performed the footprinting of the assay for transposase-accessible chromatin with sequencing (ATAC-seq) (39, 40). We collected ATAC-seq data of five cell types or tissues from GEO database (5, 41–44) (SI Appendix). For each CN, we aligned the corresponding Tn5 transposase cutting frequency of the top 3,000 genomic sequences (144 bp) in the DD-10 dataset. We extended the footprinting region to 1,000 bp in total. ATAC-seq uses Tn5 transposase to cut DNA into fragments (5). If some TFs or other molecules bind to DNA, the cutting frequency will be affected (39), and we expect the resulting footprint to have features consistent with the CN CRM for that CN. For example, in the footprintings of CN 1130 (Fig.6C), there are two valleys and three peaks, which is consistent with the motif syntax summarized by NeuronMotif. The two valleys with lower cutting frequency correspond to the binding sites of CTCF homodimer. And, the peaks with the higher cutting frequency correspond to the gap and flanking region of the CTCF homodimers. Additionally, we found that some motif syntaxes are cell type specific. For example, in the footprintings of CN 254 (Fig. 6D), the pattern is only significant in prostate tissue and the LNCaP cell line, which may be correlated with the expression level of the TFs that can bind to the motifs in motif syntax. The motif syntax of CN 254 consists of NFI motif homotrimer and bZip motif homodimer. Correspondingly, we found much higher gene expression levels of the NFI family in prostate tissue and the LNCaP cell line than those in the other three cell lines (45–47) (Fig. 6D and SI Appendix, Tables S2–S4). To further confirm that the motifs occur in the sequences, we use motifmatchr (48) to find motif segments (P < 0.05) and calculated the distribution of motif-matched positions for both the CTCF and NFIX motifs (Fig. 6 E and F and SI Appendix). Most CNs have their own footprintings in different cell types or tissues (see NeuronMotif website for other CNs). However, these patterns cannot be found in the general approach of searching motifs or combinations (SI Appendix, Figs. S5 and S23–S27). In addition, we found that the CN CRMs are significantly more conserved than motif combinations with random gap sizes (SI Appendix, Figs. S28–S30). All the results suggest that NeuronMotif provides a unique way to discover motif combination rules in the genome.
Discussion
In summary, we presented NeuronMotif as a method to dissect the cis-regulatory grammar. Our approach is based on layer-wise demixing of activating sequences of deep neurons in CNN models with high predictive power for epigenomics features. We showed that the max-pooling-convolutional structure is a main cause for multifaceted neurons in the CNN. We designed a recursive algorithm to sample sequences with high activation on the neurons, and to cluster their associated feature maps. In this way, we are able to demix the multiple patterns/facets represented in these sequences. The performance of NeuronMotif was evaluated on the original DeepSEA and Basset models, as well as on several models with deeper architecture or different activation functions. Many of the uncovered motifs and motif combinations are supported by the literature, ATAC-seq data, and RNA-seq data. These results suggest that our method can improve the interpretation of CNN models and advance our understanding of complex cis-regulatory rules.
In addition to interpreting cis-regulatory grammar from CNNs, the application of NeuronMotif may be extended to interpreting CNNs arising from other fields because the CNNs follow a similar principle and share the similar mechanism of multifaceted CN. For example, in a CNN based on amino acid sequences in proteins (29), NeuronMotif was able to resolve the shifting patterns learned by a CN (SI Appendix, Fig. S31). Furthermore, the NeuronMotif interpretation result of each CN may be beneficial to improving deep learning models. Since understanding the features learned by deep neurons is a key challenge (21, 49), we believe NeuronMotif will be generally useful for deep learning based on CNN models.
Materials and Methods
The Substructure of a CN in a CNN.
In this work, the substructure of a CN in a CNN is defined as the structure including model weights and the neuron connections between the relevant input features (receptive field) and the CN. Mathematically, the CN substructure can be represented by a function , where is the feature vector of the receptive field and is the activation of the CN. The substructure of each CN in a CNN can be interpreted alone (SI Appendix, Fig. S32). We take the toy model as an example to demonstrate the substructure of a CN in a CNN (Fig. 2D and SI Appendix, Fig. S33). In the convolutional L1/L2, there are three/one filters, each of which has its own substructure (see SI Appendix, Fig. S33 for the example substructures). In each channel, the substructures of the CNs are duplicated by scanning across the sequence, so only one CN for each channel needs to be studied. The substructure of the CN in deeper layer represents the motif combination obtained by detecting the activated CNs in previous layer (SI Appendix, Fig. S33).
PWM Estimation.
To estimate PWM, the researcher usually estimates position probability matrix (PPM) first. The PPM can equivalently convert to PWM. For convenience, all of the intermediate and final results are stored in PPM. Given a set of sequences (of equal length) with high activation on a CN, the PPM is estimated by computing the position-specific relative frequencies of the four bases. If the set is sufficiently demixed in the sense that it cannot be further decomposed into two or more different clusters, then the PPM may serve as a good model for a single motif or a module of motifs with fixed relative positioning.
Demixing Multifaceted CN.
In NeuronMotif, a motif or motif combination with fixed gaps (hard syntax) in CN input sequence is simply regarded as a single motif pattern group. Usually, the shallow CN recognizes one motif pattern group. Correspondingly, the maximum number of facets (CN CRM) recognized by CN is , where is the number of layers in the substructure of CN, and is the pooling size of the max-pooling operation applied to convolutional layer . However, the deeper CN can recognize CN CRM with two or more motif pattern groups with flexible gaps, each of which can shift freely in the input sequences. Correspondingly, the maximum number of facets (CN CRM) will be , where P is the number of motif pattern groups. Since we cannot know P straightforwardly, we set P = 1 in NeuronMotif as the default value. At the end of this section, we will state how NeuronMotif automatically increases P to ensure CN CRMs are sufficiently decoupled. Due to the exponential number of facets, the samples may not be sufficient for each facet. NeuronMotif avoids overclustering by stopping further clustering the subsample set with no more than 50 samples (SI Appendix, Fig. S34). To demonstrate the detailed operations of the algorithm, we extend the toy model in Fig. 2D. The number of filters in convolutional L2 is changed from one to three and the input sequence size is changed from 8 bp to 12 bp (SI Appendix, Fig. S33). Given the CNN model weight, the NeuronMotif program will execute as described below (see SI Appendix, Tables S5–S7 for algorithm pseudocode).
Symbol definition: , the CN that located at channel and position in layer (also can be simply written as ); , the output function of 's or 's substructure; , the bin number of (default: , see SI Appendix, Figs. S35–S37 for reason); and represent the PPM generated by the sequence cluster of and maximum activation value of sequence in the cluster , respectively; ; , sample set of ; and , the number of samples and seed sequences, respectively (default: ).
In the convolutional L1, NeuronMotif interprets the substructures of 3 CNs representing 3 types of CNs. For each CN substructure, there are 3 steps to estimate the PPM of CN CRM. 1) Sampling positive sequences. NeuronMotif enumerates all 5-bp activated sequences (, the size of the receptive field) and calculates their CN activation values (). Since all positive sequences can be enumerated, initializing and optimizing sequences for sampling are not necessary. 2) Backward demixing. Since there are no max-pooling operations before L1, the demixing step is skipped and all sequences belong to one cluster. 3) Estimate the PPM of the cluster as CN CRM. The interval is split into bins (), and the sequences are grouped according to the bin in which their activation value is located. The sequences in each bin are stacked into PPMs (), and the averages of the activation values in the bin are taken as . The CN PPM is estimated by taking the average of the bin PPMs weighted by the average activation value: . When the execution is done, NeuronMotif outputs 3 CNs’ CN CRMs () and 3 CNs’ maximum activation values ().
In the convolutional L2, NeuronMotif interprets 3 CN substructures representing 3 types of CNs based on the CN CRM (PPM) result in L1. Similar to L1, for each CN substructure in L2, there are also 3 steps to estimate the CN CRMs (PPMs). We take one of them, (position 2 in channel 3 of L2), as an example (SI Appendix, Fig. S33). The substructure of is the same as the CNN structure in Fig. 2D.
-
(1)Sample at most 8-bp sequences (receptive field of a layer-2 CN) (see SI Appendix, Figs. S33 and S38 and Table S5 for algorithm description). The number of samples is limited by the memory and computing ability of the computer. The initial sample set is empty (). The details are given below.
-
a)Seed sequence initialization. NeuronMotif generates 8-bp seed sequences randomly as seed sequence set . To efficiently optimize these seed sequences, for each seed sequence, NeuronMotif pastes a sequence that is randomly generated from one of the CN CRMs’ PPMs found in previous convolutional layer (L1 if the CN is ). Here, to select CN CRM and its pasting position in seed sequences, NeuronMotif randomly selects a CN (e.g., ) in the previous convolutional layer (L1) of from substructure, where and . The probability of being selected is positively proportional to its maximum contribution to the . The maximum contribution score of is , where is the weight of filter 3 in convolutional L2, is the position index, and is the channel index. Due to the size-2 max-pooling operation, each pair of CNs () with adjacent positions in L1, whose maximum activation values are both , is mapped to one weight value in filter 3 of convolutional L2. Based on selected , NeuronMotif randomly selects one CN CRM () of (here, only one PPM can be chosen for a CN in L1; however, there can be >1 PPMs for a CN in a deeper layer) and pastes the sequence randomly generated from to the receptive field of within the seed sequence.
-
b)Seed sequence optimization. NeuronMotif optimizes the sequence in based on the gradient and genetic algorithm. For each sequence, NeuronMotif calculates the gradient matrix and the probability matrix :where and are the output and input of the substructure, respectively, i = 1, 2, …, 8 is the position index, and is the base index. To optimize the sequence based on gradient, for each position in the sequence, NeuronMotif generates the nucleotide of the optimized sequence by randomly selecting A/C/G/T based on probability . Based on the gradient optimized sequences, NeuronMotif generates new sequences () in the next generation. The top sequences in (10% of all sequences) with the highest activation values are reserved in . The reserved sequences are shifted circularly by 1 bp (the max-pooling size is 2 here), and shifted sequences are generated (20% of all sequences) in . The remaining sequences in (70% of all sequences) are generated by roulette wheel selection based on the activation value and cross-over operation. In the cross-over operation, the selected sequences in are grouped into pairs randomly. For each pair, NeuronMotif randomly selects a position in the sequence and swaps the subsequences of the two sequences starting from the selected position. Finally, these new sequences are added to . In each generation, the sequences are optimized and updated in this way ().
-
c)Sample sequences during optimization. In each generation, when the optimization is done, NeuronMotif samples sequences by merging the sequences in the sample set and new positive sequences:where is the positive sequence of in the new generation (). NeuronMotif calculates the CN activation value of sequences in and splits the interval into bins ( ). If the number of sequences in a bin is greater than , then NeuronMotif randomly selects sequences among the sequences in the bin and removes the remaining sequences.
-
d)Repeat b) and c) until the average of activation values for the seed sequences () does not increase for 10 iterations or the number of iterations exceeds 200. NeuronMotif outputs up to sequences ().
-
a)
-
(2)
Perform backward, recursive demixing (see SI Appendix, Table S6 for algorithm description): From the deepest layer (L2 if the CN is ) of the substructure of the CN in question to the shallowest layer (L1), NeuronMotif detects whether the CN is the result of a max-pooling operation. If so, it calculates the input feature maps (to the max-pooling) for the sequences in generated for and clusters the flattened feature maps into 2 ( for motif pattern groups) clusters using k-means (k = , Euclidean distance). NeuronMotif outputs the 2 ( for motif pattern groups) clusters of the sequence. Note that the demixing will stop when the recursion reaches the shallow layers where there is no max-pooling anymore.
-
(3)
Estimate the PPMs for the CN CRMs. Each cluster of sequences represents one CN CRM. For each cluster , NeuronMotif splits the interval into bins () and stacks the sequences in each bin into and takes the average of the activation values in the bin as , where is the maximum activation value of the sequences in the cluster. Then, NeuronMotif calculates the PPM by taking the average of the PPMs weighted by the average activation value of the bin: . Here, NeuronMotif interprets as 2 CN CRMs (), and the corresponding maximum activation values are .
Similarly, NeuronMotif generates for the first type of CN in L3 and for the second type of CN in L3. Hence, the output of NeuronMotif for L3 is CN CRMs. Before ending the step (3) in program, NeuronMotif applies t tests to the indicators () of each CN’s top CN CRM. If the average of is significantly greater than 0.95 (P < 0.1), then NeuronMotif stops. Otherwise, NeuronMotif sets and reruns the steps (2) and (3).
Generally, in the deeper layers, the process is similar to that for L2. NeuronMotif samples sequences based on the previous layer’s CN CRMs and clusters the sequences layer-wise and backward until there are no more max-pooling operations. NeuronMotif outputs the PPMs of CN CRMs and maximum activation values for each CN in deeper layers.
Theory for Estimating the PPM of CN CRM for a Facet of CN through Importance Sampling.
We assume that a CN is activated if there is binding of the relevant TF or TF complex (i.e., ) (see SI Appendix, Figs. S39–S42 for motivation). If sequence is sampled from a prior distribution and we know that it activates the CN, the distribution of is given by the posterior distribution of given . We define the (generalized) CRM PPM for the CN as . Specifically, contains binary random variables , and each column of is a categorial distribution (). Then, by Bayes' theorem, we have
Therefore, the expectation is:
where is a constant. We used to estimate via importance sampling with weights:
We defined implicitly. is the random variable of activation for the CN. We set . Then, the prior distribution was defined implicitly by the equation . Defining in this way ensures that the same number of sequences is sampled at different levels of binding probability and prevents the sample set from being overwhelmed by sequences with low activation values (see SI Appendix, Figs. S33 and S38 and Table S5 for algorithm description). We sampled M sequences from . The sample set was . Then, could be estimated by:
where . The estimation above assumes that the relative positions of CRM’s motifs in the input sequence are the same and that all of those CRMs share the same motif or motif combination. However, this assumption is not suitable for most CNs, especially the CNs in deeper layers, where the estimation result is a mixture of different instances of CRMs. The random variable matrix can be considered a mixture model. Hence, we needed to find the latent variables () to estimate:
where is discrete and , , . We can control these latent variables by splitting the dataset into subsets , each of which can be considered a positive sequence set sampled from a PPM of CRM. Stacking these sequences set respectively can generate , which are the real combinatorial motif patterns that the CN has learned (see SI Appendix, Table S5 for algorithm description for more algorithm description).
Segmentation of CN CRM.
Since the PPM of CN CRM is not reliable if the number of sequences is very small, we first applied the Laplace smoothing method to the PPMs of CN CRM. The smoothed PPM () can be obtained by
where is the number of sequences that generate the PPM, is the matrix in which all elements are 0.25, and is the smoothing parameter. We set in our work.
Based on , NeuronMotif calculates the information content (IC) of each nucleotide position j:
where is the element of at . We used a 3 bp window to smooth the IC by taking the average of the information contents within the window. The continuous three nucleotide positions with smoothed IC above 1 are considered motif regions. These continuous motif regions are segmented as motif segments.
Rank the Quality of CN CRMs for a CN.
The quality score () of a CN CRM is computed by:
where , are the average information contents of nucleotide positions in motif regions and the remaining regions, respectively. is the maximum of different CN CRMs. The CN CRM with a higher score () usually has better quality. To evaluate the quality of sampled sequences that generate the PWM, we used to estimate the relative coverage level of important sequences, where is the maximum activation value of the CN estimated by maximum activation value of all sampled sequences. The higher value means the patterns in the sequences in this cluster are preferred by CN, so this PWM is more important.
Build Motif Dictionary and Annotate CN CRM.
We gradually added the motif segments of CN CRM into the empty motif dictionary set: (a) use motif dictionary to annotate the segments in all CN CRMs by Tomtom (); (b) select the top-quality CN CRMs whose motif segments are not all annotated, and add unannotated segments of the CN CRM into motif dictionary; (c) repeat (a) and (b) until the number of CN CRMs whose motif segments are all annotated increases by less than 3. To remove the duplicated motifs in motif dictionary, we used Tomtom to compute the q value of every motif segment pair in motif dictionary. For each motif pair with , we only kept the longer one in the motif dictionary. Then, we used the TomTom and duplicate-removed motif dictionary to annotate the segments in all CN CRMs (). The motif set of the CN CRM with the most motif is used as a template to align the segments across CN CRM. If there are multiple solutions to align a CN CRM, we selected the one with the smallest sum of squares of the distances of the aligned motifs’ relative positions in two CN CRMs.
Tree-Structured Syntax.
We used the binary tree to demonstrate the syntax rule. In the tree structure, leaf nodes are the motifs in the dictionary. The branch node connects two branches, which represent motifs (leaf nodes) or motif combinations (branch nodes), into a new motif combination at a spacing of specific gap sizes (order: left_branch-gap-right_branch).
Show the Relationship between the Rules of Different Layers.
The subtree corresponding to each tree node represents a motif combination from the motif dictionary of the CN. The motif combination can be first learned by the CN in the previous layers or the current layer. We use Tomtom to map the motifs in the combination to each CN’s dictionary (q value < 1e-4). If the motifs all belong to the dictionary of one CN, we link to the CN in the HTML visualization result. We also label the tree node with the layer the motif combination first occurs.
Algorithm Implementation.
NeuronMotif was implemented in Python. The current version of NeuronMotif can only be applied to a CNN implemented by tensorflow or keras. See SI Appendix, Tables S5–S7 for implementation details.
Prediction Performance Comparison.
For each prediction target, we calculated the area under the precision-recall curve (AUPRC). To compare and test the performance difference between models, we assumed that if the performance of two models is the same, the random variable of the difference in the AUPRC value of the same prediction target is . We performed a one-sided t test for each pair of models for comparison. The statistical analysis was implemented by R.
Comparison of Motif Discovery Performance.
We used JASPAR database to evaluate the performance of different interpretation methods applying to one layer of the CNN model. We matched the discovered motifs to the motifs in the JASPAR database through Tomtom. For each discovered motif, we obtained the q value of the best-matched motif in JASPAR. The performance is evaluated in two ways. On the one hand, we evaluate the similarity between the discovered motif and JASPAR motif. Each CN in the layer was marked by the top-matched JASPAR motif. We selected the top 100 CNs with the smallest q values, and we showed the distribution with box and jittering points. A larger – distribution indicates higher similarity. On the other hand, we evaluate the number of different JASPAR motifs found. We counted the found JASPAR motifs by each method (q value < 0.001). A larger number indicates better performance. Since many JASPAR motifs are similar or even duplicated, we used Tomtom to evaluate the similarity for each pair of them. If the motif pair with q value is smaller than , we considered them as one motif. For these highly similar motifs in the result, we only kept one of them in the final result.
ATAC-seq Data Processing and Footprinting Analysis.
The raw ATAC-seq data are processed by esATAC package (50). We used R to count the cutting frequency of the genomes sequences and visualize the footprinting by ggplot2 package (SI Appendix).
Supplementary Material
Appendix 01 (PDF)
Acknowledgments
We thank Z. Duren and H. Fang for valuable suggestions on motif discovery and related biological issues. We thank J. Xin for providing some integrated datasets. We thank H. Wang and J. Li for testing the NeuronMotif software. The works of Z.W., L.W., and X.W. were supported by the National Natural Science Foundation of China (No. 62250007, 62225307, 61721003), the National Key R&D Program of China (No. 2020YFA0906900, 2020AAA0105200), and a grant from Institute Guo Qiang, Tsinghua University (2021GQG1023). The works of W.H.W. and S.M. were supported by NIH grants HG010359 and HG007735.
Author contributions
Z.W., W.H.W., and X.W. designed research; Z.W., K.H., R.J., X.Z., Y.L., W.H.W., and X.W. performed research; Z.W. contributed new reagents/analytic tools; Z.W., W.H.W., and X.W. analyzed data; W.H.W., and X.W. supervised research; and Z.W., K.H., L.W., S.M., W.H.W., and X.W. wrote the paper.
Competing interests
The authors declare no competing interest.
Footnotes
Reviewers: J.J.L., University of California Los Angeles; and O.G.T., Princeton University.
Contributor Information
Wing H. Wong, Email: whwong@stanford.edu.
Xiaowo Wang, Email: xwwang@tsinghua.edu.cn.
Data, Materials, and Software Availability
The NeuronMotif code/results are available at https://github.com/XWangLabTHU/NeuronMotif/. The reproducible demos are available at Code Ocean (https://doi.org/10.24433/CO.8395715.v1). Previously published data were used for this work (DeepSEA training data (11): http://deepsea.princeton.edu/media/code/deepsea_train_bundle.v0.9.tar.gz; http://deepsea.princeton.edu/media/code/allTFs.pos.bed.tar.gz; Basset training data (12): https://github.com/davek44/Basset; https://www.dropbox.com/s/h1cqokbr8vjj5wc/encode_roadmap.bed.gz; https://www.dropbox.com/s/8g3kc0ai9ir5d15/encode_roadmap_act.txt.gz; JASPAR database (32): https://jaspar.genereg.net/download/data/2020/CORE/JASPAR2020_CORE_vertebrates_non-redundant_pfms_meme.txt. RNA-seq data from ENCODE ((46, 47): ENCFF345SHY, ENCFF873VWU, ENCFF174OMR, ENCFF395XDK, ENCFF910OBU, ENCFF003XKT, ENCFF928NYA, ENCFF042VIA, and ENCFF485OGN. ATAC-seq from Gene Expression Omnibus (GEO): GSM1155957 (5), GSM2264819 (41), GSM2902637 (42), GSM3632983 (43), GSM3320984 (44). RNA-seq data from GEO: GSM1902621, GSM1902622, and GSM1902623 in GSE73784 (https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE73784%26format=file%26file=GSE73784%5FLNCaP%2ETPM%2Eexpression%2Edata%2Etsv%2Egz) (45).
Supporting Information
References
- 1.Jolma A., et al. , DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013). [DOI] [PubMed] [Google Scholar]
- 2.Bailey T. L., Elkan C., Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 28–36 (1994). [PubMed] [Google Scholar]
- 3.Thijs G., et al. , “A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes” in Proceedings of the 5th Annual International Conference on Computational Biology (Association for Computing Machinery, New York, NY, 2001), pp. 305–312. [Google Scholar]
- 4.Park P. J., ChIP-seq: Advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669–680 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Buenrostro J. D., Giresi P. G., Zaba L. C., Chang H. Y., Greenleaf W. J., Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213–1218 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Mumbach M. R., et al. , HiChIP: Efficient and sensitive analysis of protein-directed genome architecture. Nat. Methods 13, 919–922 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Castro-Mondragon J. A., et al. , JASPAR 2022: The 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 50, D165–D173 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Zhou Q., Wong W. H., Coupling hidden Markov models for the discovery of cis-regulatory modules in multiple species. Ann. Appl. Stat. 1, 36–65 (2007). [Google Scholar]
- 9.Johnson D. S., et al. , De novo discovery of a tissue-specific gene regulatory module in a chordate. Genome Res. 15, 1315–1324 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Zhou Q., Wong W. H., CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc. Natl. Acad. Sci. U.S.A. 101, 12114–12119 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zhou J., Troyanskaya O. G., Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Kelley D. R., Snoek J., Rinn J. L., Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Kelley D. R., et al. , Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Chen K. M., Cofer E. M., Zhou J., Troyanskaya O. G., Selene: A PyTorch-based deep learning library for sequence data. Nat. Methods 16, 315–318 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Shrikumar A., et al. , Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5. 1.1. arXiv [Preprint] (2018). 10.48550/arXiv.1811.00416 (Accessed 19 November 2020). [DOI]
- 16.Nair S., Kim D. S., Perricone J., Kundaje A., Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. Bioinformatics 35, I108–I116 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Quiroga R. Q., Reddy L., Kreiman G., Koch C., Fried I., Invariant visual representation by single neurons in the human brain. Nature 435, 1102–1107 (2005). [DOI] [PubMed] [Google Scholar]
- 18.Goh G., et al. , Multimodal neurons in artificial neural networks. Distill 6, e30 (2021). [Google Scholar]
- 19.Nguyen A., Yosinski J., Clune J., Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks. arXiv [Preprint] (2016). 10.48550/arXiv.1602.03616 (Accessed 15 November 2020). [DOI]
- 20.Olah C., Mordvintsev A., Schubert L., Feature visualization. Distill 2, e7 (2017). [Google Scholar]
- 21.Eraslan G., Avsec Z., Gagneur J., Theis F. J., Deep learning: New computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019). [DOI] [PubMed] [Google Scholar]
- 22.Shrikumar A., Greenside P., Kundaje A., “Learning important features through propagating activation differences” in International Conference on Machine Learning (Proceedings of Machine Learning Research, Cambridge, MA, 2017), pp. 3145–3153. [Google Scholar]
- 23.Simonyan K., Vedaldi A., Zisserman A., Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv [Preprint] (2013). 10.48550/arXiv.1312.6034 (Accessed 16 November 2020). [DOI]
- 24.Liu G., Zeng H., Gifford D. K., Visualizing complex feature interactions and feature sharing in genomic deep neural networks. BMC Bioinformatics 20, 401 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Avsec Z., et al. , Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genetics 53, 354–366 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Israeli J., Anshul K., How to train your DragoNN (2016). https://www.genome.gov/sites/default/files/Multimedia/Slides/ENCODE2016-ResearchAppsUsers/Anshul_slides.pdf. Accessed 18 September 2022.
- 27.Alipanahi B., Delong A., Weirauch M. T., Frey B. J., Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015). [DOI] [PubMed] [Google Scholar]
- 28.Bogard N., Linder J., Rosenberg A. B., Seelig G., A deep neural network for predicting and engineering alternative polyadenylation. Cell 178, 91–106.e23 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Sidhom J. W., Larman H. B., Pardoll D. M., Baras A. S., DeepTCR is a deep learning framework for revealing sequence concepts within T-cell repertoires. Nat. Commun. 12, 1605 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Koo P. K., Ploenzke M., Improving representations of genomic sequence motifs in convolutional networks with exponential activations. Nat. Mach. Intellegence 3, 258–266 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Jolma A., et al. , DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature 527, 384–388 (2015). [DOI] [PubMed] [Google Scholar]
- 32.Fornes O., et al. , JASPAR 2020: Update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 48, D87–D92 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Gupta S., Stamatoyannopoulos J. A., Bailey T. L., Noble W. S., Quantifying similarity between motifs. Genome Biol. 8, R24 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Weirauch M. T., et al. , Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431–1443 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Russakovsky O., et al. , ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015). [Google Scholar]
- 36.Koo P. K., Eddy S. R., Representation learning of genomic sequence motifs with convolutional neural networks. Plos Comput. Biol. 15, e1007560 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Pugacheva E. M., et al. , Comparative analyses of CTCF and BORIS occupancies uncover two distinct classes of CTCF binding genomic regions. Genome Biol. 16, 161 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Soochit W., et al. , CTCF chromatin residence time controls three-dimensional genome organization, gene expression and DNA methylation in pluripotent cells. Nat. Cell Biol. 23, 881–893 (2021). [DOI] [PubMed] [Google Scholar]
- 39.Bentsen M., et al. , ATAC-seq footprinting unravels kinetics of transcription factor binding during zygotic genome activation. Nat. Commun. 11, 4267 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Li Z., et al. , Identification of transcription factor binding sites using ATAC-seq. Genome Biol. 20, 45 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Liu Q., et al. , Genome-wide temporal profiling of transcriptome and open chromatin of early cardiomyocyte differentiation derived from hiPSCs and hESCs. Circ. Res. 121, 376–391 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Calviello A. K., Hirsekorn A., Wurmus R., Yusuf D., Ohler U., Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modeling. Genome Biol. 20, 42 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Zhang Z. D., et al. , Loss of CHD1 promotes heterogeneous mechanisms of resistance to AR-targeted therapy via chromatin dysregulation. Cancer Cell 37, 584–598.e11 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Park J. W., et al. , Reprogramming normal human epithelial tissues to a common, lethal neuroendocrine cancer lineage. Science 362, 91–95 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Taberlay P. C., et al. , Three-dimensional disorganization of the cancer genome occurs coincident with long-range genetic and epigenetic alterations. Genome Res. 26, 719–731 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.E. P. Consortium, An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Davis C. A., et al. , The encyclopedia of DNA elements (ENCODE): Data portal update. Nucleic Acids Res. 46, D794–D801 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Schep A. N., Wu B. J., Buenrostro J. D., Greenleaf W. J., chromVAR: Inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat. Methods 14, 975–978 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.de Almeida B. P., Reiter F., Pagani M., Stark A., DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat. Genet. 54, 613–624 (2022). [DOI] [PubMed] [Google Scholar]
- 50.Wei Z., Zhang W., Fang H., Li Y., Wang X., esATAC: An easy-to-use systematic pipeline for ATAC-seq data analysis. Bioinformatics 34, 2664–2665 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Appendix 01 (PDF)
Data Availability Statement
The NeuronMotif code/results are available at https://github.com/XWangLabTHU/NeuronMotif/. The reproducible demos are available at Code Ocean (https://doi.org/10.24433/CO.8395715.v1). Previously published data were used for this work (DeepSEA training data (11): http://deepsea.princeton.edu/media/code/deepsea_train_bundle.v0.9.tar.gz; http://deepsea.princeton.edu/media/code/allTFs.pos.bed.tar.gz; Basset training data (12): https://github.com/davek44/Basset; https://www.dropbox.com/s/h1cqokbr8vjj5wc/encode_roadmap.bed.gz; https://www.dropbox.com/s/8g3kc0ai9ir5d15/encode_roadmap_act.txt.gz; JASPAR database (32): https://jaspar.genereg.net/download/data/2020/CORE/JASPAR2020_CORE_vertebrates_non-redundant_pfms_meme.txt. RNA-seq data from ENCODE ((46, 47): ENCFF345SHY, ENCFF873VWU, ENCFF174OMR, ENCFF395XDK, ENCFF910OBU, ENCFF003XKT, ENCFF928NYA, ENCFF042VIA, and ENCFF485OGN. ATAC-seq from Gene Expression Omnibus (GEO): GSM1155957 (5), GSM2264819 (41), GSM2902637 (42), GSM3632983 (43), GSM3320984 (44). RNA-seq data from GEO: GSM1902621, GSM1902622, and GSM1902623 in GSE73784 (https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE73784%26format=file%26file=GSE73784%5FLNCaP%2ETPM%2Eexpression%2Edata%2Etsv%2Egz) (45).






