Predicting gene regulatory regions with a convolutional neural network for processing double-strand genome sequence information

Koh Onimaru; Osamu Nishimura; Shigehiro Kuraku

doi:10.1371/journal.pone.0235748

. 2020 Jul 23;15(7):e0235748. doi: 10.1371/journal.pone.0235748

Predicting gene regulatory regions with a convolutional neural network for processing double-strand genome sequence information

Koh Onimaru ^1,^*, Osamu Nishimura ¹, Shigehiro Kuraku ¹

Editor: Vladimir Makarenkov²

PMCID: PMC7377372 PMID: 32701977

Abstract

With advances in sequencing technology, a vast amount of genomic sequence information has become available. However, annotating biological functions particularly of non-protein-coding regions in genome sequences without experiments is still a challenging task. Recently deep learning–based methods were shown to have the ability to predict gene regulatory regions from genome sequences, promising to aid the interpretation of genomic sequence data. Here, we report an improvement of the prediction accuracy for gene regulatory regions by using the design of convolution layers that efficiently process genomic sequence information, and developed a software, DeepGMAP, to train and compare different deep learning–based models (https://github.com/koonimaru/DeepGMAP). First, we demonstrate that our convolution layers, termed forward- and reverse-sequence scan (FRSS) layers, integrate both forward and reverse strand information, and enhance the power to predict gene regulatory regions. Second, we assessed previous studies and identified problems associated with data structures that caused overfitting. Finally, we introduce visualization methods to examine what the program learned. Together, our FRSS layers improve the prediction accuracy for gene regulatory regions.

Introduction

In the last decade, advances in DNA sequencing technology have dramatically increased the amount of genome sequence data derived from diverse species [1] as well as from individual humans [2]. The next demanding challenge is the deeper understanding of how genome sequences encode phenotypes and how functional information can be extracted [3]. Such sequence-based understanding would ultimately enable the prediction of phenotypes based on genome sequence information, i.e., genotype-phenotype mapping. The syntax of protein-coding genes is well understood, e.g., certain phenotypic consequences are predictable (such as nonsense mutations), yet the basic rules for non-coding sequences have not been established. Several projects including ENCODE [4,5], ROADMAP [6], and FANTOM [7] have accumulated epigenomic data to annotate the characteristics of non-coding sequences, but a comprehensive understanding of the association between epigenomic signatures and sequence information has yet to be attained.

Recently, this challenge has been addressed with deep learning–based methods [8–11]. Deep learning is a subfield of machine-learning methods, and has been applied to a variety of problems such as image classification and speech recognition [12]. For the application of deep learning to genomics, convolutional neural networks (CNN) or recurrent neural networks (RNN) are trained with input genomic sequences that are labeled with epigenomic data to predict functional non-coding sequences. Once trained, these types of classifiers can infer the functional effect of mutations in non-coding genomic regions from individual genome sequences. However, even though these deep learning–based classifiers have outperformed other methods, such as the support vector machine [13], the prediction accuracy is still far from satisfactory. In addition, as it is generally known [14], it remains elusive as to what deep learning–based models actually "learn" and what kinds of "understanding" underlies their predictions. In the present study, we first describe a CNN-based classifier that can outperform state-of-art models. Second, we identified problems associated with data structures that caused overfitting. Finally, we introduce methods to visualize trained models, potentially revealing the general syntax underlying regulatory sequences.

Results

To improve the accuracy of predicting regulatory sequences, we devised a deep learning–based method with three main features: a) integrating information from forward and reverse DNA sequences; b) simplifying the data structure; c) quality control to filter out low-quality data from a training dataset.

Training design and benchmarking

As training data, we downloaded several alignment files containing data from chromatin accessibility assays and chromatin immunoprecipitation-sequencing (ChIP-seq) experiments from the ENCODE project, and regions enriched with reads were determined as peaks by MACS2 peak caller [15]. These data were used to mark genome sequences. We divided each of the mouse and human genome sequences into 1000-basepair (bp) windows and converted the four letters (i.e., A, C, G, T) into one-hot vectors (four dimensional zero-one vectors). To denote epigenomic marks, we assigned each window as 1 if it overlapped a signal positive region or 0 otherwise (signal negative) (see Methods for details). To reduce the number of potential artifacts caused by window boundaries, we also added windows that were shifted by 500 bp toward the 3′ side (Fig 1A; we refer to this window structure as 1-kbp window/0.5-kbp stride). These data were used to train CNN models (see S1 Fig for the overall scheme).

Fig 1 — (A) Schematic representation of how to assign an epigenomic signal into a binary vector. A peak region (light blue bar) and the summit of a peak (red rectangle) are determined from epigenomic data and assigned as "1" to overlapping genomic regions (blue-filled rectangle), otherwise "0" (blue, empty rectangle). DeepGMAP, this study; DeepSEA, DanQ, previous studies. (B) The architecture of the FRSS layers. An input sequence is scanned by the first convolution kernels, and the kernels are rotated in parallel. After implementing the relu operation, pooling, and the second convolution, the two parallel outputs are combined through summation. (C) Illustration of the 180-degree rotation of kernels. Each variable in an M × N kernel is written as w_m,n,h,k ([m] = {0,…,M−1};[n] = {0,…,N−1};[h] = {0,…,H−1};[k] = {0,…,K−1}; H and K are the numbers of the channels and kernels, respectively). w′_{m′,n′,h,k} is a variable after the 180-degree rotation, and equal to w_{(M−1−m),(N−1−n),h,k} (also see Methods). In this example, a 9 × 4 kernel is trained to detect a sequence, AGCTAAAG (left). The geometric rotation of the kernel (right) results in the reverse complement of the original sequence. (D) A visual comparison between DNase-seq peaks and predictions of the conv4-FRSS and DeepSEA models. Black boxes indicate 1 kb-windows that overlap with peaks. Blue and purple boxes indicate peak regions predicted by the indicated models. Note that DeepSEA predictions contains many false positives. (E) The total number of false positives predicted by the DeepSEA model (purple) and the conv4-FRSS model (blue). The averages were calculated from predictions by three independently trained models.

To find an effective architecture of neural networks, we first compared the network architecture of published models such as CNN-based models (DeepSEA [9] and Basset [16]) and a CNN-RNN hybrid (DanQ [10]; see S1 Table for details of the models). Because these models were implemented in different programming languages, we re-implemented them in our code (referred to as the DeepSEA-type model and so on) to examine only the effect of network architectures. In addition, because the aim of this benchmark is to find an architecture that can be efficiently trained with limited data (all epigenomic data are fundamentally limited by the genome size and the binding frequency of transcription factors), and the number of training epochs was constrained to one, which means that models were trained with the full dataset only once. Moreover, because full training of models sometimes take days or weeks, we simplified the training task in this benchmarking as follows: only a subset of mouse DNase-seq data was used (the data IDs are listed in S2 Table). As a result, the training dataset for this benchmark is composed of 16747 mini-batches (each mini-batch contains 100 data). For evaluation, we used the entire chromosome 2 sequence, which was excluded from training datasets, and calculated the prediction accuracy of models using two scores, namely the area under the receiver-operation curve (AUROC) and the area under the precision-recall curve (AUPRC). Because AUROC is not a proper criterion if the majority of the data is negatively labeled, we considered AUPRC as a more important criterion [17]. As a result, we found that the DeepSEA-type architecture performed better than the Basset-type and DanQ-type models at least in this training condition (Table 1).

Table 1. A performance comparison between models.

	129_ES_E14		forebrain E11.5		forelimb E11.5		mean		time^b
	AUROC	AUPRC	AUROC	AUPRC	AUROC	AUPRC	AUROC	AUPRC
DeepSEA-type	0.9234	0.6038	0.9141	0.6380	0.9315	0.5734	0.9230	0.6051	0.1773
Basset-type	0.9085	0.5796	0.9107	0.6280	0.9175	0.5476	0.9123	0.5851	0.0847
DanQ-type	0.9057	0.5785	0.9076	0.6214	0.9046	0.5164	0.9059	0.5721	0.4586
DanQ-type block^a	0.9053	0.5699	0.9063	0.6168	0.9030	0.5031	0.9049	0.5633	0.4832
conv4	0.9300	0.6234	0.9146	0.6426	0.9351	0.5884	0.9266	0.6181	0.2548
conv3-FRSS	0.9364	0.6447	0.9171	0.6496	0.9421	0.6098	0.9319	0.6347	0.3636
conv4-FRSS	0.9414	0.6571	0.9178	0.6520	0.9476	0.6280	0.9356	0.6457	0.3729
conv4-FRSS+	0.9418	0.6631	0.9183	0.6549	0.9475	0.6347	0.9359	0.6509	0.5802
conv4-nonFRSS	0.9331	0.6424	0.9143	0.6407	0.9388	0.6008	0.9287	0.6280	0.3760
DanQ-FRSS	0.9365	0.6523	0.9162	0.6485	0.9453	0.6222	0.9327	0.6410	0.7905

Open in a new tab

^aTensorflow has several LSTM cells, and this model uses LSTMBlockCell, and the above uses LSTMCell. ^bthe mean time (second) of 20 minibatch updates.

Reading forward and reverse-complementary sequences improve prediction accuracy

Based on the above result, we next explored whether better models could be attained by modifying the DeepSEA architecture. First, as deeper convolutional networks are known to perform better [18], we inserted an additional convolution layer between the third convolution layer and the first fully connected layer of the DeepSEA-type model (with a reduced kernel number in some layers and smaller pooling patches). As expected, the four-layer convolutions achieved slightly better accuracy (conv4 in Table 1). We next focused on one of the fundamental characteristics of genome sequences—the forward and reverse strands. In previous studies, the information on reverse-strand sequences was ignored or used as independent data. However, for example, transcription factors with leucine-zipper or helix-loop-helix domains bind both strands through dimerization, and such binding thus results in palindromic binding motifs [19,20]. Inspired by this fact, we devised a set of layers that can integrate information from the both strands (Fig 1B and Methods). We arranged the one-hot vectors that represent AGCT in a symmetric manner so that a 180-degree rotation of the one-hot vectors results in a reverse complementary sequence (Fig 1C). Thus, reverse-sequence information can be processed by rotating the kernels 180 degrees. The hidden layers derived from the two strands are combined by element-wise addition after the second convolution. We termed this architecture FRSS (forward- and reverse-sequence scan) layers. The replacement of the first two convolutional layers in the DeepSEA-type model with FRSS (conv3-FRSS in Table 1) improved the accuracy of prediction more than the simple addition of a convolution layer (conv4). Moreover, a four-layer convolution model with FRSS (conv4-FRSS) achieved higher AUPRC scores than the other models (Table 1, and Fig 1D for a visual comparison). Although the magnitude of increase in AUPRC seemed subtle, the Fig 1D clearly showed a significant decrease in false positives. This visual intuition was validated by the total counts of false positives; the number of false positives by the DeepSEA-type model varied among replicates and was higher than those by the conv4-FRSS model (Fig 1E). Raising the kernel numbers of each convolution layer also slightly improved the performance (conv4-FRSS+ in Table 1) but also increased computation time (the right most column in Table 1). To confirm that the effect of FRSS resulted not from merely increase in the kenerl number but from the rotation of kernels, we replaced the rotated kernels with independently trainable kernels in the conv4-FRSS (named as "conv4-nonFRSS" in Table 1). The replacement resulted in a poorer performance than the conv4-FRSS model, indicating that the rotated kernels improve training efficiency. We also found that the FRSS layers increased the learning efficiency of the DanQ model by a level comparable with conv4-FRSS (DanQ-FRSS in Table 1). These results were consistently reproduced by repeated training and testing (S3 Table), indicating that the benchmark is robust against random initialization. Together, these results suggest that the FRSS layers enhance the predictive power of deep learning–based models.

The length of strides affect training efficiency

Next, we examined the structure and quality of training data using the conv4-FRSS model. In certain previous studies, epigenomic signals were distributed into 200-bp windows of genome sequences, and 400-bp extra-sequences were added to the left and right sides of each window [9] (referred to as "DeepSEA-type data" in this study; Fig 1A). For comparison, we trained conv4-FRSS with the DeepSEA-type data structure. Training with the DeepSEA-type data yielded a higher AUROC score but a lower AUPRC score (DeepSEA-type in Table 2). In addition, we tested several window and stride sizes and found that models can be trained most efficiently with a 1-kbp window and 0.3-kbp stride data. These results indicated that data-structure design is a critical factor for training models efficiently and that data augmentation using small strides, which was implemented in several studies [21,22], is not an optimal strategy for training models.

Table 2. A performance comparison between data structures.

Data structure	chromosome 2 (true test)		chromosome 1 (overfitting test)
Data structure	AUROC	AUPRC	AUROC	AUPRC
DeepSEA-type	0.9623	0.4841	0.9694	0.4741
1 kb window/0.5 kb stride^a	0.9356	0.6457	0.9453	0.6293
1 kb window/0.4 kb stride	0.9352	0.6528	0.9457	0.6386
1 kb window/0.3 kb stride	0.9389	0.6570	0.9515	0.6618
1 kb window/0.2 kb stride	0.9397	0.6530	0.9581	0.6962
1 kb window/0.1 kb stride	0.9368	0.6396	0.9689	0.7757

Open in a new tab

The mean values of the three mouse DNase-seq data are shown. The columns under chromosome 2 are the results of true tests (the higher is better). The columns under chromosome 1 are overfitting tests (if the scores are higher than those of true tests, that is the sign of overfitting). ^aThe same data with the mean values of conv4-FRSS in Table 1.

conv4-FRSS outperforms previous models

To validate the performance of the FRSS layers, we extended the list of training data. First, we newly trained conv4-FRSS with the same set of 125 human DNase-seq data used by Zhou & Troyanskaya (2015) [9] (see S2 Table for the lists of data). We found that conv4-FRSS significantly outperformed DeepSEA and DanQ (Fig 2A and 2C and S4 Table; the scores of DeepSEA and DanQ were obtained from their publications; https://media.nature.com/original/nature-assets/nmeth/journal/v12/n10/extref/nmeth.3547-S3.xlsx for DeepSEA and auc.txt of https://github.com/uci-cbcl/DanQ for DanQ). We also trained conv4-FRSS with 365 and 93 of human and mouse DNase-seq peaks, respectively ("hg38" and "mm10" in the Fig 2A; see S2 Table for the lists of data). As shown in the Fig 2A and S4 Table, we noticed that training with mouse data yielded higher performance than with human data. We also focused on CTCF ChIP-seq data because (a) CTCF has a general function in genome organization, (b) ENCODE releases plenty of CTCF data for both mice and humans, and (c) as evident for the AUPRC scores of DeepSEA (S2 Fig; reanalyzed data from Zhou & Troyanskaya, 2015 [9]), the prediction accuracy is highly dependent on the target of ChIP-seq, and that of CTCF is the highest among transcription factors. As a result, the conv4-FRSS model was efficiently trained with both human and mouse CTCF data, and this model yielded a median AUPRC score of >0.70 (Fig 2B and 2D; the scores of DeepSEA and DanQ were obtained from their publications). Together, these results validate the performance of the FRSS.

Fig 2 — (A, B) Comparison of AUPRCs of DNase-seq data (A) and CTCF data (B), respectively. The scores of DeepSEA and DanQ are obtained from their original studies. conv4-FRSS. hg38 and mm10 are the scores of conv4-FRSS with newly created training data using datasets obtained from the ENCODE project. filtered hg38 and mm10 are the prediction scores of conv4-FRSS trained with high-cFRiP datasets. p, two-sided Mann–Whitney–Wilcoxon test. Boxes, the lower to upper quantile values of the data; orange lines, the median; whiskers, the range of the data (Q1 –IQR × 1.5 and Q3 + IQR × 1.5 for the lower and upper bounds, respectively); flier points, those past the end of the whiskers. (C, D) Scatter plots of AUPRCs of DeepSEA (x axis; scores are obtained from its original data) and conv4-FRSS (y axis).

The quality control of training data

In the above training with extended dataset, we suspected that several classes with low AUPRCs were attributable to poor data quality. To address this possibility, we calculated a known quality index, FRiP (fraction of reads in peaks [23]), of mouse CTCF data and found that classes with low AUPRCs tended to have low FRiPs (top panels in the S3 Fig), with some exceptions (arrows in the S4 Fig and S5 Table). Because FRiP is calculated based on reads in peaks per total reads, values may be inflated when the read number of the source data is too low. To correct such inflation, we multiplied FRiP by the fraction of genomic regions covered by at least one read in a genome. As shown in the S3 Fig, the corrected FRiPs (cFRiPs) could distinguish high and low AUPRC data more clearly. In addition, cFRiP yielded a stronger correlation with the total peak numbers of data than the uncorrected FRiP (bottom panels in the S3 Fig). Although the other datasets showed weaker relations between FRiP and AUPRC, data with optimal cutoff values of cFRiP increased the average AUPRC score except for the human CTCF data (filtered hg38 and mm10 data in the Fig 2A and 2B, and S4 Fig). These results suggested two points. First, the correction method reasonably represents the true quality of the data. Second, the linear correlation between peak numbers and cFRiPs implies that data quality is not saturated enough to detect all true peaks (and hence, the possibility remains that the model learns sample quality rather than cell-specific patterns). Therefore, whereas the ENCODE project seems to consider ≥1% FRiP as good data [23], our results suggest that higher stringency during quality control is required for the precise annotation of genomes.

The utility of conv4-FRSS

Because the conv4-FRSS model showed a good accuracy for CTCF binding site prediction, we further examined the utility of this model. A comparison between the predictions of the model and CTCF motif regions detected by a motif scanning tool, FIMO [24] (FIMO in the Fig 3A) revealed that the model can predict CTCF binding regions that do not contain the canonical CTCF motif. In addition, we also used the class saliency extraction method [25] to identify informative sequences at the single-nucleotide level, which produce data similar to ChIP-seq signals (influential value in the Fig 3A and S5 Fig). This method is equivalent to the in silico saturation mutagenesis approach used previously [8,9] but is more computationally efficient because it does not require mutation-by-mutation evaluation (see Methods for details). Furthermore, using the dog genome (CanFam3.1), we performed a cross-species prediction of CTCF binding regions with the model trained solely with mouse data. As shown in the Fig 3B, the predictions of the model matched with the real CTCF signals of dog liver [26] and detected species-specific differences (compare red rectangles in the Fig 3), indicating the generality of the model. Based on all these data, we concluded that conv4-FRSS can be applied to diverse data.

Fig 3 — (A) Comparison between predictions by conv4-FRSS, experimaental data and motifs detected by FIMO in mouse chromosome 2 sequence. predictions, CTCF binding regions predicted by conv4-FRSS (prediction values ≥0.2 are shown, and the bluer is more probable); influential value, values calculated with the class saliency extraction method; mouse liver CTCF, alignment data from ENCFF627DYN; FIMO, CTCF sites that were detected by FIMO (black boxes, p ≤ 1e-4; gray boxes, 1e-4 < p ≤ 1e-3). (B) Comparison between predictions by conv4-FRSS and experimaental data in the sequence of dog chromosome 36. dog liver CTCF, alignment data from ERR022304; red rectangles, a CTCF peak that is present in the mouse genome (solid rectangle) but not the dog genome (dashed rectangle).

Visualization of what deep-learning models learn about genome sequences

To obtain insights into how the models predict regulatory sequences, we visualized conv4-FRSS trained with the subset of mouse DNase-seq data. First, as has been done in previous studies [8,10], we analyzed individual kernels of the first layer, which directly interact with DNA sequences and thus represent DNA motifs that are important for classifying regulatory sequences. We converted the weight variables to probability matrices by using a softmax-like operation and found that established models have indeed learned many known transcription factor binding motifs, such as those bound by CTCF, SOX9, OCT4 and KLF4 (Fig 4A, S6 Fig and see S6 Table for full comparison between the kernels and known motifs). Next, we applied the activation maximization method [25,27,28], in which we look for DNA sequences that maximize neuron activities in the final layer of a trained model by training the sequence itself (see Methods for formal mathematical expressions). First, we trained a sequence to activate the 129 ES E14–specific neuron in the last output layer. As with image-recognition studies [25,28], this method generated randomly repeated sequences that probably capture the 129 ES E14 class-specific traits (Fig 4B). Using the motif comparison software Tomtom [29], we found that the generated sequence contained the motifs of several reprogramming-related transcription factors, such as OCT4/SOX2 and KLF4, which are important for the pluripotency of embryonic stem cells [30] (the full detected motifs are listed in S7 Table). For comparison, we also trained a sequence to activate all of the three classes of the neurons, resulting in a CTCF motif–rich sequence (Fig 4C). This result is consistent with the general role of CTCF as an enhancer looping factor [31]. In addition, the generated sequence also contained small motifs next to the well-known 20-bp motif (dashed boxes in the Fig 4C, which resembled the other part of the alternative long CTCF motif (33/34 bp in length, including an additional GGNANTGCA or TGCANTNCC sequence) [26] (the full detected motifs are listed in S8 Table). Thus, the detection of motifs that are longer than the kernel size constitutes one of the advantages of this activation maximization method. Taken together, the activation maximization method helps understand how deep-learning models predict enhancer sequences.

Fig 4 — (A) Examples of known motifs (top) that match with kernels. (B, C) Sequences trained to activate the 129 ES E14–specific neuron in the last layer (B) and all neurons in the last layer (C), respectively. Boxes in the left panel, OCT4/SOX2 binding motifs; dashed boxes in the left panel, KLF4 binding motifs; boxes in the right panel, CTCF binding motifs; dashed boxes, motifs similar to the second part of the alternative longer CTCF motif.

Discussions

Our results demonstrate that a new convolution architecture, FRSS, efficiently discerns patterns of regulatory DNA sequences. Furthermore, the analyses of data quality and structure revealed that training efficiencies highly depend on the means by which the source data are filtered and processed. Although the visualization methods are still in development, these methods will provide researchers with clues for understanding the syntax of gene regulatory sequences.

We designed FRSS to process the information for forward and reverse sequences simultaneously, and we demonstrated the efficacy of FRSS for this purpose. This is the first to show that a new CNN architecture specialized for processing DNA sequences outperforms existing models. However, there are many ways for designing layers for this purpose. For example, instead of pair-wise summation in the last layer of FRSS, one can concatenate the output tensors similar to bidirectional RNN [32]. In addition, during finishing this study, we found a study that implemented a CNN model with a similar idea but a different architecture, although their model seemed to suffer from a run-time memory problem, and they did not clearly compare the performance of their model with that of other models [33]. We choose the current FRSS architecture because its computational cost is smaller than that of other configurations. However, better architectures may be possible that retain low computational cost yet offer greater prediction power. Overall, our study indicates that neural net designs that take the nature of genomic molecules into account will yield high accuracy for genomic information processing. We released DeepGMAP (https://github.com/koonimaru/DeepGMAP), a software that allows users to train models and predict regulatory sequences in de novo genome sequences.

Materials and methods

Dataset and processing

The sequences for the human (hg19, hg38) and mouse (mm10) genomes were downloaded from UCSC (http://hgdownload.soe.ucsc.edu/goldenPath/). Mitochondrial DNA sequences were excluded from analyses. The genomic sequences were divided into 1-kbp windows with a variety of strides as described in Results. The filtered alignment files for epigenomic data were downloaded from the ENCODE website (https://www.encodeproject.org/; see S2 Table for details). A peak caller, MACS2 version 2.1.1.20160309, was used to determine signal peak regions with the following options: "callpeak—call-summits -t <target bam file> -c <control bam file> -f <BAM or BAMPE> -g <hs or mm> -q 0.01" for ChIP-seq, and "callpeak—call-summits -q 0.01 –nomodel–shift -100 –extsize 200 -t <target bam file> -f <BAM or BAMPE> -g <hs or mm>" for DNase-seq. Using the outputs of MACS2, bedtools [34], and our codes, genomic regions were designated as positive if windows were found to overlap with the summits of peaks (Fig 1A). DNase-seq data with < 10,000 peaks were excluded from training data because they showed an apparent poor quality. For comparative analyses in Fig 2A and 2B, peak files were downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/ (these peak files were generated by ENCODE using hg19). These data were used to mark genome sequences. For cross-species comparison, CanFam 3.1 was downloaded from NCBI (https://www.ncbi.nlm.nih.gov/genome/?term=dog), and the short reads for CTCF ChIP-seq data (ERR022304) were downloaded from EMBL-EBI ArrayExpress (https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-437/).

We divided each of the mouse and human genome sequences into 1000-basepair (bp) windows and converted the four letters into one-hot vectors (four dimensional zero-one vectors). Namely, the DNA symbols, A, G, C, T and N were converted into one-hot vectors: (1, 0, 0, 0), (0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1), and (0, 0, 0, 0), respectively. Therefore, one training sample became a 1000×4 tensor. To reduce the number of potential artifacts caused by window boundaries, we also added windows that were shifted by 500 bp toward the 3′ side (Fig 1A). With a mini-batch number of 100 for the stochastic gradient descent and channel dimension 1 (a dimension for colors if the input was image data), the final shape of the input tensor was 100×1000×4×1. To denote epigenomic marks, we assigned each window as 1 if it overlapped a signal positive region or 0 otherwise, which became a label tensor of size mini-batch number×class number [i.e., 100 label vectors, and each label looks like (0, 0, 0, 1, 0, 1, …., 0); see the S1 Fig for visual illustration]. We generated several training datasets with different reference genome sequences (for example, we used the mm10 genome sequence for the mouse DNase-seq peaks that we determined with MACS2 and the hg19 genome sequence for the peak files downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/).

Design of models and implementation

Tensorflow r1.8 for python (https://www.tensorflow.org/) with CUDA Driver v9.0 and cuDNN v7.0.5 was used as the main machine-learning library for model implementation. The Tensorflow library was compiled from the source codes with the bazel compiler (https://bazel.build/; with options: "-c opt—copt = -mavx—copt = -mavx2—copt = -mfma—copt = -mfpmath = both—copt = -msse4.2—config = cuda -k //tensorflow/tools/pip_package:build_pip_package"). For convolutions, we used a module, tensorflow.nn.conv2d, with options: "strides = [1, 1, 1, 1], padding = VALID". For the Basset model, because batch normalization did not yield any positive effects, we removed this operation. For RNN layers, we used the long short-term memory cells [35] (tensorflow.nn.rnn_cell.LSTMCell or tensorflow.contrib.rnn.LSTMBlockCell without the peephole option) and tensorflow.nn.bidirectional_dynamic_rnn for output calculations. The operations and hyperparameters of models are listed in S1 Table.

For the FRSS layers, convolution kernels were rotated 180 degrees to extract reverse complementary information in the first and second layers in parallel with normal convolutions as follows. Let W^l a weight tensor of size M×N×H×K (M, kernel height; N, kernel width; H, input channel number; K, output channel number) in the l th convolution layer, defined as ${[W^{l}]}_{m, n, h, k} = w_{m, n, h, k}^{l}$ ([m] = {0,…,M−1};[n] = {0,…,N−1};[h] = {0,…,H−1};[k] = {0,…,K−1}). The rotated tensor $W_{r c}^{l}$ can be written as ${[W_{r c}^{l}]}_{m, n, h, k} = w_{(M - 1 - m), (N - 1 - n), h, k}^{l}$ (see the Fig 1C for a visual illustration for the rotation). Then, the first part of the FRSS layers is

\begin{matrix} X^{1} = p o o l (r e l u (c o n v (W^{0}, X^{0}))), \\ X_{r c}^{1} = p o o l (r e l u (c o n v (W_{r c}^{0}, X^{0}))), \end{matrix}

where X^l is an input/output tensor of size B×S×N×H (B, mini batch number; S, sequence length; N, input width; H, input channel number) in the l th convolution layer, $X_{r c}^{l}$ is the output of convolution with the rotated tensor, conv is the convolution operation, relu is the rectified linear unit, and pool is the max pooling operation of size 2×1 and stride 2. The second part of the FRSS layers is

X^{2} = p o o l (r e l u (c o n v (W^{1}, X^{1})) + r e l u (c o n v (W_{r c}^{1}, X^{1}))) .

Note that the variables of $W_{r c}^{l}$ are not independent training targets but rather are the copies of those of W^l.

Training and testing models

Stochastic gradient–based optimization was used for training models. For human data, the chromosome 8 and 9 were excluded from training data. For mouse data, the chromosome 2 was excluded. The excluded sequences were used to test trained models. hg19 genome was used for comparisons with previously reported AUPRCs (Fig 2A and 2B). A training dataset was subdivided into mini-batches containing 100 sequences. In the quick benchmark, because the fraction of positively labeled genomic windows was too small, negative samples were randomly downsampled to 25%. To accelerate the learning rate and compensate for the limited data size, each mini-batch was further divided into two sub-mini-batches, and updates were made alternately twice per sub-mini-batch (i.e., four updates per mini-batch). The loss function we used for training our models and the DeepSEA model is

\frac{1}{n} \sum_{k = 0}^{n - 1} (y_{k} l o g ({\hat{y}}_{k}) + (1 - y_{k}) l o g (1 - {\hat{y}}_{k})) + λ_{1} {‖ W ‖}_{2}^{2} + λ_{2} {‖ H ‖}_{1},

where y_k is a label of class k, ${\hat{y}}_{k}$ is a prediction after the sigmoid operation, ${‖ W ‖}_{2}^{2}$ is the L2 regularization of all variables, ‖H‖₁ is the L1 regularization of predictions before the sigmoid operation. λ₁ and λ₂ were set as 5×10⁻⁷ and1×10⁻⁸, respectively according to [9]. For the remaining models, the loss function without the regularization terms was used. The optimization algorithms RMSprop [36] and Adam [37] were used to train the DanQ model and the others, respectively. To monitor training accuracy, models were tested by every mini-batch before updating variables and then evaluated based on the F1 score ( $\frac{2 \cdot p r e c i s i o n \cdot r e c a l l}{p r e c i s i o n + r e c a l l}$ ) to assess temporal accuracy. When a mean of F1 scores of the last three tests exceeded a certain threshold (0.75), models were tested by three mini-batch sets that had been randomly excluded from a training dataset and saved when a higher F1 score was attained relative to the previous test.

NVIDIA TITAN X (Pascal; 3584 CUDA Cores; total amount of global memory, 12190 Mbytes; GPU Max Clock rate, 1531 MHz) was used for GPU-computing. Intel Xeon Processor E5-2640 v4 (2.40 GHz) was used for CPU-computing (see S1 Note for the full specification of our machine). Total training time was dependent on the number of classes, the amount of training data, and model architecture. For example, conv4-FRSS took three hours to train with the subset of mm10 DNase-seq data (three classes) and 11 hours with hg38 DNase-seq (365 classes). The bottleneck was, in part, to send tensor data from the GPU to CPU for evaluation of the temporal accuracy of the training models. In previous studies, training time was reported as 85 hours in the Basset paper with NVIDIA Tesla K20m and 164 classes, and as approximately 15 days in the DanQ paper with NVIDIA Titan Z and 919 classes. However, a fair comparison of training time with previous studies is difficult owing to differences in the machines, languages, and datasets.

To test trained models, the chromosome 2 sequence for mice and chromosome 8 and 9 sequences for humans were divided into the same window size and stride with those of the training data. AUROC and AUPRC were calculated with functions in the scikit-learn library version 0.19.1 (http://scikit-learn.org/stable/). We compared the accuracy between a model trained with the full training data and one saved during training, and adopted the one with higher AUPRCs. AUPRCs of DeepSEA and DanQ in the Fig 2A and 2B were obtained from S3 Table of the DeepSEA paper (https://media.nature.com/original/nature-assets/nmeth/journal/v12/n10/extref/nmeth.3547-S3.xlsx) and auc.txt of the DanQ repository (https://github.com/uci-cbcl/DanQ) and are listed in our S4 Table. Predictions presented in the Figs 1D and 3 were visualized using the integrative genomics viewer [38].

The class saliency extraction method

To evaluate informative sequences at the single-nucleotide level (Fig 3 and S5 Fig), the class saliency extraction method was performed as previously described [25] with slight modifications. Let S_i(X⁰) be the score of the class i before sigmoidal operation, computed by the conv4-FRSS model for an arbitrary DNA sequence X⁰ of size 1000×4×1. In other words, S_i(X⁰) is a function that includes all operations such as FRSS, convolution and fully-connected layers except the last sigmoidal operation. The derivative of S_i(X⁰) with respect to X⁰ is

ω = {\frac{\partial S_{i} (X^{0})}{\partial X^{0}} |}_{X_{0}^{0}},

where $X_{0}^{0}$ is a given DNA sequence. The derivative ω is a tensor of size 1000×4×1, and each element corresponds to each element of X⁰. The values shown in the Fig 3A and S5 Fig (orange) are ∑_n|ω_s,n,h| (s = 0, …, 999; n = 0,…, 3; h = 0). To remove irrelevant values, the figures only show sequences with S_i(X⁰)≥−2.0. These values represent the magnitude of the effect on the class score when nucleotide substitutions occur. Therefore, these values may be termed as influential values.

FRiP calculation and correction

To calculate FRiPs, initially, mitochondrial and black-listed regions (https://sites.google.com/site/anshulkundaje/projects/blacklists) were removed from peak files. A module, "countReadsPerBin.CountReadsPerBin" in deeptools [39], was used to count reads in peaks, and these read counts were then divided by total reads. As a correction for read number differences, FRiP was multiplied by the fraction of genome regions covered by at least one read.

Visualizing the inside of a trained model

Variables in each kernel were converted to a probability matrix by a softmax-like operation. Tomtom [29] version 4.12.0 and MEME motif database version 12.15 (http://meme-suite.org/meme-software/Databases/motifs/motif_databases.12.15.tgz) were used to identify closely related known transcription factor binding motifs. For visualizing kernels in the first layer of FRSS, the probability matrices were scaled by information content. Motif logos except Tomtom outputs were generated by our customized codes using the cairocffi library (https://cairocffi.readthedocs.io/en/latest/).

The activation maximization method

Let ${\hat{y}}_{i} (X^{0})$ be the score of the class i (i = 0,1,2), computed by the conv4-FRSS model for a sequence X⁰. To find a sequence that is specific to class i, we sought to optimize the following problem:

a r g \underset{X^{0}}{m a x} \frac{{\hat{y}}_{i} (X^{0})}{\sum_{k \neq i} {\hat{y}}_{k} (X^{0})} - λ {‖ X^{0} ‖}_{2}^{2},

where λ = 5.0×10⁻³. For the common sequence (right panel in the Fig 4), the problem was:

a r g \underset{X^{0}}{m a x} \prod {\hat{y}}_{i} (X^{0}) - λ {‖ X^{0} ‖}_{2}^{2} .

These formulae differ slightly from those reported previously [25] because our model used the sigmoid function for the last output, whereas the others used softmax. The optimizer Adam was used to optimize the target sequence. The initialization of variables of X⁰ was set to have a mean of 0.02 and stddev of 0.02, and this was a critical factor for this optimization problem. Training with different initialization conditions sometimes did not converge, suggesting that the optimization landscape is rugged and there is a possibility that the results shown in the Fig 4 are not the optimal solution. The generated sequences were visualized with our customized codes using the cairocffi library. Tomtom was used to find motifs in the generated sequences.

Supporting information

S1 Fig. Illustration of how a convolutional neural network reads DNA sequences.

(A) One-hot vectors that represent a DNA sequence. A 1-kbp DNA sequence is converted into a 1000×4 matrix. (B) A general convolutional network for sequence classifications. Only the first layer is shown, as an example. In the first layer, 9×4 kernels "scan" a sequence by striding base pair-by-base pair. Each "scan" is actually an element-wise multiplication of the values of a kernel (wmnhk) with values of 0 or 1 in a part of the sequence, yielding a scalar value represented by the black-to-white colored boxes shown below. By striding, scalar values are concatenated to a vector of size 992 (1000–9 + 1). Because the layer has multiple kernels (320 in this case), the output of the layer is 320 vectors. Maxpooling chooses maximum elements in each window. In this case, because the window size is 2, the total length of the sequence is halved by pooling. After fully connecting the layers, sigmoid operations yield output values between 0 and 1, which is compared with the class labels of "0" and "1".

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

S2 Fig. AUPRCs of the DeepSEA model.

Boxplots of AUPRCs sorted by factors. The data were derived from nmeth.3547-S3.xlsx of Zhou & Troyanskaya (2015). Boxes, the lower-to-upper quantile values for the data; orange lines, the median; whiskers, the range of the data. Flier points, those past the end of the whiskers. Note that the prediction of CTCF binding sites (red rectangle) had the best performance among transcription factors.

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

S3 Fig. Scatter plots for FRiP and corrected FRiP with AUPRCs (top panels) and peak numbers (bottom panels).

AUPRCs are the prediction accuracy of conv4-FRSS trained with mm10 CTCF dataset. Black arrows indicate data with low AUPRCs but relatively high FRiP scores. The quality of these data is reasonably estimated with cFRiP. The validity of cFRiP is also supported by the better correlation with peak numbers. corr., Spearman’s correlation.

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

S4 Fig. The average of AUPRCs with various cutoff values of corrected FRiP.

(A, B) Plots of average AUPRCs for human and mouse DNase-seq data (A and B) and CTCF data (C and D) as a function of cutoff values of corrected FRiPs. Note that increase in cutoff values of the corrected FRiPs results in an increase of the average AUPRCs, at least to some extent, except as shown in c. Red line, cutoff values for the filtered data in Fig 2A and 2B.

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

S5 Fig. Evaluation of informative sequences at the single-nucleotide level.

(A) CTCF binding distribution of mouse liver E14.5, as an example. Influential value, the values calculated by the class saliency extraction method; Prediction, CTCF binding sites predicted with conv4-FRSS; CTCF signal, CTCF ChIP-seq signals derived from ENCFF844ZSH; red rectangle, a region magnified in b. Note that the distribution of influential values is similar to that of the CTCF signal although the model is not directly trained with the signal data. (B) Comparison of influential values and CTCF binding motif. The CTCF binding motifs were detected with FIMO. Note that nucleotides with high influential value (orange) are aligned with the high information content sites in the CTCF motif (arrows).

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

S6 Fig. All kernels of conv4-FRSS trained by a subset of the mouse DNase-seq data.

Note that several kernels contain little information and thus appear empty. These kernels may suggest either that the kernel number was sufficiently set or that training was insufficient to optimize all kernels.

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

S1 Note

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

S1 Table

(XLSX)

Click here for additional data file.^{(7.9KB, xlsx)}

S2 Table

(XLSX)

Click here for additional data file.^{(25.9KB, xlsx)}

S3 Table

(XLSX)

Click here for additional data file.^{(5.7KB, xlsx)}

S4 Table

(XLSX)

Click here for additional data file.^{(52.9KB, xlsx)}

S5 Table

(XLSX)

Click here for additional data file.^{(9.5KB, xlsx)}

S6 Table

(XLSX)

Click here for additional data file.^{(524.6KB, xlsx)}

S7 Table

(XLSX)

Click here for additional data file.^{(557.7KB, xlsx)}

S8 Table

(XLSX)

Click here for additional data file.^{(119.8KB, xlsx)}

Acknowledgments

We thank Dr. Yuichiro Hara for critical comments and Dr. Mitsutaka Kadota for discussions concerning unpublished data.

Data Availability

All codes used in this paper are available at https://github.com/koonimaru/DeepGMAP. The IDs for data downloaded from the ENCODE website are listed in S2 Table. The data generated and/or analyzed in the current study are available in the figshare repository, https://doi.org/10.6084/m9.figshare.6728348.

Funding Statement

This work was supported in part by JSPS KAKENHI grant number 17K15132 to KO, a Special Postdoctoral Researcher Program of RIKEN to KO, and a research grant from MEXT to the RIKEN Center for Life Science Technologies and RIKEN Center for Biosystems Dynamics Research.

References

1.Meadows JRS, Lindblad-Toh K. Dissecting evolution and disease using comparative vertebrate genomics. Nat Rev Genet. 2017;18: 624–636. 10.1038/nrg.2017.51 [DOI] [PubMed] [Google Scholar]
2.Consortium 1000 Genomes Project, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491: 56–65. 10.1038/nature11632 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Kratochwil CF, Meyer A. Closing the genotype-phenotype gap: Emerging technologies for evolutionary genetics in ecological model vertebrate systems. BioEssays. 2015;37: 213–226. 10.1002/bies.201400142 [DOI] [PubMed] [Google Scholar]
4.Yue F, Cheng Y, Breschi A, Vierstra J, Wu W, Ryba T, et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature. 2014;515: 355–364. 10.1038/nature13992 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, Snyder M. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489: 57–74. 10.1038/nature11247 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, et al. The NIH roadmap epigenomics mapping consortium. Nat. Biotechnol. 2010. pp. 1045–1048. 10.1038/nbt1010-1045 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507: 455–461. 10.1038/nature12787 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Kelley DR, Snoek J, Rinn JL. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016;26: 990–999. 10.1101/gr.200535.115 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning–based sequence model. Nat Methods. 2015;12: 931–934. 10.1038/nmeth.3547 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Quang D, Xie X. DanQ: A hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 2016;44: e107 10.1093/nar/gkw226 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33: 831–838. 10.1038/nbt.3300 [DOI] [PubMed] [Google Scholar]
12.Lecun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521: 436–444. 10.1038/nature14539 [DOI] [PubMed] [Google Scholar]
13.Ghandi M, Lee D, Mohammad-Noori M, Beer MA. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput Biol. 2014;10 10.1371/journal.pcbi.1003711 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Alain G, Bengio Y. Understanding intermediate layers using linear classifier probes. Preprint at http://arxiv.org/abs/1610.01644. 2016. Available: http://arxiv.org/abs/1610.01644 [Google Scholar]
15.Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9 10.1186/gb-2008-9-9-r137 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Kelley DR, Snoek J, Rinn JL, Phelan CM, Kuchenbaecker KB, Tyrer JP, et al. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016;26: 990–999. 10.1101/gr.200535.115 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. Proc ICML. 2006. pp. 233–240. 10.1145/1143844.1143874 [DOI] [Google Scholar]
18.Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2015;7–12: 1–9. 10.1109/CVPR.2015.7298594 [DOI] [Google Scholar]
19.Glover JNM, Harrison SC. Crystal structure of the heterodimeric bZIP transcription factor c-Fos–c-Jun bound to DNA. Nature. 1995. pp. 257–261. 10.1038/373257a0 [DOI] [PubMed] [Google Scholar]
20.Shimizu T, Toumoto A, Ihara K, Shimizu M, Kyogoku Y, Ogawa N, et al. Crystal structure of PHO4 bHLH domain-DNA complex: Flanking base recognition. EMBO J. 1997;16: 4689–4697. 10.1093/emboj/16.15.4689 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Singh S, Yang Y, Poczos B, Ma J. Predicting enhancer-promoter interaction from genomic sequence with deep neural networks. Preprint at https://www.biorxiv.org/content/early/2016/11/02/085241. bioRxiv. 2016. 10.1101/085241 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Min X, Zeng W, Chen S, Chen N, Chen T, Jiang R. Predicting enhancers with deep convolutional neural networks. BMC Bioinformatics. 2017;18 10.1186/s12859-017-1878-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012. pp. 1813–1831. 10.1101/gr.136184.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Grant CE, Bailey TL, Noble WS. FIMO: Scanning for occurrences of a given motif. Bioinformatics. 2011;27: 1017–1018. 10.1093/bioinformatics/btr064 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps. Preprint at http://arxiv.org/abs/1312.6034. arXiv.org. 2013. Available: http://arxiv.org/abs/1312.6034 [Google Scholar]
26.Schmidt D, Schwalie PC, Wilson MD, Ballester B, Gonalves Â, Kutter C, et al. Waves of retrotransposon expansion remodel genome organization and CTCF binding in multiple mammalian lineages. Cell. 2012;148: 335–348. 10.1016/j.cell.2011.11.058 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Erhan D, Bengio Y, Courville A, Vincent P. Visualizing higher-layer features of a deep network. Bernoulli. 2009; 1–13. Available: http://igva2012.wikispaces.asu.edu/file/view/Erhan+2009+Visualizing+higher+layer+features+of+a+deep+network.pdf [Google Scholar]
28.Yosinski J, Clune J, Nguyen A, Fuchs T, Lipson H. Understanding Neural Networks Through Deep Visualization. Preprint at http://arxiv.org/abs/1506.06579. 2015. Available: http://arxiv.org/abs/1506.06579 [Google Scholar]
29.Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8 10.1186/gb-2007-8-2-r24 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Takahashi K, Yamanaka S. Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell. 2006;126: 663–676. 10.1016/j.cell.2006.07.024 [DOI] [PubMed] [Google Scholar]
31.Ong CT, Corces VG. CTCF: An architectural protein bridging genome topology and function. Nat. Rev. Genet. 2014. pp. 234–246. 10.1038/nrg3663 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans Signal Process. 1997;45: 2673–2681. 10.1109/78.650093 [DOI] [Google Scholar]
33.Shrikumar A, Greenside P, Kundaje A, Science C. Reverse-complement parameter sharing improves deep learning models for genomics. Preprint at https://www.biorxiv.org/content/early/2017/01/27/103663. bioRxiv. 2017. 10.1101/103812 [DOI] [Google Scholar]
34.Quinlan AR, Hall IM. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26: 841–842. 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997;9: 1735–1780. 10.1162/neco.1997.9.8.1735 [DOI] [PubMed] [Google Scholar]
36.Hinton GE, Srivastava N, Swersky K. Lecture 6e- rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA Neural Networks Mach Learn. 2012; 26–31. Available: https://goo.gl/RsQeis [Google Scholar]
37.Kingma DP, Ba JL. Adam: a Method for Stochastic Optimization. Int Conf Learn Represent 2015. 2015; 1–15. http://doi.acm.org.ezproxy.lib.ucf.edu/10.1145/1830483.1830503 [Google Scholar]
38.Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative genomics viewer. Nat. Biotechnol. 2011. pp. 24–26. 10.1038/nbt.1754 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Ramírez F, Ryan DP, Grüning B, Bhardwaj V, Kilpert F, Richter AS, et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 2016;44: W160–W165. 10.1093/nar/gkw257 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0235748.r001

Decision Letter 0

Jiajie Peng

16 Mar 2020

PONE-D-19-33090

Predicting gene regulatory regions with a convolutional neural network for processing double-strand genome sequence information

PLOS ONE

Dear Onimaru,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We would appreciate receiving your revised manuscript by Apr 30 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Jiajie Peng

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements:

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.plosone.org/attachments/PLOSOne_formatting_sample_main_body.pdf and http://www.plosone.org/attachments/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Annotating biological functions in genome sequences without experiments is still a challenging task. In this manuscript, a deep-learning-based method was proposed to predict gene regulatory regions efficiently.

I have the following major concerns.

Fig 1 is blurry.

Line 71, page 4, references are missing for DeepSEA and Basset.

Line 274, page 16, what are the meanings of labels “0” and “1”?

Line 293, page 17, more explanations are needed for the rotated kernel. Some examples of the kernel may help understand this concept.

Line 298, page 17, what is the max pooling size?

Line 263, page 15, “Data with < 10,000 peaks were excluded from training data.” Why is that?

Line 264, page 15, “peak files were downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/.” Are these peak files matching with hg38? What about the peak files for mouse genomes?

Line 313, page 18, “λ1 = 5×10–7, and λ2 = 1×10–8” More explanations are needed for setting these two parameters.

Line 346, page 19, what is the formula for Si(X0)?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Jul 23;15(7):e0235748. doi: 10.1371/journal.pone.0235748.r002

Author response to Decision Letter 0

3 Apr 2020

We thank the reviewer for his/her valuable comments. Indeed, we agree that the points raised were unclear, and we have therefore used this input to make improvements to the manuscript.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

======Comments======

I have the following major concerns.

Fig 1 is blurry.

=======Response======

We apologize for the problem. Because all figures were created with a same resolution, we have no idea what went wrong with Figure 1. At least, the resolution of Figures in the submitted manuscript was fine on my computer. We have double-checked the resolution of all figures. As the next submission will contain the figures as individual tiff files, the resolution will probably be better than the first submission.

======Comments======

Line 71, page 4, references are missing for DeepSEA and Basset.

=======Response======

We have added references.

======Comments======

Line 274, page 16, what are the meanings of labels “0” and “1”?

=======Response======

We agree that the explanation for generating training data was not clear enough. We assigned each window as 1 if an input sequence overlapped a signal positive region or 0 otherwise. We have improved the explanation for generating the training data (lines 287–298).

======Comments======

Line 293, page 17, more explanations are needed for the rotated kernel. Some examples of the kernel may help understand this concept.

=======Response======

We agree that we should have make the explanation for the rotated kernel easier to understand because this is the key concept of this paper. There was a visual explanation for this rotation in Figure 1C. Now, we have modified Figure 1C to be more consistent with the mathematical notation and cited Figure 1C in the sentence (Figure 1C and line 323).

======Comments======

Line 298, page 17, what is the max pooling size?

=======Response======

We have added the size of the max pooling (line 328).

======Comments======

Line 263, page 15, “Data with < 10,000 peaks were excluded from training data.” Why is that?

=======Response======

We apologize for the poor explanation. First of all, one thing that we did not mention here was that this filtering was applied only for DNase-seq data. We filtered out DNase-seq data with a few peaks because they usually showed a poor quality and disrupted the training. We have added this explanation in lines 278–279.

======Comments======

Line 264, page 15, “peak files were downloaded from http://hgdownload.cse.ucsc.” Are these peak files matching with hg38? What about the peak files for mouse genomes?

=======Response======

We agree that the explanation for generating training data was not clear enough. The reference of the peak files downloaded from the UCSC is hg19. These data are used for comparing the performance of conv4-frss, DeepSEA and DanQ, because DeepSEA and DanQ used these downloaded peaks for training (DeepSEA, DandQ and conv4-FRSS in Figure 2A and B). Separately, to utilize newly added data by the ENCODE project, which uses the mm10 and hg38 genomes, we created our own training datasets for other analyses such as Table 1 and “hg38” and “mm10” in Figure 2A and B). We have modified the manuscript to make that clear (lines 281–282 and lines 298–301).

======Comments======

Line 313, page 18, “λ1 = 5×10–7, and λ2 = 1×10–8” More explanations are needed for setting these two parameters.

=======Response======

These two parameters are set according to a previous paper (Zhou et al 2015). We have added this explanation to the manuscript (line 348).

======Comments======

Line 346, page 19, what is the formula for Si(X0)?

=======Response======

The formula for Si(X0) is too complex to write here, because it is composed of all operations including FRSS, convolution and fully-connected layers. We have added this explanation to the manuscript (lines 386–388).

Attachment

Submitted filename: K_Onimaru_rebuttal_letter.docx

Click here for additional data file.^{(7KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0235748.r003

Decision Letter 1

Vladimir Makarenkov

23 Jun 2020

Predicting gene regulatory regions with a convolutional neural network for processing double-strand genome sequence information

PONE-D-19-33090R1

Dear Dr. Onimaru,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Vladimir Makarenkov

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: (No Response)

Reviewer #2: The authors have carefully addressed given comments raised in the previous round of review by other reviewers. Hence, I think this version can be accepted in this round as is.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS One. doi: 10.1371/journal.pone.0235748.r004

Acceptance letter

Vladimir Makarenkov

1 Jul 2020

PONE-D-19-33090R1

Predicting gene regulatory regions with a convolutional neural network for processing double-strand genome sequence information

Dear Dr. Onimaru:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Vladimir Makarenkov

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Illustration of how a convolutional neural network reads DNA sequences.

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

S2 Fig. AUPRCs of the DeepSEA model.

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

S3 Fig. Scatter plots for FRiP and corrected FRiP with AUPRCs (top panels) and peak numbers (bottom panels).

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

S4 Fig. The average of AUPRCs with various cutoff values of corrected FRiP.

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

S5 Fig. Evaluation of informative sequences at the single-nucleotide level.

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

S6 Fig. All kernels of conv4-FRSS trained by a subset of the mouse DNase-seq data.

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

S1 Note

(PDF)

Click here for additional data file.^{(2.8MB, pdf)}

S1 Table

(XLSX)

Click here for additional data file.^{(7.9KB, xlsx)}

S2 Table

(XLSX)

Click here for additional data file.^{(25.9KB, xlsx)}

S3 Table

(XLSX)

Click here for additional data file.^{(5.7KB, xlsx)}

S4 Table

(XLSX)

Click here for additional data file.^{(52.9KB, xlsx)}

S5 Table

(XLSX)

Click here for additional data file.^{(9.5KB, xlsx)}

S6 Table

(XLSX)

Click here for additional data file.^{(524.6KB, xlsx)}

S7 Table

(XLSX)

Click here for additional data file.^{(557.7KB, xlsx)}

S8 Table

(XLSX)

Click here for additional data file.^{(119.8KB, xlsx)}

Attachment

Submitted filename: K_Onimaru_rebuttal_letter.docx

Click here for additional data file.^{(7KB, docx)}

Data Availability Statement

[pone.0235748.ref001] 1.Meadows JRS, Lindblad-Toh K. Dissecting evolution and disease using comparative vertebrate genomics. Nat Rev Genet. 2017;18: 624–636. 10.1038/nrg.2017.51 [DOI] [PubMed] [Google Scholar]

[pone.0235748.ref002] 2.Consortium 1000 Genomes Project, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, et al. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491: 56–65. 10.1038/nature11632 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref003] 3.Kratochwil CF, Meyer A. Closing the genotype-phenotype gap: Emerging technologies for evolutionary genetics in ecological model vertebrate systems. BioEssays. 2015;37: 213–226. 10.1002/bies.201400142 [DOI] [PubMed] [Google Scholar]

[pone.0235748.ref004] 4.Yue F, Cheng Y, Breschi A, Vierstra J, Wu W, Ryba T, et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature. 2014;515: 355–364. 10.1038/nature13992 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref005] 5.Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, Snyder M. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489: 57–74. 10.1038/nature11247 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref006] 6.Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, et al. The NIH roadmap epigenomics mapping consortium. Nat. Biotechnol. 2010. pp. 1045–1048. 10.1038/nbt1010-1045 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref007] 7.Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507: 455–461. 10.1038/nature12787 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref008] 8.Kelley DR, Snoek J, Rinn JL. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016;26: 990–999. 10.1101/gr.200535.115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref009] 9.Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning–based sequence model. Nat Methods. 2015;12: 931–934. 10.1038/nmeth.3547 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref010] 10.Quang D, Xie X. DanQ: A hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 2016;44: e107 10.1093/nar/gkw226 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref011] 11.Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33: 831–838. 10.1038/nbt.3300 [DOI] [PubMed] [Google Scholar]

[pone.0235748.ref012] 12.Lecun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521: 436–444. 10.1038/nature14539 [DOI] [PubMed] [Google Scholar]

[pone.0235748.ref013] 13.Ghandi M, Lee D, Mohammad-Noori M, Beer MA. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput Biol. 2014;10 10.1371/journal.pcbi.1003711 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref014] 14.Alain G, Bengio Y. Understanding intermediate layers using linear classifier probes. Preprint at http://arxiv.org/abs/1610.01644. 2016. Available: http://arxiv.org/abs/1610.01644 [Google Scholar]

[pone.0235748.ref015] 15.Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9 10.1186/gb-2008-9-9-r137 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref016] 16.Kelley DR, Snoek J, Rinn JL, Phelan CM, Kuchenbaecker KB, Tyrer JP, et al. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016;26: 990–999. 10.1101/gr.200535.115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref017] 17.Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. Proc ICML. 2006. pp. 233–240. 10.1145/1143844.1143874 [DOI] [Google Scholar]

[pone.0235748.ref018] 18.Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit. 2015;7–12: 1–9. 10.1109/CVPR.2015.7298594 [DOI] [Google Scholar]

[pone.0235748.ref019] 19.Glover JNM, Harrison SC. Crystal structure of the heterodimeric bZIP transcription factor c-Fos–c-Jun bound to DNA. Nature. 1995. pp. 257–261. 10.1038/373257a0 [DOI] [PubMed] [Google Scholar]

[pone.0235748.ref020] 20.Shimizu T, Toumoto A, Ihara K, Shimizu M, Kyogoku Y, Ogawa N, et al. Crystal structure of PHO4 bHLH domain-DNA complex: Flanking base recognition. EMBO J. 1997;16: 4689–4697. 10.1093/emboj/16.15.4689 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref021] 21.Singh S, Yang Y, Poczos B, Ma J. Predicting enhancer-promoter interaction from genomic sequence with deep neural networks. Preprint at https://www.biorxiv.org/content/early/2016/11/02/085241. bioRxiv. 2016. 10.1101/085241 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref022] 22.Min X, Zeng W, Chen S, Chen N, Chen T, Jiang R. Predicting enhancers with deep convolutional neural networks. BMC Bioinformatics. 2017;18 10.1186/s12859-017-1878-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref023] 23.Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012. pp. 1813–1831. 10.1101/gr.136184.111 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref024] 24.Grant CE, Bailey TL, Noble WS. FIMO: Scanning for occurrences of a given motif. Bioinformatics. 2011;27: 1017–1018. 10.1093/bioinformatics/btr064 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref025] 25.Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps. Preprint at http://arxiv.org/abs/1312.6034. arXiv.org. 2013. Available: http://arxiv.org/abs/1312.6034 [Google Scholar]

[pone.0235748.ref026] 26.Schmidt D, Schwalie PC, Wilson MD, Ballester B, Gonalves Â, Kutter C, et al. Waves of retrotransposon expansion remodel genome organization and CTCF binding in multiple mammalian lineages. Cell. 2012;148: 335–348. 10.1016/j.cell.2011.11.058 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref027] 27.Erhan D, Bengio Y, Courville A, Vincent P. Visualizing higher-layer features of a deep network. Bernoulli. 2009; 1–13. Available: http://igva2012.wikispaces.asu.edu/file/view/Erhan+2009+Visualizing+higher+layer+features+of+a+deep+network.pdf [Google Scholar]

[pone.0235748.ref028] 28.Yosinski J, Clune J, Nguyen A, Fuchs T, Lipson H. Understanding Neural Networks Through Deep Visualization. Preprint at http://arxiv.org/abs/1506.06579. 2015. Available: http://arxiv.org/abs/1506.06579 [Google Scholar]

[pone.0235748.ref029] 29.Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8 10.1186/gb-2007-8-2-r24 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref030] 30.Takahashi K, Yamanaka S. Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell. 2006;126: 663–676. 10.1016/j.cell.2006.07.024 [DOI] [PubMed] [Google Scholar]

[pone.0235748.ref031] 31.Ong CT, Corces VG. CTCF: An architectural protein bridging genome topology and function. Nat. Rev. Genet. 2014. pp. 234–246. 10.1038/nrg3663 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref032] 32.Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans Signal Process. 1997;45: 2673–2681. 10.1109/78.650093 [DOI] [Google Scholar]

[pone.0235748.ref033] 33.Shrikumar A, Greenside P, Kundaje A, Science C. Reverse-complement parameter sharing improves deep learning models for genomics. Preprint at https://www.biorxiv.org/content/early/2017/01/27/103663. bioRxiv. 2017. 10.1101/103812 [DOI] [Google Scholar]

[pone.0235748.ref034] 34.Quinlan AR, Hall IM. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26: 841–842. 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref035] 35.Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997;9: 1735–1780. 10.1162/neco.1997.9.8.1735 [DOI] [PubMed] [Google Scholar]

[pone.0235748.ref036] 36.Hinton GE, Srivastava N, Swersky K. Lecture 6e- rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA Neural Networks Mach Learn. 2012; 26–31. Available: https://goo.gl/RsQeis [Google Scholar]

[pone.0235748.ref037] 37.Kingma DP, Ba JL. Adam: a Method for Stochastic Optimization. Int Conf Learn Represent 2015. 2015; 1–15. http://doi.acm.org.ezproxy.lib.ucf.edu/10.1145/1830483.1830503 [Google Scholar]

[pone.0235748.ref038] 38.Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative genomics viewer. Nat. Biotechnol. 2011. pp. 24–26. 10.1038/nbt.1754 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0235748.ref039] 39.Ramírez F, Ryan DP, Grüning B, Bhardwaj V, Kilpert F, Richter AS, et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 2016;44: W160–W165. 10.1093/nar/gkw257 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Predicting gene regulatory regions with a convolutional neural network for processing double-strand genome sequence information

Koh Onimaru

Osamu Nishimura

Shigehiro Kuraku

Roles

Abstract

Introduction

Results

Training design and benchmarking

Fig 1. The Forward- and Reverse-Sequence Scan (FRSS) layers and data structures.

Table 1. A performance comparison between models.

Reading forward and reverse-complementary sequences improve prediction accuracy

The length of strides affect training efficiency

Table 2. A performance comparison between data structures.

conv4-FRSS outperforms previous models

Fig 2. Performance analyses of models with extended datasets.

The quality control of training data

The utility of conv4-FRSS

Fig 3. An example of CTCF binding site predictions on the HoxD cluster of mice and dogs.

Visualization of what deep-learning models learn about genome sequences

Fig 4. Visualization of trained kernels and results of the activation maximization method.

Discussions

Materials and methods

Dataset and processing

Design of models and implementation

Training and testing models

The class saliency extraction method

FRiP calculation and correction

Visualizing the inside of a trained model

The activation maximization method

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Jiajie Peng

Roles

Author response to Decision Letter 0

Decision Letter 1

Vladimir Makarenkov

Roles

Acceptance letter

Vladimir Makarenkov

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases