The impact of different negative training data on regulatory sequence predictions

S3 Fig. Recall values for regulatory sequence prediction on validation sets of models trained on genomic background sequences.

Each model was trained on a DHS (positive) training dataset and a genomic background (negative) training dataset and tested on a chromosome 21 hold-out validation set. Recall was calculated as a measure of model performance. For each classifier three different negative training sets are compared where the tolerances of differences in GC content composition (t_GC) is varied. Each model was trained on data derived from one cell line. Bars represent the mean of multiple cell lines and technical replicates (n = 7 for gkm-SVM, n = 70 for CNNs: 10 replicates per cell line) while error bars represent the standard deviation.

(PDF)

Click here for additional data file.^{(5.5KB, pdf)}

S4 Fig. AUPRC values or regulatory sequence prediction on validation sets of models trained on genomic background sequences.

Each model was trained on a DHS (positive) training dataset and a genomic background (negative) training dataset and tested on a chromosome 21 hold-out validation set. Area under precision recall curve (AUPRC) was calculated as a measure of model performance. For each classifier three different negative training sets are compared where the tolerances of differences in GC content composition (t_GC) is varied. Each model was trained on data derived from one cell line. Bars represent the mean of multiple cell lines and technical replicates (n = 7 for gkm-SVM, n = 70 for CNNs: 10 replicates per cell line) while error bars represent the standard deviation.

(PDF)

Click here for additional data file.^{(5.5KB, pdf)}

S5 Fig. Recall values for regulatory sequence prediction on validation sets of models trained on shuffled sequences.

Each model was trained on a DHS (positive) training dataset and a k-mer shuffled (negative) training dataset and tested on a chromosome 21 hold-out validation set. Recall was calculated as a measure of model performance. For each classifier seven different negative training sets are compared where the size of preserved k-mers during shuffling is varied. Each model was trained on data derived from one cell line. Bars represent the mean of multiple cell lines and technical replicates (n = 7 for gkm-SVM, n = 70 for CNNs: 10 replicates per cell line) while error bars represent the standard deviation.

(PDF)

S6 Fig. AUPRC values for regulatory sequence prediction on validation sets of models trained on shuffled sequences.

Each model was trained on a DHS (positive) training dataset and a k-mer shuffled (negative) training dataset and tested on a chromosome 21 hold-out validation set. Area under precision recall curve (AUPRC) was calculated as a measure of model performance. For each classifier seven different negative training sets are compared where the size of preserved k-mers during shuffling is varied. Each model was trained on data derived from one cell line. Bars represent the mean of multiple cell lines and technical replicates (n = 7 for gkm-SVM, n = 70 for CNNs: 10 replicates per cell line) while error bars represent the standard deviation.

(PDF)

S7 Fig. Number of transcription factor binding motifs in training sequences.

Known human transcription factor binding site (TFBS) motifs were matched in training sequences of different datasets from different cell lines (n = 7). Bars represent the mean value, error bars the standard deviation.

(PDF)

Click here for additional data file.^{(5.6KB, pdf)}

S8 Fig. Recall values for regulatory sequence prediction.

Models were trained on sequences of DHS regions (positive) with corresponding sets of negative sequences and tested on a chromosome 8 hold-out test set. For each classifier two different negative training sets are compared; sequences were either chosen from genomic background (t_GC = 0.02) or generated by shuffling positive sequences and preserving k-mer counts (k = 2). Recall was calculated to compare model performance. Seven models were trained on data derived for specific cell lines, bars represent the mean and error bars the standard deviations across models. Pairwise comparisons were performed with Wilcoxon signed-rank tests and asterisks represent significance levels (*p<0.05, **p<0.01, ***p<0.001).

(PDF)

Click here for additional data file.^{(5.3KB, pdf)}

S9 Fig. AUPRC values for regulatory sequence prediction on test sets.

Each model was trained on a DHS (positive) training dataset and a set of neutral sequences (negative) and tested on a chromosome 8 hold-out test set. Recall was calculated as a measure of model performance. For each classifier two different negative training sets are compared. Sequences were either chosen from genomic background (t_GC = 0.02) or generated by shuffling positive sequences and preserving k-mer counts (k = 2). Each model was trained on data derived from one cell line. Bars represent the mean of multiple cell lines (n = 7) while error bars represent standard deviations. Pairwise comparisons were performed with Wilcoxon signed-rank test and asterisks represent significance levels (*p<0.05, **p<0.01, ***p<0.001).

(PDF)

Click here for additional data file.^{(5.3KB, pdf)}

S10 Fig. Genomic frequency of 8-mers in different classes of the test sets and the first convolutional layer of the CNN models.

Exemplary for all cell-types, the figure shows results for HeLa-S3. Genomic frequency of 8-mers was extracted across all major human chromosomes and Z-Score transformed (i.e. mean-centered and standard deviation normalized to one). Panel (A) shows the genomic frequency of 8-mers in the test sets split out as DHS sites (orange, positive class), negative genomic background sequences (shades of red, from low to high) and different negative k-mer shuffles (shades of blue, from low to high). Smaller k-mer shuffles contain more rare genomic 8-mers. Panel (B) shows the distribution of the genomic 8-mer frequency for the top 100 sequences for each of 128 kernels in the first convolutional layer for 2conv2norm (left) and 4conv2pool4norm (right) architectures.

(PDF)

Click here for additional data file.^{(323.9KB, pdf)}

S11 Fig

Pairwise sequence overlap in (a) training, (b) validation and (c) test sets. We determined the overlap of merged peak sets across experiments in the same cell-type and across cell-types. For peaks to be considered overlapping between datasets, we required a 70% overlap in their coordinate ranges. We calculated pairwise overlap as number of overlapping peaks divided by the number of peaks in the union of both data sets. Datasets are named according to S1 Table.

(PDF)

Click here for additional data file.^{(10.5KB, pdf)}

S12 Fig. HeLa-S3 model performance for tissue-specific regulatory sequence prediction on validation sets of models trained on genomic background sequences.

Models were trained on DHS sequences (positive) active in HeLa-S3 cells and neutral sequences from genomic background (negative) with varied GC content tolerance (t_GC). Models were tested on DHS sequences specifically active in HeLa-S3 (positive) and DHS sequences active only in one or multiple other cell lines (A549, HepG2, K562, MCF-7) (negative). (A) and (B) show ROC and PR curves for 2conv2norm models, (C) and (D) show ROC and PR curves for 4conv2pool4norm models, (E) and (F) show ROC and PR curves for gkm-SVM models. Corresponding AUROC and AUPRC values are included.

(PDF)

Click here for additional data file.^{(235.3KB, pdf)}

S13 Fig. HeLa-S3 model performance for tissue-specific regulatory sequence prediction on validation sets of models trained on shuffled sequences.

Models were trained on DHS sequences (positive) active in HeLa-S3 cells and neutral sequences from genomic background (negative) with varied size of preserved k-mers. Models were tested on DHS sequences specifically active in HeLa-S3 (positive) and DHS sequences active only in one or multiple other cell lines (A549, HepG2, K562, MCF-7) (negative). (A) and (B) show ROC and PR curves for 2conv2norm models, (C) and (D) show ROC and PR curves for 4conv2pool4norm models, (E) and (F) show ROC and PR curves for gkm-SVM models. Corresponding AUROC and AUPRC values are included.

(PDF)

Click here for additional data file.^{(476KB, pdf)}

S14 Fig. HepG2 model performance for enhancer activity prediction of models trained on genomic background sequences.

Models were trained on HepG2 DHS sequences (positive) and genomic background sequences (negative), where different genomic background sets result from a variation of the GC content tolerance (t_GC). Models were tested on enhancer activity readouts in HepG2 cells [25]. Spearman rank correlation of predicted scores and log2 RNA/DNA ratios was used to evaluate model performance. Error bars represent 95% confidence intervals.

(PDF)

Click here for additional data file.^{(5.7KB, pdf)}

S15 Fig. HepG2 model performance for enhancer activity prediction of models trained on shuffled sequences.

Models were trained on HepG2 DHS sequences (positive) and genomic background sequences (negative), where different genomic background sets result from a variation of the size of preserved k-mers. Models were tested on enhancer activity readouts in HepG2 cells [25]. Spearman rank correlation of predicted scores and log2 RNA/DNA ratios was used to evaluate model performance. Error bars represent 95% confidence intervals.

(PDF)

Click here for additional data file.^{(6.2KB, pdf)}

S16 Fig. Model performance for enhancer activity prediction of A549, HeLa-S3 and MCF-7 models.

Models were trained either on DHS sequences active in A549, HeLa-S3 or MCF-7 cells (positive) and neutral sequences (negative), where different negative sets are composed of genomic background (t_GC = 0.1) or shuffled (k = 3) sequences. Models were tested on activity readouts of enhancer sequences in HepG2 cells [25]. Spearman rank correlation of predicted scores and log2 RNA/DNA ratios was used to evaluate model performance. For 2conv2norm and 4conv2pool4norm bars represent the median of multiple replicates (n = 10) while error bars represent 1^st and 3^rd quartiles. The dashed black line represents a reference value (Spearman’s ρ = 0.276) achieved previously [25].

(PDF)

S17 Fig. Distribution of GC content in sequences of HepG2 training datasets.

The distribution of the sequences’ GC contents in a dataset of active DHS regions in HepG2, three corresponding genomic background datasets with varied GC content tolerance (t_GC) and a set of random 300 bp sequences from the genome is shown.

(PDF)

Click here for additional data file.^{(19.9KB, pdf)}

S1 Table. Overview of DNase-seq datasets.

The number of DHS sequences is given after merging replicates and exclusion of alternative haplotypes, unlocalized genomic contigs and sequences containing non-ATCG bases. The datasets were split up into training, validation (chromosome 21) and test (chromosome 8) sets. The number of samples in these sets are given in the respective columns. Experiment and Replicate IDs are referring to ENCODE accessions.

(PDF)

Click here for additional data file.^{(8.8KB, pdf)}

S2 Table. Overview of tissue-specific validation and test sets.

Tissue-specific positive samples are DHS sequences of one cell line not overlapping with DHS sequences of the other cell lines. In contrast, negative samples are DHS sequences of other cell lines not overlapping with the first cell line. For A549, one dataset was chosen (B, named according to S1 Table). For MCF-7 one dataset was chosen (B, named according to S1 Table). The number of DHS sequences is given after exclusion of alternative haplotypes, unlocalized genomic contigs and sequences containing non-ATCG bases. The validation and test sets contain sequences located on chromosome 21 and 8, respectively.

(PDF)

Click here for additional data file.^{(6.9KB, pdf)}

S3 Table. Layer properties of 4conv2pool4norm network.

The column named ‘Size’ provides the convolutional kernel size, the max-pooling window size, the relative dropout size and the dense layer size depending on information given in column ‘Layer type’.

(PDF)

Click here for additional data file.^{(110.1KB, pdf)}

S4 Table. Layer properties of 2conv2norm network.

(PDF)

Click here for additional data file.^{(109.2KB, pdf)}

S5 Table. 2conv2norm recall for regulatory sequence prediction for different cell lines.

Ten CNN models of the 2conv2norm architecture were trained each on DHS datasets (positive) and corresponding negative sets of k-mer shuffled sequences (k = 2, k = 7) or genomic background sequences (t_GC = 0.02) for A549 or MCF-7 cells. A549 and MCF-7 cell lines are represented in our data with two training datasets each, which are labeled as A and B, respectively. Model performance was evaluated based on recall for hold-out sets (chromosome 8). The table summarizes mean and standard deviation across ten trained models. There are seven different hold-out sets derived from different cell lines and we assess model generalization across cell-types. Datasets are named according to S1 Table. Respective results for the gkm-SVM models are available Table 1, results for CNN models of 4conv2pool4norm architecture are available in S6 Table.

(PDF)

Click here for additional data file.^{(170.6KB, pdf)}

S6 Table. 4conv2pool4norm recall for regulatory sequence prediction for different cell lines.

Ten CNN models of the 4conv2pool4norm architecture were trained each on DHS datasets (positive) and corresponding negative sets of k-mer shuffled sequences (k = 2, k = 7) or genomic background sequences (t_GC = 0.02) for A549 or MCF-7 cells. A549 and MCF-7 cell lines are represented in our data with two training datasets each, which are labeled as A and B, respectively. Model performance was evaluated based on recall for hold-out sets (chromosome 8). The table summarizes mean and standard deviation across ten trained models. There are seven different hold-out sets derived from different cell lines and we assess model generalization across cell-types. Datasets are named according to S1 Table. Respective results for the gkm-SVM models are available Table 1, results for CNN models of 2conv2norm architecture are available in S5 Table.

(PDF)

Click here for additional data file.^{(170.9KB, pdf)}

S7 Table. AUROC values for tissue-specific regulatory sequence prediction on validation sets.

Models were trained on DHS sequences (positive) with corresponding sets of negative sequences and tested on a set of tissue-specific chromosome 21 test set. For each classifier two different negative training sets are compared; sequences were either chosen from genomic background (t_GC = 0.1) or generated by shuffling positive sequences and preserving k-mer counts (k = 7). AUROC value was calculated to compare model performance.

(PDF)

Click here for additional data file.^{(8.2KB, pdf)}

S8 Table. AUPRC values for tissue-specific regulatory sequence prediction on validation sets.

Models were trained on DHS sequences (positive) with corresponding sets of negative sequences and tested on a set of tissue-specific chromosome 21 test set. For each classifier two different negative training sets are compared; sequences were either chosen from genomic background (t_GC = 0.1) or generated by shuffling positive sequences and preserving k-mer counts (k = 7). AUPRC value was calculated to compare model performance.

(PDF)

Click here for additional data file.^{(8.7KB, pdf)}

Acknowledgments

We thank current and previous members of the Kircher group for helpful discussions and suggestions. Specifically, we would also like to acknowledge input from Giorgio Valentini and his lab at Università degli Studi di Milano, as well as Dirk Walther at the University of Potsdam. Computation has been performed on the HPC for Research cluster of the Berlin Institute of Health.

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

The author(s) received no specific funding for this work.

References

1.Gupta RM, Hadaya J, Trehan A, Zekavat SM, Roselli C, Klarin D, et al. A genetic variant associated with five vascular diseases is a distal regulator of Endothelin-1 gene expression. Cell. 2017;170: 522–533. 10.1016/j.cell.2017.06.049 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Jostins L, Ripke S, Weersma RK, Duerr RH, McGovern DP, Hui KY, et al. Host–microbe interactions have shaped the genetic architecture of inflammatory bowel disease. Nature. 2012;491: 119–124. 10.1038/nature11582 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Vinagre J, Almeida A, Pópulo H, Batista R, Lyra J, Pinto V, et al. Frequency of TERT promoter mutations in human cancers. Nature Communications. 2013;4: 2185 10.1038/ncomms3185 [DOI] [PubMed] [Google Scholar]
4.Gasperini M, Tome JM, Shendure J. Towards a comprehensive catalogue of validated and target-linked human enhancers. Nature Reviews Genetics. 2020; 1–19. 10.1038/s41576-019-0192-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, et al. Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface. 2018;15 10.1098/rsif.2017.0387 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Michael AK, Grand RS, Isbel L, Cavadini S, Kozicka Z, Kempf G, et al. Mechanisms of OCT4-SOX2 motif readout on nucleosomes. Science. 2020;368: 1460–1465. 10.1126/science.abb0074 [DOI] [PubMed] [Google Scholar]
7.Lerner J, Gomez-Garcia PA, McCarthy RL, Liu Z, Lakadamyali M, Zaret KS. Two-Parameter Mobility Assessments Discriminate Diverse Regulatory Factor Behaviors in Chromatin. Mol Cell. 2020;79: 677–688.e6. 10.1016/j.molcel.2020.05.036 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Elkon R, Agami R. Characterization of noncoding regulatory DNA in the human genome. Nature Biotechnology. 2017;35: 732–746. 10.1038/nbt.3863 [DOI] [PubMed] [Google Scholar]
9.The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489: 57–74. 10.1038/nature11247 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.ENCODE Project Consortium. A user’s guide to the encyclopedia of DNA elements (ENCODE). PLOS Biology. 2011;9: e1001046 10.1371/journal.pbio.1001046 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Liu Y, Fu L, Kaufmann K, Chen D, Chen M. A practical guide for DNase-seq data analysis: from data management to common applications. Briefings in Bioinformatics. 2018; bby057. [DOI] [PubMed] [Google Scholar]
12.Song L, Crawford GE. DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harbor Protocols. 2010;2010: pdb.prot5384. 10.1101/pdb.prot5384 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Boeva V. Analysis of genomic sequence motifs for deciphering transcription factor binding and transcriptional regulation in eukaryotic cells. Frontiers in Genetics. 2016;7: 24 10.3389/fgene.2016.00024 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Samee MdAH Bruneau BG, Pollard KS. A de novo shape motif discovery algorithm reveals preferences of transcription factors for DNA shape beyond sequence motifs. Cell Systems. 2019;8: 27–42. 10.1016/j.cels.2018.12.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Tillo D, Hughes TR. G+C content dominates intrinsic nucleosome occupancy. BMC Bioinformatics. 2009;10: 442 10.1186/1471-2105-10-442 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Beer MA. Predicting enhancer activity and variant impact using gkm-SVM. Human Mutation. 2017;38: 1251–1258. 10.1002/humu.23185 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Ghandi M, Mohammad-Noori M, Ghareghani N, Lee D, Garraway L, Beer MA. gkmSVM: an R package for gapped-kmer SVM. Bioinformatics. 2016;32: 2205–2207. 10.1093/bioinformatics/btw203 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Lee D, Gorkin DU, Baker M, Strober BJ, Asoni AL, McCallion AS, et al. A method to predict the impact of regulatory variants from DNA sequence. Nature Genetics. 2015;47: 955–961. 10.1038/ng.3331 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Wang M, Tai C, E W, Wei L. DeFine: deep convolutional neural networks accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of functional non-coding variants. Nucleic Acids Res. 10.1093/nar/gky215 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Meth. 2015;12: 931–934. 10.1038/nmeth.3547 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer on deep learning in genomics. Nat Genet. 2019;51: 12–18. 10.1038/s41588-018-0295-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology. 2015;33: 831–838. 10.1038/nbt.3300 [DOI] [PubMed] [Google Scholar]
23.Gesell T, Washietl S. Dinucleotide controlled null models for comparative RNA gene prediction. BMC Bioinformatics. 2008;9: 248 10.1186/1471-2105-9-248 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Reid J, Wernisch L. STEME: A robust, accurate motif finder for large data sets. PLOS ONE. 2014;9: e90735 10.1371/journal.pone.0090735 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Inoue F, Kircher M, Martin B, Cooper GM, Witten DM, McManus MT, et al. A systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity. Genome Res. 2017;27: 38–52. 10.1101/gr.212092.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Pagès H, Aboyoun P, Gentleman R, DebRoy S. Biostrings: Efficient manipulation of biological strings. Bioconductor version: Release (3.11); 2020. 10.18129/B9.bioc.Biostrings [DOI] [Google Scholar]
27.The Bioconductor Dev Team. BSgenome.Hsapiens.UCSC.hg38.masked. Bioconductor; 2017. 10.18129/B9.BIOC.BSGENOME.HSAPIENS.UCSC.HG38.MASKED [DOI] [Google Scholar]
28.Jiang M, Anderson J, Gillespie J, Mayne M. uShuffle: a useful tool for shuffling biological sequences while preserving the k-let counts. BMC Bioinformatics. 2008;9: 192 10.1186/1471-2105-9-192 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Zeng H, Hashimoto T, Kang DD, Gifford DK. GERV: a statistical method for generative evaluation of regulatory variants for transcription factor binding. Bioinformatics. 2016;32: 490–496. 10.1093/bioinformatics/btv565 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Zhou T, Shen N, Yang L, Abe N, Horton J, Mann RS, et al. Quantitative modeling of transcription factor binding specificities using DNA shape. Proc Natl Acad Sci USA. 2015;112: 4654–4659. 10.1073/pnas.1422023112 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Shen Z, Bao W, Huang D-S. Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Sci Rep. 2018;8: 15270 10.1038/s41598-018-33321-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Arvey A, Agius P, Noble WS, Leslie C. Sequence and chromatin determinants of cell-type-specific transcription factor binding. Genome Res. 2012;22: 1723–1734. 10.1101/gr.127712.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Lee D. LS-GKM: a new gkm-SVM for large-scale datasets. Bioinformatics. 2016;32: 2196–2198. 10.1093/bioinformatics/btw142 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Min X, Zeng W, Chen S, Chen N, Chen T, Jiang R. Predicting enhancers with deep convolutional neural networks. BMC Bioinformatics. 2017;18: 478 10.1186/s12859-017-1878-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv. 2016; 1603.04467.
36.Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv. 2014; 1412.6980.
37.Reddi SJ, Kale S, Kumar S. On the Convergence of Adam and Beyond. International Conference on Learning Representations. 2018.
38.Davis J, Goadrich M. The relationship between precision-recall and ROC curves. Proceedings of the 23rd international conference on Machine learning—ICML ‘06. 2006; 233–240.
39.Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE. 2015;10: e0118432 10.1371/journal.pone.0118432 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Grau J, Grosse I, Keilwagen J. PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R. Bioinformatics. 2015;31: 2595–2597. 10.1093/bioinformatics/btv153 [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Keilwagen J, Grosse I, Grau J. Area under precision-recall curves for weighted and unweighted data. PLOS ONE. 2014;9: e92209 10.1371/journal.pone.0092209 [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12: 77 10.1186/1471-2105-12-77 [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Khan A, Fornes O, Stigliani A, Gheorghe M, Castro-Mondragon JA, van der Lee R, et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Research. 2018;46: D260–D266. 10.1093/nar/gkx1126 [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011;27: 1017–1018. 10.1093/bioinformatics/btr064 [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Charif D, Lobry J. SeqinR 1.0–2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. Structural approaches to sequence evolution: Molecules, networks, populations, Biological and Medical Physics Biomedical Engeneering Springer Verlag; 2007; 207–232. [Google Scholar]
46.Smith RP, Taher L, Patwardhan RP, Kim MJ, Inoue F, Shendure J, et al. Massively parallel decoding of mammalian regulatory sequences supports a flexible organizational model. Nature Genetics. 2013;45: 1021–1028. 10.1038/ng.2713 [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Fenouil R, Cauchy P, Koch F, Descostes N, Cabeza JZ, Innocenti C, et al. CpG islands and GC content dictate nucleosome depletion in a transcription-independent manner at mammalian promoters. Genome Res. 2012;22: 2399–2408. 10.1101/gr.138776.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Lecellier C-H, Wasserman WW, Mathelier A. Human Enhancers Harboring Specific Sequence Composition, Activity, and Genome Organization Are Linked to the Immune Response. Genetics. 2018;209: 1055–1071. 10.1534/genetics.118.301116 [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet. 2007;39: 311–318. 10.1038/ng1966 [DOI] [PubMed] [Google Scholar]
50.Nguyen TA, Jones RD, Snavely AR, Pfenning AR, Kirchner R, Hemberg M, et al. High-throughput functional comparison of promoter and enhancer activities. Genome Res. 2016;26: 1023–1033. 10.1101/gr.204834.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Partridge EC, Chhetri SB, Prokop JW, Ramaker RC, Jansen CS, Goh S-T, et al. Occupancy maps of 208 chromatin-associated proteins in one human cell type. Nature. 2020;583: 720–728. 10.1038/s41586-020-2023-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Andersson R, Sandelin A, Danko CG. A unified architecture of transcriptional regulatory elements. Trends Genet. 2015;31: 426–433. 10.1016/j.tig.2015.05.007 [DOI] [PubMed] [Google Scholar]
53.Andersson R, Sandelin A. Determinants of enhancer and promoter activities of regulatory elements. Nat Rev Genet. 2020;21: 71–87. 10.1038/s41576-019-0173-8 [DOI] [PubMed] [Google Scholar]
54.Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007;448: 553–560. 10.1038/nature06008 [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Mendenhall EM, Koche RP, Truong T, Zhou VW, Issac B, Chi AS, et al. GC-rich sequence elements recruit PRC2 in mammalian ES cells. PLoS Genet. 2010;6: e1001244 10.1371/journal.pgen.1001244 [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, et al. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507: 455–461. 10.1038/nature12787 [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Kowalczyk MS, Hughes JR, Garrick D, Lynch MD, Sharpe JA, Sloane-Stanley JA, et al. Intragenic enhancers act as alternative promoters. Mol Cell. 2012;45: 447–458. 10.1016/j.molcel.2011.12.021 [DOI] [PubMed] [Google Scholar]
58.Dao LTM, Galindo-Albarrán AO, Castro-Mondragon JA, Andrieu-Soler C, Medina-Rivera A, Souaid C, et al. Genome-wide characterization of mammalian promoters with distal enhancer functions. Nat Genet. 2017;49: 1073–1081. 10.1038/ng.3884 [DOI] [PubMed] [Google Scholar]
59.Chen L, Fish AE, Capra JA. Prediction of gene regulatory enhancers across species reveals evolutionarily conserved sequence properties. PLOS Computational Biology. 2018;14: e1006484 10.1371/journal.pcbi.1006484 [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016;26: 990–999. 10.1101/gr.200535.115 [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Movva R, Greenside P, Marinov GK, Nair S, Shrikumar A, Kundaje A. Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays. PLOS ONE. 2019;14: e0218073 10.1371/journal.pone.0218073 [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Shrikumar A, Prakash E, Kundaje A. GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs. Bioinformatics. 2019;35: i173–i182. 10.1093/bioinformatics/btz322 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0237412.r001

Decision Letter 0

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

18 Aug 2020

PONE-D-20-22794

The impact of different negative training data on regulatory sequence predictions

PLOS ONE

Dear Dr. Kircher,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

As you can appreciate from the attached reports, both reviewers agreed that this is an important study, and that is was generally well conducted, but they have made suggestions for improving certain aspects. In particular, we feel that it would be important to address the major points raised by reviewer #2 regarding the mixing of promoters and enhancers in the analyses, and performing cross-cell line tests using a genomic background negative set.

Please submit your revised manuscript by Oct 02 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Miguel Branco

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2.Thank you for stating the following in the Acknowledgments Section of your manuscript:

[This work was supported by the Berlin Institute of Health and Charité – Universitätsmedizin

Berlin. The funder had no involvement in study design; in the collection, analysis and

interpretation of data; in the writing of the report; and in the decision to submit the article for

publication.]

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

[The author(s) received no specific funding for this work.]

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Krutzfeldt et al. investigate 3 machine learning models (2 CNN and one gkm-SVM) on predicting accessible regions and enhancers in human cell lines from DNA sequence alone. In particular, the authors focused on selection of different negative sequences and how that affects the model performs. The authors are able to identify with high accuracy accessible regions of the genome. Nevertheless, their results are relatively poor for cell specific accessible regions and enhancers, which is expected for DNA sequence only models. Furthermore, the authors show that using one model to select the negative set (genomic background) provides better and robust results in most test scenarios. The paper is well written and presents important results. There are some points that authors would need to address before I could recommend this paper for publication:

1. An explanation of CNNs and gkm-SVMs could help readers that are unfamiliar with these machine learning algorithms.

2. Line 66-74, the authors mention that there is a strong link between TF binding and DNA accessibility. While this would be the case for many TFs, it is not generally true (see https://www.sciencedirect.com/science/article/abs/pii/S1097276520303579?via%3Dihub, https://science.sciencemag.org/content/368/6498/1460 or https://www.biorxiv.org/content/10.1101/666446v1.abstract). I think the authors should discuss about these different types of mechanisms.

3. Line 75-85, the DNA sequence can predict TF binding sites, but many TFs bind in a concentration dependent manner (e.g. https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1001290, https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003571

or https://academic.oup.com/nar/article/43/1/84/2903035).

4. “The tolerance for differences in repeat ratio and relative sequence length were set to 0, but the tolerance for differences in GC content was varied for different training datasets (tGC={0.02, 0.05, 0.1}).” – The authors need to explain why this was varied.

5. “The parameter k which indicates the size of the preserved k-mers was varied for different datasets (k=[1,7])” – Again, it the authors need to explain why this was varied.

6. For liver enhancer activity data, have the enhancers from Inoue et al. (2017) been cross validated by another method? For example with STARR-seq from https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1473-6 .

7. “We evaluated models on their respective hold-out and additionally the models trained on shuffled data on hold-out using genomic background sequences as negative sets” – This makes sense after reading and re-reading it a few times but it isn’t immediately clear. I’d recommend finding a way to re-word it.

8. It wasn’t particularly clear exactly what information the authors use as an input. In Min et al. (2016) they use a 300bp sequence as an input. As they are using DeepEnhancer and a smaller network based on the DeepEnhancer architecture I assumed this was the same, but then I wondered how you dealt with merging of positive bins. In the materials and methods section you state that positive DNA-seq datasets were created where: “Multiple technical replicates were merged into one file per experiment, combining overlapping (minimum of 1 bp) or adjacent sequences into a single spanning sequence”. When two regions are merged, is a new centre is calculated between the two, then the input region is classed as 300bp around the new centre?

9. Do you treat a peak within 150bp of the centre of another peak as a separate instance? This peak is not overlapping and is not adjacent, but it would contain some of the same sequence.

10. Citations for some Biocoductor packages are missing. Cite the recommended papers.

11. Line 233: “First models were tested on validation sets to identify best parameters for generating the negative training set based only on recall measures.”. Maybe I misunderstood something, but shouldn’t the training be done on the training set not the validation set?

12. “However, models trained on highly shuffled data perform significantly better than models trained on genomic background data; potentially the result of an improper evaluation on varying compositions of the validation sets using different negative data” – Have the authors compared the compositions of the validation sets to test this?

13. When mentioning Figure S10, A and B are mentioned in the text but not C. C is the Z-score of the larger CNN (which looks much the same as B) but it’s just worth pointing out.

14. Is there generalisation data for the CNNs as well as the Gkm-SVM models? (as in Table 1)

15. For S5 and S6 probably shouldn’t be mentioned until after it is explained why k=7 is used. After the previous section explaining that k=2 was close to optimal it seems strange to jump to the new figure without the explanation that comes further down the paragraph.

16. It isn’t particularly clear how models were trained on enhancers. Are there annotated datasets for those cell lines?

17. As a general comment, it is easier to read Figure S1 rather than S1 Figure.

18. Line 305: I would reference https://doi.org/10.1016/j.tig.2009.08.003 about TF motif size in eukaryotic genomes, which could explain why 7-Kmer would overlap with many TF motifs.

19. K-mer 1 is the best, but authors use K-mer 2. I think the authors should include K-mer 1 in their manuscript in order to allow the reader to judge the performance of the model.

20. Figure 1 needs better explanation. It took me a while to understand the difference between top and bottom panels and red/blue colours.

21. Lines 338-348: needs to be explained better. It was difficult to follow.

22. In Figure S10A, k-mer shuffle plots several blue lines; they look similar with DHS and genomic background except for one outlier. Can authors clarify this?

23. Table 1: The authors need to show a Venn diagram with the overlap of DHS between the different cell lines, so the reader can judge these results. The reason the model generalises so well might be because the DHS in the different cells overlap significantly.

24. Lines 397-404: The authors need to discuss that the reason the model doesn’t perform so well on cell specific data is because while DNA sequence is the same, there something else that controls tissue specific regions. Their method is better at predicting constitutive DHS and enhancers and will suffer when predicting tissue specific enhancer.

25. Authors should include a graph with the run time of the different models to allow quantitative assessment of the model.

26. It is not clear why for different cell types different optimal K-mers are used. Can the authors explain this?

27. Lines 431-437: I assumed that the data is available only for HepG2, but what data is used for K562, A549 and HeLa?

28. The performance to predict enhancers is relatively poor, but within the results of other papers. The authors should make this clear.

Reviewer #2: In the manuscript by Krutzfeldt et al., the authors assess the impact of different negative training sets on regulatory sequence prediction. Three models (one gkm-SVM and two CNN) are trained using DNase hypersensitive sites as positive set and either genomic background or shuffled sequences as negative set. Effects of the choice of background are compared across three tasks: hold-out prediction regulatory sequences, cell type specific hold-out prediction and prediction of enhancer activity (correlation with experimental data). The authors find that shuffled sequences contain rare/artificial 8-mers/motifs which are used by the learners which makes models perform worse on biologically more relevant tasks.

This is a wonderful example of how the definition of ‘background’ (i. e. the baseline you compare whatever you are interested in against) has a large influence on the results – which is a common theme in bioinformatics. Unfortunately, the question which background to choose is often not given enough thought. Therefore, studies like this are very relevant and can be eye opening. With machine learning methods becoming more accessible and more widely used information on what to watch out for is pertinent. The manuscript contains relevant data, the rationale is well explained and methods are sufficiently described. It is written in an easy to follow style.

Major points for consideration

It seems likely that looking at all DHS sites together results in a mix of functional biological elements, presumably dominated by enhancers and promoters. These are likely defined by different sequence features and we could envision a scenario in which promoter activity might be strongly influenced by presence/absence of a TFBS (strong sequence feature), while enhancer activity might be more dependent on histone modifications (weak sequence feature). Thus, it seems possible that the method of negative region definition (shuffling or genomic background) may impact differently on different functional elements. If this were the case, it would affect interpretation of the results at several points throughout the manuscript. The authors mention one example in the context of GC matching (p18, lines 503-516, Figure S16): The distribution of GC content of the DHS sites is clearly bimodal. It seems likely that the less CG rich peak mainly contains enhancers and non-CGI promoters while the GC rich peak is dominated by CGI promoters. Sequence features learned by a model would probably be very different if learned on those three subgroups and could be affected quite differently by choice of negative set. I am wondering if a very crude distinction of DHS sites into promoter distal (putative enhancers) and CGI and non-CGI promoter proximal would result in substantially improved models and how these would be affected by different negative data.

P 15 line 358, Table 1: In previous paragraphs, the authors have convincingly shown that shuffled 2-mers contain rare/artificial motifs and that the learners are affected by these. Here, these models based on shuffled regions are used to compare performance across cell types. However, since part of the learning is based on the distinction between biological and artificial, performance of a model trained on one cell type is expected to be high in another (we assume that sequence composition doesn’t change between cell lines). How does model performance generalise across cell lines when genomic background is used as negative sets? Does this still support the idea that organismal rather than tissue specific regulatory features are predicted?

The point that models are strongly influenced by differences between biological and artificial sequences is really driven home by the data for tissue-specific regulatory region prediction: models trained on highly shuffled sequences are only marginally better than random (or not at all if k=1!). The comparison above therefore also needs to include data for higher k-mers (7-mers to be able to compare with the figures on tissue specific region prediction). I consider the comparison across cell-types for models trained on genomic background and 7-mer shuffles essential.

P16 line 306: Quantitative enhancer activity prediction: As mentioned above the DHS sites are a mix of functional elements, presumably mainly enhancer and promoter sequences. Therefore, the model will learn both enhancer and promoter sequence features, however, the experimental readout is only enhancer activity. This is bound to impact negatively on correlation, so maybe it’s not so surprising that the correlation is so low. The differences in overall correlation when using different parameters (flexibility in GC content and k-mers) are very small and often the Spearman’s rank correlation hovers around 0.1. This is such low correlation and such small changes, that I would be very reluctant to assign meaning to parameters different to before performing better in this task. This should be made clearer in the text. Moreover, it would be much more informative to plot the actual value pairs rather than just the value for the overall correlation. It might even reveal subgroups of value pairs, some with better correlation than others. An idea might be to colour promoter proximal/distal or CGI/non-CGI data points differently to see if patterns emerge.

Why were chromosome 21 and 8 chosen as validation and test sets, and why was this preferred over randomly selected positive and negative regions? Genomes are known to show chromosomal bias for genomic features and potential differences between training and test/validation chromosomes should be assessed, e. g. sequence composition, density of positive and negative regions, GC content, density of repetitive elements etc.

Minor points

P 13, line 257, Frequency distribution of 8-mers. While all other genomic sequences are based on GRCh38, this frequency analysis uses GRCh37 – presumably because it was reused from a previous study. Since these are both mature genome builds, it is probably fair to assume that the distribution doesn’t change significantly, but do the authors have any information on this? Also, the models include the sex chromosomes, while the frequency analysis doesn’t. Does this matter?

P 14 line 314 “potentially the result of an improper evaluation on varying compositions of the validation sets using different negative data.” I don’t understand the meaning of this sentence.

While explanations are in the text, I found Figure 1 quite confusing and clearer labels would help. The legend title should be something like “model trained on” as pink indicates that the model was trained on genomic background and blue indicates that the model was trained on shuffled sequences. Likewise, the labels for top and bottom panels should include more information, e. g. “negative regions in the test set: genomic background”.

P 14 lines 338 to 348 and Figure S10A: In my mind, this figure contains a key message and one could think about moving it into the main figures. It should be indicated which blue curve belongs to which k-mer and a number/percentage of how many 8-mers were excluded (because they weren’t present in the genome) should be given.

P16 line 431: While surely obvious to the authors, I initially found it confusing that the model from the previous publication performed so much better in the experimental activity prediction. It would probably help the reader to very briefly recap that the model from publication 23 was trained on a corresponding set of ChIPped regions rather than DHS sites.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Dr Radu Zabet

Reviewer #2: Yes: Christel Krueger

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Dec 1;15(12):e0237412. doi: 10.1371/journal.pone.0237412.r002

Author response to Decision Letter 0

5 Oct 2020

We provide our responses as a separate PDF of the submission.

Attachment

Submitted filename: reviewer_responses.pdf

Click here for additional data file.^{(467.5KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0237412.r003

Decision Letter 1

27 Oct 2020

PONE-D-20-22794R1

The impact of different negative training data on regulatory sequence predictions

PLOS ONE

Dear Dr. Kircher,

Thank you for submitting your manuscript to PLOS ONE. Some minor points were raised during the review process that we feel would be important to address. Therefore, we invite you to submit a revised version of the manuscript.

Specifically, given the amount of discussion triggered by the reviewers' comments on the possible impact of not distinguishing between promoters and enhancers, we feel that it would be important for the authors to more extensively present their rationale and views on this matter (during results interpretation and/or in the discussion section). You may also choose to address additional comments from the reviewers.

Please submit your revised manuscript by Dec 11 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Miguel Branco

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: (No Response)

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors addressed majority of the points I raised.

1. Point 3 and 4. We are happy with the changes made by the authors. Nevertheless, the suggested papers to be citated were the minimum and we were expecting the authors to cite those papers we suggested and others.

2. Point 11, the authors explained that part, but did not change the text. Readers might have the same question and I think it is important to be clear.

3. Point 18. The authors mentioned that they do not think the suggested reference is appropriate, but did not provide any explanation for that or any alternative paper.

Reviewer #2: Major Point 1

R: It seems likely that looking at all DHS sites together results in a mix of functional biological elements, presumably dominated by enhancers and promoters. These are likely defined by different sequence features and we could envision a scenario in which promoter activity might be strongly influenced by presence/absence of a TFBS (strong sequence feature), while enhancer activity might be more dependent on histone modifications (weak sequence feature).

A: We appreciate the reviewers' comment and totally agree that these differences may or are even likely to exist. We want to highlight though, that our manuscript focuses on a relative comparison and that we are not attempting to create a superior model for the prediction of open chromatin regions, enhancers or promoters. We use the prediction of open chromatin as a mere example of a relevant prediction task. As mentioned also to the other reviewer, we have not performed an extensive hyperparameter search, nor did we explore multi-task or multi-modal models which are expected to show an increased performance for the mentioned prediction tasks.

R: I completely appreciate that the manuscripts main focus is not on creating the best possible model for open chromatin prediction. The fact remains though that it is possible that background and parameter choice has a different influence depending on what it is that one is trying to predict – and open chromatin regions are clearly a mixed bag. While it may be beyond the scope of this study to look at subgroups of open chromatin regions (this could for example have been done on the first task only to shine some light on how much influence this might have), I find it disappointing that the authors chose to not even discuss this question.

Major point 2 has been addressed.

Other points:

R: P16 line 306: Quantitative enhancer activity prediction: As mentioned above the DHS sites are a mix of functional elements, presumably mainly enhancer and promoter sequences. Therefore, the model will learn both enhancer and promoter sequence features, however, the experimental readout is only enhancer activity. This is bound to impact negatively on correlation, so maybe it’s not so surprising that the correlation is so low. The differences in overall correlation when using different parameters (flexibility in GC content and k-mers) are very small and often the Spearman’s rank correlation hovers around 0.1. This is such low correlation and such small changes, that I would be very reluctant to assign meaning to parameters different to before performing better in this task. This should be made clearer in the text. Moreover, it would be much more informative to plot the actual value pairs rather than just the value for the overall correlation. It might even reveal subgroups of value pairs, some with better correlation than others. An idea might be to colour promoter proximal/distal or CGI/non-CGI data points differently to see if patterns emerge.

A: We assume that the reviewer refers to p18, lines 406ff. We agree with the reviewer that there is a general agreement that Pearson correlations below 0.5 (R2 of 0.25) are not considered predictive models. This is based on a "variance explained by a linear model" argument. However, this does not mean that significant (Spearman) correlations below this value can not be interpreted, or that differences in correlation values have no meaning. Due to the simplicity of how correlations are calculated, they tend to be rather stable in these contexts and the relative performance of models can be evaluated using rank correlations (Spearman rather than Pearson). You can see the low variance observed in calculating Spearman correlation between 10 model training instances in Fig 3. We also like to point out that the publication we are comparing to reports the Spearman correlation of 0.276 (R2 of 0.076). As it is possible to calculate confidence intervals for these values, we have added these confidence levels to Figures S14 and S15 (renamed from S13 and S14 due to the insertion of a supplementary figure). You can see that confidence intervals are different from 0 for most of the model parameters tested and that obtained correlation coefficients, despite being small, are significantly different between certain parameter choices.

R: I agree with the authors that a correlation between prediction and experimental readout does not have to be linear to be useful – I have no problem with using rank correlation here. But I do agree with Reviewer 1 that the correlation is poor (significance is easily reached here because of the numbers) and that this is not made sufficiently clear in the text. Instead, as I mentioned in my last comment, a lot of emphasis is placed on the minute effects of different parameters on the (very) weak correlation. I am fully aware that enhancer predictions based on sequence alone are generally not great, and that making the best possible prediction is not the aim of this study. Again, I find it disappointing that the authors missed the opportunity to include more information on the relationship between prediction and experimental readout – for example like the scatterplot of (ranked) prediction vs experimental readout I suggested previously.

R: Something that I noted when re-reading the manuscript is that at two points in the text relating to Figure 4 it is suggested that gkm-SVM models were trained up to a maximum set size of 350 K, however, the last data point is at 200 K. This should be corrected.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Christel Krueger

PLoS One. 2020 Dec 1;15(12):e0237412. doi: 10.1371/journal.pone.0237412.r004

Author response to Decision Letter 1

10 Nov 2020

With this revision, we have added a section discussing the mixed nature of DNase hypersensitive (DHS) sites in terms of enhancer and promoter composition. This new section discusses classes and subclasses among regulatory regions and that enhancers and promoters generally share many features. We eventually conclude that rather than splitting up data, Deep Neural Network models with specific architectures (i.e. a combination of shared and separate layers) may offer some advantage when modeling DHS data and should be explored in future work. In addition, we argue why our results (rare motifs being learned from shuffled backgrounds and insufficient GC matching) are universal with respect to promoter and enhancer subsets of DHS sites. In response to a discrepancy noticed by reviewer 2, we added a data point previously missing from Figure 4. We attach our detailed responses as a separate document to this submission.

Attachment

Submitted filename: reviewer_responses.pdf

Click here for additional data file.^{(569.9KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0237412.r005

Decision Letter 2

12 Nov 2020

The impact of different negative training data on regulatory sequence predictions

PONE-D-20-22794R2

Dear Dr. Kircher,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Miguel Branco

Academic Editor

PLOS ONE

PLoS One. doi: 10.1371/journal.pone.0237412.r006

Acceptance letter

17 Nov 2020

PONE-D-20-22794R2

The impact of different negative training data on regulatory sequence predictions

Dear Dr. Kircher:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Miguel Branco

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Fig. Estimated loss on the training and validation sets over training epochs for 2conv2norm models.

(PDF)

Click here for additional data file.^{(7.1KB, pdf)}

S2 Fig. Estimated loss on the training and validation sets over training epochs for 4conv2pool4norm models.

(PDF)

S3 Fig. Recall values for regulatory sequence prediction on validation sets of models trained on genomic background sequences.

(PDF)

Click here for additional data file.^{(5.5KB, pdf)}

S4 Fig. AUPRC values or regulatory sequence prediction on validation sets of models trained on genomic background sequences.

(PDF)

Click here for additional data file.^{(5.5KB, pdf)}

S5 Fig. Recall values for regulatory sequence prediction on validation sets of models trained on shuffled sequences.

(PDF)

S6 Fig. AUPRC values for regulatory sequence prediction on validation sets of models trained on shuffled sequences.

(PDF)

S7 Fig. Number of transcription factor binding motifs in training sequences.

(PDF)

Click here for additional data file.^{(5.6KB, pdf)}

S8 Fig. Recall values for regulatory sequence prediction.

(PDF)

Click here for additional data file.^{(5.3KB, pdf)}

S9 Fig. AUPRC values for regulatory sequence prediction on test sets.

(PDF)

Click here for additional data file.^{(5.3KB, pdf)}

S10 Fig. Genomic frequency of 8-mers in different classes of the test sets and the first convolutional layer of the CNN models.

(PDF)

Click here for additional data file.^{(323.9KB, pdf)}

S11 Fig

(PDF)

Click here for additional data file.^{(10.5KB, pdf)}

S12 Fig. HeLa-S3 model performance for tissue-specific regulatory sequence prediction on validation sets of models trained on genomic background sequences.

(PDF)

Click here for additional data file.^{(235.3KB, pdf)}

S13 Fig. HeLa-S3 model performance for tissue-specific regulatory sequence prediction on validation sets of models trained on shuffled sequences.

(PDF)

Click here for additional data file.^{(476KB, pdf)}

S14 Fig. HepG2 model performance for enhancer activity prediction of models trained on genomic background sequences.

(PDF)

Click here for additional data file.^{(5.7KB, pdf)}

S15 Fig. HepG2 model performance for enhancer activity prediction of models trained on shuffled sequences.

(PDF)

Click here for additional data file.^{(6.2KB, pdf)}

S16 Fig. Model performance for enhancer activity prediction of A549, HeLa-S3 and MCF-7 models.

(PDF)