Skip to main content
BMC Bioinformatics logoLink to BMC Bioinformatics
. 2015 May 1;16:140. doi: 10.1186/s12859-015-0573-5

Optimally choosing PWM motif databases and sequence scanning approaches based on ChIP-seq data

Michal Dabrowski 1, Norbert Dojer 2,, Izabella Krystkowiak 3, Bozena Kaminska 3, Bartek Wilczynski 2
PMCID: PMC4436866  PMID: 25927199

Abstract

Background

For many years now, binding preferences of Transcription Factors have been described by so called motifs, usually mathematically defined by position weight matrices or similar models, for the purpose of predicting potential binding sites. However, despite the availability of thousands of motif models in public and commercial databases, a researcher who wants to use them is left with many competing methods of identifying potential binding sites in a genome of interest and there is little published information regarding the optimality of different choices. Thanks to the availability of large number of different motif models as well as a number of experimental datasets describing actual binding of TFs in hundreds of TF-ChIP-seq pairs, we set out to perform a comprehensive analysis of this matter.

Results

We focus on the task of identifying potential transcription factor binding sites in the human genome. Firstly, we provide a comprehensive comparison of the coverage and quality of models available in different databases, showing that the public databases have comparable TFs coverage and better motif performance than commercial databases. Secondly, we compare different motif scanners showing that, regardless of the database used, the tools developed by the scientific community outperform the commercial tools. Thirdly, we calculate for each motif a detection threshold optimizing the accuracy of prediction. Finally, we provide an in-depth comparison of different methods of choosing thresholds for all motifs a priori. Surprisingly, we show that selecting a common false-positive rate gives results that are the least biased by the information content of the motif and therefore most uniformly accurate.

Conclusion

We provide a guide for researchers working with transcription factor motifs. It is supplemented with detailed results of the analysis and the benchmark datasets at http://bioputer.mimuw.edu.pl/papers/motifs/.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0573-5) contains supplementary material, which is available to authorized users.

Keywords: Transcription factor binding, Binding site, Sequence motif, Motif database

Background

Transcription regulation is one of the key processes that allow cells to react to environmental cues and differentiate. It would not be possible without the specific interactions between DNA-binding proteins called transcription factors (TFs) and specific target regulatory elements.

While TFs exhibit some specificity in choosing their target binding sites, this specificity is imperfect i.e. instead of binding a single specific DNA sequence a typical TF recognizes a number of similar DNA fragments with varying affinity. A pattern describing these fragments is usually called a binding motif, and mathematically defined with position weight matrices for almost 30 years now [1]. Importantly, the position weight matrix model makes a number of simplifying assumptions to make the model useful in practice: in particular it assumes independence between columns and additivity of the column scores. As the cost of this simplification, there comes a great advantage of being able to easily score any DNA sequence of the specified length with a log-odds measure, usually interpreted as a rough analog of the free energy of TF-DNA binding.

There has been a number of suggested extensions of the PWM model including Bayesian Markovian models [2] and so-called PBM-motifs [3] using mixtures of simpler models to alleviate the problems associated with the simplifying assumptions of the PWM model. More recently, there are also published approaches that combine the traditional PWM models with additional constraints on the physical properties of the DNA strand to be bound by a Transcription factor [4]. However, while these approaches can be useful in a few situations, it has been shown that in the vast majority of real biological cases, the benefits of the simple PWM model outweigh the potential of the more complex models to give us a slightly better description of the bindind site [5].

Given this data and the fact that PWM models are used for three decades in practically unchanged form, attests to their tremendous applicability. However, in order to use such a model to determine potential binding sites of a TF with a known motif, one needs to choose a log-odds threshold to separate between scores high enough to facilitate binding event and the non-specific sequences. Choosing such a threshold value might be guided by a natural interpretation of the log-odds score. However, already the authors of the early motif scanning tool PATSER [6] have realized that, due to the dependency of the log-odds distribution on the information content (IC) of a motif [1], the log-odds thresholds for different motifs should be different. Later, Rahmann et al. showed in [7] that there are more meaningful ways of selecting the TF binding threshold controlling for type I, type II errors or for a certain balance between them.

At the time, the number of TF motifs was limited and there were very few large scale datasets to serve as a golden standard of TF binding. Since then, the community has accumulated hundreds TF motifs in several databases [8-16] and the large-scale ChIP-seq projects such as ENCODE [17] have provided us with dozens of ChIP-seq datasets for different TFs. This wealth of data allows us now to revisit the different ways of selecting the threshold and put it in the context of different motif databases with respect to their coverage and accuracy in order to find the optimal choices for practical applications.

Results

Comparison of motif scanners

Comparsion of performance of motif databases should preferably be performed with the same motif scanning program, to separate the effect of the database from effects of different scanning programs. However, the two commercial motif databases: Transfac (Biobase) and MatBase (Genomatix); are each provided with a dedicated scanning program: Match [18] and MatInspector [19], respectively, using proprietary thresholds files. Therefore, as a preliminary step, we compared the performance of these dedicated programs to the performance of two motif scanners available in the public domain: matrix-scan [20] and Bio.Motif [21,22]. Both public scanners use background model-derived thresholds.

As the comparison metrics we used specificity and sensitivity, with ChIP-seq peaks for human TFs from Ensembl v.60 funcgen as the positive sets, and either third exons or genomics flanks of ChIP-seq peaks as the negative sets. The specificity and sensitivity were computed for every TF-motif pair, i.e. a pair of a TF and a related TFBS motif, and then averaged for a given database and scanning program.

Each commercial scanner was compared to the two public scanners using all motifs from its respective proprietary motif database. That is, we compared MatInspector to matrix-scan and Bio.Motif using 210 motifs from MatBase (v.8.3), representing 37 funcgen TFs; while Match was compared to the same two scanners using 106 motifs from Transfac (2010.3), representing 33 funcgen TFs. In this way, each commercial scanner could use its proprietary thresholds file. We also repeated the analysis on 32 funcgen TFs common to both motif databases, which gave nearly identical results (data not shown). For MatInspector/MatBase, in addition to individual motifs (matrices) we also used Genomatix-defined motif families (matrix families) [23].

As a summary of each scanner’s performance, we present the average specificity and sensitivity over all TF-motif pairs, together with their standard deviations. These values are plotted in Figure 1. To assess trade-offs between specificity and sensitivity, we used balanced accuracy (BA), which is their average.

Figure 1.

Figure 1

Comparison of performance of the dedicated commercial and public scanners. Shown are the average specificity and sensitivity +/−SD, for each tested database/scanner. MatIspector (A, C) or Match (B, D) were each separately compared to both matrix-scan and Bio.Motif; with either 3-rd exons (A, B) or flanks of the ChIP-seq peaks (C, D) used as the negative datasets. The color encodes scanners: matrix-scan (red), Bio.Motif (magenta), Match (green), MatInspector (blue). Stright lines through the points of average performance are the lines of equal balanced accuracy. Gray ovals in A, D mark the performance obtained with Genomatix motif families.

The results for individual motifs when using exons as the negative set were straightforward: the two public scanners achieved higher sensitivity then the commercial scanners, while maintaining the same uniformly high specificity (Figure 1A, B). With flanks of ChIP-seq peaks as the negative sets, the situation was similar for MatInspector used with individual motifs (Figure 1C), whereas for Match, as compared to the public scanners, there was some trade-off between specificity and sensitivity (Figure 1D), with BA higher for the two public scanners. On both types of negative sets, the use of motif families resulted in a large increase of sensitivity and balanced accuracy (Figure 1A,C), accompanied by a marked decrease of specificity.

We conclude that on individual motifs the performance of the two public scanners as measured by balanced accuracy was clearly better than of either of the two commercial scanners. When the proprietary motif families were used, MatInspectors achieved nearly the same balanced accuracy as the public scanners. However, the use of motif families resulted in a large decrease of specificity, which we consider undesirable in whole-genome applications.

For each database and type of negative set, we computed BAs for each pair of a TF and a related motif, and then we tested if BAs are significantly different between the scanners. The results (Table 1) indicated that the use of either public scanner resulted in significantly higher BA than of the respective dedicated scanner. Out of the two public scanners, we decided to use Bio.Motif for further work, because of its palette of threshold selection methods.

Table 1.

Significance of differences between balanced accuracies for public and commercial motif scanners

Motif database Hypothesis tested P-value for
negative dataset:
Exon 3 Peaks’ flanks
Transfac match vs matrix-scan 5.04e-41 1.13e-15
match vs Bio.Motif 3.37e-32 1.44e-12
MatBase MatInspector vs matrix-scan 3.98e-83 9.88e-43
MatInspector vs Bio.Motif 3.50e-67 8.74e-36
MatBase – families MatInspector vs matrix-scan 0.00011 0.00565
MatInspector vs Bio.Motif 0.01066 0.03784

Entries contain p-values of the Wilcoxon rank test for the null hypothesis that the BA for the 1st scanner is not lower than the BA for the 2nd scanner.

Comparison of databases coverage

In addition to the three long-established motif databases (Transfac, MatBase, and Jaspar), a number of new motif databases have recently been published (Table 2). These include: HOCOMOCO [14], SwissRegulon [15] and HT-SELEX [13]. The Jaspar database has recently been expanded and updated [16]. We included these new databases, alongside the current versions of the three long-established ones, into the analysis of the databases coverage, and into more in-depth, threshold independent analysis of databases performance.

Table 2.

Motif databases compared in the current study

Database Number of Status Link Publication
matrices
(vertebrate)
HOCOMOCO v.9 (2013) 426 public http://autosome.ru/HOCOMOCO/ [14]
Jaspar vertebrates (2014) 821 public http://jaspar.genereg.net/ [16]
HT-SELEX (2013) 820 public http://www.sciencedirect.com/science/ article/pii/S0092867412014961?via=ihub [13]
SwissRegulon (2013) 190 public http://swissregulon.unibas.ch [15]
TRANSFAC Professional 2013.1 1435 commercial http://www.biobase-international.com [9]
MatBase v.9.0 (2012) 907 commercial http://www.genomatix.de/ [23]

We first set up to compare the number of TFs represented by each database. While this is in principle a straightforward task, some care is needed due to orthology and changing gene symbols issues. Here we define the database coverage as the number of represented TFs in the human species, identified by their Entrez Gene ID. The number of distinct TFs was counted for each database (Figure 2A), for the union of all the public databases, and for the intersections between this union, Transfac, and MatBase (Figure 2B). The total number of human TFs (distinct Entrez Gene ID) assigned to the union of the public databases (710) was greater than the number of TFs assigned to the vertebrate section of Transfac Professional (551), but smaller than the number of human TF represented in the MatBase (802), with majority (493) TFs represented in each of the of the three sets (union of the public databases, Transfac, MatBase), with 65 TFs unique to the public databases, 18 TFs unique to Transfac, and 151 TFs unique to MatBase. The precise coverages change continuously between the releases, but it is notable that, at the time-point of release of the analyzed versions, the cumulative TF coverage in the public domain exceeded that of Transfac (but not MatBase), even though none of the public databases achieved this alone. In addition to the above comparisons for all human TFs, we performed a similar analysis for the subset of the 81 human TFs represented by ChIP-seq data in funcgen v.71. Out of 81 TFs represented in funcgen, 60 were represented in the union of the public databases, 59 in Transfac, and 63 in MatBase, with 53 TFs represented in each of the three sets (Figure 2C).

Figure 2.

Figure 2

Comparsion of coverage of human TFs by motif databases.A. The numbers of distinct genes (Entrez Gene ID) assigned to all the vertebrate motifs from the indicated databases. For MatBase the number of TFs as provided by Genomatix is represented. B. The Venn diagram showing the overlap between human TF genes represented in the union of all the public databases and in the Transfac database. C. Similar as in B, but for human 81 human TFs represented in Ensembl 71 funcgen is based on MatBase v.9.0.

Comparison of databases quality

Every motif matrix defines a log-likelihood function that discriminates between true and false binding sites. Motif scanners use matrices supplied with thresholds separating positive and negative predictions.

In order to characterize the overall quality of motif matrices, we analyzed their behavior for the whole spectrum of possible threshold values. To this aim we generated their receiver operating characteristic (ROC) curves and calculated area under the curve (AUC) for each ROC. Expected AUC value for random scoring function is 0.5 and AUC for function perfectly discriminating true and false predictions is 1.

We computed ROC and AUC for all motifs in all databases with respect to 4 datasets of negative sequences: third exons, genomic flanks of ChIP-seq peaks, random sequences with dinucleotide composition the same as in ChIP-seq peaks, and sequences generated by 3rd order Markov chains learned on peak sequences (for details see Methods section). Below we focus our analysis on AUC statistics; full results, including all ROC curves, are presented in Supplementary Materials at the authors’ website http://bioputer.mimuw.edu.pl/papers/motifs/.

Figure 3 presents AUC distributions of motif databases (for motifs having many related TFs the one that gives the highest AUC was selected). The distributions depend on the choice of the negative dataset. For example, 3rd exons are probably the least contaminated by accidental motif occurrences and yield highest AUCs (medians are around 0.8). On the other hand, the lowest AUCs are obtained for 3rd order Markov chains (medians between 0.6 and 0.7), because the high order of a Markov chain increases the chance of generating longer fragments of original sequences.

Figure 3.

Figure 3

AUC distributions in motif databases. Consecutive plots present distributions of AUC calculated with respect to various negative datasets, as indicated by plots’ titles. For each motif the best related TF was selected.

However, if one focuses on comparing AUC distributions between databases, all negative datasets yield similar picture. In all cases the highest and the lowest AUC belong to Hocomoco and MatBase motifs, respectively. The union of public databases always perform visibly better than both MatBase and Transfac. Choosing the best transcription factor for a motif favors databases having one motif for each TF over databases offering for each TF a large family of motifs. The second strategy is preferred by commercial databases (e.g. in the case of MatBase 63 TFs are represented by 377 motifs), which may to some degree explain their performance in this evaluation.

Therefore, we complement it with the analysis of AUC for 53 transcription factors represented in the funcgen ChIP-seq data and in each of the following datasets: MatBase, Transfac and the union of public databases. For each transcription factor and each database, the motif that gives the highest AUC is selected. Table 3 presents results for the negative dataset consisting of ChIP-seq peaks’ flanks (see Additional file 1 for analogous results for other negative datasets). As expected, the highest number of best motifs belongs to MatBase (19), but Jaspar has almost as many of best motifs (i.e. 16). The lowest numbers of best motifs have SwissRegulon (1) and Transfac (4). For over half of the TFs (30 out of 53) the best motif belongs to one of the public databases. AUC for motifs corresponding to the same TF are usually similar, but in some cases, for example ZBTB33, the differences are tremendous.

Table 3.

Comparison of optimal motifs for TFs common to MatBase, Transfac and the union of public databases

ENCODE/funcgen MatBase Transfac Prof. Jaspar vert. Hocomoco HT-Selex SwissRegulon
TF name AUC matrix name AUC matrix name AUC matrix name AUC matrix name AUC matrix name AUC matrix name
ATF3 0.710 V$CREB.02 0.693 M00981 0.694 M00015
Ap2alpha 0.809 V$AP2.02 0.807 M01045 0.816 MA0003.2 0.777 M00004 0.779 selex292 0.754 TFAP2A,C.p2
Ap2gamma 0.799 V$AP2.02 0.765 M00470 0.792 MA0524.1 0.781 M00006 0.784 selex298 0.751 TFAP2A,C.p2
BHLHE40 0.947 V$BHLHB2.01 0.799 M01034 0.885 MA0464.1 0.955 M00022 0.958 selex316 0.917 ARNT_ARNT2_BHLHB2_
MAX_MYC_USF1.p2
CTCF 0.929 V$CTCF.04 0.931 M01259 0.942 MA0139.1 0.940 M00045 0.922 selex2 0.934 CTCF.p2
Cfos 0.751 V$AP1.01 0.759 M00517 0.744 MA0476.1 0.760 M00093 0.742 FOS_FOSB,L1_JUNB,D.p2
Cjun 0.814 V$AP1.01 0.820 M00925 0.799 MA0099.1 0.827 M00183 0.570 JUN.p2
Cmyc 0.700 V$CMYC.01 0.685 M00322 0.702 MA0147.1 0.690 M00216 0.659 ARNT_ARNT2_BHLHB2_
MAX_MYC_USF1.p2
E2F1 0.802 V$E2F3.01 0.764 M00938 0.753 MA0024.2 0.694 M00052 0.723 selex750 0.674 E2F1..5.p2
E2F4 0.682 V$E2F4.01 0.678 M00920 0.689 MA0470.1 0.650 M00055 0.502 selex753 0.654 E2F1..5.p2
E2F6 0.698 V$E2F4.01 0.435 M01252 0.653 MA0471.1 0.681 M00057
EBF 0.740 V$EBF1.01 0.736 M01871 0.721 MA0154.2 0.746 M00037 0.751 selex79 0.692 EBF1.p2
ELF1 0.862 V$ELK1.03 0.835 M02053 0.800 MA0473.1 0.832 M00065 0.836 selex81 0.797 ELF1,2,4.p2
ETS1 0.765 V$ELK3.01 0.753 M02063 0.680 MA0098.1 0.708 M00082 0.765 selex100 0.683 ETS1,2.p2
Egr1 0.831 V$EGR1.01 0.848 M01972 0.822 PB0010.1 0.844 M00060 0.853 selex3 0.841 EGR1..3.p2
FOSL1 0.885 V$AP1.02 0.890 M00517 0.890 MA0477.1 0.865 M00091 0.881 FOS_FOSB,L1_JUNB,D.p2
FOSL2 0.877 V$AP1.01 0.870 M00925 0.865 MA0478.1 0.885 M00092 0.852 FOSL2.p2
FOXA1 0.763 V$FREAC4.01 0.826 M01261 0.832 MA0148.3 0.808 M00094
FOXA2 0.759 V$FREAC4.01 0.735 M02014 0.834 MA0047.2 0.816 M00095 0.834 FOXA2.p3
Gabp 0.873 V$ELK1.03 0.867 M02074 0.879 MA0062.2 0.876 M00116 0.871 selex116 0.870 ELK1,4_GABPA,B1.p3
Gata1 0.711 V$GATA5.01 0.685 M00203 0.683 MA0035.3 0.697 M00117 0.479 GATA1..3.p2
Gata2 0.853 V$GATA2.03 0.834 M00789 0.834 MA0036.2 0.843 M00118 0.538 GATA1..3.p2
HNF4A 0.801 V$HNF4.01 0.838 M02220 0.847 MA0114.2 0.850 M00147 0.837 selex673 0.809 HNF4A_NR2F1,2.p2
HNF4G 0.864 V$HNF4.01 0.811 M00764 0.898 MA0484.1 0.788 M00148
IRF4 0.669 V$ISRE.01 0.641 M00772 0.648 PB0034.1 0.603 M00174 0.665 selex148
Junb 0.912 V$AP1.01 0.912 M00925 0.920 MA0490.1 0.911 M00181 0.900 FOS_FOSB,L1_JUNB,D.p2
Jund 0.820 V$AP1.01 0.823 M00925 0.817 MA0491.1 0.827 M00182 0.805 FOS_FOSB,L1_JUNB,D.p2
MEF2A 0.643 V$MEF2.02 0.650 M00231 0.653 MA0052.2 0.616 M00204 0.615 selex156 0.604 MEF2A,B,C,D.p2
MEF2C 0.721 V$MEF2.02 0.682 M00941 0.719 MA0497.1 0.664 M00205 0.669 MEF2A,B,C,D.p2
Max 0.738 V$CMYC.01 0.703 M00322 0.700 PB0043.1 0.720 M00199 0.730 selex326 0.711 ARNT_ARNT2_BHLHB2_
MAX_MYC_USF1.p2
NFKB 0.896 V$NFKAPPAB65.02 0.891 M00774 0.878 MA0105.3 0.872 M00235 0.776 selex189 0.861 NFKB1_REL_RELA.p2
NR4A1 0.512 V$NBRE.01 0.492 M01217 0.542 M00259
Nanog 0.560 V$HOXA2.01 0.631 M01247 0.556 M00221 0.630 NANOGmouse.p2
Nfe2 0.855 V$NFE2.01 0.846 M00037 0.877 MA0501.1 0.882 M00231 0.771 selex392 0.835 NFE2.p2
Nrf1 0.951 V$NRF1.01 0.969 M00652 0.963 MA0506.1 0.973 M00264 0.977 selex194 0.968 NRF1.p2
Nrsf 0.838 V$NRSF.02 0.879 M01256 0.850 MA0138.2 0.854 M00316 0.847 REST.p3
POU2F2 0.513 V$OCT1.02 0.498 M00210 0.481 MA0507.1 0.504 M00290 0.503 selex232 0.503 POU2F1..3.p2
POU5F1 0.868 V$OCT3_4.02 0.857 M01125 0.881 MA0142.1 0.874 M00294 0.857 POU5F1_SOX2dimer.p2
PU1 0.932 V$SPI1.05 0.884 M01203 0.914 MA0080.3 0.922 M00350 0.860 selex123 0.884 SPI1.p2
Pax5 0.613 V$PAX5.01 0.613 M00143 0.713 MA0014.2 0.729 M00274 0.768 selex200 0.606 PAX5.p2
Pbx3 0.739 V$PBX1_MEIS1.01 0.546 M00998 0.758 M00280
RXRA 0.714 V$PPARG.03 0.608 M02272 0.707 MA0065.1 0.693 M00326 0.731 selex710 0.720 RXRG_dimer.p3
SP1 0.559 V$SP1.03 0.552 M00932 0.561 MA0079.3 0.555 M00346 0.547 selex29 0.551 SP1.p2
SP2 0.711 V$SP4.01 0.719 M01783 0.726 MA0516.1 0.676 M00347
Srf 0.681 V$SRF.05 0.693 M00186 0.661 MA0083.1 0.657 M00355 0.657 selex159 0.656 SRF.p3
Tcf12 0.723 V$ASCL2.01 0.679 M00698 0.712 MA0521.1 0.703 M00152 0.574 TAL1_TCF3,4,12.p2
Tr4 0.601 V$HNF4.01 0.644 M01776 0.652 MA0504.1 0.623 M00256 0.611 selex676
USF1 0.947 V$USF1.02 0.936 M00121 0.903 MA0093.2 0.945 M00396 0.935 selex352 0.932 ARNT_ARNT2_BHLHB2_
MAX_MYC_USF1.p2
Yy1 0.778 V$YY1.03 0.723 M02044 0.713 MA0095.2 0.735 M00394 0.756 selex33 0.657 YY1.p2
ZBTB33 0.489 V$KAISO.01 0.517 M01119 0.881 MA0527.1 0.749 M00184
ZBTB7A 0.699 V$ZF9.01 0.682 M01100 0.640 M00404 0.632 selex37
ZEB1 0.766 V$ZEB1.01 0.689 M00414 0.752 MA0103.2 0.686 M00409 0.734 ZEB1.p2
Znf263 0.685 V$ZNF263.01 0.762 M01587 0.653 MA0528.1

For each TF the motif with the highest AUC from each database is presented. The best motifs from all databases and the corresponding AUC are bolded (note that for FOXA2 motifs from Jaspar and SwissRegulon are both optimal). AUC are calculated with respect to negative sequences composed of flanks of ChIP-seq peaks.

Selection of log-odds thresholds

As was mentioned previously, motif matrices define log-likelihood functions that should be supplied with thresholds separating positive and negative predictions. Selection of such a threshold determines the balance between prediction sensitivity and specificity. Usually, a reasonable solution is to maximize balanced accuracy, i.e. the mean of sensitivity and specificity.

Therefore we calculated maximal balanced accuracies and corresponding thresholds for all transcription factors represented in our benchmark dataset and related motifs. Results are presented in Additional file 2.

We also decided to examine generic threshold selection approaches, i.e. methods setting the threshold on the basis the motif matrix only. They have considerable advantages – simplicity (calculations are much easier) and wide applicability (no benchmark dataset is required).

Probably the simplest generic approach is to set a common log-likelihood threshold for all motifs. Unfortunately, log-likelihood distributions substantially vary across motifs and the same threshold value may result in underestimating occurrences for one motif and overestimating them for the other. Therefore several score distribution based approaches for threshold selection were proposed. In the current study we analyze 3 representative methods for threshold selection implemented in the Bio.Motif package: FPR, FNR and balanced (see 3 for details).

It should be noted that all these approaches (including pure log-likelihood score) are parameterized and in each case the parameter enables the user to select any sensitivity-specificity configuration obtainable for a given motif. In particular, each method allows optimizing balanced accuracy for individual motifs. Therefore the point in which the approaches differ lies in the ability to select the threshold consistently across motifs.

Figure 4 presents the relation between balanced accuracy (calculated with respect to negative sequences composed of flanks of ChIP-seq peaks; for other negative datasets see Additional file 1)

Figure 4.

Figure 4

Balanced accuracies for various approaches to threshold selection. Top row: balanced accuracy vs threshold parameter. Colors represent motif information content: from blue (low), through green and yellow to beige (high). Vertical black lines indicate optimal thresholds, black circles indicate corresponding average balanced accuracies. Bottom row shows how (sub-)optimal parameter values of a motif (X-axis) depends on its information content. For each motif, a circle represents parameter value yielding maximal balanced accuracy and a horizontal line represents a parameter range, for which BA is at least 95% of the maximum. Colors represent motif AUC: from green (low), through yellow to red (high). Balanced accuracies are calculated with respect to negative sequences composed of flanks of ChIP-seq peaks.

and the parameter of the selection method for all motifs in our benchmarking dataset. In order to avoid noise introduced by poor motifs, we restricted our attention to matrices having AUC>0.6. Extreme parameters yield extreme values of sensitivity and specificity (one equal to 0 and the other to 1), resulting in the balanced accuracy equal to 0.5. Therefore BA is maximized for intermediate parameters for every motif. The parameter values giving the maximum of average BA across all motifs is indicated by vertical black lines and listed in Table 4. The highest average BA is obtained with FPR approach, since its BA profiles are most consistent – the bulk of motifs gain near-maximal BA for α∈[10−4;10−3](see 3 section for the definition of α). This is even more evident when motifs with lower AUC are excluded (see Figure 5). On the other hand, for the rest of approaches the location of BA peak seems to be strongly correlated with the information content (IC) of a motif (see Figure 4, bottom). This observation suggests that the prediction accuracy may be improved by using threshold parameters dependent on IC and other motif features (such method still might be used for any motif, without the need of a benchmark dataset).

Table 4.

Average balanced accuracy for optimal threshold parameters

Threshold selection Average BA for parameter: Optimal constant parameter for negative dataset:
method
Optimal constant Fitted with linear model Peaks’ flanks 3rd exons Shuffled peaks 3rd order MCs
Log-likelihood 0.6704 0.6724 8.9 8 9 9.9
FPR 0.6749 0.6757 0.00025 0.00041 0.00021 0.00013
FNR 0.6285 0.6641 0.26 0.18 0.27 0.32
Balanced 0.6664 0.6743 794 457 977 2089

Only motifs with AUC>0.6 are considered. Balanced accuracies are calculated with respect to negative sequences composed of flanks of ChIP-seq peaks. 2nd column contains average balanced accuracies for thresholds in 4th column. 3rd column contains average balanced accuracies for thresholds linearly depending on the motif features: information content, log-likelihood value range, log-likelihood standard deviation.

Figure 5.

Figure 5

Balanced accuracy versus the FPR threshold for various AUC > 0.6, AUC > 0.7, AUC > 0.8, AUC > 0.9, cutoffs. Colors etc. as on Figure 4, top row.

Therefore we selected several characteristics of the log-likelihood distribution under the background model (i.e. the theoretical distribution of the log-likelihood function of sequences randomly generated from the background model). Namely, we considered expected value (which equals to the motif IC), standard deviation, minimum and maximum. We fitted a linear regression model describing optimal threshold parameters by these characteristics.

The averages of BA (calculated with respect to negative sequences composed of flanks of ChIP-seq peaks) for constant parameters and parameters predicted from motif features are compared in Table 4. As expected, linear model visibly improved average BAs for methods with worse result for constant parameters, narrowing the range of average BAs to interval [0.6641;0.6757]. However, the average BA 0.6749 for constant FPR threshold was not reached by any of the other methods. To have the proper perspective to assess these differences, the reader should be aware of the distance between these values and the upper bound of the average BA for considered motifs. The ideal method would select for each motif a threshold yielding its optimal balanced accuracy. Consequently, it would reach a result equal to the average of BA maxima, which is 0.6949 for our dataset. Thus setting individual thresholds for each motif may improve the average BA by 0.02, when compared to thresholds globally set with FPR parameter α=0.0003.

Estimated optimal parameters obviously depend on the choice of the negative dataset (see the last 4 columns of Table 4), but the dispersions are relatively small: the log-likelihood range is less 2 and the ratios between the extremes for other parameters (the logarithmic scale is more suitable here) are ∼3. Moreover, the variability of optimal parameters is consistently explained by the characteristics of negative datasets - the lower are AUC for a dataset, the sharper is the parameter.

Discussion and conclusion

We have aimed to analyze the coverage and accuracy of different motif databases as well as the optimality of different approaches to motif scanning.

As one of the key advantages of the public resources is the possibility of freely combining them, we included into our analysis as an important category the union of the public databases.

In respect of the coverage of human TFs the two commercial databases maintain their lead as compared to each of the public databases separately (Figure 2A). However, the union of the public databases achieves coverage comparable to that of the commercial databases (Figure 2B).

In respect of motifs’ performance as judged by the distributions of the AUC, all the compared databases produce broadly similar results, with the median AUC higher for the union of the public databases than for either commercial one (Figure 3). At the single database level, Hocomoco was consistently the best performer for all choices of the negative datasets. Relative performance of some of the databases, in particular Jaspar and SwissRegulon, depended on the choice of the negative dataset. Jaspar performed better on peaks’ flanks and negative datasets generated from Markov models, while SwissRegulon performed better on third exons.

As another measure of motifs’ performance, we identified which database supplied the best-performing motif (Table 3) for every TF represented across all the databases. In more than half of the cases, the best-performing motif originated from the public databases. At the single database level, the largest number of best motifs was supplied by MatBase, but with Jaspar and Hocomoco at the second and third position supplying together more best motifs than MatBase.

In conclusion, public databases together match the coverage and the quality of their commercial counterparts, with no cost and limits of use imposed by the latter. On the other hand, an overhead associated with combining several public resources (unifying formats, installing updates, etc.) and differences in content other than the PWMs, make the choice public vs commercial not always obvious.

Given a sizeable number of TF motifs available only in one of the databases, a researcher going to predict binding of a particular TF needs a guide that helps to find the most appropriate public database. Our findings provide such a guide for a number of human TFs supported by reliable experimental data. Moreover, the usability of such guides may increase in a near future, when one can expect that the contents of several public databases will be available from a single server.

In terms of the performance of different scanning tools, publicly available mature software packages have exceeded the commercial tools in accuracy. While this might be at least in part a result of the commercial providers reluctance to change the behavior of the tools their customers depend on, we would advise researchers to use the publicly available tools.

With respect to the choice of the optimal threshold for multiple motifs, the method based on controlling the false positive rate is clearly the one least biased by the information content of the motif and therefore the most consistent between motifs.

In conclusion, the recent increase in availability of both sequence motifs and binding data have given us the opportunity to assess different motif databases and scanning methods for predicting potential TF binding sites. The results allow us to give some particular recommendations (such as the choice of the best motif and corresponding optimal threshold for a given TF) as well as general conclusions (superiority of FPR measure and public scanners) for users of these databases. We believe that our findings will prove to be useful also for hybrid methods [24,25], which use other data to improve motif-based predictions of binding sites.

Methods

Parsing and annotation of motif databases

Database distribution files downloaded from their providers websites (Table 1) were parsed using custom-made scripts to yield matrices in the common format, as well as matrices’ annotation to transcription factors. For the public databases, the available identifiers were the TF gene symbols or UniProt identifiers. For the purpose of comparison and mapping to the ChIP-seq datasets, these identifiers were mapped to human Entrez Gene ID, using the db2db function of the bioDBNet webservice [26] http://biodbnet.abcc.ncifcrf.gov/webServices/bioDBnet.wsdl. For the commercial databases the Entrez gene IDs were directly available. Genomatix uses the concept of matrix family [23], here referred to as ‘motif family’, i.e. family of the motifs so similar that they are predicted to bind a common set of TF orthologs in related species (e.g. all vertebrates). In Additional file 3 we provide a comprehensive mapping table for each motif database, for the commercial databases limited to the motifs used in this work.

Preparation of positive and negative TF-binding datasets

The positive and the negative sequences were obtained from the Nencki Genomics Database - NGD [27] http://www.nencki-genomics.org and were based on Ensembl [28] funcgen. For the comparison of the scanner performance we used TF ChIP-seq data for 44 TFs + CTCF from funcgen v.60 as the positive set. For the comparison of databases performance we used ChIP-seq data for 80 TFs + CTCF from funcgen v.71 as the positive set. We considered 4 negative datasets:

  • all human third exons, excluding exons with UTRs (common for all TFs),

  • sequences flanking ChIP-seq peaks for particular TFs (each peak shifted by its length + 40 nt along the chromosome),

  • random sequences of length and dinucleotide composition following ChIP-seq peaks for particular TFs, generated by BiasAway [29],

  • random sequences of length following ChIP-seq peaks for particular TFs, generated by 3rd order Markov chains learned on these peaks.

All sequences are available in Supplementary Materials at the authors’ website http://bioputer.mimuw.edu.pl/papers/motifs/.

The following quality measures were used in the comparison:

  • Sensitivity – the proportion of ChIP-seq peaks that contain at least one predicted binding site,

  • Specificity – the proportion of fragments in the negative dataset that contain no predicted binding site,

  • Balanced accuracy – the average of sensitivity and specificity.

Mapping of TFs to gene identifiers

Some TFs were represented by more than one ChIP-seq dataset and many TFs were mapped to several motifs from each database (for the details see Supplementary Materials). The annotation of the TF ChIP-seq datasets in Ensembl funcgen v.71 was downloaded from the Ensembl website (http://apr2013.archive.ensembl.org/Homo_sapiens/Experiment?db=core;ex=project-ENCODE-). The Ensembl gene IDs of the features in the classes: ‘Transcription factor’ and ‘Insulator’ (for CTCF) were mapped to Entrez Gene ID using the aforementioned db2db function of the bioDBNet webservice. For one entry (ZEB1), its Ensembl gene ID was corrected (to ENSG00000148516) prior to the mapping.

Motif scanning

The two commercial motif databases: Transfac Professional (Biobase) and MatBase (Genomatix) are distributed with dedicated motif scanning programs: Match [18] and MatInspector [19,30], respectively, and proprietary thresholds files, aimed at controlling the false positive rate. For matrix-scan we choose 1st order Markov chain background model learned on human gene upstream no-orf sequences and for Bio.Motif we used the uniform background model. We choose 0.0001 chance of Type I error as the threshold for both public scanners, because it yielded similar numbers of genome-wide matches to Match or MatInspector when run with the two respective motif databases.

Computation of intersections

For the comparison of scanners, we computed genome-wide intersections between ChIP-seq peaks and motifs using stored procedures of the NGD database [27] http://www.nencki-genomics.org. The NGD database stores the TFs ChIP-seq data imported from funcgen and the results of genome-wide motif scanning. The intersections, specificity and sensitivity were computed separately for each dataset-motif pair, then averaged twice: first for each TF-motif pair, then for a given scanner/database. For the in-depth comparison of performance we used an extended set of motif databases: Jaspar (2014) vertebrates, Transfac Professional 2013.1, MatBase 9.0, HOCOMOCO v.9, HT-SELEX, SwissRegulon (2013). This time we scanned only the positive and the negative sequences with the Bio.Motif scanner. As the average length of human third exons (153 +/− 304) was smaller than the average size of the ChIP-seq peaks (403 +/− 172), before the scanning we added 100-nt flanks of either side of exons 3 and use these flanked exons as the negative set.

Threshold selection methods

Given a motif matrix M, the log-likelihood of a sequence w of corresponding length is given by

LM(w)=logP(w|M)P(w|B)

where P(w|M) is the probability of observing w given the motif model and P(w|B) is the probability of observing w given the background model. Given a threshold t M, all sequences w satisfying L M(w)>t M are classified as M-occurrences.

Since log-likelihood distributions substantially vary across motifs, there were proposed approaches for threshold selection based on the shape of these distributions. Some representative methods are implemented in the Bio.Motif package:

  • FPR approach aims at restricting the number of false positive motif occurrences. For assumed type I error level α, t M is chosen to satisfy P(L M(w)>t M|B)=α.

  • FNR approach restricts the number of false negatives. In this method t M satisfies P(L M(w)<t M|M)=β for assumed type II error level β.

  • Balanced approach constrains the proportion between the levels of both errors, i.e. threshold t M satisfies P(L M(w)<t M|M)=γ·P(L M(w)>t M|B) for assumed parameter γ. Setting γ to the inverse of the expected frequency of motif occurrences results in roughly the same number of false positive and false negative binding site predictions.

Acknowledgements

This work was supported by: Polish National Science Centre grant No. 2013/09/B/NZ2/03170, Polish Ministry of Science and Higher Education grant No. N N519 652740 and Polish National Center for Research and Development grant No. ERA-NET-NEURON/10/2013.

Additional files

Additional file 1 (4MB, pdf)

Supplementary tables and figures with results for alternative negative datasets.

Additional file 2 (419.5KB, xls)

Performance summary for all TF-motif pairs and all negative datasets.

Additional file 3 (389.5KB, xls)

TF-motif mappings for all databases.

Footnotes

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

MD, ND and BW designed the study and drafted the manuscript. IK, MD and ND prepared positive and negative TF-binding datasets. MD and BK analyzed databases coverage and compared motif scanners. BW performed scanning of positive and negative sequences with all motif databases. ND evaluated databases quality, computed optimal log-odds thresholds and compared threshold selection methods. All authors read and approved the final manuscript.

Contributor Information

Michal Dabrowski, Email: m.dabrowski@nencki.gov.pl.

Norbert Dojer, Email: dojer@mimuw.edu.pl.

Izabella Krystkowiak, Email: i.krystkowiak@nencki.gov.pl.

Bozena Kaminska, Email: b.kaminska@nencki.gov.pl.

Bartek Wilczynski, Email: bartek@mimuw.edu.pl.

References

  • 1.Schneider TD, Stormo GD, Gold L, Ehrenfeucht A. Information content of binding sites on nucleotide sequences. J Mol Biol. 1986;188(3):415–31. doi: 10.1016/0022-2836(86)90165-8. [DOI] [PubMed] [Google Scholar]
  • 2.Xing EP, Jordan MI, Karp RM, Russell S. A hierarchical bayesian markovian model for motifs in biopolymer sequences. In: Becker S, Thrun S, Obermayer K, editors. Advances in Neural Information Processing Systems 15. Vancouver Canada: MIT Press; 2003. [Google Scholar]
  • 3.Zhao Y, Ruan S, Pandey M, Stormo GD. Improved models for transcription factor binding site identification using nonindependent interactions. Genetics. 2012;191(3):781–90. doi: 10.1534/genetics.112.138685. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Yang L, Zhou T, Dror I, Mathelier A, Wasserman WW, Gordân R, et al. TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res. 2014;42(Database issue):148–55. doi: 10.1093/nar/gkt1087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Zhao Y, Stormo GD. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat Biotech. 2011;29(6):480–3. doi: 10.1038/nbt.1893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hertz GZ, Stormo GD. Identifying dna and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999;15(7):563–77. doi: 10.1093/bioinformatics/15.7.563. [DOI] [PubMed] [Google Scholar]
  • 7.Rahmann S, Muller T, Vingron M. On the power of profiles for transcription factor binding site detection. Stat Appl Genet Mol Biol. 2003;2(1):1032. doi: 10.2202/1544-6115.1032. [DOI] [PubMed] [Google Scholar]
  • 8.Xie X, Lu J, Kulbokas EJ, Golub TR, Mootha V, Lindblad-Toh K, et al. Systematic discovery of regulatory motifs in human promoters and 3’ UTRs by comparison of several mammals. Nature. 2005;434(7031):338–45. doi: 10.1038/nature03441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Matys V, Kel-Margoulis OV, Fricke E, Liebich I, Land S, et al. Transfac and its module transcompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006;34(Database issue):108–10. doi: 10.1093/nar/gkj143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Berger MF, Badis G, Gehrke AR, Talukder S, Philippakis AA, Peña-Castillo L, et al. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell. 2008;133(7):1266–76. doi: 10.1016/j.cell.2008.05.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, et al. Diversity and complexity in DNA recognition by transcription factors. Science. 2009;324(5935):1720–23. doi: 10.1126/science.1162327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, et al. Jaspar 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 2010;38(Database issue):105–10. doi: 10.1093/nar/gkp950. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, et al. Dna-binding specificities of human transcription factors. Cell. 2013;152(1-2):327–39. doi: 10.1016/j.cell.2012.12.009. [DOI] [PubMed] [Google Scholar]
  • 14.Kulakovskiy IV, Medvedeva YA, Schaefer U, Kasianov AS, Vorontsov IE, Bajic VB, et al. Hocomoco: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res. 2013;41(Database issue):195–202. doi: 10.1093/nar/gks1089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Pachkov M, Balwierz PJ, Arnold P, Ozonov E, van Nimwegen E. Swissregulon, a database of genome-wide annotations of regulatory sites: recent updates. Nucleic Acids Res. 2013;41(Database issue):214–20. doi: 10.1093/nar/gks1145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Mathelier A, Zhao X, Zhang AW, Parcy F, Worsley-Hunt R, Arenillas DJ, et al. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res. 2014;42(Database issue):142–7. doi: 10.1093/nar/gkt997. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.ENCODE Project Consortium. Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Kel AE, Gössling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E, MATCH: a tool for searching transcription factor binding sites in DNA sequences Nucleic Acids Res. 2003;31(13):3576–579. doi: 10.1093/nar/gkg585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Cartharius K, Frech K, Grote K, Klocke B, Haltmeier M, et al. MatInspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics. 2005;21(13):2933–42. doi: 10.1093/bioinformatics/bti473. [DOI] [PubMed] [Google Scholar]
  • 20.Turatsinze JV, Thomas-Chollier M, Defrance M, van Helden J. Using rsat to scan genome sequences for transcription factor binding sites and cis-regulatory modules. Nat Protoc. 2008;3(10):1578–88. doi: 10.1038/nprot.2008.97. [DOI] [PubMed] [Google Scholar]
  • 21.Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Wilczynski B, Dojer N, Patelak M, Tiuryn J. Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs. BMC Bioinformatics. 2009;10:82. doi: 10.1186/1471-2105-10-82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Cartharius K, Frech K, Grote K, Klocke B, Haltmeier M, et al. Matinspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics. 2005;21(13):2933–42. doi: 10.1093/bioinformatics/bti473. [DOI] [PubMed] [Google Scholar]
  • 24.Pique-Regi R, Degner JF, Pai AA, Gaffney DJ, Gilad Y, Pritchard JK. Accurate inference of transcription factor binding from dna sequence and chromatin accessibility data. Genome Res. 2011;21(3):447–55. doi: 10.1101/gr.112623.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Kaplan T, Li XY, Sabo PJ, Thomas S, Stamatoyannopoulos JA, Biggin MD, et al. Quantitative models of the mechanisms that control genome-wide patterns of transcription factor binding during early drosophila development. PLoS Genet. 2011;7(2):1001290. doi: 10.1371/journal.pgen.1001290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Mudunuri U, Che A, Yi M, Stephens RM. bioDBnet: the biological database network. Bioinformatics. 2009;25(4):555–6. doi: 10.1093/bioinformatics/btn654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Krystkowiak I, Lenart J, Debski K, Kuterba P, Petas M, Kaminska B, et al. Nencki genomics database–ensembl funcgen enhanced with intersections, user data and genome-wide TFBS motifs. Database (Oxford) 2013;2013:069. doi: 10.1093/database/bat069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, et al. Ensembl 2013. Nucleic Acids Res. 2013;41(Database issue):48–55. doi: 10.1093/nar/gks1236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Worsley Hunt R, Mathelier A, del Peso L, Wasserman W. Improving analysis of transcription factor binding sites within chip-seq data based on topological motif enrichment. BMC Genomics. 2014;15(1):472. doi: 10.1186/1471-2164-15-472. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Quandt K, Frech K, Karas H, Wingender E, Werner T. MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 1995;23(23):4878–884. doi: 10.1093/nar/23.23.4878. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from BMC Bioinformatics are provided here courtesy of BMC

RESOURCES