A systematic, large-scale comparison of transcription factor binding site models

Daniela Hombach; Jana Marie Schwarz; Peter N Robinson; Markus Schuelke; Dominik Seelow

doi:10.1186/s12864-016-2729-8

. 2016 May 21;17:388. doi: 10.1186/s12864-016-2729-8

A systematic, large-scale comparison of transcription factor binding site models

Daniela Hombach ^1,², Jana Marie Schwarz ^1,², Peter N Robinson ³, Markus Schuelke ^1,², Dominik Seelow ^1,^2,^4,^✉

PMCID: PMC4875604 PMID: 27209209

Abstract

Background

The modelling of gene regulation is a major challenge in biomedical research. This process is dominated by transcription factors (TFs) and mutations in their binding sites (TFBSs) may cause the misregulation of genes, eventually leading to disease. The consequences of DNA variants on TF binding are modelled in silico using binding matrices, but it remains unclear whether these are capable of accurately representing in vivo binding.

In this study, we present a systematic comparison of binding models for 82 human TFs from three freely available sources: JASPAR matrices, HT-SELEX-generated models and matrices derived from protein binding microarrays (PBMs). We determined their ability to detect experimentally verified “real” in vivo TFBSs derived from ENCODE ChIP-seq data. As negative controls we chose random downstream exonic sequences, which are unlikely to harbour TFBS. All models were assessed by receiver operating characteristics (ROC) analysis.

Results

While the area-under-curve was low for most of the tested models with only 47 % reaching a score of 0.7 or higher, we noticed strong differences between the various position-specific scoring matrices with JASPAR and HT-SELEX models showing higher success rates than PBM-derived models. In addition, we found that while TFBS sequences showed a higher degree of conservation than randomly chosen sequences, there was a high variability between individual TFBSs.

Conclusions

Our results show that only few of the matrix-based models used to predict potential TFBS are able to reliably detect experimentally confirmed TFBS.

We compiled our findings in a freely accessible web application called ePOSSUM (http:/mutationtaster.charite.de/ePOSSUM/) which uses a Bayes classifier to assess the impact of genetic alterations on TF binding in user-defined sequences. Additionally, ePOSSUM provides information on the reliability of the prediction using our test set of experimentally confirmed binding sites.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-016-2729-8) contains supplementary material, which is available to authorized users.

Keywords: Transcription factor binding sites, TFBS prediction, PSSM, Genetic variation

Background

Transcription factors (TFs) are major players in the regulation of gene activity. They bind to DNA target sequences known as transcription factor binding sites (TFBSs) in order to up- or down-regulate transcriptional activity. As DNA binding is pivotal to correct TF function, mutations in TFBSs can lead to genetic disease [1, 2].

However, despite the rapid growth in the number of mutations known to cause Mendelian disorders, the number of confirmed mutations in regulatory sequences is still low, presumably due to difficulties in experimentally verifying their functional relevance as the cause of a given disease.

With each deep sequencing run a large number of DNA variants are found, many of which may be considered as potentially causing Mendelian disorders, even after bioinformatic filtering. Hence the main challenge is currently not the generation, but the interpretation of the data. While considerable progress has been made at the in silico analysis of alterations in the protein coding sequences (e.g. MutationTaster [3], SIFT [4], PolyPhen-2 [5]), these prediction tools address the role of variants within potential regulatory sequences only indirectly, e.g. via their evolutionary conservation, or the simple fact that a variant occurred somewhere within a confirmed regulatory element. We thus set out to examine and compare the relevance of methods for computational TFBS recognition as a means to assess the disease-causing potential of genetic variants.

Currently, the state-of-the art method for functionally representing TFBSs is via transcription factor binding models that are described by matrices, such as position-specific scoring matrices (PSSMs), also called position weight matrices (PWMs). These PSSMs are normalised representations of the position-specific log-likelihoods of a nucleotide’s probability to occur at each location in an observed sequence [6]. PSSMs can be used to computationally detect putative TFBSs in regulatory regions by establishing a match score, which determines how well a given sequence matches with a TF binding model.

A manually curated census in the year 2009 counted 1391 TFs in humans [7]. This number has grown since then—estimates for the total number of human TFs currently range between 2000 and 3000 [7, 8]. However, only a fraction of these factors have been studied intensely enough to generate reliable binding models; the ENCODE (Encyclopaedia of DNA Elements) project, for example, offers one of the largest collections of experimentally verified binding instances, but only covers 161 different TFs [9].

Moreover, various other repositories of TF binding models exist that have been obtained by different methods, but still fall short to represent the predicted TF landscape in humans. The largest of these repositories, TRANSFAC [10], is not freely available and has therefore not been included into this survey. The human Protein-DNA Interactome (hPDI) [11] contains matrices for 493 human TFs generated through Protein-Binding Microarrays (PBMs), whereas a recent publication [12] determined over 200 matrices for 151 human TFs via an HT-SELEX experiment. HT-SELEX enables the identification of DNA target strands through a large library of artificially generated oligonucleotides, which are presented to a target protein. Bound sequences are amplified by PCR, followed by several rounds of selection processes to determine the best-bound sequences. It is not surprising that binding models derived from different experimental methods can show highly different success rates in modelling TF-DNA binding. However, for a long time it was impossible to compare the quality of binding models that had been obtained by various methods due to the lack of sufficient experimental data.

Therefore, only recently a direct comparison between TF models derived from PBM and HT-SELEX-experiments has been published [13]. This study found models generated through both experimental strategies to be quite similar for in vitro binding, whereas for in vivo prediction HT-SELEX-derived models achieved better results.

Here we present a larger study in which we set out to compare a total of 179 PSSMs linked to 82 different TFs that derived from the following three sources: HT-SELEX, PBM, and the manually curated binding models provided by JASPAR [14]. Almost all of the 58 JASPAR matrices (representing 56 TFs) used in our study resulted from ChIP-seq experiments, only two, aimed at NFIC and RELA, were generated by HT-SELEX experiments. We tested all models for their capability to recognise ChIP-seq confirmed functional (‘in vivo’) TFBSs from the ENCODE project [15]. Due to the presence of several TFs in more than one binding model source, we were able to directly compare matrices linked to 26 different TFs from JASPAR and HT-SELEX, thereby enabling a quality assessment of binding models obtained through different experimental methods. For this study, we created two test sets, one comprising all suitable TFBSs from ENCODE and another one which includes only those TFBSs with a high binding score (‘high-confidence set’). We used the same number of length-matched exonic sequences as a TFBS-depleted control set.

This study is to our knowledge the largest comparison of TF binding models, testing 179 binding models (PSSMs) on over 2 x 2,000,000 target sequences in total. Hence, our study provides insight into the current quality of computational TFBS predictions and the findings generated in this study may help to evaluate the feasibility of whole-genome TFBS prediction studies.

Results and discussion

Binding predictions

We tested the PSSMs mentioned above and used ChIP-seq derived sequences binding to TFs as positive controls. Results were assessed with ROC analysis (see Methods for details).

The various binding models from the multiple sources yielded different results in many cases: While only 3 of the 19 PBM-derived models from hPDI (16 %) reached an AUC score of ≥0.7, this was the case for 46 % of the HT-SELEX and 60 % of the JASPAR-models. The average AUC values generated by the different binding model sources were highly divergent, ranging from 0.53 (hPDI) to 0.72 (JASPAR).

When tested on the ‘high-confidence’ dataset, 4 out of 15 models (26.6 %) from hPDI yielded AUC scores of ≥0.7, which resulted in an average AUC score of 0.57. The HT-SELEX-models reached AUC values ≥0.7 in 70 % of the cases (average AUC of all PSSMs: 0.76), whereas the JASPAR-matrices generated an average AUC score of 0.83. An overview of these findings is depicted in Fig. 1.

Fig. 1 — Average AUC scores and representative ROC plots for different binding model sources. a Average AUC scores generated for the different binding model sources. b ROC plot for TFAP2C. c ROC plots for TFAP2A for the entire ENCODE test set (*left*) and the high confidence set (*right*). d Underlying TF binding models for TFAP2A

We were able to directly compare the prediction quality of models for 28 different TFs that were represented in both the JASPAR and the HT-SELEX data (26 for the high confidence test set). We found that in general, the manually curated matrices from JASPAR enabled a slightly better distinction between positive and negative sequences (average AUC 0.74 versus 0.70). In the ‘high-confidence’ data set, these values increased for both data model sources: The HT-SELEX models generated an average AUC of 0.80, whereas for JASPAR this value was 0.84 (Figs. 1a and 2).

Fig. 2 — Direct comparison of binding models generated by different methods. Depicted are AUC scores for TFs stored in both JASPAR (manually collected curated models) and HT-SELEX. AUC scores were generated using ROCR. If multiple binding models were available for one TF, we depict the average AUC value

These findings can help to estimate the utility of TFBS models from different sources, enabling researchers to better circumvent the high false-positive rate, a problem which is well known for in silico TF binding analyses [16], and thus to make the most of the available data. In general, we found manually curated models provided by JASPAR and HT-SELEX-derived matrices to be much more successful in recognising in vivo TFBSs than PBM derived models.

Previously, the success of TF binding prediction has been found to be similar for the majority of matrices, with degenerate matrices yielding the most accurate predictions [17]. Hence researchers currently often use TF binding models from various sources indiscriminately. The notion that TF binding models obtained through different experimental methods can be variable has just recently entered the scientific discourse in a novel publication [13]. There, the authors compared PBM- and HT-SELEX-derived models and found both models to be reliable with a slight advantage of HT-SELEX models in the prediction of in vivo TFBSs. Our study on a larger test set (up to several thousand binding events per TF with direct comparison of models from different sources for a subset of the tested TFs) showed JASPAR to be the most reliable source for TFBS matrices in this data set.

To check if the higher accuracy of JASPAR might be the result of an overfitting of the JASPAR models to the ENCODE data, we manually checked for an overlap between the tested matrices and the ENCODE data set. For each PSSM, we studied the accompanying publication and found that this was not the case for any of the binding models included in our study. Hence we can safely assume that the current JASPAR models are indeed the most successful ones to recognise in vivo TFBSs.

In this study, we decided to use exonic sequences as a negative test set. Because the 5’ regions of genes occasionally contain functional binding sites [18, 19], we decided to categorically exclude the first exon of each transcript from our analyses. Thus, the percentage of functional TFBSs in the negative test set should be sufficiently decreased in order to allow computational predictions. We preferred this approach to alternative analyses, such as the usage of artificially generated random DNA sequences without TFBSs, as they might be very different from naturally occurring sequences and are also likely to increase error rates [20]. Furthermore, the usage of random exonic sequences as a negative set closely mimics the real-life research situation, therefore allowing an important insight into the limitations of current in silico TFBS analysis methods.

We found that the usage of only high-confidence ENCODE data leads to an increased reliability in the computerised recognition of TFBSs. However, this filtering step also introduces a potential drawback: Strong TF binding does not necessarily implicate strong TF function, as has been shown for D. melanogaster [21], where low-affinity binding sites can be of great importance for normal gene activation. Moreover, a researcher looking for a functional binding site in silico does not necessarily have any previous knowledge on the binding strength. Therefore, the decision to admit only high-confidence data into analyses – which is common practice in TF binding research [13, 17, 22]—in fact excludes a large amount of relevant information and should therefore be taken with care.

Conservation analyses

In a real-life research situation, TFBS predictions are usually supplemented by additional analyses such as the determination of sequence conservation. We therefore further tested whether the ENCODE TFBSs (positive set) show a higher degree of conservation than the genetic background. In detail, we measured phyloP [23] and phastCons [24] scores for each position of the ENCODE TFBSs and of negative test cases (length-matched random sequences from chromosomes 1 and 2). For each TF, we then calculated the average scores of the highest conservation as well as of the average conservation score reported for each sequence. As we did not know where exactly in the ENCODE sequence the actual TFBS was located, we chose to focus on the highest conservation scores, assuming that functional TFBSs would show a high degree of conservation. However, we found that, while there was a higher degree of conservation for the ENCODE sequences, the variability in the tested sequences was very high. Moreover, we did not find a link between strong TF binding and sequence conservation because the mean maximum phyloP and phastCons scores for the high-confidence test set were not significantly higher than for all ENCODE sequences. We hence conclude that, while this may sometimes be the case, a functional binding site does not necessarily have to be highly conserved. As a representative example, Fig. 3 shows example images of conservation readings for ZBTB33 and BCL11A.

Fig. 3 — Representative plots for conservation analyses. We determined the maximum phastCons (a) and phyloP (b) scores in each experimentally confirmed binding site of BCL11A (*left panel*) and ZBTB33 (*right panel*) and calculated the averages of the maximum scores

Our findings indicate that while there seems to be a certain degree of increased conservation for ChIP-seq verified TFBSs in comparison to random genetic sequences (Fig. 3), the variability between individual sites is too high to draw any reliable conclusions. These results support previous findings, which suggest that functional TFBSs can, but do not have to be evolutionarily conserved with regard to their primary nucleotide sequence [25–27]. More precisely, some researchers argue that TFBSs may fluctuate quite rapidly between species. For instance, it has been found in a comparative genomics study that the likelihood of a TFBS to be conserved between S. cerevisiae and its two closest relatives (S. paradoxus and S. mikatae) lies below 5 % for a majority of the sites [26]. In addition, rapid evolution of combinatorial transcription networks was found for the transcriptional regulator MCM1 in closely related fungi species [28]. Moreover, a simulation of the evolution of TFBSs by introducing local point mutations showed TFBSs and combinations of binding sites to evolve quickly [29]. Considering the fact that TFBSs are known to be relatively variable, even within the same species, it would be surprising if all functional binding sites were conserved across larger evolutionary distances [27, 30, 31]. It has also been shown in comparative studies between various species that the functionality of TFBSs can still be preserved despite underlying sequence changes [32, 33]. The three-dimensional structure of the DNA at a TFBS, including the histones, is likely to play an important role for TF binding, a notion which is slowly being acknowledged for the analysis of TFBSs [34]. As such three-dimensional structures can be well conserved while the underlying linear sequence may evolve rapidly, sequence conservation scores might not be of major importance for the assessment of TFBSs. In our study, we found that the degree of variation between individual TFBSs was too high to infer TFBSs conservation in general. Hence, measurements such as genomic location or three-dimensional conservation might be better suited for the assessment of the validity of a TFBS.

Regulatory SNPs (rSNPs)

We compiled a set of 52 rSNPs known to directly impact TF binding or gene expression (included as Additional file 1). rSNPs are single-nucleotide polymorphisms occurring in regulatory regions such as promoters or enhancers. They are likely to be of functional significance as they do affect transcription initiation, elongation, and translational characteristics of mRNA but are more difficult to study than alterations in coding regions and are hence under-represented in the literature [35–37]. We only included alterations in this test set which have been shown to directly interfere with a TFBS’s function in in vitro or in vivo experiments. In this test set, we found differences between wild-type and mutant variants with respect to the evaluated maximum binding scores in 76 % of all observations, but they rarely exceeded 20 %. For a subset of the rSNPs, which linked to 15 different TFs (represented by 58 PSSMs from JASPAR, hPDI, and HT-SELEX), we were able to directly compare the computational predictions to the experimental findings. If solely considering maximum binding score changes, we found agreement between computational and experimental results in less than 50 % (48.1 %) of the tested matrices. However, when analysing the data with ePOSSUM (see below), we obtained a correct prediction in 62 % of the cases. Interestingly, the different binding models from the various sources again yielded quite divergent results, thus confirming strong variability in the quality of the tested matrices (see Additional file 1).

We are aware that the rSNP test set used in this study was not large enough to provide comprehensive insight. This is due to the fact that experimental data on rSNPs in TFBSs are not readily available. Hence, we decided to make the best of the available data and to provide an initial glimpse into the reliability of the state-of-the-art methods for TFBS prediction.

ePOSSUM

Our study demonstrates that many of the existing PSSMs offer only limited reliability, while others do indeed allow reasonable predictions. In order to make our findings readily available to the scientific community, we created software called ePOSSUM that predicts the effect of genetic variation on TF binding using the matrices included in our study. It also includes a measure of significance of the assessments to indicate how reliable a prediction by a certain PSSM would be.

ePOSSUM is a web-based application that can be freely accessed at http://mutationtaster.charite.de/ePOSSUM/. Users can either enter genetic variants in the common VCF format or wild-type sequences along with the variant sequences. In the first case, the position of the variant is known and ePOSSUM scans the ENCODE ChIP-seq data used in our study for known TFBSs around this position. The existence of TFBSs at the position under study is included in the output. In total, ePOSSUM offers predictions for 81 TFs which are based on 171 different PSSMs. To speed up analysis, users can choose which TFs to include into the search for binding sites. If they submit a genomic position, they can limit the search to those TFs that are known to bind at this position.

Unlike similar applications such as motifbreakR [38], ePOSSUM does not merely subject sequences to different PSSMs, but uses their binding scores as input for a Bayes classifier. More precisely, this classifier uses the quotient of the maximum binding scores calculated for wild-type and variant sequence to determine whether the genetic alteration leads to the (predicted) gain or loss of a TFBS. As most of these quotients do not show a normal distribution, we use the R module klaR [39] for the classifier, which considers the kernel-density, i.e. the real distribution of values, to generate predictions.

Due to the poor performance of many PSSMs in discriminating real TFBSs from exonic sequences as indicated by similar maximum binding scores for both, the Bayes probability for both outcomes (TF binding versus no TF binding) is often far from 100 %. We therefore include the negative or positive predictive value (NPV/PPV) for each matrix for the given Bayes probability as a measure of the reliability of the classifier’s prediction. This value is also used to colour-code the likelihood of increased/new or decreased/lost TF binding. Darker colours represent more likely changes in TF binding. ePOSSUM uses all available PSSMs for a TF to generate a final assessment—here, only PPVs/NPVs of 70 % or above are considered. The assessment also includes whether or not a binding site for the TF was reported by ENCODE at the respective genomic location.

Different Bayes classifiers were trained for each PSSM with the quotients of all possible pairs of positive versus negative cases for the respective TF using all available sequences from our ENCODE high-confidence set as positive and exonic sequences as negative cases. The number of pairs was deliberately limited to 1,000,000 per matrix, even if more pairs were possible.

ePOSSUM is limited to human TFs and, if genomic positions were used, to genome version GRCh37. A weakness lies in the recognition of known TFBSs—as each TFBS positive ChIP-seq snippet from ENCODE comprises several hundred bases, we do not know whether or not the “real” TFBS is truly located within the fragment screened in the analysis.

Conclusions

The assessment of variation in regulatory sequences remains a substantial challenge for the research community. While in previous decades genetic analyses focused on protein-coding regions, due to the availability of deep sequencing the search for regulatory mutations has meanwhile become a research focus. The properties of TF binding have been investigated for a long time and therefore offer a myriad of data for computational analyses. However, the utility of TF binding models for large-scale TFBS recognition has not been extensively investigated yet.

Therefore, this study is an attempt at comparing the accuracy of TF matrices obtained by various experimental methods in predicting functional TFBSs on a large scale. Thus, it enables an analysis of the quality and relevance of current TF binding prediction data.

Problems in predicting TFBSs in silico include high false-positive rates [16] high variability [30, 40], and insufficient knowledge of the exact in vivo binding sites [17, 41]. Current TF binding models are usually based on mere one-dimensional sequence parameters, which poses a major limitation.

To tackle these problems, researchers have to wisely choose data sources as well as TFBS predictors. Taken together, our analysis and the newly created and free-to-use software ePOSSUM provide previously unknown quality measures for computational TFBS predictions thus enabling researchers to extract and use the most meaningful of the available data.

Methods

TF binding models

TF binding matrices generated via three different experimental methodologies were obtained from the R package MotifDb [42]. This tool contains a large and up-to-date collection of freely available TF binding models. We used human PSSMs from some of the largest and most frequently used ‘open source’ TFBS data sources: hPDI (19 matrices), JASPAR (58 matrices linked with 56 different transcription factors), and HT-SELEX studies (102 matrices linked with 47 different factors). For hPDI and JASPAR, usually one matrix was available per TF, whereas the HT-SELEX data reported up to six variants per TFs. The tested PSSMs were highly variable, ranging from approximately 5 to 19 bp length, with hPDI-matrices being generally shorter than models obtained from the other two sources. In this study, in order to obtain comparable results, we only evaluated monomeric binding models, despite the fact that the latest issue of JASPAR also contains a number of homodimeric models. An overview of the binding models can be found in Fig. 4 and in the Additional file 2.

Fig. 4 — Overview of tested TFs for the entire data set (a) and the high-confidence data (b). ENCODE: Entire set of ENCODE TFBSs (2012 freeze). High confidence set: TFs reaching at least 80 % of the maximum possible binding score (published by ENCODE) in at least 100 binding instances. Please note that the intersections are not drawn to scale

Experimentally confirmed TFBSs (positive cases)

The positive test set consisted of ChIP-seq confirmed TF binding events, which were obtained from the ENCODE project [9] (March 2012 Freeze), as provided by the UCSC [43]. This data includes TF binding instances found in ChIP-seq experiments that had been performed on various human cell lines. For each binding event, the genomic location is reported together with a binding score indicating the strength of the protein-DNA interaction.

High-confidence positive test set

In the first step of the experiment, all reported ChIP-seq instances were incorporated into the workflow. In a second approach, only sequences reaching at least 80 % of the maximum possible binding score published by ENCODE were included into a ‘high-confidence’ test set. Of these cases, we only tested TFs for which at least 100 binding instances were reported by ENCODE to reach this threshold. Hence, 15, 44, and 55 TFs remained for hPDI, HT-SELEX and JASPAR, respectively.

TFBS-depleted DNA (negative test set)

For the negative test set, we used exonic sequences (GRCh37) which we obtained via Ensembl-Biomart [44]. For each sequence of the positive test set, we randomly selected an exonic sequence of equal or greater length to the corresponding TFBS sequence; in case of longer exonic sequences, these were shortened randomly to match the TFBS sequence. We routinely omitted the first exon of each transcript in order to decrease the chance of erroneously including TFBSs located downstream of the transcription start site [18, 19].