Abstract
Background: Understanding the location and cell-type specific binding of Transcription Factors (TFs) is important in the study of gene regulation. Computational prediction of TF binding sites is challenging, because TFs often bind only to short DNA motifs and cell-type specific co-factors may work together with the same TF to determine binding. Here, we consider the problem of learning a general model for the prediction of TF binding using DNase1-seq data and TF motif description in form of position specific energy matrices (PSEMs).
Methods: We use TF ChIP-seq data as a gold-standard for model training and evaluation. Our contribution is a novel ensemble learning approach using random forest classifiers. In the context of the ENCODE-DREAM in vivo TF binding site prediction challenge we consider different learning setups.
Results: Our results indicate that the ensemble learning approach is able to better generalize across tissues and cell-types compared to individual tissue-specific classifiers or a classifier applied to the data aggregated across tissues. Furthermore, we show that incorporating DNase1-seq peaks is essential to reduce the false positive rate of TF binding predictions compared to considering the raw DNase1 signal.
Conclusions: Analysis of important features reveals that the models preferentially select motifs of other TFs that are close interaction partners in existing protein protein-interaction networks. Code generated in the scope of this project is available on GitHub: https://github.com/SchulzLab/TFAnalysis (DOI: 10.5281/zenodo.1409697).
Keywords: ENCODE-DREAM in vivo Transcription Factor binding site prediction challenge, Transcription Factors, Chromatin accessibility, Ensemble learning, Indirect-binding, TF-complexes, DNase1-seq
Introduction
Transcription Factors (TFs) are key players of transcriptional regulation. They are inadmissible to maintain and establish cellular identity and are involved in several diseases 1. TFs bind to the DNA at distinct positions, mostly in accessible chromatin regions 2, and regulate transcription by recruiting additional proteins. The TFs can alter chromatin organization or, for example, recruit an RNA polymerase to initiate transcription 1. Hence, to understand the function of TFs it is vital to identify the genomic location of TF binding sites (TFBS). As TFs regulate distinct genes in distinct tissues, these binding sites are tissue-specific 2.
Nowadays, the most prevalent and widely used method to experimentally determine TFBS is through ChIPseq experiments, which can be used to generate genome-wide, tissue-specific maps of in-vivo TF binding. However, ChIP-seq experiments are expensive, experimentally challenging, and require an antibody for the target TF. To overcome these limitations, a number of computational methods have been developed to pinpoint TFBS. Most of these methods are based on position weight matrices (PWMs) describing the sequence preference of TFs 3. PWMs indicate, for each position of a TF binding motif independently, which nucleotide is most likely to occur. Unfortunately, screening the entire genome using a PWM results in too many false positive predictions. Therefore, numerous methods have been proposed to reduce the prediction error by combining PWMs with epigenetics data, such as DNase1-seq, ATAC-seq, or Histone Modifications, reflecting chromatin accessibility. Also, additional features such as nucleotide composition, DNA shape, or sequence conservation can be incorporated into the predictions. Including these additional data sets and information improved the TF binding predictions considerably 4– 11. A non-exhaustive overview is provided in 12. While PWM based models are still the most common means to assess the likelihood of a TF binding to genomic sequences, more elaborate approaches such as SLIM-models, which capture nucleotide dependencies, have been successfully used as well 13. Recently, deep learning methods have been used to learn TF binding specificities de novo from large scale data sets comprising not only ChIP-seq but also Selex and protein binding microarray (PBM) data 14.
The ENCODE-DREAM in vivo Transcription Factor binding site prediction challenge 15 aims to systematically compare various approaches on TFBS prediction in a controlled setup, with the additional complexity of applying the classifiers on the tissues/cell types that were not used for model training. The challenge organizers provide TF-ChIP seq data for 31 TFs, accompanied with RNA-seq and DNase1-seq data in 12 different tissues. Using labels deduced from the TF-ChIP-seq data, predictive models for TF binding should be learned and then applied to a set of hold-out chromosomes on an unseen tissue. Predictions are computed in bins, covering the entire target chromosomes. The main challenge paper will provide a detailed explanation of the challenge setup and a comparison across all competing methods. This article is a companion paper to the main ENCODE-DREAM Challenge paper, in which we describe our contribution to the challenge, delineate the motivation for our work and provide an independent evaluation of our ideas to achieve generalizability across tissues.
We developed an ensemble learning approach using random forest (RF) classifiers, extending the work of Liu et al. 11. Tissue-specific cofactor information was shown to be relevant to accurately model TF binding 11, 16. Thus, we designed our approach to aggregate tissue-specific cofactor data, via an ensemble step, into a generalizable model. Briefly, we compute TF affinities with TRAP 17 for 557 PWMs in DNase-hypersensitive sites (DHSs) identified with JAMM 18. TF affinities computed by TRAP are inferred from a biophysical model. In contrast to a simple binary classification, e.g. FIMO 19, these scores can capture low affinity binding sites, which were shown to be biologically relevant 20, 21. Here, we show that our ensemble models generalize well between tissues and that they exhibit better classification performance than tissue-specific RF classifiers. Furthermore, we illustrate that only a small subset of TF features is sufficient to predict tissue-specific TFBSs and also show that these TFs are often known co-factors/interaction partners of the target TF.
Methods
Data
Within the scope of the challenge participants were provided with ChIP-seq data for 31 TFs, as well as DNase1-seq and gene expression obtained from RNA-seq data for 13 tissues. From the available 31 TFs, 12 were used to assess the model performance in the final round of the challenge. Hence, we also focus on these 12 TFs in the scope of this article: CTCF, E2F1, EGR1, FOXA1, FOXA2, GABPA, HNF4A, JUND, MAX, NANOG, REST, and TAF1. The number of binding sites per TF and tissue is shown in Table 1. Note that we exclude ambiguous sites from consideration in this study. We refer to the challenge website for a detailed overview on the provided data 15. The challenge required that the predictions are made in bins of size 200 bp, shifted by 50 bp each, spanning the whole genome.
Table 1. Number of bins labeled as bound per transcription factor (TF) and tissue, deduced from TF ChIPseq data.
TF | Number of bound sites per tissue |
---|---|
CTCF | 179,672 (A549), 271,097 (H1-hESC), 206,336 (HeLa-S3), 208,868 (HepG2), 170,208 (IMR-90), 215,238 (K562),
305,547 (MCF-7) |
E2F1 | 93,117 (GM12878), 55,391 (HeLa-S3) |
EGR1 | 72,595 (GM12878), 52,733 (H1-hESC), 175,994 (HCT116), 58,793 (MCF-7) |
FOXA1 | 256,632 (HepG2) |
FOXA2 | 374,750 (HepG2) |
GABPA | 26,467 (GM12878), 51,666(H1-hESC), 31,202 (HeLa-S3), 60,552 (HepG2), 109,423 (MCF-7), 78,403 (SK-N-SH) |
HNF4A | 106,308 (HepG2) |
JUND | 203,665 (HCT116), 179,999 (HeLa-S3), 183,558 (HepG2), 193,814 (K562), 92,905 (MCF-7), 222,013 (SK-N-SH) |
MAX | 301,615 (A549), 98,327 (GM12878), 224,379 (H1-hESC), |
321,501 (HCT116), 211,590 (HeLa-S3), 317,579 (HepG2), 318,318 (K562), 250,775 (SK-N-SH) | |
NANOG | 32.918 (H1-hESC) |
REST | 71,251 (H1-hESC), 47,654 (HeLa-S3), 67,453 (HepG2), 59,640 (MCF-7), 48,946 (Panc1), 94,082 (SK-N-SH) |
TAF1 | 87,109 (GM12878), 185,027 (H1-hESC), 93,824 (HeLa-S3), 110,385 (K562), 83,276 (SK-N-SH) |
Data preprocessing and feature generation
In order to obtain datasets per tissue and per TF that could be handled in terms of memory consumption and processing time, and also to cope with the large imbalance number of bound and unbound sites, we randomly sampled as many negative sites from the provided ChIP-seq tsv files as there were true binding sites per TF. The ChIP-seq labels contained in the balanced and down sampled tsv files are used as the response for training RF models.
Throughout the course of challenge, we have used two distinct ways to generate features for the RF classifiers: (1) with and (2) without considering DHSs. In none of the approaches have we used the provided RNA-seq data nor did we compute DNA shape features. Generally, we computed TF binding affinities with TRAP 17 for 557 distinct TFs using the default parameter settings. The position specific energy matrices (PSEMs) used in our computation are converted from position weight matrices (PWMs) obtained from JASPAR 3, UniPROBE 22, and Hocomoco 23. The code to perform the conversion and to run TRAP is available on GitHub.
We compared two approaches to generate features for the classifier from DNase1-seq data. In the first approach, shown in Figure 1a, we compute tissue-specific DHSs using the peak caller JAMM 18 (version 1.0.7.2) and merge the peak calls using the bedtools merge 24 command ( bedtools version 2.25.0) . Next, TF affinities are calculated in the identified DHS sites using TRAP, and the median DHS signal per peak is computed from the provided bigwig files. The computed data is intersected, using a left outer join with bedtools, with the binned genome structure required for training (using the bins contained in the tsv files mentioned above) and testing (using the provided bed-file containing all test regions).
The second approach for computing the features is depicted in Figure 1b. Here, we do not use the information on DHS sites, instead we compute TF binding affinities and the DNase1-seq signal per bin. To account for variability between both biological and technical replicates, we calculate the median DNase1 coverage across the replicates using the bedtools coverage command. Overall, the features for a single bin are composed of the TF affinities in that bin, the DNase1 signal in the bin itself together with its left and right neighboring bins.
Ensemble random forest classifier
The Random Forest models, implemented using the randomForest R-package 25 (version 4.6-12), are trained on either of the feature setups explained in the previous section. Training the RF models can be seen as a two step approach that is independent from the feature setup. Throughout model training, the balance between the bound and unbound classes is maintained to avoid over-fitting of the RF classifiers and also to ensure an unbiased evaluation of model performance. For fitting the RF classifiers we used 4,500 trees, and at most 30,000 positive and negative, i.e. bound and unbound, samples. This restriction is enforced by the limitations of the randomForest R-package. As illustrated in Figure 2a, for a given target TF, we first learn tissue and TF specific RF classifiers using all available features from the input matrix, T i ∈ R n ×557 ; i ∈ {1, ... ,m}, where n is the number of bins forming the training set, and m denotes the number of training tissues for the target TF:
here Binding( T i) is a vector of length n, holding the binding labels for the target TF in tissue i, and RandomForest(., .) generates the RF model trained on the features and labels provided by the first and second arguments respectively. An example of the input matrix T i and the response vector Binding( T i) is shown in Figure 2b. In the second step, to focus only on essential regulators (c.f. Figure 3a), we shrink the feature space to the union of the top 20 regulators taken over all tissue and TF specific RF classifiers, , by ranking the predictors according to their Gini index ( Figure 2c):
where TopFeatures( RF j) denotes the top 20 features of RF j and Subset(., .) generates the reduced feature matrix based on the union of the top TFs. In the following, we refer to a training data set comprised of only one tissue as a single tissue case and to a training data set composed of multiple tissues as a multi tissue case. Considering the single tissue case, we train an RF model, , on the reduced feature space and use this as the final model for the respective target TF:
In the multi-tissue scenario, we retrain tissue-specific RF models on the reduced feature space and apply them across all available training tissues:
where Prediction ( , ) returns the predictions made by when applied on the . Their predictions are combined in a new feature matrix that is used as input to train an ensemble RF, RF E. Note that the input matrix contains predictions of all tissue-specific RF models on all available training tissues ( Figure 2d):
By design, the ensemble model incorporates the tissue-specific RF classifiers in a non-linear way to better generalize across all provided training tissues. An example matrix that is used to obtain predictions from an ensemble RF is shown in Figure 2e.
Performance assessment
We used two different ways to assess model performance: (1) While fitting the RF classifiers, we measure the out-of-bag error (OOB), which is defined as the mean prediction error for each training sample i using trees that were not trained on sample i. The OOB error is computed separately for the Bound and Unbound classes:
where TP denotes the sites correctly predicted as bound, TN denotes the sites correctly predicted as unbound, FP and FN represent sites incorrectly predicted as bound and unbound, respectively. Note that, because we use balanced data for training the RF classifiers, the OOB is computed on a balanced data set.
Additionally, we compute (2) the misclassification rate for the Bound and Unbound cases on a subset of the test data that was used by the challenge organizers. The test data is composed of three hold-out chromosomes which have not been used for training: chr1, chr8 and chr21. Additionally, TF binding is predicted on an unseen tissue, i.e. a tissue that was not used for training. An overview of the test data is provided in Table 2. Note that, in contrast to the training data, the test data is not balanced, i.e. the Unbound class is larger than the Bound class. Therefore, to avoid misinterpretation of model performance, it is essential to compute the error for both classes separately.
Table 2. Test data used in this article, shown per transcription factor (TF) and tissue.
TF | Tissues |
---|---|
CTCF | PC-3, Induced
pluripotent stem cell |
E2F1 | K562 |
EGR1 | liver |
GABPA | liver |
JUND | liver |
MAX | liver |
REST | liver |
TAF1 | liver |
Protein-protein-interaction score
We obtained a customized protein-protein-interaction (PPI) probability matrix R as described previously 26, which is derived from a random walk analysis on a protein-protein-association network based on STRING 27 (version 9.05). An entry R i,j represents the probability that protein i interacts with protein j. Note that the probability R i,j is not symmetric by construction, i.e. R i,j ≠ R j,i. To generate a score describing how likely it is that a subset of proteins P contained in R interact with a distinct TF t, guided by the feature importances the RF models provide, we define the PPI score S t,P as
where GI ( p) denotes the Gini index values of p obtained from the RF model corresponding to t. Thus, the smaller the value of S t,P the more likely it is that the regulators in P interact with TF t.
Results
In this section, we first show that shrinking the feature space to those TFs essential for training does not affect model accuracy. Next, we demonstrate the benefits of the ensemble learning and how its accuracy is depending on the number of training tissues. We further investigate the top selected TFs by the RF models and find known interaction partners that possess high PPI scores. Finally, we compare the two feature design schemes, described in the Methods section, and explore their influences on model performance. If not stated otherwise, all figures presented in the following are based on annotation setup (1), including DHSs.
Reducing the feature space to a small subset does not affect classification performance
Because having a sparse feature space simplifies model interpretation, we reduce the feature space to contain only a few essential features. As explained above, we determined sets of top features using the Gini index, resulting in TF and tissue-specific sets containing either the top 10 or top 20 features. As shown in Figure 3a the difference in OOB error between the feature set comprised of the top 20 features and the full feature space is only marginal, whereas the difference is increasing when only the top 10 features are considered. Therefore, we decided to use a reduced feature space that consists of the top 20 features per model. The results indicate that the most important feature across all TFs is the DNase1-seq signal within the DHSs for feature setup (1). Similarly, in feature setup (2), the DNase1-seq signal within the bins is found to be more important than the TF features.
Ensemble learning improves model accuracy
According to the OOB error shown in Figure 3b, the ensemble RF classifiers outperform the tissue-specific models in all cases for both Bound and Unbound classes, thus emphasizing on the improved capability of the ensemble model to generalize across tissues. Additionally, we computed the misclassification rate on all test tissues which are linked to multiple training tissues ( Figure 3c). Again, we notice that the ensemble RF classifiers outperform the tissue-specific classifiers by several orders of magnitude in all Unbound instances and in most Bound cases. Overall, these results suggest that ensemble learning is a promising approach to deal with the tissue-specificity of TF binding.
Increasing the number of training tissues improves prediction accuracy
Although the results in Figure 3b and 3c suggest that the ensemble methods perform well, it remains unclear what influence the number of training tissues would have on the performance of an RF. To elucidate this, we performed permutation experiments learning multiple RF models using all possible combinations of training tissues that are available for a distinct TF. As this is a computationally demanding task, we performed it for only three, arbitrarily selected, TFs: MAX, TEAD4, and E2F6. Figure 4a illustrates that the OOB error declines when the number of training tissues increases. Hence, we conclude that the ability of an ensemble RF to generalize across tissues improves with larger number of training tissues.
However, it remains to be shown whether the improved accuracy obtained from the ensemble RF classifiers was in fact because of the ensemble learning. To test this, we designed another learning setup in which all tissue-specific data sets were aggregated into one. In other words, we pooled the training data for one TF across all available tissues into one data set. We then used this pooled data set to train a new RF model. As depicted in Figure 4b the true ensemble models perform considerably better than the models learned on the pooled training data. This shows that the ensemble technique is better suited to capture tissue-specific information than simple data aggregation.
Predictors selected by the RF classifiers are associated to the target TF
As stated before, we hypothesized that the top predictors selected by the RF classifiers represent regulators that exist either in protein complexes with the target TF via direct or indirect binding, or bind directly to DNA in close proximity to the target TF. To investigate this hypothesis, we computed a PPI score s t,P (see Methods) for the selected predictors P per TF t and compared it against scores computed for randomly sampled sets of TFs (based on 100 randomly drawn TF subsets). The PPI score s t,P for TF t is small, if t is likely to interact with the factors included in the selected predictor set P. In contrast, the score is high if t is not likely to be interacting with the factors in P. As shown in Figure 5a, except for three TFs (MAX, TAF1, ZNF143), the PPI score of the TFs selected by the RF is better (i.e. smaller) than the scores for the randomly selected set. This indicates that the RF classifiers select features representing regulators that are more likely to be interacting with the target TF, either directly or with indirect contacts.
Figure 5b provides an example of a PPI network focused on the TF MAFK. The network was obtained from the STRING database 27, using the settings highest confidence and no more than 10 interactors. The top features selected by the RF classifiers contain all known regulatory proteins in this network, except for NFE2L2, shown in red. Among these TFs are MAFK itself, MAFF, MAFG and NFE2 (highlighted in green). The strong interactions among the small MAF proteins 28 as well as the dimerization of those with NFE2 29 have been reported in the literature before.
Interaction partners shown in gray can not be identified by our approach as either these are proteins without regulatory functions or we do not have a PWM available for them.
Feature design influences the FP and FN predictions
In the conference round of the challenge, we were using feature setup (1), which is based on DNase1 Hypersensitive Sites (DHSs), while in the final round, we switched to design (2), which is purely based on bins. This transition had a strong effect on our performance assessed by the challenge organizers. While we improved the recall of our predictions by switching from (1) to (2), the precision decreased. In Figure 6, we show the misclassification rates for the Bound and Unbound classes depending on the feature designs. The performance is assessed and shown on test data. The bin based models (2) outperform the peak based models in the Bound case, whereas the peak based models show superior performance in the Unbound case. At the same time, bin based models perform poorly in the Unbound case, which is probably driven by the strong dependence of the RF classifiers on the DNase1-seq signal. In contrast to that, models based on DHSs perform well in the Unbound case, because the search space for TFBSs is limited to only DHSs. This increases the precision of the predictions, but at the same time lowers the recall, which is reflected by the high misclassification rate in the Bound case.
Discussion and Conclusion
Here, we introduced an RF based ensemble learning approach to predict TFBS in vivo. In this article, we did not compare our approach to competitors in the challenge, as this is done in the main challenge paper. Here, we show the benefits of ensemble learning in a multi-tissue setting and that modeling cofactors is beneficial for the classification.
We show on both test and training data that the ensemble strategy is able to generalize better across tissues, than models trained on only a single tissue ( Figure 3). Also the accuracy of the ensemble classifiers increases with an increasing number of available training tissues ( Figure 4a). We also illustrate that just using all available training data to learn one RF does not provide as accurate results as an ensemble model ( Figure 4b). In this study, we decided to use RF classifiers, because they lead to accurate classification results using non-linear predictions in a reasonable time. Alternative classification approaches, such as logistic regression, or support-vector-machines could have been used too.
RF classifiers have also been proposed recently, independent from the challenge 11, as an adequate method to predict TF binding. Although the authors of 11 perform cross cell-type predictions, i.e. they predict TF binding in a tissue where the RF was not trained on, they do not use ensemble models as proposed here. However, they did show that it is beneficial for the predictions of a distinct target TF to consider further TFs as predictors, in addition to the target TF itself. This is in agreement with our findings. As shown in Figure 3a, a small subset of features is sufficient to reach similar classification performance as the full feature space. We found that most of these selected TFs are known interaction partners of the target TF, see Figure 5. This is also supported by a recent study illustrating that most TFs bind in dense clusters around genes suggesting a widespread interaction among them 30.
Only for three TFs, we could not find that the predicted TFs lead to a better PPI score than a randomly chosen set. We note that for two of those three, TAF1 and MAX, the performance of the ensemble RF classifiers improved only marginally, or not at all, compared to the tissue-specific classifiers. This suggests that our model does not account for the true interaction partners of those TFs. Indeed, an inspection of the STRING database for TAF1 revealed that only TAF1 itself and TBP are among the top 20 regulators, which are included in our PWM collection. For the remaining interaction partners, mostly TFs of the TAF family, no binding motif is available in the public repositories, thus they are not included in our PWM collection and can therefore not be used by the RF classifiers. Similarly, for MAX, only 5 out of 20 high confidence interaction partners are included in our PWM collection. Specifically, no PWM is available for 6 TFs interacting with MAX, while the remaining interacting proteins are not categorized as TFs. Overall, our approach benefits from data availability ( Figure 4a). If there are only a few TFs available in our PWM collection, it will be harder to model the co-factor binding behavior of a TF across tissues adequately. Also, the more diverse the co-factor landscape of a TF is between the tissues, the harder it will be to learn a general model. Another crucial aspect with respect to that is the quality of the PWM. During the challenge, we realized that the selection of PWMs is crucial for model performance and it is required to compare PWMs obtained from different sources to make sure that one uses the one with highest information content. Nevertheless, instead of using a more recent method to model TF-motifs, we stick to the use of PWMs because they are (1) the most common way to describe the sequence specificity of TFs (2) they are available for a large number of TFs, and (3) they can be interpreted easily.
Switching the feature design for the RF classifiers from (1) DHS-based to (2) bin-based showed that DHS sites are inadmissible to reduce the false positive rate ( Figure 6) of TFBS predictions. Using only bins, without DHS information, we could improve the recall of TFBS predictions, but only at the cost of poor precision at the same time. The explanation for this behavior is a difference in size of the genomic search space between both feature setups. The bin based models have a low misclassification rate in the Bound case, because they do consider the whole genome without neglecting any sites beforehand, thus improving recall. However, our observations suggest that considering only the raw signal does not sufficiently correct for false positive sites, as opposed to use DHSs, which yield an improved misclassification rate in the Unbound case compared to the raw signal.
In general, both training and evaluating TFBS prediction methods is challenging due to the class imbalance, i.e. there are many more Unbound (negative) than Bound(positive) binding sites in the genome. This requires both (a) training approaches that avoid over-fitting for one of the two classes and (b) evaluation strategies accounting for this issue. Here, we show misclassification rates separately for both positive and negative classes to avoid a bias caused by the dominant Unbound case.
We note that our current investigation is not meant to construct a genome-wide classifier in which the unbound case is the most abundant. To achieve that, the highly unbalanced training data situation would need to be taken into account, for instance in the loss function of the classifier. Aside from the technical aspects, we show that modeling cofactors is helpful to predict TFBS and that ensemble learning is a promising technique to generalize information across tissues.
Data availability
The raw data used in this study is available online at Synapse: https://www.synapse.org/#!Synapse:syn6112317.
Software availability
Code generated as part of this analysis is available on GitHub: https://github.com/SchulzLab/TFAnalysis
Archived code as time of publication: http://doi.org/10.5281/zenodo.1409697 31
License: MIT
Acknowledgements
We thank everyone involved in organizing the ENCODE-DREAM in vivo Transcription Factor binding site prediction challenge and are grateful for the opportunity to share this article. The PPI scoring matrix used in this study was kindly provided by Sebastian Köhler.
Funding Statement
This work was supported by the Cluster of Excellence on Multimodal Computing and Interaction (DFG) [EXC248].
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 1; peer review: 2 approved with reservations]
References
- 1. Vaquerizas JM, Kummerfeld SK, Teichmann SA, et al. : A census of human transcription factors: function, expression and evolution. Nat Rev Genet. 2009;10(4):252–263. 10.1038/nrg2538 [DOI] [PubMed] [Google Scholar]
- 2. Natarajan A, Yardimci GG, Sheffield NC, et al. : Predicting cell-type-specific gene expression from regions of open chromatin. Genome Res. 2012;22(9):1711–1722. 10.1101/gr.135129.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Mathelier A, Fornes O, Arenillas DJ, et al. : JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2016;44(D1):D110–115. 10.1093/nar/gkv1176 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Pique-Regi R, Degner JF, Pai AA, et al. : Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 2011;21(3):447–455. 10.1101/gr.112623.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Luo K, Hartemink AJ: Using DNase digestion data to accurately identify transcription factor binding sites. Pac Symp Biocomput. 2013;80–91. 10.1142/9789814447973_0009 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Gusmao EG, Dieterich C, Zenke M, et al. : Detection of active transcription factor binding sites with the combination of DNase hypersensitivity and histone modifications. Bioinformatics. 2014;30(22):3143–3151. 10.1093/bioinformatics/btu519 [DOI] [PubMed] [Google Scholar]
- 7. Kähärä J, Lähdesmäki H: BinDNase: a discriminatory approach for transcription factor binding prediction using DNase I hypersensitivity data. Bioinformatics. 2015;31(17):2852–2859. 10.1093/bioinformatics/btv294 [DOI] [PubMed] [Google Scholar]
- 8. Yardımcı GG, Frank CL, Crawford GE, et al. : Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection. Nucleic Acids Res. 2014;42(19):11865–11878. 10.1093/nar/gku810 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Cuellar-Partida G, Buske FA, McLeay RC, et al. : Epigenetic priors for identifying active transcription factor binding sites. Bioinformatics. 2012;28(1):56–62. 10.1093/bioinformatics/btr614 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. O’Connor TR, Bailey TL: Creating and validating cis-regulatory maps of tissue-specific gene expression regulation. Nucleic Acids Res. 2014;42(17):11000–11010. 10.1093/nar/gku801 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Liu S, Zibetti C, Wan J, et al. : Assessing the model transferability for prediction of transcription factor binding sites based on chromatin accessibility. BMC Bioinformatics. 2017;18(1):355. 10.1186/s12859-017-1769-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Jayaram N, Usvyat D, R Martin AC: Evaluating tools for transcription factor binding site prediction. BMC Bioinformatics. 2016. 10.1186/s12859-016-1298-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Keilwagen J, Grau J: Varying levels of complexity in transcription factor binding motifs. Nucleic Acids Res. 2015;43(18):e119. 10.1093/nar/gkv577 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Alipanahi B, Delong A, Weirauch MT, et al. : Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33(8):831–838. 10.1038/nbt.3300 [DOI] [PubMed] [Google Scholar]
- 15. ENCODE-DREAM in vivo transcritpion factor binding site prediction challenge.2017; Accessed: 2018-02-03. Reference Source [Google Scholar]
- 16. Waardenberg AJ, Homan B, Mohamed S, et al. : Prediction and validation of protein-protein interactors from genome-wide DNA-binding data using a knowledge-based machine-learning approach. Open Biol. 2016;6(9): pii: 160183. 10.1098/rsob.160183 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Roider HG, Kanhere A, Manke T, et al. : Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics. 2007;23(2):134–141. 10.1093/bioinformatics/btl565 [DOI] [PubMed] [Google Scholar]
- 18. Ibrahim MM, Lacadie SA, Ohler U: JAMM: a peak finder for joint analysis of NGS replicates. Bioinformatics. 2015;31(1):48–55. 10.1093/bioinformatics/btu568 [DOI] [PubMed] [Google Scholar]
- 19. Grant CE, Bailey TL, Noble WS: Fimo: scanning for occurrences of a given motif. Bioinformatics. 2011;27(7):1017–1018. 10.1093/bioinformatics/btr064 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Tanay A: Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res. 2006;16(8):962–972. 10.1101/gr.5113606 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Crocker J, Abe N, Rinaldi L, et al. : Low affinity binding site clusters confer hox specificity and regulatory robustness. Cell. 2015;160(1–2):191–203. 10.1016/j.cell.2014.11.041 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Hume MA, Barrera LA, Gisselbrecht SS, et al. : UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res. 2015;43(Database issue):D117–122. 10.1093/nar/gku1045 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Kulakovskiy IV, Vorontsov IE, Yevshin IS, et al. : HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models. Nucleic Acids Res. 2016;44(D1):D116–125. 10.1093/nar/gkv1249 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26(6):841–842. 10.1093/bioinformatics/btq033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Liaw A, Wiener M: Classification and regression by randomforest. R News. 2002;2(3):18–22. Reference Source [Google Scholar]
- 26. Köhler S, Bauer S, Horn D, et al. : Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet. 2008;82(4):949–958. 10.1016/j.ajhg.2008.02.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Szklarczyk D, Morris JH, Cook H, et al. : The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 2017;45(D1):D362–D368. 10.1093/nar/gkw937 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Kannan MB, Solovieva V, Blank V: The small MAF transcription factors MAFF, MAFG and MAFK: current knowledge and perspectives. Biochim Biophys Acta. 2012;1823(10):1841–1846. 10.1016/j.bbamcr.2012.06.012 [DOI] [PubMed] [Google Scholar]
- 29. Igarashi K, Kataoka K, Itoh K, et al. : Regulation of transcription by dimerization of erythroid factor NF-E2 p45 with small Maf proteins. Nature. 1994;367(6463):568–572. 10.1038/367568a0 [DOI] [PubMed] [Google Scholar]
- 30. Yan J, Enge M, Whitington T, et al. : Transcription factor binding in human cells occurs in dense clusters formed around cohesin anchor sites. Cell. 2013;154(4):801–813. 10.1016/j.cell.2013.07.034 [DOI] [PubMed] [Google Scholar]
- 31. SchulzLab, Schmidt F: Florian411/TFAnalysis: Release for F1000 article (Version 1.0). Zenodo. 2018. 10.5281/zenodo.1409697 [DOI] [Google Scholar]