Skip to main content
. 2018 Nov 16;21(1):120–134. doi: 10.1093/bib/bby110

Table 1 .

Prediction methods of protein binding sites

Publication Features Method Software Description
Whitington et al. [13] Histone PWM scan followed by filtering chromatin features N/A Scan the genome using motif PWM to detect candidate binding sites. Sites are then filtered by several genetic and epigenetic criteria using ad hoc thresholds.
He et al. [14] Nucleosome-resolution histone Dynamics of nucleosome occupancy and motif N/A Use differential H3K4me2 ChIP-seq signals to measure nucleosome positioning, followed by motif analysis to predict TF binding dynamics.
Won et al. [17] Histone HMM Chromia (http://wanglab.ucsd.edu/star/job.php?s=chromia) Use sequence counts of fixed-sized bins for a number of histone ChIP-seq data sets as inputs for a three-state HMM.
Ramsey et al. [16] Histone acetylation ChIP-seq, nucleosome occupancy, DNA sequence Weighted sum of scores RamseyHAc2010 (http://magnet.systemsbiology.net/hac/) Take 100 bp intervals of transcript-proximal regions as candidate regions. Sequence and epigenetic features are used as predictors. Weighted sum of thresholded predictors values are treated as prediction scores.
Pique-Regi et al. [21] DNase-seq, DNA sequence Two-component mixture model, expectation-maximization (EM) algorithm CENTIPEDE (http://centipede.uchicago.edu/) Compute prior for binding from sequence-based features. Binned read counts from DNase-seq are assumed to be from a mixture model, which is fitted by an EM algorithm. Posterior probabilities of binding are obtained.
Cuellar-Partida et al. [26] DNase-seq, histone Epigenetic data as prior, use motif to predict FIMO, part of MEME (http://meme-suite.org/doc/fimo.html) Prior probabilities of binding are derived from epigenetic data. Posterior probabilities are computed from prior and motif scores based on a Bayesian model.
Arvey et al. [9] Histone, DNase-seq SVM N/A Two SVMs are trained based on dinucleotide mismatch k-mer features of 100 bp regions and read counts from 100 bp bins.
Ji et al. [18] Nine histones Principal component analysis (PCA)-type unsupervised learning dPCA (http://www.biostat.jhsph.edu/dpca/) Binned read counts from several ChIP-seq data sets are obtained. Differences in counts between two conditions are decomposed using PCA. The motif sites with large PC1 scores are deemed differential binding sites.
Gusmao et al. [20] DNase-seq, histone HMM HINT (http://www.regulatory-genomics.org/hint/introduction/) Genome-wide read counts in 5000 bp bins are obtained and normalized. Eight-state HMM (with one of the states being the protein-binding footprint) with multivariate outcome is fitted.
Sung et al. [19] DNase-seq, Motif (4-mer) Tests based on counts Dnase2TF (https://sourceforge.net/projects/dnase2tfr/) Identify DHS from DNase-seq data. For each DHS, it looks for `dip’ within the peaks. The dips are then combined with motifs to define binding sites.
Yardimci et al. [22] DNase-seq, bias adjustment Two-component mixture model FootprintMixture (https://ohlerlab.mdc-berlin.de/software/FootprintMixture_109/) Two-component mixture model is used to predict binding sites. Binned read counts around candidate sites are modeled by factor-specific multinomial distribution.
Sherwood et al. [23] DNase-seq (magnitude + shape around motif match sites) Expectation propagation PIQ (http://piq.csail.mit.edu/) Scan the genome using motif PWM to detect candidate binding sites. Binned read counts from DNase-seq are analyzed to build TF-binding profiles (shapes and magnitude). Posterior probability of binding is computed using expectation propagation method.
Xu et al. [32] Methylation, genomic features Random forest + two-component mixture model for methylation Methylphet (https://github.com/benliemory/Methylphet) Compute methylation scores based on a mixture model, then use the scores with other features as predictor in a random forest to predict binding sites.
Alipanahi et al. [34] DNA sequence (using binding array data as target) CNN DeepBind (http://tools.genes.toronto.edu/deepbind/) Use a deep learning framework to model the relationships between DNA sequence patterns and TF binding sites. CNN is used to capture sequence patterns. Trained model can be applied to a new sequence with variations to estimate risks by predicting changes in TF binding affinity.
Quach and Furey [27] DNase-seq (profile: mean and slope, centered at motif) SVM DeFCoM (https://bitbucket.org/bryancquach/defcom) Find candidate regions based on motifs. Binned read counts from DNase-seq around the candidate regions are obtained for varying-sized bins. Read counts are used as predictors in an SVM.
Jankowski et al. [24] DNase-seq (shape, on motif-matched sites) Two-component mixture model, EM algorithm Romulus (https://github.com/ajank/Romulus) Similar to CENTIPEDE, except for the binning strategy. They use 20 bp bins outside the motif site and single-bp bins within the motif site. For non-binding sites, they put all the positions in a single bin.
Liu et al. [28] DNase-seq (footprint score defined by counts) + genomic features Random forest BPAC (http://bioinfo.wilmer.jhu.edu/BPAC/) Motif PWM scores and different types of read count features are used as predictors to train a random forest for prediction.
Ma et al. [33] DNA sequence, shape kernel Support vector regression Sequence-shape (https://bitbucket.org/wenxiu/sequence-shape.git) Use kernel functions to model both DNA sequence and shape features simultaneously. It then performs kernel-based regression and classification to predict TF–DNA interaction. It shows that incorporation of DNA shape information can improve prediction accuracy.
Kuang et al. [29] Histone, DNase-seq, on known motif-matched sites Random forest DynaMO (https://github.com/spo111/DynaMO) Use motif sites as candidate regions. Binned read counts around the motif sites are used as predictors in random forest for prediction.
Chen et al. [25] DNase-seq, ATAC-seq and DNA sequence features Three-component mixture model; sparse logistic regression; weighted least square Mocap (https://github.com/xc406/Mocap) Take motif sites as candidate regions. They used a three-component mixture model for the read counts and sparse logistic regression on a number of features. Cross-sample method uses weighted least square to minimize a loss function.
Quang et al. (unpublished) DNase-seq CNN + RNN FactorNet (https://github.com/uci-cbcl/FactorNet) Use a number of features to train a deep learning model (DanQ CNN-RNN hybrid architecture). Features include genome sequences and annotations, gene expression and DNase-seq data.