. 2018 Nov 16;21(1):120–134. doi: 10.1093/bib/bby110

Table 1 .

Prediction methods of protein binding sites

Publication	Features	Method	Software	Description
Whitington et al. [13]	Histone	PWM scan followed by filtering chromatin features	N/A	Scan the genome using motif PWM to detect candidate binding sites. Sites are then filtered by several genetic and epigenetic criteria using ad hoc thresholds.
He et al. [14]	Nucleosome-resolution histone	Dynamics of nucleosome occupancy and motif	N/A	Use differential H3K4me2 ChIP-seq signals to measure nucleosome positioning, followed by motif analysis to predict TF binding dynamics.
Won et al. [17]	Histone	HMM	Chromia (http://wanglab.ucsd.edu/star/job.php?s=chromia)	Use sequence counts of fixed-sized bins for a number of histone ChIP-seq data sets as inputs for a three-state HMM.
Ramsey et al. [16]	Histone acetylation ChIP-seq, nucleosome occupancy, DNA sequence	Weighted sum of scores	RamseyHAc2010 (http://magnet.systemsbiology.net/hac/)	Take 100 bp intervals of transcript-proximal regions as candidate regions. Sequence and epigenetic features are used as predictors. Weighted sum of thresholded predictors values are treated as prediction scores.
Pique-Regi et al. [21]	DNase-seq, DNA sequence	Two-component mixture model, expectation-maximization (EM) algorithm	CENTIPEDE (http://centipede.uchicago.edu/)	Compute prior for binding from sequence-based features. Binned read counts from DNase-seq are assumed to be from a mixture model, which is fitted by an EM algorithm. Posterior probabilities of binding are obtained.
Cuellar-Partida et al. [26]	DNase-seq, histone	Epigenetic data as prior, use motif to predict	FIMO, part of MEME (http://meme-suite.org/doc/fimo.html)	Prior probabilities of binding are derived from epigenetic data. Posterior probabilities are computed from prior and motif scores based on a Bayesian model.
Arvey et al. [9]	Histone, DNase-seq	SVM	N/A	Two SVMs are trained based on dinucleotide mismatch k-mer features of 100 bp regions and read counts from 100 bp bins.
Ji et al. [18]	Nine histones	Principal component analysis (PCA)-type unsupervised learning	dPCA (http://www.biostat.jhsph.edu/dpca/)	Binned read counts from several ChIP-seq data sets are obtained. Differences in counts between two conditions are decomposed using PCA. The motif sites with large PC1 scores are deemed differential binding sites.
Gusmao et al. [20]	DNase-seq, histone	HMM	HINT (http://www.regulatory-genomics.org/hint/introduction/)	Genome-wide read counts in 5000 bp bins are obtained and normalized. Eight-state HMM (with one of the states being the protein-binding footprint) with multivariate outcome is fitted.
Sung et al. [19]	DNase-seq, Motif (4-mer)	Tests based on counts	Dnase2TF (https://sourceforge.net/projects/dnase2tfr/)	Identify DHS from DNase-seq data. For each DHS, it looks for `dip’ within the peaks. The dips are then combined with motifs to define binding sites.
Yardimci et al. [22]	DNase-seq, bias adjustment	Two-component mixture model	FootprintMixture (https://ohlerlab.mdc-berlin.de/software/FootprintMixture_109/)	Two-component mixture model is used to predict binding sites. Binned read counts around candidate sites are modeled by factor-specific multinomial distribution.
Sherwood et al. [23]	DNase-seq (magnitude + shape around motif match sites)	Expectation propagation	PIQ (http://piq.csail.mit.edu/)	Scan the genome using motif PWM to detect candidate binding sites. Binned read counts from DNase-seq are analyzed to build TF-binding profiles (shapes and magnitude). Posterior probability of binding is computed using expectation propagation method.
Xu et al. [32]	Methylation, genomic features	Random forest + two-component mixture model for methylation	Methylphet (https://github.com/benliemory/Methylphet)	Compute methylation scores based on a mixture model, then use the scores with other features as predictor in a random forest to predict binding sites.
Alipanahi et al. [34]	DNA sequence (using binding array data as target)	CNN	DeepBind (http://tools.genes.toronto.edu/deepbind/)	Use a deep learning framework to model the relationships between DNA sequence patterns and TF binding sites. CNN is used to capture sequence patterns. Trained model can be applied to a new sequence with variations to estimate risks by predicting changes in TF binding affinity.
Quach and Furey [27]	DNase-seq (profile: mean and slope, centered at motif)	SVM	DeFCoM (https://bitbucket.org/bryancquach/defcom)	Find candidate regions based on motifs. Binned read counts from DNase-seq around the candidate regions are obtained for varying-sized bins. Read counts are used as predictors in an SVM.
Jankowski et al. [24]	DNase-seq (shape, on motif-matched sites)	Two-component mixture model, EM algorithm	Romulus (https://github.com/ajank/Romulus)	Similar to CENTIPEDE, except for the binning strategy. They use 20 bp bins outside the motif site and single-bp bins within the motif site. For non-binding sites, they put all the positions in a single bin.
Liu et al. [28]	DNase-seq (footprint score defined by counts) + genomic features	Random forest	BPAC (http://bioinfo.wilmer.jhu.edu/BPAC/)	Motif PWM scores and different types of read count features are used as predictors to train a random forest for prediction.
Ma et al. [33]	DNA sequence, shape kernel	Support vector regression	Sequence-shape (https://bitbucket.org/wenxiu/sequence-shape.git)	Use kernel functions to model both DNA sequence and shape features simultaneously. It then performs kernel-based regression and classification to predict TF–DNA interaction. It shows that incorporation of DNA shape information can improve prediction accuracy.
Kuang et al. [29]	Histone, DNase-seq, on known motif-matched sites	Random forest	DynaMO (https://github.com/spo111/DynaMO)	Use motif sites as candidate regions. Binned read counts around the motif sites are used as predictors in random forest for prediction.
Chen et al. [25]	DNase-seq, ATAC-seq and DNA sequence features	Three-component mixture model; sparse logistic regression; weighted least square	Mocap (https://github.com/xc406/Mocap)	Take motif sites as candidate regions. They used a three-component mixture model for the read counts and sparse logistic regression on a number of features. Cross-sample method uses weighted least square to minimize a loss function.
Quang et al. (unpublished)	DNase-seq	CNN + RNN	FactorNet (https://github.com/uci-cbcl/FactorNet)	Use a number of features to train a deep learning model (DanQ CNN-RNN hybrid architecture). Features include genome sequences and annotations, gene expression and DNase-seq data.