Whitington et al. [13] |
Histone |
PWM scan followed by filtering chromatin features |
N/A |
Scan the genome using motif PWM to detect candidate binding sites. Sites are then filtered by several genetic and epigenetic criteria using ad hoc thresholds. |
He et al. [14] |
Nucleosome-resolution histone |
Dynamics of nucleosome occupancy and motif |
N/A |
Use differential H3K4me2 ChIP-seq signals to measure nucleosome positioning, followed by motif analysis to predict TF binding dynamics. |
Won et al. [17] |
Histone |
HMM |
Chromia (http://wanglab.ucsd.edu/star/job.php?s=chromia) |
Use sequence counts of fixed-sized bins for a number of histone ChIP-seq data sets as inputs for a three-state HMM. |
Ramsey et al. [16] |
Histone acetylation ChIP-seq, nucleosome occupancy, DNA sequence |
Weighted sum of scores |
RamseyHAc2010 (http://magnet.systemsbiology.net/hac/) |
Take 100 bp intervals of transcript-proximal regions as candidate regions. Sequence and epigenetic features are used as predictors. Weighted sum of thresholded predictors values are treated as prediction scores. |
Pique-Regi et al. [21] |
DNase-seq, DNA sequence |
Two-component mixture model, expectation-maximization (EM) algorithm |
CENTIPEDE (http://centipede.uchicago.edu/) |
Compute prior for binding from sequence-based features. Binned read counts from DNase-seq are assumed to be from a mixture model, which is fitted by an EM algorithm. Posterior probabilities of binding are obtained. |
Cuellar-Partida et al. [26] |
DNase-seq, histone |
Epigenetic data as prior, use motif to predict |
FIMO, part of MEME (http://meme-suite.org/doc/fimo.html) |
Prior probabilities of binding are derived from epigenetic data. Posterior probabilities are computed from prior and motif scores based on a Bayesian model. |
Arvey et al. [9] |
Histone, DNase-seq |
SVM |
N/A |
Two SVMs are trained based on dinucleotide mismatch k-mer features of 100 bp regions and read counts from 100 bp bins. |
Ji et al. [18] |
Nine histones |
Principal component analysis (PCA)-type unsupervised learning |
dPCA (http://www.biostat.jhsph.edu/dpca/) |
Binned read counts from several ChIP-seq data sets are obtained. Differences in counts between two conditions are decomposed using PCA. The motif sites with large PC1 scores are deemed differential binding sites. |
Gusmao et al. [20] |
DNase-seq, histone |
HMM |
HINT (http://www.regulatory-genomics.org/hint/introduction/) |
Genome-wide read counts in 5000 bp bins are obtained and normalized. Eight-state HMM (with one of the states being the protein-binding footprint) with multivariate outcome is fitted. |
Sung et al. [19] |
DNase-seq, Motif (4-mer) |
Tests based on counts |
Dnase2TF (https://sourceforge.net/projects/dnase2tfr/) |
Identify DHS from DNase-seq data. For each DHS, it looks for `dip’ within the peaks. The dips are then combined with motifs to define binding sites. |
Yardimci et al. [22] |
DNase-seq, bias adjustment |
Two-component mixture model |
FootprintMixture (https://ohlerlab.mdc-berlin.de/software/FootprintMixture_109/) |
Two-component mixture model is used to predict binding sites. Binned read counts around candidate sites are modeled by factor-specific multinomial distribution. |
Sherwood et al. [23] |
DNase-seq (magnitude + shape around motif match sites) |
Expectation propagation |
PIQ (http://piq.csail.mit.edu/) |
Scan the genome using motif PWM to detect candidate binding sites. Binned read counts from DNase-seq are analyzed to build TF-binding profiles (shapes and magnitude). Posterior probability of binding is computed using expectation propagation method. |
Xu et al. [32] |
Methylation, genomic features |
Random forest + two-component mixture model for methylation |
Methylphet (https://github.com/benliemory/Methylphet) |
Compute methylation scores based on a mixture model, then use the scores with other features as predictor in a random forest to predict binding sites. |
Alipanahi et al. [34] |
DNA sequence (using binding array data as target) |
CNN |
DeepBind (http://tools.genes.toronto.edu/deepbind/) |
Use a deep learning framework to model the relationships between DNA sequence patterns and TF binding sites. CNN is used to capture sequence patterns. Trained model can be applied to a new sequence with variations to estimate risks by predicting changes in TF binding affinity. |
Quach and Furey [27] |
DNase-seq (profile: mean and slope, centered at motif) |
SVM |
DeFCoM (https://bitbucket.org/bryancquach/defcom) |
Find candidate regions based on motifs. Binned read counts from DNase-seq around the candidate regions are obtained for varying-sized bins. Read counts are used as predictors in an SVM. |
Jankowski et al. [24] |
DNase-seq (shape, on motif-matched sites) |
Two-component mixture model, EM algorithm |
Romulus (https://github.com/ajank/Romulus) |
Similar to CENTIPEDE, except for the binning strategy. They use 20 bp bins outside the motif site and single-bp bins within the motif site. For non-binding sites, they put all the positions in a single bin. |
Liu et al. [28] |
DNase-seq (footprint score defined by counts) + genomic features |
Random forest |
BPAC (http://bioinfo.wilmer.jhu.edu/BPAC/) |
Motif PWM scores and different types of read count features are used as predictors to train a random forest for prediction. |
Ma et al. [33] |
DNA sequence, shape kernel |
Support vector regression |
Sequence-shape (https://bitbucket.org/wenxiu/sequence-shape.git) |
Use kernel functions to model both DNA sequence and shape features simultaneously. It then performs kernel-based regression and classification to predict TF–DNA interaction. It shows that incorporation of DNA shape information can improve prediction accuracy. |
Kuang et al. [29] |
Histone, DNase-seq, on known motif-matched sites |
Random forest |
DynaMO (https://github.com/spo111/DynaMO) |
Use motif sites as candidate regions. Binned read counts around the motif sites are used as predictors in random forest for prediction. |
Chen et al. [25] |
DNase-seq, ATAC-seq and DNA sequence features |
Three-component mixture model; sparse logistic regression; weighted least square |
Mocap (https://github.com/xc406/Mocap) |
Take motif sites as candidate regions. They used a three-component mixture model for the read counts and sparse logistic regression on a number of features. Cross-sample method uses weighted least square to minimize a loss function. |
Quang et al. (unpublished) |
DNase-seq |
CNN + RNN |
FactorNet (https://github.com/uci-cbcl/FactorNet) |
Use a number of features to train a deep learning model (DanQ CNN-RNN hybrid architecture). Features include genome sequences and annotations, gene expression and DNase-seq data. |