Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2022 Sep 12;18(9):e1009921. doi: 10.1371/journal.pcbi.1009921

TAMC: A deep-learning approach to predict motif-centric transcriptional factor binding activity based on ATAC-seq profile

Tianqi Yang 1,2,*, Ricardo Henao 3,4,*
Editor: Saurabh Sinha5
PMCID: PMC9499209  PMID: 36094959

Abstract

Determining transcriptional factor binding sites (TFBSs) is critical for understanding the molecular mechanisms regulating gene expression in different biological conditions. Biological assays designed to directly mapping TFBSs require large sample size and intensive resources. As an alternative, ATAC-seq assay is simple to conduct and provides genomic cleavage profiles that contain rich information for imputing TFBSs indirectly. Previous footprint-based tools are inheritably limited by the accuracy of their bias correction algorithms and the efficiency of their feature extraction models. Here we introduce TAMC (Transcriptional factor binding prediction from ATAC-seq profile at Motif-predicted binding sites using Convolutional neural networks), a deep-learning approach for predicting motif-centric TF binding activity from paired-end ATAC-seq data. TAMC does not require bias correction during signal processing. By leveraging a one-dimensional convolutional neural network (1D-CNN) model, TAMC make predictions based on both footprint and non-footprint features at binding sites for each TF and outperforms existing footprinting tools in TFBS prediction particularly for ATAC-seq data with limited sequencing depth.

Author summary

Applications of deep learning models are rapidly gaining popularity in recent biological studies because of their efficiency in analyzing non-linear patterns from feature-rich data. In this study, we developed a deep learning method to predict transcription factor binding sites based on chromatin accessibility profiles. Compared to previous methods using scoring functions and classical machine learning algorithms, our method forgoes the need for bias correction during signal processing and significantly increases the efficiency in extracting features at transcription factor binding sites. In addition, we showed that our method outperforms previous methods particularly for chromatin accessibility data with shallow sequencing depth. In this study, we applied our method to prediction of changes in binding sites of a transcription factor, CTCF, during early embryonic development based on bulk chromatin accessibility profiles. We then discussed about the potential application of our method to transcription factor binding site prediction using single-cell chromatin accessibility profiles as well as possible strategies to further improve the performance of our method in the future.


This is a PLOS Computational Biology Methods paper.

Introduction

Transcription factors (TFs) are proteins that bind to conserved genomic sequence motifs and have functions in regulating gene expression [1]. Determining TF binding sites (TFBSs) is essential for deciphering the molecular mechanisms regulating gene expression across different biological conditions. Biological assays, such as ChIP-seq [2] and CUT&RUN [3], have been used as the standard experimental methods for mapping genome-wide interactions between TFs and chromatin. However, these experiments are resource-intensive and can measure only one TF at one time, which largely limits their applications in many situations. To address these limitations, computational methods, discussed below, have been proposed to impute TFBSs.

Traditional computational TFBS prediction methods have been using position weight matrices (PWMs) of TF binding motifs against DNA sequence to predict TFBSs [4,5], yet these methods suffer from high false-positive rates (FDR) [6]. Recent studies have shown that more than 90% of TF binding events take place at open chromatin regions [7] that could be mapped by enzymatic cleavage assays such as DNase-I sequencing (DNase-seq) [8] and Assay for Transposase Accessible Chromatin sequencing (ATAC-seq) [7]. Notably, bound TFs hinder the activity of cutting enzymes and leave footprint sites characterized with lower cleavage signal frequency comparing to surrounding regions [9]. Therefore, the TF-bound and unbound sites can be theoretically distinguished by their footprint pattern.

Several computational methods have been developed to investigate footprint patterns in chromatin cleavage profiles [1019]. As chromatin cleavage events generated by cutting enzymes (e.g., DNase-I used in DNase-seq and Tn5 transposase used in ATAC-seq) are biased towards different sequences, previous tools initially designed for DNase-seq data often give poor predictions using ATAC-seq data [16,20,21]. By far, HINT-ATAC [16] and TOBIAS [17] are two representative footprinting tools specifically designed for ATAC-seq data, which has become the dominant data type for chromatin accessibility profile because of the simplicity of the assay itself. TOBIAS uses a simple footprint score (FPS) metric to characterize the footprint pattern at single-base resolution while HINT-ATAC uses a semi-supervised hidden Markov model (HMM) to predict footprint sites directly. Both tools significantly increase the accuracy in classifying bound/unbound sites from ATAC-seq data. On the other hand, the limitation is that they both require complex bias correction during signal processing because their models/algorithms are highly dependent on measurable footprint pattern to make predictions.

More recently, deep-learning models are rapidly gaining popularity in biological studies because of their efficiency in analyzing complex (non-linear) patterns from feature-rich data. Here, we introduced a new TFBS prediction tool named TAMC (Transcriptional factor binding prediction from ATAC-seq profile at Motif-predicted binding sites using Convolutional neural networks). TAMC takes advantage of signal processing strategies in HINT-ATAC and TOBIAS except for the bias correction step to produce input signals that are then used to extract features of TFBSs using a 1D-convolutional neural network (1D-CNN) model (Fig 1A). By evaluating TAMC with different input signal configurations, we showed that TAMC does not require bias correction during signal processing and captures both footprint and non-footprint features of TFBSs efficiently. Importantly, TAMC models pretrained with multiple deeply sequenced ATAC-seq datasets significantly outperform HINT-ATAC and TOBIAS in TFBS prediction especially when using ATAC-seq data with limited sequencing depth. In our study, we have applied TAMC in predicting changes in binding sites of a specific TF, CTCF, during human zygotic genome activation (ZGA) using bulk ATAC-seq data. We also believe that TAMC is a competitive alternative method for TFBS prediction using single-cell ATAC-seq data.

Fig 1. TAMC model and input signal.

Fig 1

(A) Architecture of TAMC framework: three convolutional layers with 3 learnable filters of kernel sizes k = 3, 5, and 7, and ReLU activations, followed by max pooling, concatenation, and prediction of binding probability via a two fully connected layers with sigmoid activation. (B) Default TAMC input signal structure. >1Nr, ATAC-seq read fragments larger than one nucleosome size; ≤1Nr, ATAC-seq read fragments equal or smaller than one nucleosome size. (C) Representative tracks show strategy of labeling bound and unbound CTCF MPBSs in GM12878 cell type at FRMD4B gene locus. MPBS, motif-predicted binding sites.

Results

TAMC overview

TAMC takes a combination of footprint profile and genomic cleavage profile (signals and slopes) around 500bp from the center of potential TF binding sites predicted by their binding sequence motifs, or motif-predicted binding sites (MPBSs), as input signal (Fig 1B). The footprint scores are calculated using the TOBIAS footprint scores metric, and the genomic cleavage signals and slopes are calculated and normalized using HINT-ATAC input signal processing scripts with modifications at the bias correction step. The genomic cleavage signals and slopes are further separated into 8 channels by strand and size of ATAC-seq reads. This results in each input signal has 9 channels in total– 1 channel for footprint profile and 8 channels for cleavage profiles (Fig 1B). The 9-channel input signals are fed into a 1D-CNN module and the convolution is performed with filter kernels moving at 1 base position (bp) per step in the direction from -500bp to +500bp from the motif center. As the length of binding sites varies between TFs, we utilized kernels with different sizes (k = 3, 5 and 7) to extract features for small and large binding sites (Fig 1A). For each kernel size, we repeated convolution for three times and the obtained three feature maps were max-pooled to extract the most salient feature at each bp within the input region (Fig 1A). The max-pooled feature maps originate from different kernel sizes are then concatenated before being fed to the fully connected layers and a final sigmoid activation function to make predictions of TF binding probability (Fig 1A).

The TAMC framework was trained and tested using published paired-end ATAC-seq data of three human cell types (GM12878, HepG2 and K562) (S1 Table). ChIP-seq data obtained from the same cell type as ATAC-seq data were used for labeling MPBSs as bound or unbound (Fig 1C). Because of the high false positive rate of TF motifs [6], a large proportion of MPBSs are not really bound by TFs (S1A Fig). To avoid biases due to training and evaluation with a disproportionally large unbound category, we randomly subsampled the bound and unbound MPBSs to have equal proportions for each TF before conforming training, validation, and testing datasets (S1B Fig). The resulting balanced datasets are large enough (more than 5000 labeled MPBSs for most TFs) to enable reliable model training and performance metrics robust to the random subsampling. The trained TAMC models were tested under two types of prediction settings: intra-data prediction (same ATAC-seq data for training and testing) and cross-data prediction (different ATAC-seq data sets for training and testing). In total, we trained and tested TAMC for 47 TFs with published ChIP-seq data for all three cell types in the ENCODE database (S2 Table). Area under the receiver operating curve (AUROC) was used to measure the trained models’ performance in classifying bound/unbound MPBSs.

TAMC outperforms existing methods

To evaluate TAMC performance, we compared TAMC predictions with predictions generated using two representative footprinting tools, namely, TOBIAS and HINT-ATAC. For HINT-ATAC, we utilized published models that were pre-trained on GM128782 ATAC-seq data for testing [16]. For TOBIAS, it has fixed parameters in its FPS metric and therefore does not require model training before testing [17]. We showed that TAMC gave the best classification of bound and unbound binding sites for most of the 47 TFs under both intra- and cross-data settings (Fig 2A). We further showed that TAMC models trained using multiple (GM12878 and HepG2) ATAC-seq datasets gave better cross-data predictions than models trained using a single (GM12878 or HepG2) ATAC-seq data (Fig 2B). In order to check the influence of sequencing depth on TAMC performance, we downsized the training and testing ATAC-seq datasets to different total numbers of high-quality non-mitochondrial aligned reads. The results showed that higher sequencing depth of the training datasets gave better cross-data classification performance for TAMC (Fig 2C). Compared to TOBIAS and HINT-ATAC, TAMC gave best cross-data classification when the sequencing depth of testing datasets is low; while for testing data with high sequencing depth, TOBIAS gave better classification than TMAC for certain TFs (Figs 2D and S2A). Interestingly, the classification performance of TOBIAS and TAMC were always crossed near the point where the training and testing ATAC-seq datasets have similar sequencing depth (Figs 2D and S2B–S2C). These results together suggested that TAMC outperforms TOBIAS and HINT-ATAC in cross-data TFBS prediction as long as the training ATAC-seq datasets were sequenced at higher depth than the testing ATAC-seq datasets.

Fig 2. TAMC outperforms existed methods.

Fig 2

(A) Heat maps compare intra-data (left) and cross-data (right) classification performance of TAMC with TOBIAS and HINT-ATAC. Test data are indicated above each heatmap, and TAMC models were trained using GM12878 ATAC-seq data with 150M sequencing depth in both heatmaps. (B) Line graph compares cross-data classification performance of TAMC models trained on ATAC-seq data of GM12878, HepG2 and multiple (GM12878 and HepG2) cell lines. (C) Line graph compares cross-data classification performance of TAMC models trained on multiple (GM12878 and HepG2) ATAC-seq data with 150M, 100M and 50M sequencing depth. (D) Line graph compares cross-data classification performance of TAMC (multiple, 150M) with TOBIAS and HINT-ATAC. The performance of models within each line graph were ranked from 1 to 3 for each TF. The higher AUROC is, the lower rank number is given. The points in the line graphs show the average AUROC ranks for 47 TFs for each method/model and the error bars represent standard error or mean (SEM). Difference between the performance of TAMC (multiple, 150M) and the other methods/models were examined using Friedman-Nemenyi test and significant differences are labeled (* p < 0.05; ** p < 0.01; *** p < 0.001). Complete statistic test results for B, C and D are provided in S3 Table. The cell type and sequencing depth of ATAC-seq data used for train TAMC models are indicated within the parenthesis. HINT-ATAC models pre-trained on GM128782 ATAC-seq data were used for all testing, while TOBIAS does not require model training before testing. M denotes million high-quality non-mitochondrial aligned reads.

TAMC does not require bias correction during input signal processing

Most existing footprinting tools, including TOBIAS and HINT-ATAC, require conducting bias correction during cleavage signal processing to uncover measurable footprint patterns for bound/unbound MPBS classification. However, none of the reported bias correction algorithms can uncover footprint for all TFs faithfully [17]. This makes bias correction a critical step that limits the performance of existing footprint-based methods. To check whether TAMC prediction is affected by bias correction during input signal processing, we compared the classification performance of TAMC models using four different combinations of bias-corrected and uncorrected signals as inputs. Our results showed that the performance of the four TAMC models using uncorrected, partially corrected and fully correct input signals were not significantly different with each other (Fig 3). In addition, all four TAMC models gave better intra-data classification performance than TOBIAS and HINT-ATAC that use bias-corrected signals as optimal inputs (Fig 3). This implies that TAMC models were trained to correct Tn5 cutting bias internally and therefore does not require manual bias correction during input signal processing. Based on these results, we set non-biased corrected input signals as the default input format for TAMC.

Fig 3. TAMC performance is independent of bias correction.

Fig 3

Bar graphs compare intra-data classification performance of TAMC models using four different combinations of bias-corrected and uncorrected inputs together with TOBIAS and HINT-ATAC in GM12878 and HepG2 cells. The performance of models within each graph were ranked from 1 to 6 for each TF. The higher AUROC is, the lower the rank number. Average AUROC ranks for 47 TFs for each method/model were shown and the error bars represent SEM. Difference between the performance of default TAMC and the other models were examined using Friedman-Nemenyi test and p-values for significant differences are indicated. Complete statistic test results are provided in S3 Table. bc, bias corrected; nbc, non-bias corrected; ns, non-significant.

TAMC generates TF-specific models

Both TOBIAS and HINT-ATAC assume that the footprints left by all TFs have the same pattern across the genome–TOBIAS applies the same metric to calculate the footprint scores at all genomic sites and HINT-ATAC uses the model trained for EGR1 to make predictions for all TFs. However, it has been recently reported that the footprint pattern for different TFs are highly heterogeneous from each other [10]. To check whether TAMC can tell the heterogeneity in binding features of different TFs, we compared intra-data prediction performance of TAMC using models trained for the same (intra-TF) or different TFs (cross-TF) as the testing TF (Figs 4 and S4). We found that TAMC give better intra-TF predictions than cross-TF predictions in general (Fig 4). Among the 47 analyzed TFs, 6 TFs (ZNF143, NR2F1, CTCF, MXI1, NFIC and MEF2A) always require TAMC models trained using the same TF for best binding site prediction (Fig 4). In particular, the model trained for CTCF showed extremely high specificity for CTCF binding site prediction (Fig 4). These results suggested that TAMC can tell TF-to-TF differences at their binding sites and generate TF-specific models to improve prediction accuracy.

Fig 4. TAMC generates TF-specific models.

Fig 4

Heat map compares intra-TF and cross-TF performance of TAMC using GM12878 ATAC-seq data. For each testing TF, the prediction performance of different TAMC models was scaled using the formula: scaled AUROC = log2(AUROC/intra-TF AUROC). Majority of the obtained scaled AUROC values falls within the range from -0.1 to 0.1. Positive scaled AUROC values means better cross-TF predictions than intra-TF predictions, while negative scaled AUROC values means better intra-TF predictions than cross-TF predictions. The rank for intra-TF prediction performance for each testing TF was labeled in the plot. Raw AUROC data are shown in S3 Fig.

TAMC captures TF-specific binding features by deep learning

To further explore how TAMC exceeds TOBIAS and HINT-ATAC in detecting TFBSs, we tested TAMC with four variant input structures that lack footprint scores (Variant 1), cleavage profile (Variant 2), read strand (Variant 3) and fragment size (Variant 4) information respectively (Fig 5A). By comparing TAMC models using variant 1 and 2 input structures with HINT-ATAC and TOBIAS respectively, we showed that the 1D-CNN model in TAMC makes better predictions than the classical model (e.g., HMM in HINT-ATAC) or non-model metrics (e.g., FPS metric in TOBIAS) even using the same input signals (Fig 5B). Therefore, the high efficiency in capturing complex and subtle features by 1D-CNN model plays an important role in improving TFBS detection ability of TAMC. At the same time, by comparing performance of default and variant TAMC models, we can define what kind of information in the input signals are important for TAMC to make predictions. Our results showed that TAMC performance was drastically compromised for all 47 TFs when the footprint score profile is removed from input (Fig 5B). In addition, loss of cleavage profiles completely impaired TAMC prediction accuracy for more than half of the TFs (Fig 5B). Therefore, both the footprint and cleavage profiles in the inputs provide important information for TAMC to make predictions.

Fig 5. TAMC captures both footprint and non-footprint features of TFBSs by deep learning.

Fig 5

(A) Schematic of variant TAMC input structures that lack footprint score (Variant 1), cleavage profile (Variant 2), ATAC-seq read strand (Variant 3) and ATAC-seq read fragment size (Variant 4) information, respectively. (B) Bar plot and heat map compares intra-data classification performance of TAMC using default and variant input structures together with TOBIAS and HINT-ATAC in GM12878 cells. TFs that require footprint score, cleavage profile, strand or size information within the input signal or the 1D-CNN model for better prediction are indicated in the side bar. (C) Bar plot compares intra-data classification performance of TAMC models using inputs with different fragment size separations for the cleavage profile in GM12878 cells. (D) Bar plot compares intra-data prediction performance of TAMC models using inputs generated from different region length surrounding MPBSs in GM12878 cells. The models are ranked by their performance within each plot. The higher AUROC is, the lower rank number is given. Average AUROC ranks for 47 TFs for each model were shown in the bar plots (B, C, and D) and the error bars represent SEM. Difference between the performance of default TAMC and the other models were examined using Friedman-Nemenyi test and p-values for selected comparisons are indicated. Complete statistic test results are provided in S3 Table.

Interestingly, we determined several TFs requiring information provided by the cleavage profile, including read strand (e.g., CTCF) and/or read fragment size (e.g., EGR1) information, for better binding site prediction (Fig 5B). In consistent with this finding, we detected strand-specific pattern in metagene cleavage profiles at CTCF binding sites (S4 Fig), which can only be captured by TAMC using inputs generated by strand-separated signals. On the other hand, we detected a small footprint pattern at EGR1 binding sites only in chevage profiles generated by reads fragments smaller than one nucleosome size (S4 Fig). Therefore, merging cleavage profiles of small and large reads fragments might dilute the footprint pattern and reduce required information used for EGR1 binding detection. These results suggest that TAMC generates TF-specific models by recognizing TF-specific binding features from ATAC-seq data. In addition, the strand-specific cleavage pattern around CTCF binding site might reflect its dimerized binding mechanisms [22], while the shallow footprint pattern at EGR1 binding sites could reflect its transient activity at most of its binding sites [23]. Therefore, although the overall performance of TAMC was not significantly affected by strand and/or fragment size separation (Fig 5B and 5C), we expected that the performance of TAMC in predicting binding site of specific TFs could be further optimized by adjusting these two factors based on biological mechanisms behind their binding events. Notably, we found that shortening the input regions significantly reduced TAMC performance (Fig 5D). This indicates that TAMC not only captures cleavage features at TFBSs but also surrounding cleavage features that could be affected by binding of co-factors.

TAMC predicts changes in constitutive CTCF biding sites during ZGA

Our results have revealed that TAMC greatly improved CTCF binding site prediction compared to TOABIS and HINT-ATAC. Studies have shown increased CTCF expression and key roles of CTCF in establishing topologically associating domains (TADs) during human zygotic genome activation (ZGA) [24] (Fig 6A). Because of the limitation in the number and accessibility of human zygotic cells, it is difficult to map CTCF binding sites by ChIP-seq experiment directly. Here we used TAMC to predict CTCF binding sites based on published ATAC-seq data of 2-cell (pre-ZGA) and 8-cell (undergoing ZGA) human embryos [25]. For this prediction, we utilized TAMC models trained on multiple ATAC-seq data (GM12878 and HepG2, sequencing depth = 150 million high-quality non mitochondrial aligned reads), which has been shown to have the best cross-data prediction performance in our study. We predicted sites with binding probability > 0.5 as CTCF binding sties, while sites with binding probability > 0.95 as constitutive CTCF binding sites. As expected, we only detected footprint patten in metagene cleavage profiles at predicted CTCF binding sites but not at the predicted non-CTCF binding sites (S5A Fig). In addition, the predicted constitutive CTCF binding sites exhibited deep footprint patten as a result of persistent TF binding events [23] (S5A Fig). These results demonstrated the feasibility of predicting CTCF brining sites in human embryonic cells with TAMC.

Fig 6. TAMC predicts changes in constitutive CTCF binding sites during ZGA.

Fig 6

(A) Schematic show CTCF knockdown prevents TAD formation during human ZGA. (B) Venn diagram shows the overlap between predicted constitutive CTCF binding sites in 2-cell and 8-cell embryos. (C) Metagene plots show aggregated cleavage profiles of predicted 2-cell specific (Group A), common (Group B) and 8-cell specific (Group C) in 2-cell and 8-cell embryos.

We next repeated the prediction experiments for three times and used the common predicted binding sites for the following analyses. We found the number of all CTCF bindings are comparable (less than 5% difference) in 2-cell and 8-cell embryos (S5B Fig). In contrast, there is a drastic (around 40%) increase in the number of constitutive CTCF binding sites at 8-cell stage compared to 2-cell stage (Figs 6B and S6B). In addition, not all constitutive CTCF binding sites in 2-cell zygote are maintained when the embryo enters ZGA. Based on our prediction, 1589 (around 40%) of constitutive CTCF binding sites in 2-cell embryos are lost or start to show compromised CTCF binding activity at 8-cell embryonic stage (Fig 5B and 5C). These results suggest that chromatin structure reorganization during ZGA is associated with both increase in the number and changes in the location of constitutive CTCF binding sites. Constitutively bound CTCF sites have been shown to maintain cell-type specific 3D chromatin architecture in somatic cells [26]. Therefore, the constitutive CTCF binding sites predicted by TAMC in our study could be used as candidate targeting sites for future studies of chromatin structures at specific regions during early human embryonic development.

Discussion

Predicting binding sites of transcription factors is important for understanding gene expression mechanisms. In this study, we introduced a new tool named TAMC to predict TF binding dynamics at MPBSs using paired-end ATAC-seq data. As summarized in Table 1, TAMC has several advantages comparing to previous tools. First, TAMC does not require bias correction during signal processing, which makes signaling processing easier and avoids further artificial bias caused by the bias correction algorithm. Second, TAMC combines different configurations of processed ATAC-seq signals within its input and therefore retains not only footprint but also non-footprint information, such as stand and size information, for later prediction use. Finally, TAMC uses 1D-CNN to analyze input signals, which is based on our results, more efficient than classical models (e.g., HMM in HINT-ATAC) or non-model-based metrices (e.g., FPS metric in TOBIAS) in feature capturing. These advantages together allow TAMC to utilize precise binding features of each TF and thus improves its performance in classifying bound/unbound MPBSs.

Table 1. Summary of features of TAMC and existing footprinting tools for ATAC-seq data.

Tools TAMC TOBIAS HINT-ATAC
Year of publication This paper 2020 2019
Input Footprint score +
Cleavage profile
Footprint score Cleavage profile
Bias correction Not required Required Required
Model 1D-CNN FPS metric HMM
Model training Required Not required Required
Features utilized for TFBS prediction Footprint +
Non-footprint
Footprint Footprint +
Non-footprint
TF-specificity Yes No No
Sequencing depth Training > testing No requirement No requirement
Performance Intra-data (+++)
Cross-data (+++)
(++) Intra-data (+)
Cross-data (+)

High (+++), medium (++) and low (+) evaluation of TFBS predicting performance.

Most previous studies use the training and testing datasets derived from the same ATAC-seq data for evaluating the tools’ performance in classifying bound/unbound MPBSs. However, in real applications, the cell type and sequencing depth of ATAC-seq data for prediction are usually different from available training data. To mimic the situations in real application, we evaluated TAMC performance under both intra-data and cross-data settings. We showed that TAMC outperforms existed methods as long as the training datasets has higher sequencing depth than testing datasets. In real studies, bulk ATAC-seq assays are usually conducted at sequencing depth between 100~200M reads and can provide 20~50M high-quality aligned reads after removing duplicated and mitochondrial DNA. In our study, the TAMC models are trained using ATAC-seq data with 150 million high-quality aligned reads. Therefore, the trained TAMC models generated in this study are suitable for TFBS prediction using most bulk ATAC-seq data generated in real studies. Recently single-cell ATAC-seq is becoming more and more popular because it can analyze multiple cell types at one time. One major obstacle for TFBS prediction using scATAC-seq data is its low sequencing depth for individual cell or specific cell population. As our TAMC models significantly exceeded existing methods in TFBS prediction using ATAC-seq data with low sequencing depth, it could be further applied in analyzing TF binding dynamics using scATAC-seq data in the future.

Among the 47 TFs analyzed in our study, only a small number of TFs (e.g., ATF3 and TBP) did not show better binding site prediction result by TAMC comparing to existed methods (Fig 2A). We noted that these TFs often have very low number of bound MPBSs (S1 Fig), which could result in insufficient data amount for model training. One possible way to improve TAMC’s performance for these TFs is to combine labeled MPBSs from different cell types for training. Importantly, our results revealed that in addition to increasing data amount, combining training datasets of multiple cell types can prevent cell type-specific overfitting effect as well, which is important for making cross-data predictions in real applications (Fig 2B). Therefore, while the trained TAMC models generated in our study already outperformed TOBIAS and HINT-ATAC for most TFs, more robust and generalized models could be obtained by combining deeply sequenced ATAC-seq data of more cell types for training.

Finally, our results showed that ~30% of TFs achieve better performance when using models trained on other TFs compared to intra-TF models (Fig 4). We noted several TFs with low-ranked intra-TF models (e.g., MAX, ELK1 and MAFK) showed higher background noise in their ChIP-seq data than TFs with top-ranked intra-TF models (e.g., CTCF, NR2F1 and EGR1) (Figs 4 and S6). High ChIP-seq background noise is often caused by poor specificity of the antibody used in ChIP experiment. It is unfavorable to ChIP-seq peak summit calling and MPBS labeling, which will then cause the trained model to capture inaccurate binding information. A potential solution is to separate the training into two steps: the first step uses MPBSs of all TFs to train the model to capture common binding features for all TFs (such as the high chromatin accessibility and appearance of footprint pattern), while the second step applies transfer learning by continuing training the model using MPBSs of a specific TF to capture TF-specific binding features (such as the pattern and surrounding chromatin environment of the footprint). For TFs with high noise in their ChIP-seq data and show compromised performance after TF-specific training, we can predict their binding sites just based on the common binding features using the model obtained in the first step.

Methods

ATAC-seq and ChIP-seq data processing

ATAC-seq and ChIP-seq data (S1 and S2 Tables) were obtained from the UCSC ENCODE portal (https://www.genome.ucsc.edu/ENCODE). Raw ATAC-seq and ChIP-seq fastq files were trimmed using TrimGalore [27] and aligned with Bowtie2 (v2.3.5.1,—qc-filter—very-sensitive) [28] to reference human genome (h38). Reads with alignment quality lower than 30 or reads aligned to mitochondrial DNA were removed using Samtools (v1.10) [29]. Duplicated reads were removed using MarkDuplicates tool of Picard (v2.0.1; http://broadinstitute.github.io/picard/). The aligned bam files of GM12878 and HepG2 ATAC-seq datasets were further downsized to 150, 100 and 50 million total reads before used for training TAMC models. The aligned bam file for K562 ATAC-seq data was also downsized to 300, 200, 150, 100, 50, 25, 10 and 5 million aligned reads for testing trained TAMC models under different sequencing depth situations. Both ATAC-seq and ChIP-seq peaks regions and summits were called using MACS2 (v2.2.7.1,—nomodel—nolambda—keep-dup auto—call-summits) [30].

MPBS labeling and sampling

Binding motifs for 47 TFs were all obtained from JASPAR CORE 2020 database (https://jaspar.genereg.net/). For TFs with redundant motifs, only the latest version of motifs was used. MPBSs for each TF across hg38 genome were mapped using the MOODS with a p-value threshold of 0.0001 [4]. Only MPBSs located within open chromatin regions (ATAC-seq peak regions) were used for further analyses. To label the TF-bound and unbound status of each MPBS, ChIP-seq data obtained from the same cell type as ATAC-seq data were used. MPBSs located outside ChIP-seq peak regions were labeled as unbound sites. For MPBSs located within each ChIP-seq peak region, only the MPBS located closest to and within 50bp from the highest ChIP-seq summit within that peak was kept and labeled as TF-bound in later analyses. If one TF has more unbound MPBSs than bound MPBSs, the obtained unbound MPBSs will be further randomly sampled to the same number as bound MPBSs for further analyses and vice versa.

For GM12878 and HepG2 cells, 70% of labeled MPBSs were used for modeling training, 20% of labeled MPBSs were used for validation during training, and the remaining 10% of labeled MPBSs were used for testing the trained models. Equal number of bound and unbound MPBSs for each TF were randomly sampled into the training, validating and testing datasets. For K562 cells, all labeled MPBSs were used to prepare testing input signals.

Input signal processing

To prepare input signals for TAMC, the ATAC-seq data was processed using ATACorrect and FootprintScores tools in TOBIAS package to generate footprint scores at single-based resolution [17]. Footprint scores within 500bp form the center of each MPBS was made into a 1x1000 footprint feature vector. At the same time, the ATAC-seq data were separated into 4 files by strand and fragment sizes, and then each file was used for counting genomic cleavage signals and calculating slopes of cleavage signals within 500bp from each MPBS center following the signal processing steps in HINT-ATAC package [16]. The obtained 4-channels of genomic cleavage signals and 4 channels of cleavage slopes were combined into 8x1000 cleavage feature vectors. The default TAMC input signal was generated by concatenating the 1x1000 footprint feature vector and the 8x1000 cleavage feature vector to form the 9x1000 input feature vector. In addition, the 1x1000 footprint feature vector and the 8x1000 cleavage feature vector form two variant TAMC input structures lacking cleavage profiles and footprint information by themselves. To make the 5x1000 variant input feature vectors lacking size or strand information, the 1x1000 footprint feature vector was concatenated with a 4x1000 cleavage feature vector generated using ATAC-seq data only separated by strand or fragment sizes respectively.

Training

The TAMC model was trained using input signals generated using ATAC-seq data of GM12878 and/or HepG2 cells. The python package PyTorch [31] was used for generating and training the model (Fig 1A). The model is trained using the Adam algorithm with minibatch size of 35, epoch size of 10 and maximum iteration number of 100. The datasets were randomized between each epoch of training. Validation loss is evaluated at the end of each training iteration and trained models with lowest 3 validation losses were saved as best models.

Testing

The trained TAMC models were tested sing ATAC-seq data of GM12878, HepG2 or K562 cells. For each TF, the AUROC value was calculated based performance in classifying bound/unbound MPBS using the average of binding probabilities predicted by the 3 best models.

The performance of TAMC was compared to TOBIAS and HINT-ATAC. TOBIAS does not require model training before testing, and its AUROC values were calculated based on the ability of the average footprint scores within each MPBS to classify bound/unbound sites. To test HINT-ATAC performance, we used the published model trained on EGR1 using GM12878 ATAC-seq data. To calculate AUROC values for HINT-ATAC prediction results, all MPBSs were first ranked by the PWM scores generated by MOODS and the maximum PWM score was saved. Then footprint regions were identified using the footprinting tool of rgt-hint package [16]. MPBSs overlapping footprint regions were assumed to be bound and their PWM scores were renewed by adding the maximum PWM score. The other MPBSs were regarded as unbound and their PWM scores were kept unchanged. AUROC is calculated based on the ability of the final PWM scores in classifying bound/unbound MPBSs.

For each TAMC model, the experiments (including MPBS sampling, training, and testing) were repeated for three times. The performance of TOBIAS and HINT-ATAC were also tested for three times using the testing MPBSs generated in each experiment replicate. Raw AUROC values were plotted into heat maps (Figs 2A and S3) or directly provided in S4 Table. For each testing TF, AUROC values generated by different methods/models were first ranked in each experiment; then the mean AUROC rank of the three experimental replicates were used for comparison.

Prediction

To predict CTCF binding sites during ZGA, we first downsized ATAC-seq data of 2-cell and 8-cell embryos to the same sequencing depth at 40,000,000 high-quality non-mitochondrial aligned reads. ATAC peaks were called as mentioned above, and ATAC peaks of these two cell types were merged as common open chromatin regions. We then selected CTCF MPBSs located within common open chromatin regions for later TAMC prediction. The prediction was repeated for three times using the three sets of models obtained from three training experiments mentioned above. Predicted binding probabilities at all CTCF MPBSs in three experiments are provided in S5 Table. CTCF MPBSs with predicted binding probability larger than 0.5 were regarded as CTCF binding sites. CTCF MPBSs with predicted binding probability larger than 0.95 were regarded as constitutive CTCF binding sites.

Supporting information

S1 Fig

Number of labeled MPBSs for 47 TFs before (A) and after (B) equalization of bound and unbound sites in GM12878 cells (supplementary to Figs 1 and 2). TFs did not show better intra-data classification using TAMC than TOBIAS or HINT-ATAC were highlighted in red.

(TIF)

S2 Fig. TAMC performance is affected by sequencing depth of training data (supplementary to Fig 2).

(A) Line graphs compare CTCF, SRF, E2F4 and NFIC binding site prediction performance using trained TAMC model with TOBIAS and HINT-ATAC. The points represent for mean AUROC ± SEM (n = 3 experimental replicates). P-values were calculated using two-sided unpaired Student’s t-test. (B-C) Line graphs compare average cross-data classification performance of TAMC models trained on multiple (GM12878 and hepG2) ATAC-seq data with 100M (A) and 50M (B) sequencing depth with TOBIAS and HINT-ATAC. The performance of models within each line graph were ranked from 1 to 3 for each TF. The higher AUROC is, the lower rank number is given. The points show the average AUROC ranks for 47 TFs for each method and the error bars represent SEM. P-values were calculated using Friedman-Nemenyi test. The cell type and sequencing depth of ATAC-seq used for train TAMC models are indicated within the parenthesis. HINT-ATAC models pre-trained on GM128782 ATAC-seq data were used for all testing, while TOBIAS does not require model training before testing. M denotes million high-quality non-mitochondrial aligned reads. Complete statistic test results and raw AUROC data for all figures are provided in S3 and S4 Tables, respectively. * p < 0.05, ** p < 0.01, *** p < 0.001.

(TIF)

S3 Fig. Heatmap shows intra-TF and cross-TF classification performance (AUROC) of TAMC using GM12878 ATAC-seq data (supplementary to Fig 4).

(TIF)

S4 Fig. Metagene plots show aggregated cleavage profiles within ±500bp regions of labeled CTCF and EGF1 MPBSs in GM12878 cell.

Cleavage signals were processed without manual bias correction. Three plots were made for each class of MPBSs using all ATAC-seq reads, ATAC-seq reads with maximum fragment size at 1Nr, and ATAC-seq reads with minimum fragment size more than 1Nr. Nr, nucleosome size. MPBS, motif-predicted binding site.

(TIF)

S5 Fig. Predicted CTCF binding sites in human embryonic cells.

(A) Metagene plots show aggregated cleavage profiles at predicted unbound, bound, and constitutively bound CTCF MPBSs in 8-cell embryos. (B) Venn diagrams show the number of predicted bound/conductively bound CTCF MPBSs in 2-cell and 8-cell embryos from 3 experiments. CTCF binding sites predicted in all three experiments are regarded as true binding sites.

(TIF)

S6 Fig. Representative genome tracks show ChIP-seq signals of TFs with top-ranked intra-TF model (CTCF, NR2F1 and EGR1) and low-ranked intra-TF model (MAX, ELK1 and MAFK).

(TIF)

S1 Table. List of ATAC-seq datasets.

(XLSX)

S2 Table. List of TF motifs and ChIP-seq datasets.

(XLSX)

S3 Table. Complete results of significance tests.

(XLSX)

S4 Table. Raw AUROC values for evaluation of different models and methods.

(XLSX)

S5 Table. Raw predicted binding probabilities for imputing CTCF binding sites in human embryos.

(XLSX)

Acknowledgments

We thank Dr. Lawrence Carin for providing suggestions for this project and Shelley Rusincovitch for organizing the Duke Data Science Plus (+DS) program. For project involving non-sensitive data, +DS is supported by Duke Research Computing for the use of the Duke Compute Cluster for high-throughput computation. The Duke Data Commons storage is supported by the National Institutes of Health (1S10OD018164-01).

Data Availability

Raw data access number and derived data supporting the findings of this study are listed within the manuscript and its Supporting Information files. Code for model training and testing is openly available in TAMC Github repository at https://github.com/tianqiyy/TAMC.git.

Funding Statement

The authors received no specific funding for this work.

References

  • 1.Spitz F, Furlong EEM. Transcription factors: from enhancer binding to developmental control. Nat Rev Genet. 2012;13(9):613–26. doi: 10.1038/nrg3207 WOS:000308064000009. [DOI] [PubMed] [Google Scholar]
  • 2.Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316(5830):1497–502. doi: 10.1126/science.1141319 WOS:000247066400049. [DOI] [PubMed] [Google Scholar]
  • 3.Skene PJ, Henikoff S. An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites. Elife. 2017;6. doi: 10.7554/eLife.21856 WOS:000394261700001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Korhonen J, Martinmaki P, Pizzi C, Rastas P, Ukkonen E. MOODS: fast search for position weight matrix matches in DNA sequences. Bioinformatics. 2009;25(23):3181–2. doi: 10.1093/bioinformatics/btp554 WOS:000272080800020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Grant CE, Bailey TL, Noble WS. FIMO: scanning for occurrences of a given motif. Bioinformatics. 2011;27(7):1017–8. doi: 10.1093/bioinformatics/btr064 WOS:000289162000022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Van Loo P, Marynen P. Computational methods for the detection of cis-regulatory modules. Brief Bioinform. 2009;10(5):509–24. doi: 10.1093/bib/bbp025 WOS:000269017300004. [DOI] [PubMed] [Google Scholar]
  • 7.Klemm SL, Shipony Z, Greenleaf WJ. Chromatin accessibility and the regulatory epigenome. Nat Rev Genet. 2019;20(4):207–20. doi: 10.1038/s41576-018-0089-8 WOS:000461361300006. [DOI] [PubMed] [Google Scholar]
  • 8.Galas DJ, Schmitz A. DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res. 1978;5(9):3157–70. Epub 1978/09/01. doi: 10.1093/nar/5.9.3157 PubMed Central PMCID: PMC342238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hesselberth JR, Chen XY, Zhang ZH, Sabo PJ, Sandstrom R, Reynolds AP, et al. Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nature Methods. 2009;6(4):283–9. doi: 10.1038/nmeth.1313 WOS:000264738800018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Quach B, Furey TS. DeFCoM: analysis and modeling of transcription factor binding sites using a motif-centric genomic footprinter. Bioinformatics. 2017;33(7):956–63. doi: 10.1093/bioinformatics/btw740 WOS:000400984700002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Kahara J, Lahdesmaki H. BinDNase: a discriminatory approach for transcription factor binding prediction using DNase I hypersensitivity data. Bioinformatics. 2015;31(17):2852–9. doi: 10.1093/bioinformatics/btv294 WOS:000361395700013. [DOI] [PubMed] [Google Scholar]
  • 12.Raj A, Shim H, Gilad Y, Pritchard JK, Stephens M. msCentipede: Modeling Heterogeneity across Genomic Sites and Replicates Improves Accuracy in the Inference of Transcription Factor Binding. Plos One. 2015;10(9): e0138030. doi: 10.1371/journal.pone.0138030 WOS:000361800700028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kang D, Sherwood R, Barkal A, Hashimoto T, Engstrom L, Gifford D. DNase-capture reveals differential transcription factor binding modalities. Plos One. 2017;12(12). doi: 10.1371/journal.pone.0187046 WOS:000419033400003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Pique-Regi R, Degner JF, Pai AA, Gaffney DJ, Gilad Y, Pritchard JK. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 2011;21(3):447–55. doi: 10.1101/gr.112623.110 WOS:000287841100010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Piper J, Assi SA, Cauchy P, Ladroue C, Cockerill PN, Bonifer C, et al. Wellington-bootstrap: differential DNase-seq footprinting identifies cell-type determining transcription factors. Bmc Genomics. 2015;16. doi: 10.1186/s12864-015-2081-4 WOS:000365286200003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Li ZJ, Schulz MH, Look T, Begemann M, Zenke M, Costa IG. Identification of transcription factor binding sites using ATAC-seq. Genome Biol. 2019;20. doi: 10.1186/s13059-019-1642-2 WOS:000459894000001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Bentsen M, Goymann P, Schultheis H, Klee K, Petrova A, Wiegandt R, et al. ATAC-seq footprinting unravels kinetics of transcription factor binding during zygotic genome activation. Nat Commun. 2020;11(1). doi: 10.1038/s41467-020-18035-1 WOS:000567552400005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Ouyang NX, Boyle AP. TRACE: transcription factor footprinting using chromatin accessibility data and DNA sequence. Genome Res. 2020;30(7):1040–6. doi: 10.1101/gr.258228.119 WOS:000554900100009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Neph S, Vierstra J, Stergachis AB, Reynolds AP, Haugen E, Vernot B, et al. An expansive human regulatory lexicon encoded in transcription factor footprints. Nature. 2012;489(7414):83–90. doi: 10.1038/nature11212 WOS:000308347000041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Vierstra J, Stamatoyannopoulos JA. Genomic footprinting. Nat Methods. 2016;13(3):213–21. Epub 2016/02/26. doi: 10.1038/nmeth.3768 . [DOI] [PubMed] [Google Scholar]
  • 21.Calviello AK, Hirsekorn A, Wurmus R, Yusuf D, Ohler U. Reproducible inference of transcription factor footprints in ATAC-seq and DNase-seq datasets using protocol-specific bias modeling. Genome Biol. 2019;20. doi: 10.1186/s13059-019-1654-y WOS:000459406400007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Yin M, Wang J, Wang M, Li X, Zhang M, Wu Q, et al. Molecular mechanism of directional CTCF recognition of a diverse range of genomic sites. Cell Res. 2017;27(11):1365–77. Epub 2017/10/28. doi: 10.1038/cr.2017.131 ; PubMed Central PMCID: PMC5674162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Sung MH, Guertin MJ, Baek S, Hager GL. DNase Footprint Signatures Are Dictated by Factor Dynamics and DNA Sequence. Mol Cell. 2014;56(2):275–85. doi: 10.1016/j.molcel.2014.08.016 WOS:000344484600009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Chen XP, Ke YW, Wu KL, Zhao H, Sun YY, Gao L, et al. Key role for CTCF in establishing chromatin structure in human embryos. Nature. 2019;576(7786):306-+. doi: 10.1038/s41586-019-1812-0 WOS:000502792400062. [DOI] [PubMed] [Google Scholar]
  • 25.Wu JY, Xu JW, Liu BF, Yao GD, Wang PZ, Lin ZL, et al. Chromatin analysis in human early development reveals epigenetic transition during ZGA. Nature. 2018;557(7704):256-+. doi: 10.1038/s41586-018-0080-8 WOS:000431775100053. [DOI] [PubMed] [Google Scholar]
  • 26.Khoury A, Achinger-Kawecka J, Bert SA, Smith GC, French HJ, Luu PL, et al. Constitutively bound CTCF sites maintain 3D chromatin architecture and long-range epigenetically regulated domains. Nat Commun. 2020;11(1). doi: 10.1038/s41467-019-13753-7 WOS:000511960800001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Wu ZP, Wang X, Zhang XG. Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq. Bioinformatics. 2011;27(4):502–8. doi: 10.1093/bioinformatics/btq696 WOS:000287246000009. [DOI] [PubMed] [Google Scholar]
  • 28.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012;9(4):357–U54. doi: 10.1038/nmeth.1923 WOS:000302218500017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078–9. doi: 10.1093/bioinformatics/btp352 WOS:000268808600014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol. 2008;9(9). doi: 10.1186/gb-2008-9-9-r137 WOS:000260586900015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Adv Neur In. 2019;32. WOS:000534424308009. [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009921.r001

Decision Letter 0

Ilya Ioshikhes, Saurabh Sinha

19 Apr 2022

Dear Yang,

Thank you very much for submitting your manuscript "TAMC: A deep-learning approach to predict motif-centric transcriptional factor binding activity based on ATAC-seq profile" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Saurabh Sinha

Guest Editor

PLOS Computational Biology

Ilya Ioshikhes

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The paper describes an interesting deep-learning approach to predict

transcription factor binding sites using motif information, ATAC-seq

profiles, and ChIP-seq information. While it seems a worthwhile

approach, I feel some aspects would benefit from greater clarity.

Major:

Results, page 6, "genomic cleavage signals and slopes are generated

following HINT-ATAC input signal processing strategy..." is the

strategy reimplemented, or is HINT-ATAC used directly?

page 7: "we compared TAMC predictions with predictions generated using

previous footprinting tools including TOBIAS and HINT-ATAC". These are

the only two previous tools compared. The sentence should be

reworded.

page 8, "compared to TOBIAS and HINT-ATAC, TAMC gave best cross-cell

prediction..." however, table 1 suggests that the former two tools

cannot be used for cross-data. This should be clarified.

page 12, figure 4: Why are the AUC values scaled between -1 and 1?

This seems an unusual procedure and makes it impossible to compare

roews. Why not plot the raw AUC on a 0-1 scale as usual? This should

be clarified

General: a bit more detail in the description of the deep learning

architecture would be desirable.

General: It would be good to include error bars wherever possible, ie

figures 2b, 2c, 3.

Minor:

Abstract: "TAMC captures both footprint and non-footprint features..."

do the authors mean "utilizes" (not "captures")?

Author summary: avoid acronyms here, like CNN, TFBS, ATAC-seq, or

explain them, since this is for non-specialists

Reviewer #2: Review of TAMC: A deep-learning approach to predict motif-centric transcriptional factor binding activity based on ATAC-seq profile

In this paper, the authors introduced a new computational method, TAMC, predicting TF-binding activity in open chromatin regions from ATAC-seq data. The authors showed that TAMC outperformed two existing methods, TOBIAS and HINT-ATAC, by combining them using a deep-learning approach without the correction of sequence bias on cut sites (e.g., Tn5 sequence preference). However, it is unclear how the convolutional network structure of TAMC is configured to train the dependency of local TF footprints and cleavage signals reflecting TF-binding biochemistry. It is also unclear whether TAMC can be generalized to other data sets. These points are discussed in more detail below:

Major points:

1) The convolutional layer architecture of TMAC shown in Figure 1A does not seem to integrate neighboring base-pair signals within a specific feature. Instead, it aggregates signals across neighboring features at the same position. In other words, the current architecture seems to randomly convolute footprinting and cleavage site features, which are then max-pooled without considering neighboring base pair signals. Since the order of the features is arbitrary, I am struggling to understand how the convolutional layer learns the biology of transcriptional binding sites. What do three different convolutional layers with kernel sizes (k=3, 5, 7) and color (red, green, blue) mean from a biological perspective?

2) It is also unclear why such a long region (1,000 bp) is needed for modeling. A footprinting event is local and can be captured within 50bp for almost all TFs. Is it really required to have 1,000bp to achieve high accuracy? What happens if it is reduced to 500bp or even 100bp? What is the optimal length?

3) When training and testing the model of TAMC, the authors used only three cell lines (GM12878, HepG2, and K562), for which many ChIP-seq datasets are available in the ENCODE database. However, no such ChIP-seq data sets are available for most human tissues, and it is unclear if the model can be extended to these human tissues. The authors need to demonstrate the generalizability of the model by analyzing bulk/single-cell ATAC-seq data sets from human tissues.

4) In Figure 5B, the authors primarily used CTCF and EGR to demonstrate that TAMC can capture footprint features of TF-binding sites. However, a more systematic analysis is required to thoroughly test if the differential prediction performance across different model structures is specific to TFs. For example, the expression and the size of TFs could be associated with differences in the prediction performance across the model variants. This can potentially be used to improve the prediction performance further.

5) The authors showed that TAMC achieved robust performance regardless of the bias correction, implying that the TAMC model also learned features associated with bias and corrected them internally. If this is the case, I expect the prediction results to be concordant between the bias-corrected and uncorrected inputs. Please demonstrate this.

6) Direct and quantitative comparison of AUC scores is required throughout the manuscript. In Figure 2A, it is unclear how significantly AUCs are different between TAMC and TOBIAS/HINT-ATAC. The distribution of AUC scores should be compared among the three methods, and the statistical significance of the differences should be assessed. The average AUC rank is not a good metric, either. Please consider it to something more direct in Figures 2B/C, 3, and 5B. In Figure 4, what is the normalization method to scale AUC (-0.1~0.1) by row? It is also unclear why the AUC should be scaled in this analysis.

Minor points:

1) AUC is the abbreviation of Area Under Curve, and it is not ROC-curve specific. Please mention this is the AUC of ROC in the method section.

2) MPBS is not a commonly used acronym. Please consider not using it.

3) Typos:

• “stand” in line 12 of page 13 should be “strand.”

• “sing” in line 6 of page 19 should be “using.”

Reviewer #3: This paper proposes a CNN model to predict TF binding site (TFBS) using footprint and non-footprint (cleavage signal and slope) features derived from ATAC-seq data. Their TF-specific CNN model aggregates the footprint scores at base resolution and cleavage profiles using 1-D convolutional filters in the region (1000 bps) around the center of the motif predicted binding sites. It is shown that the model performance is independent of correcting for the sequence-specific cleavage bias present in the ATAC-seq data; a property that does not exist in the other tools for TFBS prediction using ATAC-seq profiles.

I have the following questions,

1) Please explain why the cutoff of 1 nucleosome size was used for the ATAC-seq reads (Figure 1B)? Also, explain why adding more intervals for the read size might be redundant, e.g., < 1Nr, 1Nr < - < 3 Nr, >3Nr, given that the inclusion of read size information affects the performance (Figure 5B)?

2) Please explain the choice of K562 cell line as the target dataset and GM12878 and HepG2 as the source datasets in the cross-data experiments (Figure 2B,C).

3) In Figure 2C, which dataset(s) were used as the source dataset for TOBIAS and HINT-ATAC?

4) Average AUC rank over the TFs is used to compare the performance of different models or training settings. Please provide a way for testing/explaining the significance of delta X in average AUC rank, i.e., is a decay of 0.1 in average AUC rank significant?

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Rahul Siddharthan

Reviewer #2: No

Reviewer #3: Yes: Saba Ghaffari

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009921.r003

Decision Letter 1

Ilya Ioshikhes, Saurabh Sinha

26 Jul 2022

Dear Yang,

Thank you very much for submitting your manuscript "TAMC: A deep-learning approach to predict motif-centric transcriptional factor binding activity based on ATAC-seq profile" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

While Reviewers 1 and 3 are now fully satisfied with the manuscript, Reviewer 2 has minor concerns about the revision. Please address these to the extent possible.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Saurabh Sinha

Guest Editor

PLOS Computational Biology

Ilya Ioshikhes

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

While Reviewers 1 and 3 are now fully satisfied with the manuscript, Reviewer 2 has minor concerns about the revision. Please address these to the extent possible.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I find the manuscript greatly improved and the responses to all reviewers thorough and clear.

Two minor points:

* "including" implies a non-exhaustive list ("two representative footprinting tools including TOBIAS and HINT-ATAC"). I would suggest "namely", ie "two representative footprinting tools, namely, TOBIAS..."

* "matric" (multiple places) should be "metric" I think. Eg, "Tobias applies the same matric...", "FPS matric" etc.

Reviewer #2: The authors addressed most of my concerns. The revised Figure 1A makes it much easier to understand the CNN structure used in this study. However, I still have some questions:

1) I am still trying to understand why the model requires longer input regions to achieve the best performance. I am wondering if there is a positional bias between bound and unbound MPBSs. For example, bound MPBSs may be more centrally positioned with respect to ATAC-seq peak summits, while unbound MPBSs are more peripheral, which may introduce systematic ATAC-seq signal bias. If this is the case, the difference in ATAC-seq read intensity in the 1kb region centered at MPBS may be enough to distinguish unbound MPBS from bound MPBS. Please demonstrate that there is no positional bias. If the bias exists, the authors should control it when making the balanced training set when subsampling the unbound MPBSs, and repeat the analysis.

2) The raw AUROC for all experiments is quite helpful to better understand the results. As such, I would suggest that the authors directly compare AUROCs across different models using 2~3 representative TFs.

3) After a better understanding of “scaled AUC” in Figure 4, I find it quite interesting that ~30% of TFs achieve better performance when using models trained on other TFs. Although I agree with the authors that intra-TF prediction is generally better than cross-TF prediction for most TFs, I am also curious if there are any common properties shared by these “unusual” TFs.

Reviewer #3: All of my questions have been answered to my satisfaction.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009921.r005

Decision Letter 2

Ilya Ioshikhes, Saurabh Sinha

24 Aug 2022

Dear Yang,

We are pleased to inform you that your manuscript 'TAMC: A deep-learning approach to predict motif-centric transcriptional factor binding activity based on ATAC-seq profile' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Saurabh Sinha

Guest Editor

PLOS Computational Biology

Ilya Ioshikhes

Section Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009921.r006

Acceptance letter

Ilya Ioshikhes, Saurabh Sinha

8 Sep 2022

PCOMPBIOL-D-22-00225R2

TAMC: A deep-learning approach to predict motif-centric transcriptional factor binding activity based on ATAC-seq profile

Dear Dr Yang,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Anita Estes

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig

    Number of labeled MPBSs for 47 TFs before (A) and after (B) equalization of bound and unbound sites in GM12878 cells (supplementary to Figs 1 and 2). TFs did not show better intra-data classification using TAMC than TOBIAS or HINT-ATAC were highlighted in red.

    (TIF)

    S2 Fig. TAMC performance is affected by sequencing depth of training data (supplementary to Fig 2).

    (A) Line graphs compare CTCF, SRF, E2F4 and NFIC binding site prediction performance using trained TAMC model with TOBIAS and HINT-ATAC. The points represent for mean AUROC ± SEM (n = 3 experimental replicates). P-values were calculated using two-sided unpaired Student’s t-test. (B-C) Line graphs compare average cross-data classification performance of TAMC models trained on multiple (GM12878 and hepG2) ATAC-seq data with 100M (A) and 50M (B) sequencing depth with TOBIAS and HINT-ATAC. The performance of models within each line graph were ranked from 1 to 3 for each TF. The higher AUROC is, the lower rank number is given. The points show the average AUROC ranks for 47 TFs for each method and the error bars represent SEM. P-values were calculated using Friedman-Nemenyi test. The cell type and sequencing depth of ATAC-seq used for train TAMC models are indicated within the parenthesis. HINT-ATAC models pre-trained on GM128782 ATAC-seq data were used for all testing, while TOBIAS does not require model training before testing. M denotes million high-quality non-mitochondrial aligned reads. Complete statistic test results and raw AUROC data for all figures are provided in S3 and S4 Tables, respectively. * p < 0.05, ** p < 0.01, *** p < 0.001.

    (TIF)

    S3 Fig. Heatmap shows intra-TF and cross-TF classification performance (AUROC) of TAMC using GM12878 ATAC-seq data (supplementary to Fig 4).

    (TIF)

    S4 Fig. Metagene plots show aggregated cleavage profiles within ±500bp regions of labeled CTCF and EGF1 MPBSs in GM12878 cell.

    Cleavage signals were processed without manual bias correction. Three plots were made for each class of MPBSs using all ATAC-seq reads, ATAC-seq reads with maximum fragment size at 1Nr, and ATAC-seq reads with minimum fragment size more than 1Nr. Nr, nucleosome size. MPBS, motif-predicted binding site.

    (TIF)

    S5 Fig. Predicted CTCF binding sites in human embryonic cells.

    (A) Metagene plots show aggregated cleavage profiles at predicted unbound, bound, and constitutively bound CTCF MPBSs in 8-cell embryos. (B) Venn diagrams show the number of predicted bound/conductively bound CTCF MPBSs in 2-cell and 8-cell embryos from 3 experiments. CTCF binding sites predicted in all three experiments are regarded as true binding sites.

    (TIF)

    S6 Fig. Representative genome tracks show ChIP-seq signals of TFs with top-ranked intra-TF model (CTCF, NR2F1 and EGR1) and low-ranked intra-TF model (MAX, ELK1 and MAFK).

    (TIF)

    S1 Table. List of ATAC-seq datasets.

    (XLSX)

    S2 Table. List of TF motifs and ChIP-seq datasets.

    (XLSX)

    S3 Table. Complete results of significance tests.

    (XLSX)

    S4 Table. Raw AUROC values for evaluation of different models and methods.

    (XLSX)

    S5 Table. Raw predicted binding probabilities for imputing CTCF binding sites in human embryos.

    (XLSX)

    Attachment

    Submitted filename: Response to reviewer comments.docx

    Attachment

    Submitted filename: Responses to review comments.docx

    Data Availability Statement

    Raw data access number and derived data supporting the findings of this study are listed within the manuscript and its Supporting Information files. Code for model training and testing is openly available in TAMC Github repository at https://github.com/tianqiyy/TAMC.git.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES