Skip to main content
Advanced Science logoLink to Advanced Science
. 2025 May 24;12(30):e03135. doi: 10.1002/advs.202503135

deepTFBS: Improving within‐ and Cross‐Species Prediction of Transcription Factor Binding Using Deep Multi‐Task and Transfer Learning

Jingjing Zhai 1,2, Yuzhou Zhang 1, Chujun Zhang 1,2, Xiaotong Yin 3, Minggui Song 1,2, Chenglong Tang 3, Pengjun Ding 1,2, Zenglin Li 1,2, Chuang Ma 1,2,
PMCID: PMC12376555  PMID: 40411397

Abstract

The precise prediction of transcription factor binding sites (TFBSs) is crucial in understanding gene regulation. In this study, deepTFBS, a comprehensive deep learning (DL) framework that builds a robust DNA language model of TF binding grammar for accurately predicting TFBSs within and across plant species is presented. Taking advantages of multi‐task DL and transfer learning, deepTFBS is capable of leveraging the knowledge learned from large‐scale TF binding profiles to enhance the prediction of TFBSs under small‐sample training and cross‐species prediction tasks. When tested using available information on 359 Arabidopsis TFs, deepTFBS outperformed previously described prediction strategies, including position weight matrix, deepSEA and DanQ, with a 244.49%, 49.15%, and 23.32% improvement of the area under the precision‐recall curve (PRAUC), respectively. Further cross‐species prediction of TFBS in wheat showed that deepTFBS yielded a significant PRAUC improvement of 30.6% over these three baseline models. deepTFBS can also utilize information from gene conservation and binding motifs, enabling efficient TFBS prediction in species where experimental data availability is limited. A case study, focusing on the WUSCHEL (WUS) transcription factor, illustrated the potential use of deepTFBS in cross‐species applications, in our example between Arabidopsis and wheat. deepTFBS is publically available at https://github.com/cma2015/deepTFBS.

Keywords: bioinformatics, cross‐species prediction, deep learning, machine learning, transcriptional regulatory network


DeepTFBS leverages deep learning to predict transcription factor binding sites across species, integrating multi‐task and transfer learning approaches to improve performance in data‐scarce scenarios. This study demonstrates enhanced accuracy in intra‐ and cross‐species prediction, revealing conserved regulatory patterns and functional variants. The findings underpin a practical framework for efficient genome‐wide analysis of transcriptional regulation in plants.

graphic file with name ADVS-12-e03135-g004.jpg

1. Introduction

The complex regulation of gene expression, governed by cis‐regulatory elements (CREs) and trans‐acting factors in eukaryotes, enables cells to achieve precise responses during different developmental stages or under various environmental conditions. Transcription factors (TFs) are key trans‐factors that regulate gene expression by recognizing specific DNA sequences, TF binding sites (TFBSs).[ 1 , 2 ] Various high‐throughput experimental methods have been developed to determine the DNA sequence specificity of TFs. These include the chromatin immunoprecipitation (ChIP) assay with sequencing (ChIP‐Seq) in vivo.[ 3 ] and DNA affinity purification sequencing (DAP‐Seq) in vitro.[ 4 , 5 ] Utilizing these techniques, genome‐wide TFBS profiling experiments have been conducted in multiple plant species, including Arabidopsis thaliana,[ 4 , 6 ] Zea mays,[ 7 , 8 ] Triticum aestivum.[ 9 ] and Triticum Urartu.[ 10 ] These work generated valuable resources and provided considerable insight into the complex TF‐target regulation involved in various biological processes in plants. However, such work is costly, labor‐intensive, and time‐consuming, limiting the number of species in which TFBSs can be experimentally mapped. Therefore, there is an urgent need for computational methods to identify the syntax of TF‐DNA interactions based on available resources in order to apply these to other organisms, where the availability of experimental datasets is limited.

Numerous computational algorithms have been developed to identify the binding preferences of TFs within a single species. The most commonly used method is the position weight matrix (PWM), which uses independent positional statistics of four nucleotides (A, C, G and T) to predict TFBSs by scanning the DNA sequences of interest. However PWM ignores the weak signal and contextual information of TF binding patterns, often leading to false positive binding site predictions and adversely affecting downstream analyses.[ 11 ] In contrast, machine learning (ML) approaches, particularly recently developed deep learning (DL) algorithms have emerged as powerful alternatives. These advanced computational methods can effectively learn complex patterns, including long‐distance nucleotide dependencies, by leveraging large‐scale training data.[ 12 , 13 , 14 ] ML/DL algorithms have led to improvements in TFBS prediction accuracy using DNA sequences alone.[ 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 ] Beyond sequence data, these approaches have also successfully incorporated various DNA structural features, including DNA shape,[ 22 , 24 , 25 ] and chemical and structural properties.[ 26 ] Among the DL‐based methods, DeepBind.[ 15 ] first utilized a convolutional neural network (CNN) to model DNA and RNA binding specificities, achieving superior prediction accuracy with both in vitro and in vivo data. Subsequently, more sophisticated DL models were developed for predicting DNA regulatory elements in human, such as DeepSEA,[ 17 ] DanQ.[ 18 ] and BPNet.[ 23 ] While DeepSEA utilizes a CNN architecture to learn the regulatory patterns from DNA sequences, DanQ implements a hybrid approach combining CNN with recurrent neural networks to enhance prediction capabilities. BPNet was specially desinged to predict base‐resolution ChIP‐nexus profiles and provide insights into TF binding patterns and their interactions. More recently, transformer‐based models such as BERT‐TFBS.[ 21 ] have emerged, leveraging the power of attention mechanisms to capture complex dependencies in DNA sequences and improve TFBS prediction accuracy. Despite these advances in DL‐based approaches, several key challenges remain to be addressed.

First, DL‐based methods typically rely on substantial amounts of training data to achieve their accuracy. When these methods are applied in prediction tasks where experimentally reported binding data is limited, their prediction accuracy decreases markedly.[ 25 , 27 ] In addition, the existing DL‐based methods do not always perform well when used in cross‐species TFBS prediction, even when the TFs are highly conserved or represent direct orthologs. This limitation is partly due to the high turnover of TF binding sites.[ 28 ] In human and mouse, domain‐adaptive neural network.[ 29 ] and adversarial training.[ 30 ] have been explored to enhance the cross‐species prediction of binding sites. However, few studies have explored effective ML/DL strategies for cross‐species TFBS prediction in plants. Considering the scarcity of experimental TF binding data in many plant species, the development of algorithms suitable for the prediction of TFBS within limited datasets remains of great importance and achieving accuracy in cross‐species use would be an extremely important goal.

This study introduces deepTFBS, a DL framework designed for the precise identification of TFBSs within and across species. By leveraging multi‐task and transfer learning, deepTFBS effectively overcomes the limitations associated with small training sample sets in building predictive models. In addition, for cross‐species predictions, we propose the integration of orthologous targets and PWM information to enhance accuracy. The utility of deepTFBS was validated using extensive publicly accessible TF binding profile data in Arabidopsis and wheat. Moreover, using the WUS TF as a test case, we experimentally confirmed several predicted binding events in wheat, using yeast one‐hybrid assays, further demonstrating the accuracy of TFBS predictions provided by deepTFBS.

2. Results

2.1. Designing the deepTFBS Framework

deepTFBS is a DL framework specifically designed to accurately predict whether a given input sequence contains a binding site for a specific transcription factor, incorporating multi‐task (deepTFBS‐MT) and transfer learning (deepTFBS‐TL) strategies. In training phase I, the multi‐task DL model deepTFBS‐MT is constructed by simultaneously learning multiple TF binding preferences (Figure  1A). deepTFBS‐MT is pre‐trained on binding data from 359 Arabidopsis TFs. It processes input sequences of 1000 base pairs (bp), encoded using one‐hot encoding (see METHODS). These encoded sequences are fed into the deepTFBS backbone, which constitutes the core network architecture of the framework (Figure 1B). The model outputs the binding probabilities for all 359 TFs via a fully connected layer comprising 359 neurons, each corresponding to a specific TF (Figure 1A). The pre‐trained deepTFBS‐MT offers two key advantages. First, by fitting a single model with all available TF binding profiles, deepTFBS‐MT leverages a large dataset to train an accurate, robust, and generalizable DL model. Second, it utilizes on the strengths of multi‐task learning to deliver more precise and unbiased TFBS predictions.

Figure 1.

Figure 1

Illustration of deepTFBS framework. A) The three variants of the deepTFBS framework. The model predicts the probability that a 1000‐bp input sequence contains a binding site for each TF (multi‐task) or a specific TF (transfer learning). B) The deepTFBS network structure backbone. C) Train, validation and test data splits in deepTFBS‐MT.

However, training deepTFBS‐MT requires large‐scale TF binding profiles, which are not always available for many TFs or non‐model species. To address this limitation, we developed deepTFBS‐TL, a transfer learning (TL)‐based method that builds upon the pre‐trained deepTFBS‐MT model (Figure 1A). While deepTFBS‐MT is a multi‐task model that simultaneously predicts the binding probabilities for 359 TFs, deepTFBS‐TL is a TF‐specific binary classification model. For each TF, a separate deepTFBS‐TL model is fine‐tuned using the shared sequence representations learned by deepTFBS‐MT. This means that deepTFBS‐TL consists of 359 individually fine‐tuned models, each focused on a single TF. Each deepTFBS‐TL model is initialized with pretrained weights of deepTFBS‐MT, with the final output layer replaced by a single sigmoid function for binary classification. This design narrows the parameter search space and enables effective model training even with limited TF‐specific binding data. The transfer learning technology used is particularly advantageous for model training with small datasets. For comparison, we also implemented deepTFBS‐ST, which trains a similar binary classification model but without transfer learning (starting from randomly initialized weights). This serves as a baseline to evaluate the benefits of transfer learning in deepTFBS‐TL (Figure 1A).

Both deepTFBS‐MT and deepTFBS‐TL rely on a combination of convolutional layers, bidirectional long short‐term memory (BiLSTM) layers, and a single‐head attention (SHA) layer (deepTFBS backbone; Figure 1B). First, the deepTFBS backbone uses a convolutional layer to scan one‐hot encoded input sequences, enabling the recognition of TF binding motifs. To enhance the effectiveness of the convolutional layer, and to avoid overfitting, batch normalization is applied to re‐center and re‐scale feature maps. Subsequently, three stacking convolutional layers are used to extract higher‐level feature representations. To address issues associated with training very deep neural networks, deepTFBS uses a residual block with shortcut connection, an approach previously used in computer vision.[ 31 ] and genomic signal prediction tasks.[ 32 ] The use of the residual block mitigates issues associated with vanishing gradients and performance degradation. Following the extraction of local sequence features by the convolutional layer and the residual block, a bidirectional LSTM layer is employed to capture local and global dependencies within DNA sequences. This step enables deepTFBS to extract comprehensive contextual information, further enhancing the representation of features relevant to TF binding. To detect relevant long‐range dependencies and identify other features critical for TF binding, deepTFBS incorporates a SHA layer, which allows the model to focus on relevant information. Finally, a fully connected layer is utilized to make predictions by integrating features identified via convolution, BiLSMT and self‐attention.

2.2. deepTFBS‐MT Shows Superior Performance in Predicting TFBSs in Arabidopsis

We first assessed the performance of deepTFBS‐MT by training it with 359 Arabidopsis TF binding profiles (Table S1, Supporting Information), which were compiled from previously published studies.[ 4 , 6 , 33 , 34 , 35 ] To ensure consistent peak regions across the dataset, we employed a standardized computational pipeline (Figure S1, Supporting Information) to reprocess raw reads. Predicting TFBSs is inherently challenging since TF binding regions constitute only a minor fraction of the entire genome.[ 4 , 8 ] To generate training datasets for deepTFBS‐MT, we partitioned the Arabidopsis genome into 595710 non‐overlapping 200‐bp genomic windows, extending these windows to 1000 bp and retaining the 295955 windows supported by ≥5 TFs for model training and evaluation (see Experimental Section). To avoid data leakage, we held out all sequences from chromosome 4 as independent test set and used chromosome 1–3 and 5 for training, reserving 5000 randomly sampled windows for validation (Figure 1C). In this context, “positive” samples represented sequences containing binding site(s) for a specific TF, while “negative” samples were sequences with binding sites associated with some of the remaining 358 TFs.

Using the hold‐out testing dataset, containing 54866 sequences, deepTFBS‐MT showed an impressive performance with a median AUROC (area under receiver operating characteristics curve) of 0.964 (Figure  2A and Table S2, Supporting Information). However, considering the extreme imbalance between positive and negative samples within the training, validation, and testing datasets, the AUROC metric can potentially over‐estimate model performance, which is a common limitation in imbalanced classification problems in machine learning.[ 36 ] In contrast, PRAUC (area under the precision‐recall curve) does not consider true negative, therefore it is more suitable for assessing model performance when using highly imbalanced datasets where positive samples represents a minority. When deepTFBS‐MT was evaluated by calculating PRAUC, using the same test dataset, we observed a median PRAUC value of 0.629 (Figure 2B). There were 124 TFs exhibited relatively low PRAUC values (e.g., less than 0.5), this was largely attributed to the low positive‐to‐negative ratio in the testing dataset (Figure S2A, Supporting Information). For example, RGA (AT2G01570), which has the lowest PRAUC of 0.0093, is represented by only 222 binding sites among a large number of negatives in the testing set (Figure 2B). Notably, when the negative samples for RGA were downsampled to balance the dataset, the PRAUC increased significantly (Figure 2B), indicating that the low PRAUC is a reflection of data imbalance rather than model performance. Similarly, for another TF, ATNAC6 (AT5G39610), with a PRAUC of 0.5031 and 1082 binding sites, we observed a remarkably improvement in PRAUC upon downsampling negative samples (Figure 2C). Furthermore, we observed the performance showed TF‐dependent trend (Figure S2B, Supporting Information): families such as AP2‐EREBP, bZIP, WRKY and Dof generally showed higher PRAUC values, while others like NAC, C2H2 and MYB had lower performance. This variation is partly attributable to the number of genome‐wide binding sites available for each family—TF families with more abundant binding data tend to yield better predictive performance due to richer training signals. These findings highlight the capability of deepTFBS‐MT in TF binding predicting, even when analyzing TFs with sparse binding data.

Figure 2.

Figure 2

Performance evaluation of deepTFBS‐MT in predicting TF binding sites in Arabidopsis. A) Receiver operating characteristic (ROC) curves and B) precision‐recall (PR) curves for predictions across 359 TFs in Arabidopsis. Each curve represents one TF. C) Precision‐recall curves for RGA (AT2G01570) at different negative‐to‐positive (N/P) ratios. The model's performance decreases significantly as class imbalance increases, with PRAUC dropping from 0.677 (N/P = 1) to 0.023 (N/P = 100). Dotted lines represent individual runs; solid lines show the average. D) Precision‐recall curves for ATNAC6 (AT5G39610) across varying N/P ratios, demonstrating similar trend of performance decline with increasing class imbalance. PRAUC decreases from 0.955 (N/P = 1) to 0.542 (N/P = 40). E–L) Performance comparison of deepTFBS‐MT with existing methods. Scatter plots showing the comparison of area under ROC curve (AUC) between deepTFBS‐MT and E) PWM, G) deepSEA, I) DanQ, and K) BPNet body. Area under PR curve (PRAUC) comparisons between deepTFBS‐MT and F) PWM, H) deepSEA, J) DanQ, and L) BPNet body are also shown. Each point represents one TF, with points above the diagonal indicating superior performance by deepTFBS‐MT.

Next, we compared the performance of deepTFBS‐MT with the previously reported PWM‐based method to demonstrate the predictive power of the new model. This approach scores genomic sequences with a matrix representing the log likelihood of each nucleotide being present in a given TF binding motif. In this analysis, we scored each DNA sequence from the Arabidopsis test dataset (see Experimental Section) to ensure a fair comparison between the two approaches. By utilizing AUROC and PRAUC metrics, we found that deepTFBS‐MT demonstrated superior performance for 350 out of the 359 TFs (Figure 2C,D). Significantly, compared to PWM, deepTFBS‐MT the median PRAUC improved from 0.189 to 0.629 (Table S2, Supporting Information). Besides PWM‐based method, we also benchmarked the performance of deepTFBS against three other DL methods—DeepSEA,[ 17 ] DanQ.[ 18 ] and BPNet.[ 23 ] Both DeepSEA and DanQ were originally designed as classification models to predict regulatory elements (e.g., TF binding sites, histone mark, etc) in the human genome, making them directly comparable to our deepTFBS‐MT after re‐training on Arabidopsis TF‐binding datasets. BPNet, which was originally designed as a regression model for base‐pair resolution prediction of TF binding intensity using ChIP‐nexus data. We adapted its core architecture, BPNet body, which consists of 11 stacked convolutional layers, to perform our multi‐class classification task while maintaining its powerful feature extraction capability. Compared to DeepSEA, deepTFBS‐MT showed a better performance in identifying the binding sites of 341 out of the 359 TFs when assessed by AUC, while it showed better performance with 323 TFs when the PRAUC measurement was used. On average, deepTFBS‐MT achieved a 3.84% AUC and a 22.14% PRAUC performance gain (Figure 2E,F). Similarly, when compared with DanQ, we noted that in deepTFBS‐MT outperformed DanQ for 268 TFs when using AUC and 246 TFs with PRAUC metrics (Figure 2G,H). When compared to the adapted BPNet architecture, deepTFBS‐MT demonstrated consistently better performance across both AUC and PRAUC metrics (Figure 2I,J), indicating that our hybrid architecture of CNN, BiLSTM, and attention layers is more effective for predicting TF binding patterns. To further validate the performance of deepTFBS‐MT, we performed leave‐one‐chromosome‐out (LOCO) cross‐validation, where each chromosome was iteratively used as the test set. Across five LOCO folds, deepTFBS‑MT achieved a median AUC of 0.963 and a PRAUC of 0.621, outperforming DeepSEA (0.922/0.482), DanQ (0.958/0.599), and the adapted BPNet (0.804/0.171) (Figure S3, Supporting Information). These results demonstrate that deepTFBS‑MT consistently exceeds the performance of other supervised deep‑learning models.

To investigate the contributions of individual components to the model's performance, we conducted a systematic ablation study by training six model variants: 1) CNN only, 2) LSTM only, 3) single‐head attention (SHA) only, 4) CNN+LSTM, 5) CNN+SHA, and 6) LSTM+SHA. As shown in Figure  3 , the full architecture of CNN+LSTM+SHA, which underpins deepTFBS, achieved the highest median PRAUC, highlighting the synergistic effect of these components. In contrast, the LSTM‐only architecture showed the poorest performance, underscoring the importance of CNN for extracting local patterns. While the improvement from CNN+LSTM (PRAUC: 0.602) to CNN+LSTM+SHA (PRAUC: 0.629) was modest, it demonstrates that SHA adds value, albeit incrementally. These results demonstrate the efficacy of the deepTFBS‐MT architecture in accurately predicting TFBSs in Arabidopsis, providing a robust framework for transcription factor binding prediction.

Figure 3.

Figure 3

Ablation study demonstrating the contribution of different architectural components in deepTFBS‐MT. A) Area under ROC curve (AUC) and B) area under precision‐recall curve (PRAUC) for different model variants. Seven architectures were evaluated: LSTM only, single‐head attention (SHA) only, LSTM+SHA, CNN only, CNN+SHA, CNN+LSTM, and the complete deepTFBS architecture (CNN+LSTM+SHA). Violin plots show the performance distribution across all TFs, with box plots indicating the median and quartiles. The complete architecture (CNN+LSTM+SHA) achieved the highest median performance, while LSTM‐only showed the lowest.

2.3. deepTFBS‐MT has Interpretation Ability by Identifying TF Binding Preferences

To better understand the factors influencing deepTFBS‐MT's performance, we applied an Integrated Gradients (IG),[ 37 , 38 ] a gradient‐based attribution method, to visualize the importance of individual features identified by the model (see Experimental Section). IG computes the average gradient of the model's output as input features change, quantifying the importance of each nucleotide. A high absolute gradient for a feature indicates its greater significance in the prediction process. Using this method to sequences in the test dataset, we identified putative motifs for the top ten TFs predicted by deepTFBS‐MT. Notably, seven of these ten motifs were also recorded in the Plant Cistrome Database.[ 4 ] Based on PRAUC metrics, the highest ranked TF was CBF4, known for its role in drought stress.[ 39 ] The consensus motif identified by deepTFBS‐MT for CBF4 was [TC]GTCGG, which closely matches the motif recorded in the Plant Cistrome Database ([ATC][TC]GTCGG[CT]) (Figure  4A). Similarly, AT3G16280 and CBF2, both members of the AP2‐EREBP family, displayed nearly identical motifs, suggesting shared TF binding preferences among these TFs. For the remaining TFs—EPR1, GBF3, ATHB25, and AP2—the motifs identified by deepTFBS‐MT closely aligned with those in the Plant Cistrome Database (Figure 4A). These results indicate that deepTFBS‐MT has the capability of learning the binding preferences of different TFs.

Figure 4.

Figure 4

Interpretation of deepTFBS‐MT predictions and analysis of regulatory variants. A) Comparison of binding motifs identified by deepTFBS‐MT (left) and those documented in PlantCistromeDB (right) for the top‐performing TFs. Sequence logos show the position‐specific nucleotide preferences, with height indicating information content. B) Schematic illustration of the regulatory variant effect prediction pipeline. The process includes input sequence pairs (reference and alternate alleles), prediction of binding probabilities for 359 TFs using deepTFBS‐MT, and calculation of an aggregate effect score based on differential binding predictions. C) Distribution of effect scores across all analyzed SNPs from the 1001 Genomes Project (n = 10706842), shown on a log10 scale. D) Enrichment analysis of rare (MAF ≤ 0.1%) versus common (MAF ≥ 5%) variants among SNPs with high (top 0.1%) and low (bottom 0.1%) effect scores. The odds ratio of 1.87 indicates significant enrichment of rare variants among high‐impact SNPs was determined by Fisher's exact test. E) Odds ratio plots showing the enrichment of predicted high‐impact variants in conserved non‐coding sequences (CNS, blue) and expression quantitative trait loci (eQTL, orange) across different effect score thresholds.

2.4. deepTFBS‐MT Predicts Putative Functional Regulatory Variants

Given the efficacy of deepTFBS in understanding TF binding grammar, we next investigated whether it could be applied to identify candidate functional regulatory variants, which are known to play a crucial role in complex phenotypes.[ 40 , 41 ] To assess the regulatory impact of individual SNPs, we analyzed sequence pairs comprising both reference and alternate alleles, each centered within their flanking regions. These sequences were input into the deepTFBS‐MT model to predict the binding probabilities of 359 TFs for each allele. The regulatory impact of each SNP was then quantified by computing the absolute differences in predicted binding scores between reference and alternate alleles. We defined an aggregate “effect score” for each SNP by summing these differential binding probabilities across all 359 TFs (Figure 4B).

We applied this analytical approach to 10706842 SNPs from naturally occurring accessions in the 1001 Genomes Project.[ 42 ] As expected, the majority of these SNPs had very low effect scores (Figure 4C). However, when focusing on the top 10% of SNPs with the highest effect scores, we found that over 66% were located in either promoter (upstream 1000 bp) or distal intergenic regions, suggesting a potential role in gene regulation (Figure S4A, Supporting Information). To further validate deepTFBS‐MT's ablity to identify putative functional regulatory variants, we assessed the enrichment of rare alleles (minor allele frequency [MAF] < 0.001) in highly impactful regulatory variants (top 0.1% of SNPs with highest effect scores) compared to those with low impact (bottom 0.1%). Since functional variants tend to have lower frequencies in populations due to selective constraints,[ 43 ] we hypothesized an enrichment of rare alleles in highly impactful variants. Consistent with this hypothesis, we observed an enrichment of rare alleles with an odds ratio of 1.87 (Figure 4D; p‐value < 2.2e‐16 by Fisher's exact test). Additionally, we observed that as effect scores increased, the average MAF decreased (Figure S4B, Supporting Information), indicating that highly impactful regulatory variants tend to be rarer in the population. Beyond allele frequency, we assessed the enrichment of these variants in conserved non‐coding sequences (CNS).[ 44 ] As expected, highly impactful variants showed higher odds ratios for enrichment in CNS, further supporting their potential functional relevance (Figure 3E; Table S3, Supporting Information). These findings demonstrate deepTFBS‐MT's ability to prioritize regulatory variants likely to play key roles in gene regulation.

When examining the highly impactful regulatory motifs, we noticed two interesting examples. The first is the SNP, chr2:6 321 143 (T/A), located upstream of the PR1 gene, which is associated with leaf yellowing disease.[ 45 ] Population transcriptome data from the 1001 Genomes Project has demonstrated that this SNP affects the expression of the PR1 gene (Figure S4C, Supporting Information). According to deepTFBS‐MT predictions, the “A” allele reduced the binding affinity of multiple WRKY family TFs, including WRKY22 (AT4G01250), WRKY27 (AT5G52830), WRKY8 (AT5G46350), and WRKY15 (AT2G23320) (Figure S4D, Supporting Information). This supports previous findings that WRKY TFs interact with the PR1 gene promoter, synergistically stimulating PR1 expression in response to salicylic acid.[ 46 ] The second example is SNP chr3: 3 059 896 (A/C), associated with flowering time. According to deepTFBS‐MT predictions (Figure S4E, Supporting Information), this SNP disrupts the binding motif of ATSPL9, a TF known to affect flowering in Arabidopsis. Taken together, these examples demonstrate how deepTFBS‐MT predictions can be integrated with GWAS results to offer mechanistic insights into how specific SNPs influence gene expression and phenotypic traits, aiding the identification of causal mutations associated with complex traits.

2.5. deepTFBS‐TL Shows Enhanced TFBS Prediction Through Transfer Learning Both within and Across Plant Species

Although deepTFBS‐MT showed impressive performance in predicting TFBSs in Arabidopsis, it relied heavily on large TF binding datasets. Such extensive datasets are often unavailable for other plant species, especially crops. To address this limitation, we developed deepTFBS‐TL (Figure 1A), a transfer learning approach that uses sequence patterns learned by deepTFBS‐MT from large scale TF binding profiles. By employing transfer learning, deepTFBS‐TL initializes with the pre‐trained weights from deepTFBS‐MT, narrowing the parameter search space. As a result, deepTFBS‐TL is expected to perform well even with fewer training observations. deepTFBS‐TL is designed as a binary classifier. It uses the same positive samples as deepTFBS‐MT and generates negative samples by selecting 10 random sets, each matching the number of positives. For each TF, we trained 10 deepTFBS‐TL models using positive samples and the corresponding negative sets. To have a fair comparison to deepTFBS‐MT, we evaluated deepTFBS‐TL on the same imbalanced test dataset. deepTFBS‐TL improved performance in 347 out of 359 TFs, achieving an average PRAUC improvement of 50.5% compared to deepTFBS‐MT (Figure  5A and Table S4, Supporting Information).

Figure 5.

Figure 5

Performance comparison of transfer learning strategies in deepTFBS. A) Scatter plot comparing PRAUC values between deepTFBS‐MT and deepTFBS‐TL for 359 Arabidopsis TFs. Points above the diagonal line indicate improved performance with transfer learning. B) Performance comparison between deepTFBS‐ST (trained from scratch) and deepTFBS‐TL (with transfer learning) for Arabidopsis TFs. Blue points highlight TFs with fewer than 1000 binding sites, demonstrating the particular advantage of transfer learning for TFs with limited training data. C) Cross‐species prediction performance comparison between deepTFBS‐ST and deepTFBS‐TL for 110 wheat TFs. deepTFBS‐TL outperformed deepTFBS‐ST in 109 out of 110 cases (indicated in red), demonstrating the effectiveness of transfer learning in cross‐species applications. Prediction performance in Arabidopsis D) and wheat E) using AUC (left) and PRAUC (right) metrics for PDLLMs, AgroNT, and deepTFBS‐TL. F) Inference speed (sequences per second) comparison.

To quantify the impact of transfer learning, we compared deepTFBS‐TL, which leverages pre‐trained weights from deepTFBS‐MT, to deepTFBS‐ST, a binary classifier trained from scratch without transfer learning. Testing on the same datasets showed transfer learning led to superior binding site prediction in 339 out of 359 TFs, resulting in an average PRAUC improvement of 160% (Figure 5B). We also examined how deepTFBS‐TL performs when dealing with TFs that have fewer peak regions. Specifically, we created a reduced training dataset by removing 153 TFs with fewer than 1000 peak regions, as detailed in the experimental section. We then trained both models—deepTFBS‐TL with transfer learning and deepTFBS‐ST without it—on these TFs. The results demonstrated that deepTFBS‐TL significantly improved binding site identification for 135 out of 153 TFs, achieving an average PRAUC improvement of 16.9% (Figure 5B). However, when analyzing the performance across different TFs revealed a slight negative correlation (R = ‐0.13 and P‐value = 0.0024) between PRAUC improvement and the number of binding sites used in training (Figure S5A, Supporting Information), emphasizing that transfer learning is especially beneficial for models trained on smaller sample sizes.

Next, we assessed the cross‐species predictive capabilities of deepTFBS‐TL using curated binding profiles for 110 TFs in wheat (experimental section). Utilizing the pre‐trained deepTFBS‐MT, we trained 110 deepTFBS‐TL and deepTFBS‐ST binary classifiers specifically for wheat. When evaluated on the same Arabidopsis test set (Figure 1C), deepTFBS‐TL outperformed deepTFBS‐ST in 109 out of 110 TFBS predictions (Figure 5C; Table S5, Supporting Information). We also observed a trend where deepTFBS‐TL showed slightly greater improvement for TFs with fewer binding sites (Figure S5B, Supporting Information), suggesting a potential benefit of transfer learning in cases with limited training data, even across species.

Recent deep learning efforts in plant genomics have introduced large pre‐trained large DNA language models based on transformer.[ 47 ] architectures, such as AgroNT.[ 48 ]  and PDLLMs [ 49 ]. AgroNT was pre‐trained on genomic sequences from 48 plant species, while PDLLMs comprise a group of models pre‐trained on genomic sequences from 14 species. To directly compare our transfer learning approach (deepTFBS‐TL) with these models, we fine‐tuned AgroNT and PDLLMs on the same Arabidopsis training set (chromosomes 1–3 and 5) using the authors’ recommended hyperparameters (see METHODS) and evaluated their performance on the hold‐out chromosome 4.

deepTFBS‑TL showed higher median AUROC of 0.985 and PRAUC of 0.759, compared to 0.967/0.646 for AgroNT and 0.957/0.602 for PDLLMs (Figure 5D). Notably, deepTFBS‑MT performed comparably to AgroNT and outperformed PDLLMs (Figure S6, Supporting Information). We also evaluated cross‐species performance on the same wheat test set. In this setting, deepTFBS‑TL achieved a median AUROC of 0.976 and PRAUC of 0.257, which was lower than fine‐tuned AgroNT (0.993/0.378) and PDLLMs (0.979/0.335) (Figure 5E). This difference may be partially attributed to the fact that both AgroNT and PDLLMs were pre‐trained on wheat genome sequences, whereas deepTFBS‑MT was trained exclusively on Arabidopsis data. Nonetheless, deepTFBS‑TL still delivered competitive performance while offering substantial advantages in computational efficiency. Specifically, the inference speed of deepTFBS‐TL is 94‐fold faster than AgroNT and is 30‐fold faster than PDLLMs (Figure 5F; Table S6, Supporting Information), making it a practical choice for large‐scale, cross‐species TFBS prediction tasks. Overall, these results underscore the effectiveness of deepTFBS‐TL, demonstrating that transfer learning significantly enhances TFBS prediction performance—especially in scenarios with limited training data.

2.6. Application of deeptfbs to Study the WUS Regulatory Network in Arabidopsis, Maize, and Wheat

Given the robust performance of deepTFBS in both intra‐ and inter‐species predications, as a test case, we applied this framework to explore WUSCHEL (WUS) regulatory network. WUS is a TF crucial for stem cell maintenance and floral development.[ 50 ] We generated WUS DAP‐Seq data in Arabidopsis and maize, analyzing the results with a consistent data processing pipeline (Figure S1, Supporting Information). In these experiments we obtained two biological replicates which generated consistent peaks, and the analysis of the genomic distribution of TFBSs revealed an enrichment of binding sites in intergenic regions (Figures S7S8, Supporting Information). In our deepTFBS‐TL tests, the WUS‐target classifiers derived from Arabidopsis and maize, showed a PRAUC of 0.651 and 0.671, respectively (Figure  6A).

Figure 6.

Figure 6

Application of deepTFBS to study WUS regulatory networks across plant species. A) Precision‐recall curves showing the performance of deepTFBS‐TL in predicting WUS binding sites in Arabidopsis (blue, PRAUC = 0.651) and maize (green, PRAUC = 0.671). B) Workflow for predicting WUS binding sites in wheat. The process begins with a deepTFBS‐TL model trained on Arabidopsis data, followed by genome‐wide prediction in wheat, filtering using PWM scores and orthologous support, and retraining for wheat‐specific predictions. C) Experimental validation of predicted WUS binding sites using yeast one‐hybrid (Y1H) assays. Seven representative positive interactions are shown, with blue colonies indicating positive binding events. The empty vector (pB42AD) serves as a negative control. D) Evolutionary conservation analysis of WUS binding sites. Left: phylogenetic tree showing divergence times (MYA) between Arabidopsis, wheat, and maize. Right: horizontal bars showing the number of predicted WUS binding sites in each species and their conservation patterns (colored segments represent different conservation categories).

To generate genome‐wide TFBS data in scenarios where DAP‐Seq data for a specific TF is unavailable in a plant species, we developed a process summarized in Figure 6B. Using wheat as an example, we initially applied deepTFBS‐TL, trained on Arabidopsis data, to predit TFBS in the wheat genome. We then employed FIMO.[ 51 ] to scan and analyze the identified genomic regions to determine if the existence of the candidate regions was supported by Arabidopsis PWM data. Furthermore, orthologous targets from Arabidopsis were used to substantiate the validity of the predictions. The sites predicted and supported by either PWM or orthologous target sequences were then used as “positives” for training deepTFBS‐TL classifiers. To verify cross‐species prediction validity, we randomly selected 14 high‐confidence targets identified in wheat for testing with yeast one‐hybrid (Y1H) assays, confirming the binding of seven out of 14 predicted targets experimentally, thereby demonstrating our prediction method's reliability (Figure 6C). In contrast, when we randomly selected 16 predicted targets based solely on PWM, only two were identified as positives (Figure S9 and Table S7, Supporting Information).

To assess the evolutionary conservation of WUS targets across Arabidopsis, maize and wheat, we developed species‐specific deepTFBS‐TL models to predict genome‐wide DNA binding targets. Comparative analyses of the predicted TFBSs showed a moderate level of conservation, with 760 targets in the Arabidopsis genome conserved across all three species (Figure 6D and Tables S8–S10, Supporting Information). Notably, several experimentally validated WUS targets, such as TPR1 and TPR2—essential for embryonic patterning and auxin response.[ 50 , 52 ]—and ATAVP1, involved in various developmental processes,[ 50 , 53 ] were among these conserved targets. This finding validates our computational approach and undersores the biological significance of the conserved binding sites. Gene Ontology analysis of the 760 conserved targets highlighted their roles in diverse biological processes, including ion transport and metabolism, pollen development and cellular pH regulation (Figure S10 and Table S10, Supporting Information), reflecting the multifaceted functions of WUS in plant development.

2.7. deepTFBS is Available as an Open Source Framework and a Web Server

To facilitate its broad application, and provide enhanced functionality for plant biologists, we made the source code for deepTFBS publicly available on GitHub (https://github.com/cma2015/deepTFBS). In addition, to mitigate potential inconsistencies arising from dependencies in a DL framework, a Docker image was also created and made publicly accessible (https://hub.docker.com/r/malab/deeptfbs). Furthermore, to accommodate experimental biologists lacking programming expertise, a web‐based interface was also developed to facilitate the prediction of TFBSs in any given DNA sequence. The web server comprises of two primary components: a database and a predictor option. Within the database users can query and retrieve predicted binding sites for 512 Arabidopsis and 110 wheat TFs (Figure S11, Supporting Information). On the other hand, the TFBS prediction feature accepts any DNA sequence as input and generates a map of predicted TFBSs via an interactive graphical interface that highlights positive predicted binding sites. Each online prediction request is assigned a unique job identifier, allowing users to monitor the status of their prediction requests. However, for large‐scale prediction requirements, we highly recommend the utilization of the deepTFBS Docker image to ensure scalability (https://hub.docker.com/r/malab/deeptfbs).

3. Discussion

The accurate identification of TF binding sites is essential for elucidating intricate transcriptional regulatory network features. We introduce deepTFBS, a framework that improves upon traditional PWM‐based methods by capturing long‐range regulatory interactions, while addressing the limitations of current DL approaches through transfer learning. The framework's core architecture combines stacked CNN layers, a BiLSTM layer, and a self‐attention layer to process DNA sequences (Figure 1). Compared to existing DL methods such as DeepSEA and DanQ, the deepTFBS framework first leverages multi‐task learning (deepTFBS‐MT) to simultaneously predict binding specificities of multiple TFs in Arabidopsis, optimally utilizing large datasets and capturing shared information among TFs. To address scenarios with limited data availability, we then developed deepTFBS‐TL. By employing transfer learning strategies, this approach effectively tackles the sample size constraints and enables accurate TFBS predictions across different plant species, making the framework versatile and practical in broader applications.

Compared to current pre‐trained DNA language models, such as AgroNT.[ 48 ] and PDLLMs,[ 49 ] the deepTFBS framework stands out for its remarkable efficiency. With only 2.7 million parameters—dramatically fewer than AgroNT's 1 billion and PDLLMs’ 96 million—deepTFBS achieves strong predictive performance while being significantly more computationally efficient. In Arabidopsis, deepTFBS‑MT outperforms PDLLMs (Figure S6A,B, Supporting Information) and closely matches the performance of AgroNT (Figure S6C,D, Supporting Information), while deepTFBS‑TL surpasses both fine‑tuned AgroNT and PD‑LLMs across AUROC and PRAUC metrics (Figure 5D,E). Although deepTFBS‑TL shows slightly lower performance than AgroNT and PDLLMs on the wheat test set—likely due to the inclusion of wheat genome sequences in the pre‐training of those models—it offers substantial speed advantages, with inference running ≈94× faster than AgroNT and 30× faster than PDLLMs (Figure 5F; Table S6, Supporting Information). This lightweight design makes deepTFBS particularly well‐suited for genome‐wide TFBS scanning on modest hardware and enables seamless integration into existing analysis pipelines without the computational burden of large language models.

deepTFBS shows potential to identify putative functional regulatory variants, which may contribute to complex phenotypes. By analyzing sequence pairs of reference and alternate alleles for over 10 million SNPs from the 1001 Genomes Project, we found that deepTFBS‐MT can quantify the regulatory impact of genetic variants through an aggregate effect score across 359 TFs. Several observations suggest the biological relevance of these predictions: highly impactful variants are enriched in promoter and distal intergenic regions, show higher frequencies of rare alleles (OR = 1.87), and are preferentially located in conserved non‐coding sequences. Examination of specific cases, such as the SNPs near the PR1 gene and those associated with flowering time, suggests how deepTFBS‐MT predictions could help understand the relationship between genetic variation and phenotypic traits. The integration of these predictions with GWAS results may provide insights into potential regulatory mechanisms underlying complex traits.

The application of deepTFBS to investigate WUS regulatory networks across Arabidopsis, maize and wheat undersores its effectiveness in comparative regulatory studies. By utilizing DAP‐seq data from Arabidopsis and maize, we achieved respectable prediction accuracies, with PRAUC values of 0.651 and 0.671, respectively. Our integrated approach for wheat, which combines deepTFBS‐TL predictions with PWM and orthologous information, yielded encouraging results in Y1H validation experiments (Figure 6). The discovery of 760 conserved WUS targets across these three species, including previously validated targets such as TPR1, TPR2, and ATAVP1, highlights deepTFBS's potential to reveal evolutionarily conserved regulatory relationships. This study suggests that deepTFBS could serve as a valuable tool for exploring transcriptional regulation across plant species, especially in contexts where experimental binding data is sparse.

Overall, deepTFBS offers a robust and adaptable framework for exploring gene regulation in plants, particularly through its ability to integrate large‐scale TF binding profiles and provide accurate predictions across species. The framework should especially valuable for studying transcriptional regulation in non‐model plants where binding data is limited. To facilitate the application of deepTFBS, we provide the source codes, a web server, a Docker image, and comprehensive user documentation to the scientific community (https://github.com/cma2015/deepTFBS).

4. Experimental Section

Arabidopsis and Wheat TF Binding Data

Arabidopsis TF binding data, including ChIP‐Seq and DAP‐Seq results, were collated from five different sources: 1) The PlantTFDB database.[ 33 ] that provides ChIP‐Seq datasets for 9 TFs; 2) PlantCistromeDB.[ 4 ] containing DAP‐Seq datasets for 327 TFs, providing a valuable resource for plant cistrome data; 3) Six TFs and their corresponding binding data were derived from PCSD;.[ 34 ] 4) A study describing ChIP‐Seq datasets for 21 TFs in exploring the role of ABA in plant growth, development, and stress responses;.[ 6 ] and 5) DAP‐Seq datasets for 3 TFs aiming to characterize the TF binding landscape in Arabidopsis.[ 54 ] For wheat, DAP‐Seq datasets for 110 TFs were obtained from a recent study by,[ 9 ] providing valuable insights into the regulatory networks of this economically important crop species.

Illumina Library Preparation and WUS DAP‐Seq in Arabidopsis and Maize

To generate DAP‐Seq data for the WUS transcription factor, we conducted in‐house experiments in both Arabidopsis and maize using the following protocol. All the plant seed were surface‐sterilized in a 10% sodium hypochlorite (NaClO) solution. Arabidopsis seeds were then germinated on half‐strength Murashige‐Skoog (1/2 MS) medium supplemented with 10% sucrose and 1% agar (w/v). Plants were cultivated in a controlled growth chamber with a 16 h light/8 h dark photoperiod, maintaining a temperature of 22 °C during the light period with an illumination intensity of 80–90 µmol m−2 s−1, and 18 °C in the dark phase. Wheat (Chinese spring) seeds, were germinated and grown on half‐strength Hoagland's liquid medium under long‐day conditions (16 h light and 8 h dark) at 22 °C with 50% relative humidity and a photon flux density of 300 µmol m−2 s−1. The maize (B73) seeds were germinated in a culture room set with a 12 h photoperiod, maintaining a photon flux density of 8000 LUX and a constant temperature of 28 ± 1 °C. Genomic DNA was extracted from whole seedlings of Arabidopsis and maize and 1 µg of this DNA was uniformly fragmented to an average size of 200 bp using the Bioruptor Plus sonicator (Diagenode). Following fragmentation, DNA ends underwent an end‐repair process, and an A‐tail was added employing the NEXTflex Rapid DNA‐Seq Kit (Revvity). Next, adapters were ligated to the A‐tailed DNA fragments, preparing them as libraries for downstream DAP‐Seq analysis. The DAP‐Seq experiments were conducted in two independent replicates. The raw DAP‐Seq data have been deposited in CNCB (https://www.cncb.ac.cn) under project PRJCA033158.

Bioinformatics Processing of DAP‐Seq and ChIP‐Seq Datasets

Raw Arabidopsis DAP‐Seq or ChIP‐Seq datasets were downloaded from the NCBI's SRA Database (https://www.ncbi.nlm.nih.gov/sra) and their quality control was performed using FastQC (Version 0.11.5). Low‐quality reads and adaptor sequences were trimmed using fastp (v0.19.5).[ 55 ] Bowtie (v1.1.2).[ 56 ] was used to align clean sequencing reads to the TAIR10 Arabidopsis reference genome, using default parameters. Enriched regions were identified using the MACS2 software (v2.1.1).[ 57 ] with default parameters, except the “‐g” parameter, which was set to the genome size (1.25e8 for Arabidopsis). For custom WUS DAP‐Seq datasets, the same pipeline was followed. Genome‐wide peak visualization and annotation were performed using the R package “ChIPseeker”.[ 58 ]

Sample Generation and Feature Encoding

To prepare the input for the training of deepTFBS, the Arabidopsis (TAIR10) and wheat (IWGSC v1.0) genomes were divided into non‐overlapping 200‐bp bins. Each of these bins was assigned to a 359‐dimensional label reflecting its overlap with known TF binding peak regions. In this process, a label of “1” was assigned if the overlapping region exceeded 100 bp, otherwise, a “0” label was given. Similar procedure was followed for the wheat genome. Subsequently, for each 200‐bp sequence in a bin, 400 bp uptream and downstream flanking sequences were extracted from the Arabidopsis and wheat genomes, resulting in a total sequence length of 1000 bp and one‐hot encoding was used to convert the DNA sequences into a matrix with 1000 rows and four columns, with each column representing one of the four nucleotides as follows:

A=1,0,0,0 (1)
C=0,1,0,0 (2)
G=0,0,1,0 (3)
T=0,0,0,1 (4)

All ambiguous bases were encoded as N = [0, 0, 0, 0].

When evaluating the performance of deepTFBS and to avoid overfitting, the Arabidopsis and wheat datasets were divided into three subsets: a training set, a validation set, and a testing set. Specifically, for Arabidopsis, all sequences from chromosome 4 were held out as the independent test set, while the remaining chromosomes (1–3 and 5) were used for training and validation. For wheat, chromosomes 7A, 7B, and 7D were used as the hold‐out test set, with the remaining chromosomes used for training and validation. The training datasets were used to train deepTFBS, the validation datasets for tuning hyperparameters, while the test datasets were used to evaluate model performance.

The deepTFBS Architecture and Training

The backbone of deepTFBS: The backbone of deepTFBS was a hybrid network structure that combines a residual neural network with BiLSTM. The initial layer was a convolutional layer that applies 1D convolution operations with a specified number of kernels (weight matrices). The outputs of these operations were then processed through the rectified linear activation function (ReLU), setting values below 0 to 0. The second part of deepTFBS consist of a residual neural network, utilized to avoid gradient vanishing. In the following step, deepTFBS employs BiLSTM to scan for long‐range sequence interactions, thereby capturing more comprehensive sequence features. Then self‐attention layers were used to directly focus on critical features across the entire sequence. The architecture concludes with two fully connected layers. The first one of these incorporates Dropout to mitigate overfitting by randomly omitting some features, while the number of neurons in the final output layer corresponds to the number of TFs being tested.

deepTFBS‐MT pretraining: deepTFBS‐MT was pretrained on 359 Arabidopsis TF binding datasets, which are extremely imbalanced. The output layer contains 359 neurons corresponding to Arabidopsis TFs. The model was optimized using Adam optimizer with a batch size of 128, binary cross‐entropy loss function, and early stopping (patience = 10, maximum 200 epochs).

Fine‐Tuning deepTFBS‐TL with Limited Experimental Binding Data

The training process of deepTFBS‐TL differs from deepTFBS‐MT: while deepTFBS‐MT is trained jointly across 359 TFs, deepTFBS‐TL fine‐tunes the model individually for each TF using balanced datasets, reusing the pretrained weights from MT except for the output layer. To be specific, for each TF, all positive binding sites were maintained while randomly sampling an equal number of negative sequences to create 10 different balanced datasets. These datasets share the same positive examples but have different negative sets, enabling the training of 10 independent models per TF. deepTFBS‐TL initializes each model using the pretrained weights from deepTFBS‐MT, replacing only the final output layer of deepTFBS‐MT with a single neuron for binary classification. This ensemble approach with multiple negative sets helps capture binding site variability while reducing bias from negative sequence selection. As a control, deepTFBS‐ST follows the identical training procedure but initializes all weights randomly instead of using pretrained weights.

Fine‐tuning deepTFBS‐TL without any experimental binding data: For species with no available DAP‑Seq or ChIP‑Seq data, we fine‑tune deepTFBS‑TL using only computationally derived evidence (Figure 6B). First, the Arabidopsis‑pretrained deepTFBS‑TL model is used to scan the target genome (e.g., wheat) and generate an initial set of candidate binding sites. Next, it was filtered these candidates by two criteria: 1) motif scores computed with Arabidopsis‑derived PWMs using FIMO, and 2) support from orthologous gene annotations. Sites that satisfy either criterion were designated as high‑confidence and serve as pseudo‑labeled positives to retrain a species‑adapted deepTFBS‑TL model. Finally, the adapted model's accuracy by comparing its predictions to known regulated genes or available functional validation data was assessed.

All models were implemented using TensorFlow (v2.6) on servers with dual NVIDIA GeForce 3090Ti GPUs (24G memory; 10752 CUDA cores).

Model Evaluation

ROC: The receiver operating characteristic (ROC) curve was a widely used tool for evaluating the performance of binary classification models. The ROC curve was generated by plotting the true positive rate on the y‐axis against the false positive rate on the x‐axis at varied thresholds. The AUC value was then used to quantitatively assess the prediction accuracy of the binary model. The AUC value ranges from 0 to 1, with a higher AUC value indicating better prediction accuracy.

PRAUC: The PRAUC metric represents the area under the precision‐recall curve. It was constructed by plotting precision (y‐axis) against recall (x‐axis) at varied thresholds. Similar to the ROC curve, the area under the precision‐recall curve (PRAUC value) provides a quantitative measure of the accuracy of a binary model. The PRAUC value also ranges from 0 to 1, with a higher values denoting superior prediction accuracy.

Comparison with PWM, DeepSEA, and DanQ

In order to determine the performance of deepTFBS, a traditional PWM‐based method, two DL methods, and a machine‐learning based method were used for comparison purposes.

PWM‐based method: Given sequence S with the length L, and a PWM matrix X i,j , where i  =  A, C, G, T, j represents the position of corresponding nucleotide, j=1,2,,n, PWM can be represented as follows:

XA,1XA,2XA,3XA,jXA,nXC,1XC,2XC,3XC,jXC,nXG,1XG,2XG,3XG,jXG,nXT,1XT,2XT,3XT,jXT,n (5)

Each sequence can be split into m k‐mers and then k‐mers can be scored as

score=lgiA,C,G,T,j=1j=kXi,j0.25 (6)

DeepSEA: DeepSEA.[ 17 ] was a multi‐label classification model based on a CNN, originally utilized for predicting functional noncoding DNA variants in humans. However, DeepSEA could not be directly utilized to predict TFBSs in plant species. Therefore, based on the network structure and hyperparameters outlined in the original publication, DeepSEA was re‐trained using the Arabidopsis TFBS datasets.

DanQ: Just as DeepSEA, DanQ.[ 18 ] was a multi‐label classification model that combines a CNN with LSTM. It was initially trained on the same data as DeepSEA. Utilizing Keras, it was re‐trained and then used DanQ with the same training, validation, and testing datasets as those used for deepTFBS, ensuring the consistency of performance comparisons.

Comparison with Pre‑Trained DNA Language Models

To compare our framework with recent state‐of‐the‐art DNA language models, we evaluated AgroNT.[ 48 ] and PDLLMs.[ 49 ] under the same training and testing datasets as deepTFBS. Both models were fine‐tuned using the authors’ recommended procedures. Specifically, for AgroNT, we applied the LoRA (Low‐Rank Adaptation) fine‐tuning strategy,[ 59 ] as recommended in the original paper. LoRA is a parameter‐efficient method that adapts pre‐trained transformer models by injecting trainable low‐rank matrices into existing weights, allowing for effective fine‐tuning with minimal additional computational cost. For PDLLMs, we selected the “zhangtaolab/plant‐dnamamba‐BPE” model (https://huggingface.co/zhangtaolab/plant‐dnamamba‐BPE), as it showed the best performance in chromatin state prediction tasks according to the original benchmark. Both models were evaluated on the held‐out chromosome 4 and the cross‐species wheat test set using the same AUROC and PRAUC metrics applied to deepTFBS.

Model Interpretation by Integrated Gradients and Consensus Motif Identification

Gradient‐based attribution methods, for example IG, are widely used to explore the contribution of input features to the output of deep neural networks.[ 37 ] IG calculates the gradients of the output neuron in response to input features, measuring how the output changes with small perturbations in the input. These methods assign an attribution value to each input feature, representing its contribution to the output. Applying this method to deepTFBS, the target neurons represent individual TFs, and the IG method was employed to compute the average gradient of the output, effectively mitigating the saturation problem that arises when calculating gradients at the input layer. By calculating the attribution value for each position in the input sequence, and normalizing the results, it becomes possible to visualize the influence of each nucleotide on the prediction. Higher attribution values indicate a greater importance of a nucleotide at the corresponding position within the evaluated TFBS. The implementation of IG was via deepExplain (https://github.com/marcoancona/DeepExplain).

For each TF, attribution values were calculated for sequences in the test set when predicted scores exceeded 0.9, using the IG method. These attribution values were used to quantify the importance of each position within the input sequence on the prediction performance of the model. For a sequence of L, the top k motifs were searched based on the magnitude of attribution values, generating a comprehensive set of candidate motifs for each TF. To determine the consensus motif, the attribution values were averaged within each window, and the highest‐scoring motif was selected while its neighbors were removed to avoid redundancy. The UMAP algorithm was used to embed the obtained motifs into a lower‐dimensional space, capturing the underlying structure and similarity between the motifs. The DBSCAN algorithm was then applied to cluster the embedded motifs to generate the consensus motif.

Yeast One‐Hybrid Assays

To conduct the yeast one‐hybrid assays, the coding sequences of the TaWUS TFs was cloned individually into the pB42AD expression vector. Predicted gene promoter sequences potentially binding to AtWUS and TaWUS were inserted into separate pLacZi vectors. The recombinant pB42AD and pLacZi vectors containing the promoter sequences were co‐transformed into the yeast strain EGY48A, using empty pB42AD and pLacZi vectors as negative controls. Individual clones were transferred to SD/−Trp/−Ura medium containing X‐gal and cultivated at 28 °C for 3 d after selection on the SD/−Trp/−Ura plates.

Conflict of Interest

The authors declare no conflict of interest.

Author Contributions

J.Z., and Y.Z. contributed equally to this work. C.M. conceived the project. J.Z implemented the algorithm, J.Z., C.Z., M.S. and P.D. collected all the datasets and performed all the analysis. X.Y. and C.T. performed validation experiments. J.Z., C.M., Y.Z. and Z.L. interpreted data. J.Z., C.M. and Y.Z. wrote the article. All authors have read and approved the final manuscript.

Supporting information

Supporting Information

Acknowledgements

This work was supported by the National Natural Science Foundation of China (32170681[CM]), Projects of Youth Technology New Star of Shaanxi Province (2017KJXX‐67[CM]) and the Sub‐project of National Key Research and Development Program (2024YFD1201301‐2 [CM]). The authors thank the High‐Performance Computing (HPC) of Northwest A&F University for providing computing resources, prof. Jifeng Ning in College of Information Engineering, Northwest A&F University for the discussions on the model design and members of the C. Ma Labs for their constructive feedback on this work.

Zhai J., Zhang Y., Zhang C., Yin X., Song M., Tang C., Ding P., Li Z., Ma C., deepTFBS: Improving within‐ and Cross‐Species Prediction of Transcription Factor Binding Using Deep Multi‐Task and Transfer Learning. Adv. Sci. 2025, 12, e03135. 10.1002/advs.202503135

Data Availability Statement

The data that support the findings of this study are available in the supplementary material of this article.

References

  • 1. Lambert S. A., Jolma A., Campitelli L. F., Das P. K., Yin Y., Albu M., Chen X., Taipale J., Hughes T. R., Weirauch M. T., Cell 2018, 172, 650. [DOI] [PubMed] [Google Scholar]
  • 2. Strader L., Weijers D., Wagner D., Curr. Opin. Plant Biol. 2022, 65, 102136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Consortium E. P., Nature 2012, 489, 57.22955616 [Google Scholar]
  • 4. O'Malley R. C., Huang S.‐S. C., Song L., Lewsey M. G., Bartlett A., Nery J. R., Galli M., Gallavotti A., Ecker J. R., Cell 2016, 165, 1280. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Bartlett A., O'Malley R. C., Huang S.‐S. C., Galli M., Nery J. R., Gallavotti A., Ecker J. R., Nat. Protoc. 2017, 12, 1659. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Song L., Huang S.‐S. C., Wise A., Castanon R., Nery J. R., Chen H., Watanabe M., Thomas J., Bar‐Joseph Z., Ecker J. R., Science 2016, 354, 6312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Galli M., Khakhar A., Lu Z., Chen Z., Sen S., Joshi T., Nemhauser J. L., Schmitz R. J., Gallavotti A., Nat. Commun. 2018, 9, 4526. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Tu X., Mejía‐Guerra M. K., Valdes Franco J. A., Tzeng D., Chu P.‐Y., Shen W., Wei Y., Dai X., Li P., Buckler E. S., Zhong S., Nat. Commun. 2020, 11, 5089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Zhang Y., Li Z., Liu J., Zhang Y., Ye L., Peng Y., Wang H., Diao H., Ma Y., Wang M., Xie Y., Tang T., Zhuang Y., Teng W., Tong Y., Zhang W., Lang Z., Xue Y., Zhang Y., Nat. Commun. 2022, 13, 6940. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Zhang Y., Li Z., Zhang Y., Lin K., Peng Y., Ye L., Zhuang Y., Wang M., Xie Y., Guo J., Teng W., Tong Y., Zhang W., Xue Y., Lang Z., Zhang Y., Genome Res. 2021, 31, 2276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Khamis A. M., Motwalli O., Oliva R., Jankovic B. R., Medvedeva Y. A., Ashoor H., Essack M., Gao X., Bajic V. B., Nucleic Acids Res. 2018, 46, 72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Eraslan G., Avsec Z., Gagneur J., Theis F. J., Nat. Rev. Genet. 2019, 20, 389. [DOI] [PubMed] [Google Scholar]
  • 13. Zou J., Huss M., Abid A., Mohammadi P., Torkamani A., Telenti A., Nat. Genet. 2019, 51, 12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Zeng Y., Gong M., Lin M., Gao D., Zhang Y., IEEE Access 2020, 8, 219256. [Google Scholar]
  • 15. Alipanahi B., Delong A., Weirauch M. T., Frey B. J., Nat. Biotechnol. 2015, 33, 831. [DOI] [PubMed] [Google Scholar]
  • 16. Lee D., Gorkin D. U., Baker M., Strober B. J., Asoni A. L., McCallion A. S., Beer M. A., Nat. Genet. 2015, 47, 955. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Zhou J., Troyanskaya O. G., Nat. Methods 2015, 12, 931. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Quang D., Xie X., Nucleic Acids Res. 2016, 44, 107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Liu L., Zhang G., He S., Hu X., Bioinformatics 2021, 37, 260. [DOI] [PubMed] [Google Scholar]
  • 20. Zhao H., Tu Z., Liu Y., Zong Z., Li J., Liu H., Xiong F., Zhan J., Hu X., Xie W., Nucleic Acids Res. 2021, 49, W523. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Wang K., Zeng X., Zhou J., Liu F., Luan X., Wang X., Brief. Bioinform. 2024, 25, bbae195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Zhang Y., Wang Z., Ge F., Wang X., Zhang Y., Li S., Guo Y., Song J., Yu D.‐J., Brief. Bioinform. 2024, 25, bbae489. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Avsec Z., Weilert M., Shrikumar A., Krueger S., Alexandari A., Dalal K., Fropf R., McAnany C., Gagneur J., Kundaje A., Zeitlinger J., Nat. Genet. 2021, 53, 354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Zhou T., Shen N., Yang L., Abe N., Horton J., Mann R. S., Bussemaker H. J., Gordân R., Rohs R., Proc. Natl. Acad. Sci. USA 2015, 112, 4654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Yan W., Li Z., Pian C., Wu Y., Brief. Bioinform. 2022, 23, bbac425. [DOI] [PubMed] [Google Scholar]
  • 26. Meysman P., Dang T. H., Laukens K., De Smet R., Wu Y., Marchal K., Engelen K., Nucleic Acids Res. 2011, 39, 6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Yang T., Henao R., PLoS Comput. Biol. 2022, 18, 1009921. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Zhang X., Fang B., Huang Y. F., Nat. Commun. 2023, 14, 783. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Cochran K., Srivastava D., Shrikumar A., Balsubramani A., Hardison R. C., Kundaje A., Mahony S., Genome Res. 2022, 32, 512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Zhang Q., Li R., Li Y., Wang M., Wang L., Wang S., Cheng H., Zhang Q., Fu C., Qian Z., Wei Q., Adv. Sci. (Weinh) 2024, 11, 2405685. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. He K., in Proceedings of the IEEE conference on computer vision and pattern recognition 2016.
  • 32. Jaganathan K., Kyriazopoulou Panagiotopoulou S., McRae J. F., Darbandi S. F., Knowles D., Li Y. I., Kosmicki J. A., Arbelaez J., Cui W., Schwartz G. B., Chow E. D., Kanterakis E., Gao H., Kia A., Batzoglou S., Sanders S. J., Farh K. K.‐H., Cell 2019, 176, 535. [DOI] [PubMed] [Google Scholar]
  • 33. Jin J., Tian F., Yang D.‐C., Meng Y.‐Q., Kong L., Luo J., Gao G., Nucleic Acids Res. 2017, 45, D1040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Liu Y., Tian T., Zhang K., You Q., Yan H., Zhao N., Yi X., Xu W., Su Z., Nucleic Acids Res. 2018, 46, D1157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Guo L., Wang S., Nie Y., Shen Y., Ye X., Wu W., Plant Commun 2022, 3, 100420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Thölke P., Mantilla‐Ramos Y.‐J., Abdelhedi H., Maschke C., Dehgan A., Harel Y., Kemtur A., Mekki Berrada L., Sahraoui M., Young T., Bellemare Pépin A., El Khantour C., Landry M., Pascarella A., Hadid V., Combrisson E., O'Byrne J., Jerbi K., Neuroimage 2023, 277, 120253. [DOI] [PubMed] [Google Scholar]
  • 37. Sundararajan M., Taly A., Yan Q., in International conference on machine learning. 2017.
  • 38. Ghanbari M., Ohler U., Genome Res. 2020, 30, 214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Haake V., Cook D., Riechmann J. L., Pineda O., Thomashow M. F., Zhang J. Z., Plant Physiol. 2002, 130, 639. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Rodgers‐Melnick E., Vera D. L., Bass H. W., Buckler E. S., Proc. Natl. Acad. Sci. USA 2016, 113, E3177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Engelhorn J., Snodgrass S. J., Kok A., Seetharam A. S., Schneider M., Kiwit T., Singh A., Banf M., bioRxiv 2023. [Google Scholar]
  • 42. 1001 Genomes Consortium Cell 2016, 166, 481. [Google Scholar]
  • 43. Dwivedi S. L., Heslop‐Harrison P., Spillane C., McKeown P. C., Edwards D., Goldman I., Ortiz R., Trends Plant Sci. 2023, 28, 685. [DOI] [PubMed] [Google Scholar]
  • 44. Haudry A., Platts A. E., Vello E., Hoen D. R., Leclercq M., Williamson R. J., Forczek E., Joly‐Lopez Z., Steffen J. G., Hazzouri K. M., Dewar K., Stinchcombe J. R., Schoen D. J., Wang X., Schmutz J., Town C. D., Edger P. P., Pires J. C., Schumaker K. S., Jarvis D. E., Mandáková T., Lysak M. A., van den Bergh E., Schranz M. E., Harrison P. M., Moses A. M., Bureau T. E., Wright S. I., Blanchette M., Nat. Genet. 2013, 45, 891. [DOI] [PubMed] [Google Scholar]
  • 45. Atwell S., Huang Y. S., Vilhjálmsson B. J., Willems G., Horton M., Li Y., Meng D., Platt A., Tarone A. M., Hu T. T., Jiang R., Muliyati N. W., Zhang X., Amer M. A., Baxter I., Brachi B., Chory J., Dean C., Debieu M., de Meaux J., Ecker J. R., Faure N., Kniskern J. M., Jones J. D. G., Michael T., Nemri A., Roux F., Salt D. E., Tang C., Todesco M., et al., Nature 2010, 465, 627. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Hussain R. M. F., Sheikh A. H., Haider I., Quareshy M., Linthorst H. J. M., Front Plant Sci. 2018, 9, 930. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L., Polosukhin I., Advances in neural information processing systems, 2017, 30.
  • 48. Mendoza‐Revilla J., Trop E., Gonzalez L., Roller M., Dalla‐Torre H., de Almeida B. P., Richard G., Caton J., Lopez Carranza N., Skwark M., Laterre A., Beguir K., Pierrot T., Lopez M., Commun. Biol. 2024, 7, 835. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Liu G., Chen L., Wu Y., Han Y., Bao Y., Zhang T., Mol. Plant 2025, 18, 175. [DOI] [PubMed] [Google Scholar]
  • 50. Busch W., Miotk A., Ariel F. D., Zhao Z., Forner J., Daum G., Suzaki T., Schuster C., Schultheiss S. J., Leibfried A., Haubeiß S., Ha N., Chan R. L., Lohmann J. U., Dev. Cell 2010, 18, 841. [DOI] [PubMed] [Google Scholar]
  • 51. Grant C. E., Bailey T. L., Noble W. S., Bioinformatics 2011, 27, 1017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Szemenyei H., Hannon M., Long J. A., Science 2008, 319, 1384. [DOI] [PubMed] [Google Scholar]
  • 53. Li J., Yang H., Ann Peer W., Richter G., Blakeslee J., Bandyopadhyay A., Titapiwantakun B., Undurraga S., Khodakovskaya M., Richards E. L., Krizek B., Murphy A. S., Gilroy S., Gaxiola R., Science 2005, 310, 121. [DOI] [PubMed] [Google Scholar]
  • 54. López‐Vidriero I., Godoy M., Grau J., Peñuelas M., Solano R., Franco‐Zorrilla J. M., Plant Commun. 2021, 2, 100232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Chen S., Zhou Y., Chen Y., Gu J., Bioinformatics 2018, 34, i884. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Langmead B., Salzberg S. L., Nat. Methods 2012, 9, 357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Zhang Y., Liu T., Meyer C. A., Eeckhoute J., Johnson D. S., Bernstein B. E., Nusbaum C., Myers R. M., Brown M., Li W., Liu X. S., Genome Biol. 2008, 9, R137. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Yu G., Wang L. G., He Q. Y., Bioinformatics 2015, 31, 2382. [DOI] [PubMed] [Google Scholar]
  • 59. Hu E. J., Shen Y., Wallis P., Allen‐Zhu Z., Li Y., Wang S., Wang L., Chen W., ICLR 2022, 1, 3. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

Data Availability Statement

The data that support the findings of this study are available in the supplementary material of this article.


Articles from Advanced Science are provided here courtesy of Wiley

RESOURCES