Abstract
Background
Promoters, as essential cis-regulatory elements in prokaryotes, govern gene expression by mediating RNA polymerase binding through core motifs and long-range regulatory interactions, playing a pivotal role in cell metabolism and environmental adaptation. Hence, accurate identification of prokaryotic promoters is vital for understanding their biological functions. However, the existing tools for predicting prokaryotic promoters are mainly concentrated on individual model organisms, and their prediction accuracy needs to be further improved. To address these gaps, we develop iPro-MP, a transformer-based prokaryotic promoter prediction framework that we systematically evaluate across 23 phylogenetically diverse species, including both model and non-model organisms.
Results
iPro-MP utilizes a multi-head attention mechanism to capture textual information in DNA sequences and effectively learns the hidden patterns. Cross-species prediction demonstrates the necessity of constructing species-specific models. Through a series of experiments, iPro-MP shows outstanding performance, with the AUC exceeding 0.9 in 18 out of 23 species.
Conclusions
Our novel approach to predicting prokaryotic promoters, iPro-MP, provides the superiority to other existing tools, especially in predicting non-model organisms. Finally, for the convenience of other researchers, the source code and datasets of iPro-MP are freely available at https://github.com/Jackie-Suv/iPro-MP.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13059-025-03819-9.
Keywords: Prokaryotic promoters, Deep learning, Large language model
Background
Gene transcription is a fundamental biological process that is tightly regulated by promoter regions [1–3]. Promoters initiate transcription by guiding RNA polymerase binding and play a central role in the regulation of gene expression [4–6]. In prokaryotes, promoters serve as key cis-regulatory elements that define transcription start sites (TSSs), thereby modulating gene expression levels [7–9]. Accurate identification of promoters is therefore essential for understanding gene regulatory mechanisms and cellular responses to environmental changes. However, despite their critical biological functions, promoter recognition in genomic sequences remains a challenging task, particularly in non-model organisms.
In recent years, the rapid development of next-generation sequencing (NGS) technologies has significantly advanced multi-omics research, such as genomics and transcriptomics [10]. Among these, differential RNA sequencing (dRNA-seq) [11] enables high-resolution, genome-wide mapping of TSSs by selectively sequencing primary transcripts with a 5′ triphosphate end [12]. This technique has been successfully applied to several prokaryotic species, including Helicobacter pylori (H. pylori) [13], Methylorubrum [14], Haloferax volcanii (H. volcanii) [15], and Streptomyces tsukubaensis (S. tsukubaensis) [16], leading to the construction of several prokaryotic promoter databases, such as RegulonDB [17], DBTBS [18], Pro54DB [19], and PPD [20].
However, with the exponential growth of sequencing data and the increasing cost and labor demands of experimental validation, traditional experimental methods are no longer sufficient to meet the large-scale needs of promoter identification. This has driven the development of efficient computational approaches for prokaryotic promoter prediction. Over the past decade, a variety of machine learning models have been proposed based on DNA sequence features [5, 21–25]. For instance, Lai et al. [26] proposed iProEP, a support vector machine (SVM)-based model [27, 28] that integrates pseudo k-tuple nucleotide composition with a position-correlation scoring function, achieving accuracies of 95.2% and 93.1% for Escherichia coli (E. coli) and Bacillus subtilis (B. subtilis), respectively. Zhang et al. [29] introduced MULTiPly, a two-layer prediction predictor capable of identifying E. coli promoters and their subtypes, with an accuracy of 86.9%. Subsequently, PromoterLCNN, a convolutional neural network (CNN)-based model, further improved the prediction accuracy to 88.6% [30]. In 2022, iPro-WAEL [31], a weighted average ensemble learning model, was developed to support promoter prediction in multiple prokaryotic species. Most recently, Du et al. [32] developed Prompt, a voting-based strategy to predict promoters of 16 prokaryotes. Although these computational models have yielded promising results, they are mainly limited to a few well-studied model organisms such as E. coli and B. subtilis, with limited applicability to a broader range of prokaryotic species. Hence, it is urgent to develop a powerful predictor capable of recognizing the promoters of multiple prokaryotic species.
To address this limitation, we developed iPro-MP, a DNABERT-based deep learning model tailored for promoter prediction across 23 prokaryotes. By leveraging a bidirectional self-attention mechanism, iPro-MP effectively capture both local sequence motifs and global contextual relationships in DNA sequences. Through a series of comparative experiments and evolutionary analyses, iPro-MP exhibited competitive performance compared to other existing tools for predicting prokaryotic promoters. To facilitate further research and application, we have made the source code and datasets publicly available at https://github.com/Jackie-Suv/iPro-MP.
Results
iPro-MP exhibits excellent performance and robustness in multi-species promoter prediction
To evaluate the predictive capability and generalizability of iPro-MP, we conducted comprehensive experiments across 23 representative prokaryotic species. First, a comparative analysis of different k-mer sizes demonstrated that the 6-mer representation yielded the best performance (Additional file 1: Fig. S1 and Additional file 2: Table S1). This indicates that a larger k-mer provides the model with richer sequence semantics, thereby enhancing its ability to effectively capture promoter-specific features. Second, several cross-validation strategies were evaluated, including 5-fold, 10-fold, and repeated fivefold cross-validation (Additional file 2: Fig. S2 and Additional file 2: Table S2). The results showed minimal performance differences among these methods, whereas 5-fold cross-validation exhibited higher computational efficiency. Based on these findings, the 6-mer encoding scheme and 5-fold cross-validation were adopted as the final configuration for model training and evaluation.
As shown in Fig. 1a–d and Additional file 2: Table S3, iPro-MP achieved consistently high scores across four evaluation metrics using 5-fold cross-validation—Acc, AUC, area under the precision-recall curve (AUPRC), and MCC. Specifically, AUC values exceeded 0.9 in 17 species (73.9%), ranged from 0.8 to 0.9 in 5 species (21.7%), and fell below 0.8 in only one case (4.3%) (Fig. 1e). Similar trends were observed for AUPRC, indicating reliable model behavior even under data imbalance. While MCC values exhibited slightly greater variability, the majority still remained within a high-performance range, reflecting the model’s stability across species with different promoter sequence characteristics.
Fig. 1.
Comprehensive performance metrics evaluated using fivefold cross-validation: a Acc, b AUC, c AUPRC, d MCC, and e ROC curve. Numbers 1–23 correspond to the 23 prokaryotic species listed in Table 3
The robustness of iPro-MP was further validated on independent testing sets (Fig. 2 and Additional file 2: Table S4). Encouragingly, the model maintained high predictive performance, with 18 species (78.3%) achieving AUC > 0.9, four species (17.4%) between 0.8 and 0.9, and only one species (4.3%) below 0.8 (Fig. 2e). These results demonstrate that iPro-MP exhibits minimal performance degradation from training to independent testing, highlighting its strong generalization capability. Notably, the model performed well not only on common model organisms such as E. coli and B. subtilis, but also on phylogenetically distant or compositionally diverse species such as H. volcanii and S. tsukubaensis.
Fig. 2.
Comprehensive performance metrics evaluated using independent testing sets: a Acc, b AUC, c AUPRC, d MCC, e ROC curve, and f PR curve. Numbers 1–23 correspond to the 23 prokaryotic species listed in Table 3
It should be noted that despite the relative imbalance in the dataset, iPro-MP consistently achieved strong recall and F1 scores—averaging 0.824 and 0.816 under 5-fold cross-validation, and 0.834 and 0.832 on the independent test set, respectively. These results suggest that the model effectively identifies true promoter sequences while maintaining a favorable trade-off between precision and sensitivity, highlighting the model’s robustness in handling imbalanced datasets and its ability to generalize across diverse species.
The strong and consistent performance of iPro-MP across evolutionarily diverse species can be attributed to its underlying architecture, which is based on DNABERT. Unlike traditional machine learning approaches that rely on handcrafted features, DNABERT employs a multi-head self-attention mechanism capable of capturing both local and long-range dependencies within DNA sequences. This enables the model to learn complex regulatory signals and latent motif structures directly from the raw genomic sequence. Such contextual modeling allows iPro-MP to remain sensitive to global sequence patterns, enhancing its ability to recognize functionally relevant promoter elements—even in species that were not used during training. As a result, iPro-MP effectively generalizes promoter recognition beyond species boundaries, uncovering both conserved and species-specific regulatory elements with high fidelity.
iPro-MP reveals the species-specificity at the sequential level
To further evaluate the transferability of promoter recognition across species, we conducted a comprehensive cross-species prediction experiment. Specifically, for each of the 23 prokaryotic species, we trained a species-specific model using its own promoter data and tested its predictive performance on independent datasets from all other species. The resulting Acc, AUC, AUPRC, and MCC values are visualized in the form of heatmaps in Fig. 3a–d, respectively.
Fig. 3.
The heat map shows the values of a Acc, b AUC, c AUPRC, and d MCC in cross-species promoter validation. Numbers 1–23 correspond to the 23 prokaryotic species listed in Table 3
As expected, the highest performance for each model generally occurred when the training and testing species were the same (diagonal elements), confirming the reliability of the species-specific models. However, a number of off-diagonal elements also exhibited high prediction performance, suggesting that the promoter sequence features learned from certain species can generalize to others. Notably, models trained on Campylobacter jejuni strains (C. jejuni RM1221, C. jejuni subsp. jejuni 81,116, C. jejuni subsp. jejuni 81–176, and C. jejuni subsp. jejuni NCTC 11168) [33, 34] achieved high accuracy when predicting each other’s promoter sequences. This observation is consistent with their close phylogenetic relationships as shown in the evolutionary tree (Fig. 4a), and their shared promoter motif patterns (Fig. 4b) revealed by the Two Sample Logos [35]. As shown in the motif logos adjacent to the tree, these strains possess highly similar sequence elements in their − 10 and − 35 regions, indicating conserved cis-regulatory architecture.
Fig. 4.
The evolutionary relationships and the conserved sequences in the promoters. a Phylogenetic tree of 23 species combined with significant motifs. The nucleotide distribution and preference in the promoters of b 21 bacteria and c 2 archaea compared to non-promoters using Two Sample Logos tool
Similar cross-species transferability was observed between other phylogenetically related species, such as C. diphtheriae NCTC 13129 and C. glutamicum ATCC 13032, as well as between S. aureus subsp. aureus MW2 and S. epidermidis ATCC 12228, which also share highly similar promoter sequence signatures (Fig. 4b). These results indicate that phylogenetic proximity and sequence motif conservation play crucial roles in enabling effective promoter recognition across species boundaries.
In contrast, models trained on archaeal species such as H. volcanii DS2 and T. kodakarensis KOD1 performed poorly when applied to bacterial species, with accuracy values consistently below 0.7. This is likely due to fundamental differences in transcriptional regulatory mechanisms between archaea and bacteria [36–38]. As shown in Fig. 4c, archaeal promoters lack the clearly enriched − 10 and − 35 motifs typically found in bacterial promoters and instead exhibit distinct sequence preferences potentially related to the TATA-binding protein (TBP) and transcription factor B (TFB) binding sites [39]. In addition, we found that there was no difference in performance between GC-rich species and AT-rich species (Additional file 1: Fig. S3). These observations further emphasize the importance of constructing species-specific or clade-specific models for accurate promoter identification, particularly when dealing with evolutionarily distant organisms.
Generally, it can be clearly found that the results of most models in predicting other species promoters are not satisfactory. This result provides compelling evidence for the existence of strong species specificity among prokaryotic species. Given these circumstances, the development of species-specific prediction models becomes not just beneficial but essential. These models have the potential to enhance the accuracy of promoter prediction by leveraging species-specific data and knowledge.
iPro-MP learns discriminative representations that clearly separate promoters from non-promoters
To evaluate the internal feature representations learned by iPro-MP, we applied t-distributed stochastic neighbor embedding (t-SNE) [40] to project the high-dimensional embedding vectors of promoter and non-promoter sequences into a two-dimensional space. As shown in Fig. 5a, the visualized distributions reveal a clear separation between the two classes in most of the 23 species. Promoter sequences (yellow) consistently form tight and coherent clusters, whereas non-promoter sequences (purple) are more broadly scattered, indicating greater diversity in background genomic regions. This class-specific clustering suggests that iPro-MP effectively captures regulatory features that are shared among promoter sequences, such as conserved sequence motifs or positional patterns.
Fig. 5.
t-SNE visualization of a both promoters and non-promoters, and b only promoters in a two-dimensional feature space based on the token embeddings from the final hidden layer. c Distribution of attention weights for promoter sequences. Each row represents a species, and each column corresponds to the position of the first nucleotide of a 6-mer token relative to the TSS (position 0)
Furthermore, the separation observed across phylogenetically distant species demonstrates that the learned representations are not merely species-specific, but reflect generalizable biological properties of prokaryotic promoters. In some species, particularly those with well-defined − 10 and − 35 motifs, the promoter and non-promoter groups are almost completely separable. In contrast, species with weaker motif signals or atypical GC content exhibit more overlap between the classes, consistent with the known challenges in promoter annotation in these genomes. In addition, promoters from all species were embedded in the same space (Fig. 5b). Promoters from different species generally form distinct clusters, with several phylogenetically related species, e.g., C. jejuni strains, showing closely positioned or partially overlapping clusters, while others, such as C. glutamicum and S. elongatus, occupy well-separated regions. This result highlights the species-specific nature of promoter sequence patterns, yet also suggests that iPro-MP captures shared sequence features.
To further enhance the interpretability of our model, the attention weights learned by iPro-MP were extracted and visualized (Fig. 5c). Notably, in most species, the attention is strongly concentrated around position − 10, which aligns well with the known biological architecture of bacterial promoters. However, in the case of archaea (species 11 and 21), the attention weights are primarily concentrated around position –26, which corresponds to the location of the TATA-box-like core promoter element that is evolutionarily conserved among archaeal species [39]. In a few species, additional attention is observed near the TSS, reflecting subtle species-specific variations in promoter architecture. These patterns not only highlight the evolutionary conservation of core regulatory motifs but also underscore the model’s ability to capture subtle differences across species. In summary, the high-attention 6-mers concentrated upstream of the TSS validate that iPro-MP leverages functional promoter motifs, thereby offering both predictive power and biological insight. This feature-level visualization underscores the model’s interpretability and highlights its ability to generalize across diverse prokaryotic lineages. These results provide strong evidence that iPro-MP learns biologically meaningful and transferable representations, which contribute to its robust performance in promoter classification across diverse prokaryotic taxa.
iPro-MP outperforms classical and deep learning baselines across species
To comprehensively assess the predictive capacity of iPro-MP, we benchmarked it against four baseline classifiers—random forest (RF) [41], eXtreme Gradient Boosting (XGBoost) [42], logistic regression (LR) [43], and long short-term memory networks (LSTM) [44]—using promoter classification tasks in 23 diverse prokaryotic species. All models were trained using 5-fold cross-validation and optimized hyperparameters (Additional file 2: Table S5), and performance was evaluated using Acc, AUC, AUPRC, and MCC (Additional file 2: Table S6).
Across all four metrics, iPro-MP consistently outperformed the competing methods (Fig. 6a–d). Specifically, iPro-MP achieved the highest average values across species for all metrics (Table 1): Acc (mean = 0.890), AUC (mean = 0.935), AUPRC (mean = 0.903), and MCC (mean = 0.752). In contrast, the second-best performing model (RF) achieved Acc = 0.839, AUC = 0.882, AUPRC = 0.847, and MCC = 0.688, indicating a substantial performance gap. Next is XGBoost, whose performance is close to that of RF. However, methods like LR and LSTM lagged behind, especially in MCC and AUPRC, suggesting that they struggled to maintain class balance and precision under imbalanced data distributions.
Fig. 6.
Performance comparison of different classification algorithms in identifying promoters through fivefold cross-validation in terms of a Acc, b AUC, c AUPRC and d MCC. Numbers 1–23 correspond to the 23 prokaryotic species listed in Table 3
Table 1.
Comparison of the average prediction performance of iPro-MP and other methods using the independent testing sets
| Methods | Acc | AUC | AUPRC | MCC |
|---|---|---|---|---|
| iPro-MP | 0.890 | 0.935 | 0.903 | 0.752 |
| RF | 0.839 | 0.882 | 0.847 | 0.688 |
| XGBoost | 0.833 | 0.876 | 0.842 | 0.675 |
| LR | 0.803 | 0.843 | 0.790 | 0.612 |
| LSTM | 0.806 | 0.850 | 0.792 | 0.621 |
The performance gap is especially pronounced in challenging species. For example, in H. volcanii DS2—an archaeon with distinct transcriptional regulatory elements—iPro-MP reached an MCC of 0.876, while RF and LR only achieved 0.820 and 0.747, respectively. Similarly, in T. kodakarensis KOD1, iPro-MP achieved an MCC of 0.853, significantly outperforming RF (0.824) and XGBoost (0.793). Even in well-characterized species like E. coli, iPro-MP outperformed all alternatives in AUC (0.878 vs. 0.689–0.823) and AUPRC (0.807 vs. 0.596–0.737).
These results demonstrate that transformer-based models like iPro-MP are better equipped to capture the complex sequence patterns and regulatory logic in prokaryotic promoters. Unlike conventional methods that rely on fixed features or local dependencies, iPro-MP benefits from the global context modeling capability of DNABERT, enabling more accurate promoter recognition across a wide evolutionary range.
iPro-MP outperforms existing tools in multi-species promoter prediction
To validate the effectiveness and robustness of iPro-MP, we compared its performance with three state-of-the-art tools for prokaryotic promoter prediction: Prompt [32], PromoterLCNN [30], and iPro-WAEL [31]. These methods were selected based on the availability of webservers or open-source implementations, allowing for reproducible benchmarking on the same 23 independent testing sets used in this study.
As shown in Fig. 7 and Additional file 2: Table S7, iPro-MP consistently outperformed the competing methods across most species. Specifically, iPro-MP achieved the highest Acc (mean = 0.890), AUC (mean = 0.935), AUPRC (mean = 0.903), MCC (mean = 0.752), and F1 score (mean = 0.832) among all compared methods (Table 2). Moreover, when examined at the species level, iPro-MP achieves the best performance in 21 out of 23 species, further confirming its robustness and cross-species generalizability. Importantly, although iPro-MP utilizes a transformer-based architecture, it maintains competitive runtime efficiency (average 29.09 s) and memory usage (around 3.6 GB), which outperforms all other tools. To explore alternative validation methods in the benchmarking process, we repeated the training procedures of the other three tools using fivefold cross-validation, while strictly adhering to the official training and evaluation protocols provided by each tool (Additional file 1: Fig. S4 and Additional file 2: Table S8). iPro-MP exhibited the most robust and consistently top-performing results in all scenarios. These results collectively highlight the practical advantages of iPro-MP in terms of both predictive accuracy and computational efficiency.
Fig. 7.
Performance comparison of iPro-MP with existing tools based on the independent testing sets in terms of a Acc and b MCC. Numbers 1–23 correspond to the 23 prokaryotic species listed in Table 3
Table 2.
Comparison of the average prediction performance of iPro-MP and existing tools using the independent testing sets
| Tools | Acc | AUC | AUPRC | MCC | F1 | Runtime (s) |
|---|---|---|---|---|---|---|
| iPro-MP | 0.890 | 0.935 | 0.903 | 0.752 | 0.832 | 29.09 |
| Prompt | 0.788 | 0.835 | 0.798 | 0.611 | 0.737 | 34.48 |
| PromoterLCNN | 0.755 | 0.793 | 0.767 | 0.565 | 0.706 | 35.32 |
| iPro-WAEL | 0.798 | 0.838 | 0.811 | 0.606 | 0.747 | 34.40 |
While there were a few exceptions where other methods performed slightly better—e.g., PromoterLCNN achieved higher accuracy in E. coli (0.832 vs. 0.819) and iPro-WAEL outperformed iPro-MP in P. riograndensis SBR5 (0.802 vs. 0.793)—iPro-MP still demonstrated significantly superior performance in MCC, indicating better overall classification reliability and robustness, particularly under class imbalance. These results highlight iPro-MP’s strong generalization capability and its advantage in learning universal promoter features across phylogenetically diverse prokaryotes, offering a substantial improvement over previous methods.
Discussion
In this study, we developed iPro-MP, the first DNABERT-based framework for promoter prediction across multiple prokaryotic species. This model addresses key limitations of existing methods, including narrow species coverage and suboptimal predictive performance, by leveraging a transformer architecture with multi-head self-attention to learn both local and long-range dependencies in DNA sequences.
Through rigorous benchmarking on 23 prokaryotic species using 5-fold cross-validation and independent testing sets, iPro-MP consistently outperformed existing tools in terms of Acc and MCC. Importantly, our cross-species prediction analysis highlighted both the potential and limitations of model generalization, underscoring the value of species-specific modeling in certain phylogenetic contexts. Moreover, t-SNE visualization and motif conservation analysis demonstrated that iPro-MP captures biologically meaningful regulatory features, even in non-model organisms.
The satisfactory performance of iPro-MP can be attributed to several key design choices. First, the use of DNABERT enables the model to capture complex contextual patterns in DNA sequences. This is particularly effective for modeling promoter regions, which often contain dispersed regulatory motifs and variable structural elements. Second, the training dataset spans 23 prokaryotic species with diverse genomic characteristics, improving the model’s ability to generalize across phylogenetic clades. Third, the integration of both CDS and intergenic sequences as negative samples forces the model to learn more refined boundaries between promoters and non-promoters, beyond simply identifying open reading frames. Finally, the lightweight classification head with GELU activation, dropout regularization, and layer normalization contributes to model stability and generalization.
Beyond promoter prediction, iPro-MP has potential applications in fields such as synthetic biology and functional genomics. Recent advances in deep learning have enabled the generation of synthetic promoters with high expression strength [45–47], yet current generative models are limited by the lack of training data for non-model organisms. By enabling large-scale and accurate identification of native promoters across diverse prokaryotic genomes, iPro-MP can help expand promoter datasets for underexplored species. This, in turn, may support the development of more generalizable generative frameworks for synthetic promoter design, particularly in industrially or medically relevant prokaryotes.
Although iPro-MP achieves strong performance overall, prediction accuracy remains suboptimal for a few species, particularly those with weak promoter motifs or atypical regulatory architectures, e.g., P. putida and S. meliloti. This highlights the need for further investigation into species-specific promoter patterns and possibly incorporating epigenetic or structural features. In addition, recent advances in bioinformatics have demonstrated the strong problem-solving potential of universal models across a variety of biological prediction tasks, from protein function annotation to cross-species regulatory element detection. Inspired by this trend, we aim to extend our work by developing a universal promoter prediction framework capable of generalizing across a broader range of prokaryotic species. Such a model would reduce the dependency on organism-specific fine-tuning and enable scalable application to newly sequenced or poorly annotated species.
Conclusion
Overall, iPro-MP represents a powerful and generalizable tool for prokaryotic promoter annotation. By enhancing our ability to decode regulatory regions across diverse microbial genomes, this framework holds promise for advancing research in microbial gene regulation, synthetic biology, microbial ecology, and the study of pathogenic mechanisms. Future efforts may further improve its utility through integration with multi-omics data and domain-specific fine-tuning strategies.
Methods
The main workflow of iPro-MP is illustrated in Fig. 8, including three modules:
Data collection and processing module: experimentally validated promoter sequences were curated from the Prokaryotic Promoter Database (PPD). Redundant sequences were removed using CD-HIT with a similarity threshold of 0.8. Negative samples were generated from long coding sequences (CDS) and convergent intergenic regions using a sliding window strategy. Finally, the benchmark dataset was divided into a training set and an independent testing set.
Classification module: all sequences in the training set were tokenized into overlapping 6-mers and subsequently used to fine-tune the DNABERT model. The final prediction was generated through two fully connected layers.
Evaluation module: 5-fold cross-validation and independent testing were applied to evaluate performance, as well as feature analysis, cross-species prediction, phylogenetic analysis, and comparative test.
Fig. 8.
The flowchart of iPro-MP. a Promoter data collection and processing module, where all data are de-redundant and divided into training sets and testing sets. b Classification module, which captures contextual information hidden in the DNA sequence based on the DNABERT to make prediction. c Evaluation module, which assesses the performance of iPro-MP through a series of comparisons
Data collection and preprocessing
In our previous work, we constructed the prokaryotic promoter database (PPD) [20], which includes experimentally validated promoter data from 63 prokaryotic species. The PPD provides a robust foundation for this study. To ensure sufficient training data and minimize biases associated with small sample sizes, positive samples were selected according to the following rules: (1) Species with more than 1,000 experimentally validated promoters were included from the PPD. Notably, although B. subtilis has fewer than 1000 promoters, it is widely used in promoter prediction research and is therefore included in the benchmark dataset; (2) For each species, promoter sequences were filtered using the CD-HIT tool [48] with a sequence identity threshold of 0.8 to remove redundancy. As a result, a total of 107,286 promoter sequences (81 bp in length, [−60:20] where 0 represents TSS) were collected from 23 prokaryotic organisms using the uniform preprocessing and labeling strategy.
To generate a reliable negative sample dataset, we downloaded the whole genome sequences and corresponding annotation files of the 23 prokaryotic organisms from the GenBank database. Negative samples were derived from two genomic regions unlikely to contain promoters: (1) coding sequence (CDS) regions longer than 2000 bp, and (2) convergent intergenic regions longer than 81 bp. From these regions, we generated 81 bp subsequences using a sliding window method with a step size of 1 [49]. Subsequently, redundant sequences were removed using the CD-HIT program with a 0.8 threshold (Additional file 2: Table S9). Finally, the same number of sequences as positive samples was randomly selected from each of the two types of negative samples. This resulted in a final dataset with a 1:2 ratio of positive to negative samples (Table 3). After all the above steps, the benchmark dataset was randomly divided into training and independent testing sets in a ratio of 4:1 (Fig. 8a).
Table 3.
The benchmark dataset about 23 prokaryotic promoters
| Species | GenBank accession number | CD-HIT (positive) | Training | Testing | |||||
|---|---|---|---|---|---|---|---|---|---|
| Before | After | Positive | CDS | Intergenic | Positive | CDS | Intergenic | ||
| 1. Acinetobacter baumannii ATCC 17978 | CP000521.1 | 1540 | 1111 | 889 | 889 | 889 | 222 | 222 | 222 |
| 2. Bradyrhizobium japonicum USDA 110 | NC_004463.1 | 15,933 | 14,779 | 11,823 | 11,823 | 11,823 | 2956 | 2956 | 2956 |
| 3. Burkholderia cenocepacia J2315 | 10,831 | 9663 | 7730 | 7730 | 7730 | 1933 | 1933 | 1933 | |
| 4. Campylobacter jejuni RM1221 | NC_003912.7 | 2166 | 1995 | 1596 | 1596 | 1596 | 399 | 399 | 399 |
| 5. Campylobacter jejuni subsp. jejuni 81,116 | NC_009839.1 | 1942 | 1790 | 1432 | 1432 | 1432 | 358 | 358 | 358 |
| 6. Campylobacter jejuni subsp. jejuni 81–176 | NC_008787.1 | 2129 | 1944 | 1555 | 1555 | 1555 | 389 | 389 | 389 |
| 7. Campylobacter jejuni subsp. jejuni NCTC 11168 | NC_002163.1 | 1905 | 1745 | 1396 | 1396 | 1396 | 349 | 349 | 349 |
| 8. Corynebacterium diphtheriae NCTC 13129 | NC_002935.2 | 1656 | 1518 | 1214 | 1214 | 1214 | 304 | 304 | 304 |
| 9. Corynebacterium glutamicum ATCC 13032 | BX927147.1 | 3581 | 3031 | 2425 | 2425 | 2425 | 606 | 606 | 606 |
| 10. Escherichia coli str K-12substr. MG1655 | NC_000913.3 | 8616 | 6230 | 4984 | 4984 | 4984 | 1246 | 1246 | 1246 |
| 11. Haloferax volcanii DS2 | NC_013967.1 | 4749 | 4394 | 3515 | 3515 | 3515 | 879 | 879 | 879 |
| 12. Helicobacter pylori strain 26,695 | NC_000915.1 | 2233 | 2062 | 1650 | 1650 | 1650 | 412 | 412 | 412 |
| 13. Nostoc sp. PCC7120 | NC_003272.1 | 13,705 | 12,560 | 10,048 | 10,048 | 10,048 | 2512 | 2512 | 2512 |
| 14. Paenibacillus riograndensis SBR5 | NZ_LN831776.1 | 2351 | 2276 | 1821 | 1821 | 1821 | 455 | 455 | 455 |
| 15. Pseudomonas putida KT2440 | NC_002947.3 | 7938 | 6881 | 5505 | 5505 | 5505 | 1376 | 1376 | 1376 |
| 16. Shigella flexneri 5a str. M90T | NZ_CP037923.1 | 14,051 | 9760 | 7808 | 7808 | 7808 | 1952 | 1952 | 1952 |
| 17. Sinorhizobium meliloti 1021 | NC_003047.1 | 17,003 | 14,482 | 11,586 | 11,586 | 11,586 | 2896 | 2896 | 2896 |
| 18. Staphylococcus aureus subsp. aureus MW2 | NC_003923.1 | 2821 | 1969 | 1575 | 1575 | 1575 | 394 | 394 | 394 |
| 19. Staphylococcus epidermidis ATCC 12228 | NC_004461.1 | 2207 | 1647 | 1318 | 1318 | 1318 | 329 | 329 | 329 |
| 20. Synechococcus elongatus PCC 7942 | CP000100.1 | 1473 | 1410 | 1128 | 1128 | 1,128 | 282 | 282 | 282 |
| 21. Thermococcus kodakarensis KOD1 | NC_006624.1 | 2720 | 2413 | 1930 | 1930 | 1,930 | 483 | 483 | 483 |
| 22. Xanthomonas campestris pv. campestris B100 | AM920689.1 | 3067 | 2963 | 2370 | 2370 | 2,370 | 593 | 593 | 593 |
| 23. Bacillus subtilis subsp. subtilis str. 168 | NC_000964.3 | 691 | 663 | 530 | 530 | 530 | 133 | 133 | 133 |
DNABERT model
Bidirectional Encoder Representations from Transformers (BERT) is a pre-trained language representation model that retains the encoder part of Transformers [50], including the multi-head attention mechanism and the position-wise fully connected feed-forward neural network, while also expanding the function of bidirectional learning. It employs the Masked Language Model (MLM) as a fundamental component of its pre-training strategy, enabling the model to develop a deep bidirectional understanding of language. The relevant formulas are as follows:
| 1 |
where
| 2 |
| 3 |
In the above formulas, , , and represent query vector, key vector, and value vector, respectively. refers to the attention layer, and is the dimension of the vector.
DNABERT leverages BERT's bidirectional attention mechanism to learn both local and global contextual relationships within DNA sequences [51]. By pre-training on large-scale DNA datasets, DNABERT is capable of capturing complex patterns and dependencies in the genome, thereby providing robust feature representations for downstream tasks such as gene prediction [52, 53], motif discovery [54, 55], and regulatory element identification [56, 57]. DNABERT employs a sophisticated tokenization strategy that transforms DNA sequences into k-mer tokens, thereby enabling the application of transformer-based architectures to genomic data (Fig. 8b). A k-mer represents a contiguous subsequence of length k nucleotides extracted from the DNA sequence, analogous to word tokens in natural language processing (NLP). In this study, we fine-tuned the DNABERT model using the benchmark dataset of 23 species and applied the fine-tuned model for cross-species validation.
Fine-tuning of DNABERT model
In this study, a fine-tuning approach was applied to the pre-trained DNABERT model for the classification of DNA sequences with high accuracy. To encode input sequences for DNABERT, a k-mer tokenization strategy with k values of 3, 4, 5, and 6 was employed, enabling the model to capture sequence patterns at multiple resolutions.
To ensure reproducibility, all random number generators were initialized with a fixed seed, and PyTorch’s CuDNN backend was configured to accelerate the training and inference process. The recognition model integrated the pre-trained model with additional fully connected layers, layer normalization, dropout, and GELU activation functions to enhance model capacity and prevent overfitting. More specifically, during fine-tuning, we used the Adam optimizer with a learning rate of 1e − 5, a dropout rate of 0.3, and trained the model for up to 100 epochs. The batch size was set to 64. An early stopping strategy was applied with a patience of five epochs and a minimum delta of 1e − 4 in validation AUC.
Performance evaluation
To compare the quality of computational predictors, a set of metrics (Fig. 8c), including, sensitivity (Sn), specificity (Sp), overall accuracy (Acc), and Matthews correlation coefficient (MCC), was used to evaluate prediction performance [58–63].
| 4 |
where TP, FN, TN, and FP represent the number of true positives, false positives, true negatives, and false negatives, respectively.
Moreover, an alternative intuitive approach for model comparison is through the analysis of receiver operating characteristic (ROC) curves, which visually demonstrate relative performance across different decision thresholds. The area under the ROC curve (AUC) serves as a critical performance metric for evaluating binary classification models [64, 65], quantifying their inherent ability to discriminate between positive and negative classes. The higher the AUC, the better the performance. Besides, to save computing time, we used 5-fold cross-validation to train the model and the independent dataset to evaluate our model and other tools.
Supplementary Information
Additional file 1: Figure S1. Performance comparison of DNABERT models trained with different k-mer sizes. Figure S2. Performance comparison of iPro-MP under three cross-validation strategies. Figure S3. Performance Comparison between AT-richand GC-richspecies across five evaluation metrics using Mann-Whitney U test. Figure S4. Performance comparison of iPro-MP and other tools across 23 prokaryotic species under 5-fold cross-validation
Additional file 2: Table S1: Performance comparison of DNABERT models trained with different k-mer sizes. Table S2: Performance comparison of iPro-MP under three cross-validation strategies. Table S3: Comprehensive performance metrics evaluated using 5-fold cross-validation. Table S4: Comprehensive performance metrics evaluated using independent testing sets. Table S5: Optimal parameter settings for different algorithms. Table S6: Comparison of different algorithms for identifying promoters in 23 species. Table S7: Comparison of iPro-MP and existing tools for identifying promoters in 23 species using independent testing sets. Table S8: Performance comparison of iPro-MP and existing tools under 5-fold cross-validation. Table S9: Summary of negative samples before and after CD-HIT filtering for each species
Acknowledgements
Not applicable.
Peer review information
Jiangning Song and Tim Sands were the primary editors of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.
Authors’ contributions
Hao Lin and Hao Lyu conceived and designed the experiments. Wei Su, Yuhe Yang, Yafei Zhao, Shishi Yuan, Xueqin Xie, Yuduo Hao, Hongqi Zhang and Dongxin Ye performed the analysis and wrote the paper. All authors read and approved the final manuscript.
Funding
This work was supported by the National Natural Science Foundation of China (62172078, 62402089) and China Postdoctoral Science Foundation (2023TQ0047, GZC20230380).
Data availability
The promoter sequences for 23 species were obtained from the Prokaryotic Promoter Database (PPD) (https://lin-group.cn/database/ppd/) [20]. The genome files and annotation files for 23 species were obtained from GenBank under the accession number in Table 3. The benchmark datasets and source codes for the iPro-MP framework are freely available at the GitHub repository (https://github.com/Jackie-Suv/iPro-MP) [66] under MIT license, which includes all necessary materials for local deployment, including species-specific training data, and detailed instructions for both prediction and model retraining. Meanwhile, we also provided the pretrained models at Zenodo (10.5281/zenodo.15180139) [67].
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Hao Lyu, Email: hao.lyu@uestc.edu.cn.
Hao Lin, Email: hlin@uestc.edu.cn.
References
- 1.Boeger H, Bushnell DA, Davis R, Griesenbeck J, Lorch Y, Strattan JS, et al. Structural basis of eukaryotic gene transcription. FEBS Lett. 2005;579:899–903. [DOI] [PubMed] [Google Scholar]
- 2.Cramer P, Bushnell DA, Kornberg RD. Structural basis of transcription: RNA polymerase II at 2.8 Ångstrom resolution. Science. 2001;292:1863–76. [DOI] [PubMed]
- 3. Gao Y, Cui YX, Fox T, Lin SQ, Wang HB, de Val N, Zhou ZH, Yang W. Structures and operating principles of the replisome. Science. 2019;363:eaav7003. [DOI] [PMC free article] [PubMed]
- 4.Matteucci M, Papini G, Ciofini E, Barile L, Lionetti V. Epigenetic regulation of myocardial homeostasis, self-regeneration and senescence. Curr Drug Targets. 2015;16:827–42. [DOI] [PubMed] [Google Scholar]
- 5.Xiao X, Hu ZH, Luo ZT, Xu ZC. Ipsi(2l)-edl: a two-layer predictor for identifying promoters and their types based on ensemble deep learning. Curr Bioinform. 2024;19:327–40. [Google Scholar]
- 6.Shujaat M, Yoo S, Tayara H, Chong KT. Iprom-yeast: prediction tool for yeast promoters based on ML stacking. Curr Bioinform. 2024;19:162–73. [Google Scholar]
- 7.Gao F, Ye FZ, Zhang BW, Cronin N, Buck M, Zhang XD. Structural basis of σ54 displacement and promoter escape in bacterial transcription. Proc Natl Acad Sci U S A. 2024. 10.1073/pnas.2309670120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Glyde R, Ye FZ, Jovanovic M, Kotta-Loizou I, Buck M, Zhang XD. Structures of bacterial RNA polymerase complexes reveal the mechanism of DNA loading and transcription initiation. Mol Cell. 2018;70:1111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wei L, He W, Malik A, Su R, Cui L, Manavalan B. Computational prediction and interpretation of cell-specific replication origin sites from multiple eukaryotes by exploiting stacking framework. Brief Bioinform. 2020. 10.1093/bib/bbaa275. [DOI] [PubMed] [Google Scholar]
- 10. Liu Y, Shen X, Gong Y, Liu Y, Song B, Zeng XJBiB. Sequence Alignment/Map format: a comprehensive review of approaches and applications. Brief Bioinform. 2024;24:bbad320. [DOI] [PubMed]
- 11.Sharma CM, Hoffmann S, Darfeuille F, Reignier J, Findeiss S, Sittka A, et al. The primary transcriptome of the major human pathogen Helicobacter pylori. Nature. 2010;464:250–5. [DOI] [PubMed] [Google Scholar]
- 12.Sharma CM, Vogel J. Differential RNA-seq: the approach behind and the biological insight gained. Curr Opin Microbiol. 2014;19:97–105. [DOI] [PubMed] [Google Scholar]
- 13.Bischler T, Tan HS, Nieselt K, Sharma CM. Differential RNA-seq (dRNA-seq) for annotation of transcriptional start sites and small RNAs in Helicobacter pylori. Methods. 2015;86:89–101. [DOI] [PubMed] [Google Scholar]
- 14. Maucourt B, Roche D, Chaignaud P, Vuilleumier S, Bringel F. Genome-wide transcription start sites mapping in methylorubrum grown with dichloromethane and methanol. Microorganisms. 2022;10. [DOI] [PMC free article] [PubMed]
- 15.Babski J, Haas KA, Nather-Schindler D, Pfeiffer F, Forstner KU, Hammelmann M, et al. Genome-wide identification of transcriptional start sites in the haloarchaeon Haloferax volcanii based on differential RNA-seq (dRNA-seq). BMC Genomics. 2016;17:629. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Bauer JS, Fillinger S, Forstner K, Herbig A, Jones AC, Flinspach K, et al. dRNA-seq transcriptional profiling of the FK506 biosynthetic gene cluster in Streptomyces tsukubaensis NRRL18488 and general analysis of the transcriptome. RNA Biol. 2017;14:1617–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Salgado H, Gama-Castro S, Lara P, Mejia-Almonte C, Alarcon-Carranza G, Lopez-Almazo AG, et al. RegulonDB v12.0: a comprehensive resource of transcriptional regulation in E. coli K-12. Nucleic Acids Res. 2024;52(D1):D255–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ishii T, Yoshida K, Terai G, Fujita Y, Nakai K. DBTBS: a database of Bacillus subtilis promoters and transcription factors. Nucleic Acids Res. 2001;29:278–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Liang ZY, Lai HY, Yang H, Zhang CJ, Yang H, Wei HH, et al. Pro54DB: a database for experimentally verified sigma-54 promoters. Bioinformatics. 2017;33:467–9. [DOI] [PubMed] [Google Scholar]
- 20.Su W, Liu ML, Yang YH, Wang JS, Li SH, Lv H, et al. PPD: a manually curated database for experimentally verified prokaryotic promoters. J Mol Biol. 2021;433:166860. [DOI] [PubMed] [Google Scholar]
- 21. Qiao J, Jin J, Yu H, Wei L. Towards retraining-free RNA modification prediction with incremental learning. Inform Sci. 2024:120105.
- 22.Xie H, Ding Y, Qian Y, Tiwari P, Guo F. Structured sparse regularization based random vector functional link networks for DNA N4-methylcytosine sites prediction. Expert Syst Appl. 2024. 10.1016/j.eswa.2023.121157. [Google Scholar]
- 23.Wang L, Ding Y, Tiwari P, Xu J, Lu W, Muhammad K, et al. A deep multiple kernel learning-based higher-order fuzzy inference system for identifying DNA N4-methylcytosine sites. Inf Sci. 2023;630:40–52. [Google Scholar]
- 24.Liu B, Gao X, Zhang H. BioSeq-analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches. Nucleic Acids Res. 2019;47:e127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Li H, Liu B. Bioseq-diabolo: biological sequence similarity analysis using diabolo. PLoS Comput Biol. 2023;19:e1011214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Lai HY, Zhang ZY, Su ZD, Su W, Ding H, Chen W, et al. Iproep: a computational predictor for predicting promoter. Mol Ther. 2019;17:337–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Wang Y, Zhai Y, Ding Y, Zou Q. SBSM-Pro: support bio-sequence machine for proteins. Sci China Inf Sci. 2024;67:212106. [Google Scholar]
- 28.Meher PK, Hati S, Sahu TK, Pradhan U, Gupta A, Rath SN. SVM-root: identification of root-associated proteins in plants by employing the support vector machine with sequence-derived features. Curr Bioinform. 2024;19(1):91–102. [Google Scholar]
- 29.Zhang M, Li FY, Marquez-Lago TT, Leier A, Fan C, Kwoh CK, et al. Multiply: a novel multi-layer predictor for discovering general and specific types of promoters. Bioinformatics. 2019;35:2957–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Hernández D, Jara N, Araya M, Durán RE, Buil-Aranda C. PromoterLCNN: a light CNN-based promoter prediction and classification model. Genes. 2022. 10.3390/genes13071126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Zhang PY, Zhang HM, Wu H. Ipro-WAEL: a comprehensive and robust framework for identifying promoters in multiple species. Nucleic Acids Res. 2022;50:10278–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Du QM, Guo YX, Zhang JP, Lu FP, Peng C, Zhou CC. Predicting promoters in multiple prokaryotes with prompt. Interdiscip Sci Comput Life Sci. 2024;16:814–28. [DOI] [PubMed] [Google Scholar]
- 33.Young KT, Davis LM, DiRita VJ. Campylobacter jejuni: molecular biology and pathogenesis. Nat Rev Microbiol. 2007;5:665–79. [DOI] [PubMed] [Google Scholar]
- 34. Dugar G, Herbig A, Forstner KU, Heidrich N, Reinhardt R, Nieselt K, Sharma CM. High-resolution transcriptome maps reveal strain-specific regulatory features of multiple Campylobacter jejuni isolates. Plos Genetics. 2013;9. [DOI] [PMC free article] [PubMed]
- 35.Vacic V, Iakoucheva LM, Radivojac P. Two sample logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics. 2006;22:1536–7. [DOI] [PubMed] [Google Scholar]
- 36.Pohlschroder M, Schulze S. Haloferax volcanii. Trends Microbiol. 2019;27:86–7. [DOI] [PubMed] [Google Scholar]
- 37. Wenck BR, Vickerman RL, Burkhart BW, Santangelo TJ. Archaeal histone-based chromatin structures regulate transcription elongation rates. Commun Biol. 2024;7. [DOI] [PMC free article] [PubMed]
- 38.Jun SH, Reichlen MJ, Tajiri M, Murakami KS. Archaeal RNA polymerase and transcription regulation. Crit Rev Biochem Mol Biol. 2011;46:27–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kramm K, Endesfelder U, Grohmann D. A single-molecule view of archaeal transcription. J Mol Biol. 2019;431:4116–31. [DOI] [PubMed] [Google Scholar]
- 40.Wängberg T, Tyrcha J, Li CB. Shape-aware stochastic neighbor embedding for robust data visualisations. BMC Bioinformatics. 2022;23(1):477. [DOI] [PMC free article] [PubMed]
- 41.Lv H, Zhang ZM, Li SH, Tan JX, Chen W, Lin H. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief Bioinform. 2020;21:982–95. [DOI] [PubMed] [Google Scholar]
- 42.Shin H. XGBoost regression of the most significant photoplethysmogram features for assessing vascular aging. IEEE J Biomed Health Inform. 2022;26:3354–61. [DOI] [PubMed] [Google Scholar]
- 43.Wang L, Park HJ, Dasari S, Wang SQ, Kocher JP, Li W. CPAT: coding-potential assessment tool using an alignment-free logistic regression model. Nucleic Acids Res. 2013. 10.1093/nar/gkt006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Liang SR, Zhao YX, Jin JR, Qiao JB, Wang D, Wang Y, et al. Rm-LR: a long-range-based deep learning model for predicting multiple types of RNA modifications. Comput Biol Med. 2023. 10.1016/j.compbiomed.2023.107238. [DOI] [PubMed] [Google Scholar]
- 45.Wang Y, Wang HC, Wei L, Li SL, Liu LY, Wang XW. Synthetic promoter design in Escherichia coli based on a deep generative network. Nucleic Acids Res. 2020;48:6403–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Seo E, Choi YN, Shin YR, Kim D, Lee JW. Design of synthetic promoters for cyanobacteria with generative deep-learning model. Nucleic Acids Res. 2023;51:7071–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Zhang PC, Wang HC, Xu HW, Wei L, Liu LY, Hu ZR, Wang XW. Deep flanking sequence engineering for efficient promoter design using DeepSEED. Nat Commun. 2023;14(1):6309. [DOI] [PMC free article] [PubMed]
- 48.Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–9. [DOI] [PubMed] [Google Scholar]
- 49.Lin Y, Sun ML, Zhang JJ, Li MY, Yang KL, Wu CY, et al. Computational identification of promoters in Klebsiella aerogenes by using support vector machine. Front Microbiol. 2023;14:1200678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Devlin J, Chang M-W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv e-prints. 2018, arXiv:1810.04805.
- 51.Ji YR, Zhou ZH, Liu H, Davuluri R. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics. 2021;37:2112–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Kabir A, Bhattarai M, Peterson S, Najman-Licht Y, Rasmussen KO, Shehu A, Bishop AR, Alexandrov B, Usheva A. DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors. Nucleic Acids Res. 2024;52(19):e91. [DOI] [PMC free article] [PubMed]
- 53.Zhang JB, Zhao L, Wang W, Zhang Q, Wang XT, Xing DF, et al. Large language model for horizontal transfer of resistance gene: from resistance gene prevalence detection to plasmid conjugation rate evaluation. Sci Total Environ. 2024. 10.1016/j.scitotenv.2024.172466. [DOI] [PubMed] [Google Scholar]
- 54.Xie GB, Yu Y, Lin ZY, Chen RB, Xie JH, Liu ZG. 4 mC site recognition algorithm based on pruned pre-trained DNABert-Pruning model and fused artificial feature encoding. Anal Biochem. 2024. 10.1016/j.ab.2024.115492. [DOI] [PubMed] [Google Scholar]
- 55.Zhang ZY, Zhang Z, Ye XC, Sakurai T, Lin H. A BERT-based model for the prediction of lncRNA subcellular localization in Homo sapiens. Int J Biol Macromol. 2024. 10.1016/j.ijbiomac.2024.130659. [DOI] [PubMed] [Google Scholar]
- 56.Song T, Song HA, Pan ZY, Gao Y, Dai HH, Wang X. Deepdualenhancer: a dual-feature input DNABert based deep learning method for enhancer recognition. Int J Mol Sci. 2024. 10.3390/ijms252111744. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Jin J, Yu Y, Wang R, Zeng X, Pang C, Jiang Y, et al. iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations. Genome Biol. 2022;23:1–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Su W, Xie XQ, Liu XW, Gao D, Ma CY, Zulfiqar H, et al. iRNA-ac4c: a novel computational method for effectively detecting N4-acetylcytidine sites in human mRNA. Int J Biol Macromol. 2023;227:1174–81. [DOI] [PubMed] [Google Scholar]
- 59.Su QS, Phan L, Pham NT, Wei LY, Manavalan B. MST-m6A: a novel multi-scale transformer-based framework for accurate prediction of m6A modification sites across diverse cellular contexts. J Mol Biol. 2025. 10.1016/j.jmb.2024.168856. [DOI] [PubMed] [Google Scholar]
- 60.Lv H, Dao FY, Zulfiqar H, Su W, Ding H, Liu L, Lin H. A sequence-based deep learning approach to predict CTCF-mediated chromatin loop. Brief Bioinform. 2021;22(5):bbab031. [DOI] [PubMed]
- 61.Chen L, Li Y, Ma Y, Gao L, Yu L. Multiscale graph equivariant diffusion model for 3D molecule design. Sci Adv. 2025;11:eadv0778. [DOI] [PubMed] [Google Scholar]
- 62.Zhu H, Hao H, Yu L. Identification of microbe–disease signed associations via multi-scale variational graph autoencoder based on signed message propagation. BMC Biol. 2024;22:172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Guo X, Huang Z, Ju F, Zhao C, Yu L. Highly accurate estimation of cell type abundance in bulk tissues based on single-cell reference and domain adaptive matching. Adv Sci. 2024;11:2306329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Chen YJ, Wang JC, Zou Q, Niu MT, Ding YJ, Song JN, et al. Drugdagt: a dual-attention graph transformer with contrastive learning improves drug-drug interaction prediction. BMC Biol. 2024. 10.1186/s12915-024-02030-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Yang YH, Ma CY, Gao D, Liu XW, Yuan SS, Ding H. I2OM: toward a better prediction of 2’-O-methylation in human RNA. Int J Biol Macromol. 2023. 10.1016/j.ijbiomac.2023.124247. [DOI] [PubMed] [Google Scholar]
- 66. Su W, Yang Y, Zhao Y, Yuan s, Xie X, Hao Y, Zhang H, Ye D, Lyu H, Lin H. iPro-MP: a BERT-based model to predict multiple prokaryotic promoters. Github. 2025. https://github.com/Jackie-Suv/iPro-MP.
- 67.Su W, Yang Y, Zhao Y. Yuan s, Xie X, Hao Y, Zhang H, Ye D, Lyu H, Lin H. iPro-MP: a BERT-based model to predict multiple prokaryotic promoters. 2025. Zenodo. 10.5281/zenodo.15180139.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Additional file 1: Figure S1. Performance comparison of DNABERT models trained with different k-mer sizes. Figure S2. Performance comparison of iPro-MP under three cross-validation strategies. Figure S3. Performance Comparison between AT-richand GC-richspecies across five evaluation metrics using Mann-Whitney U test. Figure S4. Performance comparison of iPro-MP and other tools across 23 prokaryotic species under 5-fold cross-validation
Additional file 2: Table S1: Performance comparison of DNABERT models trained with different k-mer sizes. Table S2: Performance comparison of iPro-MP under three cross-validation strategies. Table S3: Comprehensive performance metrics evaluated using 5-fold cross-validation. Table S4: Comprehensive performance metrics evaluated using independent testing sets. Table S5: Optimal parameter settings for different algorithms. Table S6: Comparison of different algorithms for identifying promoters in 23 species. Table S7: Comparison of iPro-MP and existing tools for identifying promoters in 23 species using independent testing sets. Table S8: Performance comparison of iPro-MP and existing tools under 5-fold cross-validation. Table S9: Summary of negative samples before and after CD-HIT filtering for each species
Data Availability Statement
The promoter sequences for 23 species were obtained from the Prokaryotic Promoter Database (PPD) (https://lin-group.cn/database/ppd/) [20]. The genome files and annotation files for 23 species were obtained from GenBank under the accession number in Table 3. The benchmark datasets and source codes for the iPro-MP framework are freely available at the GitHub repository (https://github.com/Jackie-Suv/iPro-MP) [66] under MIT license, which includes all necessary materials for local deployment, including species-specific training data, and detailed instructions for both prediction and model retraining. Meanwhile, we also provided the pretrained models at Zenodo (10.5281/zenodo.15180139) [67].








