BacTermFinder: a comprehensive and general bacterial terminator finder using a CNN ensemble

Seyed Mohammad Amin Taheri Ghahfarokhi; Lourdes Peña-Castillo

doi:10.1093/nargab/lqaf016

. 2025 Mar 8;7(1):lqaf016. doi: 10.1093/nargab/lqaf016

BacTermFinder: a comprehensive and general bacterial terminator finder using a CNN ensemble

Seyed Mohammad Amin Taheri Ghahfarokhi ¹, Lourdes Peña-Castillo ^2,^3,^✉

PMCID: PMC11890068 PMID: 40060369

Abstract

A terminator is a DNA region that ends the transcription process. Currently, multiple computational tools are available for predicting bacterial terminators. However, these methods are specialized for certain bacteria or terminator type (i.e. intrinsic or factor-dependent). In this work, we developed BacTermFinder using an ensemble of convolutional neural networks (CNNs) receiving as input four different representations of terminator sequences. To develop BacTermFinder, we collected roughly 41 000 bacterial terminators (intrinsic and factor-dependent) of 22 species with varying GC-content (from 28% to 71%) from published studies that used RNA-seq technologies. We evaluated BacTermFinder’s performance on terminators of five bacterial species (not used for training BacTermFinder) and two archaeal species. BacTermFinder’s performance was compared with that of four other bacterial terminator prediction tools. Based on our results, BacTermFinder outperforms all other four approaches in terms of average recall without increasing the number of false positives. Moreover, BacTermFinder identifies both types of terminators (intrinsic and factor-dependent) and generalizes to archaeal terminators. Additionally, we visualized the saliency map of the CNNs to gain insights on terminator motif per species. BacTermFinder is publicly available at https://github.com/BioinformaticsLabAtMUN/BacTermFinder.

Introduction

Transcription starts at the promoter region and ends at the terminator region. A terminator is a DNA segment that indicates the end of a gene or operon [1]. The termination of transcription is a process crucial for the accurate synthesis of RNA. In prokaryotes, termination can occur through either factor-dependent or factor-independent mechanisms, with the latter known as intrinsic termination. Intrinsic terminators are a DNA region rich in cytosine (C) and guanine (G) nucleotides followed by a poly-thymine (T) sequence. In the canonical terminator structure, the RNA created from the CG-rich region binds with itself, forming a hairpin structure that causes the RNA polymerase to stall. The weak base pairing between the adenine (A) nucleotides of the DNA template and the uridines of the RNA transcript let the transcript to detach from the template thus, terminating the transcription [2]. However, alternative intrinsic terminator structures have been identified in many bacteria [3]. While intrinsic termination requires only cis-acting elements on nascent RNAs, factor-dependent termination depends on both cis-acting RNA elements and a protein such as Rho (ρ), which mediates termination via three distinct mechanisms [4].

Identifying terminators is crucial for understanding operon structure and transcriptional regulation. As both terminator types rely, at least partially, on cis-acting elements on the DNA template, they can be predicted from genomic sequence. This can be done via computational identification. Genome-wide terminator prediction via computational approaches is more affordable and efficient than via wet lab approaches. Although terminator predictions should be considered putative terminators or terminator-like regions requiring further validation. Since the development of TransTermHP [5], which is arguably the most cited terminator prediction tool, there have been several computational tools developed for bacterial terminator prediction (summarized in Table 1). However, as one can see in Table 1, these tools have either (i) focused on few (one to three) bacterial species (mostly Escherichia coli and Bacillus subtilis), (ii) used a relatively small number of terminators for generating their model, (iii) predicted a single terminator type (factor-dependent or intrinsic), or (iv) a combination of the above. Thus, there is still the need for a species- and terminator type-independent tool for predicting bacterial terminators. The availability of genome-wide transcription termination sites (TTSs, also called in other studies 3′ termini or 3′ ends) identified by RNA-seq technologies such as Term-seq [6], SEnd-seq [7], SMRT-cappable [8], RendSeq [9], RNATag-seq [10], and dRNA-seq [11] in several bacterial species opens the door to generate a species-agnostic machine learning-based model using a large number (i.e. thousands) of terminator sequences of a wide range of bacterial species. Here, we generated such a method by (i) gathering a large collection of TTSs from published studies, (ii) exploring thousands of features to represent (encode) terminator sequences, (iii) generating and assessing 11 different machine learning models to identify bacterial terminators, and (iv) comparatively assessing the performance of our best model (BacTermFinder) with the performance of four other bacterial terminator prediction methods (namely, TermNN [12], iTerm-PseKNC [13], RhoTermPredict [14], and TransTermHP [5]). Our results show that BacTermFinder can detect intrinsic and factor-dependent terminators and even archaeal terminators at a higher recall rate than current tools.

Table 1.

Summary of tools for predicting bacterial terminators

Methods	Year	Terminator type	Method	Software available?	# of terminators	# of species
InterPin [42]	2023	Intrinsic	SM	*	N/A	N/A
TermNN [12]	2022	Intrinsic	DL	Yes	1175	2
ITT prediction [49]	2021	Intrinsic	Statistical	No	137	1
iterb-PPse [50]	2020	Both	ML	No	928	2
iTerm-PseKNC [13]	2019	Both	ML	Yes	852	1
RhoTermPredict [14]	2019	Factor-dep.	DP	Yes	1298	3
OPLS-DA [51]	2018	Factor-dep.	DA	Yes	104	2
PASIFIC [52]	2017	Intrinsic	SM	Yes	330	89
RNIE [53]	2011	Intrinsic	CM	Yes	1062	2
TransTermHP [5]	2007	Intrinsic	DP	Yes	N/A	N/A

Open in a new tab

# of terminators indicates how many terminator sequences were used in the study. The # of species indicates how many bacterial species were considered. CM = covariance model, DA = discriminant analysis, DL = deep learning, DP = Dynamic programming, ML = machine learning, and SM = Structure matching. * Database with predictions available.

Materials and methods

Terminator sequences collection

We searched the National Center for Biotechnology Information (NCBI) PubMed [15] and Gene Expression Omnibus database [16] using as keywords Term-seq, SEnd-seq, SMRT-cappable, Rend-seq, RNATag-seq, and dRNA-seq to find published studies, which experimentally identified bacterial TTSs. For each identified study, using an IPython Notebook [17] and pandas data manipulation library [18], we stored in a BED file the genomic location of identified TTSs (provided in the supplementary material of the corresponding publication). With BEDTools’ [19] slopeBed and FastaFromBed commands, we extracted the genomic sequences corresponding to 100 nt flanking the TTSs (50 nt on either side) into a FASTA file. In each operation, strandedness was taken into account. We chose 50 nt on either side of the TTSs as a previous study [20] found intrinsic terminator regions in E. coli as long as 65 nt. Thus, a sequence of 100 nt long would be enough to contain this region and allow for some variation. Terminators present in plasmids were disregarded. We also collected archaeal TTSs during this process and kept these for the comparative assessment. Additionally, we collected TTSs available in the following databases: RegulonDB [21], DBTBS [22], and BSGatlasDB [23]. As we collected multiple data sets for some of the bacterial species, we needed to remove duplicated sequences from our data. To achieve this, we merged genomic locations if they had at least a 60% genomic location overlap. When the merged sequence was >100 nt, we symmetrically removed the extra nucleotides from the beginning and the end of the sequence. Additionally, sequences containing ambiguous characters (such as N, Y, etc) were removed. As some of the studies we collected data from have used different genome assemblies for the same bacterium, after getting the terminator sequences, we removed duplicated sequences (i.e. sequences 100% identical). The bacterial species with their number of non-redundant terminators used for training and comparative assessment are listed in Tables 2 and 3, respectively. Details about each study are listed in Supplementary Table S1.

Table 2.

Number of non-redundant terminators (No. Term.) per genome accession used for training

Data source	Species	Phylum	Genome	No.	GC	Avg.
			Accession	Term.	(%)	Pre.
[6, 54]	Bacillus subtilis 168	Bacillota	AL009126.3	1800	42.9	0.88
[55–57],
[23, 58, 59]	Bacillus subtilis 168	Bacillota	NC_000964.3	4872	42.9	0.72
[57]	Caulobacter vibrioides NA1000	Pseudomonadota	CP001340.1	341	66.2	0.93
[60]	Clostridioides difficile 630	Bacillota	CP010905.2	1646	28.6	0.84
[61]	Dickeya dadantii 3937	Pseudomonadota	NC_014500.1	1786	55.5	0.49
[62]	Escherichia coli BW25113	Pseudomonadota	CP009273.1	1095	50.1	0.81
[7, 57, 58, 63],
[8, 37, 64]	E. coli str. K-12 substr. MG1655	Pseudomonadota	NC_000913.3	4139	50.1	0.62
[65]	Pseudomonas aeruginosa PAO1	Pseudomonadota	NC_002516.2	805	65.6	0.89
[66]	Staphylococcus aureus JKD6009	Bacillota	LR027876.1	978	32.4	0.81
[67]	Staphylococcus aureus NCTC 8325	Bacillota	NC_007795.1	566	32.4	0.90
[68]	Streptococcus pneumoniae D39V	Bacillota	CP027540.1	747	39.1	0.94
[69]	Streptococcus pneumoniae TIGR4	Bacillota	NC_003028.3	1810	39.1	0.67
[70, 71]	Streptomyces avermitilis MA-4680	Actinomycetota	BA000030.4	2006	69.7	0.63
[71, 38]	Streptomyces clavuligerus ATCC27064	Actinomycetota	CP027858.1	1583	71.6	0.62
[71]	Streptomyces coelicolor M145	Actinomycetota	NC_003888.3	1308	71.1	0.73
[71, 72]	Streptomyces griseus NBRC13350	Actinomycetota	NC_010572.1	2724	71.2	0.76
[71, 73]	Streptomyces lividans TK24	Actinomycetota	CP009124.1	1999	71.2	0.59
[71]	Streptomyces tsukubaensis NBRC108819	Actinomycetota	CP020700.1	1283	70.8	0.55
[57]	Vibrio natriegens ATCC 14048	Pseudomonadota	CP009977.1	905	44.7	0.97
[57]	Vibrio natriegens ATCC 14048	Pseudomonadota	CP009978.1	257	44.1	0.90
[74]	Vibrio parahaemolyticus RIMD 2210633	Pseudomonadota	NC_004603.1	1852	44.7	0.67
[75]	Zymomonas mobilis ZM4 = ATCC 31821	Pseudomonadota	CP023715.1	2040	45.7	0.54

Open in a new tab

There is a total of 36 542 terminator sequences. Avg. Pre. shows the average precision obtained during cross-validation as in Fig. 5.

Table 3.

Number of non-redundant terminators (No. Term.) per species used for comparative assessment for a total of 4626 and 3581 bacterial and archaeal terminators, respectively

Data source	Species	Phylum	Genome	No.	GC	Seq.
			Accession	Term.	(%)	Tech.
Bacteria
[34]	Mycobacterium tuberculosis H37Rv	Actinomycetota	AL123456.3	2202	64.7	Term-seq
[76]	Streptococcus agalactiae NEM316	Bacillota	NC_004368.1	655	35.1	dRNA-seq
[71]	Streptomyces gardneri ATCC 15439	Actinomycetota	CP059991.1	870	70.7	Term-seq
[77]	Synechocystis PCC 6803	Cyanobacteriota	NC_000911.1	553	47.0	Term-seq
[78]	Synechocystis PCC 7338	Cyanobacteriota	CP054306.1	346	47.1	Term-seq
Archaea
[79]	Haloferax volcanii DS2	Methanobacteriota	NC_013967	1227	65.7	Term-seq
[80]	Methanococcus maripaludis S2	Methanobacteriota	NC_005791	2354	32.6	Term-seq

Open in a new tab

Confirming the sequence length for terminator identification

To confirm that a sequence pattern was within the 100 nt extracted, we used relative nucleotide frequency graphs to visualize whether a distinct pattern was present within this region of interest (ROI). The relative log2 nucleotide frequency F_{i, j} for a specific position j and nucleotide i (i ∈ {A, T, C, G}) was calculated by Inline graphic where N_i,j is the total count of nucleotide i at position j and M is the total number of sequences. The log2 ratio of the relative nucleotide frequency in the ROI for each position j and nucleotide i was obtained by subtracting from F_{i, j} the log2 of the nucleotide i-content in the corresponding genome.

Non-terminator sequences generation

To train machine learning-based models, we needed instances of non-terminators of the same length and from the same genome as the terminator-containing sequences. To generate these negative examples, we used BEDTools’ shuffle command to randomly generate genomic coordinates different from the terminator regions. These random genomic regions are taken from anywhere in the genome and, thus, include intergenic and intragenic sequences. We allowed a maximum sequence overlap between positive and negative sequences of 20 nt. A ratio of 1–10 positive to negative was used as it was shown in [24] that there should be more negative than positive instances to lower the false-positive rate during genome scan. Furthermore, the terminator detection problem has a natural imbalance between terminator and non-terminator sequences, as the estimated number of terminators in a bacterial genome is relatively tiny compared to the number of all possible non-terminator sequences of the same length. Henceforth, we refer to our complete data set (including terminator and non-terminator sequences) as BacTermData.

Feature engineering and selection

BacTermData contains DNA sequences consisting of ATCG characters. However, machine learning methods expect to receive a numerical representation of the sequences as input. There are many approaches to numerically representing sequences, such as one-hot (binary) encoding, k-mer frequencies, etc. As it is not feasible to determine a priori which representation would generate the ‘best-performing’ machine learning model, one needs to try out several distinct representations. Here, the best-performing model refers to a model that maximizes a specific performance metric, such as the F1-score or area under the precision-recall curve (AUPRC). There are several libraries or software (e.g. MathFeature [25], iLearnPlus [26], and RepDNA [27]) to generate features from DNA sequences. Here we decided to use iLearnPlus to represent the sequences in BacTermData, as it can generate a wide variety of feature sets. Using iLearnPlus, we computed 6208 features (belonging to 28 unique feature sets) per sequence. A feature set contains features created by one feature generation method (e.g. one-hot encoding). After generating the features, we applied feature selection methods to identify informative features.

To measure the importance of the features, we used two methods: SHAP [28] and Gini measure of light gradient-boosting machine (LGBM) [29]. We used an iterative algorithm to drop features: (i) train an LGBM model with all remaining features and calculate feature importance values using SHAP and Gini measure, (ii) remove features in the bottom 20% after sorting the features based on their importance as per both feature importance methods, (iii) repeat step (i) and (ii) until the average of the AUPRC obtained during 10-fold cross-validation decreases. We observed a decrease in the AUPRC after reaching 1694 features, down from 6208 features (Fig. 1). We called these 1694 features (belonging to 22 feature sets) the 22-sets features. Supplementary Table S2 describes the 1694 selected features.

Figure 1. — Average precision (an estimate of AUPRC) as a function of the number of features included in the training set. Each dot indicates an iteration of the algorithm used to remove non-informative features. The shaded area is the standard deviation of 10 cross-validation folds. The vertical dash line indicates the point where a drop of AUPRC occurred.

Generating the 1694 features with ILearnPlus is computationally intensive [roughly, it processes 30 sequences/second in a high-performance computing (HPC) cluster]. To reduce the computational requirements, we selected the complete six feature sets comprising 60% of the 1694 features (Supplementary Table S2) and called the corresponding features the six-sets features. We also kept two feature sets (ENAC and one-hot/binary encoding) as they preserve the spatial location of patterns in the ROI.

Machine learning approaches

We utilized convolutional neural networks (CNNs) and fully connected neural networks (FCNNs) as they have been shown to perform well in various domains [30]. We also used a boosting method because of boosting methods’ ability to handle tabular data better than other techniques [31]. Table 4 describes the ML approaches considered. Table 5 describes the CNNs architecture.

Table 4.

Machine learning approaches considered

#	ML method	Features	Description
1	CNN	PS2, ENAC, OH, six-sets features	Fusion^a
2	CNN	OH, six-sets features	Fusion
3	CNN	PS2, ENAC, OH, six-sets features	Append^b
4	CNN	PS2	Single^c
5	CNN	OH	Single
6	CNN	ENAC	Single
7	CNN	NCP	Single
8	CNN	PS2, ENAC, OH, NCP	Ensemble^d
9	FCNN	Six-sets features	FCNN trained on six-sets features
10	LGBM	Six-sets features	A LGBM trained on six-sets features
11	LGBM	22-sets features	A LGBM trained on 22-sets features

Open in a new tab

Feature sets are described in Supplementary Table S2.

^a Train one CNN for each of the listed feature sets followed by fully-connected layer(s).

^b Train a CNN receiving as input the concatenated listed feature sets followed by fully-connected layer(s).

^c Train a CNN receiving as input the listed feature set followed by fully-connected layer(s).

^d Train independent CNNs followed by fully-connected layer(s) on each listed feature set and the combined prediction is their average output.

Table 5.

CNN architecture description

Layer type	Hyperparameter	Value
Conv1D	Filters	64
	Kernel_size	10
	Activation	PReLU
AveragePooling1D	Pool_size	2
Conv1D	Filters	64
	Kernel_size	10
	Activation	PReLU
AveragePooling1D	Pool_size	2
BatchNormalization	–	–
Conv1D	Filters	64
	Kernel_size	10
	Activation	PReLU
Dropout	Rate	0.1
Flatten	–	–
Dense (dropout)	Units & (rate)	500 (0.3)
	Activation	PReLU
Dense (dropout)	Units & (rate)	600 (0.3)
	Activation	PReLU
Dense (dropout)	Units & (rate)	600 (0.3)
	Activation	PReLU
Dense (dropout)	Units & (rate)	200 (0.4)
	Activation	PReLU
Dense (dropout)	Units & (rate)	200 (0.4)
	Activation	PReLU
Dense	Units	1
	Activation	Sigmoid

Open in a new tab

The architecture of the FCNN is a dense layer containing 400 units and, using the ReLU activation function, is applied to the categorical input. This is followed by a dropout layer with a rate of 0.3 to reduce overfitting. This pattern of a dense layer with 400 units and ReLU activation followed by a dropout layer with a rate of 0.3 is repeated six times more. The output from these repeated layers is then fed into a dense layer with 200 units and parametric ReLU activation, followed by a dropout layer with a rate of 0.4. This configuration is repeated once more before the final output layer, which consists of a single unit with a sigmoid activation function to provide a binary classification output.

Model generation and assessment

We used stratified Monte Carlo cross-validation (SMCCV) [32] (also called repeated random subsampling) to select the optimal hyperparameters of the neural network models. Monte Carlo cross-validation is a method that randomly determines the training and validation set in different iterations. The positive-to-negative data ratio is maintained in each iteration as it is stratified. We used SMCCV with 100 folds and 10 iterations to find the optimal hyperparameters for the models. With 100 folds, 99% of the data is for training and 1% for testing on each fold. Neural networks were implemented in TensorFlow version 2.8.0. We used randomized cross-validation with 50 iterations to find the best hyperparameters for the LGBM model (Table 6). We used the Python LGBM implementation version 3.3.3.

Table 6.

Best hyperparameters for LightGBM model

Hyperparameter	Value
Boosting_type	Goss
Colsample_bytree	0.8
Learning_rate	0.1
Max_depth	7
Min_child_weight	1
n_estimators	3000
Num_leaves	64
Objective	Cross_entropy_lambda
Random_state	42
Reg_lambda	30
Unbalance	True
n_jobs	−1

Open in a new tab

We used SMCCV to assess the models’ performance, and calculated average precision, recall, and F-score per species. We calculated score thresholds for classification that maximize the F0.5, F1, and F2 scores. Average precision is calculated as AP = ∑_n(R_n − R_{n − 1})P_n where R_n and P_n are the precision and recall at the nth threshold. Recall is the proportion of actual positive instances correctly identified by a model ( Inline graphic ), while precision measures the proportion of predicted positive instances that are actually true positives () where TP is the number of true positives, FN is the number of false negatives, and FP is the number of false positives. F-score is calculated as where β can be 0.5, 1, or 2 for the corresponding F-Score.

To support the reproducibility of the machine learning method of this study, the machine learning summary table is included in the supplementary material as per DOME recommendations [33].

Results and discussion

Clearly visible pattern in ROI in BacTermData

We visualized the log2 ratio of the relative nucleotide frequency across the ROI for each study data included in BacTermData and confirmed the presence of a clear pattern around the middle of the ROI in all the data. Figure 2 shows the aggregated pattern around the TTS (middle of the sequence) in the terminator sequences in BacTermData. This indicates that there is indeed a pattern in BacTermData, which machine learning approaches could learn to identify terminator sequences. BacTermData consists of 41 168 bacterial terminator sequences (36 542 used for training) of 25 bacterial strains belonging to four phyla: Pseudomonadota (formerly Proteobacteria), Bacillota (formerly Firmicutes), Actinomycetota (formerly Actinobacteria), and Cyanobacteriota. To test the generalization capabilities of our final model, the terminator sequences of five bacterial strains (Table 3), including the only two belonging to the phylum Cyanobacteriota, were left out of training and used only for the comparative assessment.

Figure 2. — Log2 ratio of the relative nucleotide frequency of all terminator sequences in BacTermData.

Selecting a best performing model: BacTermFinder

We assessed various ML approaches and sequence representation combinations (Table 4). This assessment involved training and evaluating these combinations on the data shown in Table 2 using SMCCV. The SMCCV results are presented in Table 7. The LGBM model trained on the 22-sets features achieved performance comparable to that of single CNNs, however, computing the 22-sets features is resource-intensive and time-consuming. During training the models, we noticed that the single CNN models were performing well based on average precision and thus, we decided to build an ensemble with these four models by simply averaging their outputs. This approach consisting of an ensemble of the four single CNNs (Fig. 3) achieved an average precision of 0.7080 ± 0.0248, outperforming all the other approaches considered (Table 7). We selected this as our final model and called it BacTermFinder. Figure 4 shows the precision-recall curve (PRC) and receiver operating characteristic curve (ROC) of BacTermFinder over all SMCCV iterations. Clearly, BacTermFinder’s performance is well above a random classifier’s performance (dashed lines in Fig. 4).

Table 7.

SMCCV average precision ± standard deviation

#	Approach	Average precision ± S.D.
1	CNN_fusion	0.5955 ± 0.0232
2	CNN_fusion	0.6553 ± 0.0252
3	CNN_append	0.5949 ± 0.0257
4	CNN_single	0.6738 ± 0.0253
5	CNN_single	0.6775 ± 0.0214
6	CNN_single	0.6544 ± 0.0284
7	CNN_single	0.6730 ± 0.0243
8	CNN_ensemble	0.7080 ± 0.0248
9	FCNN	0.5012 ± 0.0319
10	LGBM	0.5933 ± 0.0269
11	LGBM	0.6476 ± 0.0217

Open in a new tab

The highest SMCCV average precision is highlighted in bold. See Table 4 for a description of the ML approaches.

Figure 3. — Depiction of BacTermFinder architecture.

Figure 4. — PRC in blue [starting at coordinates (0,1)] and ROC curve in orange [starting at coordinates (0,0)] of BacTermFinder's performance on the aggregated results of SMCCV iterations. Dashed lines show the performance of a random classifier.

Effect of GC content and phylum on terminator identification

It has been shown that there is a correlation between GC content and prevalence of certain terminator shapes in bacteria [3]. To explore whether there is an effect of GC content or phylum on terminator identification, we decided to visualize the average precision per bacterium versus their GC content and coloured it by the phylum (Fig. 5). A linear regression line fitted to this data indicates a moderate relationship between GC content and average precision: As the GC content increases, the performance decreases (Spearman correlation value of −0.46, P-value .031). For a 10% increase in GC content, the average precision decreases by 0.038. Thus, BacTermFinder tends to achieve higher average precision in bacteria with lower GC content. Bacteria with high GC content tend to have more factor-dependent terminators [34] and less intrinsic terminators with the canonical shape [3]. Factor-dependent terminators do not always have the hairpin structure of intrinsic terminators [4], and thus, their sequence motif might be weaker. Additionally, the Rho utilization (rut) site is 60–90 nt upstream of the terminator [35], and thus, often outside our ROI. These issues might partially explain why BacTermFinder’s performance is lower on high GC bacteria; however, this pattern is not observed within phyla, and BacTermFinder’s performance is more consistent (with less variation) for Bacillota and Actinomycetota than for Pseudomonadota (Fig. 5). All the bacteria in the phylum Actinomycetota is from the Streptomyces genus (Table 2) and that might explain why BacTermFinder’s performance is similar for these bacteria; however, this is not the case for the phylum Bacillota. Other factors, in addition to GC content, probably affect terminator identification between and within phyla and further investigation is needed.

Figure 5. — Average precision versus GC content of BacTermFinder SMCCV results per bacterial strain. Bacterial strains are coloured based on their phylum with Bacillota in the GC % between 30 and 43%, Pseudomonadota between 44 and 66%, and Actinomycetota above 66%.

Assessing BacTermFinder’s predictions on independent validation data

To confirm that BacTermFinder had learned terminator-like sequence patterns present in prokaryotic organisms not included in the training data, we tested BacTermFinder on an independent set of bacterial and archaeal terminators. For this assessment, we used the experimentally determined terminators described in Table 3 as the reference set of true terminators and compare them against BacTermFinder’s genome-wide terminator predictions for these organisms. We used BacTermFinder’s predictions with a probability threshold of at least 0.30 (this is the probability threshold that maximized the F1 score during cross-validation) and merged those predictions three or less nucleotides apart. BacTermFinder predictions used are available in BacTermFinder’s GitHub repository under the folder ‘data/BacTermFinder predictions validation set’.

First we looked at whether predicted terminators have a log2 ratio of the relative nucleotide frequency similar to that of experimentally identified terminators. Log2 ratios are provided in the Supplementary File 2. To estimate the similarity between these distributions, we calculated Kendall’s Tau between both frequency distributions per nucleotide. As it can be seen from Fig. 6 for Streptococcus agalactiae, the log2 ratio of the relative nucleotide frequency of predicted terminators is similar to that of experimentally identified terminators. The saliency map (Fig. 6C) shows the contribution of each nucleotide per position in BacTermFinder’s predictions. Negative numbers reduce the probability of predicting a terminator, while positive numbers increase the probability of predicting a terminator. Figures 7 and 8 show the same visual analysis for the archaea Haloferax volcanii and the cyanobacterium Synechocystis PCC 6803. Figures for the the other four organisms can be seen in Supplementary Figs S1–S4. The average Kendall’s Tau coefficient value per nucleotide across the seven organisms in the validation data are: A 0.42 ± 0.08, T 0.35 ± 0.17, C 0.29 ± 0.06, and G 0.30 ± 0.18. This suggests that BacTermFinder predicted terminators have similar sequence patterns to those present in experimentally determined terminators. The lowest Kendall’s Tau correlation is observed in Methanococcus maripaludis followed by Mycobacterium tuberculosis; while the highest correlation is observed in Streptococcus agalactiae followed by Synechocystis PCC 6803.

Figure 6. — Visualization of *Streptococcus agalactiae* experimentally identified terminators (A) versus BacTermFinder genome-wide predictions (B) on the Y-axis is the log2 ratio and on the X-axis is the nucleotide position. Kendall’s Tau coefficient values per nucleotide are A 0.56 (P-value 9.41 × 10⁻¹⁷), T 0.62 (P-value 4.88 × 10⁻²⁰), C 0.26 (P-value 1.04 × 10⁻⁴), and G 0.38 (P-value 1.84 × 10⁻⁸). (A) Log2 ratio of the relative nucleotide frequency of experimentally determined terminators. (B) Log2 ratio of the relative nucleotide frequency of genome-wide predicted terminators. (C) Saliency map generated using using [81].

Figure 7. — Visualization of H. *volcanii* DS2 experimentally identified terminators (A) versus BacTermFinder genome-wide predictions (B) on the Y-axis is the log2 ratio and on the X-axis is the nucleotide position. Kendall’s Tau coefficient values per nucleotide are A 0.43 (P-value 2.57 × 10⁻¹⁰), T 0.26 (P-value 1.2 × 10⁻⁴), C 0.39 (P-value 1.04 × 10⁻⁸), and G 0.32 (P-value 2.09 × 10⁻⁶). (A) Log2 ratio of the relative nucleotide frequency of experimentally determined terminators. (B) Log2 ratio of the relative nucleotide frequency of genome-wide predicted terminators. (C) Saliency map generated using [81].

Figure 8. — Visualization of *Synechocystis* PCC 6803 experimentally identified terminators (A) versus BacTermFinder genome-wide predictions (B) on the Y-axis is the log2 ratio and on the X-axis is the nucleotide position. Kendall’s Tau coefficient values per nucleotide are A 0.43 (P-value 1.82 × 10⁻¹⁰), T 0.47 (P-value 4.98 × 10⁻¹²), C 0.29 (P-value 1.70 × 10⁻⁵), and G 0.36 (P-value 7.64 × 10⁻⁸). (A) Log2 ratio of the relative nucleotide frequency of experimentally determined terminators. (B) Log2 ratio of the relative nucleotide frequency of genome-wide predicted terminators. (C) Saliency map generated using [81].

As previously done by Bar A et al. [36], we classified BacTermFinder predicted terminators as primary (≤200-nt downstream of an annotated stop codon) and internal (intragenic). On average 16.09 ± 2.16% of BacTermFinder’s predictions are primary and 48.67 ± 5.35% are internal. On average, 23.73 ± 10.60% of TTSs detected by sequencing methods are located within genes [36–38]. Even though, most BacTermFinder predictions are nearby annotated stop codons (Supplementary Fig. S5), the proportion of intragenic predictions among BacTermFinder’s predictions is on the upper range of those experimentally detected by sequencing methods. In addition to false positive predictions, these intragenic predictions might be due to other reasons; for example, intragenic intrinsic terminators have been observed in many bacteria where they might act as attenuators and terminator-like structures may have other functions [39].

We constructed average profiles of structural and physical characteristics of the predicted terminators using the values collected for RNA dinucleotides in the DiProDB database [40] (accessed on 9th December 2024) and compared them against the average profiles of same-length randomly selected sequences from the corresponding genome (control sequences). Predicted terminators showed a clearly distinct average profile from that of the control sequences in all RNA characteristics available in DiProDB. Figure 9 shows the RNA free energy [41] profiles of BacTermFinder predicted terminators for the seven organism in our validation data. This indicates that predicted terminator sequences have a clearly distinct structural characteristics than control sequences.

Figure 9. — RNA free energy average profiles of predicted terminator sequences (orange, jagged line) and control sequences (gray, horizontal line). Dinucleotide free energy values were obtained from the DiProDB database. AL123456.3 = *Mycobacterium tuberculosis* H37Rv, CP054306.1 = *Synechocystis* PCC 7338, CP059991.1 = *Streptomyces gardneri* ATCC 15439, NC_000911.1 = *Synechocystis* PCC 6803, NC_004368.1 = *Streptococcus agalactiae* NEM316, NC_005791 = *Methanococcus maripaludis* S2, NC_013967 = H. *volcanii* DS2.

Comparative assessment for genome-wide terminator prediction

We compared BacTermFinder’s performance with that of TermNN [12], iTerm-PseKNC [13], RhoTermPredict [14], and TransTermHP [5] on terminators of five bacterial species (Table 3) not used for generating our model. We decided to include TermNN, RhoTermPredict, and iTerm-PseKNC in the comparative assessment because they are the most recently developed tools available for predicting intrinsic, factor-dependent, and both types of terminators, respectively (Table 1). We included TransTermHP because, as mentioned earlier, is the most cited tool for bacterial terminator prediction. All programs were used to obtain genome-wide terminator predictions. Additionally, we downloaded (on 9th December 2024) genome-wide intrinsic terminator predictions from the InterPin database [42] for Synechocystis sp. PCC 6803 and Streptococcus agalactiae NEM316. The InterPin database predictions for Mycobacterium tuberculosis H37Rv and Streptomyces gardneri ATCC 15439 were done in a different genome assembly than the one used in this study. There were not predictions available for Synechocystis sp. PCC 7338 in the InterPin database.

Statistical significance of the difference in recall between methods was estimated using the Friedman test which is a non-parametric test recommended for comparison of more than two classifiers over multiple datasets [43]. To find out which models differ in terms of recall, we used several post-hoc pair-wise tests as recommended in [44], namely Quade, Miller, Nemenyi, and Siegel post-hoc tests. All statistical tests were carried out in R using the packages PMCMRplus [45] and scmamp [46].

As the experimentally identified terminators by sequencing methods in our validation data are not exhaustive (i.e. there are actual terminators missing due to lack of expression of some transcripts in the growth conditions sampled for sequencing), absence of a experimentally-detected terminator in a specific region does not prove that there is not a terminator in that region. Considering all predicted terminators absent in the validation data as false positives would overestimate the false positive rate of the tools. Thus, we evaluated predictive performance using recall (also called sensitivity or true positive rate) at 10 sequence overlapping thresholds between the predicted terminators and the actual terminators (Table 8). That is, we first considered as correct predictions those predictions with at least 10% overlap with an actual terminator, then considered those with at least 20% overlap with an actual terminator as correct, and so on until considering only those with 100% overlap with an actual terminator as correct. In this way, we evaluated how accurately the terminator location is predicted by the tools. To do this, we used BEDtools’ intersect command and calculated recall at each overlap threshold.

Table 8.

Average recall over 10 overlap thresholds ± standard deviation for four other terminator prediction tools and BacTermFinder

Bacterium	RhoTermPredict	ITerm-PseKNC	TransTermHP	TermNN	BacTermFinder
Streptomyces gardneri ATCC 15439	0.25 ± 0.13	0.01 ± 0.004	0.17 ± 0.10	0.37 ± 0.22	0.60 ± 0.22
CP059991.1, GC = 70.7%
Mycobacterium tuberculosis H37Rv	0.24 ± 0.16	0.18 ± 0.11	0.003 ± 0.002	0.14 ± 0.09	0.23 ± 0.12
AL123456.3, GC = 64.7%
Synechocystis sp. PCC 7338	0.10 ± 0.04	0.12 ± 0.05	0.28 ± 0.15	0.51 ± 0.26	0.64 ± 0.22
CP054306.1, GC = 47.1%
Synechocystis sp. PCC 6803	0.12 ± 0.08	0.10 ± 0.05	0.22 ± 0.10	0.48 ± 0.24	0.54 ± 0.18
NC_000911.1, GC = 47.0%
Streptococcus agalactiae NEM316	0.02 ± 0.006	0.24 ± 0.13	0.56 ± 0.27	0.77 ± 0.32	0.82 ± 0.30
NC_004368.1, GC = 35.1%
Overall mean recall	0.15 ± 0.10	0.13 ± 0.09	0.25 ± 0.20	0.46 ± 0.23	0.57 ± 0.21

Open in a new tab

The corresponding genome accession and GC content (%) is provided below each bacterium name. The highest recall per row is highlighted in bold.

We included in the comparative assessment bacteria with different GC content. Bacteria with high GC content tend to have a larger proportion of factor-dependent terminators [34]. For example, Mycobacterium tuberculosis is reported to have a large proportion (up to 54%) of factor-dependent terminators [34]. As RhoTermPredict was designed to predict factor-dependent terminators, we hypothesized it would perform better on bacteria with high GC content than on bacteria with low GC content. On the other hand, we hypothesized that termNN and TransTermHP would perform better on bacteria with low GC content than on bacteria with high GC content, as these two tools were designed to identify intrinsic terminators. Our results (Table 8) indeed support these hypotheses. RhoTermPredict achieved its highest recall (roughly 24%) on Mycobacterium tuberculosis and Streptomyces gardneri, which are the two organisms with the highest proportion of factor-dependent terminators. BacTermFinder’s recall on these two organisms (i.e. Mycobacterium tuberculosis and Streptomyces gardneri) is comparable to or better than that of RhoTermPredict. This indicates that BacTermFinder can indeed predict some factor-dependent terminators. All tools but RhoTermPredict achieved their highest recall on Streptococcus agalactiae (lowest GC content). InterPin’s recall for Synechocystis sp. PCC 6803 is 0.28 ± 0.19 and for Streptococcus agalactiae NEM316 is 0.49 ± 0.23. InterPin’s recall is comparable to that of TransTermHP (Table 8). This is unsurprising as both tools (InterPin and TransTermHP) predict terminators by seeking sequences that match the canonical hairpin intrinsic terminator structure. Tools’ recall varied between species with an overall recall range of [0.03, 0.82]. BacTermFinder outperformed ITerm-PseKNC, TermNN, and TransTermHP in terms of mean recall across various overlap thresholds in all the species in the validation set, and it achieved the highest mean recall overall (Table 8). This result suggests that BacTermFinder is able to find both types of terminators (intrinsic and factor-dependent) and can generalize to phyla not seen during training (e.g. Cyanobacteriota).

The Friedman test (P-value <2.2 × 10⁻¹⁶) indicated that the average recall rank obtained by some of the classifiers is significantly different from the mean rank expected under the null hypothesis. All post-hoc pair-wise tests indicated that BacTermFinder’s recall rankings were significantly better than those of all other tools, and that termNN’s recall rankings were significantly better than those of iTerm-PseKNC, TransTermHP, and RhoTermPredict (Supplementary Fig. S6 shows the critical difference plot and results per post-hoc test are shown in Supplementary Table S3).

As we have an incomplete annotation of all terminators in any given bacterial genome, it is hard to estimate the genome-wide false positive rate of terminator prediction tools since a prediction might indeed be correct even though a terminator has not yet been determined experimentally in that location. However, the number of predicted terminators per gene can provide a rough estimate of the number of false positives. Figure 10 shows the distribution of number of terminators predicted per gene by TermNN, RhoTermPredict, and BacTermFinder across the five bacteria in our independent validation data. BacTermFinder displays less variation in the number of predicted terminators per gene, and predicts, on average, 6.62 ± 1.18 terminators per gene, while TermNN and RhoTermPredict predict 8.89 ± 3.52 and 4.30 ± 3.20, respectively. RhoTermPredict predicts less terminators per gene; however, its recall is substantially lower than BacTermFinder’s for four out of five bacteria in our independent validation data. The results provided in Fig. 10 and Table 8 indicate that BacTermFinder’s false positive rate is lower than that of TermNN (the second best tool) while achieving a higher recall rate.

Figure 10. — Distribution of the number of predicted terminators per gene across the five bacterial species in our independent validation data. The horizontal line inside each box indicates the median value, and the bottom and top of each box indicate the 25 and 75 percentile, respectively.

We used TermNN and BacTermFinder, which are the two tools with the highest recall as per Table 8, to find archaeal terminators. BacTermFinder outperforms TermNN in terms of recall in predicting archaeal terminators (Table 9). This indicates that BacTermFinder can generalize to identify some archaeal terminators.

Table 9.

Average recall over 10 overlap thresholds on two archaeal species of the two best performing terminator prediction tools (TermNN and BacTermFinder) as per the results shown in Table 8

Archaea	TermNN	BacTermFinder
Haloferax volcanii	0.15 ± 0.11	0.32 ± 0.20
NC_013967, GC = 65.7%
Methanococcus maripaludis	0.40 ± 0.22	0.57 ± 0.25
NC_00579, GC = 32.6%

Open in a new tab

The highest recall per Archaea is highlighted in bold.

BacTermFinder can predict the location of terminators accurately

We assessed how closely our predictions aligned with actual terminator locations. To do this, we visualized the recall rate as a function of sequence overlap of the predicted terminators with the actual terminators (Fig. 11). We hypothesized that we would observe a declining trend as we moved towards stricter overlap thresholds, which is indeed the case. We compared our overlap versus recall with that of TermNN. Our results show that (i) on every overlap threshold, BacTermFinder outperforms or is comparable to TermNN, and (ii) BacTermFinder’s recall drops at stricter overlaps than TermNN’s recall. The latter indicates that BacTermFinder can find the location of terminators more accurately than TermNN. However, BacTermFinder’s recall sharply decreases at overlap thresholds >0.8, which suggests the potential need for a nucleotide-wise segmentation approach to achieve accuracy at the nucleotide level.

Figure 11. — Recall as a function of percentage sequence overlap between predicted and actual terminators. All sequences are 100 nt long. The dotted lines are TermNN, and the solid lines are BacTermFinder. Each colour represents a bacterium in our independent validation data. The area under the curves is shown in the legend.

Additionally, we assessed the accuracy of the probability of being a terminator outputted by TermNN and BacTermFinder. To do this, we sorted their predictions based on their estimated probabilities of a sequence being a terminator, and selected the top n% to calculate recall at ten different percentage overlaps with actual terminators. Subsequently, we calculated the mean recall and standard deviation across the percentage overlap levels. We visualized mean recall versus top n predictions for a bacterium with high GC content (Streptomyces gardneri, Fig. 12A), a bacterium with approximately equiprobable nucleotide distribution (Synechocystis PCC 7338, Fig. 12B) and a bacterium with low GC content (Streptomyces gardneri, Fig. 12C). For these three bacteria, BacTermFinder achieves higher recall rates at any top n% predictions than TermNN. BacTermFinder has a wider margin over TermNN as the GC content increases. BacTermFinder has also less variation in recall across different overlap thresholds (shaded area in Fig. 12) than TermNN. For Streptomyces gardneri, the worst recall level of BacTermFinder is similar to or higher than the best recall level of TermNN across all top n% of predictions. With BacTermFinder’s 10% most confident predictions on any of these three bacteria, one is able to identify close to 40% of their known terminators.

Figure 12. — Average recall (solid lines) versus top n% most confident predictions for BacTermFinder (blue, top line) and TernNN (orange, bottom line) for (A) *Streptomyces gardneri*, (B) *Synechocystis* PCC 7338, and (C) *Streptococcus agalactiae*. The shaded area indicates standard deviation.

Low agreement between terminator prediction tools

To see the agreement among RhoTermPredict, TermNN, and BacTermFinder, we generated a Venn diagram for two bacteria in the extremes of the GC content axis in our validation data. That is Mycobacterium tuberculosis with 64.69% GC content and Streptococcus agalactiae with 35.12% GC content. We chose Mycobacterium tuberculosis as it has a large proportion of factor-dependent terminators [34]. For Mycobacterium tuberculosis, the highest agreement (14% of the total predicted terminators) is between BacTermFinder and RhoTermPredict (Fig. 13, intersection areas in the center and top-center). For Streptococcus agalactiae, a low GC bacteria, the highest agreement (92.8% of the total predicted terminators) is between BacTermFinder and TermNN (Fig. 14, intersection areas in the center and left-center). This suggests that there is lower agreement among tools to predict factor-dependent terminators (Fig. 13) than intrinsic terminators (Fig. 14). BacTermFinder predicts a similar number of terminators in Mycobacterium tuberculosis and Streptococcus agalactiae as RhoTermPredict and TermNN, respectively. This suggests that BacTermFinder is as effective in identifying factor-dependent and intrinsic terminators as tools specialized on each terminator type. The number of terminators predicted by all three tools is quite low for both bacteria: 23 (or 1.9%) and 9 (or 1.4%) for Mycobacterium tuberculosis and Streptococcus agalactiae, respectively.

Figure 13. — Agreement among 1220 predicted terminators (with at least 50% sequence overlap with an actual terminator) of BacTermFinder, RhoTermPredict, and TermNN for *Mycobacterium tuberculosis*.

Figure 14. — Agreement among 642 predicted terminators (with at least 50% sequence overlap with an actual terminator) of BacTermFinder, RhoTermPredict, and TermNN for *Streptococcus agalactiae*.

Case study: the Rhodobacter capsulatus SB1003 puc operon

As a case study, we looked specifically at terminator predictions made by BacTermFinder, TermNN, RhoTermPredict, and InterPin for the puc operon of Rhodobacter capsulatus SB 1003 (chromosome sequence RefSeq ID: NC_014034.1). R. capsulatus has a GC content of 66.5%, is from the phylum Pseudomonadota and has both types of termination (intrinsic and Rho-dependent). All predictions on the reverse strand were discarded for this analysis. We chose this operon because its transcription initiation, attenuation, and termination have been previously studied using gene fusions, high-resolution RNA 5’ -end mappings by primer extensions, and RNA blot hybridization by LeBlanc et al. [47]. Additionally, R. capsulatus was not included in BacTermFinder training data. LeBlanc et al.’s model of puc RNA regulation indicated that there are two termination sites in this operon: one located immediately downstream of pucE and another between pucA and pucC (circles in Fig. 15). We deemed predictions overlapping the vertical rectangles in Fig. 15 as predicting the corresponding experimentally-determined termination site. Only BacTermFinder and InterPin predicted both termination sites identified by LeBlanc et al. (i.e. their recall is 100%). RhoTermPredict and TermNN only predicted the site directly downstream of pucA (i.e. their recall is 50%). The web version of ITerm-PseKNC failed to predict any termination site in this region. BacTermFinder and RhoTermPredict have a prediction in common in the middle of pucDE. LeBlanc et al. identified several RNA 5’ ends in the middle of pucDE and concluded that there is a promoter internal to pucD. Thus, BacTermFinder and RhoTermPredict are likely both mistakenly identifying this promoter as a terminator. Interestingly, BacTermFinder and RhoTermPredict predicted a pucB-internal terminator, which is immediately downstream of a small RNA (sRNA) predicted with high confidence (probability of being a sRNA of 0.99) by Grüll et al. [48] (rectangle in the third track of Fig. 15). Based on these results, if only the two terminator sites identified by LeBlanc et al. are considered active termination sites, then the precision of the tools is 100%, 40%, 20%, and 11% for InterPin, BacTermFinder, TermNN, and RhoTermPredict, respectively. However, if one assumes that the terminator internal to pucB is bona fide, then the recall of InterPin and BacTermFinder would be 66% and 100%, respectively, while the precision of RhoTermPredict and BacTermFinder would increase to 22% and 60%, respectively. From the perspective of using predictions to guide further research, the predicted pucB-internal termination site downstream of the predicted sRNA is a promising lead. This case study suggests that BacTermFinder might be more sensitive for identifying non-canonical termination sites than InterPin; while, keeping its false positive rate lower than that of other tools.

Figure 15. — *R. capsulatus puc* operon. Circles in the first track indicate attenuation and termination sites experimentally identified by LeBlanc *et al.* [47]. NCBI annotated coding sequences and stop codons are shown as rectangles and thin lines in the first and second track, respectively. A sRNA predicted with high-confidence in [48] is shown as a rectangle in the third track. The bottom tracks shown terminator predictions of BacTermFinder (fourth track, dark green rectangles), TermNN (fifth track, magenta rectangles), RhoTermPredict (sixth track, orange rectangles), and InterPin (bottom track, gray rectangles). Predictions within the two large vertical rectangles overlap with the experimentally identified attenuation/termination sites. The number below BacTermFinder’s predictions is the probability of being a terminator calculated by BacTermFinder. Image was partially created using Integrative Genomics Viewer (IGV) [82].

Conclusions

In this work, we have collected, to our knowledge, the largest data set of bacterial terminators identified by sequencing technologies. This comprehensive dataset includes terminators from 25 bacterial strains with a wide range of GC content identified by various sequencing technologies. We expect these data to be valuable for further developments in bacterial terminator prediction. Additionally, we have developed BacTermFinder, a general model for finding bacterial terminators. Future work to improve BacTermFinder includes (i) increasing the ROI to include the rut site; (ii) adding archaeal terminators into the training data; and (iii) determining the terminator type, as BacTermFinder can find both terminator types but does not indicate which type is found.

BacTermFinder outperforms in terms of recall other existing tools in an independent validation data set. BacTermFinder’s average recall over five bacterial species is 0.57 ± 0.21, and TermNN’s (the second-best tool) recall is 0.46 ± 0.23. This increase in recall is achieved by BacTermFinder while predicting, on average, two terminators less per gene than TermNN. BacTermFinder recall in all prokaryotic data (bacterial and archaeal) is 0.53 ± 0.20, and TermNN’s is 0.40 ± 0.22. Furthermore, BacTermFinder can identify both types of terminators as good as or better than tools specialized on that specific terminator type. BacTermFinder and BacTermData are available at:

https://github.com/BioinformaticsLabAtMUN/BacTermFinder.

Supplementary Material

lqaf016_Supplemental_Files

lqaf016_supplemental_files.zip^{(2.1MB, zip)}

Acknowledgements

This research was partly enabled by computing infrastructure provided by Acenet (ace-net.ca) and the Digital Research Alliance of Canada (alliancecan.ca).

Author contributions: L.P-C. funding acquisition, conceptualization, supervision, formal analysis, visualization and writing. S.M.A.T.G. data curation, investigation, software, formal analysis, visualization and writing.

Contributor Information

Seyed Mohammad Amin Taheri Ghahfarokhi, Department of Computer Science, Memorial University of Newfoundland, St. John’s, Newfoundland A1B 3X5, Canada.

Lourdes Peña-Castillo, Department of Computer Science, Memorial University of Newfoundland, St. John’s, Newfoundland A1B 3X5, Canada; Department of Biology, Memorial University of Newfoundland, St. John’s, Newfoundland A1B 3X9, Canada.

Supplementary data

Supplementary data is available at NAR Genomics & Bioinformatics online.

Conflict of interest

None declared.

Funding

This work was supported by funds from an Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant (2019-05247) to L.P-C. and a graduate fellowship from Memorial University School of Graduate Studies to S.M.A.T.G.

Data availability

Software and data are available at https://github.com/BioinformaticsLabAtMUN/BacTermFinder and https://doi.org/10.5281/zenodo.14498242.

References

1. Ray-Soni A, Bellecourt MJ, Landick R Mechanisms of bacterial transcription termination: all good things must end. Annu Rev Biochem. 2016; 85:319–47. 10.1146/ANNUREV-BIOCHEM-060815-014844. [DOI] [PubMed] [Google Scholar]
2. Santangelo TJ, Artsimovitch I Termination and antitermination: RNA polymerase runs a stop sign. Nat Rev Microbiol. 2011; 9:319–29. 10.1038/nrmicro2560. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Mitra A, Angamuthu K, Jayashree HV et al. Occurrence, divergence and evolution of intrinsic terminators across eubacteria. Genomics. 2009; 94:110–6. 10.1016/j.ygeno.2009.04.004. [DOI] [PubMed] [Google Scholar]
4. Song E, Uhm H, Munasingha PR et al. Rho-dependent transcription termination proceeds via three routes. Nat Commun. 2022; 13:1663. 10.1038/s41467-022-29321-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Kingsford CL, Ayanbule K, Salzberg SL Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake. Genome Biol. 2007; 8:R22. 10.1186/GB-2007-8-2-R22/FIGURES/5. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Dar D, Shamir M, Mellin JR et al. Term-seq reveals abundant ribo-regulation of antibiotics resistance in bacteria. Science. 2016; 352:aad9822. 10.1126/science.aad9822. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Ju X, Li D, Liu S Full-length RNA profiling reveals pervasive bidirectional transcription terminators in bacteria. Nat Microbiol. 2019; 4:1907–1918. 10.1038/S41564-019-0500-Z. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Yan B, Boitano M, Clark TA et al. SMRT-Cappable-seq reveals complex operon variants in bacteria. Nat Commun. 2018; 9:3676. 10.1038/S41467-018-05997-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Lalanne JB, Taggart JC, Guo MS et al. Evolutionary convergence of pathway-specific enzyme expression stoichiometry. Cell. 2018; 173:749–61. 10.1016/j.cell.2018.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Shishkin AA, Giannoukos G, Kucukural A et al. Simultaneous generation of many RNA-seq libraries in a single reaction. Nat Methods. 2015; 12:323. 10.1038/NMETH.3313. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Sharma CM, Vogel J Differential RNA-seq: the approach behind and the biological insight gained. Curr Opin Microbiol. 2014; 19:97–105. 10.1016/J.MIB.2014.06.010. [DOI] [PubMed] [Google Scholar]
12. Brandenburg VB, Narberhaus F, Mosig A Inverse folding based pre-training for the reliable identification of intrinsic transcription terminators. PLoS Comput Biol. 2022; 18:e1010240. 10.1371/JOURNAL.PCBI.1010240. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Feng CQ, Zhang ZY, Zhu XJ et al. iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators. Bioinformatics. 2019; 35:1469–77. 10.1093/bioinformatics/bty827. [DOI] [PubMed] [Google Scholar]
14. Di Salvo M, Puccio S, Peano C et al. RhoTermPredict: an algorithm for predicting Rho-dependent transcription terminators based on Escherichia coli, Bacillus subtilis and Salmonella enterica databases. BMC Bioinformatics. 2019; 20:117. 10.1186/s12859-019-2704-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. NCBI National Center for Biotechnology Information (NCBI) PubMed 1988–2023. https://pubmed.ncbi.nlm.nih.gov/10 November 2023, date last accessed.
16. NCBI National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) 1999–2023. https://www.ncbi.nlm.nih.gov/geo/10 November 2023, date last accessed.
17. Kluyver T, Ragan-Kelley B, Pérez F et al.. Loizides F, Schmidt B Jupyter Notebooks—a publishing format for reproducible computational workflows. Positioning and Power in Academic Publishing: Players, Agents and Agendas. 2016; Amsterdam: IOS Press; 87–90. [Google Scholar]
18. The pandas development team pandas-dev/pandas: Pandas. 2020; Zenodo; 10.5281/zenodo.3509134. [DOI] [Google Scholar]
19. Quinlan AR, Hall IM BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010; 26:841–2. 10.1093/BIOINFORMATICS/BTQ033. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Lesnik EA, Sampath R, Levene HB et al. Prediction of rho-independent transcriptional terminators in Escherichia coli. Nucleic Acids Res. 2001; 29:3583–94. 10.1093/nar/29.17.3583. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Gama-Castro S, Salgado H, Santos-Zavaleta A et al. RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Res. 2016; 44:D133–43. 10.1093/NAR/GKV1156. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Ishii T, Yoshida KI, Terai G et al. DBTBS: a database of Bacillus subtilis promoters and transcription factors. Nucleic Acids Res. 2001; 29:278–80. 10.1093/NAR/29.1.278. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Geissler AS, Anthon C, Alkan F et al. BSGatlas: a unified Bacillus subtilis genome and transcriptome annotation atlas with enhanced information access. Microb Genom. 2021; 7:524. 10.1099/MGEN.0.000524. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Chevez-Guardado R, Peña-Castillo L Promotech: a general tool for bacterial promoter recognition. Genome Biol. 2021; 22:318. 10.1186/S13059-021-02514-9/TABLES/10. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Bonidia RP, Domingues DS, Sanches DS et al. MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors. Brief Bioinform. 2022; 23:bbab434. 10.1093/BIB/BBAB434. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Chen Z, Zhao P, Li C et al. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res. 2021; 49:e60. 10.1093/NAR/GKAB122. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Liu B, Liu F, Fang L et al. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics. 2015; 31:1307–9. 10.1093/BIOINFORMATICS/BTU820. [DOI] [PubMed] [Google Scholar]
28. Lundberg SM, Erion G, Chen H et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020; 2:2522–5839. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Ke G, Meng Q, Finley T et al.. Guyon I, Luxburg UV, Bengio S et al. LightGBM: a highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems. 2017; 30:New York, USA: Curran Associates, Inc. [Google Scholar]
30. LeCun Y, Bengio Y, Hinton G Deep Learning. Nature. 2015; 521:436–44. 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
31. Grinsztajn L, Oyallon E, Varoquaux G Why do tree-based models still outperform deep learning on typical tabular data?. arXiv18 July 2022, preprint: not peer reviewedhttps://arxiv.org/abs/2207.08815.
32. Xu QS, Liang YZ Monte Carlo cross validation. Chemom Intell Lab Syst. 2001; 56:1–11. 10.1016/S0169-7439(00)00122-2. [DOI] [Google Scholar]
33. Walsh I, Fishman D, Garcia-Gasulla D et al. DOME: recommendations for supervised machine learning validation in biology. Nat Methods. 2021; 18:1122–7. 10.1038/s41592-021-01205-4. [DOI] [PubMed] [Google Scholar]
34. D’Halluin A, Polgar P, Kipkorir T et al. Premature termination of transcription is shaped by Rho and translated uORFS in Mycobacterium tuberculosis. iScience. 2023; 26:106465. 10.1016/j.isci.2023.106465. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Banerjee S, Chalissery J, Bandey I et al. Rho-dependent transcription termination: more questions than answers. J Microbiol. 2006; 44:11–22. [PMC free article] [PubMed] [Google Scholar]
36. Bar A, Argaman L, Eldar M et al. TRS: a method for determining transcript termini from RNAtag-seq sequencing data. Nat Commun. 2023; 14:7843. 10.1038/s41467-023-43534-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Adams PP, Baniulyte G, Esnault C et al. Regulatory roles of Escherichia coli 5′ UTR and ORF-internal RNAs detected by 3′ end mapping. eLife. 2021; 10:e62438. 10.7554/ELIFE.62438. [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Hwang S, Lee N, Choe D et al. Elucidating the regulatory elements for transcription termination and posttranscriptional processing in the Streptomyces clavuligerus genome. mSystems. 2021; 6:e01013-20. 10.1128/MSYSTEMS.01013-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Peters JM, Vangeloff AD, Landick R Bacterial transcription terminators: the RNA 3′-end chronicles. J Mol Biol. 2011; 412:793–813. 10.1016/j.jmb.2011.03.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Friedel M, Nikolajewa S, Sühnel J et al. DiProDB: a database for dinucleotide properties. Nucleic Acids Res. 2009; 37:D37–40. 10.1093/nar/gkn597. [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Xia T, SantaLucia J Jr, Burkard ME et al. Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson–Crick base pairs. Biochemistry. 1998; 37:14719–35. 10.1021/bi9809425. [DOI] [PubMed] [Google Scholar]
42. Gupta S, Padmashali N, Pal D INTERPIN: a repository for intrinsic transcription termination hairpins in bacteria. Biochimie. 2023; 214(Pt B):228–36. 10.1016/j.biochi.2023.07.018. [DOI] [PubMed] [Google Scholar]
43. Demšar J Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006; 7:1–30. [Google Scholar]
44. García S, Fernández A, Luengo J et al. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inform Sci. 2010; 180:2044–64. 10.1016/j.ins.2009.12.010. [DOI] [Google Scholar]
45. Pohlert T PMCMRplus: calculate pairwise multiple comparisons of mean rank sums extended. 2024; R package version 1.9.12https://cran.r-project.org/package=PMCMRplus.
46. Calvo B, Santafe G scmamp: statistical comparison of multiple algorithms in multiple problems. The R Journal. 2016; 8:248–56. [Google Scholar]
47. LeBlanc H, Lang AS, Beatty JT Transcript cleavage, attenuation, and an internal promoter in the Rhodobacter capsulatus puc operon. J Bacteriol. 1999; 181:4955–60. 10.1128/JB.181.16.4955-4960.1999. [DOI] [PMC free article] [PubMed] [Google Scholar]
48. Grüll MP, Peña-Castillo L, Mulligan ME et al. Genome-wide identification and characterization of small RNAs in Rhodobacter capsulatus and identification of small RNAs affected by loss of the response regulator CtrA. RNA Biol. 2017; 14:914–25. 10.1080/15476286.2017.1306175. [DOI] [PMC free article] [PubMed] [Google Scholar]
49. Gupta S, Pal D Clusters of hairpins induce intrinsic transcription termination in bacteria. Sci Rep. 2021; 11:16194. 10.1038/s41598-021-95435-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
50. Fan Y, Wang W, Zhu Q iterb-PPse: identification of transcriptional terminators in bacterial by incorporating nucleotide properties into PseKNC. PloS One. 2020; 15:e0228479. 10.1371/journal.pone.0228479. [DOI] [PMC free article] [PubMed] [Google Scholar]
51. Nadiras C, Eveno E, Schwartz A et al. A multivariate prediction model for Rho-dependent termination of transcription. Nucleic Acids Res. 2018; 46:8245–60. 10.1093/nar/gky563. [DOI] [PMC free article] [PubMed] [Google Scholar]
52. Millman A, Dar D, Shamir M et al. Computational prediction of regulatory, premature transcription termination in bacteria. Nucleic Acids Res. 2017; 45:886–93. 10.1093/nar/gkw749. [DOI] [PMC free article] [PubMed] [Google Scholar]
53. Gardner PP, Barquist L, Bateman A et al. RNIE: genome-wide prediction of bacterial intrinsic terminators. Nucleic Acids Res. 2011; 39:5845–52. 10.1093/nar/gkr168. [DOI] [PMC free article] [PubMed] [Google Scholar]
54. Dar D, Prasse D, Schmitz RA et al. Widespread formation of alternative 3′ UTR isoforms via transcription termination in archaea. Nat Microbiol. 2016; 1:16143. 10.1038/nmicrobiol.2016.143. [DOI] [PubMed] [Google Scholar]
55. Takada H, Mandell ZF, Yakhnin H et al. Expression of Bacillus subtilis ABCF antibiotic resistance factor VmlR is regulated by RNA polymerase pausing, transcription attenuation, translation attenuation and (p)ppGpp. Nucleic Acids Res. 2022; 50:6174–89. 10.1093/NAR/GKAC497. [DOI] [PMC free article] [PubMed] [Google Scholar]
56. Mandell ZF, Oshiro RT, Yakhnin AV et al. NusG is an intrinsic transcription termination factor that stimulates motility and coordinates gene expression with NusA. eLife. 2021; 10:e61880. 10.7554/ELIFE.61880. [DOI] [PMC free article] [PubMed] [Google Scholar]
57. Lalanne JB, Taggart JC, Guo MS et al. Evolutionary convergence of pathway-specific enzyme expression stoichiometry. Cell. 2018; 173:749–61. 10.1016/J.CELL.2018.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
58. Johnson GE, Lalanne JB, Peters ML et al. Functionally uncoupled transcription-translation in Bacillus subtilis. Nature. 2020; 585:124–8. 10.1038/S41586-020-2638-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
59. Hoon MJLD, Makita Y, Nakai K et al. Prediction of transcriptional terminators in Bacillus subtilis and related species. PLoS Comput Biol. 2005; 1:e25. 10.1371/JOURNAL.PCBI.0010025. [DOI] [PMC free article] [PubMed] [Google Scholar]
60. Fuchs M, Lamm-Schmidt V, Sulzer J et al. An RNA-centric global view of Clostridioides difficile reveals broad activity of Hfq in a clinically important gram-positive bacterium. Proc Natl Acad Sci USA. 2021; 118:e2103579118. 10.1073/PNAS.2103579118/-/DCSUPPLEMENTAL. [DOI] [PMC free article] [PubMed] [Google Scholar]
61. Forquet R, Jiang X, Nasser W et al. Mapping the complex transcriptional landscape of the phytopathogenic bacterium Dickeya dadantii. mBio. 2022; 13:e0052422. 10.1128/MBIO.00524-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
62. Dar D, Sorek R High-resolution RNA 3’-ends mapping of bacterial Rho-dependent transcripts. Nucleic Acids Res. 2018; 46:6797–805. 10.1093/NAR/GKY274. [DOI] [PMC free article] [PubMed] [Google Scholar]
63. Santos-Zavaleta A, Salgado H, Gama-Castro S et al. RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12. Nucleic Acids Res. 2019; 47:D212–20. 10.1093/NAR/GKY1077. [DOI] [PMC free article] [PubMed] [Google Scholar]
64. Choe D, Kim K, Kang M et al. Synthetic 3′-UTR valves for optimal metabolic flux control in Escherichia coli. Nucleic Acids Res. 2022; 50:4171–86. 10.1093/NAR/GKAC206. [DOI] [PMC free article] [PubMed] [Google Scholar]
65. Thomason MK, Voichek M, Dar D et al. A rhlI 5′ UTR-derived sRNA regulates RhlR-dependent quorum sensing in Pseudomonas aeruginosa. mBio. 2019; 10:e02253-19. 10.1128/MBIO.02253-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
66. Mediati DG, Wong JL, Gao W et al. RNase III-CLASH of multi-drug resistant Staphylococcus aureus reveals a regulatory mRNA 3′UTR required for intermediate vancomycin resistance. Nat Commun. 2022; 13:3558. 10.1038/S41467-022-31177-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
67. Bastet L, Bustos-Sanmamed P, Catalan-Moreno A et al. Regulation of heterogenous lexA expression in Staphylococcus aureus by an antisense RNA originating from transcriptional read-through upon natural mispairings in the sbrB intrinsic terminator. Int J Mol Sci. 2022; 23:576. 10.3390/IJMS23010576/S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
68. Slager J, Aprianto R, Veening JW Deep genome annotation of the opportunistic human pathogen Streptococcus pneumoniae D39. Nucleic Acids Res. 2018; 46:9971–89. 10.1093/NAR/GKY725. [DOI] [PMC free article] [PubMed] [Google Scholar]
69. Warrier I, Ram-Mohan N, Zhu Z et al. The transcriptional landscape of Streptococcus pneumoniae TIGR4 reveals a complex operon architecture and abundant riboregulation critical for growth and virulence. PLoS Pathog. 2018; 14:e1007461. 10.1371/JOURNAL.PPAT.1007461. [DOI] [PMC free article] [PubMed] [Google Scholar]
70. Lee Y, Lee N, Hwang S et al. Genome-scale analysis of genetic regulatory elements in Streptomyces avermitilis MA-4680 using transcript boundary information. BMC Genomics. 2022; 23:68. 10.1186/S12864-022-08314-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
71. Lee Y, Lee N, Hwang S et al. Genome-scale determination of 5′ and 3′ boundaries of RNA transcripts in Streptomyces genomes. Sci Data. 2020; 7:436. 10.1038/S41597-020-00775-W. [DOI] [PMC free article] [PubMed] [Google Scholar]
72. Hwang S, Lee N, Choe D et al. System-level analysis of transcriptional and translational regulatory elements in Streptomyces griseus. Front Bioeng Biotechnol. 2022; 10:844200. 10.3389/FBIOE.2022.844200/FULL. [DOI] [PMC free article] [PubMed] [Google Scholar]
73. Lee Y, Lee N, Jeong Y et al. The transcription unit architecture of Streptomyces lividans TK24. Front Microbiol. 2019; 10:2074. 10.3389/FMICB.2019.02074/FULL. [DOI] [PMC free article] [PubMed] [Google Scholar]
74. kadi MA, Ishii E, Truong DT et al. Direct RNA sequencing unfolds the complex transcriptome of Vibrio parahaemolyticus. mSystems. 2021; 6:e0099621. 10.1128/MSYSTEMS.00996-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
75. Vera JM, Ghosh IN, Zhang Y et al. Genome-scale transcription-translation mapping reveals features of Zymomonas mobilis transcription units and promoters. mSystems. 2020; 5:e00250-20. 10.1128/MSYSTEMS.00250-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
76. Rosinski-Chupin I, Sauvage E, Sismeiro O et al. Single nucleotide resolution RNA-seq uncovers new regulatory mechanisms in the opportunistic pathogen Streptococcus agalactiae. BMC Genomics. 2015; 16:419. 10.1186/S12864-015-1583-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
77. Cho SH, Jeong Y, Hong SJ et al. Different regulatory modes of Synechocystis sp. PCC 6803 in response to photosynthesis inhibitory conditions. mSystems. 2021; 6:e0094321. 10.1128/MSYSTEMS.00943-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
78. Jeong Y, Hong SJ, Cho SH et al. Multi-omic analyses reveal habitat adaptation of marine cyanobacterium Synechocystis sp. PCC 7338. Front Microbiol. 2021; 12:667450. 10.3389/FMICB.2021.667450/FULL. [DOI] [PMC free article] [PubMed] [Google Scholar]
79. Berkemer SJ, Maier LK, Amman F et al. Identification of RNA 3′ ends and termination sites in Haloferax volcanii. RNA Biol. 2020; 17:663–76. 10.1080/15476286.2020.1723328. [DOI] [PMC free article] [PubMed] [Google Scholar]
80. Li J, Yue L, Li Z et al. aCPSF1 cooperates with terminator U-tract to dictate archaeal transcription termination efficacy. eLife. 2021; 10:e70464. 10.7554/ELIFE.70464. [DOI] [PMC free article] [PubMed] [Google Scholar]
81. Majdandzic A, Rajesh C, Koo PK Correcting gradient-based interpretations of deep neural networks for genomics. Genome Biol. 2023; 24:109. 10.1186/s13059-023-02956-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
82. Robinson JT, Thorvaldsdóttir H, Winckler W et al. Integrative genomics viewer. Nat Biotechnol. 2011; 29:24–6. 10.1038/nbt.1754. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

lqaf016_Supplemental_Files

lqaf016_supplemental_files.zip^{(2.1MB, zip)}

Data Availability Statement

Software and data are available at https://github.com/BioinformaticsLabAtMUN/BacTermFinder and https://doi.org/10.5281/zenodo.14498242.

[B1] 1. Ray-Soni A, Bellecourt MJ, Landick R Mechanisms of bacterial transcription termination: all good things must end. Annu Rev Biochem. 2016; 85:319–47. 10.1146/ANNUREV-BIOCHEM-060815-014844. [DOI] [PubMed] [Google Scholar]

[B2] 2. Santangelo TJ, Artsimovitch I Termination and antitermination: RNA polymerase runs a stop sign. Nat Rev Microbiol. 2011; 9:319–29. 10.1038/nrmicro2560. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3. Mitra A, Angamuthu K, Jayashree HV et al. Occurrence, divergence and evolution of intrinsic terminators across eubacteria. Genomics. 2009; 94:110–6. 10.1016/j.ygeno.2009.04.004. [DOI] [PubMed] [Google Scholar]

[B4] 4. Song E, Uhm H, Munasingha PR et al. Rho-dependent transcription termination proceeds via three routes. Nat Commun. 2022; 13:1663. 10.1038/s41467-022-29321-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5. Kingsford CL, Ayanbule K, Salzberg SL Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake. Genome Biol. 2007; 8:R22. 10.1186/GB-2007-8-2-R22/FIGURES/5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6. Dar D, Shamir M, Mellin JR et al. Term-seq reveals abundant ribo-regulation of antibiotics resistance in bacteria. Science. 2016; 352:aad9822. 10.1126/science.aad9822. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7. Ju X, Li D, Liu S Full-length RNA profiling reveals pervasive bidirectional transcription terminators in bacteria. Nat Microbiol. 2019; 4:1907–1918. 10.1038/S41564-019-0500-Z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8. Yan B, Boitano M, Clark TA et al. SMRT-Cappable-seq reveals complex operon variants in bacteria. Nat Commun. 2018; 9:3676. 10.1038/S41467-018-05997-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9. Lalanne JB, Taggart JC, Guo MS et al. Evolutionary convergence of pathway-specific enzyme expression stoichiometry. Cell. 2018; 173:749–61. 10.1016/j.cell.2018.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10. Shishkin AA, Giannoukos G, Kucukural A et al. Simultaneous generation of many RNA-seq libraries in a single reaction. Nat Methods. 2015; 12:323. 10.1038/NMETH.3313. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11. Sharma CM, Vogel J Differential RNA-seq: the approach behind and the biological insight gained. Curr Opin Microbiol. 2014; 19:97–105. 10.1016/J.MIB.2014.06.010. [DOI] [PubMed] [Google Scholar]

[B12] 12. Brandenburg VB, Narberhaus F, Mosig A Inverse folding based pre-training for the reliable identification of intrinsic transcription terminators. PLoS Comput Biol. 2022; 18:e1010240. 10.1371/JOURNAL.PCBI.1010240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13. Feng CQ, Zhang ZY, Zhu XJ et al. iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators. Bioinformatics. 2019; 35:1469–77. 10.1093/bioinformatics/bty827. [DOI] [PubMed] [Google Scholar]

[B14] 14. Di Salvo M, Puccio S, Peano C et al. RhoTermPredict: an algorithm for predicting Rho-dependent transcription terminators based on Escherichia coli, Bacillus subtilis and Salmonella enterica databases. BMC Bioinformatics. 2019; 20:117. 10.1186/s12859-019-2704-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15. NCBI National Center for Biotechnology Information (NCBI) PubMed 1988–2023. https://pubmed.ncbi.nlm.nih.gov/10 November 2023, date last accessed.

[B16] 16. NCBI National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) 1999–2023. https://www.ncbi.nlm.nih.gov/geo/10 November 2023, date last accessed.

[B17] 17. Kluyver T, Ragan-Kelley B, Pérez F et al.. Loizides F, Schmidt B Jupyter Notebooks—a publishing format for reproducible computational workflows. Positioning and Power in Academic Publishing: Players, Agents and Agendas. 2016; Amsterdam: IOS Press; 87–90. [Google Scholar]

[B18] 18. The pandas development team pandas-dev/pandas: Pandas. 2020; Zenodo; 10.5281/zenodo.3509134. [DOI] [Google Scholar]

[B19] 19. Quinlan AR, Hall IM BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010; 26:841–2. 10.1093/BIOINFORMATICS/BTQ033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20. Lesnik EA, Sampath R, Levene HB et al. Prediction of rho-independent transcriptional terminators in Escherichia coli. Nucleic Acids Res. 2001; 29:3583–94. 10.1093/nar/29.17.3583. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21. Gama-Castro S, Salgado H, Santos-Zavaleta A et al. RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Res. 2016; 44:D133–43. 10.1093/NAR/GKV1156. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22. Ishii T, Yoshida KI, Terai G et al. DBTBS: a database of Bacillus subtilis promoters and transcription factors. Nucleic Acids Res. 2001; 29:278–80. 10.1093/NAR/29.1.278. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23. Geissler AS, Anthon C, Alkan F et al. BSGatlas: a unified Bacillus subtilis genome and transcriptome annotation atlas with enhanced information access. Microb Genom. 2021; 7:524. 10.1099/MGEN.0.000524. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24. Chevez-Guardado R, Peña-Castillo L Promotech: a general tool for bacterial promoter recognition. Genome Biol. 2021; 22:318. 10.1186/S13059-021-02514-9/TABLES/10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25. Bonidia RP, Domingues DS, Sanches DS et al. MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors. Brief Bioinform. 2022; 23:bbab434. 10.1093/BIB/BBAB434. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26. Chen Z, Zhao P, Li C et al. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res. 2021; 49:e60. 10.1093/NAR/GKAB122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27. Liu B, Liu F, Fang L et al. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics. 2015; 31:1307–9. 10.1093/BIOINFORMATICS/BTU820. [DOI] [PubMed] [Google Scholar]

[B28] 28. Lundberg SM, Erion G, Chen H et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell. 2020; 2:2522–5839. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29. Ke G, Meng Q, Finley T et al.. Guyon I, Luxburg UV, Bengio S et al. LightGBM: a highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems. 2017; 30:New York, USA: Curran Associates, Inc. [Google Scholar]

[B30] 30. LeCun Y, Bengio Y, Hinton G Deep Learning. Nature. 2015; 521:436–44. 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]

[B31] 31. Grinsztajn L, Oyallon E, Varoquaux G Why do tree-based models still outperform deep learning on typical tabular data?. arXiv18 July 2022, preprint: not peer reviewedhttps://arxiv.org/abs/2207.08815.

[B32] 32. Xu QS, Liang YZ Monte Carlo cross validation. Chemom Intell Lab Syst. 2001; 56:1–11. 10.1016/S0169-7439(00)00122-2. [DOI] [Google Scholar]

[B33] 33. Walsh I, Fishman D, Garcia-Gasulla D et al. DOME: recommendations for supervised machine learning validation in biology. Nat Methods. 2021; 18:1122–7. 10.1038/s41592-021-01205-4. [DOI] [PubMed] [Google Scholar]

[B34] 34. D’Halluin A, Polgar P, Kipkorir T et al. Premature termination of transcription is shaped by Rho and translated uORFS in Mycobacterium tuberculosis. iScience. 2023; 26:106465. 10.1016/j.isci.2023.106465. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] 35. Banerjee S, Chalissery J, Bandey I et al. Rho-dependent transcription termination: more questions than answers. J Microbiol. 2006; 44:11–22. [PMC free article] [PubMed] [Google Scholar]

[B36] 36. Bar A, Argaman L, Eldar M et al. TRS: a method for determining transcript termini from RNAtag-seq sequencing data. Nat Commun. 2023; 14:7843. 10.1038/s41467-023-43534-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B37] 37. Adams PP, Baniulyte G, Esnault C et al. Regulatory roles of Escherichia coli 5′ UTR and ORF-internal RNAs detected by 3′ end mapping. eLife. 2021; 10:e62438. 10.7554/ELIFE.62438. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B38] 38. Hwang S, Lee N, Choe D et al. Elucidating the regulatory elements for transcription termination and posttranscriptional processing in the Streptomyces clavuligerus genome. mSystems. 2021; 6:e01013-20. 10.1128/MSYSTEMS.01013-20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B39] 39. Peters JM, Vangeloff AD, Landick R Bacterial transcription terminators: the RNA 3′-end chronicles. J Mol Biol. 2011; 412:793–813. 10.1016/j.jmb.2011.03.036. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B40] 40. Friedel M, Nikolajewa S, Sühnel J et al. DiProDB: a database for dinucleotide properties. Nucleic Acids Res. 2009; 37:D37–40. 10.1093/nar/gkn597. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B41] 41. Xia T, SantaLucia J Jr, Burkard ME et al. Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson–Crick base pairs. Biochemistry. 1998; 37:14719–35. 10.1021/bi9809425. [DOI] [PubMed] [Google Scholar]

[B42] 42. Gupta S, Padmashali N, Pal D INTERPIN: a repository for intrinsic transcription termination hairpins in bacteria. Biochimie. 2023; 214(Pt B):228–36. 10.1016/j.biochi.2023.07.018. [DOI] [PubMed] [Google Scholar]

[B43] 43. Demšar J Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006; 7:1–30. [Google Scholar]

[B44] 44. García S, Fernández A, Luengo J et al. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inform Sci. 2010; 180:2044–64. 10.1016/j.ins.2009.12.010. [DOI] [Google Scholar]

[B45] 45. Pohlert T PMCMRplus: calculate pairwise multiple comparisons of mean rank sums extended. 2024; R package version 1.9.12https://cran.r-project.org/package=PMCMRplus.

[B46] 46. Calvo B, Santafe G scmamp: statistical comparison of multiple algorithms in multiple problems. The R Journal. 2016; 8:248–56. [Google Scholar]

[B47] 47. LeBlanc H, Lang AS, Beatty JT Transcript cleavage, attenuation, and an internal promoter in the Rhodobacter capsulatus puc operon. J Bacteriol. 1999; 181:4955–60. 10.1128/JB.181.16.4955-4960.1999. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B48] 48. Grüll MP, Peña-Castillo L, Mulligan ME et al. Genome-wide identification and characterization of small RNAs in Rhodobacter capsulatus and identification of small RNAs affected by loss of the response regulator CtrA. RNA Biol. 2017; 14:914–25. 10.1080/15476286.2017.1306175. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B49] 49. Gupta S, Pal D Clusters of hairpins induce intrinsic transcription termination in bacteria. Sci Rep. 2021; 11:16194. 10.1038/s41598-021-95435-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B50] 50. Fan Y, Wang W, Zhu Q iterb-PPse: identification of transcriptional terminators in bacterial by incorporating nucleotide properties into PseKNC. PloS One. 2020; 15:e0228479. 10.1371/journal.pone.0228479. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B51] 51. Nadiras C, Eveno E, Schwartz A et al. A multivariate prediction model for Rho-dependent termination of transcription. Nucleic Acids Res. 2018; 46:8245–60. 10.1093/nar/gky563. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B52] 52. Millman A, Dar D, Shamir M et al. Computational prediction of regulatory, premature transcription termination in bacteria. Nucleic Acids Res. 2017; 45:886–93. 10.1093/nar/gkw749. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B53] 53. Gardner PP, Barquist L, Bateman A et al. RNIE: genome-wide prediction of bacterial intrinsic terminators. Nucleic Acids Res. 2011; 39:5845–52. 10.1093/nar/gkr168. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B54] 54. Dar D, Prasse D, Schmitz RA et al. Widespread formation of alternative 3′ UTR isoforms via transcription termination in archaea. Nat Microbiol. 2016; 1:16143. 10.1038/nmicrobiol.2016.143. [DOI] [PubMed] [Google Scholar]

[B55] 55. Takada H, Mandell ZF, Yakhnin H et al. Expression of Bacillus subtilis ABCF antibiotic resistance factor VmlR is regulated by RNA polymerase pausing, transcription attenuation, translation attenuation and (p)ppGpp. Nucleic Acids Res. 2022; 50:6174–89. 10.1093/NAR/GKAC497. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B56] 56. Mandell ZF, Oshiro RT, Yakhnin AV et al. NusG is an intrinsic transcription termination factor that stimulates motility and coordinates gene expression with NusA. eLife. 2021; 10:e61880. 10.7554/ELIFE.61880. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B57] 57. Lalanne JB, Taggart JC, Guo MS et al. Evolutionary convergence of pathway-specific enzyme expression stoichiometry. Cell. 2018; 173:749–61. 10.1016/J.CELL.2018.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B58] 58. Johnson GE, Lalanne JB, Peters ML et al. Functionally uncoupled transcription-translation in Bacillus subtilis. Nature. 2020; 585:124–8. 10.1038/S41586-020-2638-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B59] 59. Hoon MJLD, Makita Y, Nakai K et al. Prediction of transcriptional terminators in Bacillus subtilis and related species. PLoS Comput Biol. 2005; 1:e25. 10.1371/JOURNAL.PCBI.0010025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B60] 60. Fuchs M, Lamm-Schmidt V, Sulzer J et al. An RNA-centric global view of Clostridioides difficile reveals broad activity of Hfq in a clinically important gram-positive bacterium. Proc Natl Acad Sci USA. 2021; 118:e2103579118. 10.1073/PNAS.2103579118/-/DCSUPPLEMENTAL. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B61] 61. Forquet R, Jiang X, Nasser W et al. Mapping the complex transcriptional landscape of the phytopathogenic bacterium Dickeya dadantii. mBio. 2022; 13:e0052422. 10.1128/MBIO.00524-22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B62] 62. Dar D, Sorek R High-resolution RNA 3’-ends mapping of bacterial Rho-dependent transcripts. Nucleic Acids Res. 2018; 46:6797–805. 10.1093/NAR/GKY274. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B63] 63. Santos-Zavaleta A, Salgado H, Gama-Castro S et al. RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12. Nucleic Acids Res. 2019; 47:D212–20. 10.1093/NAR/GKY1077. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B64] 64. Choe D, Kim K, Kang M et al. Synthetic 3′-UTR valves for optimal metabolic flux control in Escherichia coli. Nucleic Acids Res. 2022; 50:4171–86. 10.1093/NAR/GKAC206. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B65] 65. Thomason MK, Voichek M, Dar D et al. A rhlI 5′ UTR-derived sRNA regulates RhlR-dependent quorum sensing in Pseudomonas aeruginosa. mBio. 2019; 10:e02253-19. 10.1128/MBIO.02253-19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B66] 66. Mediati DG, Wong JL, Gao W et al. RNase III-CLASH of multi-drug resistant Staphylococcus aureus reveals a regulatory mRNA 3′UTR required for intermediate vancomycin resistance. Nat Commun. 2022; 13:3558. 10.1038/S41467-022-31177-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B67] 67. Bastet L, Bustos-Sanmamed P, Catalan-Moreno A et al. Regulation of heterogenous lexA expression in Staphylococcus aureus by an antisense RNA originating from transcriptional read-through upon natural mispairings in the sbrB intrinsic terminator. Int J Mol Sci. 2022; 23:576. 10.3390/IJMS23010576/S1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B68] 68. Slager J, Aprianto R, Veening JW Deep genome annotation of the opportunistic human pathogen Streptococcus pneumoniae D39. Nucleic Acids Res. 2018; 46:9971–89. 10.1093/NAR/GKY725. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B69] 69. Warrier I, Ram-Mohan N, Zhu Z et al. The transcriptional landscape of Streptococcus pneumoniae TIGR4 reveals a complex operon architecture and abundant riboregulation critical for growth and virulence. PLoS Pathog. 2018; 14:e1007461. 10.1371/JOURNAL.PPAT.1007461. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B70] 70. Lee Y, Lee N, Hwang S et al. Genome-scale analysis of genetic regulatory elements in Streptomyces avermitilis MA-4680 using transcript boundary information. BMC Genomics. 2022; 23:68. 10.1186/S12864-022-08314-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B71] 71. Lee Y, Lee N, Hwang S et al. Genome-scale determination of 5′ and 3′ boundaries of RNA transcripts in Streptomyces genomes. Sci Data. 2020; 7:436. 10.1038/S41597-020-00775-W. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B72] 72. Hwang S, Lee N, Choe D et al. System-level analysis of transcriptional and translational regulatory elements in Streptomyces griseus. Front Bioeng Biotechnol. 2022; 10:844200. 10.3389/FBIOE.2022.844200/FULL. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B73] 73. Lee Y, Lee N, Jeong Y et al. The transcription unit architecture of Streptomyces lividans TK24. Front Microbiol. 2019; 10:2074. 10.3389/FMICB.2019.02074/FULL. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B74] 74. kadi MA, Ishii E, Truong DT et al. Direct RNA sequencing unfolds the complex transcriptome of Vibrio parahaemolyticus. mSystems. 2021; 6:e0099621. 10.1128/MSYSTEMS.00996-21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B75] 75. Vera JM, Ghosh IN, Zhang Y et al. Genome-scale transcription-translation mapping reveals features of Zymomonas mobilis transcription units and promoters. mSystems. 2020; 5:e00250-20. 10.1128/MSYSTEMS.00250-20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B76] 76. Rosinski-Chupin I, Sauvage E, Sismeiro O et al. Single nucleotide resolution RNA-seq uncovers new regulatory mechanisms in the opportunistic pathogen Streptococcus agalactiae. BMC Genomics. 2015; 16:419. 10.1186/S12864-015-1583-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B77] 77. Cho SH, Jeong Y, Hong SJ et al. Different regulatory modes of Synechocystis sp. PCC 6803 in response to photosynthesis inhibitory conditions. mSystems. 2021; 6:e0094321. 10.1128/MSYSTEMS.00943-21. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B78] 78. Jeong Y, Hong SJ, Cho SH et al. Multi-omic analyses reveal habitat adaptation of marine cyanobacterium Synechocystis sp. PCC 7338. Front Microbiol. 2021; 12:667450. 10.3389/FMICB.2021.667450/FULL. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B79] 79. Berkemer SJ, Maier LK, Amman F et al. Identification of RNA 3′ ends and termination sites in Haloferax volcanii. RNA Biol. 2020; 17:663–76. 10.1080/15476286.2020.1723328. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B80] 80. Li J, Yue L, Li Z et al. aCPSF1 cooperates with terminator U-tract to dictate archaeal transcription termination efficacy. eLife. 2021; 10:e70464. 10.7554/ELIFE.70464. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B81] 81. Majdandzic A, Rajesh C, Koo PK Correcting gradient-based interpretations of deep neural networks for genomics. Genome Biol. 2023; 24:109. 10.1186/s13059-023-02956-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B82] 82. Robinson JT, Thorvaldsdóttir H, Winckler W et al. Integrative genomics viewer. Nat Biotechnol. 2011; 29:24–6. 10.1038/nbt.1754. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

BacTermFinder: a comprehensive and general bacterial terminator finder using a CNN ensemble

Seyed Mohammad Amin Taheri Ghahfarokhi

Lourdes Peña-Castillo

Abstract

Introduction

Table 1.

Materials and methods

Terminator sequences collection

Table 2.

Table 3.

Confirming the sequence length for terminator identification

Non-terminator sequences generation

Feature engineering and selection

Figure 1.

Machine learning approaches

Table 4.

Table 5.

Model generation and assessment

Table 6.

Results and discussion

Clearly visible pattern in ROI in BacTermData

Figure 2.

Selecting a best performing model: BacTermFinder

Table 7.

Figure 3.

Figure 4.

Effect of GC content and phylum on terminator identification

Figure 5.

Assessing BacTermFinder’s predictions on independent validation data

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Comparative assessment for genome-wide terminator prediction

Table 8.

Figure 10.

Table 9.

BacTermFinder can predict the location of terminators accurately

Figure 11.

Figure 12.

Low agreement between terminator prediction tools

Figure 13.

Figure 14.

Case study: the Rhodobacter capsulatus SB1003 puc operon

Figure 15.

Conclusions

Supplementary Material

Acknowledgements

Contributor Information

Supplementary data

Conflict of interest

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases