iPromoter-2L2.0: Identifying Promoters and Their Types by Combining Smoothing Cutting Window Algorithm and Sequence-Based Features

Bin Liu; Kai Li

doi:10.1016/j.omtn.2019.08.008

. 2019 Aug 14;18:80–87. doi: 10.1016/j.omtn.2019.08.008

iPromoter-2L2.0: Identifying Promoters and Their Types by Combining Smoothing Cutting Window Algorithm and Sequence-Based Features

Bin Liu ^1,^2,^∗, Kai Li ³

PMCID: PMC6796744 PMID: 31536883

Abstract

Promoters are short regions at specific locations of DNA sequences, which are playing key roles in directing gene transcription. They can be grouped into six types ( $σ^{24}, σ^{28}, σ^{32}, σ^{38}, σ^{54}, σ^{70}$ ). Recently, a predictor called “iPromoter-2L” was constructed to predict the promoters and their six types, which is the first approach to predict all the six types of promoters. However, its predictive quality still needs to be further improved for real-world application requirement. In this study, we proposed the smoothing cutting window algorithm to find the window fragments of the DNA sequences based on the conservation scores to capture the sequence patterns of promoters. For each window fragment, the discriminative features were extracted by using kmer and PseKNC. Combined with support vector machines (SVMs), different predictors were constructed and then clustered into several groups based on their distances. Finally, a new predictor called iPromoter-2L2.0 was constructed to identify the promoters and their six types, which was developed by ensemble learning based on the key predictors selected from the cluster groups. The results showed that iPromoter-2L2.0 outperformed other existing methods for both promoter prediction and identification of their six types, indicating that iPromoter-2L2.0 will be helpful for genomics analysis.

Keywords: promoter, smoothing cutting window algorithm, ensemble learning

Introduction

A promoter is a DNA fragment at a specific location that can be recognized and bound by RNA polymerase to initiate transcription. In bacteria, the RNA polymerase contains five subunits (2 $α, β,$ β′, $ω$ ) and an extra $σ$ factor.1, 2 The $σ$ factors can be labeled as $σ^{24}, σ^{28}, σ^{32}, σ^{38}, σ^{54} and σ^{70}$ according to the molecular weights. Different $σ$ factors direct the RNA polymerase binding to different promoter regions, which can affect the consequent activation of genes. $σ^{24}$ and $σ^{32}$ participate in heat-shock response, $σ^{28}$ participates in the flagellar gene expression during normal growth, $σ^{54}$ participates in nitrogen metabolism, and $σ^{70}$ , called primary $σ$ factor, is in charge of transcription of most genes in growing cells.2, 3, 4

Because the wet experiments are expensive to identify the types of promoters, several predictors were developed to identify the promoters based on the DNA sequence information; for example, iPro54-PseKNC⁵ based on the PseKNC⁶ was constructed to identify promoters. A position-correlation scoring function (PCSF)⁷ and Bayes profile⁸ were proposed to identify promoter. By combining the variable window technique with the regular Z-curve method,9, 10, 11 “variable-window Z-curve” was proposed to detect promoters. These methods were discussed in a recent study.¹²

Recently, the iPromoter-2L¹² has been proposed, which is the first predictor that is able to predict the promoters and their aforementioned six different types. This predictor employed the multi-window-based PseKNC approach to capture the sequence patterns of the promoters. However, for this predictor, it is extremely hard to find the optimized sequence windows by using the flexible-sliding-window approach to extract the discriminative features, preventing the performance improvement of this method. In order to overcome these shortcomings, in this study we proposed the smoothing cutting window (SCW) algorithm to divide the DNA sequences into fragment windows based on the conservation scores and ensemble of different predictors based on various sequence-based features to further improve the predictive performance.

Results and Discussion

Comparison with Other Existing Methods

Table 1 shows the results (Equation 24) generated by iPromoter-2L2.0 via the 5-fold validation on the benchmark dataset. The corresponding rates obtained by the existing methods are also given in Table 1. For the second-layer prediction, only the iPromoter-2L and iPromoter-2L2.0 are able to predict the promoter types among the five existing methods.

Table 1.

A Comparison of iPromoter-2L2.0 with Other Predictors for Identifying Promoters (the First Layer) and Their Types (the Second Layer) via the 5-fold Cross-Validation on the Same Benchmark Dataset

Method	Acc (%)	MCC	Sn (%)	Sp (%)
First Layer

PCSF^a	74.81	0.4980	78.92	70.70
vw Z-curve^a	80.28	0.6098	77.76	82.80
Stability^a	78.04	0.5615	76.61	79.48
iPro54^a	80.45	0.6100	77.76	83.15
iPromoter-2L1.0^a	81.68	0.6343	79.20	84.16
iPromoter-2L2.0^b	84.98	0.6998	84.13	85.84

Second Layer

iPromoter-2L1.0^a
σ²⁴ promoter	93.50	0.7338	72.52	96.93
σ²⁸ promoter	96.82	0.5708	42.54	99.49
σ³² promoter	94.41	0.6524	52.58	99.14
σ³⁸ promoter	94.69	0.2962	15.34	99.48
σ⁵⁴ promoter	94.04	0.6459	53.19	99.57
σ⁷⁰ promoter	80.66	0.6056	95.34	59.35
iPromoter-2L2.0^b
σ²⁴ promoter	94.62	0.8053	81.82	97.22
σ²⁸ promoter	97.94	0.7561	71.64	99.23
σ³² promoter	95.38	0.7361	71.82	98.05
σ³⁸ promoter	94.58	0.2242	7.36	99.85
σ⁵⁴ promoter	98.11	0.6714	59.57	99.42
σ⁷⁰ promoter	85.94	0.7109	95.22	72.47

Open in a new tab

See Equation 1. Acc, accuracy; Sn, sensitivity; Sp, specificity.

The results reported in Liu et al.¹²

The predictor proposed in this study.

From Table 1 we can see the following: (1) for the first-layer prediction, the iPromoter-2L2.0 outperformed all the other methods in terms of all the four performance measures (cf. Equation 24); (2) for the second-layer prediction, the iPromoter-2L2.0 outperformed iPromoter-2L for the prediction of σ²⁴ promoters, σ²⁸ promoters, σ³² promoters, σ⁵⁴ promoters, and σ⁷⁰ promoters in terms of accuracy (Acc) and Matthew’s correlation coefficient (MCC), and its performance is comparable with that of iPromoter-2L for the prediction of σ³⁸ promoters. The reasons for the performance improvement of the iPromoter-2L predictor is that it is based on the SCW algorithm, which is able to more accurately extract the sequence features to discriminate the promoters and their types.

It can be anticipated that the proposed SCW algorithm would have many potential applications, such as enhancer prediction, DNA replication origin prediction, etc.

Web Server and Its User Guide

We established a web server for iPromoter-2L2.0 so as to help the readers to use the proposed method by following the steps below.

Step 1. Click the hyperlink http://bliulab.net/iPromoter-2L2.0/ to access the homepage as shown in Figure 1. An introduction to the web server is given in the Read Me.
Step 2. Copy/paste or type the query DNA sequences into the input box at the center of Figure 1 or upload the data by the Browse button.
Step 3. Click on the Submit button—you will see the predicted results. If using the example sequences for the prediction, you will see the following results: (1) both the first and the second query sequences are non-promoters; (2) the third query sequence is a σ⁷⁰ promoter.
Step 4. On the results, the predictive result can be downloaded via clicking the Download button.

A Screenshot of the Homepage of the Web Server for iPromoter-2L2.0

iPromoter-2L2.0 can be accessed at http://bliulab.net/iPromoter-2L2.0/.

Materials and Methods

Benchmark Dataset

To facilitate performance comparison of various methods, we employed the dataset $S$ ¹² to construct the predictor and evaluate the performance of various methods, which can be formulated as¹²

{\begin{matrix} \begin{matrix} S = S^{+} \cup S^{-} \end{matrix} \\ S^{+} \begin{matrix} = S^{+} (σ^{24}) \cup S^{+} (σ^{28}) \cup S^{+} (σ^{32}) \cup S^{+} (σ^{38}) \cup S^{+} (σ^{54}) \cup S^{+} (σ^{70}) \end{matrix} \end{matrix},

(Equation 1)

where “ $\cup$ ” indicates the “union” in the theory; $S^{+}$ indicates promoter samples; $S^{-}$ indicates non-promoter samples; and $S^{+} (σ^{24})$ , $S^{+} (σ^{28})$ , $S^{+} (σ^{32})$ , $S^{+} (σ^{38})$ , $S^{+} (σ^{54})$ , and $S^{+} (σ^{70})$ indicate six kinds of promoters. Specifically, the benchmark dataset $S$ consists of 5,920 samples, half of which are promoters, and the others are non-promoters. $S^{+} (σ^{24})$ contains 484 samples; $S^{+} (σ^{28})$ contains 134 samples; $S^{+} (σ^{32})$ contains 291 samples; $S^{+} (σ^{38})$ contains 163 samples; $S^{+} (σ^{54})$ contains 94 samples; $S^{+} (σ^{70})$ contains 1,694 samples.

Sample Formulation

In this study, the DNA sequence samples were divided into several fragment windows by using the proposed SCW algorithm, and then for each fragment window, a sliding window approach was used to extract the sequence features by using kmer¹³ and PseKNC.6, 14, 15

SCW Algorithm

Previous studies showed that the distribution of conservation scores between promoters and non-promoters are obviously different.¹² Here, we proposed the SCW algorithm to incorporate these sequence patterns into the predictor so as to improve the predictive performance.

A DNA sample is represented as

D = Ν_{1} Ν_{2} \dots Ν_{i} \dots Ν_{81},

(Equation 2)

where $N_{i}$ denotes the i-th nucleotide at the sequence position i. It can be one of the following four nucleotides, i.e.,

N_{i} \in {\begin{matrix} A (adenine) & C (cytosine) & \begin{matrix} G (guanine) & T (thymine) \end{matrix} \end{matrix}},

(Equation 3)

where $\in$ refers to “member of,” a symbol in set theory.

To reflect the conservation score distribution patterns along $D$ , it was split into S+1 fragments $ρ ([1, τ_{1} - 1], [τ_{1}, τ_{2} - 1], \dots, [τ_{s}, L])$ by the cutting points $τ_{j} (j = 1, 2, \dots, S)$ (S is the total number of cutting points), which can be represented as

{\begin{matrix} ρ_{1} = Ν_{1} Ν_{2} \dots Ν_{τ_{1} - 1} \\ ρ_{2} = Ν_{τ_{1}} Ν_{τ_{1} + 1} \dots Ν_{τ_{2} - 1} \\ \dots \\ ρ_{S + 1} = Ν_{τ_{s}} Ν_{τ_{s} + 1} \dots Ν_{L} \end{matrix} .

(Equation 4)

The cutting point $τ_{j}$ is defined as follows:

τ_{j} = {\begin{matrix} φ_{1}, if φ_{1} > α and φ_{2} - φ_{1} > α \\ φ_{m}, if 1 < m < Z and φ_{m} - φ_{m - 1} > α and φ_{m + 1} - φ_{m} > α \\ φ_{Z}, if L - φ_{Z} > α and φ_{Z} - φ_{Z - 1} > α \\ is not a cutting point, otherwise \end{matrix},

(Equation 5)

where $α$ is a distance threshold, which was set as 8 in this study, $φ$ is the candidate cutting point, and Z is the total number of $φ .$ For a given sequence position $i$ , $φ$ is defined as

φ_{m} = {\begin{matrix} i, if {SSD}_{i} < {SSD}_{i - 1} and {SSD}_{i} < {SSD}_{i + 1} and 1 < i < L \\ 1, if {SSD}_{i} < {SSD}_{i + 1} and i = 1 \\ Z, if {SSD}_{i} < {SSD}_{i - 1} and i = L \\ is not a candidate cutting point, otherwise \end{matrix},

(Equation 6)

where ${SSD}_{i}$ represents the smooth standard deviation of the average conservation score (CS) of sequence position $i$ , which can be calculated by

{SSD}_{i} = {\begin{matrix} \frac{1}{5} \sum_{k = i - 2}^{i + 2} {SD}_{k}, 2 < i < L - 1 \\ \frac{1}{i + 2} \sum_{k = 1}^{i + 2} {SD}_{k}, i = 1,2 \\ \frac{1}{L - i + 3} \sum_{k = i - 2}^{L} {SD}_{k}, i = L - 1, L \end{matrix},

(Equation 7)

where k is the sequence position and ${SD}_{k}$ is the standard deviation of the average CS at the k-th sequence position, which can be calculated by

{SD}_{k} = \sqrt{\frac{1}{Y} \sum_{y = 1}^{Y} {(ε_{k}^{y} - μ)}^{2}},

(Equation 8)

where Y represents number of labels, which is equal to 2 for the first layer and 6 for the second layer. $ε_{k}^{y}$ denotes the y-th class samples’ average CS at the k-th sequence position, which can be calculated by the approach introduced in Schneider and Stephens.¹⁶ $μ$ is the average CS of all labels at the k-th position.

The conservation profiles and the standard deviations of promoters and non-promoters are shown in Figure 2A, and the conservation profile and the standard deviation of each promoter type are shown in Figure 3A. The smooth standard deviation curves are shown in Figures 2B and 3B. The DNA sequences were divided into several fragments by SCW as shown in Figures 2C and 3C. The pseudo-code of SCW algorithm is shown in Box 1.

A Flowchart Shows the Steps of the Proposed Smoothing Cutting Window Algorithm for the First-Layer Prediction

The standard deviations shown in (A) are converted into the smooth standard deviations as shown in (B), based on which the DNA sequences are divided into several fragments, as shown in (C).

A Flowchart Shows the Process of the Proposed Smoothing Cutting Window Algorithm for the Second-Layer Prediction

The SDs shown in (A) are converted into the smooth SDs as shown in (B), based on which the DNA sequences are divided into several fragments, as shown in (C).

Box 1. Algorithm: Smoothing Cutting Window.

Parameters: sequence length L, number of label Y
Input: DNA sequence in Equation 1
Output: cutting points $τ_{1}, τ_{2}, ... τ_{s}$
For y = 1 to Y do
For i = 1 to L do
Calculate conservation score $ε_{i}^{y}$
End for
End for
For i = 1 to L do
Calculate SSD_i by Equation 7
End for
Calculate cutting points $τ_{1}, τ_{2}, ... τ_{s}$ by Equations 5 and 6 and SSD
Return $τ_{1}, τ_{2}, ... τ_{s}$

After the process shown in Box 1, each DNA sequence in $S$ (cf. Equation 1) was divided into four fragments ([1, 28], [29, 44], [45, 56], [57, 81]), and each DNA sequence in $S^{+}$ (cf. Equation 1) was divided into four fragments ([1, 17], [18, 41], [42, 56], [57, 81]). Then for each fragment, the sliding-window approach was used to extract the features.

A sliding window can be expressed by $[ξ, δ]$ , where $ξ$ is the width of the window and $δ$ is the step of sliding window. For each fragment obtained, the number of the segments produced by $[ξ, δ]$ along the fragment sequence is given by¹²

η = INT [\frac{| ρ_{i} | - ξ + δ}{δ}],

(Equation 9)

where “INT” is an “integer-cutting operator.” $| ρ_{i} |$ denotes the length of the i-th fragment. For example, assuming $| ρ_{i} | = 29$ , $ξ$ = 6, and $δ$ = 1 in Equation 9, we obtain $η = 24$ . For example, we can obtain 24 DNA segments with the sliding window of $[6,1]$ on the i-th fragment of length 29.

kmer

kmer¹³ is a simple and effective method to extract the information in the DNA sequence. By using kmer, the DNA sequence fragment $ρ$ (cf. Equation 4) can be represented as

ρ = {[\begin{matrix} f_{1}^{kmer} & f_{2}^{kmer} & \begin{matrix} \dots & f_{i}^{k m e r} & \dots & f_{4^{k}}^{kmer} \end{matrix} \end{matrix}]}^{T},

(Equation 10)

where $f_{i}^{k mer}$ $(i = 1,2, \dots, 4^{k})$ is the frequencies of k neighboring nucleotides in the fragment $ρ$ ,and T represents transpose operator. For example, Equation 10 is a 4-mer vector when $k = 4$ .

\begin{matrix} ρ = {[\begin{matrix} f (AAAA) & f (AAAC) & \begin{matrix} f (AAAT) & \dots & f (TTTT) \end{matrix} \end{matrix}]}^{T} \\ = {[\begin{matrix} f_{1}^{4 mer} & f_{2}^{4 mer} & \begin{matrix} f_{3}^{4 mer} & \dots & f_{256}^{4 mer} \end{matrix} \end{matrix}]}^{T} \end{matrix} .

(Equation 11)

PseKNC

The PseKNC⁶ incorporates the short-range sequence information, the long-range sequence information, and the physicochemical properties of the dinucleotides,⁶ which can formulate the DNA sequence fragment $ρ$ of Equation 4 as

ρ = {[\begin{matrix} f_{1}^{PseKNC} & f_{2}^{PseKNC} & \begin{matrix} \dots & f_{4^{k}}^{PseKNC} & \begin{matrix} f_{4^{k} + 1}^{PseKNC} & \dots & f_{4^{k} + λ}^{PseKNC} \end{matrix} \end{matrix} \end{matrix}]}^{T} .

(Equation 12)

PseKNC⁶ has three parameters: k, $λ$ (the number of sequence correlations considered¹⁷), and w (the weight factor). Each of the parameters has been clearly defined in a paper⁶ and a comprehensive review.¹⁸

The kmer and PseKNC can be easily generated by some exiting tools, such as Pse-in-One¹⁹ and PseKNC-General.¹⁴

Operation Engine

Support vector machines (SVMs) were successfully applied in several bioinformatics problems (B.L., C. L., and K. Yan, unpublished data).20, 21, 22, 23, 24 In this study, we employed SVMs to build the predictor. We used the SVM with radial basis function (RBF) kernel in the Scikit-learn package.²⁵ The SVM has two parameters: $C$ (regularization) and $γ$ (kernel width).

Accordingly, when combining sliding-window approach and SVM based on kmer or PseKNC, there are a total of $(2 + 2 + 1) = 5$ , or $(2 + 2 + 3) = 7$ parameters, respectively. The values of $C$ and $γ$ will be given later.

For the sliding-window with,

{\begin{matrix} 5 \leq ξ \leq 9 with step gap △ = 1 \\ 1 \leq δ \leq 2 with step gap △ = 1 \end{matrix} .

(Equation 13)

For the kmer approach with

k = 1,2,3,

(Equation 14)

30 elementary classifiers can be developed, as denoted by

C (i), (i = 1,2, \dots, 30) .

(Equation 15)

For the PseKNC approach with

{\begin{matrix} 1 \leq k \leq 4 with step gap △ = 1 \\ 2 \leq λ \leq ξ - k with step gap △ = 3 \\ w = 0.5 \end{matrix},

(Equation 16)

46 elementary classifiers can be developed, denoted by

C (i), (i = 31, 32, \dots, 76) .

(Equation 17)

Therefore, we have a total of 30 + 46 = 76 elementary classifiers.

Ensemble Learning

Inspired by the previous studies,13, 26, 27, 28, 29, 30, 31, 32 by using a voting system, a series of individual predictors can develop an ensemble predictor with better prediction quality.

When developing an ensemble learning model, there are two fundamental issues: the selection of the individual classifiers with low correlation from the elementary classifiers and the construction of an ensemble classifier by fusing the selected classifiers. In this study, we employed the affinity propagation (AP) clustering algorithm³³ to cluster the elementary classifiers based on the distance among classifiers. For each cluster, one key classifier was selected.

In order to measure the complementarity of different elementary classifiers, the distance between any two elementary classifiers $C (i)$ and $C (j)$ was measured by the following equation:

Distance (C (i), C (j)) = \sqrt{\frac{1}{m} \sum_{k = 1}^{m} (d_{i k} Δ d_{j k})},

(Equation 18)

where m is the training sample number, $d_{i k}$ is the classification probability of classifier $C (i)$ on the k-th sample, and $d_{i k} Δ d_{j k}$ is calculated by

d_{i k} Δ d_{j k} = {\begin{matrix} \frac{1}{Y} \sum_{y = 1}^{Y} {(d_{i k y} - d_{j k y})}^{2}, i f C (i) and C (j) have different prediction on the k - th sample \\ 0, otherwise \end{matrix},

(Equation 19)

where Y represents number of labels. Y was set as 2 and 6 for promoter identification and their type prediction, respectively. $d_{i k y}$ represents the probability of $C (i)$ predicting k-th sample as category y. By using Equations 18 and 19, the distance between any elementary classifiers can be accurately measured. The range of $Distance (C (i), C (j))$ is from 0 to 1, where 1 indicates the predictive results of two classifiers are completely complementary and 0 means that their results are identical. The elementary classifiers were then grouped into different clusters by using the AP clustering algorithm.³³

The flowchart of the proposed iPromoter-2L2.0 predictor is shown in Figure 4.

A Flowchart Shows How iPromoter-2L2.0 Is Working

For the first layer, 10 key classifiers were obtained (Table 2) as formulated by

C^{1} (i), (i = 1,2, \dots, 10) .

(Equation 20)

For the second layer, nine key classifiers were obtained (Table 3) as formulated by

C^{2} (i), (i = 1,2, \dots, 9) .

(Equation 21)

By fusing the 10 key classifiers (cf. Equation 20) following this study,¹³ we can obtain the first-layer ensemble predictor as given by

C^{E 1} = C^{1} (1) \forall C^{1} (2) \forall \dots \forall C^{1} (10) = \forall_{i = 1}^{10} C^{1} (i) .

(Equation 22)

By fusing the nine key classifiers (cf. Equation 21), we can obtain the second-layer ensemble predictor given by

C^{E 2} = C^{2} (1) \forall C^{2} (2) \forall \dots \forall C^{2} (9) = \forall_{i = 1}^{9} C^{2} (i),

(Equation 23)

where the symbol $\forall$ in Equations 22 and 23 means that linear combination of the key individual classifiers. The weight factors were optimized by the genetic algorithm,³⁴ and the parameters (population size, evolutional generations) of genetic algorithm were set as 200 and 2,000, respectively, for the first and second layers.

Table 2.

The Six Key Classifiers for the First-Layer Prediction

Key Classifier	Feature Vector	Dimension
$C^{1} (1)$	kmer^a	768
$C^{1} (2)$	kmer^b	396
$C^{1} (3)$	kmer^c	2,880
$C^{1} (4)$	kmer^d	624
$C^{1} (5)$	PseKNC^e	1,080
$C^{1} (6)$	PseKNC^f	11,880
$C^{1} (7)$	PseKNC^g	46,440
$C^{1} (8)$	PseKNC^h	1,566
$C^{1} (9)$	PseKNCⁱ	2,808
$C^{1} (10)$	PseKNC^j	729

Open in a new tab

The parameters used: $ξ = 5$ , $δ = 1$ , k = 1, $C = 2^{3}$ , $γ = 2^{- 6}$ .

The parameters used: $ξ = 5$ , k = 1, $C = 2$ , $γ = 2^{- 4}$ .

The parameters used: $ξ = 6$ , $δ = 1$ , k = 2, $C = 2$ , $γ = 2^{- 4}$ .

The parameters used: $ξ = 8$ , $δ = 1$ , k = 1, $C = 2^{3}$ , $γ = 2^{- 6}$ .

The parameters used: $ξ = 6$ , $δ = 1$ , k = 1, λ = 2, w = 0.5, $C = 2^{3}, γ = 2^{- 4} .$

The parameters used: $ξ = 6$ , $δ = 1$ , k = 3, λ = 2, w = 0.5, $C = 2^{3}, γ = 2^{- 4}$ .

The parameters used: $ξ = 6$ , $δ = 1$ , k = 4, λ = 2, w = 0.5, $C = 2, γ = 2^{- 4} .$

The parameters used: $ξ = 7$ , $δ = 2$ , k = 2, λ = 2, w = 0.5, $C = 2, γ = 2^{- 2} .$

ⁱ

The parameters used: $ξ = 8$ , $δ = 1$ , k = 2, λ = 2, w = 0.5, $C = 2^{3}, γ = 2^{- 4} .$

The parameters used: $ξ = 8$ , $δ = 2$ , k = 1, λ = 5, w = 0.5, $C = 2, γ = 2^{- 2} .$

Table 3.

The 10 Key Classifiers for the Second-Layer Prediction

Key Classifier	Feature Vector	Dimension
$C^{2} (1)$	kmer^a	1,584
$C^{2} (2)$	kmer^b	2,688
$C^{2} (3)$	PseKNC^c	11,880
$C^{2} (4)$	PseKNC^d	1,008
$C^{2} (5)$	PseKNC^e	3,528
$C^{2} (6)$	PseKNC^f	1,566
$C^{2} (7)$	PseKNC^g	2,808
$C^{2} (8)$	PseKNC^h	729
$C^{2} (9)$	PseKNCⁱ	1,296

Open in a new tab

The parameters used: $ξ = 5$ , $δ = 2$ , k = 2, $C = 2^{4}$ , $γ = 2^{- 4}$ .

The parameters used: $ξ = 7$ , $δ = 1$ , k = 2, $C = 2^{4}$ , $γ = 2^{- 4}$ .

The parameters used: $ξ = 6$ , $δ = 1$ , k = 3, λ = 2, w = 0.5, $C = 2^{4}, γ = 2^{- 4} .$

The parameters used: $ξ = 7$ , $δ = 1$ , k = 1, λ = 2, w = 0.5, $C = 2^{4}, γ = 2^{- 1} .$

The parameters used: $ξ = 7$ , $δ = 1$ , k = 2, λ = 5, w = 0.5, $C = 2, γ = 2^{- 1} .$

The parameters used: $ξ = 7$ , $δ = 2$ , k = 2, λ = 2, w = 0.5, $C = 2^{4}, γ = 2^{- 1} .$

The parameters used: $ξ = 8$ , $δ = 1$ , k = 2, λ = 2, w = 0.5, $C = 2^{4}, γ = 2^{- 1} .$

The parameters used: $ξ = 8$ , $δ = 2$ , k = 1, λ = 5, w = 0.5, $C = 2^{4}, γ = 2^{- 1} .$

ⁱ

The parameters used were as follows: $ξ = 9$ , $δ = 1$ , k = 1, λ = 5, w = 0.5, $C = 2^{4}, γ = 2^{- 1} .$

Cross-Validation and Performance Measures

The performance of various predictors was evaluated by using 5-fold cross-validation with the following performance measures:¹²

{\begin{matrix} Sn (i) = 1 - \frac{N_{-}^{+} (i)}{N^{+} (i)} & 0 \leq Sn \leq 1 \\ Sp (i) = 1 - \frac{N_{+}^{-} (i)}{N^{-} (i)} & 0 \leq Sp \leq 1 \\ Acc (i) = 1 - \frac{N_{-}^{+} (i) + N_{+}^{-} (i)}{N^{+} (i) + N^{-} (i)} & 0 \leq Acc \leq 1 \\ MCC (i) = \frac{1 - (\frac{N_{-}^{+} (i)}{N^{+} (i)} + \frac{N_{+}^{-} (i)}{N^{-} (i)})}{\sqrt{(1 + \frac{N_{+}^{-} (i) - N_{-}^{+} (i)}{N^{+} (i)}) (1 + \frac{N_{-}^{+} (i) - N_{+}^{-} (i)}{N^{-} (i)})}} & - 1 \leq MCC \leq 1 \end{matrix},

(Equation 24)

where $i = 1,2, \dots, Y$ , and Y is the number of classes of this system. i is the i-th class or type. For the first-layer prediction, the value of $Y$ is 2, and the value of $i$ represents the promoter (i = 1) or non-promoter (i = 2). Similarly, for the second-layer prediction, the value of $Y$ is 6 and the value of $i$ is 1, 2, 3, 4, 5, or 6 for $σ^{24}$ , $σ^{28}$ , $σ^{32}$ , $σ^{38}$ , $σ^{54}$ , or $σ^{70}$ promoters, respectively. For the detail of these performance measures, please refer to a recent study.¹²

Author Contributions

B.L. provided the main idea of the manuscript and wrote the manuscript. K.L. did the experiments and wrote the manuscript.

Conflicts of Interest

The authors declare no competing interests.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61672184 and 61822306), the Fok Ying-Tung Education Foundation for Young Teachers in the Higher Education Institutions of China (161063), and the Scientific Research Foundation in Shenzhen (JCYJ20180306172207178).

References

1.Borukhov S., Nudler E. RNA polymerase: the vehicle of transcription. Trends Microbiol. 2008;16:126–134. doi: 10.1016/j.tim.2007.12.006. [DOI] [PubMed] [Google Scholar]
2.Silva S.D.A.E., Echeverrigaray S. Intech; 2012. Bacterial Promoter Features Description and Their Application on E. coli In Silico Prediction and Recognition Approaches. [Google Scholar]
3.Janga S.C., Collado-Vides J. Structure and evolution of gene regulatory networks in microbial genomes. Res. Microbiol. 2007;158:787–794. doi: 10.1016/j.resmic.2007.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Potvin E., Sanschagrin F., Levesque R.C. Sigma factors in Pseudomonas aeruginosa. FEMS Microbiol. Rev. 2008;32:38–55. doi: 10.1111/j.1574-6976.2007.00092.x. [DOI] [PubMed] [Google Scholar]
5.Lin H., Deng E.Z., Ding H., Chen W., Chou K.C. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 2014;42:12961–12972. doi: 10.1093/nar/gku1019. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Chen W., Lei T.-Y., Jin D.-C., Lin H., Chou K.-C. PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Anal. Biochem. 2014;456:53–60. doi: 10.1016/j.ab.2014.04.001. [DOI] [PubMed] [Google Scholar]
7.Li Q.Z., Lin H. The recognition and prediction of sigma70 promoters in Escherichia coli K-12. J. Theor. Biol. 2006;242:135–141. doi: 10.1016/j.jtbi.2006.02.007. [DOI] [PubMed] [Google Scholar]
8.He W., Jia C., Duan Y., Zou Q. 70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features. BMC Syst. Biol. 2018;12(Suppl 4):44. doi: 10.1186/s12918-018-0570-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Zhang C.T. A symmetrical theory of DNA sequences and its applications. J. Theor. Biol. 1997;187:297–306. doi: 10.1006/jtbi.1997.0401. [DOI] [PubMed] [Google Scholar]
10.Zhang C.T., Zhang R., Ou H.Y. The Z curve database: a graphic representation of genome sequences. Bioinformatics. 2003;19:593–599. doi: 10.1093/bioinformatics/btg041. [DOI] [PubMed] [Google Scholar]
11.Song K. Recognition of prokaryotic promoters based on a novel variable-window Z-curve method. Nucleic Acids Res. 2012;40:963–971. doi: 10.1093/nar/gkr795. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Liu B., Yang F., Huang D.S., Chou K.-C. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics. 2018;34:33–40. doi: 10.1093/bioinformatics/btx579. [DOI] [PubMed] [Google Scholar]
13.Liu B., Long R., Chou K.-C. iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics. 2016;32:2411–2418. doi: 10.1093/bioinformatics/btw186. [DOI] [PubMed] [Google Scholar]
14.Chen W., Zhang X., Brooker J., Lin H., Zhang L., Chou K.-C. PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics. 2015;31:119–120. doi: 10.1093/bioinformatics/btu602. [DOI] [PubMed] [Google Scholar]
15.Liu B., Liu F., Fang L., Wang X., Chou K.-C. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics. 2015;31:1307–1309. doi: 10.1093/bioinformatics/btu820. [DOI] [PubMed] [Google Scholar]
16.Schneider T.D., Stephens R.M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990;18:6097–6100. doi: 10.1093/nar/18.20.6097. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Chou K.-C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005;21:10–19. doi: 10.1093/bioinformatics/bth466. [DOI] [PubMed] [Google Scholar]
18.Liu B. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief. Bioinform. 2017 doi: 10.1093/bib/bbx165. Published online December 19, 2017. [DOI] [PubMed] [Google Scholar]
19.Liu B., Liu F., Wang X., Chen J., Fang L., Chou K.-C. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015;43(W1):W65–W71. doi: 10.1093/nar/gkv458. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Li D., Ju Y., Zou Q. Protein Folds Prediction with Hierarchical Structured SVM. Curr. Proteomics. 2016;13:79–85. [Google Scholar]
21.Zhang N., Sa Y., Guo Y., Lin W., Wang P., Feng Y. Discriminating Ramos and Jurkat Cells with Image Textures from Diffraction Imaging Flow Cytometry Based on a Support Vector Machine. Curr. Bioinform. 2018;13:50–56. [Google Scholar]
22.Wang S.P., Zhang Q., Lu J., Cai Y.D. Analysis and Prediction of Nitrated Tyrosine Sites with the mRMR Method and Support Vector Machine Algorithm. Curr. Bioinform. 2018;13:3–13. [Google Scholar]
23.Chen W., Lv H., Nie F., Lin H. i6mA-Pred: Identifying DNA N6-methyladenine sites in the rice genome. Bioinformatics. 2019;35:2796–2800. doi: 10.1093/bioinformatics/btz015. [DOI] [PubMed] [Google Scholar]
24.Feng P.M., Chen W., Lin H., Chou K.C. iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal. Biochem. 2013;442:118–125. doi: 10.1016/j.ab.2013.05.024. [DOI] [PubMed] [Google Scholar]
25.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2012;12:2825–2830. [Google Scholar]
26.Liu B., Wang S., Long R., Chou K.-C. iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics. 2017;33:35–41. doi: 10.1093/bioinformatics/btw539. [DOI] [PubMed] [Google Scholar]
27.Liu B., Yang F., Chou K.-C. 2L-piRNA: A two-layer ensemble classifier for identifying piwi-interacting RNAs and their function. Mol. Ther. Nucleic Acids. 2017;7:267–277. doi: 10.1016/j.omtn.2017.04.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Lin C., Chen W., Qiu C., Wu Y., Krishnan S., Zou Q. LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy. Neurocomputing. 2014;123:424–435. [Google Scholar]
29.Zou Q., Guo J., Ju Y., Wu M., Zeng X., Hong Z. Improving tRNAscan-SE Annotation Results via Ensemble Classifiers. Mol. Inform. 2015;34:761–770. doi: 10.1002/minf.201500031. [DOI] [PubMed] [Google Scholar]
30.Zou Q., Wang Z., Guan X., Liu B., Wu Y., Lin Z. An approach for identifying cytokines based on a novel ensemble classifier. BioMed Res. Int. 2013;2013:686090. doi: 10.1155/2013/686090. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Yan K., Fang X., Xu Y., Liu B. Protein Fold Recognition based on Multi-view Modeling. Bioinformatics. 2019 doi: 10.1093/bioinformatics/btz040. Published online January 21, 2019. [DOI] [PubMed] [Google Scholar]
32.Liu B., Zhu Y. ProtDec-LTR3.0: protein remote homology detection by incorporating profile-based features into Learning to Rank. IEEE Access. 2019 Published online July 18, 2019. [Google Scholar]
33.Frey B.J., Dueck D. Clustering by passing messages between data points. Science. 2007;315:972–976. doi: 10.1126/science.1136800. [DOI] [PubMed] [Google Scholar]
34.Mitchell M. MIT Press; 1998. An Introduction to Genetic Algorithms. [Google Scholar]

[bib1] 1.Borukhov S., Nudler E. RNA polymerase: the vehicle of transcription. Trends Microbiol. 2008;16:126–134. doi: 10.1016/j.tim.2007.12.006. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Silva S.D.A.E., Echeverrigaray S. Intech; 2012. Bacterial Promoter Features Description and Their Application on E. coli In Silico Prediction and Recognition Approaches. [Google Scholar]

[bib3] 3.Janga S.C., Collado-Vides J. Structure and evolution of gene regulatory networks in microbial genomes. Res. Microbiol. 2007;158:787–794. doi: 10.1016/j.resmic.2007.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Potvin E., Sanschagrin F., Levesque R.C. Sigma factors in Pseudomonas aeruginosa. FEMS Microbiol. Rev. 2008;32:38–55. doi: 10.1111/j.1574-6976.2007.00092.x. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Lin H., Deng E.Z., Ding H., Chen W., Chou K.C. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 2014;42:12961–12972. doi: 10.1093/nar/gku1019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Chen W., Lei T.-Y., Jin D.-C., Lin H., Chou K.-C. PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Anal. Biochem. 2014;456:53–60. doi: 10.1016/j.ab.2014.04.001. [DOI] [PubMed] [Google Scholar]

[bib7] 7.Li Q.Z., Lin H. The recognition and prediction of sigma70 promoters in Escherichia coli K-12. J. Theor. Biol. 2006;242:135–141. doi: 10.1016/j.jtbi.2006.02.007. [DOI] [PubMed] [Google Scholar]

[bib8] 8.He W., Jia C., Duan Y., Zou Q. 70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features. BMC Syst. Biol. 2018;12(Suppl 4):44. doi: 10.1186/s12918-018-0570-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Zhang C.T. A symmetrical theory of DNA sequences and its applications. J. Theor. Biol. 1997;187:297–306. doi: 10.1006/jtbi.1997.0401. [DOI] [PubMed] [Google Scholar]

[bib10] 10.Zhang C.T., Zhang R., Ou H.Y. The Z curve database: a graphic representation of genome sequences. Bioinformatics. 2003;19:593–599. doi: 10.1093/bioinformatics/btg041. [DOI] [PubMed] [Google Scholar]

[bib11] 11.Song K. Recognition of prokaryotic promoters based on a novel variable-window Z-curve method. Nucleic Acids Res. 2012;40:963–971. doi: 10.1093/nar/gkr795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Liu B., Yang F., Huang D.S., Chou K.-C. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics. 2018;34:33–40. doi: 10.1093/bioinformatics/btx579. [DOI] [PubMed] [Google Scholar]

[bib13] 13.Liu B., Long R., Chou K.-C. iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics. 2016;32:2411–2418. doi: 10.1093/bioinformatics/btw186. [DOI] [PubMed] [Google Scholar]

[bib14] 14.Chen W., Zhang X., Brooker J., Lin H., Zhang L., Chou K.-C. PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics. 2015;31:119–120. doi: 10.1093/bioinformatics/btu602. [DOI] [PubMed] [Google Scholar]

[bib15] 15.Liu B., Liu F., Fang L., Wang X., Chou K.-C. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics. 2015;31:1307–1309. doi: 10.1093/bioinformatics/btu820. [DOI] [PubMed] [Google Scholar]

[bib16] 16.Schneider T.D., Stephens R.M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990;18:6097–6100. doi: 10.1093/nar/18.20.6097. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Chou K.-C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005;21:10–19. doi: 10.1093/bioinformatics/bth466. [DOI] [PubMed] [Google Scholar]

[bib18] 18.Liu B. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief. Bioinform. 2017 doi: 10.1093/bib/bbx165. Published online December 19, 2017. [DOI] [PubMed] [Google Scholar]

[bib19] 19.Liu B., Liu F., Wang X., Chen J., Fang L., Chou K.-C. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015;43(W1):W65–W71. doi: 10.1093/nar/gkv458. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Li D., Ju Y., Zou Q. Protein Folds Prediction with Hierarchical Structured SVM. Curr. Proteomics. 2016;13:79–85. [Google Scholar]

[bib21] 21.Zhang N., Sa Y., Guo Y., Lin W., Wang P., Feng Y. Discriminating Ramos and Jurkat Cells with Image Textures from Diffraction Imaging Flow Cytometry Based on a Support Vector Machine. Curr. Bioinform. 2018;13:50–56. [Google Scholar]

[bib22] 22.Wang S.P., Zhang Q., Lu J., Cai Y.D. Analysis and Prediction of Nitrated Tyrosine Sites with the mRMR Method and Support Vector Machine Algorithm. Curr. Bioinform. 2018;13:3–13. [Google Scholar]

[bib23] 23.Chen W., Lv H., Nie F., Lin H. i6mA-Pred: Identifying DNA N6-methyladenine sites in the rice genome. Bioinformatics. 2019;35:2796–2800. doi: 10.1093/bioinformatics/btz015. [DOI] [PubMed] [Google Scholar]

[bib24] 24.Feng P.M., Chen W., Lin H., Chou K.C. iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal. Biochem. 2013;442:118–125. doi: 10.1016/j.ab.2013.05.024. [DOI] [PubMed] [Google Scholar]

[bib25] 25.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2012;12:2825–2830. [Google Scholar]

[bib26] 26.Liu B., Wang S., Long R., Chou K.-C. iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics. 2017;33:35–41. doi: 10.1093/bioinformatics/btw539. [DOI] [PubMed] [Google Scholar]

[bib27] 27.Liu B., Yang F., Chou K.-C. 2L-piRNA: A two-layer ensemble classifier for identifying piwi-interacting RNAs and their function. Mol. Ther. Nucleic Acids. 2017;7:267–277. doi: 10.1016/j.omtn.2017.04.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Lin C., Chen W., Qiu C., Wu Y., Krishnan S., Zou Q. LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy. Neurocomputing. 2014;123:424–435. [Google Scholar]

[bib29] 29.Zou Q., Guo J., Ju Y., Wu M., Zeng X., Hong Z. Improving tRNAscan-SE Annotation Results via Ensemble Classifiers. Mol. Inform. 2015;34:761–770. doi: 10.1002/minf.201500031. [DOI] [PubMed] [Google Scholar]

[bib30] 30.Zou Q., Wang Z., Guan X., Liu B., Wu Y., Lin Z. An approach for identifying cytokines based on a novel ensemble classifier. BioMed Res. Int. 2013;2013:686090. doi: 10.1155/2013/686090. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Yan K., Fang X., Xu Y., Liu B. Protein Fold Recognition based on Multi-view Modeling. Bioinformatics. 2019 doi: 10.1093/bioinformatics/btz040. Published online January 21, 2019. [DOI] [PubMed] [Google Scholar]

[bib32] 32.Liu B., Zhu Y. ProtDec-LTR3.0: protein remote homology detection by incorporating profile-based features into Learning to Rank. IEEE Access. 2019 Published online July 18, 2019. [Google Scholar]

[bib33] 33.Frey B.J., Dueck D. Clustering by passing messages between data points. Science. 2007;315:972–976. doi: 10.1126/science.1136800. [DOI] [PubMed] [Google Scholar]

[bib34] 34.Mitchell M. MIT Press; 1998. An Introduction to Genetic Algorithms. [Google Scholar]

PERMALINK

iPromoter-2L2.0: Identifying Promoters and Their Types by Combining Smoothing Cutting Window Algorithm and Sequence-Based Features

Bin Liu

Kai Li

Abstract

Introduction