Skip to main content
Molecular Therapy. Nucleic Acids logoLink to Molecular Therapy. Nucleic Acids
. 2019 Aug 14;18:80–87. doi: 10.1016/j.omtn.2019.08.008

iPromoter-2L2.0: Identifying Promoters and Their Types by Combining Smoothing Cutting Window Algorithm and Sequence-Based Features

Bin Liu 1,2,, Kai Li 3
PMCID: PMC6796744  PMID: 31536883

Abstract

Promoters are short regions at specific locations of DNA sequences, which are playing key roles in directing gene transcription. They can be grouped into six types (σ24,σ28,σ32,σ38,σ54,σ70). Recently, a predictor called “iPromoter-2L” was constructed to predict the promoters and their six types, which is the first approach to predict all the six types of promoters. However, its predictive quality still needs to be further improved for real-world application requirement. In this study, we proposed the smoothing cutting window algorithm to find the window fragments of the DNA sequences based on the conservation scores to capture the sequence patterns of promoters. For each window fragment, the discriminative features were extracted by using kmer and PseKNC. Combined with support vector machines (SVMs), different predictors were constructed and then clustered into several groups based on their distances. Finally, a new predictor called iPromoter-2L2.0 was constructed to identify the promoters and their six types, which was developed by ensemble learning based on the key predictors selected from the cluster groups. The results showed that iPromoter-2L2.0 outperformed other existing methods for both promoter prediction and identification of their six types, indicating that iPromoter-2L2.0 will be helpful for genomics analysis.

Keywords: promoter, smoothing cutting window algorithm, ensemble learning

Introduction

A promoter is a DNA fragment at a specific location that can be recognized and bound by RNA polymerase to initiate transcription. In bacteria, the RNA polymerase contains five subunits (2α,β,β′,ω) and an extra σ factor.1, 2 Theσ factors can be labeled as σ24,σ28,σ32,σ38,σ54andσ70 according to the molecular weights. Differentσ factors direct the RNA polymerase binding to different promoter regions, which can affect the consequent activation of genes. σ24 and σ32 participate in heat-shock response, σ28 participates in the flagellar gene expression during normal growth, σ54 participates in nitrogen metabolism, and σ70, called primary σ factor, is in charge of transcription of most genes in growing cells.2, 3, 4

Because the wet experiments are expensive to identify the types of promoters, several predictors were developed to identify the promoters based on the DNA sequence information; for example, iPro54-PseKNC5 based on the PseKNC6 was constructed to identify promoters. A position-correlation scoring function (PCSF)7 and Bayes profile8 were proposed to identify promoter. By combining the variable window technique with the regular Z-curve method,9, 10, 11 “variable-window Z-curve” was proposed to detect promoters. These methods were discussed in a recent study.12

Recently, the iPromoter-2L12 has been proposed, which is the first predictor that is able to predict the promoters and their aforementioned six different types. This predictor employed the multi-window-based PseKNC approach to capture the sequence patterns of the promoters. However, for this predictor, it is extremely hard to find the optimized sequence windows by using the flexible-sliding-window approach to extract the discriminative features, preventing the performance improvement of this method. In order to overcome these shortcomings, in this study we proposed the smoothing cutting window (SCW) algorithm to divide the DNA sequences into fragment windows based on the conservation scores and ensemble of different predictors based on various sequence-based features to further improve the predictive performance.

Results and Discussion

Comparison with Other Existing Methods

Table 1 shows the results (Equation 24) generated by iPromoter-2L2.0 via the 5-fold validation on the benchmark dataset. The corresponding rates obtained by the existing methods are also given in Table 1. For the second-layer prediction, only the iPromoter-2L and iPromoter-2L2.0 are able to predict the promoter types among the five existing methods.

Table 1.

A Comparison of iPromoter-2L2.0 with Other Predictors for Identifying Promoters (the First Layer) and Their Types (the Second Layer) via the 5-fold Cross-Validation on the Same Benchmark Dataset

Method Acc (%) MCC Sn (%) Sp (%)
First Layer

PCSFa 74.81 0.4980 78.92 70.70
vw Z-curvea 80.28 0.6098 77.76 82.80
Stabilitya 78.04 0.5615 76.61 79.48
iPro54a 80.45 0.6100 77.76 83.15
iPromoter-2L1.0a 81.68 0.6343 79.20 84.16
iPromoter-2L2.0b 84.98 0.6998 84.13 85.84

Second Layer

iPromoter-2L1.0a
σ24 promoter 93.50 0.7338 72.52 96.93
σ28 promoter 96.82 0.5708 42.54 99.49
σ32 promoter 94.41 0.6524 52.58 99.14
σ38 promoter 94.69 0.2962 15.34 99.48
σ54 promoter 94.04 0.6459 53.19 99.57
σ70 promoter 80.66 0.6056 95.34 59.35
iPromoter-2L2.0b
σ24 promoter 94.62 0.8053 81.82 97.22
σ28 promoter 97.94 0.7561 71.64 99.23
σ32 promoter 95.38 0.7361 71.82 98.05
σ38 promoter 94.58 0.2242 7.36 99.85
σ54 promoter 98.11 0.6714 59.57 99.42
σ70 promoter 85.94 0.7109 95.22 72.47

See Equation 1. Acc, accuracy; Sn, sensitivity; Sp, specificity.

a

The results reported in Liu et al.12

b

The predictor proposed in this study.

From Table 1 we can see the following: (1) for the first-layer prediction, the iPromoter-2L2.0 outperformed all the other methods in terms of all the four performance measures (cf. Equation 24); (2) for the second-layer prediction, the iPromoter-2L2.0 outperformed iPromoter-2L for the prediction of σ24 promoters, σ28 promoters, σ32 promoters, σ54 promoters, and σ70 promoters in terms of accuracy (Acc) and Matthew’s correlation coefficient (MCC), and its performance is comparable with that of iPromoter-2L for the prediction of σ38 promoters. The reasons for the performance improvement of the iPromoter-2L predictor is that it is based on the SCW algorithm, which is able to more accurately extract the sequence features to discriminate the promoters and their types.

It can be anticipated that the proposed SCW algorithm would have many potential applications, such as enhancer prediction, DNA replication origin prediction, etc.

Web Server and Its User Guide

We established a web server for iPromoter-2L2.0 so as to help the readers to use the proposed method by following the steps below.

  • Step 1. Click the hyperlink http://bliulab.net/iPromoter-2L2.0/ to access the homepage as shown in Figure 1. An introduction to the web server is given in the Read Me.

  • Step 2. Copy/paste or type the query DNA sequences into the input box at the center of Figure 1 or upload the data by the Browse button.

  • Step 3. Click on the Submit button—you will see the predicted results. If using the example sequences for the prediction, you will see the following results: (1) both the first and the second query sequences are non-promoters; (2) the third query sequence is a σ70 promoter.

  • Step 4. On the results, the predictive result can be downloaded via clicking the Download button.

Figure 1.

Figure 1

A Screenshot of the Homepage of the Web Server for iPromoter-2L2.0

iPromoter-2L2.0 can be accessed at http://bliulab.net/iPromoter-2L2.0/.

Materials and Methods

Benchmark Dataset

To facilitate performance comparison of various methods, we employed the dataset S12 to construct the predictor and evaluate the performance of various methods, which can be formulated as12

{S=S+SS+=S+(σ24)S+(σ28)S+(σ32)S+(σ38)S+(σ54)S+(σ70), (Equation 1)

where “” indicates the “union” in the theory; S+ indicates promoter samples; S indicates non-promoter samples; and S+(σ24), S+(σ28), S+(σ32), S+(σ38), S+(σ54), and S+(σ70) indicate six kinds of promoters. Specifically, the benchmark dataset S consists of 5,920 samples, half of which are promoters, and the others are non-promoters. S+(σ24) contains 484 samples;S+(σ28) contains 134 samples; S+(σ32) contains 291 samples;S+(σ38) contains 163 samples;S+(σ54) contains 94 samples;S+(σ70) contains 1,694 samples.

Sample Formulation

In this study, the DNA sequence samples were divided into several fragment windows by using the proposed SCW algorithm, and then for each fragment window, a sliding window approach was used to extract the sequence features by using kmer13 and PseKNC.6, 14, 15

SCW Algorithm

Previous studies showed that the distribution of conservation scores between promoters and non-promoters are obviously different.12 Here, we proposed the SCW algorithm to incorporate these sequence patterns into the predictor so as to improve the predictive performance.

A DNA sample is represented as

D=Ν1Ν2ΝiΝ81, (Equation 2)

where Ni denotes the i-th nucleotide at the sequence position i. It can be one of the following four nucleotides, i.e.,

Ni{A(adenine)C(cytosine)G(guanine)(thymine)}, (Equation 3)

where refers to “member of,” a symbol in set theory.

To reflect the conservation score distribution patterns along D, it was split into S+1 fragments ρ([1,τ11],[τ1,τ21],,[τs,L])by the cutting points τj(j=1, 2,,S) (S is the total number of cutting points), which can be represented as

{ρ1=Ν1Ν2Ντ11ρ2=Ντ1Ντ1+1Ντ21ρS+1=ΝτsΝτs+1ΝL. (Equation 4)

The cutting point τj is defined as follows:

τj={φ1,ifφ1>αandφ2φ1>αφm,if1<m<Zandφmφm1>αandφm+1φm>αφZ,ifLφZ>αandφZφZ1>αisnotacuttingpoint,otherwise, (Equation 5)

whereαis a distance threshold, which was set as 8 in this study, φis the candidate cutting point, and Z is the total number ofφ.For a given sequence position i, φ is defined as

φm={i,ifSSDi<SSDi1andSSDi<SSDi+1and1<i<L1,ifSSDi<SSDi+1andi=1Z,ifSSDi<SSDi1andi=Lisnotacandidatecuttingpoint,otherwise, (Equation 6)

where SSDi represents the smooth standard deviation of the average conservation score (CS) of sequence position i, which can be calculated by

SSDi={15k=i2i+2SDk,2<i<L11i+2k=1i+2SDk,i=1,21Li+3k=i2LSDk,i=L1,L, (Equation 7)

where k is the sequence position and SDk is the standard deviation of the average CS at the k-th sequence position, which can be calculated by

SDk=1Yy=1Y(εkyμ)2, (Equation 8)

where Y represents number of labels, which is equal to 2 for the first layer and 6 for the second layer. εky denotes the y-th class samples’ average CS at the k-th sequence position, which can be calculated by the approach introduced in Schneider and Stephens.16 μ is the average CS of all labels at the k-th position.

The conservation profiles and the standard deviations of promoters and non-promoters are shown in Figure 2A, and the conservation profile and the standard deviation of each promoter type are shown in Figure 3A. The smooth standard deviation curves are shown in Figures 2B and 3B. The DNA sequences were divided into several fragments by SCW as shown in Figures 2C and 3C. The pseudo-code of SCW algorithm is shown in Box 1.

Figure 2.

Figure 2

A Flowchart Shows the Steps of the Proposed Smoothing Cutting Window Algorithm for the First-Layer Prediction

The standard deviations shown in (A) are converted into the smooth standard deviations as shown in (B), based on which the DNA sequences are divided into several fragments, as shown in (C).

Figure 3.

Figure 3

A Flowchart Shows the Process of the Proposed Smoothing Cutting Window Algorithm for the Second-Layer Prediction

The SDs shown in (A) are converted into the smooth SDs as shown in (B), based on which the DNA sequences are divided into several fragments, as shown in (C).

Box 1. Algorithm: Smoothing Cutting Window.
  • Parameters: sequence length L, number of label Y

  • Input: DNA sequence in Equation 1

  • Output: cutting points τ1,τ2,...τs

  • For y = 1 to Y do

  • For i = 1 to L do

  • Calculate conservation score εiy

  • End for

  • End for

  • For i = 1 to L do

  • Calculate SSDi by Equation 7

  • End for

  • Calculate cutting points τ1,τ2,...τs by Equations 5 and 6 and SSD

  • Return τ1,τ2,...τs

After the process shown in Box 1, each DNA sequence in S (cf. Equation 1) was divided into four fragments ([1, 28], [29, 44], [45, 56], [57, 81]), and each DNA sequence in S+ (cf. Equation 1) was divided into four fragments ([1, 17], [18, 41], [42, 56], [57, 81]). Then for each fragment, the sliding-window approach was used to extract the features.

A sliding window can be expressed by [ξ,δ], where ξ is the width of the window and δ is the step of sliding window. For each fragment obtained, the number of the segments produced by [ξ,δ] along the fragment sequence is given by12

η=INT[|ρi|ξ+δδ], (Equation 9)

where “INT” is an “integer-cutting operator.” |ρi| denotes the length of the i-th fragment. For example, assuming |ρi|=29, ξ = 6, and δ = 1 in Equation 9, we obtainη=24. For example, we can obtain 24 DNA segments with the sliding window of [6,1] on the i-th fragment of length 29.

kmer

kmer13 is a simple and effective method to extract the information in the DNA sequence. By using kmer, the DNA sequence fragment ρ (cf. Equation 4) can be represented as

ρ=[f1kmerf2kmerfikmer f4kkmer]T, (Equation 10)

where fikmer (i=1,2,,4k) is the frequencies of k neighboring nucleotides in the fragment ρ,and T represents transpose operator. For example, Equation 10 is a 4-mer vector when k=4.

ρ=[f(AAAA)f(AAAC)f(AAAT)f(TTTT)]T=[f14merf24merf34merf2564mer]T. (Equation 11)

PseKNC

The PseKNC6 incorporates the short-range sequence information, the long-range sequence information, and the physicochemical properties of the dinucleotides,6 which can formulate the DNA sequence fragment ρ of Equation 4 as

ρ=[f1PseKNCf2PseKNCf4kPseKNCf4k+1PseKNCf4k+λPseKNC]T. (Equation 12)

PseKNC6 has three parameters: k, λ (the number of sequence correlations considered17), and w (the weight factor). Each of the parameters has been clearly defined in a paper6 and a comprehensive review.18

The kmer and PseKNC can be easily generated by some exiting tools, such as Pse-in-One19 and PseKNC-General.14

Operation Engine

Support vector machines (SVMs) were successfully applied in several bioinformatics problems (B.L., C. L., and K. Yan, unpublished data).20, 21, 22, 23, 24 In this study, we employed SVMs to build the predictor. We used the SVM with radial basis function (RBF) kernel in the Scikit-learn package.25 The SVM has two parameters: C (regularization) and γ (kernel width).

Accordingly, when combining sliding-window approach and SVM based on kmer or PseKNC, there are a total of (2+2+1)=5, or (2+2+3)=7 parameters, respectively. The values of C and γ will be given later.

For the sliding-window with,

{5ξ9withstepgap=11δ2withstepgap=1. (Equation 13)

For the kmer approach with

k=1,2,3, (Equation 14)

30 elementary classifiers can be developed, as denoted by

C(i),(i=1,2,,30). (Equation 15)

For the PseKNC approach with

{1k4withstepgap=12λξkwith step gap=3w=0.5, (Equation 16)

46 elementary classifiers can be developed, denoted by

C(i),(i=31,32,,76). (Equation 17)

Therefore, we have a total of 30 + 46 = 76 elementary classifiers.

Ensemble Learning

Inspired by the previous studies,13, 26, 27, 28, 29, 30, 31, 32 by using a voting system, a series of individual predictors can develop an ensemble predictor with better prediction quality.

When developing an ensemble learning model, there are two fundamental issues: the selection of the individual classifiers with low correlation from the elementary classifiers and the construction of an ensemble classifier by fusing the selected classifiers. In this study, we employed the affinity propagation (AP) clustering algorithm33 to cluster the elementary classifiers based on the distance among classifiers. For each cluster, one key classifier was selected.

In order to measure the complementarity of different elementary classifiers, the distance between any two elementary classifiers C(i) and C(j) was measured by the following equation:

Distance(C(i),C(j))=1mk=1m(dikΔdjk), (Equation 18)

where m is the training sample number, dik is the classification probability of classifier C(i) on the k-th sample, and dikΔdjk is calculated by

dikΔdjk={1Yy=1Y(dikydjky)2,ifC(i)andC(j)havedifferentpredictiononthek-thsample0,otherwise, (Equation 19)

where Y represents number of labels. Y was set as 2 and 6 for promoter identification and their type prediction, respectively. diky represents the probability of C(i) predicting k-th sample as category y. By using Equations 18 and 19, the distance between any elementary classifiers can be accurately measured. The range of Distance(C(i),C(j)) is from 0 to 1, where 1 indicates the predictive results of two classifiers are completely complementary and 0 means that their results are identical. The elementary classifiers were then grouped into different clusters by using the AP clustering algorithm.33

The flowchart of the proposed iPromoter-2L2.0 predictor is shown in Figure 4.

Figure 4.

Figure 4

A Flowchart Shows How iPromoter-2L2.0 Is Working

For the first layer, 10 key classifiers were obtained (Table 2) as formulated by

C1(i),(i=1,2,,10). (Equation 20)

For the second layer, nine key classifiers were obtained (Table 3) as formulated by

C2(i),(i=1,2,,9). (Equation 21)

By fusing the 10 key classifiers (cf. Equation 20) following this study,13 we can obtain the first-layer ensemble predictor as given by

CE1=C1(1)C1(2)C1(10)=i=110C1(i). (Equation 22)

By fusing the nine key classifiers (cf. Equation 21), we can obtain the second-layer ensemble predictor given by

CE2=C2(1)C2(2)C2(9)=i=19C2(i), (Equation 23)

where the symbol in Equations 22 and 23 means that linear combination of the key individual classifiers. The weight factors were optimized by the genetic algorithm,34 and the parameters (population size, evolutional generations) of genetic algorithm were set as 200 and 2,000, respectively, for the first and second layers.

Table 2.

The Six Key Classifiers for the First-Layer Prediction

Key Classifier Feature Vector Dimension
C1(1) kmera 768
C1(2) kmerb 396
C1(3) kmerc 2,880
C1(4) kmerd 624
C1(5) PseKNCe 1,080
C1(6) PseKNCf 11,880
C1(7) PseKNCg 46,440
C1(8) PseKNCh 1,566
C1(9) PseKNCi 2,808
C1(10) PseKNCj 729
a

The parameters used: ξ=5, δ=1, k = 1, C=23, γ=26.

b

The parameters used: ξ=5, k = 1, C=2, γ=24.

c

The parameters used: ξ=6, δ=1, k = 2, C=2, γ=24.

d

The parameters used: ξ=8, δ=1, k = 1, C=23, γ=26.

e

The parameters used: ξ=6, δ=1, k = 1, λ = 2, w = 0.5, C=23,γ=24.

f

The parameters used: ξ=6, δ=1, k = 3, λ = 2, w = 0.5, C=23,γ=24.

g

The parameters used: ξ=6, δ=1, k = 4, λ = 2, w = 0.5, C=2,γ=24.

h

The parameters used: ξ=7, δ=2, k = 2, λ = 2, w = 0.5, C=2,γ=22.

i

The parameters used: ξ=8, δ=1, k = 2, λ = 2, w = 0.5, C=23,γ=24.

j

The parameters used: ξ=8, δ=2, k = 1, λ = 5, w = 0.5, C=2,γ=22.

Table 3.

The 10 Key Classifiers for the Second-Layer Prediction

Key Classifier Feature Vector Dimension
C2(1) kmera 1,584
C2(2) kmerb 2,688
C2(3) PseKNCc 11,880
C2(4) PseKNCd 1,008
C2(5) PseKNCe 3,528
C2(6) PseKNCf 1,566
C2(7) PseKNCg 2,808
C2(8) PseKNCh 729
C2(9) PseKNCi 1,296
a

The parameters used: ξ=5, δ=2, k = 2, C=24, γ=24.

b

The parameters used: ξ=7, δ=1, k = 2, C=24, γ=24.

c

The parameters used: ξ=6, δ=1, k = 3, λ = 2, w = 0.5, C=24,γ=24.

d

The parameters used: ξ=7, δ=1, k = 1, λ = 2, w = 0.5, C=24,γ=21.

e

The parameters used: ξ=7, δ=1, k = 2, λ = 5, w = 0.5, C=2,γ=21.

f

The parameters used: ξ=7, δ=2, k = 2, λ = 2, w = 0.5, C=24,γ=21.

g

The parameters used: ξ=8, δ=1, k = 2, λ = 2, w = 0.5, C=24,γ=21.

h

The parameters used: ξ=8, δ=2, k = 1, λ = 5, w = 0.5, C=24,γ=21.

i

The parameters used were as follows: ξ=9, δ=1, k = 1, λ = 5, w = 0.5, C=24,γ=21.

Cross-Validation and Performance Measures

The performance of various predictors was evaluated by using 5-fold cross-validation with the following performance measures:12

{Sn(i)=1N+(i)N+(i)0Sn1Sp(i)=1N+(i)N(i)0Sp1Acc(i)=1N+(i)+N+(i)N+(i)+N(i)0Acc1MCC(i)=1(N+(i)N+(i)+N+(i)N(i))(1+N+(i)N+(i)N+(i))(1+N+(i)N+(i)N(i))1MCC1, (Equation 24)

where i=1,2,,Y, and Y is the number of classes of this system. i is the i-th class or type. For the first-layer prediction, the value of Y is 2, and the value of i represents the promoter (i = 1) or non-promoter (i = 2). Similarly, for the second-layer prediction, the value of Y is 6 and the value of i is 1, 2, 3, 4, 5, or 6 for σ24, σ28, σ32, σ38, σ54, or σ70 promoters, respectively. For the detail of these performance measures, please refer to a recent study.12

Author Contributions

B.L. provided the main idea of the manuscript and wrote the manuscript. K.L. did the experiments and wrote the manuscript.

Conflicts of Interest

The authors declare no competing interests.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61672184 and 61822306), the Fok Ying-Tung Education Foundation for Young Teachers in the Higher Education Institutions of China (161063), and the Scientific Research Foundation in Shenzhen (JCYJ20180306172207178).

References

  • 1.Borukhov S., Nudler E. RNA polymerase: the vehicle of transcription. Trends Microbiol. 2008;16:126–134. doi: 10.1016/j.tim.2007.12.006. [DOI] [PubMed] [Google Scholar]
  • 2.Silva S.D.A.E., Echeverrigaray S. Intech; 2012. Bacterial Promoter Features Description and Their Application on E. coli In Silico Prediction and Recognition Approaches. [Google Scholar]
  • 3.Janga S.C., Collado-Vides J. Structure and evolution of gene regulatory networks in microbial genomes. Res. Microbiol. 2007;158:787–794. doi: 10.1016/j.resmic.2007.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Potvin E., Sanschagrin F., Levesque R.C. Sigma factors in Pseudomonas aeruginosa. FEMS Microbiol. Rev. 2008;32:38–55. doi: 10.1111/j.1574-6976.2007.00092.x. [DOI] [PubMed] [Google Scholar]
  • 5.Lin H., Deng E.Z., Ding H., Chen W., Chou K.C. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 2014;42:12961–12972. doi: 10.1093/nar/gku1019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Chen W., Lei T.-Y., Jin D.-C., Lin H., Chou K.-C. PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Anal. Biochem. 2014;456:53–60. doi: 10.1016/j.ab.2014.04.001. [DOI] [PubMed] [Google Scholar]
  • 7.Li Q.Z., Lin H. The recognition and prediction of sigma70 promoters in Escherichia coli K-12. J. Theor. Biol. 2006;242:135–141. doi: 10.1016/j.jtbi.2006.02.007. [DOI] [PubMed] [Google Scholar]
  • 8.He W., Jia C., Duan Y., Zou Q. 70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features. BMC Syst. Biol. 2018;12(Suppl 4):44. doi: 10.1186/s12918-018-0570-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zhang C.T. A symmetrical theory of DNA sequences and its applications. J. Theor. Biol. 1997;187:297–306. doi: 10.1006/jtbi.1997.0401. [DOI] [PubMed] [Google Scholar]
  • 10.Zhang C.T., Zhang R., Ou H.Y. The Z curve database: a graphic representation of genome sequences. Bioinformatics. 2003;19:593–599. doi: 10.1093/bioinformatics/btg041. [DOI] [PubMed] [Google Scholar]
  • 11.Song K. Recognition of prokaryotic promoters based on a novel variable-window Z-curve method. Nucleic Acids Res. 2012;40:963–971. doi: 10.1093/nar/gkr795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Liu B., Yang F., Huang D.S., Chou K.-C. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics. 2018;34:33–40. doi: 10.1093/bioinformatics/btx579. [DOI] [PubMed] [Google Scholar]
  • 13.Liu B., Long R., Chou K.-C. iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework. Bioinformatics. 2016;32:2411–2418. doi: 10.1093/bioinformatics/btw186. [DOI] [PubMed] [Google Scholar]
  • 14.Chen W., Zhang X., Brooker J., Lin H., Zhang L., Chou K.-C. PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics. 2015;31:119–120. doi: 10.1093/bioinformatics/btu602. [DOI] [PubMed] [Google Scholar]
  • 15.Liu B., Liu F., Fang L., Wang X., Chou K.-C. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics. 2015;31:1307–1309. doi: 10.1093/bioinformatics/btu820. [DOI] [PubMed] [Google Scholar]
  • 16.Schneider T.D., Stephens R.M. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990;18:6097–6100. doi: 10.1093/nar/18.20.6097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Chou K.-C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005;21:10–19. doi: 10.1093/bioinformatics/bth466. [DOI] [PubMed] [Google Scholar]
  • 18.Liu B. BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief. Bioinform. 2017 doi: 10.1093/bib/bbx165. Published online December 19, 2017. [DOI] [PubMed] [Google Scholar]
  • 19.Liu B., Liu F., Wang X., Chen J., Fang L., Chou K.-C. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015;43(W1):W65–W71. doi: 10.1093/nar/gkv458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Li D., Ju Y., Zou Q. Protein Folds Prediction with Hierarchical Structured SVM. Curr. Proteomics. 2016;13:79–85. [Google Scholar]
  • 21.Zhang N., Sa Y., Guo Y., Lin W., Wang P., Feng Y. Discriminating Ramos and Jurkat Cells with Image Textures from Diffraction Imaging Flow Cytometry Based on a Support Vector Machine. Curr. Bioinform. 2018;13:50–56. [Google Scholar]
  • 22.Wang S.P., Zhang Q., Lu J., Cai Y.D. Analysis and Prediction of Nitrated Tyrosine Sites with the mRMR Method and Support Vector Machine Algorithm. Curr. Bioinform. 2018;13:3–13. [Google Scholar]
  • 23.Chen W., Lv H., Nie F., Lin H. i6mA-Pred: Identifying DNA N6-methyladenine sites in the rice genome. Bioinformatics. 2019;35:2796–2800. doi: 10.1093/bioinformatics/btz015. [DOI] [PubMed] [Google Scholar]
  • 24.Feng P.M., Chen W., Lin H., Chou K.C. iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal. Biochem. 2013;442:118–125. doi: 10.1016/j.ab.2013.05.024. [DOI] [PubMed] [Google Scholar]
  • 25.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2012;12:2825–2830. [Google Scholar]
  • 26.Liu B., Wang S., Long R., Chou K.-C. iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics. 2017;33:35–41. doi: 10.1093/bioinformatics/btw539. [DOI] [PubMed] [Google Scholar]
  • 27.Liu B., Yang F., Chou K.-C. 2L-piRNA: A two-layer ensemble classifier for identifying piwi-interacting RNAs and their function. Mol. Ther. Nucleic Acids. 2017;7:267–277. doi: 10.1016/j.omtn.2017.04.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Lin C., Chen W., Qiu C., Wu Y., Krishnan S., Zou Q. LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy. Neurocomputing. 2014;123:424–435. [Google Scholar]
  • 29.Zou Q., Guo J., Ju Y., Wu M., Zeng X., Hong Z. Improving tRNAscan-SE Annotation Results via Ensemble Classifiers. Mol. Inform. 2015;34:761–770. doi: 10.1002/minf.201500031. [DOI] [PubMed] [Google Scholar]
  • 30.Zou Q., Wang Z., Guan X., Liu B., Wu Y., Lin Z. An approach for identifying cytokines based on a novel ensemble classifier. BioMed Res. Int. 2013;2013:686090. doi: 10.1155/2013/686090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Yan K., Fang X., Xu Y., Liu B. Protein Fold Recognition based on Multi-view Modeling. Bioinformatics. 2019 doi: 10.1093/bioinformatics/btz040. Published online January 21, 2019. [DOI] [PubMed] [Google Scholar]
  • 32.Liu B., Zhu Y. ProtDec-LTR3.0: protein remote homology detection by incorporating profile-based features into Learning to Rank. IEEE Access. 2019 Published online July 18, 2019. [Google Scholar]
  • 33.Frey B.J., Dueck D. Clustering by passing messages between data points. Science. 2007;315:972–976. doi: 10.1126/science.1136800. [DOI] [PubMed] [Google Scholar]
  • 34.Mitchell M. MIT Press; 1998. An Introduction to Genetic Algorithms. [Google Scholar]

Articles from Molecular Therapy. Nucleic Acids are provided here courtesy of The American Society of Gene & Cell Therapy

RESOURCES