Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Feb 3.
Published in final edited form as: J Theor Biol. 2012 Aug 3;312:65–75. doi: 10.1016/j.jtbi.2012.07.013

Comprehensive comparative analysis and identification of RNA-binding protein domains: multi-class classification and feature selection

Samad Jahandideh 1,*, Vinodh Srinivasasainagendra 1, Degui Zhi 1,*
PMCID: PMC3867591  NIHMSID: NIHMS397612  PMID: 22884576

Abstract

RNA-protein interaction plays an important role in various cellular processes, such as protein synthesis, gene regulation, post-transcriptional gene regulation, alternative splicing, and infections by RNA viruses. In this study, using Gene Ontology Annotated (GOA) and Structural Classification of Proteins (SCOP) databases an automatic procedure was designed to capture structurally solved RNA-binding protein domains in different subclasses. Subsequently, we applied tuned multi-class SVM (TMCSVM), Random Forest (RF), and multi-class ℓ1/ℓq-regularized logistic regression (MCRLR) for analysis and classifying RNA-binding protein domains based on a comprehensive set of sequence and structural features. In this study, we compared prediction accuracy of three different state-of-the-art predictor methods. From our results, TMCSVM outperforms the other methods and suggests the potential of TMCSVM as a useful tool for facilitating the multi-class prediction of RNA-binding protein domains. On the other hand, MCRLR by elucidating importance of features for their contribution in predictive accuracy of RNA-binding protein domains subclasses, helps us to provide some biological insights into the roles of sequences and structures in protein–RNA interactions.

Keywords: RNA-binding domain, tuned multi-class SVM, Random Forest, Multi-class ℓ1/ℓq -regularized logistic regression, Prediction

1. Introduction

Regulation of biological processes happens through association and dissociation of macromolecules, i.e., protein, RNA and DNA. Furthermore, functional components of cells are frequently complex assemblies of macromolecules. At the molecular level, RNA-protein complexes play an important role in various cellular processes, such as protein synthesis, gene regulation, post-transcriptional gene regulation, alternative splicing, and infections by RNA viruses. Therefore, it is important to understand the principle of RNA-protein interactions and prediction of RNA-binding proteins is essential in identifying the cellular processes in which RNA-protein complexes are involved.

It is commonly accepted that RNA recognition by proteins is mainly mediated by specific kinds of RNA-binding domains (RBDs) (Morozova et al., 2006; Shulman-Peleg et al., 2008). The RBDs can be classified into different subclasses based on their basic binding motifs, e.g., the KH domain, the double-stranded RNA-binding domain (dsRBD), and the zinc finger motif (Chen and Varani, 2005). Although in recent years, new RBDs have been identified (Parker and Barford, 2006), an increasing amount of evidence on non-coding RNAs suggest that new RBDs will be identified (Lingel and Sattler, 2005).

In order to recognize the RNA functional importance in close relationship with protein in its activities, computational studies of RNA-protein complexes have been significantly increased (Ellis et al., 2007; Jones et al., 2001). Recently, a variety of approaches have been proposed to study RNA-protein interactions (Lunde et al., 2007). Although some interesting results have been obtained, the precise details of the RNA-protein interaction are far from being fully understood. For this reason, it is strongly recommended to develop reliable computational methods to accurately predict RNA-binding proteins and analyze important features in RNA-protein interaction.

Homology-based methods are the most common method to identify the class of unknown proteins at sequence or structure level. These methods are limited by the absence of experimentally annotated homologous proteins in protein databases. Hence it is strongly encouraged to develop computational tools to identify RNA-binding proteins (RBPs) using sequence- and structure-derived features. Most of previous investigations, predict RBPs using sequence- derived features (Han et al., 2004; Shao et al., 2009; Yu et al., 2006). In addition to sequence-based methods, up to now, only one investigation by Shazman and Mandel-Gutfreund (2008) developed a structural-based method to predict RBPs. Shazman and Mandel-Gutfreund developed a multiSVM-based method using four subgroups of features including: (i) Largest patch parameters (such as patch size and patch surface accessibility), (ii) Protein parameters (such as molecular weight) (iii) Cleft/patch parameters (such as the overlap between the largest, second largest, and third largest clefts, and largest patch), and (iv) Parameters related to other surface patches (such as number of residues in the lysine out patch and in the negative patch), to describe the global composition of each protein. Using the jackknife test, they reported a 75.61% accuracy of prediction for three subclasses of RBPs; tRNA-, rRNA-, and mRNA-binding proteins. In comparison with our work, it is limit to three classes of RBPs, and they have done a non-accurate manually data collection. Despite the availability of several methods, identification of RBPs using sequence information with high accuracy is still a major challenge.

Here we present a comprehensive performance evaluation of some state of the art predictor methods on an important problem, i.e., classifying RBDs using sequence- and structure-derived information. Combining a diverse set of features, we developed three different methods including; tuned multi-class SVM (TMCSVM), Random Forest (RF), and Multi-class ℓ1/ℓq -regularized logistic regression (MCRLR). By applying these methods, we have shown that we can classify RBDs based on their RNA target (7S, double-stranded, tRNA, rRNA, or mRNA). In all of five different subclasses of RBPs, no exclusive RNA-binding motif is present. However, in such cases in addition to successful classifying RBPs, we discovered dissimilar sequence and structural features.

2. Materials and methods

2.1. Automatic dataset harvesting

Based on the fact that most of similar works on prediction of RNA-binding proteins, manually collected and annotated datasets, in this work, in order to do a more accurate and automated data harvesting we constructed a dataset of non-redundant RNA-binding protein domains using two main datasets including: (i) Gene Ontology Annotated (GOA) database, available at http://www.ebi.ac.uk/GOA/, which cover ~2.5 million reports of associated protein chains with Gene Ontology (GO) terms, and (ii) 40% non-redundant set of Structural Classification of Proteins (SCOP) 1.75 from ASTRAL website. Based on GO classification, RNA binding root involves 28 leaves. Our first step of automatic procedure was one by one search for RNA binding subclasses GO IDs in GOA database to find associated protein chains to each subclass of RNA binding GO IDs. Briefly, GO is a major bioinformatics tool for the unification of biology. More specifically, one of the aims is annotation of genes and gene products. GO contains three ontologies that describe the molecular functions, biological processes, and cellular components of proteins (Ashburner et al., 2000). For more details and comprehensive discussion we refer to the paper (Chou and Shen, 2006), as well as the discussions as elaborated in (Chou and Shen, 2008). The second step was search across SCOP 1.75 to capture non-redundant RNA binding protein domains in different subclasses (Fig. 1). We eliminated protein domains, which associated to more than one RNA binding subclass.

Figure 1.

Figure 1

The proposed automatic procedure for dataset harvesting

2.2 Feature generation

In this study a combination of sequence- and structure-derived features were used for prediction of RNA-binding protein domains. Our representation of the protein sequence in this study is a general form of Chou’s pseudo amino acid composition (Chou, 2011). Indeed, to avoid losing many important information hidden in protein sequences, the pseudo amino acid composition (PseAAC) was proposed (Chou, 2001; Chou, 2005) to replace the simple amino acid composition (AAC) for representing the sample of a protein. For a summary about its recent development and applications, see a comprehensive review (Chou, 2009). Ever since the concept of PseAAC was proposes by Chou in 2001, its has rapidly penetrated into almost all the fields of protein attribute prediction (Chen et al., 2009; Ding et al., 2009; Esmaeili et al., 2010; Georgiou et al., 2009; Guo et al., 2011; Hayat and Khan, 2012; Hu et al., 2011; Li et al., 2012; Lin, 2008; Liu et al., 2012; Mei, 2012; Mohabatkar, 2010; Mohabatkar et al., 2011; Nanni et al., 2012; Qin et al., 2012; Qiu et al., 2009; Qiu et al., 2011; Yu et al., 2010; Zhang and Fang, 2008; Zhao et al., 2012; Zou et al., 2011). According to Eq.6 of a recent comprehensive review (Chou, 2011), the form of PseAAC can be generated and formulated as

P=[ψ1ψ2ψuψΩ]T (1)

where T is a transpose operator, while the subscribe Ω is an integer and its value as well as the components ψ1, ψ2, … will be defined by a series of feature extractions as elaborated below.

In addition to sequence-derived features, structure-derived features were generated in this study. Totally, 267 different sequence- and structure-derived features were generated using several information sources, which can be classified into six major subgroups including:

  1. Sequence-derived features including: (i) composition of all 20 amino acids (20 features), (ii) composition of amino acids in 9 different physicochemical groups including tiny, small, aliphatic, aromatic, polar, non-polar, charged, acidic, and basic amino acids groups (9 features), (iii) pI, the isoelectric point (1 feature), (iv) molecular weight (1 feature), and (v) number of residues and number of atoms (2 features). This subgroup of features was generated using seqinr package (version 3.0–3) in R environment.

  2. Secondary structure features including: (i) composition of all 20 amino acids and composition of amino acids in physicochemical groups, within three different secondary structures, i.e., helix, sheet and random coil (87 features), and (ii) composition of 6 different secondary structures, i.e., H (α-helix), G (310 helix), E (extended β-strand), B (isolated β-bridge), T (turn), and S (bend) (6 features). Secondary structure parameters in each protein domain were computed using the output of the program DSSP (Kabsch and Sander, 1983). In order to calculate secondary structures structures in three different secondary structures, the six structures were reduced into three classes (H,G → H, E → E, all other states to C).

  3. Solvent accessibility features including: composition of all 20 amino acids and composition of amino acids in physiochemical groups, within three different solvent accessibility states, i.e., buried, intermediate, and exposed (87 features). Based on the standard ranges of solvent accessibility values (SAV), three kinds of solvent accessibility states are defined. Buried state, B, is endowed to residues having 0≤SAV≤0.16, intermediate state, I, to residues having 0.16<SAV≤0.36, and exposed state, E, to residues having 0.36<SAV≤1. Solvent accessibility values of residues were computed using ASAView program (Ahmad et al., 2004).

  4. Hydrogen bonds features: The hydrogen bond from the backbone CO (i) to the backbone NH (i+N), is expressed by the symbol H-bond (i, i+N). In this study we computed frequencies of H-bond (i, i+N) for N=-5,-4,-3, …, 3, 4, 5. Furthermore, total hydrogen bonds, parallel- and antiparallel hydrogen bonds were computed. The values of these features were divided by length of protein domains. The output of the program DSSP (Kabsch and Sander, 1993) was used to generate these features (13 features).

  5. Electrostatic properties features: eight electrostatic properties features including net molecular charge, net molecular charge per atom, overall molecular dipole moment in debyes, net molecular dipole moment per atom, number of positively charged residues, and number of negatively charged residues were calculated using the Protein Dipole Moments Server (http://bip.weizmann.ac.il/dipol/).

  6. Patch features: main, second and third patch sizes, main patch’s molecular weight, composition of all 20 amino acids and composition of amino acids in physico-chemical groups, within the main patch were calculated (33 features). In order to extract all continuous positive patches on the proteins surface the PatchFinder algorithm (Stawiski et al., 2003) was used. The patches were sorted based on the number of grid points contained within the patch, and the largest three patches were selected.

2.3. Predictor methods

In this study, we used three different predictor methods including tuned multi-class SVM (TMCSVM), Random Forest (RF), and multi-class regularized logistic regression (MCRLR) to classify RBDs to three and five subclasses. The jackknife test was used to training and testing on databases. Through the jackknife test, one case is removed from the database and training is done using the remaining cases; then testing is done using the removed case. This procedure is repeated until all cases are tested. Although this method is time-consuming, it is more useful for the small databases such as ours. In addition to jackknife we also used self-consistency test to evaluate the prediction results. Both of jackknife and self-consistency are thought to be the most rigorous and objective methods for evaluation of prediction.

Among the independent dataset test, sub-sampling (e.g., 5 or 10-fold cross-validation) test, and jackknife test, which are often used for examining the accuracy of a statistical prediction method (Chou and Zhang, 1995), the jackknife test was deemed the least arbitrary that can always yield a unique result for a given benchmark dataset, as elucidated in (Chou and Shen, 2008) and demonstrated by Eqs.28–32 of (Chou, 2011). Therefore, the jackknife test has been widely recognized and increasingly used by investigators to test the power of various prediction methods (see, e.g., (Chen et al., 2009; Chou et al., 2011; Chou et al., 2012; Ding et al., 2009; Esmaeili et al., 2010; Georgiou et al., 2009; Gu et al., 2010; Jiang et al., 2008; Lin, 2008; Li and Li, 2008; Lin et al., 2008; Lin and Wang, 2011; Li et al., 2012; Mei, 2012; Mohabatkar, 2010; Mohabatkar et al., 2011; Qiu et al., 2010; Wu et al., 2011; Wu et al., 2012; Xiao et al., 2011a; Xiao et al., 2011b; Xiao et al., 2012; Yu et al., 2010; Zeng et al., 2009; Zhang and Fang, 2008; Zhang et al., 2008; Zhou et al., 2007).

2.3.1. Tuned multi-class support vector machine

Basically, support vector machine (SVM) is a kind of learning machines based on statistical learning theory. They have three remarkable characteristics including: the absence of minima, the sparseness of the solution, and implementation using the kernel Adatron algorithm. The kernel Adatron maps inputs to a high-dimensional feature space, and then optimally separates data into their respective classes by isolating those inputs which fall close to the data boundaries. Therefore, the kernel Adatron is especially effective in separating sets of data which share complex boundaries. Because of seeking a global optimized solution and avoiding over-fitting in the SVM training process, dealing with a large number of features is possible. SVMs can only be used for classification, not for function approximation. The theory and algorithms of SVMs can be found in Vapnik (1995, 1998).

In this study, we applied the tune function using e1071-package of R environment (version 2.11-1) to develop our multi-class SVM based method. Multi-class SVM in e1071 uses the “one-against-one” strategy, i.e., binary classification between all pairs, followed by voting. On the other hand, the tune function uses Grid Search to find the best functions. Using the tune function through jackknife procedure, it provides as many simulations as the number of cases in databases to select optimum structure each time.

2.3.2. Random Forest

Random forest (RF) was developed by Breiman in 2001 (Vapnik, 1998). The RF classification extends the concept of decision trees and has been successfully used in various biological problems (Dudoit et al., 2002; Statnikov et al., 2008; Jia and Hu, 2011; Kandaswamy et al., 2011; Lin et al., 2011; Pugalenthi et al., 2012; Qiu and Wang, 2011; Shameer et al., 2011). RF is a collection of decision trees instead of one tree, where each tree is trained using a bootstrap sample from the training dataset. The trees are then grown using a randomly selected subset of predictors at each node. After constructing all trees, a new object can then be classified based on the class label with the most votes, where every vote is decided by every tree in the forest. Finally, predictive performance is estimated using the observations left out of the bootstrap sample, termed the out-of-bag (OOB) observations. An appeal of RF is that the forest of trees contains a large amount of information about the relationship between the variables and observations. This information can be used for prediction, clustering, imputing missing data, and detecting outliers. The RF algorithm was implemented by the randomForest (version 4.6-2) R package (Liaw, 2002). We used tune randomForest (tuneRF) function. The number of trees and stepFactor were set to 1000 and 2, respectively. However, there are default values for different features, which are provided by the program and we used in this work.

2.3.3. Multi-class ℓ1/ℓq-regularized logistic regression

A multi-class ℓ1/ℓq-regularized logistic regression model that we used in this study is a generalization of the ℓ1-regularization logistic regression. Development of such strong theoretical guarantees, and great empirical success method is from recent studies in areas such as machine learning, statistics, and applied mathematics (Bach, 2008; Duchi and Singer, 2009; Kowalski, 2009; Negahban et al., 2009; Yuan and Lin, 2006).

The multi-class ℓ1/ℓq-regularized logistic regression is an expression of the form

minxl=1ki=1mwilog(1+exp(-yi(xTai+c)))+λ||x||1/q (2)

where aiT indicates vector of size 1 × n, n is number of features for i-th protein domain of the ℓ-th RBDs subclass, wiℓ is the weight for aiT, yiℓ is the response of aiℓ, and c is the intercept for the ℓ-th RBDs subclass. To construct multi-class ℓ1/ℓq-regularized logistic regression we used mcLogisticR function of SLEP package (version 4.0) which is written in Matlab. In this function, the elements in y are required to be m × k matrix including elements of 1 or −1 (m is the number of protein domains and k is the number of RBDs subclasses).

3. Results

3.1. Construction of dataset

Constructed dataset cover 7 out of 28 RNA binding subclasses with at least one protein domain member, including 7S RNA binding (10 protein domains), double-stranded RNA binding (16 protein domains), mRNA binding (11 protein domains), rRNA binding (29 protein domains), tRNA binding (16 protein domains), translational factor activity RNA binding (2 protein domains), and single-stranded RNA binding (1 protein domain). The RBDs of our dataset are summarized in Table 1. In construction of our methods we eliminated subclasses with less than 10 protein domain. In addition we constructed methods for prediction of five subclasses (i.e. tRNA-, rRNA-, mRNA-, 7S-, and double-stranded binding domain subclasses) and three subclasses (i.e. tRNA-, rRNA-, and mRNA-binding domain subclasses).

Table 1.

Summarized RNA binding domains in our dataset.

No. Protein Domain GO term Class Fold Superfamily Family Domain Species
1 d1914a1 7S_RNA_binding α+β SRP9/14 SRP9/14 SRP9/14 SRP9 Mouse
2 d1914a2 7S_RNA_binding α+β SRP9/14 SRP9/14 SRP9/14 SRP14 Mouse
3 d1hq1a_ 7S_RNA_binding α SPBD SPBD SPBD SSBP Ffh EC
4 d1jida_ 7S_RNA_binding α+β SRP19 SRP19 SRP19 SRP19 Human
5 d1kvva_ 7S_RNA_binding α+β SRP19 SRP19 SRP19 SRP19 AAF
6 d1lnga_ 7S_RNA_binding α+β SRP19 SRP19 SRP19 SRP19 AMJ
7 d1ls1a1 7S_RNA_binding α FHUDB D-SRP/SRP receptor G-proteins D-SRP/SRP receptor G-proteins SSBP Ffh TA
8 d1ls1a2 7S_RNA_binding α/β P-loop NTP hydrolases P-loop NTP hydrolases Nitrogenase iron protein-like GTPase domain of SSBP Ffh TA
9 d1qb2a_ 7S_RNA_binding α SPBD SPBD SPBD SRP54M Human
10 d1qzxa2 7S_RNA_binding α SPBD SPBD SPBD SSBP Ffh ASS
11 d1di2a_ DS_RNA_binding α+β dsRBD-like dsRBD-like dsRBD dsRBD A XL
12 d1ekza_ DS_RNA_binding α+β dsRBD-like dsRBD-like dsRBD Staufen, domain III DM
13 d1o0wa1 DS_RNA_binding α RNase III domain-like RNase III domain-like RNase III catalytic domain-like RNase III ECD TM
14 d1o0wa2 DS_RNA_binding α+β dsRBD-like dsRBD-like dsRBD RNase III, C-terminal domain TM
15 d1t4oa_ DS_RNA_binding α+β dsRBD-like dsRBD-like dsRBD RNase III, C-terminal domain SC
16 d1uhza_ DS_RNA_binding α+β dsRBD-like dsRBD-like dsRBD staufen homolog 2 Mouse
17 d1uila_ DS_RNA_binding α+β dsRBD-like dsRBD-like dsRBD ATP-dep RNA helicase A, Dhx9 Mouse
18 d1whna_ DS_RNA_binding α+β dsRBD-like dsRBD-like dsRBD Dus2l Mouse
19 d1whqa_ DS_RNA_binding α+β dsRBD-like dsRBD-like dsRBD ATP-dependent RNA helicase A, Dhx9 Mouse
20 d1x47a1 DS_RNA_binding α+β dsRBD-like dsRBD-like dsRBD Dgcr8 protein Human
21 d1x48a1 DS_RNA_binding α+β dsRBD-like dsRBD-like dsRBD dsRNA-dependent protein kinase pkr Mouse
22 d1x49a1 DS_RNA_binding α+β dsRBD-like dsRBD-like dsRBD dsRNA-dependent protein kinase pkr Mouse
23 d2dixa1 DS_RNA_binding α+β dsRBD-like dsRBD-like dsRBD Interferon -ids RNA DPK activator A Human
24 d2dmya1 DS_RNA_binding α+β dsRBD-like dsRBD-like dsRBD Spermatid perinuclear RBP Human
25 d2nuga1 DS_RNA_binding α RNase III domain-like RNase III domain-like RNase III catalytic domain-like RNase III ECD AA
26 d2nuga2 DS_RNA_binding α+β dsRBD-like dsRNA-binding domain-like dsRBD RNase III, C-terminal domain AA
27 d1afwa1 mRNA_binding α/β Thiolase-like Thiolase-like Thiolase-related Thiolase SC
28 d1j1ja_ mRNA_binding α α-α superhelix Translin Translin Translin Human
29 d1kvka1 mRNA_binding α+β RP S5 domain 2-like RP S5 domain 2-like GHMP Kinase, N-terminal domain Mevalonate kinase RN
30 d1kvka2 mRNA_binding α+β Ferredoxin-like GHMP Kinase, C-terminal domain Mevalonate kinase Mevalonate kinase RN
31 d1l5ja1 mRNA_binding α α-α superhelix Aconitase B, N-terminal domain Aconitase B, N-terminal domain Aconitase B, N-terminal domain EC
32 d1l5ja2 mRNA_binding α/β The “swivelling” β/β/α domain LeuD/IlvD-like LeuD-like Aconitase B, second N-terminal domain EC
33 d1l5ja3 mRNA_binding α/β Aconitase iron-sulfur domain Aconitase iron-sulfur domain Aconitase iron-sulfur domain Aconitase B, C-terminal domain EC
34 d1p5fa_ mRNA_binding α/β Flavodoxin-like Class I glutamine AM-like DJ-1/PfpI DJ-1 Human
35 d1q67a_ mRNA_binding β PH domain-like barrel PH domain-like Dcp1 Dcp1 SC
36 d1xlya_ mRNA_binding α RNA-binding protein She2p RNA-binding protein She2p RNA-binding protein She2p RNA-binding protein She2p SC
37 d3gcba_ mRNA_binding α/β Cysteine proteinases Cysteine proteinases Papain-like Bleomycin hydrolase SC
38 d1a32a_ rRNA_binding α S15/NS1 RNA-binding domain S15/NS1 RNA-binding domain RP S15 RP S15 BST
39 d1diva1 rRNA_binding α+β RP L9 C-domain RP L9 C-domain RP L9 C-domain RP L9 C-domain BST
40 d1dmga_ rRNA_binding α/β RP L4 RP L4 RP L4 RP L4 TM
41 d1egaa2 rRNA_binding α+β α-lytic protease prodomain-like Prokaryotic type KH domain Prokaryotic type KH domain GTPase Era C-terminal domain EC
42 d1feua_ rRNA_binding β RP L25-like RP L25-like RPL25-like RP TL5 (general stress protein CTC) TT
43 d1i4ja_ rRNA_binding α+β RP L22 RP L22 RP L22 RP L22 TA
44 d1i6ua_ rRNA_binding α+β RP S8 RP S8 RP S8 RP S8 AMJ
45 d1iqva_ rRNA_binding α Ribosomal protein S7 RP S7 RP S7 RP S7 APH
46 d1loua_ rRNA_binding α+β Ferredoxin-like RP S6 RP S6 RP S6 TT
47 d1n0ua1 rRNA_binding β R/I/E factor common domain Translation proteins EF eEF-2, domain II SC
48 d1n0ua2 rRNA_binding α/β P-loop NTP hydrolases P-loop NTP hydrolases G proteins eEF-2, N-terminal (G) domain SC
49 d1n0ua3 rRNA_binding α+β RP S5 domain 2-like RP S5 domain 2-like TMC eEF-2, domain IV SC
50 d1n0ua4 rRNA_binding α+β Ferredoxin-like EF-G C-terminal domain-like EF-G/eEF-2 domains III and V eEF-2 SC
51 d1n0ua5 rRNA_binding α+β Ferredoxin-like EF-G C-terminal domain-like EF-G/eEF-2 domains III and V eEF-2 SC
52 d1pkpa1 rRNA_binding α+β RP S5 domain 2-like RP S5 domain 2-like TMC RP S5, N-terminal domain BST
53 d1pkpa2 rRNA_binding α+β dsRBD-like dsRBD-like RP S5, N-terminal domain RP S5, N-terminal domain BST
54 d1rl6a1 rRNA_binding α+β RP L6 RP L6 RP L6 RP L6 BST
55 d1rl6a2 rRNA_binding α+β RP L6 RP L6 RP L6 RP L6 BST
56 d1seia_ rRNA_binding α+β RP S8 RP S8 RP S8 RP S8 BST
57 d1vmba_ rRNA_binding α+β Ferredoxin-like RP S6 RP S6 RP S6 TM
58 d1vqoa1 rRNA_binding β SH3-like barrel TP SH3-like domain C-terminal domain of RP L2 C-terminal domain of RP L2 AHM
59 d1vqoa2 rRNA_binding β OB-fold Nucleic acid-binding proteins Cold shock DNA-binding domain-like N-terminal domain of RP L2 AHM
60 d1wf3a1 rRNA_binding α/β P-loop NTP hydrolases P-loop NTP hydrolases G proteins GTPase Era, N-terminal domain TT
61 d1wf3a2 rRNA_binding α+β α-lytic protease prodomain-like Prokaryotic type KH domain Prokaryotic type KH domain GTPase Era C-terminal domain TT
62 d1whia_ rRNA_binding β RP L14 RP L14 RP L14 RP L14 BS
63 d2cqla1 rRNA_binding α+β RP L6 RP L6 RP L6 RP L6 Human
64 d2j5aa1 rRNA_binding α+β Ferredoxin-like RP S6 RP S6 RP S6 AA
65 d2v3ka1 rRNA_binding α/β α/β knot α/β knot EMG1/NEP1-like EMG1 SC
66 d3bbda1 rRNA_binding α/β α/β knot α/β knot EMG1/NEP1-like RBP NEP1 MJ
67 d1dm9a_ SS_ RNA_binding α+β α-L RNA-binding motif α-L RNA-binding motif Heat shock protein 15 kD HSP 15 Kd EC
68 d1d7qa_ TFA_ RNA_binding β OB-fold Nucleic acid-binding proteins Cold shock DNA-binding domain-like eIF1a Human
69 d2if1a_ TFA_ RNA_binding α+β eIF1-like eIF1-like eIF1-like eIF- 1 (SUI1) Human
70 d1a6fa_ tRNA_binding α+β RP S5 domain 2-like RP S5 domain 2-like RNase P protein RNase P protein BSU
71 d1dj0a_ tRNA_binding α+β Pseudouridine synthase Pseudouridine synthase Pseudouridine synthase I TruA Pseudouridine synthase I TruA EC
72 d1fl0a_ tRNA_binding β OB-fold Nucleic acid-binding proteins Myf domain EMAP II Human
73 d1gd7a_ tRNA_binding β OB-fold Nucleic acid-binding proteins Myf domain TRBP111 homolog CsaA TT
74 d1jjca_ tRNA_binding α+β Class II aaRS and biotin synthetases Class II aaRS and biotin synthetases Class I A-tRNA S- like, catalytic domain PheRS alpha subunit TT
75 d1nz0a_ tRNA_binding α+β RP S5 domain 2-like RP S5 domain 2-like RNase P protein RNase P protein TM
76 d1ou5a1 tRNA_binding α Poly A PCT region-like Poly A PCT region-like Poly A PCT region-like tRNA CCA-adding E, C-terminal domains HM
77 d1ou5a2 tRNA_binding α+β Nucleotidyltransferase Nucleotidyltransferase Poly A polymerase head domain-like tRNA CCA-adding E, head domain HM
78 d1pyba_ tRNA_binding β OB-fold Nucleic acid-binding proteins Myf domain TRBP111 AA
79 d1r6la1 tRNA_binding α+β RP S5 domain 2-like RP S5 domain 2-like Ribonuclease PH domain 1-like Ribonuclease PH, domain 1 PseA
80 d1r6la2 tRNA_binding α+β RPH domain 2-like RPH domain 2-like Ribonuclease PH domain 2-like Ribonuclease PH, domain 2 PseA
81 d1rqga1 tRNA_binding α ABD of a subclass of class I AA-tRNA-S ABD of a subclass of class I AA-tRNA-S ABD of a subclass of class I AA-tRNA-S MetRS PA
82 d1rqga2 tRNA_binding α/β AN-α hydrolase-like Nucleotidylyl transferase Class I A-tRNA S, catalytic domain MetRS PA
83 d2c5sa1 tRNA_binding α/β AN-α hydrolase-like AN-α hydrolase-like ThiI-like TBP ThiI, N-ter D BA
84 d2c5sa2 tRNA_binding α+β THUMP domain THUMP domain-like THUMP domain TBP ThiI, N-ter D BA
85 d2iy5a1 tRNA_binding α Long alpha-hairpin tRNA-binding arm PheRS PheRS TT

DS_RNA_binding: double-stranded_RNA_binding, SS_RNA_binding: single-stranded_RNA_binding, TFA_RNA_binding: translational_factor_activity_RNA_binding, SRP: Signal recognition particle alu RNA binding heterodimer, SPBD: Signal peptide-binding domain, FHUDB: Four-helical up-and-down bundle, P-loop NTP hydrolases: P-loop containing nucleoside triphosphate hydrolases, R/I/E factor common domain: Reductase/isomerase/elongation factor common domain, RP: Ribosomal protein, Class II aaRS and BS: Class II aaRS and biotin synthetases, Poly A PCT region-like: Poly A polymerase C-terminal region-like, RPH domain 2-like: Ribonuclease PH domain 2-like, ABD of a subclass of class I AA-tRNA-S: Anticodon-binding domain of a subclass of class I aminoacyl- tRNA synthetases, AN-α hydrolase-like: Adenine nucleotide alpha hydrolase-like, TP SH3-like domain: Translation proteins SH3-like domain, D-SRP/SRP receptor G-proteins: Domain of the SRP/SRP receptor G-proteins, Class I glutamine AM-like: Class I glutamine amidotransferase-like, AAF: Archaeon Archaeoglobus fulgidus, TT: Thermus thermophiles, BA: Bacillus anthracis, PA: Pyrococcus abyssi, PseA: Pseudomonas aeruginosa, AA: Aquifex aeolicus, TM: Thermotoga maritime, AMJ: Archaeon Methanococcus jannaschii, MJ: Methanococcus jannaschii, SC: Saccharomyces cerevisiae, BST: Bacillus stearothermophilus, EC: Escherichia coli, BSU: Bacillus subtilis, AHM: Archaeon Haloarcula marismortui, APH: Archaeon Pyrococcus horikoshii, TA:Thermus aquaticus, TM: Thermotoga maritime, HM: Human mitochondrial, RN: Rattus norvegicus, DM: Drosophila melanogaster, XL: Xenopus laevis, ASS: Archaeon Sulfolobus solfataricus; SSBP: Signal sequence binding protein Ffh, RNase III ECD: RNase III endonuclease catalytic domain, PheRS: Phenylalanyl-tRNA synthetase, TBP ThiI, N-ter D: Thiamine biosynthesis protein ThiI, N-terminal domain, MetRS: Methionyl-tRNA synthetase, TRBP: tRNA-binding protein, tRNA CCA-adding E: tRNA CCA-adding enzyme, PheRS: Phenyl-tRNA synthetase, HSP: Heat shock protein, RBP: Ribosome biogenesis protein, EMG: Essential for mitotic growth, RNase III ECD: RNase III endonuclease catalytic domain, Interferon-ids RNA DPK activator A: Interferon-inducible double stranded RNA-dependent protein kinase activator A, Dus: dihydrouridine synthase, TMC: Translational machinery components.

3.2. ANOVA analysis for feature selection

In order to consider the effect of number of features on performance of methods, ANOVA was used to select significantly different features between three and five RNA-binding protein domain subclasses. Table 2 and Table 3 have shown 10 top features with the lowest p-values. From ANOVA results, RNA binding subclasses show an obvious difference in sequence- and structure-based features. Figure 2 shows difference of shape, size of RBDs, size of main patch and frequency of two important charged amino acids, i.e., Arg and Lys, in five different RBD subclasses. In addition reduced models were constructed using selected features with significant level of <.05, which were 45 and 102 features in three and five subclasses, respectively.

Table 2.

Indicating ten top selected features between three RNA-binding domain subclasses by using ANOVA analysis.

Number Feature P-value
1 Number of Arg in intermediate regions 2.0E-04
2 Molecular weight of RBDs 5.0E-04
3 Number of Ser in buried regions 7.0E-04
4 Number of Cys in main patch 0.001
5 Number of basic amino acids in sequence 0.0012
6 Number of Glu in sheet 0.0026
7 Number of charged amino acids in sequence 0.003
8 Number of Arg in exposed regions 0.0034
9 Number of charged amino acids in sheet 0.0039
10 Isoelectric point 0.0041

Table 3.

Indicating ten top selected features between five RNA-binding domain subclasses by using ANOVA analysis.

Number Feature P-value
1 Dipole 4.00E-10
2 Total number of residues 8.00E-10
3 Total number of atoms 1.40E-09
4 Total number of negative residues 7.40E-09
5 Total number of positive residues 4.93E-08
6 Number of Lys in main patch 8.92E-08
7 Molecular weight of RBDs 8.55E-07
8 RM 2.18E-06
9 Number of small amino acids in main patch 1.50E-05
10 Number of Ala in buried regions 4.62E-05

Figure 2.

Figure 2

Diversity of features between five different RBDs. (A) sample of 7S RBDs (d1914a1), (B) sample of rRNA RBDs (d2v3ka1), (C) sample of double_stranded RBPs (d1ekza_), (D) sample of mRNA RBDs (d1afwa1), and (E) sample of tRNA RBDs (d1a6fa_). The gray region represents the main patch, blue represents Arg amino acids, and green represents Lys amino acids.

3.3. Tuned multi-class support vector machine analysis

We used a tune function to select optimized structure of TMCSVM through jackknife and self-consistency tests. The most important parameter of TMCSVM topology is kernel function which was searched for the best one among four different kernel functions, i.e., linear, polynomial, radial, and sigmoid. Table 8 and 9 show the highest performance obtained by TMCSVM in overall. TMCSVM and reduced-TMCSVM show the highest rate of 79.31% in prediction of rRNA BDs subclass in comparison with the other methods in five subclasses prediction and also SVM shows the highest rate of 50% for prediction of tRNA BD subclass in three subclasses prediction. Our results confirm that although TMCSVM is a machine learning method, dealing with a large number of features is possible because of seeking a global optimized solution and avoiding over-fitting in the SVM training process. However, obtained results in three subclasses prediction, emphasize that this ability is diminished to limited range of features/samples ratio.

Table 8.

Results of self-consistency and jackknife tests in prediction of five subclasses

Test Method Rate of correct prediction for each RBD subclasses
Overall rate of accuracy
7s (%) Double-stranded (%) mRNA (%) rRNA (%) tRNA (%)
Self-consistency RF 100 100 100 100 100 100
Reduced-RF 100 100 100 100 100 100
SVM 100 100 100 100 100 100
Reduced-SVM 100 100 100 100 87.50 97.56
MCRLR 100 100 100 100 100 100
Reduced-MCRLR 100 100 100 100 100 100
Jackknife RF 60.00 81.25 63.64 44.83 37.50 54.88
Reduced-RF 70.00 81.25 81.82 44.83 37.50 58.54
SVM 40.00 87.50 54.55 79.31 50.00 67.07
Reduced-SVM 50.00 87.50 45.45 79.31 43.75 65.84
MCRLR 60.00 100 54.55 65.52 31.25 63.41
Reduced-MCRLR 50.00 87.50 45.45 65.52 43.75 60.98

Table 9.

Results of self-consistency and jackknife tests in prediction of three subclasses

Test Method Rate of correct prediction for each RBD subclasses
Overall rate of accuracy
mRNA (%) rRNA (%) tRNA (%)
Self-consistency RF 100 100 100 100
Reduced-RF 100 100 100 100
SVM 100 100 100 100
Reduced-SVM 100 100 100 100
MCRLR 100 100 100 100
Reduced-MCRLR 100 100 100 100
Jackknife RF 81.82 62.07 43.75 60.71
Reduced-RF 81.82 62.07 62.50 66.07
SVM 63.64 82.76 37.50 66.07
Reduced-SVM 63.64 79.31 62.50 71.43
MCRLR 54.55 82.76 62.50 71.43
Reduced-MCRLR 18.18 79.31 37.50 55.36

3.4. Random Forest analysis

R randomForest package was used to construct RF for prediction of RBD subclasses. In order to optimize performance of RF, we defined cutoffs based on distribution of RBDs in subclasses, i.e., number of RBDs in each subclass divided by total number of RBDs. Obtained results reveal that although RF can predict all of RBDs correctly through self-consistency, performance of jackknife test drastically reduced (Table 8 and 9). However, reduced-RF shows the highest rate in prediction of 7S RBDs subclass (70%), and mRNA RBDs subclass (81.82%) in comparison with the other methods in five subclasses prediction. In addition, RF and reduced-RF show the highest rate of 81.82% for prediction of mRNA BD subclass in three subclasses prediction. Furthermore, from obtained results, it is obvious that number of features in RF training is an important issue and it is independent of number of subclasses. Indeed, RF is over-fitting prone when we train it using large number of features.

3.5. Multi-class ℓ1/ℓq-regularized logistic regression

We ran a MCRLR method on the dataset in five and three subclasses using jackknife and self-consistency. MCRLR provides useful information about preferred and avoided features in each one of RNA binding subclasses. tRNA BD subclasses shows some preferred and avoided with higher average values in comparison with the other subclasses in three- and five subclasses through jackknife and self-consistency procedures (Tables 47). Our results confirm previous reported unique properties of tRNA BPs by Shazman and Mandel-Gutfreund (2008).

Table 4.

Indicating average weights of ten top important features computed by MCRLR in the jackknife procedure using all of features for prediction of five RNA-binding domain subclasses. Positive values show preference of the features in related subclasses and negative values show avoidance of the features in related subclasses.

graphic file with name nihms397612u1.jpg
No. Feature/Average Value
7s RBD subclass Double -stranded RBD subclass mRNA RBD subclass rRNA RBD subclass tRNA RBD subclass
1 Number of Met in MP/1.517 Number of basic AAs in MP/0.967 Molecular weight/1.373 Number of His in MP/1.913 Number of Arg in IR/2.004
2 Number of Met in RC/1.274 Number of Glu in MP/0.900 Number of Ser in BR/1.176 Number of Arg in IR/-1.555 Number of His in sheet/2.003
3 Number of Cys in sheet/0.998 Dipole/0.900 Number of Cys in MP/1.142 Number of Ile in MP/1.499 Number of Pro in MP/-1.583
4 Number of His in seq./-0.920 Number of Glu in IR/0.891 Number of Ser in IR/-1.094 Number of Leu in helix/-1.481 Number of Ile in MP/-1.576
5 Number of Met in ER/0.990 Number of Tyr in MP/0.831 Isoelectric point/-1.007 Number of Met in RC/-1.458 Number of Asp in BR/1.371
6 Number of Glu in MP/-0.886 Number of Ala in BR/0.811 Number of Cys in RC/0.969 Number of Ser in RC/1.330 Number of Ile in IR/-1.335
7 Number of Tyr in helix/0.811 Number of Gln in RC/-0.799 Number of charged AAs in seq./-0.949 Number of Tyr in MP/-1.304 Number of Met in sheet/1.255
8 Number of Glu in IR/0.759 Number of Arg in MP/-0.779 Number of Asp in ER/0.934 Number of Val in helix/1.280 Number of Phe in ER/-1.226
9 Frequency of antiparallel HB/-0.753 Number of Lys in MP/0.752 Number of Arg in ER/-0.931 Number of His in ER/-1.269 Number of Phe in IR/1.166
10 Second patch size/-0.729 Number of Asp in MP/0.728 Number of Asp in IR/0.920 Number of Cys in RC/-1.258 Number of Gln in IR/1.114

MP: main patch, RC: randomcoil, ER: exposed regions, IR: intermediate regions, HB: hydrogen bond, AAs: amino acids, BR: buried regions, Seq.: Sequence.

Table 7.

Indicating average weights of ten top important features computed by MCRLR in the jackknife procedure using selected features by ANOVA analysis for prediction of three RNA-binding domain subclasses. Positive values show preference of the features in related subclasses and negative values show avoidance of the features in related subclasses.

graphic file with name nihms397612u4.jpg
No. Feature/Average Value
mRNA RBD subclass rRNA RBD subclass tRNA RBD subclass
1 Number of Glu in seq./-3.365 Number of Ala in helix/-4.742 Number of Asn in helix/-5.94
2 Number of Cys in helix/-3.347 Number of Glu in seq./3.999 Number of Leu in seq./-3.765
3 Number of Asp in seq./1.950 Number of Leu in seq./2.571 Number of Trp in seq./-2.978
4 Number of Gln in helix/1.759 Number of Asn in helix/2.481 Number of Ala in helix/2.913
5 Number of Glu in helix/-1.621 Number of Cys in helix/1.900 Number of Gly in helix/2.386
6 Number of Val in seq./1.620 Number of Trp in seq/1.837 Number of aliphatic AAs in seq./2.048
7 Number of Leu in seq./1.570 Number of Lys in seq./-1.712 Number of Lys in seq./1.773
8 Number of Ala in helix/1.525 Number of Ile in seq./-1.618 Number of Ile in seq./1.567
9 Number of nonpolar AAs in seq./1.187 Number of Pro in seq./1.551 Number of Pro in helix/1.402
10 Number of Trp in seq./1.051 Number of nonpolar AAs in seq./-1.355 Number of Met in helix/-1.241

The results of jackknife and self-consistency tests, which shown in Table 8 and 9, are obtained according to the output of the model. High performance measures of MCRLR model through self-consistency confirm usefulness of defined features in prediction of RBPs subclasses. Results of jackknife tests show that performance of reduced-MCRLR drastically decreased especially in three subclasses prediction using selected features. Rationale for decrease of MCRLR performance is restriction of shrinkage ability using limited number of features (N=45 for prediction of three subclasses). Indeed, ℓ1/ ℓq-regularized constrains the total weight allocated to a set of features, with the end result that some features received zero weight. Additionally, MCRLR shows the highest rate in prediction of double-stranded RBDs in comparison with the other methods.

4. Discussion

Knowledge regarding how bio-macromolecules interact with each other is essential in the understanding of cellular processes. In this study, we investigated interaction of protein and RNA as an important interaction in various cellular processes.

According to a recent comprehensive review (Chou, 2011), to establish a really useful predictor for a protein system, we need to consider: construct a valid benchmark dataset, formulate the protein samples with an effective mathematical expression, develop a powerful algorithm to operate the prediction, evaluate the anticipated accuracy of the predictor, and establish a user-friendly web-server, respectively.

From previous reports, it is mentioned that the aminoacyl tRNA synthetases, and bacterial factors, which mimic tRNA BPs have highly negatively charged surface (Tworowski et al., 2005; Nakamura and Ito, 2003). But there is no more information about variation in feature distribution in different RBPs. In this study, in addition to multi-class classification of RBDs we tried to do feature selection. In addition to comparable prediction accuracy with TMCSVM, a clear variety of feature distributions was elucidated by using MCRLR. For example, our results demonstrate exciting diversity in distribution of Lys and Arg, two important charged amino acids in interaction and catalytic reaction, in different RBDs subclasses. From our data in tRNA BD subclass, Lys is preferred in sequence and Arg is preferred in intermediate regions with high scores (Tables 47). In mRNA BDs subclass, Arg is preferred in sequence, and in double-stranded RBD subclass, Lys is preferred in main domain while Arg is avoided in main domain. In addition, in rRNA BD subclass, Lys is avoided in sequence and it seems that Arg is preferred to be on surface as it has been determined with negative value of being in intermediate regions. In 7S RBD subclass, Arg and Lys have not been selected among top preferred or avoided residues. From our results we can understand that Lys and Arg in tRNA BDs, Arg in mRNA BDs, Lys in double-stranded RBDs, and exposed Arg in rRNA BDs are possibly important in RNA-protein interaction and catalytic reaction of RBDs. Figure 2 illustrates distribution of Arg and Lys in main patches of different RBDs subclasses.

These results emphasize that the tRNA BDs have unique local and global properties that can be utilized for identifying novel proteins possibly involved in tRNA processing. Moreover, it is worth to mention that the size of secondary patch show positive average value in tRNA BDs subclasses and it means secondary patch may have specific properties as mentioned by Shazman and Mandel-Gutfreund (2008). Growth of 3D solved protein databases will be helpful to discover more details about RBDs.

In this study we developed a first of its kind in silico approach for analysis and prediction of RBDs subclasses in three and five subclasses using RF, TMCSVM and MCRLR. In overall, TMCSVM outperforms the other methods, although tuning of SVM is time consuming. On the other hand, MCRLR show some advantages including fast training, report of more important features for RBD prediction, and detection of avoided and preferred features in each subclass.

In addition, RF shows the worst accuracy among three predictor methods which means RF is prone to over-fitting especially when large numbers of features are fed into it. In conclusion, we used two types of predictor methods including: (1) MCRLR as a statistical method and (2) RF and TMCSVM as machine learning methods. Statistical methods are commonly accepted and popularity of these models may be attributed to the interpretability of model parameters and ease of use, although they suffer from their specific limitations. For example, statistical methods use linear combinations of independent variables and, therefore, are not the best adept at modeling grossly nonlinear complex interactions as has been demonstrated in biological systems. On the other hand, machine learning methods are rich and flexible nonlinear systems that show robust performance in dealing with noisy or incomplete data and have the ability to generalize from the input data. They may be better suited than other modeling systems to predict outcomes when the relationships between the variables are complex, multidimensional, and nonlinear as found in complex biological systems. Although machine learning methods can give high prediction accuracy, some problem may be raised in their training. For example in this study we showed that RF as a well-known machine learning method is not well suited for our problem and is prone to over-fitting. “black box” nature, and the empirical nature of model development are other disadvantages of machine learning methods (Tu, 1996).

5. Conclusion

A great challenge in classifying ligand binding proteins (such as RBDs) is to be able to identify to which ligand it will bind. For this purpose, we applied three different predictor methods to classify RNA-binding domains using a large number of sequence and structural features, which was trained on three and five different subclasses of known RBDs classified according to their RNA target. From our results TMCSVM shows the highest prediction accuracy in compare with other methods. Overall, the results we obtained are encouraging, reinforcing the idea that combination of sequence and structural properties of protein domains can give clues to the protein’s interacting partner.

It is important to note that subclassification of the RBDs to three and five subclasses using our multiclass approach is only possible –given the prior knowledge that the protein domain binds RNA. Indeed we have to mention that requiring known protein domains as RNA binding is a limitation of such predictor models.

Finally, our results showed that, in addition to multi class prediction, biological diversity of RBD’s subclasses would be interpretable using state-of-the-art methods like ℓ1/ℓq-regularized logistic regression.

Since user friendly and publicly accessible web-servers represent the future direction for developing practically more useful predictors (Chou and Shen, 2009), we shall make efforts in our future work to provide a web-server for the method presented in this paper.

Table 5.

Indicating average weights of ten top important features computed by MCRLR in the jackknife procedure using selected features by ANOVA analysis for prediction of five RNA-binding domain subclasses. Positive values show preference of the features in related subclasses and negative values show avoidance of the features in related subclasses.

graphic file with name nihms397612u2.jpg
No. Feature/Average Value
7sRBD subclass Double-stranded RBD subclass mRNA RBD subclass rRNA RBD subclass tRNA RBD subclass
1 Number of Cys in sheet/2.029 Number of Glu in IR/1.675 Number of Ser in IR/-2.051 Number of small AAs in helix/3.773 Number of Ile in MP/-4.351
2 Number of Met in MP/2.011 Number of Arg in MP/-1.394 Number of Arg in ER/-2.034 Number of Ser in RC/3.652 Number of Arg in IR/4.083
3 Number of Met in RC/1.653 Number of Tyr in MP/1.230 Number of Cys in MP/1.976 Number of Ile in MP/3.378 Number of Small AAs in helix/-3.625
4 Frequency of antiparallel HB/-1.54 Dipole/1.160 Number of Ser in BR/1.733 Number of acidic AAs in ER/3.219 Number of Small AAs in RC/-3.118
5 Number of His in seq./-1.370 Number of Gln in RC/-1.070 Molecular weight/1.554 Number of Ser in IR/3.030 Number of His in seq./3.027
6 Number of Met in ER/1.300 Number of Phe in MP/1.062 Number of small AAs in BR/1.486 Number of Tyr in MP/-2.954 Number of Phe in IR/2.643
7 Number of Glu in IR/1.292 Number of Aromatic AAs in MP/1.047 Number of Arg in IR/1.482 Number of Pro in sheet/2.7428 Number of tiny AAs in sheet/2.288
8 Number of Leu in sheet/1.235 Number of Ser in IE/1.036 Number of Glu in IR/-1.363 Number of Met in RC/-2.742 Frequency of i+3 HB/2.091
9 Number of Gln in RC/1.231 Number of basic AAs in BR/-0.989 Isoelectric point/-1.319 Number of Arg in IR/-2.588 Number of Small AAs in MP/-1.816
10 Second patch size/-1.1781 Number of Met in RC/0.981 Number of Ile in MP/1.317 Number of Glu in IR/-2.458 Number of Phe inRC/-1.790

MP: main patch, RC: randomcoil, ER: exposed regions, IR: intermediate regions, HB: hydrogen bond, AAs: amino acids, BR: buried regions, Seq.: Sequence.

Table 6.

Indicating average weights of ten top important features computed by MCRLR in the jackknife procedure using all of features for prediction of three RNA-binding domain Subclasses. Positive values show preference of the features in related subclasses and negative values show avoidance of the features in related subclasses.

graphic file with name nihms397612u3.jpg
No. Feature/Average Value
mRNA RBD subclass rRNA RBD subclass tRNA RBD subclass
1 Number of charged AAs in seq./-1.337 Number of Arg in IR/-2.794 Number of Arg in IR/2.129
2 Number of Ser in IR/-1.132 Number of Glu in sheet/1.556 Number of Pro in MP/-1.699
3 Number of Arg in seq./1.131 Number of Phe in helix/-1.278 Number of Pro in IR/-1.581
4 Number of Pro in seq./0.953 Number of charged AAs in seq./1.109 Number of Phe in IR/1.339
5 Number of Glu in sheet/0.949 Number of Phe in MP/1.091 Number of Asn in IR/-1.183
6 Number of Ser in BR/0.898 Number of Ser in IR/1.021 Number of Val in helix/-1.055
7 Number of Phe in IR/-0.782 Second patch size/-1.015 Number of Ile in IR/-1.033
8 Number of Arg in ER/-0.769 Number of Cys in helix/0.954 Number of Phe in helix/0.921
9 Number of Arg in IR/0.753 Number of His in sheet/-0.930 Number of Cys in helix/-0.790
10 Number of Cys in RC/0.712 Number of Cys in RC/-0.789 Second patch size/0.766

MP: main patch, RC: randomcoil, ER: exposed regions, IR: intermediate regions, AAs: amino acids, BR: buried regions, Seq.: Sequence.

Highlights.

  1. Multi-class prediction of RNA-binding protein domains.

  2. We compared prediction accuracy of three different state-of-the-art predictor methods.

  3. In addition to successful classifying RBPs, we discovered dissimilar sequence and structural features using ℓ1/ℓq-regularized logistic regression.

  4. Our method could be applied to identify novel RNA-binding proteins with unique folds.

Acknowledgments

We thank Abbas Mahdavi for his assistance in this investigation. This work is partially funded by NIH Grant R00RR024163.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Ahmad S, Gromiha M, Fawareh H, Sarai A. ASAView: database and tool for solvent accessibility representation in proteins. BMC Bioinformatics. 2004;5:51. doi: 10.1186/1471-2105-5-51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. Nature Genetics. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bach F. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research. 2008;9:1179–1225. [Google Scholar]
  4. Breiman L. Random Forests. Machine Learning. 2001;45:5–32. [Google Scholar]
  5. Chen C, Chen L, Zou X, Cai P. Prediction of protein secondary structure content by using the concept of Chou’s pseudo amino acid composition and support vector machine. Protein & Peptide Letters. 2009;16:27–31. doi: 10.2174/092986609787049420. [DOI] [PubMed] [Google Scholar]
  6. Chen Y, Varani G. Protein families and RNA recognition. FEBS J. 2005;272:2088–2097. doi: 10.1111/j.1742-4658.2005.04650.x. [DOI] [PubMed] [Google Scholar]
  7. Chou KC. Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS: Structure, Function, and Genetics. 2001;43:246–255. doi: 10.1002/prot.1035. (Erratum: ibid, 2001, Vol44, 60) [DOI] [PubMed] [Google Scholar]
  8. Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005;21:10–19. doi: 10.1093/bioinformatics/bth466. [DOI] [PubMed] [Google Scholar]
  9. Chou KC. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Current Proteomics. 2009;6:262–274. [Google Scholar]
  10. Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review) J Theor Biol. 2011;273:236–247. doi: 10.1016/j.jtbi.2010.12.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Chou KC, Shen HB. Hum-PLoc: A novel ensemble classifier for predicting human protein subcellular localization. Biochem Biophys Res Commun. 2006;347:150–157. doi: 10.1016/j.bbrc.2006.06.059. [DOI] [PubMed] [Google Scholar]
  12. Chou KC, Shen HB. Cell-PLoc: A package of Web servers for predicting subcellular localization of proteins in various organisms (updated version: Cell-PLoc 2.0: An improved package of web-servers for predicting subcellular localization of proteins in various organisms, Natural Science, 2010, 2, 1090–1103) Nature Protocols. 2008;3:153–162. doi: 10.1038/nprot.2007.494. [DOI] [PubMed] [Google Scholar]
  13. Chou KC, Shen HB. Review: recent advances in developing web-servers for predicting protein attributes. Natural Science. 2009;2:63–92. (openly accessible at http://www.scirp.org/journal/NS/) [Google Scholar]
  14. Chou KC, Wu ZC, Xiao X. iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins. PLoS ONE. 2011;6:e18258. doi: 10.1371/journal.pone.0018258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Chou KC, Wu ZC, Xiao X. iLoc-Hum: Using accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Molecular Biosystems. 2012;8:629–641. doi: 10.1039/c1mb05420a. [DOI] [PubMed] [Google Scholar]
  16. Chou KC, Zhang CT. Review: Prediction of protein structural classes. Critical Reviews in Biochemistry and Molecular Biology. 1995;30:275–349. doi: 10.3109/10409239509083488. [DOI] [PubMed] [Google Scholar]
  17. Ding H, Luo L, Lin H. Prediction of cell wall lytic enzymes using Chou’s amphiphilic pseudo amino acid composition. Protein & Peptide Letters. 2009;16:351–355. doi: 10.2174/092986609787848045. [DOI] [PubMed] [Google Scholar]
  18. Duchi J, Singer Y. Online and batch learning using forward backward splitting. Journal of Machine Learning Research. 2009;10:2899–2934. [Google Scholar]
  19. Dudoit S, Fridlyan J, Fridlyan TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002;97:77–87. [Google Scholar]
  20. Ellis JJ, Broom M, Jones S. Protein RNA interactions: structural analysis and functional classes. Proteins. 2007;66:903–911. doi: 10.1002/prot.21211. [DOI] [PubMed] [Google Scholar]
  21. Esmaeili M, Mohabatkar H, Mohsenzadeh S. Using the concept of Chou’s pseudo amino acid composition for risk type prediction of human papillomaviruses. J Theor Biol. 2010;263:203–209. doi: 10.1016/j.jtbi.2009.11.016. [DOI] [PubMed] [Google Scholar]
  22. Georgiou DN, Karakasidis TE, Nieto JJ, Torres A. Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou’s pseudo amino acid composition. J Theor Biol. 2009;257:17–26. doi: 10.1016/j.jtbi.2008.11.003. [DOI] [PubMed] [Google Scholar]
  23. Gu Q, Ding YS, Zhang TL. Prediction of G-Protein- Coupled Receptor Classes in Low Homology Using Chou’s Pseudo Amino Acid Composition with Approximate Entropy and Hydrophobicity Patterns. Protein & Peptide Letters. 2010;17:559–567. doi: 10.2174/092986610791112693. [DOI] [PubMed] [Google Scholar]
  24. Guo J, Rao N, Liu G, Yang Y, Wang G. Predicting protein folding rates using the concept of Chou’s pseudo amino acid composition. Journal of Computational Chemistry. 2011;32:1612–1617. doi: 10.1002/jcc.21740. [DOI] [PubMed] [Google Scholar]
  25. Han LY, Cai CZ, Lo SL, Chung MC, Chen YZ. Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA. 2004;10:355–368. doi: 10.1261/rna.5890304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Hayat M, Khan A. Discriminating Outer Membrane Proteins with Fuzzy KNearest Neighbor Algorithms Based on the General Form of Chou’s PseAAC. Protein & Peptide Letters. 2012;19:411–421. doi: 10.2174/092986612799789387. [DOI] [PubMed] [Google Scholar]
  27. Hu L, Zheng L, Wang Z, Li B, Liu L. Using pseudo amino Acid composition to predict protease families by incorporating a series of protein biological features. Protein and Peptide Letters. 2011;18:552–558. doi: 10.2174/092986611795222795. [DOI] [PubMed] [Google Scholar]
  28. Jia SC, Hu XZ. Using Random Forest Algorithm to Predict beta-Hairpin Motifs. Protein and Peptide Letters. 2011;18:609–617. doi: 10.2174/092986611795222777. [DOI] [PubMed] [Google Scholar]
  29. Jiang X, Wei R, Zhang TL, Gu Q. Using the concept of Chou’s pseudo amino acid composition to predict apoptosis proteins subcellular location: an approach by approximate entropy. Protein & Peptide Letters. 2008;15:392–396. doi: 10.2174/092986608784246443. [DOI] [PubMed] [Google Scholar]
  30. Jones S, Daley DT, Luscombe NM, Berman HM, Thornton JM. Protein RNA interactions: a structural analysis. Nucleic Acids Res. 2001;29:943–954. doi: 10.1093/nar/29.4.943. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Kabsch W, Sander C. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1993;22:2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
  32. Kandaswamy KK, Chou KC, Martinetz T, Moller S, Suganthan PN, Sridharan S, Pugalenthi G. AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. J Theor Biol. 2011;270:56–62. doi: 10.1016/j.jtbi.2010.10.037. [DOI] [PubMed] [Google Scholar]
  33. Kowalski M. Sparse regression using mixed norms. Applied and Computational Harmonic Analysis. 2009;27:303–324. [Google Scholar]
  34. Li BQ, Huang T, Liu L, Cai YD, Chou KC. Identification of Colorectal Cancer Related Genes with mRMR and Shortest Path in Protein-Protein Interaction Network. PLoS ONE. 2012;7:e33393. doi: 10.1371/journal.pone.0033393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Li FM, Li QZ. Predicting protein subcellular location using Chou’s pseudo amino acid composition and improved hybrid approach. Protein & Peptide Letters. 2008;15:612–616. doi: 10.2174/092986608784966930. [DOI] [PubMed] [Google Scholar]
  36. Li LQ, Zhang Y, Zou LY, Zhou Y, Zheng XQ. Prediction of Protein Subcellular Multi-Localization Based on the General form of Chou’s Pseudo Amino Acid Composition. Protein & Peptide Letters. 2012;19:375–387. doi: 10.2174/092986612799789369. [DOI] [PubMed] [Google Scholar]
  37. Liaw A. Wiener M. Classification and regression by randomForest. R News. 2002;2:18–22. [Google Scholar]
  38. Lin H. The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition. J Theor Biol. 2008;252:350–356. doi: 10.1016/j.jtbi.2008.02.004. [DOI] [PubMed] [Google Scholar]
  39. Lin H, Ding H, Feng-Biao Guo FB, Zhang AY, Huang J. Predicting subcellular localization of mycobacterial proteins by using Chou’s pseudo amino acid composition. Protein & Peptide Letters. 2008;15:739–744. doi: 10.2174/092986608785133681. [DOI] [PubMed] [Google Scholar]
  40. Lin J, Wang Y. Using a novel AdaBoost algorithm and Chou’s pseudo amino acid composition for predicting protein subcellular localization. Protein & Peptide Letters. 2011;18:1219–1225. doi: 10.2174/092986611797642797. [DOI] [PubMed] [Google Scholar]
  41. Lin WZ, Fang JA, Xiao X, Chou KC. iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model. PLoS ONE. 2011;6:e24756. doi: 10.1371/journal.pone.0024756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Lingel A, Sattler M. Novel modes of protein-RNA recognition in the RNAi pathway. Curr Opin Struct Biol. 2005;15:107–115. doi: 10.1016/j.sbi.2005.01.010. [DOI] [PubMed] [Google Scholar]
  43. Liu L, Hu XZ, Liu XX, Wang Y, Li SB. Predicting Protein Fold Types by the General Form of Chou’s Pseudo Amino Acid Composition: Approached from Optimal Feature Extractions. Protein & Peptide Letters. 2012;19:439–449. doi: 10.2174/092986612799789378. [DOI] [PubMed] [Google Scholar]
  44. Lunde BM, Moore C, Varani G. RNA-binding proteins: modular design for efficient function. Nat Rev Mol Cell Biol. 2007;8:479–490. doi: 10.1038/nrm2178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Mei S. Multi-kernel transfer learning based on Chou’s PseAAC formulation for protein submitochondria localization. J Theor Biol. 2012;293:121–130. doi: 10.1016/j.jtbi.2011.10.015. [DOI] [PubMed] [Google Scholar]
  46. Mohabatkar H. Prediction of cyclin proteins using Chou’s pseudo amino acid composition. Protein & Peptide Letters. 2010;17:1207–1214. doi: 10.2174/092986610792231564. [DOI] [PubMed] [Google Scholar]
  47. Mohabatkar H, Mohammad Beigi M, Esmaeili A. Prediction of GABA(A) receptor proteins using the concept of Chou’s pseudo-amino acid composition and support vector machine. J Theor Biol. 2011;281:18–23. doi: 10.1016/j.jtbi.2011.04.017. [DOI] [PubMed] [Google Scholar]
  48. Morozova N, Allers J, Myers J, Shamoo Y. Protein RNA interactions: exploring binding patterns with a three-dimensional superposition analysis of high resolution structures. Bioinformatics. 2006;22:2746–2752. doi: 10.1093/bioinformatics/btl470. [DOI] [PubMed] [Google Scholar]
  49. Nanni L, Lumini A, Gupta D, Garg A. Identifying Bacterial Virulent Proteins by Fusing a Set of Classifiers Based on Variants of Chou’s Pseudo Amino Acid Composition and on Evolutionary Information. IEEE/ACM Trans Comput Biol Bioinform. 2012;9:467–475. doi: 10.1109/TCBB.2011.117. [DOI] [PubMed] [Google Scholar]
  50. Negahban S, Ravikumar P, Wainwright M, Yu B. A unified framework for high dimensional analysis of m-estimators with decomposable regularizers. Advances in Neural Information Processing Systems. 2009:1348–1356. [Google Scholar]
  51. Parker JS, Barford D. Argonaute: a scaffold for the function of short regulatory RNAs. Trends Biochem Sci. 2006;31:622–630. doi: 10.1016/j.tibs.2006.09.010. [DOI] [PubMed] [Google Scholar]
  52. Pugalenthi G, Kandaswamy KK, Chou KC, Vivekanandan S, Kolatkar P. RSARF: Prediction of Residue Solvent Accessibility from Protein Sequence Using Random Forest Method. Protein & Peptide Letters. 2012;19:50–56. doi: 10.2174/092986612798472875. [DOI] [PubMed] [Google Scholar]
  53. Qin YF, Wang CH, Yu XQ, Zhu J, Liu TG, et al. Predicting Protein Structural Class by Incorporating Patterns of Over- Represented k-mers into the General form of Chou’s PseAAC. Protein & Peptide Letters. 2012;19:388–397. doi: 10.2174/092986612799789350. [DOI] [PubMed] [Google Scholar]
  54. Qiu JD, Huang JH, Liang RP, Lu XQ. Prediction of G-protein-coupled receptor classes based on the concept of Chou’s pseudo amino acid composition: an approach from discrete wavelet transform. Analytical Biochemistry. 2009;390:68–73. doi: 10.1016/j.ab.2009.04.009. [DOI] [PubMed] [Google Scholar]
  55. Qiu JD, Huang JH, Shi SP, Liang RP. Using the concept of Chou’s pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform. Protein & Peptide Letters. 2010;17:715–722. doi: 10.2174/092986610791190372. [DOI] [PubMed] [Google Scholar]
  56. Qiu JD, Suo SB, Sun XY, Shi SP, Liang RP. OligoPred: A web-server for predicting homo-oligomeric proteins by incorporating discrete wavelet transform into Chou’s pseudo amino acid composition. Journal of Molecular Graphics & Modelling. 2011;30:129–134. doi: 10.1016/j.jmgm.2011.06.014. [DOI] [PubMed] [Google Scholar]
  57. Qiu Z, Wang X. Improved Prediction of Protein Ligand-Binding Sites Using Random Forests. Protein & Peptide Letters. 2011;18:1212–1218. doi: 10.2174/092986611797642788. [DOI] [PubMed] [Google Scholar]
  58. Shameer K, Pugalenthi G, Kandaswamy KK, Sowdhamini R. 3dswap-pred: Prediction of 3D Domain Swapping from Protein Sequence Using Random Forest Approach. Protein & Peptide Letters. 2011;18:1010–1020. doi: 10.2174/092986611796378729. [DOI] [PubMed] [Google Scholar]
  59. Shao X, Tian Y, Wu L, Wang Y, Jing L, Deng N. Predicting DNA- and RNA-binding proteins from sequences with kernel methods. J Theor Biol. 2009;258:289–293. doi: 10.1016/j.jtbi.2009.01.024. [DOI] [PubMed] [Google Scholar]
  60. Shazman S, Mandel-Gutfreund Y. Classifying RNA-binding proteins based on electrostatic properties. PLoS Comput Biol. 2008;4:e1000146. doi: 10.1371/journal.pcbi.1000146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Shulman-Peleg A, et al. Prediction of interacting single-stranded RNA bases by protein-binding patterns. J Mol Biol. 2008;379:299–316. doi: 10.1016/j.jmb.2008.03.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008;9:319. doi: 10.1186/1471-2105-9-319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Stawiski EW, Gregoret LM, Mandel-Gutfreund Y. Annotating nucleic acid-binding function based on protein structure. J Mol Biol. 2003;326:1065–1079. doi: 10.1016/s0022-2836(03)00031-7. [DOI] [PubMed] [Google Scholar]
  64. Tu JV. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. J Clin Epidemiol. 1996;49:1225–1231. doi: 10.1016/s0895-4356(96)00002-9. [DOI] [PubMed] [Google Scholar]
  65. Vapnik V. Statistical Learning Theory. Wiley-Interscience; New York: 1998. [Google Scholar]
  66. Vapnik VN. The Nature of Statistical Learning Theory. Springer; Berlin: 1995. [Google Scholar]
  67. Wu ZC, Xiao X, Chou KC. iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. Molecular BioSystems. 2011;7:3287–3297. doi: 10.1039/c1mb05232b. [DOI] [PubMed] [Google Scholar]
  68. Xiao X, Wang P, Chou KC. iNR-PhysChem: A Sequence-Based Predictor for Identifying Nuclear Receptors and Their Subfamilies via Physical-Chemical Property Matrix. PLoS ONE. 2012;7:e30869. doi: 10.1371/journal.pone.0030869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Xiao X, Wu ZC, Chou KC. iLoc-Virus: A multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. J Theor Biol. 2011a;284:42–51. doi: 10.1016/j.jtbi.2011.06.005. [DOI] [PubMed] [Google Scholar]
  70. Xiao X, Wu ZC, Chou KC. A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites. PLoS ONE. 2011b;6:e20592. doi: 10.1371/journal.pone.0020592. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Yu L, Guo Y, Li Y, Li G, Li M, et al. SecretP: Identifying bacterial secreted proteins by fusing new features into Chou’s pseudo-amino acid composition. J Theor Biol. 2010;267:1–6. doi: 10.1016/j.jtbi.2010.08.001. [DOI] [PubMed] [Google Scholar]
  72. Yu X, Cao J, Cai Y, Shi T, Li Y. Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J Theor Biol. 2006;240:175–184. doi: 10.1016/j.jtbi.2005.09.018. [DOI] [PubMed] [Google Scholar]
  73. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal Of The Royal Statistical Society Series B. 2006;68(1):49–67. [Google Scholar]
  74. Zeng YH, Guo YZ, Xiao RQ, Yang L, Yu LZ, Li ML. Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. J Theor Biol. 2009;259:366–372. doi: 10.1016/j.jtbi.2009.03.028. [DOI] [PubMed] [Google Scholar]
  75. Zhang GY, Fang BS. Predicting the cofactors of oxidoreductases based on amino acid composition distribution and Chou’s amphiphilic pseudo amino acid composition. J Theor Biol. 2008;253:310–315. doi: 10.1016/j.jtbi.2008.03.015. [DOI] [PubMed] [Google Scholar]
  76. Zhang GY, Li HC, Gao JQ, Fang BS. Predicting lipase types by improved Chou’s pseudo-amino acid composition. Protein & Peptide Letters. 2008;15:1132–1137. doi: 10.2174/092986608786071184. [DOI] [PubMed] [Google Scholar]
  77. Zhao XW, Li XT, Ma ZQ, Yin MH. Identify DNA-Binding Proteins with Optimal Chou’s Amino Acid Composition. Protein & Peptide Letters. 2012;19:398–405. doi: 10.2174/092986612799789404. [DOI] [PubMed] [Google Scholar]
  78. Zhou XB, Chen C, Li ZC, Zou XY. Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. J Theor Biol. 2007;248:546–551. doi: 10.1016/j.jtbi.2007.06.001. [DOI] [PubMed] [Google Scholar]
  79. Zou D, He Z, He J, Xia Y. Supersecondary structure prediction using Chou’s pseudo amino acid composition. Journal of Computational Chemistry. 2011;32:271–278. doi: 10.1002/jcc.21616. [DOI] [PubMed] [Google Scholar]

RESOURCES