Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2018 Mar 8;34(14):2499–2502. doi: 10.1093/bioinformatics/bty140

iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences

Zhen Chen 1,1, Pei Zhao 2,1, Fuyi Li 3, André Leier 4,5, Tatiana T Marquez-Lago 4,5, Yanan Wang 6, Geoffrey I Webb 7, A Ian Smith 3, Roger J Daly 3,, Kuo-Chen Chou 8,9,, Jiangning Song 3,7,
Editor: Alfonso Valencia
PMCID: PMC6658705  PMID: 29528364

Abstract

Summary

Structural and physiochemical descriptors extracted from sequence data have been widely used to represent sequences and predict structural, functional, expression and interaction profiles of proteins and peptides as well as DNAs/RNAs. Here, we present iFeature, a versatile Python-based toolkit for generating various numerical feature representation schemes for both protein and peptide sequences. iFeature is capable of calculating and extracting a comprehensive spectrum of 18 major sequence encoding schemes that encompass 53 different types of feature descriptors. It also allows users to extract specific amino acid properties from the AAindex database. Furthermore, iFeature integrates 12 different types of commonly used feature clustering, selection and dimensionality reduction algorithms, greatly facilitating training, analysis and benchmarking of machine-learning models. The functionality of iFeature is made freely available via an online web server and a stand-alone toolkit.

Availability and implementation

http://iFeature.erc.monash.edu/; https://github.com/Superzchen/iFeature/.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

In recent years, machine learning techniques have been increasingly used as a powerful means to predict structural and functional properties of proteins and to assist in the annotation of genomic and proteomic data (Larranaga et al., 2006; Libbrecht and Noble, 2015). In this regard, it has proven crucial to transform protein and peptide sequences into effective mathematical expressions that describe their intrinsic correlation with the corresponding structural and functional attributes (Chou, 2011). Over the past decades, an increasing number of diverse feature encoding methods or descriptors extracted from protein and peptide sequence information have been proposed for improving various predictions. Applications include predicting protein structural and function classes (Chou and Fasman, 1978), protein-protein interactions, protein–ligand interactions (Cao et al., 2015; Shen et al., 2007), subcellular locations (Chou and Shen, 2008), enzyme substrates (Barkan et al., 2010; Rottig et al., 2010; Song et al., 2010), among others.

Several web servers and stand-alone software packages have been developed to calculate a variety of structural and physicochemical features, including PROFEAT (Li et al., 2006; Rao et al., 2011), PseAAC (Shen and Chou, 2008), PseAAC-Builder (Du et al., 2012), propy (Cao et al., 2013), PseAAC-General (Du et al., 2014), protr/ProtrWeb (Xiao et al., 2015), Rcpi (Cao et al., 2015) and PseKRAAC (Zuo et al., 2017). However, in addition to feature extraction, feature selection and ranking analysis is an equally crucial step in machine learning of protein structures and functions. To the best of our knowledge, there is no universal toolkit or web server currently available that integrates both functions of feature extraction and feature selection analysis. It is in this spirit that we developed iFeature, a versatile open-source Python toolkit that bridges this gap. iFeature can be used not only to extract a great variety of numerical feature encoding schemes from protein or peptide sequences, but also for feature clustering, ranking, selection and dimensionality reduction, all of which will greatly facilitate users’ subsequent efforts to identify relevant features and construct effective machine learning-based models. In order to facilitate users’ interpretability of outcomes, the clustering and dimensionality reduction results can be visualized in form of scatter diagrams. iFeature also supports the integration of different feature types, making it more convenient to train models by combining different feature groups. Lastly, we developed a user-friendly web server for iFeature.

2 Implementation

An important advantage of iFeature is that it integrates the multi-faceted functionality of feature calculation, extraction, clustering, selection and dimensionality reduction analysis. A complete list of the 18 major encoding schemes is summarized in Table 1. We briefly discuss below.

Table 1.

List of various descriptors calculated by iFeature

Descriptor groups Descriptor Dimension
Amino acid composition Amino acid composition (AAC) 20
Enhanced amino acid composition (EAAC)
Composition of k-spaced amino acid pairs (CKSAAP) 2400
Dipeptide composition (DPC) 400
Dipeptide deviation from expected mean (DDE) 400
Tripeptide composition (TPC) 8000
Grouped amino acid composition Grouped amino acid composition (GAAC) 5
Enhanced grouped amino acid composition (GEAAC)
Composition of k-spaced amino acid group pairs (CKSAAGP) 150
Grouped dipeptide composition (GDPC) 25
Grouped tripeptide composition (GTPC) 125
Binary Binary (BINARY)
Autocorrelation Moran (Moran) 240
Geary (Geary) 240
Normalized Moreau-Broto (NMBroto) 240
C/T/D Composition (CTDC) 39
Transition (CTDT) 39
Distribution (CTDD) 195
Conjoint triad Conjoint triad (CTriad) 343
Conjoint k-spaced triad (KSCTriad) 343x(k+1)
Quasi-sequence-order Sequence-order-coupling number (SOCNumber) 60
Quasi-sequence-order descriptors (QSOrder) 100
Pseudo-amino acid composition Pseudo-amino acid composition (PAAC) 50
Amphiphilic PAAC (APAAC) 80
K-nearest neighbor K-nearest neighbor for proteins (KNNprotein) 60
K-nearest neighbor for peptide (KNNpeptide) 60
PSSM Position-specific scoring matrix (PSSM) profile
AAindex AAindex (AAINDEX)
BLOSUM62 BLOSUM62 matrix
Z-scale Z-scale (ZSCALE)
Predicted secondary structure Secondary structure elements content (SSEC) 3
Secondary structure elements binary (SSEB)
Predicted protein disorder Disorder (Disorder)
Disorder content (DisorderC) 2
Disorder binary (DicorderB)
Predicted accessible surface area Accessible surface area (ASA)
Predicted main-chain torsional angles Torsional angles (TS)
Pseudo K-tuple reduced amino acids composition PseKRAAC (type1 to type16)

The first group includes six feature sets, i.e. amino acid composition, composition of k-spaced amino acid pairs (Chen et al., 2013; Liu et al., 2017), enhanced amino acid composition, dipeptide composition, dipeptide deviation from expected mean (Saravanan and Gautham, 2015) and tripeptide composition (Bhasin and Raghava, 2004). The secondary group is labeled ‘grouped amino acid composition’, which also consists of five descriptors (Table 1). For this group, 20 amino acid types are first categorized according to their physicochemical properties, and then the composition of each category is calculated. The third group is the binary encoding scheme in which each amino acid is represented by a 20-dimensional binary vector. The fourth group includes three types of autocorrelation feature sets: normalized Moreau–Broto autocorrelation, Moran autocorrelation and Geary autocorrelation (Sokal and Thomson, 2006). This feature group allows users to select properties from the AAindex database (Kawashima et al., 2008). The fifth group consists of three feature sets: composition, transition and distribution (Dubchak et al., 1995, 1999). The sixth group is the conjoint triad (Shen et al., 2007). The seventh group contains two sequence-order feature sets, sequence-order-coupling number and quasi-sequence-order (Chou, 2000; Chou and Cai, 2004; Schneider and Wrede, 1994). The eighth group includes the pseudo-amino acid composition and the amphiphilic pseudo-amino acid composition (Chou, 2001, 2005). The ninth group includes two K-nearest neighbor features: KNNprotein and KNNpeptide (Chen et al., 2013). The tenth group is the PSSM encoding scheme, which extracts features from the position-specific scoring matrix (PSSM; Altschul, 1997) generated by PSI-BLAST. The eleventh group is the AAindex encoding scheme where each amino acid is represented by a 531-dimensional vector (Tung and Ho, 2008). The twelfth group is the BLOSUM matrix-derived descriptor (Lee et al., 2011). The thirteenth group is the Z-scale encoding where each amino acid is represented by five physicochemical descriptor variables. Feature groups 14 to 17 are derived from information about the predicted protein secondary structure, disorder, accessible surface area and torsional angles, respectively. The last group includes 16 types of pseudo K-tuple reduced amino acid compositions (Zuo et al., 2017).

Moreover, as high-dimensional features can potentially cause over fitting or high-dimensional disaster (Bellman and Bellman, 1961) and increase of redundant information, machine learning models trained using such high-dimensional initial features often perform poorly in practice. To solve this problem, iFeature further integrates several commonly used feature clustering, selection and dimensionality reduction algorithms to filter out redundant features and retain the useful and relevant ones. All implemented feature analysis algorithms are listed in Table 2. All clustering methods support sample and feature clustering procedures. In cases where users are not familiar with computer programming using Python, we also implemented an online web server of iFeature. It is configured on the extensible cloud computing facility supported by the e-Research Centre at Monash University, equipped with 16 cores, 64 GB memory and a 2 TB hard disk. This configuration can be easily upgraded in line with increasing user demands in the future.

Table 2.

A list of various feature clustering, selection and dimensionality reduction algorithms available in iFeature

Type of functionality Algorithm
Feature clustering K-means (kmeans)
Hierarchical clustering (hcluster)
Mean shift (meanshift)
DBSCAN (dbscan)
Affinity propagation (apc)
Feature selection Chi-square test (CHI2)
Information gain (IG)
Mutual information (MIC)
Pearson's correlation coefficient (pearsonr)
Dimensionality reduction Principal component analysis (PCA) Latent Dirichlet allocation (LDA) t-Distributed Stochastic Neighbor  Embedding (t-SNE)

3 Results

In this work, we have developed iFeature, a comprehensive, flexible and open-source Python toolkit for generating various sequences, structural and physiochemical features derived from protein/peptide sequences. iFeature also allows users to integrate various feature clustering, selection and dimensionality reduction algorithms that facilitate feature importance analysis, model training and benchmarking of machine learning-based models. iFeature has been extensively tested to guarantee correctness of computations, and was purposely designed to ensure workflow efficiency. To the best of our knowledge, this is the first universal toolkit for integrated feature calculation, clustering and selection analysis. In the future, we will integrate more analysis and clustering algorithms to enable interactive analysis and machine learning-based modeling. iFeature is expected to be widely used as a powerful tool in bioinformatics, computational biology and proteome research.

Funding

This work was supported by grants from the Australian Research Council [ARC; LP110200333 and DP120104460], National Natural Science Foundation of China [NSFC; 31701142], National Health and Medical Research Council of Australia [NHMRC; APP1058540], the National Institute of Allergy and Infectious Diseases of the National Institutes of Health [R01 AI111965] and a Major Inter-Disciplinary Research (IDR) Grant Awarded by Monash University. A.L. and T.T.M.-L. were supported by Informatics startup packages through the UAB School of Medicine.

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data

References

  1. Altschul S.F. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Barkan D.T. et al. (2010) Prediction of protease substrates using sequence and structure features. Bioinformatics, 26, 1714–1722. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bellman R.E. (1961) Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton, NJ. [Google Scholar]
  4. Bhasin M., Raghava G.P. (2004) Classification of nuclear receptors based on amino acid composition and dipeptide composition. J. Biol. Chem., 279, 23262–23266. [DOI] [PubMed] [Google Scholar]
  5. Cao D.S. et al. (2013) propy: a tool to generate various modes of Chou’s PseAAC. Bioinformatics, 29, 960–962. [DOI] [PubMed] [Google Scholar]
  6. Cao D.S. et al. (2015) Rcpi: r /Bioconductor package to generate various descriptors of proteins, compounds and their interactions. Bioinformatics, 31, 279–281. [DOI] [PubMed] [Google Scholar]
  7. Chen X. et al. (2013) Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites. Bioinformatics, 29, 1614–1622. [DOI] [PubMed] [Google Scholar]
  8. Chen Z. et al. (2013) hCKSAAP_UbSite: improved prediction of human ubiquitination sites by exploiting amino acid pattern and properties. Biochim. Biophys. Acta, 1834, 1461–1467. [DOI] [PubMed] [Google Scholar]
  9. Chou K.C. (2000) Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. Biochem. Biophys. Res. Commun., 278, 477–483. [DOI] [PubMed] [Google Scholar]
  10. Chou K.C. (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins, 43, 246–255. [DOI] [PubMed] [Google Scholar]
  11. Chou K.C. (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics, 21, 10–19. [DOI] [PubMed] [Google Scholar]
  12. Chou K.C. (2011) Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol., 273, 236–247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Chou K.C., Cai Y.D. (2004) Prediction of protein subcellular locations by GO-FunD-PseAA predictor. Biochem. Biophys. Res. Commun., 320, 1236–1239. [DOI] [PubMed] [Google Scholar]
  14. Chou K.C., Shen H.B. (2008) Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat. Protoc., 3, 153–162. [DOI] [PubMed] [Google Scholar]
  15. Chou P.Y., Fasman G.D. (1978) Prediction of the secondary structure of proteins from their amino acid sequence. Adv. Enzymol. Relat. Areas Mol. Biol., 47, 45–148. [DOI] [PubMed] [Google Scholar]
  16. Du P. et al. (2012) PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou’s pseudo-amino acid compositions. Anal. Biochem., 425, 117–119. [DOI] [PubMed] [Google Scholar]
  17. Du P. et al. (2014) PseAAC-General: fast building various modes of general form of Chou’s pseudo-amino acid composition for large-scale protein datasets. Int. J. Mol. Sci., 15, 3495–3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Dubchak I. et al. (1995) Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. USA, 92, 8700–8704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Dubchak I. et al. (1999) Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. Proteins, 35, 401–407. [PubMed] [Google Scholar]
  20. Kawashima S. et al. (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res., 36 (Database issue), D202–D205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Larranaga P. et al. (2006) Machine learning in bioinformatics. Brief. Bioinform., 7, 86–112. [DOI] [PubMed] [Google Scholar]
  22. Lee T.Y. et al. (2011) Incorporating distant sequence features and radial basis function networks to identify ubiquitin conjugation sites. PLoS One, 6, e17331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Li Z.R. et al. (2006) PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res., 34, W32–W37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Libbrecht M.W., Noble W.S. (2015) Machine learning applications in genetics and genomics. Nat. Rev. Genet., 16, 321–332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Liu L.M. et al. (2017) iPGK-PseAAC: identify lysine phosphoglycerylation sites in proteins by incorporating four different tiers of amino acid pairwise coupling information into the general PseAAC. Med. Chem., 13, 552–559. [DOI] [PubMed] [Google Scholar]
  26. Rao H.B. et al. (2011) Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res., 39, W385–W390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Rottig M. et al. (2010) Combining structure and sequence information allows automated prediction of substrate specificities within enzyme families. PLoS Comput. Biol., 6, e1000636. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Saravanan V., Gautham N. (2015) Harnessing computational biology for exact linear B-cell epitope prediction: a novel amino acid composition-based feature descriptor. Omics, 19, 648–658. [DOI] [PubMed] [Google Scholar]
  29. Schneider G., Wrede P. (1994) The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site. Biophys. J., 66, 335–344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Shen J. et al. (2007) Predicting protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. USA, 104, 4337–4341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Shen H.B., Chou K.C. (2008) PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Anal. Biochem., 373, 386–388. [DOI] [PubMed] [Google Scholar]
  32. Sokal R.R., Thomson B.A. (2006) Population structure inferred by local spatial autocorrelation: an example from an Amerindian tribal population. Am. J. Phys. Anthropol., 129, 121–131. [DOI] [PubMed] [Google Scholar]
  33. Song J. et al. (2010) Cascleave: towards more accurate prediction of caspase substrate cleavage sites. Bioinformatics, 26, 752–760. [DOI] [PubMed] [Google Scholar]
  34. Tung C.W., Ho S.Y. (2008) Computational identification of ubiquitylation sites from protein sequences. BMC Bioinformatics, 9, 310. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Xiao N. et al. (2015) protr/ProtrWeb: r package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics, 31, 1857–1859. [DOI] [PubMed] [Google Scholar]
  36. Zuo Y. et al. (2017) PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition. Bioinformatics, 33, 122–124. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES