Abstract
Cystine-stabilized peptides have great utility as they naturally block ion channels, inhibit acetylcholine receptors, or inactivate microbes. However, only a tiny fraction of these peptides has been characterized. Exploration for novel peptides most efficiently starts with the identification of candidates from genome sequence data. Unfortunately, though cystine-stabilized peptides have shared structures, they have low DNA sequence similarity, restricting the utility of BLAST and even more powerful sequence alignment-based annotation algorithms, such as PSI-BLAST and HMMER. In contrast, a supervised machine learning approach may improve discovery and function assignment of these peptides. To this end, we employed our previously described m-NGSG algorithm, which utilizes hidden signatures embedded in peptide primary sequences that define and categorize structural or functional classes of peptides. From the generalized m-NGSG framework, we derived five specific models that categorize cystine-stabilized peptide sequences into specific functional classes. When compared with PSI-BLAST, HMMER and existing function-specific models, our novel approach (named CSPred) consistently demonstrates superior performance in discovery and function-assignment. We also report an interactive version of CSPred, available through download (https://bitbucket.org/sm_islam/cystine-stabilized-proteins/src) or web interface (watson.ecs.baylor.edu/cspred), for the discovery of cystine-stabilized peptides of specific function from genomic datasets and for genome annotation. We fully describe, in the Availability section following the Discussion, the quick and simple usage of the CsPred website to automatically deliver function assignments for batch submissions of peptide sequences.
Introduction
Cystine-stabilized peptides are impressively abundant and widespread across the taxa. They form the neurotoxic venom fraction of spiders1, snakes2, scorpions3, sea anemones4, jellyfish, corals and conch5 and may be specific for insects, mammals, or reptiles. Other cystine-stabilized peptides serve as antimicrobials6 and defensins in humans, insects, fungi, plants and most other taxa. Functionally, the venom peptides include sodium7, calcium8 and potassium9 ion channel blockers, acetylcholine receptor inhibitors10, or protease inhibitors11. Antimicrobial peptides generally act as membrane disrupters specifically against bacterial or fungal cells, but, due to their ability to penetrate cell membranes, they can also enter eukaryotic cells to act on host DNA directly and to modulate immune responses6. The stability of these peptides and their specific and powerful functions make them strong candidates for a variety of medical and agricultural applications, including pain relief, disruption of cancer development, and environmentally friendly insecticides, fungicides and bactericides, delivered either directly or via transgenes.
Cystine-stabilized peptides are also achieving commercial success. Clinically, alpha-bungarotoxin has a long history of use in isolating and identifying specific acetylchloline receptors and in the diagnosis of myasthenia gravis10. Aprotinin has been shown clinically effective against flu infection by inhibiting protease cleavage of HA0 to HA1 and HA212, and Linaclotide is licensed for clinical use orally against irritable bowel syndrome13. The calcium channel blocker from conch, ziconotide (Prialt), is used clinically as a pain reliever8, and the chloride channel blocker from scorpion, chlorotoxin, reached Phase III trials as a treatment for glioblastoma cancer14. However, only a tiny fraction of cystine-stabilized peptides has been characterized experimentally15–17. To sort through the huge number of remaining cystine-stabilized peptides present in such a wide range of genomes for the purpose of classifying each of these peptides into one of the disparate functional groups, an efficient automated approach is warranted.
Sequence identity of the cystine-stabilized peptides varies broadly and can be distributed into different structural/motif and family-based (the native source of a peptide) classes18. The scorpion toxin-like superfamily17,19,20, agatoxins21, and conotoxins22 are examples of family-based classes, while STPs23, NTPs23, cyclotides24 and knottins25 are examples of structure or motif-based classes. Because of the high degree of heterogeneity in their primary sequences, several sequence alignment independent models have been reported to classify the structure of the cystine-stabilized /disulfide-rich family. For instance, Cypred26 predicts cyclic peptides including cyclotides; Knotter 1D predicts peptides with ICK motifs27; iCTX-Type structures predict types of Conotoxins targeting Ion Channels28; PredCSF predicts conotoxin superfamily from the primary protein sequences29; and PredSTP predicts sequential tri-disulfide motifs in cysteine rich peptide23. In addition, a specific functional group of cystine-stabilized peptides often come from different family or structural classes. Thus, family or structure/motif-based classification will may reveal the functional characteristic of a peptide. Under this context, it is necessary to develop a sequence alignment independent model to discover the functional characteristics in a family of origin or structure agnostic fashion.
Machine learning-based supervised models are widely used to predict the functional and structural class of proteins which are difficult to predict using sequence alignment-based algorithms. However, it is imperative to extract the relevant feature vectors (descriptors) and to implement an optimized classification algorithm to get expected performance from a model. Several classification algorithms have already been exploited to predict protein characteristics from the primary sequences30–32, but, extracting proper descriptors from protein sequences remains a challenging task. A number of descriptors, such as amino acid composition33, autocorrelation34, CTD (composition, transition, and distribution)35, conjoint triads36 and pseudo amino acid compositions37 are routinely used to build machine learning-based models. Recently, we demonstrated a complete pipeline of a classifier constructor where the feature generation model is integrated with a logistic regression algorithm38. This training set pipeline is denoted as m-NGSG (modified n-gram and skip-gram) where a modified n-grams39and skip-grams-based40 framework is used to generate descriptors from the protein sequences and utilize the hidden signatures from the descriptors for the supervised classification41. The m-NGSG framework has proven highly accurate for constructing reliable supervised prediction models38.
In this study, we applied m-NGSG to build five individual models to predict ion channel blockers, antimicrobial peptides, acetylcholine receptor inhibitors, serine protease inhibitors, and hemolytic proteins from disulfide stabilized proteins. Identification of hemolytic characteristics will allow the researcher to eliminate from consideration proteins cytotoxic to humans. The results demonstrate superiority of m-NGSG-based models to PSI-BLAST42, HMMER43 and other available models. Finally, we propose the CSPred model which combines the results of the five different models and gives a probability score for the five important functional characteristics of cystine-stabilized proteins. We also present three classifiers that assign ion channel blockers into three subclasses, sodium, potassium and calcium channel blockers.
Material and Methods
Data acquisition and preparation
The positive and negative datasets for ion channel blockers (ICB), antimicrobial peptides (AMP), acetylcholine receptor inhibitors (ACRI), serine protease inhibitors (SPI), and hemolytic proteins (HLP) are generated by obtaining protein sequences from UniprotKB (knowledgebase)44 using the search keys mentioned in Supplement Table 1. All the protein sequences, including positive and negative classes, contain a minimum of one disulfide bond and a chain size of less than 150 amino acid residues. Thereafter, the protein sequences are curated manually based on the functional attribute for each entry. A portion of the HLP positive dataset is collected from the HemoPI server45. Here, only the sequences containing a minimum of one pair of cysteines are selected from the dataset. The CD-HIT software46 is used to organize sequences based on identity thresh-holds to generate final datasets for each functional group of the cysteine stabilized proteins (See Supplement Table 1). From the positive and negative datasets of each selected functional group, 90% of the chains are retained for training sets, while 10% of the chains are reserved for out-of-sample test sets using a random shuffle-split process. The numbers of chains in each training and test sets are mentioned in Supplement Table 1. Further, to construct a separate compound model to classify the ICB into three different subclasses, we made three separate models using six different training sets (Supplement Table 2). The ICB classifier was constructed to classify the ICBs into sodium, potassium and calcium channel blockers. In order of make the process reproducible, the sequences of the training and test sets are provided in a separate supplementary folder (Supplementary Folder 1).
Model construction using m-NGSG
Five different binary classifiers are constructed to predict each of the five selected functional classes using the m-NGSG algorithm38. The m-NGSG algorithm (avialable at https://bitbucket.org/sm_islam/mngsg/src) offers an integrated and fully automated feature generation method followed by a logistic regression-based model construction, feature generation, and parameter optimization as described previously38. Parameter optimizations employed five-fold cross-validation using appropriate training sets. Supplement Table 3 illustrates the parameters selected by the m-NGSG optimizer for each functional group specific model. A combined model CSPred is further derived from the result aggregation of the five-individual function-based models. A diagram of the CSPred model construction is illustrated in Fig. 1.
Model evaluation
The performances of all five models were evaluated using a five-fold cross-validation. Precision (eq. 1), recall (eq. 2), F1-score (eq. 3), accuracy (eq. 4), and Mathews Correlation Coefficient (MCC) (eq. 5) values are calculated for each model as the evaluation matrices. For calculation of these evaluation matrices, the confusion matrices were constructed to calculate the True Positives (TP), False Positives (FP), True Negatives (TN) and False Negatives (FN). TP and TN are correctly predicted positive and negative data points, respectively. Similarly, FP and FN are incorrectly predicted positive and negative data points, respectively. From TP, TN, FP, and FN, the evaluation matrices were calculated using the following equations:
1 |
2 |
3 |
4 |
5 |
Comparison with PSI-BLAST and HMMER
The performance of each model except the subclasses of ICB is compared with PSI-BLAST42 and HMMER43. The ncbi-blast-2.5.0+ standalone software was downloaded to run PSI-BLAST locally. Similarly, HMMER 3.1b2 was installed in a Linux operating system, and the PHMMER function was used to run HMMER with the default parameters. Evaluation matrices were calculated for the identical training sets with PSI-BLAST using five-fold cross-validation. During cross-validation with PSI-BLAST and PHMMER, the training set was employed to populate the database, while the test set operated as the query. The class of each query sequence was predicted using the highest matching score with the sequences in the database. For PSI-BLAST, cross-validations were conducted using the threshold E-values of 0.01, 0.05, 0.1, 0.5, 1.0 and 5.0 with five iterations. All other parameters were kept as default. Afterwards, the sequence of the out-of-sample test sets from each functional group were predicted keeping the sequence of the corresponding training sets as databases.
Comparison with other available models
Several other models exist to predict subsets of functional groups. iAMP-2L47 and CAMPR348 are available to predict antimicrobial peptides, but are not scoped or optimized to predict cysteine stabilized peptides. We also compare the performance of our AMP model with iAMP-2L and CAMPR3. CAMPR3 offers four different classifiers to predict AMPs: Support Vector Machine (SVM), Random Forest (RF), Artificial Neural Network (ANN), and Discriminant Analysis (DA). We compared our AMP model with all classifiers offered by CAMPR3 using the out-of-sample test set and calculated precision, recall, F1-score, accuracy and MCC values. Similarly, the HemoPI45 model is dedicated to predicting hemolytic peptides, and was used to compare with our HLP model using the out-of-sample test set of HLP and calculated evaluation matrices.
Results
Evaluation of the m-NGSG-based models
ICB, AMP, ACRI, SPI and HLP represent five different functional class-based models constructed using the m-NGSG algorithm38. Each model was evaluated using precision, recall, F1-score, accuracy and MCC scores based on a five-fold cross-validation against a training set. The evaluation matrices are reported in Table 1. The training set accuracies of the five models range from 86.33% to 95.23% where AMP and ACRI rendered the lowest and highest accuracy, respectively. The models also generated F1-Scores ranging from 0.81 to 0.89 and MCC scores ranging from 0.74 to 0.90. In addition, we observed consistent performances from all three ICB subclassifiers. The training set of the NaB, KB and CaB classifiers produced 0.83, 0.82 and 0.73 MCC scores, respectively, from five-fold cross-validations. These classifiers also generated MCC scores of 0.94, 0.87, 0.69, respectively which indicates trivial overfit/underfits of the models. Figure 2 illustrates the other evaluation matrices of the ICB subcalssification models.
Table 1.
Models | Precision | Recall | F1-Score | Accuracy | MCC | |||||
---|---|---|---|---|---|---|---|---|---|---|
Training set | Test set | Training set | Test set | Training set | Test set | Training set | Test set | Training set | Test set | |
ICB | 91.25 | 95.58 | 83.80 | 92.85 | 0.87 | 0.94 | 89.67 | 95.32 | 0.78 | 0.90 |
AMP | 86.56 | 85.96 | 77.08 | 81.66 | 0.81 | 0.84 | 86.33 | 87.74 | 0.71 | 0.74 |
ACRI | 100.00 | 100.00 | 0.80 | 63.63 | 0.89 | 0.78 | 95.23 | 92.00 | 0.87 | 0.76 |
SPI | 97.52 | 96.43 | 79.66 | 81.81 | 0.88 | 0.88 | 91.90 | 92.55 | 0.83 | 0.84 |
HLP | 86.07 | 92.30 | 86.66 | 80.00 | 0.86 | 0.86 | 89.39 | 89.47 | 0.78 | 0.78 |
The precision, recall and accuracy values are shown in percentages.
Abbreviations: ICB = Ion channel blocker; AMP = Antimicrobial peptide; ACRI = Acetylcholine receptor inhibitor; SPI = Serine protease inhibitor; HLP = Hemolytic protein.
To judge the robustness of our approach, it is imperative to compare the performance of the model against established reliable available methods. PSI-BLAST and HMMER are used for generalized comparison, while other comparison groups are more specific. iAMP-2L and CAMPR3 are used to evaluate performance against AMPs, and HemoPI for HLP.
Comparison of the evaluation matrices and area under curve (AUC) with PSI-BLAST and HMMER
PSI-BLAST is a dependable and widely used algorithm to discover distantly related protein sequence using PSSM matrices42. HMMER is a Hidden Markov Model-based algorithm designed to detect remote homologs with a high sensitivity43. We compared the performance of each constructed model with PSI-BLAST and HMMER for the corresponding training sets using a five-fold cross-validation. Supplement Fig. 1 illustrates an extensive comparison among the m-NGSG based models, HMMER, and PSI-BLAST models made with different E-values. Precision, recall, F1-score, accuracy and MCC values are used to evaluate the models against PSI-BLAST. Figure 3A and Supplement Fig. 3 specifically show the comparison of the MCC values of each training set with PSI-BLAST and HMMER. Figure 3B and Supplement Fig. 3 illustrates the standard deviation of the MCC values generated from different folds using different models. The area under curve (AUC) for the five-different m-NGSG-based models were also compared with PSI-BLAST and HMMER using the corresponding training sets. The E-values yielding the best MCC values for each function-based training sets were used to run a PSI-BLAST for the comparison. Figure 4 shows the receiver operating characteristic (ROC) curves for m-NGSG-based models with a side by side area under curve (AUC) comparison among each m-NGSG-based model and the corresponding PSI-BLAST and HMMER-based models. For the five training sets, m-NGSG based models generated better AUCs compared to the corresponding PSI BLAST and HMMER based models.
Comparison of the evaluation matrices with PSI-BLAST and HMMER on the out-of-sample test set
Versatility of the five m-NGSG based models were tested by comparing their performance with PSI-BLAST and HMMER on the corresponding out-of-sample test set. We imported the same E-values from the ROC curve comparison to run the PSI-BLAST on the test sets. The MCC values were measured for each model and the corresponding PSI-BLAST and HMMER to achieve an appropriate comparison. Figure 5 displays comparative bar plots which illustrate the MCC values on the out-of-sample test set produces five different models and PSI-BLAST. According to Fig. 5, each of the five models shows better MCC values compared to their equivalent PSI-BLAST results while four models show better MCC value than HMMER. In the case of AMP, both the m-NGSG-based model and HMMER returns the same MCC value (0.74).
Comparison of AMP and Hemolytic peptide prediction models with other currently available models
Along with PSI-BLAST and HMMER, we used the iAMP-2L47 and CAMPR348 models to predict antimicrobial peptides (AMP), and the HemoPI45 algorithm to predict hemolytic peptides. While it is important to note that none of these models are dedicated to the identification of only cystine- stabilized peptides, their performance parameters should generalize to their prediction. We compared performances of iAMP-L2 and CAMPR3 with our m-NGSG-based AMP model and HemoPI with the m-NGSG-based HLP model using the corresponding out-of-sample test sets. Figure 6 shows the comparative precision, recall, accuracy and MCC values among different models. Among the other available models, CAMP-ANN showed the highest precision score 0.43 or 43% while the precision score produced by m-NGSG-based AMP model was 0.85. CAMP-SVM showed a slightly better recall score than m-NGSG, 0.83 and 0.81, respectively. Overall, the best accuracy score was generated by iAMP-2L (0.54), but it was far less than the accuracy score produced by m-NGSG, which was 0.75. Finally, the highest MCC score was generated by CAMP-ANN (0.13) which was also well below the MCC score of m-NGSG (0.74) (see Fig. 6). Similarly, Supplement Fig. 4 illustrates the comparison on the out-of-sample HLP test set among Hemo PI, PSI-BLAST and m-NGSG. Here, precision, recall, accuracy and MCC scores of HemoPI are 0.55 (55%), 0.67 (66%), 0.66 (66%) and 0.31, respectively. These are lower than corresponding scores of m-NGSG-based HLP.
Discussion
In this study, we constructed five different functional classifiers of cystine -stabilized peptides and combined them to build the CSPred model which predicts the probability of the selected five functional characteristics of query peptide sequences. After building a model, the most important step is to assess its performance using a k-fold cross-validation and out-of-sample test sets. We have performed this step carefully using a five-fold cross-validation and an out-of-sample test set for each of the five models. Table 1 shows a comprehensive comparison among evaluation matrices for each model. The cross-validation accuracy ranged from ~86% to ~95% for the models. The accuracies on the out-of-sample test sets were concordant to the cross-validation accuracies. No big difference was detected between the accuracies for the models except the ICB model where the difference between the test and training set was ~5%. However, the increase of accuracy on the test set indicates versatility of the ICB model. The other evaluation matrices such as F1-score and MCC were also quite consistent between the training and test set (see Table 1). The comparative analysis between the evaluation matrices explains the adaptability of each different function-based model.
The protein sequences collected from the Uniprot Knowledgebase were further filtered using the CD-hit46 software to reduce the homoscedasticity of the model by reducing the redundancy of the similar sequences. Although this operation reduced the amount of available training set, particularly of the ICB subclasses, that helped models to be kept optimally unbiased. The fasta sequences of each protein of the training and test sets from Uniprot Knowledgebase may contain some other domains along with the functional domains. Despite the presence of these noises (the subsequences from other domains) in the datapoints (here the data points are the primary sequences of the proteins), CSPred showed a robust performance in the cross-validation and out-of-sample test-set classification. This outcome indicates the strength of m-NGSG algorithm to classify data points with noise. From a previous study, we demonstrated that m-NGSG is capable to differentiate noises in datasets during a classification task38.
The ultimate success and novelty of a machine learning-based model depends on its superiority over other concurrent algorithms. PSI-BLAST and HMMER are successful and widely accepted algorithms to discover distantly related protein chains. Therefore, we compared the performance of each five function-based model with PSI-BLAST and HMMER using the corresponding training and out-of-sample test sets. One complexity and disadvantage to working with PSI-BLAST is choosing the optimal E-value; it is challenging to select an E-value that will give the best results. Supplement Fig. 1 shows a clear superiority of m-NGSG-based methods over the equivalent PSI-BLAST with different E-values and HMMER except better recall values obtained by HMMER compared to m-NGSG based models. That explains only a better sensitivity of HMMER than m-NGSG models but not the overall performance. The MCC score is chosen over an accuracy score for further comparison because MCC is more robust and reflects the sensitivity, specificity, precision and false negative rate while accuracy only reflects the average of sensitivity and specificity of a model49. Similar to the training set, m-NGSG-based models showed better MCC values on the out-of-sample test sets compared to the corresponding HMMER and PSI-BLAST with the optimized E-values, Fig. 5. This result demonstrates consistently better performance over HMMER and PSI-BLAST and unbiased behavior of the m-NGSG-based models. In addition to PSI-BLAST and HMMER, we compared the m-NGSG-based model with other available function specific prediction models. There are two available models to predict antimicrobial peptides: iAMP-2Land CAMPR3. CAMPR3 also has four different classifiers to perform the antimicrobial peptide prediction which are SVM, RF, ANN, and DA. We evaluated all the models by computing the four evaluation matrices (described in Fig. 5) on the out-of-sample test sets. The performance of iAMP-2L and CAMPR3 were significantly low compared to PSI-BLAST and m-NGSG-based AMP model. The reason is possibly the training sets are not optimized to predict the cystine-stabilized AMPs. The similar results are found when we compared HemoPI (see Supplement Fig. 4) with the m-NGSG-based HLP model. These results indicated that the m-NGSG-based models are superior to any other concurrent algorithms to classify functions of cystine-stabilized peptides.
Several cystine-stabilized peptides have already been licensed for clinical or agricultural use. This small fraction demonstrates the potential for new applications hidden among the thousands of undiscovered cystine-stabilized peptide sequences in genomes across many taxa. A voltage-gated calcium channel blocker cystine-stabilized peptide (Hv1a) from spider venom50 is now the primary product of Vestaron, Inc., with commercial production in E. coli for broad-scale application on crops plants as an eco-friendly insecticide that degrades within two weeks after application. This same spider peptide has been fused to a targeting moiety by another group to specifically target aphids as a transgene in plants51. In our own lab (CMK), antimicrobial cystine-stabilized peptides have been targeted for specific toxicity towards individual pathogenic bacterial species, with nontarget toxicity greatly reduced (Islam et al., unpublished data). This has implications for antibiotic treatment without the disruption of the native microbiome. Thus, a diverse array of different cystine-stabilized peptides has realized commercial application.
The model approach outlined here has the potential to greatly impact the discovery of functionally active peptides. A typical pipeline might involve finding top candidate peptide sequences using genome databases and one or more of the prediction models, followed by peptide production of the top candidates in an appropriate heterologous expression system and wet lab evaluation of the peptides. Using an antimicrobial peptide screen as an example, our AMP algorithm would be used to screen a dataset of peptides (see the Availability section below). These candidate sequences would then be screened with the HLP (hemolytic protein) algorithm to eliminate peptides that might be toxic to human cells. From the pool of remaining candidates, synthesized DNA sequences would be cloned into an E. coli expression vector alongside a stabilizing fusion partner such as SUMO52 and a purification tag such as 6x His-Tag. The peptides would be expressed and purified, and then confirmed for stability, toxicity against the target microbe, and lack of toxicity against human cells. Top candidates would be available for pharmaceutical production systems or to be used as transgenes in the organism to be protected by the antimicrobial peptide53. An advantage with using peptides over typical small molecule drugs is the relative ease with which an appropriate modifying peptide can be found and genetically fused to the effector peptide sequence, for example an antimicrobial peptide targeted to a specific pathogenic bacterium with a targeting peptide54. It should be noted that the metadata associated with the peptide candidate sequence can also be used to help select peptides with the desired action. For example, ion channel blockers with oral toxicity might be more commonly found in plants and algae, where an ion channel blocker might serve as an oral insecticide, than in spider or snake venom, where the ion channel blocker would most likely have toxicity only by injection.
Availability
CSPred is an open source collaborative initiative available in the bitbucket repository (https://bitbucket.org/sm_islam/cystine-stabilized-proteins/src). It is also publicly available as a free web application at watson.ecs.baylor.edu/cspred. The web server provides an accessibility to the CSPred, and a user does not need computational experience to use the model. Posting the web address (watson.ecs.baylor.edu/cspred) on a web browser will take the user to the CSPred webpage. There, the user needs to upload the fasta file of the unknown protein sequences and click the submit button. That action will trigger the prediction process. The result is divided into six columns. The first column is the protein ID labels of the fasta sequences. The second, third, fourth, fifth, and sixth columns display the probability values of being an ICB, AMP, ACRI, SRI, and HLP, respectively, for each sequence submitted. Thus, the web interface provides a simple avenue to categorize submitted protein sequences according to these five functional characteristics, and may use a high-throughput batch-style input. Supplementary Figs 5–7 demonstrate the pipeline to use the CSPred web application. The sub classifiers of ICB are not included in CSPred. However, all the training and test datasets are provided as a supplement enabling users to make their own models or reproduce the same models using the m-NGSG framework that is available at watson.ecs.baylor.edu/ngsg.
Electronic supplementary material
Acknowledgements
We acknowledge Benjamin J. Heil for making our web-application publicly available through http://watson.ecs.baylor.edu/cspred.
Author Contributions
S.M.A.I. and C.M.K. envisioned the proposal while S.M.A.I. and E.J.B. outlined the project plan, S.M.A.I. prepared the data and conducted the analysis. S.M.A.I., E.J.B. and C.M.K. generated the manuscript.
Competing Interests
The authors declare no competing interests.
Footnotes
Electronic supplementary material
Supplementary information accompanies this paper at 10.1038/s41598-018-27177-8.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.King GF, Hardy MC. Spider-Venom Peptides: Structure, Pharmacology, and Potential for Control of Insect Pests. Annu. Rev. Entomol. 2013;58:475–496. doi: 10.1146/annurev-ento-120811-153650. [DOI] [PubMed] [Google Scholar]
- 2.Chan YS, et al. Snake venom toxins: toxicity and medicinal applications. Appl. Microbiol. Biotechnol. 2016;100:6165–6181. doi: 10.1007/s00253-016-7610-9. [DOI] [PubMed] [Google Scholar]
- 3.Ortiz E, Gurrola GB, Schwartz EF, Possani LD. Scorpion venom components as potential candidates for drug development. Toxicon. 2015;93:125–135. doi: 10.1016/j.toxicon.2014.11.233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Frazão B, Vasconcelos V, Antunes A. Sea Anemone (Cnidaria, Anthozoa, Actiniaria) Toxins: An Overview. Mar. Drugs. 2012;10:1812–1851. doi: 10.3390/md10081812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Akondi KB, et al. Discovery, Synthesis, and Structure–Activity Relationships of Conotoxins. Chem. Rev. 2014;114:5815–5847. doi: 10.1021/cr400401e. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Nguyen LT, Haney EF, Vogel HJ. The expanding scope of antimicrobial peptide structures and their modes of action. Trends Biotechnol. 2011;29:464–472. doi: 10.1016/j.tibtech.2011.05.001. [DOI] [PubMed] [Google Scholar]
- 7.Munasinghe N, Christie M. Conotoxins That Could Provide Analgesia through Voltage Gated Sodium Channel Inhibition. Toxins. 2015;7:5386–5407. doi: 10.3390/toxins7124890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Bourinet, E. & Zamponi, G. W. Block of voltage-gated calcium channels by peptide toxins. Neuropharmacology, 10.1016/j.neuropharm.2016.10.016 (2016). [DOI] [PubMed]
- 9.Norton RS, Chandy KG. Venom-derived peptide inhibitors of voltage-gated potassium channels. Neuropharmacology. 2017 doi: 10.1016/j.neuropharm.2017.07.002. [DOI] [PubMed] [Google Scholar]
- 10.Dutertre, S., Nicke, A. & Tsetlin, V. I. Nicotinic acetylcholine receptor inhibitors derived from snake and snail venoms. Neuropharmacology, 10.1016/j.neuropharm.2017.06.011 (2017). [DOI] [PubMed]
- 11.Mourão C, Schwartz E. Protease Inhibitors from Marine Venomous Animals and Their Counterparts in Terrestrial Venomous Animals. Mar. Drugs. 2013;11:2069–2112. doi: 10.3390/md11062069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zhirnov OP, Klenk HD, Wright PF. Aprotinin and similar protease inhibitors as drugs against influenza. Antiviral Res. 2011;92:27–36. doi: 10.1016/j.antiviral.2011.07.014. [DOI] [PubMed] [Google Scholar]
- 13.Layer P, Stanghellini V. Review article: linaclotide for the management of irritable bowel syndrome with constipation. Aliment. Pharmacol. Ther. 2014;39:371–384. doi: 10.1111/apt.12604. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Cohen-Inbar O, Zaaroor M. Glioblastoma multiforme targeted therapy: The Chlorotoxin story. J. Clin. Neurosci. 2016;33:52–58. doi: 10.1016/j.jocn.2016.04.012. [DOI] [PubMed] [Google Scholar]
- 15.Mobli M, Undheim EAB, Rash LD. Modulation of Ion Channels by Cysteine-Rich Peptides: From Sequence to Structure. Adv. Pharmacol. San Diego Calif. 2017;79:199–223. doi: 10.1016/bs.apha.2017.03.001. [DOI] [PubMed] [Google Scholar]
- 16.Silverstein KAT, et al. Small cysteine-rich peptides resembling antimicrobial peptides have been under-predicted in plants: Under-predicted cysteine-rich peptides in plants. Plant J. 2007;51:262–280. doi: 10.1111/j.1365-313X.2007.03136.x. [DOI] [PubMed] [Google Scholar]
- 17.Kuzmenkov AI, Grishin EV, Vassilevski AA. Diversity of Potassium Channel Ligands: Focus on ScorpionToxins. Biochem. Mosc. 2015;80:1764–1799. doi: 10.1134/S0006297915130118. [DOI] [PubMed] [Google Scholar]
- 18.Cheek S, Krishna SS, Grishin NV. Structural Classification of Small, Disulfide-rich Protein Domains. J. Mol. Biol. 2006;359:215–237. doi: 10.1016/j.jmb.2006.03.017. [DOI] [PubMed] [Google Scholar]
- 19.Santibáñez-López CE, Possani LD. Overview of the Knottin scorpion toxin-like peptides in scorpion venoms: Insights on their classification and evolution. Toxicon. 2015;107:317–326. doi: 10.1016/j.toxicon.2015.06.029. [DOI] [PubMed] [Google Scholar]
- 20.Possani LD, Becerril B, Delepierre M, Tytgat J. Scorpion toxins specific for Na+-channels. Eur. J. Biochem. 1999;264:287–300. doi: 10.1046/j.1432-1327.1999.00625.x. [DOI] [PubMed] [Google Scholar]
- 21.Adams ME. Agatoxins: ion channel specific toxins from the american funnel web spider, Agelenopsis aperta. Toxicon. 2004;43:509–525. doi: 10.1016/j.toxicon.2004.02.004. [DOI] [PubMed] [Google Scholar]
- 22.Olivera BM, et al. Peptide neurotoxins from fish-hunting cone snails. Science. 1985;230:1338–1343. doi: 10.1126/science.4071055. [DOI] [PubMed] [Google Scholar]
- 23.Islam SMA, Sajed T, Kearney CM, Baker EJ. PredSTP: a highly accurate SVM based model to predict sequential cystine stabilized peptides. BMC Bioinformatics. 2015;16:210. doi: 10.1186/s12859-015-0633-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Craik DJ, Simonsen S, Daly NL. The cyclotides: novel macrocyclic peptides as scaffolds in drug design. Curr. Opin. Drug Discov. Devel. 2002;5:251–260. [PubMed] [Google Scholar]
- 25.Gracy J, et al. KNOTTIN: the knottin or inhibitor cystine knot scaffold in 2007. Nucleic Acids Res. 2007;36:D314–D319. doi: 10.1093/nar/gkm939. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Kedarisetti P, Mizianty MJ, Kaas Q, Craik DJ. & Kurgan, L. Prediction and characterization of cyclic proteins from sequences in three domains of life. Biochim. Biophys. Acta BBA - Proteins Proteomics. 2014;1844:181–190. doi: 10.1016/j.bbapap.2013.05.002. [DOI] [PubMed] [Google Scholar]
- 27.Gelly J-C. The KNOTTIN website and database: a new information system dedicated to the knottin scaffold. Nucleic Acids Res. 2004;32:156D–159. doi: 10.1093/nar/gkh015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Ding H, et al. iCTX-Type: A Sequence-Based Predictor for Identifying the Types of Conotoxins in TargetingIon Channels. BioMed Res. Int. 2014;2014:1–10. doi: 10.1155/2014/286419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Fan Y-X, Song J, Kong X, Shen H-B. PredCSF: An Integrated Feature-Based Approach for Predicting Conotoxin Superfamily. Protein Pept. Lett. 2011;18:261–267. doi: 10.2174/092986611794578341. [DOI] [PubMed] [Google Scholar]
- 30.Sharma A, et al. Computational approach for designing tumor homing peptides. Sci. Rep. 2013;3:1607. doi: 10.1038/srep01607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Simeon S, et al. osFP: a web server for predicting the oligomeric states of fluorescent proteins. J. Cheminformatics. 2016;8:72. doi: 10.1186/s13321-016-0185-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Du P, Cao S, Li Y. SubChlo: Predicting protein subchloroplast locations with pseudo-amino acid composition and the evidence-theoretic K-nearest neighbor (ET-KNN) algorithm. J. Theor. Biol. 2009;261:330–335. doi: 10.1016/j.jtbi.2009.08.004. [DOI] [PubMed] [Google Scholar]
- 33.Zhang G-Y, Fang B-S. Predicting the cofactors of oxidoreductases based on amino acid composition distribution and Chou’s amphiphilic pseudo-amino acid composition. J. Theor. Biol. 2008;253:310–315. doi: 10.1016/j.jtbi.2008.03.015. [DOI] [PubMed] [Google Scholar]
- 34.Xia J-F, Han K, Huang D-S. Sequence-Based Prediction of Protein-Protein Interactions by Means of Rotation Forest and Autocorrelation Descriptor. Protein Pept. Lett. 2010;17:137–145. doi: 10.2174/092986610789909403. [DOI] [PubMed] [Google Scholar]
- 35.Dubchak I, Muchnik I, Holbrook SR, Kim SH. Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. USA. 1995;92:8700–8704. doi: 10.1073/pnas.92.19.8700. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Chang D, Syu Y-T, Lin P-C. Predicting the protein-protein interactions using primary structures with predicted protein surface. BMC Bioinformatics. 2010;11:S3. doi: 10.1186/1471-2105-11-S1-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Shen H-B, Chou K-C. PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition. Anal. Biochem. 2008;373:386–388. doi: 10.1016/j.ab.2007.10.012. [DOI] [PubMed] [Google Scholar]
- 38.Islam, S. M. A., Heil, B. J., Kearney, C. M. & Baker, E. J. Protein classification using modified n-grams and skip-grams. Bioinformatics10.1093/bioinformatics/btx823 (2017) [DOI] [PubMed]
- 39.Keˇselj, V., Peng, F., Cercone, N. & Thomas, C. N-gram-based author profiles for authorship attribution. In Pacific Association for Computational Linguistics (2003).
- 40.Guthrie, D., Allison, B., Liu, W., Guthrie, L. & Wilks, Y. A closer look at skip-gram modelling. In Proceedings of the 5th international Conference on Language Resources and Evaluation (LREC-2006), 1–4 (2006).
- 41.Islam, S. A., Kearney, C. M., Choudhury, A. & Baker, E. J. Protein Classification using Modified N-Gram and Skip-Gram Models: Extended Abstract. In 586–586, 10.1145/3107411.3108193 (ACM Press, 2017).
- 42.Altschul S. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011;39:W29–W37. doi: 10.1093/nar/gkr367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Boeckmann B. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31:365–370. doi: 10.1093/nar/gkg095. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Chaudhary, K. et al. A Web Server and Mobile App for Computing Hemolytic Potency of Peptides. Sci. Rep. 6 (2016). [DOI] [PMC free article] [PubMed]
- 46.Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26:680–682. doi: 10.1093/bioinformatics/btq003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Xiao X, Wang P, Lin W-Z, Jia J-H, Chou K-C. iAMP-2L: A two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Anal. Biochem. 2013;436:168–177. doi: 10.1016/j.ab.2013.01.019. [DOI] [PubMed] [Google Scholar]
- 48.Waghu FH, Barai RS, Gurung P, Idicula-Thomas S. CAMP R3: a database on sequences, structures and signatures of antimicrobial peptides: Table 1. Nucleic Acids Res. 2016;44:D1094–D1097. doi: 10.1093/nar/gkv1051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Powers DM. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Technol. 2011;2:37–63. [Google Scholar]
- 50.Herzig V, King G. The Cystine Knot Is Responsible for the Exceptional Stability of the Insecticidal Spider Toxin ω-Hexatoxin-Hv1a. Toxins. 2015;7:4366–4380. doi: 10.3390/toxins7104366. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Bonning BC, et al. Toxin delivery by the coat protein of an aphid-vectored plant virus provides plant resistance to aphids. Nat. Biotechnol. 2013;32:102–105. doi: 10.1038/nbt.2753. [DOI] [PubMed] [Google Scholar]
- 52.Butt TR, Edavettal SC, Hall JP, Mattern MR. SUMO fusion technology for difficult-to-express proteins. Protein Expr. Purif. 2005;43:1–9. doi: 10.1016/j.pep.2005.03.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Peschen D, Li H-P, Fischer R, Kreuzaler F, Liao Y-C. Fusion proteins comprising a Fusarium-specific antibody linked to antifungal peptides protect plants against a fungal pathogen. Nat. Biotechnol. 2004;22:732–738. doi: 10.1038/nbt970. [DOI] [PubMed] [Google Scholar]
- 54.Eckert R, et al. Targeted Killing of Streptococcus mutans by a Pheromone-Guided ‘Smart’ Antimicrobial Peptide. Antimicrob. Agents Chemother. 2006;50:3651–3657. doi: 10.1128/AAC.00622-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.