AntiAngioPred: A Server for Prediction of Anti-Angiogenic Peptides

Azhagiya Singam Ettayapuram Ramaprasad; Sandeep Singh; Raghava Gajendra P S; Subramanian Venkatesan

doi:10.1371/journal.pone.0136990

. 2015 Sep 3;10(9):e0136990. doi: 10.1371/journal.pone.0136990

AntiAngioPred: A Server for Prediction of Anti-Angiogenic Peptides

Azhagiya Singam Ettayapuram Ramaprasad ¹, Sandeep Singh ², Raghava Gajendra P S ², Subramanian Venkatesan ^1,^*

Editor: Anna Tramontano³

PMCID: PMC4559406 PMID: 26335203

Abstract

The process of angiogenesis is a vital step towards the formation of malignant tumors. Anti-angiogenic peptides are therefore promising candidates in the treatment of cancer. In this study, we have collected anti-angiogenic peptides from the literature and analyzed the residue preference in these peptides. Residues like Cys, Pro, Ser, Arg, Trp, Thr and Gly are preferred while Ala, Asp, Ile, Leu, Val and Phe are not preferred in these peptides. There is a positional preference of Ser, Pro, Trp and Cys in the N terminal region and Cys, Gly and Arg in the C terminal region of anti-angiogenic peptides. Motif analysis suggests the motifs “CG-G”, “TC”, “SC”, “SP-S”, etc., which are highly prominent in anti-angiogenic peptides. Based on the primary analysis, we developed prediction models using different machine learning based methods. The maximum accuracy and MCC for amino acid composition based model is 80.9% and 0.62 respectively. The performance of the models on independent dataset is also reasonable. Based on the above study, we have developed a user-friendly web server named “AntiAngioPred” for the prediction of anti-angiogenic peptides. AntiAngioPred web server is freely accessible at http://clri.res.in/subramanian/tools/antiangiopred/index.html (mirror site: http://crdd.osdd.net/raghava/antiangiopred/).

Introduction

The process of growth of new capillary blood vessels is used for healing and reproduction, which is known as angiogenesis. It occurs for healing wounds and for restoring blood flow to tissues after injury. The control of angiogenesis is achieved by maintaining balance between growth and inhibitory factors in healthy tissues [1, 2]. Angiogenesis is regulated by ‘on’ and ‘off’, switches. Angiogenesis-stimulating growth factors are considered as ‘on switches’ while the angiogenesis inhibitors are considered as the ‘off switches’. Excess production of angiogenic growth factors favors the growth of blood vessels while the presence of excess of angiogenic inhibitors prevents angiogenesis. Recent studies have identified several endogenous anti-angiogenic peptides identified from various biological sources, which regulate angiogenesis and tumor growth [3–6].

There are several peptides derived from various proteins, which inhibit angiogenesis [3, 4, 7–12]. Matrix Metalloproteinase also generates angiogenic inhibitors in vitro by proteolytically cleaving fragments from the pericellular matrix to generate endostatin, tumstatin, angiostatinetc [13]. These peptides inhibit endothelial cell proliferation, migration, tube formation and matrigel neovascularization. For example, the anti-angiogenic properties of arresten are mediated through α1β1 integrin [7]. Recently, novel anti-angiogenic activity was localized to amino acids 54–132 using deletion mutagenesis of tumstatin [9]. The peptides, similar to tumstatin and the tum-5 domain, bind and function via αvβ3 in an RGD-independent manner.

The increasing interest in peptide based therapeutics has led to the development of many peptide databases with therapeutic properties like anticancer [14], antihypertensive [15], antimicrobial [16], blood-brain barrier [17], antiparasitic [18], hemolytic [19], quorum-sensing [20], tumor homing [21] and cell penetrating [22]. So far, peptide based drugs have been employed for many diseases and these are being investigated in clinical applications against tumors, either for imaging or therapy [3, 23–26]. In general, they are attractive molecules as therapeutics because of their natural availability, ability to penetrate cells, specific target binding, and diverse modifications giving flexibility for different applications. The discovery of angiogenesis peptide inhibitors will help in the development of therapeutic treatments against cancer. Several web-based tools are available for the annotation of protein sequence to understand the family and subfamily of the protein [27–29]. So far, there are no web-based tools to predict the anti-angiogenic peptides. Thus, the search of anti-angiogenic agents for the treatment of cancer is particularly important. Hence, in this study, a systematic attempt has been made to develop machine learning based models using various features extracted from peptide sequences like binary profile patterns (BPP); amino acid composition (AAC) as well as dipeptide compositions (DPC). A user-friendly web server has also been developed to help the experimental biologist to predict the anti-angiogenic peptides.

Methods

Datasets

Positive dataset

The main dataset was collected from the literature. In this study, we have obtained 257 anti-angiogenic peptides from various research articles and patents (S5 Table). Due to the redundancy in the sequences, CD-HIT software was used to eliminate highly similar sequences and it was ensured that no two sequences have more than 70% sequence identity. The resulting dataset contains 135 sequences in the positive dataset (S1 Table). Among these 135 sequences, 20% of the dataset (~28 sequences) was kept separately to be used as independent dataset (S3 Table).

Negative dataset

As there is no source of experimentally proven non-anti-angiogenic peptides, we extracted 135 random peptide regions from proteins from Swiss-Prot database [30] and treated them as non-anti-angiogenic peptides (S1 Table). Though some of these randomly selected peptides could be anti-angiogenic in nature but the probability is very less. The random peptide sequences were extracted in such a way that the length distribution of the dataset remains same as of positive dataset. Among these 135 sequences, 20% of the dataset (~28 sequences) was kept separately to be used as independent dataset.

Terminus datasets

We divided the main dataset into nine terminus datasets, which are NT5, CT5, NTCT5, NT10, CT10, NTCT10, NT15, CT15 and NTCT15. NT5 and CT5 contain first five residues and last five residues from the N-terminal and C-terminal region of the peptide sequence respectively. NTCT5 is obtained by joining the NT5 and CT5 sequence. Similarly other terminus datasets were also constructed to understand the region of the peptide containing maximum information to discriminate these peptides from random sequence.

Independent dataset

The independent dataset was made by extracting 20% of the sequences (~28 sequences) from the positive, as well as negative dataset, thereby making a total of 56 sequences (S2 Table). These sequences were not used in either training or testing procedure while developing any model.

Random datasets

In order to check the reliability of models, we created five more random negative dataset using the same procedure as used in developing negative dataset. These datasets have been created to check whether the property of the developed model changes if the negative dataset is replaced with another randomly created dataset. These datasets were named as ‘Random1’, ‘Random2’, ‘Random3’, ‘Random4’ and ‘Random5’ (S4 Table).

Calculation of residue propensities

The propensity of each amino acid in anti-angiogenic peptides was calculated by the following formula:

P (i) = \frac{A A C p (i)}{A A C p (i) + A A C s (i)}

(Eq 1)

where, P(i) represents propensity of i ^th amino acid, AACp(i) and AACs(i) represents the average composition of i ^th amino acid in positive and Swiss-Prot dataset, respectively. We also calculated the position wise propensities of amino acids in both N-terminal and C-terminal regions of the peptides.

Cross validation technique

In the present study, we performed ten-fold cross-validation technique to develop our models. In this technique, the sequences were randomly divided into ten sets. Nine sets were used for training the model while the remaining tenth set was used for testing. The process was then repeated ten times such that each set was once used as a test set. The average performance of all the ten sets is reported as the final performance of the method.

Machine learning approaches

Different machine learning techniques like Support Vector Machines (SVM), Neural Networks (Multilayer Perceptron), Bayesian approach (Naïve Bayes) [31], Nearest Neighbor (IBk) [32], Decision trees (Random Forest and J48) [33, 34] and logistic regression [35] were used to develop the models. SVM based method was implemented using SVM^Light software [36] while rest of the methods were implemented using WEKA package [37].

Input features for prediction

A machine learning based method requires set of features in the form of numbers as input. These features contain the global information of the biological molecules being studied by the method. The features used in this study are described below.

Amino acid composition (ACC)

It is represented by the percentage of each amino acid within a peptide with a vector size of 20. It was calculated by using the following equation:

A C C (i) = \frac{A_{i}}{N} X 10

(Eq 2)

Where AAC(i) represent the percentage of amino acid (i); A _i represent the frequency of i^th residue and N is the total number of residues in the peptide.

Dipeptide composition

Dipeptide composition refers to the percentage of all the possible pair of amino acids (e.g. AA, AC, AD etc.) present in the peptide. It represents a vector size of 400 (20 x 20) and also includes information about the neighboring residues. It was calculated using the following equation:

D P C (i) = \frac{D P (i)}{N}

(Eq 3)

Where DPC (i) represents the percentage of dipeptide (i); DP (i) represents the frequency of i ^th dipeptide and N represents the total number of dipeptides.

Binary profile

In binary profile, each amino acid is represented by a binary vector of size 20 where one element of the vector corresponding to the presence of a particular amino acid is represented by 1 and other 19 elements are represented by 0. (e.g. Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0). Therefore for a stretch of 5 amino acids, the total vector size of binary profile will be 100 (20 x 5).

Two sample logos

Online service of two sample logo software was used to generate two sample logos [38–40]. It is useful in representing the frequency of amino acids at specific positions in the peptide sequence. The size of the residues displayed at each position is proportional to the relative frequency of each amino acid at that position.

Performance measures

The performance of the developed models was calculated using the standard performance parameters like Sensitivity (Sn), Specificity (Sp), Accuracy (Acc) and Matthew’s correlation coefficient (MCC). The formula to calculate Sensitivity, Specificity, Accuracy and MCC is given by following equations:

S e n s i t i v i t y = \frac{T P}{(T P + F N)}

(Eq 4)

S p e c i f i c i t y = \frac{T N}{(T N + F P)}

(Eq 5)

A c c u r a c y = \frac{(T P + T N)}{(T P + F P + T N + F N)} X 100

(Eq 6)

M C C = \frac{(T P) (T N) - (F P) (F N)}{\sqrt{(T P + F N) (T N + F P) (T P + F P) (T N + F N)}}

(Eq 7)

Where TP, TN, FP and FN represents True Positive, True Negative, False Positive and False Negative respectively.

Results

Amino acid composition analysis

The amino acid composition analysis was carried out to extract certain residues, which are dominant in anti-angiogenic peptides. We compared the average amino acid composition in anti-angiogenic and non-anti-angiogenic peptides (Fig 1). It was observed that residues like Cys, Pro, Ser, Arg, Trp, Thr and Gly are predominant in anti-angiogenic peptides while residues Ala, Asp, Ile, Leu, Val and Phe are under-represented in these peptides. Composition was also computed for all the random datasets and compared with the negative dataset and no bias was observed, ensuring that the dataset is purely random. We also calculated the average amino acid composition of the entire Swiss-Prot database to be used as reference for analyzing the difference (Fig 1).

Fig 1 — Composition of entire Swiss-Prot is taken for reference and composition of all random datasets.

Residue propensities and positional preference

The propensities of residues are in accordance with the amino acid composition analysis with Cys, Trp, Ser, Arg and Pro being predominant in anti-angiogenic peptides while Val, Ala, Leu and Ile being less preferred in these peptides (S6 Table). To understand the position wise preference of amino acids at the first and last 10 residues of the N and C terminus of anti-angiogenic peptides, we calculated the position wise propensities using Swiss-Prot as reference dataset (S7 Table). Cys, Ser, Thr and His are preferred at N1 position; Pro at N2 position; Trp and Pro at N3; Ser and Phe at N4 and Cys is predominant at N5, N6 and N7 positions in anti-angiogenic peptides. At C-terminal region, Cys is prominent at C1 and C2; Gly and Cys at C3 and C4; Cys at C5 while Arg is most favoured at C8, C9 and C10 position. We also performed residue based preference analysis using two sample logo (S1 and S2 Fig), which is in accordance with the results described above.

Motif analysis

To find the frequent motifs in the anti-angiogenic peptides, we extracted the motifs using MERCI software using following criteria: i) the motif should be present in at least 10% (~14 peptides) of the total number of peptides in the positive dataset, ii) the motif can have a maximum of 5 gaps. Here, the gap represents the presence or absence of any amino acid. Using the above criteria, we obtained a total of 151 motifs. Further, we selected the motifs, which had propensity (Eq 1) more than or equal to 0.90. This resulted in a total of 22motifs, which are "CG-G", "TC", "SC", "SP-S", "W-S-C", "WS-C", "S-T-C", "S-C-S", "CS-T", "C-S-T", "T-C", "S-C", "C-G-G", "TR", "S-T-G", "S-P-S", "SP", "RT", "P-W", "P-C", "C-N" and "CG" (hyphen ‘-’ represents a gap).These motifs are important for understanding and identification of anti-angiogenic peptides. The full list of 151 motifs sorted by propensity is given in S8 Table.

Performance of various machine learning approaches on the dataset

We used different machine learning classifiers like SVM, Random Forest (RF), IBk, J48, Naïve Bayes, Logistic and Multilayer Perceptron (MP) to develop amino acid composition based model on whole peptide dataset. This helps us to compare the performance of different classifiers on the same dataset. The models developed in this study are explained below.

Amino acid composition based model

We used amino acid composition of the peptide as input feature to develop the prediction model using SVM, J48, RF, Naïvebayes, MP, Logistic and IBk machine learning classifiers (Table 1).SVM (MCC = 0.48), MP (MCC = 0.49) and RF(MCC = 0.48) based models performed better than other methods. The performance among the best models (SVM, RF, MP) was alike and therefore we selected SVM machine learning method for further development of models using different input features.