Skip to main content
Journal of Biomedicine and Biotechnology logoLink to Journal of Biomedicine and Biotechnology
. 2011 Aug 23;2011:432830. doi: 10.1155/2011/432830

Prediction of B-cell Linear Epitopes with a Combination of Support Vector Machine Classification and Amino Acid Propensity Identification

Hsin-Wei Wang 1, Ya-Chi Lin 1, Tun-Wen Pai 1, 2,2,*, Hao-Teng Chang 3, 4, 5,4,5,*
PMCID: PMC3163029  PMID: 21876642

Abstract

Epitopes are antigenic determinants that are useful because they induce B-cell antibody production and stimulate T-cell activation. Bioinformatics can enable rapid, efficient prediction of potential epitopes. Here, we designed a novel B-cell linear epitope prediction system called LEPS, Linear Epitope Prediction by Propensities and Support Vector Machine, that combined physico-chemical propensity identification and support vector machine (SVM) classification. We tested the LEPS on four datasets: AntiJen, HIV, a newly generated PC, and AHP, a combination of these three datasets. Peptides with globally or locally high physicochemical propensities were first identified as primitive linear epitope (LE) candidates. Then, candidates were classified with the SVM based on the unique features of amino acid segments. This reduced the number of predicted epitopes and enhanced the positive prediction value (PPV). Compared to four other well-known LE prediction systems, the LEPS achieved the highest accuracy (72.52%), specificity (84.22%), PPV (32.07%), and Matthews' correlation coefficient (10.36%).

1. Introduction

Epitopes, also called antigenic determinants, are clusters of amino acid segments located on the surfaces of an antigen. Epitopes can elicit the immune response and are recognized by specific antibodies [1]. Basically, B-cell epitopes are categorized into two types: linear and conformational. Linear epitopes (LEs) are composed of contiguous amino acid residues within a continuous stretch of a primary protein sequence. Conformational epitopes (CEs) consist of amino acids that are dispersed among discontinuous regions but become aggregated on the protein surface [2, 3]. In general, over 90% of B-cell epitopes are discontinuous [4, 5]; thus, CEs play critical roles in biological and biomedical applications, including the prevention and neutralization of pathogen infections, and the design of therapeutic drugs. However, the prediction and identification of CEs within a protein depend on resolved three-dimensional structural information. One major, generally accepted concept is that conformational epitopes cannot be properly formed without binding to a corresponding antibody [6]. Therefore, antigen-antibody cocrystallographic information is a major concern in CE prediction. On the other hand, because CEs are discontinuous epitopes, it is difficult to design a peptide that forms the same conformation as the predicted CE. Thus, CEs that are predicted by computational analysis may not be verifiable in biochemical experiments, except with the cocrystallographic approach. Although B-cell LEs occupy a small part of the entire epitope group, they are important in biochemistry [7], virology [8], immunology [9], and vaccine research [10]. Therefore, research and development of accurate computational approaches for LE prediction remains a critical challenge in bioinformatics and computational biology [6]. Most published B-cell LE predictors have been based on the characteristics of amino acids, like hydrophobicity, surface accessibility, mobility, protrusion area, physico-chemical properties, antigenicity, and pocket characteristics [1, 3, 1116]. For example, BcePred [16], BEPITOPE [17], PEOPLE [11], VaxiJen [18], and LEP [12] are bioinformatics tool that use various mathematical approaches to predict LEs according to the physico-chemical propensities of amino acids. Nevertheless, in 2005, Blythe and Flower led a group that evaluated the physico-chemical propensities of amino acids to predict LEs in proteins; they reported that even the best physico-chemical propensity scales available performed only slightly better than a random model [19]. Hence, it was proposed that, instead of using the antigenicity scale alone, LE prediction may be improved by integration with other computational approaches.

Several machine learning computational methods have been applied to improve the accuracy of LE prediction. For example, BepiPred combined a hydrophilicity scale with a hidden Markov model [20]; BCPred [21] and FBCPred [22] employed SVM with a subsequence kernel; Söllner and Mayer utilized a molecular operating environment with the decision tree and nearest neighbour approaches [6]. However, these machine learning approaches were mostly set to predict peptides of fixed lengths. It is difficult to analyze true LEs, because they generally range from 8 to 20 amino acid residues in length [11, 2325]. Epitopes with fixed lengths are not typically sufficient to represent the whole region of antigenic determinants. To overcome the drawbacks of training and/or predicting fixed length epitopes, ABCPred used two artificial neural network methods, the feed-forward network and the recurrent neural network, for the prediction of B-cell LEs [26]. Both networks were used with different window lengths from 10 to 20 amino acids and a two-residue interval.

Although bioinformatists have expended great effort on developing LE predictors, there remains much room for improvement. Theoretically, an epitope identified by experimental immunological or biochemical methods must possess biological antigenicity that can induce antibody production in animals. However, when computational skills are used for the prediction, some experimentally identified epitopes could be missed or ignored. This generated the interesting study of how to retrieve the unpredictable epitopes and enhance their antigenicity score in silico.

In 2008, LEP was developed for predicting LEs based on physico-chemical propensities combined with a mathematical morphology approach. LEP could retrieve some of the LEs that were locally embedded in the noise signals of the antigenic index [12]. We reasoned that prediction accuracies could be further improved and retain the advantage of variable length conditions, by combining the LEP with machine learning technologies.

As mentioned above, the machine learning methods used in previous LE prediction methods were often trained to predict epitopes with fixed lengths. Chen's study showed that the frequencies of occurrence for some amino acid pairs in the epitope dataset were significantly higher than in non-epitope datasets, or vice versa [23]. We noticed this important statistical feature and applied it to enhance the performance of LE prediction systems. Hence, in order to explore the statistical advantages of verified epitopes and retain the antigenic characteristics of candidate peptides, we decided to extend the concept of amino acid pairs from Chen's study, which only considered peptides with 2 residues.

In this study, we developed a novel B-cell LE prediction system called LEPS (Linear Epitope Prediction by Propensities and Support Vector Machine). The LEPS is freely available for academic use at http://leps.cs.ntou.edu.tw. We adopted the library for SVM (LIBSVM) tool and trained it to recognize features of amino acid segments (AASs) with lengths from 2 to 4 residues. Then, SVM was used to characterize those patterns as epitope and non-epitope clusters [27]. Accordingly, the LEPS approach first performed physico-chemical propensities and mathematical morphology approaches and then used the AAS features to cluster the predicted LE candidates and remove the less probable LEs.

2. Materials and Methods

2.1. Testing Datasets and Predictors

Four datasets were used in this study. The AntiJen dataset was recommended at an international meeting sponsored by the National Institute for Allergy and Infectious Disease [6] and contained 171 protein sequences with 691 verified, nonoverlapping epitopes [19]. The HIV dataset was a collection of the antigenic determinants located on 10 HIV proteins with 54 nonoverlapping, verified epitopes [39]. The PC dataset, generated in this study, was a collection of 12 protein sequences with 98 nonoverlapping, verified epitopes (Table 1). In order to balance out the variation of each dataset in quantity and antigen diversity, these three datasets were merged into one, comprehensive dataset called the “AHP dataset.” These datasets were analyzed with different LE predictors, including the BepiPred [20], ABCPred [26], BCPred [21], and FBCPred [22], to compare performances with that of the LEPS developed here.

Table 1.

Epitopes predicted in the PC dataset after analysis with LEPS.

Antigen : length (UniProt IDa) LEPS-predicted Epitopes Experimental epitopes Ref.
PrP : 253
(P04156)
M1ANLGCWML9
R37YPGQG42 [28]
Q52GG54 [28]
Q91GGGT95 [28]
N100KPSKPKTNMKHMA113 [28]
G123GLGGYMLG131 [28]
S143DYEDRYYRENMHRYPN159 H140FGSDY145 [28]
Q160VYYRPMD167 [28]
F198TETD202 [28]
Y218ERESQAYYQRGS230

GAPDH : 338
(P20287)
A4KVGING10
A21AFLKNTVDV30
V31SVNDPFIDL40 V31SVNDPFIDLEYM43 [29]
K48RDSTHGTFPGEVSTENGKLKVN G58EVSTENGKLKVNGKLISVHCERDP82 [29]
KL73
C78ERDPANIPWDKDGA92
A108QAHIKNNRAK118 G100VFTTIDKAQAHIKN114 [29]
S123APSADAPM131
V136NENSYEKS144
V148SNASCTTN156
K163VIHDKFEIV172 K163VIHDKFEIVE173 [29]
V188VDGPSSKLWRDGRGAM204
A210STGAAKAVG219
L225NGKLT230
R235VPTPDVSV243
R249LGKGASYEE258
F287VGSTSSS294 S268GPLKGILEYTEDEVVSSDFVG289 [29]
I302SLNNNF308
Y315DNEFGY321
I329THMHKVDHA338

Ara h 1 : 626
(P43238)
K26SSPYQKKTENPC38 K26SSPYQKK33 [30]
Q47QEPDDLK54 Q48EPDDLKQKA57 [30]
E66YDPRCVY73 [30]
P75RGHTGTTNQRSPPGERTRGRQPG E90RTRGRQPGDYDDDRR105 [30]
DYDDDRRQPRREEGGRWGPAGPRE R108REEGGRW115 [30]
REREEDWRQPREDWRRPSHQQPR E124REEDWRQ131 [30]
KIRPEGREGEQEWGTPGSHVREETSR E134DWRRPSHQQPRKIRPEG151 [30]
NN173
P295GQFEDFF302 [30]
Y312LQGFSRN319 [30]
F325NAEFNEIRR334 [30]
Q345EERGQRR352 [30]
K381SVSKKGSEEEGDI394 D393ITNPINLRE402 [30]
N409NFGKLFEVK418 [30]
G463NLELV468 [30]
K472EQQQRGRREEEEDEDEEEEGSN
EV497
R498RYTARLKEG507 [30]
E525LHLLGFGIN534 [30]
H539RIFLAGDKD548 [30]
I551DQIEKQAKDLAFPGSGE568 [30]
P587QSQSQSPSSPEKESPEKEDQEEEN
QGGKGP617

SARS N : 422
(Q19QW0)
A36RPKQRRPQGLPNNTASWFT55 [31]
H60GKEEL65
T77NSGPDDQ84
L140NTPKDHIGTRNPNNN155
A156ATVLQLPQGTTLPKGFYAEGSRGG180 [31]
T266KQYNVTQAFGRRGP280 [31]
N286FGDQDLIRQGTDYK300 [31]
K356HIDAYKTFPPTEPKKDKKK375 [31]
R386QKKQPTVTLLPAADMDDFSRQLQN410 [31]

ZP3 : 399
(O77685, residue 24–422)
T31QSPAPGSSFSP42 T31QSPAPGSSFSPPPVVA47 [32]
Q71AAELTLGPSACAPVPAEPLSK92 [32]
H101ECGSELQMTPDSLIYSTVLHY122 [32]
P124NLSQ128 L126SQSPLVLRSSP137 [32]
G156IQPTWVPFHSTLSREQ172 [32]
D251SSSIFISPRPG262 [32]
V291TATDQAPSPLN302 [32]
A311DEWLPVEGPRD322 [32]
Q346EPGNPSEFEADLMLGPLVLSEAENGP372 [32]

AIV-H4 : 511
(A3KF09, residue17–527)
Q17NYTGNPVIC26 D107TCYPFDVPEYQSLR121 [33]
F137QWNTVKQNGKSGACKRANVNDFFNRLNWLVK [33]
S169DGNAYP175 SDGNAYPLQNLTKINNGDYARLYIWGVHHPSTDT202
N206LYKNNPGRVTVSTK220 [33]
T224SVVPNIGSGPLVRGGQSGRVSXYWTIV250 [33]
V257FNTIGNLIAPRGHYKLNNQKKSTILNTAIPIGSC
SKCHTDKGSLSTTKPFQNISRIAVGDCPRYV
QGSLKLATGMRNIPEKASRGLFGAI349
[33]
D455SEMNKLFERVRRQL469 [33]
A473EDKGNGCFEIFHKCDNN490 [33]
N512RFQIQGVKLTQGYM526 [33]

AIV-H5 : 568
(A5HNY9)
A25NNSTEQVDTIMEKNVTVTHAQDILEKTHNGKL57 [33]
E85FLNVPEWSYIVEKINPANDLCYP108 [33]
C151PYQGRSSFFRNVVW165 [33]
D199AAEQTRLYQNPTTY213 [33]
R223SKVNGQSGRMEFFWTILKPNDAINFESNGNFIA [33]
ENAYKIV273
L472RDNAKELGNGCFEFYHR489 [33]
E284LEYGNCNTKC294

AIV-H12 : 527 (C7FPM3, residue 1–527) T35LIEQNVPVT44 D31TVNTLIEQNVPVTQVEELVH51 [33]
K127YERVKMFDFTKWNVTYTGTSKACNNTSNQGS [33]
YRSMRWLTLKSGQFPVQTDEY180
F190TWAIHHPPTSDEQVKLYKNPNSLSSVTTDEINR [33]
FRPNIGPRPL234
Q238QGRMDYYWAVLKPGQTV255 [33]
T259NGNLIAPEYGHLITGKSHGRILKNDLPIGQCTTEC294 [33]
T310SKHYIGKCPKYIPS324 [33]
R334NVPQAQDRGLFGAIAGFIEG354 [33]
I430TDIWAYNAELLVLLENQKTLDEHDANVRNLHD [33]
VR465
G478CFEILHKCDDGCMDTIKNGT498 [33]
Q502DYEEESKLERQRINGVKLEENSTYK527 [33]

DEN-3 E-glycoprotein : 493 (D2JWZ8, residue 281–773) T331QLATLRKLCIEGKI345 [34]
D351SRCPTQGEAVLPEEQDPNY370 [34]
Q411YENLKYTVIITVHTGDQHQVGNETQGVT
AEITPQASTTE450
[34]
L476LTMKNKAWMVHRQW490 [34]
S533QEGA537 Q526EVVVLGSQEGAMHT540 [34]
W669YKKGSSI676
L707NSLG711

O. tsutsugamushi 47-kDa antigen : 466 (Q53246) H21SKSLLNQKAVLPQQKSDMHIN42 [35]
T65NIGISLNNKVSKYQQEV82 [35]
V97TNENVIAGR106 [35]
Y145ATFGDSNQS154 [35]
V173TNGIISSKGRDMG186 [35]
F193IQTNAAIHM202 [35]
H201MGSFGGPMF210 [35]
I233PSNTVLEAV242 [35]
L245KKGEKIR252 L245KKGEKIRRG254 [35]
L333LRNGKSMTLKCKIIANK350 [35]
Q357SNDQSLVVN366 [35]
L373TPDLVKKYNITSA386 [35]

HPV L1 protein : 510 (A8BQ01) D41VYVTRTNVYYHGGSSRLLTVGHPYYSIKKSNN
VAVPKV80
[36]
V122GRGQPL128 V90KLPDPNKFGLPDADLYDPDTQRLLWACVGVEVG
RGQPLGV130
[36]
T205TIEDGDMVET215 [36]
D219ICTNTCKYPDYLKMAAEPY238 [36]
G235DSMFFSLRREQMFTRHFFNRGGKMGDTIPD285 [36]
R326AQGHNNGMCW336
S350TNVSLCATEA360 [36]
F370KEYLRHMEEYDLQFIFQLCKITLTPEIMAY400 [36]
V416PPPPSASL424
K440PTPPKTPTDP450 P450YASLTFWDVDLSESFSMDLD470 [36]
G497TPPPTSKRKRV508

Bacillus anthracis, PA domain III and IV : 248 (P13423, residue 488–735) N538PSDPLETTKPDMT551 R532RIAAVNPSDPLETTKPDMT551 [37]
A596ELNATNIYTVL607 [37]
I620RDKRFHYDRNNIAVGADES639 [37]
L692NISSLRQDGKT703 [37]
N720PNYK724 L716YISNPNYKVNVYAVTKENT735 [37]

aBecause some of the epitopes in the PC dataset were partial antigen fragments, the serial numbers for the residues in each epitope were assigned according to the sequence information retrieved from the UniProt database [38]. The overlapping amino acids between the experimentally verified and predicted epitopes are shown in bold.

2.2. System Flow

The proposed system was divided into three main steps (Figure 1(a)). The first step retrieved primitive epitope candidates from a query protein sequence with LEP [12], which was developed in our previous work and was used with the default settings. Then, an SVM classifier was applied to remove less probable epitope candidates and improve prediction accuracies. In the final step, the predicted epitope residues were highlighted in the query sequence and visualized in a predicted structure. The virtual structure was generated from Modeller 9.9, based on homologous protein structure modeling approaches [40].

Figure 1.

Figure 1

The design of LEPS. (a) Step 1(a): primitive epitope candidates with globally and locally high antigenicity were extracted by calculating weighting coefficients for various physicochemical propensities of each amino acid. After the filtering process with the SVM classifier (step 2(a)), predicted epitopes were highlighted (step 3(a)) in the query sequence and the simulated structure. (b) Step 1(b): 1230 experimentally verified epitopes and 872 non-epitopes were analyzed to determine the statistical characteristics of AASs. Step 2(b): subsequently, epitope indexes of 872 epitopes and 872 non-epitopes were used to train the SVM model to predict candidate epitopes based on the statistical characteristics defined in step 1(b).

2.3. Training Datasets and SVM Model

The process of training the SVM model comprised two major steps (Figure 1(b)). The first step (step 1(b)) evaluated the statistical characteristics that determined the frequencies of occurrence of AASs with various lengths from an independent B-cell epitope dataset (Bcipep [41]) and a non-epitope dataset (Chen et al. [23]). The second step (step 2(b)) produced an SVM model that recognized the epitopes and non-epitopes of the Chen dataset based on the statistical features derived from step 1(b).

The Bcipep dataset comprised 1230 experimentally verified, B-cell, and nonredundant LEs with lengths that ranged from 3 to 56 residues that were identified in over 1000 antigen proteins. This dataset was used in step 1(b) to analyze the statistical characteristics associated with the frequencies of occurrence of AASs of 2 to 4 residues in length that represented epitopes.

The Chen dataset contained 872 epitopes and 872 non-epitopes. All epitopes and non-epitopes within this dataset were restricted to a length of 20 residues. These verified epitopes were retrieved from the Bcipep dataset by applying a “truncation-extension treatment.” That is, when the length of an LE was longer than 20 residues, an equal number of superfluous residues were truncated from both the N- and C-termini to preserve the central 20 residues. Conversely, when the length of an LE was shorter than 20 residues, an equal number of residues were added to both the N- and C-termini until the epitope comprised 20 residues. On the other hand, the 872 non-epitopes were generated by randomly selecting peptide segments from the Swiss-Prot database [42], with the stipulation that none was the same as any of the 872 epitopes. The 872 non-epitopes were used to analyze the statistical characteristics of AASs for non-epitopes in step 1(b). After determining the statistical features that were associated with frequencies of occurrence, the proposed system applied these features (step 2(b)) to produce an SVM model in a 5-fold cross-validation on the Chen dataset.

2.4. Statistical Analysis of AASs and Epitope Indexes

For LE verification, we considered the statistical features to be AASs of 2 (AAS2), 3 (AAS3), and 4 (AAS4) residues in length for both epitopes and non-epitopes. For AAS2, 400 possible combinations of residue pairs were analyzed for occurrence frequencies within both the epitope and non-epitope datasets. The epitope index (Epidexi2) of the ith pattern (AASi2) was calculated by taking logarithm value of the ratio of the number of AASi2 among all epitopes AASs2 compared to the same ratio in the non-epitope AASs2 group with the following equation:

Epidexi2=log(fi2+/ifi2+fi2/ifi2)(i=1,2,,400), (1)

where fi2+and fi2 were the numbers of AASi2 in the epitope and non-epitope datasets; respectively, and ∑ifi2+ and ∑ifi2 denoted the total number of AASi2 in the corresponding dataset. Finally, the values of Epidexi2 were normalized to the range of [0,1] to avoid dominance of any individual Epidexi2 in the classifier learning processes.

There were a total of 8000 and 160,000 possible combinations for AAS3 and AAS4, respectively. A large portion of AAS3 or AAS4 did not appear in the non-epitope dataset; this would cause a problem, because it could lead to a zero in the denominator. Hence, the definitions of Epidexi3 and Epidexi4 were modified from the definition for Epidexi2, and the corresponding epitope indexes for AAS3 and AAS4 were defined as follows:

Epidexil=fil+ifil+, (2)

where l was equal to 3 or 4. Again, the values of Epidexi3 and Epidexi4 were normalized to the range of [0,1].

2.5. SVM Features and Model Selection

In this study, we adopted the SVM as a learning method to classify the epitope and non-epitope peptides. We employed the open source LIBSVM toolbox for executing this classification. In LIBSVM, each instance in the training set possessed one target value (class label) and several features (attributes). In the testing set, only the features were required for each instance. The objective of SVM was to generate a model from the training set that facilitated the prediction of the target value of each instance in the testing set. In this study, a peptide corresponded to an instance, and the target value (1 or −1) represented whether that peptide was an epitope. Each peptide contained three feature values based on Epidexi2, Epidexi3, and Epidexi4. For example, a 20-mer peptide was decomposed into 19 AASi2 subsegments, and the corresponding epitope index of this peptide was obtained by taking the average of 19 Epidexi2 from the corresponding AASi2. Similarly, the feature values of Epidexi3 and Epidexi4 could be obtained by calculating the averages of 18 Epidexi3 and 17 Epidexi4 subsegments, respectively.

The Chen dataset was used to construct an SVM model based on three feature values and the target values of each epitope and non-epitope. There were four common kernel functions provided by LIBSVM, including linear, polynomial, radial basis function (RBF), and sigmoid. We examined these four kernel functions with a 5-fold cross-validation. The training dataset was equally divided into 5 different subsets; four of the subsets were used for training the model, and the last one was used for testing the model. These processes were repeated five times with each individual subset used as the testing subset. Here, the RBF kernel was selected as the default kernel function, because it provided the best cross-validation accuracy with the training data. Subsequently, the RBF kernel function was applied to train the whole testing dataset for constructing the final SVM classifier in the LEPS.

2.6. Performance Measurement

To evaluate the performance of the LEPS at the level of the amino acid residue, five indicators were used to measure effectiveness at the default settings. These indicators were (1) sensitivity (SEN), defined as the percentage of epitopes that were correctly predicted as epitopes; (2) specificity (SPE), defined as the percentage of non-epitopes that were correctly predicted as non-epitopes; (3) positive predictive value (PPV), defined as the probability that a predicted epitope was, in fact, an epitope; (4) accuracy (ACC), defined as the proportion of correctly predicted peptides; (5) Matthews' correlation coefficient (MCC), which was a measure of the predictive performance that incorporated both SEN and SPE into a single value between −1 and +1 [26]. These parameters were calculated with the following equations:

Sensitivity=TPTP+FN, (3)
Specificity=TNTN+FP, (4)
Accuracy=TP+TNTP+FP+TN+FN, (5)
PPV=TPTP+FP, (6)
MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN), (7)

where TP represented the true positive; TN, the true negative; FP, the false positive; FN, the false negative.

3. Results and Discussion

3.1. A New Linear Epitope Dataset: PC

The new dataset, called the PC dataset (collected by Pai and Chang), contained 12 sequences that did not overlap with other datasets. It was generated and analyzed in this study. The experimental epitopes in the PC dataset were identified with the peptide scan methodology, a conventional method for epitope determination. The average length of the identified epitopes in the PC dataset was 18.9 residues. This was considered a practical length for an epitope to be used in peptide vaccine development or antibody generation. The average epitope lengths in the HIV and AntiJen datasets were 26.4 and 16.3 residues, respectively. All sequences in the PC dataset were analyzed with the LEPS, and the predicted and experimentally verified epitopes are listed in Table 1.

3.2. The Performance of LEPS

The epitope information collected from the PC, AntiJen, and HIV datasets were utilized to verify the performance of LEPS. The PC dataset was described in the previous section. The original AntiJen dataset comprised 3619 epitopes, of which 3168 were found in the Swiss-Port database. As in our previous report, we regenerated the original AntiJen dataset by removing the repeated epitopes [12]. The HIV dataset focused on one infectious pathogen and was recognized as a useful tool in the field of HIV immunology [39]. The AHP dataset combined these three datasets to balance the variations in each dataset including variations in epitope length and the physico-chemical properties of antigens. With these 4 datasets, we compared the performance of five LE predictors, including LEPS, BepiPred [20], ABCPred [26], BCPred [21], and FBCPred [22].

As expected, LEPS provided favorable results in all four datasets (Figure 2). Table 2 shows that LEPS displayed the best specificity (SPE), with values of 88.33%, 84.48%, 74.84%, and 84.22% in the PC, AntiJen, HIV, and AHP datasets, respectively. Moreover, LEPS showed the best PPVs, with values of 45.12%, 28.85%, 71.44%, and 32.07% in the PC, AntiJen, HIV, and AHP datasets, respectively. The PPV indicated the rate of identifying real epitopes among all positive predicted candidates. It is one of the most important factors in conducting vaccine development. Reduction of the false positive candidates can improve the effectiveness and efficiency of identifying the real epitopes. Therefore, the LEPS will outperform the other predictors in terms of biological experiment cost effectiveness. In the field of computational science, prediction accuracy is one of the most concerned factors for system evaluation. Except in the HIV dataset, LEPS displayed the best ACCs, with values of 61.66%, 73.81%, and 72.52% for the PC, AntiJen, and AHP datasets, respectively. These results showed that LEPS displayed excellent performance for LE prediction. The LEPS also showed the best performance in the MCC for the AntiJen and AHP datasets (10.10% and 10.36%), and the MCC was only a little lower (22.76%) than BCPred (29.80%) and FBCPred (27.81%) for the HIV dataset. Taken together, LEPS displayed excellent performance in SPE and PPVs for all four datasets; it also showed the best or equivalent ACCs for all datasets. However, it showed relatively low SEN compared to the other predictors, mainly due to less number of predicted LEs.

Figure 2.

Figure 2

Comparison of the performances of LEPS, BepiPred, ABCPred, BCPred, and FBCPred systems. The best performance for each indicator is marked with a star.

Table 2.

Comparison of the performances of LEPS, BepiPred, ABCPred, BCPred, and FBCPred systems.

Systems SENa SPEa ACCa PPVa MCCa
PC dataset

LEPS 12.78 88.33 61.66 45.12 3.65
BepiPred 48.23 59.72 55.33 38.19 7.49
ABCPred0.8b 65.46 40.26 48.89 36.21 5.13
BCPred 50.92 59.35 52.83 36.07 4.43
FBCPred 51.03 52.55 52.20 35.26 3.17

AntiJen dataset

LEPS 26.72 84.48 73.81 28.85 10.10
BepiPred 51.79 57.61 55.52 22.02 6.04
ABCPred0.8 67.33 40.40 44.70 21.83 5.46
BCPred 58.84 54.87 53.92 23.34 8.93
FBCPred 60.31 51.21 51.45 22.33 6.73

HIV dataset

LEPS 48.33 74.84 63.45 71.44 22.76
BepiPred 50.16 60.85 56.72 61.22 9.72
ABCPred0.7 87.97 14.65 56.59 56.33 5.64
BCPred 80.18 54.57 66.57 65.55 29.80
FBCPred 73.20 58.20 67.13 65.56 27.81

AHP datasetc

LEPS 26.97 84.22 72.52 32.07 10.36
BepiPred 51.48 57.91 55.57 25.06 6.32
ABCPred0.8 68.28 39.06 45.58 24.51 5.45
BCPred 59.45 54.80 54.50 26.32 9.73
FBCPred 60.40 51.66 52.31 25.38 7.60

aSEN: sensitivity; SPE: specificity; PPV: positive prediction value; ACC: accuracy; MCC: Matthews' correlation coefficient, unit, %.

bThe subscripts of ABCPred denote threshold values according to the highest accuracy.

cThis dataset is a merge of the other 3 datasets.

3.3. The LEPS Platform

The LEPS provides a user-friendly interface for biologists to predict linear epitope candidates (Figure 3(a)). LEPS will accept either FASTA format or text, and the default parameters were set as indicated. In this system, several physicochemical propensities can be dynamically modified by users, including secondary structures, hydropathy, surface accessibility, flexibility, polarity, and other factors. The scanning window size for each parameter is also adjustable. After executing the prediction, the overall antigenicity of the query protein and the predicted LE candidates are displayed. For example, Figure 3(b) shows the LEs in HIV integrase predicted by LEPS. Seventeen candidates were initially predicted by LEP based on the global and local distributions of antigenicity. These candidates were further filtered by SVM selection, with only 9 remaining candidates. Within these 9 epitope candidates, number 1 (residue 5–19), number 2 (residue 41–50), numbers 7 and 8 (residue 227–239, and residue 243–247), and number 9 (residue 261–266) overlapped with the experimental epitopes at residues 1–16, residues 42–55, residues 228–252, and residues 262–271, respectively. To verify the surface conditions of the predicted LEs within the query protein sequence, a protein structure was simulated based on homologous modeling approaches. This structure can be viewed and analyzed by clicking on the button labeled “predicted structure.”

Figure 3.

Figure 3

The LEPS server. (a) Users can input a query sequence and manually adjust the weight and window size of each propensity. (b) The output information of HIV integrase predicted by LEPS shows 17 candidates, and only 9 candidates were retained after SVM filtration. The final predicted epitope segments are labeled in yellow at the bottom.

3.4. Visualization of the Predicted LEs on 3D Structures

Predicted structures of the query sequences can be rendered by Jmol (http://www.jmol.org/) in LEPS, and the corresponding PDBs and PyMOL script files (http://www.pymol.org/) are downloadable by request. For example, Figure 4 shows the simulated structure of HIV integrase as predicted by Modeller, with the predicted epitope segments displayed in yellow solid spheres. Because there is a high probability that true epitopes will be exposed on the protein surfaces for binding with antibodies, visualization of the predicted LEs on 3D structures can facilitate the selection of suitable epitopes from predicted candidates according to their surface distributions. Figure 5 shows an example of the experimentally verified epitopes and predicted epitopes for the 10 kDa chaperonin protein in the AntiJen dataset. The yellow spheres in both Figures 5(a) and 5(b) show the true and predicted epitope atoms, respectively. The position of the remaining protein is shown in red and blue solid balls in the two simulated structures. In both cases, most of the epitope residues are located on the protein surface.

Figure 4.

Figure 4

The predicted LEs of HIV integrase mapped onto a simulated 3D structure. The predicted epitopes are labeled in yellow, and the selected epitopes (number 1 and number 3) are shown in yellow spheres.

Figure 5.

Figure 5

The experimental and predicted epitopes of 10 kDa chaperonin. The structural surfaces display the true epitopes (a) and predicted epitopes (b) in yellow spheres. The red and blue spheres represent the remainder of the protein. Both figures were created with PyMOL.

3.5. Acceptability of Low Sensitivities

Although LEPS can provide a highly accurate prediction of LEs, the low sensitivity is an issue that remains to be investigated. In general, epitope datasets confront a challenge that biological experiments would not cover all the true epitopes within an individual antigen. Peptide scanning data could only identify potential epitopes that were recognized by a specific antibody. However, different antibodies to the same antigen might recognize different epitopes. These biological variations caused low coverage of epitopes within an antigen [43]. This situation implies that the sensitivities of an LE predictor should generally be low. Alternatively, a LE predictor might ubiquitously predict more epitopes to regain the sensitivities accompanying with the reduction of specificities. This will definitely lead to higher experimental costs in general. Nevertheless, to persuade biologists to conduct in vitro experiments on the predicted potential LEs, the accuracy and MCC values could provide balanced statistics for evaluating the performance of a prediction system.

In this study, LEPS displayed high accuracy, MCC, specificity, and PPV, although the sensitivity was a little low. However, the reduced sensitivity was offset by the high PPV. Therefore, the LEPS provides a high probability of success for molecular biologists in predicting and selecting functional epitopes effectively and efficiently.

Acknowledgments

This work was supported by the National Science Council, Taiwan (NSC-98-2311-B-039-003-MY3 and NSC-99-2627-B-039-002 to H.-T. Chang and NSC100-2321-B-019-004, NSC 99-2627-B-019-007, and NSC98-2221-E-019-031-MY2 to T.-W. Pai) and by the Taiwan Department of Health Clinical Trial and Research Center of Excellence (DOH100-TD-B-111-004).

References

  • 1.Davies DR, Cohen GH. Interactions of protein antigens with antibodies. Proceedings of the National Academy of Sciences of the United States of America. 1996;93(1):7–12. doi: 10.1073/pnas.93.1.7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Van Regenmortel MHV. Immunoinformatics may lead to a reappraisal of the nature of B cell epitopes and of the feasibility of synthetic peptide vaccines. Journal of Molecular Recognition. 2006;19(3):183–187. doi: 10.1002/jmr.768. [DOI] [PubMed] [Google Scholar]
  • 3.Barlow DJ, Edwards MS, Thornton JM. Continuous and discontinuous protein antigenic determinants. Nature. 1986;322(6081):747–748. doi: 10.1038/322747a0. [DOI] [PubMed] [Google Scholar]
  • 4.Benjamin DC. B-cell epitopes: fact and fiction. Advances in Experimental Medicine and Biology. 1995;386:95–108. doi: 10.1007/978-1-4613-0331-2_8. [DOI] [PubMed] [Google Scholar]
  • 5.Vinion-Dubiel AD, McClain MS, Cao P, Mernaugh RL, Cover TL. Antigenic diversity among Helicobacter pylori vacuolating toxins. Infection and Immunity. 2001;69(7):4329–4336. doi: 10.1128/IAI.69.7.4329-4336.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Greenbaum JA, Andersen PH, Blythe M, et al. Towards a consensus on datasets and evaluation metrics for developing B-cell epitope prediction tools. Journal of Molecular Recognition. 2007;20(2):75–82. doi: 10.1002/jmr.815. [DOI] [PubMed] [Google Scholar]
  • 7.Andersen OS, Boisguerin P, Glerup S, et al. Identification of a linear epitope in sortilin that partakes in pro-neurotrophin binding. Journal of Biological Chemistry. 2010;285(16):12210–12222. doi: 10.1074/jbc.M109.062364. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Xiang J, Zhang S, Cheng A, et al. Expression and characterization of recombinant VP19c protein and N-terminal from duck enteritis virus. Virology Journal. 2011;8(article 82) doi: 10.1186/1743-422X-8-82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lanza A, Perillo L, Landi C, Femiano F, Gombos F, Cirillo N. Controversial role of antibodies against linear epitopes of desmoglein 3 in pemphigus vulgaris, as revealed by semiquantitative living cell immunofluorescence microscopy and in-cell ELISA. International Journal of Immunopathology and Pharmacology. 2010;23(4):1047–1055. doi: 10.1177/039463201002300409. [DOI] [PubMed] [Google Scholar]
  • 10.Yadav M, Liebau E, Haldar C, Rathaur S. Identification of major antigenic peptide of filarial glutathione-S-transferase. Vaccine. 2011;29:1297–1303. doi: 10.1016/j.vaccine.2010.11.078. [DOI] [PubMed] [Google Scholar]
  • 11.Alix AJP. Predictive estimation of protein linear epitopes by using the program PEOPLE. Vaccine. 1999;18(3-4):311–314. doi: 10.1016/s0264-410x(99)00329-1. [DOI] [PubMed] [Google Scholar]
  • 12.Chang HT, Liu CH, Pai TW. Estimation and extraction of B-cell linear epitopes predicted by mathematical morphology approaches. Journal of Molecular Recognition. 2008;21(6):431–441. doi: 10.1002/jmr.910. [DOI] [PubMed] [Google Scholar]
  • 13.Chang HT, Pai TW, Fan TC, et al. A reinforced merging methodology for mapping unique peptide motifs in members of protein families. BMC Bioinformatics. 2006;7, article no. 38 doi: 10.1186/1471-2105-7-38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Andersen PH, Nielsen M, Lund O. Prediction of residues in discontinuous B-cell epitopes using protein 3D structures. Protein Science. 2006;15(11):2558–2567. doi: 10.1110/ps.062405906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Pai TW, Chang MDT, Tzou WS, et al. REMUS: a tool for identification of unique peptide segments as epitopes. Nucleic Acids Research. 2006;34:W198–W201. doi: 10.1093/nar/gkl188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Saha S, Raghava GPS. BcePred: prediction of continuous B-cell epitopes in antigenic sequences using physico-chemical properties. Lecture Notes in Computer Science. 2004;3239:197–204. [Google Scholar]
  • 17.Odorico M, Pellequer JL. BEPITOPE: predicting the location of continuous epitopes and patterns in proteins. Journal of Molecular Recognition. 2003;16(1):20–22. doi: 10.1002/jmr.602. [DOI] [PubMed] [Google Scholar]
  • 18.Doytchinova IA, Flower DR. VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines. BMC Bioinformatics. 2007;8, article no. 4 doi: 10.1186/1471-2105-8-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Toseland CP, et al. AntiJen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data. Immunome Research. 2005;1:p. 4. doi: 10.1186/1745-7580-1-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Larsen JE, et al. Improved method for predicting linear B-cell epitopes. Immunome Research. 2006;2:p. 2. doi: 10.1186/1745-7580-2-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.El-Manzalawy Y, Dobbs D, Honavar V. Predicting linear B-cell epitopes using string kernels. Journal of Molecular Recognition. 2008;21(4):243–255. doi: 10.1002/jmr.893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.El-Manzalawy Y, Dobbs D, Honavar V. Predicting flexible length linear B-cell epitopes. In: Proceedings of the Computational Systems Bioinformatics Conference, vol. 7; 2008; pp. 121–132. [PMC free article] [PubMed] [Google Scholar]
  • 23.Chen J, Liu H, Yang J, Chou KC. Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids. 2007;33(3):423–428. doi: 10.1007/s00726-006-0485-9. [DOI] [PubMed] [Google Scholar]
  • 24.Florea L. Epitope prediction algorithms for peptide-based vaccine design. In: Proceedings of the IEEE Computer Society Bioinformatics conference, vol. 2; 2003; pp. 17–26. [PubMed] [Google Scholar]
  • 25.Roberts CGP, Meister GE, Jesdale BM, Lieberman J, Berzofsky JA, De Groot AS. Prediction of HIV peptide epitopes by a novel algorithm. AIDS Research and Human Retroviruses. 1996;12(7):593–610. doi: 10.1089/aid.1996.12.593. [DOI] [PubMed] [Google Scholar]
  • 26.Saha S, Raghava GPS. Prediction of continuous B-cell epitopes in an antigen using recurrent neural network. Proteins. 2006;65(1):40–48. doi: 10.1002/prot.21078. [DOI] [PubMed] [Google Scholar]
  • 27.Chang CC, Lin CJ. LIBSVM: a library for support vector machine. 2001.
  • 28.Sachsamanoglou M, Paspaltsis I, Petrakis S, et al. Antigenic profile of human recombinant PrP: generation and characterization of a versatile polyclonal antiserum. Journal of Neuroimmunology. 2004;146(1-2):22–32. doi: 10.1016/j.jneuroim.2003.09.018. [DOI] [PubMed] [Google Scholar]
  • 29.Argiro L, Kohlstädt S, Henri S, et al. Identification of a candidate vaccine peptide on the 37 kDa Schistosoma mansoni GAPDH. Vaccine. 2000;18(19):2039–2048. doi: 10.1016/s0264-410x(99)00521-6. [DOI] [PubMed] [Google Scholar]
  • 30.Wesley Burks A, Shin D, Cockrell G, Stanley JS, Helm RM, Bannon GA. Mapping and mutational analysis of the IgE-binding epitopes on Ara h 1, a legume vicilin protein and a major allergen in peanut hypersensitivity. European Journal of Biochemistry. 1997;245(2):334–339. doi: 10.1111/j.1432-1033.1997.t01-1-00334.x. [DOI] [PubMed] [Google Scholar]
  • 31.Liu SJ, Leng CH, Lien SP, et al. Immunological characterizations of the nucleocapsid protein based SARS vaccine candidates. Vaccine. 2006;24(16):3100–3108. doi: 10.1016/j.vaccine.2006.01.058. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Cui X, Duckworth JA, Molinia FC, Cowan PE. Identification and evaluation of an infertility-associated ZP3 epitope from the marsupial brushtail possum (Trichosurus vulpecula) Vaccine. 2010;28(6):1499–1505. doi: 10.1016/j.vaccine.2009.11.052. [DOI] [PubMed] [Google Scholar]
  • 33.Mueller M, Renzullo S, Brooks R, Ruggli N, Hofmann MA. Antigenic characterization of recombinant hemagglutinin proteins derived from different avian influenza virus subtypes. PLoS ONE. 2010;5(2) doi: 10.1371/journal.pone.0009097. Article ID e9097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.da Silva AN, Nascimento EJ, Cordeiro MT, et al. Identification of continuous human B-cell epitopes in the envelope glycoprotein of dengue virus type 3 (DENV-3) PloS One. 2009;4(10 article e7425) doi: 10.1371/journal.pone.0007425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Stetler RA, Gao Y, Signore AP, Cao G, Chen J. HSP27: mechanisms of cellular protection against neuronal injury. Current Molecular Medicine. 2009;9(7):863–872. doi: 10.2174/156652409789105561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Senger T, Becker MR, Schädlich L, Waterboer T, Gissmann L. Identification of B-cell epitopes on virus-like particles of cutaneous alpha-human papillomaviruses. Journal of Virology. 2009;83(24):12692–12701. doi: 10.1128/JVI.01582-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Kelly-Cirino CD, Mantis NJ. Neutralizing monoclonal antibodies directed against defined linear epitopes on domain 4 of anthrax protective antigen. Infection and Immunity. 2009;77(11):4859–4867. doi: 10.1128/IAI.00117-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Consortium TU. The universal protein resource (UniProt) in 2010. Nucleic Acids Research. 2009;38(1):D142–D148. doi: 10.1093/nar/gkp846. Article ID gkp846. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Korber BTM. HIV Immunology and HIV/SIV Vaccine Databases. Los Alamos, NM, USA: Los Alamos National Laboratory; 2003. (Theoretical Biology and Biophysics). LA-UR 04-8162. [Google Scholar]
  • 40.Eswar N, Webb B, Marti-Renom MA, et al. Comparative protein structure modeling using Modeller. Current protocols in bioinformatics. 2006;(chapter 5, unit 5.6) doi: 10.1002/0471250953.bi0506s15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Saha S, Bhasin M, Raghava GPS. Bcipep: a database of B-cell epitopes. BMC Genomics. 2005;6:p. 79. doi: 10.1186/1471-2164-6-79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Boeckmann B, Bairoch A, Apweiler R, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Research. 2003;31(1):365–370. doi: 10.1093/nar/gkg095. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Caoili SE. Benchmarking B-cell epitope prediction for the design of peptide-based vaccines: problems and prospects. Journal of Biomedicine and Biotechnology. 2010;2010:14 pages. doi: 10.1155/2010/910524. Article ID 910524. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Biomedicine and Biotechnology are provided here courtesy of Wiley

RESOURCES