A meta-learning approach for B-cell conformational epitope prediction

Yuh-Jyh Hu; Shun-Chien Lin; Yu-Lung Lin; Kuan-Hui Lin; Shun-Ning You

doi:10.1186/s12859-014-0378-y

. 2014 Nov 18;15(1):378. doi: 10.1186/s12859-014-0378-y

A meta-learning approach for B-cell conformational epitope prediction

Yuh-Jyh Hu ^1,^2,^✉, Shun-Chien Lin ², Yu-Lung Lin ², Kuan-Hui Lin ¹, Shun-Ning You ¹

PMCID: PMC4237749 PMID: 25403375

Abstract

Background

One of the major challenges in the field of vaccine design is identifying B-cell epitopes in continuously evolving viruses. Various tools have been developed to predict linear or conformational epitopes, each relying on different physicochemical properties and adopting distinct search strategies. We propose a meta-learning approach for epitope prediction based on stacked and cascade generalizations. Through meta learning, we expect a meta learner to be able integrate multiple prediction models, and outperform the single best-performing model. The objective of this study is twofold: (1) to analyze the complementary predictive strengths in different prediction tools, and (2) to introduce a generic computational model to exploit the synergy among various prediction tools. Our primary goal is not to develop any particular classifier for B-cell epitope prediction, but to advocate the feasibility of meta learning to epitope prediction. With the flexibility of meta learning, the researcher can construct various meta classification hierarchies that are applicable to epitope prediction in different protein domains.

Results

We developed the hierarchical meta-learning architectures based on stacked and cascade generalizations. The bottom level of the hierarchy consisted of four conformational and four linear epitope prediction tools that served as the base learners. To perform consistent and unbiased comparisons, we tested the meta-learning method on an independent set of antigen proteins that were not used previously to train the base epitope prediction tools. In addition, we conducted correlation and ablation studies of the base learners in the meta-learning model. Low correlation among the predictions of the base learners suggested that the eight base learners had complementary predictive capabilities. The ablation analysis indicated that the eight base learners differentially interacted and contributed to the final meta model. The results of the independent test demonstrated that the meta-learning approach markedly outperformed the single best-performing epitope predictor.

Conclusions

Computational B-cell epitope prediction tools exhibit several differences that affect their performances when predicting epitopic regions in protein antigens. The proposed meta-learning approach for epitope prediction combines multiple prediction tools by integrating their complementary predictive strengths. Our experimental results demonstrate the superior performance of the combined approach in comparison with single epitope predictors.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0378-y) contains supplementary material, which is available to authorized users.

Keywords: B-cell epitope prediction, Linear epitopes, Conformational epitopes, Meta learning

Background

The ability of an antibody to respond to an antigen, such as a virus capsid protein fragment, depends on the antibody’s specific recognition of an epitope, which is the antigenic site to which an antibody binds. Based on their structure and interaction with antibodies, epitopes can be divided into two categories: linear and conformational. A linear epitope is formed by a continuous sequence of amino acids, whereas a conformational epitope is composed of discontinuous primary sequences, which are close in three-dimensional space.

Several different approaches exist for predicting linear and conformational epitopes. Previous studies relied on the varying physicochemical properties of amino acids to predict linear epitopes [1–3]. A study on 484 amino acid scales revealed that predictions based on the best-performing scales poorly correlated with experimentally confirmed epitopes [4]. This result prompted the development of machine-learning methods to improve prediction. BepiPred combines amino acid propensity scales with a hidden Markov model to achieve marginal improvement over methods based on physicochemical properties [5]. ABCPred uses artificial neural networks (ANN) for predicting linear B-cell epitopes [6]. Chen et al. proposed the novel amino acid pair (AAP) antigenicity scale [7], for which the authors trained a support vector machine (SVM) classifier, using the AAP propensity scale to distinguish epitopes and nonepitopes. BCPREDS uses SVM combined with a variety of kernel methods, including string kernels, radial basis kernels, and subsequence kernels, to predict linear B-cell epitopes [8].

An increase in the availability of protein structures has enabled the identification of conformational epitopes by using various computational methods. For example, DiscoTope 2.0 uses a combination of amino acid composition information, spatial neighborhood information, and a surface measure for predicting epitopes [9]. ElliPro uses Thornton’s propensities and applies residue clustering to identify epitopes [10]. SEPPA 2.0 predicts conformational epitopes based on the unit patches of residue triangles, and the clustering coefficient for describing local spatial context and compactness with two new parameters appended, ASA (Accessible Surface Area) propensity, and consolidated amino acid index [11]. EPITOPIA combines structural and physiochemical features, and adopts a Bayesian classifier to predict epitopes [12]. EPSVR uses a support vector regression method to predict conformational epitopes. The meta learner EPMeta incorporates consensus results from multiple prediction servers by using a voting mechanism [13].

In this study, we propose combining multiple predictions to improve epitope prediction based on two meta-learning strategies: stacked generalization (stacking) [14,15] and cascade generalization (cascade) [16,17]. These strategies work in a hierarchical architecture of meta learners and base learners, in which the input space for meta learners is extended by the predictions of the base learners. We selected several linear and conformational epitope predictors as the base learners, and evaluated four inductive learning algorithms as the meta learners. To evaluate performance, we tested the combinatorial method on an independent set of antigen proteins that were not used previously to train the epitope prediction tools according to the documents on the tools and their publications. Our results indicate the potential of meta learning for epitope prediction.

Results and discussion

Prediction correlations between base learners

For a meta-learning method to perform effectively, the base learners must have complementary predictive capabilities, which can be reflected by relatively low correlation among their predictions. We selected four conformational and four linear epitope predictors as our base learners. The conformational predictors were DiscoTope 2.0 [9], ElliPro [10], SEPPA 2.0 [11], and Bpredictor [18], and the linear epitope predictors were BepiPred [5], ABCpred [6], AAP [7], and BCPREDS [8]. We calculated the Pearson’s correlation coefficients for the prediction scores produced by the base prediction tools. To further analyze the correlations among predictions based on the score rankings, we sorted the prediction scores of all protein residues provided by each base learner and then conducted a Spearman’s rank correlation analysis. Tables 1 and 2 list the Pearson’s correlation coefficients and Spearman's rank correlation coefficients of all pairs of linear and conformational predictors, respectively. The average correlation coefficients of the linear and conformational prediction tools were 0.383 vs. 0.384 and 0.370 vs. 0.459 in the Pearson’s and Spearman’s correlation analyses, respectively, which indicate a relatively weak correlation among the epitope predictions of the base learners.

Table 1.

Correlation analysis of linear epitope predictors

Linear	AAP		ABCpred		BCPREDS
Linear	Pearson	Spearman	Pearson	Spearman	Pearson	Spearman
AAP	1	1	-	-	-	-
ABCpred	0.241	0.251	1	1	-	-
BCPREDS	0.515	0.520	0.342	0.287	1	1
BepiPred	0.383	0.372	0.282	0.299	0.536	0.489

Conformational	SEPPA 2.0		DiscoTope 2.0		Bpredictor
Conformational	Pearson	Spearman	Pearson	Spearman	Pearson	Spearman
SEPPA 2.0	1	1	-	-	-	-
DiscoTope 2.0	0.246	0.400	1	1	-	-
Bpredictor	0.339	0.509	0.372	0.364	1	1
ElliPro	0.333	0.487	0.388	0.362	0.624	0.630

Classifier	TPR	FPR	Precision	Accuracy	F-score	MCC	AUC
2-level (ANN)^a	0.514	0.019	0.705	0.944	0.594	0.573	0.748
2-level (C4.5)^a	0.511	0.023	0.663	0.941	0.577	0.551	0.744
2-level (k-NN)^a	0.496	0.012	0.783	0.949	0.607	0.599	0.742
2-level (SVM)^a	0.593	0.009	0.848	0.959	0.697	0.689^d	0.920
3-level Stacking^b	0.579	0.009	0.850	0.958	0.689	0.682^d	0.925
Cascade^c	0.588	0.010	0.843	0.959	0.693	0.684^d	0.925

Classifier	TPR	FPR	Precision	Accuracy	F-score	MCC	AUC
SEPPA 2.0	0.450	0.097	0.291	0.867	0.348	0.290	0.793
DiscoTope 2.0	0.930	0.761	0.096	0.294	0.173	0.110	0.617
Bpredictor	0.129	0.017	0.399	0.916	0.195	0.192	0.690
ElliPro	0.711	0.512	0.108	0.506	0.186	0.109	0.635
AAP	0.831	0.770	0.085	0.278	0.154	0.039	0.490
ABCpred	0.603	0.548	0.088	0.463	0.152	0.031	0.536
BCPREDS	0.962	0.906	0.084	0.163	0.154	0.053	0.476
BepiPred	0.718	0.500	0.110	0.517	0.191	0.118	0.609

Base learner	Parameter	Range of parameter values	Selected value
SEPPA 2.0	scoring threshold	0.00 ~ 1.00	0.21
DiscoTope 2.0	scoring threshold	−70.00 ~ 10.00	−18.09
Bpredictor	scoring threshold	0.00 ~ 1.00	0.88
ElliPro	scoring threshold	0.00 ~ 1.00	0.44
AAP	window size	10, 12, 14, 16, 18, 20	16
ABCpred	scoring threshold	0.00 ~ 1.00	0.84
BCPREDS	window size	12, 14, 16, 18, 20, 22	20
BepiPred	scoring threshold	−4.00 ~ 3.00	0.02

Classifier	TPR	FPR	Precision	Accuracy	F-score	MCC	AUC
SEPPA 2.0	0.289	0.050	0.204	0.922	0.239	0.202	0.765
DiscoTope 2.0	0.930	0.763	0.051	0.266	0.097	0.080	0.699
Bpredictor	0.010	0.007	0.057	0.951	0.017	0.006	0.683
ElliPro	0.826	0.535	0.064	0.480	0.119	0.118	0.696
AAP	0.846	0.641	0.055	0.379	0.104	0.086	0.609
ABCpred	0.507	0.480	0.045	0.519	0.082	0.011	0.530
BCPREDS	0.990	0.874	0.048	0.163	0.091	0.072	0.570
BepiPred	0.761	0.499	0.063	0.512	0.117	0.106	0.656
CBTOPE^a	0.159	0.003	0.681	0.961	0.258	0.317	0.681
LBtope^b	0.632	0.578	0.046	0.431	0.086	0.022	0.575
EPMeta^c	0.129	0.043	0.118	0.922	0.124	0.083	0.595
3-level Stacking	0.194	0.008	0.520	0.958	0.283	0.300	0.793
Cascade	0.199	0.008	0.519	0.958	0.288	0.304	0.789

Classifier	TPR	FPR	Precision	Accuracy	F-score	MCC	AUC
SEPPA 2.0	0.155	0.045	0.161	0.913	0.158	0.112	0.697
3-level Stacking^a w/o SEPPA 2.0	0.418	0.030	0.384	0.935	0.386	0.351	0.820
Cascade^a w/o SEPPA 2.0	0.404	0.033	0.405	0.937	0.404	0.371	0.820
DiscoTope 2.0	0.917	0.625	0.090	0.409	0.164	0.148	0.748
3-level Stacking^a w/o DiscoTop 2.0	0.231	0.013	0.541	0.939	0.324	0.327	0.809
Cascade^a w/o DiscoTope 2.0	0.212	0.013	0.532	0.939	0.303	0.310	0.806
Bpredictor	0.045	0.028	0.067	0.933	0.054	0.021	0.683
3-level Stacking^b w/o Bpredictor	0.119	0.006	0.471	0.957	0.190	0.222	0.779
Cascade^b w/o Bpredictor	0.149	0.002	0.769	0.962	0.250	0.328	0.787
ElliPro	0.421	0.279	0.131	0.694	0.199	0.090	0.630
3-level Stacking^c w/o ElliPro	0.367	0.009	0.802	0.935	0.504	0.516	0.861
Cascade^c w/o ElliPro	0.346	0.010	0.770	0.932	0.478	0.488	0.857
CBTOPE^d	0.801	0.424	0.118	0.591	0.205	0.188	0.798
3-level Stacking^d	0.446	0.010	0.751	0.954	0.558	0.557	0.913
Cascade^d	0.446	0.010	0.762	0.954	0.562	0.562	0.908

Classifier	TPR	FPR	Precision	Accuracy	F-score	MCC	AUC
Tool-based	0.060	0.005	0.364	0.956	0.103	0.133	0.663
Feature-based	0.065	0.006	0.342	0.955	0.109	0.134	0.658

Classifier ^*	TPR	FPR	Precision	Accuracy	F-score	MCC	AUC
Conformational 3-level Stacking	0.144	0.018	0.261	0.946	0.186	0.168	0.801
+BCPREDS	0.184	0.006	0.597	0.960	0.281	0.317	0.753
+AAP	0.194	0.008	0.527	0.958	0.284	0.303	0.788
+ BepiPred	0.214	0.009	0.524	0.958	0.304	0.317	0.788
+ABCpred	0.194	0.008	0.520	0.958	0.283	0.300	0.793

Classifier ^*	TPR	FPR	Precision	Accuracy	F-score	MCC	AUC
Conformational cascade	0.154	0.010	0.397	0.954	0.222	0.228	0.743
+BCPREDS	0.204	0.007	0.577	0.960	0.301	0.327	0.745
+ABCpred	0.184	0.005	0.617	0.960	0.284	0.323	0.760
+AAP	0.189	0.006	0.603	0.960	0.288	0.323	0.765
+ BepiPred	0.199	0.008	0.519	0.958	0.288	0.304	0.789

Base feature	Description	Reference
Propensity score	The propensity score is derived from a scoring function that sums the log-odd ratios of the amino acids in the spatial neighborhood (defined in [9]) around each residue in a given protein.	[9]
Residue accessibility	Using NACCESS to calculate the accessibilities of the whole molecule submitted in a pdb file. NACCESS calculates the atomic accessible surface defined by rolling a probe around a van der Waals surface. The residue accessibilities are categorized into 4 classes: all-polar, nonpolar, total-side, and main-chain.	[31]
Secondary structure	Secondary structure refers to highly regular local sub-structures defined by patterns of hydrogen bonds between the main-chain peptide groups.	[26]
Secondary structure	In such cases, the chain of amino acids folds into regular repeating structures, such as α helix, β structure, and coil.	[26]
Accessible surface area	Calculated using Gerstein et al.’s calc-surface program to measure the accessible surface area of a sphere, on each point of which the center of a solvent molecule can be placed in contact with this atom without penetrating any other atoms of the molecule.	[38,39]
Atom volume	Calculated using Gerstein et al.’s calc-volume program. It calculates volumes by applying a geometric construction called Voronoi polyhedra to divide the total volume among the atoms in a protein model.	[37]
B factor	The B factor is also known as the Debye-Waller factor or the temperature factor. It is used to describe the attenuation of x-ray scattering or coherent neutron scattering caused by thermal motion. Two B factors of a protein were considered in this study: the B factor of side chain and the B factor of main chain.	[32,33]
Solvent excluded surface	Calculated using Sanner et al.’s MSMS program, which builds the solvent excluded surface based on the reduced surface.	[34]
Solvent accessible surface	Calculated using Sanner et al.’s MSMS program, which builds the solvent accessible surface based on the reduced surface.	[34]
PSSM	Using PSI-BLAST to search the non-redundant protein database, and derive the information content from a position specific scoring matrix as the base feature.	[36]
Side chain polarity	The 20 amino acids were divided into four categories: polar, nonpolar, acidic polar, and basic polar.	[40]
Hydropathy index	Kyte and Doolittle devised the hydopathy index by applying a sliding-window strategy that continuously determined the average hydopathy in a window as it advanced through the sequence.	[41]
Antigenic propensity	Kolaskar and Tongaonkar analyzed 156 antigenic determinants (<20 residues per determinant) in 34 different proteins to obtain the antigenic propensities of amino acid residues.	[42,43]
Flexibility	Karplus and Schulz developed the flexibility scale based on the mobility of the protein segments on 31 proteins with known structures.	[35]
Hydrophilic scale	Parker et al. developed the hydrophilic scale based on the high-performance liquid chromatography (HPLC) peptide retention data.	[26]

Classifier ^*	TPR	FPR	Precision	Accuracy	F-score	MCC	AUC
3-level stacking	0.194	0.008	0.520	0.958	0.283	0.300	0.793
\SEPPA 2.0	0.169	0.009	0.447	0.956	0.245	0.256	0.755
\ElliPro	0.144	0.012	0.349	0.952	0.204	0.203	0.746
\BepiPred	0.144	0.012	0.341	0.952	0.203	0.200	0.749
\BCPREDS	0.109	0.009	0.338	0.953	0.165	0.173	0.717
\AAP	0.124	0.016	0.255	0.947	0.167	0.153	0.758
\Bpredictor	0.154	0.012	0.356	0.952	0.215	0.213	0.724
\DiscoTope 2.0	0.045	0.006	0.257	0.954	0.076	0.092	0.672
\ABCpred	0.065	0.006	0.342	0.955	0.109	0.134	0.658

1BZQ_A	1J5O_B	1KXT_A	1KXV_A	1N5Y_B	1N6Q_B	2OZ4_A	2R4R_A	2R4S_A	2VIS_C
2VIT_C	2ZJS_Y	3BSZ_F	3KJ4_A	3KJ6_A	-	-	-	-	-

1A2Y_C	1ADQ_A	1AFV_A	1AHW_C	1AR1_B	1BGX_T	1BQL_Y	1BVK_C	1C08_C	1DQJ_C
1DZB_X	1DZB_Y	1EGJ_A	1EO8_A	1EZV_E	1FDL_Y	1FNS_A	1FSK_A	1G7H_C	1G7I_C
1G7J_C	1G7L_C	1G7M_C	1G9M_G	1G9N_G	1GC1_G	1HYS_B	1IC4_Y	1IC5_Y	1IC7_Y
1J1O_Y	1J1P_Y	1J1X_Y	1JHL_A	1JPS_T	1JRH_I	1KIP_C	1KIQ_C	1KIR_C	1KYO_E
1LK3_A	1MEL_L	1MHP_B	1MLC_E	1N8Z_C	1NBY_C	1NBZ_C	1NDG_C	1NDM_C	1NSN_S
1OAK_A	1ORS_C	1OSP_O	1QLE_B	1R3K_C	1RJL_C	1RVF_1	1RVF_2	1RVF_3	1RZJ_G
1RZK_G	1TZH_V	1TZI_V	1UA6_Y	1UAC_Y	1UJ3_C	1V7M_V	1W72_A	1WEJ_F	1XIW_A
1YJD_C	1YQV_Y	1YY9_A	1ZTX_E	2AEP_A	2ARJ_Q	2B2X_A	2DD8_S	2EIZ_C	2HMI_B
2Q8A_A	2QQK_A	2QQN_A	2UZI_R	2VH5_R	2VXQ_A	2VXT_I	2W9E_A	2XTJ_A	2ZUQ_A
3G6D_A	3GRW_A	3O0R_B	3PGF_A	-	-	-	-	-	-

Performance measure	Definition
TPR^a	TP/(TP + FN)
FPR	FP/(FP + TN)
Precision^b	TP/(TP + FP)
Accuracy	(TP + TN)/(TP + TN + FP + FN)
F-score	2 × TPR × Precision/(TPR + Precision)
MCC
AUC	Area under the ROC curve

PERMALINK

A meta-learning approach for B-cell conformational epitope prediction

Yuh-Jyh Hu

Shun-Chien Lin

Yu-Lung Lin

Kuan-Hui Lin

Shun-Ning You

Abstract

Background

Results

Conclusions

Electronic supplementary material

Background

Results and discussion

Prediction correlations between base learners

Table 1.

Table 2.

Figure 1.

Performances of meta classifiers and base learners

Figure 2.

Table 3.

Figure 3.

Figure 4.

Table 4.

Table 5.

Table 6.

Figure 5.

Table 7.

Ablation analysis

Table 8.

Table 9.

Table 10.

Table 11.

Table 12.

Table 13.

Conclusions

Methods

Epitope prediction as inductive learning

Meta learning: stacked generalization and cascade generalization

Table 14.

Analysis of prediction performances: data sets and performance measures

Table 15.

Table 16.

Table 17.

Correlation analysis and ablation study

Availability

Acknowledgements

Additional file

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases