Prediction of supertype-specific HLA class I binding peptides using support vector machines

Guang Lan Zhang; Ivana Bozic; Chee Keong Kwoh; J Thomas August; Vladimir Brusic

doi:10.1016/j.jim.2006.12.011

. Author manuscript; available in PMC: 2010 Jan 13.

Published in final edited form as: J Immunol Methods. 2007 Jan 25;320(1-2):143–154. doi: 10.1016/j.jim.2006.12.011

Prediction of supertype-specific HLA class I binding peptides using support vector machines

Guang Lan Zhang ^a,^b, Ivana Bozic ^c, Chee Keong Kwoh ^b, J Thomas August ^d, Vladimir Brusic ^e,^*

PMCID: PMC2806231 NIHMSID: NIHMS21238 PMID: 17303158

Abstract

Experimental approaches for identifying T-cell epitopes are time-consuming, costly and not applicable to the large scale screening. Computer modeling methods can help to minimize the number of experiments required, enable a systematic scanning for candidate major histocompatibility complex (MHC) binding peptides and thus speed up vaccine development. We developed a prediction system based on a novel data representation of peptide/MHC interaction and support vector machines (SVM) for prediction of peptides that promiscuously bind to multiple Human Leukocyte Antigen (HLA, human MHC) alleles belonging to a HLA supertype. Ten-fold cross-validation results showed that the overall performance of SVM models is improved in comparison to our previously published methods based on hidden Markov models (HMM) and artificial neural networks (ANN), also confirmed by blind testing. At specificity 0.90, sensitivity values of SVM models were 0.90 and 0.92 for HLA-A2 and -A3 dataset respectively. Average area under the receiver operating curve (A_ROC) of SVM models in blind testing are 0.89 and 0.92 for HLA-A2 and -A3 datasets. A_ROC of HLA-A2 and -A3 SVM models were 0.94 and 0.95, validated using a full overlapping study of 9-mer peptides from human papillomavirus type 16 E6 and E7 proteins. In addition, a large-scale experimental dataset has been used to validate HLA-A2 and -A3 SVM models. The SVM prediction models were integrated into a web-based computational system MULTIPRED1, accessible at antigen.i2r.a-star.edu.sg/multipred1/.

Keywords: T-cell epitope, Human Leukocyte Antigen supertype, Promiscuous binding peptide, Support vector machines

1. Introduction

Cellular immunity in vertebrates is mediated by T cells of the immune system which generate highly specific and lasting immune responses to pathogens (Fabbri et al., 2003). T-cell-based immune responses are mediated by antigenic peptides presented by major histocompatibility complex (MHC) molecules (Pamer and Cresswell, 1998; Yewdell and Bennink, 2001). Antigenic peptides bind MHC molecules and form peptide/MHC complexes. Peptide/MHC complexes shown to be recognized by T cells are called T-cell epitopes. Identifying promiscuous peptides that bind multiple Human Leukocyte Antigen (HLA, human MHC) alleles is a basis for T cell epitope mapping and epitope-based vaccine development (Berzofsky et al., 2001; Srinivasan et al., 2004a; De Groot, 2006). HLA genes are the most polymorphic human genes known (Williams, 2001), with more than 2400 allelic variants identified in the human population as of July 2006 (www.anthonynolan.org.uk/HIG/). Because of the high HLA polymorphism, identifying promiscuous peptides that bind more than one HLA allele is essential for the development of vaccines with a broad and unbiased coverage of the human population. HLA alleles that share sequence similarity and that bind largely overlapping sets of peptides define HLA supertypes (Sette and Sidney, 1999; Doytchinova et al., 2004; Lund et al., 2004). Promiscuous peptides have been reported in the context of HLA supertypes (Threlked et al., 1997; Wilson et al., 2003; Srinivasan et al., 2004b). Epitope-based vaccines show great potential in fighting infectious diseases (Sette et al., 2000; Ada, 2003; Wilson et al., 2003), and they are also investigated for control of cancers, allergy, autoimmunity, and even dementia (Alexander et al., 2002; Durrant and Ramage, 2005; Quintana and Cohen, 2005; Verhagen et al., 2005; Wisniewski and Frangione, 2005; De Groot, 2006).

Experimental validation of peptide binding to HLA molecules is time-consuming and costly, and thus not applicable to large scale screening across multiple HLA alleles. Computational methods are instrumental for systematic large-scale identification of MHC-binding peptides (Schirle et al., 2001; Brusic et al., 2004). One type of methods is structure-based approach that relies on structural conservation observed in 3D structure of peptide–MHC complexes (Schueler-Furman et al., 2000; Bui et al., 2006; Tong et al., 2006). These methods are computationally intensive, and have mainly been applied to MHC molecules with known crystal structures. Data-driven approaches include statistical methods based on experimental peptide binding measurements. These methods include binding motifs (Rammensee et al., 1993), quantitative matrices (Parker et al., 1994; Singh and Raghava, 2003; Reche and Reinherz, 2005; Peters and Sette, 2005), artificial neural networks (ANN) (Honeyman et al., 1998; Christensen et al., 2003), hidden Markov models (HMM) (Mamitsuka, 1998; Brusic et al., 2002), decision trees (Savoie et al., 1999; Segal et al., 2001), discriminant analysis (Mallios, 2001), multivariate regression (Lin et al., 2004), ensemble classifier (Xiao and Segal, 2005), support vector machines (SVM) (Donnes and Elofsson, 2002; Zhao et al., 2003; Bhasin and Raghava, 2004; Riedesel et al., 2004; Bozic et al., 2005; Liu et al., 2006; Cui et al., 2007), and biosupport vector machine which is modified from a conventional support vector machine by introducing a biobasis function so that the non-numerical attributes of amino acids can be recognized without a feature extraction process (Yang and Johnson, 2005). Recently a structure- and sequence-based method was reported, in which residue-based energy terms from the molecular dynamics simulations are used as features to train SVM prediction models for peptide/MHC class I binding (Antes et al., 2006).

SVM-based models showed higher accuracy than other prediction methods in studies of peptide binding to a single HLA molecule. We have employed SVM models with a novel data representation, which captures information of the interaction between a peptide and an HLA molecule and allows the use of a single model for prediction of peptide binding to a multiplicity of alleles that belong to a particular HLA supertype. Earlier we reported the application of HMM (Brusic et al., 2002) and ANN (Zhang et al., 2005b) for prediction of peptide binding to the HLA-A2 supertype. A web-based prediction system, MULTIPRED (Zhang et al., 2005a), was developed using HMM and ANN models. In this study we extended MULTIPRED by applying SVM models. The SVM-MULTIPRED was applied to prediction of HLA class I supertype-specific promiscuous binding peptides in the context of HLA-A2 and -A3. Extensive testing, including blind testing and 10-fold cross-validation, were performed to assess the performance of the prediction models. Validation of the models was conducted using experimental data from human papillomavirus (HPV) type 16 E6 and E7 proteins and a large-scale experimental dataset made available recently by Peters et al. (2006). The performance of the SVM models were compared with that of HMM and ANN models. MULTIPRED1 is the updated version of MULTIPRED (Zhang et al., 2005a). MULTIPRED1 is accessible at antigen.i2r.a-star.edu.sg/multipred1/.

2. Materials and methods

2.1. Data and data representation

Nine-mer peptide data were extracted from the MHCPEP database (Brusic et al., 1994), published articles, and a set of HLA non-binding peptides (Brusic, V. unpublished data). The HLA-A2 supertype dataset, named as Dataset1, has 3050 peptides (664 binders and 2386 non-binders) related to 15 alleles (Table 1) of HLA-A2 supertype and the HLA-A3 supertype dataset, named Dataset2, has 2216 peptides (680 binders and 1536 non-binders) related to eight alleles (Table 2) of HLA-A3 supertype. Nine-mer peptides were used in building models because the predominant length of peptides that bind HLA-A2 and -A3 (class I) alleles is nine-amino-acid long (Rammensee et al., 1993). The datasets are available for download at antigen.i2r.a-star.edu.sg/multipred1/data.

Table 1.

Number of 9-mer peptides related to 15 HLA alleles belonging to A2 supertype in Dataset1

HLA-A2 allele	Binders	Non-binders	Total
A*0201	440	1999	2439
A*0202	45	25	70
A*0203	46	7	53
A*0204	23	224	247
A*0205	16	40	56
A*0206	43	37	80
A*0207	4	11	15
A*0208	0	4	4
A*0209	5	1	6
A*0210	3	0	3
A*0211	4	0	4
A*0214	8	1	9
A*0217	2	4	6
A*6802	23	31	54
A*6901	2	2	4
Total	664	2386	3050

HLA-A3 allele	Binders	Non-binders	Total
A*0301	107	89	196
A*0302	146	259	405
A*1101	142	223	365
A*1102	142	211	353
A*3101	44	54	98
A*3301	35	62	97
A*3303	5	0	5
A*6801	59	638	697
Total	680	1536	2216

	A*0201	A*0202	A*0203	A*0206	A*6802	A*6901
The Peters dataset	3089	1447	1443	1437	1434	833
Overlapping	240	54	48	47	47	0
Non-overlapping	2849 (1024/1825)	1393 (611/782)	1395 (600/795)	1390 (480/910)	1387 (387/1000)	833 (86/747)

	A*0301	A*1101	A*3101	A*3301	A*6801
The Peters dataset	2094	1985	1869	1140	1141
Overlapping	97	102	71	70	69
Non-overlapping	1997(452/1545)	1883(618/1265)	1798(399/1399)	1070(161/909)	1072(455/617)

Specificity	Sensitivity

	SVM	ANN	HMM
0.80	0.96 (−0.82)	0.95	0.69
0.90	0.90 (−0.52)	0.84	0.55
0.95	0.76 (−0.02)	0.55	0.42

Specificity	Sensitivity

	SVM	ANN	HMM
0.80	0.97 (−0.65)	0.86	0.56
0.90	0.92 (−0.30)	0.66	0.36
0.95	0.84 (−0.05)	0.41	0.24

HLA-A2 allele	SVM	ANN	HMM
A*0201	0.96	0.94	0.90
A*0202	0.83	0.65	0.80
A*0204	0.94	0.83	0.87
A*0205	0.96	0.91	0.82
A*0206	0.88	0.81	0.84
Average	0.914	0.828	0.846
Std. dev	0.057	0.113	0.039

HLA-A3 allele	SVM	ANN	HMM
A*0301	0.93	0.89	0.94
A*0302	0.86	0.84	0.86
A*1101	0.96	0.91	0.91
A*1102	0.96	0.86	0.86
A*3101	0.87	0.69	0.66
A*3301	0.92	0.63	0.58
A*6801	0.97	0.96	0.95
Average	0.924	0.83	0.8229
Std.dev.	0.044	0.144	0.120

HLA-A*	0201	0202	0203	0206	6802	6901

SVM A_ROC	0.91	0.83	0.82	0.79	0.74	0.81
ANN A_ROC	0.88	0.79	0.79	0.74	<0.64	–
HLA-A*	0301	1101	3101		3301	6801

SVM A_ROC	0.87	0.89	0.83		0.79	0.76
ANN A_ROC	0.85	0.87	<0.83		<0.81	<0.77

PERMALINK

Prediction of supertype-specific HLA class I binding peptides using support vector machines

Guang Lan Zhang

Ivana Bozic

Chee Keong Kwoh

J Thomas August

Vladimir Brusic

Abstract

1. Introduction

2. Materials and methods

2.1. Data and data representation

Table 1.

Table 2.

Table 3.

Table 4.

2.2. Support vector machines

2.3. Training, testing and validation

Table 5.

Table 6.

Table 7.

Table 8.

2.4. Comparison of prediction methods

3. Results

3.1. Cross-validation results

Table 9.

Table 10.

Table 11.

Table 12.

Fig. 1.

Fig. 2.

3.2. Blind testing results

Table 13.

Table 14.

3.3. Validation using HPV E6 and E7 proteins

Table 15.

3.4. Validation using the Peters dataset

Table 16.

3.5. MULTIPRED1 — an online computational system for prediction of promiscuous HLA binding peptides

4. Discussion and conclusion

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases