Mature Epitope Density - A strategy for target selection based on immunoinformatics and exported prokaryotic proteins

Anderson R Santos; Vanessa Bastos Pereira; Eudes Barbosa; Jan Baumbach; Josch Pauling; Richard Röttger; Meritxell Zurita Turk; Artur Silva; Anderson Miyoshi; Vasco Azevedo

doi:10.1186/1471-2164-14-S6-S4

. 2013 Oct 25;14(Suppl 6):S4. doi: 10.1186/1471-2164-14-S6-S4

Mature Epitope Density - A strategy for target selection based on immunoinformatics and exported prokaryotic proteins

Anderson R Santos ^1,⁵, Vanessa Bastos Pereira ¹, Eudes Barbosa ^1,², Jan Baumbach ², Josch Pauling ^2,⁴, Richard Röttger ^2,⁴, Meritxell Zurita Turk ¹, Artur Silva ³, Anderson Miyoshi ¹, Vasco Azevedo ^1,^✉

PMCID: PMC3908659 PMID: 24564223

Abstract

Background

Current immunological bioinformatic approaches focus on the prediction of allele-specific epitopes capable of triggering immunogenic activity. The prediction of major histocompatibility complex (MHC) class I epitopes is well studied, and various software solutions exist for this purpose. However, currently available tools do not account for the concentration of epitope products in the mature protein product and its relation to the reliability of target selection.

Results

We developed a computational strategy based on measuring the epitope's concentration in the mature protein, called Mature Epitope Density (MED). Our method, though simple, is capable of identifying promising vaccine targets. Our online software implementation provides a computationally light and reliable analysis of bacterial exoproteins and their potential for vaccines or diagnosis projects against pathogenic organisms. We evaluated our computational approach by using the Mycobacterium tuberculosis (Mtb) H37Rv exoproteome as a gold standard model. A literature search was carried out on 60 out of 553 Mtb's predicted exoproteins, looking for previous experimental evidence concerning their possible antigenicity. Half of the 60 proteins were classified as highest scored by the MED statistic, while the other half were classified as lowest scored. Among the lowest scored proteins, ~13% were confirmed as not related to antigenicity or not contributing to the bacterial pathogenicity, and 70% of the highest scored proteins were confirmed as related. There was no experimental evidence of antigenic or pathogenic contributions for three of the highest MED-scored Mtb proteins. Hence, these three proteins could represent novel putative vaccine and drug targets for Mtb. A web version of MED is publicly available online at http://med.mmci.uni-saarland.de/.

Conclusions

The software presented here offers a practical and accurate method to identify potential vaccine and diagnosis candidates against pathogenic bacteria by "reading" results from well-established reverse vaccinology software in a novel way, considering the epitope's concentration in the mature portion of the protein.

Background

Tuberculosis (TB) has been one of the major causes of morbidity and mortality worldwide for centuries, and control of the spread of Mycobacterium tuberculosis (Mtb) infection remains a public health priority [1]. More than 9 million new cases of TB in humans arise every year, resulting in nearly 2 million deaths worldwide [2]. Bacille Calmette-Guérin (BCG), the current vaccine for the treatment of TB, has its limitations; although it is protective against severe childhood TB, it does not satisfactorily prevent the pulmonary disease in adults [3]. Effective prophylactic and therapeutic immunization is a key strategy for global epidemic control [1]. Novel TB vaccine candidates include BCG or recombinant BCG (rBCG) strains, which are used in heterologous prime-boost strategies as a prime vaccination [4]. Booster vaccinations can include viral vectors that express immunodominant Mtb antigens or fusion proteins of these antigens, combined with adjuvanticity to ensure immunogenicity [5]. Many Mtb antigens, including Ag85A, Ag85B, TB10.4 and ESAT-6, have been tested as vaccine candidates; however, these have not been shown to be successful at treating TB [6]. Consequently, discovering new antigens continues to be a crucial factor for the successful development of vaccines against TB [7].

Exported proteins are currently the main target for Reverse Vaccinology (RV) due to their essential role in host-pathogen interactions [8]. Examples of this interaction include the following: (i) adherence to host cells; (ii) invasion of the cell to which there was compliance; (iii) damage to host tissues; (iv) resistance from the defense machinery of the cells to environmental stress; and (v) mechanisms for subversion of the host's immune response [9,10]. In general, RV reveals a great number of proteins that could constitute potential targets of vaccine candidates that then have to be confirmed via cost-intensive and time-consuming wet-lab experiments. However, incorporating immunoinformatic filters, which identify target proteins with high potential in the RV process, could reduce these drawbacks [11]. Immunoinformatics focuses mainly on small peptides ranging from 8 to 11 residues, called linear epitopes, particularly on those that strongly bind to MHC class I molecules. Just one epitope per protein can be enough to create an immune response in the host [12-14]. Bioinformatic techniques to search for epitopes are well understood and available, but can sometimes lead to high false positive rates [15]. Despite this drawback, epitope predictors are capable of identifying weak or even strong epitope motifs that have been experimentally neglected [16].

Epitope density has been described in research as a function of "hot spots" or regions with enriched MHC class II binding epitopes [16]. This work reported 544, 609 and 757 15mers peptides binding to three, two and just one of the molecules HLA-DR1, -DR2, and -DR4, respectively. An analysis of two of the 61 proteins examined in that study showed that Ag85B and MPT63 contain, respectively, 30 and 23 peptides with highest binding to MHC molecules; however, experimental data was only available for 10 peptides derived from MPT63.

Asking whether specific defined domains have high epitope densities, one study found that signal peptides and trans-membrane domains have exceptionally high epitope densities [17]. This work computed the high epitope density of signal peptides using in silico methods which corroborate with the high percentage of identified signal peptide epitopes in the IEDB (immune epitope database). The enhanced immunogenicity of signal peptides was experimentally confirmed using peptides derived from Mtb proteins. High antigen-specific response rates and population coverage to signal peptide sequences were found when compared with non-signal peptide antigens derived from the same proteins. The MED (Mature Epitope Density) concept is similar to epitope density [16]. To demonstrate the potential of MED to uncover bacterial targets for RV, we collected a set of experimental evidence from the literature that demonstrates a relationship between high MED scores and promising targets in M. tuberculosis (Mtb) strain H37Rv.

Results

Allele frequency

Figure 1 shows the MHC allele histogram of the predicted epitopes of all 553 Mtb H37Rv exported proteins. The horizontal axis represents the alleles available for prediction through the NetMHC software (version 3.0), and the vertical axis represents the absolute number of epitopes predicted by each allele of all exported proteins. The MHC alleles are ordered according to their decreasing number of predicted epitopes. The first five MHC alleles are human and represent 52.32% of all predicted epitopes, the first 15 represent 80.83%, and the last 24 MHC alleles only represent 2.58% of the overall NetMHC epitope prediction.

**MHC alleles used to predict MED score**. MHC alleles in the NetMHC software (horizontal axis) and the number of predicted strong binders to epitopes from *Mtb* H37Rv exported proteins (vertical axis).

Control datasets

In the Figure 2, the control groups were divided in panels exhibiting protein quantity, percentage regarding this quantity and the average MED score. The horizontal axis of all three panels states the predicted sub-cellular location (cytoplasmic, membrane bound, PSE or secreted) for three groups of proteins: the Doytchinova et al. (2007) control groups (positive and negative control groups represented by Dplus and Dminus, respectively) and an Mtb positive control group (Mtbplus) taken from the AntigenDB. The vertical axis displays the data (from top to bottom): number of proteins, the percentage represented by the number of proteins and the average MED score for each control group. The number of proteins (top panel) and percentage (middle panel) predicted as cytoplasmic represent the majority for both Dminus and Mtbplus groups, while the Dplus group has more predicted exported proteins. Curiously, the Mtbplus group has the majority of cytoplasmic predicted proteins, which is surprising as it was expected that the majority of antigenic proteins would be exported to the extracellular milieu, as observed in the Dplus group that contains several pathogenic organisms.

**MED score applied into previous control groups**. Three previously published protein control groups were assessed according to predicted sub-cellular location and the average MED score. Quantities (top panel) and percentages (middle panel) of proteins, plus the average MED scores per predicted local sub-cellular, were analyzed. **These control groups include *M. tuberculosis* antigenic proteins** obtained from the AntigenDB site that were **observed eliciting immune cellular responses and the control groups presented by** Doytchinova *et al*. (2007).

Two results should be noted in the bottom panel. Firstly, the average MED scores were very similar among the three control groups, showing that MED is not necessarily a binary statistic classifier for targets but also a continuous statistic measure capable of defining the preferable targets; however, when significant differences between MED scores are shown, it can be used just like a binary classifier. This procedure was assessed in the evidence dataset shown in the next section. Secondly, the average MED score for proteins predicted as membrane-integral were shown to be twice as great as in the other sub-cellular compartments. This result agrees with other work in which signal peptides and trans-membrane domains were found to have exceptionally high CD8+ T cell epitope densities [17].

Evidence dataset

Figure 3 shows a histogram representing the distribution of MED scores for all 553 Mtb exported proteins. As seen in Table 1, MED scores range from 15.67 to 27.00 nM/mer, with the highest MED score data set represented on the far right side of Figure 3. These values strongly contrast with MED scores of Table 2, which are between 0.00 and 3.19 nM/mer, with the lowest MED score dataset represented on the far left side of Figure 3. As mentioned in the previous section, the MED score is not a binary classifier but is also capable of analyzing proteins scored within these extremely different ranges, allowing us to develop evidence for the general importance of MED scores.

MED scores from *M. tuberculosis*. MED score histogram for *Mtb* H37Rv exported proteins. Data in Tables 1 and 2 are situated in the extremities of this graph.

Table 1.

MED highest-scored proteins.

Genome Locus	N	d	MED (nM/mer)	Local	Evidence	Unique publication identifier
Rv2452c	14	18	27,00	SEC	favorable	10.1046/j.1365-2958.1999.01593.x

Rv1811	66	108	21,34	PSE	favorable	PMID:10760138

Rv3018c	145	234	20,72	PSE	favorable	10.1099/jmm.0.47565-0, 10.1046/j.1365-2958.1999.01593.x, 10.1016/j.vaccine.2004.08.046

Rv1489	37	63	20,36	PSE	favorable	10.1186/1471-2180-10-132, 10.1021/pr0500049, 10.1016/j.tube.2008.01.003

Rv0847	58	98	19,89	SEC	favorable	10.1016/j.tube.2006.01.014, 10.1016/j.tube.2006.01.014

Rv0436c	78	123	19,14	PSE	favorable	10.1074/jbc.M004658200

Rv0116c	117	214	17,61	SEC	favorable	10.1099/mic.0.024802-0

Rv1841c	167	308	17,33	PSE	favorable	10.1128/jb.184.4.1112-1120.2002

Rv2339	224	437	17,25	PSE	favorable	10.1093/molbev/msm111

Rv0589	195	364	17,10	PSE	favorable	10.1007/s11010-011-0733-5

Rv1158c	107	189	17,07	SEC	favorable	10.1016/j.tube.2004.09.005

Rv0286	129	242	17,04	PSE	favorable	10.1128/IAI.70.12.6996-7003.2002

Rv3497c	161	314	16,87	SEC	favorable	10.1073/pnas.1631248100

Rv1967	151	305	16,53	SEC	favorable	10.1111/j.1574-695X.2010.00677.x

Rv1620c	156	311	16,52	PSE	favorable	10.1073/pnas.1003219107, 20090285847

Rv3000	86	167	16,04	PSE	favorable	10.1016/j.tube.2006.01.014

Rv2690c	64	126	16,03	PSE	favorable	Patent 7393540

Rv0804	87	175	15,85	SEC	favorable	10.1107/S1744309108031679

Rv0598c	58	104	15,85	SEC	favorable	PMID:12657046

Rv3693	203	404	15,69	SEC	favorable	10.4049/jimmunol.1002212, 10.1002/pmic.200600853

Rv2262c	100	206	15,69	PSE	favorable	PMID:12368431

Open in a new tab

Table 1 lists 21 of the 30 highest MED scored-proteins from the Mtb H37Rv exported proteins. Each protein is accompanied by at least a unique publication identifier, which can be doi, Pubmed id or a patent number. A protein can be cited twice or thrice by different publications; some publications cite several proteins. The first columns show the protein locus tags, followed by the number of predicted epitopes (n) and epitope probability as a function of its proportion in the mature protein (d). The MED score is calculated as n divided by d. Evidence can be favorable or contrary based on publication results and the expectation indicated by the MED score.

Table 2.

MED lowest-scored proteins.

Genome Locus	N	d	MED (nM/mer)	Local	Evidence	Unique publication identifier
Rv0532	59	555	3,19	SEC	contrary	10.1021/pr1005108

Rv0746	77	741	3,11	SEC	contrary	10.1186/1471-2148-6-95, 10.1016/j.micinf.2006.03.015

Rv1468c	37	328	3,03	SEC	contrary	10.1021/pr1005108

Rv3590c	48	542	2,96	SEC	favorable	10.1016/S1672-0229(08)60039-X

Rv3511	66	678	2,91	SEC	favorable	10.1186/1471-2148-6-95

Rv1100	20	160	2,88	PSE	contrary	10.1099/mic.0.27204-0

Rv3312A	4	64	2,69	SEC	contrary	10.1073/pnas.0602304104

Rv3595c	34	400	2,51	SEC	contrary	10.1186/1471-2148-6-95

Rv1091	60	814	2,40	SEC	contrary	10.1186/1471-2148-6-95

Rv3706c	4	50	2,32	PSE	contrary	10.3389/fmicb.2010.00121

Rv3345c	98	1498	2,05	SEC	favorable	10.1186/1471-2148-6-95, 10.1099/mic.0.26660-0

Rv0559c	4	78	2,05	SEC	contrary	10.1371/journal.pone.0007615

Rv3388	44	690	2,03	SEC	contrary	10.1016/j.tube.2003.12.014

Rv0833	52	689	1,75	PSE	favorable	10.1186/1471-2148-6-95

Rv2487c	28	655	1,15	SEC	contrary	Patent EP2207035

Rv3514	43	1448	0,93	SEC	contrary	10.1111/j.1365-2567.2010.03383.x

Rv3508	40	1860	0,71	SEC	contrary	10.1371/journal.pone.0002375, 10.1002/prot.10586

Rv3655c	0	0	0	PSE	contrary	10.1371/journal.pone.0010474

Open in a new tab

Table 2 lists 18 of the 30 MED lowest-scored proteins from the Mtb H37Rv exported proteins. Each protein is accompanied by at least a unique publication identifier, which can be doi, Pubmed id or a patent number. A protein can be cited twice or thrice by different publications; some publications cite several proteins. The first columns in Tables 1 and 2 show the protein locus tags, followed by the number of predicted epitopes (n) and epitope probability as a function of its proportion in the mature protein (d). The MED score is calculated as n divided by d. Evidence can be favorable or contrary based on publication results and the expectation indicated by the MED score.

MED score limitations

Figure 4 is useful to understand the main limitation of MED scores. It shows two pair of box plots, each pair representing a numerator (predicted epitopes) and a denominator (possibilities or chances for epitopes) that are used in Equation 1. The first pair of boxes show data from the numerator and denominator from the 30 lowest MED scored proteins from the Mtb exported proteins, shown at the far left side of Figure 3; the second pair of boxes show data from the 30 highest MED scored proteins from the Mtb exported proteins, shown at the far right side of Figure 3. The numerators and denominators were investigated to determine how protein length can influence the MED score. The number of epitopes predicted in the highest-scored subset is more than twice as high as the lowest-scored subset. This result was expected because there is evidence that the highest-scored subset is composed of proteins related to antigenicity or contributing to the bacterial pathogenicity while the majority of the lowest-scored subset is not. The number of possibilities for linear epitopes in the lowest-scored subset is almost three times higher when compared to the highest-scored subset. This numerical difference in the denominators is the major limitation for the MED score strategy, especially for data above the average. Quartiles Q3 and Q4, among those with lowest chances, include half (7/14) the evidence, in contrast to our hypothesis of an existing relation between MED and promising reverse vaccinology targets. These quartiles include denominators between 537 and 1,860 (just one greater than 1,498). Thus, according to the data, MED scores tend to indicate false positives when there is a difference factor of at least five between the number of predictions and the number of epitope possibilities located in the mature amino acid sequence portion. No false positives were observed when this factor was less than two. An interesting result is that the two biggest control groups from Figure 2, Dplus and Dminus, had average factors (fold) of 3.22 and 2.82, respectively.

**MED score limitation**. Boxplot of pairs of numerators and denominators within the 60 lowest and highest MED scores from the *Mtb* H37Rv exported proteins. "Predicted" stands for epitope predictions and "Chances" stands for possible 9mers windows in an amino acid sequence's mature portion; both are used in Equation 1. This graph illustrates the major limitation of MED scores: a factor greater than five between the numerator and denominator of a MED score calculation can cover an antigenic protein creating a false negative.

MED score sensitivity

Among the 30 proteins that were lowest scored by MED, 14 showed contrary evidence and just four favorable evidence to the MED score concept. Among the 30 highest scored proteins, there was favorable evidence for 21 proteins based on the MED score and no protein with contrary evidence. Among the lowest and highest scored remainders, none showed favorable or contrary evidence related to MED scores. These results were used to create Figure 5 with a ROC curve graph that calculated sensitivities of 84% for MED scores with 7% false positives.

**MED score ROC curve**. Receiver operating characteristic (ROC) curve from the Mature Epitope Density (MED) score calculated for 39 *Mtb* H37Rv exported proteins with favorable or contrary evidence to the MED concept.

Novel probable putative Mtb antigens

The Mtb H37rv proteins Rv0235c, Rv0492A and Rv1004c were predicted to have some of the highest MED scores: 17.78, 20.31 and 18.58 nM/mer, respectively. The former two were predicted to be potentially exposed on the bacterial surface, and the latter was predicted to be secreted. Respectively, there are 78, 43 and 228 predicted epitopes against 138, 73 and 386 epitope chances for these proteins. This is the first published indication of their roles in bacterial antigenicity; MED scoring results suggest these proteins as useful putative targets for future investigations.