A comprehensive analysis of the IEDB MHC class-I automated benchmark

Raphael Trevizani; Zhen Yan; Jason A Greenbaum; Alessandro Sette; Morten Nielsen; Bjoern Peters

doi:10.1093/bib/bbac259

. 2022 Jul 6;23(4):bbac259. doi: 10.1093/bib/bbac259

A comprehensive analysis of the IEDB MHC class-I automated benchmark

Raphael Trevizani ^1,², Zhen Yan ³, Jason A Greenbaum ⁴, Alessandro Sette ^5,⁶, Morten Nielsen ^7,⁸, Bjoern Peters ^9,^10,^✉

PMCID: PMC9618166 PMID: 35794711

Abstract

In 2014, the Immune Epitope Database automated benchmark was created to compare the performance of the MHC class I binding predictors. However, this is not a straightforward process due to the different and non-standardized outputs of the methods. Additionally, some methods are more restrictive regarding the HLA alleles and epitope sizes for which they predict binding affinities, while others are more comprehensive. To address how these problems impacted the ranking of the predictors, we developed an approach to assess the reliability of different metrics. We found that using percentile-ranked results improved the stability of the ranks and allowed the predictors to be reliably ranked despite not being evaluated on the same data. We also found that given the rate new data are incorporated into the benchmark, a new method must wait for at least 4 years to be ranked against the pre-existing methods. The best-performing tools with statistically indistinguishable scores in this benchmark were NetMHCcons, NetMHCpan4.0, ANN3.4, NetMHCpan3.0 and NetMHCpan2.8. The results of this study will be used to improve the evaluation and display of benchmark performance. We highly encourage anyone working on MHC binding predictions to participate in this benchmark to get an unbiased evaluation of their predictors.

Keywords: Epitope prediction, Benchmark, MHC-I, CD8+, IEDB tools

Introduction

T cell epitopes are molecules bound by MHC molecules that are recognized by T cell receptors that trigger an immune response. Most T cell epitopes are peptides, which are subdivided based on the type of MHC molecule. MHC class I molecules present peptides to CD8+ T cells, while MHC class II molecules present peptides to CD4+ T cells. This work focuses on peptides bound to MHC class I molecules, which play a critical role in the detection of intracellular infections and cancer [1].

The many applications associated with epitope mapping lead to the development of a large number of computational methods to predict T cell epitopes from the amino acid sequence [2], mostly focusing on the prediction of peptide binding to MHC molecules [3]. However, the broad selection of prediction servers available makes it burdensome for users to choose the best server and for developers to demonstrate the superiority of their newly developed methods.

To address the need for a blind test of MHC-I binding predictors, an automated benchmark was established in 2014 that uses data curated by the Immune Epitope Database (IEDB) [4, http://tools.iedb.org/auto_bench/mhci/weekly]. This ensures that the data benchmarked will be ‘new’ to the participating tools, and provide a realistic assessment of the performance.

To optimally establish a benchmark encompasses several challenges. For instance, many methods restrict MHC-I alleles and peptide sizes by design [5–14], while others are more comprehensive [15–17], which impedes the use of the same datasets for all evaluations. This complication was addressed in the initial development of the benchmark but it was left unchecked on purpose for future assessment [4]. Other initially unforeseen obstacles emerged from the accumulation of data and the enrollment of new predictors.

In this paper, we address these concerns by simulating several hypothetical scenarios. We start by presenting a review of the current state of the benchmark and database, followed by an analysis of the ranking of the predictors. We experimented with different metrics, verified the minimum number of datasets a method should output results for and checked if all predictors could be fairly evaluated on a heterogeneous set of data.

Materials and Methods

Datasets

A dataset is a collection of peptides. A dataset constitutes an independent unit of evaluation of the benchmark, and each dataset consists of a unique combination of IEDB reference number, MHC-I allele, peptide length and measurement type (Figure 1, top). Each dataset contains the experimental measurement of a peptide-MHC-I allele complex, and the predictions of each method (Figure 1, bottom). The number of entries varies per dataset, but a minimum of 10 peptides are required, with at least two non-binders and two binders. In this study, we used all the 161 datasets accumulated in the benchmark since its inception.

A peptide is deemed a binder if (i) it was experimentally reported to qualitatively bind to an MHC, or (ii) its half-life (T Inline graphic ) bound to the MHC is reported to be longer than 120 min or (iii) it is IC is reported to be lower than 500 nM. Peptides that do not meet any of those criteria are considered non-binders.

When the method is capable of predicting the binding affinity of all the peptides contained within and produces valid results, we say the method ‘handled’ or ‘worked on’ the dataset.

Current metrics and original rank

In the current version of the benchmark, the area under the ROC curve (AUC) and the Spearman Inline graphic correlation coefficient (SRCC) of each dataset are percent ranked (PR) to provide the relative standing of each method on each dataset. The percent rank is calculated according to Equation 1.

(1)

where Inline graphic is the score of a method on a given dataset, is the number of methods with a score equal or lower than and is the number of methods.

In the current version of the benchmark, the area under the ROC curve (AUC) and the Spearman correlation coefficient (SRCC) are calculated for each dataset separately. To adjust for differences in the difficulty of predicting datasets, both measures (AUC and SRCC) are percent ranked (PR). This provides the relative standing of each method on each dataset. The overall value of any number of Inline graphic metrics for datasets can be calculated according to Equation 2.

(2)

Inline graphic = Overall value given a set of metrics (,...,). is the total number of datasets, is the total number of metrics, PR is the percent ranked score (Equation 1).

To generate what is henceforth called the ‘original rank’, we used all the datasets each method was able to predict and assessed their performance using the overall metric ( Inline graphic (AUC,SRCC), Equation 2).

Stability of metrics

To verify the stability of each metric, the datasets are initially shuffled and divided into five partitions. Each method is evaluated on each partition Inline graphic (i = 1,...,5) using one of the metrics, generating a series of -Ranks. We use Spearman to measure the level of agreement between each -Rank and the original rank.

Five metrics were tested: AUC, SRCC, Kendall- Inline graphic (KT), average-precision (AP) and precision-recall curve (PRC). We also repeated the algorithm and weighted the results with the number of peptides of each dataset. Each experiment was repeated 100 times, each time changing the partition scheme.

There are three major groups of metrics tested: raw metrics ( Inline graphic = AUC, SRCC, KT, AP, PRC), their weighted () and percent-ranked counterparts (PR()). We did three different 2 by 2 comparisons (Mann–Whitney U test) between the major groups , and PR(). Within each group, we used one-way analysis of variance (ANOVA) followed by post hoc analysis (Tukey test). The normality of the distributions was tested with a Shapiro–Wilk test followed by a visual inspection of quantile–quantile plots. All statistical analysis was done using the following libraries of Python (v.3.10.1): statistics, SciPy v1.8.0-1 and StatsModel v.0.13.2-1.

Data heterogeneity

Eight datasets were commonly evaluated by all methods, so we checked whether the methods ranked differently if only these eight datasets were used. For every method, we took the predictions for the eight commonly evaluated datasets and applied the overall score Inline graphic (AUC,SRCC) (Equation 2) and called it the ‘lite’ version of the method. The rank output by the lite versions was compared with the original rank. A paired t-test was applied to check for differences (SciPy v.1.8 of the Python v.3.10.1).

Minimum number of datasets

Each dataset counts as one unit of evaluation of the benchmark. In this experiment, we attempted to estimate the minimum number of datasets a newcomer should handle before it can be ranked against previously existing methods.

The datasets Inline graphic to are shuffled and the methods are ranked using one dataset (), yielding a new rank (). At each step, a new dataset is added yielding rank ( datasets). The Spearman correlation between the 14 data points that represent the position of each method for and the 14 data points of the original rank is taken. This experiment was repeated 1000 times; each time the datasets were shuffled to change the order the datasets are added. We established a cutoff of the P-value < 0.05 for the Spearman Inline graphic between the original rank and each (n = 1,...,161).

Grouped rank

The original rank of methods was grouped in a top-down fashion using the Mann–Whitney U statistical test for significance. Given the Inline graphic methods, where is the method that ranks highest on the original rank, the algorithm starts from and compares it with . If they do not significantly differ, they are considered part of the same group and the algorithm proceeds to compare and . If and differ, becomes the head of a new group and is compared with Inline graphic and so on.

Results

Datasets were less frequent and more diverse than anticipated

The IEDB MHC-I automated benchmark is organized around datasets harvested from the literature or directly submitted by authors. In this section, we analyze the origins and composition of MHC binding datasets that were used in the benchmark. We looked into attributes such as annual frequency of submission, number of entries per dataset, the total number of binders and non-binders, peptide size and MHC-I allele diversity.

The current benchmark database contains 161 eligible datasets (Figure 2-B). The year 2014 displays the highest number because of datasets accumulated before 2014. After 2015, a new batch of datasets was added every 9 weeks on average, and, as such, it was not possible to provide weekly updates as originally planned [4].

Panel **(A)** Number of datasets evaluated by each method since the inception of the benchmark. **(B)** Number of datasets incorporated into the benchmark through direct submission or literature curation. **(C)** Binders and non-binders. Upper left panel: histograms of counts of binders and non-binders. Upper right panel: stacked histogram of database size for each peptide size. Lower panel: bar showing the number of unique peptides shared among binders and non-binders. **(D)** Treemap depicting the totality of the MHC-I alleles by dataset occurrence. **(E)** Database growth for each experimental measurement (Binary, T and IC refer to qualitative experiments, half-life experiments and inhibitory concentration experiments, respectively) per number of datasets (left panel) and the number of peptides (right panel). **(F)** Number of datasets per number of peptides in the dataset.

Inline graphic — Panel **(A)** Number of datasets evaluated by each method since the inception of the benchmark. **(B)** Number of datasets incorporated into the benchmark through direct submission or literature curation. **(C)** Binders and non-binders. Upper left panel: histograms of counts of binders and non-binders. Upper right panel: stacked histogram of database size for each peptide size. Lower panel: bar showing the number of unique peptides shared among binders and non-binders. **(D)** Treemap depicting the totality of the MHC-I alleles by dataset occurrence. **(E)** Database growth for each experimental measurement (Binary, T and IC refer to qualitative experiments, half-life experiments and inhibitory concentration experiments, respectively) per number of datasets (left panel) and the number of peptides (right panel). **(F)** Number of datasets per number of peptides in the dataset.

The number of peptides per dataset varied from 10 to 26 649. The majority contained tens of peptides and the number of datasets decreased with the increase in the number of peptides (Figure 2-F, supplementary material 2). Datasets obtained from literature curation were more frequent than those from direct submission, but the average number of peptides obtained by direct submission was higher (Table 1).

Table 1.

Database submissions

Total	Literature	Direct
Peptides*	8 621	34 682
Datasets	116	45
avg(P/D)	74.31	770.71

Open in a new tab

*repetitions included, avg(P/D) = average number of peptides per dataset

The benchmark is limited to peptides of size 8 to 11 and the vast majority of peptides used in the benchmark were of size 9 (65.37%). Overall, nearly 63% of the database was composed of binders, but there are more binders for 10-mers and 11-mers (Figure 2-C).

Because the binding of a given peptide is HLA-specific, the same peptide can be found on different datasets with different statuses of binder/non-binder. However, when all datasets were combined, less than 4.5% of the peptides had this shared status (Figure 2-C).

Regarding the experimental methods used, 65 datasets were described qualitatively (binary), 62 referred to inhibitory concentration experiments (IC Inline graphic ) and 34 were half-life experiments (T). Contrastingly, the overwhelming majority of the peptides were in ‘binary’ datasets (Figure 2-E).

The most common MHC-I allele in the database was HLA-A*02:01, which accounted for 24.38% of the datasets, followed by HLA-B*07:02 (9.38%, Figure 2-D). Indeed, the majority (51.25%) of the datasets were made of only seven MHC-I alleles (HLA-A*02:01, HLA-B*07:02, HLA-A*24:02, HLA-B*58:01, HLA*B:27:05, HLA-A*11:01, HLA-B*35:01). Combined, the MHC-I alleles that appeared only once total 18.6% of the database.

Overall, we found that datasets included in the benchmark are highly heterogeneous in terms of size, and significantly overrepresented a subset of common MHC alleles and their preferred peptide ligand length of nine residues, reflecting study biases.

Percent-ranked metrics provide better generalization

The purpose of the IEDB MHC-I automated benchmark is to rank the multiple participating methods according to their performance. As such, the choice of metrics is of paramount importance.

It is impossible to directly compare the servers, as each adopts its own scoring system. While it is possible to circumvent this by arbitrarily setting a cutoff that converts the predictions into binary values, it is problematic to set a single value that fairly caters to all methods. Similarly, to assiduously determine an individual cutoff for each method would be an impractical task and possibly lead to questionable, biased results. As such, most binary classification metrics are excluded from use. Among the metrics that do not require conversion to binary values, the initial version of the benchmark used AUC and SRCC, and other common metrics (KT, PRC and AP).

The ideal metric has good generalization. Therefore, when the methods are evaluated only on a few datasets, they should mirror the original rank. Concretely, we measured how the rank of a metric fluctuates across subsets of the entire collection of datasets. We call this the ‘stability’ of a metric. To assess stability, we conceived an experiment that randomly partitioned the datasets into five groups and evaluated the correlation between the rank of each partition and the original rank (Figure 3).

Simulation to assess the stability of metrics. (I) The database is randomly partitioned in five. (II) All datasets are evaluated using predictor and metric , and the methods are ranked according to their performance. (III) The datasets of each partition are evaluated using predictor and metric , and the methods are ranked according to their performance. (IV) The Spearman correlation between the ranks output on (II) and (III) is taken. This simulation was repeated 100 times; each time, the datasets were reshuffled and randomly partitioned five ways.

The result of the stability experiment is summarized in Figure 4 and the values for all statistical analyses were included as supplementary material 3.

Result of 100 runs of the stability experiment. Bar height indicates the Spearman rank correlation coefficient between the ranks output by the entire database and each of the five partitions. AUC = Area under the ROC curve, SRCC = Spearman , KT = Kendall , AP = Average-Precision, PRC = Precision-Recall curve, M = metric weighted by number of peptides, PR(M) = percent ranked metric. Ov = overall metric (Equation 2).

Aside from using the ‘raw’ metrics, two more variations of the stability experiment were tried. The first was adding weights. As it stands, the IEDB benchmark counts each separate dataset as one unit of evaluation regardless of the number of peptides. This is a source of concern, as it means attributing the same significance to all datasets despite the variation in the number of peptides (Figure 2-F). The second variation was percent ranking the Inline graphic metrics (hereby referred to as PR(), Equation 1). The effects of PR() have never been evaluated, even though PR() has been applied since the inception of the benchmark.

Weighting the datasets did not bring any benefit to the stability of the ranks (Figure 4), but percent ranking the metrics caused the stability to significantly differ from both unweighted and weighted metrics (respectively, P-value = Inline graphic and , Mann–Whitney U test). Among the percent-ranked metrics, PR(AUC), PR(SRCC) and PR(KT) were significantly better than PR(PRC) and PR(AP) (, one-way ANOVA followed by a Tukey HSD—supplementary material 3), but no significant difference was found between PR(AUC), PR(SRCC) and PR(KT).

Lastly, another variation of the stability experiment was attempted by combining the metrics. The current version of the benchmark combines PR(AUC) and PR(SRCC) into what is called the ‘overall value’ Inline graphic (AUC,SRCC) (Equation 2). We tested all possible combinations of (,...,) using AUC, SRCC and KT because they showed the best results. No difference was found between the combined metrics and their original counterparts (Figure 4, P-value = 0.1, one-way ANOVA).

In summary, the best generalization was obtained when the metrics are percent ranked, the best results are given by PR(AUC), PR(SRCC) and PR(KT), and the combination of metrics resulted in no discernible advantage. The current benchmark uses PR(AUC), PR(SRCC) and Inline graphic (AUC,SRCC) and these findings confirm the current metrics in use are, albeit redundant, among those with the best power of generalization. Despite being one of the original metrics of the IEDB benchmark, it must be acknowledged that SRCC has little value to the increasingly common binary datasets (Figure 2-E). Contrastingly, AUC seems a finer fit for the problem. AUC works by defining successive cutoffs based on the range of values output by the method, thus working regardless of the scoring system employed by each method. By contrast, AUC requires the conversion of the experimental values to binary values, which is solved with the cutoffs 500 nM, and 120 min for IC Inline graphic and T, respectively.

Heterogeneity of datasets does not significantly bias the rank of methods

The IEDB benchmark automatically runs all methods whenever a new dataset is assimilated. Optimally, all methods should be evaluated based on the same data, but in reality, most do not work with every MHC-I allele or peptide size. Therefore, not all methods work on all datasets, leaving gaps on the evaluation table which have accumulated considerably over time and culminate in a very different number of results for each method (Figure 2-A). Also, methods are not allowed to handle datasets older than them. Combined, these cause one of the greatest challenges of the IEDB benchmark: to fairly evaluate the methods based on a heterogeneous set of results.

Here, we report an experiment that allowed us to estimate the difference between the original rank versus a putative rank based only on the datasets common to all methods.

Among the 161 datasets in the benchmark, eight datasets were evaluated by all methods, i.e. these were the datasets that all methods could handle. The predictors were tested using only the eight common datasets, yielding a new rank (called ‘lite rank’) that was compared with the original rank.

Both Pearson and Spearman Inline graphic correlations between the lite and original ranks were high (Figure 5, 0.84 and 0.86, respectively). Furthermore, a paired T-test between both distributions showed the null hypothesis cannot be rejected, which means the ranks can be considered statistically similar (P-value = 0.16, paired T-test).

Correlation between the rank output by lite-variations of the methods and the regular rank. Spearman = 0.81, P = , Pearson r = 0.84, P = . Paired T-test P-value = 0.16.

It is unrealistic to expect the datasets to be sufficiently standardized to meet the requirements of a direct comparison, which renders any benchmark suboptimal in theory. Nevertheless, this experiment shows that the similitude between the original and lite ranks suggests that inferring the rank of the methods from heterogeneous data has no concerning impact in practice.

Fair assessment requires 22+ datasets

A method that has just entered the benchmark cannot be fairly ranked against its peers before it has handled a minimum number of datasets. This ensures the rank is future-proof and does not fluctuate unnecessarily due to chance. Here, we attempted to estimate the minimum number of datasets a newcomer should handle before ranking.

Figure 6 shows the P-value of the convergence of the correlation between the original rank and the rank output by each Inline graphic . The average P-value reached P < 0.05 at and the highest value for average + standard deviation reached P < 0.05 at (Figure 6), which suggests a new method should not be ranked against the others before it has reached approximately 22 datasets. From 22 to 42 the ranking can be considered trustworthy on average, and reliable after a method has predicted 42 datasets.

Convergence of the P-value of the Spearman between the rank of the methods when evaluated on all datasets and when evaluated on smaller subsets of 1 to 161 datasets (x-axis, number of datasets). The average P-value is represented by the line and the error bars depict the standard deviation. The horizontal dashed line highlights the P-value = 0.05 and the vertical dashed lines highlight the number of datasets needed for the average and the worst case where P < 0.05 (respectively 22 and 42).

The results in this section show that, even though the benchmark contains 161 datasets, the original rank can be reliably obtained from a set of 22 to 42 datasets. Although this may initially seem a small number, only around one dataset every 9 weeks entered the benchmark. Thence, a new method joining the benchmark should ideally wait from 192 to 378 weeks before a fair evaluation. Worse, this time frame assumes the newcomer predicts all the datasets in the meantime, which rarely happens since most methods limit the MHC-I alleles and peptide sizes they work with.

Despite the need for 22 datasets on average to replicate the original rank, the lite rank is composed of only eight datasets and does not allow us to reject the null hypothesis due to the absence of significance. We intend to continue to monitor the lite rank and a more definitive answer should be attained as more datasets become available.

Performance of predictors overlaps and falls into groups

The benchmark officially started on 21 March 2014 and contained only four predictors: ANN 3.4 [5], ARB [6], SMM [7] and NetMHCpan 2.8 [15]. Later on the same year (31 October 2014), NetMHCcons [8], SMMPMBEC [9], IEDB Consensus [10] and PickPocket [11] were incorporated. On 9 December 2016, ANN 3.4 was updated to ANN 4.0 [12], NetMHCpan 2.8 was updated to NetMHCpan 3.0 [16] and mhcflurry [13] joined. After that, on 11 November 2018, NetMHCpan was updated to NetMHCpan 4.0 [17]. Lastly, DeepSeqPan [14] was added on 30 August 2019. Figure 7 shows the original rank of the methods.

Original rank of the participating methods using (AUC,SRCC) (Equation 2). (*)Methods that differ significantly according to a Mann–Whitney U test on a top-down two-by-two comparison.

As expected, the results varied depending on the dataset, as reflected by the error bars in Figure 7. Therefore, it is inexact to take their positions as absolute, since they were based only on the arithmetic means of the datasets and do not account for such variations. For that reason, we grouped the methods by testing for significant differences using a top-down approach that resulted in five groups.

It should be acknowledged that DeepSeqPan and mhcflurry 1.2.0 are the newest methods to join the IEDB benchmark and have, respectively, been evaluated using 19 and 30 datasets only (Figure 2-A), which is below the threshold of 42 datasets for the methods to be reliably assessed (Figure 6). Therefore, they must be further tested before a definitive answer. It is also noteworthy how poorly the IEDB Consensus rated, considering it was tagged the ‘IEDB-recommended’ method until the year 2020 on the IEDB website. The reason for such bad results is that the IEDB Consensus relies on the predictions of ANN, SMM and Comblib. The two former methods ranked higher than IEDB Consensus, but their performance seems to have been negatively impacted by the predictions of Comblib.

Discussion

The quest for the best MHC-I-binding prediction server led to the proposition of very informative benchmarks [20, 20–22, 22–24]. However, the quality of the predictors varies depending on the HLA allele, which often causes these benchmarks to result in ‘ties’, preventing them from giving a definitive rank [23, 25]. Here, we showed the ranking of the predictors remains virtually the same whether or not the benchmark simulation was based on the same data. Considering the data are already scarce, it is a better, more informative strategy to use all the datasets a method can handle.

In order to improve their methods, developers are constantly updating them by integrating new data into their training sets. As a consequence, benchmarks must opt for one of the two following drawbacks: the benchmark can be based only on new, albeit scarce data [22, 23], or favor more abundant data but risk using peptides that may have already been assimilated into one of the method’s training set [20, 24]. The IEDB benchmark addresses this conflict by automatically testing the methods as soon as new data are released, thereby not risking tests with old peptides, while the results accumulate over time. The drawback, in this case, is that it takes time to properly appraise newly added methods. From the current rate of dataset submission (Figure 2-panel B) and the number of datasets needed for a method to be fairly evaluated against the others, we estimate at least 4 years for a confident evaluation on average.

As noted elsewhere, the IEDB benchmark only reports a rank of the servers, sacrificing details such as absolute binding affinities [21]. However, in-depth information requires a level of standardization that is unreasonable to expect, as different servers output results in different scales, and not all experimental data are the result of binding experiments. As observed in the stability experiment, using the least amount of metrics possible suffices (Figure 4) and, arguably, allows a more intuitive interpretation of the results.

It has also been stated that the IEDB benchmark has yet to report allele-specific results [21]. While this is a desirable (and trivial) feature to implement, unfortunately, it can be misleading given the currently available data. We also reported a minimum of 22 to 42 datasets that a method should predict before the results could be taken credibly, but 66.24% of the MHC-I count six datasets or fewer (Figure 2-D). Thus, reporting HLA-specific results would be invalid for the overwhelming majority of the cases.

Recently, datasets have been generated with the aid of the tools available on the IEDB, where putative peptides are first predicted and then experimentally validated, which ultimately imposes an experimental bias on the datasets. One possible solution to this is the generation of dedicated datasets according to the original IEDB benchmark [4]. The incorporation of datasets with data from eluted ligand experiments is currently under development and should be used in the future rounds of the IEDB automated benchmark.

An open benchmark can only benefit the field of peptide binding prediction. The IEDB benchmark is unique in its continuous supply of weekly, online reports of the performance of the enrolled servers. However, to accomplish this requires developers to submit their methods, which is somewhat lower than expected. We highly encourage developers to enter the MHC-I automated benchmark and provide support for those who wish to join.

Conclusion

In this work, we presented the data collected since the IEDB MHC class I benchmark was established. We reported the annual frequency of submission, the number of entries per dataset, amount of binders and non-binders, number of peptides per size and detailed MHC-I allele diversity.

To find the most appropriate metric for the benchmark, we tested the power of generalization of each metric. Percent ranking the metrics was found to significantly amplify the stability, whereas weighting the datasets has no advantage.

To appraise whether testing the methods using a heterogeneous set of data significantly impacts the scoring system, we compared the original rank with an alternative rank that uses only the datasets common to all methods. No significant difference was found.

Finally, it was found that the methods ought to predict from 22 to 42 datasets for conclusive benchmark results, which takes at least 4 years to be gathered given the current rate of submissions.

It is impossible to obtain an absolute rank due to the diversity of the quality of the methods across multiple datasets. It is, however, possible to group the methods and obtain an approximate rank that serves the practical purpose of giving the relative standing of the methods.

The findings reported here will be used to further strengthen the MHC class I IEDB automated benchmark so users can make informed decisions about the choice of a predictor. Additionally, having such a solid framework helps developers by providing accurate information on the state-of-the-art of MHC I binding prediction for comparison.

Key Points

Dataset submissions were less frequent than anticipated.
No significant difference is found in the ranks of the predictors even if they are tested using different datasets.
Using percentile-ranked results of the original metrics yields the most generalizable ranks.
A minimum of 22 datasets is needed to reliably rank a new method.
The overall performance of the predictors overlaps and falls into groups.

Author contributions statement

R.T.: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing (original draft), Writing (review and editing). Z.Y.: Data curation, Methodology. J.A.G.: Methodology, Writing (review and editing). A.S.: Funding acquisition, Writing (review and editing). M.N.: Methodology, Validation, Writing (review and editing). B.P.: Conceptualization, Funding acquisition, Supervision, Writing (original draft), Writing (review and editing).

Supplementary Material

SupplementaryMaterial1_bbac259

Click here for additional data file.^{(408.1KB, pdf)}

SupplementaryMaterial2_bbac259

Click here for additional data file.^{(421.3KB, pdf)}

SupplementaryMaterial3_bbac259

Click here for additional data file.^{(5.4KB, txt)}

Acknowledgments

This work was supported by the National Institute of Health [contract number 75N93019C00001].

Raphael Trevizani is a research scientist at Fiocruz and a consultant for the La Jolla Institute for Immunology working on the development of tools for immunoinformatics.

Zhen Yan is a Bioinformatics Application Developer of Bioinformatics Core at La Jolla Institute for Immunology. He is involved in the development and implementation of Bioinformatics tools related to immunology.

Jason Greenbaum is Director of the Bioinformatics Core at La Jolla Institute for Immunology and is involved in the development and implementation of computational tools and pipelines related to immunology.

Alessandro Sette is a professor at the La Jolla Institute of Immunology and he has devoted more than 35 years to the study of MHC molecules and their role in eliciting T cell responses.

Morten Nielsen is a professor at the Technical University of Denmark, Department of Health Technology, working on the development of immunoinformatics methods for the prediction of immune-receptor interactions.

Bjoern Peters is a professor at the La Jolla Institute of Immunology, working on computational and experimental approaches to study adaptive immune responses.

Contributor Information

Raphael Trevizani, Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, California 92037, USA; Fiocruz Ceará, Fundação Oswaldo Cruz, Rua São José s/n, Precabura, Eusébio/CE, Brazil.

Zhen Yan, Bioinformatics Core, La Jolla Institute for Immunology, La Jolla, California 92037, USA.

Jason A Greenbaum, Bioinformatics Core, La Jolla Institute for Immunology, La Jolla, California 92037, USA.

Alessandro Sette, Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, California 92037, USA; Department of Medicine, University of California San Diego, La Jolla, California 92093, USA.

Morten Nielsen, Department of Health Technology, Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark; Instituto de Investigaciones Biotecnológicas, Universidad Nacional de San Martín, B1650 Buenos Aires, Argentina.

Bjoern Peters, Division of Vaccine Discovery, La Jolla Institute for Immunology, La Jolla, California 92037, USA; Department of Medicine, University of California San Diego, La Jolla, California 92093, USA.

References

1. Vaughan K, Xu X, Caron E, et al. Deciphering the MHC-associated peptidome: a review of naturally processed ligand data. Expert Rev Proteomics 2017;14(9):729–36. [DOI] [PubMed] [Google Scholar]
2. Peters B, Nielsen M, Sette A. T Cell Epitope Predictions. Annu Rev Immunol 2020;38(1):123–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Yewdell JW, Bennink JR. Immunodominance in major histocompatibility complex class I-restricted T lymphocyte responses. Annu Rev Immunol 1999;17(1):51–88. [DOI] [PubMed] [Google Scholar]
4. Trolle T, Metushi IG, Greenbaum JA, et al. Automated benchmarking of peptide-MHC class I binding predictions. Bioinformatics 2015;31(13):2174–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Lundegaard C, Lund O, Nielsen M. Accurate approximation method for prediction of class I MHC affinities for peptides of lengths 8, 10, and 11 using prediction tools trained on 9mers. Bioinformatics 2008;24(11):1397–8. [DOI] [PubMed] [Google Scholar]
6. Bui HH, Sidney J, Peters B, et al. Automated generation and evaluation of specific MHC binding predictive tools: ARB matrix applications. Immunogenetics 2005;57(5):304–14. [DOI] [PubMed] [Google Scholar]
7. Peters B, Sette A. Generating quantitative models describing the sequence specificity of biological processes with the stabilized matrix method. BMC bioinformatics 2005;6(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Karosiene E, Lundegaard C, Lund O, et al. NetMHCcons: a consensus method for the major histocompatibility complex class I predictions. Immunogenetics 2012;64(3):177–86. [DOI] [PubMed] [Google Scholar]
9. Kim Y, Sidney J, Pinilla C, et al. Derivation of an amino acid similarity matrix for peptide: MHC binding and its application as a Bayesian prior. BMC bioinformatics 2009;10(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Moutaftsi M, Peters B, Pasquetto V, et al. A consensus epitope prediction approach identifies the breadth of murine T CD8+-cell responses to vaccinia virus. Nat Biotechnol 2006;24(7):817–9. [DOI] [PubMed] [Google Scholar]
11. Zhang H, Lund O, Nielsen M. The PickPocket method for predicting binding specificities for receptors based on receptor pocket similarities: application to MHC-peptide binding. Bioinformatics 2009;25(10):1293–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Andreatta M, Nielsen M. Gapped sequence alignment using artificial neural networks: application to the MHC class I system. Bioinformatics 2016;32(4):511–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. O’Donnell TJ, Rubinsteyn A, Laserson U. MHCflurry 2.0: Improved Pan-Allele Prediction of MHC Class I-Presented Peptides by Incorporating Antigen Processing. Cell Systems 2020;11(1):42–48.e7. [DOI] [PubMed] [Google Scholar]
14. Liu Z, Cui Y, Xiong Z, et al. DeepSeqPan, a novel deep convolutional neural network model for pan-specific class I HLA-peptide binding affinity prediction. Sci Rep 2019;9(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Hoof I, Peters B, Sidney J, et al. NetMHCpan, a method for MHC class I binding prediction beyond humans. Immunogenetics 2009;61(1):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Nielsen M, Andreatta M. NetMHCpan-3.0; improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets. Genome Med 2016;8(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Jurtz V, Paul S, Andreatta M, et al. NetMHCpan-4.0: Improved Peptide-MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data. The Journal of Immunology 2017;199(9):3360–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Rammensee HG, Bachmann J, Emmerich NPN, et al. SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics 1999;50(3–4):213–9. [DOI] [PubMed] [Google Scholar]
19. Nielsen M, Lundegaard C, Lund O. Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method. BMC Bioinformatics 2007;8(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Jiang L, Yu H, Li J, et al. Predicting MHC class I binder: existing approaches and a novel recurrent neural network solution. Brief Bioinform 2021;22(6). [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Zhao W, Sher X. Systematically benchmarking peptide-MHC binding predictors: From synthetic to naturally processed epitopes. PLoS Comput Biol 2018;14(11):e1006457. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Lin HH, Ray S, Tongchusak S, et al. Evaluation of MHC class I peptide binding prediction servers: Applications for vaccine research. BMC Immunol 2008;9(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Bonsack M, Hoppe S, Winter J, et al. Performance Evaluation of MHC Class-I Binding Prediction Tools Based on an Experimentally Validated MHC-Peptide Binding dataset. Cancer Immunol Res 2019;7(5):719–36. [DOI] [PubMed] [Google Scholar]
24. Mei S, Li F, Leier A, et al. A comprehensive review and performance evaluation of bioinformatics tools for HLA class I peptide-binding prediction. Brief Bioinform 2020;21(4):1119–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Gfeller D, Bassani-Sternberg M, Schmidt J, et al. Current tools for predicting cancer-specific T cell immunity. OncoImmunology 2016;5(7):e1177691. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Jojic N, Reyes-Gomez M, Heckerman D, et al. Learning MHC I-peptide binding. Bioinformatics 2006;22(14):e227–35. [DOI] [PubMed] [Google Scholar]
27. Jacob L, Vert JP. Efficient peptide-MHC-I binding prediction for alleles with few known binders. Bioinformatics 2007;24(3):358–66. [DOI] [PubMed] [Google Scholar]
28. Nielsen M, Lundegaard C, Blicher T, et al. NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known sequence. PloS one 2007;2(8):e796. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Binkowski TA, Marino SR, Joachimiak A. Predicting HLA Class I Non-Permissive Amino Acid Residues Substitutions. PLoS ONE 2012;7(8):e41710. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Zhang GL, Khan AM, Srinivasan NK, et al. MULTIPRED: a computational system for prediction of promiscuous HLA binding peptides. Nucleic Acids Res 2005;33(Web Server):W172–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Sette A, Sidney J. Nine major HLA class I supertypes account for the vast preponderance of HLA-A and -B polymorphism. Immunogenetics 1999;50(3–4):201–12. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SupplementaryMaterial1_bbac259

Click here for additional data file.^{(408.1KB, pdf)}

SupplementaryMaterial2_bbac259

Click here for additional data file.^{(421.3KB, pdf)}

SupplementaryMaterial3_bbac259

Click here for additional data file.^{(5.4KB, txt)}

[ref1] 1. Vaughan K, Xu X, Caron E, et al. Deciphering the MHC-associated peptidome: a review of naturally processed ligand data. Expert Rev Proteomics 2017;14(9):729–36. [DOI] [PubMed] [Google Scholar]

[ref2] 2. Peters B, Nielsen M, Sette A. T Cell Epitope Predictions. Annu Rev Immunol 2020;38(1):123–45. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3. Yewdell JW, Bennink JR. Immunodominance in major histocompatibility complex class I-restricted T lymphocyte responses. Annu Rev Immunol 1999;17(1):51–88. [DOI] [PubMed] [Google Scholar]

[ref4] 4. Trolle T, Metushi IG, Greenbaum JA, et al. Automated benchmarking of peptide-MHC class I binding predictions. Bioinformatics 2015;31(13):2174–81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5. Lundegaard C, Lund O, Nielsen M. Accurate approximation method for prediction of class I MHC affinities for peptides of lengths 8, 10, and 11 using prediction tools trained on 9mers. Bioinformatics 2008;24(11):1397–8. [DOI] [PubMed] [Google Scholar]

[ref6] 6. Bui HH, Sidney J, Peters B, et al. Automated generation and evaluation of specific MHC binding predictive tools: ARB matrix applications. Immunogenetics 2005;57(5):304–14. [DOI] [PubMed] [Google Scholar]

[ref7] 7. Peters B, Sette A. Generating quantitative models describing the sequence specificity of biological processes with the stabilized matrix method. BMC bioinformatics 2005;6(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] 8. Karosiene E, Lundegaard C, Lund O, et al. NetMHCcons: a consensus method for the major histocompatibility complex class I predictions. Immunogenetics 2012;64(3):177–86. [DOI] [PubMed] [Google Scholar]

[ref9] 9. Kim Y, Sidney J, Pinilla C, et al. Derivation of an amino acid similarity matrix for peptide: MHC binding and its application as a Bayesian prior. BMC bioinformatics 2009;10(1):1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] 10. Moutaftsi M, Peters B, Pasquetto V, et al. A consensus epitope prediction approach identifies the breadth of murine T CD8+-cell responses to vaccinia virus. Nat Biotechnol 2006;24(7):817–9. [DOI] [PubMed] [Google Scholar]

[ref11] 11. Zhang H, Lund O, Nielsen M. The PickPocket method for predicting binding specificities for receptors based on receptor pocket similarities: application to MHC-peptide binding. Bioinformatics 2009;25(10):1293–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] 12. Andreatta M, Nielsen M. Gapped sequence alignment using artificial neural networks: application to the MHC class I system. Bioinformatics 2016;32(4):511–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13. O’Donnell TJ, Rubinsteyn A, Laserson U. MHCflurry 2.0: Improved Pan-Allele Prediction of MHC Class I-Presented Peptides by Incorporating Antigen Processing. Cell Systems 2020;11(1):42–48.e7. [DOI] [PubMed] [Google Scholar]

[ref14] 14. Liu Z, Cui Y, Xiong Z, et al. DeepSeqPan, a novel deep convolutional neural network model for pan-specific class I HLA-peptide binding affinity prediction. Sci Rep 2019;9(1). [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] 15. Hoof I, Peters B, Sidney J, et al. NetMHCpan, a method for MHC class I binding prediction beyond humans. Immunogenetics 2009;61(1):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] 16. Nielsen M, Andreatta M. NetMHCpan-3.0; improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets. Genome Med 2016;8(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17. Jurtz V, Paul S, Andreatta M, et al. NetMHCpan-4.0: Improved Peptide-MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data. The Journal of Immunology 2017;199(9):3360–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref18] 18. Rammensee HG, Bachmann J, Emmerich NPN, et al. SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics 1999;50(3–4):213–9. [DOI] [PubMed] [Google Scholar]

[ref19] 19. Nielsen M, Lundegaard C, Lund O. Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method. BMC Bioinformatics 2007;8(1). [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] 20. Jiang L, Yu H, Li J, et al. Predicting MHC class I binder: existing approaches and a novel recurrent neural network solution. Brief Bioinform 2021;22(6). [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] 21. Zhao W, Sher X. Systematically benchmarking peptide-MHC binding predictors: From synthetic to naturally processed epitopes. PLoS Comput Biol 2018;14(11):e1006457. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] 22. Lin HH, Ray S, Tongchusak S, et al. Evaluation of MHC class I peptide binding prediction servers: Applications for vaccine research. BMC Immunol 2008;9(1). [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] 23. Bonsack M, Hoppe S, Winter J, et al. Performance Evaluation of MHC Class-I Binding Prediction Tools Based on an Experimentally Validated MHC-Peptide Binding dataset. Cancer Immunol Res 2019;7(5):719–36. [DOI] [PubMed] [Google Scholar]

[ref24] 24. Mei S, Li F, Leier A, et al. A comprehensive review and performance evaluation of bioinformatics tools for HLA class I peptide-binding prediction. Brief Bioinform 2020;21(4):1119–35. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] 25. Gfeller D, Bassani-Sternberg M, Schmidt J, et al. Current tools for predicting cancer-specific T cell immunity. OncoImmunology 2016;5(7):e1177691. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] 26. Jojic N, Reyes-Gomez M, Heckerman D, et al. Learning MHC I-peptide binding. Bioinformatics 2006;22(14):e227–35. [DOI] [PubMed] [Google Scholar]

[ref27] 27. Jacob L, Vert JP. Efficient peptide-MHC-I binding prediction for alleles with few known binders. Bioinformatics 2007;24(3):358–66. [DOI] [PubMed] [Google Scholar]

[ref28] 28. Nielsen M, Lundegaard C, Blicher T, et al. NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known sequence. PloS one 2007;2(8):e796. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref29] 29. Binkowski TA, Marino SR, Joachimiak A. Predicting HLA Class I Non-Permissive Amino Acid Residues Substitutions. PLoS ONE 2012;7(8):e41710. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref30] 30. Zhang GL, Khan AM, Srinivasan NK, et al. MULTIPRED: a computational system for prediction of promiscuous HLA binding peptides. Nucleic Acids Res 2005;33(Web Server):W172–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] 31. Sette A, Sidney J. Nine major HLA class I supertypes account for the vast preponderance of HLA-A and -B polymorphism. Immunogenetics 1999;50(3–4):201–12. [DOI] [PubMed] [Google Scholar]

PERMALINK

A comprehensive analysis of the IEDB MHC class-I automated benchmark

Raphael Trevizani

Zhen Yan

Jason A Greenbaum

Alessandro Sette

Morten Nielsen

Bjoern Peters

Abstract

Introduction

Materials and Methods

Datasets

Figure 1.

Current metrics and original rank

Stability of metrics

Data heterogeneity

Minimum number of datasets

Grouped rank

Results

Datasets were less frequent and more diverse than anticipated

Figure 2.

Table 1.

Percent-ranked metrics provide better generalization

Figure 3.

Figure 4.

Heterogeneity of datasets does not significantly bias the rank of methods

Figure 5.

Fair assessment requires 22+ datasets

Figure 6.

Performance of predictors overlaps and falls into groups

Figure 7.

Discussion

Conclusion

Key Points

Author contributions statement

Supplementary Material

Acknowledgments

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases