Abstract
Background
Cytotoxic T cells are key effectors in the immune response against pathogens and tumors. Thus, identifying those immunogenic epitopes driving T-cell activation conforms a fundamental goal for antigen-based immunotherapies. T-cell antigen discovery is challenged by immense epitope landscapes, unfeasible to screen ad hoc experimentally due to the high cost and low throughput of immunogenicity validations. Precedingly, immunoinformatic models, with orders of magnitude higher throughput such as HLA-I binding affinity tools, are used to predict the antigenic potential of T-cell epitopes. However, the resulting immunogenicity screening success rates (ISSR)—the capacity to rank truly immunogenic epitopes among top-scored candidates prioritized for experimental validation—have remained incremental and the immunological explainability underlying model predictions limited.
Results
PredIG is an interpretable predictor of T-cell epitope immunogenicity trained upon 17,448 peptide-HLA-I allele pairs (pHLAs) with reported immunogenicity in T-cell reactivity and binding assays. Upon pHLAs, PredIG integrates an in silico feature space of antigenic properties (proteasomal cleavage, TAP translocation, HLA-I binding affinity, and presentation), and physicochemical epitope descriptors, particularly focused on TCR-facing central residues. Leveraging this information, we built three antigen-specific XGBoost models to compute PredIG immunogenicity scores (PredIG-NeoAntigen, PredIG-NonCanonical, and PredIG-Pathogen). We then used Shapley Additive models (SHAP) to analyze their immunological interpretability pinpointing a balanced feature importance between antigenic and physicochemical properties. This highlighted the strong contribution of antigen processing likelihood and physicochemical characteristics, often overlooked in T-cell epitope predictions. Comparably, PredIG obtained cutting-edge ISSR performance in our pathogen and non-canonical cancer antigen held-outs versus immunogenicity, HLA-I binding, and pHLA stability predictors. In cancer neoantigens, we used PredIG to refine the success rates of HLA-I binding affinity predictions and to prioritize an additional set of immunogenic (neo)epitopes differing from top-binding candidates across the three antigen types tested.
Conclusions
Overall, we demonstrate how PredIG immunogenicity scores are instrumental to refine and expand the prioritization of actionable T-cell (neo)epitopes in infection and cancer, including non-canonical antigens not seen during training. Furthermore, PredIG displays an unprecedented immunological interpretability determining important immunogenicity drivers beyond HLA-I binding affinity. Ultimately, PredIG enables large throughput antigen discovery in open-source containerized environments (https://github.com/BSC-CNS-EAPM/PredIG) and facilitates accessibility via a streamlined webserver (https://horus.bsc.es/predig).
Supplementary Information
The online version contains supplementary material available at 10.1186/s13073-025-01569-8.
Background
Cytotoxic T cells are key effectors of the immune response against pathogens and cancer thanks to the T cell receptor (TCR) recognition of antigen-derived epitopes on HLA molecules and the subsequent killing of diseased presenting cells. These capabilities have led T cells to constitute the main drivers of modern immunotherapies [1–4], such as immune checkpoint inhibitors, adoptive TIL transfer, or neoantigen vaccines. Therefore, controlling their activation is a coveted therapeutic goal. Determining which peptide-HLA complexes (pHLAs) are immunogenic—capable of triggering an immune response—is particularly key to guide the design of the immunogens embedded in T-cell-directed vaccines. This type of immunotherapy, applicable both to cancer [5–9] and infection [10], uses poly(neo)epitope immunogens that contain concatenated peptides derived from multiple proteins. Using this integration, T-cell-directed vaccines limit the likelihood of immune escape via antigenic drift [11] or mutation loss [12, 13]. What is more, for pathogen vaccination, targeting T cell responses can avoid toxicities due to pervasive antibody enhancement effects observed in conformational immunogens [14–16].
However, both tumoral and infected cells present vast epitope landscapes that increase the throughput required in immunological screenings and, thus, challenge the identification of immunogenic candidates for vaccine design. The validation of T-cell epitope immunogenicity, usually performed via T-cell binding or T-cell reactivity assays, requires access to scarce patient material, such as PBMCs, tumor biopsies, or infected individuals; subsequent T-cell enrichments; and long co-culture experiments or complex stainings [17–20]. This process makes the total number of potential T cell epitopes to validate experimentally unfeasible, both for tumors and pathogens. Hence, the high-throughput and inexpensive implementation of immunoinformatic predictors have become critical components for the prioritization of T-cell epitopes prior to subsequent T-cell assays.
Notably, the immunoinformatic strategy to identify optimal T-cell epitopes starts from different antigen sources in cancer and infection. To design T-cell-directed cancer vaccines, the effort is focused on predicting the most immunogenic neoantigens, which derive from tumor-specific mutations that create novel protein sequences recognizable as foreign by the immune system [21–23]. Neoantigens are naturally or immunotherapeutically exploited to guide tumor cell killing, and their generation depends on the tumor mutational burden (or TMB), which often accounts for a large number of variants. For neoantigen screening purposes, this number is (technically) further multiplied to generate all the peptides or neoepitopes carrying the variant [21–23]. However, among the vast tumor neoantigen landscape to screen, only a few neoepitopes typically result immunogenic (1 to 5%) [24–27]. Furthermore, to ensure the effectiveness of immunotherapeutic anti-tumor responses, it is optimal to target multiple neoantigens rather than a single candidate; thus, multiple bona-fide neoepitopes have to be identified and validated. The risks of targeting a single neoantigen are mutation subclonality [28], the incapacity of single neoepitopes to trigger effective T-cell responses [29], or facilitating the immune escape of the tumor due to mutation loss caused by the immunoediting pressure of the treatment [12].
In the context of T-cell-directed vaccine design to fight infection, pathogens generate antigenic proteins that are entirely foreign to the human immune system; thus, all the derived epitopes have plausible immunogenicity to assess. In addition, the high mutational rates in pathogens prompted by immune pressure and subsequent antigenic drift further increase the number of potential peptides to validate overtime [11]. To alleviate this, T-cell polyepitopes aim to cover the largest possible representation of the pathogen’s proteome, requiring the validation of long lists of epitopes. Additionally, a differential requirement for T-cell-directed vaccines for pathogens, which are not personalized for each patient but designed for general populations, is that polyepitopes have to confer a wide HLA coverage via including multiple epitopes that bind to a set of HLA alleles representative of the genetic diversity in the populations targeted [30–32].
Regardless of the differing starting protein sources and levels of foreignness to the immune system, state-of-the-art bioinformatic strategies for T-cell epitope prioritization in cancer and infection use a similar toolkit. This includes predictors of HLA binding affinity [33–38], pHLA stability [39, 40], T-cell epitope immunogenicity [38, 41–43], and a plethora of proteogenomic pipelines [44–46]. Traditionally, HLA-I binding predictors have been the most popular methodologies including widely used tools such as MHCflurry [33, 34] and NetMHCpan [36]. The link between immunogenicity and HLA affinity works under the hypothesis that strong binding epitopes will reach and stay on the cell surface more consistently, thus increasing the likelihood for T cell detection [24, 25]. Still, due to the complexity of TCR recognition and T cell activation, the field of immunogenicity prediction struggles with low experimental success rates ranging between 1 and 5% [24–27].
Recently, the increasing availability of experimental T-cell binding and T-cell reactivity data to the thousands of validated pHLA complexes has opened the possibility to create models that predict T-cell immunogenicity directly [18, 47–51]. While HLA binding affinity experiments are in the end a proxy for immunogenicity likelihood, T-cell assays measure T-cell response/activation directly; thus, this data type should be able to inform immunogenicity predictions more faithfully. However, existing immunogenicity models have been hindered by low epitope diversity in T-cell assay databases, as the data is biased towards strong HLA binders due to the high cost and low throughput of validation techniques [49, 52]. Another limitation for machine learning training comes from the difficulties in generating of reliable negative data, since labeling a pHLA pair as “non-immunogenic” is constrained by the diversity of the TCR repertoire and the T cell phenotypes present in the experimental sample [24, 43]. Thus, herein, we propose to refer to negative cases as immunogenically “Silent.”
The community is already committed to alleviating the shortage in epitope diversity, the bias towards HLA binding strength, and the consistency in negative epitope annotations via scaling up and standardizing data generation as reflected by the Cancer Grand Challenge awarded to deciphering T-cell immunogenicity [53]. Altogether, this conforms an emerging momentum for epitope prioritization methods based on T-cell assays that hold the promise to enhance the immunogenicity success rate of current state-of-the-art predictions—the capacity to rank truly immunogenic epitopes among top-scored candidates to screen—and to shed light on novel immunogenicity determinants.
With this aim, we present PredIG, a novel predictor of T-cell epitope immunogenicity. Our tool is developed on 17,448 pHLAs, immunogenically validated in T-cell assays, for which we have built a tailored feature space that predicts their antigenic properties and describes the physicochemical properties of the epitope. These are then exploited via an explainable machine learning framework to enhance the interpretability of our PredIG immunogenicity score. Furthermore, to consistently validate PredIG, we thoroughly benchmark our predictive performance against a set of state-of-the-art tools using three antigen-specific held-out datasets: (1) a cancer neoantigen set; (2) a non-canonical cancer antigen set; and (3) a pathogen set derived from SARS-CoV-2. Complementarily, we describe the capacity of PredIG scores to prioritize immunogenic pHLAs alternative to best-binding epitopes and to refine the success rate of HLA-I binding affinity predictions when used as a quality filter. Finally, to streamline the implementation of PredIG while boosting its usability and reproducibility, we deploy our software in a user-friendly webserver and in containerized environments (Docker and Singularity) for large-scale studies.
Implementation
Datasets
Antigen types, experimental sources, and immunogenicity label harmonization
To obtain a sizeable dataset for machine-learning modeling, we retrieved 17,448 pairs of epitope and HLA-I allele (pHLAs) with validated immunogenicity, from public databases [48, 50, 54] and from recent literature [4, 41, 49, 55] in September 2023. As for disease of origin, our dataset contains epitopes derived from pathogens and from cancer including neoantigens, tumor-associated antigens, and non-canonical cancer antigens. Experimental validations comprise T-cell reactivity assays such as IFN-γ ELISpot, 41BB staining, or TNFα-release and T-cell binding assays such as MHC-tetramers and multimers. The experimental results are considered ground truth and used as immunogenicity label. Pairs of epitope and HLA allele (pHLA) validated as “positive,” “positive-low,” “positive-intermediate,” and “positive-high” are harmonized as “positive” for binarization of the label. pHLAs labeled as “negative” maintain the same label. Among pHLA cases with more than a single validation experiment, those with at least one responsive subject were considered positive and the remaining negative instances were discarded to avoid discording/discrepant immunogenicity labels for the same pHLAs. This criterion is based on the fact that each T-cell assay replicate/experiment uses a unique T-cell sample with polyclonal TCRs and diverse T-cell states. Thus, one finding of immunogenicity validates the ground truth of a pHLA instance whereas a negative finding does not invalidate it because it can imply the absence of the responsive TCR or the presence of non-effector T-cell states in the sample.
Data curation
All datasets were curated in a series of general quality filters and a subset of dataset-specific filters. The general filters include HLA nomenclature standardization to discard alleles/cases with missing or insufficient information (explained below); epitope length between 8 and 15 amino acids and at least one patient tested for all cases and one patient responding for positive cases. Duplicates between datasets were removed based on pHLA identity (e.g., concatenating epitope sequence and standardized HLA allele) maintaining the positive instances in cases of discording annotation.
We adapted HLA allele ontologies [56] to focus on experiments performed on mono-allelic cell lines or on epitopes predicted as binders for a single HLA-I allele of the antigen presenting cells used in the assay. IEDB [50] data was curated as follows: host organism was restricted to “Homo sapiens” to only include epitopes tested against human T cells; MHCType was restricted to “MHC-I”; the dataset was split by Epitope.Relationship into “neo-epitope” for cancer datasets and “pathogen” for pathogenic datasets and MHC allele restriction was refined using HLA Allele Ontologies [56]; Allele.Evidence.Code was used to discard “not determined” instances. Data from LANL [57] and HCV databases [58] was retrieved using Repitope [59]), and curated as in IEDB. TANTIGEN v1.0 [54] and v2.0 [48] data only contains positive cases; the number of subjects tested and the experimental source are annotated in the original publication. In PRIME v1.0 [41], we discarded data coming from the studies “Calis,” “Dengue,” and “Random” for HLA resolution limitations; the number of subjects tested, responded, and experimental source depends on the specific study [41]. The non-canonical tumoral antigen dataset was retrieved from Gros et al. [55] and curated by the “nonC-TL” antigen category to include alternative reading frames (OffFrame), intronic, intergenic, non-coding regions (non-CDS), and 5′ and 3′ (UTR5, UTR3). This dataset was validated by IFN-γ ELISpots and 41BB stainings. SARS-CoV-2 data retrieved from Schumacher et al. [4] was curated using general filters exclusively and validated by pHLA multimer assays. TESLA data for cancer neoantigens was curated using general filters and validated using pHLA multimers [49]. Tran et al. data including cancer neoantigens from gastrointestinal tumors was curated using general filters and validated by IFN-γ ELISpots and 41BB staining [60].
HLA nomenclature standardization
We only retained those pHLA points with 4 digits of HLA resolution, necessary to perform reliable binding predictions (Fig. 2B) and their nomenclature was standardized to the HLA allele nomenclature established by the WHO Nomenclature Committee for Factors of the HLA System [61], which is as follows: HLA-(Gene)*(Allele group):(specific HLA protein or allotype). For instance, HLA-A*02:01. Data points with insufficient allelic annotation were discarded (e.g., HLA-A2). Data points with deeper resolution were restricted to non-synonymous modifications specified by 4 digits (e.g., HLA-A*01:01:03 > HLA-A*01:01). All annotations that did not follow the standardized symbols were harmonized into 4 digits (e.g., HLA-A0101 > HLA-A*01:01).
Fig. 2.
Curation of T-cell assay data to build PredIG’s training sets accounting for the inherent class imbalance derived from the low success rate of immunogenicity screenings. A A dataset of 17,448 pairs of epitope and HLA-I allele (pHLAs) validated in T-cell immunogenicity assays including T-cell reactivity and T-cell binding techniques was retrieved from IEDB [50], TANTIGEN v1.0 [54] and v2.0 [48], PRIME v1.0 [41], LANL [57], HCV [58], and recent literature [4, 49, 55]. B All pHLAs were curated using HLA allele ontologies [56], restricting the annotations to cases with 4 digits of HLA allelic resolution (e.g., HLA-B*07:02). C The experiments were enriched for single HLA-allele binding preferences either by studies on mono-allelic cell lines or HLA binding prediction to single allele among the HLA molecules of the antigen presenting cells in the experimental sample. D Immunogenicity class imbalance refers to the majority of negative or non-immunogenic cases over a minority of positive or immunogenic instances present in T-cell assay datasets [24, 25, 83]. While this imbalance is widely known in the field, it is hard to quantify accurately and becomes relevant for machine learning predictions of T-cell epitope immunogenicity, as an example of a dichotomic response variable. XGBoost models are well suited to deal with highly imbalanced datasets adjusting the Scale Pos Weight hyperparameter (SPW). This hyperparameter increases the attention or weight over positive instances in the loss function that XGBoost models optimize to maximize prediction performance. PredIG model evolution revolves around a gradient of SPW values to mimic immunogenicity class imbalance settings (SPW-POS). These are rescaled for the lower class imbalance in our training (SPW-RS). The results and implications of PredIG-SPW optimizations are discussed across the paper. The immunogenicity ground truth of pHLAs is determined in T-cell reactivity (IFN-γ ELISpots) or T-cell binding (MHC-tetramer stainings) assays leading to “Immunogenic” and “ Non-Immunogenic” qualitative labels. However, we also refer to non-immunogenic pHLAs as “Silent,” since a single experiment does not interrogate all the TCRs of an individual and less all potential human TCR repertoire. Thus, the lack of signal might be caused by the absence of reactive TCRs in the sample, thus making the pHLA pair silent but not precluding its recognition by other TCRs
Data splits
The training set includes 13,073 pHLA pairs (6114 tumoral and 6959 of pathogen origin, of which 5006 were validated as immunogenic and 8067 as non-immunogenic) curated from the IEDB [50], LANL [57], HCV [58, 62], TANTIGEN v1.0 [54], and PRIME v1.0 [41]. The training set was split for internal model optimization using a fivefold cross-validation coupled to a grid search exploration of XGBoost hyperparameters. Independently, three held-out sets were retrieved from recent publications [4, 55], TESLA consortium [49], and TANTIGEN v2.0 [48], from which we removed any pHLA pair found in the training set to avoid data leakage. These are antigen-type-specific including a cancer neoantigen dataset (held-out 1, n = 3564), a non-canonical cancer antigen dataset (held-out 2, n = 560), and a pathogen dataset containing SARS-CoV-2 T-cell epitopes (held-out 3, n = 243).
Feature space
Antigenic and physicochemical features
PredIG’s feature space is formed by 14 descriptors calculated/extracted/predicted from the epitope and HLA-I allele sequences. These are grouped in 7 antigenic features and 7 physicochemical features. The first include proteasomal processing [63, 64], TAP transport [65, 66], antigen processing, antigen presentation, and HLA-I binding via different metrics [34, 67]. The second include hydrophobicity, molecular weight, net charge, and peptide instability [68]. Once computed, the identity of the pHLA instance is not fed to the XGBoost model [69].
Proteasomal cleavage
To assess proteasomal processing capabilities, we used NetCleave v2.0 [64] to predict the C-terminal cleavage propensity of epitopes. NetCleave is a deep-learning algorithm trained on MS-immunopeptidomics data. It uses 48 physicochemical descriptors of a customized cleavage site spanning the last 4 amino acids of the epitope and the 3 continuing residues at C-terminal of the parental protein. Netcleave provides a probability score of the cleavage likelihood. The settings were set as for a CSV input with Uniprot ID using –pred_input 2 –mhc_class I and mhc_allele HLA [64]. NetCleave implementation is available at https://github.com/BSC-CNS-EAPM/NetCleave/.
TAP transport efficiency
The transport of peptides between the cytoplasm and the ER lumen is driven by TAP, a peptide-translocator located at the ER membrane. To determine TAP transport efficiency, we have used NetCTLpan1.1 [65, 66]. This tool is trained on HLA-ligands eluted from the cell surface and sequenced by MS. The score of this method is built on a probability weighted matrix based on amino acid frequencies of using the C-terminal residue and the tree N-terminal residues of the transported peptides. The TAP module [65] within NetCTLpan [66] was run using tapmat_pred_fsa, tap.logodds.mat, -a 1, and specific peptide lengths.
Antigen processing predictor
MHCflurry v2.0 [34] includes an antigen processing predictor trained on HLA-eluted peptides that have undergone all the processing and transport steps. It models the HLA-allele-independent processing effects such as proteasomal cleavage, TAP transport efficiency, and ERAAP trimming and determines a probabilistic antigen processing score. The processing module was run with default parameters and no parental protein context (–no-flanking).
HLA binding affinity
To calculate, the MHC binding affinity has used two pan-allele tools: NOAH v1.0, a structure-based HLA-I binding predictor trained on position-specific epitope environments [67]; and MHCflurry v2.0, a sequence-based NN trained on in vitro binding affinity assays and MS-immunopeptidomics data [33, 34].
NOAH v1.0 is a structure-based HLA-I binding predictor based on a position-specific weight matrices (PSWM) architecture [67]. It is trained on pHLA crystal structures retrieved from the Protein Data Bank [70] from which it extracts a factorization of each position in the peptide sequence and its amino acid contacts in the local environment of the HLA-I binding groove [67]. Then, it uses the environment of each position to build a PSWM and derive likeliness of the peptide to properly fit and bind to an HLA-I allele. Peptide position factorization allows the method to perform de novo predictions to unseen alleles by transferring the learning of each position’s environment to new alleles. NOAH is run using default parameters and is available at https://github.com/BSC-CNS-EAPM/Neoantigens-NOAH.
MHCflurry v2.0 tool supports the prediction of HLA binding affinity (nM) and percentile rank to any HLA allele of known sequence [34]. The settings selected to run MHCflurry were as follows: –input.csv; –out out.csv; –no-throw; –output-delimiter ‘,’; –always-include-best-allele; and –no-flanking. In the input file, the epitopes were termed “peptide” and each epitope’s MHC restriction was indicated as “allele,” with multiple cases included separated by blanks seizing the “HLA genotype” option.
Antigen presentation predictor
MHCflurry v2.0 [33] includes an antigen presentation predictor that integrates the antigen processing module with binding affinity predictions to jointly derive a probabilistic antigen presentation score. The presentation module was run with the same parameters as binding predictions.
TCR contact region
The central residues of HLA-I epitopes (P4 to PΩ2) face upwards to the TCR and are key for epitope recognition [71]. We have termed these as the “TCR contact region.” To extract this sequence from any given epitope, we used a parsing script using a sequence subsetting function (R Stringr v1.5.1).
Recognition physicochemical features
The physicochemical properties of an epitope and its TCR contact region are crucial for the protein–protein interaction that occurs between pHLA and TCR. These properties include hydrophobicity, molecular weight, net charge, and peptide instability and were determined using the R package “Peptides” v2.4.5 [68]. Settings and index properties were set as default. Hydrophobicity, molecular weight, and net charge were calculated both the full epitope sequence and the TCR Contact Region. Peptide instability was calculated for the epitope sequence but not for TCR contact region since these residues are embedded within the peptide sequence.
Model development
XGBoost
XGBoost [69] stands for eXtreme Gradient Boosting and is a decision tree-based machine learning algorithm that uses gradient boosting to minimize prediction error over iteration to thus maximize predictive performance. It is a well-known winner of Kaggle ML-competitions, where it has shown its fitness for tabular data and its capacity to deal with data imbalance [72]. XGBoost limits overfitting by penalizing model complexity with hyperparameters as the leaning rate using eta or restricting tree depths using max_depth. In this implementation, we used the XGBoost R package v1.7.5.1 and the functions xgboost(), xgb.train(), and xgb.cv(). The parameters specified were objective function as “binary:logistic,” “aucpr” as evaluation metric, and maximize as “TRUE” to optimize the predictive precision, prediction as “TRUE” to obtain a probabilistic score, stratified as “TRUE” to control the proportion of labels in the different folds, nrounds as “1000,” and finally, to accelerate the training time the process was parallelized using 8 cores via nthread. Additionally, we set early stopping rounds as “5” and watchlist using the test set to accelerate the internal performance evaluations. Obtained models were exported in.MODEL and.RAW formats.
Hyperparameter optimization
To tune the XGBoost model [69], we used a 5CV-grid search approach. The hyperparameters optimized were eta to control the learning rate, max_depth to limit tree depth and the complexity of the model, Scale Pos Weight to compensate for class imbalance in immunogenicity data, and max_delta to constraint the estimation weight of each tree. The grid search explores all the possible hyperparameter combinations and is coupled to the xgb.cv() function to train 5 models using fourfold and evaluate the performance on the remaining fold iteratively. The hyperparameter space explored is the same in all PredIG optimizations for eta, max_depth, and max_delta. The Scale Pos Weight space ranges from 0.013 to 100 depending on the specific class imbalance optimization of the model (Additional File 1: Table. S1) [73]. The gradient was set following literature recommendations on neoantigen class imbalance (1–5% of immunogenic success rate) [24, 25] and was adapted to the non-canonical and pathogen sets, which more foreign to the immune system contain less pronounced class imbalances. Then, SPW values were calculated using the class imbalances in the test sets as realistic to extreme examples of class imbalance per antigen type. All values were explored in the different held-outs. Precisely, SPW-POS is obtained dividing the number of negative instances by the positive instances. In addition, in the SPW-RS optimizations, the class imbalance in the training set, again negative divided by positive instances, is rescaled dividing it by the class imbalance in each test set. This rescaling corrected the greater number of positive instances included in the train set for the model to learn from more immunogenic pairs of epitope and HLA-I allele.
Voting schemes
We used geometric mean and an additive voting system [43] to combine the scores from different class imbalance optimizations in order to obtain an aggregated PredIG score. Briefly, the voting scheme consists in adding the probability of each model and ranking based on this value. Both metrics were computed against all class imbalance optimizations (SPW-All) and against SPW-POS or SPW-RS separately.
Linear regression ensemble
To ensemble the PredIG scores obtained from different class imbalance optimizations into a single prediction, we used Lasso, Ridge, and Elastic linear models [74]. In each case, we modeled using 3 schemes: against all PredIG optimizations (SPW-All), against SPW-POS only and SPW-RS only.
Performance evaluation
To optimize the precision of our XGBoost models, we maximized AUCPR as evaluation metric during training. The performance was validated against 3 held-out datasets containing cancer neoantigens (independent 1), non-canonical cancer antigens (independent 2), and SARS-CoV-2 pathogenic antigens (independent 3). Due to the prioritization goal of the immunogenicity prediction problem posed here, we firstly evaluated the performance using an enrichment metric termed “Immunogenicity Enrichment Factor.” This metric assesses the number of ground truth immunogenic pHLAs that are found among a number top-X scored candidates given a target dataset. For low-throughput assessment, we used ISSR10, ISSR25, and ISSR50, whereas for high-throughput, we used ISSR100, ISSR200, ISSR400, and ISSR1000 (only in neoantigen set). To aggregate success rates, we computed an averaged fraction of low-throughput enrichments (ISSR_low), of high-throughput (ISSR_high), and an overall enrichment using all metrics (ISSR_all). Secondly, to evaluate the overall precision in a dataset, we used AUCPR, and for overall performance, we used ROCAUC. Further extensive statistic’s analysis was explored using MCC, accuracy, precision, recall, specificity, TPR, FPR, FDR, and odds ratio all thresholded by the Youden Index (not shown). To focus on the top predicted candidates, we also used partial ROCAUC (pAUC) [75]. The performance statistics were computed using the R libraries ROCR v1.0.11, pROC v1.18.5, PRROC v1.3.1, mltools v0.3.5, and epitools v0.5.10.1.
T-cell epitope prioritization benchmark
The benchmark against state-of-the-art T-cell prioritization methods included NOAH [67], MHCflurry v2.0 affinity percentile [34], and NetMHCpan v4.1 EL Rank for HLA-binding [36]; NetCleave v2.0 for proteasomal cleavage [64], pHLA stability measurements [40], MHCflurry scores for antigen processing and presentation [34]; and PRIME v2.0 [38] and TLImm [76] for immunogenicity prediction.
HLA binding refinement
The assessment of success rate refinement over HLA-I binding predictors is performed upon MHCflurry v2.0 affinity percentile, NetMHCpan v4.1 EL rank, and NOAH score. The “same budget” analysis refers to testing the same number of pHLA candidates discarding strong binders with low PredIG scores and including the next best binders with high PredIG scores. To optimize this setting, we explored a range of PredIG score thresholds to maximize success rate upon the number of tests (e.g., 100 experiments). The “optimized budget” uses PredIG score to reduce the number of tests to perform by discarding strong binding candidates with low PredIG scores (e.g., from an initial selection of top-100 binding candidates, PredIG optimizes tests to perform to 40).
Identification overlap
To assess the prioritization of truly immunogenic pHLAs prioritized by PredIG which are alternative to best binding predictions, we overlap concatenated pHLA pairs among the top-100 PredIG candidates vs top-100 binding candidates and identify differing pairs. Venn diagrams were computed using ggvenn v.0.1.10 and Upset plots using UpSetR v.1.4.0 R packages.
Prediction interpretability
To extract the feature contribution from XGBoost models, we calculated Shapley values [77] using shapforxgboost R package v0.1.3. To interpret the directionality of each feature towards predicting immunogenic pHLAs, we used a ggplot2 (v3.5.1) adapted version of XGBoost summary plots.
PredIG implementation and usage modes
PredIG is an open-source software released and maintained at https://github.com/BSC-CNS-EAPM/PredIG, under the license GNU General Public License version 2. PredIG hosts three types of input file: CSV-Uniprot, CSV-Recombinant, and FASTA. PredIG predictions can be run using three different models: PredIG-NeoA, optimized for cancer neoantigens, PredIG-NonCan for non-canonical cancer antigens, and PredIG-Path for pathogens. PredIG performs pan HLA-I allele predictions.
“CSV-Uniprot” mode
Input a.CSV file with pairs of peptide and HLA-I allele listed in “Epitope” column (peptide sequence) and in “HLA_allele” column (HLA nomenclature in 4 digits; e.g., HLA-A*02:01). The Uniprot ID of the protein of origin of each peptide is provided at “uniprot_id” column (required for proteasomal processing calculations within PredIG).
“CSV-Recombinant” mode
Input a.CSV file with pairs of peptide and HLA-I allele listed in “Epitope” column (peptide sequence) and in “HLA_allele” column (HLA nomenclature in 4 digits, e.g., HLA-A*02:01). The amino acid sequence of the recombinant protein specific of each peptide is provided at “protein_name” column (required for proteasomal processing calculations within PredIG).
“FASTA” mode
Input a FASTA file with the target protein sequence and a.CSV file with a list of HLA-I alleles of interest (“HLA_allele” column). By default, PredIG will generate all possible epitopes of 8 to 14 AA of length and will calculate against the input HLA-I alleles.
The output of PredIG includes epitope sequence, HLA-I allele in 4 digits format, all the features required for PredIG predictions, and the PredIG score of the selected model.
Webserver
PredIG is implemented as a webserver at https://horus.bsc.es/predig. The server holds the batch prediction of up to 5000 pHLAs in a single search. Find extended instructions on input options, and search instructions and output formats in the server documentation. Results may take several minutes to compute. Queries with a large number of parental proteins might extend search times due to the automated retrieval of the C-terminal context at Uniprot [78] for proteasomal cleavage prediction via NetCleave. In addition, a full protein sequence can be explored with chosen epitope lengths and HLA-I allele queries.
Containers
PredIG has been containerized for scalability and reproducibility using Docker [79] and Singularity [80] containers. Find the docker container at https://hub.docker.com/r/bsceapm/PredIG and at Singularity Community Catalog via https://github.com/BSC-CNS-EAPM/PredIG. For building and usage instructions, refer to https://github.com/BSC-CNS-EAPM/predig-containers/.
R dependencies
All R-library dependencies used to develop PredIG can be encountered and installed through the code in github and via Docker and Singularity containers.
Results
A computational feature space of antigenic and physicochemical descriptors of pHLA pairs fosters the explainability of immunogenicity predictions
To bridge ML-based immunogenicity prediction with cellular immunology principles in order to enable a consistent biological explainability of PredIG predictions, we have built a tailored feature space—the descriptors that we calculate in silico from pHLAs that have been validated in prior T cell reactivity and T cell binding assays—conformed by a set of seven antigenic and seven physicochemical descriptors that represent the steps leading to antigen presentation and recognition (Fig. 1A–B). These properties include proteasomal cleavage [63, 64, 67], TAP transport [65, 66], HLA-I binding [34, 67], and antigen processing and antigen presentation [34] among the antigenic feature set (Fig. 1A) and calculations of amino acid hydrophobicity, molecular weight, net charge, and peptide stability to represent the physicochemical properties influencing epitope recognition [68] (Fig. 1B). Notably, PredIG’s feature space is entirely built in silico and requires minimal input information from pHLAs: the epitope sequence, HLA-I allele(s), and the epitope’s parental protein.
Fig. 1.
The interpretability of PredIG’s immunogenicity score is fostered by a computational feature space that computes and weights the importance of antigenic and physicochemical properties from pairs of immunogenic and non-immunogenic epitope and HLA-I allele validated in T-cell assays (e.g., IFN-y ELISpots or MHC tetramers). A The antigenic feature set includes computational scores for proteasomal processing (NetCleave v2.0 [64], MHCflurry v2.0 processing score [34]), TAP transport (NetCTLpan v2.0 [65, 66]), HLA-I binding affinity (NOAH [67], MHCflurry v2.0 binding affinity and affinity percentile [34]), and antigen presentation (MHCflurry v2.0 presentation score). Veganized and adapted from Rock et al. [82]. BThe physicochemical features are calculated for the entire epitope sequence and extracted from the central residues of the epitope (P4-PΩ2) separately [68].. The properties calculated are hydrophobicity, molecular weight, net charge, and chemical stability (this latter for the entire epitope exclusively)
The antigenic and physicochemical features are calculated using a compendium of state-of-the-art bioinformatic tools (see the “Methods” section) [34, 63, 64, 66]. By integrating these using explainable AI techniques (XAI) [77], PredIG weights the contribution of antigenic and physicochemical properties in our immunogenicity score. This scheme enables a comprehensive interpretation of epitope qualities predicted as immunogenically prone. The physicochemical features we use are calculated for the entire sequence of the epitope and specifically for its central residues (P4—PΩ2), termed here as “TCR contact region” (Fig. 1B). This epitope sequence dissection is performed since central epitope residues are known to face upwards to the TCR, strongly influencing the immune synapse outcome [71, 81]. Similarly, epitope hydrophobicity has been linked to greater T-cell activation capacities [71]. We use molecular weight, as a proxy of residue bulkiness as well as amino acid net charge to assess their influence in the contact with the CDR loops of the TCR [59]. Lastly, peptide stability is included as proxy for pHLA durability, in this case only calculated for the full peptide sequence.
PredIG, a three-in-one model optimized to predict the T-cell epitope immunogenicity of cancer neoantigens, non-canonical cancer antigens, and pathogen-derived epitopes
To obtain a sizeable dataset to train a machine learning (ML) algorithm, we retrieved 17,448 pairs of epitope and HLA-I allele (pHLAs) from public databases [48, 50, 54] and recent literature [4, 41, 49, 55, 60] (Fig. 2A–D). All pHLAs were obtained from annotated T-cell immunogenicity validation assays, such as T cell reactivity via cytokine release (e.g., IFN-γ ELISpots) and T cell binding experiments (e.g., MHC-tetramer stainings) (Fig. 2A). We curated the HLA nomenclature annotation of each pHLA complex, requiring 4 digits of allele resolution to enable accurate HLA binding affinity predictions (Fig. 2B). In addition, T cell experiments were further curated to instances with either tested on mono-allelic cell lines or epitopes predicted to bind strongly to one allele of all the HLA molecules present in the experimental sample [56] (Fig. 2C).
Immunogenicity validation screenings often encounter few positive instances among a large number of tested candidates; hence, available T-cell assay datasets are severely imbalanced towards non-immunogenic pHLAs (majority class) [24]. In ML terms, this is a phenomenon known as class imbalance: a small proportion of the minority class—positive or immunogenic pHLAs—diluted over a majority class highly frequent (negative or non-immunogenic pairs) [83]. Across this work, we termed this factor as “Immunogenicity Class Imbalance” (Fig. 2D) and its integration in PredIG models is extensively tested, because when predicting a dichotomic response or label using ML algorithms, it is strongly relevant to account for class imbalance.
Within the entire dataset of 17,448 T-cell assays retrieved, PredIG is trained on 13,073 pHLA pairs (6114 tumoral and 6959 of pathogen origin, of which 5006 were validated as immunogenic and 8067 as non-immunogenic) and internally validated using a fivefold cross-validation (Fig. 3A). Independently, we built three antigen-specific held-out or test sets—not present among the pHLAs in the training set—using cancer neoantigens (held-out 1, n = 3564), non-canonical cancer antigens (held-out 2, n = 560), and pathogen-derived epitopes (held-out 3, n = 243) for model benchmarking (Fig. 3B). Regarding the immunogenicity class imbalance, we built the training set with a smaller class imbalance and epitopes of mixed origin deliberately for PredIG to learn from a larger and more antigenically diverse set of immunogenic pHLAs. On the contrary, to assess the generalization capacity of PredIG in realistic use-cases, our held-out test sets embed larger class imbalances and exclusively contain specific antigenic origins, as end-user studies will typically focus on a single type of antigen (Fig. 3A–B).
Fig. 3.
Optimization of PredIG XGBoost models using Scale Pos Weight (SPW) for immunogenicity class imbalance correction and benchmark against state-of-the-art methods for T-cell epitope prioritization. A PredIG is trained in a dataset that includes epitopes of pathogenic and cancer origin (n = 9863; immunogenic = 3804, non-immunogenic = 6059), tuned using 5CV and validated against an internal test set (n = 3290; immunogenic = 1279, non-immunogenic = 2011). The large fraction of immunogenic pHLAs both at training and test sets was purposedly devised to maximize the learning of the model from a wide diversity of immunogenic pHLAs. B Next, we generated three type-of-antigen-specific held-out sets as use-cases with realistic immunogenicity class imbalance (e.g., a lower fraction of immunogenic pHLAs). These include the following: (1) cancer neoantigens (n = 3564, immunogenic = 144, non-immunogenic = 3420); (2) non-canonical cancer antigens (n = 560, immunogenic = 18, non-immunogenic = 542); and (3) pathogen-derived epitopes from SARS-CoV-2 (n = 243, immunogenic = 5, non-immunogenic = 238). C, E, G Optimization of the XGBoost models [69] studying a gradient of Scale Pos Weight (SPW) [73] hyperparameter combinations (termed PredIG-SPW) to foster the generalization in different immunogenicity class imbalance contexts and specific types of antigens. The Scale Pos Weight hyperparameter increases the attention or weight over positive cases in the loss function that XGBoost models optimize to maximize prediction performance on immunogenic pHLAs. Shown per antigen type the TOP5 best model solutions selected based on averaged ISSR (see Additional File 2: Fig. S1 for extended statistical analysis). Statistically, the performance is evaluated in terms of discrimination capacity over the entire dataset (ROCAUC—oranges), the precision over recall capacity (AUCPR—reds), and in terms prioritization of ground truth immunogenic epitopes among top-ranked candidates or immunogenicity screening success rates, the average of individual screenings (e.g., number of T-cell assays to perform) are colored in green, blue, and purple for ISSR_Low, ISSR_High and ISSR_All. ISSR_Low includes ISSR10, ISSR25, and ISSR50—considered low throughput T-cell assays—for all held-outs. ISSR_High includes ISSR100, 200 for all held-outs whereas ISSR400 and ISSR1000 are not included when their throughput exceeds the size of the given held-out dataset (ISSR400 not for pathogen and ISSR1000 not for pathogen nor for non-canonical cancer antigen held-outs). The heatmap color-scale is harmonically set across metrics so that the darker the color the better the score. The heatmap color-scale is harmonically set across metrics so that the darker the color the better the score. D, F, H Benchmark of the best PredIG-SPW optimizations per antigen type (PredIG-NeoA, PredIG-NonCan, PredIG-Path) against state-of-the-art methods for T-cell epitope prioritization. The SOTA methods in our benchmark include NOAH [67], NetMHCpan v4.1 [36], and MHCflurry v2.0 [34] for HLA-I binding, NetMHCstab [40] for pHLA stability, PRIME v2.0 [38], and TLImm [76] for T-cell epitope immunogenicity. The statistical evaluation of the benchmark and color-scale visualization are the same as in the model optimization panels
To create our PredIG models, we used the XGBoost algorithm [69] given its state-of-the-art performance in tabular data [72], adaptability to small datasets, and capacity to deal with imbalanced data [69]. We tuned the XGBoost algorithm for class imbalance correction using specific hyperparameters termed Scale Pos Weight (SPW) and Max Delta Scale (Fig. 3C–H, Additional File 2: Fig. S1, Additional File 3 and 4: Table. S2 and S3) [73]. Specifically, the SPW hyperparameter increases the attention or weight over positive instances in the loss function that XGBoost models optimize to maximize predictive performance. For this reason, the optimization of SPW is extensively studied giving rise to a PredIG-SPW nomenclature system to track model evolution (Fig. 3C, E, G, and Additional File 2: Fig. S1). In parallel, we used the area under the precision and recall curve (AUCPR) as the evaluation metric to maximize during the training iterations of the model. By maximizing AUCPR, we enhance the precision of PredIG on positive instances—immunogenic pHLAs—and we favor the performance of the model on the top-ranked T cell epitopes.
To adjust the class imbalance correction to the different frequencies of immunogenic epitopes expected per antigen type, we explored a gradient of SPW values, termed SPW-POS. This gradient covered from an extreme class imbalance described in neoantigens (1:100) to a lower range for pathogen-derived and non-canonical cancer antigens, more foreign to the immune system (Fig. 3B, C, E, and G, Additional File 2: Fig. S1, Additional File 3 and 4: Table. S2 and S3). We additionally designed a second gradient, SPW-RS, that corrects SPW-POS values by the class imbalance present in our training set (non-immunogenic over immunogenic pHLAs).
To assess the performance during the development of PredIG-SPW optimizations against the antigen-specific test sets, we first evaluated ROCAUC, a metric typically used in machine learning to assess the general discrimination or classification capacity of predictive models. Secondly, we used AUCPR curve, which ponders prediction precision (fraction of true positives among pHLAs predicted as positives) over recall or sensitivity. We explored both curves as statistical metrics to select the best PredIG-SPW optimizations (Fig. 3C, E, G, and Additional File 2: Fig. S1). However, ROCAUC and AUCPR did not reach an agreement on a unique PredIG-SPW optimization as best performer across antigen types. Having observed that individual PredIG-SPW optimizations did not converge, we continued to combine multiple optimizations of class imbalance or PredIG-SPW. First, using aggregated metrics or voting schemes, and second, building ensemble architectures using a layer of linear models combining PredIG-SPW optimizations. The aggregated metrics tested were geometric mean and an additive voting, unifying different combinations of SPW weights (either exclusively SPW-POS, SPW-RS, or both), and lastly, we integrated the scores using a voting scheme. However, voting schemes did not improve the performance of individual PredIG-SPW and increased model complexity, which challenged prediction explainability (Additional File 2: Fig. S1). Thus, we discarded this approach. Next, we aimed at combining PredIG-SPW scores into ensemble models using linear regressions and regularization via Lasso, Ridge, and Elastic Net. Again, these more complex model architectures did not improve the predictive performance while prevented straightforward prediction explainability and were discarded (Additional File 2: Fig. S1).
Given the lack of agreement in go-to machine learning performance metrics, we tailored our statistical analysis to explore the capacity of PredIG to prioritize truly immunogenic epitopes among top-ranked pHLAs according to our scoring (success rate). Furthermore, we reasoned that this analysis should better recapitulate the experimental needs of immunological screening—shortlisting a bona-fide set T-cell epitopes to validate among large number of possibilities—and pondered that if stable, this analysis conformed a better model selection rationale. We defined this metric purposedly as “Immunogenicity Screening Success Rate” or ISSR (the capacity to rank truly immunogenic epitopes among top-scored candidates). We calculated ISSR upon different numbers of top-ranked epitopes to simulate a range screening throughputs, specifically referring to these as ISSR_X, given the number of experimental validations to be performed (ISSR10, 25, 50, 100, and 200 plus ISSR400 and 1000 when the held-out dataset size allowed). We implemented ISSR in all model optimizations across the three held-out sets. However, the peaks in ISSR performance again varied between models, specific screening throughputs, and held-out sets. We hypothesized that obtaining a stable behavior on the multiple screening throughputs tested was unlikely.
To further aggregate ISSR seeking performance stabilization for best-model selection, we integrated all the calculated ISSR into a single metric (ISSR_All), and into two specific ISSR aggregating low- and high-throughput studies (ISSR_Low and ISSR_High). The aggregated ISSR achieved a greater stabilization of the predictive performance beyond the fluctuation observed in the previous gradient of screening sizes; however, we could not identify a single model to best perform across the three antigen-specific held-out sets. Still, maximizing ISSR_All, ISSR_Low, and ISSR_High, we identified one PredIG-SPW optimization as best performing per antigen type, and thus, these metrics were chosen to simplify model choice. The selected PredIG optimizations were termed PredIG-NeoA for cancer neoantigens, PredIG-NonCan for non-canonical cancer antigens, and PredIG-Path for pathogen-derived antigens and correspond to the optimizations PredIG-SPW-POS-Xtreme, PredIG-SPW-RS-I2, and PredIG-SPW-RS-I1 respectively (Fig. 3C, E, G, and Additional File 2: Fig. S1).
PredIG outcompetes the immunogenicity screening success rate of state-of-the-art methodologies in non-canonical cancer antigens and pathogen-derived epitopes
We compared the performance of PredIG against multiple state-of-the-art T cell epitope predictors using the three independent datasets: 3566 immunogenically validated cancer neoantigens, 560 non-canonical cancer antigens, and 243 pathogen-derived epitopes (Fig. 3F and H). The tools included in the benchmark cover the prediction of T-cell immunogenicity (PRIME [38] and TLImm [76]), HLA-I binding affinity (NetMHCpan [36], MHCflurry binding affinity and percentile rank scores [34], and NOAH score [67]), pHLA stability (NetMHCstab [40]), antigen processing (NetCleave [63, 64] and MHCflurry’s processing score), and antigen presentation (MHCflurry’s presentation score [34]).
In the non-canonical cancer antigens set, PredIG achieves a state-of-the-art performance as a classifier and surpasses the ISSR of all HLA binding predictors assessed as a ranker (Fig. 3F and Additional File 5: Fig. S2). Similarly, PredIG outcompetes PRIME and TLImm across all ISSR, but ISSR50, where equal rates are observed. As for antigen processing, MHCflurry processing score is the top performer in ISSR100 and ISSR200 whereas NetCleave shows low success rates. NetMHCstab displays an inferior performance in lower throughputs and a competitive result at ISSR400. Among HLA binding predictors, percentile scores rank better than affinity scores (nM-based) and the structure-based NOAH falls behind in the low-throughput enrichments but recovers state-of-the-art performance in large-throughput tests. Conversely, MHCflurry presentation score shows the lowest performance across success rates. In the classification curves, PredIG is top performer in both AUCPR and ROCAUCs only matched by MHCflurry presentation and processing scores (Fig. 3F and Additional File 5: Fig. S2).
In the pathogen set derived from SARS-CoV-2, PredIG clearly outcompetes the ISSR of all other methods (Fig. 3H and Additional File 5: Fig. S2). This behavior is reproduced across all success rates and AUCPR with notable differences. Conversely, in this case, ROCAUC is not associated with immunogenicity success rates, likely due to the small number of positive instances. Remarkably, this held-out dataset only contained 5 immunogenic pHLA cases which PredIG ranks much higher than other methods.
To further test the generalization capacity of PredIG-Path beyond the small number of immunogenic epitopes in our pathogen held-out (5 pHLAs), we performed an additional experiment over 10 additional held-out splits (termed pathogen-splits). These are formed exchanging the 5 positive cases in our pathogen held-out for different sets of novel immunogenic T cell epitopes reposited at IEDB from SARS-CoV-2 publications indexed after 01/01/2024 (Additional File 6: Fig. S3). The novel pHLAs were filtered to avoid any data leakage with the training sets and the held-outs leading to 292 immunogenic pHLAs. We organized these pHLAs in the novel pathogen-split held-outs by including sets of 5, 10, 15, 20, 25, 30, 35, 40, 45, and 50 different immunogenic pHLAs in each split, thus making the splits independent in terms of positive cases. The negative pHLAs were maintained from the original pathogen held-out (n = 238 non-immunogenic pHLAs). This enabled the assessment of PredIG-Path performance across a gradient of immunogenicity class imbalances and towards a larger group of immunogenic pHLAs derived from pathogens. Furthermore, the splits were stratified to maximize and control the diversity in terms of presenting HLA-I alleles.
The results obtained depict a stable-to-increasing performance of PredIG-Path towards the novel pathogen-splits, thus supporting the generalization capacity of our predictions towards a larger set of immunogenic epitopes. AUCPR ranges between 0.22 and 0.40 for the original 0.31 whereas ROCAUC does between 0.70 and 0.92 for the original 0.76. The performances in the ISSR at low- and high-throughput screenings are equal or superior in the novel splits with the exception of ISSR10 at Split 5. The averaged ISSR displayed an incremental performance gain when the number of positives included in the splits increases. Overall, these results confirm the generalization capacity of PredIG-Path over a range of novel positive pHLAs derived from SARS-CoV-2 and support the usefulness of training PredIG in a diverse set of antigenic origins.
In the cancer neoantigen set, PredIG is a step below the state-of-the-art ranking performance obtained by the HLA binding predictors (NetMHCpan [36] and MHCflurry [34]) and by TLImm [76], a transfer learning-based tool that combines MS immunopeptidomics, HLA binding, and T-cell assay data (Fig. 3D and Additional File 5: Fig. S2). When compared to PRIME [38]—the only tool exclusively trained on T-cell assays with a similar dataset size to PredIG—we obtain higher immunogenicity success rates in the low-throughput settings (ISSR10 to ISSR100) whereas lower in larger settings (ISSR200-1000). The pHLA stability score of NetMHCstab [40] displays a superior performance across enrichment throughputs. Among the antigen processing scorings, NetCleave [63, 64] displays lower success rates across conditions whereas MHCflurry processing score is less efficient in low throughputs but more in larger experiments [34]. As for classification metrics, in terms of AUCPR, PredIG’s optimizations are top performing only surpassed by MHCflurry presentation score [34], which displays the lowest ISSR. In terms of ROCAUC, PredIG achieves lower performances which can be caused by the optimization of both precision and ranking performance during model development.
PredIG top-ranks immunogenic pHLAs alternative to best-binding epitopes
Beyond ranking performance, we assessed the identification overlap across the top-ranked pHLAs by different predictors and we observe across antigen types how PredIG prioritizes immunogenic epitopes not found among the top-binding candidates prioritized by HLA-I binding predictors (Fig. 4A–C). In the cancer neoantigen held-out set, when comparing immunogenic epitopes ranked among the top-100 pHLA pairs by NetMHCpan, MHCflurry, and NOAH and PredIG—that sums an aggregated total of 56 immunogenic epitopes—PredIG prioritizes 19/56 truly immunogenic pHLAs. Of this group, up to 8 are not prioritized by any HLA-I binding tool (Fig. 4A). In addition, when combining PredIG and NOAH, our in-house structure-based HLA-I binding predictor, we identify up to 19 immunogenic neoantigens not prioritized by NetMHCpan and MHCflurry.
Fig. 4.
PredIG prioritizes an alternative set of truly immunogenic (neo)antigens not ranked among top HLA-binding epitopes. A PredIG’s top-100 candidates rank immunogenic pHLAs not prioritized in the top-100 of any of the HLA-I binding affinity tools tested. This pattern is reproduced across all the held-out datasets comprising cancer neoantigens, non-canonical cancer antigens, and pathogen-derived epitopes. In the held-out set for cancer neoantigens, all the top-100 from HLA binding methods aggregated leads to the identification 56 unique immunogenic pHLAs. Upon these, PredIG’s top-100 identifies 8 immunogenic pHLAs not ranked by any binding method. When combining PredIG and NOAH, we identify up to 19 alternative neoantigens, a relevant increase 34% of the total candidates identified by sequence-based HLA-I binding affinity predictors (MHCflurry and NetMHCpan). B In the non-canonical cancer antigen held-out, PredIG’s top-100 identifies 7 out of 9 unique immunogenic pHLAs, of which 2 are not identified by any other method. On the contrary, all other top-100 from binding methods combined are only able to rank 2 out of 9 candidates not seen by PredIG. Of note, due to the lower success rates in non-canonical cancer antigens, the degree of alternative optimizations is constrained. C In pathogen held-out, PredIG top-100 identifies the only candidate ranked by binding methods plus 3 alternative pHLAs. Notably, PredIG ranks 4 out of 5 of the immunogenic candidates of this set among the top-100 whereas other methods only 1. Of note, due to the lower success rates in pathogens, the degree of alternative optimizations is constrained
Of note, due to the low number of immunogenic epitopes in the non-canonical cancer antigens and pathogen test sets (18 and 5 respectively), the degree of alternative identifications in these settings is constrained. In the non-canonical cancer antigen set, 9/18 immunogenic epitopes are identified adding up the successful identifications among the top-100 of NetMHCpan, MHCflurry, NOAH, and PredIG. Within these, PredIG successfully ranks 7/9 pHLAs, of which 2/7 are not found by any other method. (Fig. 4B). In the pathogen test set, PredIG ranks 4/5 immunogenic epitopes in our top-100 while all the HLA binding methods aggregated only identify 1 immunogenic candidate, which is also encountered among our identifications (Fig. 4C).
Refining the immunogenicity screening success rate of HLA-I binding affinity predictions using PredIG scores as a quality filter
At this point, provided that the training data used to build PredIG consists of T-cell assays performed on pHLAs strongly selected for HLA binding affinity [41, 50, 54], we assessed whether PredIG scoring thresholds could distinguish between immunogenic and non-immunogenic or silent pHLAs among strong binding candidates. Thus, we established PredIG scores as a quality filter to refine the ISSR of top binding pHLA candidates (Fig. 5A–B) [24]. Implementing the Youden index—point of maximum discrimination capacity calculated from the ROCAUC curve—for each PredIG best-model (PredIG-NeoA, PredIG-NonCan, and PredIG) as the filtering threshold at top-200 binders in the corresponding held-out set, we identified significant enrichments using Fisher’s exact or test of equal or given proportions across HLA binding affinity tools (Additional File 7: Table. S4).
Fig. 5.
PredIG, used as a quality filter, refines the immunogenicity screening success rate (ISSRs) of top-ranked pHLAs predicted by HLA-I binding affinity tools. A PredIG scores can be used to refine the success rate of pHLA-I binding predictors implemented as a quality filter upon best HLA-binding candidates. Among best-binding candidates as the lowest scoring epitopes in the averaged HLA binding affinity percentiles between NetMHCpan and MHCflurry, immunogenic pHLAs (dark purple) are enriched in the high range of the PredIG score distribution whereas a greater share of non-immunogenic epitopes (light purple) spread over the lower-scoring range. Thus, strong binding candidates prioritized by HLA-I binding predictions contain false positives that can be downsized by using a stringent threshold of PredIG immunogenicity score. B Refinement of the ISSR in a “same budget screening” and an “optimized budget screening” using PredIG to filter and retain pHLAs or to filter and optimize the size of T cell screenings, respectively. The refinement capacity is tested across antigen types and binding prediction tools (MHCflurry v2.0 [34], NetMHCpan v4.1 [36], and NOAH [67]. plus an averaged scoring between MHCflurry affinity percentile and NetMHCpan EL rank). The first four rows of each antigen-specific panel (*) display the baseline of truly immunogenic candidates among the top-100 binding pHLAs ranked by each binding tool. The Nº Hits are the truly immunogenic pHLAs identified among the total number of tests (Nº Tests). The next four rows (+) show the “same budget screening” approach where top-binding epitopes with low PredIG scores are discarded and replaced for the next best binding pHLAs that surpass the PredIG score threshold optimized per each type of antigen (Additional File 1: Table. S1). With this, the number of tests continues at 100 as reflected in the “Nº Tests” column. The last four rows (-) display the “optimized budget screening” approach, where only the pHLAs among the top-100 binding candidates that comply with the PredIG score threshold would be recommended for downstream validation. Hence, the number of tests to perform is lower (Nº Tests), optimizing the screening size while leading to enhanced success rates (Nº hits/Nº tests)
To test our refinement capacity, we implemented a strong PredIG score threshold upon the top binding candidates predicted by the HLA binding affinity tools in our benchmark (NOAH [67], NetMHCpan [36], and MHCflurry [34]). Doing so, we observed an increase in the ISSR upon all binding methods across the three antigen-specific held-out sets (Fig. 5AB and Additional File 1: Table. S1). Next, we used a gradient of PredIG score thresholds upon the top-100 candidates ranked by each binding predictor to maximize the performance of the PredIG score cut-off to use as recommended quality filter. As a result, we obtain a specific threshold for each antigen-type held-out (Additional File 1: Table. S1A–C). The need of specific thresholds could be attributed to the different influence of class imbalance optimizations encoded in each PredIG-SPW best model (NeoA, NonCan, and Path) and their particular immunogenicity score distributions. In neoantigens, specific PredIG thresholds worked best when adapted on top of each HLA binding method whereas for the non-canonical cancer antigen and pathogen-derived held-outs a stable cut-off was identified on top of all tools (Additional File 1: Table. S1A–C).
To further proof the applicability of PredIG’s refinement capacity, we studied two different immunogenicity or T cell screening settings. First, in a “same budget screening,” we implemented our refinement using the specific threshold of PredIG score as quality filter while maintaining the number of pHLAs to screen experimentally. To this end, we discarded low PredIG scoring candidates among the top binding pHLAs and then we included the next best binders that satisfied the PredIG threshold. This was performed until the same number pHLA candidates is re-selected (Fig. 5B). Second, in an “optimized budget screening” setting, we reduced the number of tests and experimental cost using PredIG to discard low scoring candidates previously selected via HLA binding tools, without including new instances (Fig. 5B).
Overall, in the cancer neoantigen set, the success rates increased over a 6% across tools using the “same budget screening” approach and up to a 17% in “optimized budget screening” (Fig. 5B). Interpreted in absolute values, performing the same number of T-cell assays (n = 100), this translated into identifying 31 to 37 immunogenic neoepitopes using PredIG as a quality filter. This represented 2 to 6 additional immunogenic epitopes on top of a baseline of 29–30 positive epitopes identified by HLA binding predictors. Reducing the number of assays from 100 to 55, we retained 26 out of the 30 positive cases ranked among top-100 binders. Remarkably, when combining our in-house tools NOAH and PredIG, we reached a success rate of 47% in the “optimized budget screening” approach (Fig. 5B).
In the non-canonical cancer antigen set that only contains 18 immunogenic pHLAs, PredIG’s “same budget screening” refinement captures between 2 and 4 epitopes extra from a baseline of 3 to 5 already identified by top-100 HLA binders integrating NOAH, MHCflurry and NetMHCpan rankings. For the “optimized screening budget” conditions, up to a 13% in ISSR is reached compared to a baseline of 2 to 5% while reducing the number of tests from 100 to 39 (Fig. 5B).
Finally, in the pathogen test set that contains 5 immunogenic pHLAs, binding methods display very low ISSR100, only ranking 1 out of 5 positive cases. The implementation of PredIG improved the ranking to 3 out of 5 immunogenic epitopes in the same budget. Accordingly, lowering the number of experiments to 16, we still ranked 1 out of 5 immunogenic candidates. Aggregated, these imply a threefold and sixfold in the “same screening budget” and “optimized screening budget” respectively (Fig. 5B).
PredIG explainability analysis assigns comparable weights to antigenic and physicochemical features in T-cell epitope immunogenicity scores
PredIG is built using an explainable machine learning framework that enables the determination of those antigenic and physicochemical descriptors that are more important to predict immunogenic epitopes using our score. To this end, we analyzed the feature importance using Shapley values [77] upon our XGBoost models [69]. This evaluates the contribution of each property in our predictions and further dissects the directionality of feature value distributions versus our immunogenicity score (Fig. 6 A–C). This directionality analysis enables the association of high values of certain descriptor, for instance hydrophobicity, and greater likelihood for immunogenicity prediction (positive Shapley value).
Fig. 6.

The SHAP-based explainability underlying PredIG immunogenicity score assigns comparable weights to physicochemical and antigenic descriptors of pHLA-I and pinpoints type-of-antigen particularities. A SHAP-based feature importance plot comparing the overall contribution of antigenic versus physicochemical features in immunogenicity score for PredIG-NeoA, PredIG-NonCan, and PredIG-Path. The larger the SHAP values obtained the greater the contribution to the predictions. Importantly, physicochemical features (oranges), often overlooked in T-cell epitope tools and prioritization pipelines, have a relevant contribution within PredIG accounting for similar levels than antigenic predictors (blues). B Expanded SHAP-based feature importance analysis to dissect property-specific contributions to the immunogenicity score optimized per antigen-type. Among the group of antigenic features, the major weight is associated to the sum of HLA-I binding affinity scores. The second largest contributor is antigen processing with NetCleave reaching the highest individual Shapley value. Antigen transport is ranked third, with the prediction of TAP translocation via NetCTLpan. Last is the MHCflurry’s antigen presentation score which might be shaded by collinearities with individual metrics (binding affinity and antigen processing). In the physicochemical feature group, the degree of influence ranks molecular weight and hydrophobicity as important contributors while net charge and peptide stability obtain modest importances. Remarkably, the contribution of the central residues of the epitope or TCR contact region reaches similar levels of importance compared to the properties of the full epitope. C SHAP directionality plot to interpret the association between the distribution of antigenic and physicochemical descriptors and PredIG immunogenicity predictions. Each dot represents a pHLA-I instance and displays the Shapley value in the X-axis. Each pHLAs-predicted descriptor value is reflected by the color scale, antigenic descriptors (blues), and physicochemical properties (orange). The Shapley values indicate the importance of that feature for PredIG to predict the immunogenicity of the given pHLA. Among positive SHAP cases (above zero), the higher the Shapley value the better the likelihood to be predicted as an immunogenic pHLA by PredIG. Whereas negative SHAP cases (below zero) indicate the association between descriptor distributions and predictions as “non-immunogenic.” In the blue color-scale, dark blue depicts the “best likelihood” in the antigenic range of each method. NOAH and MHCflurry affinity percentile scores were inverted to display the best values as dark. The orange color-scale depicts the range value of physicochemical properties from low to high
Across our antigen-specific PredIG models (PredIG-NeoA, PredIG-NonCan, and PredIG-Path), we observe a balanced contribution of antigenic and physicochemical features, with a small advantage for the first set (Fig. 6A). This trend observed in Shapley values is orthogonally validated using Gini Index importance analysis, another feature importance metric that is embedded within XGBoost models [84] (data not shown). This analysis confirms the large contribution of physicochemical descriptors of the epitope and its central residues for the prediction of immunogenicity, which current T cell antigen selection pipelines often overlook to focus exclusively on antigenic properties.
Among antigenic features (Fig. 6A–C), the major weight is associated to HLA binding affinity as contributed by the sum of NOAH, MHCflurry affinity percentile, and MHCflurry affinity scores. Antigen processing is the second largest contributor with NetCleave score reaching the highest individual Shapley value. Third comes antigen transport, with the prediction of TAP translocation via NetCTLpan. Finally, the contribution of antigen presentation, here including the combination of binding affinity and antigen processing in MHCflurry’s antigen presentation score, is of the smaller extend and might be shaded by collinearities with the individual metrics.
In the physicochemical feature group (Fig. 6A–C), both when computed at full-peptide sequence and at TCR contact region, the degree of influence ranks molecular weight and hydrophobicity as important contributors whereas net charge and peptide stability remain with modest importances. Remarkably, the contribution of central residues of the epitope reaches similar levels than those of the properties computed for the entire epitope sequence.
To compare the distributions of individual features versus their importance in the model computed by SHAP values, we take advantage of feature directionality visualization (Fig. 6C). The analysis of antigenic feature distributions shows the absence of a clear immunogenicity discrimination capacity of individual metrics as no dichotomic pattern is observed in the color scales (Fig. 6C). For the HLA binding affinity, the uniform distribution can be explained by the HLA binding strength selection of the pHLAs that are tested in T-cell assays using binding predictors [48–50]. Antigen processing metrics display a greater discrimination with MHCflurry antigen processing score ranking immunogenic epitopes as better processed and NetCleave reproducing this trend to a lesser extent. Despite its medium contribution in the feature importance of the model, the distribution of TAP scores indicates a certain discrimination capacity across antigen types.
The physicochemical distributions display slightly clearer distinction patterns underlying T-cell immunogenicity (Fig. 6 A–C). Immunogenic candidates present larger and more hydrophobic amino acids, both at full peptide and at central residue level. Differently, the net charge is more relevant when computed in the central residues of the peptide and then averaged across the entire peptide sequence. Peptide stability distribution, accordingly, to its low importance in the model, does not seem to play a large role for pHLA immunogenicity prediction.
PredIG deployment in a webserver and containerized environments to support small- and large-throughput studies on pHLAs and full protein sequences
To foster the usability of PredIG in the community and its adaptability to small- and large-scale studies, we devised a deployment in a user-friendly webserver and via containerized environments (Fig. 7). Both webserver and containers contain the 3 PredIG antigen-specific models: PredIG-NeoA, PredIG-NonCan, and PredIG-Path. The webserver runs queries of up to 5000 pHLAs, which accessible online and typically run within minutes. Three input formats are allowed: CSV-Uniprot, CSV-Recombinant, and FASTA. The first two CSV formats are peptide-based inputs that list pairs of HLA-I allele and epitopes coupled to their parental protein annotation, either as Uniprot ID or as the amino acid sequence of the entire protein. The FASTA mode enables the automated exploration of full protein sequences by generating all the potential epitopes from 8 to 14 amino acids of length and calculating their immunogenicity when presented by a given list of HLA-I alleles.
Fig. 7.

PredIG’s flexible input types allow the exploration of individual pHLAs as well as full protein sequences deploying our tool in a user-friendly webserver and in containerized environments for ease-of-installation and high-throughput reproducibility. A “CSV-Uniprot” mode: the input file is a.CSV with columns indicating the epitope, the presenting HLA-I allele, and the Uniprot ID of the epitope’s parental protein. B “CSV-Recombinant” mode: the input file is a.CSV with columns indicating the epitope, the presenting HLA-I allele, and the amino acid sequence of the protein of origin. This mode is designed to support (recombinant or mutated) proteins without Uniprot ID but can also work with any protein sequence. C “FASTA” mode: the input is composed by a FASTA file with the amino acid sequence of the protein of interest and a.CSV file with a list of target HLA-I alleles (“HLA_allele” column). By default, PredIG will generate all possible epitopes from 8 to 14 amino acids of length and will calculate their immunogenicity score paired against all the HLA-I alleles in the input list. D The user can choose between three PredIG models: PredIG-NeoA with a predictive performance and class imbalance optimized for cancer neoantigens, PredIG-NonCan for epitopes from non-canonical cancer antigens, and PredIG-Path for pathogen-derived T cell epitopes. E PredIG result output is a.CSV file with one pHLA-I per row and columns containing the PredIG immunogenicity score as well as the different antigenic and physicochemical features computed within the model to generate the score. This comprehensive result sheet allows the interpretation of multiple layers to guide the prioritization of candidate pHLAs
PredIG containers host the same functionalities as the webserver while increasing the throughput. To this end, containers run locally or in clusters isolating the software dependencies for ease-of-installation and guaranteeing the reproducibility of the results. By these, containers accommodate high-throughput queries of thousands of pHLAs in the minute scale running in a single CPU. Additionally, our Docker [79] and Singularity [80] containers—documented through a specific repository (https://github.com/BSC-CNS-EAPM/predig-containers-builder)—facilitate a straightforward installation in local computers or in high-performance computing clusters. In summary, containers permit scalable command-line usage, ensure the reproducibility of our method in any computational set-up, and importantly enable a smooth integration of our tool in immunoinformatic pipelines for T cell (neo)epitope prioritization.
Discussion
PredIG is the first predictor of CD8+ T-cell epitope immunogenicity to bridge cellular immunology and (physicochemical) antigen recognition rules into an interpretable machine learning model using explainable AI techniques. To this end, we designed an immunology-driven feature space entirely in silico by embedding antigen processing, transport, and HLA binding predictors of immunogenically validated pHLA-Is together with physicochemical descriptors of the presented epitopes (Fig. 1A–B). The SHAP-based feature importance analysis of the main contributors to our immunogenicity score facilitates the explanation of PredIG’s fundamental mechanisms for T cell epitope prediction. In addition, the results sheet we provide to end-users includes the physicochemical and antigen metrics enhancing the interpretability of predicted candidates to a comprehensiveness level not reported in other methods (Fig. 6A–C).
Performance-wise, our benchmark situates PredIG in a competitive position in the T cell epitope prediction field (Fig. 3D, F, H, and Additional File 5: Fig. S2). We obtain a cutting-edge ranking and classification performance in the non-canonical cancer antigens and pathogen-derived held-out sets. Particularly, the success rates observed in non-canonical cancer antigens show the capacity to generalize to a type of antigen unseen in the model training phase (PredIG only uses cancer neoantigens and pathogen-derived epitopes during training). Furthermore, we performed an extended analysis of the generalization capacity of PredIG-Path model over multiple splits of immunogenic pHLAs derived from SARS-CoV-2. The results obtained support that training PredIG in a diverse set of antigenic origins is useful to predict the immunogenicity of pathogen-derived pHLAs and that PredIG-Path is not affected by the extreme class imbalance in the original pathogen held-out (that only contained 5 immunogenic pHLAs).
Another important performance aspect is the capacity to refine the immunogenicity success rates or ISSRs of HLA-I binding predictions using a stringent PredIG score threshold as a quality filter upon best-HLA binding candidates (Fig. 5A–B, Table 4). The increase in ISSRs is reproduced across our type-of-antigen-specific held-out sets and becomes especially relevant in the cancer neoantigen set (Fig. 5A–B). We ponder that our models take advantage of the rich T-cell assay information encoded in our training data—which include immunogenic and non-immunogenic epitopes highly selected for HLA binding affinity—to learn to accurately rank the immunogenicity potential among strong HLA-binding epitopes. As demonstrated, this is both useful to optimize the size of immunological screenings or to refine strong binding epitopes with low immunogenicity scores. With this, rather than only proposing a new prioritization method, PredIG synergizes with the field and can be easily included in (neo)antigen prioritization pipelines [45, 85, 86].
Orthogonally, we show that PredIG prioritizes the screening of truly immunogenic pHLAs that are alternative to top-ranked epitopes by HLA binding affinity tools on the same dataset (Fig. 4A–C). This can be linked to the use of T-cell assays as training data and to our immunogenically driven feature space that differs from binding assays and MS-immunopeptidomics data used to train HLA binding affinity predictors. We observed these alternative identifications in all the three antigen-specific held-outs, specially prioritizing a notable fraction of alternative immunogenic neoepitopes. This complementary capacity can broaden the scope of T cell epitope prioritization and inform of immunogenic properties beyond binding to HLA molecules [87].
Analyzing the performance of the state-of-the-art methods included in our benchmark [34, 36, 38, 40, 67], we envision different aspects that could be influencing the results (Fig. 3D, F, H, and Additional File 5: Fig. S2). In the cancer neoantigen dataset, the ISSRs appear to be higher than common observations in the field [42, 43], particularly when using TLImm, MHCflurry, and NetMHCpan (Fig. 3D). In this line, even though these methods are trained on other data modalities (HLA binding assays, and MS immunopeptidomics), the sequence of the neoepitopes in our validation sets are likely to be encountered in the training sets of HLA binding predictors [34, 36], which are not easily retrainable. This pseudo-data leakage could explain the inflated performances observed in our neoantigen benchmark. TLImm, a transfer learning-based method trained using T-cell assays and inferring from MS immunopeptidomics and HLA binding, obtains a remarkable performance in the low-throughput section of the cancer neoantigen held-out. However, this capacity is not reproduced in the held-outs for non-canonical cancer antigens nor in pathogen antigens, for which this method is also trained for.
In the non-canonical cancer antigen and pathogen-derived sets, the results observed in terms of immunogenicity success rates are more concordant to recent literature and PredIG stably surpass the performance of the state-of-the-art tools tested. In these cases, aside from the less likely data leakage across predictors, this behavior might be due to the more realistic immunogenicity class imbalance, including a smaller number of immunogenic pHLAs (± pHLAs = 18/542 and 5/238) than in the neoantigens set (± pHLAs = 144/3420) (Fig. 3F and H).
During the optimization of PredIG models, we have observed discordances between classification (ROCAUC and AUCPR) and ranking metrics (ISSRs) that led to different best model selections (Fig. 3C, E, G, and Additional File 2: Fig. S1). Of course, when models are to be selected, their overarching goal needs to be factorized in. In our field, it is different to aim at finding a small list of immunogenically enriched pHLA candidates for vaccine development than to classify the neoantigen landscape in proteome-wide studies. However, ROCAUC is often used as go-to metric for model selection [83], but as shown in our analyses, ROCAUC mostly does not associate with better success rates among top-scored candidates or ISSRs. Thus, ROCAUC does not recommend the best predictive models for the prioritization of pHLAs for immunological screenings, which are normally limited by low-to-medium throughputs (Fig. 3C, E, and G). For this reason, the immunoinformatics field would do better off using success rates among top-ranked candidates or ISSR directly as benchmarking metric. To this point, we propose that the immunogenicity success rates should be reported with greater clarity, since these are often complicated with terms as TTIF and FR (fraction of immunogenic epitopes upon top-20 and top-100 tested candidates respectively) [43]. To ease the interpretation, we have proposed “Immunogenicity Screening Success Rates” or shortened “ISSR” coupled to the number of T-cell assays to perform for a given screening (e.g., ISSR10, ISSR100…). ISSR conform a straightforward metric for the interpretation of prediction accuracy given a desired experimental throughput as well as to facilitate the interpretation of immunogenicity prediction benchmarks.
The wealth of T-cell immunogenicity data is set to increase greatly with initiatives as the TCR Grand Challenge [53], the TESLA consortia [49], and via recurrent improvements of databases as the IEDB [50] or CEDAR [51]. Computationally, the T cell immunoinformatics field is very active and state-of-the-art methods are continuously updated [88]. For these reasons, immunogenicity models should be adaptable to keep up the pace, a requirement that PredIG meets by design since (1) the core XGBoost models of PredIG are retrainable [69]; (2) the small feature space (14 descriptors) can be expandable to encode new properties without increasing model complexity significantly; and (3) the minimal information dependency (target HLA-I allele, epitope sequence and its parental protein as Uniprot ID, or amino acid sequence) make PredIG suitable to virtually all new datasets in T cell epitope discovery. This last component distinguishes PredIG from proteogenomic pipelines based on large feature spaces extracted from complex multi-omics data that often require golden datasets existing in the literature but not available to many end-users nor in current clinical practice [89]. Furthermore, to adapt to populational studies, PredIG enables pan-HLA-I-allele predictions by integrating ML- and structural-based HLA-I binding affinity methods (Fig. 1A). Thus, any target HLA-I allele can be studied [32, 34, 67].
Beyond generating larger T-cell assay datasets using experimental techniques with higher throughputs such as the HANSolo assay [90], efforts should be devoted to generating reliable negative data with extensive TCR repertoire testing. This should avoid the use of epitopes predicted as non-binders to HLA molecules but not tested in T-cell assays, as “non-immunogenic” when training immunogenicity prediction models [43]. To increase epitope diversity and diminish the bias towards strong HLA binders, large screenings should also focus on performing T-cell assays versus non-strong binding candidates as cases of weak HLA-I binders able to elicit immune responses have already been reported [91]. Similarly, non-canonical binding modes between peptide and HLA or in TCRpHLA triads are not comprised in current HLA affinity predictors [92] and could be captured by larger unbiased screenings. These broader screenings will expand the understanding of the factors driving T-cell immunogenicity further than HLA binding.
Another aspect that can influence predictive success of immunogenicity ML-models is the appropriate reproduction of the cellular context in T-cell assays in the training data. Peptide pulsing experiments—which displace epitopes bound to HLA-I molecules on the surface of cells by an excessive concentration of synthetic peptides—are widely implemented but do not mimic the intracellular conditions of antigen processing and transport [93]. This can lead to false positive epitopes that do not fulfill proteasomal cleavage motifs or TAP transport preferences [65]. Thus, despite activating T cells in vitro, these might not lead to effective T-cell responses in vivo and their use to train immunogenicity models will be misguiding. This point further supports the inclusion of antigen processing and transport predictions to detect these cases in immunoinformatic pipelines, as we included in PredIG and orthogonally validated in our feature contribution analysis via Shapley values (Fig. 6 A–C).
In addition to wider and more reliable T-cell assay datasets, we ponder that feature space design should follow existing immunological knowledge to bridge biology and machine learning predictions. Following the same rationale, black-box machine learning algorithms should be avoided. A further aspect to refine carefully is the annotation of HLA nomenclature and the pairing to the HLA alleles present in the sample for reliable data curation (Fig. 2A–C). We covered these steps using IMGt nomenclature [94, 95] and HLA ontologies from the IEDB [56]. Altogether, explaining predictions in fundamental immunological terms and implementing these measures will help to raise the confidence in the clinical and experimentalist communities towards immunoinformatics models (Fig. 6A–C). In return, using explainable ML models trained following immunological principles can be useful to reassess the importance of biological processes underlying T-cell epitope immunogenicity (Fig. 6A–C) [89]. To exemplify this rationale, we designed the feature space of PredIG and its explainable architecture supporting this idea (Figs. 1A–B, 6A–C).
Analyzing the SHAP-based feature importance within PredIG, we observed a strong contribution of the physical and chemical descriptors of the epitope when contrarily, these properties are often overlooked in antigen prioritization pipelines that rely more heavily or exclusively on antigenic descriptors [45, 85]. Notably, this pattern was reproduced in the models optimized for cancer neoantigens, non-canonical cancer antigens, and pathogen-derived epitopes. The importance of the nature of the central residues of the epitope—described in the literature to face upwards to the TCR and linked to immunogenicity [41, 96]—is reproduced in our analysis and should also be considered for immunogenicity prediction (Fig. 6A–C).
Next, in terms of antigenic properties, TAP transport and pHLA stability predictions deserve a closer evaluation (Figs. 3D, F, H, and 5A–C). Despite the slow development of TAP prediction methods, NetCTLpan was published in 2010 [65, 66], and the prediction of TAP obtains a relevant Shapley feature importance within PredIG (Fig. 6A–C). This can imply that TAP transport is not captured by antigen processing metrics trained using MS-immunopeptidomics data and calls for the further development of TAP translocation predictors. Similarly, the prediction of pHLA stability, herein using NetMHCstab [40], has reached strong success rates when applied to cancer neoantigens (Fig. 3D). All these encourage the generation of larger peptide transport and pHLA complex stability datasets to update these methods as well as their inclusion in immunogenicity pipelines.
Finally, to ensure the implementation of computational methodologies in the T-cell immunology community, it is relevant to make tools easily accessible and highly reproducible both for small- and large-scale studies. Thus, we have made PredIG available through a user-friendly webserver that enables non-programmers to interrogate their candidate epitopes with up to 5000 pHLAs per submission, to perform pan-HLA-I allele predictions and the use of antigen-specific PredIG models (Fig. 7). Complementarily, containerized environments (Docker [79] and Singularity [80]) enable all PredIG functionalities while facilitating the installation, reproducibility, and scaling-up to large-scale studies. In addition, the Singularity container permits the implementation of the tool in high-performance computing clusters. In both cases, the flexibility of input type, both pairs of pHLA and protein sequences, suit PredIG predictions for peptide screenings in derived canonical proteins annotated at Uniprot and recombinant proteins, for instance containing tumor variants or introducing mutations to stabilize structural conformations of certain immunogens (Fig. 7).
Conclusions
Overall, our results support the utility of PredIG to increase and refine immunogenicity success rates for the discovery of actionable T-cell epitopes in cancer and infection. We demonstrate the capacity to generalize to non-canonical cancer antigens, a type of antigen unseen during PredIG’s training as well as cutting-edge performance in pathogens. Complementing this predictive capacity, the use of a feature space rich in T-cell immunology information and the implementation of explainable AI techniques – XGBoost and Shapley Additive models (SHAP) – equip PredIG immunogenicity scores with an unprecedented biological interpretability. With this, we pinpointed the large contribution of often overlooked epitope features such as peptide physicochemical properties and antigen processing and transport capacities. We envision the integration of PredIG into immunoinformatic pipelines for T-cell epitope prioritization will be eased by its minimal input requirement (peptide sequence, target HLA-I allele, and parental protein Uniprot ID or amino acid sequence), its high throughput for epitope screening on large lists of pHLAs or on full protein sequences, and its complementary function as a quality filter to refine HLA binding affinity predictions. To this end, the deployment of PredIG in containerized environments (Singularity and Docker) and in a streamlined webserver will facilitate broader usage. Altogether, we consider these capabilities position PredIG as a valuable method in the bioinformatic toolkit of the immunology, vaccinology, and immuno-oncology communities.
Supplementary Information
Additional file 1: Table S1. XGBoost hyperparameter space is optimized during PredIG development to tune for immunogenicity class imbalance. "PredIG Model" lists the different PredIG-SPW optimizations labeled using specific class imbalance settings of Scale Pos Weight (SPW) [69]. XGBoost models with the suffix _POS refer to SPW calculations only based on the gradient of class imbalance explored to optimize the performance against cancer neoantigens, non-canonical antigens and pathogen sets. Model with the suffix "RS" refers to the corrected SPW-POS gradient using the class imbalance in the training set. "Optimal Hyperparameters" lists the values determined as best performing choice by a 5CV Grid Search or ParBayessian optimization routines. "Hyperparameter Space" lists the range of values allowed to explore in each hyperparameter during the optimization of PredIG models.
Additional file 2: Fig. S1. Extended statistical analysis for the optimization of the PredIG-SPW XGBoost models [73] implementing a gradient of Scale Pos Weight (SPW) [73]. The Scale Pos Weight hyperparameter increases the attention or weight over positive instances in the loss function that XGBoost models optimize to maximize prediction performance.The SPW values were set using two different strategies: SPW-POS and SPW-RS. SPW-POS is calculated without accounting for the existing class imbalance in the training set and takes positive values. SPW-RS rescales the gradient of SPW-POS values using the class imbalance in the training set and takes values from 0 to 1 (See Methods). In addition, multiple voting schemes [74] and linear ensembles combining different sets of SPW-optimizations were explored [67]. Statistically, the performance is evaluated in terms of discrimination capacity over the entire dataset (ROCAUC - reds) and precision over recall (AUCPR - oranges) and in terms prioritization of ground truth immunogenic epitopes amongst top-ranked candidates or immunogenicity screening success rates (ISSR10, 25, 50, 100, 200, 400 and 1000) colored in green, blue and purple for low-, high- and averaged screening throughtputs respectively (e.g. number of T-cell assays to perform). The individual ISSRs explored are adapted to the size of the corresponding held-out dataset (e.g. ISSR1000 is not calculated for non-canonical cancer antigens and pathogens because the size of these held-outs is smaller than 1000). The heatmap color-scale is harmonically set accross metrics so that the darker the color the better the score.
Additional file 3: Table S2. PredIG linear model ensembles for extended immunogenicity class imbalance assessment. PredIG linear ensembles (RIDGE, LASSO and ELASTIC) trained combining individual PredIG models for class imbalance correction. Lasso linear models use a strong regularization penaly that sets to zero those features – pHLA descriptors – that would be assigned a minimal contribution to the prediction. Ridge uses a softer regularization penalty that minimizes such features but precludes neglecting their weight as zeros. Elastic net is a form of mixed regularization between Lasso and Ridge. In the ensemble models, the suffix "All" implies the ensembling of 8 PredIG optimizations (SPW-Pos + SPW-Rescaled); the suffix "Pos" implies ensembling the 4 SPW-Pos optimitzations and the suffix "Rescaled" assembles the 4 SPW-Rescaled models.
Additional file 4: Table S3. Quantification of the immunogenic enrichment achieved establishing PredIG Youden score as a refinement threshold amongst top-200 HLA binding candidates per antigen-type held-out (p-values for Fisher's exact test or Test of Equal of Given Proportions). PredIG Youden indexes correspond to the maximum discrimination point in the ROCAUC curve given the score distribution of a model. For the Cancer Neoantigens held-out, a Fisher's exact test (one sided - greater) was performed comparing the immunogenicity of top-200 binding neoantigens above and below the PredIG score threshold corresponding to the Youden Index of the PredIG-NeoA model (PredIG score = 0.956). For the Cancer Non-Canonical Antigens held-out, a Test of Equal or Given Proportions (one sided - greater) was performed comparing the immunogenicity of top-200 binding non-canonical cancer antigens above the PredIG score threshold corresponding to the Youden Index of the PredIG-NeoCan model (PredIG score = 0.146). For the Pathogen held-out, a Test of Equal or Given Proportions (one sided - greater) was performed comparing the immunogenicity of top-100 binding non-canonical cancer antigens above the PredIG score threshold corresponding to the Youden Index of the PredIG-Path model (PredIG score = 0.192). Given the size of the pathogen held-out (n = 243 pHLAs), the test was performed for top-100 binders instead as top-200. The Test of Equal or Given Proportions was performed when one of the positions in the confusion matrix contained less than 5 cases.
Additional file 5: Fig. S2. Full benchmark comparing the best PredIG-SPW optimizations per antigen type versus state-of-the-art methodologies for T cell epitope prediction. The best PredIG models (PredIG-NeoA, PredIG-NonCan and PredIG-Path) are compared against state-of-the-art methods for T-cell epitope prioritization. The SOTA methods in our benchmark include NOAH [36], NetMHCpan v4.1 [34] and MHCflurry v2.0 [40] for HLA-I binding, NetMHCstab [38] for pHLA stability, PRIME v2.0 [76] and TLImm [50] for T-cell epitope immunogenicity. The statistical evaluation of the benchmark and color-scale visualization are the same as described in Additional File 2: Fig. S1.
Additional file 6: Fig. S3. PredIG-Path generalization to a diverse set of Pathogen-splits including novel immunogenic pHLAs in the pathogen held-out. We have generated 10 additional held-out datasets (termed Pathogen-Splits) by exchanging the 5 positive cases in our pathogen held-out for different sets of novel pHLAs with validated positive immunogenicity derived from SARS-CoV-2. These T cell epitopes were reposited at IEDB from publications indexed after 01/01/202450. This analysis tests the capacity of PredIG Path model to rank these novel immunogenic pHLAs with a similar performance than those in the original pathogen held-out. The set of novel pHLAs were filtered to avoid any data leakage with the training sets and the held-outs, making these positive instances entirely unseen by PredIG, We organized these pHLAs in the novel Pathogen-Splits by including 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50 different immunogenic pHLAs in each split leading to 292 novel immunogenic pHLAs. Thus, making these novel splits are entirely independent in terms of positive cases. The negative pHLAs were maintained from the original pathogen held-out (n = 238 non-immunogenic pHLAs). This enabled the assessment of PredIG-Path performance across a gradient of decreasing immunogenicity class imbalance (larger splits have more positive instances as compared to the fixed number of negative pHLAs). Furthermore, the pHLAs in the splits were stratified to maximize the diversity in terms of presenting HLA-I alleles. Overall, these results confirm the generalization capacity of PredIG-Path over a range of novel positive pHLAs derived from SARS-CoV-2.
Additional file 7: Table S4. Exploring a gradient of PredIG scores to define a quality filter for the refinement of HLA-I binding affinity predictions. "Ranking" refers to threshold of PredIG immunogenicity score implemented as quality filter (e.g. predig70 a score of 0.70) where TOP100 indicates the baseline ISSR100 for each tool. The numeric columns indicate the number of immunogenic epitopes amongst top-100 (ISSR100) in each condition. The gradient of PredIG scores explored coupled to the behavior of the success rate reflects the refinement capacity of PredIG as a quality filter and pinpoint the method- and antigen-specific threshold choices. Top) Cancer neoantigens. Middle) Non-canonical cancer antigens. Bottom) Pathogen antigens.
Additional file 8. PredIG datasets and XGBoost models. The datasets used for training are reposited at https://github.com/BSC-CNS-EAPM/PredIG, in the data section under the titles: predig_train_modf.csv, predig_test_modf.csv, predig_i1_modf.csv for the cancer neoantigen held-out; predig_i2_modf.csv for the cancer non-canonical antigen held-out and predig_i3_modf.csv for the pathogen held-out. The XGBoost models created are available as: predig_neoant.model for the PredIG model optimized for cancer neoantigens; predig_noncan.model for the PredIG model optimized for non canonical cancer antigens; and predig_path.model for the PredIG model optimized for pathogens.
Acknowledgements
To colleagues, collaborators, and funders for their invaluable support.
Authors' contributions
R.F.D. is first-author responsible for the curation of the manuscript, conceptualization of the work, retrieval and curation of the datasets, development of the predictive models and code, performing all the analyses, building figures and coordinated model deployment. C.D.D. and A.C.S. deployed the models in containerized environments and webserver. M.V., participated in the conceptualization of the project and the curation of the final manuscript. E.P.P., and V.G. coordinated the project and supervised the final curation of the manuscript. R.F.D, E.P.P. and V.G. are corresponding authors. All authors read and approved the final manuscript.
Funding
R.F.D., C.D.D., A.C.S., and V.G. received funding to support this project from ARNmVSRVAC_NBD Misiones, CDTI – Misiones Science and Innovation 2021, Ministerio de Ciencia e Innovación (Ministry of Science and Innovation), Spanish Government, CN044800. R.F.D., and V.G. also received funding to support this project from the Limiting West Nile Virus impact by novel vaccines and therapeutics approaches (LWNVIVAT), HORIZON-HLTH-2023-DISEASE-03-18 project number 101137248. E.P-P. is supported by the Spanish Ministry of Science (PID2024-159258OB-I00 and RYC2019-026415-I MICIU/AEI/10.13039/501100011033, 'El FSE invierte en tu futuro'), Fundación FERO-ASEICA (BFERO2002.6) and Fundació Josep Carreras Contra la Leucemia. I.J.C. is supported by MCIU as a Centro de Excelencia Severo Ochoa (CEX2023- 001258-S, MCIN/AEI/10.13039/501100011033). The remaining authors did not work in this project under specific funding.
Data availability
All data and models supporting the findings of this study are available within the paper and its Supplementary Information. Find the all PredIG code and datasets are available at: https://github.com/BSC-CNS-EAPM/PredIG. Find the software containers at: https://hub.docker.com/repository/docker/bsceapm/predig and https://github.com/BSC-CNS-EAPM/predig-containers/. Find PredIG webserver at: https://horus.bsc.es/predig.
Declarations
Ethics approval and consent to participate
This work exclusively used publicly available data.
Consent for publication
Not applicable.
Competing interests
The authors declare the following competing financial interest(s): At the time the work described in this manuscript was carried out, V.G. and C.D.D. were employees of Nostrum Biodiscovery. The remaining authors declare that they have no conflicts of interest.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Roc Farriol-Duran, Email: roc.farriol@bsc.es.
Eduard Porta-Pardo, Email: eporta@carrerasresearch.org.
Víctor Guallar, Email: victor.guallar@bsc.es.
References
- 1.Schumacher TN, Schreiber RD. Neoantigens in cancer immunotherapy. Science. 2015;348:69–74. [DOI] [PubMed] [Google Scholar]
- 2.Ribas A, Wolchok, JD. Cancer immunotherapy using checkpoint blockade. Science. 2018. 10.1126/science.aar4060 [DOI] [PMC free article] [PubMed]
- 3.Moss P. The T cell immune response against SARS-CoV-2. Nat Immunol. 2022;23:186–93. [DOI] [PubMed] [Google Scholar]
- 4.Gangaev A. et al. Identification and characterization of a SARS-CoV-2 specific CD8+ T cell response with immunodominant features. Nat Commun. 2021;12:1 12, 1-14. [DOI] [PMC free article] [PubMed]
- 5.Sahin U, et al. Personalized RNA mutanome vaccines mobilize poly-specific therapeutic immunity against cancer. Nature. 2017;547:222–6. [DOI] [PubMed] [Google Scholar]
- 6.Keskin DB, et al. Neoantigen vaccine generates intratumoral T cell responses in phase Ib glioblastoma trial. Nature. 2019;565:234–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hu Z. et al. Personal neoantigen vaccines induce persistent memory T cell responses and epitope spreading in patients with melanoma. Nat Med. 2021:1–11. 10.1038/s41591-020-01206-4. [DOI] [PMC free article] [PubMed]
- 8.Rojas LA, et al. Personalized RNA neoantigen vaccines stimulate T cells in pancreatic cancer. Nature. 2023;618:144–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Sahin U, et al. An RNA vaccine drives immunity in checkpoint-inhibitor-treated melanoma. Nature. 2020;585:107–12. [DOI] [PubMed] [Google Scholar]
- 10.Arieta CM, et al. The T-cell-directed vaccine BNT162b4 encoding conserved non-spike antigens protects animals from severe SARS-CoV-2 infection. Cell. 2023;186:2392-2409.e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Yewdell JW. Antigenic drift: understanding COVID-19. Immunity. 2021;54:2681–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Verdegaal EME, et al. Neoantigen landscape dynamics during human melanoma-T cell interactions. Nature. 2016;536:91–5. [DOI] [PubMed] [Google Scholar]
- 13.Al Bakir M, et al. Clonal driver neoantigen loss under EGFR TKI and immune selection pressures. Nature. 2025;639:1052–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kardani K, Hashemi A, Bolhassani A. Comparison of HIV-1 Vif and Vpu accessory proteins for delivery of polyepitope constructs harboring Nef, Gp160 and P24 using various cell penetrating peptides. PLoS ONE. 2019;14:e0223844. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Feng T, et al. Immunity of two novel hepatitis C virus polyepitope vaccines. Vaccine. 2022;40:6277–87. [DOI] [PubMed] [Google Scholar]
- 16.Wells TJ, Esposito T, Henderson IR, Labzin LI. Mechanisms of antibody-dependent enhancement of infectious disease. Nat Rev Immunol. 2024. 10.1038/s41577-024-01067-9. [DOI] [PubMed]
- 17.Bentzen AK. et al. Large-scale detection of antigen-specific T cells using peptide-MHC-I multimers labeled with DNA barcodes. Nat Biotechnol. 2016;34:10 34, 1037–1045 (2016). [DOI] [PubMed]
- 18.Gurung HR et al. Systematic discovery of neoepitope–HLA pairs for neoantigens shared among patients and tumor types. Nat Biotechnol. 2023:1–11. 10.1038/s41587-023-01945-y. [DOI] [PMC free article] [PubMed]
- 19.Tran E, et al. Immunogenicity of somatic mutations in human gastrointestinal cancers. Science. 2015;350:1387–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Gros A, et al. Recognition of human gastrointestinal cancer neoantigens by circulating PD-1+ lymphocytes. J Clin Invest. 2019;129:4992–5004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.De Mattos-Arruda L, et al. Neoantigen prediction and computational perspectives towards clinical benefit: recommendations from the ESMO precision medicine working group. Ann Oncol. 2020;31:978–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Lang F, Schrörs B, Löwer M, Türeci Ö, Sahin U. Identification of neoantigens for individualized therapeutic cancer vaccines. Nat Rev Drug Discov. 2022;21:261–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Addala V et al. Computational immunogenomic approaches to predict response to cancer immunotherapies. Nat Rev Clin Oncol. 2023:1–19. 10.1038/s41571-023-00830-6. [DOI] [PubMed]
- 24.Bjerregaard A-M, et al. An analysis of natural T cell responses to predicted tumor neoepitopes. Front Immunol. 2017. 10.3389/fimmu.2017.01566. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Bjerregaard AM, et al. Corrigendum: an analysis of natural T cell responses to predicted tumor neoepitopes. Front Immunol. 2018;9:1007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Bonsack M, et al. Performance evaluation of MHC class-I binding prediction tools based on an experimentally validated MHC–peptide binding data set. Cancer Immunol Res. 2019;7:719–36. [DOI] [PubMed] [Google Scholar]
- 27.Roudko V, Greenbaum B, Bhardwaj N. Computational prediction and validation of tumor-associated neoantigens. Front Immunol. 2020;11:27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.McGranahan N, et al. Clonal neoantigens elicit T cell immunoreactivity and sensitivity to immune checkpoint blockade. Science. 2016;351:1463–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.McGranahan N, Swanton C. Neoantigen quality, not quantity. Sci Transl Med. 2019;11:eaax7918. [DOI] [PubMed]
- 30.Mahmud S, et al. Designing a multi-epitope vaccine candidate to combat MERS-CoV by employing an immunoinformatics approach. Sci Rep. 2021;11:15431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Bhattacharya M, et al. Development of epitope-based peptide vaccine against novel coronavirus 2019 (SARS-COV-2): immunoinformatics approach. J Med Virol. 2020;92:618–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Nilsson JB, Grifoni A, Tarke A, Sette A, Nielsen M. PopCover-2.0. Improved selection of peptide sets with optimal HLA and pathogen diversity coverage. Front Immunol. 2021;12:728936. [DOI] [PMC free article] [PubMed]
- 33.O’Donnell TJ, et al. MHCflurry: open-source class I MHC binding affinity prediction. Cell Syst. 2018;7:129-132.e4. [DOI] [PubMed] [Google Scholar]
- 34.O’Donnell, T. J., Rubinsteyn, A. & Laserson, U. MHCflurry 2.0: improved pan-allele prediction of MHC class I-presented peptides by incorporating antigen processing. Cell Systems. 2020;11:42–48.e7. [DOI] [PubMed]
- 35.Shao XM, et al. High-throughput prediction of MHC class i and II neoantigens with MH cnuggets. Cancer Immunol Res. 2020;8:396–408. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Reynisson B, Alvarez B, Paul S, Peters B, Nielsen M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2020;48:W449–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Sarkizova S, et al. A large peptidome dataset improves HLA class I epitope prediction across most of the human population. Nat Biotechnol. 2019;38(2):199. 10.1038/s41587-019-0322-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Gfeller D, Schmidt J, Croce G, Guillaume P, Bobisse S, Genolet R, Queiroz L, Cesbron J, Racle J, Harari A. Improved predictions of antigen presentation and TCR recognition with MixMHCpred2.2 and PRIME2.0 reveal potent SARS-CoV-2 CD8+ T-cell epitopes. Cell Syst. 2023;14:72–83.e5. 10.1016/j.cels.2022.12.002. [DOI] [PMC free article] [PubMed]
- 39.Harndahl M, et al. Peptide-MHC class I stability is a better predictor than peptide affinity of CTL immunogenicity. Eur J Immunol. 2012;42:1405–16. [DOI] [PubMed] [Google Scholar]
- 40.Jørgensen KW, Rasmussen M, Buus S, Nielsen M. Net MHC stab - predicting stability of peptide-MHC-I complexes; impacts for cytotoxic T lymphocyte epitope discovery. Immunology. 2014;141:18–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Schmidt J, et al. Prediction of neo-epitope immunogenicity reveals TCR recognition determinants and provides insight into immunoediting. Cell Rep Med. 2021;2:100194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Deng J et al. IEPAPI: a method for immune epitope prediction by incorporating antigen presentation and immunogenicity. Brief Bioinform. 2023:bbad171.10.1093/bib/bbad171. [DOI] [PubMed]
- 43.Müller M, et al. Machine learning methods and harmonized datasets improve immunogenic neoantigen prediction. Immunity. 2023;56:2650-2663.e6. [DOI] [PubMed] [Google Scholar]
- 44.Kula T, et al. T-scan: a genome-wide method for the systematic discovery of T cell epitopes. Cell. 2019;178:1016-1028.e13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Lee CH, et al. A robust deep learning workflow to predict CD8 + T-cell epitopes. Genome Med. 2023;15:70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Zhou C, et al. PTuneos: prioritizing tumor neoantigens from next-generation sequencing data. Genome Med. 2019;11:67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Lu M. et al. dbPepNeo2.0: a database for human tumor neoantigen peptides from mass spectrometry and TCR recognition. Front Immunol. 2022;13:855976. [DOI] [PMC free article] [PubMed]
- 48.Zhang G, Chitkushev L, Olsen LR, Keskin DB, Brusic V. TANTIGEN 2.0: a knowledge base of tumor T cell antigens and epitopes. BMC Bioinformatics. 2021;22:40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Wells DK, et al. Key parameters of tumor epitope immunogenicity revealed through a consortium approach improve neoantigen prediction. Cell. 2020;183:818-834.e13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Vita R, et al. The immune epitope database (IEDB): 2018 update. Nucleic Acids Res. 2019;47:D339–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Koşaloğlu-Yalçın Z, et al. The cancer epitope database and analysis resource (CEDAR). Nucleic Acids Res. 2023;51:D845–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Dhanda SK, Malviya J, Gupta S. Not all T cell epitopes are equally desired: a review of in silico tools for the prediction of cytokine-inducing potential of T-cell epitopes. Brief Bioinform. 2022;23:bbac382. [DOI] [PubMed]
- 53.T-cell receptors | Cancer grand challenges. https://www.cancergrandchallenges.org/challenges/active-challenges/t-cell-receptors.
- 54.Olsen LR, et al. TANTIGEN: a comprehensive database of tumor T cell antigens. Cancer Immunol Immunother. 2017;66:731–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Lozano-Rabella M, et al. Exploring the immunogenicity of noncanonical HLA-I tumor ligands identified through proteogenomics. Clin Cancer Res. 2023;29:2250–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Vita R, et al. An ontology for major histocompatibility restriction. J Biomed Semantics. 2016;7:1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Kuiken C, Thurmond J, Dimitrijevic M, Yoon H. The LANL hemorrhagic fever virus database, a new platform for analyzing biothreat viruses. Nucleic Acids Res. 2012;40:D587-592. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Yusim K, et al. Los alamos hepatitis C immunology database. Appl Bioinformatics. 2005;4:217–25. [DOI] [PubMed] [Google Scholar]
- 59.Ogishi M, Yotsuyanagi H. Quantitative prediction of the landscape of T cell epitope immunogenicity in sequence space. Front Immunol. 2019;10:827. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Bai P, et al. Immune-based mutation classification enables neoantigen prioritization and immune feature discovery in cancer immunotherapy. Oncoimmunology. 2021;10:1868130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.HLA Nomenclature @ hla.alleles.org. http://hla.alleles.org/nomenclature/naming.html.
- 62.Kuiken C, Yusim K, Boykin L, Richardson R. The Los Alamos hepatitis C sequence database. Bioinformatics. 2005;21:379–84. [DOI] [PubMed] [Google Scholar]
- 63.Amengual-Rigo P, Guallar V. NetCleave: an open-source algorithm for predicting C-terminal antigen processing for MHC-I and MHC-II. Sci Rep. 2021;11:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Farriol-Duran R, Vallejo-Vallés M, Amengual-Rigo P, Floor M, Guallar V. NetCleave: an open-source algorithm for predicting C-terminal antigen processing for MHC-I and MHC-II. In: Computational vaccine design (ed. Reche, P. A.) 211–226 (Springer US, New York, NY, 2023). 10.1007/978-1-0716-3239-0_15. [DOI] [PubMed]
- 65.Peters B, Bulik S, Tampe R, Van Endert PM, Holzhütter H-G. Identifying MHC class I epitopes by predicting the TAP transport efficiency of epitope precursors. J Immunol. 2003;171:1741–9. [DOI] [PubMed] [Google Scholar]
- 66.Stranzl T, Larsen MV, Lundegaard C, Nielsen M. NetCTLpan: pan-specific MHC class I pathway epitope predictions. Immunogenetics. 2010;62:357–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Barajas A, et al. Virus-like particle-mediated delivery of structure-selected neoantigens demonstrates immunogenicity and antitumoral activity in mice. J Transl Med. 2024;22:14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Osorio D, Rondón-villarreal P, Torres R. Peptides : a package for data mining of antimicrobial peptides. R J. 2015;7:4–14. [Google Scholar]
- 69.Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016:785–794. 10.1145/2939672.2939785.
- 70.Berman HM. The protein data bank. Nucleic Acids Res. 2000;28:235–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Chowell D, et al. TCR contact residue hydrophobicity is a hallmark of immunogenic CD8+ T cell epitopes. Proc Natl Acad Sci U S A. 2015;112:E1754–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Grinsztajn L, Oyallon E, Varoquaux G. Why do tree-based models still outperform deep learning on typical tabular data? 2022.
- 73.xgboost package - RDocumentation. https://www.rdocumentation.org/packages/xgboost/versions/1.7.3.1.
- 74.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
- 75.Ma H, Bandos AI, Rockette HE, Gur D. On use of partial area under the ROC curve for evaluation of diagnostic performance. Stat Med. 2013;32:3449–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Fasoulis R, Rigo MM, Antunes DA, Paliouras G, Kavraki LE. Transfer learning improves pMHC kinetic stability and immunogenicity predictions. ImmunoInformatics. 2024;13:100030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Sharma P, Singh G, Kaur N. Data valuation using Shapley value in machine learning. AIP Conf Proc. 2023;2916:030001. [Google Scholar]
- 78.UniProt Consortium, T. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2018;46:2699–2699. [DOI] [PMC free article] [PubMed]
- 79.Merkel D. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014;2014:2.
- 80.Kurtzer GM, Sochat V, Bauer MW. Singularity: scientific containers for mobility of compute. PLoS ONE. 2017;12:e0177459. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Capietto AH. et al. Mutation position is an important determinant for predicting cancer neoantigens. J Exp Med. 2020;217:e20190179. [DOI] [PMC free article] [PubMed]
- 82.Rock KL, Reits E, Neefjes J. Present yourself! by MHC class I and MHC class II molecules. Trends Immunol. 2016;37:724–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Richardson E, et al. The receiver operating characteristic curve accurately assesses imbalanced datasets. PATTER. 2024;5:100994. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Strobl C, Boulesteix A-L, Augustin T. Unbiased split selection for classification trees based on the Gini index. Comput Stat Data Anal. 2007;52:483–501. [Google Scholar]
- 85.Gartner JJ, et al. A machine learning model for ranking candidate HLA class I neoantigens based on known neoepitopes from multiple human tumor types. Nat Cancer. 2021;2:563–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Hundal J, et al. PVACtools: a computational toolkit to identify and visualize cancer neoantigens. Cancer Immunol Res. 2020;8:409–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Calis JJA, et al. Properties of MHC class I presented peptides that enhance immunogenicity. PLoS Comput Biol. 2013;9:e1003266. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Buckley PR, et al. Evaluating performance of existing computational models in predicting CD8+ T cell pathogenic epitopes and cancer neoantigens. Brief Bioinform. 2022;23:bbac141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Yao N, Greenbaum BD. Trade-offs inside the black box of neoantigen prediction. Immunity. 2023;56:2466–8. [DOI] [PubMed] [Google Scholar]
- 90.Cattaneo CM. et al. Identification of patient-specific CD4+ and CD8+ T cell neoantigens through HLA-unbiased genetic screens. Nat Biotechnol. 2023:1–5. 10.1038/s41587-022-01547-0. [DOI] [PMC free article] [PubMed]
- 91.Wang Y, et al. Weak binder for MHC molecule is a potent Mycobacterium tuberculosis-specific CTL epitope in the context of HLA-A24 allele. Microb Pathog. 2012;53:162–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Guillaume P, et al. The c-terminal extension landscape of naturally presented HLA-I ligands. Proc Natl Acad Sci U S A. 2018;115:5083–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Rius, C. et al. Peptide–MHC class I tetramers can fail to detect relevant functional T cell clonotypes and underestimate antigen-reactive T cell populations. J Immunol 2018. 10.4049/jimmunol.1700242. [DOI] [PMC free article] [PubMed]
- 94.Lefranc MP, et al. IMGT unique numbering for immunoglobulin and T cell receptor constant domains and Ig superfamily C-like domains. Dev Comp Immunol. 2005;29:185–203. [DOI] [PubMed] [Google Scholar]
- 95.Barker DJ, et al. The IPD-IMGT/HLA database. Nucleic Acids Res. 2023;51(D1):D1053–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Szeto C, Lobos CA, Nguyen AT, Gras S. TCR recognition of peptide-MHC-I: rule makers and breakers. Int J Mol Sci. 2020;22:1–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Additional file 1: Table S1. XGBoost hyperparameter space is optimized during PredIG development to tune for immunogenicity class imbalance. "PredIG Model" lists the different PredIG-SPW optimizations labeled using specific class imbalance settings of Scale Pos Weight (SPW) [69]. XGBoost models with the suffix _POS refer to SPW calculations only based on the gradient of class imbalance explored to optimize the performance against cancer neoantigens, non-canonical antigens and pathogen sets. Model with the suffix "RS" refers to the corrected SPW-POS gradient using the class imbalance in the training set. "Optimal Hyperparameters" lists the values determined as best performing choice by a 5CV Grid Search or ParBayessian optimization routines. "Hyperparameter Space" lists the range of values allowed to explore in each hyperparameter during the optimization of PredIG models.
Additional file 2: Fig. S1. Extended statistical analysis for the optimization of the PredIG-SPW XGBoost models [73] implementing a gradient of Scale Pos Weight (SPW) [73]. The Scale Pos Weight hyperparameter increases the attention or weight over positive instances in the loss function that XGBoost models optimize to maximize prediction performance.The SPW values were set using two different strategies: SPW-POS and SPW-RS. SPW-POS is calculated without accounting for the existing class imbalance in the training set and takes positive values. SPW-RS rescales the gradient of SPW-POS values using the class imbalance in the training set and takes values from 0 to 1 (See Methods). In addition, multiple voting schemes [74] and linear ensembles combining different sets of SPW-optimizations were explored [67]. Statistically, the performance is evaluated in terms of discrimination capacity over the entire dataset (ROCAUC - reds) and precision over recall (AUCPR - oranges) and in terms prioritization of ground truth immunogenic epitopes amongst top-ranked candidates or immunogenicity screening success rates (ISSR10, 25, 50, 100, 200, 400 and 1000) colored in green, blue and purple for low-, high- and averaged screening throughtputs respectively (e.g. number of T-cell assays to perform). The individual ISSRs explored are adapted to the size of the corresponding held-out dataset (e.g. ISSR1000 is not calculated for non-canonical cancer antigens and pathogens because the size of these held-outs is smaller than 1000). The heatmap color-scale is harmonically set accross metrics so that the darker the color the better the score.
Additional file 3: Table S2. PredIG linear model ensembles for extended immunogenicity class imbalance assessment. PredIG linear ensembles (RIDGE, LASSO and ELASTIC) trained combining individual PredIG models for class imbalance correction. Lasso linear models use a strong regularization penaly that sets to zero those features – pHLA descriptors – that would be assigned a minimal contribution to the prediction. Ridge uses a softer regularization penalty that minimizes such features but precludes neglecting their weight as zeros. Elastic net is a form of mixed regularization between Lasso and Ridge. In the ensemble models, the suffix "All" implies the ensembling of 8 PredIG optimizations (SPW-Pos + SPW-Rescaled); the suffix "Pos" implies ensembling the 4 SPW-Pos optimitzations and the suffix "Rescaled" assembles the 4 SPW-Rescaled models.
Additional file 4: Table S3. Quantification of the immunogenic enrichment achieved establishing PredIG Youden score as a refinement threshold amongst top-200 HLA binding candidates per antigen-type held-out (p-values for Fisher's exact test or Test of Equal of Given Proportions). PredIG Youden indexes correspond to the maximum discrimination point in the ROCAUC curve given the score distribution of a model. For the Cancer Neoantigens held-out, a Fisher's exact test (one sided - greater) was performed comparing the immunogenicity of top-200 binding neoantigens above and below the PredIG score threshold corresponding to the Youden Index of the PredIG-NeoA model (PredIG score = 0.956). For the Cancer Non-Canonical Antigens held-out, a Test of Equal or Given Proportions (one sided - greater) was performed comparing the immunogenicity of top-200 binding non-canonical cancer antigens above the PredIG score threshold corresponding to the Youden Index of the PredIG-NeoCan model (PredIG score = 0.146). For the Pathogen held-out, a Test of Equal or Given Proportions (one sided - greater) was performed comparing the immunogenicity of top-100 binding non-canonical cancer antigens above the PredIG score threshold corresponding to the Youden Index of the PredIG-Path model (PredIG score = 0.192). Given the size of the pathogen held-out (n = 243 pHLAs), the test was performed for top-100 binders instead as top-200. The Test of Equal or Given Proportions was performed when one of the positions in the confusion matrix contained less than 5 cases.
Additional file 5: Fig. S2. Full benchmark comparing the best PredIG-SPW optimizations per antigen type versus state-of-the-art methodologies for T cell epitope prediction. The best PredIG models (PredIG-NeoA, PredIG-NonCan and PredIG-Path) are compared against state-of-the-art methods for T-cell epitope prioritization. The SOTA methods in our benchmark include NOAH [36], NetMHCpan v4.1 [34] and MHCflurry v2.0 [40] for HLA-I binding, NetMHCstab [38] for pHLA stability, PRIME v2.0 [76] and TLImm [50] for T-cell epitope immunogenicity. The statistical evaluation of the benchmark and color-scale visualization are the same as described in Additional File 2: Fig. S1.
Additional file 6: Fig. S3. PredIG-Path generalization to a diverse set of Pathogen-splits including novel immunogenic pHLAs in the pathogen held-out. We have generated 10 additional held-out datasets (termed Pathogen-Splits) by exchanging the 5 positive cases in our pathogen held-out for different sets of novel pHLAs with validated positive immunogenicity derived from SARS-CoV-2. These T cell epitopes were reposited at IEDB from publications indexed after 01/01/202450. This analysis tests the capacity of PredIG Path model to rank these novel immunogenic pHLAs with a similar performance than those in the original pathogen held-out. The set of novel pHLAs were filtered to avoid any data leakage with the training sets and the held-outs, making these positive instances entirely unseen by PredIG, We organized these pHLAs in the novel Pathogen-Splits by including 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50 different immunogenic pHLAs in each split leading to 292 novel immunogenic pHLAs. Thus, making these novel splits are entirely independent in terms of positive cases. The negative pHLAs were maintained from the original pathogen held-out (n = 238 non-immunogenic pHLAs). This enabled the assessment of PredIG-Path performance across a gradient of decreasing immunogenicity class imbalance (larger splits have more positive instances as compared to the fixed number of negative pHLAs). Furthermore, the pHLAs in the splits were stratified to maximize the diversity in terms of presenting HLA-I alleles. Overall, these results confirm the generalization capacity of PredIG-Path over a range of novel positive pHLAs derived from SARS-CoV-2.
Additional file 7: Table S4. Exploring a gradient of PredIG scores to define a quality filter for the refinement of HLA-I binding affinity predictions. "Ranking" refers to threshold of PredIG immunogenicity score implemented as quality filter (e.g. predig70 a score of 0.70) where TOP100 indicates the baseline ISSR100 for each tool. The numeric columns indicate the number of immunogenic epitopes amongst top-100 (ISSR100) in each condition. The gradient of PredIG scores explored coupled to the behavior of the success rate reflects the refinement capacity of PredIG as a quality filter and pinpoint the method- and antigen-specific threshold choices. Top) Cancer neoantigens. Middle) Non-canonical cancer antigens. Bottom) Pathogen antigens.
Additional file 8. PredIG datasets and XGBoost models. The datasets used for training are reposited at https://github.com/BSC-CNS-EAPM/PredIG, in the data section under the titles: predig_train_modf.csv, predig_test_modf.csv, predig_i1_modf.csv for the cancer neoantigen held-out; predig_i2_modf.csv for the cancer non-canonical antigen held-out and predig_i3_modf.csv for the pathogen held-out. The XGBoost models created are available as: predig_neoant.model for the PredIG model optimized for cancer neoantigens; predig_noncan.model for the PredIG model optimized for non canonical cancer antigens; and predig_path.model for the PredIG model optimized for pathogens.
Data Availability Statement
All data and models supporting the findings of this study are available within the paper and its Supplementary Information. Find the all PredIG code and datasets are available at: https://github.com/BSC-CNS-EAPM/PredIG. Find the software containers at: https://hub.docker.com/repository/docker/bsceapm/predig and https://github.com/BSC-CNS-EAPM/predig-containers/. Find PredIG webserver at: https://horus.bsc.es/predig.





