Table 1.
List of algorithms to predict immunogenicity.
References | Training data | Algorithm | Discriminative features (Immunogenicity) |
---|---|---|---|
Tung et al. (95) | Trained on 9-mer HLA-A2 restricted peptides. From MHCPEP, SYFPEITHI and IEDB, consist of 558 immunogenic, 527 non-immunogenic peptides | Decision tree learning methods to identify informative physicochemical properties from 531 physicochemical properties retrieved from version 9.0 of amino acid index (AAindex) database. Support vector machine with a weighted string kernels for immunogenicity prediction (named POPISK) | Top AAindex contributors: (i) Retention coefficient in HPLC, pH2.1, (ii) Principal property value z2, (iii) Hydrophobicity scale from native proteins, (iv) Normalized composition of membrane proteins, and (v) pK-C. Found positions 4, 6, 8, and 9 critical for 9-mer peptide |
Calis et al. (98) | Trained on 9-mer from MHC-I associated peptides. From IEDB and three immunogenicity studies in mice (99, 100), and unpublished data on Coxiella Burnetti-derived peptides), consist of 600 immunogenic and 181 non-immunogenic peptides | Per non-anchor residue of the presented peptide, log enrichment score calculated as ratio between the fraction of specific amino acid in immunogenic vs. non-immunogenic data, then score weighted to the importance of that position measured as Kullback-Leibler divergence. The weighted log enrichment scores of all (non-anchor) residues summed as immunogenicity score | Preference for residues with larger or aromatic side chains Positions 4–6 critical for 9-mer peptide |
Trolle and Nielsen (101) | Trained on 9-mer peptides covering 9 HLA alleles. From 295 T cell epitopes from SYFPEITHI and 1,216 T cell epitopes from IEDB, allele-balanced training data created by randomly selecting 50 epitopes from each of 9 HLA alleles except 2 alleles having 14 epitopes each, Total 378 epitopes | Weighted sum of pMHC binding affinity [NetMHCcons (102)], pMHC stability [NetMHCstab (103)] and T cell propensity prediction (98) (integrated algorithm named NetTepi). Optimal relative weights obtained | Performance gain obtained by summing pMHC binding affinity, pMHC stability predictions and T cell propensity than individual predictions |
Chowell et al. (104) | Trained on 9-mer H-2Db and HLA-A2 restricted peptides (separately for two ANN-Hydro models). From IEDB, 204 immunogenic and 232 non-immunogenic (self-peptides from MHC ligand elution experiment with no known immunogenicity) for H-2Db, and 372 immunogenic and 201 non-immunogenic peptides for HLA-A2 | Hydrophobicity-based artificial neural network (ANN-Hydro) based on numeric sequence of amino acid hydrophobicity | Strong bias toward hydrophobic amino acids at TCR contact residues (P4, P6, P7, and P8 for 9-mers) within immunogenic epitopes. Negative correlation between polarity of amino acids and immunogenicity |
Łuksza et al. (105) | Trained on 2,552 MHC-I immunogenic peptides from IEDB. Neoantigens with mutations generated from non-hydrophobic, wild-type residues at positions 2 and 9 excluded (as prediction of MHC affinities for wild-type peptides with non-hydrophobic anchor residues led to non-informative amplitudes) | Recognition potential of a neoantigen = A × R, where amplitude (A) is relative probability that a neoantigen is presented on MHC-I whereas its wild-type counterpart is not, and R is probability that neoantigen will be recognized by TCR repertoire. R defined by a multistate thermodynamic model, treating sequence similarity as proxy for binding energies | High sequence similarity of a given neoantigen with epitopes in IEDB by gapless alignment with BLOSUM62 amino acid similarity matrix |
Bjerregaard et al. (106) | From 13 publications, analyzed total 1,948 peptide-HLA complexes, of which 53 reported immunogenic | HLA binding prediction by NetMHCpan-4.0. Similarity between each neo- and normal peptide using kernel similarity measure proposed by Shen et al. (107) | High predicted binding score (HLA binding strength). Peptide sequence dissimilarity to self (wild-type counterpart of the neopeptide), especially for those with comparable HLA binding |
Pogorelyy et al. (97) | Trained on 9-mer peptides. From (104), 3,671 immunogenic and 3,911 non-immunogenic peptides | Principal component analysis and dimensionality reduction on 10-dimensional vectors of Kidera factor sums for each epitope. Fit multinomial Gaussian model using expectation maximization to estimate probability of being immunogenic | Distinct physicochemical properties in Kidera space |
Jurtz et al. (93) | Trained on 8,920 TCRβ CDR3 sequences and 91 HLA-A2 cognate peptides obtained from IEDB. 379 TCR and 16 peptides from the MIRA assay in (108). Negative data from eluted peptide ligands from self (i.e., human) proteins, a set of 200,000 TCR CDR3 sequences from 20 healthy donors and creating internal incorrect combinations of TCRs and peptides | Convolutional neural networks (CNN) to predict whether a given TCR is able to recognize a specific peptide, with amino acid sequences of peptide and CDR3 region of TCRβ chain as input. CNNs scans the input and detects pattern to be integrated into network (named NetTCR) | Conserved sequence patterns of peptide-TCR pairs encoded by BLOSUM50 matrix |
Smith et al. (94) | Trained on 8-11mer 141 epitopes from MHC-I H2b and H2d haplotypes | Using amino acid features (tiny, small, aliphatic, aromatic, non-polar, polar, charged, basic and acidic), variables derived by presence/absence of each feature at each absolute and relative position, at site of SNV mutation, at being/middle/end residues, difference of each feature in mutated vs. reference antigen. Most predictive features into gradient boosting algorithm and trained by 10,000-fold cross-validation | Peptide biochemical features: valine at position 1, valine at last position, small amino acids at the last position, basic amino acids of the reference at the mutated position, changes in the mutated position to a small amino acid, lysine at relative site 1, and presence of valine within the first 3 positions |
Ogishi and Yotsuyanagi (96) | Trained on 8–11 mer MHC-I and 11–30 mer MHC-II peptides. From IEDB, LANL HIV and HCV database and TANTIGEN database, 6,957 HLA-I and 16,642 HLA-II immunogenic peptides. 191,326 TCR CDR3β sequences obtained from MiXCR | TCR-peptide contact potential profiling (CPP) by optimal alignment between CDR3β (randomly down-sampled to 10,000 sequences) and peptides and using pairwise contact potential scales from AAindex. Peptide sequence-based estimates of physicochemical properties (= peptide descriptors) using: aIndex, blosumIndices, boman, charge, crucianProperties, fasgaiVectors, hmoment, hydrophobicity, instaIndex, kideraFactors, mswhimScores, pI, protFP, chseScales, and zScales Most predictive peptide descriptors and CPP features compressed into a linear coordinate system through extremely randomized tree (ERT) algorithm | Physicochemical and CPP features: features from short (3- and 4-aa) and longest (8- and 11-aa for MHC-I and MHC-II, respectively) fragments, skewness- and kurtosis-derived features and AAindexes, including inverse of modified Miyazawa-Jernigan transfer energy, inverse of quasichemical energy in an average protein environment from interfacial regions of protein-protein complexes, and distance-dependent statistical potential within 10–12 Å |
Riley et al. (109) | Trained on 9-mer HLA-A2 restricted peptides. 155 immunogenic from IEDB, 2,756 HeLa HLA-A2 binding self-peptides and 1,044 HLA-A2 non-binders | A feed-forward neural network with inputs describing structural and structure-based energetic features of 9-aa in peptide sequence and peptide-HLA complex. Structural and energy features are those comprising Talaris 2014 energy function or derived from Table S3 (109) | Structural and energic features: van der Waals interaction, hydrophobic solvation, Coulombic potentials, hydrogen bond energies, side chain rotamer energies, and solvent accessible surface areas (SASA) |