Skip to main content
Molecular & Cellular Proteomics : MCP logoLink to Molecular & Cellular Proteomics : MCP
. 2023 Feb 14;22(4):100506. doi: 10.1016/j.mcpro.2023.100506

Precision Neoantigen Discovery Using Large-Scale Immunopeptidomes and Composite Modeling of MHC Peptide Presentation

Rachel Marty Pyke 1,, Datta Mellacheruvu 1,, Steven Dea 1, Charles Abbott 1, Simo V Zhang 1, Nick A Phillips 1, Jason Harris 1, Gabor Bartha 1, Sejal Desai 1, Rena McClory 1, John West 1, Michael P Snyder 2, Richard Chen 1,§, Sean Michael Boyle 1,∗,§
PMCID: PMC10114598  PMID: 36796642

Abstract

Major histocompatibility complex (MHC)–bound peptides that originate from tumor-specific genetic alterations, known as neoantigens, are an important class of anticancer therapeutic targets. Accurately predicting peptide presentation by MHC complexes is a key aspect of discovering therapeutically relevant neoantigens. Technological improvements in mass spectrometry–based immunopeptidomics and advanced modeling techniques have vastly improved MHC presentation prediction over the past 2 decades. However, improvement in the accuracy of prediction algorithms is needed for clinical applications like the development of personalized cancer vaccines, the discovery of biomarkers for response to immunotherapies, and the quantification of autoimmune risk in gene therapies. Toward this end, we generated allele-specific immunopeptidomics data using 25 monoallelic cell lines and created Systematic Human Leukocyte Antigen (HLA) Epitope Ranking Pan Algorithm (SHERPA), a pan-allelic MHC-peptide algorithm for predicting MHC-peptide binding and presentation. In contrast to previously published large-scale monoallelic data, we used an HLA-null K562 parental cell line and a stable transfection of HLA allele to better emulate native presentation. Our dataset includes five previously unprofiled alleles that expand MHC diversity in the training data and extend allelic coverage in underprofiled populations. To improve generalizability, SHERPA systematically integrates 128 monoallelic and 384 multiallelic samples with publicly available immunoproteomics data and binding assay data. Using this dataset, we developed two features that empirically estimate the propensities of genes and specific regions within gene bodies to engender immunopeptides to represent antigen processing. Using a composite model constructed with gradient boosting decision trees, multiallelic deconvolution, and 2.15 million peptides encompassing 167 alleles, we achieved a 1.44-fold improvement of positive predictive value compared with existing tools when evaluated on independent monoallelic datasets and a 1.17-fold improvement when evaluating on tumor samples. With a high degree of accuracy, SHERPA has the potential to enable precision neoantigen discovery for future clinical applications.

Keywords: major histocompatibility complex, MHC, immunopeptidomics, neoantigens, cancer, cancer vaccines, machine learning, immunology, neoantigen prediction, next-generation sequencing

Graphical Abstract

graphic file with name fx1.jpg

Highlights

  • Generated 25 stably transfected monoallelic cell lines and applied immunopeptidomics.

  • Harmonized 512 public immunopeptidomic samples through systematic reprocessing.

  • Developed pan-allele MHC binding algorithm (SHERPA) utilizing 167 human HLA alleles.

  • Employed empirically derived antigen-processing features to predict MHC presentation.

  • SHERPA demonstrates up to 1.44-fold increased precision over competing algorithms.

In Brief

Accurately identifying neoantigens is critical for many clinical applications. We generated immunopeptidomics data from 25 stably transfected monoallelic cell lines. Then, we systematically reprocessed a large corpus of public data to improve MHC binding pocket diversity and to empirically learn the rules of antigen presentation. In applying these datasets, we trained SHERPA, an MHC binding and presentation prediction algorithm. SHERPA improves performance compared with existing tools by 1.44-fold in held-out monoallelic data and 1.11-fold for known immunogenic epitopes.


The concerted efforts of innate and adaptive immunity help maintain homeostasis and fight pathogen attacks. The innate immune system reacts quickly and is largely nonspecific, whereas the adaptive immune system is highly specific, typically takes a longer time to develop, and is long lasting. Within the adaptive immune system, T cells survey the health of a cell by examining major histocompatibility complex (MHC) protein complexes on the cell surface. All nucleated cells express MHC on their surface, and those cells presenting nonself and aberrant peptides are identified and eliminated. Although several factors are needed to mount an immunogenic response from CD8+ T cells, MHC class I presentation of relevant peptides is a gatekeeping step (1, 2, 3).

Early in vitro and in vivo experiments evaluating binding characteristics of various peptide MHC pairs indicated that presented peptides had allele-specific motifs. In vitro experiments primarily attempted to measure binding affinity of specific peptides and their cognate MHC complexes using competitive binding assays in a hypothesis-driven manner (4). Advances in LC–MS/MS heralded a new era of large-scale immunopeptidomics and the ability to learn the rules of binding and presentation in an unsupervised way (5). However, the assignment of peptides to their cognate alleles was problematic (6). Further advances in using genetically engineered cell lines expressing a single allele of interest, often at high copy number, helped circumvent this problem and produce unambiguous human leukocyte antigen (HLA) peptidomes (7).

Algorithms to model MHC binding have progressed along with these technological advances. Early approaches relied exclusively on MHC-peptide binding affinity data (8). As immunopeptidomics data have become increasingly available, more groups have incorporated these data into prediction models. Several recent approaches employ multiallelic immunopeptidomics data, including clustering-based deconvolution (9, 10), iterative assignment (11), and direct modeling (12). Other efforts capitalize on the unambiguous nature of monoallelic data (7, 13, 14). Furthermore, some algorithms aim to model MHC-peptide binding alone (10, 15, 16) whereas others extend to modeling antigen processing and surface presentation as well (7, 12, 13, 17). Although there is disagreement on the best way to model MHC presentation, there is clear consensus that advances in mass spectrometry (MS) have enabled scalable data generation and significantly improved MHC-peptide binding and presentation prediction algorithms.

With the goal of creating an accurate MHC-peptide binding and prediction algorithm, we generated a high-quality dataset comprising 25 monoallelic cell lines. All the monoallelic cell lines are distinct from previously published datasets because of their stable transfection and background cell line (K562). We also profiled five alleles that have not previously been profiled using monoallelic immunopeptidomics technologies, expanding the known MHC binding pocket diversity and the allelic representation in underprofiled Asian, African, and Middle Eastern populations; the lack of information for under-represented groups is an area of high current interest (18). Furthermore, we combined our dataset with curated publicly available binding affinity data and monoallelic immunopeptidomics data to create a robust MHC-binding prediction algorithm. In order to capture the diverse facets of antigen processing and presentation, we systematically reprocessed and deconvoluted publicly available multiallelic data from several tissue types with different transcript expression profiles using our binding model trained on monoallelic data and expanded our training data to encompass 167 alleles. We also employed this large dataset to empirically learn patterns in antigen processing at the protein and peptide level. We applied gradient boosted decision trees to train our prediction models, which we call SHERPA (Systematic HLA Epitope Ranking Pan Algorithm). We evaluated SHERPA on held-out monoallelic and independently generated multiallelic tumor datasets to demonstrate that our integrated approach for MHC-peptide presentation prediction has a 1.17-fold improvement in performance over the best existing tools.

Experimental Procedures

Immunopeptidomics Using Monoallelic Cell Lines

Experimental Design and Statistical Rationale

In this study, all peptides studied were derived from the MHC-I immunopeptidome. Experimental work was performed on two sample types: monoallelic cell lines and tumor tissue. For the monoallelic cell lines, 28 cell lines of 5 × 109 cells were processed. Three biological replicates were assessed for a single cell line. One cell line was an HLA-null line used as a negative control. Peptides from eight of the cell lines were processed a single time. Peptides from the other 20 of the cell lines were divided and processed on the mass spectrometer twice (11 with collision-induced dissociation [CID] only and 9 with both CID and electron-transfer/higher-energy collision dissociation). For the tumor tissues, 12 tissues were processed without controls or replicates. In addition to the experimental work, publicly available peptides were also analyzed. A Wald test and two-sided t tests were used in this study.

Cell Culture

We generated monoallelic cell lines by stably transfecting K562 parental cells with a single allele of interest using Jump-In technology (Thermo Fisher Scientific). Optimized plasmids comprising the sequences for HLA, beta 2 microglobulin, and IRES promoter were synthesized for each allele (GeneArt). Cells were screened for plasmid integration and expanded to 500 M cells through various passages. The cells were pelleted once surface expression of target alleles was confirmed using flow cytometry (using W6/32 antibody). Transfection experiments were performed by Thermo Fisher Scientific.

Immunoprecipitation of Peptide–MHC Complexes

Pelleted cells were resuspended in octylthioglucoside lysis buffer, and the cell lysate was incubated overnight with W6/32 antibody immobilized on Protein A sepharose. After washing the resin, MHC-bound peptides were eluted using 0.1 M acetic acid and 0.1% TFA. The success of immunoprecipitation was verified by confirming the depletion of MHC class I complex in post-immunoprecipitated samples using ELISA. Immunoprecipitations were performed by Cayman Chemical.

Peptide Sequencing Using LC–MS/MS

Eluted peptides were desalted using solid phase extraction (Empore C18), loaded on a column, eluted using 80/20 acetonitrile/water (0.1% TFA), lyophilized, and stored. Samples were reconstituted in 0.1% TFA before they were analyzed using LC–MS/MS. Chromatographic separation was performed using a 2 h gradient on a Waters NanoAcquity system. Peptides were analyzed using a Thermo Fisher Fusion Lumos mass spectrometer in data-dependent mode (MS1: Orbitrap at 60,000 full width at half maximum resolution, m/z range: 300 to 800; isolation window: 1.6 Da; fragmentation: electron-transfer/higher-energy collision dissociation and CID; MS2: Orbitrap at 15,000 full width at half maximum; cycle time: 3 s). MS experiments were performed by MS Bioworks, LLC.

Peptide Identification

Peptides were identified using PEAKS software (PEAKS Studio 10.0 build 20190129) (19) using the default two-step identification workflow, where the first step performs de novo sequencing to identify mass tags and the second step performs a database search on a subset of putative proteins identified using de novo mass tags. The workflow was run with the following settings. Protein database: SwissProt proteome database (20,402 entries; dated March 20, 2019); precursor mass tolerance: 10 ppm; fragment mass tolerance: 0.02 Da; enzyme specificity: none; fixed modifications: carbamidomethylation of cysteine (+57.0215); variable modifications: oxidation of methionine (+15.9949), N-terminal acetylation (+42.0106). Peptide-to-spectrum matches were filtered at 1% false discovery rate, estimated using decoy sequences.

Post Processing and Quality Control

Peptides identified at 1% false discovery rate were filtered further to remove spurious peptides using the following constraints: spurious peptides were defined as polymeric peptides, peptides from highly chimeric spectra (n > 2) and identifications with less than ten fragment ions (supplemental Code). Background contaminants, profiled using a mock transfection (GFP), were filtered out. Samples were manually inspected for the presence of motif signatures (using Gibbs Clustering software (9)) and peptide yields (heuristic cutoff of 500 peptides per sample).

Extraction and Processing of Public Datasets

Extraction and Processing of Publicly Available Monoallelic and Multiallelic Data

Publicly available immunopeptidomics data were identified after an exhaustive literature search. Raw data (.raw files) were downloaded and systematically processed similar to in-house data. HLA types to sample mappings were extracted from corresponding publications. Peptide identifications from samples that passed rigorous quality control criteria (as described previously) were aggregated.

Processing of in Vitro Binding Affinity Data

Raw HLA-peptidome data were downloaded from the Immune Epitope Database (IEDB) (date: March 9, 2020) and filtered to identify entries corresponding to in vitro binding assays. The “Object Type” column was filtered to exclude any nonlinear peptides, and the “Units” column was filtered to exclude any non-nM entries. Furthermore, only four digit MHC class I peptides of length 8 to 11 were retained. The peptides were derived from the following “method/technique” categories: “purified MHC/competitive/radioactivity,” “purified MHC/direct/fluorescence,” “purified MHC/competitive/fluorescence,” “cellular MHC/competitive/fluorescence,” “cellular MHC/direct/fluorescence,” “cellular MHC/competitive/radioactivity,” “binding assay,” and “lysate MHC/direct/radioactivity.” Ligands with IC50 values less than 500 nm were identified as binders.

Measurement and Analysis of Transcript Expression

Sequencing of Monoallelic Cell Lines

Representative samples of monoallelic K562 cell lines generated in-house were sequenced in triplicate using our in-house commercial platform ImmunoID NeXT with 200 million paired end reads (150 base pair) of sequencing for RNA from tumor samples. Reads were aligned in accordance with Personalis Cancer RNA pipeline and transcript per million (TPM) values were extracted. Another cell line, B721.221, with large amounts of publicly available monoallelic data, was sourced from the American Type Culture Collection and profiled in triplicate using ImmunoID NeXT similarly to minimize technological variance.

Generation of Transcriptome for External Immunopepdomics Datasets

For one of the largest publicly available monoallelic datasets generated using B721.221 cell (MSV000080527; MSV000084172) lines, we regenerated the transcriptome data on ImmunoID NeXT (713), as described previously. For the rest of the samples in the expanded dataset, we imputed TPM values with our internal database by taking the median values from all samples with a matching tissue type.

Evaluation of Differential Transcriptome Abundances Across Cell Lines and Tissue Types

To evaluate the differences in expression between K562 and B721.221, we performed differential gene expression analysis using the Deseq2 package (20), which takes in raw transcript counts from all replicates of a sample set to determine differentially expressed genes. We used a threshold of an absolute value log2 fold change of greater than or equal to 2 in conjunction with an adjusted p value threshold of less than or equal to 0.01 to identify differential gene expression. We then visualized the differentially expressed genes in a volcano plot to show cell line gene expression differences, using the Bioinfokit tool by Renesh Bedre (https://github.com/reneshbedre/bioinfokit/tree/v0.9). We then took the identified genes that were enriched in K562 and the genes that were depleted in K562 as compared with B721.221 and calculated the Gene Ontology enrichment terms for both gene sets using http://geneontology.org/. To evaluate the differences in expression between the tissue types, we restricted our analysis to the tissues with unique expression profiles (n = 30). Then, we identified the 5000 genes with the highest standard deviation between the expression profiles and visualized those genes as rows and the unique expression profiles as columns in a clustered heatmap.

Exploratory Analysis of Immunopeptidomics Data

Relationship Between Expression and Peptide Presentation

Peptides identified in K562 monoallelic samples were assigned a transcript expression value (TPM) by mapping each peptide to one or more proteins and proteins to one or more transcripts. The highest TPM was assigned as the representative value when there are multiple mappings. A set of random peptides (n > 50 million) were also generated and assigned TPMs as described. TPMs were binned to deciles, and enrichment of presented peptides in each bin was calculated as the log ratio of number of presented peptides to background random peptides.

Relationship Between Cleavage Specificities and Peptide Presentation

To understand the cleavage preferences of the proteasome in our monoallelic cell lines, we mapped each peptide to a protein (as described previously) and identified the right and left flanking sequences (five amino acids) of the peptides. A set of random peptides (n > 50 million) were also generated and assigned flanking sequences as described to serve as a null dataset. To calculate the enrichment or depletion of amino acids at each position in the flanking regions, we subtracted the observed amino acid frequency from the random amino acid frequency and divided by the random amino acid frequency. Of note, the C terminus and N terminus of the protein was also considered in addition to the individual amino acids. We visualized the enrichment and depletion as a heatmap.

Clustering Alleles by Binding Pocket Similarity

The binding pocket was represented by a pseudo sequence of 34 amino acids as described previously (8). The distance between alleles was calculated by taking the sum of the BLOSUM62 scores for the amino acids in each allele at each position across the pseudo sequence. All the unique alleles with in-house or public monoallelic immunopeptidomics data were compared with one another and visualized as a heatmap.

Estimating Allelic Coverage in Under-Represented Ethnic Populations

Allele frequencies from several ethnic populations were downloaded from http://www.allelefrequencies.net/ on May 22, 2020. The frequencies of the five novel alleles in under-represented ethnic populations were taken directly from the data on the website. To estimate the allelic coverage of all the unique alleles in the whole data across ethnic populations, we focused on the data derived from the National Marrow Donor Program (n = 18 populations). For each population, we performed a Monte Carlo simulation to generate 10,000 synthetic individuals with the allelic frequencies for that population specified by National Marrow Donor Program. Then, we calculated the percentage of the alleles in the synthetic cohort that were represented in the expanded dataset (in-house monoallelic, public monoallelic, IEDB, and public multiallelic).

Generation of Peptide Motifs

We generated the motifs for each allele using the weblogo software (version 3.7.1). All motifs were based on a specific length of peptide (8-, 9-, 10-, or 11-mers). Motifs for public datasets were based on the peptides derived from our internal processing pipeline.

Generation of Position-Specific Frequencies of Amino Acids of the Binding Pocket

To assess the representativeness of our expanded dataset to the full space of known alleles, we calculated the frequency of amino acid at each position of the binding pocket pseudo sequence for both the alleles in the IMGT (the international ImMunoGeneTics information system; http://www.imgt.org/) and our expanded dataset. We visualized these frequencies as stacked bar plots.

Feature Engineering and Building Prediction Models

Data Splitting for Training and Evaluation

In order to rigorously evaluate our performance on novel peptides, we ensured that no peptides with overlapping 8-mer cores were present across the training, validation, and testing datasets. We took the unique set of peptides across our monoallelic immunopeptidomics, multiallelic immunopeptidomics, and IEDB datasets. Then, we grouped peptides that contained any identical 8-mer substrings (“nested” peptides) and placed each peptide group in one of ten different subsets. Once we had roughly equal peptide numbers in each subset, we assigned one subset to be for validation (∼10%), one subset to be for testing (∼10%), and the final eight subsets to be for training (∼80%). Moreover, we fully held out 32 multiallelic samples from the training dataset (described in detail in the “Benchmarking and Evaluation of Prediction Models” section) and evaluated the models on a subset of the peptides (∼10%) that were fully held out from training (as described previously). All publicly available multiallelic data (except for the 32 held out samples) was used for feature engineering of the gene propensity and hot spot scores because systematically holding out proteins or peptides would skew the scores. Only the ∼80% training subset of the multiallelic data was used for deconvolution.

Specifications of Prediction Models

We trained two types of prediction algorithms, one that models binding and another that models presentation. The binding algorithm models MHC peptide interaction affinities and takes as inputs the HLA binding pocket, the amino acid sequence, and the length of peptide ligands. The presentation algorithm jointly models both antigen processing and peptide-MHC (pMHC) binding, whose inputs include the HLA binding pocket, amino acid sequence and length of peptide ligands, proteasomal processing footprints manifested in the left and right flanking regions of peptide ligands, abundance of source proteins engendering peptide ligands as measured by gene expression, and two features that model the propensity of antigen processing and presentation. Features corresponding to these inputs as generated as follows:

  • 1.

    HLA binding pocket (B): The binding pocket is represented by a pseudo sequence of amino acids as described previously (8). Briefly, 34 positions on the protein sequence of the HLA that are a distance of 4 Å or lesser in crystollagraphic structures were selected from the full protein to serve as the psuedo sequence. The amino acid sequence is encoded using a BLOSUM62 substitution matrix, where each amino acid is represented by a 20-dimensional vector constituting the relative weights of amino acid substitutions. We chose BLOSUM62 encoding as opposed to a one-hot encoding approach under the assumption that substitutions between amino acids with evolutionary similarity have a lower impact on epitope binding changes than substitutions between amino acids that are very dissimilar.

  • 2.

    Peptide ligand (P): The amino acid sequence of the peptide ligand is encoded using a BLOSUM62 substitution matrix, where each amino acid is represented by a 20-dimensional vector constituting the relative weights of amino acid substitutions. Since HLA ligands could be of variable lengths (8- to 11-mers), we adapted a middle-padding approach described previously and adjusted all peptides to a length of 11 amino acids by inserting blanks in the middle (12). Briefly, for every peptide with fewer than 11 amino acids, we assign one to three 20-dimensional vector(s) of zeros to the center of the peptide encoding to reach the maximal 11-mer length. We take this approach to have a pan-length algorithm that keeps the peptide anchors in consistent columns of the matrix.

  • 3.

    Peptide length (L): The number of amino acids in the peptide ligand is designated as the peptide length.

  • 4.

    Left and right flanking regions (F): Peptide sequences (5-mers) to the left and right of the peptide ligand in the source protein are used as the left and right flanking regions, respectively. Multimappers are resolved by assigning the protein with the highest transcript expression to the peptide. These left and right flanking 5-mers are encoded using BLOSUM62 substitution matrix as described previously.

  • 5.

    Abundance of source protein (T): Peptide ligands are redundantly assigned to all source proteins and then to transcripts. The transcript with the highest expression (calculated as the TPM) is chosen. Both the transcript and the TPM are then assigned to the peptide ligand.

  • 6.

    Gene propensity score (G): Publicly available multiallelic data were used to estimate gene propensity. Peptide ligands from each sample are mapped to source proteins and then to transcripts redundantly. The number of peptides mapping to each transcript-associated protein were determined. To calculate the expected number of peptides for each transcript-associated protein, the number of transcripts per million for each protein across all multiallelic data sources are determined and normalized by the protein length and the number of peptides derived from each sample. Then, the number of observed and expected peptides for each protein are summed across all the multiallelic datasets. Finally, the observed values were divided by the expected values to derive a gene propensity score for each protein (gene). Proteins without any observed transcripts across all samples are (potentially pseudo genes) given a score of −3 to deprioritize them.

  • 7.

    Hot spot score (H): Publicly available multiallelic data were used to estimate the hot spot score. Peptide ligands from each sample are mapped to source proteins. When peptides were mapped to multiple proteins, all potential mappings are used. The number of peptides overlapping a particular amino acid represents the hot spot score of that amino acid. To assign a hot spot score for a peptide, the peptide is mapped to its source protein and the amino acid hot spot score spanning the peptide is averaged.

We evaluated the contribution of each feature to the XGBoost model using the “gain” metric, which measures the relative contribution of a feature to the main model by aggregating the individual contributions of the feature in each tree. Higher values indicate greater importance.

Generating Negative Examples (Nonbinders)

Immunopeptidomics experiments only generate peptides that successfully bind to and are presented by MHC molecules. Thus, we synthetically generated negative examples instead of experimentally identifying them. We generated 20 negative examples for every positive example in our training and validation datasets. To generate a negative example, we randomly selected a protein from the SwissProt proteome (downloaded on March 20, 2019) and then randomly selected a peptide from within that protein. Peptides were selected to have length 8, 9, 10, or 11 with equal probability. Flanking regions were assigned based on the true flanking regions around the selected peptides. A gene expression value (TPM) was assigned by randomly selecting a transcript from the transcriptome of the associated positive example. The gene propensity score and hot spot score were assigned based on the protein and position in the protein of the peptide selected for the negative example.

Training the Prediction Algorithm

All models were trained using a gradient-boosted decision tree algorithm implemented using an open source package XGBoost (21). All numeric and encoded features were provided as a vectorized input feature vector for training the algorithm. Optimal training parameters were selected based on a subset of samples using sequential model–based optimization, implemented using the open source package HyperOpt (22). The resulting training parameters used for the final training were as follows: loss function—binary logistic; maximum depth—10; eta—0.01; subsample—0.7; early stopping rounds—5; min child weight—0.5; maximum delta step—1; tree method—hist; and number of estimators—500. Large training sets were subset and processed in parallel based on available compute resources and processed on a high performance cluster.

Calibrating Raw Scores Using Percent Rank Values

A set of 500,000 peptides were randomly selected from the human proteome. Once we trained a model, we calculated the predictions across the set of random peptides with every allele. Then, we ranked the random peptides according to their raw prediction probability (output of XGBoost) for each allele. For each new peptide predicted, the assigned rank is the percentage of the random set that is predicted to bind or to be presented with a better raw score than the new peptide. The ranks range from 0 to 100 with lower scores meaning better bound or presented peptides. The ranks are recalculated for each model and allele combination.

Applying Prediction Models to Multiallelic Cell Line and Patient Samples

The MONO-Binding model was used to generate binding ranks for the peptides in the public multiallelic samples (including tumors, tissues, and cell lines) with immunopeptidomics data. Only peptides with 8, 9, 10, and 11 amino acids were considered. About 10% of multiallelic data was held out for testing purposes. Predictions were made for all HLA alleles (up to six) of each sample. Samples without HLA typing were excluded from this analysis.

Deconvolution of Multillelic Data

Once predictions were made for all HLA alleles of each multiallelic sample with immunopeptidomics data, the data were deconvoluted to decide which allele-peptide pairs should be included in the training dataset for the final model. First, we excluded all allele-peptide pairs with a predicted binding rank of ≥0.5 to exclude all peptides that do not bind to any of the designated alleles. Second, if there were multiple alleles predicted to bind to a specific peptide, we selected the allele-peptide pair with the lowest rank (best binder) and excluded all other pairs. Then, we removed any duplicate allele-peptide pairs. Finally, we generated 20 negative examples (as described previously) for every new positive example derived from the multiallelic data.

Training Composite Models

We trained a total of three prediction models for our composite model. In addition, we also trained five more models to help us better understand features that contribute to optimal performance. The prediction models in our composite model were trained as follows:

  • 1.

    MONO-Binding: Trained using the IEDB, in-house monoallelic immunopeptidomics and public monoallelic immunopeptidomics data with B, P, and L as features.

  • 2.

    SHERPA-Binding: Trained using the IEDB, in-house monoallelic immunopeptidomics, public monoallelic immunopeptidomics, and deconvoluted multiallelic immunopeptidomics data with B, P, and L as features.

  • 3.

    SHERPA-Presentation: Trained using the in-house and public monoallelic immunopeptidomics data with SHERPA-Binding, F, T, G, and H as features.

The additional prediction models were trained as follows:

  • 1.

    PUBLIC-Binding: Trained using the public monoallelic immunopeptidomics data with B, P, and L as features.

  • 2.

    MONO-Binding-LOO: 126 allele-specific models trained using the IEDB, in-house monoallelic immunopeptidomics and public monoallelic immunopeptidomics data with B, P, L, IEDB-Binding, INHOUSE-Binding, and PUBLIC-Binding as features. Each allele-specific model was trained without peptides in the training dataset from the respective alleles.

  • 3.

    SHERPA-Binding + F: Trained using the IEDB, in-house monoallelic immunopeptidomics, public monoallelic immunopeptidomics, and deconvoluted multiallelic immunopeptidomics data with SHERPA-Binding and F as features.

  • 4.

    SHERPA-Binding + FT: Trained using the IEDB, in-house monoallelic immunopeptidomics, public monoallelic immunopeptidomics, and deconvoluted multiallelic immunopeptidomics data with SHERPA-Binding, F, and T as features.

  • 5.

    SHERPA-Binding + FTG: Trained using the IEDB, in-house monoallelic immunopeptidomics, public monoallelic immunopeptidomics, and deconvoluted multiallelic immunopeptidomics data with SHERPA-Binding, F, T, and G as features.

Benchmarking and Evaluation of Prediction Models

Generation of the Monoallelic Held-Out Test Data

For the monoallelic immunopeptidomics datasets (in-house and public), ∼10% of the positive examples were held out of the training and validation datasets. Each positive example was supplemented with 999 negative examples to reflect the positive to negative ratio accepted by the field (5, 23, 24, 25, 26). For the IEDB data, ∼10% of the positive and negative data was withheld from the training dataset. No supplementary negative examples were added, so the ratios in the test dataset reflect the positive to negative peptide ratio in IEDB. All monoallelic validation figures rely on the monoallelic immunopeptidomics test data except for one comparison which uses the IEDB test data.

Calculation of Evaluation Metrics

Three metrics were used to evaluate the performance of the prediction models. The metrics are as follows:

  • 1.

    Positive predictive value (PPV): PPVs were calculated by making predictions on the entire test dataset and calculating the percentage of peptides in the top X% of predictions that are positive examples, with X representing the portion of the dataset that are positive examples. Each PPV was calculated individually for the peptides of each allele and combined using a median giving a single metric. Of note, the positive to negative ratios of the monoallelic immunopeptidomics and IEDB test datasets vary, so the interpretations of the plots are different.

  • 2.

    Precision-recall curves: Precision-recall curves were generated by calculating the precision and the recall for every possible cutoff and plotting them as a single line.

  • 3.

    Fraction of observed peptides predicted by model: This metric was used for all multiallelic tumor validation analyses. First, a single score is selected to represent each peptide observed with immunopeptidomics by making predictions on all the patient’s HLA alleles and selecting the best (lowest) rank among the predictions. Then, the score is calculated by determining the percentage of observed peptides that are given a rank of ≤0.1.

Leave-One-Out Pan-Allelic Analysis

To evaluate the pan-allelic performance of the MONO-Binding model used for model-based deconvolution, we trained 126 independent models with the same features as the MONO-Binding model. For each model, we excluded the peptides from a specific allele from the set of peptides used to train the MONO-Binding model. To generate the predicted motif for each allele, we predicted the binding rank for 500,000 random peptides for a given allele with the model for which the allele had been excluded from training. Then, we generated the motif for peptides with the top percentile of binding ranks. Motifs were only visualized for alleles with at least 50 positive peptides in the training data. To generate the precision recall curve, we used all 126 models to predict the binding ranks of the monoallelic immunopeptidomics data (in-house and public) that was excluded from training and validation (10% test dataset). The predictions for each allele were made with the model that excluded that allele from training.

Generating Validation Data Using Tissue Samples

A total of 12 fresh-frozen tumor samples (five colorectal and seven lung) and matched adjacent normal fresh-frozen samples were selected for patient validation. These samples were purchased from a biobank and were collected under institutional review board–approved protocols and abide by the Declaration of Helsinki principles. Each tumor sample was divided into two pieces. Immunoprecipitation of MHC complexes followed by LC–MS/MS (as described previously) was performed in a portion of each tumor sample to yield immunopeptidomics data. DNA and RNA was extracted from the remaining tumor sample, and DNA was extracted from the adjacent normal sample for analysis with ImmunoID NeXT (as described previously). The RNA extracted from the tumor sample was used to yield transcriptomic data.

Extraction of Held-Out Validation Datasets From External Multiallelic Samples

About ∼10% of multiallelic immunopeptidomics data with nonoverlapping “nested” peptides with the training and validation datasets was withheld from deconvolution and training to serve as a test dataset. We used ten samples from each of two specific datasets to validate our internal tumor immunopeptidomics performance (PXD007635 and PXD009602) (27, 28). Only samples with HLA-A, HLA-B, and HLA-C typing at 4-digit resolution were used for the analysis.

Immunogenicity Evaluation

In order for a peptide to incite an immunogenic response, it must be presented on the cellular surface by an MHC allele. Thus, we evaluated the ability of the algorithms to positively identify immunogenic peptides. We used the dataset described in the study by Chowell et al. (29) for this analysis and focused exclusively on the immunogenic peptides. Then, we evaluated the percentage of immunogenic peptides that were predicted by the various models at ≤0.1 percentile rank.

Running Comparison Prediction Algorithms

NetMHCpan-4.1-BA, NetMHCpan-4.1-EL, and MHCFlurry-2.0-BA were used as comparison prediction algorithms. All algorithms were run according to default settings. Percentile rank outputs were used for each analysis.

Results

Generation of Monoallelic Immunopeptidomics Data

In order to generate high-quality data, we used an HLA-null K562 parental cell line and engineered monoallelic cell lines that expressed a single HLA allele of interest (Fig. 1A). Expanded cells with good surface expression of target alleles were used for immunoprecipitation of peptides associated with MHC complexes, followed by elution of ligands and peptide sequencing using LC–MS/MS. Our stable transfection protocol, which ensures single-site integration of the target allele, minimizes expression biases and enables modeling of antigen processing in closer-to-native condition. Moreover, our K562 parental cell line provides a different background than the majority of previously published monoallelic data (B721.221 parental cell line), providing a novel and standardized system to study peptide presentation.

Fig. 1.

Fig. 1

Generation and overview of the monoallelic data.A, a schematic of the experimental procedure to generate the monoallelic training data. A human leukocyte antigen (HLA) allele and beta 2 microglobulin (B2M) were stably transfected into an HLA-null K562 parental cell line. The major histocompatibility complex (MHC)–peptide complex was purified using a w6/32 antibody, and the peptides were gently eluted off the complexes. The peptides were sequenced with LC–MS/MS and identified with a database search. B, bar plots showing the peptide yields and distribution of peptide lengths for each of the 25 monoallelic cell lines. C, a comparison between motifs of peptides generated from our monoallelic cell line with HLA-B35:01 and a publicly available dataset for the same allele. Motifs are shown for peptides of length 8, 9, 10, and 11. See supplemental Fig. S1 for motifs from all 25 cells. See supplemental Fig. S2 for comparisons with other public datasets. D, a bar plot showing the distribution of the ratio of observed peptides from the monoallelic cell lines compared with random expectation across several transcript per million (TPM) ranges. Values are shown with a log10 transformation. E, a heatmap showing the enrichment and depletion of five amino acids upstream and downstream of the peptides identified from the monoallelic cell lines compared with a random expectation. Red denotes the enrichment of amino acids, and blue denotes the depletion of them. The C and N terminus of the protein are denoted with “-.”

We generated data for 25 alleles (supplemental Table S1) and identified an average of 1575 unique peptides per allele (range: 905–2712) (Fig. 1B). We first evaluated the allele-wise length distribution and recapitulated previously described observations, with HLA-A alleles presenting longer peptides than HLA-B alleles (two-sided t test, p = 0.0066, supplemental Fig. S1A) (13). We also compared the motifs in our raw data with those published previously. Whereas most of the motifs are concordant, we also observed some differences (supplemental Fig. S2). As an example, a previously published dataset (7, 13) for HLA-B∗35:01 shows a tryptic digest like lysine/arginine signature for 8, 10, and 11-mers on the C-terminal end of the peptide, whereas our data show a motif that is consistent with the ones observed with 9-mers (Fig. 1C). We observe similar patterns in HLA-A∗29:02, HLA-A∗30:02, and HLA-B∗57:01 (supplemental Fig. S3), suggesting that there may be contamination in the publicly available datasets. Since MHC-binding algorithms learn allele-specific motifs, such subtle differences are critical to create accurate prediction models. We then evaluated the relationship between transcript abundance and peptide presentation (Fig. 1D). We observed a positive correlation between the propensity of a transcript to engender presented peptides and its expression level, represented by the TPM. We also evaluated the proteasomal cleavage signatures of presented peptides by examining the left- and right-flanking amino acids (Fig. 1E). The relative enrichment of amino acids compared with a random background set indicates an overrepresentation of lysine and arginines on the C-terminal end of the peptide. This observation is in concordance with previously described tryptic and chemotryptic enzymatic activity of proteasome cleavage (7, 13). Of note, we observe an enrichment of peptides originating from either the C terminal or N terminal of the source protein, indicating preferential proteasomal processing of peptides requiring relatively lesser enzymatic cleavage (two-sided t test, p = 3.77e-32, supplemental Fig. S1B). Enrichment was not observed in the N-terminal end of the peptide in other monoallelic datasets (7, 13), possibly because of overexpression of MHC with transient transfection protocols leading to a lack of discrimination among presented peptides.

Novel Monoallelic Cell Lines Enhance Diversity of Training Data and Reveal Subtle Binding Preferences

Beyond the added value of our unique parental cell line and a stable transfection approach, our dataset includes five alleles that had not previously been profiled using monoallelic cell lines. We selected these five alleles on the basis of two criteria: (1) the uniqueness of the allele binding pocket compared with previously profiled alleles (Fig. 2A) and (2) the population frequency of the allele in ethnic populations that are underrepresented in current datasets (Fig. 2B). For example, we clustered alleles by their binding pocket similarity and selected HLA-A∗02:52 from the HLA-A∗02 subcluster because of its high frequency in Iranian Kurdish populations (7% frequency, Fig. 2B). Though several HLA-A∗02 alleles have been profiled by previous groups (light gray), we found small differences in the motif from closely clustered alleles despite very similar binding pockets. For example, unlike surrounding HLA∗02 alleles, we observed the presence of tyrosine in the second anchor position (C-terminal end of peptide) for HLA-A∗02:52. Moreover, although HLA-A∗02:52 clusters tightly with HLA-A∗02:07, HLA-A∗02:07 has a strong preference for aspartic acid at position 3, whereas HLA-A∗02:52 has no preference. Similarly, we profiled three HLA-B∗15 alleles (HLA-B∗15:11, HLA-B∗15:13, and HLA-B∗15:18) that are frequent across several Asian populations, including the second most frequent allele in a Beijing population at over 12% (HLA-B∗15:11). Despite inclusion in the same allele subgroup and having binding pockets that cluster together, the three alleles have very distinct associated peptide motifs with each having different preferences at each anchor position. These three alleles highlight the value of our novel alleles in the creation of a pan-allelic model, which relies on learning how amino acid changes impact binding preference. Another novel allele, HLA-B∗81:01, falls outside a major binding pocket cluster, but its peptide motif shows strong similarities to surrounding alleles (Fig. 2A). Of note, HLA-B∗81:01 has moderate population frequency across over a dozen African populations (2–6% frequencies). Taken together, our novel alleles enhance both binding pocket diversity and population representation in a manner that is not represented in previously published training data.

Fig. 2.

Fig. 2

Binding pocket diversity and population frequencies of novel alleles.A, heatmaps for human leukocyte antigen (HLA)-A and HLA-B that represent the binding pocket similarity between alleles with monoallelic immunopeptidomics data. Dark blue squares represent alleles that have very similar binding pockets, whereas white squares represent alleles with divergent binding pockets. The 25 alleles profiled with our monoallelic system are denoted in orange. The five alleles that have not previously been profiled are denoted in green. Motifs for these novel alleles are shown alongside motifs for related alleles in gray. Black boxes deonte the cluster of alleles containing the newly profiled alleles. B, a heatmap showing the frequencies of the five novel alleles in several populations of diverse world ethnicities. Dark purple denotes high population frequencies of the alleles, and light purple denotes low population frequencies.

Large-Scale Data Integration Enhances the Representativeness of HLA Peptidome

Given the large amount of publicly available immunopeptidomics and binding affinity data from diverse tissue types and cell lines, we reasoned that a systematic integration of all available high-quality datasets would improve the representation of training data and enable feature engineering. Toward this end, we downloaded a large corpus of publicly available raw data (supplemental Table S2), comprising 2.15 million peptides from 512 experiments and covering 167 distinct alleles. A majority of the datasets are multiallelic (384 experiments from 15 projects) (12, 23, 27, 28, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39), but there are also a large number of monoallelic experiments (128 experiments from six projects) (7, 13, 40, 41, 42, 43, 44) and alleles with binding assay data (n = 90 alleles) (26) (Fig. 3A). As anticipated, the tissue types in the multiallelic datasets have diverse expression profiles (Fig. 3B). Furthermore, we observed significant differences in the gene expression profiles of transcripts from B721.221 cell lines (used in a large-scale monoallelic dataset generated from B721.221 cell line with alleles from HLA-A, HLA-B, and HLA-C) (7, 13) and K562 cell lines (used in our in-house data generated) (Fig. 3C). About 21% of transcripts display significant differences in expression between B721.221 and K562 cell lines, and these variations impact key pathways that change the functionality of the cells (supplemental Table S3). Given the vast differences in underlying proteomes and antigen-processing machinery of various tissue types, incorporation of these diverse data is particularly important for modeling presentation effectively.

Fig. 3.

Fig. 3

Systematic expansion of human leukocyte antigen (HLA) ligandome through the incorporation of publicly available data.A, box plots representing the number of unique peptides per sample from monoallelic and multiallelic immunopeptidomics samples that were reprocessed through our pipeline. Bar plot showing the number of samples for each project. Samples are colored according to their project. Peptide yields are log10 transformed. See supplemental Table S2 for additional details. B, a heatmap of expression values (transcript per million [TPM]) of highly differentiated genes between tissue and tumor types of publicly available multiallelic immunopeptidomics data. Low expression is shown in red, and high expression is shown in blue. C, a volcano plot denoting the differential gene expression between the monoallelic parental cell lines, B721.221 and K562. Gene transcripts with significant upregulation in B721.221 compared with K562 are shown in green, whereas gene transcripts with significant upregulation in K562 compared with B721.221 are shown in red. Gene transcripts with no significant upregulation or downregulation are shown in gray. D, a bar plot denoting the weighted fraction of alleles in 18 ethnicity populations from the National Marrow Donor Program within the expanded training dataset, including monoallelic cell lines profiled in house, public monoallelic data, public multiallelic, data and binding assay data from Immune Epitope Database (IEDB). E, two stacked bar plots showing the frequencies of amino acids at each position in the pseudo binding pocket for all annotated alleles in IMGT (top) and all alleles from the expanded training dataset, including monoallelic cell lines profiled in house, public monoallelic data, public multiallelic data, and binding assay data from IEDB.

We evaluated the representation of the MHC alleles in our expanded dataset on two parameters—population coverage and structural diversity of the binding pocket. We assessed population coverage using a simulation study that utilizes allele frequencies from the US National Marrow Donor Program as documented in the allele frequency net database (45). The alleles within our extended dataset represent an average coverage of 96% of observed alleles across various ethnic populations in the United States (Fig. 3D). Our training data span a wide range of allele binding pocket clusters that have vastly different motif characteristics, indicating a robust representation at the macro level. Since it is possible that relatively small differences in binding pocket sequences might lead to substantial differences in the presented motif, possibly because of the varied importance of amino acids in the binding pocket (46), we evaluated the representation at each position in the binding pocket. We observed a high degree of concordance in the distribution of frequencies between all known alleles and alleles in our expanded training dataset (Fig. 3E). Of note, this concordance remains high even when we exclude alleles only present in multiallelic samples (supplemental Fig. S4, A and B). By expanding our in-house dataset using publicly available data, we not only enhance the breadth of our training data, both in terms of tissue of origin and allelic diversity but also minimize potential biases introduced by data acquisition techniques, such as chromatography and peptide fragmentation approaches. In addition, the expanded scale and scope helps minimize out-of-distribution modeling errors.

Modeling pMHC Presentation Using Novel Features Encapsulating Antigen Processing

The unbiased and ex vivo nature of MS-based immunopeptidomics data uniquely lends itself to model both MHC-peptide binding and presentation. Modeling binding entails learning the motif preferences of HLA alleles and their cognate peptides, whereas modeling presentation includes learning propensities of upstream antigen processing in addition to binding motifs (Fig. 4A). Several features that capture various aspects of antigen processing and the biophysical characteristics of MHC-peptide binding have been discussed in the literature, the most prominent of these are peptide sequence, binding pocket sequence, expression of source protein, cleavage patterns, and various propensities for presentation that are not captured by expression (12). We use a combination of widely applied and novel features to model MHC-peptide binding and presentation.

Fig. 4.

Fig. 4

Modeling binding and presentation.A, a schematic showing the difference between major histocompatibility complex (MHC) binding and MHC presentation. MHC binding involves the ability of an MHC allele to bind to a paired peptide and is modeled with the peptide (P), allele binding pocket (B), and peptide length (L). MHC presentation involves all steps in the antigen-processing pathway in addition to MHC binding and is modeled with the peptide (P), allele binding pocket (B), peptide length (L), gene expression (T), flanking regions around the peptide (F), propensity of the gene to engender peptides (G), and propensity of the region within the gene to engender peptides (H). B, boxplots representing the distribution of peptides per transcript observed in the reprocessed multiallelic immunopeptidomics data across transcript deciles. The peptides observed are normalized by transcript length. Red boxes denote the transcripts that generate many observed peptides despite low expression levels and transcripts that generate few observed peptides despite high expression levels. C, the distributions of expected and observed peptides from across the ACTB protein. Expected peptides, shown in gray, are generated by summing the number of frequent alleles predicted to bind each peptide (rank <2 by netMHCpan4.0). The 30 most frequent alleles in the reprocessed multiallelic immunopeptidomics dataset were used for the analysis. Observed peptides are measured from the reprocessed multiallelic immunopeptidomics data and are shown in green.

We modeled binding using three features: the amino acid sequence of the peptide ligand (P), the binding pocket pseudo sequence of the presenting allele (B), and peptide length (L). In addition to these three binding-specific features, we integrated four additional features into our presentation model that encapsulate antigen processing. First, as evidence from our monoallelic immunopeptidomics data, the abundance of the source protein has a strong influence on peptide presentation (Fig. 1D). Accordingly, we used a transformed value of transcript expression (T) as a surrogate for protein expression. Second, our data also showed the strong preference of amino acids flanking the presented peptides for proteasomal cleavage; therefore, we incorporated the five amino acids upstream and downstream of each peptide (F) as an additional feature (Fig. 1E). It has been discussed previously that certain proteins or regions within proteins have a higher propensity for MHC presentation (23, 47, 48). We have noticed similar trends in the large set of public data that we had downloaded and reprocessed systematically (supplemental Table S2). We evaluated the dependence of the number of peptides presented (adjusted for transcript length) as a function of transcript expression of its cognate gene. As expected, we observed a positive correlation in the median values, but we also observed several outliers that do not follow this trend (data points highlighted in red boxes, Fig. 4B). We modeled this skew (propensity) and developed a gene propensity score (G) as our third feature (supplemental Fig. S5A, see Experimental Procedures section). Similarly, we noticed positional preferences within proteins when we compared observed and expected peptides from various locations within a protein. An anecdotal example of such a skew in actin beta gene (ACTB) is shown in Figure 4C. Briefly, position-specific coverage of observed peptides was contrasted with expected coverage based on predictions for the 30 most frequent alleles in our dataset. Importantly, we found a lack of correlation between predicted and observed peptide profiles across the entire proteome (supplemental Fig. S5B). We modeled this phenomenon using position-specific coverage of immunopeptides in our large dataset and developed a hot spot score (H) as our final feature (see Experimental Procedures section). An ensemble of these propensity scores and other features described previously facilitate the modeling of antigen processing and surface presentation.

Systematically Incorporating Binding Affinity, Monoallelic Data, and Multiallelic Data into a Composite Model

Both in vitro binding assays and monoallelic immunopeptidomics experiments produce allele-specific data, but assigning a peptide to its cognate allele, also known as deconvolution, is a key challenge in modeling multiallelic data. Several approaches have been applied to perform deconvolution (10, 11, 12). We utilized our comprehensive binding affinity and monoallelic datasets to develop a model-based deconvolution approach to convert multiallelic data into pseudo monoallelic data (Fig. 5A and supplemental Table S4). First, we generated a binding prediction model using exclusively monoallelic data (in-house monoallelic data, public monoallelic data, and in vitro–binding assay data extracted from IEDB). Unlike the binding affinity data from IEDB that generates both positive and negative examples, immunopeptidomics data only generate positive examples, so we generated negative examples using a large set of random peptides from the human proteome at a 1:20 ratio. A large proportion of negative examples helps minimize the risk of learning random or systematic biases in synthetically generated negative examples. This monoallelic model (MONO-Binding) is able to accurately predict motifs for alleles excluded from the training dataset (Fig. 5B and supplemental Fig. S6). Given our confidence in the model trained with allele-specific data (MONO-Binding), we applied it to each of the multiallelic datasets to assign each peptide in a sample to its cognate allele and generate pseudo monoallelic data. From 289 multiallelic sources representing 118 unique HLA alleles, we began with over 1.15 million peptides and successfully mapped about 700,000 peptides to alleles (∼60% mapping rate). Then, the monoallelic and pseudo monoallelic data were merged to train a comprehensive binding model (SHERPA-Binding). Finally, SHERPA-Binding was used as a primary model (p-model) or feature for our SHERPA-Presentation model. This final model incorporates all presentation features but is trained exclusively on monoallelic data to avoid overfitting to the multiallelic datasets that generated the gene propensity and hot spot features. As expected, when we examined the feature importances for each of the models, we observe the peptide anchor residues and diverse binding pocket amino acids have the greatest influence over the SHERPA-Binding model, and SHERPA-Binding has the greatest influence over the SHERPA-Presentation model (supplemental Fig. S7). For all models, prediction probabilities were calibrated into percentile ranks to remove allele-specific biases as described earlier (15).

Fig. 5.

Fig. 5

Overview of composite modeling approach and model performance.A, a schematic of the composite modeling approach. Inhouse monoallelic immunopeptidomics data, public monoallelic immunopeptidomics data, and Immune Epitope Database (IEDB) data are used to train MONO-Binding. MONO-Binding is used to deconvolute the multiallelic immunopeptidomics data to create pseudo monoallelic data. All monoallelic and pseudomonoallelic data are combined to train the Systematic Human Leukocyte Antigen Epitope Ranking Pan Algorithm (SHERPA) (SHERPA)-Binding model. The SHERPA-Binding model is used as a feature along with other presentation features to train the SHERPA-Presentation model on monoallelic immunopeptidomics data. B, a precision-recall curve demonstrating the predicted pan-performance on unseen alleles (MONO-Binding-LOO) compared with MONO-Binding and NetMHCpan4.1-BA, NetMHCpan-4.1-EL, MHCFlurry-2.0-BA. A model was trained for each allele with the data for that allele excluded from the training dataset. The MONO-Binding-LOO curve represents the predictions from each of the models on the test data of the allele excluded from the training data. C and D, boxplots denoting the distributions of positive predictive values (top 0.1%) across alleles within the monoallelic immunopeptidomics held-out test data. Distributions are shown for (C) NetMHCpan4.1-BA, NetMHCpan-4.1-EL, MHCFlurry-2.0-BA, MONO-Binding, SHERPA-Binding, and SHERPA-Presentation and (D) SHERPA-Binding, SHERPA-Binding + F, SHERPA-Binding + FT, SHERPA-Binding + TTG, and SHERPA-Presentation. E, boxplots showing the distribution of precision and recall values across alleles in the monoallelic immunopeptidomics data for SHERPA-Presentation across several percentile rank thresholds. A percentile rank of 0.1 is selected as the optimal threshold.

Benchmarking the Performance of SHERPA-Binding and Presentation Models

To evaluate our models, we first tested their performance on ∼10% held-out monoallelic data (positive examples), mixed with negative examples in a 1:999 ratio to mimic true ratio (5, 23, 24, 25, 26). We chose this prevalence to reflect the underlying characteristics of peptide presentation (5, 23, 24, 25, 26). The MONO-Binding model significantly outperformed NetMHCpan-4.1-BA, NetMHCpan-4.1-EL, and MHCFlurry-2.0-BA (0.53, 0.34, 0.44, and 0.40, respectively, Fig. 5C and supplemental Data 1A). When we restricted the input data to be only the publicly available monoallelic data (PUBLIC-Binding), SHERPA still outperformed the other three models, suggesting that the XGBoost modeling approach or our strict data curation added significant value (supplemental Fig. S8A). The binding model trained on the full dataset (SHERPA-Binding), comprising both monoallelic and pseudo monoallelic data, had the same PPV compared with MONO-Binding, a model trained on monoallelic data alone (0.53 and 0.53, respectively; Fig. 5C). SHERPA-Presentation has a better PPV compared with SHERPA-Binding, attesting to the utility of presentation-specific features. We further evaluated the contribution of each of our features and confirmed that each feature has an additive effect on performance (Fig. 5D). Flanking regions had a relatively minor impact on performance (∼0.01 PPV gain compared with SHERPA-Binding). We found the largest performance gain with the addition of abundance (TPM, ∼0.03 increase in PPV above SHERPA-Binding + F) and gene propensity (∼0.05 increase in PPV above SHERPA-Binding + FT), but we also observed significant gain with the addition of the hot spot score (∼0.01 PPV above SHERPA-Binding + FTG). For an orthogonal allele-specific validation, we assessed the performance of the models using the held-out IEDB data. Of the comparison models, NetMHCpan-4.1-BA had the best performance. MONO-Binding and SHERPA-Binding performed similarly to NetMHCpan-4.1-EL and MHCFlurry-2.0-BA (supplemental Fig. S8B and supplemental Data 1B). Finally, we evaluated the precision and recall at various rank values for both our multiallelic binding and presentation models. We observed a consistently and significantly higher precision and recall compared with other models at all rank values (Fig. 5E and supplemental Fig. S8, CE). Using this analysis, we empirically determined a rank threshold of 0.1 to be our definition of a “binding” or “presented” peptide.

Experimental Validation of SHERPA Using Tissue Samples

To understand the utility of our prediction models in a patient sample, we performed immunopeptidomics on tumor samples from seven lung cancer patients and five colorectal cancer patients. In these experiments, we observed strong peptide yields (4043 median, supplemental Fig. S9A, supplemental Table S5, and supplemental Data 1C). As expected, we did not observe any neoantigens from these patients in the immunopeptidomics data, as data-dependent acquisition–based immunoprecipitation–MS protocols lack the required sensitivity of detection. A total of 45 alleles were represented across the 12 patients, and one of the alleles was outside our training dataset. Thus, testing our performance on these samples provided an opportunity to assess our prediction accuracy on trained alleles, as well as the pan-allelic prediction capability on an untrained allele. To establish a prediction score for each peptide across all six potential patient alleles (patient-centric rank), we used the rank from the allele with the lowest rank to represent how well a peptide is bound or presented by that patient. We observed a strong performance of our models, which recovered 1.17 times more experimentally observed peptides among those that were predicted to be bound or presented than the next best model (MHCFlurry-2.0-BA) (Fig. 6A). Furthermore, the multiallelic models outperformed the exclusively allele-specific models, highlighting the importance of the multiallelic deconvolution approach. Although all models used a percentile ranking approach, we found that MHCFlurry-2.0-BA predicts an average of 30 percentile rank for negative peptides (compared with the expected ∼50 percentile rank for all other models), suggesting that the sensitivity metric used in this analysis may be overrepresenting MHCFlurry’s performance (supplemental Fig. S9, C and D). To further corroborate our results, we performed the same analysis in the ∼10% of the peptides from tissue data from ovarian cancer and colorectal cancer studies that had been held out from multiallelic training (27, 28). We found similar trends across both these external datasets, on data that were not part of our training data, with the SHERPA-Presentation model showing the best results, showing 1.03 to 1.22 times more experimentally observed peptides predicted than MHCFlurry-2.0-BA (Fig. 6, B and C).

Fig. 6.

Fig. 6

Performance of Systematic Human Leukocyte Antigen Epitope Ranking Pan Algorithm (SHERPA) on tissue samples and immunogenic epitopes. Boxplots showing the distribution of prediction performance across (A) tumors profiled with immunopeptidomics in-house (lung and colorectal, left), (B) by Schuster et al. (ovarian, middle) and (C) Loffler et al. (colorectal, right). Performance is defined as the fraction of peptides observed with immunopeptidomics that are predicted to bind in the top 0.1% of all peptides percentile rank ≤0.1). Performance is shown for the following models: NetMHCpan4.1-BA, NetMHCpan-4.1-EL, MHCFlurry-2.0-BA, MONO-Binding, SHERPA-Binding, and SHERPA-Presentation. D and E, bar plots showing the sensitivity of NetMHCpan4.1-BA, NetMHCpan-4.1-EL, MHCFlurry-2.0-BA, MONO-Binding, and SHERPA-Binding on the Chowell et al. immunogenicity dataset: (D) performance across all epitopes and (E) performance across high frequency alleles.

Immunogenicity

Although the SHERPA models do not specifically predict the likelihood that a peptide will be immunogenic, MHC presentation is a gatekeeping step of an immunogenic response. Thus, one would expect that all immunogenic peptides are successfully bound and presented on the cell surface. As an additional evaluation, we used the dataset of immunogenic peptides described by Chowell et al. (29) to evaluate the ability of the models to predict these epitopes. Across the entire dataset, we observe the best performance by the MONO-Binding model and the SHERPA-Binding model (Fig. 6D and supplemental Data 1D). We see variable but similar performance across the specific alleles. Of note, we observe a 1.11-fold improvement over the next best model for HLA-A∗02:01, the most studied allele (Fig. 6E). Together, this evidence suggests that the performance of the SHERPA models is generalizable and resilient to data sources.

Discussion

Many personalized cancer immunotherapies require precise neoantigen identification. Although advances in next-generation sequencing technologies have enabled large-scale survey of putative neoantigens, algorithmic neoantigen ranking continues to have limited accuracy. Our work seeks to improve precision neoantigen discovery by addressing both the training data and modeling approach. Since prediction models are highly dependent upon their training data, we generated immunopeptidomics data using monoallelic cell lines that unambiguously map a peptide to an HLA allele. To ensure generalizability of SHERPA beyond our model system, we further expanded the HLA-peptidome by incorporating publicly available monoallelic immunopeptidomics data from different parental cell lines, allele-specific binding array data, and multiallelic tissue immunopeptidomics data. Moreover, we developed a modeling approach that integrates the data types strategically and mines the breadth of data to model antigen processing across a variety of tissue types. It also accounts for real-world data by accounting for diploid allele data.

While a combination of several factors lead to marked improvements in performance of SHERPA, we believe that three aspects set it apart. First, SHERPA employs the largest training data of any neoantigen prediction tool to our knowledge. Peptides from 167 unique human alleles and antigen processing from 30 unique expression profile backgrounds ensure the representativeness of our training data. Second, we designed features that mine extensive tissue-based immunopeptidomics data to model presentation and demonstrate ∼0.06 improvement in our monoallelic PPV assessment with their inclusion. Third, SHERPA’s composite modeling approach for integrating multiple data types reduces inherent biases caused by a single data source and allows for robust multiallelic deconvolution.

Our work is timely because it addresses two critical needs in the field. First, the Human Immuno-Peptidome Project outlined the need for large cohort immunopeptidomics studies in their effort to comprehensively map the human immunopeptidome (49). With this project, we contribute a new set of monoallelic immunopeptidomics data on a different cell line background than previous large studies and a previous set of lung and colorectal tumor tissues. Second, the scientific community has an urgent need for representing diverse ethnic populations in medical research (18). We addressed this issue by profiling five novel HLA alleles that are frequent in under-represented Asian, African, and Middle Eastern populations.

Though our work shows great improvement, both SHERPA and the field of neoantigen prediction continues to face some key challenges. We integrated different data types (binding array and immunopeptidomics) to decrease bias and increase generalizability, whereas the vast majority of previous pMHC data was produced using very similar protocols. Accordingly, developing newer approaches to affinity purification other than the commonly used W6/32 antibody immunoprecipitation, applying newer data acquisition strategies such as data-independent acquisition, and generating immunopeptidomics data under different cell states are avenues for future improvement. In addition, gene expression is an imperfect surrogate for protein abundance. Other methods, such as ribosome footprinting, may provide a better estimate of protein levels. Moreover, some of the features used in the SHERPA-Presentation model are exclusively designed to the human proteome without holding out specific proteins or protein subregions, so the results of the SHERPA-Presentation model should be interpreted in light of the inclusion of all proteins and protein regions in the feature generation process. The SHERPA-Binding model is a more accurate representation of performance on nonhuman candidates. Finally, SHERPA does not specifically predict the likelihood that a ligand will be immunogenic; however, SHERPA captures a higher percentage of immunogenic epitopes than other methods, suggesting that peptides predicted by SHERPA will be more likely to be immunogenic.

In summary, we demonstrate that SHERPA is an important method that balances the tension between the clarity of monoallelic motifs and the generalizability of tissue data while consistently outperforming previous models. Its improved performance is expected to be useful in neoantigen prediction for immunotherapy.

Data Availability

The MS proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD023064 (49). Evaluation data can be downloaded from the following link: https://drive.google.com/drive/folders/1y3IBELIU5TgUaEiWrPANxRlTVUhYQGGp?usp=sharing.

Supplemental data

This article contains supplemental data.

Conflict of interest

R. M. P., D. M., Steven Dea, C. A., S. V. Z., N. A. P., J. H., G. B., Sejal Desai, R. M., J. W., R. C., and S. M. B. are full-time employees of Personalis. M. P. S. cofounded Personalis. Personalis Inc provided the funding for this project.

Acknowledgments

The cell line transfections were conducted at Thermo Fisher Scientific. The immunopeptidomics experiments were conducted at Cayman Chemicals and MS Bioworks, LLC. We thank Josette Northcott for the help in interpretation of the flow cytometry data. We thank Eric Levy, Euan Ashley, Russ B. Altman, and Atul Butte for the helpful feedback on the article.

Author contributions

R. M. P., D. M., R. C., and S. M. B. conceptualization; R. M. P., D. M., and S. M. B. methodology; R. M. P., D. M., Steven Dea, S. V. Z., N. A. P., J. H., and G. B. software; R. M. P., D. M., and Steven Dea validation; R. M. P. and D. M. formal analysis; C. A. resources; R. M. P., D. M., and Steven Dea data curation; R. M. P. and D. M. writing–original draft; Steven Dea, C. A., Sejal Desai, R. M., M. P. S., R. C., and S. M. B. writing–review & editing; R. M. P. and D. M. visualization; R. C. and S. M. B. supervision; S. M. B. project administration; J. W. and R. C. funding acquisition.

Supplemental Data

Supplemental information
mmc1.pdf (4.4MB, pdf)
Supplemental Table S1
mmc2.xlsx (1.6MB, xlsx)
Supplemental Table S2
mmc3.xlsx (4.1MB, xlsx)
Supplemental Table S3
mmc4.xlsx (83.6KB, xlsx)
Supplemental Table S4
mmc5.xlsx (9KB, xlsx)
Supplemental Table S5
mmc6.xlsx (1.9MB, xlsx)

References

  • 1.Wells D.K., van Buuren M.M., Dang K.K., Hubbard-Lucey V.M., Sheehan K.C.F., Campbell K.M., et al. Key parameters of tumor epitope immunogenicity revealed through a consortium approach improve neoantigen prediction. Cell. 2020;183:818–834. doi: 10.1016/j.cell.2020.09.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Yadav M., Jhunjhunwala S., Phung Q.T., Lupardus P., Tanguay J., Bumbaca S., et al. Predicting immunogenic tumour mutations by combining mass spectrometry and exome sequencing. Nature. 2014;515:572–576. doi: 10.1038/nature14001. [DOI] [PubMed] [Google Scholar]
  • 3.Schumacher T.N., Schreiber R.D. Neoantigens in cancer immunotherapy. Science. 2015;348:69–74. doi: 10.1126/science.aaa4971. [DOI] [PubMed] [Google Scholar]
  • 4.Sette A., Vitiello A., Reherman B., Fowler P., Nayersina R., Kast W.M., et al. The relationship between class I binding affinity and immunogenicity of potential cytotoxic T cell epitopes. J. Immunol. 1994;153:5586–5592. [PubMed] [Google Scholar]
  • 5.Hunt D.F., Henderson R.A., Shabanowitz J., Sakaguchi K., Michel H., Sevilir N., et al. Characterization of peptides bound to the class I MHC molecule HLA-A2.1 by mass spectrometry. Science. 1992;255:1261–1263. doi: 10.1126/science.1546328. [DOI] [PubMed] [Google Scholar]
  • 6.Andreatta M., Lund O., Nielsen M. Simultaneous alignment and clustering of peptide data using a Gibbs sampling approach. Bioinformatics. 2013;29:8–14. doi: 10.1093/bioinformatics/bts621. [DOI] [PubMed] [Google Scholar]
  • 7.Abelin J.G., Keskin D.B., Sarkizova S., Hartigan C.R., Zhang W., Sidney J., et al. Mass spectrometry profiling of HLA-associated peptidomes in mono-allelic cells enables more accurate epitope prediction. Immunity. 2017;46:315–326. doi: 10.1016/j.immuni.2017.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Nielsen M., Lundegaard C., Blicher T., Lamberth K., Harndahl M., Justesen S., et al. NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known. PLoS One. 2007;2:796. doi: 10.1371/journal.pone.0000796. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Andreatta M., Alvarez B., Nielsen M. GibbsCluster: unsupervised clustering and alignment of peptide sequences. Nucleic Acids Res. 2017;45:W458–W463. doi: 10.1093/nar/gkx248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Bassani-Sternberg M., Chong C., Guillaume P., Solleder M., Pak H., Gannon P.O., et al. Deciphering HLA-I motifs across HLA peptidomes improves neo-antigen predictions and identifies allostery regulating HLA specificity. PLoS Comput. Biol. 2017;13:e1005725. doi: 10.1371/journal.pcbi.1005725. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Reynisson B., Alvarez B., Paul S., Peters B., Nielsen M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2020;48:W449–W454. doi: 10.1093/nar/gkaa379. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Bulik-Sullivan B., Busby J., Palmer C.D., Davis M.J., Murphy T., Clark A., et al. Deep learning using tumor HLA peptide mass spectrometry datasets improves neoantigen identification. Nat. Biotechnol. 2018;37:55–63. doi: 10.1038/nbt.4313. [DOI] [PubMed] [Google Scholar]
  • 13.Sarkizova S., Klaeger S., Le P.M., Li L.W., Oliveira G., Keshishian H., et al. A large peptidome dataset improves HLA class I epitope prediction across most of the human population. Nat. Biotechnol. 2020;38:199–209. doi: 10.1038/s41587-019-0322-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.O'Donnell T., Rubinsteyn A., Bonsack M., Riemer A., Hammerbacher J. MHCflurry: open-source class I MHC binding affinity prediction. Cell Syst. 2018;7:129–132. doi: 10.1016/j.cels.2018.05.014. [DOI] [PubMed] [Google Scholar]
  • 15.Jurtz V., Paul S., Andreatta M., Marcatili P., Peters B., Nielsen M. NetMHCpan 4.0: improved peptide-MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data. J. Immunol. 2017;199:3360–3368. doi: 10.4049/jimmunol.1700893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Shao X.M., Bhattacharya R., Huang J., Sivakumar I.K.A., Tokheim C., Zheng L., et al. High-throughput prediction of MHC class I and II neoantigens with MHCnuggets. Cancer Immunol. Res. 2020;8:396–408. doi: 10.1158/2326-6066.CIR-19-0464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.O'Donnell T.J., Rubinsteyn A., Laserson U. MHCflurry 2.0: improved pan-allele prediction of MHC class I-presented peptides by incorporating antigen processing. Cell Syst. 2020;11:418–419. doi: 10.1016/j.cels.2020.09.001. [DOI] [PubMed] [Google Scholar]
  • 18.Diversifying clinical trials. Nat. Med. 2018;24:1779. doi: 10.1038/s41591-018-0303-4. [DOI] [PubMed] [Google Scholar]
  • 19.Zhang J., Xin L., Shan B., Chen W., Xie M., Yuen D., et al. Peaks DB: de novo sequencing assisted database search for sensitive and accurate peptide identification. Mol. Cell. Proteomics. 2012;11 doi: 10.1074/mcp.M111.010587. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Love M.I., Huber W., Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Chen T., Guestrin C. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery; New York, NY: 2016. XGBoost: a scalable tree boosting system; pp. 785–794. [Google Scholar]
  • 22.Bergstra J., Komer B., Eliasmith C., Yamins D., Cox D.D. Hyperopt: a Python library for model selection and hyperparameter optimization. Comput. Sci. Discov. 2015;8 [Google Scholar]
  • 23.Bassani-Sternberg M., Pletscher-Frankild S., Jensen L.J., Mann M. Mass spectrometry of human leukocyte antigen class I peptidomes reveals strong effects of protein abundance and turnover on antigen presentation. Mol. Cell. Proteomics. 2015;14:658–673. doi: 10.1074/mcp.M114.042812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Rammensee H.G., Friede T., Stevanoviíc S. MHC ligands and peptide motifs: first listing. Immunogenetics. 1995;41:178–228. doi: 10.1007/BF00172063. [DOI] [PubMed] [Google Scholar]
  • 25.Rammensee H., Bachmann J., Emmerich N.P., Bachor O.A., Stevanović S. SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics. 1999;50:213–219. doi: 10.1007/s002510050595. [DOI] [PubMed] [Google Scholar]
  • 26.Vita R., Overton J.A., Greenbaum J.A., Ponomarenko J., Clark J.D., Cantrell J.R., et al. The immune epitope database (IEDB) 3.0. Nucleic Acids Res. 2015;43:D405–D412. doi: 10.1093/nar/gku938. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Schuster H., Peper J.K., Bösmüller H.-C., Röhle K., Backert L., Bilich T., et al. The immunopeptidomic landscape of ovarian carcinomas. Proc. Natl. Acad. Sci. U. S. A. 2017;114:E9942–E9951. doi: 10.1073/pnas.1707658114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Löffler M.W., Kowalewski D.J., Backert L., Bernhardt J., Adam P., Schuster H., et al. Mapping the HLA ligandome of colorectal cancer reveals an imprint of malignant cell transformation. Cancer Res. 2018;78:4627–4641. doi: 10.1158/0008-5472.CAN-17-1745. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Chowell D., Krishna S., Becker P.D., Cocita C., Shu J., Tan X., et al. TCR contact residue hydrophobicity is a hallmark of immunogenic CD8+ T cell epitopes. Proc. Natl. Acad. Sci. U. S. A. 2015;112:E1754–E1762. doi: 10.1073/pnas.1500973112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Khodadoust M.S., Olsson N., Wagar L.E., Haabeth O.A.W., Chen B., Swaminathan K., et al. Antigen presentation profiling reveals recognition of lymphoma immunoglobulin neoantigens. Nature. 2017;543:723–727. doi: 10.1038/nature21433. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Ternette N., Olde Nordkamp M.J.M., Müller J., Anderson A.P., Nicastri A., Hill A.V.S., et al. Immunopeptidomic profiling of HLA-A2-positive triple negative breast cancer identifies potential immunotherapy target antigens. Proteomics. 2018;18:e1700465. doi: 10.1002/pmic.201700465. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Shraibman B., Barnea E., Kadosh D.M., Haimovich Y., Slobodin G., Rosner I., et al. Identification of tumor antigens among the HLA peptidomes of glioblastoma tumors and plasma. Mol. Cell. Proteomics. 2019;18:1255–1268. doi: 10.1074/mcp.RA119.001524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Pearson H., Daouda T., Granados D.P., Durette C., Bonneil E., Courcelles M., et al. MHC class I-associated peptides derive from selective regions of the human genome. J. Clin. Invest. 2016;126:4690–4701. doi: 10.1172/JCI88590. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Bassani-Sternberg M., Bräunlein E., Klar R., Engleitner T., Sinitcyn P., Audehm S., et al. Direct identification of clinically relevant neoepitopes presented on native human melanoma tissue by mass spectrometry. Nat. Commun. 2016;7:13404. doi: 10.1038/ncomms13404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Chong C., Marino F., Pak H., Racle J., Daniel R.T., Müller M., et al. High-throughput and sensitive immunopeptidomics platform reveals profound interferonγ-mediated remodeling of the human leukocyte antigen (HLA) ligandome. Mol. Cell. Proteomics. 2018;17:533–548. doi: 10.1074/mcp.TIR117.000383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Löffler M.W., Mohr C., Bichmann L., Freudenmann L.K., Walzer M., Schroeder C.M., et al. Multi-omics discovery of exome-derived neoantigens in hepatocellular carcinoma. Genome Med. 2019;11:28. doi: 10.1186/s13073-019-0636-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Laumont C.M., Daouda T., Laverdure J.-P., Bonneil É., Caron-Lizotte O., Hardy M.-P., et al. Global proteogenomic analysis of human MHC class I-associated peptides derived from non-canonical reading frames. Nat. Commun. 2016;7:10238. doi: 10.1038/ncomms10238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Shraibman B., Kadosh D.M., Barnea E., Admon A. Human leukocyte antigen (HLA) peptides derived from tumor antigens induced by inhibition of DNA Methylation for development of drug-facilitated immunotherapy. Mol. Cell. Proteomics. 2016;15:3058–3070. doi: 10.1074/mcp.M116.060350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Mommen G.P.M., Frese C.K., Meiring H.D., van Gaans-van den Brink J., de Jong A.P.J.M., van Els C.A.C.M., et al. Expanding the detectable HLA peptide repertoire using electron-transfer/higher-energy collision dissociation (EThcD) Proc. Natl. Acad. Sci. U. S. A. 2014;111:4507–4512. doi: 10.1073/pnas.1321458111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Illing P.T., Pymm P., Croft N.P., Hilton H.G., Jojic V., Han A.S., et al. HLA-B57 micropolymorphism defines the sequence and conformational breadth of the immunopeptidome. Nat. Commun. 2018;9:4693. doi: 10.1038/s41467-018-07109-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Di Marco M., Schuster H., Backert L., Ghosh M., Rammensee H.-G., Stevanović S. Unveiling the peptide motifs of HLA-C and HLA-G from naturally presented peptides and generation of binding prediction matrices. J. Immunol. 2017;199:2639–2651. doi: 10.4049/jimmunol.1700938. [DOI] [PubMed] [Google Scholar]
  • 42.Kaur G., Gras S., Mobbs J.I., Vivian J.P., Cortes A., Barber T., et al. Structural and regulatory diversity shape HLA-C protein expression levels. Nat. Commun. 2017;8:15924. doi: 10.1038/ncomms15924. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Alpízar A., Marino F., Ramos-Fernández A., Lombardía M., Jeko A., Pazos F., et al. A molecular basis for the presentation of phosphorylated peptides by HLA-B antigens. Mol. Cell. Proteomics. 2017;16:181–193. doi: 10.1074/mcp.M116.063800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Yair-Sabag S., Tedeschi V., Vitulano C., Barnea E., Glaser F., Melamed Kadosh D., et al. The peptide repertoire of HLA-B27 may include ligands with lysine at P2 anchor position. Proteomics. 2018;18:e1700249. doi: 10.1002/pmic.201700249. [DOI] [PubMed] [Google Scholar]
  • 45.Dos Santos E.J.M., McCabe A., Gonzalez-Galarza F.F., Jones A.R., Middleton D. Allele frequencies net database: improvements for storage of individual genotypes and analysis of existing data. Hum. Immunol. 2016;77:238–248. doi: 10.1016/j.humimm.2015.11.013. [DOI] [PubMed] [Google Scholar]
  • 46.Johansen T.E., McCullough K., Catipovic B., Su X.M., Amzel M., Schneck J.P. Peptide binding to MHC class I is determined by individual pockets in the binding groove. Scand. J. Immunol. 1997;46:137–146. doi: 10.1046/j.1365-3083.1997.d01-102.x. [DOI] [PubMed] [Google Scholar]
  • 47.Müller M., Gfeller D., Coukos G., Bassani-Sternberg M. “Hotspots” of antigen presentation revealed by human leukocyte antigen ligandomics for neoantigen prioritization. Front. Immunol. 2017;8:1367. doi: 10.3389/fimmu.2017.01367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Vizcaíno J.A., Kubiniok P., Kovalchik K.A., Ma Q., Duquette J.D., Mongrain I., et al. The human immunopeptidome project: a roadmap to predict and treat immune diseases. Mol. Cell. Proteomics. 2020;19:31–49. doi: 10.1074/mcp.R119.001743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Perez-Riverol Y., Csordas A., Bai J., Bernal-Llinares M., Hewapathirana S., Kundu D.J., et al. The PRIDE database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res. 2019;47:D442–D450. doi: 10.1093/nar/gky1106. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental information
mmc1.pdf (4.4MB, pdf)
Supplemental Table S1
mmc2.xlsx (1.6MB, xlsx)
Supplemental Table S2
mmc3.xlsx (4.1MB, xlsx)
Supplemental Table S3
mmc4.xlsx (83.6KB, xlsx)
Supplemental Table S4
mmc5.xlsx (9KB, xlsx)
Supplemental Table S5
mmc6.xlsx (1.9MB, xlsx)

Data Availability Statement

The MS proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD023064 (49). Evaluation data can be downloaded from the following link: https://drive.google.com/drive/folders/1y3IBELIU5TgUaEiWrPANxRlTVUhYQGGp?usp=sharing.


Articles from Molecular & Cellular Proteomics : MCP are provided here courtesy of American Society for Biochemistry and Molecular Biology

RESOURCES