Summary
Identification of proteasomal spliced peptides (PSPs) by mass spectrometry (MS) is not possible with traditional search engines. Here, we provide a protocol for running RHybridFinder (RHF), an R package for the computational inference of putative PSPs detected by MS. RHF extracts high confidence scored de novo sequenced peptides identified by PEAKS software. Those peptides are then matched to protein databases to infer cis- or trans-spliced major histocompatibility complex (MHC)-associated peptides. RHF is relatively fast and straightforward. PSPs have to be validated experimentally.
For complete details on the use and execution of the original protocol, please refer to Faridi et al. (2018).
Subject areas: Bioinformatics, Cancer, Computer sciences, Immunology, Mass Spectrometry
Graphical abstract

Highlights
-
•
RHybridFinder (RHF) is an improved R package for the discovery of spliced peptides
-
•
RHF builds upon the algorithm published in Faridi et al. (2018)
-
•
RHF uses MS data analyzed in PEAKS
-
•
The spliced peptide candidates generated by RHF need to be validated experimentally
Identification of proteasomal spliced peptides (PSPs) by mass spectrometry (MS) is not possible with traditional search engines. Here, we provide a protocol for running RHybridFinder (RHF), an R package for the computational inference of putative PSPs detected by MS. RHF extracts high confidence scored de novo sequenced peptides identified by PEAKS software. Those peptides are then matched to protein databases to infer cis- or trans-spliced MHC-associated peptides. RHF is relatively fast and straightforward. PSPs have to be validated experimentally.
Before you begin
The proteasome is recognized as the core enzymatic machinery of the antigen processing and presentation pathway wherein peptides derived from proteasomal proteolysis are selectively presented on the cell surface by MHC (major histocompatibility complex)-I molecules (Neefjes et al., 2011). In 2004, Hanada et al. discovered that the proteasome could cleave and splice peptide fragments to generate immunogenic epitopes presented by MHC class I molecules (Hanada et al., 2004). Following this groundbreaking discovery, other research groups have been able to uncover additional T cell spliced epitopes generated by the proteosome, referred in this protocol as proteasomal spliced peptides (PSPs) (Berkers et al., 2015; Dalet et al., 2011; Ebstein et al., 2016; Michaux et al., 2014; Vigneron et al., 2004).
More recently, MS-based immunopeptidomics has been used to expedite the identification of PSPs in a systematic manner, including cis- and trans-spliced peptides (Berkers et al., 2015; Faridi et al., 2018; Liepe et al., 2010, 2016; Rolfs et al., 2019; Specht et al., 2020). However, MS-based studies using different computational approaches have led to a debate around the proportion of those PSPs in the MHC class I immunopeptidome (Lichti, 2021; Mylonas et al., 2018; Wilhelm et al., 2021).
Here, we provide a protocol to run RHybridFinder (RHF), an open access and improved R package built upon the computational workflow developed by Faridi et al. (2018) for the analysis of MS data to systematically identify putative PSPs (Faridi et al., 2018). High speed performance is the main strength of RHF in addition to be relatively straightforward to run. The main limitation is that the PSPs identified by RHF may not be genuily spliced by the proteasome in vivo. Their source and presentation should therefore be validated experimentally to move the debate forward (Figure 1).
Figure 1.
Overview of suggested workflow for the discovery of PSPs
We propose a four-step workflow for the identification of PSPs. The first three steps (blue squares: sample preparation, MS data acquisition and RHbridFinder enable computational exploration of putative PSPs followed by experimental validations (green square). A non-exhaustive list of possible experiments is shown for validating/gaining confidence in the identification of MHC-I peptides that are genuinely catalyzed by proteasomal splicing.
RHybridFinder is available on CRAN (https://cran.r-project.org/package=RHybridFinder) to enable more researchers to explore those debated peptides.
Data collection
For demonstration of the output of the different RHybridFinder functions, we have used datasets from the HLA Ligand Atlas (Marcu et al., 2021) deposited in PRIDE (Proteomics IDentification Database) PXD019643.
-
1.
Download the following mzML files and analyzed them in PEAKS:
171002_AM_AUT01-DN17_Liver_W6-32_10%_DDA_3_400-650mz_msms4, 171002_AM_AUT01-DN17_Liver_W6-32_10%_DDA_3_400-650mz_msms5, 171002_AM_AUT01-DN17_Liver_W6-32_10%_DDA_3_400-650mz_msms6.
-
2.
Analyze these files in PEAKS.
Installing Rstudio/R
RHybridFinder package has been developed in RStudio and implemented in R programming language.
-
3.
Download & install Rstudio if not already installed: (https://www.rstudio.com/products/rstudio/download/).
Installing and loading RHybridFinder
Below are the lines needed to install the RHybridFinder package from CRAN (the Comprehensive R Archive Network) and then load it.
-
4.
Install and load RHybridFinder by typing “install.packages(“RHybridFinder”) in the R console.
> install.packages(“RHybridFinder“)
-
5.
Load RHybridFinder by typing “library(RHybridFinder)” in the R console
> library(RHybridFinder)
CRITICAL: if you copy the lines of code from here, keep in mind that you might have to re-write the quotation marks yourself.
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Deposited data | ||
| Human liver sample from autologous donor 17 - HLA Ligand Atlas | (Marcu et al., 2021) | PXD019643 |
| Software and algorithms | ||
| RStudio (version 1.3.1093) | RStudio website https://www.rstudio.com | SCR_000432 |
| R (>3.5.0) | R statistical software (https://www.r-project.org/) | SCR_001905 |
| PEAKS (PEAKS X studio) | PEAKS website: https://www.bioinfor.com// | N/A |
| RHybridFinder (v.0.2.0) | https://cran.r-project.org/web/packages/RHybridFinder/index.html | N/A |
| seqinr (v. 4.2-5) | CRAN - (Charif and Lobry, 2007) | N/A |
| foreach (v.1.5.1), doParallel (v. 1.0.16) | CRAN | N/A |
| netMHCpan (v. 4.0 &4.1) | DTU health tech: https://services.healthtech.dtu.dk (Reynisson et al., 2020) | SCR_018182 |
| hybrid finder | Faridi et al. (2018) (workflow on which the package is based) | N/A |
Step-by-step method details
Step 1: Load inputs into R
Timing: 1 min
Before running HybridFinder, the inputs need to be loaded into R. We propose the following way of loading the files into R in order to facilitate the process Figure 2.
-
1.
Create an object (folder_Exp1) for the path to the parent folder (Mel_Exp1) (but both can be named otherwise).
> folder_Exp1 <- file.path(“/Users/YOURUSERNAME/Desktop/Mel_Exp1”)
-
2.Import the de novo sequencing as well as the database results, both of which are located in the first_run child folder.
-
a.de novo sequencing results file> denovo_Exp1 <- read.csv(file = file.path(folder_Exp1, “first_run”, "all_denovo_candidates.csv"), header=TRUE, sep=",", stringsAsFactors = FALSE)
-
b.database search results file
-
a.
> db_search_Exp1<- read.csv (file=file.path(folder_Exp1, “first_run”,“DB seach psm.csv”), header=TRUE, sep=“,”, stringsAsFactors=FALSE)
-
3.
Create an object for the path to the proteome file, located in the parent folder (folder_Exp1) (see refproteome_Exp1, in the example below). The fasta proteome will be imported in R during the HybridFinder function.
> refproteome_Exp1 <- file.path(folder_Exp1, “uniprothuman-20379entries-Nov2019_validated.fasta”)
CRITICAL: Please note that if you copy the file access path (in windows), you will need to switch the backslash (“\”) to a normal slash ( “/”).
Figure 2.
Recommended folder structure
The parent folder includes two child folders. The child folders include the various files that are necessary for running RHybridFinder. The dotted line (second_run) indicates that the DB search psm.csv file is added after the second DB search.
Access the datasets included in the R package
The RHybridFinder package also includes demonstration datasets from the HLA Ligand Atlas that have already been analyzed in PEAKS. These datasets include PEAKS de novo sequencing results and PEAKS database search results.
# access denovo dataset
> data(package= “RHybridFinder”, “denovo_Human_Liver_AUTD17”)
# access database search dataset
> data(package=”RHybridFinder”, “db_Human_Liver_AUTD17”)
Note: that due to size constraints the proteome database (.fasta) file is not included in the package. It can be downloaded from the Uniprot database.
Note: In the environment tab, the denovo_Human_Liver_AUTD17 and db_Human_Liver_AUTD17 should appear. Note that if you see <promise>, after clicking on the objects, the data would appear.
Step 2: Run HybridFinder
Timing: 2–5 min (with parallelism, 8 cores) - 10–15 min (without parallelism)
In order to have a relatively short runtime, we have implemented an option to use parallel computing. However, please note that because parallel computing requires a certain amount of processing units for proper functioning, it has been made possible to also run HybridFinder without parallel computing.
Based on default parameters in the HybridFinder function, the “all de novo candidates.csv” file contains 16,286 peptide sequences and the runtime (parallelism with 8 cores) is of 2 min 17 s ∼5 min are required for double the number of peptides. Without parallelism, the runtime ranged between 10 and 15 min for 16,286 peptide sequences.
-
4.
Run Hybridfinder (Please refer to Table 1 in order to know more about the inputs needed) and export the results in the parent folder.
> HybridFinder_results_Exp1<- HybridFinder(denovo_candidates = denovo_Exp1, db_search = db_search_Exp1, proteome_db = refproteome_Exp1,customALCcutoff = NULL, with_parallel=TRUE, customCores = 8, export_files= TRUE, export_dir = folder_Exp1)
CRITICAL: if you use the datasets included in the package, please note that they are named differently so for instance the “denovo_candidates” and “db_search” parameters should be set to the datasets loaded from the package: denovo_Human_Liver_AUTD17 and db_Human_Liver_AUTD17, respectively.
CRITICAL: Make sure to store the HybridFinder results in an object (i.e HybridFinder_results_Exp1), as the HybridFinder output dataframe will come in handy in the second function.
Note: At the end of the hybrid proteome will be the concatenated hybrid fake proteins with the name pattern ‘sp|denovo_HF_fake_protein_[#]’.
Note: with_parallel is activated if set to true and if the PC has more than 5 cores.
CRITICAL: Please ensure to have a minimal number of other windows open and to save any work in other softwares prior to using HybridFinder with parallelism.
Table 1.
HybridFinder function parameters
| Parameter | Description | Default value |
|---|---|---|
| de novo_candidates | the dataframe containing the de novo sequencing results | No defaults. Necessary input. |
| db_search | the data frame containing the database search results | No defaults. Necessary input. |
| db_search | the data frame containing the database search results | No defaults. Necessary input. |
| proteome_db | the file path to the proteome used for the database search | No defaults. Necessary input. |
| (Optional) customALCcutoff | A custom score cutoff that can be set by the user as long as it would be at least 85 | NULL. (ALC cutoff calculated automatically as median of matching peptide sequences of assigned spectra). If set manually, minimum is 85. |
| with_parallel : boolean (True or False) | representing whether parallel computing should be employed for running the function. | TRUE |
| (Optional) customCores | If with_parallel is set to TRUE and the PC has >5 cores, the user can set a custom amount of cores to be used by the function. | 6 |
| (Optional) export_files : boolean (True or False) | by default it is set to False, however, if set to True, then the following input is essential. | FALSE |
| (Optional) export_dir | file path to the directory where the output files should be stored. This parameter is necessary for the export. | NULL |
The function will output a list (Figure 3) containing: (1) the HybridFinder output containing all the denovo peptides along with their potential splice type explanation cis-/trans-, (2) a list of the step1 hybrid candidate peptides, (3) the hybrid proteome (merged proteome: the original user proteome along with the hybrid proteome composed of the concatenated candidate hybrid peptide sequences).
Note: In the example above, export_files have been set to TRUE and the export_dir has been defined which means that the files are also automatically exported. If these two parameters were not specified or were set to FALSE & NULL, the results are only stored in the Exp1_HybridFinder_results. In this case, you can still use “export_HybridFinder_results” as in the code below, where HybridFinder_results_Exp1 is the object created above for the storage of HybridFinder results.
> export_HybridFinder_results(HybridFinder_results_Exp1, export_dir= folder_Exp1)
Pause point: If you would like to conduct the rest of the protocol at a later time, either use the export functionality and then load the HybridFinder output in order to use it for the second step. Alternatively, save the objects in R in a .rda file as follows and once you want to use it again for the step 4, load checknetMHCpan inputs into R.
> save (HybridFinder_results_Exp1, file=file.path(folder_Exp1, ”HybridFinder_results_Exp1.rda”)
>load (file.path(folder_Exp1, ”HybridFinder_results_Exp1.rda”))
Figure 3.
Screenshot of the HybridFinder function results
In the results list you will find 3 items: 1) a dataframe containing the HybridFinder output. 2) a character vector containing the candidate spliced peptides. 3) a list which is in a seqinr class (Charif and Lobry, 2007) containing the merged hybrid proteome.
Step 3: Database search using hybrid Fasta
Timing: 1 h
An essential interim step must follow the HybridFinder function and consists of running a database search in PEAKS with the merged proteome. Importantly, now that a merged hybrid proteome has been obtained from the HybridFinder function, it can be used to obtain potential PSPs whose quality is comparable with all other database search peptides while filtering all peptides at the same FDR (False Discovery Rate) cutoff which can be adjusted by the users in PEAKS. In the original workflow by Faridi et al. (2018), the database search peptides in both runs were filtered in PEAKS at a 1% FDR.
-
5.
Perform a database search in PEAKS using the original raw MS file (while using the same settings as in the beginning) however, this time while using the merged hybrid proteome (.fasta) file generated with the HybridFinder function.
Step 4: Load checknetMHCpan inputs into R
Timing: 1 min
Prior to running checknetMHCpan, please ensure that netMHCpan (versions 4.0 or 4.1) is installed. checknetMHCpan is the last step of the hybrid finder workflow, the function uses the database search results from the second PEAKS analysis and provides the binding affinity results of all the peptides along with their categorizations.
-
6.
Create an object for the location of the netMHCpan executable
> netmhcpan_dir <- file.path(“/usr/local/bin”)
-
7.
Create an object (vector) for storing the HLA-I alleles that you would like to have binding affinity predictions for.
> alleles_Exp1 <- c(“HLA-A∗02:01”, “HLA-A∗03:01”, “HLA-B∗07:02”)
-
8.
Retrieve the HybridFinder output from the HybridFinder function results
> HF_output_Exp1 <- HybridFinder_results_Exp1[[1]]
-
9.
Import the database search results (from step 3: Database search using hybrid fasta)
> rerun_db_search_Exp1 <- read.csv(file.path(folder_Exp1, “second_run”, “DB search psm.csv”), sep=“,”, head = TRUE, stringsAsFactors = FALSE)
Note: in case your computer’s OS is “Windows” (netMHCpan is not compatible with Windows) the web version of netMHCpan (http://www.cbs.dtu.dk/services/NetMHCpan-4.1/instructions.php) would come in handy. In this case, we propose to use a separate function from this package instead (step2_wo_netmhcpan) which outputs a netMHCpan-ready input of sequences in .pep format.
Access the datasets included in the R package
The demonstration datasets from the HLA Ligand Atlas included in this package also include datasets for the checknetMHCpan/step2_wo_netMHCpan functions. After having run the HybridFinder function and stored the results in HyrbidFinder_results_Exp1, PEAKS was run using the merged hybrid proteome. Below is a way to retrieve the second PEAKS run dataset included in the package:
> data(package= “RHybridFinder”, “db_rerun_Human_Liver_AUTD17”)
Note: The merged proteome used for the second database search is based on the customALCcutoff being set to NULL (default parameter value).
CRITICAL: The merged proteome database would change between different samples, and if the customALCcutoff parameter is changed. The same merged hybrid proteome cannot be used for separate analyses.
Step 5: Run checknetMHCpan
Timing: ∼ 1 min
The checknetMHCpan function embodies the second major step of the workflow. The categorizations of the hybrid peptides from the HybridFinder output are retrieved for matched peptides found in the second PEAKS database results. Then, peptide-MHC class I binding predictions for the entire database search results (for peptides between 9 and 12 amino acids) are computed using netMHCpan and are tidied in order to summarize the results.
-
10.
Run checknetMHCpan using the code below (Please refer to Table 2 in order to know more about the inputs needed) and export the results in the same folder:
> checknetMHCpan_results_Exp1 <- checknetMHCpan(netmhcpan_directory = netmhcpan_dir, netmhcpan_alleles = alleles_Exp1, peptide_rerun = rerun_db_search_Exp1, HF_step1_output = HF_output_Exp1, export_files= TRUE, export_dir = folder_Exp1)
Note: checknetMHCpan is compatible with the exports from both netMHCpan 4.0 & netMHCpan 4.1.
CRITICAL: if you use the datasets included in the package, please note that they are named differently so for instance the “peptide_rerun” parameter should be set to dataset loaded from the package db_rerun_Human_Liver_AUTD17.
Table 2.
checknetMHCpan function parameters
| Parameter | Description | Default value |
|---|---|---|
| netmhcpan_directory | the directory where netMHCpan is installed (i.e., ‘/usr/bin’ or ‘/usr/local/bin’, depending on where you have it installed) | No defaults. Necessary input. |
| netmhcpan_alleles | a vector composed of the alleles the peptides will be tested against. | No defaults. Necessary input. |
| peptide_rerun | the database search results from the second peaks run | No defaults. Necessary input. |
| HF_step1_output | the data frame from the HybridFinder function of the containing the spliced peptide potential explanations as well as RT, m/z, ALC, Scan & Fraction | No defaults. Necessary input. |
| (Optional) export_files : boolean (True or False) | by default it is set to False, however, if set to True, then the following input is essential. | FALSE |
| (Optional) export_dir | file path to the directory where the output files should be stored. This parameter is necessary for the export. | NULL |
After running the code above, a results list should be returned (Figure 4).
Figure 4.
Screenshot of the checknetMHCpan results list
In the results list you will find 3 items: 1) a dataframe containing the netMHCpan results. 2) a dataframe containing the tidied netMHCpan results. 3) the database search results with the “Potential_spliceType” for the hybrid peptides retrieved from step1.
These results are also exportable with the export_checknetMHCpan_results function.
> export_checknetMHCpan_results(step2_RHF_results_Exp1 , export_dir = folder_Exp1)
Note: If you intend on using the web version of netMHCpan (especially useful for windows OS users) or another software for peptide binding affinity, the step2_wo_netMHCpan function does the same as checknetMHCpan but without running netMHCpan. The function should return a list (Figure 5) containing the updated database search results as well as a list of the peptides which can be used as input in the web version of netMHCpan.
Figure 5.
Screenshot of the step2_wo_netMHCpan results list
In the results list you will find 2 items: 1) a character vector containing the netMHCpan-ready input. 2) the database search results with the “Potential_spliceType” for the hybrid peptides retrieved from step1.
Expected outcomes
HybridFinder
The HybridFinder function follows the same rationale as indicated in Faridi et al. (2018). After high-confidence de novo peptides are extracted, these are searched sequentially for an exact hit, followed by a search of pair fragments within one protein and then within two proteins (Figure 6). Finally, the sequences of all hybrid peptides are concatenated to create fake proteins, which are added at the bottom of the proteome database in order to constitute a merged hybrid proteome.
Figure 6.
HybridFinder function
HybridFinder extracts high confidence de novo peptides by using a ALC cutoff based on the median ALC of common spectrum groups & sequence of peptides between the de novo and the database search. The ALC cutoff is used to filter unassigned de novo spectrum groups in order to obtain high confidence de novo spectra. All sequences are then searched in the proteome for the entire sequence, those that match are filtered and considered “Linear”, the remainder of the peptide spectrum groups are “cut” in order to create peptide fragment combinations. These are then searched in the proteome for whether fragment combinations exist within a same protein, matches are considered as cis-spliced and further filtered. Finally, fragment combinations are created from those that didn’t match in the previous step and are searched whether they exist in two proteins. If there is a match, these are considered as trans-spliced peptides. The remaining uncategorized spectrum groups are considered not to have a biological explanation (NBE) and are therefore discarded.
Typically, when the HybridFinder function is run, 3 messages are printed representing each major stage of the algorithm and finally ‘Done!’ is printed once the processing is finished. The function returns a list containing 3 items: the HybridFinder output (Figure 7) where the predicted splice type is displayed, a character vector containing only the list of hybrid candidates (Figure 8) and finally the merged hybrid proteome (Figure 9) where the hybrid peptide candidates have been concatenated as fake proteins.
Figure 7.
Screenshot of the HybridFinder output dataframe
(5 rows), The Fraction column represents the LC-MS run, the Scan column is a number representing a unique index for the tandem mass spectra (F[Fraction#]:Scan#), m/z is the precursor mass-to-charge ratio, RT is the Retention Time (elution time) for the spectrum, Peptide corresponds to the peptide sequences. The Length column represents the number of amino acids for a given peptide, ALC (Average Local Confidence), is a score calculated in PEAKS as the total of the residue local confidence scores in the peptide divided by the peptide length. These columns are not provided by the HybridFinder function, they are columns found in any PEAKS de novo sequencing export. For more information, please visit the PEAKS user manual. The Potential_spliceType corresponds to the resulting categorization from the HybridFinder function. Finally, the proteome_database_used is the filename of the fasta proteome provided by the user (this column is mainly for helping the user keep track of the proteome used) in the HybridFinder function.
Figure 8.

Screenshot of the HybridFinder hybrid peptide candidates vector (5 rows)
Figure 9.
Screenshot of the bottom of the HybridFinder merged hybrid proteome (5 proteins)
The results might differ if the customALCcutoff score parameter is changed. If the results are exported, these are stored in a folder as .csv files and the merged proteome database is saved as .fasta file. The peptide sequences predicted as spliced are considered as preliminary candidates. Performing the rest of the steps is essential in order to obtain the final list.
checknetMHCpan and step2_wo_netMHCpan
The checknetMHCpan & step2_wo_netMHCpan functions represent the last step in Faridi et al.’s (2018) workflow. After a database search is performed using the merged hybrid proteome in step 1, these two functions can be used. Both of these functions retrieve the potential splice type categorization established in step 1. However, with checknetMHCpan the user can directly obtain MHC-I binding affinity predictions computed for all peptides between 9 and 12 amino acids using netMHCpan (Jurtz et al., 2017; Reynisson et al., 2020).
The checknetMHCpan function returns two formats of the netMHCpan results and the updated database search results from the second run with the potential splice type. The first format of the netMHCpan represents the results as they are (Figure 10). The second format is a tidied version of the netMHCpan results (Figure 11), where the rows are summarized into different columns, to allow quick analysis of the netMHCpan results (especially when more than one HLA-I allele is used); in these columns are summed the number of HLA-I alleles that a given peptide is a strong or weak binder to as well as the corresponding alleles. Finally, the database search results dataframe (from the second PEAKS run) updated with the potential splice type determined in the HybridFinder function for each peptide (Figure 12). Additionally, any sequence not identified in the HybridFinder output and solely attributed to the fake proteins created is removed. If exported, these are stored in a folder containing 2 .csv files and a .tsv (tab-separated values) corresponding to these different outputs.
Figure 10.
Screenshot of the checknetMHCpan netMHCpan results
(5 rows) HLA/MHC is the allele, Peptide is the amino acid sequence of the potential ligand, Core is the minimal 9 amino acid sequence core to enable HLA binding, Of is the starting position of the Core within the peptide, Gp and Gl are the position and the length of the deletions (respectively), if any. Ip and Il are the position and the length of the insertions (respectively), if any. Icore is the interaction core, Identity is PEPLIST (which indicates that peptides were used as input as opposed to proteins in fasta-format). Score is the raw prediction, Aff(nM) is the predicted IC50 value in nanoMolar units, %Rank is the percentile rank of the predicted affinity compared to a set of random natural ligands. BindLevel is designated by 3 qualifiers: Strong binder, Weak binder, None binder. Potential_spliceType is the categorization retrieved from the HybridFinder output on the potential splice type explanation of the peptide (i.e., linear, cis, trans).
Figure 11.
Screenshot of the checknetMHCpan tidied netMHCpan results
(5 rows) Peptide is the amino acid sequence of the potential ligand, the strongBinder, weakBinder, noneBinder (this column not shown in this figure) columns correspond to the alleles to which a given peptide is a strong/weak/none binder to, respectively. If more than one allele, these are separated by commas. For each peptide, there will be %Rank columns per allele (e.g., If 3 alleles were specified in the checknetMHCpan command, then each peptide will have 3%Rank columns). strongBinder_count, weakBinder_count, noneBinder_count represent the number of alleles to which a peptide is a strong/weak/none binder to. Lastly, the Potential_spliceType column is the categorization retrieved from the HybridFinder output on the potential splice type explanation of the peptide (i.e., linear, cis, trans).
Figure 12.
Screenshot of the checknetMHCpan database search results updated with the Potential_spliceType column
(5 rows) Peptide is the amino acid sequence of the potential ligand, X.log10P represents the best -10logP identification score for the corresponding peptide. Mass represents the monoisotopic mass of the peptide, Length is the number of amino acid residues that constitute the given peptide, ppm is the precursor mass error, the m.z is the precursor mass-to-charge ratio, Z is the precursor charge, RT is the Retention Time (elution time) for the spectrum, Area represents the area underthe curve of the peptide feature found at the same m/z and retention time as the MS/MS scan, Fraction is the LC-MS run, id represents the precursor ID associated with the PSM, Scan is a number representing a unique index for tandem mass spectra (F[Fraction#]:Scan#), from.Chimera (this column is not shown in this figure) displays whether the identified peptide is from chimeric spectra, Source.File is the mzML/mzXML file used in the PEAKS analysis, PTM is the type of the post-translational modification, Ascore is the localization score assigned to modifications on the peptide, Found.By represents the analysis (in this case PEAKS DB). Peptide_no_mods represents the peptide sequence without modifications, Potential_spliceType is linear, cis or trans and is retrieved from the HybridFinder function.
The step2_wo_netMHCpan is the equivalent of checknetMHCpan with the exception of computing binding affinity. The function returns a netMHCpan-ready list of peptides (Figure 13), as well as the updated the database search results (Figure 14). If exported, the results are exported into a folder containing a .pep file and a .csv file.
Figure 13.

Screenshot of the step2_wo_netMHCpan netMHCpan-ready input (5 rows)
Figure 14.
Screenshot of the checknetMHCpan database search results updated with the Potential_spliceType column
(5 rows) The dataframe contains the same columns as in Figure 13
After running checknetMHCpan or step2_wo_netMHCpan the final list of hybrid candidate peptides should be explored for further experimental validation (Figure 1).
Limitations
The presented package was developed and optimized for exports from PEAKS software. Therefore, results from other search engines or de novo sequencing softwares might not work while using this package. Limitations related to the workflow include the possible introduction of bias towards having results containing a higher proportion of Leucine residues. This is due to the workaround proposed by Faridi et al. (2018) which is also used in this package, entailing a switch of all Isoleucines to Leucines in the database search and the proteome since de novo sequencing does not differentiate between Isoleucine and leucine. As mentioned above, it is also important to emphasize that this protocol does not enable the direct identification of high-confidence PSPs that are genuinely spliced by the proteasome in vivo. However, this protocol enables the computational identification of putative PSPs, which should then be validated experimentally in a rigorous manner as shown in Figure 1.
Troubleshooting
Problem 1
While installing the .tar.gz file for the package, in case you run into the following error: ”Error in install.packages : type == ”both” cannot be used with ’repos = NULL’”
Potential solution
The solution would be to simply invoke the install.packages function while specifying where the package is located, setting the repository (repos) to NULL and setting the type as source (source package).
> install.packages("∼/Downloads/RHybridFinder_0.1.0.tar", repos = NULL, Type="source")
Problem 2
While running HybridFinder, in case you run into the following error: ”Error in prepare_input_for_HF(de novo_candidates, db_search): Please make sure you have the right input. N.B: The de novo results data frame should be the first input”
Potential solution
Verify the de novo data frame has been correctly imported. Since the de novo results file is in .csv format, the separator should be a comma “,”, stringsAsFactors should be set to FALSE and lastly the header should be set to TRUE. Please refer to step1: Loading inputs into R.
Verify that the HybridFinder parameters are properly typed. The de novo sequencing results data frame is indicated first and then the database search results. Alternatively, write the parameters and assigned them their appropriate objects (i.e., de novo_candidates = de novo_results_human_liver_Exp1). Please refer to step2: Run HybridFinder.
Problem 3
While running HybridFinder, in case you run into the following error: ”Error in $<-.dataframe`(`∗tmp`, “db_id”, value = character ( 0 ) ) : replacement has 0 rows, data has[…]”
Potential solution
Verify the database search data frame, make sure it has been correctly imported. Since the database search results file is in .csv format the separator should be a comma “,”, stringsAsFactors should be set to FALSE and lastly the header should be set to TRUE. Please refer to step1: Loading inputs into R.
Problem 4
While running checknetMHCpan, if the following error is displayed: ”Error in checknetMHCpan[…]:Please provide the proper input”
Potential solution
Verify that the de novo and database search data frames are not switched. Please refer to step 5: Run checknetMHCpan.
Problem 5
While running checknetMHCpan, if the following error is displayed: ”Please check the input alleles: […]”
Potential solution
Ensure that the alleles are in the right format, or that the allele is written correctly (i.e., HLA-A03:01, HLA-A∗03:01). Please refer to step 4: Load checknetMHCpan inputs into R.
Problem 6
While running checknetMHCpan, if the path to netMHCpan is not correct, the following error might appear: sh: 1: [/temporary directory/netMHCpan]: not found error in running command
Potential solution
The issue could either be that the directory does not contains the netMHCpan file or that the directory was not well written I.e(‘usr/bin/' vs. ‘/usr/bin’ or ‘/usr/bin/, where the first example is wrong and the other two are correct). Please refer to step 4: Load checknetMHCpan inputs into R
Resource availability
Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Etienne Caron etienne.caron@umontreal.ca.
Materials availability
This study did not generate new unique reagents.
Acknowledgments
This work was supported by funding from the Funds de recherche du Québec - Santé (FRQS), the Cole Foundation, CHU Sainte-Justine and the Charles-Bruneau Foundations, Canada Foundation for Innovation, the National Sciences and Engineering Research Council (NSERC) (#RGPIN-2020-05232), and the Canadian Institutes of Health Research (CIHR) (#174924). K.A.K. is a recipient of IVADO’s postdoctoral scholarship (#3879287150). C.L. is currently supported by a National Health and Medicine Research Council (NHMRC) of Australia CJ Martin Early Career Research Fellowship (1143366). A.W.P. is supported by a NHMRC Principal Research fellowship (1137739). P.F. is supported by a Victorian Cancer Agency (Australia) Mid-Career Fellowship.
Author contributions
R code, conceptualization, and authorship, F.S. and P.K.; validation and conceptualization, Q.M., D.J.H., K.A.K., and I.S.; conceptualization and authors of the original workflow, P.F., C.L., and A.W.P.; conceptualization and authorship, E.C.
Declaration of interests
The authors declare no competing interests.
Contributor Information
Peter Kubiniok, Email: peterkubiniok@gmail.com.
Etienne Caron, Email: etienne.caron@umontreal.ca.
Data and code availability
The package is available on CRAN and includes data (PEAKS analyses) from HLA Ligand Atlas (Marcu et al., 2021) deposited in PRIDE (Proteomics IDentification Database) PXD019643 (were analyzed in PEAKS and used in this protocol for demonstration purposes only).
References
- Berkers C.R., Jong A. de, Schuurman K.G., Linnemann C., Geenevasen J.A.J., Schumacher T.N.M., Rodenko B., Ovaa H. Peptide splicing in the proteasome creates a novel type of antigen with an isopeptide linkage. J. Immunol. 2015;195:4075–4084. doi: 10.4049/jimmunol.1402454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Charif D., Lobry J.R. In: Structural Approaches to Sequence Evolution, Molecules, Networks, Populations. Bastolla Ugo, Porto Markus, Roman H. Eduardo, Vendruscolo Michele., editors. Springer Verlag); 2007. SeqinR 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis; pp. 207–232. [Google Scholar]
- Dalet A., Robbins P.F., Stroobant V., Vigneron N., Li Y.F., El-Gamil M., Hanada K., Yang J.C., Rosenberg S.A., Eynde B.J.V. den. An antigenic peptide produced by reverse splicing and double asparagine deamidation. Proc. Natl. Acad. Sci. USA. 2011;108:E323–E331. doi: 10.1073/pnas.1101892108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ebstein F., Textoris-Taube K., Keller C., Golnik R., Vigneron N., Eynde B.J.V. den, Schuler-Thurner B., Schadendorf D., Lorenz F.K.M., Uckert W. Proteasomes generate spliced epitopes by two different mechanisms and as efficiently as non-spliced epitopes. Sci. Rep. 2016;6:24032. doi: 10.1038/srep24032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Faridi P., Li C., Ramarathinam S.H., Vivian J.P., Illing P.T., Mifsud N.A., Ayala R., Song J., Gearing L.J., Hertzog P.J. A subset of HLA-I peptides are not genomically templated: Evidence for cis- and trans-spliced peptide ligands. Sci. Immunol. 2018;3:eaar3947. doi: 10.1126/sciimmunol.aar3947. [DOI] [PubMed] [Google Scholar]
- Hanada K., Yewdell J.W., Yang J.C. Immune recognition of a human renal cancer antigen through post-translational protein splicing. Nature. 2004;427:252–256. doi: 10.1038/nature02240. [DOI] [PubMed] [Google Scholar]
- Jurtz V., Paul S., Andreatta M., Marcatili P., Peters B., Nielsen M. NetMHCpan-4.0: improved peptide–MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data. J. Immunol. 2017;199:3360–3368. doi: 10.4049/jimmunol.1700893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lichti C.F. Identification of spliced peptides in pancreatic islets uncovers errors leading to false assignments. Proteomics. 2021;21:e2000176. doi: 10.1002/pmic.202000176. [DOI] [PubMed] [Google Scholar]
- Liepe J., Mishto M., Textoris-Taube K., Janek K., Keller C., Henklein P., Kloetzel P.M., Zaikin A. The 20S proteasome splicing activity discovered by SpliceMet. Plos. Comput. Biol. 2010;6:e1000830. doi: 10.1371/journal.pcbi.1000830. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liepe J., Marino F., Sidney J., Jeko A., Bunting D.E., Sette A., Kloetzel P.M., Stumpf M.P.H., Heck A.J.R., Mishto M. A large fraction of HLA class I ligands are proteasome-generated spliced peptides. Science. 2016;354:354–358. doi: 10.1126/science.aaf4384. [DOI] [PubMed] [Google Scholar]
- Marcu A., Bichmann L., Kuchenbecker L., Kowalewski D.J., Freudenmann L.K., Backert L., Mühlenbruch L., Szolek A., Lübke M., Wagner P. HLA Ligand Atlas: a benign reference of HLA-presented peptides to improve T-cell-based cancer immunotherapy. J. Immunother. Cancer. 2021;9:e002071. doi: 10.1136/jitc-2020-002071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Michaux A., Larrieu P., Stroobant V., Fonteneau J.-F., Jotereau F., Eynde B.J.V. den, Moreau-Aubry A., Vigneron N. A spliced antigenic peptide comprising a single spliced amino acid is produced in the proteasome by reverse splicing of a longer peptide fragment followed by trimming. J. Immunol. 2014;192:1962–1971. doi: 10.4049/jimmunol.1302032. [DOI] [PubMed] [Google Scholar]
- Mylonas R., Beer I., Iseli C., Chong C., Pak H.-S., Gfeller D., Coukos G., Xenarios I., Müller M., Bassani-Sternberg M. Estimating the contribution of proteasomal spliced peptides to the HLA-I Ligandome. Mol. Cell. Proteomics. 2018;17:2347–2357. doi: 10.1074/mcp.RA118.000877. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neefjes J., Jongsma M.L.M., Paul P., Bakke O. Towards a systems understanding of MHC class I and MHC class II antigen presentation. Nat. Rev. Immunol. 2011;11:823–836. doi: 10.1038/nri3084. [DOI] [PubMed] [Google Scholar]
- Reynisson B., Alvarez B., Paul S., Peters B., Nielsen M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2020;48:W449–W454. doi: 10.1093/nar/gkaa379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rolfs Z., Müller M., Shortreed M.R., Smith L.M., Bassani-Sternberg M. Comment on “A subset of HLA-I peptides are not genomically templated: Evidence for cis- and trans-spliced peptide ligands. Sci. Immunol. 2019;4:eaaw1622. doi: 10.1126/sciimmunol.aaw1622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Specht G., Roetschke H.P., Mansurkhodzhaev A., Henklein P., Textoris-Taube K., Urlaub H., Mishto M., Liepe J. Large database for the analysis and prediction of spliced and non-spliced peptide generation by proteasomes. Sci. Data. 2020;7:146. doi: 10.1038/s41597-020-0487-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vigneron N., Stroobant V., Chapiro J., Ooms A., Degiovanni G., Morel S., Bruggen P. van der, Boon T., Eynde B.J.V. den. An antigenic peptide produced by peptide splicing in the proteasome. Science. 2004;304:587–590. doi: 10.1126/science.1095522. [DOI] [PubMed] [Google Scholar]
- Wilhelm M., Zolg D.P., Graber M., Gessulat S., Schmidt T., Schnatbaum K., Schwencke-Westphal C., Seifert P., Krätzig N. de A., Zerweck J. Deep learning boosts sensitivity of mass spectrometry-based immunopeptidomics. Nat. Commun. 2021;12:3346. doi: 10.1038/s41467-021-23713-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The package is available on CRAN and includes data (PEAKS analyses) from HLA Ligand Atlas (Marcu et al., 2021) deposited in PRIDE (Proteomics IDentification Database) PXD019643 (were analyzed in PEAKS and used in this protocol for demonstration purposes only).


CRITICAL: if you copy the lines of code from here, keep in mind that you might have to re-write the quotation marks yourself.
Timing: 1 min
Pause point: If you would like to conduct the rest of the protocol at a later time, either use the export functionality and then load the HybridFinder output in order to use it for the second step. Alternatively, save the objects in R in a .rda file as follows and once you want to use it again for the step 4, load checknetMHCpan inputs into R.








