Machine learning application to predict binding affinity between peptide containing non-canonical amino acids and HLA-A0201

Shan Jiang; Zhaoqian Su; Nathaniel Bloodworth; Yunchao Liu; Cristina E Martina; David G Harrison; Jens Meiler

doi:10.1371/journal.pone.0314833

. 2025 Jun 27;20(6):e0314833. doi: 10.1371/journal.pone.0314833

Machine learning application to predict binding affinity between peptide containing non-canonical amino acids and HLA-A0201

Shan Jiang ^1,^#, Zhaoqian Su ^1,^#, Nathaniel Bloodworth ², Yunchao Liu ¹, Cristina E Martina ¹, David G Harrison ², Jens Meiler ^1,^3,^4,^5,^*

Editor: Yuval Garini⁶

PMCID: PMC12204577 PMID: 40577315

Abstract

Class Ι major histocompatibility complexes (MHC-Ι), encoded by the highly polymorphic HLA-A, HLA-B, and HLA-C genes in humans, are expressed on all nucleated cells. Both self and foreign proteins are processed to peptides of 8–10 amino acids, loaded into MHC-Ι, within the endoplasmic reticulum and then presented on the cell surface. Foreign peptides presented in this fashion activate CD8 + T cells and their immunogenicity correlates with their affinity for the MHC-Ι binding groove. Thus, predicting antigen binding affinity for MHC-Ι is a valuable tool for identifying potentially immunogenic antigens. While quite a few predictors for MHC-Ι binding exist, there are no currently available tools that can predict antigen/MHC-Ι binding affinity for antigens with explicitly labeled post-translational modifications or unusual/non-canonical amino acids (NCAAs). However, such modifications are increasingly recognized as critical mediators of peptide immunogenicity. In this work, we propose a machine learning application that quantifies the binding affinity of epitopes containing NCAAs to MHC-Ι and compares its performance with other commonly used regressors. Our model demonstrates robust performance, with 5-fold cross-validation yielding an R² value of 0.477 and a root-mean-square error (RMSE) of 0.735, indicating strong predictive capability for peptides with NCAAs. This work provides a valuable tool for the computational design and optimization of peptides incorporating NCAAs, potentially accelerating the development of novel peptide-based therapeutics with enhanced properties and efficacy.

Introduction

The class I major histocompatibility complex (MHC-Ι) enables the adaptive immune response by presenting antigens to patrolling cytotoxic T cells [1,2]. Peptides presented by MHC-Ι originate in the cytoplasm and are usually length limited, having only 8–10 amino acids. This system has evolved principally to enable rapid identification and elimination of viral infected or malignant cells while minimizing the risk of aberrant recognition of self-peptides and consequential autoimmunity [2]. The MHC-Ι protein products are themselves encoded by the Human Leukocyte Antigen (HLA) genes in humans; both the co-dominantly expressed subtypes (A, B, and C) and the high degree of polymorphism observed in the peptide-binding domain of these genes enable MHC-Ι to complex with a large repertoire of peptides [2,3]. Post-translational modification of proteins and peptides (resulting in the incorporation of NCAAs) can further broaden the immunogenic landscape of peptides presented by MHC-Ι. Peptides containing various NCAAs are implicated as immunogens in a variety of diseases [4] including rheumatoid arthritis [5], hypertension and cardiometabolic inflammation [6], and cancer [7].

Recent advances in immunotherapy targeting cancer and autoimmune diseases, coupled with advances in data science have incentivized the creation of computational tools that predict peptides likely to bind to MHC-Ι and induce immune responses [8,9]. These tools encompass a wide range of methodologies to analyze peptide-MHC interactions. Among these are sequence-based approaches like NetMHCPan [10–12] and MHCflurry [13] that utilize amino acid sequences to forecast binding affinities. Additionally, structure-based approaches such as Rosetta FlexPepDock [14–16] employ three-dimensional structural data to provide a detailed understanding of peptide-MHC binding dynamics and conformational stability. The most advanced and effective of these tools leverage machine learning techniques to construct predictive models. These models are trained on extensive datasets comprising antigen-MHC-Ι pairs and their corresponding binding affinity data. A significant portion of these data are derived from the Immune Epitope Database (IEDB) [17], which provides a comprehensive repository of experimentally validated immune epitopes. These methods are thoroughly benchmarked and reviewed by Zhao et al [8].

Despite notable advances in both sequence- and structure-based epitope binding predictors, there are currently no tools capable of rapidly predicting antigen/ MHC-Ι binding affinity for antigens with post-translational modifications or NCAAs. These modifications are increasingly recognized as critical mediators of peptide immunogenicity. The main scope of this research is to develop a new model that would be able to predict the binding affinity of epitopes containing NCAAs to MHC-Ι.

Machine learning models have demonstrated superior performance in predicting binding affinity due to their ability to capture complex patterns and interactions within the data. The development and refinement of these models involve rigorous processes including feature generation, model training, and validation. Several popular algorithms are widely used for property prediction in the fields of chemistry and biology, including support vector machines (SVM), artificial neural networks (ANNs), principal component analysis (PCA), and partial least squares (PLS) regression [18,19].

In this work, we develop a simple encoder capable of creating feature vectors from peptides based on chemical structure. We then systematically benchmark several different supervised machine learning models on a filtered, publicly available dataset containing peptides with NCAAs and experimentally determined binding affinities.

Methods and results

This section details the study results from three perspectives, data preparation, feature generation and model testing and validation. The data preparation subsection will explain the source and structure of the data used, focusing on data exploration and filtration. The feature generation subsection, the key part of this section, will introduce how peptides with NCAAs are encoded. The model testing and validation subsection will evaluate and compare performance metrics such as R² and RMSE across different datasets with five-fold cross-validation using various algorithms.

Data preparation

The initial dataset, a table with 100,141 rows and 29 columns, was exported from the publicly available Immune Epitope Database (IEDB). Among the 29 columns, five are of particular interest for this study: “Name”, “Qualitative Measurement”, “Quantitative Measurement”, “Response Measured”, and “HLA”. Table 1 lists possible or example values for these five columns. The “Name” column shows the peptide sequence within the binding complex of interest; in the given “Name” column example in the table, GILGFVFTV + OTH(V9), the text between the “+” sign and parentheses indicates the modification method applied to the peptide, and the text within the parentheses lists the amino acids modified by this method. The “HLA” column shows the HLA gene responsible for encoding the MHC binding to the peptide. The “Qualitative Measurement” column has values ranging from strong to weak, representing binding strength. The “Quantitative Measurement” column provides a numerical value obtained from experiments, with the type of measurement explained in the “Response Measured” column.

Table 1. Table listing the five columns of most interests, example values, and the number of unique values.

Column Name	Example Values	Number of Unique Values
Name	GILGFVFTV + OTH(V9)	61813
Qualitative Measurement	positive, negative	5
Quantitative Measurement	0.1–65,000	223
HLA	HLA-A*02:01	122
Response Measured	half maximal inhibitory concentration (IC50)	8

Open in a new tab

Fig 1 demonstrates the data preparation process. Starting with the original dataset of 100,141 rows, it was confirmed that each peptide contains at least one NCAA. Since the objective of our research is to predict quantitative binding affinity, each row needed a non-NA value in the “Quantitative Measurement” column. Additionally, to ensure consistency of “HLA” and “Response Measured” across the training and test datasets, the most populated “HLA” and “Response Measured” values, which were HLA-A*02:01 and IC50 with a unit of nanomolar(nM) were selected. Finally, a dataset of 166 rows was prepared for further analysis.

Feature generation

With the sequences of peptides and their quantitative binding values prepared, the next step was to determine how to encode them for machine learning model building. Protein/amino acid encoding involves representing a protein or amino acid with an n-dimensional numerical vector. According to published studies, there are multiple encoding methods, which can be either whole sequence-based or amino acid-based [20]. In the latter approach, each amino acid is first encoded individually, and then the combination of feature vectors from all amino acids in the protein sequence constitutes the encoding of the entire peptide sequence.

Since all HLA species across the dataset used for this study are the same, only the peptides of the binding complex need to be considered for generating the input vector for the next step’s model building. This simplifies the process and makes it more time efficient. Given that the target peptides in this study contain at least one NCAA, which implies potential chemical modifications at the same amino acid position, it is intuitive to use chemistry or structural encoding rather than sequence encoding to retain residue-specific information.

The feature generation process includes four main steps, as illustrated in Fig 2. First, each peptide sequence is tokenized into amino acid tokens. According to summary after step 1, for all 166 rows of data, totally 20 canonical and 28 non-canonical tokens were generated. Fig 3 shows the count of unique tokens across the entire dataset, while Fig 4 illustrates the distribution of tokens at each amino acid position. Tokens with names longer than one character indicate NCAAs. There are two types of NCAA tokens: those containing an underscore (“_”) were defined by the authors of this paper during tokenization step, while the others were derived by their resources and named uniquely, for example, “Phg” refers to a racemic mixture of DL-phenylglycine, as described in the referenced literature [21]. The structure of each amino acid token, particularly the NCAAs, was verified using referenced literature searched from IEDB by “Epitope IRI”, and chemical structures were converted to SMILES [22] strings. Third, feature vectors for each token were generated using RDKit [23] from the SMILES strings obtained in the previous step. According to RDKit, these vectors describe various physicochemical properties such as molecular weight, partial charge, and the number of specific functional groups, resulting in a total of 208 features. Given the size of the prepared dataset, the feature vector dimension is large, and many features are highly correlated, so principal component analysis (PCA) was applied to reduce the dimensionality from 208 to 10. The choice of 10 components was made because they cover 99.75% of the variance of the original feature vector.

At this point, a map was created with an amino acid token as the key and its corresponding feature vector of size 10 as the value. The final step is to combine the features of all tokens obtained in the first step to generate the feature vector for the entire peptide sequence. With each token’s vector size being 10 and the peptide length being nine or ten, the resulting feature vector dimensions for each peptide sequence would be 90 or 100. To ensure consistency of input data for building a machine learning model, an additional 10 zeros were appended to the feature vectors of peptides with a length of nine.

Model testing and validation

To predict binding affinity values of HLA-A0201 with peptide based on both canonical and non-canonical amino acid composition, a machine learning model was established. Fig 5 demonstrates the framework of the model. The model follows a structural feature-based supervised learning architecture where each input is a feature vector with 100 dimensions representing the structural and physicochemical properties of the peptide, and the output is the logarithm-transformed IC50 (nM) binding affinity.

During model building, a five-fold cross-validation was applied to the dataset. Root mean square error (RMSE) and R-square (R²) were used as evaluation metrics. To compare the training results of Partial Least Squares (PLS) with other commonly used algorithms, an open-source tool named Lazy Predict was applied to the same dataset.

Three components were selected for building the Partial Least Squares (PLS) model because, among the range of 2–10 components tested, using 3 components yielded the best performance in terms of cross-validated R-squared (R²) and root mean square error (RMSE) using five-folds. The detailed results of this comparison are listed in Table 2. Fig 6 illustrates the correlation between the original binding affinity and the predicted binding affinity for both the training set and the test set, using PLS from each individual cycle of five-fold cross-validation. The scatter plots reveal a clear correlation between the actual and predicted values, demonstrating the model’s effectiveness despite the relatively small dataset. This strong correlation in both training and test datasets indicates that the model generalizes well and is not overfitted.

Table 2. Performance of PLS with different components – cross-validated R² and RMSE.

Components	Cross-validated R²	Cross-validated RMSE
2	0.444	0.759
3	*0.477*	*0.735*
4	0.463	0.743
5	0.451	0.750
6	0.425	0.766
7	0.395	0.786
8	0.355	0.810
9	0.325	0.827
10	0.283	0.852

Open in a new tab

Fig 6 — After splitting the model into 5 equal sized training and testing data sets, the correlation between predicted and experimentally determined IC50 values was calculated. Training dataset shown in red, testing in blue.

To provide a comprehensive comparison, the same 5-fold cross validation was performed using various regressors employed by the Lazy Predict(version 0.2.11) [24]. A total of 36 different regressors were included. Fig 7 displays the test set R-squared from the first validation cycle for all the regressors, with algorithm names labeled for reference. Additional figures displaying RMSE and R² across all regressors for each validation cycle are available in the GitHub repository for further reference, the link to which is provided in the code and data section.

The test set R² and RMSE results for the top three performing algorithms among these 36, along with those from PLS, are summarized in Table 3. Performance varied by data split; the best-performing algorithm differed between validation cycles. However, certain algorithms appeared more frequently in the top ranks. Among the 15 entries in Table 3, the most frequent high-performing models were ExtraTrees(3 appearance), GradientBoosting (2 appearances), and Tweedie(2 appearances). Detailed R² and RMSE values of test set for these three algorithms along with PLS across all five validations are summarized in Table 4.

Table 3. Comparison of test set R² and RMSE for top three performing models and PLS regression for each validation.

Cross-validation 1			Cross-validation 2			Cross-validation 3
Regressor	R ²	RMSE	Regressor	R ²	RMSE	Regressor	R ²	RMSE
DecisonTree	0.6280	0.6819	GradientBoosting	0.6947	0.6812	AdaBoost	0.6658	0.5084
GradientBoosting	0.5691	0.7339	Bagging	0.6292	0.7507	ExtraTrees	0.6063	0.5518
ExtraTrees	0.5559	0.7451	BayesianRidge	0.6248	0.7552	RandomForest	0.6028	0.5542
PLS	0.5639	0.7385	PLS	0.5780	0.8010	PLS	0.5089	0.6163
Cross-validation 4			Cross-validation 5
Regressor	R ²	RMSE	Regressor	R ²	RMSE
ExtraTrees	0.6304	0.6615	NuSVR	0.5528	0.5358
AdaBoost	0.5931	0.8666	SVR	0.5418	0.5424
Tweedie	0.5840	0.7018	Tweedie	0.4836	0.5758
PLS	0.6761	0.6093	PLS	0.2871	0.6766

Open in a new tab

Table 4. Comparison of test set R² and RMSE for PLS and the top three frequently high-performing regressors across all validation cycles.

	ExtraTrees		GradientBoosting		Tweedie		PLS
	R ²	RMSE	R ²	RMSE	R ²	RMSE	R ²	RMSE
Cross-validation 1	0.5559	0.7451	0.5691	0.7340	0.4780	0.8079	0.5639	0.7385
Cross-validation 2	0.5920	0.7875	0.6915	0.6847	0.4891	0.8812	0.5780	0.8010
Cross-validation 3	0.6063	0.5518	0.5092	0.6160	0.5890	0.5638	0.5089	0.6163
Cross-validation 4	0.6304	0.6615	0.4925	0.7752	0.5841	0.7018	0.6761	0.6093
Cross-validation 5	0.3784	0.6317	0.4106	0.6152	0.4836	0.5759	0.2871	0.6766

Open in a new tab

For all algorithms, model performance varied by fold – suggesting sensitivity to the specific data splits. This variability may be caused by the relatively small size of the test sets, which can result in low variance in the target variable (y-values). Such low variance can, in turn, lead to cases where certain validation cycles exhibit lower R² values despite achieving lower RMSE. Additionally, with smaller datasets, R² is more susceptible to the influence of outliers, further complicating the interpretation of model performance.

Discussion

Compared with other sequence-based prediction tools such as NetMHCPan [10,11], the most important improvement our model achieves is its ability to significantly expand the coverage of amino acid species in the involved peptides. Not only does it include the 20 canonical amino acids, but it also takes NCAAs into account without compromising structural accuracy. As long as the structure of an NCAA is known, applying this protocol to predict affinity is straightforward. Additionally, to make the model even more user-friendly, we have eliminated the need for MHC involvement in the model-building process. This means that, when compared with structure- or model-docking based approaches such as Rosetta FlexPepDock [15], our model provides results much faster with minimal human intervention. This is because our method does not require the provision and fine-tuning of large and complex protein structures, thereby accelerating the prediction process and reducing the potential for user error.

Despite these promising results, it is important to acknowledge that the current size of the training and test datasets is relatively small, which may limit the model’s performance. Although the Immune Epitope Database (IEDB) contains a substantial amount of data regarding peptide-MHC binding affinities, only a small percentage of these data includes quantitative binding values, and an even smaller portion pertains to peptides containing NCAAs. Collecting more data would enhance the model’s ability to capture a broader range of patterns and interactions, thereby improving its robustness and reliability.

Another future effort involves expanding the scope of the model to include MHCs from other species. For this study, we used data related solely to HLA-A0201 to ensure consistency, but extending the protocol to incorporate other MHC types would significantly widen the prediction coverage and improve the model’s reliability. By encompassing a larger variety of MHC alleles, we can better understand the nuances of peptide-MHC interactions across different biological contexts, making the model more universally applicable.

An additional consideration is the potential advantage of using ensemble regressors instead of relying on a single algorithm. Given the observed variability in performance across different models and validation folds, an ensemble approach that combines predictions from multiple algorithms may yield more reliable results. By aggregating output – such as through voting or averaging – ensemble models can help mitigate the impact of data-specific fluctuations, potentially improving overall prediction accuracy.

In conclusion, our model presents a notable advancement in peptide-MHC binding affinity predictions by expanding amino acid coverage and simplifying the prediction process. Future enhancements through increased dataset size and broader MHC coverage will further solidify its utility and accuracy, making it a powerful tool for computational immunology and related fields.

Data Availability

All relevant data for this study are publicly available from the GitHub repository (https://github.com/meilerlab/ML_PLS_MHC_peptide_binding_pred).

Funding Statement

J.M. is supported by a Humboldt Professorship of the Alexander von Humboldt Foundation. J.M. acknowledges funding by the Deutsche Forschungsgemeinschaft (DFG) through SFB1423 (421152132), SFB 1664 (514901783), TRR (514664767), and SPP 2363 (460865652). J.M. is supported by the Federal Ministry of Education and Research (BMBF) through the Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), through the German Network for Bioinformatics Infrastructure (de.NBI), and through the German Academic Exchange Service (DAAD) via the School of Embedded Composite AI (SECAI 15766814). Work in the Meiler laboratory is further supported through the National Institute of Health (NIH) through R01 HL122010, R01 DA046138, R01 AG068623, U01 AI150739, R01 CA227833, R01 LM013434, S10 OD016216, S10 OD020154, S10 OD032234, 5T32HL144446-05. D.G.H. is supported by NIH grants IH R35HL140016 and NIH AG076785. N.B. is supported by NIH grant 5T32HL144446-05.

References

1.Janeway CA, Travers P, Walport M, Shlomchik MJ. The major histocompatibility complex and its functions. Immunobiology: The Immune System in Health and Disease. 5th edition. Garland Science; 2001. Available from: https://www.ncbi.nlm.nih.gov/books/NBK27156/ [Google Scholar]
2.Matsumura M, Fremont DH, Peterson PA, Wilson IA. Emerging principles for the recognition of peptide antigens by MHC class I molecules. Science. 1992;257(5072):927–34. doi: 10.1126/science.1323878 [DOI] [PubMed] [Google Scholar]
3.Archbold JK, Macdonald WA, Gras S, Ely LK, Miles JJ, Bell MJ, et al. Natural micropolymorphism in human leukocyte antigens provides a basis for genetic control of antigen recognition. J Exp Med. 2009;206(1):209–19. doi: 10.1084/jem.20082136 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Dendrou CA, Petersen J, Rossjohn J, Fugger L. HLA variation and disease. Nat Rev Immunol. 2018;18(5):325–39. doi: 10.1038/nri.2017.143 [DOI] [PubMed] [Google Scholar]
5.James EA, Rieck M, Pieper J, Gebe JA, Yue BB, Tatum M, et al. Citrulline-specific Th1 cells are increased in rheumatoid arthritis and their frequency is influenced by disease duration and therapy. Arthritis Rheumatol. 2014;66(7):1712–22. doi: 10.1002/art.38637 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Kirabo A, Fontana V, de Faria APC, Loperena R, Galindo CL, Wu J, et al. DC isoketal-modified proteins activate T cells and promote hypertension. J Clin Invest. 2014;124(10):4642–56. doi: 10.1172/JCI74084 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Kacen A, Javitt A, Kramer MP, Morgenstern D, Tsaban T, Shmueli MD, et al. Post-translational modifications reshape the antigenic landscape of the MHC I immunopeptidome in tumors. Nat Biotechnol. 2023;41(2):239–51. doi: 10.1038/s41587-022-01464-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Zhao W, Sher X. Systematically benchmarking peptide-MHC binding predictors: From synthetic to naturally processed epitopes. PLoS Comput Biol. 2018;14(11):e1006457. doi: 10.1371/journal.pcbi.1006457 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Mei S, Li F, Leier A, Marquez-Lago TT, Giam K, Croft NP, et al. A comprehensive review and performance evaluation of bioinformatics tools for HLA class I peptide-binding prediction. Brief Bioinform. 2020;21(4):1119–35. doi: 10.1093/bib/bbz051 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Hoof I, Peters B, Sidney J, Pedersen LE, Sette A, Lund O, et al. NetMHCpan, a method for MHC class I binding prediction beyond humans. Immunogenetics. 2009;61(1):1–13. doi: 10.1007/s00251-008-0341-z [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Nielsen M, Andreatta M. NetMHCpan-3.0; improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets. Genome Med. 2016;8(1):33. doi: 10.1186/s13073-016-0288-x [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Reynisson B, Alvarez B, Paul S, Peters B, Nielsen M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2020;48(W1):W449–54. doi: 10.1093/nar/gkaa379 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.O’Donnell TJ, Rubinsteyn A, Bonsack M, Riemer AB, Laserson U, Hammerbacher J. MHCflurry: Open-Source Class I MHC Binding Affinity Prediction. Cell Syst. 2018;7(1):129–132.e4. doi: 10.1016/j.cels.2018.05.014 [DOI] [PubMed] [Google Scholar]
14.Bloodworth N, Barbaro NR, Moretti R, Harrison DG, Meiler J. Rosetta FlexPepDock to predict peptide-MHC binding: An approach for non-canonical amino acids. PLoS One. 2022;17(12):e0275759. doi: 10.1371/journal.pone.0275759 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Raveh B, London N, Zimmerman L, Schueler-Furman O. Rosetta FlexPepDock ab-initio: simultaneous folding, docking and refinement of peptides onto their receptors. PLoS One. 2011;6(4):e18934. doi: 10.1371/journal.pone.0018934 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Liu T, Pan X, Chao L, Tan W, Qu S, Yang L, et al. Subangstrom accuracy in pHLA-I modeling by Rosetta FlexPepDock refinement protocol. J Chem Inf Model. 2014;54(8):2233–42. doi: 10.1021/ci500393h [DOI] [PubMed] [Google Scholar]
17.Vita R, Blazeska N, Marrama D, IEDB Curation Team Members, Duesing S, Bennett J, et al. The Immune Epitope Database (IEDB): 2024 update. Nucleic Acids Res. 2025;53(D1):D436–43. doi: 10.1093/nar/gkae1092 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Ferreira LLG, Andricopulo AD. ADMET modeling approaches in drug discovery. Drug Discov Today. 2019;24(5):1157–65. doi: 10.1016/j.drudis.2019.03.015 [DOI] [PubMed] [Google Scholar]
19.Jiang L, Yu H, Li J, Tang J, Guo Y, Guo F. Predicting MHC class I binder: existing approaches and a novel recurrent neural network solution. Brief Bioinform. 2021;22(6):bbab216. doi: 10.1093/bib/bbab216 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Jing X, Dong Q, Hong D, Lu R. Amino Acid Encoding Methods for Protein Sequences: A Comprehensive Review and Assessment. IEEE/ACM Trans Comput Biol Bioinform. 2020;17(6):1918–31. doi: 10.1109/TCBB.2019.2911677 [DOI] [PubMed] [Google Scholar]
21.Hoppes R, Oostvogels R, Luimstra JJ, Wals K, Toebes M, Bies L, et al. Altered peptide ligands revisited: vaccine design through chemically modified HLA-A2-restricted T cell epitopes. J Immunol. 2014;193(10):4803–13. doi: 10.4049/jimmunol.1400800 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988;28(1):31–6. doi: 10.1021/ci00057a005 [DOI] [Google Scholar]
23.Landrum G, Tosco P, Kelley B, Rodriguez R, Cosgrove D, Vianello R, et al. rdkit/rdkit: 2025_03_1 (Q1 2025) Release. Zenodo. 2025. Available from: doi: 10.5281/zenodo.15115844 [DOI] [Google Scholar]
24.lazypredict. Available from: https://pypi.org/project/lazypredict/. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All relevant data for this study are publicly available from the GitHub repository (https://github.com/meilerlab/ML_PLS_MHC_peptide_binding_pred).

[pone.0314833.ref001] 1.Janeway CA, Travers P, Walport M, Shlomchik MJ. The major histocompatibility complex and its functions. Immunobiology: The Immune System in Health and Disease. 5th edition. Garland Science; 2001. Available from: https://www.ncbi.nlm.nih.gov/books/NBK27156/ [Google Scholar]

[pone.0314833.ref002] 2.Matsumura M, Fremont DH, Peterson PA, Wilson IA. Emerging principles for the recognition of peptide antigens by MHC class I molecules. Science. 1992;257(5072):927–34. doi: 10.1126/science.1323878 [DOI] [PubMed] [Google Scholar]

[pone.0314833.ref003] 3.Archbold JK, Macdonald WA, Gras S, Ely LK, Miles JJ, Bell MJ, et al. Natural micropolymorphism in human leukocyte antigens provides a basis for genetic control of antigen recognition. J Exp Med. 2009;206(1):209–19. doi: 10.1084/jem.20082136 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0314833.ref004] 4.Dendrou CA, Petersen J, Rossjohn J, Fugger L. HLA variation and disease. Nat Rev Immunol. 2018;18(5):325–39. doi: 10.1038/nri.2017.143 [DOI] [PubMed] [Google Scholar]

[pone.0314833.ref005] 5.James EA, Rieck M, Pieper J, Gebe JA, Yue BB, Tatum M, et al. Citrulline-specific Th1 cells are increased in rheumatoid arthritis and their frequency is influenced by disease duration and therapy. Arthritis Rheumatol. 2014;66(7):1712–22. doi: 10.1002/art.38637 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0314833.ref006] 6.Kirabo A, Fontana V, de Faria APC, Loperena R, Galindo CL, Wu J, et al. DC isoketal-modified proteins activate T cells and promote hypertension. J Clin Invest. 2014;124(10):4642–56. doi: 10.1172/JCI74084 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0314833.ref007] 7.Kacen A, Javitt A, Kramer MP, Morgenstern D, Tsaban T, Shmueli MD, et al. Post-translational modifications reshape the antigenic landscape of the MHC I immunopeptidome in tumors. Nat Biotechnol. 2023;41(2):239–51. doi: 10.1038/s41587-022-01464-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0314833.ref008] 8.Zhao W, Sher X. Systematically benchmarking peptide-MHC binding predictors: From synthetic to naturally processed epitopes. PLoS Comput Biol. 2018;14(11):e1006457. doi: 10.1371/journal.pcbi.1006457 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0314833.ref009] 9.Mei S, Li F, Leier A, Marquez-Lago TT, Giam K, Croft NP, et al. A comprehensive review and performance evaluation of bioinformatics tools for HLA class I peptide-binding prediction. Brief Bioinform. 2020;21(4):1119–35. doi: 10.1093/bib/bbz051 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0314833.ref010] 10.Hoof I, Peters B, Sidney J, Pedersen LE, Sette A, Lund O, et al. NetMHCpan, a method for MHC class I binding prediction beyond humans. Immunogenetics. 2009;61(1):1–13. doi: 10.1007/s00251-008-0341-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0314833.ref011] 11.Nielsen M, Andreatta M. NetMHCpan-3.0; improved prediction of binding to MHC class I molecules integrating information from multiple receptor and peptide length datasets. Genome Med. 2016;8(1):33. doi: 10.1186/s13073-016-0288-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0314833.ref012] 12.Reynisson B, Alvarez B, Paul S, Peters B, Nielsen M. NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 2020;48(W1):W449–54. doi: 10.1093/nar/gkaa379 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0314833.ref013] 13.O’Donnell TJ, Rubinsteyn A, Bonsack M, Riemer AB, Laserson U, Hammerbacher J. MHCflurry: Open-Source Class I MHC Binding Affinity Prediction. Cell Syst. 2018;7(1):129–132.e4. doi: 10.1016/j.cels.2018.05.014 [DOI] [PubMed] [Google Scholar]

[pone.0314833.ref014] 14.Bloodworth N, Barbaro NR, Moretti R, Harrison DG, Meiler J. Rosetta FlexPepDock to predict peptide-MHC binding: An approach for non-canonical amino acids. PLoS One. 2022;17(12):e0275759. doi: 10.1371/journal.pone.0275759 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0314833.ref015] 15.Raveh B, London N, Zimmerman L, Schueler-Furman O. Rosetta FlexPepDock ab-initio: simultaneous folding, docking and refinement of peptides onto their receptors. PLoS One. 2011;6(4):e18934. doi: 10.1371/journal.pone.0018934 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0314833.ref016] 16.Liu T, Pan X, Chao L, Tan W, Qu S, Yang L, et al. Subangstrom accuracy in pHLA-I modeling by Rosetta FlexPepDock refinement protocol. J Chem Inf Model. 2014;54(8):2233–42. doi: 10.1021/ci500393h [DOI] [PubMed] [Google Scholar]

[pone.0314833.ref017] 17.Vita R, Blazeska N, Marrama D, IEDB Curation Team Members, Duesing S, Bennett J, et al. The Immune Epitope Database (IEDB): 2024 update. Nucleic Acids Res. 2025;53(D1):D436–43. doi: 10.1093/nar/gkae1092 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0314833.ref018] 18.Ferreira LLG, Andricopulo AD. ADMET modeling approaches in drug discovery. Drug Discov Today. 2019;24(5):1157–65. doi: 10.1016/j.drudis.2019.03.015 [DOI] [PubMed] [Google Scholar]

[pone.0314833.ref019] 19.Jiang L, Yu H, Li J, Tang J, Guo Y, Guo F. Predicting MHC class I binder: existing approaches and a novel recurrent neural network solution. Brief Bioinform. 2021;22(6):bbab216. doi: 10.1093/bib/bbab216 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0314833.ref020] 20.Jing X, Dong Q, Hong D, Lu R. Amino Acid Encoding Methods for Protein Sequences: A Comprehensive Review and Assessment. IEEE/ACM Trans Comput Biol Bioinform. 2020;17(6):1918–31. doi: 10.1109/TCBB.2019.2911677 [DOI] [PubMed] [Google Scholar]

[pone.0314833.ref021] 21.Hoppes R, Oostvogels R, Luimstra JJ, Wals K, Toebes M, Bies L, et al. Altered peptide ligands revisited: vaccine design through chemically modified HLA-A2-restricted T cell epitopes. J Immunol. 2014;193(10):4803–13. doi: 10.4049/jimmunol.1400800 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0314833.ref022] 22.Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988;28(1):31–6. doi: 10.1021/ci00057a005 [DOI] [Google Scholar]

[pone.0314833.ref023] 23.Landrum G, Tosco P, Kelley B, Rodriguez R, Cosgrove D, Vianello R, et al. rdkit/rdkit: 2025_03_1 (Q1 2025) Release. Zenodo. 2025. Available from: doi: 10.5281/zenodo.15115844 [DOI] [Google Scholar]

[pone.0314833.ref024] 24.lazypredict. Available from: https://pypi.org/project/lazypredict/. [Google Scholar]

PERMALINK

Machine learning application to predict binding affinity between peptide containing non-canonical amino acids and HLA-A0201

Shan Jiang

Zhaoqian Su

Nathaniel Bloodworth

Yunchao Liu

Cristina E Martina

David G Harrison

Jens Meiler

Roles

Abstract

Introduction