Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2011 Apr 8;108(17):6981–6986. doi: 10.1073/pnas.1018165108

Large-scale characterization of peptide-MHC binding landscapes with structural simulations

Chen Yanover 1, Philip Bradley 1,1
PMCID: PMC3084072  PMID: 21478437

Abstract

Class I major histocompatibility complex proteins play a critical role in the adaptive immune system by binding to peptides derived from cytosolic proteins and presenting them on the cell surface for surveillance by T cells. The varied peptide binding specificity of these highly polymorphic molecules has important consequences for vaccine design, transplantation, autoimmunity, and cancer development. Here, we describe a molecular modeling study of MHC-peptide interactions that integrates sampling techniques from protein–protein docking, loop modeling, de novo structure prediction, and protein design in order to construct atomically detailed peptide binding landscapes for a diverse set of MHC proteins. Specificity profiles derived from these landscapes recover key features of experimental binding profiles and can be used to predict peptide binding with reasonable accuracy. Family wide comparison of the predicted binding landscapes recapitulates previously reported patterns of specificity divergence and peptide-repertoire diversity while providing a structural basis for observed specificity patterns. The size and sequence diversity of these structure-based binding landscapes enable us to identify subtle patterns of covariation between peptide sequence positions; analysis of the associated structural models suggests physical interactions that may mediate these sequence correlations.


Class I MHC proteins selectively bind short (typically, 8–10 amino acids) peptides derived from proteasomal degradation of cytosolic proteins and present these peptides on the cell surface for surveillance by CD8+ T lymphocytes. By this mechanism, non-self-peptides derived from intracellular pathogens can be detected by the immune system, and infected cells can be targeted for destruction. The MHC proteins are highly polymorphic (the class I MHC gene HLA-B is estimated to be the most polymorphic gene in the human genome; ref. 1), with different proteins recognizing specific and often quite divergent peptide repertoires. MHC polymorphism has been reported to play a role in a wide range of phenomena including susceptibility to infectious and inflammatory diseases, drug toxicity, autoimmunity, cancer, and transplantation outcome (25). The specific role of peptide-repertoire variation in each context is not well understood, in part because comprehensive characterization of the peptide binding preferences of individual MHC proteins is experimentally challenging. As a result, there has been great interest in computational models of MHC-peptide recognition. This interest has led to the development of machine-learning algorithms that are able to predict—in many cases, with high accuracy—the binding of a novel peptide to a target MHC molecule by training on experimental binding data for that protein or related MHC family members (610).

The X-ray crystallographic structure determination of the first MHC protein (11) and, later, of several hundred peptide-MHC complexes, has shed light on the physicochemical attributes facilitating the formation of such complexes. The overall conserved nature of these structural features, coupled with MHC- and peptide-specific conformational subtleties, has made MHC proteins an intriguing model system for molecular modeling studies. Several studies have reported attempts to predict the structure of particular peptide-MHC complexes, either from a solved structure of the target protein itself (12, 13), or using the structure of a homologous MHC protein as a template (14, 15). Subsequently, binding and affinity prediction algorithms have been devised based on structural features (e.g., MHC-peptide amino acid distances, refs. 16 and 17; all-atom interaction energy components, refs. 15, 18, and 19; and intermolecular contacts, ref. 20) computed from low- or high-resolution models of peptide-MHC complexes. Notably, such high-resolution features are often integrated with experimental binding data and binding predictions are made using specifically trained machine-learning algorithms (15, 18, 20), hinting at the challenges faced by high-resolution modeling.

Here we report the direct use of atomically detailed molecular modeling to map peptide binding landscapes for a diverse collection of HLA-A and HLA-B proteins. For our purposes, a complete description of a peptide binding landscape would consist of the binding affinities of the target protein for all possible peptides, and three-dimensional cocomplex structures for all high-affinity binding peptides (peptide structure, as well as sequence, being critical for processes such as recognition by T cells). A complete description of this sort is currently out of reach both computationally and experimentally; we assemble instead an approximate binding landscape model consisting of amino acid sequences, molecular models, and predicted binding energies for a large and diverse population of optimized binding peptides generated by flexible-backbone peptide docking and sequence design simulations (Fig. 1). We show that a simple, position-independent binding model derived from this landscape recovers key features of experimental binding profiles, and can be used to predict peptide binding to the target protein with reasonable accuracy. We further show that these structure-based binding landscapes are well suited to the comparative study of binding preferences within the MHC family, providing structural insight into differences in binding specificity for closely related proteins, recapitulating previously reported variation in peptide-repertoire diversity, and allowing us to cluster proteins according to their specificity profiles. By leveraging the size, sequence diversity, and structural resolution of these binding landscape models, we are able to investigate features of MHC-peptide interactions—pairwise correlations between peptide sequence positions, for example—that are not readily addressed by direct analysis of the limited dataset of experimentally characterized peptides.

Fig. 1.

Fig. 1.

Mapping the peptide binding landscape with docking and sequence design. Many independent flexible-backbone peptide docking simulations generate diverse samples in structure space. Simultaneous peptide sequence refinement focuses exploration to the higher-affinity subspace.

Results

We used structural modeling simulations to generate peptide binding landscapes (Fig. 1) for 11 HLA-A and 18 HLA-B proteins (Table S1). For each target protein, we constructed three independent binding landscapes, one from a crystal structure of the target protein (“self-template”—where available), and two from homology models of the target protein, one built from an HLA-A template (A*02:01), and one from an HLA-B template (B*35:01). These homology-based predictions test our ability to model binding preferences of uncharacterized proteins. To facilitate direct comparison with experimental binding data, we reduced the diverse peptide sequences and binding energies in each landscape to a position-specific binding energy matrix (PBEM) by estimating an independent binding energy contribution for each amino acid at each position, and then to a position-specific frequency matrix (PFM) by transforming these energies to probabilities via the Boltzmann equation.

Binding Prediction.

Each predicted peptide binding landscape consists of a large set of optimized peptide sequences, atomically detailed cocomplex structures, and predicted binding energies. Generating a similarly detailed experimental dataset is currently infeasible and we, therefore, resort to comparison with the most comprehensive data currently available: quantitative binding motifs derived using positional scanning combinatorial peptide libraries. Briefly, such a combinatorial library is composed of mixtures of peptides; in each mixture, the peptides share a single amino acid in a certain position and its measured binding affinity is, effectively, the average influence of that amino acid on binding, in a diverse set of sequential contexts (21). Note that reducing our structurally derived data to PFMs follows, essentially, the same rationale.

The experimental and computational PFMs are substantially similar (30 out of 34 comparisons with P < 0.01; P-value computation details are given in Materials and Methods) and share many specific sequence features (Fig. 2 and Figs. S1 and S2); for example, B*51:01 dominant proline in position P2 (throughout the paper, Pi denotes peptide position i), as well as isoleucine and valine in P9. Focusing on the key, specificity-determining positions (as identified by Sidney et al., ref. 21), the majority of the preferred amino acids are recovered by the structural simulations (using identical thresholds for selection of preferred amino acids; complete results are given in Figs. S1 A and B). Moreover, structural modeling identifies peptide-MHC interactions that may underlie observed binding preferences; for example, a bidentate hydrogen bond between D74 and R5 for B*08:01, and a hydrophobic pocket surrounding I9, whose position is constrained by a network of hydrogen bonds to the peptide backbone (Fig. 2; MHC and peptide sequence positions are denoted in italics and boldface, respectively).

Fig. 2.

Fig. 2.

PFMs derived from combinatorial library assays and structural simulations agree in many key positions. (Left) Representative side-by-side comparisons of experimental (Left) and computational (Second Column) PFMs, as well as examples for specificity-determining interactions suggested by structural modeling (Third Column; peptide backbone is shown in green, interacting peptide and MHC side chains are depicted in yellow and pink, respectively), for two MHC proteins: B*08:01 (Upper) and B*51:01 (Lower). (Right) Quality of the structure-based PFMs using B*35:01 structure as a template. Per-position divergence from the experimental PFM was transformed to a P value and plotted using a color scale ranging from P = 0.5 to P = 1/(9 × 15) (by chance, one would expect roughly half the squares in the table to be completely white and one to be black); P values for the total divergence across the peptide (“All”) and restricted to primary and secondary anchor positions (“Anchors”) as defined by ref. 21 are shown to the right (primary and secondary anchor positions are indicated by “A” and “a,” respectively).

To further evaluate the correctness of our structure-based binding predictions, we turn to the more traditional task of discriminating, for a given MHC, binding from nonbinding peptides, and, more generally, predicting MHC-peptide binding affinities. For this task, we use the position-specific binding energy matrices derived from our peptide binding landscapes to assign binding energies to peptide sequences; accuracy is assessed by calculating the area under the receiver operating characteristic curve (auROC), which measures our ability to discriminate binding from nonbinding peptides [perfect discrimination would give an auROC of 1.0; the expected auROC for a random ordering of the peptides is 0.5, and significance estimates can be calculated using the Mann–Whitney–Wilcoxon (MWW) rank-sum test]. Overall, the majority of energy matrices yield significantly accurate binding predictions (40 out of 48 with MWW P < 0.01, Fig. 3; accuracy of affinity predictions is shown in Fig. S3). In most cases, the binding energy matrix based on the self-structure (median auROC of 0.756 and 0.786 for HLA-A and HLA-B, respectively) performs better than, or equal to, the best matrix based on a homology model (0.689 and 0.778), attesting to the effect of minor MHC backbone variations on binding preferences. For example, the performance gap for HLA A*02:01—with auROC of 0.826 and 0.614 for self and homology modeling matrices, respectively—can be, in part, attributed to a shift in an MHC beta strand which enlarges the C-terminal pocket and leads to strong preference for phenylalanine at P9 rather than the experimentally observed isoleucine and valine (Fig. S2). On the whole, predictions for HLA-B proteins are more accurate than for HLA-A proteins, which may be due to the fact that, on average, the HLA-B peptide binding datasets contain a larger fraction of very low-affinity peptides (Table S1). We note that these predictions are in general less accurate than those made by state-of-the-art sequence-based machine-learning algorithms, although an unbiased direct comparison on the same set of peptides is difficult due to likely overlap with training data. To provide a rough frame of reference, we computed median auROC scores for three recently benchmarked (10) pan-specific binding prediction methods, yielding scores of 0.81–0.89 and 0.68–0.78 for HLA-A and HLA-B proteins, respectively (details in SI Materials and Methods; note that auROC scores can be sensitive to the sequences in the testing set).

Fig. 3.

Fig. 3.

Binding prediction accuracy. Simulation-derived PBEMs for seven HLA-A (Upper) and twelve HLA-B proteins (Lower) were used to predict binding scores for corresponding peptides with experimentally determined binding affinity (6). Area under the ROC curve for discriminating binders from nonbinders is shown for self, as well as homology modeling, based PBEMs. Significantly discriminative binding predictions (MWW P < 0.01) are denoted by *; (un), self-structure unavailable; (self) template is a self-structure.

Binding Preference Divergence.

The peptide binding specificity of a single MHC protein is a complex function of its amino acid sequence, with overlapping subsets of MHC sequence positions contributing to preferences at individual peptide positions. Comparison of peptide binding data for closely related proteins—differing at one or a few sequence positions—represents one approach toward untangling this complexity and identifying the impact of individual MHC positions on binding. We selected six pairs of such proteins whose binding specificity had been recently compared by experimental means (Table S2), computed peptide binding landscapes for each pair, and directly compared the predicted difference in binding profiles for each protein pair with the experimentally observed specificity divergences.

Our structure-based analysis predicts the majority of peptide positions experimentally associated with considerable differences in binding specificity, for each pair of proteins (Fig. 4, Right, darker shades of gray and red boxes, respectively). A few additional positions—e.g., P3 for B*41:01, B*41:02 and P7 for A*03:01, A*03:02—appear to be altered based on our data, but these changes have not been confirmed experimentally. These discrepancies can be attributed, in part, to the limitations of our structural simulations (see Discussion), but also, perhaps, to the qualitative description of experimental binding preferences, called for by the relatively small number of binding peptides available for each protein; for example, although Escobar et al. (22) focus on P9, recalculation of the PFMs for B*35:01 and B*35:03 using their data revealed considerable divergences in P3 and P6, as also noted by our structural simulations.

Fig. 4.

Fig. 4.

Binding preference divergence. (Left) Differential PFMs, showing, for each position, the amino acids predicted to bind one MHC protein more favorably than another. The height of each column is proportional to the specificity divergence at that position; the difference in amino acid frequencies determine the height and order of the corresponding letters. Structural models (second and third columns) suggest plausible connections between MHC sequence mismatches and peptide binding preference (see text for details; color scheme as in Fig. 2). (Right) Per-position specificity divergences for six pairs of MHC proteins, differing at one or two positions (mismatched amino acids are shown on the right). Positions at which particular MHC polymorphism has been experimentally shown to alter the binding preference are red boxed.

Similarly, many of the per-position alterations in binding preference previously described experimentally are recapitulated by our simulations (Fig. 4, Fig. S4, and Table S2). Two such examples are shown in Fig. 4: A*03:01 preference for phenylalanine and tyrosine at P3 compared to glutamine for A*03:02 (23) (Top); A*66:01 preference for polar acidic side chain at P1, in contrast to A*66:02 which preferably binds serine (and lysine) at that position (24) (Bottom). An important feature of our analysis is that it allows, unlike the experimental data, a quantitative comparison between various positions within a single pair of MHC proteins, as well as across pairs. Such quantitative comparisons of binding specificity could prove invaluable clinically, for example in assessing the risk of hematopoietic cell transplantation from mismatched donor (25). Moreover, structural modeling suggests mechanistic explanations for the differences in binding preferences: Q3 in a peptide bound to A*03:02 can form a bidentate hydrogen bond with Q156, whereas Y3 packs against the nonpolar L156 in A*03:01 and can hydrogen bond with E152; replacing positively charged R163 in A*66:01 with a negatively charged glutamate reduces the preference for negatively charged amino acids at P1, and increases binding for positively charged and neutral polar amino acids (Fig. 4).

Specificity divergence can also be used to portray, on a large scale, the landscape of MHC binding preferences, identifying “neighboring,” as well as distant, pairs of proteins. To that end, we hierarchically clustered all MHC proteins studied in this work—HLA-A and HLA-B combined—based on their pairwise divergences (Fig. S5A). We note that a similar approach has been taken in previous studies (see, e.g., refs. 26 and 27), using various distance measures, though, importantly, our analysis is solely based on structural simulations. The emerging dendrogram structure agrees well with the MHC class I supertype classification recently updated by Sidney et al. (28), with, mostly, members of each supertype grouped closely together (intrasupertype divergences are significantly smaller than intersupertype divergences: P < 10-7 for A*0201-templated simulations and P = 0.0004 for B*3501-templated simulations, after excluding divergences between the nearly identical allele pairs in Table S2). Notable exceptions are B*41:01 and B*41:02—both lack experimental data and have been classified into the B44 supertype based on sequence similarity only (28)—which show, in our analysis, considerably different binding preferences than the other B44 proteins. Intriguingly, the binding preferences of HLA-A proteins are not distinct from those of HLA-B ones, and, often, the closest neighbor of a protein from one group belongs to the other.

The fact that the optimized peptide sequences comprising our binding landscapes are generated by a uniform procedure for all proteins allows direct comparison of their total sequence diversity; the analogous comparison of experimentally assayed binding peptides must be undertaken with great care, given the varying peptide selection criteria employed by different investigators. Such an analysis is motivated by several recent studies pointing to a relationship between MHC binding repertoire diversity—or, equivalently, the permissiveness of the binding motif—and various aspects of the immune response, in particular immunodominance of T-cell responses (29) and progression of HIV infection (30, 31). To quantify the diversity of binding peptides, we computed, for each MHC protein, the average distance among its strongest binders. This measure recapitulates the reported findings, showing that the repertoire of binding peptides for B*35:01 is more diverse than that of B*35:03 (30) (Fig. 5 and Fig. S5C, red framed bars); B*57:01’s repertoire is more limited than B*07:02, B*27:05, and B*35:01 (31) (blue bars); and HLA-A proteins bind, on average, more diverse sets of peptides than HLA-B proteins (29) (black bars, Fig. S5B).

Fig. 5.

Fig. 5.

Binding peptide diversity. For each MHC protein, the average pairwise Hamming distance for the 1% of lowest energy binders is shown, indicating the diversity of its binder repertoire (template, A*02:01). Proteins are ordered by diversity, and average values for HLA-A and HLA-B proteins are shown as black bars. Specific subsets of proteins are denoted as red-framed or blue-colored bars; see text for details. An asterisk stands for a two-tailed t-test P value < 10-3; ** indicates P values < 10-10.

Pairwise Correlations.

Our structure-based binding specificity profiles implicitly assume independence between peptide positions, allowing easy visualization with sequence logos (32) and direct comparison with experimental peptide library data. This independence assumption is not likely to be completely valid, given that peptide positions can influence one another, either directly or indirectly. Previous work has shown that inclusion of pairwise attributes between peptide positions improves binding prediction accuracy, though only marginally (33), providing indirect evidence for their (possibly limited) importance. For the majority of MHC proteins, there is insufficient data on bound peptides to permit the identification of statistically significant pairwise correlations, considering the large number of position/amino acid combinations that must be tested and the biased nature of the peptide sequences themselves. Given that the structural simulations directly model residue interactions, and that the final sets of peptide sequences are large and diverse, we analyzed the predicted peptide binding landscapes to identify significant pairwise correlations and their underlying mechanistic basis.

We found evidence for a large number of significant pairwise correlations (Table S3). For a subset of the highly significant correlations that occur across multiple proteins, we examined the structural models of bound peptides with and without the correlating amino acid pair, and searched the experimental database of binding peptides for supporting evidence. We discovered two dominant classes of pairwise correlation: direct correlations mediated by favorable or unfavorable side-chain–side-chain interactions, and indirect correlations mediated by the peptide backbone conformation. In the first category are favorable interactions between directly contacting side chains (e.g., R3-E6, Fig. 6C, E3-R6, E3-R7, where R3-E6 indicates a pairwise correlation between R at P3 and E at P6), as well as positive and negative compensatory correlations between side chains that can occupy the same MHC pocket [a positive (negative) correlation means that the amino acid pair occurs more (less) often than expected based on the frequencies of the individual amino acids at their positions]. Compensatory interactions can be seen at positions 3 and 5, for example, where we observe positive correlations between amino acid pairs (A3-F5, Fig. 6A, A3-I5, F3-A5, Fig. 6B, Y3-G5, P3-F5) whose size compensates to fill a shared pocket formed by side chains of the MHC α2-helix, whereas amino acids that are either both small, both large, and/or like-charged (A3-G5, F3-Y5, R3-R5, D3-E5) show negative correlations. Analyzing the pooled dataset of experimentally validated binding peptides for all MHC proteins, we find evidence in support of these interactions: R-E and E-R amino acid pairs at positions 3 and 6 or 3 and 7 show positive correlations (P-values from 0.1 to 0.001), and there are positive correlations for small-large and large-small amino acid combinations at positions 3 and 5 (P values less than 10-14 and 10-8, respectively; note that the true significance of experimental correlations is difficult to assess given the varied size and biased nature of experimentally assayed peptide sets and the effects of pooling data from multiple proteins).

Fig. 6.

Fig. 6.

Structural simulations reveal covariation between peptide sequence positions. (A and B) A shared pocket formed by side chains of the MHC α2-helix favors compensatory (large-small/small-large) changes at P3 and P5. (C) A network of hydrogen bonds contributes to positive correlation between R3 and E6. (D) Sequence preferences at P2 are altered due to a shift in the peptide backbone (cyan vs. green) induced by proline at P3, which disrupts a conserved peptide backbone hydrogen bond to Y99.

The second category—indirect correlations mediated by backbone conformation—is dominated by correlations involving glycine and proline, two amino acids with unique backbone conformational properties. We see a number of highly significant correlations between proline at P3, for example, and a second amino acid at P2: favorable correlations tend to involve larger amino acids (e.g., M2-P3, Fig. 6D), whereas unfavorable interactions involve smaller ones (A2-P3 and S2-P3). As shown in Fig. 6D, proline at P3 disrupts a conserved MHC-peptide backbone hydrogen bond and lifts the peptide backbone slightly up out of the binding pocket, which may be better tolerated by large amino acids at P2 than small ones. We also see a highly significant negative correlation between aspartate at P4 and proline at P5; examination of structural models suggests that P5 may disrupt a peptide backbone conformation preferred by D4; an increased beta-turn propensity for this diresidue may also interfere with binding. Positive correlations for M2-P3 (P = 0.015), and negative correlations for D4-P5 (P < 10-5) are seen in the experimental dataset of binding peptides. Not all observed correlations are supported by known binding sequences, however: We observe a very strong positive correlation between G1 and P2 in the structural simulations, whereas there is a significant negative correlation between these amino acids in the known binding sequences (P < 10-3). If we restrict to predicted correlations that occur across multiple alleles and for which the experimental P values meet a significance threshold (e.g., 0.1 or 0.01), we find that the predictions are supported 2–3 times as often as they are contradicted (binomial P values between 0.05 and 0.01, see SI Materials and Methods). We find this degree of agreement encouraging given the small and potentially biased nature of the pooled experimental dataset.

Discussion

MHC proteins are among the most polymorphic in the human genome, with the International Immunogenetics HLA database currently listing the sequences of more than 1,000 HLA-A and 1,500 HLA-B proteins (34). Experimental binding data, however, are lagging far behind and binding peptides have been reported for fewer than 150 proteins (6). As a result, significant computational efforts have been devoted in recent years to the development of peptide binding prediction methods (see refs. 6 and 7 for recent reviews) but their accuracy still depends, to a great extent, on the availability of a fairly large number of experimental measurements, ideally of unrelated peptides, for either the protein itself (6) or some “closely related” proteins (27). Notably, the most accurate predictors rely on convoluted machine-learning algorithms, thus, on the one hand, attesting to the complex nature of MHC-peptide binding, and, on the other hand, presenting no readily interpretable information to help investigate it.

Here, we have used special purpose molecular modeling simulations to predict, for a given MHC protein, the sequence of thousands of binding peptides, to each of which is associated an atomically detailed structural model and a predicted intermolecular binding energy. We have demonstrated that a rather simple binding model—a PFM—inferred from these simulations agrees well with available experimental binding data. In addition, the accompanying structural models suggest plausible—and testable—mechanistic explanations for the observed binding preferences, and for divergences in binding specificity between closely related proteins. Predictions can be made for proteins without three-dimensional structures and for proteins with little or no experimental binding data. As a result, these predictions are insensitive to heterogeneities in the available peptide binding datasets, making them well-suited to comparisons of peptide binding between different proteins and across the MHC family.

Our structure-based approach to binding specificity prediction also has significant limitations. The molecular modeling, although state-of-the-art, is still quite crude: The MHC backbone is held fixed throughout the simulations (the side chains near the peptide can rearrange); the force fields were parameterized for monomeric protein structure prediction and design, and likely could be substantially improved for modeling protein–peptide interactions; modeling simulations focused exclusively on binding of 9-mer peptides. As a result of these and other limitations, the specificity predictions and related inferences from these modeling simulations are sometimes inaccurate: Anchor-position preferences are mispredicted in some of the homology-model-based binding profiles, and these mispredictions can have a disastrous effect on binding predictions. The fact that self-template binding predictions for these targets are significantly more accurate illustrates the sensitivity of our binding landscapes to the backbone of the template MHC molecule, and suggests that incorporation of MHC backbone flexibility may lead to substantial improvements in accuracy.

Notwithstanding these significant limitations, we are optimistic that, given the large community of researchers working to improve molecular modeling force fields and sampling methods, the quality of simulation-derived inferences will continually improve. Indeed, an advantage of a large-scale modeling study such as this one, focused on a family of proteins with extensive experimental structural and binding data, is the ability to highlight current limitations in modeling methods and suggest avenues for improvement. As an example, analysis of an initial underprediction of beta-branched amino acids at the second anchor position (P9) revealed a systematic error in our modeling of rotamer preferences at C-terminal positions; fixing this error substantially improved binding prediction for MHC proteins with aliphatic preferences at this position. Understanding the apparent bias toward histidine in many of the structure-based PFMs (Fig. S2) may also reveal specific force field deficiencies, whose correction could further improve performance.

Our structure-based approach is highly complementary to traditional sequence-based predictors. Although the peptide binding prediction accuracies are on the whole lower, particularly for well-characterized proteins, the atomically detailed models allow the investigation of new aspects of MHC-peptide interactions. We have used these models to investigate the extent and mechanistic basis of pairwise correlations between peptide positions, and are currently examining the role of the peptide backbone conformation in determining specificity profiles. Structural modeling may also shed light on T-cell receptor (TCR) recognition of peptide-MHC complexes, given that the peptide’s structure, as well as its sequence, is being recognized; doing so could lead to an improved understanding of T-cell alloreactivity and its role in transplant outcome. By explicitly modeling the TCR-peptide-MHC ternary complex, it may be possible to rationalize and perhaps predict patterns of TCR cross-reactivity to different peptide-MHC complexes. Finally, a physicochemical approach is well suited to modeling MHC interactions with peptides that contain nonnatural or posttranslationally modified amino acids, for which limited binding data currently exists. MHC molecules have been shown to specifically recognize and prefer phosphorylated variants of certain peptides, suggesting a mechanism for immune surveillance of cells with deregulated phosphorylation, a hallmark of malignant transformation (35). As molecular modeling methods continue to improve, we expect that structure-based predictions will play an increasingly important role in the investigation of MHC-peptide interactions.

Materials and Methods

Our test set includes all HLA proteins with positional library data (21); proteins with a known high-resolution (< 3 ) crystal structure and binding measurements, per the Immune Epitope Database (6), for 50 or more 9-amino-acid-long peptides; the single HLA-A and HLA-B proteins with the largest number of experimental measurements, currently lacking a solved structure; and six pairs of HLA proteins differing at up to two sequence positions (11 HLA-A and 18 HLA-B proteins; see Table S1 for details).

Peptide Binding Landscape Construction.

To generate a peptide binding landscape for a given MHC protein, we perform 20,000 independent, flexible-backbone peptide docking simulations. These simulations differ from standard protein docking simulations—which seek to predict the bound structure of a complex from the structures of the individual components—in that the sequence of the peptide, as well as its internal structure and orientation relative to the MHC molecule, is optimized during the simulation. Each simulation consists of a low-resolution phase, during which an initial backbone model for the peptide is assembled, and a high-resolution phase, in which this starting backbone model is refined by a Monte Carlo simulation with an all-atom representation and an atomically detailed molecular mechanics force field. These simulations are specialized for the MHC protein family in several ways: A kinematic tree is defined on the MHC-peptide system that builds the peptide outward from the two canonical anchor positions (P2 and P9 for a peptide of length nine). During the low-resolution phase, the peptide orientation is sampled by docking moves that explore alternative placements for the anchor residues; a network of conserved MHC-peptide-backbone hydrogen bonds are enforced during the high-resolution refinement phase through the use of energetic restraints (Fig. S6). The sequence of the peptide is randomized at the start of each simulation, held fixed throughout the low-resolution phase, and optimized during the high-resolution phase through sequence-mutation Monte Carlo moves. The low-resolution buildup generates conformational diversity, whereas the high-resolution refinement allows the assignment of meaningful all-atom binding energies to the final models. An unbound energy is subsequently assigned to each final peptide model by removing the MHC and conducting a short Monte Carlo relaxation. All simulations were implemented in the Rosetta modeling package (36), and will be made freely available to academic users as part of the next public release (see SI Materials and Methods for simulation details).

Peptide Binding Landscape Analysis.

The binding energy contribution for a given amino acid at a fixed peptide position—that is, an entry in a PBEM—is computed by taking the difference between the 10th percentiles of the bound and unbound energy for the ensemble of peptides containing that amino acid at the peptide position (for details, see SI Materials and Methods; all PBEMs are provided in Dataset S1). To construct a PFM, these energy contributions are transformed to probabilities using the Boltzmann distribution with a KT value of 1. PFMs are compared by computing per-position Jensen–Shannon probability divergences and averaging these divergences over all peptide positions to give a single total divergence for clustering purposes. The significance of average and per-position divergences between predicted and experimental PFMs was assessed by calculating divergences for 100,000 random PFMs whose columns had information content distributions matching the experimental data. Peptide-repertoire diversity is defined as the average pairwise Hamming distance for the 1% of lowest energy binders. Pairwise correlations are assessed by computing P values for 2 × 2 contingency tables reflecting co-occurrence of the given amino acid pair in the corresponding peptide set (further details are given in SI Materials and Methods).

Supplementary Material

Supporting Information

Acknowledgments.

We thank the members of the Rosetta development community for their many contributions to the software used in this research and gratefully acknowledge superlative computing support from the Hutchinson Center, with special thanks to Jeffrey Katcher, Carl Benson, and Dirk Petersen. P.B. was supported by Center new development funding and a Searle Scholars Fellowship.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1018165108/-/DCSupplemental.

References

  • 1.Mungall AJ, et al. The DNA sequence and analysis of human chromosome 6. Nature. 2003;425:805–811. doi: 10.1038/nature02055. [DOI] [PubMed] [Google Scholar]
  • 2.Phillips EJ, Mallal SA. Hla and drug-induced toxicity. Curr Opin Mol Ther. 2009;11:231–242. [PubMed] [Google Scholar]
  • 3.Fernando MMA, et al. Defining the role of the MHC in autoimmunity: A review and pooled analysis. PLoS Genet. 2008;4:e1000024. doi: 10.1371/journal.pgen.1000024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Garrido F, et al. Implications for immunosurveillance of altered HLA class I phenotypes in human tumours. Immunol Today. 1997;18:89–95. doi: 10.1016/s0167-5699(96)10075-x. [DOI] [PubMed] [Google Scholar]
  • 5.Petersdorf EW. HLA matching in allogeneic stem cell transplantation. Curr Opin Hematol. 2004;11:386–391. doi: 10.1097/01.moh.0000143701.88042.d9. [DOI] [PubMed] [Google Scholar]
  • 6.Peters B, et al. A community resource benchmarking predictions of peptide binding to MHC-I molecules. PLoS Comput Biol. 2006;2:e65. doi: 10.1371/journal.pcbi.0020065. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lin HH, Ray S, Tongchusak S, Reinherz EL, Brusic V. Evaluation of MHC class I peptide binding prediction servers: applications for vaccine research. BMC Immunol. 2008;9:8. doi: 10.1186/1471-2172-9-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Hertz T, Yanover C. Pepdist: A new framework for protein-peptide binding prediction based on learning peptide distance functions. BMC Bioinformatics. 2006;7:S3. doi: 10.1186/1471-2105-7-S1-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Heckerman D, Kadie C, Listgarten J. Leveraging information across hla alleles/supertypes improves epitope prediction. J Comput Biol. 2007;14:736–746. doi: 10.1089/cmb.2007.R013. [DOI] [PubMed] [Google Scholar]
  • 10.Zhang H, Lundegaard C, Nielsen M. Pan-specific MHC class I predictors: A benchmark of HLA class I pan-specific prediction methods. Bioinformatics. 2009;25:83–89. doi: 10.1093/bioinformatics/btn579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Bjorkman PJ, et al. Structure of the human class I histocompatibility antigen, HLA-A2. Nature. 1987;329:506–512. doi: 10.1038/329506a0. [DOI] [PubMed] [Google Scholar]
  • 12.Rosenfeld R, Zheng Q, Vajda S, DeLisi C. Computing the structure of bound peptides. Application to antigen recognition by class I major histocompatibility complex receptors. J Mol Biol. 1993;234:515–521. doi: 10.1006/jmbi.1993.1607. [DOI] [PubMed] [Google Scholar]
  • 13.Tong JC, Tan TW, Ranganathan S. Modeling the structure of bound peptide ligands to major histocompatibility complex. Protein Sci. 2004;13:2523–2532. doi: 10.1110/ps.04631204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Sezerman U, Vajda S, DeLisi C. Free energy mapping of class I MHC molecules and structural determination of bound peptides. Protein Sci. 1996;5:1272–1281. doi: 10.1002/pro.5560050706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Bordner AJ, Abagyan R. Ab initio prediction of peptide-MHC binding geometry for diverse class I MHC allotypes. Proteins. 2006;63:512–526. doi: 10.1002/prot.20831. [DOI] [PubMed] [Google Scholar]
  • 16.Schueler-Furman O, Altuvia Y, Sette A, Margalit H. Structure-based prediction of binding peptides to MHC class I molecules: Application to a broad range of MHC alleles. Protein Sci. 2000;9:1838–1846. doi: 10.1110/ps.9.9.1838. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Jojic N, Reyes-Gomez M, Heckerman D, Kadie C, Schueler-Furman O. Learning MHC I-peptide binding. Bioinformatics. 2006;22:e227–235. doi: 10.1093/bioinformatics/btl255. [DOI] [PubMed] [Google Scholar]
  • 18.Antes I, Siu SWI, Lengauer T. DynaPred: A structure and sequence based method for the prediction of MHC class I binding peptide sequences and conformations. Bioinformatics. 2006;22:e16–24. doi: 10.1093/bioinformatics/btl216. [DOI] [PubMed] [Google Scholar]
  • 19.Knapp B, Omasits U, Frantal S, Schreiner W. A critical cross-validation of high throughput structural binding prediction methods for pMHC. J Comput Aided Mol Des. 2009;23:301–307. doi: 10.1007/s10822-009-9259-2. [DOI] [PubMed] [Google Scholar]
  • 20.Schiewe AJ, Haworth IS. Structure-based prediction of MHC-peptide association: algorithm comparison and application to cancer vaccine design. J Mol Graph Model. 2007;26:667–675. doi: 10.1016/j.jmgm.2007.03.017. [DOI] [PubMed] [Google Scholar]
  • 21.Sidney J, et al. Quantitative peptide binding motifs for 19 human and mouse MHC class I molecules derived using positional scanning combinatorial peptide libraries. Immunome Res. 2008;4:2. doi: 10.1186/1745-7580-4-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Escobar H, et al. Large scale mass spectrometric profiling of peptides eluted from HLA molecules reveals N-terminal-extended peptide motifs. J Immunol. 2008;181:4874–4882. doi: 10.4049/jimmunol.181.7.4874. [DOI] [PubMed] [Google Scholar]
  • 23.Elamin NE, Bade-Döding C, Blasczyk R, Eiz-Vesper B. Polymorphism between HLA-A0301 and A0302 located outside the pocket F alters the POmega peptide motif. Tissue Antigens. 2010;76:487–490. doi: 10.1111/j.1399-0039.2010.01547.x. [DOI] [PubMed] [Google Scholar]
  • 24.Bade-Döding C, et al. A single amino-acid polymorphism in pocket A of HLA-A*6602 alters the auxiliary anchors compared with HLA-A*6601 ligands. Immunogenetics. 2004;56:83–88. doi: 10.1007/s00251-004-0677-y. [DOI] [PubMed] [Google Scholar]
  • 25.Yanover C, et al. How do amino acid mismatches affect the outcome of hematopoietic cell transplants? A structural perspective; Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology; New York: Association for Computing Machinery; 2010. pp. 627–633. [Google Scholar]
  • 26.Hertz T, Yanover C. Identifying HLA supertypes by learning distance functions. Bioinformatics. 2007;23:e148–e155. doi: 10.1093/Bioinformatics/btl324. [DOI] [PubMed] [Google Scholar]
  • 27.Nielsen M, et al. NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known sequence. PLoS ONE. 2007;2:e796. doi: 10.1371/journal.pone.0000796. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Sidney J, Peters B, Frahm N, Brander C, Sette A. HLA class I supertypes: A revised and updated classification. BMC Immunol. 2008;9:1. doi: 10.1186/1471-2172-9-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Rao X, Costa AICAF, van Baarle D, Kesmir C. A comparative study of HLA binding affinity and ligand diversity: Implications for generating immunodominant CD8+ T cell responses. J Immunol. 2009;182:1526–1532. doi: 10.4049/jimmunol.182.3.1526. [DOI] [PubMed] [Google Scholar]
  • 30.Thammavongsa V, et al. Assembly and intracellular trafficking of HLA-B*3501 and HLA-B*3503. Immunogenetics. 2009;61:703–716. doi: 10.1007/s00251-009-0399-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Kosmrlj A, et al. Effects of thymic selection of the T-cell repertoire on HLA class I-associated control of HIV infection. Nature. 2010;465:350–354. doi: 10.1038/nature08997. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: A sequence logo generator. Genome Res. 2004;14:1188–1190. doi: 10.1101/gr.849004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Peters B, Tong W, Sidney J, Sette A, Weng Z. Examining the independent binding assumption for binding of peptide epitopes to MHC-I molecules. Bioinformatics. 2003;19:1765–1772. doi: 10.1093/bioinformatics/btg247. [DOI] [PubMed] [Google Scholar]
  • 34.Robinson J, et al. IMGT/HLA and IMGT/MHC: Sequence databases for the study of the major histocompatibility complex. Nucleic Acids Res. 2003;31:311–314. doi: 10.1093/nar/gkg070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Mohammed F, et al. Phosphorylation-dependent interaction between antigenic peptides and MHC class I: A molecular basis for the presentation of transformed self. Nat Immunol. 2008;9:1236–1243. doi: 10.1038/ni.1660. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Leaver-Fay A, et al. ROSETTA3: An object-oriented software suite for the simulation and design of macromolecules. Methods Enzymol. 2011;487:545–574. doi: 10.1016/B978-0-12-381270-4.00019-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information
1018165108_ST01.pdf (25.8KB, pdf)
1018165108_ST02.pdf (27KB, pdf)
1018165108_ST03.txt (60.8KB, txt)
1018165108_SD01.xls (272KB, xls)

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES