Skip to main content
Protein Science : A Publication of the Protein Society logoLink to Protein Science : A Publication of the Protein Society
. 2020 Oct 5;29(11):2259–2273. doi: 10.1002/pro.3958

Learning peptide recognition rules for a low‐specificity protein

Lucas C Wheeler 1,2,3, Arden Perkins 1,2, Caitlyn E Wong 1,2, Michael J Harms 1,2,
PMCID: PMC7586891  PMID: 32979254

Abstract

Many proteins interact with short linear regions of target proteins. For some proteins, however, it is difficult to identify a well‐defined sequence motif that defines its target peptides. To overcome this difficulty, we used supervised machine learning to train a model that treats each peptide as a collection of easily‐calculated biochemical features rather than as an amino acid sequence. As a test case, we dissected the peptide‐recognition rules for human S100A5 (hA5), a low‐specificity calcium binding protein. We trained a Random Forest model against a recently released, high‐throughput phage display dataset collected for hA5. The model identifies hydrophobicity and shape complementarity, rather than polar contacts, as the primary determinants of peptide binding specificity in hA5. We tested this hypothesis by solving a crystal structure of hA5 and through computational docking studies of diverse peptides onto hA5. These structural studies revealed that peptides exhibit multiple binding modes at the hA5 peptide interface—all of which have few polar contacts with hA5. Finally, we used our trained model to predict new, plausible binding targets in the human proteome. This revealed a fragment of the protein α‐1‐syntrophin that binds to hA5. Our work helps better understand the biochemistry and biology of hA5, as well as demonstrating how high‐throughput experiments coupled with machine learning of biochemical features can reveal the determinants of binding specificity in low‐specificity proteins.

Keywords: binding specificity, hydrophobicity, machine learning, peptides, S100 proteins, X‐ray crystallography

Short abstract

PDB Code(s): 6WN7;

1. INTRODUCTION

Up to 40% of protein–protein interactions are mediated by globular domains that recognize a short, linear fragment of their interaction partner. 1 , 2 Such protein‐peptide interactions play key roles in processes ranging from from signaling networks to biological phase transitions. 2 , 3 Understanding such systems therefore requires knowing which proteins recognize which peptides under what conditions. 2 , 4

Protein‐peptide interaction interfaces exhibit a wide range of specificity. For some proteins, one can describe specificity using a simple binding motif that encodes the amino acid(s) recognized at each site in the peptide. 5 , 6 One can predict protein targets by searching for matching sequences within the proteome. 6 Some proteins deviate from this highly specific paradigm, requiring more sophisticated approaches. 7 For example, many PDZ binding domains exhibit binding “multi‐specificity”, in which peptide preference must be represented as a handful of binding motifs. 8 , 9 Predicting interaction targets for such proteins is more difficult than for proteins with single binding motifs, but the same basic logic applies: search the proteome for sequences that match the binding motifs.

Even more extreme cases exist, such as S100 proteins. Members of this family of homodimeric calcium‐activated signaling proteins play important roles in a wide range of critical cellular processes such as innate immunity, cell‐cycle regulation, and inflammatory signaling. 10 , 11 S100s bind to short, linear peptide regions of target proteins with micromolar affinity, modulating their activity (Figure 1). 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 Peptides can interact with a single monomer in the homodimer (Figure 1b,c), or can interact across the homodimer interface (Figure 1d–f). Defining the peptide recognition rules of S100 proteins has, however, proven extremely difficult, as target peptides lack sufficient similarity to be usefully represented as a binding motif. 19 , 20 That said, the low‐specificity of the S100 protein family does not equate to no specificity. Specific targets within the highly‐variable sets of S100 binding partners appear to be evolutionary conserved over hundreds of millions of years, even as the interface acquired mutations. 19 We therefore describe S100 proteins as “low specificity”: they use a single site to bind, with similar affinity, to peptides with very different sequences. (This definition does not require that peptide affinity be low; however, we expect that the maximum affinity of low‐specificity protein for a given peptide will be lower than a high‐specificity protein that is optimized to interact with that peptide).

FIGURE 1.

FIGURE 1

S100 proteins interact with peptides in a canonical peptide binding interface in a calcium‐dependent manner. (a) Overlay of S100:peptide structures shown in panels (b)–(f). The two chains of the homodimer are shown in black and white. The bound calcium ions are shown as blue spheres. The peptide binding surfaces are highlighted with arrows. Panels (b)–(f) show five different S100:peptide pairs. The S100 chains shown as surfaces and the peptides shown as tubes, colored as a rainbow from N‐ to C‐terminal. (b) S100B interacting with TRTK12 peptide (PDB ID: 3IQQ); (c) S100A1 interacting with TRTK12 peptide (PDB ID: 2KBM); (d) S100A11 interacting with Annexin I N‐terminus (PDB ID: 1QLS); (e) S100A4 interacting with myosin‐IIA peptide (PDB ID: 3ZWH); and (f) S100A6 interacting with Siah‐1 interacting peptide (PDB ID: 2JTT). To better show the peptide binding interface, this structure is rotated 90° relative to the other structures, as indicated

Here we endeavor to dissect the specificity of a representative S100 protein: human S100A5 (hA5). This protein likely plays a signaling role in olfaction. 21 , 22 It interacts with a diverse set of peptide targets with no obvious sequence motif. 18 , 19 , 20 Instead of representing hA5's peptide specificity with a binding motif, we represented it using readily calculated biochemical properties of each peptide. We used machine‐learning to train a model against a recently released quantitative phage display dataset, 23 finding that we were able to reproduce the phage display data and several measured peptide binding interactions formed by hA5. We then used this model both to understand what biochemical features hA5 recognizes and to predict new binding targets. Our results demonstrate that it is possible to gain insights into the rules that define binding specificity, even for proteins with extremely low specificity. The software we developed for this purpose—HOPS: Hunches from Oregon about Peptide Specificity—is available for download (https://github.com/harmslab/hops).

2. RESULTS

2.1. Peptide sequence is insufficient to describe specificity

Previously, we collected a high‐throughput, phage‐display dataset for hA5 interacting with ≈ 40, 000 random 12‐mer peptides. 23 We did two parallel panning experiments: one in the absence of competitor (Figure 2a), the other in the presence of a competitor peptide that binds at the site of interest (Figure 2b). We then used high‐throughput sequencing to quantify the frequencies of peptides in both the “conventional” and “competitor” samples (f conventional and f competitor, respectively). Peptides that bind at the site of interest are depleted preferentially in the competitor sample. This can be quantified by E = ln(f competitor/f conventional), meaning that E < 0 corresponds to a peptide that binds at the site of interest. We found that peptides with E < −1.37, corresponding to a four‐fold decrease in frequency with the addition of competitor, could be distinguished from zero with a false‐discovery rate of 0.05. Throughout our analysis, we therefore use a cutoff of E = −1.37 for peptides that bind.

FIGURE 2.

FIGURE 2

Interacting peptides can be identified using phage display. Panels (a) and (b) Rows show two different experiments, done in parallel, for each protein. Biotinylated, Ca 2+‐loaded, hA5 is added to a population of phage either alone (row a) or in the presence of saturating competitor peptide (row b). Phage that bind to the protein (blue or purple) are pulled down using a streptavidin plate. Bound phage are then eluted using EDTA, which disrupts the peptide binding interface. In the absence of competitor (row a), phage bind adventitiously (purple) as well as at the interface of interest (blue). In the presence of competitor (row b), only adventitious binders are present. (c) Sequence logo for all peptides in the phage display dataset for which E < −1.37. Each position is highly variable in the position‐weight‐matrix. (d) Frequency sequence logos representing three of the 28 peptide clusters identified using DBSCAN

We first sought to determine if sequence‐level rules were sufficient to describe the specificity of peptides enriched by hA5 in the phage display enrichment experiment. We took the 3, 574 unique peptide sequences with E < −1.37 to calculate a position‐specific weight‐matrix. 24 This approach revealed extreme variability across positions in the peptide (Figure 2c).

We next used a clustering approach to identify a set of motifs representing a profile of “multi‐specificity” for hA5. Similar approaches have worked for other multi‐specific proteins, identifying a small set of motifs sufficient to describe binding preferences. 9 , 25 We clustered enriched peptides by Damerau‐Levenstein distance using the DBSCAN algorithm. 26 , 27 , 28 We then generated position‐specific‐weight‐matrices for each cluster. Only 0.4% of peptides were placed in clusters; the remainder were separted into singleton clusters. The resulting clusters were highly diverse. Figure 2d shows three of the identified clusters: there is little sequence commonality between the three clusters. This extreme “multi‐specificity” of hA5 extends beyond a small set of motifs easily represented by position‐weight matrices. Thus, we concluded that simple sequence‐based rules were insufficient to identify the key determinants of specificity in hA5 or to construct a predictive model for binding partners.

2.2. Supervised machine learning can be used to train a predictive model of enrichment

We sought an alternate approach to simple sequence‐based metrics. Inspired by the literature on characteristic biochemical properties of intrinsically‐disordered proteins, 29 as well as previous machine‐learning approaches applied to biochemical interactions, 7 , 30 , 31 we hypothesized that the biochemical features of peptide targets could be used to construct a predictive model of of peptide enrichment. For each peptide sequence identified in the phage display dataset, we calculated a set of 57 features covering an array of biochemical properties. We included properties such as hydrophobicity, Chou‐Fasman α, accessible surface area, isoelectric point, and net charge (a full list of all predicted features is available in Table S1). We calculated each feature in sliding windows across the peptide sequence, resulting in a final set of 4,446 features for each individual peptide (Figure 3a).

FIGURE 3.

FIGURE 3

Machine learning model predicts phage display enrichment. (a) Diagram of the process for training the machine‐learning model. Peptides are broken into sliding windows and a set of predicted biochemical features is calculated for each window. These are the features used in the machine‐learning model. (b) We found best model input parameters using cross‐validation. Pairs of bars represent the average Rtrain2 (blue) or Rtest2 (orange) for 10‐fold cross‐validation replicates of the data using the model parameters below. Square indicates whether the feature was used in the model (filled) or not (empty). “Window”: whether sliding windows were used. “HOPS” and “CIDER” features are listed in Table S1. “Num. estimators” is number of estimators included in the Random Forest. The Rtrain2 and Rtest2 are indicated for the chosen model. (c) Points are individual peptides. Red line is the a linear regression between the predicted E and measured E for each peptide in the test set. Dashed line blue line indicates the threshold below which we can measure enrichment (E = −1.37). (d) ROC curve for classifying peptides as above or below the E cutoff. The area under the curve is shown on the plot. (e) Heat map shows the contribution of each site (position 1–12) and aggregated chemical feature (top‐to‐bottom) to the final model. Color indicates relative contribution from red (strong) to white (no contribution). The marginal contribution of each chemical feature is shown to the right of the plot. Table S1 describes which chemical features went into which aggregate bins

We next used machine learning to train a model to reproduce our phage display E values using these features as inputs. Prior to training the model, we withheld 10% of the data to use as a test set. In pilot analyses, we tried several methods including Random Forest Regression, 32 Support Vector Machines, 33 and a Gaussian Process Model. 34 We found the Random Forest out‐performed the other approaches. We then optimized the nuisance parameters in our model—which features to include, whether to apply a sliding window, and the number of estimators—using k‐fold cross‐validation. The best model we identified used sliding windows, included the full set of 57 base features, and used 30 estimators (Figure 3b). We then trained our model against the complete training set and measured its predictive power using the previously withheld 10% test set. This yielded a final Rtrain2 of 0.973. Overall, the model reproduced the test set phage display E values well, giving Rtest2 of 0.867 (Figure 3c). There was a systematic deviation between the predicted and measured values of E for the highest and lowest values: the slope between E predicted and E measured was also 1.16 rather than 1.00 (Figure 3c). This suggests that there are features important for the highest and lowest E values not fully captured by our model.

We also tested the utility of using the model to predict whether E for a peptide was expected to be above or below the E cutoff of −1.37. We calculated a Receiver Operator Characteristic (ROC) curve for the classifier, plotting the true positive rate against the false positive rate. This yielded a highly predictive model, with an area under the curve of 98.9. This give us high confidence in using this model to classify enriching peptides.

We validated our model by reproducing the known binding of four S100 peptide targets that we had previously studied using isothermal titration calorimetry (ITC). 19 Of these, three bind to hA5 and one does not. We used the model to predict whether the known peptides would bind to hA5. For peptides longer than 12 amino acids, we calculated the score for all possible contiguous 12‐mers and took the best score as our predicted E value. The model predicted binding of the peptides NCX1, A5cons, and A6cons, while predicting that the SIP peptide would not bind (Table 1). These predictions accurately recapitulate the known pattern of binding for all four peptides.

TABLE 1.

Dissociation constants and model predictions for known peptide targets

Peptide Sequence KD (μM) Epred Predicted to bind (E < −1.37)?
NCX1 RRLLFYKYVYKR 18 −1.68 Yes
SIP SEGLMNVLKKIYEDG >100 −0.43 No
A5cons rshsSSFQDWLLSRLPgggsae 3 −2.43 Yes
A6cons rshsGFDWRWGMEALTgggsae 3 −1.39 Yes

Note: Data for the known target peptides used in our previous study.19 NCX1 and SIP are fragments of human proteins. A5cons and A6cons were identified as consensus sequences from an earlier phage display experiment. The lowercase flanking sequences “rshs” and “gggsae” come from the M13 phage coat protein. K D and predicted E value (Epred) are shown. The statistically significant E cutoff for hA5 is −1.37.

2.3. Model classifies peptides based on hydrophobicity and shape complementarity

We next asked what aspects of the peptides were recognized by the trained model. We quantified the contribution of each feature and peptide position to the predicted E as measured by node impurity (see methods). We found that no one feature or peptide position dominated the prediction (Figure 3e). To summarize the data, we pooled features based on their chemical similarity. We pooled side chain volume and beta‐chain knob propensity, 35 , 36 along with a variety of other terms, into “geometry”; we pooled eleven different hydrophobicity scales into “hydrophobicity”; and we pooled predicted charge and number of hydrogen bonds into “polar” interactions. A full list of the individual features and their bins is given in Table S1.

We plotted the relative contribution of each property as a function of peptide position (Figure 3e). Each site contributed almost equally to the predicted enrichment (Figure 3e). Meanwhile, different molecular properties had radically different contribution levels. Geometry, hydropathy, and secondary structure propensity dominated the predictive power of the model. In contrast, polar contacts—often a strong determinant of specificity—had almost no predictive power (Figure 3e). These results are consistent with binding being determined by shape complementarity and hydrophobic interactions at the interface.

2.4. A high resolution crystal structure reveals binding site interaction variability in hA5

To test the hypothesis that peptide binding was determined by shape complementarity and hydrophobicity, we sought to co‐crystallize hA5 with a bound peptide. Unfortunately, we were unable to do so either by growing hA5 crystals in the presence of peptides, nor by soaking peptide targets into crystals grown in the absence of peptide. We did, however, discover a novel crystal form of calcium‐bound hA5 in the absence of peptide. Despite being twinned and requiring special care during structure solution and refinement, this crystal diffracted to 1.25 Å—the highest resolution crystal structure so far determined for hA5—and provided a clear view of the binding site and new insights into binding interactions (PDB: 6WN7, Figure 4a Table 2).

FIGURE 4.

FIGURE 4

Crystal structure of hA5 reveals variability of the peptide interaction surface. (a) The unit cell of the hA5 crystal structure showing the packing of the asymmetric unit, which contains three homodimers (white/dark gray surface). Crystallographic symmetry mates that occupy the peptide‐binding in three distinct conformations are shown as ribbon (orange, pink, and light pink). Unit cell axes are labeled in gray. (b) Overlay of all calcium‐bound structures of hA5: 1.25 Å crystal structure from this study (white/dark gray, 6 chains), a 2.60 Å crystal structure (PDB: 4dir, teal/blue, 2 chains), and an NMR solution structure (PDB: 2kay, yellow/olive, 2 chains). (c) The homodimer containing chains E and F (dark gray, white) with the regions of crystal symmetry mates that occupy the peptide‐binding site shown as sticks. D‐E) Electron density showing the binding site occupied by Met1 and Leu44, from separate chains, or Leu88 and Phe86 from the same chain. 2Fo‐Fc density shown as blue mesh at 1.5 σ

TABLE 2.

Crystallography data collection and refinement statistics

Data collection PDB code: 6WN7
Structure Human S100A5 C43S, C79S
Space group P32
Unit cell a, b, c (Å), α, β, γ (°) 76.3, 76.3, 84.2, 90.0, 90.0, 120.0
Resolution (Å) 51.98–1.25(1.32‐)a
Completeness (%) 99.8 (94.6)
Unique reflections 151,437 (4561)
Multiplicity 10.6 (8.4)
Rmeas (%) 8.1 (115.8)
CC1/2 1.0 (0.48)
⟨I/σ⟩ 16.2 (2.1)
⟨I/σ⟩ ∼ 2.0 (Å)b 1.24
⟨mosaicity⟩ 0.34
Refinement
Resolution range (Å) 26.0–1.25
R‐factor (%) 17.0c
R‐free (%) 20.6c
Protein residues 536
Ca2+ 18
Water molecules 401
RMSD lengths (Å) 0.012
RMSD angles (°) 1.2
Ramachandran plot c
φ,ψ‐preferred (%) 98.26
φ,ψ‐allowed (%) 1.55
φ,ψ‐outliers (%) 0.19
B‐factors
⟨protein atoms⟩ (Å2) 15
⟨waters⟩ (Å2) 22

a

Resolution cutoff was applied using C C 1/2 > 0.3.

b

Resolution at which ⟨ (I/σ) falls to 2.0.

c

Data is twinned and refined with a twin law of ‐k, −h, −l and twinning fraction of 0.44. Eighty‐fourth percentile all‐atom Molprobity clashscore for structures at comparable resolutions.59 Refined with one TLS group per protein chain.

The crystal asymmetric unit contains three homodimers (chains AB, CD, and EF) with similar global structure (0.98–1.78 Å RMSD across all C α, depending on the chains compared), but that form different interactions with one another and crystal symmetry mates (Figure 4a). An overlay shows good agreement for individual chains from a previously‐solved 2.6 Å crystal structure (4DIR, RMSD of 0.97 Å) and an NMR structure (2KAY, RMSD of 1.39 Å), with the regions of highest variation being Ser43‐Glu49 and the C‐terminal helix (Figure 4b). The position of the homodimer partner chains also show variability, with shifts of 1–3 Å between structures (Figure 4c).

For each homodimer, the two peptide binding sites form two distinct interactions with crystal symmetry mates (Figure 4c). Two largely hydrophobic regions of the binding site coordinate nonspecific binding of bulky hydrophobic side chains, with the two pockets occupied by either (a) a Met and Leu from different crystal symmetry mate chains (Figure 4d), or (b) a Leu and Phe from the same crystal symmetry mate chain (Figure 4e). Thus, despite the absence of a peptide ligand, in fact all of the peptide binding sites are occupied, likely explaining why our attempts at diffusing peptides into the crystal were unsuccessful. The ability of the binding site to accommodate a variety of peptide ligands is demonstrated by the observation that binding is facilitated by occupying either one or both hydrophobic pockets, and the two pockets were able to accommodate binding by both Met and Leu, and Leu and Phe, respectively, with few other interactions stabilizing this association. In fact, no intermolecular hydrogen bonds were formed at these interfaces. Across these binding modes, we also observed the binding site itself changed little, as evidenced by similarity between different chains within the asymmetric unit (Figure 4b). This perhaps explains the importance of shape complementarity, and low importance of hydrogen bond interactions, for our binding prediction model.

2.5. Docking models reveal hydrophobic nature of the interaction

Our machine learning model and crystal structure both indicate that binding recognition is mediated by hydrophobic contacts and shape complementarity (Figures 3 and 4). To gain structural insight into how this works for individual peptides with radically different sequences, we used ROSETTA to dock peptides to an hA5 dimer extracted from our crystal structure. We docked four peptides known to bind: A5cons (SSFQDWLLSRLP), A6cons (GFDWRWGMEALT), NCX1 (RRLLFYKYVYKR), and α‐1‐syn (GERWQRVLLSLA, see next section). We do not have high‐resolution structures of any peptide interacting with hA5; however, previous NMR chemical shift perturbation analyses revealed that A5cons, A6cons, and NCX1 all bind at the canonical peptide binding surface. 18 , 19

For each peptide, we generated probable docked models using FlexPepDocking. 37 To make sure that the docking results were not strongly determined by the starting conformation of the peptide, we started the docking runs with the peptide in four different starting states: either helical or extended, and either “up” or “down” in the pocket (see methods). We generated 400,000 docked models for each peptide. We took the 50,000 best‐scoring models for each peptide and used them for further analyses. To determine the similarity between the resulting models for a given peptide, we clustered the models based on the pairwise peptide C α RMSD values between each model. Although we allowed up to 50 clusters, 95% of models for each peptide were partitioned into only 5 to 12 clusters.

To summarize the results, we calculated the C α RMSD between each model and the best‐scoring model for that peptide, and then plotted this RSMD against the score for each model (Figure 5a). In such plots, the best model appears on the bottom left: the model with the lowest overall score has an RMSD of 0 against itself. We then colored each model on the plot by the cluster it was within.

FIGURE 5.

FIGURE 5

Docked peptides show multiple binding modes. (a–d) Docking results for peptides indicated above each graph. Each point is a single model. The color of each point indicates its cluster membership, ranked from the cluster with the best to the worst score: black, blue, green, brown, and purple. We excluded models from outside the top five clusters from this plot. The x‐axis is the ROSETTA score for the model; the y‐axis is the C α RMSD for each model against the best model for that peptide. (e–h) Plausible models for the peptide are indicated on the structure. The hA5 input structure is shown as a surface, with chain A and B shown in gray and white. The peptide is shown as a tube, colored from blue (N‐terminus) to red (C‐terminus). We show the top five highest‐scoring peptides for each model. (i) Molecular detail of the highest scoring overall peptide model (A5cons). C α atoms are highlighted with colors matching panel f. The three hydrogen bonds formed between the peptide and hA5 are indicated with arrows; hydrophobic interactions are indicated with “*”. Sidechains that do not interact with S100 have been removed for clarity. (j) Overlay of all 20 peptide docks shown in panels e–h

The A6cons and A5cons peptides yielded single binding solutions. For example, for A6cons, 71% of the top 50,000 models fell within a single cluster, including the 8 best‐scoring models (black points, Figure 5a). For this model, the peptide takes on a largely extended conformation that drapes across the hydrophobic binding surface in a belt‐like fashion (Figure 5e). The core, central region of the peptide engages similarly for A6cons peptides in this binding mode; however, there is variability in the placement of the N‐ and C‐termini of the peptide. As a result, even within this top cluster, average C α RMSD between structures was 8 Å. A similar situation holds for A5cons (Figure 5b,f).

In contrast, the NCX1 and α‐1‐syn peptides did not yield unique docking solutions. For example, in NCX1, the top two models come from different clusters (the left‐most black and blue points in Figure 5c). The peptide in these two models occupies the same binding pocket, but it traces through the pocket in two different ways (Figure 5g), running in opposite directions through the pocket. The α‐1‐syn peptide gave an even wider diversity of solutions, with three different clusters represented in the top 10 structures (Figure 5d), representing quite different binding modes (Figure 5h).

All of the models are characterized by interactions that are almost entirely hydrophobic in nature. As an example, we can look in detail at one of the A5cons models (Figure 5i). This model had the best overall score for any peptide docked to hA5 (red circle, Figure 5b). In this model, the peptide forms five, well‐packed hydrophobic interactions (indicated with asterisks), but only three hydrogen bonds to hA5 (indicated with arrows). This dearth of hydrogen bonds is common for all of the peptides. If we average over the cluster containing the best‐scoring model for each peptide, A6cons forms the most hydrogen bonds to hA5 (3.2 ± 2), while NCX1 forms the fewest (1.3 ± 1). Thus, as predicted by the machine learning model, polar interactions do not seem to play a key role in defining peptide binding.

We can also use these models to rationalize the finding that many diverse peptides bind. If we overlay the solutions shown in Figure 5e–h onto a single structure, we can see the sheer breadth of structures that are compatible with this binding site (Figure 5j): the interface can accommodate a wide variety of peptide configurations, as long as they can have hydrophobic amino acids and enough flexibility to pack into position.

2.6. The trained model identified a possible new hA5 target peptide

Finally, we attempted to use our trained model to predict new, biologically‐plausible targets, for hA5. We used a 12 amino acid sliding window to find 10,477,400 unique 12‐mers in 20,206 human proteins extracted from uniprot. Applying our trained model to this k‐merized human proteome resulted in a set of predicted interacting peptides (Figure 6a,b). The resulting distribution of predicted E scores is shown in Figure 6a. The distribution is centered at zero, with a tail extending along the negative (higher enrichment) axis. An estimated 3.9% of proteomic 12‐mers had an E value below our apparent detection threshold of −1.37.

FIGURE 6.

FIGURE 6

hA5 binds tightly to one of the predicted peptide targets. (a) Histogram showing the distribution of E scores for proteomic 12‐mers predicted to bind to hA5. Red dashed line indicates the cutoff of E = −1.37. (b) Sequences of the five proteomic peptides predicted to bind to hA5. Newly discovered target, α‐1‐syn, is highlighted in red. (c) Isothermal Titration Calorimetry (ITC) trace showing binding of peptide α‐1‐syn to hA5. We estimated parameters for a single‐site binding model to the data using the Bayesian MCMC sampler in pytc. 62 Lines show 100 individual fits sampled from the Bayesian posterior probability distribution. Inset shows structure of human α‐1‐syntrophin (PDB entry 1Z87) with the Q13424 peptide fragment (GERWQRVLLSLA) labeled in red. Detailed data on predicted peptides can be found in Table 3

We next sought to predict specific sequences that would bind. We selected five peptides from the top 0.05% of the E score distribution and purchased synthetic versions of these peptides. For experimental tractability, we selected peptides that were predicted to be soluble using the pepcalc.com server. To avoid effects from the peptide termini, we ordered the predicted 12‐mer peptides plus 3 additional amino acids taken from the full protein sequence at both the N‐ and C‐terminal ends (Figure 6a,b). The full peptide sequences and the proteins from which they were taken are shown in Table 3. We measured binding of these peptides to hA5 using ITC. We first conducted all the measurements at 25°C. If we were unable to detect a heat of binding at 25°C we also attempted to measure the interaction at 10°C. Because these protein‐peptide interactions are expected to be hydrophobic, we would expect to see nonzero ΔC p of binding, and thus heats of binding at one or both temperatures. 15 , 19

TABLE 3.

Model predictions for potential new peptide targets

Uniprot accession Sequence Protein of origin E KD (μM)
Q86UW7 agsSQRAPPAPTREGrrd Calcium‐dependent secretion activator 2 −4.03 >100
O75170 dapGAGAPPAPGKKEapp Serine/threonine‐protein phosphatase 6 −3.94 >100
Q13424 gagGERWQRVLLSLAedt α‐1‐syntrophin −4.45 5
B2RNZ0 keiKTAMWRLFVKIYFlqk Human olfactory receptor 14|1 −3.53 >100
Q14147 sedDRAGPAPPGASDgvd TP‐dependent RNA helicase DHX34 −3.88 >100

Note: Model E scores and measured binding affinities for peptides predicted by our model to bind to hA5. Flanking amino acids outside the predicted 12‐mer are shown in lowercase.

The peptide extracted from α‐1‐syntrophin protein (referred to hereafter as “α‐1‐syn”) bound to hA5 at 25°C with K D = 4.8 μM (95% confidence, 1.4 to 23 μM) and ΔH = −4.5 kcal · mol −1 (95% confidence, − 11 to −1.5 kcal · mol −1) (Figure 6c). The peptide has little sequence similarity to other previously‐identified targets; however, it does possess five hydrophobic residues, including one tryptophan. It also has multiple charged and polar residues that, together, make it readily soluble in water.

The remaining four peptides gave no evidence of binding at either temperature (Table 3). These peptides are quite variable in sequence; however, three of the four (Q86UW7, O75170, Q14147) are rich in proline and alanine and are studded with charged residues. They also notably lack the large bulky tryptophan possessed by the α‐1‐syn peptide (Table 3). Thus, it is possible that these proline rich peptides clash with the binding site despite favorable overall properties. It is less clear what may determine the lack of B2RNZ0 peptide binding to hA5.

3. DISCUSSION

We applied a supervised machine learning approach to a previously‐measured high‐throughput phage display dataset to predict the binding of peptide targets for human S100A5 (hA5). Using this model we were able to: (a) recapitulate the established pattern of specificity for a set of known targets, (b) determine that the major biochemical drivers of peptide binding were hydrophobicity and shape complementarity, and (c) identify a previously unknown target peptide from human α‐1‐syntrophin. By solving a crystal structure of the calcium‐bound form of hA5, we were able to propose a biophysical rationale for the low specificity of the protein: there are several different binding modes at the canonical peptide interface. This was confirmed by peptide docking studies, which found that peptides could dock in multiple orientations, while exhibiting a paucity of hydrogen bonds to the hA5 surface. Our results lay the groundwork for a more thorough understanding of the biochemistry and biology of hA5. We also provide evidence that high‐throughput binding experiments coupled with deep sequencing and machine learning constitute a potential way forward in understanding the determinants of binding specificity in very low‐specificity proteins.

3.1. A step forward in understanding the biochemistry of hA5 specificity

We find that the peptide specificity of hA5 is determined largely by shape complementarity and hydrophobic surface area—not polar contacts. These are the most predictive features in our trained model (Figure 3e). This result is supported by the crystal structure, which shows the peptide‐binding surface can be occupied by different combinations of bulky hydrophobic side chains from crystal symmetry mates (Figure 4c). For example, in one symmetry mate pair, a bulky hydrophobic side chain extends from one symmetry mate into the peptide binding pocket of another. Finally, our docking results show that peptides can be accommodated in multiple orientations in the binding pocket (Figure 5j)—forming many hydrophobic contacts, but few hydrogen bonds (Figure 5i).

As a result, the features that contribute to binding are distributed across the target peptides, rather than being concentrated onto one or two key sites. This observation is a notable deviation from the traditional way of thinking about protein–protein interaction specificity, which is often centered around the idea of binding “hot spots”. 38 This helps to explain why a straightforward representation of binding preferences as a motif or position‐weight‐matrix has not been possible for S100 proteins. We suspect that similar patterns may be identified in other low‐specificity proteins and that similar approaches to ours may be required to understand the determinants of binding specificity.

While hydrophobicity and shape complementarity are clearly important, our model likely underestimates the contribution of polar contacts. It systematically underestimates the magnitude of E of both the highest and lowest E peptides (Figure 3c). We suspect this is because these values of E depend most strongly on specific structural details, rather than the aggregate biochemical features considered by our model. Such contacts may be “smeared out” by the model, and thus make a smaller contribution to the model than they do in the actual molecular interface. This effect must be relatively small, however, as the model performs quite well overall and our structural analyses support a small role for polar contacts at this interface.

3.2. Implications for the biological roles of hA5

The large predicted interaction set for hA5 (Figure 6a) likely reflects the hydrophobic nature of the peptide binding surface. Any peptide that presents a compatible hydrophobic surface is expected to bind, possibly in multiple conformations. Crucially, however, this does not mean that anypeptide will bind. We found four new peptides that did not interact with hA5 (Figure 6b), in addition to the previously known “SIP” peptide. 19 Further, we found previously that this specificity has been conserved for hundreds of millions of years in S100A5 paralogs, 19 suggesting that the low specificity does not represent a total lack of peptide binding preference.

Our results suggest a plausible, but previously unknown target for hA5 (Figure 6c). The peptide we identified is a fragment of human α‐1‐syntrophin, a largely disordered PDZ‐domain‐containing protein that is expressed in a variety of human tissues and serves as a scaffold for various signaling molecules. 39 , 40 , 41 The peptide fragment is part of a relatively exposed region of the α‐1‐syntrophin PDZ domain, and should be accessible to hA5 in the cell. There are several tissues where both proteins are expressed including kidney and brain. 40 , 41 , 42 , 43 , 44 , 45 Future biological experiments such as pull‐down assays should be used to test whether α‐1‐syntrophin is truly a biological interaction partner of hA5.

Aside from identifying a specific target, our results also allow us to create a rough estimate of the number of putative hA5 peptides that may exist in the proteome. Based on the predictions of our machine‐learning model, we estimate that the protein can bind to roughly 4% of the 10,477,400 unique 12‐mers in the human proteome. When we sampled five predicted binders, we found that only one bound. If we assume the model yields ≈ 80% false positives when applied to the proteome, there are ≈ 420, 000 potential hA5 targets. If only 10% of these partners are physically accessible—with the rest occluded in the interior of proteins or cell membranes—we are still left with 42, 000 peptide fragments that may be expected to bind to hA5.

This suggests that other mechanisms are required to offset the low biochemical specificity of hA5. One possibility is hA5's precise cellular expression and localization. The protein has a very tight expression pattern and appears to be localized near specific bilayers, 45 , 46 , 47 thus limiting its available binding targets. hA5 also has relatively low affinities for peptides (≈ μM), 19 , 20 meaning that both it and/or its partners must be at relatively high concentration for an interaction to form. Finally, it is also possible that there are additional higher‐ordered properties of proteins that restrict the true set of possible hA5 target peptides. For example, in addition to the peptide region itself, the nearby regions may need to possess flexibility to accommodate peptide binding—something our peptide model does not take into account.

3.3. Implications for predicting proteomic targets of low‐specificity proteins

Finally, our work suggests that even relatively sophisticated machine‐learning approaches may not be sufficient to build models that reliably predict new binding targets for low specificity proteins. In our case, only one of the five peptides we sampled from the human proteome interacted with hA5. This low success rate likely arises from a few sources. First, there are errors in the model itself—it does not perfectly reproduce the phage display data. Second, phage‐display does not perfectly map to binding of isolated peptides in vitro. The third, and likely most important issue, however, is statistical. The total number of nonbinding peptides in the proteome is almost certainly very large compared to the number of true targets; therefore, even a small false positive rate in our predictions would cause a huge number of false positives that can swamp out our true predictions. 48 This means that predicting specific new targets from the proteome, even with an exceedingly accurate model, will be quite challenging. Predictions will thus always require experimental follow up to validate their binding.

3.4. Future directions

Unlike purely sequenced‐based methods, our approach provides insight into what biochemical features are recognized by the protein. By recoding the amino acid sequence as a vector of biochemical properties, one gains insight into what features of the amino acids—and the peptide as a whole—are being recognized by the protein, rather than what letters are preferred. This is particularly powerful for a case like hA5, where the amino acid preferences are not obvious, but it will likely be useful for more specific proteins as well. For example, if a motif contains a tryptophan and a tyrosine at a given site, what are the relative contributions of hydrophobicity and hydrogen bonding to binding?

All that is required as input for these calculations is a large collection of peptide sequences with some measured property such as enrichment, binding, or activity. Our software automatically calculates the chemical features and then writes them out in a format that can be fed into any machine learning platform. The approaches we implement here should thus be broadly applicable to other proteins that recognize short linear motifs, 1 , 2 , 4 providing a framework for future studies to decipher the biochemical determinants of binding preferences in these systems.

4. MATERIALS AND METHODS

4.1. Machine learning analysis

We implemented our machine learning model in Python 3 extended with numpy, 49 scipy, 50 and matplotlib. 51 We used sklearn 0.21.3 for our random forest regression. 32 , 52 , 53 A full list of the calculated features is shown in Table S1. As noted, some features were calculated using CIDER (using localCIDER 1.7) 29 ; we calculated the remaining features using our own software. We standardized all input features prior to training the model by subtracting the standard deviation and dividing by the mean of that feature as calculated across all observations. We trained the model using the default objective function in sklearn (least‐squares). Prior to doing any model fitting, we withheld 10% of the data as a test set. We did k‐fold cross‐validation on the training data to determine which parameters to include in the fit, using k = 10. We determined the relative contribution of each feature to our final trained model using the “feature_importance” method of sklearn, which analyzes node impurity as measured by mean squared error. Our full implementation, including all data files and an example script, is available at https://github.com/harmslab/hops.

4.2. X‐ray crystallography

hA5 C43S/C79S was expressed and purified from BL21(DE3) cells as described previously. 19 To generate crystals, we dialyzed 4 mM protein into 1 mM HEPES, 8 mM CaCl 2, 0.25 mM DTT at pH 7.5. We then mixed this solution 1:1 with 0.2 M (NH 4)2 SO 4, 20% PEG 8000 (w/v). We grew crystals by hanging‐drop at 4°C. We harvested crystals, submerged them in a cryoprotectant solution of 25% PEG 1500, and then flash froze them by plunging into liquid nitrogen.

X‐ray diffraction data were collected at the Berkeley Advanced Light Source (ALS) beamline 5.0.3 at cryogenic temperatures on a single high‐diffracting human hA5 crystal. Data were processed with iMosflm v. 7.2.150 and scaled with SCALA. 54 Data were cut to 1.25 Å resolution based on the method of Karplus & Diederichs 55 with CC1 = 2 > 0:3 and completeness >50 in the highest resolution bin. Analysis with POINTLESS 54 indicated space group P3112 as a candidate solution, but molecular replacement trials using PDB structure 4dir and Phaser 56 failed to correctly solve the structure. Analysis with Xtriage 57 indicated the data was twinned, with an estimated twin fraction of 0.5 and twin law of ‐k, −h, −l. Subsequent molecular replacement trials found a solution in space group P32 with three homodimers (6 chains) in the asymmetric unit. Strong density for each homodimer remained after refinement with each respective homodimer deleted. Manual model building was performed with Coot v. 0.8.3 58 and refinement with Phenix 1.10.156 using a twin law of ‐k, −h, −l, which refined to a twin fraction of 0.44. In late stages of refinement riding hydrogens were added and TLS was applied with one group per protein chain. The protein chains contain two Ca 2+ atoms each that are well‐defined and similar in coordination to previous S100 structures. A few solvent sites showed close approaches and may also be fully or partially occupied metal sites, but in the absence of further evidence these were modeled as waters. Residues 44–49 in all chains exhibited weak density and were left partially modeled. The final R work/R free of the structure was 17.6/20.0% (Table 2). We assessed structure quality using Molprobity. 59 The final structure was submitted to the protein data bank as code: 6WN7.

4.3. Docking studies

Docking analyses were performed using ROSETTA3.1 (build 2018.09.60072), 60 using the FlexPepDocking binary. 37 We generated 3‐, 5‐, and 9‐mer fragment libraries using the included “make_fragments.pl” script, with the UNIREF90 database as input. For each peptide, we generated used four starting models: either helical or extended, and then in two orientations relative to the binding pocket: N → C going “up” or “down” the pocket (according to the orientation shown in Figure 5e). When clustered, models came from all of the initial docking models, suggesting the results did not depend on the choice of starting model. We executed FlexPepDocking with the flags “‐lowres_abinitio ‐pep_refine ‐ex1 ‐ex2aro”. We generated ≈ 400, 000 docked models for each peptide.

After docking, we extracted the top 50, 000 models for each peptide for downstream analysis. We clustered the models based on pairwise C α RMSD between peptides, using hierarchical clustering by unweighted pair group method with arithmetic mean (UPGMA). The cophetic correlation coefficient ranged from 0.7–0.9 for all four peptides. We specified that the software identify 50 clusters; however, we found that 95% of the models ended up in the top 5 to 12 clusters for each peptide. Clustering and data analysis were done Python 3.7 extended by numpy, 49 scipy, 50 and matplotlib. 51 Hydrogen bonds were counted in output structures using VMD 1.9.3, 61 with the criterion of <3.5 Å, < 40°.

4.4. Isothermal titration calorimetry

Synthetic peptides were purchased from GenScript, Inc. For all peptides, we attempted to measure binding at 25°C. If binding could not be detected at 25°C we also attempted the experiment at 10°C. ITC experiments were performed in 25 mM TES, 100 mM NaCl, 2 mM CaCl 2, 1 mM TCEP, pH 7.4. Samples were equilibrated and degassed by centrifugation at 18, 000 × g at the experimental temperature for 35 minutes. Synthetic peptides were dissolved directly into the experimental buffer prior to each experiment. All experiments were performed on a MicroCal ITC‐200. Gain settings were determined on a case‐by‐case basis to ensure quality data. A 750 rpm syringe stir speed was used for all experiments. Spacing between injections ranged from 300 to 900 s depending on gain settings and relaxation time of the binding process. A single‐site binding model was fit to the titration data using the Bayesian MCMC fitter in pytc. 62 Uniform priors were used for all parameters. The ML estimate was used as a starting guess and the likelihood surface was then explored with 100 walkers, each taking 5,000 steps. The first 10% of steps were discarded as burn in. One clean ITC trace was used to fit the binding model. Negative results were double‐checked to ensure accuracy.

AUTHOR CONTRIBUTIONS

Lucas C. Wheeler: Conceptualization; investigation; methodology; software; validation; visualization; writing‐original draft; writing‐review and editing. Arden Perkins: Investigation; methodology; visualization; writing‐review and editing. Caitlyn E. Wong: Investigation; writing‐review and editing. Michael J. Harms: Conceptualization; funding acquisition; investigation; methodology; project administration; software; supervision; visualization; writing‐original draft; writing‐review and editing.

Supporting information

Table S1 (table‐s1.pdf) Features used for supervised machine learning. Features denoted (CIDER) were calculated using the CIDER software package. 29 Other features were calculated using our own software package (HOPS: https://github.com/harmslab/hops). Features were organized into the combined feature group categories indicated in the second column.

ACKNOWLEDGMENTS

We acknowledge Dale Tronrud for helpful conversations regarding handling of the crystallographic twinning refinement. We thank Joseph Harman, Anneliese Morrison, and John Muyskens of the Harms lab for their helpful comments on the manuscript. This work was funded by NIH R01GM117140 359 (Michael J. Harms), NIH 7T32GM007759 (Lucas C. Wheeler), and NIH F32DK115195‐02 (Arden Perkins). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Wheeler LC, Perkins A, Wong CE, Harms MJ. Learning peptide recognition rules for a low‐specificity protein. Protein Science. 2020;29:2259–2273. 10.1002/pro.3958

Lucas C. Wheeler and Arden Perkins contributed equally.

Funding information National Institute of General Medical Sciences, Grant/Award Numbers: 7T32GM007759, F32DK115195, R01GM117140

REFERENCES

  • 1. London N, Raveh B, Schueler‐Furman O. Druggable protein–protein interactions – From hot spots to hot segments. Curr Opin Chem Biol. 2013;17:952–959. [DOI] [PubMed] [Google Scholar]
  • 2. Ivarsson Y, Jemth P. Affinity and specificity of motif‐based protein–protein interactions. Curr Opin Struct Biol. 2019;54:26–33. [DOI] [PubMed] [Google Scholar]
  • 3. Li P, Banjade S, Cheng H‐C, et al. Phase transitions in the assembly of multivalent signalling proteins. Nature. 2012;483:336–340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Seo M‐H, Kim PM. The present and the future of motif‐mediated protein–protein interactions. Curr Opin Struct Biol. 2018;50:162–170. [DOI] [PubMed] [Google Scholar]
  • 5. Ren S, Uversky VN, Chen Z, Dunker AK, Obradovic Z. Short linear motifs recognized by SH2, SH3 and Ser/Thr kinase domains are conserved in disordered protein regions. BMC Genomics. 2008;9:S26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Hugo W, Ng S‐K, Sung W‐K. D‐SLIMMER: Domain–SLiM interaction motifs miner for sequence based protein–protein interaction data. J Proteome Res. 2011;10:5285–5295. [DOI] [PubMed] [Google Scholar]
  • 7. Pethe MA, Rubenstein AB, Khare SD. Data‐driven supervised learning of a viral protease specificity landscape from deep sequencing and molecular simulations. Proc Natl Acad Sci. 2019;116:168–176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Ernst A, Gfeller D, Kan Z, et al. Coevolution of PDZ domain–ligand interactions analyzed by high‐throughput phage display and deep sequencing. Mol Biosyst. 2010;6:1782–1790. [DOI] [PubMed] [Google Scholar]
  • 9. Gfeller D, Butty F, Wierzbicka M, et al. The multiple‐specificity landscape of modular peptide recognition domains. Mol Syst Biol. 2014;7:484–484. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Santamaria‐Kisiel L, Rintala‐Dempsey AC, Shaw GS. Calcium‐dependent and ‐independent interactions of the S100 protein family. Biochem J. 2006;396:201–214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Donato, R , Cannon R, B, Sorci, G , Riuzzi, F , Hsu, K , J Weber, D , & L Geczy, C. (2013) Functions of s100 proteins. Curr Mol Med 13, 24–57. [PMC free article] [PubMed] [Google Scholar]
  • 12. Lee Y‐T, Dimitrova YN, Schneider G, et al. Structure of the S100a6 complex with a fragment from the C‐terminal domain of Siah‐1 interacting protein: A novel mode for S100 protein target recognition. Biochemistry. 2008;47:10921–10932. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Bertini, I , Gupta, S. D, Hu, X , Karavelas, T , Luchinat, C , Parigi, G , & Yuan, J . (2009) Solution structure and dynamics of S100a5 in the Apo and Ca2+‐bound states. J Biol Inorg Chem 14, 1097–1107. [DOI] [PubMed] [Google Scholar]
  • 14. Leclerc E, Fritz G, Vetter SW, Heizmann CW. Binding of S100 proteins to RAGE: An update. Biochim Biophys Acta. 2009;1793:993–1007. [DOI] [PubMed] [Google Scholar]
  • 15. Streicher WW, Lopez MM, Makhatadze GI. Annexin I and Annexin II N‐terminal peptides binding to S100 protein family members: Specificity and thermodynamic characterization. Biochemistry. 2009;48:2788–2798. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Slomnicki LP, Nawrot B, Lesniak W. S100a6 binds P53 and affects its activity. Int J Biochem Cell Biol. 2009;41:784–790. [DOI] [PubMed] [Google Scholar]
  • 17. Lesniak W, Slomnicki LP, Filipek A. S100a6 – New facts and features. Biochem Biophys Res Commun. 2009;390:1087–1092. [DOI] [PubMed] [Google Scholar]
  • 18. Liriano, M. A. (2012) Ph.D. (University of Maryland, Baltimore, United States – Maryland).
  • 19. Wheeler LC, Anderson JA, Morrison AJ, Wong CE, Harms MJ. Conservation of specificity in two low‐specificity proteins. Biochemistry. 2018;57:684–695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Simon, M. A , Ecsedi, P , Kovacs, G. M , Poti, A. L , Remenyi, A , Kardos, J , Gogl, G , & Nyitray, L . (2019) High throughput competitive fluorescence polarization assay reveals functional redundancy in the s100 protein family. bioRxiv. [DOI] [PubMed]
  • 21. Schaefer BW, Fritschy J‐M, Murmann P, et al. Brain S100a5 is a novel calcium‐, zinc‐, and copper ion‐binding protein of the EF‐hand superfamily. J Biol Chem. 2000;275:30623–30630. [DOI] [PubMed] [Google Scholar]
  • 22. Olender T, Keydar I, Pinto JM, et al. The human olfactory transcriptome. BMC Genomics. 2016;17:619. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Wheeler, L.C & Harms, M. J . (2020) Were ancestral proteins less specific? bioRxiv. [DOI] [PMC free article] [PubMed]
  • 24. Crooks GE, Hon G, Chandonia J‐M, Brenner SE. WebLogo: A Sequence Logo Generator. Genome Res. 2004;14:1188–1190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Kim T, Tyndel MS, Huang H, et al. MUSI: An integrated system for identifying multiple specificity from very large peptide or nucleic acid data sets. Nucleic Acids Res. 2012;40:e47–e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Damerau FJ. A technique for computer detection and correction of spelling errors. Commun ACM. 1964;7:171–176. [Google Scholar]
  • 27. Ling RF. On the theory and construction of k‐clusters. Comput J. 1972;15:326–332. [Google Scholar]
  • 28. Ester M, Kriegel H‐P, Sander J, Xu X. A Density‐Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Palo Alto, CA, USA: AAAI press, 1996; p. 226–231. [Google Scholar]
  • 29. Holehouse AS, Ahad J, Das RK, Pappu RV. CIDER: Classification of intrinsically disordered ensemble regions. Biophys J. 2015;108:228a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of dna‐ and rna‐binding proteins by deep learning. Nat Biotechnol. 2015;33:831–838. [DOI] [PubMed] [Google Scholar]
  • 31. Pethe MA, Rubenstein AB, Khare SD. Large‐scale structure‐based prediction and identification of novel protease substrates using computational protein design. J Mol Biol. 2017;429:220–236. [DOI] [PubMed] [Google Scholar]
  • 32. Breiman L. Random forests. Mach Learn. 2001;45:5–32. [Google Scholar]
  • 33. Chang, C.‐C & Lin, C.‐J. (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2, 27:1–27:27. [Google Scholar]
  • 34. Rasmussen C, Williams C. Gaussian Processes for Machine Learning. Cambridge, MA, USA: MIT press, 2006. [Google Scholar]
  • 35. Richards FM. Areas, volumes, packing and protein structure. Annu Rev Biophys Bioeng. 1977;6:151–176. [DOI] [PubMed] [Google Scholar]
  • 36. Joo H, Tsai J. An amino acid code for beta‐sheet packing structure. Proteins. 2014;82:2128–2140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Raveh B, London N, Zimmerman L, Schueler‐Furman O. Rosetta flexpepdock ab‐initio: Simultaneous folding, docking and refinement of peptides onto their receptors. PLoS One. 2011;6:e18934. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Zerbe BS, Hall DR, Vajda S, Whitty A, Kozakov D. Relationship between hot spot residues and ligand binding hot spots in protein–protein interfaces. J Chem Inf Model. 2012;52:2236–2244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Bhat HF, Baba RA, Bashir M, et al. Alpha‐1‐syntrophin protein is differentially expressed in human cancers. Biomarkers. 2011;16:31–36. [DOI] [PubMed] [Google Scholar]
  • 40. Williams JC, Armesilla AL, Mohamed TMA, et al. The sarcolemmal calcium pump, alpha‐1 syntrophin, and neuronal nitric‐oxide synthase are parts of a macromolecular protein complex. J Biol Chem. 2006;281:23341–23348. [DOI] [PubMed] [Google Scholar]
  • 41. Adams ME, Anderson KNE, Froehner SC. The alpha‐syntrophin ph and pdz domains scaffold acetylcholine receptors, utrophin, and neuronal nitric oxide synthase at the neuromuscular junction. J Neurosci. 2010;30:11004–11010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Neely JD, Amiry‐Moghaddam M, Ottersen OP, Froehner SC, Agre P, Adams ME. Syntrophin‐dependent expression and localization of aquaporin‐4 water channel protein. Proc Natl Acad Sci. 2001;98:14108–14113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Hashida‐Okumura A, Okumura N, Iwamatsu A, Buijs RM, Romijn HJ, Nagai K. Interaction of neuronal nitric‐oxide synthase with alpha1‐syntrophin in rat brain. J Biol Chem. 1999;274:11736–11741. [DOI] [PubMed] [Google Scholar]
  • 44. Chan WY, Xia C‐L, Dong D‐C, Heizmann CW, Yew DT. Differential expression of s100 proteins in the developing human hippocampus and temporal cortex. Microsc Res Tech. 2003;60:600–613. [DOI] [PubMed] [Google Scholar]
  • 45. Teratani T, Watanabe T, Yamahara K, et al. Restricted expression of calcium‐binding protein s100a5 in human kidney. Biochem Biophys Res Commun. 2002;291:623–627. [DOI] [PubMed] [Google Scholar]
  • 46. Knott TK, Madany PA, Faden AA, et al. Olfactory discrimination largely persists in mice with defects in odorant receptor expression and axon guidance. Neural Dev. 2012;7:17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. McIntyre, J. C , Davis, E. E , Joiner, A , Williams, C. L , Tsai, I.‐C , Jenkins P. M, McEwen D. P, Zhang, L , Escobado, J , Thomas, S , & others. (2012) Gene therapy rescues cilia defects and restores olfactory function in a mammalian ciliopathy model. Nat Med 18, 1423–1428. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Moons KGM, van Es G‐A, Deckers JW, Habbema JDF, Grobbee DE. Limitations of sensitivity, specificity, likelihood ratio, and bayes' theorem in assessing diagnostic probabilities: A clinical example. Epidemiology. 1997;8:12–17. [DOI] [PubMed] [Google Scholar]
  • 49. Walt SVD, Colbert SC, Varoquaux G. The NumPy array: A structure for efficient numerical computation. Comput Sci Eng. 2011;13:22–30. [Google Scholar]
  • 50. Jones, E , Oliphant, T , Peterson, P et al. (2001) SciPy: Open source scientific tools for Python.
  • 51. Hunter JD. Matplotlib: A 2d graphics environment. Comput Sci Eng. 2007;9:90–95. [Google Scholar]
  • 52. Breiman, L , Friedman, J , Stone, C . J, & Olshen, R. A . (1984) Classification and Regression Trees. Boca Raton, FL, USA: CRC press. [Google Scholar]
  • 53. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit‐learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–2830. [Google Scholar]
  • 54. Evans P. Scaling and assessment of data quality. Acta Crystallogr Sect D. 2006;62:72–82. [DOI] [PubMed] [Google Scholar]
  • 55. Karplus PA, Diederichs K. Linking crystallographic model and data quality. Science. 2012;336:1030–1033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. McCoy AJ, Grosse‐Kunstleve RW, Adams PD, Winn MD, Storoni LC, Read RJ. Phaser crystallographic software. J Appl Cryst. 2007;40:658–674. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Liebschner D, Afonine PV, Baker ML, et al. Macromolecular structure determination using X‐rays, neutrons and electrons: Recent developments in Phenix . Acta Crystallogr Sect D. 2019;75:861–877. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Emsley P, Cowtan K. Coot: Model‐building tools for molecular graphics. Acta Crystallogr Sect D. 2004;60:2126–2132. [DOI] [PubMed] [Google Scholar]
  • 59. Chen VB, III Arendall WB, Headd JJ, et al. MolProbity: All‐atom structure validation for macromolecular crystallography. Acta Crystallogr Sect D. 2010;66:12–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Leaver‐Fay A, Tyka M, Lewis SM, et al. In: Johnson ML, Brand L, editors. Methods in Enzymology, Computer Methods, Part C. Volume 487 Cambridge, MA, USA: Academic Press, 2011; p. 545–574. [Google Scholar]
  • 61. Humphrey W, Dalke A, Shulten K. Vmd ‐ visual molecular dynamics. J Mol Graph. 1996;14:33–38. [DOI] [PubMed] [Google Scholar]
  • 62. Duvvuri H, Wheeler LC, Harms MJ. Pytc: Open‐source python software for global analyses of isothermal titration calorimetry data. Biochemistry. 2018;57:2578–2583. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Table S1 (table‐s1.pdf) Features used for supervised machine learning. Features denoted (CIDER) were calculated using the CIDER software package. 29 Other features were calculated using our own software package (HOPS: https://github.com/harmslab/hops). Features were organized into the combined feature group categories indicated in the second column.


Articles from Protein Science : A Publication of the Protein Society are provided here courtesy of The Protein Society

RESOURCES