Structural ontogeny of protein-protein interactions

Aerin Yang; Hanlun Jiang; Kevin M Jude; Deniz Akpinaroglu; Stephan Allenspach; Alex Jie Li; James Bowden; Carla Patricia Perez; Liu Liu; Po-Ssu Huang; Tanja Kortemme; Jennifer Listgarten; K Christopher Garcia

doi:10.1126/science.adx6931

. Author manuscript; available in PMC: 2026 Feb 14.

Published in final edited form as: Science. 2026 Feb 12;391(6786):eadx6931. doi: 10.1126/science.adx6931

Structural ontogeny of protein-protein interactions

Aerin Yang ^1,^†, Hanlun Jiang ^2,^†, Kevin M Jude ^1,^4,^†, Deniz Akpinaroglu ^5,⁶, Stephan Allenspach ^2,^†, Alex Jie Li ^5,⁶, James Bowden ², Carla Patricia Perez ⁷, Liu Liu ¹, Po-Ssu Huang ⁸, Tanja Kortemme ^5,^6,⁹, Jennifer Listgarten ^2,^3,^5,^*, K Christopher Garcia ^1,^4,^*

¹Department of Molecular and Cellular Physiology, Stanford University School of Medicine, Stanford, CA 94305, USA.

²Department of Electrical Engineering and Computer Science, UC Berkeley, CA, USA.

³Center for Computational Biology, University of California Berkeley, CA, USA

⁴Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA 94305, USA.

⁵The UC Berkeley–UCSF Graduate Program in Bioengineering, University of California, San Francisco; San Francisco, CA 94143, USA.

⁶Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco; San Francisco, CA 94143, USA.

⁷Biophysics program, Stanford University, Stanford, CA, USA

⁸Department of Bioengineering, Stanford University, Stanford, CA, USA.

⁹Quantitative Biosciences Institute, University of California, San Francisco; San Francisco, CA 94143, USA.

Correspondence: jennl@berkeley.edu, kcgarcia@stanford.edu

^†

These authors contributed equally to this work

Author contributions

K.C.G. conceived of coevolution platform and synthetic interface engineering, experimental design, initiation of collaborations, evaluation of data, interpretation of results, visualization, writing of the manuscript and funding acquisition. A.Y. developed the coevolution platform, designed the library, performed selection and sequence clustering and analysis, carried out experimental validation (binding assays, specificity matrix, and protein expression), analyzed the results, visualization, writing of the manuscript. K.M.J. solved crystal structures of coevolved complexes, interface contacts analysis, analyzed the results, visualization, writing of the manuscript. L.L. contributed to binding assays. H.J. performed SPM and epistasis analysis, analyzed and interpreted the results, contributed to writing of the manuscript. S.A. developed the SPM methodology and the algorithm for efficiently computing epistasis effects, contributed to interpretation of results and writing of the manuscript. J.B. developed and applied the method for simulating coevolutionary trajectories, contributed to interpretation of results and writing of the manuscript. J.L. conceived of the machine learning framework for analyzing multi-round selection read count data, supervised and contributed to the development and application of the computational methods (SPM, epistasis analysis and simulating coevolutionary trajectories), contributed to funding acquisition and writing of the manuscript. D.A. and A.J.L. developed and applied the Frame2seq structure-based analyses, contributed to interpretation of results and writing of the manuscript. T.K. contributed to interpretation of the Frame2seq results, writing of the manuscript, and funding acquisition. C.P.P. carried out Rosetta and PatchDock calculations. P.-S.H. contributed to manuscript writing.

Conceptualization: JL, KCG

Methodology: AY, DA, SA, AJL, JB, JL, KCG

Investigation: AY, HJ, DA, KMJ, SA, AJL, JB, CPP, LL, JL, KCG

Visualization: AY, HJ, DA, KMJ, SA, AJL, JB, JL, KCG

Funding acquisition: TK, JL, KCG

Supervision: TK, JL, KCG

Writing – original draft: AY, HJ, KMJ, SA, JB, JL, KCG

Writing – review & editing: AY, HJ, D.A., KMJ, SA, AJL, JB, P-SH, TK, JL, KCG

PMCID: PMC12904254 NIHMSID: NIHMS2128659 PMID: 41678610

Abstract

Understanding how protein binding sites evolve interactions with other proteins could hold clues to targeting “undruggable” surfaces. We used synthetic coevolution to engineer new interactions between naïve surfaces, simulating the de novo formation of protein complexes. We isolated seven distinct structural families of protein Z-domain complexes and found that synthetic complexes explore multiple shallow energy wells through ratchet-like docking modes, whereas complexes formed by natural binding sites converged in a deep energy well with a relatively fixed geometry. Epistasis analysis of a machine learning-estimated fitness landscape revealed ‘seed’ contacts between binding partners that anchored the earliest stages of encounter complex formation. Our results suggest that ‘silent’ surfaces have a shallower energy landscape than natural binding sites, disfavoring tight binding, likely due to evolutionary counter-selection.

How new protein-protein interactions arise, diversify, and evolve distinct specificities remains poorly understood. Yet it is an important problem to better understand, as there appear to be distinctive properties of protein surfaces that have evolved to form protein-protein interfaces, versus non-interacting surfaces of proteins (1-3). Natural protein binding sites are usually the most “druggable” sites on a given protein and tend to attract the majority of binders from combinatorial peptide or antibody libraries (4-6). Are there biophysical properties that distinguish regions of protein surfaces that have not evolved to bind to a ligand from natural binding sites? Although it has been proposed that protein binding sites tend to have more exposed hydrophobic amino acids, no definitive rules exist (2, 7, 8). Indeed protein binding sites are characterized by a vast diversity of chemistries and conformations (9-11). As a result, it remains challenging to identify an unknown binding site solely from inspection of an unbound structure (3).

Protein interfaces often evolve under selective pressures, giving rise to compensatory mutations that confer binding specificity and stability synonymous with particular functions (12-15). Coevolution, characterized by reciprocal changes between interacting partners, reflects structural and functional constraints imposed by evolution and provides insights into the ontogeny of protein-protein interactions (16, 17). Early studies have revealed that residues in close physical proximity at protein interfaces tend to coevolve, exhibiting correlated mutations to maintain structural compatibility and functional integrity (18-20). These coevolutionary signals have been leveraged through computational approaches, such as statistical energy models, to infer residue-residue contacts and identify structural features of PPIs (16, 21-23). Although such models effectively infer evolutionary constraints from sequence variation, the data on which they are trained may not capture the full range of coevolutionary trajectories in PPI evolution. Consequently, experimental systems can help us to gain mechanistic insights while exploring areas of complex formation not uncovered by nature alone. Such experiments can directly track coevolutionary dynamics and trajectories, including transient intermediates, under controlled selection (13, 14, 24-26).

We previously developed a synthetic coevolution platform that enables bidirectional coevolution of interacting proteins though “library-on-library” selections, providing a synthetic proxy of protein coevolution at a systems level (27). This experiment demonstrated that coevolution can structurally remodel naturally conserved interaction surfaces into a plethora of diverse interface chemistries and fine specificities through amino acid side chain repacking. Interestingly, however, the remodeled interfaces were constrained into a common docking mode. This observation raised the question of whether these complexes were trapped in a deep energy minimum and thus encumbered with an evolutionary imprint for binding in a stereotyped manner that could not be traversed even with highly diverse libraries. The coevolution platform offers the possibility to address this question by engineering bidirectional interactions between two naïve protein surfaces unencumbered by a natural binding imprint and to study the epistasis and fitness landscapes (28-30), as a proxy for protein-protein evolution.

In this study we ask if there are fundamental differences between surfaces of proteins that have evolved to bind to proteins versus surfaces that have not? We have learned that the type of epistasis, and structural adaptation that shape specificity and orthogonality differ between natural and non-natural protein binding sites. Further, in silico epistasis analysis identified initiating “seed” sequences at the earliest stages of an evolutionary path to a high affinity complex.

Results

Designing de novo interfaces using synthetic coevolution

We carried out a synthetic proxy of protein-protein coevolution by designing interfaces between “silent” surfaces of proteins that have no ligand recognition history. We previously described a synthetic coevolution platform that enables bidirectional sampling of libraries to remodel the existing dimer interfaces between the natural binding surface of protein Z-domain and its affibody binder (27). Here, we leveraged this coevolution platform to isolate completely synthetic interfaces between protein Z-domain pairs that have no pre-existing interaction with one another. Our strategy was to “relocate” the binding interface from the canonical binding region that Z-domain uses to bind to its natural ligand, Fc-region of immunoglobulin, as well as synthetic proteins like affibodies, to a non-canonical site on both protein partners that is not used for ligand binding (Fig. 1A). Protein Z-domains and their engineered binding scaffolds, affibodies, typically interact with target proteins through their H1-H2 helices, which we refer to as the “natural” binding sites (31). However, we simultaneously diversified the non-canonical H2-H3 and H1-H3 regions of two Z-domains to encourage these libraries to converge on one another at sites on the proteins that are not used for binding ligands, which we refer to as “synthetic interfaces.” When accompanied by deep sequencing of the selection rounds, this strategy captures the emergence and coevolutionary trajectories of new protein-protein interactions as they transition from unbiased exploratory states to structurally optimized conformations. This approach does not recapitulate the temporal trajectory of natural evolution as all mutants are contained in the initial library (i.e. at time zero prior to any selection), but stepwise selection with higher stringency can simulate, in some respects, affinity maturation over time.

Figure 1. — (A) Overview of the interface relocation strategy. Canonical binding interfaces (H1-H2) of protein Z-domain and affibody were relocated to non-canonical H2-H3 and H1-H3 interfaces to create novel synthetic interfaces.

(B) Schematic of the protease-based cleavage-capture assay. Two proteins are displayed on yeast as a single-chain construct connected by a flexible linker containing a 3C protease cleavage site. Upon cleavage, interacting pairs retain fluorescence from the C-terminal HA-tag binding antibody, while non-interacting pairs lose the HA signal, enabling differentiation based on binding affinity. Flow cytometry plots were gated using SSC-A (side scatter area, a measure of cell granularity/complexity) and APC-A (HA-tag fluorescence readout).

(C) The workflow integrates experimental and computational approaches. Yeast display screening followed by next-generation sequencing (NGS) is used to track sequence evolution, identify pairwise convergence, and construct sequence similarity networks (SSNs) that visualize cluster formation based on sequence relationships. Crystal structures validate these clusters, revealing interface architecture and binding modes. The NGS data is also harnessed by our statistical machine learning method to model multi-round selection data to obtain an estimated fitness landscape, which can be used to uncover landscape geometry via simulating coevolutionary trajectories, and to elucidate key protein-protein interactions through epistasis analysis. Frame2seq provides sequence-structure mapping, linking sequence clusters to their structural configurations, using crystal structures as input. Epistasis analysis, derived from these computational models and validated by interface structures, reveals critical residue-residue interactions, providing insights into the evolutionary mechanisms shaping protein-protein interactions.

We used our coevolution platform with a protease-based cleavage-capture assay (Fig. 1B). The assay distinguishes between binding and non-binding pairs based on fluorescence retention; interacting pairs displayed on yeast retain HA-tag fluorescence after 3C protease cleavage, whereas non-interacting pairs lose the signal. The HA-tag signal correlates with binding affinities (27), enabling relative comparisons of evolved complexes.

We functionalized the two pre-selected sites of interaction with diverse libraries. Newly coevolved interfaces were revealed by tracking pairwise next-generation sequencing (NGS) sequence information of interacting pairs, uncovering patterns of convergence indicative of coevolution through epistasis. We considered several possible outcomes (Fig. 1C): 1- Sequence clusters showing linkage between Z-A and Z-B sequences indicate successful library-on-library convergence to form a new bidirectional interface 2- Sequence convergence in the partner Z-A site paired with sequence drift in the Z-B site indicates that Z-A is adapting to an invariant surface on Z-B, and that Z-B sequence drifts because of no selective pressure to adapt to a surface of Z-A, allowing random amino acids. 3- Drift in the Z-A library paired with convergence in the Z-B library indicates Z-A has no interactions, while Z-B has found an invariant surface on Z-A.

We then leveraged machine learning (ML) and statistical modeling to glean more from our experimental data, helping to strengthen our understanding of sequence-structure relationships, and epistatic interactions within them. Specifically, we employed our Selection Probabilistic Model (SPM), a statistical machine learning framework, to extract energy and fitness landscapes from the multi-round selection read count data, capturing both round-specific and global fitness landscapes, encapsulated in a neural network (Methods; fig. S1). Using the global fitness landscape, we simulated in silico coevolutionary trajectories, identifying potential evolutionary pathways toward accessible energy minima. In addition, we exploited structural information by using a newly-developed inverse folding model, Frame2seq (32), to comprehensively map the sequence-structure compatibility of all 714 sequences in the selected clusters and the six converged structural configurations with solved crystal structures (Methods). Finally, by performing an epistasis analysis of the SPM energy landscape and the Frame2seq sequence-structure compatibility scores, we uncovered key residue interactions that shaped the ontogeny of our protein-protein interactions.

Construction and selection of a coevolutionary library for evolving novel protein interfaces

To construct our coevolutionary library, we first screened random Z-domain docking interactions to triage for sterically incompatible surface contours, and identify complementary surfaces that might support a new interaction, using Rosetta and PatchDock (33). This process involved docking two poly-valine Z-domains to identify docking poises free of clashes, and that show high shape complementarity, with interface areas exceeding 1000 Å². Using valines as a generic sidechain representation has been widely used in protein design backbone modeling tasks before amino acid identities are resolved (34, 35). Here, in docking the two domains, the valines provided a neutral volume in the initial orientation to define the residues in the interface. Selected docked models underwent iterative refinement and energy minimization in Rosetta, yielding a stable, low-energy H2-H3/H1-H3 poly-Val docking model that served as the basis for our library design (Fig. 2A and fig. S2). Based on the docked poly-Val model, we identified key interface residues for randomization, replacing bulky or charged residues to facilitate efficient contacts between proteins at designated library positions (fig. S3). Five positions (8, 11, 14, 15, and 45) on Z-A and six positions (29, 30, 33, 43, 44, and 47) on Z-B were selected for randomization, using five hydrophobic amino acids (M, F, L, I, and V) to construct the coevolutionary library (Fig. 2A) (36-38). The library was then screened using our coevolution platform, with HA-tag fluorescence monitored by flow cytometry to assess enrichment across rounds (Fig. 2B).

Figure 2. — (A) Coevolution library design targeting non-canonical binding regions (H1-H3 face of Z-A and H2-H3 face of Z-B) in the Z-domain. Eleven library positions were selected for randomization using five hydrophobic amino acids (M, F, L, I, and V).

(B) Flow cytometry tracking HA-tag fluorescence enrichment. HA-tag fluorescence was monitored across selection rounds, showing a progressive increase in HA-tag signal, indicative of enrichment for interacting pairs.

(C) Selection progression and sequence diversity analysis. (Left) Flow cytometry dot plots demonstrate HA-tag fluorescence enrichment across rounds, with a higher percentage of cells retaining fluorescence in later rounds. (Middle) Sequence logos generated from next-generation sequencing (NGS) data illustrate that amino acid diversity is maintained across all library positions, even in the final round. (Right) Sequence similarity networks (SSNs) of enriched sequences (p-value < 0.05) highlight the emergence of distinct clusters in later rounds, reflecting patterns of convergence at the level of sequence communities rather than individual residues. SSN constructed from concatenated 11 amino acid sequences of Z-A and Z-B proteins. Edges in the SSNs were formed using an edit distance threshold of 3.

Flow cytometry data revealed progressive enrichment of interacting pairs, as indicated by increased HA-tag signal retention over successive rounds (Fig. 2C left). Selection was carried out until the majority of cells showed clear HA-tag retention after cleavage, relative to pre-cleavage staining, ensuring that libraries were enriched to comparable functional endpoints. Despite this enrichment, sequence logos indicated substantial diversity within the library, without clear convergence to a single consensus sequence even in the later rounds (Fig. 2C middle). This suggests that the library preserved multiple viable configurations rather than converging on a single optimal solution. Notably, sequence similarity networks (SSN) (39) of the top enriched sequences from each round revealed the emergence of distinct clusters in the later rounds, highlighting patterns of convergence at the level of sequence clusters rather than individual residues (Fig. 2C, right). As later analyses indicate, these clusters represent distinct interaction modes evolved at the interface, revealing that the synthetic coevolution platform has captured multiple binding solutions.

Divergent evolution of synthetic interfaces captured through sequence clustering, structures, and fitness landscape

To visualize the relationships among the enriched sequences in the round 5 NGS data, we constructed an SSN (39). Sequence communities within the SSN were merged into single nodes, resulting in a community map comprising seven distinct clusters (Fig. 3A). Although the overall amino acid composition across all eleven library positions remained diverse, sequence logos for each cluster showed patterns of convergence, suggesting distinct coevolutionary pathways to the different clusters. Surprisingly, we only saw clusters with convergence of both interfaces, suggestive that the partners had ‘homed’ in on one another through their library patches, albeit with imperfect structural alignments of the libraries (discussed below) (Fig. 1C). To further illustrate the specificity and orthogonality of interactions, we used a Circos plot (40) to represent pairwise relationships between Z-A and Z-B proteins within the round 5 NGS data (Fig. 3B). The plot revealed minimal cross-reactivity (orthogonality) between clusters, with most sequences paired within their respective clusters.

Figure 3. — (A) A community map of sequence similarity network (SSN) of round 5 sequences. Clustered communities are merged into single nodes to form a community map. Seven distinct clusters were identified, each exhibiting distinct patterns of convergence, as illustrated in the sequence logos.

(B) A Circos plot (40) representing pairwise relationships between Z-A and Z-B proteins in round 5 next-generation sequencing (NGS) data. Each pair is normalized to have equal area, providing a visual representation of the approximate specificity or cross-reactivity of each sequence. The plot highlights high orthogonality, with sequences primarily pairing within their respective clusters and minimal cross-cluster interactions.

(C) Validation of representative interactions. (top) A table of representative pairs from each cluster. (Bottom) On-yeast cleavage-capture assay confirms strong binding affinities for all tested pairs, validating functional interactions. Data are mean ± SD; n = 3 independent replicates.

(D) Structural diversity in synthetic interface docking orientations. (Left) Superimposed crystal structures of synthetic interface complexes. (Right) Docking angle variations of Z-B helix 3 relative to the Cluster 5 structure highlight the diversity in docking geometries among clusters.

(E) Conserved docking orientations in natural interface complexes. Structures from a previous study (27) reveal that natural interface complexes of the Z-domain and its affibody binder exhibit relatively consistent docking angles, in contrast to the synthetic interface complexes in panel D.

(F) The relative energy landscapes of the 20 most accessible wells for the natural interface (LL1 library) and the synthetic interface. Energy is defined as the negative fitness (47), and, for ease of comparison of the interfaces, for each interface, we computed a *relative energy* for each sequence by subtracting the energy of the strongest-binding sequence. We do so because relative energies are comparable between the two plots (see Section C of Supplementary Information for details), whereas energies are not. The width of each energy well corresponds to the accessibility (see main text and Methods) of that well. In each subplot, the most accessible energy well is marked by the curly brace labeled with its accessibility. For the natural interface, red dots denote the 20 most accessible sequences that are not crystal structures, while colored dots correspond to the crystal structures in Panel E, all falling into the same well, consistent with their conserved docking modes. For the synthetic interface, dots are colored according to the colors of the structures in Panel D based on their corresponding clusters. To visualize the geometry of each landscape, we chose a random ordering of the 20 most accessible wells, with the exception of the natural interface which always centers the deepest energy well, using a random ordering for the rest.

To explore the structural characteristics of coevolved complexes, we selected a representative of the most enriched sequences from each cluster and validated their binding using the cleavage-capture assay, confirming strong binding (Fig. 3C). For six of the seven clusters, we solved crystal structures of one coevolved complex, with resolution ranging from 1.43 to 2.85 Å (Table S1). Structural analysis revealed considerable diversity in the docking orientations across the complexes, whilst maintaining unidirectional polarity (i.e., we did not see reverse docking orientations), analogous to the hands of a clock. Docking angles varied by up to 91.1° between A5B5 and A7B7 complexes, and none of the coevolved docking poises resembled the initial docked poly-Val model (Fig. 3D and fig. S4). This wide diversity in ratchet-like docking orientations contrasts with our previous study (27), where highly constrained docking modes were observed in coevolved complexes involving the natural interface of the Z-domain and its affibody binder (Fig. 3E). Inter-residue distances across the natural interface complexes were highly conserved, whereas those of the synthetic interface complexes varied substantially (fig. S5). The buried surface areas (BSA) across the coevolved interfaces ranged from 1080 to 1475 Å², which are on the smaller side for PPIs but remain consistent with previous structural studies (11, 41). To assess interface packing, we calculated shape complementarity (sc) (42), packstat scores (43), and analyzed cold spots (44-46) across all solved natural and synthetic interface complexes (fig. S6). We identified a single cold spot (44) in the parent structure of the natural interface (fig. S6A), whereas no cold spots were detected in any of the coevolved natural or synthetic interfaces, indicating that the coevolved interfaces are well-packed.

Despite the library composition being limited to generally hydrophobic residues, the diversity in docking orientations of the synthetic interface clusters, spanning a range of approximately 90 degrees in 15 degree increments, suggests distinct fitness landscapes were sampled during coevolution. To elucidate characteristics of the overall fitness landscape, we employed a simple in silico coevolution simulation (one mutation at a time, and only improving the fitness), generating 10 million trajectories for each interface, using our learned fitness function from SPM (Methods). For each sequence, we computed its accessibility which is defined as the fraction of all trajectories that landed at that sequence. Highly accessible sequences represent energy wells in the energy landscape where coevolutionary pathways converged (energy is the negative fitness (47)). We noticed a striking difference in the number and depth of energy wells between the natural and synthetic interfaces (Fig. 3F). For the natural interface, the energy landscape was dominated by a single highly accessible well containing all strongly binding sequences. In contrast, the energy landscape of the synthetic interface contained a multiplicity of comparably accessible wells that each contained strongly binding sequences (Fig. 3F and fig. S7). These differences—namely difference in well depths and accessibilities between natural and synthetic—were substantiated as statistically significant through a bootstrap-based statistical test (fig. S8 and Method), with p-values respectively P= 4.7x10⁻³¹⁰ and P=2.9x10⁻¹⁴⁷. We additionally experimentally characterized the geometries of the energy landscapes as follows. We measured K_D values by SPR for a subset of the variants in the energy landscape for each of synthetic and natural interfaces. Specifically, we leveraged K_D measurements for 6 variants in the synthetic interface experiments appearing in the specificity matrix (fig. S9), that happen to be among the 20 most accessible appearing in our energy landscape plot. Their measured K_D values ranged from 0.4-8.8 μM, which corresponds to an energy gap of 1.8 kcal/mol. For the natural interface, we obtained SPR measurements for the variant corresponding to the deepest energy well and 4 variants that lie at the bottom of 4 other wells in the natural landscape. The K_D value of the variant corresponding to the deepest energy well was 2.4 nM (27) and for the other 4 sequences, ranged from 6.7 μM to >100 μM (i.e., beyond the SPR measurable limit of 100 μM) (fig. S9D), for which the smallest energy gap between the deepest well variant and the other variants is 4.7 kcal/mol (from the 6.7 μM variant). Importantly, this natural interface landscape energy gap (conservatively estimated because it is the smallest one) is 2.6 times larger than the measured energy gaps in the synthetic interface.

Structural insights into specificity and cross-reactivity between clusters

To further elucidate the specificity and cross-reactivity of coevolved complexes, we analyzed the structural and sequence features of representative pairs from each cluster. Specificity between the seven representative pairs was validated by measuring relative binding affinities of all possible combinations of Z-A and Z-B proteins (Fig. 4A). The resulting specificity matrix revealed strong intra-cluster specificity, with cognate pairs occupying the diagonal positions. Some instances of cross-reactivity were observed, including A4B1 and A7B3 mixed complexes, which formed between non-cognate pairs. That clusters can cross-react, despite evolving distinct sequence and structural solutions, suggests that the different docking modes are separated by shallow energetic barriers, consistent with the fitness analysis (48). To complement these analyses, we measured solution affinities (K_D) by SPR for all combinations in the specificity matrix. Cognate pairs, as well as the noted cross-reactive pairs (A4B1 and A7B3), bound in the sub- to low-micromolar range (0.3 - 1.7 μM). Importantly, the solution K_D values correlated strongly with the on-yeast cleavage–capture HA-tag readout (Spearman r = 0.700, Pearson r = 0.868, R² = 0.754; fig. S9), confirming that the specificity matrix accurately reflects binding in solution. We also examined whether biophysical parameters aligned with these affinities (11); however, common metrics such as buried surface area (BSA), shape complementarity (sc), and the hydrophobic fraction of buried surface area (% hydrophobic ΔSASA) showed no correlation with K_D, underscoring that structural features alone are insufficient predictors of affinity in our synthetic interfaces (fig. S10).

Figure 4. — (A) Specificity matrix of all possible Z-A and Z-B combinations of representative pairs in Fig. 3C. Binding affinities measured by on-yeast cleavage-capture assay were normalized to the highest affinity (A4B4 complex) and non-interacting control (11xAla mutant). The matrix shows high intra-cluster specificity and some inter-cluster cross-reactivity (e.g., A4B1, A7B3). Data are mean of n = 3 independent replicates.

(B) Workflow for scoring sequence-structure compatibility using Frame2seq. All library sequences are scored for compatibility with their representative structure.

(C) Experimentally determined specificity matrix of representative sequences (top) and predicted Frame2seq interface score matrix of all library sequences (bottom) (Spearman correlation = −0.746). Frame2seq scores are negative log likelihoods (lower scores mean higher fitness).

(D) Frame2seq interface scores of round 5 sequence similarity network (SSN) cluster sequences on all representative crystal structures (see Methods for details on representative structure selection). The correct (cognate) sequence-structure pairings are in blue, and the incorrect (noncognate) pairings are in orange.

(E) Interface structures reveal converged residue interactions. (top) Interface structures of coevolved complexes highlight converged residues in stick representation, with other library residues shown as lines. The center of mass (COM) of the converged residues is depicted as a grey sphere at the interface, illustrating their clustering in proximity. Each structure exhibits distinct COM positions, reflecting different interaction and docking geometries across clusters. (Bottom) Sequence logos of each cluster illustrate residue convergence, with converged residues marked by asterisks. These data demonstrate how pairwise residue convergence underpins cluster-specific interactions and correlates with distinct docking orientations, highlighting the structural determinants of specificity within coevolved interfaces.

(F) Pairwise specificity matrix highlighting biased cross-reactivity between cluster 3 and 7 pairs (left). A7 exhibits dual specificity by binding both B3 and B7, while A3 binds only B3. The interaction intensity in the left matrix is derived from the specificity matrix in Fig. 4A. Interfaces and top views of the A3B3 (pale cyan/cyan), A7B3 (slate/skyblue), and A7B7 (salmon/dark salmon) crystal structures (right). Z-A subunits are shown as surface representations and Z-B subunits as cartoons, with residue 29^B side chain atoms shown as spheres. Top views reveal distinct docking geometries: in the A7B7 complex, the Z-B subunit is rotated by 53.9° relative to the A7B3 complex. Models of putative A3B7, which fails to form a stable complex, are presented in fig. S13.

The experimental cross-reactivity matrix for the 7 representative sequence pairs implies that a surprisingly high degree of orthogonality arises from the relatively limited choice of hydrophobic residues at each library position. To probe the generality of this observation beyond individual sequences, we sought to comprehensively map all sequences in the clusters to the six crystal structures. To do so, we used Frame2seq (32) to quantify sequence-structure compatibility in silico (Fig. 4B, Methods). We provided the experimentally-solved crystal structure backbones and sequences outside the library positions as inputs to the model and computed Frame2seq model scores to estimate sequence-structure compatibility for the library residues at the interface (Methods; Eq. S1). We analyzed (i) chain pairing specificity (Fig. 4C) and (ii) compatibility of all selected sequences in each cluster with each of the six crystal structures (Fig 4D). To assess chain pairing specificity, we scored all possible Z-A and Z-B combinations with cognate (diagonal) and non-cognate (off-diagonal) chain pairings (Fig. 4C). The resulting chain pairing confusion matrix shows a strong correlation to the experimentally validated specificity matrix (Spearman correlation = −0.746). This level of predictive accuracy of Frame2seq for specificity prediction is surprising given the relatively shallow fitness landscapes between clusters and is encouraging for using such models for mapping sequence-structure space. To further assess the generalizability of Frame2seq, we benchmarked it against the Flex ddG dataset of experimentally measured interface mutations (49), where the model successfully distinguished destabilizing from stabilizing mutations (fig. S11), consistent with its predictive accuracy on our synthetic complexes. To assess sequence-structure mapping, we scored all possible paired cluster sequences for compatibility with each of the six experimentally-solved structures representing these clusters (Fig 4D). Cognate sequence – structure pairings are consistently scored by Frame2seq to form more favorable (lower scoring) interfaces. These results suggest that the sequences in each of the identified clusters indeed map well to the six representative crystal structures.

We examined the high-resolution crystal structures of coevolved complexes from each cluster to understand the basis of both specificity and cross-reactivity. The structures revealed a wide range of interface packing solutions, with converged residue interactions driving specificity within clusters (Fig. 4E). Converged residues, marked with asterisks on the sequence logos, were highlighted in stick representation on the structures. These residues cluster closely within the interfaces, emphasizing that pairwise convergence observed in NGS data correlates strongly with cluster-specific interactions and distinct structural determinants that define docking orientations.

Interestingly, beyond intra-cluster specificity, we also observe inter-cluster cross-reactivity that can result in dual or multi-specificity by a single binding partner. To assess how common these cross-reactivities are, we quantified inter-cluster cross-reactivity among the sequences from the SSN clusters (Fig. 3A). We found that 19.9% of Z-A sequences exhibited at least one inter-cluster cross-reactivity, whereas only 3.4% of Z-B sequences did so, indicating that cross-reactivity occurs primarily on the A chain (fig. S12). This is exemplified by the A7 protein, which can bind strongly with both B3 and B7 with comparable affinities (K_D= 487 nM for A7B7; 551 nM for A7B3) (Fig. 4F and fig. S9). Structural analysis revealed that A7 adopts different docking modes depending on its partner. In the A7B3 complex, the Z-B subunit is rotated by 53.9° relative to the A7B7 complex, and the H1 helix of A7 shifts by 4.1 Å (Fig. 4F and fig. S13). This demonstrates how sequence convergence within clusters supports intra-cluster specificity, while inter-cluster cross-reactivity emerges from docking plasticity, allowing a shared partner to navigate distinct energy wells through structural adaptation. Notably, cross-reactivity was also directional, with A7B3 forming but another possible cross-reactive pair A3B7 failing to interact, due to a steric clash caused by a Met29^BPhe substitution in Z-B (fig. S13). Interestingly, position 29^B strongly converges to Met in cluster 3 and Phe in cluster 7 (Fig. 4E), suggesting that these residues contribute to cluster-specific docking preferences and impose structural constraints on inter-cluster cross-reactivity.

Structural and epistatic signatures underlying synthetic and natural protein-protein interfaces

Our hypothesis in the design of the coevolution experiment was that the library patches would converge on one another through inter-chain epistasis. To ask how structural convergence occurred, we first curated all atomic contacts < 4 Å between subunits in all of our crystal structures (Fig. 5A, Table S2). We then classified randomized positions as “library” and constant positions as “framework”, denoting contacts as library-on-library ‘LL’(A_lib-B_lib), library-on-framework ‘LF’ (A_lib-B_frame, A_frame-B_lib), or framework-on-framework ‘FF’ (A_frame-B_frame). The cores of all interfaces are dominated by LL contacts, which presumably act as “hotspots” to initiate complex formation at the early stages of selection. The centrally located LL patches are not perfectly structurally coincident, thus LF contacts occur at the peripheries of the misaligned LL regions, and FF interfaces occur adventitiously at the outermost edges of the interfaces.

Figure 5. — (A) Contacts between Z-domain chains. For each structure, chain A is shown as a surface representation and chain B as a cartoon with contacting residue side chains shown as sticks. Noncontacting residues are colored as in panel A, noncontacting library residues are colored blue or slate, contacting framework residues are gray or white, chain A contacting library residues are red, and chain B contacting library residues are salmon. The noncontacting helix of chain B is omitted for clarity. Below each structure is a pie chart of atomic contacts between helical residues: Framework-framework (gray), A library – B framework (red), A framework – B library (salmon), A library – B library (dark red). A representative natural interface Z-domain-affibody pair (LL2.c22) is shown in cyan.

(B) Dot plots of contact ratios (library-on-library contacts over library-on-framework contacts) and SPM epistasis effect size ratios (inter-chain epistasis effect sizes over intra-chain epistasis effect sizes) for the crystal structures. For each type of interface, the black horizontal bar marks the average contacts ratio or epistasis effect size ratio of the corresponding crystal structures. Each crystal structure uses the same markers in the two plots.

We sought to quantify the distinct patterns of structural interactions in the natural versus synthetic interfaces by comparing their contact ratios, defined as the number of LL contacts (A_lib-B_lib) divided by the number of LF contacts (A_lib-B_frame and A_frame-B_lib) in the crystal structures. Synthetic interfaces spanned a wide range of contact ratios, with an average of 0.61; whereas the natural interfaces had a narrower range of contact ratios and a lower average: 0.22 for the LL1 library and 0.33 for the LL2 library (Fig. 5B left). We hypothesized that the library-to-library interactions might be related to inter-chain epistasis, whereas library-on-framework interactions might be related to intra-chain epistasis. Indeed, we found that a higher ratio of inter-chain over intra-chain epistasis effect sizes tended to correspond to a higher ratio of LL to LF contacts (Fig. 5B right, fig. S14). Notably, the de novo complex formation has a higher proportion of LL contacts compared to the natural complex, suggesting LL contacts may be primary drivers of de novo complex formation. Presumably this difference arises from the natural protein partners being trapped in the deep energy well of the natural docking mode; within that well, the primary role of epistasis is to optimize contacts in the existing interface rather than to explore new interfaces.

Epistatic dissection of coevolution in synthetic and natural interfaces

We next sought a more detailed understanding of how epistasis and structure play out during coevolution. Physical interactions could form in part through intermolecular (ZA-ZB) epistasis of library amino acids on their respective surfaces, and also through intramolecular epistasis within each binding partner (ZA or ZB). Intramolecular epistasis could preorganize a binding surface independently of a partner to make it more likely to bind to any partner. To determine which mechanisms may drive different stages of coevolution, we leveraged our epistasis analysis to reveal critical interactions in the overall SPM-predicted fitness landscape as well as those in the individual SPM-predicted fitness landscapes associated with each selection round (see Methods, “Extraction of epistasis importance from SPM-predicted fitness landscape” section). We also applied a similar analysis to the Frame2Seq sequence-structure compatibility scores.

First, leveraging the SPM fitness, we found a strong correlation between pairwise epistasis importance and physical proximity (Spearman correlation=-0.550, fig. S15). Epistasis “importance” refers to the maximum amount that a given interaction can change the fitness. The weaker correlation for inter-chain pairs likely arises from diverse docking modes in the crystal structures, as some pairwise epistasis may be important for a particular docking mode. Residues involved in the strongest pairwise intra-chain epistatic interactions tended to lie on the same helix at the binding interface (Fig. 6A).

Figure 6. — (A) Left and center: Visualization of intra-chain pairwise Selection Probabilistic Modeling (SPM) epistasis overlaid onto the crystal structure of A5B5. Widths of the bonds connecting the library residues are proportional to the strength of the corresponding epistasis importance. Right: Top 10 inter-chain pairwise SPM epistasis importance. The widths of the lines are proportional to the corresponding epistasis importance.

(B) Left: Illustration of Ratio of average Epistasis Importance (REI). It is computed as the ratio of the average inter-chain SPM epistasis importance per term over the average intra-chain SPM epistasis importance per term for each order of interaction. Middle and right: Comparison of REI for each round of selection experiments of synthetic and natural interfaces.

(C) Pairwise epistasis terms per cluster structure, overlayed and aligned on Z-A (above) and Z-B (below) when viewed from the front (left) and back (right) of the complex. Spheres represent library residues, and bars represent epistatic interactions where thickness represents epistasis strength as computed by Frame2seq and color represents the residue shared amongst all such interactions (which is also depicted on the legend in the right).

Next we inspected how contributions of intra-chain and inter-chain interactions likely changed over selection rounds by leveraging our round-specific epistasis analysis as a proxy. As expected, we found that both intra- and inter-chain epistasis increased with increasing selection strength (i.e., increasing round number) (fig. S16). The relative contribution of inter-chain over intra-chain epistasis importance decreases with increasing selection strength (Fig. 6B), suggesting that after establishing binding contact in the early rounds, each chain may then individually focus more on fine-tuning of its own energetics. In contrast, when performing the same analysis on the original Z-domain-affibody interface, the relative contributions did not systematically decrease (Fig. 6B), presumably because binding contact did not need to be established in the first place.

We additionally performed epistasis analysis on the sequence-structure compatibility scores derived from Frame2seq using each of the six crystal structures (Methods, “Extraction of structure-conditioned epistasis importance using Frame2seq” section). We found that many of the inferred epistatic interactions were conserved (involving equivalent residues) across all six structures, despite the different docking modes (Fig. 6C). Furthermore, several of the identified interactions were also seen in the SPM global fitness epistasis analysis (fig. S17). These results suggest that a set of key interactions are important in the formation of the interface, and that the different docking modes “pivot” around these core interactions.

Seed sequences in the coevolutionary paths

We wondered whether there were primordial “seed” contacts that could serve as the earliest sentinels of protein binding emerging from the formerly silent surfaces in the selections. Moreover, such key interactions could be preconditions that favor an evolutionary path toward a specific binding conformation. To attempt to extract the identity of seed sequences, we analyzed our simulated coevolutionary trajectories and curated potential seed sequences as those that are weak binders (with SPM-predicted fitness between 18.35 and 22.07) with relatively high “exclusivity” and “contribution” towards their most likely energy well (the well that their trajectories fall into most frequently; we considered only wells that correspond to sequences in Fig. 4A whose binding affinity had been experimentally measured with normalized HA-tag binding affinity > 0.8). Exclusivity of a sequence for a well refers to the fraction of trajectories from that sequence that end in that well; contribution of a sequence to a well is defined as the fraction of all trajectories ending in that well that are accounted for by that sequence (Methods).

This curation yielded plausible seed sequences for sequences A7B3 and A1B1. From these seed sequences, we selected one sequence with high exclusivity and relatively high contribution for each of A7B3 and A1B1 for further analysis (fig. S18). We found only one energetically favorable coevolutionary path to A7B3 from its selected seed (Fig. 7A and fig. S19). Our epistasis analysis identified F45^A-F33^B as the highest-ranking inter-chain pairwise epistasis effect size along this path, suggesting its pivotal role in progression from weak to strong binding partners. This interaction is physically located in the core of the protein interface (Fig. 7B). In contrast to the case of A7B3, for A1B1 we found multiple viable coevolutionary paths from the selected seed (Fig 7C and fig. S18B). For all of the sequences along these paths, the interaction V14^A-V47^B ranked as the highest or second highest in inter-chain pairwise epistasis effect size, suggesting its pivotal role in binding evolution; residues in this interaction pair are in close physical contact with each other (Fig. 7D). Altogether, these results suggest that key interactions not only help to usher the evolutionary path toward a specific binding conformation from a seed but are also maintained along the path. Moreover, distinct key interactions may drive the differences between docking modes.

Figure 7. — (A) Illustration of the only energetically favorable coevolutionary path from the identified seed (VFLLF--VVFFIV) to A7B3. The conservation of the top-ranking inter-chain epistasis between F45^A-F33^B is highlighted.

(B) Visualization of the conserved top ranking inter-chain epistasis F45^A-F33^B on the crystal structure.

(C) Illustration of energetically favorable coevolutionary paths from the identified seed (LMVVF--VVIFMV) to A1B1. The widths of the arrows are proportional to the number of simulated trajectories visiting the corresponding paths. Gray arrows represent minor paths that were collectively visited less than 15% of all the simulated coevolutionary trajectories. Conserved top ranking pairwise inter-chain epistasis V14^A-V47^B is highlighted in each sequence in the paths.

(D) Conservation of top-ranking inter-chain epistasis visualized on the structural model of A1B1: V14^A-V47^B is the highest or the second highest ranking inter-chain epistasis effect size for the seed, all the waypoints and A1B1. The structural model of A1B1 is built by making the F14^AV point mutation of the crystal structure A4B

Discussion

We have used a coevolution platform to try to understand the differences between surfaces of proteins that have evolved to bind to other proteins, versus those that have not. Our strategy was to simulate the ontogeny of new protein-protein interactions in vitro between “silent” surfaces of proteins with no natural history of protein-protein interactions and then deconstruct this process. By integrating experimental coevolution and statistical machine learning, this platform has revealed how structural and epistatic principles, including “seed” interactions, shape binding specificity and adaptability. Synthetic complexes adopted a range of docking modes, distinguished by subtly different interfaces, by sampling shallow, sawtooth energetic landscapes. This was not seen during coevolution of a complex mediated by natural Z-domain binding surfaces, that appeared to reside within a deep energy well (27).

Our results offer insight into why surfaces of proteins that have not evolved to engage ligands are notoriously difficult to drug (50). Natural protein binding sites have been evolutionarily refined over millennia, appearing to possess structural and energetic imprints that make them conducive for binding and as drug targets (6, 51). It is well established that combinatorial libraries undergoing unbiased selections, such as peptide and antibody phage libraries, frequently converge on the natural binding sites of proteins, which serve as the most “druggable” sites for both small molecules and protein therapeutics (6). In contrast, regions of protein surfaces that do not engage ligands were likely evolutionarily “counter-selected” to avoid spurious interactions with non-specific proteins, and this likely contributes to lack of “druggability.” Targeting non-binding regions of proteins may be limited by a relative lack of evolutionary fitness that results in a dearth of structural features conducive to binding (1, 2, 6). Consequently, non-binding surfaces of proteins may be constrained to shallow energetic landscapes that have evolved to avoid non-specific interactions with irrelevant molecules. Our synthetic coevolution experiment overcame this limitation through molecular diversity, but even so, the resulting complexes exhibited a sawtooth energetic landscape that explored parallel evolutionary pathways to diverse structural solutions. This plasticity may have been further facilitated by the hydrophobic nature of our designed library, which promotes chemically compatible packing across multiple configurations, as shown by the lack of cold spots in the synthetic interfaces despite the shallow energy landscapes (38, 44). Underscoring this shallow energetic landscape, we observe that docking modes can interconvert in a ratchet-like fashion through only a few amino acid mutations, illustrating a remarkable potential for plasticity in molecular recognition.

A limitation of our coevolution experiment is that we used a biased library restricted to hydrophobic amino acids. We designed our library to be within the experimental limits of diversity of yeast display, which is about 10⁹. We chose 11 sites to cover a surface area that lead to binding interfaces on the smaller side for PPI (~1080-1475 Å²) but still common (11, 41), and allowed complete sampling of the theoretical diversity of the library (3.6×10⁸). We also wanted the library to be similar in composition to our previous library we used for remodeling the protein Z-affibody interface (27) so that we could make comparisons to the synthetic interfaces in the current manuscript. Indeed, several of our results are consistent with previous studies on evolution of PPI based on analysis of all 20 natural amino acids. For example, we found that dual-binding states (i.e., promiscuous intermediates (13)) frequently occur along shortest viable mutational paths between co-evolved crystal backbones (as per our SPM-estimated fitness); this result suggests that the biased libraries are capable of mimicking characteristics of natural repertoires at a systems level. We also did not see correlations between measured biophysical parameters (buried surface area (BSA), shape complementarity (Sc), and the hydrophobic fraction of buried surface area [% hydrophobic ΔSASA]) and pK_D as measured by SPR for our synthetic interfaces, as previously shown in (11). Consistent with well packed natural PPI using all twenty amino acids, we found that the synthetic interfaces were well packed, lacking cold spots (44, 45), showing that even our limited library is capable of evolving natural-like interfaces.

Concordant with previous studies showing the importance of epistasis in natural protein evolution (15, 46, 52), we also found that epistasis plays a critical role in the evolution of both synthetic and natural interfaces, but distinct patterns emerged in each. Synthetic interfaces exhibit higher inter/intra epistasis importance ratios during early selection rounds, suggesting that inter-chain interactions played a stronger role in determining binding modes at the onset of selection, compared to the affinity maturation stage (Fig. 5B). These inter-chain interactions may have provided an initial structural foothold for binding, enabling intra-chain refinements in later rounds to optimize stability and specificity. We speculate that the original high affinity Z-domain-affibody complex—trapped in a deep energy well that constrains its ability to adapt to the epistasis through reorientation of docking modes—distributes the epistasis more evenly for interface repacking. As a result, natural interfaces maintain a balanced inter/intra epistasis importance ratio throughout selection.

It is instructive to ask how our results on a model system inform our understanding of the initiation step of complex formation during natural protein-protein coevolution. Presumably, naïve proteins acquired mutual affinity through evolution of mutations on complementary surfaces under a selective pressure over a long time period (53). In such a setting, natural coevolution is iterative over time, where advantageous mutations that stabilize the complex accumulate and are fixed, followed by further sampling of mutants until a functional threshold, such as affinity, is met, at which point further affinity-enhancing mutations become superfluous (29, 54, 55). In our system, seed interactions, such as F45^A-F33^B in A7B3 and V14^A-V47^B in A1B1, appear to act as stabilizing anchors, linking seed sequences to refined binding interfaces, while preserving evolutionary flexibility, as evidenced by the divergent coevolutionary trajectories. We suggest that such mechanisms likely model mechanisms of protein-protein evolution that occur in nature. During natural protein-protein evolution, it is unlikely that a suitable constellation of mutations would arise simultaneously at the seed stage that would confer high affinity in a single step (56). Thus, we suggest that our model system has captured the essence of a natural mechanism for the initiation of protein-protein complexes.

There are examples of artificially designed protein-protein interfaces that leverage molecular docking followed by side chain design calculations to create novel interfaces (57-59). These methods sample shape complementarity extensively through docking but generally lack the ability to capture nuanced side chain-guided interfacial plasticity because of the hierarchical nature of the design procedures, as well as challenges associated with overcoming local energy minima. Epistatic clusters of residues are secondary considerations to these types of design methods, typically relying upon the design algorithm to capture them. We used docked poses without specific sequences to set up the residue positions that undergo sequence drifts and came to sequences that convey orthogonality through new binding orientations, not easily designable by conventional methods (60, 61).

Artificial intelligence is making rapid strides in prediction and design of binding proteins (62-65), but these in silico approaches remain largely agnostic to the biophysical chemistry and evolutionary mechanisms by which protein-protein interactions (PPIs) are formed. Generative AI approaches are effective at producing binders to proteins, but are generally most effective targeting natural binding surfaces that have evolved to bind ligands (59). Both targeting naïve surfaces of proteins and achieving fine specificity among highly similar surfaces remain challenging. Understanding the structural principles underlying protein interfaces at a systems level, using large datasets of paired coevolved protein-protein complexes, could help to obtain deeper insight into molecular recognition and also bridge the gaps in current AI-based protein design approaches. Such challenges are difficult for existing models trained largely on static and monomeric structures, which do not explicitly capture how sequence changes reshape interfaces to mediate specificity and cross-reactivity. Our approach might provide insights to construct improved ML-based protein engineering strategies for “undruggable” therapeutic targets (41, 66-68).

Materials and Methods

Z domain-Z domain docking

A Z-domain was isolated from the crystal structure of Protein Z in complex with an in vitro selected affibody (PDB: 1LP1). The Z-domain structure was minimized using the the Rosetta FastRelax protocol, applying backbone coordinate constraints to preserve the overall fold. A poly-valine variant of the Z-domain, where all amino acids were substituted with valine, was generated using RosettaRemodel. This poly-valine Z-domain was then docked against a second poly-valine Z-domain using PatchDock. Docked models were ranked based on their approximate interface are. Models with an interface area greater than 1000Å² were subjected to refinement with the Rosetta FastRelax protocol under the same constraints. The refined models were ranked by Rosetta Energy Units (REU), and the 25 lowest energy structures were manually inspected in Pymol. The final model was selected based on both energy and interaction mode preference (helices13 interacting with helices23) from the PatchDock set with interface areas excessing 1000Å².

The original protein sequence was restored onto each Z-domain in the selected docked model, which underwent further refinement with Rosetta FastRelax. Target residues on the interface of one Z-domain were identified based off of this refined model. These target residues were used to generate a rotamer interaction field (RIF) with RIFDock. A second round of PatchDock was performed, targeting the same residues used during the RIF generation step. PatchDock models with an interface area greater than 1000Å² were used as seeds for RIFDock runs. RIFDock-generated models underwent additional refinement using the FastRelax protocol with coordinate constraints and were ranked by energy. The lowest energy models were visually inspected in Pymol, revealing a convergence to a docked conformation within 0.45Å of the best PatchDock model. This converged model was selected for further sequence design.

The docked model underwent refinement using RosettaDock local-refinement as well as side-chain packing and interface redesign with Rosetta FastDesign. Models were filtered by ddg (threshold = −15) and SASA (threshold=800). Redesigned models were ranked by energy, and top-ranking designs were analyzed to identify trends in residue preferences, which guided the selection of positions for constructing a degenerate codon library.

Protein expression and purification

The protein Z domain-encoding DNA plasmids were inserted into the pET28 bacterial expression vector. The vector includes the Z-domain gene with either a C-terminal His₆-tag or biotin-acceptor peptide tag (BAP tag, GLNDIFEAQKIEW) followed by His₆-tag, inserted between the NcoI and XhoI sites of pET28b vector (Novagen). For expression, the plasmids were transformed into E. coli BL21 (DE3) cells, which were cultured in TB medium containing 50 mg/L kanamycin at 37°C. When the optical density at 600 nm (OD₆₀₀) reached 0.6, protein expression was induced with 0.5 mM isopropyl-β-D-thiogalactoside (IPTG), and the cultures were incubated overnight at 25°C before cell harvesting. Protein purification was carried out using Ni²⁺-NTA agarose column chromatography (Ni-NTA, Qiagen), followed by further purification with size-exclusion chromatography on a Superdex S75 10/300GL Increase column (GE Healthcare). The final protein products were stored in HEPES-buffered saline (HBS; 20 mM HEPES, pH 7.5, 150 mM sodium chloride).

Yeast display of single-chain Z domain dimers

Single-chain Z protein dimers were expressed on the surface of Saccharomyces cerevisiae strain EBY100 (Invitrogen, cat. no. C839-00) through fusion to the C-terminus of the Aga2 protein. The dimers, linked via a GS-linker containing a 3C protease cleavage site, were positioned between an N-terminal cMyc epitope and a C-terminal HA tag. The construct, formatted as N-cMyc-ZA-linker-ZB-HA-C, was cloned into the pCT302 vector (Addgene #41845). Competent yeast cells were transformed with Z-domain plasmids using a yeast transformation kit (Zymo Research T2001) and plated on SDCAA agar plates (Teknova). Plates were incubated at 30°C for two days until colonies formed. Single colonies were picked and cultured in SDCAA media (pH 4.5, 20 g dextrose, 6.7 g yeast nitrogen base, 5 g bactocasamino acids, 10.4 g sodium citrate and 6.4 g citric acid monohydrate dissolved in 1 liter of deionized H₂O, supplemented with 10 ml of Gibco^™ Penicinillin-Stereptomycin, 10,000 U/ml) until reaching an OD600 of 10. Cultures were then induced at 20°C for 24 hours by diluting to an OD600 of 1.0 in SGCAA medium (similar to SDCAA but containing 20 g galactose instead of dextrose). Protein display levels were validated by staining cells with Alexa Fluor 647-labeled anti-HA antibody (1:50 dilution; Cell Signaling Technology, cat. no. 3444S). Fluorescence signals were analyzed using flow cytometry (Beckman Coulter, CytoFLEX).

On-yeast cleavage-capture assay

For the single clone cleavage-capture assay, colonies were selected from transformed EBY100 cells that were plated on SDCAA agar plates. A total of 5 × 10⁵ induced yeast cells were stained with Alexa Fluor 647-labeled anti-HA antibody at a 1:50 dilution. After staining, the cells were washed thoroughly with MACS buffer (autoMACS^® Running Buffer, Miltenyi, cat. no. 130-091-221) to remove unbound antibodies. The washed cells were then incubated in 20 μL of 3C protease cleavage solution, prepared by diluting lab-made 3C protease to 0.4 mg/mL in MACS buffer, and maintained at 4°C. At each time point, 2 μL of the sample was taken and diluted in 100 μL of ice-cold MACS buffer. The fluorescence intensity of the samples was measured using flow cytometry. To evaluate the affinity between two interacting proteins, the mean fluorescence intensity (MFI) at each time point was normalized to the MFI before cleavage, expressed as a percentage of the maximum MFI.

Yeast displayed libraries

A site-directed mutagenesis library was generated through assembly PCR using DNA sequences containing degenerate codons. The gel-purified PCR product was combined with a linearized pCT302 vector and introduced into EBY100 cells via electroporation (69). Following electroporation, the cells were incubated in YPD medium at 30°C for one hour before being transferred to SDCAA media. To assess transformation efficiency, serial dilutions of the recovered cells were plated onto SDCAA agar plates. After a 2-day incubation, protein expression was induced in SGCAA media as described above.

Library DNA sequences are listed below.

GTTGATAATAAATTTAATGCADTSCAATGGDTSGCATTTDTSDTSATTTTGCATCTGCCCAATTTGAACGAGGAACAGAGAAACGCTTTCATACAGTCTCTAAAAGATGATCCAAGTCAATCAGCAAATTTADTSGCCGAAGCTGCGGCCTTAAATGCCGCTCAAGCGCCTAAGGAATTCGGCGGAGGTGGGAGSCTGGAAGTTCTGTTCCAGGGTCCGGGAGGCGGCGGGAGCGGATCCGTTGACAACAAGTTTAACAAAGAGCAGCAAAATGCGTTTTACGAGATATTACATCTTCCGAATCTTAATGAGATACAGAGGAATDTSDTSATTCAGDTSCTGAAAGATGACCCTAGCCAGAGCGCCDTSDTSCTGGCTDTSGCGAAGATCGCAAACGATGCACAAGCACCTAAA

Theoretical nucleotide diversity: 3.63 × 10⁸

Functional library size: 2.50 × 10⁹

Selection of yeast-display libraries

Yeast-display library was enriched for interacting pairs through a combination of magnetic-activated cell sorting (MACS) and fluorescence-activated cell sorting (FACS), as previously described (27). Initial negative selection was performed using 10 times the theoretical diversity of the library to remove uncleavable variants caused by linker mutations. Positive and negative selections were alternated to enrich interacting pairs while minimizing the accumulation of uncleavable mutants. Library selection involved four rounds of positive selection using MACS (R1-R4) followed by one round of FACS sorting (R5) to achieve higher purity of interacting pairs. The specific methods for MACS and FACS selections were performed as described previously (27).

Deep sequencing of yeast libraries

DNA was extracted from 5-10 × 10⁷ yeast cells per selection round using the Zymoprep II kit (Zymo Research). Unique 6-mer barcodes and random 8-mer sequences were incorporated into the flanking regions of the sequencing product through 30 cycles of PCR amplification. The amplified region covered library positions for both Z-A and Z-B. A second PCR step was performed to add Illumina primer sequences, resulting in final products containing the format: Illumina P5-barcode-N8-read-Illumina P7. The final PCR products were purified using agarose gel electrophoresis, quantified with a Nanodrop, and subjected to deep sequencing on an Illumina MiSeq platform with a 2×300 V3 kit.

The amplicons were amplified using the following primers:

Illumina forward primer:

5'-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGA-3'

Illumina reverse primer:

5'-CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTC-3'

Sequence library filter

To identify enriched oligopeptide pairs, a one-sided hypergeometric test was performed using the scipy.stats library. Counts of individual oligopeptides (A and B) and their pairwise combinations (AB) were extracted from the sequencing data. The total population size (M), counts of each A or B oligopeptide (n and N), and observed AB pair counts (k) were used as input parameters for the hypergeometric distribution. The survival function (P(X>(k−1))) was computed to determine p-values for each pair. Pairs with p-values below 0.05 and counts greater than or equal to 20 were retained as enriched sequences in the round 5 NGS data. The filtered results were stored for further analysis, ensuring that only statistically significant and highly abundant pairs were included in downstream analyses, such as sequence similarity networks and Circos plots.

Sequence similarity network and Circos Plot

Coevolution data were imported from a CSV file containing filtered round 5 NGS data, where p-values were < 0.05 and read counts were ≥20, resulting in 862 enriched sequences out of a total of 416,872 unique round 5 sequences, which were used for plotting sequence similarity networks and a Circos plot. Sequence similarity networks and community maps were constructed using the igraph software package (70). Nodes in the edit distance-based networks represented unique Z-A/Z-B pairs, and edges were added between nodes when the edit distance between two pairs fell below a predefined threshold.

A Circos plot was generated using the PyCirclize library to visualize coevolutionary interactions among protein sequences. A constant score value was assigned for uniform visualization, and interaction categories were color-coded (e.g., red, orange, yellow) to differentiate cluster groups. The final Circos plot provided an effective visualization of complex coevolutionary networks, highlighting sequence clusters and interaction patterns.

X-ray Crystallography

Some protein complexes were reductively methylated (71), as noted below, and all were digested with carboxypeptidases A and B, purified by size-exclusion chromatography, and concentrated. All crystals were grown by vapor diffusion using either 100 nl protein with 100 nl well solution or 100 nl protein with 80 nl well solution and 20 nl microseeds, and flash cooled with liquid nitrogen after cryoprotection as noted below.

Methylated A4B1 was concentrated to 245 mg/ml and crystallized against 0.1 M bis-tris-propane pH 9.0 and 30% polyethylene glycol (PEG) 6000. Crystals were cryoprotected by drawing through a drop of paratone N. Diffraction images were collected at the Advanced Light Source (ALS) beamline 8.2.1.

Initial crystals of A2B2 grew at 193 mg/ml from 0.1 M bis-tris pH 5.5 and 25% PEG 3350. These were used to make microseed stock to grow the final crystals at 228 mg/ml from 0.1 M bis-tris-propane pH 9 and 30% PEG 6000. Crystals were cryoprotected with addition of 30% glycerol. Diffraction images were collected at the Stanford Synchrotron Radiation Laboratory (SSRL) beamline 12-1.

Initial crystals of A3B3 were grown at 200 mg/ml from 0.1 M bis-tris pH 6.5 and 25% PEG 3350. A microseed stock from these crystals was used to grow the final crystals from 0.2 M NaCl, 0.1 M bis-tris pH 6.5, and 25% PEG 3350. Crystals were cryoprotected by drawing through a drop of paratone N, and diffraction images were collected at ALS beamline 8.2.1.

Crystals of A5B5 were grown with two cycles of microseeding. Initial seed crystals grew at 210 mg/ml from 0.2 M Li₂(SO₄), 0.1 M Tris pH 8.5, and 30% PEG 4000. Secondary seeds were grown with microseeding from 0.1 M HEPES pH 7.5 and 20% PEG 10,000. Final crystals were grown from 0.15 M DL-Malic acid pH 7.0 and 20% PEG 3350. Crystals were cryoprotected by drawing through a drop of paratone N, and diffraction images were collected at ALS beamline 5.0.1.

Methylated A6B6 was concentrated to 168 mg/ml and crystallized from 0.01 M CoCl₂, 0.1 M MES pH 6.5, and 1.8 M (NH₄)₂SO₄. Crystals were cryoprotected by drawing through a drop of paratone N, and diffraction images were collected at ALS beamline 2.0.1.

Initial crystals of A7B7 grown at 138 mg/ml from 0.1 M sodium cacodylate pH 6.5 and 1.4 M sodium acetate were used to prepare a microseed stock. Final crystals were grown at 192 mg/ml with microseeding from 0.1 M bis-tris pH 6.1 and 2.1 M (NH₄)₂SO₄. Crystals were cryoprotected with addition of 3 M sodium malonate pH 6.0, and diffraction images were collected at the National Synchrotron Light Source (NSLS-2) AMX beamline.

Initial seed crystals of methylated A7B3 were grown at 167 mg/ml from 0.1 M Tris pH 8.5 and 25% PEG 3350. Final crystals were grown with microseeding at 136 mg/ml from 0.05 M MgCl₂, 0.1 M HEPES pH 7.5, and 30% PEG monomethyl ether 550. Crystals were harvested without further cryoprotection and diffraction images were collected at SSRL beamline 12-2.

Diffraction data wee indexed, integrated, and scaled using either XDS (72) (A4B1, A2B2, A3B3, A5B5, A6B6, A7B3) or autoproc (73) (A7B7). Space groups were assigned using pointless and reflections were merged with aimless from the CCP4 suite (74, 75). All structures were solved by molecular replacement in Phaser (76) by searching for individual subunits with the following search models: Alphafold2-generated models of the chains (A3B3, A5B5), side-chain truncated models of the chains from pdb entry 8DA3 (A6B6), side-chain truncated models of the A6B6 chains (A4B1, A2B2, A7B7), and side-chain truncated models of the chains from pdb entry 8DAB (A7B3). Some structures were then rebuilt with phenix.autobuild (77) (A3B3, A6B6, A7B7, A7B3). Additional refinement was performed interactively in Coot (78) and in Phenix (79-82) (A2B2, A3B3, A5B5, A6B6, A7B7) or Buster (83, 84) with final refinement in Phenix (A4B1, A7B3). Final refinement included NCS restraints (A4B1) TLS parameters determined automatically in Phenix (A4B1, A3B3, A5B5, A6B6, A7B7, and A7B3). Model geometry was assessed with Molprobity (85) and contacting residues were identified with Pymol (Schrodinger, LLC) using a distance cutoff of 3.8 Å. Crystallographic software used in this project was compiled and maintained by SBGrid (86). Crystallographic data collection and refinement statistics along with PDB deposition codes are reported in table S1.

Selection of representative cluster structures

To obtain structural representatives for each cluster, we initially attempted to crystallize all seven cluster structures along the diagonal of the specificity matrix in Figure 4A, along with two high-affinity cross-reactive complexes (A4B1 and A7B3). Among these, the crystal structures of A3B3, A5B5, A6B6, A4B1, and A7B3 were successfully solved. To expand our structural coverage, we selected additional cluster sequences that differed by a single amino acid substitution (one edit distance) from the original representative pairs. These variants were tested for binding and crystallization, leading to the successful structure determination of A2B2-L14^AM for Cluster 2 and A7B7-L43^BF for Cluster 7. For Cluster 1, an initial structure solution by molecular replacement at 1.73 Å resolution showed clear density for protein backbones, but most sidechain density was uninterpretable and the structure could not be refined. Closer inspection of the diffraction images identified satellite reflections consistent with incommensurate modulation of the crystal. In light of this result, we leveraged the A4B1 structure as a representative of cluster 1, as it differs from A1B1 by only a single amino acid substitution located outside the interface (fig. S20). Furthermore, we confirmed that the A4B1 structure aligns well with the backbone conformation of A1B1, validating its use as a pseudo-representative structure for Cluster 1.

Surface plasmon resonance

Dissociation constants (K_D) for Z-A/Z-B dimers were measured by surface plasmon resonance (SPR) on a BIAcore T100 instrument (GE Healthcare). Biotinylated Z-A chains were immobilized on a streptavidin-coated (SA) sensor chip (Cytiva), with a reference channel containing an unrelated protein. All experiments were performed in HBS-P+ buffer (Cytiva) supplemented with 0.5% BSA. Serial three-fold dilutions of analytes, starting from 50 μM, were injected at a flow rate of 50 μl/min, with an association phase of 120 s followed by a dissociation phase of 180-300 s. After each injection, the surface was regenerated with 0.1 M glycine (pH 2.5). Sensorgrams were analyzed using the BIAcore T100 evaluation software, and binding constants were determined by fitting to a steady-state affinity model.

Selection probabilistic model (SPM)

We developed a statistical machine learning generative model to model all of the multi-round read count data simultaneously. The core quantity estimated by the model is the probability that any sequence is selected in a given round; the model is aware that output from one selection round is input to the next round. This core, estimated quantity can be thought of as the energy/fitness landscape for that round, and can be aggregated across rounds to obtain an overall energy/fitness function. We call our approach the selection probabilistic model (SPM). SPM is a generative model in that it models the data-generating process of multi-round selection experiments (depicted in Fig. S1A) by describing each round in two steps; (i) an update of the sequence distribution from the previous round, where the update arises from the selection process and (ii) generation/sampling of the observed sequence read counts from the updated sequence distribution in the first step, yielding a multinomial likelihood that can be maximized to estimate model parameters. For a detailed description refer to SI Section A. We define ${f_{r}}^{θ} (x)$ as the inferred fitness for round $r$ , with learned parameters $θ$ , as the logarithm of the estimated selection probability (for each sequence and each round), and $f^{θ} (x)$ as the global (over all rounds) fitness function. The energy is defined as negative fitness. Having inferred these fitness/energy landscapes, we use these for two down-stream applications: (1) epistasis analysis (SI Section C) and (2) simulation of coevolutionary trajectories based on the fitness landscape (SI Section D).

SPM training and evaluation

Using the SPM, we learned a function ${f_{r}}^{θ} (x)$ represented as a neural network, to predict the fitness associated with each selection round $r$ for each sequence $x$ (3,650,782 unique sequences in total). The optimal settings of the parameters $θ$ were determined by maximizing the logarithm of the multinomial likelihood (described in SI Section B) over the sequences in each selection round. Our neural network, $f^{θ} (x)$ , comprised four fully connected hidden layers, each with a ReLU activation function (Fig. S1B). To stabilize the training, we applied regularization by weight decay (1e-5) with the Adam optimizer (87) and a dropout rate of 0.1. The hyperparameters, consisting of number of the hidden layers (4), hidden dimension (100), learning rate (0.001), batch size (10,000) and regularization strength, were chosen based on the model that produced the lowest MSE between the predicted counts and the experimental observation in a five-fold cross validation on the final selection round (while always using all previous rounds for training). We refer to SI Section B for the details of hyperparameter selection and model validation.

Extraction of epistasis importance from SPM-predicted fitness landscape

We use two different types of epistatic characterizations for each epistatic term, (i) the epistatic effect size, and (ii) the epistatic importance. As is made more precise in Supplementary Information Section C, the effect size corresponds to the standard effect size one obtains in a linear additive model comprising all epistatic effects of all orders. For example, one epistatic effect size could be the parameter value in the linear additive model for position 2 being an A and position 5 being an L—a second order effect because it involves two positions. In contrast, the epistatic importance refers to how much changes in amino acids at specified positions can change the fitness. For example, one epistatic importance could be the maximum fitness minus the minimal fitness obtained by choosing the least and most favorable amino acids at position 2 and 5—hence importance is almost like defining the maximally achievable dynamic range at a set of positions, over all possible amino acids. Practically, we use effect size when we’re characterizing epistasis related to a particular variant (sequence), whereas when we seek to characterize epistasis for the entire fitness landscape, then importance is the relevant quantity because it is not anchored on any particular variants.

Briefly, we developed an efficient method to extract epistatic information, that is, the importance of different interaction terms; single site, pairwise, and all higher orders—from the learned sequence-to-fitness relationship. In this method, a linearized version of the learned fitness function is constructed that consists of the effect size of different terms (e.g., single site, pairwise, or higher order). This method is related to the Walsh-Hadamard transform of the fitness function (88, 89). After computing the effect sizes of the linearized fitness function, we then define the epistasis importance of a term as the difference between maximum and minimum effect sizes of all possible configurations of amino acids for that term. This choice is inspired by the partial dependence-based variable importance measure for categorical values introduced in the work by Greenwell et al. (90). A detailed description of our epistasis-extraction method can be found in SI Section C.

Simulating coevolutionary trajectories on the SPM-predicted fitness landscape

To extract the geometry of our estimated binding fitness landscape, we simulated coevolutionary trajectories through it. We simulated coevolutionary trajectories as paths from random points on the landscape that always increase the fitness. Specifically, each move in a trajectory considers all single-position edit distance mutations that increase the fitness, and selects one such single edit at random. This process iterates until no moves are left that increase the fitness, at which point we have arrived at a terminal state, namely, a local minimum. After simulating 10 million such trajectories, we compute the “accessibility” of each sequence as the fraction of trajectories that ended at that sequence. Accessibility of an energy well is defined as the accessibility of that well’s lowest-energy sequence. Such an analysis can provide insight into the ruggedness of landscape (the more local minima, the more rugged it is), as well as how different parts of sequence space differ (or not) in their accessibility. Although we simulated 10 million trajectories, fewer trajectories resulted in similar conclusions, suggesting that 10 million was sufficient. More details are provided in SI Section D.

Energy landscape well depth, accessibility (width), and their statistical significance

In each energy landscape plot (such as the ones in Fig. 3F, fig. S8C and fig. S8D), each well has a specific variant that defines its energy minimum (shown with a colored dot). The width of a well is drawn proportional to the accessibility of that well, which can be seen by tracing out the energy barrier denoted by black lines to adjacent wells. For ease of visualization, we show only wells with the 20 most accessible sequences. For these 20, we chose a random ordering of the wells, with the exception of the natural interface which always centers the deepest energy well, using a random ordering for the rest. Later when we bootstrap this analysis to obtain statistical significance, each bootstrap data set has its own random ordering, thereby revealing that this ordering is unimportant to the major conclusions about differences in landscape geometry between synthetic and natural interfaces.

Although wells are shown adjacent to each other in these visualizations, namely with each well having one neighboring well to the left, and one to the right (other than the two wells on each end), in reality, each well has an energy barrier to every other well (and in particular to the other 19 wells in the plot). Consequently, we define (and show) the depth of an energy well as the minimal cumulative energy barrier that a coevolutionary trajectory needs to overcome from the representative sequence of this well to the representative sequence of any of the other 19 wells.

We use Dijkstra’s algorithm to find the path with the minimal energy barrier (details in SI Section D).

Having defined the width (accessibility) and depth, we can now define statistical tests related to these quantities in order to gauge if the differences between natural and synthetic interfaces are statistically significant. To do so, for each interface, we sampled 1,000 bootstrap NGS read data sets (each containing the original number of read counts in that data set). For each of these 1,000 data sets, we re-computed our energy landscapes, well accessibility and depths. Specifically, we trained an SPM to predict the fitness landscape––for each of these 1,000 data sets––with which to then simulate 10 million coevolutionary trajectories. We then performed a Mann-Whitney U test for the alternative hypothesis that the depth of the most accessible energy well in the natural interface is larger than 3 times the depth of the deepest energy well among the 20 most accessible wells in the synthetic interface (for 4 times larger, the results were not significant at level $α = 0.05$ ). Similarly, we performed a Mann-Whitney U test for the alternative hypothesis that the accessibility of the most accessible energy well in the natural interface is larger than 5 times the accessibility of the most accessible energy well in the synthetic interface (for 6 times larger, the results were not significant at level $α = 0.05$ ).

Identification of seed sequences from coevolutionary trajectories

We require each seed sequence for a specific target sequence to have three properties: (i) weak but experimentally detectable binding affinity (with SPM-predicted fitness between 18.35 and 22.07 which are the average fitness of sequences with NGS read count of 5 and 20 respectively from the final selection round). (ii) high exclusivity to its strong-binding target sequence, where exclusivity is defined as the ratio of the number of trajectories passing through a given weak-binding sequence and arriving at a specific strong-binding target sequence to the total number of all trajectories passing through the weak-binding sequence (minimum exclusivity = 0.9). (iii) high contribution to its strong-binding target sequence, where contribution is defined as the ratio of the number of trajectories passing through a given weak-binding sequence and arriving at a specific strong-binding target sequence to the total number of all trajectories arriving at the target sequence (minimum contribution = 0.01). Intuitively, sequences with high exclusivity serve as checkpoints in the evolutionary paths to their corresponding target sequences, while sequences with high contribution act as the main origins of the evolutionary paths to their corresponding target sequences. We focused our analysis on seed sequences that are at least three single-point mutations away from their strong-binding sequences.

Scoring chain pairing specificity and sequence-structure compatibility using Frame2seq

We compute Frame2seq (32) model negative pseudo log likelihoods (PLL) at the interface to assess predicted chain pairing specificity and sequence-structure compatibility. We refer to Frame2seq negative PLL values at the interface as interface scores throughout this work. Given experimentally solved structures and the amino acid sequence at framework positions as input, we introduce a mask at the interface positions and score the clustered library sequences at the masked indices to output interface scores as follows:

- \log p (x_{i} = x_{i}^{l i b r a r y} ∣ x_{- i}^{l i b r a r y}, S)

(Eq. S1)

where $i$ iterates over interface positions, $x^{l i b r a r y}$ is a clustered library sequence, $x_{- i}$ is the sequence $x$ with a mask introduced at $i$ , $S$ is structure. We predict chain pairing specificity by scoring all Z-A and Z-B chain combinations on corresponding structures. We predict sequence-structure compatibility by scoring all clustered library sequences on all six experimentally solved structures.

Extraction of structure-conditioned epistasis importance using Frame2seq

To extract structure-specific epistasis importance, we utilize the aforementioned epistasis importance extraction method (details described in SI Section C) but substitute the fitness function $f^{θ} (x)$ from SPM with an approximate structure-conditioned fitness function $f_{F 2 s}^{θ} (x ∣ S)$ constructed using Frame2seq (where $S$ represents input structural information). We use Frame2seq with published pre-trained weights (32) without fine-tuning. Details of how this function is constructed are provided in SI Section E, and our approximations are validated empirically in SI Section F.

Supplementary Material

Supplementary tables S1 S2

NIHMS2128659-supplement-Supplementary_tables_S1_S2.xlsx^{(19.4KB, xlsx)}

adx6931_Supplementary Materials

NIHMS2128659-supplement-adx6931_Supplementary_Materials.pdf^{(2.9MB, pdf)}

adx6931_Reproducibility Checklist_seq1_v3

NIHMS2128659-supplement-adx6931_Reproducibility_Checklist_seq1_v3.docx^{(68.8KB, docx)}

Supplementary Text

Table S1 to S2

Figures S1 to S21

References (94 – 101)

MDAR Reproducibility Checklist

Acknowledgments:

We thank D. Waghray and A. Velasco for their support, and Michael Fischbach, J. Xiong and H. Nisonoff for helpful discussions. We thank Marc Allaire at ALS and Jean Jakoncic at NSLS-II for assistance with data collection. This is the author's version of the work. It is posted here by permission of the AAAS for personal use, not for redistribution.

Funding:

K.C.G. is an investigator of the Howard Hughes Medical Institute and supported by NIH grant GM 150125, NIH-R01AI103867, Cancer Grand Challenges partnership financed by CRUK (CGCATF- 2023/100006), and the National Cancer Institute (OT2CA297242), Pew Innovation Fund, and the Yosemite Innovation Fund. J.L. is supported by Chan Zuckerberg Investigator program. H. J. was supported by the Croucher Fellowship. S.A. was supported by the Swiss National Science Foundation (grant no. P500PT_214430). J.C.B. was supported by a UC Berkeley Chancellor’s Fellowship and NSF Graduate Research Fellowship (DGE 2146752). A.J.L is supported by an NSF Graduate Research Fellowship and a UCSF Discovery Fellowship. D.A. is supported by an NSF Graduate Research Fellowship. C.P.P. and P.-S.H. are supported by NIH grant R01GM147893. T.K. is supported by NIH grant R35GM145236 and is a Chan Zuckerberg Biohub Investigator. D.A. and T.K. benefitted from the Microsoft Accelerating Foundation Models Research (AFMR) grant program. The Berkeley Center for Structural Biology is supported in part by the Howard Hughes Medical Institute. The Advanced Light Source is a Department of Energy Office of Science User Facility under Contract No. DE-AC02-05CH11231. The Pilatus detector on 5.0.1. was funded under NIH grant S10OD021832. The ALS-ENABLE beamlines are supported in part by the National Institutes of Health, National Institute of General Medical Sciences, grant P30 GM124169. Use of the Stanford Synchrotron Radiation Lightsource, SLAC National Accelerator Laboratory, is supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences under Contract No. DE-AC02-76SF00515. The SSRL Structural Molecular Biology Program is supported by the DOE Office of Biological and Environmental Research, and by the National Institutes of Health, National Institute of General Medical Sciences (P30GM133894). The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official views of NIGMS or NIH. The Center for BioMolecular Structure (CBMS) is primarily supported by the National Institutes of Health, National Institute of General Medical Sciences (NIGMS) through Grant # P30GM133893, and by the DOE Office of Biological and Environmental Research FWP # BO070. This research used resources 17-ID-1 of the National Synchrotron Light Source II, a U.S. Department of Energy (DOE) Office of Science User Facility operated for the DOE Office of Science by Brookhaven National Laboratory under Contract No. DE-SC0012704. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. (DGE 2146752). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Footnotes

Competing interests: Authors declare that they have no competing interests.

Data and materials availability: Crystal structures and diffraction intensities have been deposited in the Protein Databank with accession IDs pdb_00009NKM, pdb_00009NKN, pdb_00009NKO, pdb_00009NKP, pdb_00009NKQ, pdb_00009NKR, and pdb_00009NKS. Raw diffraction images for each structure have been deposited in the SBGrid databank are linked to the PDB entry. NGS data are deposited at Dryad (91). Code for training and using SPM , computing epistasis from fitness landscape, and simulating coevolutionary trajectories is available at https://github.com/hanlunj/coevolution and is also archived together with training data and model weights at Zenodo (92). Code and model for Frame2seq are available at https://github.com/dakpinaroglu/Frame2seq and Zenodo (93). Materials for this study are described in the methods and reasonable requests can be directed to the corresponding authors.

References and Notes

1.Levy ED, A Simple Definition of Structural Regions in Proteins and Its Use in Analyzing Interface Evolution. J. Mol. Biol 403, 660–670 (2010). [DOI] [PubMed] [Google Scholar]
2.Ma B, Elkayam T, Wolfson H, Nussinov R, Protein–protein interactions: Structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proc. Natl. Acad. Sci 100, 5772–5777 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Gainza P, Sverrisson F, Monti F, Rodolà E, Boscaini D, Bronstein MM, Correia BE, Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods 17, 184–192 (2020). [DOI] [PubMed] [Google Scholar]
4.Smith GP, Filamentous Fusion Phage: Novel Expression Vectors That Display Cloned Antigens on the Virion Surface. Science 228, 1315–1317 (1985). [DOI] [PubMed] [Google Scholar]
5.Wrighton NC, Farrell FX, Chang R, Kashyap AK, Barbone FP, Mulcahy LS, Johnson DL, Barrett RW, Jolliffe LK, Dower WJ, Small peptides as potent mimetics of the protein hormone erythropoietin. Science 273, 458–463 (1996). [DOI] [PubMed] [Google Scholar]
6.DeLano WL, Ultsch MH, De Vos AM, Wells JA, Convergent solutions to binding at a protein-protein interface. Science 287, 1279–1283 (2000). [DOI] [PubMed] [Google Scholar]
7.Moreira IS, Fernandes PA, Ramos MJ, Hot spots-A review of the protein-protein interface determinant amino-acid residues. Proteins Struct. Funct. Bioinforma 68, 803–812 (2007). [DOI] [PubMed] [Google Scholar]
8.Guharoy M, Chakrabarti P, Conserved residue clusters at protein-protein interfaces and their use in binding site identification. BMC Bioinformatics 11 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Wang X, Lupardus P, LaPorte SL, Garcia KC, Structural Biology of Shared Cytokine Receptors. Annu. Rev. Immunol 27, 29–60 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Boulanger MJ, Bankovich AJ, Kortemme T, Baker D, Garcia KC, Convergent mechanisms for recognition of divergent cytokines by the shared signaling receptor gp130. Mol. Cell 12, 577–589 (2003). [DOI] [PubMed] [Google Scholar]
11.Erijman A, Rosenthal E, Shifman JM, How structure defines affinity in protein-protein interactions. PLoS One 9 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Storz JF, Compensatory mutations and epistasis for protein function. Curr. Opin. Struct. Biol 50, 18–25 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Aakre CD, Herrou J, Phung TN, Perchuk BS, Crosson S, Laub MT, Evolving New Protein-Protein Interaction Specificity through Promiscuous Intermediates. Cell 163, 594–606 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Avizemer Z, Martí-Gómez C, Hoch SY, McCandlish DM, Fleishman SJ, Evolutionary paths that link orthogonal pairs of binding proteins. (2023). 10.21203/rs.3.rs-2836905/v2. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Lipsh-Sokolik R, Fleishman SJ, Addressing epistasis in the design of protein function. Proc. Natl. Acad. Sci. U. S. A 121, 1–9 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.de Juan D, Pazos F, Valencia A, Emerging methods in protein co-evolution. Nat. Rev. Genet 14, 249–261 (2013). [DOI] [PubMed] [Google Scholar]
17.Lovell SC, Robertson DL, An Integrated View of Molecular Coevolution in Protein-Protein Interactions. Mol. Biol. Evol 27, 2567–2575 (2010). [DOI] [PubMed] [Google Scholar]
18.Shindyalov IN, Kolchanov NA, Sander C, Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng. Des. Sel 7, 349–358 (1994). [DOI] [PubMed] [Google Scholar]
19.Dunn SD, Wahl LM, Gloor GB, Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 24, 333–340 (2008). [DOI] [PubMed] [Google Scholar]
20.Weigt M, White RA, Szurmant H, Hoch JA, Hwa T, Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Natl. Acad. Sci. U. S. A 106, 67–72 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, Sander C, Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS One 6, e28766 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Hopf TA, Schärfe CPI, Rodrigues JPGLM, Green AG, Kohlbacher O, Sander C, Bonvin AMJJ, Marks DS, Sequence co-evolution gives 3D contacts and structures of protein complexes. Elife 3, 713–724 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Marks DS, Hopf TA, Sander C, Protein structure prediction from sequence variation. Nat. Biotechnol 30, 1072–1080 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Ding D, Green AG, Wang B, Lite TLV, Weinstein EN, Marks DS, Laub MT, Co-evolution of interacting proteins through non-contacting and non-specific mutations. Nat. Ecol. Evol 6, 590–603 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Dean AM, Thornton JW, Mechanistic approaches to the study of evolution: The functional synthesis. Nat. Rev. Genet 8, 675–688 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Levin KB, Dym O, Albeck S, Magdassi S, Keeble AH, Kleanthous C, Tawfik DS, Following evolutionary paths to protein-protein interactions with high affinity and selectivity. Nat. Struct. Mol. Biol 16, 1049–1055 (2009). [DOI] [PubMed] [Google Scholar]
27.Yang A, Jude KM, Lai B, Minot M, Kocyla AM, Glassman CR, Nishimiya D, Kim YS, Reddy ST, Khan AA, Garcia KC, Deploying synthetic coevolution and machine learning to engineer protein-protein interactions. Science 381 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Johnson MS, Reddy G, Desai MM, Epistasis and evolution: recent advances and an outlook for prediction. BMC Biol. 21, 1–12 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Starr TN, Thornton JW, Epistasis in protein evolution. Protein Sci. 25, 1204–1218 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Miton CM, Buda K, Tokuriki N, Epistasis and intramolecular networks in protein evolution. Curr. Opin. Struct. Biol 69, 160–168 (2021). [DOI] [PubMed] [Google Scholar]
31.Högbom M, Eklund M, Åke Nygren P, Nordlund P, Structural basis for recognition by an in vitro evolved affibody. Proc. Natl. Acad. Sci 100, 3191–3196 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Akpinaroglu D, Seki K, Guo A, Zhu E, Kelly MJS, Kortemme T, Structure-conditioned masked language models for protein sequence design generalize beyond the native sequence space. (2023). 10.1101/2023.12.15.571823. [DOI] [Google Scholar]
33.Schneidman-Duhovny D, Inbar Y, Nussinov R, Wolfson HJ, PatchDock and SymmDock: Servers for rigid and symmetric docking. Nucleic Acids Res. 33, 363–367 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Huang PS, Ban YEA, Richter F, Andre I, Vernon R, Schief WR, Baker D, Rosettaremodel: A generalized framework for flexible backbone protein design. PLoS One 6 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Clark LA, van Vlijmen HWT, A knowledge-based forcefield for protein-protein interface design. Proteins 70, 1540–50 (2008). [DOI] [PubMed] [Google Scholar]
36.Hu JC, O’Shea EK, Kim PS, Sauer RT, Sequence Requirements for Coiled-Coils: Analysis with λ Repressor-GCN4 Leucine Zipper Fusions. Science 250, 1400–1403 (1990). [DOI] [PubMed] [Google Scholar]
37.Zhu B-Y, Zhou ME, Kay CM, Hodges RS, Packing and hydrophobicity effects on protein folding and stability: Effects of β-branched amino acids, valine and isoleucine, on the formation and stability of two-stranded α-helical coiled coils/leucine zippers. Protein Sci. 2, 383–394 (1993). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Lim WA, Sauer RT, Alternative packing arrangements in the hydrophobic core of λrepresser. Nature 339, 31–36 (1989). [DOI] [PubMed] [Google Scholar]
39.Atkinson HJ, Morris JH, Ferrin TE, Babbitt PC, Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS One 4 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA, Circos: An information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Lo Conte L, Chothia C, Janin J, The atomic structure of protein-protein recognition sites. J. Mol. Biol 285, 2177–2198 (1999). [DOI] [PubMed] [Google Scholar]
42.Norel R, Lin SL, Wolfson HJ, Nussinov R, Shape complementarity at protein-protein interfaces. Biopolymers 34, 933–940 (1994). [DOI] [PubMed] [Google Scholar]
43.Sheffler W, Baker D, RosettaHoles: Rapid assessment of protein core packing for structure prediction, refinement, design, and validation. Protein Sci. 18, 229–239 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Shirian J, Sharabi O, Shifman JM, Cold Spots in Protein Binding. Trends Biochem. Sci 41, 739–745 (2016). [DOI] [PubMed] [Google Scholar]
45.Gurusinghe SNS, Oppenheimer B, Shifman JM, Cold spots are universal in protein–protein interactions. Protein Sci. 31, 1–14 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Heyne M, Shirian J, Cohen I, Peleg Y, Radisky ES, Papo N, Shifman JM, Climbing up and down Binding Landscapes through Deep Mutational Scanning of Three Homologous Protein-Protein Complexes. J. Am. Chem. Soc 143, 17261–17275 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Levy RM, Haldane A, Flynn WF, Potts Hamiltonian models of protein co-variation, free energy landscapes, and evolutionary fitness. Curr. Opin. Struct. Biol 43, 55–62 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Dutta S, Gullá S, Chen TS, Fire E, Grant RA, Keating AE, Determinants of BH3 Binding Specificity for Mcl-1 versus Bcl-xL. J. Mol. Biol 398, 747–762 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Barlow KA, Ó Conchúir S, Thompson S, Suresh P, Lucas JE, Heinonen M, Kortemme T, Flex ddG: Rosetta Ensemble-Based Estimation of Changes in Protein-Protein Binding Affinity upon Mutation. J. Phys. Chem. B 122, 5389–5399 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Thanos CD, DeLano WL, Wells JA, Hot-spot mimicry of a cytokine receptor by a small molecule. Proc. Natl. Acad. Sci. U. S. A 103, 15422–15427 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Li K, Tokareva OS, Thomson TM, Wahl SCT, Travaline TL, Ramirez JD, Choudary SK, Agarwal S, Walkup WG, Olsen TJ, Brennan MJ, Verdine GL, McGee JH, De novo mapping of α-helix recognition sites on protein surfaces using unbiased libraries. Proc. Natl. Acad. Sci 119, 2017 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Podgornaia AI, Laub MT, Pervasive degeneracy and epistasis in a protein-protein interface. Science 347, 673–677 (2015). [DOI] [PubMed] [Google Scholar]
53.Moutinho AF, Trancoso FF, Dutheil JY, Zhang J, The Impact of Protein Architecture on Adaptive Evolution. Mol. Biol. Evol 36, 2013–2028 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Ortlund EA, Bridgham JT, Redinbo MR, Thornton JW, Crystal Structure of an Ancient Protein: Evolution by Conformational Epistasis. Science 317, 1544–1548 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Weinreich DM, Delaney NF, DePristo MA, Hartl DL, Darwinian Evolution Can Follow Only Very Few Mutational Paths to Fitter Proteins. Science 312, 111–114 (2006). [DOI] [PubMed] [Google Scholar]
56.Tang C, Iwahara J, Clore GM, Visualization of transient encounter complexes in protein–protein association. Nature 444, 383–386 (2006). [DOI] [PubMed] [Google Scholar]
57.Huang P, Love JJ, Mayo SL, A de novo designed protein protein interface. Protein Sci. 16, 2770–4 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Fleishman SJ, Whitehead TA, Ekiert DC, Dreyfus C, Corn JE, Strauch E, Wilson IA, Baker D, Computational Design of Proteins Targeting the Conserved Stem Region of Influenza Hemagglutinin. Science 332, 816–821 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Cao L, Coventry B, Goreshnik I, Huang B, Sheffler W, Park JS, Jude KM, Marković I, Kadam RU, Verschueren KHG, Verstraete K, Walsh STR, Bennett N, Phal A, Yang A, Kozodoy L, DeWitt M, Picton L, Miller L, Strauch EM, DeBouver ND, Pires A, Bera AK, Halabiya S, Hammerson B, Yang W, Bernard S, Stewart L, Wilson IA, Ruohola-Baker H, Schlessinger J, Lee S, Savvides SN, Garcia KC, Baker D, Design of protein-binding proteins from the target structure alone. Nature 605, 551–560 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Chen Z, Boyken SE, Jia M, Busch F, Flores-Solis D, Bick MJ, Lu P, VanAernum ZL, Sahasrabuddhe A, Langan RA, Bermeo S, Brunette TJ, Mulligan VK, Carter LP, DiMaio F, Sgourakis NG, Wysocki VH, Baker D, Programmable design of orthogonal protein heterodimers. Nature 565, 106–111 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Joachimiak LA, Kortemme T, Stoddard BL, Baker D, Computational Design of a New Hydrogen Bond Network and at Least a 300-fold Specificity Switch at a Protein-Protein Interface. J. Mol. Biol 361, 195–208 (2006). [DOI] [PubMed] [Google Scholar]
62.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D, Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, Wang J, Cong Q, Kinch LN, Dustin Schaeffer R, Millán C, Park H, Adams C, Glassman CR, DeGiovanni A, Pereira JH, Rodrigues AV, Van Dijk AA, Ebrecht AC, Opperman DJ, Sagmeister T, Buhlheller C, Pavkov-Keller T, Rathinaswamy MK, Dalwadi U, Yip CK, Burke JE, Christopher Garcia K, Grishin NV, Adams PD, Read RJ, Baker D, Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Dauparas J, Anishchenko I, Bennett N, Bai H, Ragotte RJ, Milles LF, Wicky BIM, Courbet A, de Haas RJ, Bethel N, Leung PJY, Huddy TF, Pellock S, Tischer D, Chan F, Koepnick B, Nguyen H, Kang A, Sankaran B, Bera AK, King NP, Baker D, Robust deep learning–based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Pacesa M, Nickel L, Schellhaas C, Schmidt J, Pyatova E, Kissling L, Barendse P, Choudhury J, Kapoor S, Alcaraz-Serna A, Cho Y, Ghamary KH, Vinué L, Yachnin BJ, Wollacott AM, Buckley S, Westphal AH, Lindhoud S, Georgeon S, Goverde CA, Hatzopoulos GN, Gönczy P, Muller YD, Schwank G, Swarts DC, Vecchio AJ, Schneider BL, Ovchinnikov S, Correia BE, BindCraft: one-shot design of functional protein binders, bioRxiv (2024)p. 2024.09.30.615802. [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Jones S, Thornton JM, Principles of protein-protein interactions. Proc. Natl. Acad. Sci. U. S. A 93, 13–20 (1996). [DOI] [PMC free article] [PubMed] [Google Scholar]
67.Keskin O, Gursoy A, Ma B, Nussinov R, Principles of protein-protein interactions: What are the preferred ways for proteins to interact? Chem. Rev 108, 1225–1244 (2008). [DOI] [PubMed] [Google Scholar]
68.Grünberg R, Serrano L, Strategies for protein synthetic biology. Nucleic Acids Res. 38, 2663–2675 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
69.Chao G, Lau WL, Hackel BJ, Sazinsky SL, Lippow SM, Wittrup KD, Isolating and engineering human antibodies using yeast surface display. Nat. Protoc 1, 755–768 (2006). [DOI] [PubMed] [Google Scholar]
70.Csardi G, Nepusz T, The igraph software package for complex network research. InterJournal Complex Sy, 1695 (2006). [Google Scholar]
71.Walter TS, Meier C, Assenberg R, Au K-F, Ren J, Verma A, Nettleship JE, Owens RJ, Stuart DI, Grimes JM, Lysine Methylation as a Routine Rescue Strategy for Protein Crystallization. Structure 14, 1617–1622 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
72.Kabsch W, XDS. Acta Crystallogr. Sect. D Biol. Crystallogr 66, 125–132 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
73.Vonrhein C, Flensburg C, Keller P, Sharff A, Smart O, Paciorek W, Womack T, Bricogne G, Data processing and analysis with the autoPROC toolbox. Acta Crystallogr D Biol Crystallogr 67, 293–302 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
74.Winn MD, Ballard CC, Cowtan KD, Dodson EJ, Emsley P, Evans PR, Keegan RM, Krissinel EB, Leslie AGW, McCoy A, McNicholas SJ, Murshudov GN, Pannu NS, Potterton EA, Powell HR, Read RJ, Vagin A, Wilson KS, Overview of the CCP 4 suite and current developments. Acta Crystallogr. Sect. D Biol. Crystallogr 67, 235–242 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
75.Evans PR, Murshudov GN, How good are my data and what is the resolution? Acta Crystallogr. Sect. D Biol. Crystallogr 69, 1204–1214 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
76.McCoy AJ, Grosse-Kunstleve RW, Adams PD, Winn MD, Storoni LC, Read RJ, Phaser crystallographic software. J. Appl. Crystallogr 40, 658–674 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
77.Terwilliger TC, Grosse-Kunstleve RW, V Afonine P, Moriarty NW, Zwart PH, Hung L-W, Read RJ, Adams PD, Iterative model building, structure refinement and density modification with the PHENIX AutoBuild wizard. Acta Crystallogr D Biol Crystallogr 64, 61–69 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Emsley P, Lohkamp B, Scott WG, Cowtan K, Features and development of Coot. Acta Crystallogr. Sect. D Biol. Crystallogr 66, 486–501 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
79.Liebschner D, V Afonine P, Baker ML, Bunkóczi G, Chen VB, Croll TI, Hintze B, Hung L-W, Jain S, McCoy AJ, Moriarty NW, Oeffner RD, Poon BK, Prisant MG, Read RJ, Richardson JS, Richardson DC, Sammito MD, V Sobolev O, Stockwell DH, Terwilliger TC, Urzhumtsev AG, Videau LL, Williams CJ, Adams PD, Macromolecular structure determination using X-rays, neutrons and electrons: recent developments in Phenix. Acta Crystallogr. Sect. D Struct. Biol 75, 861–877 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
80.Echols N, Grosse-Kunstleve RW, V Afonine P, Bunkóczi G, Chen VB, Headd JJ, McCoy AJ, Moriarty NW, Read RJ, Richardson DC, Richardson JS, Terwilliger TC, Adams PD, Graphical tools for macromolecular crystallography in PHENIX. J. Appl. Cryst 45, 581–586 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
81.V Afonine P, Grosse-Kunstleve RW, Echols N, Headd JJ, Moriarty NW, Mustyakimov M, Terwilliger TC, Urzhumtsev A, Zwart PH, Adams PD, Towards automated crystallographic structure refinement with phenix.refine. Acta Crystallogr. Sect. D Biol. Crystallogr 68, 352–367 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
82.Headd JJ, Echols N, V Afonine P, Moriarty NW, Gildea RJ, Adams PD, Flexible torsion-angle noncrystallographic symmetry restraints for improved macromolecular structure refinement. Acta Crystallogr. Sect. D Biol. Crystallogr 70, 1346–1356 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
83.Bricogne G, Blanc E, Brandl M, Flensburg C, Keller P, Paciorek W, Roversi P, Sharff A, Smart OS, Vonrhein C, Womack TO, BUSTER version 20240710. Cambridge, United Kingdom Glob. Phasing Ltd; (2017). [Google Scholar]
84.Smart OS, Womack TO, Flensburg C, Keller P, Paciorek W, Sharff A, Vonrhein C, Bricogne G, Exploiting structure similarity in refinement: automated NCS and target-structure restraints in BUSTER. Acta Crystallogr D Biol Crystallogr 68, 368–380 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
85.Chen VB, Arendall WB, Headd JJ, Keedy DA, Immormino RM, Kapral GJ, Murray LW, Richardson JS, Richardson DC, MolProbity : all-atom structure validation for macromolecular crystallography. Acta Crystallogr. Sect. D Biol. Crystallogr 66, 12–21 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
86.Morin A, Eisenbraun B, Key J, Sanschagrin PC, Timony MA, Ottaviano M, Sliz P, Cutting edge: Collaboration gets the most out of software. Elife 2 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
87.Kingma DP, Adam: A method for stochastic optimization. arXiv Prepr. arXiv1412.6980 (2014). [Google Scholar]
88.Aghazadeh A, Ocal O, Ramchandran K, CRISPRL and: Interpretable large-scale inference of DNA repair landscape based on a spectral approach. Bioinformatics 36, i560–i568 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
89.Aghazadeh A, Nisonoff H, Ocal O, Brookes DH, Huang Y, Koyluoglu OO, Listgarten J, Ramchandran K, Epistatic Net allows the sparse spectral regularization of deep neural networks for inferring fitness functions. Nat. Commun 12, 5225 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
90.Greenwell BM, Boehmke BC, McCarthy AJ, A simple and effective model-based variable importance measure. arXiv Prepr. arXiv1805.04755 (2018). [Google Scholar]
91.Yang A, Garcia KC, NGS Data for: Structural ontogeny of protein-protein interactions, Dryad (2025). 10.5061/dryad.79cnp5j8b. [DOI] [Google Scholar]
92.Jiang H, Codebase and data for structural ontogeny of protein-protein interactions, Zenodo (2025). 10.5281/zenodo.17282141. [DOI] [Google Scholar]
93.Akpinaroglu D, Kortemme T, Structure-conditioned masked language models for protein sequence design generalize beyond the native sequence space (0.0.1), Zenodo (2023). 10.1101/2023.12.15.571823. [DOI] [Google Scholar]
94.Rocklin GJ, Chidyausiku TM, Goreshnik I, Ford A, Houliston S, Lemak A, Carter L, Ravichandran R, Mulligan VK, Chevalier A, Arrowsmith CH, Baker D, Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
95.D’Costa S, Hinds EC, Freschlin CR, Song H, Romero PA, Inferring protein fitness landscapes from laboratory evolution experiments. PLOS Comput. Biol 19, e1010956 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
96.Park Y, Metzger BPH, Thornton JW, The simplicity of protein sequence-function relationships. Nat. Commun 15, 7953 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
97.Posfai A, Zhou J, McCandlish DM, Kinney JB, Gauge fixing for sequence-function relationships. bioRxiv (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
98.Romero PA, Arnold FH, Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. cell Biol 10, 866–876 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
99.Wittmann BJ, Yue Y, Arnold FH, Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 12, 1026–1045 (2021). [DOI] [PubMed] [Google Scholar]
100.Teufel AI, Wilke CO, Accelerated simulation of evolutionary trajectories in origin-fixation models. J. R. Soc. Interface 14, 20160906 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
101.DIJKSTRA EW, A Note on Two Problems in Connexion with Graphs. Numer. Math 1, 269–271 (1959). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary tables S1 S2

NIHMS2128659-supplement-Supplementary_tables_S1_S2.xlsx^{(19.4KB, xlsx)}

adx6931_Supplementary Materials

NIHMS2128659-supplement-adx6931_Supplementary_Materials.pdf^{(2.9MB, pdf)}

adx6931_Reproducibility Checklist_seq1_v3

NIHMS2128659-supplement-adx6931_Reproducibility_Checklist_seq1_v3.docx^{(68.8KB, docx)}

[R1] 1.Levy ED, A Simple Definition of Structural Regions in Proteins and Its Use in Analyzing Interface Evolution. J. Mol. Biol 403, 660–670 (2010). [DOI] [PubMed] [Google Scholar]

[R2] 2.Ma B, Elkayam T, Wolfson H, Nussinov R, Protein–protein interactions: Structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proc. Natl. Acad. Sci 100, 5772–5777 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Gainza P, Sverrisson F, Monti F, Rodolà E, Boscaini D, Bronstein MM, Correia BE, Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods 17, 184–192 (2020). [DOI] [PubMed] [Google Scholar]

[R4] 4.Smith GP, Filamentous Fusion Phage: Novel Expression Vectors That Display Cloned Antigens on the Virion Surface. Science 228, 1315–1317 (1985). [DOI] [PubMed] [Google Scholar]

[R5] 5.Wrighton NC, Farrell FX, Chang R, Kashyap AK, Barbone FP, Mulcahy LS, Johnson DL, Barrett RW, Jolliffe LK, Dower WJ, Small peptides as potent mimetics of the protein hormone erythropoietin. Science 273, 458–463 (1996). [DOI] [PubMed] [Google Scholar]

[R6] 6.DeLano WL, Ultsch MH, De Vos AM, Wells JA, Convergent solutions to binding at a protein-protein interface. Science 287, 1279–1283 (2000). [DOI] [PubMed] [Google Scholar]

[R7] 7.Moreira IS, Fernandes PA, Ramos MJ, Hot spots-A review of the protein-protein interface determinant amino-acid residues. Proteins Struct. Funct. Bioinforma 68, 803–812 (2007). [DOI] [PubMed] [Google Scholar]

[R8] 8.Guharoy M, Chakrabarti P, Conserved residue clusters at protein-protein interfaces and their use in binding site identification. BMC Bioinformatics 11 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Wang X, Lupardus P, LaPorte SL, Garcia KC, Structural Biology of Shared Cytokine Receptors. Annu. Rev. Immunol 27, 29–60 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Boulanger MJ, Bankovich AJ, Kortemme T, Baker D, Garcia KC, Convergent mechanisms for recognition of divergent cytokines by the shared signaling receptor gp130. Mol. Cell 12, 577–589 (2003). [DOI] [PubMed] [Google Scholar]

[R11] 11.Erijman A, Rosenthal E, Shifman JM, How structure defines affinity in protein-protein interactions. PLoS One 9 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Storz JF, Compensatory mutations and epistasis for protein function. Curr. Opin. Struct. Biol 50, 18–25 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Aakre CD, Herrou J, Phung TN, Perchuk BS, Crosson S, Laub MT, Evolving New Protein-Protein Interaction Specificity through Promiscuous Intermediates. Cell 163, 594–606 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Avizemer Z, Martí-Gómez C, Hoch SY, McCandlish DM, Fleishman SJ, Evolutionary paths that link orthogonal pairs of binding proteins. (2023). 10.21203/rs.3.rs-2836905/v2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Lipsh-Sokolik R, Fleishman SJ, Addressing epistasis in the design of protein function. Proc. Natl. Acad. Sci. U. S. A 121, 1–9 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.de Juan D, Pazos F, Valencia A, Emerging methods in protein co-evolution. Nat. Rev. Genet 14, 249–261 (2013). [DOI] [PubMed] [Google Scholar]

[R17] 17.Lovell SC, Robertson DL, An Integrated View of Molecular Coevolution in Protein-Protein Interactions. Mol. Biol. Evol 27, 2567–2575 (2010). [DOI] [PubMed] [Google Scholar]

[R18] 18.Shindyalov IN, Kolchanov NA, Sander C, Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng. Des. Sel 7, 349–358 (1994). [DOI] [PubMed] [Google Scholar]

[R19] 19.Dunn SD, Wahl LM, Gloor GB, Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 24, 333–340 (2008). [DOI] [PubMed] [Google Scholar]

[R20] 20.Weigt M, White RA, Szurmant H, Hoch JA, Hwa T, Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Natl. Acad. Sci. U. S. A 106, 67–72 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, Sander C, Protein 3D Structure Computed from Evolutionary Sequence Variation. PLoS One 6, e28766 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Hopf TA, Schärfe CPI, Rodrigues JPGLM, Green AG, Kohlbacher O, Sander C, Bonvin AMJJ, Marks DS, Sequence co-evolution gives 3D contacts and structures of protein complexes. Elife 3, 713–724 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Marks DS, Hopf TA, Sander C, Protein structure prediction from sequence variation. Nat. Biotechnol 30, 1072–1080 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Ding D, Green AG, Wang B, Lite TLV, Weinstein EN, Marks DS, Laub MT, Co-evolution of interacting proteins through non-contacting and non-specific mutations. Nat. Ecol. Evol 6, 590–603 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Dean AM, Thornton JW, Mechanistic approaches to the study of evolution: The functional synthesis. Nat. Rev. Genet 8, 675–688 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Levin KB, Dym O, Albeck S, Magdassi S, Keeble AH, Kleanthous C, Tawfik DS, Following evolutionary paths to protein-protein interactions with high affinity and selectivity. Nat. Struct. Mol. Biol 16, 1049–1055 (2009). [DOI] [PubMed] [Google Scholar]

[R27] 27.Yang A, Jude KM, Lai B, Minot M, Kocyla AM, Glassman CR, Nishimiya D, Kim YS, Reddy ST, Khan AA, Garcia KC, Deploying synthetic coevolution and machine learning to engineer protein-protein interactions. Science 381 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Johnson MS, Reddy G, Desai MM, Epistasis and evolution: recent advances and an outlook for prediction. BMC Biol. 21, 1–12 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Starr TN, Thornton JW, Epistasis in protein evolution. Protein Sci. 25, 1204–1218 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Miton CM, Buda K, Tokuriki N, Epistasis and intramolecular networks in protein evolution. Curr. Opin. Struct. Biol 69, 160–168 (2021). [DOI] [PubMed] [Google Scholar]

[R31] 31.Högbom M, Eklund M, Åke Nygren P, Nordlund P, Structural basis for recognition by an in vitro evolved affibody. Proc. Natl. Acad. Sci 100, 3191–3196 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Akpinaroglu D, Seki K, Guo A, Zhu E, Kelly MJS, Kortemme T, Structure-conditioned masked language models for protein sequence design generalize beyond the native sequence space. (2023). 10.1101/2023.12.15.571823. [DOI] [Google Scholar]

[R33] 33.Schneidman-Duhovny D, Inbar Y, Nussinov R, Wolfson HJ, PatchDock and SymmDock: Servers for rigid and symmetric docking. Nucleic Acids Res. 33, 363–367 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Huang PS, Ban YEA, Richter F, Andre I, Vernon R, Schief WR, Baker D, Rosettaremodel: A generalized framework for flexible backbone protein design. PLoS One 6 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Clark LA, van Vlijmen HWT, A knowledge-based forcefield for protein-protein interface design. Proteins 70, 1540–50 (2008). [DOI] [PubMed] [Google Scholar]

[R36] 36.Hu JC, O’Shea EK, Kim PS, Sauer RT, Sequence Requirements for Coiled-Coils: Analysis with λ Repressor-GCN4 Leucine Zipper Fusions. Science 250, 1400–1403 (1990). [DOI] [PubMed] [Google Scholar]

[R37] 37.Zhu B-Y, Zhou ME, Kay CM, Hodges RS, Packing and hydrophobicity effects on protein folding and stability: Effects of β-branched amino acids, valine and isoleucine, on the formation and stability of two-stranded α-helical coiled coils/leucine zippers. Protein Sci. 2, 383–394 (1993). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Lim WA, Sauer RT, Alternative packing arrangements in the hydrophobic core of λrepresser. Nature 339, 31–36 (1989). [DOI] [PubMed] [Google Scholar]

[R39] 39.Atkinson HJ, Morris JH, Ferrin TE, Babbitt PC, Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS One 4 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, Jones SJ, Marra MA, Circos: An information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Lo Conte L, Chothia C, Janin J, The atomic structure of protein-protein recognition sites. J. Mol. Biol 285, 2177–2198 (1999). [DOI] [PubMed] [Google Scholar]

[R42] 42.Norel R, Lin SL, Wolfson HJ, Nussinov R, Shape complementarity at protein-protein interfaces. Biopolymers 34, 933–940 (1994). [DOI] [PubMed] [Google Scholar]

[R43] 43.Sheffler W, Baker D, RosettaHoles: Rapid assessment of protein core packing for structure prediction, refinement, design, and validation. Protein Sci. 18, 229–239 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Shirian J, Sharabi O, Shifman JM, Cold Spots in Protein Binding. Trends Biochem. Sci 41, 739–745 (2016). [DOI] [PubMed] [Google Scholar]

[R45] 45.Gurusinghe SNS, Oppenheimer B, Shifman JM, Cold spots are universal in protein–protein interactions. Protein Sci. 31, 1–14 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.Heyne M, Shirian J, Cohen I, Peleg Y, Radisky ES, Papo N, Shifman JM, Climbing up and down Binding Landscapes through Deep Mutational Scanning of Three Homologous Protein-Protein Complexes. J. Am. Chem. Soc 143, 17261–17275 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Levy RM, Haldane A, Flynn WF, Potts Hamiltonian models of protein co-variation, free energy landscapes, and evolutionary fitness. Curr. Opin. Struct. Biol 43, 55–62 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Dutta S, Gullá S, Chen TS, Fire E, Grant RA, Keating AE, Determinants of BH3 Binding Specificity for Mcl-1 versus Bcl-xL. J. Mol. Biol 398, 747–762 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] 49.Barlow KA, Ó Conchúir S, Thompson S, Suresh P, Lucas JE, Heinonen M, Kortemme T, Flex ddG: Rosetta Ensemble-Based Estimation of Changes in Protein-Protein Binding Affinity upon Mutation. J. Phys. Chem. B 122, 5389–5399 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] 50.Thanos CD, DeLano WL, Wells JA, Hot-spot mimicry of a cytokine receptor by a small molecule. Proc. Natl. Acad. Sci. U. S. A 103, 15422–15427 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] 51.Li K, Tokareva OS, Thomson TM, Wahl SCT, Travaline TL, Ramirez JD, Choudary SK, Agarwal S, Walkup WG, Olsen TJ, Brennan MJ, Verdine GL, McGee JH, De novo mapping of α-helix recognition sites on protein surfaces using unbiased libraries. Proc. Natl. Acad. Sci 119, 2017 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] 52.Podgornaia AI, Laub MT, Pervasive degeneracy and epistasis in a protein-protein interface. Science 347, 673–677 (2015). [DOI] [PubMed] [Google Scholar]

[R53] 53.Moutinho AF, Trancoso FF, Dutheil JY, Zhang J, The Impact of Protein Architecture on Adaptive Evolution. Mol. Biol. Evol 36, 2013–2028 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] 54.Ortlund EA, Bridgham JT, Redinbo MR, Thornton JW, Crystal Structure of an Ancient Protein: Evolution by Conformational Epistasis. Science 317, 1544–1548 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] 55.Weinreich DM, Delaney NF, DePristo MA, Hartl DL, Darwinian Evolution Can Follow Only Very Few Mutational Paths to Fitter Proteins. Science 312, 111–114 (2006). [DOI] [PubMed] [Google Scholar]

[R56] 56.Tang C, Iwahara J, Clore GM, Visualization of transient encounter complexes in protein–protein association. Nature 444, 383–386 (2006). [DOI] [PubMed] [Google Scholar]

[R57] 57.Huang P, Love JJ, Mayo SL, A de novo designed protein protein interface. Protein Sci. 16, 2770–4 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] 58.Fleishman SJ, Whitehead TA, Ekiert DC, Dreyfus C, Corn JE, Strauch E, Wilson IA, Baker D, Computational Design of Proteins Targeting the Conserved Stem Region of Influenza Hemagglutinin. Science 332, 816–821 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] 59.Cao L, Coventry B, Goreshnik I, Huang B, Sheffler W, Park JS, Jude KM, Marković I, Kadam RU, Verschueren KHG, Verstraete K, Walsh STR, Bennett N, Phal A, Yang A, Kozodoy L, DeWitt M, Picton L, Miller L, Strauch EM, DeBouver ND, Pires A, Bera AK, Halabiya S, Hammerson B, Yang W, Bernard S, Stewart L, Wilson IA, Ruohola-Baker H, Schlessinger J, Lee S, Savvides SN, Garcia KC, Baker D, Design of protein-binding proteins from the target structure alone. Nature 605, 551–560 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] 60.Chen Z, Boyken SE, Jia M, Busch F, Flores-Solis D, Bick MJ, Lu P, VanAernum ZL, Sahasrabuddhe A, Langan RA, Bermeo S, Brunette TJ, Mulligan VK, Carter LP, DiMaio F, Sgourakis NG, Wysocki VH, Baker D, Programmable design of orthogonal protein heterodimers. Nature 565, 106–111 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] 61.Joachimiak LA, Kortemme T, Stoddard BL, Baker D, Computational Design of a New Hydrogen Bond Network and at Least a 300-fold Specificity Switch at a Protein-Protein Interface. J. Mol. Biol 361, 195–208 (2006). [DOI] [PubMed] [Google Scholar]

[R62] 62.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D, Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] 63.Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, Wang J, Cong Q, Kinch LN, Dustin Schaeffer R, Millán C, Park H, Adams C, Glassman CR, DeGiovanni A, Pereira JH, Rodrigues AV, Van Dijk AA, Ebrecht AC, Opperman DJ, Sagmeister T, Buhlheller C, Pavkov-Keller T, Rathinaswamy MK, Dalwadi U, Yip CK, Burke JE, Christopher Garcia K, Grishin NV, Adams PD, Read RJ, Baker D, Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] 64.Dauparas J, Anishchenko I, Bennett N, Bai H, Ragotte RJ, Milles LF, Wicky BIM, Courbet A, de Haas RJ, Bethel N, Leung PJY, Huddy TF, Pellock S, Tischer D, Chan F, Koepnick B, Nguyen H, Kang A, Sankaran B, Bera AK, King NP, Baker D, Robust deep learning–based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] 65.Pacesa M, Nickel L, Schellhaas C, Schmidt J, Pyatova E, Kissling L, Barendse P, Choudhury J, Kapoor S, Alcaraz-Serna A, Cho Y, Ghamary KH, Vinué L, Yachnin BJ, Wollacott AM, Buckley S, Westphal AH, Lindhoud S, Georgeon S, Goverde CA, Hatzopoulos GN, Gönczy P, Muller YD, Schwank G, Swarts DC, Vecchio AJ, Schneider BL, Ovchinnikov S, Correia BE, BindCraft: one-shot design of functional protein binders, bioRxiv (2024)p. 2024.09.30.615802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R66] 66.Jones S, Thornton JM, Principles of protein-protein interactions. Proc. Natl. Acad. Sci. U. S. A 93, 13–20 (1996). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R67] 67.Keskin O, Gursoy A, Ma B, Nussinov R, Principles of protein-protein interactions: What are the preferred ways for proteins to interact? Chem. Rev 108, 1225–1244 (2008). [DOI] [PubMed] [Google Scholar]

[R68] 68.Grünberg R, Serrano L, Strategies for protein synthetic biology. Nucleic Acids Res. 38, 2663–2675 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R69] 69.Chao G, Lau WL, Hackel BJ, Sazinsky SL, Lippow SM, Wittrup KD, Isolating and engineering human antibodies using yeast surface display. Nat. Protoc 1, 755–768 (2006). [DOI] [PubMed] [Google Scholar]

[R70] 70.Csardi G, Nepusz T, The igraph software package for complex network research. InterJournal Complex Sy, 1695 (2006). [Google Scholar]

[R71] 71.Walter TS, Meier C, Assenberg R, Au K-F, Ren J, Verma A, Nettleship JE, Owens RJ, Stuart DI, Grimes JM, Lysine Methylation as a Routine Rescue Strategy for Protein Crystallization. Structure 14, 1617–1622 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R72] 72.Kabsch W, XDS. Acta Crystallogr. Sect. D Biol. Crystallogr 66, 125–132 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R73] 73.Vonrhein C, Flensburg C, Keller P, Sharff A, Smart O, Paciorek W, Womack T, Bricogne G, Data processing and analysis with the autoPROC toolbox. Acta Crystallogr D Biol Crystallogr 67, 293–302 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R74] 74.Winn MD, Ballard CC, Cowtan KD, Dodson EJ, Emsley P, Evans PR, Keegan RM, Krissinel EB, Leslie AGW, McCoy A, McNicholas SJ, Murshudov GN, Pannu NS, Potterton EA, Powell HR, Read RJ, Vagin A, Wilson KS, Overview of the CCP 4 suite and current developments. Acta Crystallogr. Sect. D Biol. Crystallogr 67, 235–242 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R75] 75.Evans PR, Murshudov GN, How good are my data and what is the resolution? Acta Crystallogr. Sect. D Biol. Crystallogr 69, 1204–1214 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R76] 76.McCoy AJ, Grosse-Kunstleve RW, Adams PD, Winn MD, Storoni LC, Read RJ, Phaser crystallographic software. J. Appl. Crystallogr 40, 658–674 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R77] 77.Terwilliger TC, Grosse-Kunstleve RW, V Afonine P, Moriarty NW, Zwart PH, Hung L-W, Read RJ, Adams PD, Iterative model building, structure refinement and density modification with the PHENIX AutoBuild wizard. Acta Crystallogr D Biol Crystallogr 64, 61–69 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R78] 78.Emsley P, Lohkamp B, Scott WG, Cowtan K, Features and development of Coot. Acta Crystallogr. Sect. D Biol. Crystallogr 66, 486–501 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R79] 79.Liebschner D, V Afonine P, Baker ML, Bunkóczi G, Chen VB, Croll TI, Hintze B, Hung L-W, Jain S, McCoy AJ, Moriarty NW, Oeffner RD, Poon BK, Prisant MG, Read RJ, Richardson JS, Richardson DC, Sammito MD, V Sobolev O, Stockwell DH, Terwilliger TC, Urzhumtsev AG, Videau LL, Williams CJ, Adams PD, Macromolecular structure determination using X-rays, neutrons and electrons: recent developments in Phenix. Acta Crystallogr. Sect. D Struct. Biol 75, 861–877 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R80] 80.Echols N, Grosse-Kunstleve RW, V Afonine P, Bunkóczi G, Chen VB, Headd JJ, McCoy AJ, Moriarty NW, Read RJ, Richardson DC, Richardson JS, Terwilliger TC, Adams PD, Graphical tools for macromolecular crystallography in PHENIX. J. Appl. Cryst 45, 581–586 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R81] 81.V Afonine P, Grosse-Kunstleve RW, Echols N, Headd JJ, Moriarty NW, Mustyakimov M, Terwilliger TC, Urzhumtsev A, Zwart PH, Adams PD, Towards automated crystallographic structure refinement with phenix.refine. Acta Crystallogr. Sect. D Biol. Crystallogr 68, 352–367 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R82] 82.Headd JJ, Echols N, V Afonine P, Moriarty NW, Gildea RJ, Adams PD, Flexible torsion-angle noncrystallographic symmetry restraints for improved macromolecular structure refinement. Acta Crystallogr. Sect. D Biol. Crystallogr 70, 1346–1356 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R83] 83.Bricogne G, Blanc E, Brandl M, Flensburg C, Keller P, Paciorek W, Roversi P, Sharff A, Smart OS, Vonrhein C, Womack TO, BUSTER version 20240710. Cambridge, United Kingdom Glob. Phasing Ltd; (2017). [Google Scholar]

[R84] 84.Smart OS, Womack TO, Flensburg C, Keller P, Paciorek W, Sharff A, Vonrhein C, Bricogne G, Exploiting structure similarity in refinement: automated NCS and target-structure restraints in BUSTER. Acta Crystallogr D Biol Crystallogr 68, 368–380 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R85] 85.Chen VB, Arendall WB, Headd JJ, Keedy DA, Immormino RM, Kapral GJ, Murray LW, Richardson JS, Richardson DC, MolProbity : all-atom structure validation for macromolecular crystallography. Acta Crystallogr. Sect. D Biol. Crystallogr 66, 12–21 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R86] 86.Morin A, Eisenbraun B, Key J, Sanschagrin PC, Timony MA, Ottaviano M, Sliz P, Cutting edge: Collaboration gets the most out of software. Elife 2 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R87] 87.Kingma DP, Adam: A method for stochastic optimization. arXiv Prepr. arXiv1412.6980 (2014). [Google Scholar]

[R88] 88.Aghazadeh A, Ocal O, Ramchandran K, CRISPRL and: Interpretable large-scale inference of DNA repair landscape based on a spectral approach. Bioinformatics 36, i560–i568 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R89] 89.Aghazadeh A, Nisonoff H, Ocal O, Brookes DH, Huang Y, Koyluoglu OO, Listgarten J, Ramchandran K, Epistatic Net allows the sparse spectral regularization of deep neural networks for inferring fitness functions. Nat. Commun 12, 5225 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R90] 90.Greenwell BM, Boehmke BC, McCarthy AJ, A simple and effective model-based variable importance measure. arXiv Prepr. arXiv1805.04755 (2018). [Google Scholar]

[R91] 91.Yang A, Garcia KC, NGS Data for: Structural ontogeny of protein-protein interactions, Dryad (2025). 10.5061/dryad.79cnp5j8b. [DOI] [Google Scholar]

[R92] 92.Jiang H, Codebase and data for structural ontogeny of protein-protein interactions, Zenodo (2025). 10.5281/zenodo.17282141. [DOI] [Google Scholar]

[R93] 93.Akpinaroglu D, Kortemme T, Structure-conditioned masked language models for protein sequence design generalize beyond the native sequence space (0.0.1), Zenodo (2023). 10.1101/2023.12.15.571823. [DOI] [Google Scholar]

[R94] 94.Rocklin GJ, Chidyausiku TM, Goreshnik I, Ford A, Houliston S, Lemak A, Carter L, Ravichandran R, Mulligan VK, Chevalier A, Arrowsmith CH, Baker D, Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R95] 95.D’Costa S, Hinds EC, Freschlin CR, Song H, Romero PA, Inferring protein fitness landscapes from laboratory evolution experiments. PLOS Comput. Biol 19, e1010956 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R96] 96.Park Y, Metzger BPH, Thornton JW, The simplicity of protein sequence-function relationships. Nat. Commun 15, 7953 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R97] 97.Posfai A, Zhou J, McCandlish DM, Kinney JB, Gauge fixing for sequence-function relationships. bioRxiv (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R98] 98.Romero PA, Arnold FH, Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. cell Biol 10, 866–876 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R99] 99.Wittmann BJ, Yue Y, Arnold FH, Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 12, 1026–1045 (2021). [DOI] [PubMed] [Google Scholar]

[R100] 100.Teufel AI, Wilke CO, Accelerated simulation of evolutionary trajectories in origin-fixation models. J. R. Soc. Interface 14, 20160906 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R101] 101.DIJKSTRA EW, A Note on Two Problems in Connexion with Graphs. Numer. Math 1, 269–271 (1959). [Google Scholar]

PERMALINK

Structural ontogeny of protein-protein interactions

Aerin Yang

Hanlun Jiang

Kevin M Jude

Deniz Akpinaroglu

Stephan Allenspach

Alex Jie Li

James Bowden

Carla Patricia Perez

Liu Liu

Po-Ssu Huang

Tanja Kortemme

Jennifer Listgarten

K Christopher Garcia

Abstract

Results

Designing de novo interfaces using synthetic coevolution

Figure 1. Synthetic coevolution workflow.

Construction and selection of a coevolutionary library for evolving novel protein interfaces

Figure 2. Coevolution selection progress.

Divergent evolution of synthetic interfaces captured through sequence clustering, structures, and fitness landscape

Figure 3. Sequence clustering and fitness landscape of synthetic interfaces.

Structural insights into specificity and cross-reactivity between clusters

Figure 4. Structural insights into specificity and orthogonality between clusters.

Structural and epistatic signatures underlying synthetic and natural protein-protein interfaces

Figure 5. Structural parsing of interface composition and analysis of epistasis.

Epistatic dissection of coevolution in synthetic and natural interfaces

Figure 6. Epistatic contributions between coevolved pairs.

Seed sequences in the coevolutionary paths

Figure 7. Seed sequences initiate coevolutionary trajectories.

Discussion

Materials and Methods

Z domain-Z domain docking

Protein expression and purification

Yeast display of single-chain Z domain dimers

On-yeast cleavage-capture assay

Yeast displayed libraries

Selection of yeast-display libraries

Deep sequencing of yeast libraries

Sequence library filter

Sequence similarity network and Circos Plot

X-ray Crystallography

Selection of representative cluster structures

Surface plasmon resonance

Selection probabilistic model (SPM)

SPM training and evaluation

Extraction of epistasis importance from SPM-predicted fitness landscape

Simulating coevolutionary trajectories on the SPM-predicted fitness landscape

Energy landscape well depth, accessibility (width), and their statistical significance

Identification of seed sequences from coevolutionary trajectories

Scoring chain pairing specificity and sequence-structure compatibility using Frame2seq

Extraction of structure-conditioned epistasis importance using Frame2seq

Supplementary Material

Acknowledgments:

Funding:

Footnotes

References and Notes

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases