Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Mar 11.
Published in final edited form as: Sci Signal. 2025 Nov 18;18(913):eads8396. doi: 10.1126/scisignal.ads8396

CoDIAC: A comprehensive approach for interaction analysis provides insights into SH2 domain function and regulation

Alekhya Kandoor 1, Gabrielle Martinez 1, Julianna M Hitchcock 1, Savannah Angel 1, Logan Campbell 1, Saqib Rizvi 1, Kristen M Naegle 1,*
PMCID: PMC12973274  NIHMSID: NIHMS2124730  PMID: 41252491

Abstract

Protein domains are conserved structural and functional units that serve as the building blocks of proteins. Through evolutionary expansion, domain families are represented by multiple members in diverse configurations with other domains, evolving new specificities for their interacting partners. Here, we developed a structure-based interface analysis to comprehensively map domain interfaces from experimental and predicted structures, including interfaces with macromolecules and intraprotein interfaces. We hypothesized that comprehensive contact mapping of domains could yield insights into domain selectivity and the conservation of domain-domain interfaces across proteins, as well as identify conserved posttranslational modifications (PTMs), relative to interaction interfaces, enabling the inference of specific effects due to PTMs or mutations. We applied this approach to the human Src homology 2 (SH2) domain family, a modular unit central to phosphotyrosine-mediated signaling, identifying an approach to understanding binding selectivity and evidence of coordinated regulation of SH2 domain binding interfaces by tyrosine and serine/threonine phosphorylation and acetylation. These findings suggest that multiple signaling systems can regulate protein activity and SH2 domain interactions in a coordinated manner. We provide the extensive features of the human SH2 domain family and this modular approach as an open source Python package for Comprehensive Domain Interface Analysis of Contacts (CoDIAC).

Introduction

Protein domain families share a common fold and function and serve as key building blocks for proteins. It was previously predicted that the human proteome contains 47,576 domains [1] with diverse functions, including binding to proteins, RNA, DNA, or lipids, and enzymatic activities such as regulating phosphorylation, proteolysis, and DNA repair. Recombination of domains into new protein architectures can tailor existing functions or generate entirely new ones. As domains duplicate and evolve, they develop new specificities, enabling novel functionality in both the domain-containing protein and its interaction partners [2]. Despite learning much from individual domain components, we still lack a complete understanding of how domain interaction interfaces drive molecular interactions and regulate overall protein function.

Domains are the targets of numerous posttranslational modifications (PTMs) [3], most with unknown functions. Previously, we hypothesized that conserved PTMs—those occurring in the same structural position across many domains in a family—likely share common effects. Applying this concept to the RRM domain family, we identified conserved tyrosine phosphorylation in the RNA-binding motif, suggesting global regulation of RNA interactions by phosphotyrosine signaling [4]. However, this approach lacked a comprehensive analysis between PTMs and the myriad interaction interfaces across domains. Understanding residue-level contacts with ligands or between domains in a protein could help to hypothesize functional effects of PTMs. Whereas the Protein Contact Atlas [5] provided noncovalent interactions from x-ray structures, challenges connecting structures with reference sequences, lack of programmatic access, and absence of coverage for NMR, EM, and AlphaFold structures [6] led us to develop a modular, Python-based package to extract interaction interfaces of domains from a wider range of sources. This tool identifies relevant structures, defines domain boundaries, and integrates contact analysis with PTMs and mutations using flat text formats and Jalview-based visualization. Most modules can be used individually or together for comprehensive domain family analysis.

Given the central roles of protein domains in protein function, this pipeline should enable diverse research applications. We applied it here to analyze Src homology 2 (SH2) domains—“reader” domains essential to phosphotyrosine (pTyr) signaling. The SH2 domain is interesting, because it is extensive (119 domains in 109 proteins within the human proteome), it co-occurs in diverse protein architectures (for example, alone, with other reader domains, and with a broad range of enzymatic domains), it is a member of a broad class of globular domains that interact with flexible protein ligands (termed domain-motif interactions), and it is extensively posttranslationally modified. Despite considerable experimental coverage in the PDB, structures represent only a fraction of the “SH2ome,” with even fewer structures complexed with ligands. Although progress has been made in understanding binding determinants, including developing SH2 domains as reagents for pTyr site enrichment in mass spectrometry [7], we still lack a comprehensive understanding of which SH2 domains interact with which of the 46,000 pTyr sites in the human proteome [8]. Better understanding could improve SH2 domain design as affinity reagents, as well as engineering new specificities for inhibitors targeting dysregulated tyrosine kinase signaling [9, 10]. Additionally, although one structural position of tyrosine phosphorylation has a known regulatory function in SRC family kinases [11, 12, 13], little is known about other extensive PTMs on SH2 domains.

Here, we describe our generalized contact mapping and domain-centric analysis pipeline, COmprehensive Domain Interface Analysis of Contacts (CoDIAC), and its application to human SH2 domains. We developed an approach to systematically project contacts, including from AlphaFold predictions, to infer contact maps for SH2 domains lacking structural data. Our analysis revealed both known and previously unappreciated insights. Through contact mapping, we found that the fraction of bonds formed between an SH2 domain and specific ligand positions directly correlated with the specificity conferred at those positions. We also discovered that domain-domain pairings recurring in the proteome maintained conserved interaction interfaces, but primarily when the entire protein architecture is conserved. Additionally, we found that partial protein structures sometimes misrepresent native interaction interfaces. Through conserved structural analysis of PTMs and binding interfaces, we identified extensive modification of both domain-domain and domain-ligand interfaces by serine/threonine phosphorylation, tyrosine phosphorylation, and acetylation, suggesting the regulation of SH2 domain “reader” function by multiple signaling systems. Finally, our ability to extrapolate across the SH2 domain family while considering multifaceted interfaces provides potential insights into clinically relevant mutations.

Results

CoDIAC: A comprehensive, domain-centric pipeline for the analysis of contacts

CoDIAC is a Python-based package that integrates multiple resources to identify and extract contact maps from all available experimental structures in the PDB [14] and AlphaFold predictions [6] that cover a domain of interest as defined by its InterPro accession [15]. The CoDIAC pipeline builds an annotated reference set of proteins and structures, extracts contact maps, and integrates with other features to produce resource files for analysis and Jalview visualization (Fig. 1A) [16]. CoDIAC uses UniProt [17] as the primary protein reference and pairwise sequence alignment from Biopython [18] to align structural sequences with UniProt reference sequences, identifying domain boundaries and generating features referenced to a common sequence. Beyond structurally extracted contact maps, CoDIAC translates other protein annotations onto reference sequences and integrates the extraction of PTMs from ProteomeScout [3] and PhosphoSitePlus [19], as well as variants from databases like OMIM [20] and clinically important gnomAD missense mutations [21, 22]. CoDIAC is available at https://github.com/NaegleLab/CoDIAC.

Fig. 1. Overview of the CoDIAC pipeline and its application to SH2 domains.

Fig. 1.

(A) Flowchart outlining the key modules of the CoDIAC pipeline that can identify relevant domain-containing proteins (agnostic of species), associated PDB and AlphaFold structures, annotate controlled domains from those resources for the extraction of domain interactions between proteins (ligand) and within proteins (domain-domain), and orient PTMs, mutations, and contact interfaces onto a common reference and Jalview-based visualization. (B) An overview of all of the data and features extracted for the human SH2 domain family with CoDIAC. The SH2 domain family has been sorted and grouped by shared architectures (A1 for example is a group that contain only the SH2 domain). The immediately adjacent domains to the SH2 domain are labeled on the outermost tracks (for example the A5 group is an architecture that consists of SH3-SH2-kinase) and are in color if experimental structures covering the pair of domains exists. The feature generation and inference performed for the SH2 domain family are highlighted in the inner two circles. The innermost circle represents the genes for which we have experimental (black), AlphaFold (slate-gray), and inferred (light gray) features for domain-domain interfaces, whereas the outer circle with hatched pattern represents ligand interface contacts with phosphotyrosine (pTyr)-containing ligands.

Whereas we demonstrated the domain-centric analysis of the human SH2 domain family with the complete CoDIAC pipeline, its modular components can be used independently. The first part produces descriptive text files containing sequences, gene names, and domain annotations for proteins of interest, which are useful for evolutionary studies across species or comprehensive proteome information. The feature integration components can be used beyond contact mapping to merge different resources. For example, here we integrated manually extracted features from previous research on phage display mutagenesis [23] and SH2 domain–phospholipid binding contacts [24]. Additionally, the contact map extraction component is domain-agnostic and can analyze any regions within structures. This modular framework provides a flexible toolkit for various protein and domain-based analyses.

CoDIAC contact mapping

A key aspect of CoDIAC is the conversion of structural files into fingerprints of noncovalent interactions. This enables domain-focused analysis by mapping contacts across all identified PDB and AlphaFold structures containing the domain of interest. For individual structures, CoDIAC uses Arpeggio [25] to generate flat text files of all interatomic interactions. These “adjacency files” contain distance and contact type information for each interacting residue pair across entities and chains. We then generate binary adjacency files indicating whether pairwise residue-level interactions have sufficient evidence (binary value 1) or not (binary value 0), based on user-defined parameters for maximum interaction distance or contact types. For experimental structures with multiple molecule representations, we aggregate across assemblies to determine sufficient residue-level contact evidence (also user-defined). For SH2 domain analysis, we retained interactions with distances <5 Å and present in at least 25% of chains. For structures with multiple entities (domain and ligand), we kept contacts shared by at least 50% of the domain-ligand pairs. The result is a contact map represented as a binary adjacency file detailing interactions with PDB ID, residue numbers, IDs, entity IDs, and binary interaction values. CoDIAC processes mmCIF files from any source, including PDB and AlphaFold, offering flexibility in contact mapping parameters within a common programming environment and producing readable file formats.

For domain-centric analysis, CoDIAC uses generated contact maps and annotated structure files to identify domain regions and analyze its interactions with other regions. In this work, we focused on SH2 domain interactions with other domains within a protein or with pTyr-containing ligands. CoDIAC generates Jalview-style feature files of contact maps for visualization and analysis. Because proteins often have multiple experimental structures, we aggregated across independent experiments to produce a single set of contact features for each available SH2 domain interaction interface. To avoid spurious contacts from study bias, we required features to be shared in at least 30% of the available structures when multiple structures exist, a threshold determined by measuring feature retention versus inclusion stringency (fig. S1). The resulting pipeline produces feature files indicating which SH2 domain residues interact with ligands or other domains, enabling exploration of domain regulation by PTMs or mutations and their effects on protein function and ligand binding.

SH2 domain analysis of interfaces

We used CoDIAC to systematically extract domain interfaces from structures (both experimental and predicted) to understand binding specificity, interface conservation, and the function of mutations and PTMs on the human SH2 domain family (Fig. 1B). From the PDB, we identified 467 experimental structures covering 61 unique human SH2 proteins, with 135 structures including pTyr-containing ligand complexes. Most ligand-containing structures (101) occurred in trans (ligand and domain as separate entities), whereas 34 were bound in cis, with the ligand part of the same protein chain (fig. S2). For contact mapping, we excluded structures with variants in the SH2 domain during ligand interface mapping, as well as structures with any mutations during domain-domain interface mapping. The experimental structures provided 78 unique SH2-pTyr ligand pairs and 207 structures across 32 proteins with domain-domain interface coverage, including 144 representing full protein architectures (fig. S3). Although AlphaFold currently cannot predict ligand-bound structures involving modified residues, we leveraged the full-length predicted structures of SH2 domain-containing proteins with low prediction error to study domain–domain interaction interfaces. We obtained 109 predicted structures for full protein architecture domain-domain interface analysis. We used PROMALS3D [26] to generate a structure-based alignment of the SH2 domain family (with all data available at https://doi.org/10.6084/m9.figshare.26321968), representing a comprehensive resource of SH2 domain interfaces.

Comprehensive ligand contact mapping

We extracted SH2 domain residues interacting with ligands across all available structures (Fig. 2A). Canonical SH2 binding involves a central engagement of the pTyr residue with an invariant arginine in the binding pocket (alignment position 62), with specificity from surrounding residues interacting within a shallow pocket [27]. Approximately half of the binding energy comes from the invariant arginine-pTyr interaction, with the remainder from nearby ligand residues [28]. Our systematic analysis recovered the pTyr–invariant arginine interaction in most cases, with nine exceptions occurring under high local concentrations (fig. S4), including tandem SH2 domains with bivalent pTyr ligands (SYK, ZAP70, and RASA1) and cases in which SH2 domains (CRK and CRKL) are tethered to pTyr ligands in cis. Although these exceptions highlight possible noncanonical binding, we excluded them from canonical binding pocket mapping.

Fig. 2. SH2 domain ligand interface analysis.

Fig. 2.

(A) A matrix of ligand-based contacts of the SH2 domain relative to the reference alignment of the SH2 domain family (selected examples). The residue-level contacts for 28 unique SH2 domains and their ligands from 111 PDB structures are shown in this heatmap. Contacts are colored based on whether the residue makes contact with the N-terminal side or the pTyr in the ligand (pink), to the C-terminal side (orange), or to both sides (gray). Position 105 is the most C-terminal position that was consistently seen to interact with the N-terminal portion of the ligand. (B) Matrix representation of the combined feature set for the canonical binding interface of SH2-pTyr interactions, relative to the reference alignment of SH2 domains and oriented by the pTyr for ligands. Features viewed here are aggregated from across 78 unique SH2-pTyr ligand pairs, and each feature is projected as an interaction between the residues in the domain (x-axis) and the ligand (y-axis). Secondary structures and amino acid consensus annotations were obtained from the PROMALS3D alignment. BoJom: Representative mutations and their effects from phage display studies separated according to whether they increase affinity (“superbinder”) or alter specificity.

We explored how important the sequence identity of the ligand was to the mapping of the SH2 domain ligand-binding pocket. We encoded the SH2 domain residues that make contact with the ligands according to whether they make contact with the N-terminal or pTyr of the ligand, the C-terminal side of the ligand, or both (fig. S5 and Fig. 2A). This recapitulated findings from individual structural studies, showing that N-terminal SH2 regions engage N-terminal and pTyr ligand positions, whereas C-terminal SH2 binds to C-terminal ligand positions. We identified position 105 in our reference alignment as the primary bifurcation point between the pTyr pocket and specificity region across most domains (Fig. 2A). When we similarly encoded ligand residues by their interaction with SH2 domain regions before or after position 105, clustering showed that ligand interactions were driven by SH2 domain identity rather than by ligand sequence (fig. 11), consistent with flexible ligands fitting into rigid globular domains.

We also investigated how ligand length affected domain mapping by examining residues mapped by just the pTyr (fig. S7) and additional flanking residues (fig. S5). We found that the presence of N-terminal amino acids was unnecessary for recovering unique contacts, because ligands consistently mapped the same pocket residues regardless of N-terminal presence. Furthermore, even short ligands often mapped the same SH2 domain residues as longer ones, suggesting additional ligand residues frequently do not contribute previously uncharacterized domain contacts (see GRB2 in fig. S5). However, in some cases, very short ligands only mapped part of the specificity region, whereas longer peptides revealed additional contacts (see LCK in fig. S5). The high consistency of domain residues mapped across independent experiments with different ligand sequences and lengths suggests that the available PDB ligands provide robust mapping of the SH2 domain binding pocket, especially when aggregating information across structures.

Next, we explored the interactions between specific SH2 domain residues and ligand positions, using a 2D view with SH2 domain alignment positions on the x-axis, ligand positions on the y-axis, and values indicating the number of structures mapping contacts between position pairs (Fig. 2B). The most frequent interaction occurred between the invariant arginine (position 62) and the pTyr residue. High agreement also existed for other conserved structural positions interacting with the pTyr, including positions 103 (polar/His dominant) and 105 (hydrophobic/Lys dominant) on the βD-strand, which engage pTyr in >95% of structures, and position 37 on the αA helix (mostly arginine), which engages pTyr in ~85% of structures. These enriched positions align with known SH2-pTyr binding mechanisms [29, 30, 31]. We also found that positions 64 (last βB residue) and 74 (small residues on βC) strongly interacted with pTyr in approximately 85% of structures. As binding transitions to the C-terminal portion, binding mode diversity becomes apparent, again highlighting position 105 as the most C-terminal SH2 residue that engages the pTyr. This visualization also showed that ligand position −1 engagement predominantly occurred with adjacent pTyr binding (position 37), whereas the +1 position was concurrently engaged with position 103, suggesting dual constraints for these ligand residues. However, because other SH2 residues can independently interact with the +1 position (but not with −1), there may be differences in constraints between −1 and +1 positions. Comprehensive mapping bridges knowledge from individual structures and highlights domain-specific variations driving different ligand specificities.

We also examined whether our structural analysis could explain phage display mutagenesis results from researchers developing SH2 “superbinders” for pTyr pull-down reagents [23, 7]. When mapping phage-based mutation effects (fig. S8), the mutations that increased binding affinity directly corresponded to the N-terminal SH2 domain region coordinating pTyr binding, including positions 74 and 105—two of the triplet mutations yielding the 100-fold greater affinity currently used in improved reagents [7] (Fig. 2B). The third mutation in the triplet (position 67) is adjacent to a conserved pTyr interaction residue, suggesting that immediately adjacent positions also shape the binding pocket. Non-evolvable positions (which lose function when mutated) include the invariant arginine at position 62 and conserved positions not involved in binding—likely affecting domain folding rather than the binding interface. Positions altering SH2 domain specificity occurred at and beyond position 124, in the variable structural contact region (Fig. 2B and fig. S8). Thus, our systematic extraction identified both conserved structural positions in the pTyr pocket and variable, SH2-specific positions contributing to ligand selection, potentially guiding the development of SH2 domain products with altered affinity or specificity.

The relative proportion of residue-level interactions determines the importance of ligand binding

We observed that the total number of SH2 domain residues engaged with specific ligand positions correlated with known binding patterns—most interactions occur with the pTyr residue and the C-terminal ligand portion (Fig. 2B). To examine this further, we extracted the number of residue-level bonds made to each ligand position and normalized them by the maximum number of bonds observed, typically at the pTyr position (Fig. 3). Averaging across all SH2 domains confirmed that most bonds coordinate the pTyr residue, followed by the +1 and +3 positions, then +2, with relatively few bonds to the N-terminal side or beyond position +3. This pattern aligns with SH2 domain binding profiles obtained from degenerate peptide libraries, whereby C-terminal positions, particularly +3, are important selection criteria for protein-protein interactions [32, 33].

Fig. 3. Testing the hypothesis that the fraction of bonds dedicated to coordinating specific ligand positions is related to ligand specificity.

Fig. 3.

Top: For each domain complexed with ligand in available structures, we normalized the number of residue-level interactions made to ligand positions to the maximum number of residue-level bonds made (almost always a value of 1 for the pTyr interaction). ‘Total’ indicates the average across all available structures. The matrix was clustered, and we used this grouping and a relative emphasis on the fraction of bonds to define groupings, such as +2 binding for the GRB2/GRAP2-containing group. Bottom: motif logos generated for peptides pulled down from previously published superbinder experiments with stimulated and pervanadate-treated Jurkat cell lysates [7]. Here, we included any domain with more than 50 peptides and randomly downsampled to 100 total for controlling entropy. WT indicates that the pull-down was from WT SH2 domains, whereas S indicates that the data are from a “superbinder” mutant domain. The emphasis above the bars is based on the fraction of bonds. VAV3 and CRKL are not directly represented by available structure, but have close homologs (VAV1/VAV2 and CRK, respectively).

We hypothesized that the fraction of bonds used to coordinate specific ligand positions predicts interaction selectivity. To test this, we analyzed data from a study performing pTyr pull-downs from pervanadate-treated Jurkat cell lysates with SH2 domains, including superbinder mutants [7], to evaluate motif logos (Fig. 3). Whereas SH2 domains generally engage positions 0 to +3, with an emphasis on +1 and +3, there is considerable diversity among family members, which is revealed by hierarchical clustering of bond fractions. For example, although the average fraction of bonds at the +2 position is relatively low, some members (GRB2, GRAP2) show higher engagement at this position. The GRB2-binding motif is the only one with a significant +2 determinant, consistent with our classification as a “+2 binding mode” group. Additionally, GRB2-related members showed relatively few +3 position interactions, consistent with previous findings showing that a bulky tryptophan in the EF-loop restricts access to the +3 binding pocket [34]. Thus, residue-level bond patterns appear predictive of ligand selectivity information content.

Additional binding patterns emerged from the clustered heatmap, including groups with equivalent emphasis on the +1 and +3 positions and groups where +1 is stronger than +3. Motif logos generally match these bond fraction observations, suggesting that bond fraction can classify ligand binding modes more broadly. For example, PIK3R1(N) and PIK3R2(N), with stronger +1 bond emphasis, showed greater selectivity for +1 than for +3 in their motif logos. These domains also have the greatest bond fractions at the −1 and −2 positions and their logos show the strongest information content in positions N-terminal to the pTyr.

Although we observed high bond fractions at the +4 position for some domains, no motif logos suggested +4 position selectivity. Pascal et al. found that PLCG1(C), together with other Group II binders (SH2 domains that have small amino acids in β-D5, our alignment position 104), enable an extended interaction with the ligand (up to and past the +4 ligand position) [35]. The lack of ligand discrimination at the pull-down level, despite nonnegligible bonds at the +4 position, is likely explained by earlier observations that extended peptide length does not map new SH2 domain residues, meaning that the +4 interactions that are interacting with SH2 domain residues are already interacting with the +1 to +3 residues (Fig. 2B). Bond fractions for PLCG1(C) suggest strong +1 importance, consistent with mutagenesis studies showing that only the +1 position provides significant binding energy outside the pTyr interaction [35]. Together, these results demonstrate that comprehensive analysis across structures reveals key information about SH2 domain–ligand coordination and domain-specific binding differences, while highlighting the need for deeper understanding of how combinatorial residue-level interactions determine specificity, because simple bond interaction analysis is insufficient to describe ligand discrimination.

The motif logos indicated, albeit weakly, preferences for acidic amino acids at the +1 position, consistent with the structural co-dependence between +1 and pTyr binding through SH2 residue 103 (Fig. 2B). Because the content of the motif logos assumes an equal distribution of amino acids, we wished to statistically test the presence of a +1 acidic amino acid in the peptides pulled down by SH2 domains, compared to the overall background of all peptides identified in the experiment, controlling for sequence aspects related to the condition and ability to be phosphorylated by kinases (table 1). Indeed, we saw significant enrichment for glutamate and aspartate (E/D) in the +1 position for 10 of the 12 domains with pulldown data, which corresponds to 8 of 10 of the domains for which we also have structures (many tests had very large effect sizes). The two domains lacking +1 acidic enrichment were CRKL and GRB2, which showed the strongest +3 determinants. Testing for −1 acidic enrichment showed that CRKL and GRB2 were instead enriched for this constraint, with fewer domains showing −1 enrichment overall and weaker effect sizes. Structures capturing the −1/pTyr dual constraint were less common than those capturing pTyr/+1 engagement, correlating with the observed effect sizes in phosphopeptide pulldowns. These independent, high-throughput data suggest that structural analysis is informative for both global and individual family levels with respect to informing the possible minimal constraints that may be required for phosphopeptides to bind to any SH2 domain and the residue positions that guide specific SH2 domain binding.

Intraprotein, domain-domain, interaction interfaces

To understand the role of PTMs, mutations, and nonligand interfaces in regulating full protein architectures, we examined interfaces between SH2 domains and other modular domains within a protein (intraprotein interactions). This analysis enabled us to test domain-domain contact similarity between experimental and AlphaFold structures and assess interface conservation when domain pairs are reused in different protein architectures. PDB structures provided contact maps for nine domain-domain interfaces spanning 21 unique SH2 domains (fig. S9), whereas AlphaFold predictions yielded 5 additional interfaces plus comparisons with the 9 from PDB (fig S10), covering 43 unique SH2 domains and their adjoining domains.

To evaluate structures predicted by AlphaFold, we compared domain-domain contact maps from experimental and predicted structures with the Jaccard Index (JI), the number of shared residue features normalized by the total features from both methods. Experimental and predicted contacts showed high similarity (JI >0.5) for 16 of 19 domain-domain interfaces (fig. S11A). For interfaces with low similarity, we examined the Predicted Aligned Error (PAE) from AlphaFold, which estimates distance error between residue pairs. The SH2-PE/DAG-bd interface in CHN2 shared only half the contacts between experiment and prediction and had a high average PAE, whereas the PTPN6 SH2-SH2 interface showed substantial differences, with poor PAE values (Data File S2). These results indicate that PAE is useful for assessing predicted structure reliability for contact extraction (fig. S11). Based on this analysis, we used predicted structure interfaces only when PAE estimates showed at least 50% of the domain-domain interface within an acceptable error range (PAE ≤ 10 Å). Given the generally high agreement between experimental and predicted structures, with PAE-based filtering to reduce errors, we generated contact maps for an additional 13 unique interfaces using AlphaFold predictions.

Having validated our approach for both experimental and predicted domain-domain interface extraction, we examined interface composition across the SH2 family and similarity between proteins sharing domain-domain pairings. We found that the extent of SH2 domain engagement in domain-domain interfaces varied considerably. For example, the SH2-SOCS_box interface is extensive (41 contacts) in SOCS proteins, whereas others consist of just a few contacts, such as the tandem SH2 domain interface in the PTP family (Fig. 4A). The extent of the interface may indicate overall protein regulation importance. For example, the extensive contacts between the SH2(N) and PTP catalytic domains in PTPN6 and PTPN11 are consistent with the role of the SH2 domain in maintaining the inactive conformation of the phosphatase [36].

Fig. 4. Analysis of intraprotein interfaces of the SH2 domain.

Fig. 4.

(A) All domain-domain interactions extracted from available structures. The sizes of the nodes are proportional to the total number of contacts made with that domain interface and the SH2 domain. If the Jaccard index (JI) of the overlap between features was > 0.5 for multiple proteins, the line connects those proteins (red if the full protein architectures are identical, blue if the full protein architectures are different). The interfaces are sorted, in a circular format, according to the size of the total number in the proteome (for example 31 SH2 domain-containing proteins have an SH3-SH2 pairing). (B) AlphaFold predictions for the interfaces of SH3-SH2 (dark green) and SH2-kinase interfaces (cyan) across diverse full-protein architectures. Protein architecture is given on the leu and protein name is on the right. Positions on the x-axis are relative to the reference SH2 domain alignment.

When measuring the similarity of domain-domain contacts across the family, we found high conservation of interfaces among proteins that share entire protein architectures (Fig. 4A). Examples include high interface similarity across homologs with identical architectures: PTPN11 and PTPN6, CHN1 and CHN2, and STAT5A and STAT5B. This conservation extends to families with minor architectural variations (for example SOCS2 and SOCS4, the JAKs, GRB7, GRB10, and GRB14, as well as STAT1, STAT2, and STAT3; figs. S9 and S10). However, when domain-domain pairings (sub-architectures) occur across highly variable protein compositions, their interfaces differ substantially. For example, the SH2-SH3 interfaces within SLA/SLA2 and SRC family kinases remain conserved only within their respective families (Fig. 4B). The SH2-kinase interface contains a core of approximately four to six amino acid residues near the N terminus of the SH2 domain that shows high similarity between ABL and SRC family kinases. This conservation suggests the additional F-actin-binding domain of ABL family kinases has a minimal effect on SH2–kinase domain interactions compared to SRC family kinases. Neither ABL nor SRC family kinases share the same level of SH2-kinase interface conservation with FER/FES or SYK family members (Fig. 4B). Thus, domain-domain interfaces tend to remain conserved among homologous protein duplications and can withstand small architectural expansions but rarely survive major alterations to overall protein architecture.

Given that domain additions can alter structural arrangements of shared sub-architectures, we investigated how contacts depended on experimental design, particularly protein length. We found marked differences when comparing partial versus full protein structures. For example, PTPN11, which contains two SH2 domains followed by a tyrosine phosphatase catalytic domain (SH2-SH2-PTP_cat), shows consistent interface contacts when the full protein is expressed. However, structures containing only the tandem SH2 domains show interfaces that differ substantially from those in full-length structures (fig. S12). Conversely, some STAT and SOCS partial protein structures maintain similar interfaces to those of full structures (confirmed by high-confidence AlphaFold predictions). Because partial protein structures can misrepresent native domain-domain interfaces, we increased our stringency by prioritizing full domain architectures and using partial structures only when domain-domain pairs aligned well structurally (low RMSD) with the full protein structures (Data file S2B). When full structures were unavailable by experimentally derived structures, we used AlphaFold predictions with acceptable PAE values. This approach produced a reliable set of contact maps between SH2 domains and other protein domains across the family, providing insights into how these domains regulate protein function and folding (Fig. 1B).

Extending contact maps by evolution and structural similarity

Despite having numerous SH2 domain structures, coverage of the family remains limited. Only ≈24 and ≈18% of SH2 domains have experimental structures with ligands or multiple domains, respectively. We therefore developed a method to group SH2 domains for projecting contact maps from available structures to those lacking structural data, which is particularly challenging for ligand-binding pockets with variable specificity regions. Using evolutionary distances, we hierarchically grouped SH2 domains through a Neighbor-Joining method (fig. S13), identifying 26 initial clusters. Further analysis revealed the need for sub-clustering; for example, a cluster containing PIK3R(C) and VAV family domains showed strong within-group sequence homology but poor cross-group similarity (fig. S14). Because these cases could not be systematically determined through hierarchical tree cutting alone, we performed clustering within the 26 clusters, resulting in 58 sub-clusters (Data File S1). We validated this approach by comparison with sequence-identity-based and structure-based clustering methods (minimizing RMSD scores). Sequence-based clustering produced identical results, whereas structure-based clustering reproduced 79% of the clusters with more than two proteins. We used identical subclusters found across all approaches, grouping remaining domains together (Data file S1).

To test these groupings for ligand contact projection, we measured contact similarity within groups having multiple structures. Generally, we found good agreement, with the GRB2-GRAP2, VAV1-VAV2, and HCK-LCK-FGR groups showing high contact similarity (JI values >0.5, fig. S15). However, tandem SH2 domains revealed a discrepancy. Whereas clustering suggested that between-protein domains were more similar [for example PTPN6(N) is more related to PTPN11(N) than it is to PTPN6(C)], structural contacts showed that within-protein domains had similar binding patterns. This held true across all tandem domains with structural data (SYK, ZAP70, PTPN11, and PIK3R1). We therefore handled tandem SH2 domains as separate groups (Data file S1C), yielding 42 final clusters for ligand contact projection (Data file S1). Based on our finding that domain-domain interfaces are most conserved when full protein architectures are shared, we used protein architecture for domain-domain contact projection. Using projections from available structures within clusters for ligand contacts, and from clusters with identical architectures for domain-domain contacts, we extended contact maps to 31 (ligand) and 10 (domain-domain) additional SH2 domains (Fig. 1B). Although relaxing the constraints could increase coverage, we prioritized high structural and evolutionary similarity for reliable projections.

Inferring the effect of PTMs by conserved structural analysis

SH2 domains undergo extensive PTM. The 119 human SH2 domains contain 191 pTyr sites, 164 ubiquitylated lysines, 168 phosphorylated serines, 65 phosphorylated threonines, 34 acetylated lysines, and smaller numbers of methylations and SUMOylations that have been identified experimentally. Despite these numerous modifications, few studies report their functional effects. Notable exceptions include independent reports on SRC family kinases whereby SH2 domain phosphorylation alters binding. In LCK, phosphorylation of Tyr192 decreases pTyr ligand binding affinity, particularly affecting the interaction with the +3 position of the ligand [12], reducing LCK activity in TCR signaling [37]. Similarly, phosphorylation of Tyr194 in LYNdecreases peptide binding regardless of the inherent affinity of the peptide [13]. Phosphorylation of Tyr213 of SRC reduces binding to its C-terminal tail phosphotyrosine (Tyr 527) but not to other ligands [11]. These pTYr residues occupy the same structural position (alignment position 124), which is adjacent to a conserved +3 binding site at position 125 (Fig. 2A). Their location in the specificity-determining region near +3 binding residues explains how SRC family kinase SH2 domains maintain binding capacity but alter specificity when phosphorylated at this position, particularly affecting +3 determinants. Given the general importance of the +3 position, this modification can appear to cause substantial binding reduction, as was observed for Tyr194 of LYN [13]. We hypothesized that integrating PTMs with comprehensive structural analysis might enable faster functional prediction for the numerous SH2 domain PTMs.

We used CoDIAC modules with ProteomeScout [3], PhosphoSitePlus [19], and Jalview to map PTMs onto SH2 reference sequences and analyze them in the PROMALS3D alignment. This revealed patterns of conserved PTMs across structural positions, with multiple modifications appearing at specific alignment positions across many family members. We developed comprehensive reports analyzing the relationship of each PTM to the number of similar PTMs at the same structural position, distance to ligand-binding interfaces, nearby interface residues, and overlap with domain-domain or phospholipid-binding interfaces (Data file S3). From this analysis, we generated hypotheses about the functional effects of PTMs. Based on size and charge differences, we propose that modifications at or near the pTyr binding interface may block domain-ligand binding entirely, whereas those farther from pTyr likely modulate specificity or affinity. We identified 53 modifications directly on pTyr-engaging residues that could disrupt canonical binding: 18 pTyr, 29 pSer/pThr, and 7 N6-acetyl-lysine sites. Despite limited overall acetylation data, several modifications occur directly at interaction interfaces, notably on a highly conserved lysine at position 105 involved in pTyr coordination (as well as five additional acetylation events on specificity region residues). This suggests that acetylation may directly disrupt pTyr-mediated signaling by removing positive charges critical for pTyr coordination.

Phosphorylation showed distinct patterns. Conserved pSer and pThr sites predominate in the N-terminal half of the SH2 domain, frequently on residues directly contacting the ligand pTyr, suggesting that pSer and pThr signaling might directly dampen SH2 domain–pTyr interactions. In contrast, tyrosine phosphorylation tends to occur on or near residues binding the +1 to +3 region of the ligand, suggesting that pTyr might “tune” binding specificity. However, the most conserved tyrosine phosphorylation site (position 73, Fig. 5) sits adjacent to a structural position binding the ligand pTyr on the β-C strand, indicating that some pTyr sites may function as binding switches rather than as specificity modulators.

Fig. 5. PTMs in a Jalview-based visualization on the reference SH2 domain alignment for a subset of SH2 domains that cover the diversity of PTMs.

Fig. 5.

Ligand features were collapsed across structures and indicated if they were shared by 10 or more unique SH2-ligand pairs. Ligand interactions were labeled as interacting with pTyr (black) or specificity (gray) based on whether the interaction was predominantly with the pTyr residue or with other ligand positions (−1 to +3 predominantly). Tracks at the bottom indicate the number of PTMs found in that alignment position across the family. Numbers on the tracks indicate the number of PTMs identified in that position, an asterisk indicates more than 10 (pTyr: position 73 has 28, positions 123, 124, and 148 have 16, 14, and 13, respectively; pSer/pThr position 58 has 19, and the positions between 64 and 67 have 22, 6, 13, and 10 phosphorylation sites; position 74 has 10 pSer/pThr sites). Dark blue indicates pTyr; light blue indicates a pSer/pThr; maroon indicates N6-acetyl-lysine. Boxes around the alignment and tracks indicate locations of high conservation of PTMs and overlap with ligand-interacting residues.

Given the extensive phosphorylation at or near ligand-binding residues, we searched for functional studies of these modifications. Previous studies confirmed that some pSer/pThr [38] and pTyr sites [39] affect binding and signaling. For example, Lee et al. studied two pSer in PIK3R1 (p85a) in the same structural position of both tandem SH2 domains (alignment position 65, a conserved pSer/pThr site). Phosphorylation of either site markedly decreased SH2 domain ligand binding and inhibited PI3K signaling [38]. When both sites were phosphorylated, PI3K dissociated from its upstream activators, reducing AKT activation. This region of the SH2 domain contains 51 annotated pSer and pThr sites that directly interacting with ligands, suggesting that there are similar regulatory effects across many SH2 domains.

For tyrosine phosphorylation, our conservation analysis identified the previously studied SRC family kinase regulatory sites and extended them beyond SFK SH2 domains. Alignment position 124 contains 14 modification sites, including in ZAP70(C), NCK1, and ABL2 SH2 domains (Data File S3). This position neighbors conserved specificity contacts at positions 125 to 126, consistent with phosphorylation affecting +2 and +3 interactions (Fig. 2B). We also observed that many SH2 domains contain pTyr pairs (positions 123 to 124), including in SRC family kinases. Weir et al. found that three tyrosines in FYN (Tyr185, Tyr213, and Tyr214; alignment positions 73, 123, and 124, which are all conserved phosphorylation sites) reduced SH2 domain binding capacity when phosphorylated [39]. These studies validate our comprehensive contact mapping approach for predicting the effects of PTMs. We estimate that 54 and 35% of SH2 domains can be regulated by tyrosine and serine/threonine phosphorylation, respectively, often with multiple regulatory sites in a single domain.

Beyond ligand binding, we examined PTMs at domain-domain interaction interfaces to identify potential regulation of larger protein structures. Because of the architecture-specific nature of these interfaces, we focused on PTMs directly at domain-domain interfaces, identifying 40 such modifications (Data file S3). This approach revealed key regulatory insights, including the finding by Burmeister et al. that phosphorylation of PTPN11(N) Thr73 and PTPN11(C) Ser189 (alignment position 140) by PKA inhibits PTPN11 catalytic activity [40] by stabilizing its closed conformation.

Our analysis confirmed that PTPN11(N) Thr73 is at the SH2 domain-PTP catalytic domain interface and that PTPN11(C) Ser189 is at the SH2-SH2 domain interface, supporting their role in regulating protein conformation. In total, we identified 28 PTMs at SH2 domain interfaces with other domains (Data file S3), providing directed hypotheses for functional testing. We also examined phospholipid-binding interfaces with hand-annotated features from Park et al. [24]. With data for about 10 SH2 domains having phospholipid-binding residues, we identified five pTyr sites across four domains that might affect lipid binding through proximity to positively charged interface residues (Data file S3). Several phospholipid-binding lysines (six in total) have been annotated as ubiquitylation sites, suggesting potential regulation of phospholipid binding, although surface accessibility may also explain this pattern.

Integrated analysis of clinically relevant mutations

We used CoDIAC to relate mutations to SH2 domain regulation by examining relationships between mutations, interaction interfaces, and PTMs (Data File S4). From 111 clinically important mutations identified across various databases (OMIM, gnomAD, and PDB), we found 29 in ligand contact areas, 40 within two amino acid residues of ligand binding regions, 16 at domain-domain contacts, and nine at modified residues. Across ligand and domain-domain contacts, 83 to 85% of mutations substantially change the physiochemical properties of the residues, compared to 44% at PTM sites (Data file S4).

Integrated analysis revealed diverse potential effects. For example, SRC K206L (at position 105) removes a positive charge at a pTyr-coordinating residue; STATI K637E (position 134) introduces a charge switch at a ligand-binding residue and eliminates a ubiquitylation site; and PTPN11 Y62D (position 112) affects a residue adjacent to the ligand interface and at the PTP catalytic interface, while removing a phosphorylation site. Mutations in Tyr62 and Tyr63 of PTPN11 are associated with Noonan syndrome, Leopard syndrome, and RASopathy (ClinVar accessions: VCV000013329.48, VCV000013333.86). Mutations disrupting the SH2 domain–PTP interface increase catalytic activity by destabilizing the closed conformation of the phosphatase [41]. Our analysis suggests that these mutations not only affect domain-domain interfaces but also disrupt ligand binding and phosphorylation-based regulation.

Another example from our integrated analysis involves SOCS1 mutations. The P123R and Y154H mutations activate the JAK-STAT pathway in tumor cells and are associated with B cell lymphomas [42]. We found that Pro123 (position 101) is directly adjacent to a conserved +1 ligand-binding residue (at position 102) and on the same β-D strand side as the pTyr/+1 interacting residue 103 (Fig. 2B). Tyr154 of SOCS1 (alignment position 148) is both a ligand binding contact and interacts with the SOCS_box domain. Although phosphorylation of Tyr154 has not been identified in SOCS1, it is consistent with a conserved phosphorylation position predicted to alter binding specificity through +3 ligand interactions. As one of the rare cases in which a ligand pocket residue also participates in intraprotein interactions, phosphorylation might regulate SH2 domain opening for ligands. Tyr154 is part of a “YY” doublet in which both tyrosines can be phosphorylated, suggesting that the Y154H mutation could affect (i) SH2-SOCS_box interface disruption, altering catalytic activity; (ii) ligand specificity changes; and (iii) phosphorylation-based regulation of both ligand and domain-domain interfaces. Although previous work hypothesized that Y154H would cause ligand contact loss [42], our analysis precisely identifies affected interactions and suggests additional regulatory mechanisms.

Discussion

CoDIAC provides a flexible framework for extracting contact maps from experimental and predicted structures, annotating domains within structures, and harnessing all available structures for protein domain families. Here, we used it to explore the interaction interfaces of modular domains, generating a comprehensive overview of SH2 domains and the intersection of PTMs and mutations with binding interfaces. As well as recovering insights from individual structure-based studies, the comprehensive evaluation across the entire family revealed emergent properties of SH2 domains.

A key limitation of this and all structure-based studies is that interaction interfaces, particularly domain-domain contacts, represent single, low-energy configurations and do not capture the dynamic nature of protein interactions. Nevertheless, these low-energy states remain relevant for understanding how mutations and PTMs affect protein configuration. Although we did not examine the role of linker regions in regulating protein function—a well-known phenomenon in SH2 domain-containing tyrosine kinases [43]—CoDIAC can accomodate these regions, because interdomain regions are effectively annotated by the pipeline. Additionally, some structures note domain-domain interfaces that change in active conformational states, such as an SH2 domain–kinase interface in active ABL that stabilizes the open conformation [43]. Such interfaces are also important, including the possibility that PTMs could regulate conformational state stability. By selecting contacts shared by most structures to reduce study bias, we inherently biased our maps toward the most represented structures (typically inactive conformations). However, CoDIAC can be used to easily compare and contrast different interfaces for such comparative purposes and can map non-domain regions and various interface types, including interactions with lipids, small molecules, RNA, and DNA.

Our study highlighted several important caveats. Structures with partial protein coverage should be used cautiously, as demonstrated by the differences in contact maps between partial and full protein representations and the structural rearrangements that can occur when sub-architectures are shared between diverse families. Additionally, we observed that ligand engagement can occur without the canonical pTyr–invariant arginine interaction when ligands are presented in cis or through multivalent interactions (such as in tandem SH2 domains). This suggests that contact mapping from isolated domain-ligand pairs may limit our understanding of physiologically relevant interactions in signaling networks. Beyond tandem SH2 domains, many domain-ligand interacting modules work together with other modules (for example the SH3-SH2 pairing across much of the family), suggesting that noncanonical binding might be more prevalent in protein interaction networks than studies of isolated domains would indicate.

Conservation analysis of PTMs within SH2 domains revealed patterns suggesting extensive regulation of SH2 domain interaction interfaces by multiple signaling systems. Some insights were unexpected, contradicting the prevailing notion that modifications primarily occur in intrinsically disordered segments; many conserved sites are located directly on β-strands within the SH2 domain. The breadth of these PTMs across independent experiments and homologs, together with evidence of their regulation by drugs [44], growth factors [45, 46], or other stimuli [47], suggests that these structurally conserved modifications are transiently regulated, are important for signaling, and are not simply mass spectrometry artifacts. Studies describing modification effects in conserved positions on sites not yet in databases also suggest that additional modifiable residues in these positions may be discovered.

Although we highlighted individual PTM effects on regulating ligand or domain-domain interactions, evidence suggests that they co-occur and provide multifactorial control of SH2 domains. An intriguing pattern involves pSer sites directly adjacent to pTyr sites near key ligandbinding regions, including positions 72 and 73, which interact with the ligand pTyr. Comprehensive profiling of human serine/threonine kinases revealed that priming phosphorylation may be common for subsequent phosphorylation [48]. Using ProteomeScout, we found evidence of a doubly phosphorylated tryptic fragment containing these positions in the SH2D1A domain from breast cancer samples [49], suggesting that one phosphorylation may prime the kinase motif for the other. Although much remains to be tested about the regulation of SH2 domain interactions by PTMs, the comprehensive integration of structure and PTMs enables more direct hypothesis generation about modification effects and prioritization based on conservation and functional effect, while identifying multiple factors to consider in disease-relevant mutations.

Materials and Methods

UniProt, InterPro, and structure reference generation of the SH2 domain family

We developed and used the CoDIAC ‘InterPro’ module to gather all SH2 domain–containing proteins for ‘Homo Sapiens’ with the InterPro ID ‘IPR000980,’ resulting in 109 UniProt identifiers. We then used the ‘UniProt’ module of CoDIAC to build reference files for the proteins and a reference FASTA file for just the SH2 domain regions for the 119 unique SH2 domains. Given that there can sometimes be discrepancies around specific domain boundaries, we included the ability to alter the boundaries (systematically for all domains) when producing a FASTA sequence file of domains. Based on alignment quality, we found that truncating the boundary of the SH2 domains defined in the reference by one amino acid residue on the C-terminal side resulted in a substantially better alignment, and we selected an ‘n-terminal offset’ of 0 and a ‘c-terminal offset’ of −1. Additionally on domain quality checks, given its high specificity for only the PIK3 regulatory family, we manually removed the InterPro region in the PIK3R1/2/3 proteins ‘PI3K_P85_iSH2:IPR032498’, which is an α-helical region between the tandem SH2 domains. We also manually removed an InterPro-defined domain in SUPT6H that indicates (Spt6_SH2_C:IPR035018) because it overlapped with the parent SH2 family. Finally, we found that using boundaries defined by SMART [50], which were an average between UniProt and InterPro domains for the atypical SH2 domains (JAK family), resulted in a better overall alignment. At the time of this analysis (June 2024), UniProt returned 1,477 experimental structures associated with the UniProt set of proteins. We generated the structural reference datasets containing structures for SH2 domain proteins obtained through experiments (Integrated PDB reference file) and predictions (AlphaFold reference file). For the generation of this dataset, with CoDIAC we used ‘PDB’ followed by the ‘IntegrateStructure_Reference’ module, which captured an exhaustive list of experimental details for each of the PDB identifiers associated with the UniProt IDs of human proteins containing an SH2 domain. Using the ‘IntegrateStructure’ module of CoDIAC, which aligns structural data with the reference data and annotates domains found within the structures, we found 467 total structures for analysis that covered an SH2 domain in its entirety. Gaps and variants of experimentally derived sequences, relative to reference, were noted in the annotated file for consideration of exclusion criteria during contact mapping. We performed an identical process of capturing all predicted structures for the UniProt IDs identified in the family, annotating their regions relative to the reference, and generated the AlphaFold reference file. The 109 predicted structures were downloaded as mmCIF files using AlphaFold database version 2 (v2). We used the ‘PTM’ module of CoDIAC to capture PTMs found within the domain boundary regions of SH2 domains for all PTMs for which there were five or more PTMs in the family, producing a unique set of PTMs from both ProteomeScout [3], using ProteomeScoutAPI [51], and PhosphoSitePlus [19], using an API that we created and incorporated in CoDIAC that operates similarly to the ProteomeScoutAPI, and then combined them and kept the unique set of all PTMs for final analysis. We used the ‘mutations’ module in CoDIAC to gather mutations within the SH2 domain regions defined in the reference from gnomAD and OMIM. We used PROMALS3D [26] to align the reference SH2 FASTA file and then used the ‘Jalview’ module in CoDIAC to translate and integrate feature files produced from PTMs together with contact map feature files, as well as generating annotation tracks of features based on summing features along the columns of the alignment. Integrated features and annotation tracks were used to generate analysis reports concerning PTMs and mutations (compiled in data files S3 and S4, respectively). All data derived from this pipeline are available on Figshare (https://doi.org/10.6084/m9.figshare.26321968). Code used in this work are deposited on Zenodo: SH2-specific contact mapping https://doi.org/10.5281/zenodo.17196042; CoDIAC https://doi.org/10.5281/zenodo.17200311.

Adjacency File and Contact Map generation

We generated binary text files using the ‘AdjacencyFiles’ module that are a simplified representation of interatomic interactions. To make these files, we first made these calculations with the Python package Arpeggio [25], which outputs a JSON-formatted file comprising all types of interatomic interactions that exist within an input protein structure. For the purposes of domain-focused contact extraction, we kept all noncovalent interactions (aromatic, carbonyl, hydrophobic, ionic, polar, van der Waals, halogen, and hydrogen bond) occurring at a distance < 5Å between atoms that reside on SH2 domains and their interacting domains and ligands. We did not include any interactions that may occur between protein entities and small molecules. All components (chains, residues, residue positions, atom pairs, distance, and contact type) of the filtered noncovalent interactions are saved as adjacency text files and, using our set threshold (retain interactions if found in ≥25% of the chains for domain-domain analysis and in ≥50% of bound chain replicates for domain-ligand complexes), we aggregated contacts across chains of specific entities. We represented whether the contact exists (binary value ‘1’) or not (binary value ‘0’) between or across protein entities in the final binarized adjacency text files. We were unable to generate Arpeggio json files for 12 PDB structures (PDB IDs: 7UNC, 7UND, 8H36, 8H37, 8OEU, 8OEV, 8OF0, 6GMH, 6TED, 7OOP, 7OPC, and 7OPD), and the adjacency files we successfully generated for 455 PDB and 109 AlphaFold structures generated in this work can be found on Figshare (https://doi.org/10.6084/m9.figshare.26309674). Across the 475 PDB structures, we identified 17 unique PTMs. We generated a PTM dictionary that is used by the ‘contactMap’ module to replace the three letter codes of the modified residues with the single-letter code of their native residue in the structure sequence and allowed for precise comparisons between structure and reference sequence. This PTM dictionary can be modified based on the ones extracted within the structure datasets. We used the ‘contactMap’ module in CoDIAC to generate contact maps for domain-domain and ligand-binding interfaces of SH2 domains. This module generates a Python object that stores all of the structural information gathered from both the structure datasets (PDB and AlphaFold reference files) and the binary adjacency files. By incorporating annotated and interaction data for each structure, contact maps were constructed in the form of nested python dictionaries whose outer and inner dictionary keys indicate the residues that are noncovalently linked, and the values for the inner dictionary represent the binary value of the interaction. We printed contacts from these dictionaries to Jalview-supported feature files SH2 domain contacts with domains and ligands. Using the ‘analysis’ module, we aggregated contacts across various PDB structures that represented an identical interface. We evaluated after what threshold (percent) the loss of features upon aggregation becomes minimal and found 30% to be a suitable threshold for merging contacts across SH2 domain-containing structures (fig. S1). We found 207 PDB structures spanning 32 unique proteins that enabled us to examine domain-domain interfaces for contacts. We removed PDB structures for LCK, PLCG1, and ZAP70 from our analysis because we found structural discrepancies when evaluating RMSDs and were unsure whether the structures represented a native conformation. We did not extract contacts across SH2 domains and other modular domains within the PIK3R1, CRK, and SUPT6H structures. After these removals and following the evaluation of WT structures for domain-domain binding interfaces resulted in maps for 19 unique SH2 domain proteins (Fig. 1B). We failed to generate contact maps due to a mismatch between the structure and reference sequence alignments for two of the HCK structures (PDB IDs: 2HCK and 1AD5), because of a 10–amino acid unmodeled region within their kinase domains.

PAE estimation for the predicted structural fold of domain-domain interfaces

The AlphaFold prediction system provides Predicted Aligned Error (PAE) values for every residue pair in a protein structure that reflects the distance error (Å) between the residues oriented in three dimensional space. Given a domain interface of interest, we first retrieved the PAE values between the residue pairs rirj, where ri and rj reside on either of the domains. Because AlphaFold presents the PAE values in the form of an asymmetrical matrix, we calculated the average of the PAE estimates for the two relative structural configurations of the domain-domain interface (from both directions of the interface).

PAEavg=PAE12+PAE212 (1)

where PAE12 is the fraction of total number of residue pairs rirj that have a low PAE (≤ 10 Å), given ri is in the SH2 domain and rj is in the partner domain and PAE21 is the estimate of the same interface when ri is in the partner domain and rj is in the SH2 domain. We reported the final PAE estimate as this average value calculated from both directions of the structural orientation of the domain-domain interaction. We considered assessing interfaces only if our PAE estimate resulted in the capture of at least 50% of the domain-domain interface within our defined tolerable error range.

Methods for measuring the evolutionary and structural properties of SH2 domains

First, we used evolutionary distances as the basis to classify SH2 domain–containing proteins into families. We implemented a distance-based, Neighbor-Joining algorithm to calculate the evolutionary distances among all the known 119 SH2 domains and then used Ward’s method to hierarchically cluster the proteins. Using the elbow method, we identified an optimal linkage distance for clustering. An elbow point occurring at 0.5 indicates a decrease in variance within the clusters beyond this linkage distance (fig. S16). This threshold produced 40 clusters (Data file S1D), but we were unable to sufficiently project contacts, so we reassessed clustering for this purpose at the next elbow point, which occurs at linkage distance 0.8 and produced 26 clusters (Data file S1A). The clusters generated from thresholds of 0.5 and 0.8 were highly similar. Each of these 26 clusters were subdivided into 58 sub-clusters and we used a silhouette score to identify the number of sub-clusters that could be formed within each of these clusters. We benefited by working with a higher linkage threshold of 0.8, because it did not affect the protein grouping and enabled us to find more closely related neighbors for enhancing the chances of contact projection. We gained features for nine additional proteins at a threshold of 0.8, compared to what would have been achieved with 0.5. Upon performing these steps, we separated the SH2 domain family into 26 clusters and 58 sub-clusters (Data file S1B). We next validated the clusters by closely examining the sequences and structures of the domains to evaluate whether evolutionary-based agglomerative clustering would be beneficial in identifying and separating intra-clustered homologues efficiently. For primary sequence comparison, we calculated amino acid sequence identities between proteins within each of these 26 clusters. We hierarchically grouped each of the 26 clusters based on the sequence identity scores. This resulted in exactly the same sub-grouping that we had attained through evolutionary distances. For structural comparisons, we estimated RMSD values with the PyMOL utility ‘super,’ which involves sequence-independent programming alignment. We made these calculations between several structures with Python-controlled PyMOL scripts. The experimental structures did not cover the entire SH2 domain family, so we accounted for the unresolved protein domains by incorporating the predicted structures of the domains from AlphaFold, because we observed very low RMSD scores ( < 2 Å) between the SH2 domain structures from experiments and from AlphaFold (fig. S17). We next repeated a similar hierarchical clustering method for the 26 clusters to gain intra-clustered structural homologs using RMSD scores.

pTyr ligand peptide clustering

We represented every pTyr-containing peptide in our dataset as DPPS vectors [52] and then measured the average Euclidean distance between these DPPS vectors of the peptides. These average Euclidean distances indicate how closely associated the peptides are by their physiochemical properties. We hierarchically clustered the peptides based on these average distances. We adapted this method from our previous study [8] to categorize pTyr-containing peptides that were seven amino acid residues in length (−2 to +4 of the pTyr).

Phosphopeptide sequence analysis

Orthogonal testing of structure-extracted features was performed by aligning the sequences of phopshopeptides pulled down from pervanadate-treated Jurkat cell lysates [7]. To generate motif logos, we required at least 50 sequences. If a pull-down resulted in significantly more peptides for any one SH2 domain, we subsampled to 100 peptides to control for entropy differences (we did not see differences in the general motif logos across random subsamples). When available, we used the WT SH2 domain structure. However, for those domains for which the WT structure recovered too few peptides for useful motif generation, we used data from the triplet superbinder mutations. The three superbinder residue positions were shown by the authors not to have a substantial effect on binding specificity [23], which is consistent with the determination that these positions exclusively coordinate the pTyr residue, according to our structure extraction (Fig. 2B). To test the preference of acidic amino acids in the SH2 domain pull-downs, we first assembled a background of all of the unique phosphopeptides observed across the various pull-down experiments and then tested for the presence of a glutamate or an aspartate, directly following or preceeding the pTyr and the used Fisher’s Exact test to measure the probability of having observed the incidence of the motif by random chance, given the full background. We corrected for multiple hypothesis correction (within the positional group of tests) using the False Discovery Rate (FDR) procedure and rejected hypotheses with FDR-corrected values < 0.05.

Supplementary Material

Supplementary Data File 3
Supplementary Data File 1
Supplementary Data File 2
SupplementaryFigures
Supplementary Data File 4

Acknowledgments:

We acknowledge M. Ryan for early work on PDB interfaces to identify relevant SH2 domain–containing proteins, E. Draizen for help with initial scripts for parsing and contact extractions with mmCIF files, and P. Bourne and M. Fallahi-Sichani for helpful discussions.

Funding:

Research reported in this publication was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R35GM138127 (to K.M.N.) and the National Institute of Allergy and Infectious Disease under Award Number R01AI153617 (to K.M.N.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. G.M. was additionally supported by NIH grant T32CA009109.

Footnotes

Competing interests:

The authors declare that they have no competing interests.

Data and materials availability:

CoDIAC is available at 10.5281/zenodo.17200310. all SH2 domain specific analysis code is at https://doi.org/10.5281/zenodo.17196042. All data are available at Figshare: the adjacency files generated for SH2 domains are located at https://doi.org/10.6084/m9.figshare.26309674, and all data derived from the CoDIAC pipeline on structure files are available at: https://doi.org/10.6084/m9.figshare.26321968. All code and data are distributed under a noncommercial, share alike, with attribution license. All data needed to evaluate the conclusions in the paper are present in the paper and its Supplementary Materials.

References and Notes

  • [1].Schaeffer RD, et al. , Classification of Domains in Predicted Structures of the Human Proteome. Proceedings of the National Academy of Sciences of the United States of America 120 (12), e2214069120 (2023), doi: 10.1073/pnas.2214069120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Jakubec D, Kratochv 13053′fl M, Vymĕtal J, Vondrášek J, Widespread Evolutionary Crosstalk among Protein Domains in the Context of Multi-Domain Proteins. PLOS ONE 13 (8), e0203085 (2018), doi: 10.1371/journal.pone.0203085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Matlock MK, Holehouse AS, Naegle KM, ProteomeScout: A Repository and Analysis Resource for Post-Translational Modifications and Proteins. Nucleic Acids Research 43 (D1), D521–D530 (2015), doi: 10.1093/nar/gku1154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Sloutsky R, Naegle KM, Proteome-Level Analysis Indicates Global Mechanisms for Post-Translational Regulation of RRM Domains. Journal of Molecular Biology 430 (1), 41–44 (2018), doi: 10.1016/j.jmb.2017.11.001. [DOI] [PubMed] [Google Scholar]
  • [5].Kayikci M, et al. , Visualization and Analysis of Non-Covalent Contacts Using the Protein Contacts Atlas. Nature Structural and Molecular Biology 25 (2), 185–194 (2018), doi: 10.1038/s41594-017-0019-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Jumper J, et al. , Highly Accurate Protein Structure Prediction with AlphaFold. Nature 596 (7873), 583–589 (2021), doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Martyn GD, et al. , Engineered SH2 Domains for Targeted Phosphoproteomics. ACS Chemical Biology 17 (6), 1472–1484 (2022), doi: 10.1021/acschembio.2c00051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Ronan T, Garnett R, Naegle KM, New Analysis Pipeline for High-Throughput Domain-Peptide Affinity Experiments Improves SH2 Interaction Data. Journal of Biological Chemistry 295 (32), 11346–11363 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].A.-d. Liu, et al. , (Arg)9-SH2 Superbinder: A Novel Promising Anticancer Therapy to Melanoma by Blocking Phosphotyrosine Signaling. Journal of Experimental & Clinical Cancer Research 37 (1), 138 (2018), doi: 10.1186/s13046-018-0812-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Diop A, et al. , SH2 Domains: Folding, Binding and Therapeutical Approaches. International Journal of Molecular Sciences 23 (24), 15944 (2022), doi: 10.3390/ijms232415944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Stover DR, Furet P, Lydon NB, Modulation of the SH2 Binding Specificity and Kinase Activity of Src by Tyrosine Phosphorylation within Its SH2 Domain. Journal of Biological Chemistry 271 (21), 12481–12487 (1996), doi: 10.1074/jbc.271.21.12481. [DOI] [PubMed] [Google Scholar]
  • [12].Couture C, et al. , Regulation of the Lck SH2 Domain by Tyrosine Phosphorylation. Journal of Biological Chemistry 271 (40), 24880–24884 (1996), doi: 10.1074/jbc.271.40.24880. [DOI] [PubMed] [Google Scholar]
  • [13].Jin LL, et al. , Tyrosine Phosphorylation of the Lyn Src Homology 2 (SH2) Domain Modulates Its Binding Affinity and Specificity*. Molecular & Cellular Proteomics 14 (3), 695–706 (2015), doi: 10.1074/mcp.M114.044404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Berman HM, et al. , The Protein Data Bank. Nucleic Acids Research 28 (1), 235–242 (2000), doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Paysan-Lafosse T, et al. , InterPro in 2022. Nucleic acids research 51 (D1), D418–D427 (2023), doi: 10.1093/nar/gkac993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ, Jalview Version 2—a Multiple Sequence Alignment Editor and Analysis Workbench. Bioinformatics (Oxford, England) 25 (9), 1189–1191 (2009), doi: 10.1093/bioinformatics/btp033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Bateman A, UniProt: A Worldwide Hub of Protein Knowledge. Nucleic Acids Research (2019), doi: 10.1093/nar/gky1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Cock PJA, et al. , Biopython: Freely Available Python Tools for Computational Molecular Biology and Bioinformatics. Bioinformatics (Oxford, England) 25 (11), 1422–3 (2009), doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Hornbeck PV, et al. , PhosphoSitePlus: A Comprehensive Resource for Investigating the Structure and Function of Experimentally Determined Post-Translational Modifications in Man and Mouse. Nucleic acids research 40 (Database issue), D261–70 (2012), doi: 10.1093/nar/gkr1122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Hamosh A, Online Mendelian Inheritance in Man (OMIM), a Knowledgebase of Human Genes and Genetic Disorders. Nucleic Acids Research 33 (Database issue), D514–D517 (2004), doi: 10.1093/nar/gki033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Chen S, et al. , A Genomic Mutational Constraint Map Using Variation in 76,156 Human Genomes. Nature 625 (7993), 92–100 (2024), doi: 10.1038/s41586-023-06045-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Landrum MJ, et al. , ClinVar: Public Archive of Relationships among Sequence Variation and Human Phenotype. Nucleic Acids Research 42 (Database issue), D980–985 (2014), doi: 10.1093/nar/gkt1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Kaneko T, et al. , Superbinder SH2 Domains Act as Antagonists of Cell Signaling. Science Signaling 5 (243), ra68–ra68 (2012), doi: 10.1126/scisignal.2003021. [DOI] [PubMed] [Google Scholar]
  • [24].Park M-J, et al. , SH2 Domains Serve as Lipid-Binding Modules for pTyr-Signaling Proteins. Molecular Cell 62 (1), 7–20 (2016), doi: 10.1016/j.molcel.2016.01.027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Jubb HC, et al. , Arpeggio: A Web Server for Calculating and Visualising Interatomic Interactions in Protein Structures. Journal of Molecular Biology 429 (3), 365–371 (2017), doi: 10.1016/j.jmb.2016.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Pei J, Kim B-H, Grishin NV, PROMALS3D: A Tool for Multiple Protein Sequence and Structure Alignments. Nucleic Acids Research 36 (7), 2295–2300 (2008), doi: 10.1093/nar/gkn072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Jaber Chehayeb R, Boggon TJ, SH2 Domain Binding: Diverse FLVRs of Partnership. Frontiers in Endocrinology 11, 575220 (2020), doi: 10.3389/fendo.2020.575220. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Gan W, Roux B, Binding Specificity of SH2 Domains: Insight from Free Energy Simulations. Proteins: Structure, Function, and Bioinformatics 74 (4), 996–1007 (2009), doi: 10.1002/prot.22209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Grucza RA, Bradshaw JM, Futterer K, Waksman G, SH2 Domains: From Structure to Energetics, a Dual Approach to the Study of Structure-Function Relationships. Medicinal Research Reviews 19 (4), 273–293 (1999), doi: 10.1002/(SICI)1098-1128(199907)19:4<273::AID-MED2>3.0.CO;2-G. [DOI] [PubMed] [Google Scholar]
  • [30].Marengere LEM, Pawson T, Structure and Function of SH2 Domains. Journal of Cell Science 1994 (Supplement_18), 97–104 (1994), doi: 10.1242/jcs.1994.Supplement_18.14. [DOI] [PubMed] [Google Scholar]
  • [31].Marasco M, Carlomagno T, Specificity and Regulation of Phosphotyrosine Signaling through SH2 Domains. Journal of Structural Biology: X 4, 100026 (2020), doi: 10.1016/j.yjsbx.2020.100026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Songyang Z, et al. , Use of an Oriented Peptide Library to Determine the Optimal Substrates of Protein Kinases. Current Biology 4 (11) (1994). [DOI] [PubMed] [Google Scholar]
  • [33].Obenauer JC, Cantley LC, Yaffe MB, Scansite 2.0: Proteome-wide Prediction of Cell Signaling Interactions Using Short Sequence Motifs. Nucleic Acids Research 31 (13), 3635–3641 (2003), doi: 10.1093/nar/gkg584. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Rahuel J, et al. , Structural Basis for Specificity of GRB2-SH2 Revealed by a Novel Ligand Binding Mode. Nature Structural Biology 3 (7), 586–589 (1996), doi: 10.1038/nsb0796-586. [DOI] [PubMed] [Google Scholar]
  • [35].Pascal SM, et al. , Nuclear Magnetic Resonance Structure of an SH2 Domain of Phospholipase C-gamma 1 Complexed with a High Affinity Binding Peptide. Cell 77 (3), 461–472 (1994), doi: 10.1016/0092-8674(94)90160-0. [DOI] [PubMed] [Google Scholar]
  • [36].Kontaridis MI, Swanson KD, David FS, Barford D, Neel BG, PTPN11 ( Shp2 ) Mutations in LEOPARD Syndrome Have Dominant Negative , Not Activating , Effects * Journal of Biological Chemistry 281 (10), 6785–6792 (2006), doi: 10.1074/jbc.M513068200. [DOI] [PubMed] [Google Scholar]
  • [37].Courtney AH, et al. , A Phosphosite within the SH2 Domain of Lck Regulates Its Activation by CD45. Molecular Cell 67 (3), 498–511.e6 (2017), doi: 10.1016/j.molcel.2017.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Lee JY, Chiu Y-H, Asara J, Cantley LC, Inhibition of PI3K Binding to Activators by Serine Phosphorylation of PI3K Regulatory Subunit P85α Src Homology-2 Domains. Proceedings of the National Academy of Sciences 108 (34), 14157–14162 (2011), doi: 10.1073/pnas.1107747108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].Weir ME, et al. , Novel Autophosphorylation Sites of Src Family Kinases Regulate Kinase Activity and SH 2 Domain-Binding Capacity. FEBS LeNers 590 (8), 1042–1052 (2016), doi: 10.1002/1873-3468.12144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Burmeister BT, et al. , Protein Kinase A (PKA) Phosphorylation of Shp2 Protein Inhibits Its Phosphatase Activity and Modulates Ligand Specificity. Journal of Biological Chemistry 290 (19), 12058–12067 (2015), doi: 10.1074/jbc.M115.642983. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [41].Hof P, Pluskey S, Dhe-Paganon S, Eck MJ, Shoelson SE, Crystal Structure of the Tyrosine Phosphatase SHP-2. Cell 92 (4), 441–50 (1998). [DOI] [PubMed] [Google Scholar]
  • [42].Hadjadj J, et al. , Early-Onset Autoimmunity Associated with SOCS1 Haploinsufficiency. Nature Communications 11 (1), 5341 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [43].Filippakopoulos P, Müller S, Knapp S, SH2 Domains: Modulators of Nonreceptor Tyrosine Kinase Activity. Current Opinion in Structural Biology 19 (6), 643–649 (2009), doi: 10.1016/j.sbi.2009.10.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].Asmussen J, et al. , MEK-Dependent Negative Feedback Underlies BCR-ABL-Mediated Oncogene Addiction. Cancer Discovery 4 (2), 200–215 (2014), doi: 10.1158/2159-8290.CD-13-0235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [45].Wolf-Yadlin A, et al. , Effects of HER2 Overexpression on Cell Signaling Networks Governing Proliferation and Migration. Molecular Systems Biology 2 (1), 54 (2006), doi: 10.1038/msb4100094. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [46].Chylek LA, et al. , Phosphorylation Site Dynamics of Early T-cell Receptor Signaling. PLoS ONE 9 (8), e104240 (2014), doi: 10.1371/journal.pone.0104240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [47].Caruso JA, et al. , A Systems Toxicology Approach Identifies Lyn as a Key Signaling Phosphoprotein Modulated by Mercury in a B Lymphocyte Cell Model. Toxicology and Applied Pharmacology 276 (1), 47–54 (2014), doi: 10.1016/j.taap.2014.01.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [48].Johnson JL, et al. , An Atlas of Substrate Specificities for the Human Serine/Threonine Kinome. Nature 613 (7945), 759–766 (2023), doi: 10.1038/s41586-022-05575-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [49].Mertins P, et al. , Ischemia in Tumors Induces Early and Sustained Phosphorylation Changes in Stress Kinase Pathways but Does Not Affect Global Protein Levels. Molecular & Cellular Proteomics 13 (7), 1690–1704 (2014), doi: 10.1074/mcp.M113.036392. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [50].Letunic I, Khedkar S, Bork P, SMART: Recent Updates, New Developments and Status in 2020. Nucleic Acids Research 49 (D1), D458–D460 (2020), doi: 10.1093/nar/gkaa937. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [51].Holehouse AS, Naegle KM, Reproducible Analysis of Post-Translational Modifications in Proteomes - Application to Human Mutations. PLoS ONE 10 (12), 1–19 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [52].Tian F, Yang L, Lv F, Yang Q, Zhou P, In Silico Quantitative Prediction of Peptides Binding Affinity to Human MHC Molecule: An Intuitive Quantitative Structure-Activity Relationship Approach. Amino Acids 36 (3), 535–554 (2009). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data File 3
Supplementary Data File 1
Supplementary Data File 2
SupplementaryFigures
Supplementary Data File 4

Data Availability Statement

CoDIAC is available at 10.5281/zenodo.17200310. all SH2 domain specific analysis code is at https://doi.org/10.5281/zenodo.17196042. All data are available at Figshare: the adjacency files generated for SH2 domains are located at https://doi.org/10.6084/m9.figshare.26309674, and all data derived from the CoDIAC pipeline on structure files are available at: https://doi.org/10.6084/m9.figshare.26321968. All code and data are distributed under a noncommercial, share alike, with attribution license. All data needed to evaluate the conclusions in the paper are present in the paper and its Supplementary Materials.

RESOURCES