PairK: Pairwise k‐mer alignment for quantifying protein motif conservation in disordered regions

Jackson C Halpin; Amy E Keating

doi:10.1002/pro.70004

. 2024 Dec 25;34(1):e70004. doi: 10.1002/pro.70004

PairK: Pairwise k‐mer alignment for quantifying protein motif conservation in disordered regions

Jackson C Halpin ¹, Amy E Keating ^1,^2,^3,^✉

PMCID: PMC11669117 PMID: 39720898

Abstract

Protein–protein interactions are often mediated by a modular peptide recognition domain binding to a short linear motif (SLiM) in the disordered region of another protein. To understand the features of SLiMs that are important for binding and to identify motif instances that are important for biological function, it is useful to examine the evolutionary conservation of motifs across homologous proteins. However, the intrinsically disordered regions (IDRs) in which SLiMs reside evolve rapidly. Consequently, multiple sequence alignment (MSA) of IDRs often misaligns SLiMs and underestimates their conservation. We present PairK (pairwise k‐mer alignment), an MSA‐free method to align and quantify the relative local conservation of subsequences within an IDR. Lacking a ground truth for conservation, we tested PairK on the task of distinguishing biologically important motif instances from background motifs, under the assumption that biologically important motifs are more conserved. The method outperforms both standard MSA‐based conservation scores and a modern LLM‐based conservation score predictor. PairK can quantify conservation over wider phylogenetic distances than MSAs, indicating that some SLiMs are more conserved than MSA‐based metrics imply. PairK is available as an open‐source python package at https://github.com/jacksonh1/pairk. It is designed to be easily adapted for use with other SLiM tools and for diverse applications.

Keywords: conservation, intrinsically disordered proteins, multiple sequence alignment, short linear motif

1. INTRODUCTION

Many protein–protein interactions involve a modular domain in one protein binding to a short linear motif (SLiM) in the disordered region of another protein. Examples of peptide recognition domains, or SLiM‐binding domains, include SH3, SH2, and PDZ domains. Over several decades, researchers have worked to define core motifs for many SLiM‐binding domains that describe the sequence requirements for peptide binding (Kumar et al., 2022). The motifs are usually defined by patterns that are common to a modest number of binding peptides discovered experimentally or by using de novo motif discovery tools such as SLiMFinder (Edwards et al., 2007). Motif descriptions typically encompass 3–10 residues, with only a few positions fully defined, and are often described using regular expressions that list the residue(s) required or allowed at each position. The low sequence complexity of motif definitions does not typically provide enough information to identify specific biological interactions using sequence alone (Krystkowiak & Davey, 2017). For example, when searching the proteome for sequences that match a motif regular expression, thousands of instances are usually found. However, the vast majority of these matches are not biologically relevant (Krystkowiak & Davey, 2017).

Interactions between SLiMs and their cognate SLiM‐binding domains use diverse mechanisms to enhance binding or confer specificity (Bugge et al., 2020). For instance, SLiM interactions have been reported to use the sequences flanking the core motif (Ball et al., 2000; Hwang et al., ²⁰²¹; Singer et al., ²⁰²⁴), a secondary binding site on the binding domain (Acevedo et al., 2017), or tandem motif repeats to increase affinity and specificity (Harker et al., 2019; Klippel et al., ²⁰¹¹; Stevers et al., ²⁰¹⁸). Different proteins can use distinct mechanisms to recruit the same SLiM‐binding domain (Ball et al., 2000; Boëda et al., ²⁰⁰⁷; Hwang et al., ²⁰²¹^,²⁰²²), making it impossible to define a single motif that captures all of the information required for binding.

In the bioinformatic analysis of candidate SLiM interactions, filters or annotations are often employed to highlight the motif matches that are most likely to be biologically relevant in a candidate protein or proteome (Krystkowiak & Davey, 2017). Predicted disorder, e.g., as reflected by IUPRED scores (Erdős & Dosztányi, 2020; Mészáros et al., ²⁰¹⁸), is used to rule out matches where a motif is buried in a structured domain. Annotations from Gene Ontology (Ashburner et al., 2000; The Gene Ontology Consortium, ²⁰²¹) can identify candidate motifs that share functions with a binding domain of interest (Krystkowiak & Davey, 2017). Several tools have been developed that aggregate multiple motif‐filtering functions, such as SLiMSearch (Krystkowiak & Davey, 2017) and SLiMAn (Reys et al., 2024). SLiMSuite offers de novo motif discovery and filtering tools in one toolkit (Edwards et al., 2020).

Biologically relevant SLiMs can potentially be recognized based on their evolutionary conservation. The disordered regions of proteins, and SLiMs within them, can evolve rapidly, compared to structured domains (Davey et al., 2015). However, genuine motifs are often more conserved than their surrounding IDR sequences and frequently emerge as islands of conservation amidst highly variable disordered regions (Chica et al., 2008; Davey et al., ²⁰⁰⁹^,²⁰¹²; Nguyen Ba et al., ²⁰¹²). SLiM conservation has been used to discover new motifs, as in phylo‐HMM (Nguyen Ba et al., 2012) and the original implementation of SLiMPrints (Davey et al., 2012), and has also been employed as a filter to identify biologically likely motif matches (Davey et al., 2012; Edwards et al., ²⁰²⁰; Krystkowiak & Davey, ²⁰¹⁷; Reys et al., ²⁰²⁴). Assessing sequence conservation typically involves collecting sequences that are homologous to the protein of interest, creating a multiple sequence alignment (MSA), and quantifying the conservation of each MSA column. For IDRs, methods such as DisCons (Varadi et al., 2015) and the conservation function of IUPred3 (Erdős et al., 2021) also report the conservation of disorder for each position. For SLiMs, conservation is typically quantified relative to the surrounding IDR sequence, as SLiMs tend to be more conserved than adjacent IDR sequences, but less conserved than folded domains (Davey et al., 2009, 2012; Nguyen Ba et al., ²⁰¹²).

Most tools for assessing SLiM conservation use a predefined motif expression to account for allowed substitutions at each position. These methods usually yield a single score quantifying the conservation of the entire motif. However, there are drawbacks to using pre‐defined motifs. In such a scenario, the quality of the conservation score will depend on the quality of the motif definition. But many motifs are poorly defined, and for some peptide‐binding domains, there are few (or no) known binding partners. High‐throughput screens are now re‐defining SLiM binding requirements (Benz et al., 2022; Brauer et al., ²⁰¹⁹; Singer et al., ²⁰²⁴) and often reveal that existing definitions are too restrictive. Additionally, increasing evidence suggests that features outside of the core motif, e.g. in the motif‐flanking sequence, play a large role in SLiM binding (Acevedo et al., 2017; Ball et al., ²⁰⁰⁰; Bugge et al., ²⁰²⁰; Harker et al., ²⁰¹⁹; Hwang et al., ²⁰²¹; Klippel et al., ²⁰¹¹; Singer et al., ²⁰²⁴; Stevers et al., ²⁰¹⁸). These sequence features are not considered in motif‐based conservation scoring methods, yet their conservation could be critical for distinguishing protein segments that bind from those that don't. Thus, more flexible bioinformatics tools are needed to quantify per‐position conservation of SLiMs in a motif‐agnostic manner.

A recognized limitation of current SLiM conservation tools is their reliance on an initial MSA (Chica et al., 2008; Davey et al., ²⁰¹²). Compared to folded domains, intrinsically disordered regions (IDRs) are often extremely challenging, if not impossible, to align. MSAs are constructed under the assumption that residues in each column of the alignment correspond to directly comparable structural positions across sequences. Folded domains, due to strict evolutionary constraints on structure, often exhibit this type of conserved residue positioning across homologs of the same domain, and MSAs of folded domains are highly informative. In contrast, positions across homologous IDRs are often not comparable and violate the underlying assumptions required for MSA interpretation. Because IDRs evolve without the same structural constraints as folded domains, homologous IDRs often exhibit characteristics such as unique residue composition, a high prevalence of insertions and deletions, and low‐complexity or repetitive regions that make alignment particularly challenging (Khan et al., 2015; Lange et al., ²⁰¹⁶). Resulting MSAs are often difficult to interpret and can be misleading. In particular, IDR MSAs frequently misalign conserved SLiMs.

ConDens provides a way to quantify motif conservation that avoids MSA misalignment artifacts to some extent (Lai et al., 2012). ConDens assesses motif conservation by quantifying the density of matches to a motif regular expression over a defined region of an MSA and compares the results to an evolutionary model. However, the method requires a predefined motif definition, which limits its utility to well‐studied motifs. Additionally, ConDens uses an input MSA to define regions to scan for motif matches. Therefore, the method could still suffer from motif misalignment artifacts if motifs end up outside the defined scanning region in the MSA. Recent advances in protein large language models (LLMs) have led to methods for predicting per‐residue conservation scores without the need for an input MSA or set of homologous sequences (Marquet et al., 2022; Yeung et al., ²⁰²³). Yeung et al. found that their conservation score predictor, Kibby, predicted higher conservation for known phosphorylation sites in disordered regions than traditional conservation scores from MSAs (Yeung et al., 2023). These results suggest that LLM‐based conservation score predictors have the potential to score SLiM conservation on a per‐residue level while avoiding errors inherent in IDR MSAs.

We developed a simple, MSA‐free tool—PairK (pairwise k‐mer alignment)—that generates focused alignments of short sequences in IDRs for the analysis of SLiM conservation. PairK can calculate per‐position conservation scores directly from these alignments. PairK alignments can be generated from raw sequences or from sequences embedded using an LLM such as ESM2 (Lin et al., 2023). We developed a benchmark to evaluate the effectiveness of PairK by using residue‐level conservation scores to distinguish experimentally verified SLiMs from background motif matches in human‐disordered sequences. Our method outperformed both standard MSA‐based and modern LLM‐based conservation score predictors like the Kibby method of Yeung et al. (Yeung et al., 2023). We found that PairK can quantify residue‐level conservation across broader phylogenetic distances than MSAs and is more effective at distinguishing verified SLiMs from background matches over these distances. This suggests SLiMs may be more conserved—and evolve more slowly—than previously believed. The alignments generated by PairK can be used in other SLiM conservation analyses, such as motif‐based conservation scoring. Therefore, PairK is designed as a flexible python package for integration into existing pipelines or for incorporation into new algorithms. PairK is available as an open‐source package at https://github.com/jacksonh1/pairk.

2. RESULTS

2.1. Sequence alignments of disordered regions confound the quantification of SLiM conservation

To evaluate different methods for quantifying sequence conservation, we first developed a pipeline to retrieve and process homologous sequences for a specified protein. We used OrthoDB (Kuznetsov et al., 2023) to obtain precompiled groups of homologous sequences at various phylogenetic levels (called orthologous groups in OrthoDB). For a protein of interest (the query protein), the pipeline performs the following steps: it locates the query protein in OrthoDB, retrieves the orthologous group at a specified phylogenetic level, compares each homolog to the query protein, selects the least divergent homolog in each organism, clusters the least divergent homologs to reduce redundancy, and creates an MSA from the remaining sequences (see Methods for details).

To explore potential issues arising from aligning disordered regions, we examined the MSAs of proteins with verified SLiMs. Figure 1a presents a section of an MSA of RIAM, a verified Ena/VASP EVH1 binding partner in humans (Lafuente et al., 2004), aligned with its vertebrate homologs. The folded domain exhibits high conservation and minimal gaps, while the disordered region containing the Ena/VASP EVH1 binding motif (LPPPP) displays a larger number of insertions/deletions and a higher overall apparent divergence. Figure 1b provides a detailed view of the MSA region containing the SLiM. Although all the homolog sequences have at least one motif match in the short region shown, few are well‐aligned with the human RIAM motif, resulting in an artificially low conservation score for the first motif position. Several positions in the alignment, including several in the SLiM, show artificially high conservation due to prolines that are not confidently aligned (indicated by red arrows). This is because prolines are assigned a high score in most substitution matrices, and disordered regions are rich in proline residues. To optimize the alignment score, alignment algorithms tend to align prolines in IDRs when there is no other strong signal.

Detecting evolutionary conservation of short linear motifs is confounded by poor alignment of disordered regions. (a) Slice of an MSA of RIAM, which contains an LPPPP Ena/VASP binding motif (Lafuente et al., 2004), aligned to its vertebrate homolog sequences. (b) *Left* – Part of the SLiM region of the MSA from (a). Matches to the EVH1 motif are highlighted in red. Columns that appear artificially conserved are indicated with a red arrow. *Right* – The apparent conservation of the SLiM residues extracted from the example MSA. X‐axis labels are residues in the human protein. White space in the sequence logo indicates gaps in the corresponding alignment columns. The bar plot shows the conservation scores of the aligned columns. (C) The conservation scores (Shannon entropy from Capra & Singh, 2007) of residues in experimentally verified SLiMs vary with alignment algorithms. Data are from 240 verified SLiM instances (731 residues). Homologs are from metazoans.

To assess the sensitivity of the apparent evolutionary conservation of residues within SLiMs to the underlying MSA, we collected a set of experimentally verified SLiM instances from the Eukaryotic Linear Motif (ELM) database (Kumar et al., 2024) or manually curated them from the literature (see Methods) and processed them through the pipeline described. For each set of homologs, we generated MSAs using Mafft (Katoh & Standley, 2013), Muscle (Edgar, 2022), and Clustalo (Sievers et al., 2011) and calculated conservation scores for residues within the verified motifs. Figure 1c demonstrates that the conservation scores of many residues vary significantly when using different aligners, suggesting that the MSA, and thus the resulting motif residue conservation scores, are poorly determined.

2.2. PairK (pairwise k‐mer alignment)

We developed an alternative to MSAs for the evolutionary analysis of SLiMs. We assumed that homologous IDRs are highly divergent on a global level, but contain short, positionally conserved stretches of sequence that may correspond to SLiMs. Based on this assumption, we developed a method to align sequence fragments from a query IDR with fragments from homolog IDRs, in a pair‐wise manner. Using these alignments, we calculate conservation scores that describe the conservation of each residue relative to all other residues in equal‐sized fragments. We call the tool PairK (pairwise k‐mer alignment) (Figure 2a).

The pairwise k‐mer alignment method (PairK) for quantifying the conservation of SLiMs. (a) Schematic of the method. (b) Example z‐scores and sequence logos for an Ena/VASP binding motif from the protein RIAM and its vertebrate homologs. X‐axis labels are residues in the human protein. White space in the sequence logos indicates gaps in the corresponding alignment. The results using an MSA (the same MSA from Figure 1a) are shown at *left*. The positions in the MSA corresponding to the SLiM residues in the human sequence (LPPPP) are extracted and shown in the *middle‐left* panel, with gaps removed. The results from PairK (*middle*‐*right*) and the embedding‐based variant of PairK using ESM2 embeddings (Lin et al., 2023) (*right*) suggest that the LPPPP motif is more conserved than it appears in the MSA.

PairK takes as input a query IDR sequence (containing the sequence fragment of interest), a set of homologous IDR sequences, and a sequence fragment length k. The value of k is set equal to the length of the sequence fragment of interest in the query IDR. PairK can be divided into two main steps (see Methods for details).

First, to generate a pairwise k‐mer alignment, the query IDR sequence is divided into all possible k‐mers, where each k‐mer is a length k subsequence of the query IDR. Each k‐mer in the query IDR is then aligned to each homologous IDR in a pairwise manner without gaps, and the best‐matching fragment from each homologous IDR is determined by a scoring matrix. The combined list of the best matching fragments from each homolog and the corresponding k‐mer from the query sequence constitutes a “pseudo MSA” for that k‐mer. We use the term “pseudo MSA” because there is no attempt to perform a global alignment of the retrieved sequences. A pseudo MSA is generated for every k‐mer in the query IDR. For example, if the query IDR is 10 residues long and the user sets k = 5, PairK will produce 6 pseudo MSAs, one for each 5‐mer in the query IDR.

In the second step of PairK, we compute conservation scores from each of the k‐mer pseudo MSAs. Conservation scores are calculated column‐wise for each position in each k‐mer pseudo MSA. The conservation scores are then converted to z‐scores, using the conservation scores of all positions in all k‐mer pseudo MSAs as the background. For a k‐mer of interest, the z‐scores describe the conservation of each residue relative to all other positions in all k‐mers from the same query IDR. Converting conservation scores to z‐scores corrects for the lower information content of small fragments (particularly for small values of k) and for the background divergence of the IDRs.

To directly compare PairK scores with MSA scores, we also converted MSA conservation scores to z‐scores, such that each final score is relative to the other columns in the IDR region of the MSA. This z‐score normalization is very similar to that used in the SLiM Prints method (Davey et al., 2012), except that we normalized relative to the entire IDR instead of a window around each residue.

Figure 2b displays the resulting sequence logos and conservation scores for the example SLiM shown in Figure 1, using both the MSA method and PairK. In these logos, the height of each letter indicates the count of that residue in the corresponding alignment column. Gaps in an alignment column result in whitespace above the letters (where the logo letters do not fill the column). For the example in Figure 2b, the motif residues receive much higher scores using PairK compared to the MSA method. Additionally, the sequence logos for the PairK method suggest that the L in the first position and the P in the third position are more conserved than is reflected in the MSA analysis.

We developed a variation of the pairwise k‐mer alignment step of PairK that incorporates residue embeddings generated by large language models trained on protein sequences. We tested this using ESM2 (Lin et al., 2023) and DR‐BERT (Nambiar et al., 2023) embeddings. In this approach, residue embeddings are computed for the full‐length query sequence and its homologs (DR‐BERT requires a special procedure to deal with long sequences; see Methods). When the query IDR is split into k‐mers, the corresponding residue embeddings are sliced out of the query sequence embedding tensor. For each k‐mer, the best matching fragment in a homologous IDR is determined by calculating the Euclidean distance between the associated k‐mer embedding slice and the embedding slices of all equal‐length fragments in the homologous IDR. The homolog fragment with the lowest embedding distance is selected and used to construct the pseudo MSA. See Figure 2b for an example result from this approach using ESM2 embeddings. PairK is coded to use ESM2 but can use residue embeddings provided by the user, allowing the k‐mer alignment step to be performed with embeddings generated by other LLMs.

2.3. PairK outperforms MSAs at quantifying SLiM residue conservation

Given initially promising results using PairK, as seen in Figure 2b, we sought a systematic approach to evaluate the performance of various scoring methods for quantifying the conservation of residues in SLiMs (Figure 3a). We handpicked seven SLiMs based on their abundant experimentally verified instances annotated in ELM: those defined for binding to domains AP2, EH, SH2, SH3, WW, 14‐3‐3, and Ena/VASP EVH1 (Kumar et al., 2024). Using the regular expressions that define these motifs, sourced from the ELM database or the literature, we identified motif matches in disordered regions of human proteins. The matches were divided into two categories: high‐confidence verified SLiMs (true positives) and background motif matches. The high‐confidence verified SLiMs are annotated true positives from the ELM or were manually curated from the literature (Tables S2 and S3, see Methods). True positives in the ELM are SLiM interactions that are manually curated and supported by various lines of evidence, primarily experimental (Kumar et al., 2024). As in other studies with similar benchmarks (Chica et al., 2008; Davey et al., ²⁰¹²), we assumed that the large majority of background proteome motif matches are not biologically relevant binders, and the fraction of real motifs is much larger in the true positive set than in the background set. Our benchmark is based on the expectation that true positives are more evolutionarily conserved, on average, than background motif matches.

SLiM conservation scoring benchmark. (a) Schematic of the benchmark pipeline. Distributions of the conservation scores of the motifs in the benchmark are shown for the MSA method (b) and PairK (c). Homologous sequences were gathered at the metazoan level. PairK better separates motif matches that are validated (TPs, orange) from background motif matches in the proteome (BG, blue). (d) For each motif in the benchmark except 14‐3‐3, PairK (red) performs better than the MSA method (gray). Error bars are 95% confidence intervals from the bootstrap analysis. The plot is for homologs at the Metazoa level.

Each protein motif was evaluated using different scoring methods to calculate a conservation score for the query motif, defined as the average residue‐level conservation score of defined motif positions. Motif positions where all residues are allowed (designated by "x" or "." in regular expressions) were excluded from the analysis. We then assessed how effectively conservation scores distinguish true positives from background matches, using the area under the precision‐recall curve (auPRC) as our performance metric (Saito & Rehmsmeier, 2015). For further details, see Methods and Table S1.

For each SLiM class in the benchmark except 14‐3‐3, PairK outperforms the MSA method (Figure 3b–d). Interestingly, the degree of improvement varies among the SLiMs. AP2 shows the most significant enhancement (~2x increase in auPRC), while 14‐3‐3 shows the least. The embedded version of PairK further improved performance for some SLiMs, with AP2 showing a ~3x increase in auPRC compared to the MSA method. DR‐BERT is an LLM trained specifically for IDRs (Nambiar et al., 2023), however, PairK run with DR‐BERT embeddings did not perform as well as ESM2 in our benchmark (Figure S1). ESM2 was therefore used to generate embeddings for other analyses, unless otherwise stated.

For our initial analysis, and for all tests where we do not indicate otherwise, we set k equal to the length of the annotated motif for each domain of interest. We explored the impact of including a flanking sequence around the motif match, i.e., we increased the value of k by including flanking residues on both sides of the motif. In these cases, we performed the pairwise alignment step with the increased value of k and then extracted just the core‐motif residue scores (see Figure S2A for illustration). Although adding flanking sequence slightly reduced performance for most SLiMs, results were still better than for the MSA method (Figure S2B). Adding flanking sequences improved performance for WW and 14‐3‐3 SLiMs. The choice of scoring matrix for the pairwise alignment step, including a matrix developed for the alignment of disordered regions (Trivedi & Nagarajaram, 2019), had negligible effects on performance (Figure S2C).

We assessed conservation as calculated using the alignment‐free score predictor of Yeung et al., referred to as the Kibby method here. After converting the Kibby scores to z‐scores, as for MSA scores, we found that Kibby outperformed MSAs in most instances except for the WW motif (Figure 3d). PairK outperformed Kibby for most SLiMs, however, in some cases, Kibby performed better or similarly to PairK.

2.4. PairK quantifies SLiM conservation at greater phylogenetic distances

We examined how various conservation methods performed for orthologous groups at different phylogenetic levels. For progressively more divergent phylogenetic levels (Tetrapoda, Vertebrata, and Metazoa) (Figure 4a), the performance of the MSA method remained approximately the same, whereas that of PairK increased (Figure 4b,c; Figures S4 and S5). The increasing performance of PairK with increasing phylogenetic level is consistent with greater conservation of the SLiM relative to the rest of the IDR. Figure 4d shows an example. Here, we added flanking residues around the motif prior to the pairwise k‐mer alignment (increased k to 15) to show conservation of the motif and five residues N and C terminal to the motif. The sequence logos and scores look similar for both the MSA and PairK methods at the Vertebrata level. However, at the Metazoa level, the MSA method suggests that the SLiM is not conserved in this group of species. The MSA‐based logo for Metazoa reflects many gaps in the alignment and includes threonine at the first position. The apparent higher conservation of the second, third, and fourth prolines is likely due to spuriously aligned prolines from poorly aligned sequences. Overall, the MSA method suggests that the motif is not conserved in the >200 additional organisms present in the metazoan orthologous group. In contrast, by PairK, the SLiM residues score even higher at the Metazoa level than at the Vertebrata level. The sequence logo suggests that the motif residues remain generally conserved at the Metazoa level, but the surrounding residues in the IDR are less conserved, increasing the relative score of the motif residues.

PairK better distinguishes real motifs from background matches for more divergent homologs. (a) Phylogenetic tree of Eukaryotes. (b) Performance (reported as auPRC) for the MSA and PairK methods at different phylogenetic levels. The performance of the Kibby method is replotted at each level for comparison with the other methods, however, it is independent of the phylogenetic level and only the query sequence is used in its calculation. Error bars are 95% confidence intervals from a bootstrap analysis. (c) The difference in auPRC score for PairK vs. the MSA method for individual motifs at different phylogenetic levels. (d) Example sequence logos and conservation scores for an experimentally verified SLiM from lamellipoden (RAPH1) that binds to the Ena/VASP EVH1 domain. The motif region is highlighted in red. For Vertebrata, the MSA and pairwise k‐mer methods (*top*) perform similarly and show similar sequence profiles. For Metazoa, the MSA (*bottom left*) has a high fraction of gaps, indicated by white space in the logo, while the pairwise k‐mer method (*bottom right*) indicates that the motif is still conserved in metazoans.

The enhanced accuracy of PairK, corroborated by benchmark results, opens avenues to uncover potentially significant but as‐yet unvalidated SLiMs and to reveal conserved sequence attributes that contribute to SLiM binding. We scrutinized motif matches in the benchmark that were highly rated by PairK but poorly scored by the MSA method (benchmark scores available in Table S4) and identified several candidates for further study (illustrated in Figure 5).

Sequence logos and conservation scores for examples from the benchmark. (a) TRAF6 motif match in G protein‐coupled receptor 179 with homologs from the vertebrate level. (b) Motif matches for the Ena/VASP EVH1 domain (vertebrate level for RIAM, and metazoan level for WASF2 and Roundabout homolog 2). White space in the sequence logos indicates gaps in the alignment. The x‐axis labels are the residues of the human sequence, which was the query sequence. For the MSA plots, the positions corresponding to the human residues were extracted from the MSA (as in Figure 1b) for easier visualization. The sequence shown on the x‐axis labels is the full query k‐mer. In (b) a larger value of k was used (k = 15) for the PairK method to show sequence flanking the motif. The motif residues are highlighted in red.

One example is sequence EVCPWEVTE from G protein‐coupled receptor 179, which matches the TRAF6 preferred motif xxxPxExx[FYWHDE] (Figure 5a). TRAF6 is an E3 ubiquitin ligase that binds to cell‐surface receptors and is involved in cellular processes including immunity and NF‐kB signaling. Previous work suggests that TRAF6 highly disfavors binding to sequences with a proline at positions following the motif‐required proline (position 0), because ligands form a beta‐sheet in complex with the domain (Halpin et al., 2022; Pullen et al., ¹⁹⁹⁹; Ye et al., ²⁰⁰²). Conversely, TRAF6 favors binding to sequences with a negatively charged residue at the last position in the motif (position +5), based on the enrichment of this residue in known binding sequences (Huang et al., 2019; Jiang et al., ²⁰⁰⁴; Pullen et al., ¹⁹⁹⁹; Sato et al., ²⁰⁰³; Shi et al., ²⁰¹⁵; Tsukamoto et al., ¹⁹⁹⁹; Ye et al., ²⁰⁰²). The MSA‐derived sequence logo for a TRAF6 motif match in G protein‐coupled receptor 179 indicates that features important for binding are not conserved in this protein. Glycine is observed at position +2, contrary to glutamate as mandated by the motif definition. Arginine and glutamine are also observed in position +5 instead of the preferred glutamate and aspartate. The presence of proline in position +4 is incompatible with beta‐sheet binding geometry. In contrast, when analyzed using PairK, G protein‐coupled receptor 179 looks much more likely to bind TRAF6, as it is apparent that there are stretches of continuous sequence in all homologs that match the motif, and nearly all of those segments have glutamate at position +5 and no prolines after position 0. Little is known about G protein‐coupled receptor 179, so it is difficult to further assess whether it is a likely TRAF6 binding partner.

We identified two untested motif matches that are likely to be genuine Ena/VASP EVH1 binding partners: WASF2 and Roundabout Homolog 2 (ROBO2) (Figure 5b). Both proteins are biologically plausible Ena/VASP interaction partners, due to their roles in cytoskeletal processes. Moreover, ROBO2 is a paralog of the confirmed Ena/VASP binder ROBO1 (Bashaw et al., 2000), while WASF2 is a component of the WAVE complex, which plays a crucial role in lamellipodia formation (Chen et al., 2014; Chereau et al., ²⁰⁰⁵; Suetsugu et al., ¹⁹⁹⁹). These motif matches are conserved according to PairK, but not when analyzed using MSAs.

PairK can potentially unveil conserved sequence features that enhance binding affinity and specificity. The Ena/VASP motif from RIAM is a good example (Figure 5b). The MSA motif logo reveals a low conservation score for the first motif position and includes positively charged residues in the N‐terminal flanking sequence. However, previous studies have shown that Ena/VASP EVH1 prefers negatively charged residues adjacent to the motif (Ball et al., 2000; Carl et al., ¹⁹⁹⁹). Furthermore, the first position of the motif is crucial for binding and engages a specific pocket on the EVH1 domain. In contrast to the MSA‐based analysis, PairK reveals the conservation of negatively charged residues in the N‐terminal flank and a higher conservation of leucine at the first motif position.

3. DISCUSSION

It is well known that disordered regions of proteins are challenging to align. Because MSAs form the basis for conservation scores, the quality of these alignments directly influences the reliability of conservation assessments, as illustrated in Figure 1. To detect conservation more reliably, we developed a multiple‐sequence‐alignment‐free method termed PairK (pairwise k‐mer alignment) (Figure 2) to quantify the conservation of short sequence fragments within IDRs (including SLiMs). We established a benchmark to evaluate the ability of our method to distinguish genuine SLiMs from background motif matches using the conservation of residues within SLiMs and showed that PairK significantly outperforms MSAs on this test. We use this benchmark to compare residue‐level conservation scoring methods in the absence of reliable annotations that can serve as a ground truth. We note that SLiM filtering tools, which integrate multiple metrics for scoring/filtering (such as SLiMSearch (Krystkowiak & Davey, 2017)) or regular expression‐based SLiM conservation tools, may perform better on this specific benchmark task.

PairK offers several advantages over typical MSA‐based conservation methods when assessing SLiM conservation. One advantage is that it does not allow gaps. Deciding how much to penalize gaps in an MSA can be difficult and arbitrary. PairK circumvents this issue by not allowing gaps at all. If there is no well‐matching fragment in a homolog sequence, the scores are penalized through the mismatching residues rather than gap penalties. The fact that the pseudo MSAs are gapless greatly simplifies the interpretation of conservation at specific positions in the motif (Figure 5) and reduces artifacts from spuriously aligning high‐scoring residues such as proline (Figure 1). It is more consistent with what is known about the binding of SLiMs to protein domains, which typically can't accommodate gaps or insertions in the core motif because of the requirement to engage specific pockets on the binding domain.

One application for our tool is evaluating candidate SLiMs discovered in proteomic experimental screens (Benz et al., 2022; Hwang et al., ²⁰²²; Ivarsson et al., ²⁰¹⁴). Candidate binders from screens are not all biologically relevant. Thus, researchers are tasked with prioritizing sequences likely to yield significant biological insights for study, and sequence conservation emerges as a crucial factor in these decisions. Proteomic screens are often performed on SLiM‐binding domains with poorly defined motifs and tend to uncover sequences that bind to the domain but don't match the existing motif definition. For this application, conservation scoring methods that yield per‐residue conservation scores are required, and methods that depend upon a predefined motif may not be suitable. MSA‐based conservation scores can either underestimate or overestimate sequence conservation due to the quality of the underlying MSA. In contrast, PairK is less affected by artifacts resulting from aligning disordered regions (Figure 1) and is superior for distinguishing biologically relevant SLiMs from background matches (Figure 3). The protein LLM‐based conservation scoring method (the Kibby method (Yeung et al., 2023)) also outperforms the MSA method in our benchmark and has the advantage of not requiring homologous sequences at all. However, Kibby did not perform as well as PairK for most motifs (Figure 3).

Another application of SLiM conservation scoring is generating hypotheses about important SLiM binding determinants, based on residue‐level conservation. The identity and frequency of residues in homologs at each position within and around the motif (e.g., sequence logos as shown in Figure 5) are useful for this purpose. PairK is well‐suited for this kind of analysis because it only utilizes gapless, continuous sequence fragments, and the pseudo MSAs can be analyzed or used to generate sequence logos that summarize motif features. This is not true of the Kibby method, which doesn't provide the identity or frequency of residues in homologs at each position but only predicts a conservation score for each residue in a specific sequence.

We tested our conservation scoring methods on homologs from different phylogenetic levels (Figure 4). Notably, the performance of PairK improves with a broader phylogenetic range. As homologs diverge, the MSAs indicate that SLiMs are less conserved across metazoans. However, the enhanced performance of PairK implies that this apparent lack of conservation is due to declining MSA quality rather than reduced SLiM conservation. Alignments using Mafft, Muscle, and Clustalo at diverse phylogenetic levels (Figure S3) support this observation, as the consistency of MSAs built using different tools diminishes with increased global divergence. Thus, our findings reveal an advantage of PairK—it enables the examination of SLiM conservation across broader phylogenetic ranges. This breadth enhances the conservation signal due to increased background divergence.

One consideration when using PairK is the choice of k. Adding residues flanking a potential motif (increasing k) could provide insight into the conservation of residues surrounding the motif. This could reveal sequence features outside of the core motif that are important for binding. Additionally, if a protein contains multiple motifs and the conservation of just one of them is of interest, the addition of flanking residues can help find the most similar motif in each homolog sequence. However, if k is too large, potentially irrelevant residues could influence the k‐mer alignment and lower the overall quality of the results. The gapless nature of PairK assumes that residue positions are conserved due to contacts with the SLiM‐binding domain. If k is too large and this assumption is broken, the resulting alignment quality will decrease. Our benchmark results suggest that PairK will perform differently for different SLiMs (Figure 3d). We advise running PairK on known instances of a SLiM of interest and comparing the conservation results with background matches to the motif, using a few different values for k. This will provide a sense of how conserved the specific SLiM instances are, how well the method separates true positives from background, and how much flanking sequence can be included without decreasing performance. When interpreting results from PairK, it is important to consider the z‐scores, rather than relying on the sequence logos or individual pseudo MSAs alone. K‐mer alignments contain less information than global alignments and the z‐score helps to indicate when the information content is too low to conclude that a k‐mer is conserved. For small values of k, many k‐mer matches will arise by chance. Z‐scores indicate whether the k‐mer of interest is more conserved than other k‐mers with the same k.

Several variations of our pairwise k‐mer alignment method could be beneficial for specific analyses. For example, it may be useful to employ position weighting during pairwise alignment. In such an approach, individual positions in each query k‐mer – homolog k‐mer match could be weighted, allowing users to include known SLiM information when selecting optimal scoring fragments from each homolog. For example, for the TRAF6 motif xxxPxExx[FYWHDE], with k = 9, scores from the “x” positions could be down‐weighted in the alignment to prioritize selecting homolog fragments that match the essential P/E positions and the last position. Correcting for residue biases in the input sequences might also boost sensitivity. We used a scoring matrix built for IDRs to account for differences in residue composition between IDRs and ordered regions. However, PairK could be improved by correcting for the specific residue composition of the input query and its homologs.

PairK can be used without any prior knowledge of motifs or regular expressions, and thus could potentially be used as an agnostic motif discovery tool. One way to do this would be to simulate homolog sequences of similar evolutionary distances and residue compositions to the real homologs. The k‐mer conservation of the real homologs could then be compared to the simulated homologs to find k‐mers that are more conserved than expected. Similar approaches have been taken to quantify the conservation of IDR physicochemical properties (Zarin et al., 2019).

PairK was designed as a flexible python library so that it is easy to adapt and incorporate into existing conservation scoring methods or new algorithms. The pairwise k‐mer alignment step of PairK can be performed separately from the conservation scoring step and pseudo MSAs can be easily accessed for custom applications. Alternative scoring methods can easily be applied to the pseudo MSAs produced by PairK, for example, the regular expression‐based score of Chica et al. (Chica et al., 2008) or many of the other existing motif‐based scoring methods. However, when using PairK alignments, care should be taken to incorporate an appropriate background correction, as we did here by using z‐scores. The k‐mer alignments contain much less information than global alignments, and at small k, all k‐mers can appear conserved by chance rather than conservation.

We offer PairK as a free and publicly available Python package, available here: https://github.com/jacksonh1/pairk.

4. METHODS

4.1. General tools

We used Biopython (Cock et al., 2009) to facilitate sequence processing. Sequence logos were generated using the logomaker python package (Tareen & Kinney, 2020). Many of the plots were generated using the Seaborn (Waskom, 2021) or matplotlib (Hunter, 2007) python packages. The multiple sequence alignment image in Figure 1a was generated using Jalview (Waterhouse et al., 2009).

4.2. Pipeline for gathering and processing homologous sequences for conservation analysis

The following pipeline was used to generate all groups of homologous sequences in this study. For a protein of interest, here called the query protein, we gathered precomputed orthologous groups at the specified phylogenetic level from a locally downloaded copy of the OrthoDB data v 11 (Kuznetsov et al., 2023). Any sequences that were shorter than 0.5 times the query sequence length were removed. We further removed any sequences with non‐amino acid characters (“X”, “x”, “*”, “J”, “B”, “U”, or “Z”). Next, we used the alfpy python package (v 1.0.6) (Zielezinski et al., 2017) to calculate the sequence distance between the query sequence and all other sequences in the orthologous group, using google distance (Choi & Rashid, 2008) between frequency vectors with a word size of 2. The sequence distances were then used to remove all but the closest sequence to the query sequence for each organism in the orthologous group (least divergent homolog), such that there remained one sequence for each organism. The remaining homologs were then clustered to 90% identity using CD‐HIT (v 4.8.1) (Fu et al., 2012; Li & Godzik, ²⁰⁰⁶) with the ‐g parameter. We redefined the representative sequence for the cluster containing the query protein to make sure that it was the cluster representative. The homolog sequences were then reduced to just the representative of each cluster. The final homolog sequences were then aligned with MAFFT (v 7.52) (Katoh & Standley, 2013) unless otherwise specified. When Clustal Omega (v 1.2.3) (Sievers et al., 2011) was used, we used default parameters. For Muscle (v 5.2) (Edgar, 2022), default parameters were used except we added the ‐super5 flag. The pipeline used to generate the homolog groups is available at https://github.com/jacksonh1/slim_conservation_orthogroup_generation.

4.3. Definition of disordered regions

To define the disordered regions in a query protein, we used IUPred2A (Erdős & Dosztányi, 2020; Mészáros et al., ²⁰¹⁸) to calculate disorder scores for the sequence. We defined the IDRs as regions in the query sequence where the IUPRED scores were above 0.4. If two IDRs were separated by fewer than 11 residues, we merged them into one IDR. If an IDR was shorter than 8 residues, it was discarded.

4.4. Column‐wise conservation scores

To calculate conservation scores for MSA columns and PairK pseudo MSA columns, we used the python script from Capra et al. (Capra & Singh, 2007), which we trivially modified for compatibility with modern python tools. Unless otherwise specified, conservation scores were calculated for each column of an MSA (or pseudo MSA) using the property entropy score from the Capra et al. script. The residue groups (V, L, I, M), (F, W, Y), (S, T), (N, Q), (H, K, R), (D, E), (A, G) were treated as equivalent amino acids (Williamson, 1995). For Figures 1 and S3, the Shannon entropy score from the Capra et al. script was used. For both the Shannon entropy and property entropy, a gap penalty was applied to the final score for the column (as in Capra et al.), where the score was multiplied by the fraction of the column that was gaps (here termed gap fraction). The scores were inverted and normalized such that the values ranged from 0 to 1, with 1 being maximally conserved and 0 reflecting no conservation. Both Shannon entropy and property entropy showed similar performance on the benchmark (Figure S2D).

4.5. Variability in MSAs generated by different methods for regions containing SLiMs (Figure 1c and Figure S3)

The pipeline described in section 4.2 was used to gather homologous sequences for proteins containing verified SLiMs. The same set of SLiMs was used as in the benchmark (TP set, described below) except for LIG_14‐3‐3_CanoR_1, which was removed due to its variable‐length regular expression. MSAs were produced with the final homolog sequences using MAFFT, Muscle, and Clustal Omega (see pipeline methods above for MSA details). From the MSAs, Shannon entropy conservation scores were calculated for the columns corresponding to the defined positions of the motif, i.e., any position in the regular expression not defined as “x” (x = any residue) (see Table S1 for motif position masks used).

4.6. MSA‐based conservation score

To calculate conservation scores from an MSA, the property entropy score was first calculated for each column in the alignment (described in section 4.4). For positions within the IDR (as defined by the query sequence, described in section 4.3), we converted the scores to z‐scores. The background score distribution used for the z‐score calculation was every score in the IDR whose column had a gap fraction less than 0.2. The z‐score for each position was the column score minus the mean of the background distribution divided by the standard deviation of the background distribution. The reported score for the motif was the average z‐score of the defined motif residues (see Table S1 for motif position masks used).

4.7. Kibby conservation scores

The conservation score predictor, Kibby (Yeung et al., 2023), generates per‐residue conservation scores for input sequences. It does not require an alignment or homologous sequences. We used the Kibby conservation score predictor to generate conservation scores for the full‐length query sequences in the benchmark datasets. The Yeung et al. script conservation_from_fasta.py was used with default parameters (language model esm2_t33_650M_UR50D) and ‘‐device = cuda’. The resulting conservation scores were converted to z‐scores in the same manner as the MSA‐based conservation scores, using the conservation scores of the IDR residues as the background score distribution. No gap fraction mask was used as there were no gaps or alignment involved. The reported score for the motif was the average z‐score of the defined motif residues (see Table S1 for motif position masks used).

4.8. Pairwise k‐mer alignment (PairK) method

To run PairK for a query IDR, we had to obtain the corresponding IDR in each homolog. For a direct comparison with the MSA‐based method, we extracted the homolog IDRs from an MSA by slicing the region of the alignment corresponding to the query IDR. Thus, the same IDR sequences that are used in the MSA‐based score calculations are used as input for PairK. The IDRs were then de‐aligned by removing all the gaps. From the query IDR, overlapping k‐mer sequences were generated using a sliding window approach. Starting from the first position in the sequence, the sequence within a window of length k was recorded as the k‐mer for that position. This process was repeated for each position in the query IDR until the window reached the end of the sequence. In more formal notation, for a query IDR q with a length of N residues, a total of N – k + 1 k‐mers was generated. The k‐mer at position n in q (k‐mer^q,n) is composed of the residues at positions n to n + k ‐ 1. For each k‐mer in the query IDR (k‐mer^q,n), a gapless pairwise “pseudo MSA” was constructed using the following procedure. For each homolog IDR h, a k‐mer was generated at each position m (k‐mer^h,m). For each homolog, the k‐mer^h,m that best matched the query k‐mer^q,n was identified by calculating an alignment score for the k‐mer^q,n – k‐mer^h,m gapless alignment at each homolog position m, using a scoring matrix. For each residue match in the k‐mer^q,n – k‐mer^h,m alignment, the match score was retrieved from the scoring matrix, and the sum of the match scores was taken as the alignment score. The pseudo MSA for each k‐mer^q,n was constructed by collecting the highest scoring k‐mer^h,m from each homolog IDR. In instances where there was more than one best‐scoring k‐mer^h,m, the first instance (most N‐terminal) was selected (behavior of the numpy argmax function (Harris et al., 2020)). Unless otherwise specified, we used the EDSSMat50 substitution matrix (Trivedi & Nagarajaram, 2019), which was built for disordered regions, as the scoring matrix. We also tested the Blosum62 matrix (Henikoff & Henikoff, 1992) and Grantham distance matrix (Grantham, 1974). The Grantham distance matrix values were normalized to the range 0–1 by subtracting from each matrix element the lowest value in the matrix and dividing by the difference between the highest and lowest matrix values. The normalized matrix was then converted to a similarity matrix by subtracting each matrix element from 1. All three matrices showed very similar performance, and we chose to use EDSSMat50 (Figure S2).

For the embedding version of PairK, we used a similar procedure as for the normal PairK alignment, except residue embeddings from ESM2 (Lin et al., 2023) were used to select the best‐matching homologous subsequences for each k‐mer. In this version of the method, ESM2 residue embeddings were computed for the full‐length query sequence and homolog sequences, using the esm2_t33_650M_UR50D model. The start and stop tokens were then removed leaving a tensor of dimension N × 1280, where N is the number of residues in the sequence. Thus, each amino acid in the sequence had an associated 1280‐dimension feature vector. When the query IDR was split into k‐mer^q,n segments, the corresponding residue embeddings were also sliced out of the query embedding tensor, i.e., each k‐mer^q,n was associated with a tensor slice, T_ij ^q,n ∈ ℝ ^{k x 1280}. For each homolog IDR, k‐mer^h,m segments and their associated embedding slices were also generated, T_ij ^h,m ∈ ℝ ^{k x 1280}. For each k‐mer^q,n and for each homolog IDR, the best matching k‐mer^h,m was determined by calculating the Euclidean distance between T_ij ^q,n – T_ij ^h,m pairs for each m. The pseudo MSA for each k‐mer^q,n was constructed by collecting the k‐mer^h,m from each homolog IDR with the lowest Euclidean distance to that k‐mer^q,n.

We also tried using embeddings from DR‐BERT (Nambiar et al., 2023) for the embedding version of the PairK alignment step (Figure S1). Due to the maximum sequence length of the model, sequences longer than 1022 residues were split into chunks and processed individually, after which they were merged to yield the final residue embeddings. To do this, the sequence was split into 1010 residue chunks that overlapped by 50 residues. Sequence chunks were padded with five glycines to prevent the model from treating residues at the N and C termini of internal sequence chunks as the start/end of the protein. The first chunk in a sequence was only padded with glycines at the C‐terminus and the last sequence chunk was only padded at the N‐terminus. DR‐BERT residue embeddings were then generated for each sequence chunk. The embeddings corresponding to the start and stop tokens as well as the added glycines were removed from each chunk. To merge the embeddings of the sequence chunks into one tensor for the full sequence, the embeddings in the 50 residue overlapping regions were averaged together and the final tensor was assembled by concatenation. The PairK alignment step with DR‐BERT embeddings was performed in the same manner as was done with ESM2 embeddings.

After constructing pseudo MSAs for all query IDR k‐mers (by either version of PairK) we calculated conservation scores for every column in every pseudo MSA (see “column‐wise conservation scores,” section 4.4). The conservation scores were then converted to z‐scores, using all columns in all of the pseudo MSAs as the background distribution. The reported score for the motif was the average z‐score of the defined motif residues (see Table S1 for motif position masks used). PairK is available as a python tool (https://github.com/jacksonh1/pairk).

4.9. SLiM conservation benchmark

The code used to preprocess the data sources and generate the benchmark is available at https://github.com/jacksonh1/slim_conservation_benchmark.

4.10. Preprocessing of ELM data

We downloaded the following data tables from the ELM (downloaded on February 9, 2024) (Kumar et al., 2024): the SLiM classes table that contains the regular expression for each motif class, and the SLiM instances table, which contains the experimentally verified SLiM instances. The fasta file containing the full‐length sequences for each instance was also downloaded to obtain the amino‐acid sequence of the SLiM from the positions provided in the instance table. We then removed the small number of instances for which the SLiM sequence did not match the regular expression provided for the SLiM. The sequences used for the ELM annotations and the sequences used in OrthoDB can be slightly different, e.g., when the two databases used different sequence sources or different versions of the same sequence. Therefore, we had to map the ELM instance annotations to the corresponding proteins in OrthoDB. To do so, we mapped the UniProt ID for each instance in the ELM to its corresponding protein in OrthoDB using the homolog group pipeline tools described above and removed the small fraction of instances that could not be mapped. Finally, we found the SLiM positions within the OrthoDB sequence by searching for the ELM SLiM sequence within the full‐length OrthoDB sequence. To ensure that only one copy of the SLiM was found in the OrthoDB sequence, we included 15 residues flanking each side of the ELM SLiM sequence in the search. No sequences had multiple copies of the sequence used in the search. The small number of entries that did not contain a perfect match to the search sequence were discarded.

4.11. Benchmark – true positive set

We chose the following SLiMs from the ELM to use in the benchmark: DOC_WW_Pin1_4, LIG_AP2alpha_2, LIG_EH_1, LIG_SH2_GRB2like, LIG_14‐3‐3_CanoR_1, and LIG_SH3_CIN85_PxpxPR_1, based on the number of verified instances that were available. To generate the true positive set, we first filtered the preprocessed ELM data (described above) to include only SLiM instances annotated as true positives (most of the instances in the table). Any instances not annotated as Homo sapiens, Mus musculus, or Rattus norvegicus were then removed. Because we used only human sequences to search for background motif matches, we only included these closely related organisms in the true positive (TP) set. For each protein, we determined the disordered regions as described above. Any instances not in an IDR were removed. We added Ena/VASP EVH1 domain and the TRAF6 MATH domain to the benchmark but manually curated the verified SLiM instances for these domains from the literature (Tables S2 and S3) because we noticed either missing or incorrect annotations in the ELM for those SLiMs. The instances for Ena/VASP and TRAF6 were filtered in the same manner as the ELM instances. TRAF6 was removed from the overall analysis of the benchmark, due to the small number of true positives, but we calculated conservation scores for the motif to include in Table S4.

4.12. Benchmark – Background set

To generate the background matches (BG set), we retrieved all human sequences in OrthoDB. Although OrthoDB purports to remove duplicate and isoform sequences in each organism, we found some of these. Therefore, we clustered the set of all human sequences in OrthoDB to 95% identity using CD‐HIT (using the arguments ‐c 0.95 and ‐g 1) and kept only the representative sequence from each cluster. Clusters containing a protein with a true positive SLiM were removed. We used the remaining set of human sequences to search for background matches (~19,000 sequences total). To find the background motif matches, we first defined the disordered regions in the background sequences as described above. For each SLiM in the TP set, the motif regular expression was used to search the background sequences. Sequences matching the regular expression and in a disordered region were added to the background set for that SLiM. Finally, the full set of ELM instances (preprocessed) was used to remove any motif match in the BG set that overlapped with any verified SLiM instance. In cases where the BG set was very large for a given motif, the BG set was randomly subsampled to reduce computation time. Finally, sequences longer than 5000 residues were removed from all sets. Final counts of the TP/BG sets for each motif are provided in Table S1.

4.13. Benchmark scoring

To evaluate the performance of the MSA and PairK method on the benchmark, conservation scores were calculated for all motif matches in the benchmark for homolog groups collected at the Tetrapoda, Vertebrata, and Metazoa phylogenetic levels (constructed using the pipeline described in section 4.2). For MSA conservation scores, the motif match (either BG or TP) was discarded if there were fewer than 20 scored columns in the background score distribution used to calculate the z‐scores, e.g., if too many columns exceed the gap fraction cutoff value of 0.2 or for very short IDRs with fewer than 20 columns. For PairK, the motif match was discarded if there were fewer than 50 points in the background score distribution used to calculate the z‐scores or fewer than 10 k‐mers in the query IDR. For both MSA and PairK scores, the motif match was discarded if the standard deviation of the background distribution was less than 0.05. The final reported conservation score for the motif for both methods is the average z‐score of the defined motif residues (positions that are not ‘x’ in the motif, see Table S1). To be included in the final benchmark, we required each motif match to have a valid score for all methods at all of the phylogenetic levels. For example, if a motif match failed the MSA score method at the Metazoa level due to having too few background scores for a z‐score calculation, it was removed entirely from the benchmark even if it passed at the other levels. The full benchmark table with conservation scores is provided in Table S4.

4.14. Benchmark bootstrapping analysis

To test the robustness of the auPRC scores from the benchmark, we performed a bootstrapping analysis. We generated 1000 bootstrap replicates of the benchmark scores by random sampling with replacement. In each replicate, the number of scores for each SLiM and the number of true positives and background matches for each SLiM were kept the same as for the original benchmark scores. We calculated auPRC scores for each bootstrap replicate for the entire benchmark (Figure 4) and for each SLiM (Figure 3). We then calculated 95% confidence intervals from the bootstrap replicate auPRCs using the percentile function from the numpy python package (Harris et al., 2020).

AUTHOR CONTRIBUTIONS

Jackson C. Halpin: Writing – review and editing; writing – original draft; funding acquisition; conceptualization; investigation; methodology; validation; visualization; software; formal analysis; data curation. Amy E. Keating: Conceptualization; funding acquisition; project administration; resources; supervision; writing – original draft; writing – review and editing.

Supporting information

Figure S1. The performance of the embedding‐based variant of PairK on the benchmark when using residue embeddings from the ESM2 (1) (yellow) or DR‐BERT (2) (blue) LLMs. Homolog sequences were gathered at the Metazoa level. auPRC is the area under the precision‐recall curve. Error bars are 95% confidence intervals from the bootstrap analysis.

Figure S2. Effect of flanking sequence, scoring matrix, and conservation score method on PairK performance. (a) The PairK method. Stars indicate where the method is changed in parts B–D. The performance of the method is shown when adding residues flanking the potential motif for the alignment step (i.e. increasing k) (b), changing the alignment scoring matrix (c), and using different column‐wise conservation scoring methods (d). Columnwise scoring methods used: property entropy and Shannon entropy (3). Scoring matrices: Blosum62 (4), EDSSmat50 (5), and the Grantham matrix converted to a similarity matrix (6). Homolog sequences were gathered at the Vertebrata level. auPRC is the area under the precision‐recall curve.

Figure S3. The correlation between conservation scores of residues in experimentally verified SLiMs from MSAs produced by different alignment algorithms. The underlying homolog sequences were collected at the Tetrapod (a), Vertebrate (b), or Metazoa (c) level. Data are from 240 verified SLiM instances (731 residues).

Figure S4. Benchmark conservation score distributions (density) from homolog groups retrieved at the Tetrapoda, Vertebrata, and Metazoa levels. True positives are shown in orange and background matches are shown in blue.

Figure S5. Performance of MSA (gray) and PairK (red) conservation scoring methods for each SLiM in the benchmark, performed on homologs generated at different phylogenetic levels. Error bars are 95% confidence intervals from the bootstrap analysis. auPRC values were used to calculate the “increase in auPRC” in Figure 4c.

Table S1. Counts of SLiM instances in the benchmark for each motif.

Table S2. Manually curated verified Ena/VASP EVH1 binding partners (7–22). Interaction data are at the protein level so, for each protein, any FPPPP and LPPPP sequence within an IDR was considered a true positive.

Table S3. Manually curated verified TRAF6 MATH domain interactions (23–29).

PRO-34-e70004-s002.pdf^{(1.4MB, pdf)}

Table S4. Benchmark conservation scores.

PRO-34-e70004-s001.xlsx^{(528.1KB, xlsx)}

ACKNOWLEDGMENTS

This work was supported by the National Institutes of Health under award numbers R35GM149227 and F32GM137510. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. We thank Foster Birnbaum for help accelerating the embedding distance functions.

Halpin JC, Keating AE. PairK: Pairwise k‐mer alignment for quantifying protein motif conservation in disordered regions. Protein Science. 2025;34(1):e70004. 10.1002/pro.70004

Review Editor: Nir Ben‐Tal

REFERENCES

Acevedo LA, Greenwood AI, Nicholson LK. A noncanonical binding site in the EVH1 domain of vasodilator‐stimulated phosphoprotein regulates its interactions with the Proline rich region of Zyxin. Biochemistry. 2017;56:4626–4636. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25:25–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ball LJ, Kühne R, Hoffmann B, Häfner A, Schmieder P, Volkmer‐Engert R, et al. Dual epitope recognition by the VASP EVH1 domain modulates polyproline ligand specificity and binding affinity. EMBO J. 2000;19:4903–4914. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bashaw GJ, Kidd T, Murray D, Pawson T, Goodman CS. Repulsive axon guidance: Abelson and enabled play opposing roles downstream of the roundabout receptor. Cell. 2000;101:703–715. [DOI] [PubMed] [Google Scholar]
Benz C, Ali M, Krystkowiak I, Simonetti L, Sayadi A, Mihalic F, et al. Proteome‐scale mapping of binding sites in the unstructured regions of the human proteome. Mol Syst Biol. 2022;18:e10584. [DOI] [PMC free article] [PubMed] [Google Scholar]
Boëda B, Briggs DC, Higgins T, Garvalov BK, Fadden AJ, McDonald NQ, et al. Tes, a specific Mena interacting partner, breaks the rules for EVH1 binding. Mol Cell. 2007;28:1071–1082. [DOI] [PubMed] [Google Scholar]
Brauer BL, Moon TM, Sheftic SR, Nasa I, Page R, Peti W, et al. Leveraging new definitions of the LxVP SLiM to discover novel Calcineurin regulators and substrates. ACS Chem Biol. 2019;14:2672–2682. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bugge K, Brakti I, Fernandes CB, Dreier JE, Lundsgaard JE, Olsen JG, et al. Interactions by disorder – a matter of context. Front Mol Biosci. 2020;7:110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Capra JA, Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics. 2007;23:1875–1882. [DOI] [PubMed] [Google Scholar]
Carl UD, Pollmann M, Orr E, Gertlere FB, Chakraborty T, Wehland J. Aromatic and basic residues within the EVH1 domain of VASP specify its interaction with proline‐rich ligands. Curr Biol. 1999;9:715–718. [DOI] [PubMed] [Google Scholar]
Chen XJ, Squarr AJ, Stephan R, Chen B, Higgins TE, Barry DJ, et al. Ena/VASP proteins cooperate with the WAVE complex to regulate the actin cytoskeleton. Dev Cell. 2014;30:569–584. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chereau D, Kerff F, Graceffa P, Grabarek Z, Langsetmo K, Dominguez R. Actin‐bound structures of Wiskott‐Aldrich syndrome protein (WASP)‐homology domain 2 and the implications for filament assembly. Proc Natl Acad Sci U S A. 2005;102:16644–16649. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chica C, Labarga A, Gould CM, López R, Gibson TJ. A tree‐based conservation scoring method for short linear motifs in multiple alignments of protein sequences. BMC Bioinformatics. 2008;9:229. [DOI] [PMC free article] [PubMed] [Google Scholar]
Choi LJ, Rashid NA. Adapting normalized google similarity in protein sequence comparison. 2008 International Symposium on Information Technology. 2008;1:1–5. [Google Scholar]
Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
Davey NE, Cowan JL, Shields DC, Gibson TJ, Coldwell MJ, Edwards RJ. SLiMPrints: conservation‐based discovery of functional motif fingerprints in intrinsically disordered protein regions. Nucleic Acids Res. 2012;40:10628–10641. [DOI] [PMC free article] [PubMed] [Google Scholar]
Davey NE, Cyert MS, Moses AM. Short linear motifs – ex nihilo evolution of protein regulation. Cell Commun Signal. 2015;13:43. [DOI] [PMC free article] [PubMed] [Google Scholar]
Davey NE, Shields DC, Edwards RJ. Masking residues using context‐specific evolutionary conservation significantly improves short linear motif discovery. Bioinformatics. 2009;25:443–450. [DOI] [PubMed] [Google Scholar]
Edgar RC. Muscle5: high‐accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny. Nat Commun. 2022;13:6968. [DOI] [PMC free article] [PubMed] [Google Scholar]
Edwards RJ, Davey NE, Shields DC. SLiMFinder: a probabilistic method for identifying over‐represented, convergently evolved, short linear motifs in proteins. PLoS One. 2007;2:e967. [DOI] [PMC free article] [PubMed] [Google Scholar]
Edwards RJ, Paulsen K, Aguilar Gomez CM, Pérez‐Bercoff Å. Computational prediction of disordered protein motifs using SLiMSuite. Methods Mol Biol. 2020;2141:37–72. [DOI] [PubMed] [Google Scholar]
Erdős G, Dosztányi Z. Analyzing protein disorder with IUPred2A. Curr Protoc Bioinformatics. 2020;70:e99. [DOI] [PubMed] [Google Scholar]
Erdős G, Pajkos M, Dosztányi Z. IUPred3: prediction of protein disorder enhanced with unambiguous experimental annotation and visualization of evolutionary conservation. Nucleic Acids Res. 2021;49:W297–W303. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu L, Niu B, Zhu Z, Wu S, Li W. CD‐HIT: accelerated for clustering the next‐generation sequencing data. Bioinformatics. 2012;28:3150–3152. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grantham R. Amino acid difference formula to help explain protein evolution. Science. 1974;185:862–864. [DOI] [PubMed] [Google Scholar]
Halpin JC, Whitney D, Rigoldi F, Sivaraman V, Singer A, Keating AE. Molecular determinants of TRAF6 binding specificity suggest that native interaction partners are not optimized for affinity. Protein Sci. 2022;31:e4429. [DOI] [PMC free article] [PubMed] [Google Scholar]
Harker AJ, Katkar HH, Bidone TC, Aydin F, Voth GA, Applewhite DA, et al. Ena/VASP processive elongation is modulated by avidity on actin filaments bundled by the filopodia cross‐linker fascin. Mol Biol Cell. 2019;30:851–862. [DOI] [PMC free article] [PubMed] [Google Scholar]
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585:357–362. [DOI] [PMC free article] [PubMed] [Google Scholar]
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992;89:10915–10919. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang W‐C, Liao J‐H, Hsiao T‐C, Wei T‐YW, Maestre‐Reyna M, Bessho Y, et al. Binding and enhanced binding between key immunity proteins TRAF6 and TIFA. Chembiochem. 2019;20:140–146. [DOI] [PubMed] [Google Scholar]
Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9:90–95. [Google Scholar]
Hwang T, Parker SS, Hill SM, Grant RA, Ilunga MW, Sivaraman V, et al. Native proline‐rich motifs exploit sequence context to target actin‐remodeling Ena/VASP protein ENAH. Elife. 2022;11:e70680. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hwang T, Parker SS, Hill SM, Ilunga MW, Grant RA, Mouneimne G, et al. A distributed residue network permits conformational binding specificity in a conserved family of actin remodelers. Elife. 2021;10:e70601. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ivarsson Y, Arnold R, McLaughlin M, Nim S, Joshi R, Ray D, et al. Large‐scale interaction profiling of PDZ domains through proteomic peptide‐phage display using human and viral phage peptidomes. Proc Natl Acad Sci U S A. 2014;111:2542–2547. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jiang Z, Mak TW, Sen G, Li X. Toll‐like receptor 3‐mediated activation of NF‐kappaB and IRF3 diverges at toll‐IL‐1 receptor domain‐containing adapter inducing IFN‐beta. Proc Natl Acad Sci U S A. 2004;101:3533–3538. [DOI] [PMC free article] [PubMed] [Google Scholar]
Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30:772–780. [DOI] [PMC free article] [PubMed] [Google Scholar]
Khan T, Douglas GM, Patel P, Nguyen Ba AN, Moses AM. Polymorphism analysis reveals reduced negative selection and elevated rate of insertions and deletions in intrinsically disordered protein regions. Genome Biol Evol. 2015;7:1815–1826. [DOI] [PMC free article] [PubMed] [Google Scholar]
Klippel S, Wieczorek M, Schümann M, Krause E, Marg B, Seidel T, et al. Multivalent binding of formin‐binding protein 21 (FBP21)‐tandem‐WW domains fosters protein recognition in the pre‐spliceosome. J Biol Chem. 2011;286:38478–38487. [DOI] [PMC free article] [PubMed] [Google Scholar]
Krystkowiak I, Davey NE. SLiMSearch: a framework for proteome‐wide discovery and annotation of functional modules in intrinsically disordered regions. Nucleic Acids Res. 2017;45:W464–W469. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kumar M, Michael S, Alvarado‐Valverde J, Mészáros B, Sámano‐Sánchez H, Zeke A, et al. The eukaryotic linear motif resource: 2022 release. Nucleic Acids Res. 2022;50:D497–D508. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kumar M, Michael S, Alvarado‐Valverde J, Zeke A, Lazar T, Glavina J, et al. ELM‐the eukaryotic linear motif resource‐2024 update. Nucleic Acids Res. 2024;52:D442–D455. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kuznetsov D, Tegenfeldt F, Manni M, Seppey M, Berkeley M, Kriventseva EV, et al. OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity. Nucleic Acids Res. 2023;51:D445–D451. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lafuente EM, van Puijenbroek AAFL, Krause M, Carman CV, Freeman GJ, Berezovskaya A, et al. RIAM, an Ena/VASP and profilin ligand, interacts with Rap1‐GTP and mediates Rap1‐induced adhesion. Dev Cell. 2004;7:585–595. [DOI] [PubMed] [Google Scholar]
Lai ACW, Nguyen Ba AN, Moses AM. Predicting kinase substrates using conservation of local motif density. Bioinformatics. 2012;28:962–969. [DOI] [PubMed] [Google Scholar]
Lange J, Wyrwicz LS, Vriend G. KMAD: knowledge‐based multiple sequence alignment for intrinsically disordered proteins. Bioinformatics. 2016;32:932–936. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li W, Godzik A. Cd‐hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. [DOI] [PubMed] [Google Scholar]
Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary‐scale prediction of atomic‐level protein structure with a language model. Science. 2023;379:1123–1130. [DOI] [PubMed] [Google Scholar]
Marquet C, Heinzinger M, Olenyi T, Dallago C, Erckert K, Bernhofer M, et al. Embeddings from protein language models predict conservation and variant effects. Hum Genet. 2022;141:1629–1647. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mészáros B, Erdős G, Dosztányi Z. IUPred2A: context‐dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res. 2018;46:W329–W337. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nambiar A, Forsyth JM, Liu S, Maslov S. DR‐BERT: a protein language model to annotate disordered regions. bioRxiv, 2023.02.22.529574. 2023. [DOI] [PubMed]
Nguyen Ba AN, Yeh BJ, van Dyk D, Davidson AR, Andrews BJ, Weiss EL, et al. Proteome‐wide discovery of evolutionary conserved sequences in disordered regions. Sci Signal. 2012;5:rs1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pullen SS, Dang TT, Crute JJ, Kehry MR. CD40 signaling through tumor necrosis factor receptor‐associated factors (TRAFs). Binding site specificity and activation of downstream pathways by distinct TRAFs. J Biol Chem. 1999;274:14246–14254. [DOI] [PubMed] [Google Scholar]
Reys V, Pons J‐L, Labesse G. SLiMAn 2.0: meaningful navigation through peptide‐protein interaction networks. Nucleic Acids Res. 2024;52:W313–W317. [DOI] [PMC free article] [PubMed] [Google Scholar]
Saito T, Rehmsmeier M. The precision‐recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10:e0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sato S, Sugiyama M, Yamamoto M, Watanabe Y, Kawai T, Takeda K, et al. Toll/IL‐1 receptor domain‐containing adaptor inducing IFN‐β (TRIF) associates with TNF receptor‐associated factor 6 and TANK‐binding kinase 1, and activates two distinct transcription factors, NF‐κB and IFN‐regulatory Factor‐3, in the toll‐like receptor signaling. J Immunol Res. 2003;171:4304–4310. [DOI] [PubMed] [Google Scholar]
Shi Z, Zhang Z, Zhang Z, Wang Y, Li C, Wang X, et al. Structural insights into mitochondrial antiviral signaling protein (MAVS)‐tumor necrosis factor receptor‐associated factor 6 (TRAF6) signaling. J Biol Chem. 2015;290:26811–26820. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal omega. Mol Syst Biol. 2011;7:539. [DOI] [PMC free article] [PubMed] [Google Scholar]
Singer A, Ramos A, Keating AE. Elaboration of the Homer1 recognition landscape reveals incomplete divergence of paralogous EVH1 domains. Protein Sci. 2024;33:e5094. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stevers LM, de Vink PJ, Ottmann C, Huskens J, Brunsveld L. A thermodynamic model for Multivalency in 14‐3‐3 protein‐protein interactions. J Am Chem Soc. 2018;140:14498–14510. [DOI] [PMC free article] [PubMed] [Google Scholar]
Suetsugu S, Miki H, Takenawa T. Identification of two human WAVE/SCAR homologues as general actin regulatory molecules which associate with the Arp2/3 complex. Biochem Biophys Res Commun. 1999;260:296–302. [DOI] [PubMed] [Google Scholar]
Tareen A, Kinney JB. Logomaker: beautiful sequence logos in python. Bioinformatics. 2020;36:2272–2274. [DOI] [PMC free article] [PubMed] [Google Scholar]
The Gene Ontology Consortium . The gene ontology resource: enriching a GOld mine. Nucleic Acids Res. 2021;49:D325–D334. [DOI] [PMC free article] [PubMed] [Google Scholar]
Trivedi R, Nagarajaram HA. Amino acid substitution scoring matrices specific to intrinsically disordered regions in proteins. Sci Rep. 2019;9:16380. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tsukamoto N, Kobayashi N, Azuma S, Yamamoto T, Inoue J. Two differently regulated nuclear factor kappaB activation pathways triggered by the cytoplasmic tail of CD40. Proc Natl Acad Sci U S A. 1999;96:1234–1239. [DOI] [PMC free article] [PubMed] [Google Scholar]
Varadi M, Guharoy M, Zsolyomi F, Tompa P. DisCons: a novel tool to quantify and classify evolutionary conservation of intrinsic protein disorder. BMC Bioinformatics. 2015;16:153. [DOI] [PMC free article] [PubMed] [Google Scholar]
Waskom ML. seaborn: statistical data visualization. J Open Source Softw. 2021;6:3021. [Google Scholar]
Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ. Jalview version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009;25:1189–1191. [DOI] [PMC free article] [PubMed] [Google Scholar]
Williamson RM. Information theory analysis of the relationship between primary sequence structure and ligand recognition among a class of facilitated transporters. J Theor Biol. 1995;174:179–188. [DOI] [PubMed] [Google Scholar]
Ye H, Arron JR, Lamothe B, Cirilli M, Kobayashi T, Shevde NK, et al. Distinct molecular mechanism for initiating TRAF6 signalling. Nature. 2002;418:443–447. [DOI] [PubMed] [Google Scholar]
Yeung W, Zhou Z, Li S, Kannan N. Alignment‐free estimation of sequence conservation for identifying functional sites using protein sequence embeddings. Brief Bioinform. 2023;24:bbac599. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zarin T, Strome B, Nguyen Ba AN, Alberti S, Forman‐Kay JD, Moses AM. Proteome‐wide signatures of function in highly diverged intrinsically disordered regions. Elife. 2019;8:e46883. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment‐free sequence comparison: benefits, applications, and tools. Genome Biol. 2017;18:186. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Table S1. Counts of SLiM instances in the benchmark for each motif.

Table S3. Manually curated verified TRAF6 MATH domain interactions (23–29).

PRO-34-e70004-s002.pdf^{(1.4MB, pdf)}

Table S4. Benchmark conservation scores.

PRO-34-e70004-s001.xlsx^{(528.1KB, xlsx)}

[pro70004-bib-0001] Acevedo LA, Greenwood AI, Nicholson LK. A noncanonical binding site in the EVH1 domain of vasodilator‐stimulated phosphoprotein regulates its interactions with the Proline rich region of Zyxin. Biochemistry. 2017;56:4626–4636. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0002] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25:25–29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0003] Ball LJ, Kühne R, Hoffmann B, Häfner A, Schmieder P, Volkmer‐Engert R, et al. Dual epitope recognition by the VASP EVH1 domain modulates polyproline ligand specificity and binding affinity. EMBO J. 2000;19:4903–4914. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0004] Bashaw GJ, Kidd T, Murray D, Pawson T, Goodman CS. Repulsive axon guidance: Abelson and enabled play opposing roles downstream of the roundabout receptor. Cell. 2000;101:703–715. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0005] Benz C, Ali M, Krystkowiak I, Simonetti L, Sayadi A, Mihalic F, et al. Proteome‐scale mapping of binding sites in the unstructured regions of the human proteome. Mol Syst Biol. 2022;18:e10584. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0006] Boëda B, Briggs DC, Higgins T, Garvalov BK, Fadden AJ, McDonald NQ, et al. Tes, a specific Mena interacting partner, breaks the rules for EVH1 binding. Mol Cell. 2007;28:1071–1082. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0007] Brauer BL, Moon TM, Sheftic SR, Nasa I, Page R, Peti W, et al. Leveraging new definitions of the LxVP SLiM to discover novel Calcineurin regulators and substrates. ACS Chem Biol. 2019;14:2672–2682. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0008] Bugge K, Brakti I, Fernandes CB, Dreier JE, Lundsgaard JE, Olsen JG, et al. Interactions by disorder – a matter of context. Front Mol Biosci. 2020;7:110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0009] Capra JA, Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics. 2007;23:1875–1882. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0010] Carl UD, Pollmann M, Orr E, Gertlere FB, Chakraborty T, Wehland J. Aromatic and basic residues within the EVH1 domain of VASP specify its interaction with proline‐rich ligands. Curr Biol. 1999;9:715–718. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0011] Chen XJ, Squarr AJ, Stephan R, Chen B, Higgins TE, Barry DJ, et al. Ena/VASP proteins cooperate with the WAVE complex to regulate the actin cytoskeleton. Dev Cell. 2014;30:569–584. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0012] Chereau D, Kerff F, Graceffa P, Grabarek Z, Langsetmo K, Dominguez R. Actin‐bound structures of Wiskott‐Aldrich syndrome protein (WASP)‐homology domain 2 and the implications for filament assembly. Proc Natl Acad Sci U S A. 2005;102:16644–16649. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0013] Chica C, Labarga A, Gould CM, López R, Gibson TJ. A tree‐based conservation scoring method for short linear motifs in multiple alignments of protein sequences. BMC Bioinformatics. 2008;9:229. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0014] Choi LJ, Rashid NA. Adapting normalized google similarity in protein sequence comparison. 2008 International Symposium on Information Technology. 2008;1:1–5. [Google Scholar]

[pro70004-bib-0015] Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0016] Davey NE, Cowan JL, Shields DC, Gibson TJ, Coldwell MJ, Edwards RJ. SLiMPrints: conservation‐based discovery of functional motif fingerprints in intrinsically disordered protein regions. Nucleic Acids Res. 2012;40:10628–10641. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0017] Davey NE, Cyert MS, Moses AM. Short linear motifs – ex nihilo evolution of protein regulation. Cell Commun Signal. 2015;13:43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0018] Davey NE, Shields DC, Edwards RJ. Masking residues using context‐specific evolutionary conservation significantly improves short linear motif discovery. Bioinformatics. 2009;25:443–450. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0019] Edgar RC. Muscle5: high‐accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny. Nat Commun. 2022;13:6968. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0020] Edwards RJ, Davey NE, Shields DC. SLiMFinder: a probabilistic method for identifying over‐represented, convergently evolved, short linear motifs in proteins. PLoS One. 2007;2:e967. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0021] Edwards RJ, Paulsen K, Aguilar Gomez CM, Pérez‐Bercoff Å. Computational prediction of disordered protein motifs using SLiMSuite. Methods Mol Biol. 2020;2141:37–72. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0022] Erdős G, Dosztányi Z. Analyzing protein disorder with IUPred2A. Curr Protoc Bioinformatics. 2020;70:e99. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0023] Erdős G, Pajkos M, Dosztányi Z. IUPred3: prediction of protein disorder enhanced with unambiguous experimental annotation and visualization of evolutionary conservation. Nucleic Acids Res. 2021;49:W297–W303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0024] Fu L, Niu B, Zhu Z, Wu S, Li W. CD‐HIT: accelerated for clustering the next‐generation sequencing data. Bioinformatics. 2012;28:3150–3152. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0025] Grantham R. Amino acid difference formula to help explain protein evolution. Science. 1974;185:862–864. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0026] Halpin JC, Whitney D, Rigoldi F, Sivaraman V, Singer A, Keating AE. Molecular determinants of TRAF6 binding specificity suggest that native interaction partners are not optimized for affinity. Protein Sci. 2022;31:e4429. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0027] Harker AJ, Katkar HH, Bidone TC, Aydin F, Voth GA, Applewhite DA, et al. Ena/VASP processive elongation is modulated by avidity on actin filaments bundled by the filopodia cross‐linker fascin. Mol Biol Cell. 2019;30:851–862. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0028] Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585:357–362. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0029] Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992;89:10915–10919. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0030] Huang W‐C, Liao J‐H, Hsiao T‐C, Wei T‐YW, Maestre‐Reyna M, Bessho Y, et al. Binding and enhanced binding between key immunity proteins TRAF6 and TIFA. Chembiochem. 2019;20:140–146. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0031] Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9:90–95. [Google Scholar]

[pro70004-bib-0032] Hwang T, Parker SS, Hill SM, Grant RA, Ilunga MW, Sivaraman V, et al. Native proline‐rich motifs exploit sequence context to target actin‐remodeling Ena/VASP protein ENAH. Elife. 2022;11:e70680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0033] Hwang T, Parker SS, Hill SM, Ilunga MW, Grant RA, Mouneimne G, et al. A distributed residue network permits conformational binding specificity in a conserved family of actin remodelers. Elife. 2021;10:e70601. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0034] Ivarsson Y, Arnold R, McLaughlin M, Nim S, Joshi R, Ray D, et al. Large‐scale interaction profiling of PDZ domains through proteomic peptide‐phage display using human and viral phage peptidomes. Proc Natl Acad Sci U S A. 2014;111:2542–2547. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0035] Jiang Z, Mak TW, Sen G, Li X. Toll‐like receptor 3‐mediated activation of NF‐kappaB and IRF3 diverges at toll‐IL‐1 receptor domain‐containing adapter inducing IFN‐beta. Proc Natl Acad Sci U S A. 2004;101:3533–3538. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0036] Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30:772–780. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0037] Khan T, Douglas GM, Patel P, Nguyen Ba AN, Moses AM. Polymorphism analysis reveals reduced negative selection and elevated rate of insertions and deletions in intrinsically disordered protein regions. Genome Biol Evol. 2015;7:1815–1826. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0038] Klippel S, Wieczorek M, Schümann M, Krause E, Marg B, Seidel T, et al. Multivalent binding of formin‐binding protein 21 (FBP21)‐tandem‐WW domains fosters protein recognition in the pre‐spliceosome. J Biol Chem. 2011;286:38478–38487. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0039] Krystkowiak I, Davey NE. SLiMSearch: a framework for proteome‐wide discovery and annotation of functional modules in intrinsically disordered regions. Nucleic Acids Res. 2017;45:W464–W469. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0040] Kumar M, Michael S, Alvarado‐Valverde J, Mészáros B, Sámano‐Sánchez H, Zeke A, et al. The eukaryotic linear motif resource: 2022 release. Nucleic Acids Res. 2022;50:D497–D508. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0041] Kumar M, Michael S, Alvarado‐Valverde J, Zeke A, Lazar T, Glavina J, et al. ELM‐the eukaryotic linear motif resource‐2024 update. Nucleic Acids Res. 2024;52:D442–D455. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0042] Kuznetsov D, Tegenfeldt F, Manni M, Seppey M, Berkeley M, Kriventseva EV, et al. OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity. Nucleic Acids Res. 2023;51:D445–D451. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0043] Lafuente EM, van Puijenbroek AAFL, Krause M, Carman CV, Freeman GJ, Berezovskaya A, et al. RIAM, an Ena/VASP and profilin ligand, interacts with Rap1‐GTP and mediates Rap1‐induced adhesion. Dev Cell. 2004;7:585–595. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0044] Lai ACW, Nguyen Ba AN, Moses AM. Predicting kinase substrates using conservation of local motif density. Bioinformatics. 2012;28:962–969. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0045] Lange J, Wyrwicz LS, Vriend G. KMAD: knowledge‐based multiple sequence alignment for intrinsically disordered proteins. Bioinformatics. 2016;32:932–936. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0046] Li W, Godzik A. Cd‐hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–1659. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0047] Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary‐scale prediction of atomic‐level protein structure with a language model. Science. 2023;379:1123–1130. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0048] Marquet C, Heinzinger M, Olenyi T, Dallago C, Erckert K, Bernhofer M, et al. Embeddings from protein language models predict conservation and variant effects. Hum Genet. 2022;141:1629–1647. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0049] Mészáros B, Erdős G, Dosztányi Z. IUPred2A: context‐dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res. 2018;46:W329–W337. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0050] Nambiar A, Forsyth JM, Liu S, Maslov S. DR‐BERT: a protein language model to annotate disordered regions. bioRxiv, 2023.02.22.529574. 2023. [DOI] [PubMed]

[pro70004-bib-0051] Nguyen Ba AN, Yeh BJ, van Dyk D, Davidson AR, Andrews BJ, Weiss EL, et al. Proteome‐wide discovery of evolutionary conserved sequences in disordered regions. Sci Signal. 2012;5:rs1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0052] Pullen SS, Dang TT, Crute JJ, Kehry MR. CD40 signaling through tumor necrosis factor receptor‐associated factors (TRAFs). Binding site specificity and activation of downstream pathways by distinct TRAFs. J Biol Chem. 1999;274:14246–14254. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0053] Reys V, Pons J‐L, Labesse G. SLiMAn 2.0: meaningful navigation through peptide‐protein interaction networks. Nucleic Acids Res. 2024;52:W313–W317. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0054] Saito T, Rehmsmeier M. The precision‐recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10:e0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0055] Sato S, Sugiyama M, Yamamoto M, Watanabe Y, Kawai T, Takeda K, et al. Toll/IL‐1 receptor domain‐containing adaptor inducing IFN‐β (TRIF) associates with TNF receptor‐associated factor 6 and TANK‐binding kinase 1, and activates two distinct transcription factors, NF‐κB and IFN‐regulatory Factor‐3, in the toll‐like receptor signaling. J Immunol Res. 2003;171:4304–4310. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0056] Shi Z, Zhang Z, Zhang Z, Wang Y, Li C, Wang X, et al. Structural insights into mitochondrial antiviral signaling protein (MAVS)‐tumor necrosis factor receptor‐associated factor 6 (TRAF6) signaling. J Biol Chem. 2015;290:26811–26820. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0057] Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal omega. Mol Syst Biol. 2011;7:539. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0058] Singer A, Ramos A, Keating AE. Elaboration of the Homer1 recognition landscape reveals incomplete divergence of paralogous EVH1 domains. Protein Sci. 2024;33:e5094. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0059] Stevers LM, de Vink PJ, Ottmann C, Huskens J, Brunsveld L. A thermodynamic model for Multivalency in 14‐3‐3 protein‐protein interactions. J Am Chem Soc. 2018;140:14498–14510. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0060] Suetsugu S, Miki H, Takenawa T. Identification of two human WAVE/SCAR homologues as general actin regulatory molecules which associate with the Arp2/3 complex. Biochem Biophys Res Commun. 1999;260:296–302. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0061] Tareen A, Kinney JB. Logomaker: beautiful sequence logos in python. Bioinformatics. 2020;36:2272–2274. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0062] The Gene Ontology Consortium . The gene ontology resource: enriching a GOld mine. Nucleic Acids Res. 2021;49:D325–D334. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0063] Trivedi R, Nagarajaram HA. Amino acid substitution scoring matrices specific to intrinsically disordered regions in proteins. Sci Rep. 2019;9:16380. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0064] Tsukamoto N, Kobayashi N, Azuma S, Yamamoto T, Inoue J. Two differently regulated nuclear factor kappaB activation pathways triggered by the cytoplasmic tail of CD40. Proc Natl Acad Sci U S A. 1999;96:1234–1239. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0065] Varadi M, Guharoy M, Zsolyomi F, Tompa P. DisCons: a novel tool to quantify and classify evolutionary conservation of intrinsic protein disorder. BMC Bioinformatics. 2015;16:153. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0066] Waskom ML. seaborn: statistical data visualization. J Open Source Softw. 2021;6:3021. [Google Scholar]

[pro70004-bib-0067] Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ. Jalview version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009;25:1189–1191. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0068] Williamson RM. Information theory analysis of the relationship between primary sequence structure and ligand recognition among a class of facilitated transporters. J Theor Biol. 1995;174:179–188. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0069] Ye H, Arron JR, Lamothe B, Cirilli M, Kobayashi T, Shevde NK, et al. Distinct molecular mechanism for initiating TRAF6 signalling. Nature. 2002;418:443–447. [DOI] [PubMed] [Google Scholar]

[pro70004-bib-0070] Yeung W, Zhou Z, Li S, Kannan N. Alignment‐free estimation of sequence conservation for identifying functional sites using protein sequence embeddings. Brief Bioinform. 2023;24:bbac599. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0071] Zarin T, Strome B, Nguyen Ba AN, Alberti S, Forman‐Kay JD, Moses AM. Proteome‐wide signatures of function in highly diverged intrinsically disordered regions. Elife. 2019;8:e46883. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pro70004-bib-0072] Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment‐free sequence comparison: benefits, applications, and tools. Genome Biol. 2017;18:186. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

PairK: Pairwise k‐mer alignment for quantifying protein motif conservation in disordered regions

Jackson C Halpin

Amy E Keating

Abstract

1. INTRODUCTION

2. RESULTS

2.1. Sequence alignments of disordered regions confound the quantification of SLiM conservation

FIGURE 1.

2.2. PairK (pairwise k‐mer alignment)

FIGURE 2.

2.3. PairK outperforms MSAs at quantifying SLiM residue conservation

FIGURE 3.

2.4. PairK quantifies SLiM conservation at greater phylogenetic distances

FIGURE 4.

FIGURE 5.

3. DISCUSSION

4. METHODS

4.1. General tools

4.2. Pipeline for gathering and processing homologous sequences for conservation analysis

4.3. Definition of disordered regions

4.4. Column‐wise conservation scores

4.5. Variability in MSAs generated by different methods for regions containing SLiMs (Figure 1c and Figure S3)

4.6. MSA‐based conservation score

4.7. Kibby conservation scores

4.8. Pairwise k‐mer alignment (PairK) method

4.9. SLiM conservation benchmark

4.10. Preprocessing of ELM data

4.11. Benchmark – true positive set

4.12. Benchmark – Background set

4.13. Benchmark scoring

4.14. Benchmark bootstrapping analysis

AUTHOR CONTRIBUTIONS

Supporting information

ACKNOWLEDGMENTS

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases