Abstract
Gene duplication is the primary source of new genes, resulting in most genes having identifiable paralogs. Over time, paralog pairs may diverge in some respects but many retain the ability to perform the same functional role. Protein sequence identity is often used as a proxy for functional similarity and can predict shared functions between paralogs as revealed by synthetic lethal experiments. However, the advent of alternative protein representations, including embeddings from protein language models (PLMs) and predicted structures from AlphaFold, raises the possibility that alternative similarity metrics could better capture functional similarity between paralogs. Here, using two species (budding yeast and human) and two different definitions of shared functionality (shared protein–protein interactions and synthetic lethality), we evaluated a variety of alternative similarity metrics. For some tasks, predicted structural similarity or PLM similarity outperform sequence identity, but more importantly these similarity metrics are not redundant with sequence identity, i.e. combining them with sequence identity leads to improved predictions of shared functionality. By adding contextual features, representing similarity to homologous proteins within and across species, we can significantly enhance our predictions of shared paralog functionality. Overall, our results suggest that alternative similarity metrics capture complementary aspects of functional similarity beyond sequence identity alone.
Graphical Abstract
Graphical Abstract.
Introduction
Comparisons of protein sequences provide the foundation for much of modern biology, allowing us to infer evolutionary relationships between genes and predict functions for homologous proteins [1–6]. Such sequence comparisons are not just made across species, for inferring functions of orthologs, but also within species, for paralogs (genes that arose from the duplication of a common ancestral gene). Most genes, including ∼70% of human genes, have identifiable paralogs, and therefore, understanding the shared and divergent functions of paralog pairs is crucial to understanding gene function and the genotype–phenotype map [7].
Over evolutionary timescales, pairs of paralogs accumulate mutations, leading to divergence in their protein sequences and potentially in their function. However, even after diverging for hundreds of millions of years, many pairs of paralogs still perform similar functions in the cell and in some cases exhibit sufficient functional similarity that they can phenotypically compensate for each other’s loss [8–11]. Such compensatory relationships can be revealed via perturbation experiments—often it is possible to mutate each paralog individually with little fitness consequence but their mutation in combination results in cell death, a phenomenon known as ‘synthetic lethality’ (SL). These compensatory relationships between paralogs contribute to ‘genetic robustness’—the ability of cells and organisms to withstand genetic perturbation [12, 13]—and can potentially be exploited for the development of targeted therapies in cancer where deletion/mutation of one paralog may render tumour cells vulnerable to inhibition of another [8, 14–23].
Previous work has demonstrated that the amino acid sequence identity between pairs of paralogs is the feature most predictive of SL, suggesting that sequence similarity is a reasonable proxy for the functional similarity of paralogs [16, 21, 24]. Recent computational advances have led to the development of new protein sequence representations, such as AlphaFold-predicted protein structures [25, 26] and protein language model (PLM) embeddings [27, 28]. These representations, which are essentially derived from protein sequences, are now being widely used to predict functions from protein sequences [29–31]. In this work, we assess whether similarity metrics derived from these alternative protein representations can be used to predict whether two paralogs share functions. Paralogs are ideal for this study because their shared functions can be relatively easily assessed experimentally (e.g. through double perturbation experiments), making them perfect candidates for systematic evaluations of similarity metrics.
We model this as a binary classification problem—can similarity metrics derived from sequence, structure, and PLM embeddings effectively identify pairs of paralogs where both paralogs share functions (Fig. 1)? We use two definitions of ‘shared functions’ between paralog pairs—shared protein–protein interactions (PPIs), indicating that the two paralogs occupy similar spaces in the PPI network, and SL, indicating that the paralogs can phenotypically compensate for each other’s loss [32]. Although SL is a clear indicator of shared function between paralogs, pairs can share functionality without being SL. SL is primarily observed between paralogs that function in broadly essential pathways or processes [33] and so indicates not just that two paralogs can perform the same function but also that this function is important for cell survival. In contrast, if two paralogs can interact with the same set of proteins (shared PPIs) then that can be taken as an indication that they can perform at least some of the same intracellular functions [34] regardless of whether or not that function is necessary for shared survival. We therefore use both definitions. We further note that shared PPIs are highly predictive of SL [16, 33] and so the two measures are related. By focusing on these two specific definitions, we aim to better understand how various sequence similarity metrics correlate with functional conservation in paralogs, and to provide insights into the broader question of sequence–function relationships. To ensure generalizability of our findings, we perform our evaluation using paralogs from two species—humans and the budding yeast Saccharomyces cerevisiae, both of which have comprehensive PPI maps as well as systematically generated synthetic lethal interaction data available.
Figure 1.
Experimental overview—evaluating sequence and structural similarity metrics for predicting shared paralog functions. (A) A paralog pair A1–A2 is annotated based on their ability to perform shared functions. Two distinct labels are used: shared PPI (the two genes of the pair share a significative overlap of their PPI partners), and SL (simultaneous loss of both genes of the pair induces death cell when their individual loss has no effect). (B) For every pair of paralogs, we include amino acid sequence identity as a baseline and calculate features from comparison of their predicted structures (such as the TM score) and their PLM embeddings (such as cosine similarity). We also calculate features derived from searches of each paralog against multispecies databases of protein sequences and predicted protein structures. These features allow us to evaluate how similar the paralogs are to each other with respect to other proteins in the same species (paralogs) and across other species (orthologs) (e.g. whether A2 is the most similar protein in the database to A1 or whether there are 10 more similar paralogs). (C) Using a cross-validation approach we evaluate the ability of each individual feature to predict two definitions of shared functions (PPI/SL) across two species (yeast and human). Using machine learning we also evaluate the predictive power of feature subsets (e.g. all structural features) and all features combined.
Our evaluation across four datasets (two definitions × two species) reveal that sequence similarity features derived from alternative protein representations provide non-redundant but complementary functional information that is not captured by sequence identity alone. While individual similarity measures perform better in specific scenarios, the combination of all features in machine learning classifiers consistently offers a more accurate prediction of whether two paralogs share functions. We further show that features derived by searching for homologous proteins within and across species provides additional information beyond simple pairwise comparison and can aid the identification of paralog pairs where both paralogs share functions.
Materials and methods
Building four paralog pair classification datasets
Human and yeast paralog pairs lists
We obtained 124 813 human (Homo sapiens; genome assembly GRCh38.p14) and 5673 yeast (S. cerevisiae S288c; genome assembly R64-1-1) protein-coding paralog pairs and their corresponding amino acid sequence identities from Ensembl release 111 [35]. Each gene was mapped to its corresponding UniProt entry. Paralog pairs for which all AlphaFold and PLM features could not be computed were subsequently filtered out, resulting in a final total of 107 103 paralog pairs for humans and 5673 for yeast.
Annotating paralog pairs as sharing functions based on PPI comparison
For the shared PPI labels, we assessed whether paralog pairs share functions based on the overlap of their PPIs. We used systematically generated PPI networks for this. For humans, PPI data were obtained from BioPlex 3.0 (HEK293T cell AP-MS) [36], including only interactions where the paralog served as the bait. For yeast, PPI data were sourced from the Michaelis et al. [37] yeast interactome. We measured the overlap of interactors for each paralog pair (A1–A2) using the −log10P-value from a Fisher’s exact test (FET) to determine the significance of the overlap. The FET was conducted by comparing the observed number of shared PPIs between two paralogs to the total set of all detected proteins involved in interactions within the experiment, serving as the background population. Paralog pairs were classified as ‘sharing functions’ if the −log10P-value from the FET was ≥−log10(0.05/number of pairs in the species), indicating significant overlap. Conversely, pairs were classified as ‘not sharing functions’ if the −log10P-value from the FET was ≤−log10(0.05), indicating non-significant overlap. Pairs with nominally significant overlap that did not meet significance criteria after Bonferroni correction for multiple hypothesis correction were considered ambiguous and removed from the datasets. For both yeast and human, we defined additional PPI datasets (see Supplementary Tables S1–S2) based on the full physical interaction network from BioGRID (release 4.4.231) [38] as well as datasets restricted to yeast two-hybrid interactions. Additionally, for all PPI datasets, we generated an additional version using Jaccard score, in which paralog pairs were classified as ‘sharing functions’ if the Jaccard score was ≥0.5 and as ‘not sharing functions’ if the score was ≤0.1.
Annotating paralog pairs as sharing functions based on SL
In humans, SL labels were assigned using the method described in De Kegel et al. [33] (training dataset). Briefly, SL between paralogs was called by analysing CRISPR screens in a panel of cancer cell lines [39] and using a linear regression model to identify associations between loss (via deletion, silencing, and mutation) of one paralog and sensitivity to the inhibition of its counterpart. De Kegel et al. [33] only evaluated paralog pairs from smaller families (<20 paralogs) and that met a minimum sequence identity threshold (20%). Here, we used the same pipeline but assessed all paralog pairs in Ensembl 111 without applying sequence identity or family size filters. We also defined an additional human SL dataset, named ‘SL-Lenient’ (Supplementary Table S3), which combines hits from multiple existing combinatorial screens. In this dataset, paralog pairs are classified as SL if they were detected as synthetic lethal in at least one screen, and as non‐SL if they were screened but never observed as SL. Genetic interactions were obtained from Thompson et al. [40], Dede et al. [22], Parrish et al. [41], Gonatopoulos-Pournatzis et al. [42], and Ito et al. [43]. For Thompson et al., Dede et al., and Parrish et al., we used the authors’ own definition of SL. For Ito et al. and Gonatopoulos-Pournatzis et al., where many of the hits represent negative genetic interactions with minimal impact on growth, we additionally required that the constructs targeting the gene pair have a minimum log fold-change value <−0.8. See De Kegel et al. [33] for additional details on how this threshold was chosen. In yeast, SL labels were assigned based on the ‘Synthetic Genetic Array’ genetic interaction dataset [44], which involves pairwise screening of non-essential genes. Pairs with a genetic interaction score ≤−0.35 and a P-value ≤.05 were classified as SL, while the remaining screened pairs were classified as non-SL. We defined an additional yeast dataset of negative genetic interactions, in which gene pairs with a genetic interaction score ≤−0.08 and a P-value ≤.05 were classified as ‘Neg. GI’ (Supplementary Table S3). Non-screened pairs were systematically filtered out.
Computing paralog pair features
Features derived from comparing predicted structures
For each paralog pair (A1–A2), we retrieved the AlphaFold2-predicted structures for A1 and A2 canonical isoforms from the AlphaFold Database [25, 45] (UP000005640_9606_HUMAN_v4 for human; UP000002311_559 292_YEAST_v4 for yeast). We then used Foldseek [46] (version 6.29e255) to search the structure of A1 against the entire AlphaFold database. The resulting Foldseek ranked search output is similar to a BLAST [1] output table, as it contains alignment results between a query protein and potential target proteins, ranking them based on alignment scores. For A1 as the query, we extracted the alignment scores corresponding to A2 as the target. This process was repeated with A2 as the query and A1 as the target. The following pair features were defined based on the alignment scores: fident, template modelling (TM) score(s), local distance difference test (LDDT), bits, e-value, probability, and alignment length. For each feature, the minimum value between the two alignments (A1 as query and A2 as query) was used. In the minority of cases (3.4%) where neither A1 nor A2 provided an alignment with their respective paralog, the pairs were systematically filtered out.
Features derived from database searches
The Foldseek ranked search output, previously used for assessing pairwise structural similarity for all paralog pairs, was also utilized to define structural similarity search features. For sequence similarity search features, we employed MMseqs2 [47] (Many-against-Many sequence searching; version 13.45111) in a similar manner to Foldseek. We chose MMseqs2 due to its similarity in implementation to Foldseek, which helps minimize algorithmic differences and ensures fair comparisons. We searched the A1 UniProt sequence against the UniProtKB/Swiss-Prot [48] (release 2023_04 of 13 September 2023) sequence database and identified A2 in the resulting ranked search table. We used the similarity search results to define the following three features: the rank of A2 in the ranked search results, the number of proteins from the same species as A1–A2 that rank higher than A2 (paralogs), and the number of species that possess proteins ranking higher than A2 (species). This process was repeated with A2 as the query and A1 as the target, and the final pair features were determined by taking the minimum values observed between the two alignments (A1 as query and A2 as query). In the minority of cases (3.4%) where neither A1 nor A2 provided an alignment with their respective paralog, the pairs were systematically filtered out.
Distances between PLM embeddings as pair features
We used two sources of PLM embeddings—ProtT5 and ESM2. For ProtT5 [28] (ProtT5-XL-U50 model), ‘per-protein’ pre-encoded embeddings for A1 and A2 were retrieved from the UniProt database and compared using the following distance metrics: cosine, Euclidean, Manhattan, and TS-SS (similarity-sector similarity) [49]. For ESM2 [50], the UniProt protein sequences of A1 and A2 were encoded using the ‘esm2_t48_15B_UR50D’ model. Since the resulting embeddings were of different, incomparable sizes, we generated ‘fixed-size’ protein embeddings using four methods based on the implementation by Yeung et al. [51]: beginning-of-sequence tokens, end-of-sequence special token, mean of both special tokens, and mean of all residue tokens. These fixed-size embeddings were then compared using cosine, Euclidean, Manhattan, and TS-SS distances.
Evaluating pair features on label predictions
Spearman’s correlation coefficients
To evaluate the relationships between the different features, Spearman correlation coefficients were calculated using Python 3.11.4 and the pandas 2.1.3 package. Since the features could be either positively or negatively correlated, we presented the absolute values of the correlation coefficients for clarity.
Evaluating individual features for predicting labels
We evaluated each feature across all four datasets for their ability to predict whether paralog pairs share functions. Using four-fold cross-validation, we first standardized the feature values using z-score normalization (sklearn version 1.3.2 used for all machine learning analyses). Then, we assessed the performance of simple logistic regression classifiers (max iteration = 1000), each containing only one feature at a time, to predict the paralog pair labels. The performance was measured by computing the area under the receiver operating curve (AUROC) for each feature, and we presented both the individual AUC values and the mean AUC across the folds.
XGBoost classifiers evaluation on label predictions
To evaluate the predictive power of each ‘set’ of features (predicted structure similarity, PLM embeddings, and similarity search), rather than individual features, we used XGBoost classifiers. As before, we applied four-fold cross-validation and standardized all feature values. We assessed the performance of XGBoost classifiers (n_estimators = 600, random_state = 8, learning_rate = 0.1, colsample_bytree = 0.5, use_label_encoder = False, eval_metric=‘logloss’, and XGBoost version 2.0.3) that incorporated various subsets of features to predict the paralog pair labels. The performance of these classifiers was measured by computing their AUROC. We presented the mean AUROC across the folds, along with the ROC and the precision-recall curve for each set of features.
Constructing additional datasets for Gene Ontology-based paralog pair classification
Gene Ontology terms retrieval
Human Gene Ontology (GO) terms were sourced from UniProt (reference proteome data, 2023-07-12) [48]. For yeast, GO terms were retrieved from the Saccharomyces Genome Database (2024-03-25) [52]. Only experimental GO terms were retained, filtered by evidence codes: EXP, IDA, IEP, IGI, IMP, IPI, TAS, and IC. Paralog genes were then annotated with their corresponding experimental GO terms.
GO semantic distances between paralogs
The GOGO software [53] was used to calculate semantic distances between the two genes of each paralog pair, generating separate distances for biological processes (BPO), molecular functions (MFO), and cellular components (CCO).
Paralog pairs classification based on GO semantic distances
Paralog pairs were classified using fixed thresholds: Pairs with semantic similarities ≥0.9 were labelled as sharing functions, while those with similarities ≤0.3 were labelled as not sharing functions. Pairs with semantic similarities not meeting the above mentioned criteria were excluded from the analysis. This process resulted in six datasets: three for human (BPO, MFO, and CCO) and three for yeast (BPO, MFO, and CCO), where paralog pairs were classified based on their respective semantic similarity.
Results
Defining paralog pairs with shared functions using PPI networks and SL
We first obtained a list of 107 103 paralog pairs in humans (H. sapiens) and 5673 paralog pairs in budding yeast (S. cerevisiae) from Ensembl111 [35]. We then annotated each set of paralog pairs according to whether or not they share functions. We reasoned that pairs of paralogs could be viewed as having shared functions if they (i) could interact with the same PPI partners or (ii) could phenotypically compensate for each other’s loss, as evidenced by SL (Fig. 1A). For the PPI features, we made use of systematically generated PPI networks (Bioplex for humans; social interactome for yeast) [36, 37] to avoid the ascertainment bias associated with networks that aggregate interactions from multiple sources [54]. Using these networks, paralog pairs were annotated as having shared functions if they shared a significant fraction of their PPI partners (see the ‘Materials and methods’ section), resulting in 3.6% of pairs in humans and 10% of pairs in yeast labelled as having shared functions (Fig. 2A). For SL pairs, we made use of different sources of labels for each species. In budding yeast, a near complete map of all genetic interactions between pairs of protein-coding genes has been generated [44] which includes the vast majority of paralog pairs. We used this to annotate 4.8% of yeast paralog pairs as having shared functions (Fig. 2A). No similar comprehensive genetic interaction map exists for humans, but genome-wide CRISPR screens in genetically diverse cancer cell lines can be used to identify robust and unbiased synthetic lethal and non-synthetic lethal pairs [33]. Using this approach, we annotated 0.5% of human paralog pairs as synthetic lethal (Fig. 2A). We note that in each dataset a subset of gene pairs was left unannotated due either to lack of data (e.g. missing CRISPR gene scores for some genes) or ambiguous calls (e.g. a nominally significant overlap in PPI partners that did not meet the threshold for significance; see the ‘Materials and methods section’).
Figure 2.
Details of ‘shared functions’ labels in two species. (A) Table showing the counts of paralog pairs labelled according to four ‘shared functions’ class annotations (one PPI and one SL dataset for each species, humans (Hs—H. sapiens) and budding yeast (Sc—S. cerevisiae)). (B) Venn diagram illustrating the overlap between pairs labelled as sharing functions based on PPI and SL in humans and in budding yeast, highlighting the small but significant level of agreement and the varying scales between the different datasets. P-value and Odds ratio from an FET.
Fig. 2B shows the composition of the resulting four datasets. Although these datasets derive from the same lists of Ensembl paralog pairs, the labels within each species are defined using different functional criteria (PPI and SL) and so they are only partially overlapping. A pair labelled as ‘sharing functions’ in one dataset might be labelled as ‘not sharing functions’ or not labelled at all in another. We assessed the agreement between the two functional definitions using FETs, revealing statistically significant relationships between the datasets of the same species (Fig. 2B; PPI–SL P-value = 5.81 × 10−8 in humans and PPI–SL P-value = 3.54 × 10−12 in yeast). These results suggest that different functional definitions based on distinct methodological approaches successfully capture common biological insights regarding paralog function sharing.
To evaluate various sequence similarity metrics for predicting paralogs sharing functions, we analysed each of these four paralog datasets separately. We aimed to assess the suitability of these metrics for specific annotations and to understand how different features perform across various functional annotations, species, dataset sizes, and label distributions.
Evaluation of structural similarity between predicted structures to predict shared functions in paralogous genes
Protein sequences fold into 3D structures, enabling them to interact with their environment and perform their functions. Thanks to AlphaFold2 [25, 45], predicted protein structures derived from sequences are now widely available for many species including both yeast and humans. To assess whether structural similarity can predict if two paralogs perform the same function, we compared the AlphaFold2-predicted structures of paralog pairs. We used Foldseek [46], a recently developed tool designed for fast and efficient protein structure alignments, to align the AlphaFold-predicted structures.
Foldseek is akin to BLAST but for protein structures. It enables fast structure alignments and quick searches for structural similarities across large protein databases. We evaluated the scores computed by Foldseek as features that define the pairwise structural similarity of the paralogs. We defined nine pairwise structural similarity features (Fig. 3A, left) as follows: amino acid sequence identity after Foldseek structural alignment (fident); three TM scores that measure global structural similarity (TM score of the alignment, TM score normalized by the query length, and TM score normalized by the target length) [55]; LDDT [56], which measures local structure similarity; and four additional scores from the Foldseek alignment (e-value, bits score, probability, and alignment length).
Figure 3.
Pairwise structural comparison features provide non-redundant information about shared paralog function. (A) Overview of nine predicted structural similarity features derived from Foldseek alignment between AlphaFold-predicted structures of the two paralogs. (B) Spearman correlation coefficients (absolute values) between these nine features and sequence identity in the 100k human paralog pairs from Ensembl 111. (C) Mean AUC values comparing the nine individual structural features with minimum sequence identity for predicting shared functions across the four datasets using four-fold cross-validation. Error bars show the standard deviation of AUC values from cross-validation. (D) Performances of an XGBoost classifier using the nine predicted structure similarity features plus sequence identity compared with a classifier using solely sequence identity. Top chart shows the mean AUC (AUROC) on each dataset, bottom shows ROC and precision-recall curves for each datasets.
Considering these nine pairwise features are derived from AlphaFold-predicted structures, and hence essentially from protein sequences, we investigated whether they contain redundant information already captured by sequence identity or if they provide novel information. Although alternative metrics derived from pairwise protein sequence comparisons (e.g. sequence similarity defined using the BLOSUM 62 substitution matrix) are used in other contexts, previous work analysing SL has used sequence identity rather than any of these metrics [16, 22, 33, 40] and so we use this as the baseline for our analysis. Analysis of the assembled datasets suggests that sequence similarity (or MMseqs2 e-score) does not perform significantly better than sequence identity (Supplementary Fig. S3). We first computed Spearman correlation coefficients between the selected features for all human paralog pairs (Fig. 3B). We found that only fident displayed a high correlation (Spearman correlation of 0.86) with standard sequence identity, while all other metrics were considerably lower. The high correlation with fident is expected, as it is effectively an amino acid sequence identity measurement calculated from a pairwise alignment derived from aligned structures. Other structural similarity metrics displayed significantly lower correlation with sequence identity, e.g. LDDT score (0.78) and TM score query (0.74), suggesting that they capture non-redundant information. Similar observations were made for the correlations between calculated features in the yeast dataset as shown in Supplementary Fig. S1.
As these nine pairwise features capture pairwise structural divergence, they may also capture functional divergence. To evaluate their ability to predict whether two paralogs share functions, we performed a four-fold cross-validation on each of our four datasets individually. This approach ensured that we assessed the predictive performance separately within each dataset. As shown in Fig. 3C, most features had at least some predictive power (mean AUC > 0.5), though their performance varied across datasets. With the exception of predicting shared PPI in yeast, no structural similarity feature outperformed sequence identity at predicting shared functions. Some features (e.g. fident and LDDT) performed similarly or marginally worse than sequence identity at predicting SL in yeast and humans, but did not outperform it. For yeast PPIs, most structural features, led by LDDT and bits score, outperformed sequence identity, which performed particularly poorly at this task (AUC = 0.58). This suggests that structural comparisons may be particularly valuable when sequence-based comparisons perform poorly such as an in the yeast shared PPI challenge. However, they might be unnecessary when sequence identity alone is sufficient, as seen in human SL.
Since the different structural features seem to capture various aspects of functional divergence, we developed an XGBoost classifier combining all nine predicted structure features with sequence identity. We evaluated this classifier’s performance using cross-validation on the four datasets and compared it to a classifier that relied solely on sequence identity. As shown in Fig. 3D, this structure-based classifier outperformed sequence identity in most datasets, except in human-SL, where sequence identity alone performed exceptionally well. These results suggest that sequence identity and predicted structural similarities provide non-redundant information about shared paralog functions.
Evaluation of distances between PLM embeddings to predict shared functions in paralogous genes
PLMs are another artificial intelligence (AI) application in biology that learn the language of protein sequences through unsupervised training on vast sequence datasets. Once trained, a vector representation—called a PLM embedding—can be extracted from the model’s final hidden layer when provided with a protein sequence. Thus, a PLM embedding represents a protein sequence as a high-dimensional vector in the learned protein language space. It has been demonstrated that PLM embeddings efficiently capture key features of proteins, such as structural motifs, evolutionary signals, and functional elements [27, 28, 50, 57, 58]. PLM embeddings have been successfully used for various applications, including predicting secondary structure, subcellular localization, and inter-residue distances for 3D structure prediction, as well as serving as an alternative to traditional sequence similarity in homology-based annotation transfer. To evaluate the ability of distances between PLM embeddings to predict if two paralogs perform the same function, we investigated two commonly used PLMs: ESM2 [50] (15B parameters; trained on UniRef50) and ProtT5 [28] (a transformer using 3B parameters; trained on UniRef50), along with various metrics for comparing the produced PLM embeddings.
ESM2 produces ‘per residue’ embeddings that are not directly comparable at the protein level because their lengths vary based on the protein sequence. To make them comparable, we used four methods to compute fixed-size embeddings: beginning of sequence, end of sequence, mean of residue tokens, and mean of special tokens. For ProtT5, we directly obtained the fixed-size ‘per protein’ embeddings available on UniProt. For each fixed-size embedding (five in total—four from ESM2 and one from ProtT5), we computed four different distances (cosine, Euclidean, Manhattan, and TS-SS) between the embeddings of each paralog pair, resulting in a total of 20 PLM embedding distances used as pair features.
Given that these 20 features are derived from protein sequences, we investigated whether PLM embedding distances contain redundant information already captured by sequence identity. We first computed Spearman correlation coefficients between the selected features for all human paralog pairs (Fig. 4B). While all feature pairs showed some correlation, only those corresponding to different distance metrics between the same fixed-size embeddings displayed high correlation (Spearman correlation > 0.85). In general, the correlation with protein sequence identity was only moderate (average Spearman correlation across all metrics being 0.62). The observed correlations suggest that the 20 PLM features capture non-redundant information, influenced by the choice of PLM, the method used to compute fixed-size embeddings, and to a lesser extent, the distance metrics used (similar observations were made in yeast, as shown in Supplementary Fig. S1).
Figure 4.
Distances between PLM embeddings provide novel information about the functional similarity of paralog proteins. (A) Overview of the 20 PLM embedding features derived from ESM2 and ProtT5 embeddings. ProtT5 embeddings are available directly in fixed sizes per protein, while ESM2 embeddings are outputted in varying sizes based on the initial protein sequence length. ESM2 embeddings are converted to fixed sizes using four different methods: (i) bs: fixed size using the beginning of sequence token, (ii) es: fixed size using the end of sequence token, (iii) mrt: fixed size using the mean of residue tokens, and (iv) mst: fixed size using the mean of special residue tokens. Four different distance metrics (cosine similarity, Euclidean, Manhattan, and TS-SS) are used to compare each paralog pair’s embeddings for a given fixed size, resulting in 20 features. (B) Spearman correlation coefficients (absolute values) between these 20 features and sequence identity in 100k human paralog pairs from Ensembl 111 revealing that the method of fixing the size of the embedding matters more than the distance metric. (C) Mean AUC of the 20 individual features compared with minimum sequence identity for predicting shared functions across the four datasets using four-fold cross-validation. Error bars show the standard deviation of AUC values from cross-validation. (D) Performances of an XGBoost classifier using the 20 PLM features plus sequence identity compared with a classifier using solely sequence identity, the predicted structure similarity classifier, and a classifier with all 29 features combined with sequence identity for four-fold cross-validation on shared functions prediction across the four datasets. Top chart shows the mean AUC (AUROC) on each dataset, bottom shows ROC and precision-recall curves for each datasets.
As these 20 PLM pair features directly capture pairwise sequence similarity, we hypothesize they may also capture functional similarity. To evaluate their ability to predict if two paralogs share functions, we performed four-fold cross-validations on the four datasets as we did for structural similarity metrics. As shown in Fig. 4C, most PLM features had similar predictive power and performed comparably to each other across all four datasets. The biggest differences were observed between ESM2 and ProtT5 derived distances, with ProtT5 generally performing better, and also between the different ESM2 fixed-size embeddings, indicating that the method used to make the embeddings fixed-size matters more than the distance metrics used. It is worth noting that distances derived from ProtT5 per-protein embeddings and ESM2 mean of residue tokens (mrt) perform particularly well on yeast PPI data but not as well on human PPI data. However, we note that sequence identity performance also varies widely across these datasets. This variation may be, at least partially attributable, to the significant differences in the distributions of paralogs in the two species. The human genome contains paralogs deriving from at least two whole genome duplication (WGD) events, compared with one in yeast, and overall human genes have more paralogs.
Since the different PLM features seem to capture functional similarity without being highly correlated, we combined all 20 PLM features with minimum sequence identity in an XGBoost classifier and evaluated its performance using cross-validation on four datasets. As shown in Fig. 4D, this PLM classifier outperformed sequence identity alone in all datasets except human-SL, where it achieved equivalent performance.
To determine if the nine structural features and the 20 PLM features contain redundant or unique information, we built a third XGBoost classifier incorporating all previous 29 pairwise features. This classifier significantly outperformed sequence identity alone (Fig. 4D), as well as the individual PLM and structural classifiers, across all four datasets. Combining these features resulted in substantial gains in predictive power for all classification challenges. These results suggest that sequence identity, PLM embedding distances, and predicted structural similarities all contain non-redundant functional information.
Sequence–structure similarity searches against databases further capture novel information on paralog pairs sharing functions
Previous work has established that the likelihood of two paralogs displaying SL is influenced not only by the similarity of the two proteins, but also by the number of other paralogs in the same family and whether or not the pair in question are the most similar within a larger family [33]. The existence of other members within a larger family may also influence the likelihood of two proteins sharing PPIs. We hypothesize that within a given paralog family, if shared functions exist, the closer paralogs—those that are more similar—will be better candidates for sharing functions, while the distant ones—those that are less similar—will be less likely to share functions. Here, we aim to evaluate whether the presence of more similar sequences or structures in databases can predict if two paralogs share functions.
To determine the existence of more similar protein sequences, we first performed similarity searches using sequences and structural information. The two paralog sequences of a pair were queried against the UniProt database [48] using MMseqs2 [47]. The resulting position of the paralog in the ranked search results was translated into different features: the number of more similar protein sequences among all homologs (rank), the number of more similar proteins within the same species (paralogs), and the number of species with more similar orthologs (species). Similarly, to identify more similar protein structures, we used Foldseek [46] to query the predicted structures of the paralogs against the AlphaFold database [25, 26, 45], resulting in three analogous features for structures. This approach yielded a total of six features capturing the existence of more similar sequences and structures (Fig. 5A).
Figure 5.
Sequence and structure similarity searches provide non-redundant information about shared paralog function. (A) Overview of the similarity search features. We use MMseqs2 to search paralog sequences against the UniProt sequence database and Foldseek to search AlphaFold-predicted structures of paralogs against the AlphaFold database. Features are extracted from the ranked search results, specifically focusing on the paralog rank (rank of A2 when A1 is the query). To capture more precisely the divergence within and across species captured by the database searches, we derive two additional features from the ranked search result table: (i) paralogs: the number of proteins within the same species that rank better than A2 and (ii) species: the number of species possessing proteins (orthologs) that rank better than A2. These three feature pairs (rank, paralogs, and species) are computed for both sequence and structure similarity searches, making a total of six similarity search features. (B) Spearman correlation coefficients (absolute values) between these six features and sequence identity in 100k human paralog pairs from Ensembl 111. (C) Mean AUC of the six individual features compared with minimum sequence identity for predicting shared functions across the four datasets using four-fold cross-validation. Error bars show the standard deviation of AUC values from cross-validation. (D) Performances of an XGBoost classifier using the six similarity search features plus sequence identity compared with a classifier using solely sequence identity, a classifier with only the three sequence similarity search features plus sequence identity, and a classifier with only the three structure similarity search features plus sequence identity for four-fold cross-validation on shared functions prediction across the four datasets. Top chart shows the mean AUC (AUROC) on each dataset, bottom shows ROC and precision-recall curves for each datasets.
In human, Spearman correlation coefficients (Fig. 5B) between these six features and sequence identity exhibited medium to low correlation with sequence identity (all <0.7), indicating that these features successfully capture information not captured by sequence identity. Notably, the ‘species’ features were the most correlated with sequence identity (0.58 for sequence and 0.68 for structure), while the ‘paralogs’ features showed the lowest correlations (0.18 for sequence and 0.13 for structure). Similar trends were observed in yeast, as shown in Supplementary Fig. S1.
Since these six similarity search features capture sequence similarity of a paralog pair in the genomic and evolutionary context by relying on sequence/structure databases, we anticipated that they may also capture functional similarity. To evaluate their individual ability to predict if two paralogs share functions, we again performed four-fold cross-validations on the four classification datasets. As shown in Fig. 5C, most similarity search features had at least some predictive power (AUC > 0.5) and performed differently across all four datasets. Overall, the ‘paralog’ features, those reflecting the presence of more similar sequences or structures to one of the two paralogs inside the genome, exhibited the strongest predictive power in most datasets.
Next, we combined all six similarity search features with minimum sequence identity in an XGBoost classifier and evaluated its performance using cross-validation on four datasets. As shown in Fig. 5D, this similarity search classifier outperformed sequence identity alone in all datasets. The substantial improvement in all datasets when the six similarity search features are combined with sequence identity, compared with using sequence identity alone, highlights the presence of functional information captured by structure and sequence databases that pairwise sequence identity alone does not capture.
Combining search and similarity features to better predict paralog pairs sharing functions
To capture the most comprehensive relationship between sequence divergence and functional divergence, we combined all previous features (structure similarity, PLM distances, and similarity searches) to model if two paralogs share functions in the two human datasets. Initially, we hypothesized that there might be a single latent variable underlying all the features that is predictive of shared functions and so we used the first dimension of a principal components analysis with all features to model the sequence divergence–functional divergence relationship. However, this approach performed poorly compared to most individual features and classifiers (Fig. 6). This suggests that there is no single latent variable underlying all the similarity metrics, and we may need the full set of features to accurately capture ‘functional’ similarity. We then used all features to train machine learning classifiers (XGBoost) on the two human datasets. These classifiers achieved the best four-fold cross-validation performances on the two human training datasets (Fig. 6), significantly outperforming sequence identity alone or any of the classifiers trained on feature subsets (structural features, embedding features, and database search features). Interestingly, sequence identity alone still performs very well on SL prediction. To assess the robustness of our findings to the PPI network used and the source of synthetic lethal interactions, we performed cross-validation on additional datasets (Supplementary Tables S1–S4). For the PPI networks, in each species, we analysed the full physical interaction network from BioGRID [38] as well as a restricted network containing only yeast two-hybrid interactions. We also considered an alternative definition of shared PPIs, based on Jaccard score rather than the FET, for each network. In all cases, our combined feature set was non-redundant with sequence identity and the full classifier outperformed sequence identity alone. For human SL, we considered interactions identified using combinatorial CRISPR screens [22, 40–43], while for yeast, we considered a more lenient set of ‘negative genetic interactions’ (GI score ≤ −0.08). As with the PPIs, our findings are robust to the definition of SL used (Supplementary Figs S4–S5). This indicates that our features successfully capture different and unique aspects of sequence similarity between two protein sequences and their underlying functional relationships.
Figure 6.
Integrating all features improves prediction of shared PPIs, SL, and GO semantic similarity. Performances of an XGBoost classifier using all 36 sequence similarity features together compared with a classifier using solely sequence identity, the classifier using the nine predicted structure similarity features plus sequence identity, the classifier using the 20 PLM features plus sequence identity, the classifier using the six similarity search features plus sequence identity, and a classifier using only a single latent variable representing all 36 features for four-fold cross-validation on shared functions prediction across five human datasets (PPI; SL; GO BPO; GO MFO; and GO CCO). Top chart shows the mean AUC (AUROC) on each dataset, bottom shows ROC and precision-recall curves for each datasets.
To validate these features and test if they can also capture different types of shared function definitions, we trained classifiers on three new human datasets based on the three GO term categories: BPO, MFO, and CCO. We included only GO terms supported by experimental evidence to avoid class labels that were themselves based on sequence comparisons (see the ‘Materials and methods’ section). Paralog pairs were labelled as sharing functions (or not) based on the semantic similarity between their GO terms (Supplementary Table S4). In these three new tasks, our features significantly outperformed sequence identity alone (similar observations were made in yeast, as shown in Supplementary Fig. S2), validating them as invaluable sequence similarity metrics and demonstrating their ability to predict other definitions of shared paralog function.
Lastly, because a gene might appear in multiple pairs and thus be present in both training and testing sets—in different, unseen pairs—our ML classifiers might be learning characteristics specific to individual genes or even closely related paralogs rather than capturing general sequence features [59, 60]. To determine if the presence of genes belonging to the same family in different pairs influences our results when predicting on other pairs of family members, we went one step further and performed four-fold cross-validation on each of our 14 datasets (the 5 main human, the 5 main yeast, and the 4 additional datasets), but this time we split the data based on gene family, meaning that an entire family is either in the training set or in the test set. As expected, this task is more challenging for most classifiers; however, the overall trend remains robust: Integrating all features still improves prediction, even when predicting on unseen families (Supplementary Fig. S6). In particular, both the similarity search classifier and the classifier using all 36 features remain robust in these scenarios.
Discussion
Recent AI advances in language processing have significantly impacted protein sequence analysis by introducing new representations, such as PLM embeddings, and enabling accurate protein structure predictions, like those generated by AlphaFold2. In this study, we evaluated how well metrics derived from such models can predict shared functions in paralogs, compared with the traditional measure of sequence identity. Each of these features—whether comparing amino acid superposition in 3D structures, calculating distances between high-dimensional PLM embedding vectors, or assessing similarity through database searches—is based on different representations of protein sequences. Sequence identity compares the direct amino acid sequence, AlphaFold-predicted structures capture co-evolutionary signals between amino acids, PLM embeddings represent sequences in a protein language space, and similarity searches compare sequences or structures against others in databases. Despite originating from the same protein sequences, we showed that these different metrics are non-redundant and they capture unique aspects of sequence divergence between paralogs—as one could expect, given that each method is designed to highlight different properties. Furthermore, the combination of all these features in machine learning classifiers significantly outperforms any individual feature or feature type classifier across all datasets. While some studies may focus on specific types of features, such as structural data, PLM embeddings, or sequence-based approaches, our findings demonstrate that each type of feature captures different aspects of the sequence–function relationship. Our finding that the combination of all features performs well on tasks beyond shared PPI and SL prediction, i.e. predicting pairs with high GO semantic similarity, suggests that the set of features we have calculated may be useful for other challenges that relate to paralog similarity, e.g. predicting common targets of small molecules [61]. As a resource for the community, we therefore provide the features we have calculated for all paralog pairs on Zenodo at https://zenodo.org/records/14973633, along with all predictions made by each of our classifiers for every yeast and human pair.
Both yeast and humans have undergone WGD, yet they differ in the number of duplicated paralogs within each family. Human paralog families tend to be larger, potentially reflecting varying degrees of functional divergence. Despite these differences in duplication history, most features perform comparably in nearly all equivalent human and yeast datasets. The variations in feature performance may largely be attributed to differences in the dataset composition, with each dataset representing a distinct subset of the known paralog pairs. These subsets differ in their sequence identity distributions and the degree to which sequence and function are correlated. A key example is the human SL dataset constructed and studied in this paper. While it aims to be as comprehensive as current knowledge allows, human SL data remain less complete than that of yeast, where almost all paralog pairs have been systematically tested for SL. This means that some human SL paralog pairs may not be represented in the dataset, potentially limiting our ability to fully capture sequence/SL relationships, unlike in yeast. Nevertheless, across all datasets, the proposed sequence-derived features, especially in combination, effectively capture sequence–function relationships in both species. These promising results suggest the potential for broader application of these features beyond paralogs, such as predicting ortholog functions or determining whether two predicted orthologs share functions. Future work could explore this by applying our feature set to predict outcomes in protein replacement experiments [62–64], where one protein is substituted with its ortholog to observe functional conservation across species.
Supplementary Material
Acknowledgements
We thank members of the Ryan lab for useful conversations and critical reading of the manuscript.
Author contributions: Olivier Dennler (Conceptualization [equal], Data curation [lead], Formal analysis [lead], Investigation [equal], Methodology [equal], Software [lead], Validation [equal], Visualization [equal], Writing—original draft [equal], Writing—review & editing [equal]) and Colm J. Ryan (Conceptualization [equal], Funding acquisition [lead], Investigation [equal], Methodology [equal], Project administration [lead], Resources [lead], Supervision [lead], Validation [supporting], Visualization [supporting], Writing—original draft [equal], Writing—review & editing [equal])
Contributor Information
Olivier Dennler, School of Medicine, University College Dublin, Dublin 4, D04 V1W8, Ireland; School of Computer Science, University College Dublin, Dublin 4, D04 V1W8, Ireland; Conway Institute, University College Dublin, Dublin 4, D04 V1W8, Ireland.
Colm J Ryan, School of Medicine, University College Dublin, Dublin 4, D04 V1W8, Ireland; School of Computer Science, University College Dublin, Dublin 4, D04 V1W8, Ireland; Conway Institute, University College Dublin, Dublin 4, D04 V1W8, Ireland.
Supplementary data
Supplementary data is available at NAR Genomics & Bioinformatics online.
Conflict of interest
None declared.
Funding
This work was supported by a Science Foundation Ireland grant to C.J.R. (20/FFP-P/8641). Funding to pay the Open Access publication charges for this article was provided by Science Foundation Ireland.
Data availability
The data—including all computed features, datasets, and scripts used in this article—are available on GitHub at https://github.com/cancergenetics/paralog_seq_similarity and on Zenodo at https://zenodo.org/records/14975581. The produced features, datasets, and predictions are available on Zenodo at https://zenodo.org/records/14973633.
References
- 1. Altschul SF, Gish W, Miller W et al. Basic local alignment search tool. J Mol Biol. 1990; 215:403–10. 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 2. Eisen JA Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res. 1998; 8:163–7. 10.1101/gr.8.3.163. [DOI] [PubMed] [Google Scholar]
- 3. Eddy SR A new generation of homology search tools based on probabilistic inference. Genome Inform. 2009; 23:205–11. [PubMed] [Google Scholar]
- 4. Studer RA, Robinson-Rechavi M How confident can we be that orthologs are similar, but paralogs differ?. Trends Genet. 2009; 25:210–6. 10.1016/j.tig.2009.03.004. [DOI] [PubMed] [Google Scholar]
- 5. Nehrt NL, Clark WT, Radivojac P et al. Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput Biol. 2011; 7:e1002073. 10.1371/journal.pcbi.1002073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Hernández-Salmerón JE, Moreno-Hagelsieb G Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2. BMC Genomics. 2020; 21:741. 10.1186/s12864-020-07132-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Ibn-Salem J, Muro EM, Andrade-Navarro MA Co-regulation of paralog genes in the three-dimensional chromatin architecture. Nucleic Acids Res. 2017; 45:81–91. 10.1093/nar/gkw813. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Dean EJ, Davis JC, Davis RW et al. Pervasive and persistent redundancy among duplicated genes in yeast. PLoS Genet. 2008; 4:e1000113. 10.1371/journal.pgen.1000113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Vavouri T, Semple JI, Lehner B Widespread conservation of genetic redundancy during a billion years of eukaryotic evolution. Trends Genet. 2008; 24:485–8. 10.1016/j.tig.2008.08.005. [DOI] [PubMed] [Google Scholar]
- 10. Cusack SA, Wang P, Lotreck SG et al. Predictive models of genetic redundancy in Arabidopsis thaliana. Mol Biol Evol. 2021; 38:3397–414. 10.1093/molbev/msab111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Iohannes SD, Jackson D Tackling redundancy: genetic mechanisms underlying paralog compensation in plants. New Phytol. 2023; 240:1381–9. 10.1111/nph.19267. [DOI] [PubMed] [Google Scholar]
- 12. Ewen-Campen B, Mohr SE, Hu Y et al. Accessing the phenotype gap: enabling systematic investigation of paralog functional complexity with CRISPR. Dev Cell. 2017; 43:6–9. 10.1016/j.devcel.2017.09.020. [DOI] [PubMed] [Google Scholar]
- 13. Hu Y, Ewen-Campen B, Comjean A et al. Paralog Explorer: a resource for mining information about paralogs in common research organisms. Comput Struct Biotechnol J. 2022; 20:6570–7. 10.1016/j.csbj.2022.11.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Hartwell LH, Szankasi P, Roberts CJ et al. Integrating genetic approaches into the discovery of anticancer drugs. Science. 1997; 278:1064–8. 10.1126/science.278.5340.1064. [DOI] [PubMed] [Google Scholar]
- 15. Kaelin WG The concept of synthetic lethality in the context of anticancer therapy. Nat Rev Cancer. 2005; 5:689–98. 10.1038/nrc1691. [DOI] [PubMed] [Google Scholar]
- 16. DeLuna A, Vetsigian K, Shoresh N et al. Exposing the fitness contribution of duplicated genes. Nat Genet. 2008; 40:676–81. 10.1038/ng.123. [DOI] [PubMed] [Google Scholar]
- 17. VanderSluis B, Bellay J, Musso G et al. Genetic interactions reveal the evolutionary trajectories of duplicate genes. Mol Syst Biol. 2010; 6:429. 10.1038/msb.2010.82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Han K, Jeng EE, Hess GT et al. Synergistic drug combinations for cancer identified in a CRISPR screen for pairwise genetic interactions. Nat Biotechnol. 2017; 35:463–74. 10.1038/nbt.3834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. O’Neil NJ, Bailey ML, Hieter P Synthetic lethality and cancer. Nat Rev Genet. 2017; 18:613–23. 10.1038/nrg.2017.47. [DOI] [PubMed] [Google Scholar]
- 20. Lord CJ, Ashworth A PARP inhibitors: synthetic lethality in the clinic. Science. 2017; 355:1152–8. 10.1126/science.aam7344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Kegel BD, Ryan CJ Paralog buffering contributes to the variable essentiality of genes in cancer cell lines. PLoS Genet. 2019; 15:e1008466. 10.1371/journal.pgen.1008466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Dede M, McLaughlin M, Kim E et al. Multiplex enCas12a screens detect functional buffering among paralogs otherwise masked in monogenic Cas9 knockout screens. Genome Biol. 2020; 21:262. 10.1186/s13059-020-02173-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Huang A, Garraway LA, Ashworth A et al. Synthetic lethality as an engine for cancer drug target discovery. Nat Rev Drug Discov. 2020; 19:23–38. 10.1038/s41573-019-0046-z. [DOI] [PubMed] [Google Scholar]
- 24. Gu Z, Steinmetz LM, Gu X et al. Role of duplicate genes in genetic robustness against null mutations. Nature. 2003; 421:63–6. 10.1038/nature01198. [DOI] [PubMed] [Google Scholar]
- 25. Jumper J, Evans R, Pritzel A et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021; 596:583–9. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Varadi M, Anyango S, Deshpande M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022; 50:D439–44. 10.1093/nar/gkab1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Rives A, Meier J, Sercu T et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA. 2021; 118:e2016239118. 10.1073/pnas.2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Elnaggar A, Heinzinger M, Dallago C et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2022; 44:7112–27. 10.1109/TPAMI.2021.3095381. [DOI] [PubMed] [Google Scholar]
- 29. Nallapareddy V, Bordin N, Sillitoe I et al. CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models. Bioinformatics. 2023; 39:btad029. 10.1093/bioinformatics/btad029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Zhao C, Liu T, Wang Z PANDA-3D: protein function prediction based on AlphaFold models. NAR Genom Bioinform. 2024; 6:lqae094. 10.1093/nargab/lqae094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Dickson A, Mofrad MRK Fine-tuning protein embeddings for functional similarity evaluation. Bioinformatics. 2024; 40:btae445. 10.1093/bioinformatics/btae445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Eisenberg D, Marcotte EM, Xenarios I et al. Protein function in the post-genomic era. Nature. 2000; 405:823–6. 10.1038/35015694. [DOI] [PubMed] [Google Scholar]
- 33. Kegel BD, Quinn N, Thompson NA et al. Comprehensive prediction of robust synthetic lethality between paralog pairs in cancer cell lines. Cell Syst. 2021; 12:1144–59. 10.1016/j.cels.2021.08.006. [DOI] [PubMed] [Google Scholar]
- 34. Brun C, Guénoche A, Jacq B Approach of the functional evolution of duplicated genes in Saccharomyces cerevisiae using a new classification method based on protein–protein interaction data. J Struct Funct Genom. 2003; 3:213–24. 10.1023/A:1022694824569. [DOI] [PubMed] [Google Scholar]
- 35. Harrison PW, Amode MR, Austine-Orimoloye O et al. Ensembl 2024. Nucleic Acids Res. 2024; 52:D891–9. 10.1093/nar/gkad1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Huttlin EL, Ting L, Bruckner RJ et al. The BioPlex Network: a systematic exploration of the human interactome. Cell. 2015; 162:425–40. 10.1016/j.cell.2015.06.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Michaelis AC, Brunner A-D, Zwiebel M et al. The social and structural architecture of the yeast protein interactome. Nature. 2023; 624:192–200. 10.1038/s41586-023-06739-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Oughtred R, Rust J, Chang C et al. The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci. 2021; 30:187–200. 10.1002/pro.3978. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Meyers RM, Bryan JG, McFarland JM et al. Computational correction of copy number effect improves specificity of CRISPR–Cas9 essentiality screens in cancer cells. Nat Genet. 2017; 49:1779–84. 10.1038/ng.3984. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Thompson NA, Ranzani M, Weyden Lvd et al. Combinatorial CRISPR screen identifies fitness effects of gene paralogues. Nat Commun. 2021; 12:1302. 10.1038/s41467-021-21478-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Parrish PCR, Thomas JD, Gabel AM et al. Discovery of synthetic lethal and tumor suppressor paralog pairs in the human genome. Cell Rep. 2021; 36:109597. 10.1016/j.celrep.2021.109597. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Gonatopoulos-Pournatzis T, Aregger M, Brown KR et al. Genetic interaction mapping and exon-resolution functional genomics with a hybrid Cas9–Cas12a platform. Nat Biotechnol. 2020; 38:638–48. 10.1038/s41587-020-0437-z. [DOI] [PubMed] [Google Scholar]
- 43. Ito T, Young MJ, Li R et al. Paralog knockout profiling identifies DUSP4 and DUSP6 as a digenic dependence in MAPK pathway-driven cancers. Nat Genet. 2021; 53:1664–72. 10.1038/s41588-021-00967-z. [DOI] [PubMed] [Google Scholar]
- 44. Costanzo M, VanderSluis B, Koch EN et al. A global genetic interaction network maps a wiring diagram of cellular function. Science. 2016; 353:aaf1420. 10.1126/science.aaf1420. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Varadi M, Bertoni D, Magana P et al. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 2024; 52:D368–75. 10.1093/nar/gkad1011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Kempen Mv, Kim SS, Tumescheit C et al. Fast and accurate protein structure search with Foldseek. Nat Biotechnol. 2024; 42:243–6. 10.1038/s41587-023-01773-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Steinegger M, Söding J MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017; 35:1026–8. 10.1038/nbt.3988. [DOI] [PubMed] [Google Scholar]
- 48. UniProt Consortium Bateman A, Martin M-J et al. UniProt: the Universal Protein knowledgebase in 2023. Nucleic Acids Res. 2023; 51:D523–31. 10.1093/nar/gkac1052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Heidarian H, Dinneen D A hybrid geometric approach for measuring similarity level among documents and document clustering. 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService), Oxford, UK, 2016. 2016; Piscataway, NJ, USA: IEEE; 142–51. [Google Scholar]
- 50. Lin Z, Akin H, Rao R et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023; 379:1123–30. 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
- 51. Yeung W, Zhou Z, Mathew L et al. Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies. Brief Bioinform. 2023; 24:bbac619. 10.1093/bib/bbac619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Cherry JM, Hong EL, Amundsen C et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 2012; 40:D700–5. 10.1093/nar/gkr1029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Zhao C, Wang Z GOGO: an improved algorithm to measure the semantic similarity between gene ontology terms. Sci Rep. 2018; 8:15107. 10.1038/s41598-018-33219-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Braun P, Tasan M, Dreze M et al. An experimentally derived confidence score for binary protein–protein interactions. Nat Methods. 2009; 6:91–7. 10.1038/nmeth.1281. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Zhang Y, Skolnick J Scoring function for automated assessment of protein structure template quality. Proteins. 2004; 57:702–10. 10.1002/prot.20264. [DOI] [PubMed] [Google Scholar]
- 56. Mariani V, Biasini M, Barbato A et al. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics. 2013; 29:2722–8. 10.1093/bioinformatics/btt473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Heinzinger M, Elnaggar A, Wang Y et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics. 2019; 20:723. 10.1186/s12859-019-3220-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Bepler T, Berger B Learning the protein language: evolution, structure, and function. Cell Syst. 2021; 12:654–69. 10.1016/j.cels.2021.05.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Ben-Hur A, Noble WS Choosing negative examples for the prediction of protein–protein interactions. BMC Bioinformatics. 2006; 7:S2. 10.1186/1471-2105-7-S1-S2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Feng Y, Long Y, Wang H et al. Benchmarking machine learning methods for synthetic lethality prediction in cancer. Nat Commun. 2024; 15:9058. 10.1038/s41467-024-52900-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Kruger FA, Overington JP Global analysis of small molecule binding to related protein targets. PLoS Comput Biol. 2012; 8:e1002333. 10.1371/journal.pcbi.1002333. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Kachroo AH, Laurent JM, Yellman CM et al. Systematic humanization of yeast genes reveals conserved functions and genetic modularity. Science. 2015; 348:921–5. 10.1126/science.aaa0769. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Laurent JM, Garge RK, Teufel AI et al. Humanization of yeast genes with multiple human orthologs reveals functional divergence between paralogs. PLoS Biol. 2020; 18:e3000627. 10.1371/journal.pbio.3000627. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Lai H-Y, Yu Y-H, Jhou Y-T et al. Multiple intermolecular interactions facilitate rapid evolution of essential genes. Nat Ecol Evol. 2023; 7:745–55. 10.1038/s41559-023-02029-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data—including all computed features, datasets, and scripts used in this article—are available on GitHub at https://github.com/cancergenetics/paralog_seq_similarity and on Zenodo at https://zenodo.org/records/14975581. The produced features, datasets, and predictions are available on Zenodo at https://zenodo.org/records/14973633.







