Abstract
Motivation
Representation learning has revolutionized sequence-based prediction of protein function and subcellular localization. Protein networks are an important source of information complementary to sequences, but the use of protein networks has proven to be challenging in the context of machine learning, especially in a cross-species setting.
Results
We leveraged the STRING database of protein networks and orthology relations for 1322 eukaryotes to generate network-based cross-species protein embeddings. We did this by first creating species-specific network embeddings and subsequently aligning them based on orthology relations to facilitate direct cross-species comparisons. We show that these aligned network embeddings ensure consistency across species without sacrificing quality compared to species-specific network embeddings. We also show that the aligned network embeddings are complementary to sequence embedding techniques, despite the use of sequence-based orthology relations in the alignment process. Finally, we validated the embeddings by using them for two well-established tasks: subcellular localization prediction and protein function prediction. Training logistic regression classifiers on aligned network embeddings and sequence embeddings improved the accuracy over using sequence alone, reaching performance numbers close to state-of-the-art deep-learning methods.
Availability and implementation
The source code and scripts for generating the network-based cross-species protein embeddings are available at https://github.com/deweihu96/SPACE. Precomputed network embeddings and sequence embeddings for all eukaryotic proteins are included in STRING version 12.0 (https://string-db.org/cgi/download).
1 Introduction
Machine learning has long been used to enhance our understanding of proteins by making predictions about their characteristics (Kouba et al. 2023), including functions, subcellular localizations, and structure. This works because machine learning can identify complex patterns that would be hard to otherwise find, e.g. in protein sequences. Through the decades, the growth in biological data combined with advances in machine learning techniques have considerably improved the accuracy of these predictions, helped elucidate the functions and localizations of proteins (Wang et al. 2014, Bonetta and Valentino 2020), and been applied in the discovery of new drugs (Vamathevan et al. 2019) and biomarkers (Rost et al. 2016).
Over the past several years, protein language models trained on vast amounts of protein sequences have revolutionized computational biology (Rives et al. 2021, Brandes et al. 2022, Elnaggar et al. 2022, Lin et al. 2023). These models, such as ProtT5 (Elnaggar et al. 2022) and ESM2 (Lin et al. 2023), trained on more than 45 million protein sequences from UniRef50 dataset (Suzek et al. 2015), learn to represent protein sequences as high-dimensional vectors, encapsulating both protein domains and shorter motifs in the sequences. These protein embeddings serve as a foundation for further analysis and are often employed within smaller, labeled datasets to train supervised models. The impact of this approach has been broad; protein language models are now used by all state-of-the-art tools for predicting protein function (Wang et al. 2023), subcellular localization (Thumuluri et al. 2022), and post-translational modifications (Pokharel et al. 2022).
While sequence embeddings have revolutionized protein sequence analysis by enabling precise predictions across various tasks, protein–protein interaction (PPI) networks (De Las Rivas and Fontanillo 2010) serve as a valuable complementary information source that captures the complex interplay between proteins, crucial for understanding biological processes and disease mechanisms (Vendruscolo and Fuxreiter 2022, Hasselgren and Oprea 2024). For instance, node2loc (Pan et al. 2022) demonstrated that PPI-based embeddings can perform better in subcellular localization prediction compared to sequence embeddings, and NetGO (You et al. 2019) shows that incorporating PPIs through nearest-neighbor searches can enhance function prediction accuracy. These studies highlight the potential of PPI networks as a complementary information source for various predictive tasks. An important source of PPI networks is the STRING database, which comprises protein physical interactions and functional associations for 12 535 species, including 1322 eukaryotes (Szklarczyk et al. 2023). These interactions are integrated from curated databases, literature mining, computational prediction, and orthology-based knowledge transfer between species.
Network-based protein embeddings can be typically generated using random-walk techniques such as deepwalk (Perozzi et al. 2014) and node2vec (Grover and Leskovec 2016), or graph neural networks (GNNs) (Kipf and Welling 2016, Zhou et al. 2020). While deepwalk and node2vec employ random walks to learn node representations, node2vec uses biased walks to balance local and global network exploration. Unlike GNNs, node2vec offers several advantages, particularly for unlabeled graphs like STRING. As an unsupervised method, node2vec does not require node features or labeled data. Additionally, node2vec generally exhibits better scalability to large graphs and is often more computationally efficient than many GNN architectures (Khoshraftar and An 2024). Moreover, as demonstrated by Liu et al. (2023), the weighted version of node2vec, which considers edge weights while performing random walks, outperforms GNNs specifically on weighted, undirected, and unlabeled networks from STRING.
However, when node2vec is applied separately to multiple networks, e.g. PPI networks from different species, the resulting embeddings are not directly comparable. Consequently, proteins with similar characteristics across networks may have completely different embeddings. This hinders the use of network protein embeddings in tasks that span multiple species and prevents knowledge transfer between species (Martins 2023, Yuan et al. 2024). To deal with this challenge, researchers have turned to embedding alignment techniques as a potential solution. Embedding alignment, as defined by Kalinowski and An (2020), is the task of finding a mapping between two vector spaces representing embeddings of different datasets, enabling tasks such as cross-lingual translation (Mikolov et al. 2013, Joulin et al. 2018, Patra et al. 2019) and knowledge graph (KG) integration (Heimann et al. 2018, Chu et al. 2019, Du and Tong 2019).
To create cross-species network protein embeddings, past work has used orthologs as anchors through several approaches: network kernels for pairwise directed alignment (Fan et al. 2019), autoencoders for pairwise bidirectional alignment (Li et al. 2023), and node2vec applied to multi-species networks including ortholog relationships as edges (Mancuso et al. 2024). By aligning the protein networks of diverse species, these approaches enhanced the precision and robustness of cross-species protein predictions, marking a substantial improvement over traditional single-species network embeddings (Fan et al. 2019, Li et al. 2023, Mancuso et al. 2024). However, existing methods for creating multi-species embeddings (Li et al. 2023, Mancuso et al. 2024) do not scale to the number of eukaryotic species in STRING. FedCoder (Baumgartner et al. 2023) is an innovative approach to integrating multiple KG embedding spaces into a uniform “Web of Embeddings”. In FedCoder, autoencoders are applied to learn mappings for embeddings from individual KGs that project into a shared latent space. In the latent space of the autoencoders, linked entities from different KGs have similar representations, effectively aligning the original embedding spaces. A key advantage of FedCoder for aligning multiple networks is its scalability, exhibiting linear computational complexity with respect to the number of embedding spaces, in contrast to pairwise methods that scale quadratically. Additionally, the performance of FedCoder improves as more embedding spaces are integrated.
In this study, we introduce SPACE, a comprehensive set of embeddings for all eukaryotic proteins in the STRING database. SPACE includes pre-calculated aligned cross-species network embeddings generated with FedCoder (Baumgartner et al. 2023) as well as protein sequence embeddings from the ProtT5 model. We show that the network embeddings are well-aligned across eukaryotes without sacrificing quality in the individual species and are complementary to sequence embeddings. We further demonstrate the effectiveness of both types of embeddings by applying them to two cross-species prediction tasks, namely subcellular localization prediction and protein function prediction. To facilitate widespread use of SPACE as a foundation for prediction tasks, we make the full set of pre-computed network and sequence embeddings available from the STRING website (https://string-db.org/cgi/download).
2 Methods
2.1 Dataset
The datasets used to generate SPACE include protein sequences, protein functional association networks, and orthologs. The protein sequences and functional association networks are sourced from STRING version 12.0 (Szklarczyk et al. 2023), using the full networks including all evidence channels and orthology-transferred associations. Detailed network statistics for all species are provided as Table S12, available as supplementary data at Bioinformatics online. The orthologs that serve as alignment anchors are derived from the hierarchical orthologous groups (OGs) of eggNOG 6.0, which are based on sequence similarity alone (Hernández-Plaza et al. 2023). We construct pairs of orthologs for any two species based on the OGs at their latest common ancestor taxonomy level. For instance, if within an OG species A and B have i and j unique proteins, respectively, they will result in number of orthologous pairs.
In the network alignment process, we start with 48 seed species carefully selected to include a diverse range of well-studied organisms. All the other eukaryotic species are defined as non-seed species. For seed-species selection specifically, we evaluated the STRING functional association networks, excluding curated database and evidence transferred by orthology, against the KEGG database (Kanehisa et al. 2021). For all subsequent embedding, alignment, and benchmarking steps, we used the full STRING networks, including all evidence channels.
The 48 seed species were selected for hyperparameter tuning of both the individual network embedding (using node2vec) and the network embedding alignment. Since node2vec cannot generate embeddings for unseen nodes, we randomly partition the edges into six partitions. Self-loops were added to singleton proteins that were not connected to any other proteins in the training set. The first five partitions were used for cross-validation, while the sixth served as a test set. For the network embedding alignment, to prevent the orthology information leakage, we instead partition the proteins based on orthology. Proteins from the networks were partitioned into six groups, ensuring that all proteins from the same OGs were assigned to the same partition. Similarly, the first five partitions were utilized for cross-validation, with the sixth partition reserved as a test set.
2.2 ProtT5 for protein sequence embeddings
The encoder-only ProtT5-XL-UniRef50 (Elnaggar et al. 2022), Rostlab/prot_t5_xl_half_uniref50-enc (https://huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc), generated the 1024-dimensional protein sequence embeddings. ProtT5-XL-UniRef50 is a self-supervised protein language model built upon the T5-3B (Raffel et al. 2020) model, trained on the UniRef50 (Suzek et al. 2015) dataset, which contains more than 45 million protein sequences. We selected the half-precision and encoder-only version model, based on the original publication (Elnaggar et al. 2022) and other downstream tasks (Bernhofer and Rost 2022, Heinzinger et al. 2022, Villegas-Morcillo et al. 2022). The ProtT5 model produces embeddings of each amino acid in a protein sequence, and we average the amino acid embeddings to get the embedding of a protein. The full-length sequences of all proteins were fed into the model, using CPU mode for the less than 1000 proteins that could not fit in GPU memory.
2.3 node2vec for single-species network embedding
The node2vec algorithm (Grover and Leskovec 2016) was used to create network-based protein embeddings for each species. node2vec is a random-walk-based algorithm (Le 2017, Xia et al. 2020) that efficiently maps nodes in a graph to a low-dimensional space of features. We used a weighted version of node2vec, which considers the edge weights in the networks to create 128-dimensional embeddings that capture the network’s topological information, encoding both local and global structural properties. A grid search of predefined hyperparameter search spaces (Table S1, available as supplementary data at Bioinformatics online) was carried out, and the results were expressed as mean and standard deviation link prediction (Table S2, available as supplementary data at Bioinformatics online) scores over the cross-validation set. The hyperparameters producing the best link prediction scores across all seed species were selected and used for all the species. The node2vec implementation of PecanPy (https://pypi.org/project/pecanpy/) was used throughout our work.
2.4 FedCoder for multiple networks embedding alignment
To apply FedCoder (Baumgartner et al. 2023) to biological orthology data, we implemented a weighting mechanism (Eq. (1)) for many-to-many orthology relationships and modified the loss function balance parameter. For each pair of orthologous proteins i and j, we define the weighting factor used in the alignment process as:
| (1) |
where and represent the occurrence frequencies of protein i and protein j among the orthologous groups of two species. This weighting mechanism treats orthologs with higher weights as more important than down-weighted many-to-many orthologs.
To reduce the complexity of training and improve the data quality, we excluded pairs with weights smaller than a threshold value, which we treated as a hyperparameter. We replaced the original FedCoder parameter with to provide more intuitive hyperparameter tuning, where ranges from 0 to 1.
Equation (2) shows the overall loss function. It consists of two parts: the alignment loss and the reconstruction loss , summing up the reconstruction loss of overall N species
| (2) |
2.5 Seed species alignment
We started the alignment with 48 seed species for two main reasons: (1) computational scalability, as it is computationally expensive to load all 1322 eukaryotic species and their orthologous relationships into a single alignment process; and (2) study bias, as aligning understudied species with well-studied species can degrade information and add noise, given that many model organisms are well-studied with dense networks, while many other species in STRING are understudied.
We tuned the hyperparameters using grid search (Table S3, available as supplementary data at Bioinformatics online). The best hyperparameters were selected based on the average alignment loss with standard deviation. Consistent with the original FedCoder publication, an autoencoder with one layer and without activation functions performed the best. However, our experiments showed that omitting negative sampling during alignment further improved alignment quality. We obtained aligned seed species embeddings with 512 dimensions. The autoencoders were implemented with PyTorch (https://pytorch.org/), and the same for non-seed species.
2.6 Incorporating non-seed species
To broaden our analysis, we incorporated non-seed species by aligning their embeddings to those of the previously aligned seed species. We found that aligning non-seed species to their corresponding taxonomic groups (fungi, plants, animals, and protists) yielded better results compared to aligning them with all seed species collectively. Despite not utilizing all seed species in the alignment process, the non-seed species effectively align with the entire seed species set due to the pre-established alignment among seed species. During the integration of non-seed species, we fixed embeddings of seed species and employed a separate autoencoder for each non-seed species. This approach aims to simultaneously (1) minimize the distance between orthologous pairs of seed and non-seed species in the latent space, and (2) reconstruct the input embeddings of non-seed species from the latent space using a decoder. This independent alignment strategy enabled parallel processing of multiple non-seed species and enhanced scalability. To ensure consistency, we retained the same hyperparameters used in the seed species alignment process.
2.7 Alignment quality assessment
To quantitatively assess the quality of cross-species alignment, we first divided orthologous pairs into three categories, namely seed with seed, seed with non-seed, and non-seed with non-seed. For computational efficiency, we randomly selected 1% of the proteins, i.e. one in 10 000 of the orthologous pairs, and generated an equal number of species-matched non-orthologous pairs. We calculated cosine similarities for all the sampled pairs and compared the distributions using the Mann–Whitney U test. We repeated these experiments 10 times and reported the mean and standard deviation of the median value of each experiment (Table S5, available as supplementary data at Bioinformatics online).
2.8 Singletons in the networks
Some proteins, known as singletons, do not interact with other proteins in our networks. However, their sequence information is still valuable and can be used to generate sequence-based embeddings. To ensure all proteins in STRING are represented in both sequence and network embeddings for downstream tasks, we provide network embeddings for singletons as well. We first scale the network embeddings of all species to the range . We then assign network embeddings to singletons based on three categories: (1) Singletons in interaction orthologous groups, where at least one protein is part of any network: these are assigned the average of their interaction orthologs’ embeddings, with small random noise added to ensure uniqueness; (2) Singletons in singleton orthologous groups, where no proteins are part of any network, but they are orthologs to each other: random embeddings in the ranges of and [0.99, 1.0] are generated for these groups. This ensures their embeddings are distinct from the scaled network embeddings. Singletons in the same singleton OGs receive these embeddings with added random noise; (3) Singletons without orthologs: individual embeddings are generated using the same method, ensuring each is unique and appropriately positioned in the embedding space.
2.9 Quality control of the aligned embeddings
KEGG (Kanehisa et al. 2021) pathways comprise a detailed collection of pathway maps, and it is crucial for understanding biological functions and interactions. To ensure that the alignment process preserves functional information, we compared the aligned embeddings to the original node2vec embeddings using KEGG pathway co-membership. This quality-control step verifies that the alignment does not degrade the biologically meaningful signal of the original networks. A dataset spanning 378 eukaryotic species in STRING was assembled using KEGG pathway annotations. We generated the receiver operating characteristic (ROC) curves for each species by ranking protein pairs based on their cosine similarities in different types of embeddings (node2vec, aligned, and ProtT5). Protein pairs were classified as false positives (FPs) if they did not share any pathway affiliation, or as true positives (TPs) if they co-located in at least one KEGG pathway. Protein pairs where at least one member was absent from the KEGG database were excluded from TP/FP designation. These ROC curves plotted cumulative FPs on the x-axis against cumulative TPs on the y-axis. Since the high-precision part of the ROC curve is most relevant for assessing performance, we calculated the partial area under the curve (AUC) up to a FP rate of 0.1% as well as the full AUC. We grouped species by kingdom and compared AUC distributions between embedding methods using the Wilcoxon signed-rank test.
2.10 Benchmark datasets and methods
Subcellular localization (Dönnes and Höglund 2004, Dubey and Chouhan 2011) refers to identifying a given protein’s specific location(s) within a cell. This task is essential in cell biology and proteomics because the functionality of proteins is inherently tied to their locations within the cell. To evaluate the performance of our embeddings in predicting subcellular localization, we used the SwissProt cross-validation set and the Human Protein Atlas (HPA) test set from DeepLoc 2.0 (Thumuluri et al. 2022), mapped to STRING identifiers using the UniProt ID mapping (https://www.uniprot.org/id-mapping) and the STRING human aliases file, respectively. The SwissProt cross-validation set contains multiple species, providing a comprehensive evaluation of the embeddings’ performance across different organisms, while the HPA test set focuses on human proteins. To measure the effectiveness of the embeddings, we compared ProtT5 embeddings, aligned embeddings, and the concatenation of aligned and sequence embeddings (SPACE embeddings). For each localization, we trained a logistic regression model to distinguish between proteins from localization and all others (one–vs–rest). Additionally, we compared the performance of DeepLoc 2.0 with our logistic regression models over the HPA subset by running the original DeepLoc 2.0 implementation. The performance was measured by precision–recall curves, accuracy, Jaccard score, MicroF1, MacroF1, and per-localization Matthews correlation coefficient (MCC) (Chicco and Jurman 2020). All the metrics were calculated with scikit-learn (https://pypi.org/project/scikit-learn/). We further validated if the proteins cluster according to their localizations in the aligned space. We calculated cosine similarities between protein pairs within each localization and between proteins from different localizations, and compared similarity distributions using the Mann–Whitney U test.
Furthermore, we explored the capability of SPACE embeddings to predict Gene Ontology (GO) terms (Ashburner et al. 2000, Aleksander et al. 2023), a crucial benchmark for assessing the functional annotation accuracy of protein embeddings. GO term prediction encompasses three aspects: biological processes, cellular components, and molecular functions, offering a comprehensive lens to evaluate our embeddings’ biological relevance and applicability. We built subsets of the training and test data used in NetGO 2.0 (Yao et al. 2021), filtered through the UniProt ID mapping. We trained logistic regression models per GO term on the training set and evaluated the performance on the test set using three types of embeddings: ProtT5, aligned embeddings, and SPACE embeddings. We again used the one-vs-rest strategy for multi-class classification, where each GO term is treated as a separate binary classification problem. We show precision–recall curves for our methods. We also used the same evaluation metrics mentioned in NetGO 2.0 (Yao et al. 2021): maximum MicroF1 (Fmax), area under the precision–recall curve (AUPRC), and minimum semantic distance (Smin). All the metrics were calculated with CAFA-evaluator (https://pypi.org/project/cafaeval/).
3 Results and discussion
3.1 SPACE: pre-calculated sequence and cross-species network embeddings
In this study, we present SPACE (STRING Proteins as Complementary Embeddings), a comprehensive set of protein embeddings for all eukaryotic proteins in the STRING database. SPACE includes pre-calculated aligned cross-species network embeddings generated with FedCoder (see Section 2) as well as 1024-dimensional protein sequence embeddings from the ProtT5 model (Fig. 1a). This approach leverages protein networks and orthology relations for 1322 eukaryotes from the STRING database to create network-based cross-species protein embeddings. The SPACE workflow begins by creating 128-dimensional species-specific network embeddings using node2vec, which captures the information from protein–protein interaction networks within each species. These embeddings are then aligned across species using a two-step process. First, embeddings from 48 selected seed species are aligned using the FedCoder method. This process employs per-species autoencoders to decrease the distance between orthologs in the latent space while preserving the original network information for each species, resulting in 512-dimensional cross-species embeddings. Subsequently, using a similar encoder–decoder architecture, the embeddings for non-seed species are aligned to the established latent space of their corresponding taxonomic groups (fungi, plants, animals, or protists).
Figure 1.
SPACE workflow and demonstration of successful cross-species embedding alignment. (a) Overview of the SPACE workflow. The pipeline begins with input from the STRING database in two forms: protein–protein interaction networks and protein sequences. The networks are processed through node2vec to generate 128-dimensional species-specific embeddings. The network alignment process first aligns 48 seed species using the FedCoder method to create a 512-dimensional shared latent space, then aligns each remaining non-seed species to their corresponding taxonomic groups (fungi, plants, animals, or protists) in this established latent space using autoencoders. In parallel, sequences are processed through the ProtT5 encoder to generate sequence embeddings. (b) UMAP visualization demonstrates cross-species embedding alignment’s effectiveness. The plots show aligned network protein embeddings for four evolutionarily diverse seed species (Homo sapiens, Saccharomyces cerevisiae, Arabidopsis thaliana, and Dictyostelium discoideum) and one non-seed species (Rattus norvegicus). Colored points represent proteins from the named species, while gray points show the background distribution of proteins from other species. The overlapping patterns in the embeddings demonstrate successful alignment, with some regions representing functional associations found throughout eukaryotes and others representing functions specific to particular kingdoms. The unmapped cluster from R. norvegicus is mainly composed by olfactory proteins.
3.2 SPACE embeddings maintain pathway integrity
We validated that the aligned embeddings retain pathway-level information from the STRING network as effectively as node2vec embeddings. KEGG pathway annotations (Kanehisa et al. 2021) were used for evaluation, and ProtT5 sequence embeddings were included as a baseline for comparison. Due to space constraints, we manually selected 12 seed species covering a wide range in the phylogenetic tree, as shown in Fig. 2. Detailed figures for all other species are available in Fig. S5, available as supplementary data at Bioinformatics online.
Figure 2.
Comparison of protein embedding methods across diverse eukaryotic species using KEGG pathways. The plots show receiver operating characteristic (ROC) curves comparing three different embedding approaches: aligned network embeddings (solid lines), node2vec embeddings (dashed lines), and ProtT5 sequence embeddings (dotted lines). Results are presented for 12 representative species divided into four panels: (top left) animal species including Homo sapiens, Drosophila melanogaster, and Danio rerio; (top right) plants including Arabidopsis thaliana, Zea mays, and Solanum tuberosum; (bottom left) fungi including Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Cryptococcus neoformans; and (bottom right) protists including Dictyostelium discoideum, Trypanosoma brucei, and Plasmodium falciparum. For each species, all possible protein pairs where both proteins are annotated in KEGG pathways were evaluated and ranked by their cosine similarity scores. The x-axis shows cumulative false positives, while the y-axis shows cumulative true positives. True positives are defined as protein pairs sharing at least one KEGG pathway, while false positives are pairs without shared pathways. The curves demonstrate that aligned embeddings maintain pathway information comparable to original node2vec embeddings across most species, while both network-based methods consistently outperform sequence embeddings in capturing pathway relationships.
These curves show that the aligned embeddings (solid lines) capture pathway information as accurately as the node2vec embeddings (dashed lines) that they are based on. The aligned embeddings slightly outperform node2vec for some species (e.g. Homo sapiens and Dictyostelium discoideum), whereas node2vec performs better for certain other species (e.g. Arabidopsis thaliana). The only notable outlier is Saccharomyces cerevisiae, which performs worse after alignment, which is likely explained by this species having one of the highest-quality networks in STRING and thus an unusually good node2vec embedding.
While both aligned and node2vec embeddings show varying agreement with pathways, sequence embeddings (dotted lines) derived from ProtT5 consistently underperform in the KEGG benchmark. This performance gap can be attributed to the nature of the information that sequence embeddings capture. Sequence embeddings are designed to encode protein structure, domains, and local motifs (Elnaggar et al. 2022). Conversely, they do not incorporate information on interactions, which are crucial for understanding biological processes and pathways.
Statistical analysis of all ROC curves shows that the aligned and node2vec embeddings perform much better than ProtT5 across all kingdoms, and that the aligned embeddings perform slightly better than node2vec except in fungi (Figs S2 and S3 and Table S6, available as supplementary data at Bioinformatics online). This is consistent with STRING already transferring evidence across species based on orthology; one should thus not expect the alignment process to improve the embeddings for individual species.
In summary, our alignment process succeeds in preserving the original information contained in the node2vec embeddings while enabling cross-species comparability. This retention of species-specific network characteristics, combined with their support for cross-species comparisons, makes the aligned embeddings a valuable complement to sequence embeddings, with the former capturing the complex interplay of proteins and the latter providing information about protein structure and evolution. However, because the STRING networks already include orthology-transferred associations, the potential gains from alignment may be limited in species that already benefit from such cross-species evidence, particularly in fungi.
3.3 Combining embeddings improves cross-species subcellular localization prediction
To evaluate the usefulness of ProtT5 embeddings, aligned embeddings, and their concatenation (SPACE embeddings), we test them on downstream tasks. The first such task is subcellular localization prediction using the DeepLoc 2.0 (Thumuluri et al. 2022) datasets (after id-mapping, detailed in Section 2). We retained 24 816 out of 28 303 proteins, covering 144 species, from the Swiss-Prot cross-validation set, and 1646 out of 1717 proteins from the Human Protein Atlas (HPA) test set, which consists solely of human proteins. We trained logistic regression models to predict each localization from the embeddings. The predictive performance is shown as precision-recall curves (Fig. 3a and b).
Figure 3.
Precision–recall curves of different embeddings and visualization of SPACE embeddings in protein subcellular localization prediction. (a) Precision–recall curves on SwissProt cross-validation set (24 816 proteins across 144 species) comparing SPACE (concatenation of aligned network and ProtT5 sequence embeddings, red), aligned network embeddings (black), and ProtT5 sequence embeddings (gray). (b) Precision–recall curves on Human Protein Atlas (HPA) test set (1646 human proteins), with DeepLoc2 predictions (blue star) included as an additional baseline. The curves demonstrate that SPACE embeddings consistently maintain higher precision across all recall values compared to individual embedding types. (c) UMAP visualization of aligned network embeddings based on their projections onto logistic regression weight vectors for subcellular localization prediction. The distinct clustering patterns demonstrate that the aligned embeddings successfully capture protein localization information across multiple species, with clear separation observed for major cellular compartments such as nucleus, mitochondrion, and cell membrane. Proteins with multiple localizations were excluded from this visualization to ensure clear compartment separation.
We examined precision–recall curves on both SwissProt cross-validation and HPA test sets (Fig. 3a). On the SwissProt cross-validation set, SPACE embeddings consistently maintain higher precision across all recall values compared to both aligned network embeddings and ProtT5 embeddings (Fig. 3a). This advantage is particularly pronounced from a recall of 0.25 and up, where SPACE can make more predictions with the same precision. The performance difference between SPACE and ProtT5 is further confirmed on the HPA test set (Fig. 3b). Moreover, all the embedding-based predictors outperform DeepLoc 2.0 on the HPA test set (a direct comparison cannot be made for the SwissProt dataset). Standard performance metrics are provided in Tables S8 and S9, available as supplementary data at Bioinformatics online.
To explore the performance of the embedding-based predictors in more detail, we looked at the MCCs for individual localizations (Tables S8 and S9, available as supplementary data at Bioinformatics online). Aligned and SPACE embeddings allow for better predictions for some localizations that are challenging for sequence embeddings. For example, the targeting mechanisms for lysosome/vacuole, Golgi apparatus, and peroxisome are complex and make sequence-based prediction challenging. In lysosomes, proteins require both signal sequences and subsequent M6P modifications in the Golgi apparatus but share trafficking machinery and hydrolase pathways (Braulke and Bonifacino 2009); Golgi resident proteins have specific targeting/retention mechanisms (Munro 1998), and proteins passing through the Golgi apparatus have various targeting signals for final destinations, but they participate in conserved glycosylation and sorting complexes (De Matteis and Luini 2008); peroxisomal proteins use two different targeting signals (PTS1/PTS2) but engage in metabolic pathways with shared import machinery (PEX proteins) (Saleem et al. 2006). Despite this complexity, these organelles have distinct functions, which are captured by the STRING network. This explains why network-based embeddings perform better than sequence embeddings alone for these organelles, highlighting how functional context can be more informative than sequence features for predicting localization.
We further visualized the aligned embeddings of the SwissProt dataset, excluding proteins with multiple locations, with UMAP to demonstrate the biological relevance of our aligned embeddings (Fig. 3c). To obtain a UMAP that best shows the information relevant to subcellular localization, we used the projection of the aligned embeddings on the set of weight vectors from the logistic regressions as input. This representation shows that proteins cluster according to their cellular compartments. The distinct grouping of proteins from certain subcellular locations, such as nucleus, mitochondrion, and cell membrane, shows that SPACE embeddings successfully capture protein localization across multiple species. Consistent with this, we observe that proteins with the same subcellular localization have higher cosine similarity than proteins with different localization. This is statistically significant for every individual localization (P , see Table S10, available as supplementary data at Bioinformatics online).
3.4 Combining embeddings enhances cross-species protein function prediction
Our second downstream task is protein function prediction using the NetGO 2.0 (Yao et al. 2021) datasets. We mapped 74 838 proteins across 152 eukaryotic species from the training set of 120 856 prokaryotic and eukaryotic proteins, with an additional 15 827 proteins from 12 species serving from the test set (detailed in Section 2.5). We trained logistic regression models to predict each GO term (Ashburner et al. 2000, Aleksander et al. 2023) from the embeddings. The predictive performance over the test set is shown as precision–recall curves (Fig. 4), and standard performance metrics are provided in the Table S11, available as supplementary data at Bioinformatics online.
Figure 4.
Precision–recall curves of different embeddings in protein function prediction. (a) Molecular function, (b) biological process, and (c) cellular component. Each panel compares SPACE (concatenation of aligned network and ProtT5 sequence embeddings, red), aligned network embeddings (black), and ProtT5 sequence embeddings (gray). Stars indicate precision and recall values that yield maximum MicroF1 scores. The curves reveal that SPACE embeddings show particular strength in biological process prediction, while maintaining competitive performance in molecular function and cellular component prediction, highlighting the complementary nature of sequence and network information in capturing different aspects of protein function.
We examined precision–recall curves on each GO aspect (Fig. 4). For molecular functions, ProtT5 sequence embeddings show slightly better performance in the mid-recall range, but SPACE embeddings are marginally better than them at the higher recall range. SPACE consistently outperforms aligned network embeddings and ProtT5 sequence embeddings across most recall values for biological processes. Finally, SPACE shows modest advantages at higher recall values () for cellular components.
The standard evaluation metrics (Table S11, available as supplementary data at Bioinformatics online) confirm the patterns observed in the precision–recall curves. Across all three GO aspects, SPACE demonstrates robust performance, with similar performance to ProtT5 in both molecular function and cellular component prediction. SPACE particularly shows its strength in predicting biological processes.
The precision–recall curves and the standard metrics show that the molecular function predictions rely heavily on the sequence embeddings, presumably because they encode protein domain information. Meanwhile, biological process prediction benefits from integrating network embeddings that capture protein–protein interactions and pathway relationships.
4 Conclusions
This research presents SPACE, a collection of complementary embeddings for all proteins from the 1322 eukaryotic species in the STRING database. This consists of pre-calculated protein sequence embeddings using the ProtT5-XL-UniRef50 model and aligned, cross-species network embeddings created using node2vec and FedCoder. The aligned network embeddings are designed to complement sequence-based embeddings. SPACE provides these pre-calculated sequence and aligned network embeddings with the aim to support a broad range of prediction tasks across species.
Our results show that the embeddings are directly comparable between proteins from different species, that the aligned network embeddings retain the information from species-specific embeddings, and that the sequence and network embeddings are complementary. We demonstrate the latter on two downstream tasks, namely subcellular localization and protein function prediction. The performance, which can be achieved on these tasks using simple logistic regression, demonstrates the power of the embeddings.
The practical applications of SPACE embeddings are vast, extending beyond basic research to potential clinical implications. For example, cross-species embeddings can be the starting point for training a general predictor of protein interactions among eukaryotic parasites and their hosts. The comparability across species can also improve large-scale functional annotations, especially for less-studied organisms.
In conclusion, our research highlights the potential of combining sequence and cross-species network-based embeddings. We believe that SPACE provides easy access to embeddings that can serve as a foundation for protein-related prediction tasks.
Supplementary Material
Acknowledgements
We acknowledge Deic, Denmark, for awarding this project access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through Deic, Denmark, Deic-KU-L5-2023–004.
Contributor Information
Dewei Hu, Novo Nordisk Foundation Center for Protein Research, Department of Cellular and Molecular Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark.
Damian Szklarczyk, Department of Molecular Life Sciences, University of Zurich, Zurich 8057, Switzerland; SIB Swiss Institute of Bioinformatics, Amphipôle, Quartier UNIL-Sorge, Lausanne 1015, Switzerland.
Christian von Mering, Department of Molecular Life Sciences, University of Zurich, Zurich 8057, Switzerland; SIB Swiss Institute of Bioinformatics, Amphipôle, Quartier UNIL-Sorge, Lausanne 1015, Switzerland.
Lars Juhl Jensen, Novo Nordisk Foundation Center for Protein Research, Department of Cellular and Molecular Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark; ZS Discovery, ZS Associates, Kongens Lyngby 2800, Denmark.
Author contributions
Dewei Hu (Conceptualization [equal], Data curation [equal], Formal analysis [lead], Investigation [lead], Methodology [lead], Project administration [equal], Validation [lead], Visualization [lead], Writing—original draft [lead], Writing—review & editing [equal]), Damian Szklarczyk (Data curation [equal], Formal analysis [equal], Resources [equal], Validation [equal], Writing—review & editing [equal]), Christian von Mering (Supervision [equal], Writing—review & editing [equal]), and Lars Jensen (Conceptualization [lead], Formal analysis [supporting], Funding acquisition [lead], Investigation [supporting], Methodology [lead], Project administration [lead], Supervision [lead], Validation [supporting], Visualization [supporting], Writing—original draft [supporting], Writing—review & editing [lead])
Supplementary data
Supplementary data is available at Bioinformatics online.
Conflict of interest: None declared.
Funding
This work was supported by the Novo Nordisk Foundation and the Swiss Institute of Bioinformatics. D.H. and L.J.J. have received funding from the Novo Nordisk Foundation (NNF14CC0001 and NNF20SA0035590); D.S. and C.v.M. have received funding from the Swiss Institute of Bioinformatics. Funding for open access charge: Novo Nordisk Foundation (NNF24SA0098829).
Data availability
The related datasets are available on Zenodo: https://doi.org/10.5281/zenodo.15600639. Precomputed network embeddings and sequence embeddings for all eukaryotic proteins are included in STRING version 12.0 (https://string-db.org/cgi/download).
References
- Aleksander SA, Balhoff J, Carbon S et al. ; Gene Ontology Consortium. The gene ontology knowledgebase in 2023. Genetics 2023;224:iyad031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ashburner M, Ball CA, Blake JA et al. Gene ontology: tool for the unification of biology. Nat Genet 2000;25:25–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baumgartner M, Dell’Aglio D, Paulheim H et al. Towards the web of embeddings: integrating multiple knowledge graph embedding spaces with FedCoder. J Web Semant 2023;75:100741. [Google Scholar]
- Bernhofer M, Rost B. TMbed: transmembrane proteins predicted through language model embeddings. BMC Bioinformatics 2022;23:326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bonetta R, Valentino G. Machine learning techniques for protein function prediction. Proteins: Struct Funct Bioinform 2020;88:397–413. [DOI] [PubMed] [Google Scholar]
- Brandes N, Ofer D, Peleg Y et al. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 2022;38:2102–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Braulke T, Bonifacino JS. Sorting of lysosomal proteins. Biochim Biophys Acta 2009;1793:605–14. [DOI] [PubMed] [Google Scholar]
- Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 2020:21:6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chu X, Fan X, Yao D et al. Cross-network embedding for multi-network alignment. In: The World Wide Web Conference. New York, NY, USA: Association for Computing Machinery, 2019, 273–84.
- De Las Rivas J, Fontanillo C. Protein–protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol 2010;6:e1000807. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Matteis MA, Luini A. Exiting the Golgi complex. Nat Rev Mol Cell Biol 2008;9:273–84. [DOI] [PubMed] [Google Scholar]
- Dönnes P, Höglund A. Predicting protein subcellular localization: past, present, and future. Genom Proteom Bioinform 2004;2:209–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Du B, Tong H. MrMine: multi-resolution multi-network embedding. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York, NY, USA:Association for Computing Machinery, 2019, 479–88.
- Dubey A, Chouhan U. Subcellular localization of proteins. Arch Appl Sci Res 2011;3:392–401. [Google Scholar]
- Elnaggar A, Heinzinger M, Dallago C et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 2022;44:7112–27. [DOI] [PubMed] [Google Scholar]
- Fan J, Cannistra A, Fried I et al. Functional protein representations from biological networks enable diverse cross-species inference. Nucleic Acids Res 2019;47:e51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grover A, Leskovec J. node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA:Association for Computing Machinery, 2016, 855–64. [DOI] [PMC free article] [PubMed]
- Hasselgren C, Oprea TI. Artificial intelligence for drug discovery: are we there yet? Annu Rev Pharmacol Toxicol 2024;64:527–50. [DOI] [PubMed] [Google Scholar]
- Heimann M, Shen H, Safavi T et al. REGAL: representation learning-based graph alignment. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York, NY, USA:Association for Computing Machinery, 2018, 117–26.
- Heinzinger M, Littmann M, Sillitoe I et al. Contrastive learning on protein embeddings enlightens midnight zone. NAR Genom Bioinform 2022;4:lqac043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hernández-Plaza A, Szklarczyk D, Botas J et al. eggnog 6.0: enabling comparative genomics across 12 535 organisms. Nucleic Acids Res 2023;51:D389–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Joulin A, Bojanowski P, Mikolov T et al. Loss in translation: learning bilingual word mapping with a retrieval criterion. arXiv, arXiv:1804.07745, 2018, preprint: not peer reviewed.
- Kalinowski A, An Y. A survey of embedding space alignment methods for language and knowledge graphs. arXiv, arXiv:2010.13688, 2020, preprint: not peer reviewed.
- Kanehisa M, Furumichi M, Sato Y et al. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res 2021;49:D545–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khoshraftar S, An A. A survey on graph representation learning methods. ACM Trans Intell Syst Technol 2024;15:1–55. [Google Scholar]
- Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv, arXiv:1609.02907, 2016, preprint: not peer reviewed.
- Kouba P, Kohout P, Haddadi F et al. Machine learning-guided protein engineering. ACS Catal 2023;13:13863–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Le D-H. Random walk with restart: a powerful network propagation algorithm in bioinformatics field. In: 2017 4th NAFOSTED Conference on Information and Computer Science. Hanoi, Vietnam: IEEE, 2017, 242–7. [Google Scholar]
- Li L, Dannenfelser R, Zhu Y et al. Joint embedding of biological networks for cross-species functional alignment. Bioinformatics 2023;39:btad529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Z, Akin H, Rao R et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023;379:1123–30. [DOI] [PubMed] [Google Scholar]
- Liu R, Hirn M, Krishnan A. Accurately modeling biased random walks on weighted networks using node2vec. Bioinformatics 2023;39:btad047. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mancuso CA, Johnson KA, Liu R et al. Joint representation of molecular networks from multiple species improves gene classification. PLoS Comput Biol 2024;20:e1011773. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martins YC. Analysis of protein–protein interactions networks and cross-species transfer learning comparison for seven organisms. bioRxiv, 2023–06, 2023, preprint: not peer reviewed.
- Mikolov T, Le QV, Sutskever I. Exploiting similarities among languages for machine translation. arXiv, arXiv:1309.4168, 2013, preprint: not peer reviewed.
- Munro S. Localization of proteins to the Golgi apparatus. Trends Cell Biol 1998;8:11–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pan X, Chen L, Liu M et al. Identifying protein subcellular locations with embeddings-based node2loc. IEEE/ACM Trans Comput Biol Bioinform 2022;19:666–75. [DOI] [PubMed] [Google Scholar]
- Patra B, Moniz JRA, Garg S et al. Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy. 2019, 184–193.
- Perozzi B, Al-Rfou R, Skiena S. Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: Association for Computing Machinery, 2014, 701–10.
- Pokharel S, Pratyush P, Heinzinger M et al. Improving protein succinylation sites prediction using embeddings from protein language model. Sci Rep 2022;12:16933. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raffel C, Shazeer N, Roberts A et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 2020;21:1–67. http://jmlr.org/papers/v21/20-074.html34305477 [Google Scholar]
- Rives A, Meier J, Sercu T et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA 2021;118:e2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rost B, Radivojac P, Bromberg Y. Protein function in precision medicine: deep understanding with machine learning. FEBS Lett 2016;590:2327–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saleem R, Smith J, Aitchison J. Proteomics of the peroxisome. Biochim Biophys Acta 2006;1763:1541–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suzek BE, Wang Y, Huang H et al. ; UniProt Consortium. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2015;31:926–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Szklarczyk D, Kirsch R, Koutrouli M et al. The string database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res 2023;51:D638–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thumuluri V, Almagro Armenteros JJ, Johansen AR et al. Deeploc 2.0: multi-label subcellular localization prediction using protein language models. Nucleic Acids Res 2022;50:W228–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vamathevan J, Clark D, Czodrowski P et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 2019;18:463–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vendruscolo M, Fuxreiter M. Protein condensation diseases: therapeutic opportunities. Nat Commun 2022;13:5550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Villegas-Morcillo A, Gomez AM, Sanchez V. An analysis of protein language model embeddings for fold prediction. Brief Bioinform 2022;23:bbac142. [DOI] [PubMed] [Google Scholar]
- Wang S, You R, Liu Y et al. Netgo 3.0: protein language model improves large-scale functional annotations. Genom Proteom Bioinform 2023;21:349–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Z, Zou Q, Jiang Y et al. Review of protein subcellular localization prediction. CBIO 2014;9:331–42. [Google Scholar]
- Xia F, Liu J, Nie H et al. Random walks: a review of algorithms and applications. IEEE Trans Emerg Top Comput Intell 2020;4:95–107. [Google Scholar]
- Yao S, You R, Wang S et al. Netgo 2.0: improving large-scale protein function prediction with massive sequence, text, domain, family and network information. Nucleic Acids Res 2021;49:W469–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- You R, Yao S, Xiong Y et al. Netgo: improving large-scale protein function prediction with massive network information. Nucleic Acids Res 2019;47:W379–87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan H, Mancuso CA, Johnson K et al. Computational strategies for cross-species knowledge transfer and translational biomedicine. arXiv, arXiv:2408.08503, 2024, preprint: not peer reviewed.
- Zhou J, Cui G, Hu S et al. Graph neural networks: a review of methods and applications. AI Open 2020;1:57–81. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The related datasets are available on Zenodo: https://doi.org/10.5281/zenodo.15600639. Precomputed network embeddings and sequence embeddings for all eukaryotic proteins are included in STRING version 12.0 (https://string-db.org/cgi/download).




