Abstract
Embeddings from protein Language Models (pLMs) are replacing evolutionary information from multiple sequence alignments (MSAs) as the most successful input for protein prediction. Is this because embeddings capture evolutionary information? We tested various approaches to explicitly incorporate evolutionary information into embeddings on various protein prediction tasks. While older pLMs (SeqVec, ProtBert) significantly improved through MSAs, the more recent pLM ProtT5 did not benefit. For most tasks, pLM-based outperformed MSA-based methods, and the combination of both even decreased performance for some (intrinsic disorder). We highlight the effectiveness of pLM-based methods and find limited benefits from integrating MSAs.
Keywords: Embeddings, Machine learning, Evolutionary information, Protein language models, Multiple sequence alignments, Protein structure prediction, Secondary structure
Subject terms: Machine learning, Bioinformatics, Structural biology, Protein sequence analyses
Introduction
MSA + ML = better protein prediction
Three decades ago, the combination of machine learning1 and evolutionary information derived from multiple sequence alignments (MSAs) brought about a breakthrough in protein secondary structure prediction2,3, i.e., of 1D-structure4. Predictions improved even more for larger families, increasing evolutionary information through diversity5,6. The idea was applied to other aspects of structure prediction in 1D7–10 and 2D11; it was refined for 2D predictions through evolutionary couplings12–14 and, most recently, with a leap in 3D structure prediction through DeepMind’s AlphaFold215. The coupling of MSAs and ML was at the root of almost all state-of-the-art (SOTA) protein prediction methods published since the introduction of this concept 30 years ago2,3.
Embeddings from protein language models (pLMs) replace traditional input
Over the last four years, novel representations, dubbed embeddings, for proteins have emerged from converting models developed for natural language processing (NLP) into protein language models (pLMs)16–22. Embeddings are learned solely from raw protein sequences without requiring additional phenotype annotations (self-supervised) using either auto-regressive pre-training or masked language modeling. Pre-learned embeddings from so called foundation models have successfully been input to a second level of protein prediction tasks (supervised training), surpassing MSA-based SOTA methods17,23–28. For some tasks, embeddings are so informative that even extremely shallow models (few free parameters), such as logistic or linear regression, suffice for the second-step supervised training25,26. In fact, raw embeddings capture information about inter-residue distance constraints without ever having encountered any structural information during training28. Besides improved performance and ease of adaptation to various tasks (ease of use), embedding-based methods often also speed up predictions. In computational biology, most resources are needed for inference, i.e., for applying methods to new proteins with Internet prediction servers such as PredictProtein29 completing over 106 (10 M) predictions over its 32 years2,29. Thus, the pLM pre-training costs that must be done only once becomes less crucial. If embedding-based methods outperform MSA-based methods, does this imply that pLMs capture evolutionary information? Or do pLMs and MSAs correlate because they capture similar constraints on the observed protein sequences? If embeddings did not capture evolutionary information, we could improve predictions by combining embeddings with evolutionary information.
Here, we hypothesized that embedding-based methods could be improved through evolutionary information. We trained five similar architectures on different amino acid sequence representations and predicted protein secondary structures to test our idea. To forfeit bias from particular pLMs, we tried our methodology on three embedding types: SeqVec16, ProtBert-BFD17 (ProtBert), and ProtT5-XL-U5017 (ProtT5). We extended our analysis by comparing embedding- and MSA-based methods for various representative prediction tasks.
Results
MSAs can improve embedding-based predictions
We generated two types of embeddings for each of the three investigated pLMs (SeqVec16, ProtBert17, and ProtT517): MSA embeddings and raw embeddings. We refer to the embeddings explicitly enriched by evolutionary information derived from multiple sequence alignments (MSAs) as to MSA embeddings as opposed to the unaltered raw embeddings. We explicitly added evolutionary information in different ways. Firstly, by adding the readout from MSAs, namely the position-specific scoring matrix (PSSM), as an additional input feature for the supervised training (dubbed PSSMSplit and PSSMConcat). The main difference between the two PSSM-based approaches is in the way the prediction models use the input features: While PSSMConcat concatenates the raw query sequence embedding with the PSSM features and uses them together as the single model input vector, PSSM split uses raw embeddings and PSSM features as separate input features, only concatenating learned features from the different inputs at a later layer (Figs. S7, S8). Secondly, we added evolutionary information by generating MSAs for our query sequences and making predictions for each sequence in the alignment. To obtain the prediction for our query sequence, we averaged over predictions for all proteins in the MSA (dubbed MSACons). Thirdly, we generated embeddings for all sequences in the MSA of a query protein and averaged over the embedding vectors by columns in the MSA. This averaged embedding was then used as input for subsequent methods (MSA embeddings). Figure 1 provides a schematic overview of the employed approaches. We first studied the effect of these alternatives using the well-understood problem of secondary structure prediction in three states (H: helix, E: strand, and O: other (¬H ∧ ¬E).
Fig. 1.
Prediction workflows. Schematic overview of the five main approaches for comparing the addition of evolutionary information to an embedding baseline. (a) Raw embeddings The query sequence is encoded with a pLM to obtain an embedding representation. This embedding is fed into a CNN to generate task-specific predictions. (b) MSA embeddings For each query sequence, an MSA is generated using MMseqs230. Each sequence in the MSA is embedded with a pLM and the embeddings are averaged column wise according to the MSA alignment. This averaged embedding is fed into a CNN to generate task-specific predictions. (c) MSA Consensus (MSACons) An MSA is generated for each query sequence using MMseqs230. Each sequence in the MSA is embedded with a pLM and each sequence embedding is feed into a CNN to make predictions for each sequence in the alignment. Final predictions for the query sequence are obtained by averaging predictions column-wise, based on the MSA. (d) PSSMConcat The query sequence is encoded with a pLM to obtain an embedding representation. Additionally a MSA is generated using MMseqs230, from which a PSSM is computed. The per-residue embedding and PSSM features are concatenated into feature vectors, which are fed into a CNN to generate task-specific predictions. (e) PSSMSplit The query sequence is encoded with a pLM to obtain an embedding representation. An MSA is also generated using MMseqs230, from which a PSSM is computed. The embedding and PSSM are provided as separate input features to a single CNN to generate task-specific predictions.
For SeqVec16 and ProtBert17 embeddings, the explicit addition of evolutionary information improved secondary structure prediction significantly (at 95% confidence interval—CI ± 1.96 stderr) of up to six percentage points for the simpler SeqVec and almost one for the more advanced ProtBert (assessed by the three-state per-residue accuracy Q3 Fig. 2, Table S1). In contrast, MSAs did not improve for ProtT5 (Fig. 2, Table S1). PSSM-based models and averaging predictions for all proteins in an MSA (MSACons) remained numerically even below the raw embeddings (Fig. 2, Table S1; the decrease was not statistically significant). Visualizing the CNN layers that combine PSSM-based and embedding-based features in our PSSM-based models (PSSMConcat, PSSMSplit) revealed that weights applied to PSSM features are often strongly negative or positive for all embedding types, indicating that the models actively use these features (Figs. S2, S3), even if no significant performance changes could be observed.
Fig. 2.

Secondary structure prediction. We tested transfer learning to predict protein secondary structure (in three states: helix, strand, other) with three pLM “generations”: SeqVec16, ProtBert17, ProtT517. For each, we showed five alternatives: raw the original raw embeddings, MSA embedding values averaged over all proteins (in family as defined by the MSA), MSACons average over pLM-based prediction for each protein, PSSMConcat raw embeddings concatenated by PSSM, PSSMSplit raw embeddings and PSSM processed by disjunct model layers. Baseline performance approximated random predictions by: shuffle labels random draws from class distribution, and majority class predicting all residues in class other. The three-state per-residue accuracy (Q3) provided for the TEST100 data set. Error bars mark 1.96% standard errors, i.e., the 95% confidence interval (CI).
The prediction of intrinsically disordered regions in proteins31 is a task for which MSA-based predictions are among the top performing methods despite the difficulties of aligning such proteins32,33. Nevertheless, embedding-based methods such as SETH, that purely rely on single sequence input, perform similarly26. For this task, we compared to the MSA-based ODiNPred. The MSA-augmented SETH MSACons performed significantly worse than the original SETH26 (Fig. 3a).
Fig. 3.
Raw embeddings vs MSACons. Explicitly adding embeddings through multiple sequence alignments (MSAs, here MSACons) slightly increased performance for three of the four representative prediction tasks (binding, TMH and TMB). Error bars mark 1.96% standard errors, i.e., the 95% confidence interval (CI). (a) Per-residue Spearman correlation for disorder prediction on the CheZOD11743 dataset; (b) per-residue binding residue F1 prediction performance on the TestSet22524; (c) per-segment Qok for TMHs on 571 helical transmembrane proteins23; (d) per-segment Qok for TMB on 57 beta-barrel membrane proteins23. The original MSA-based methods were (left to right): ODINPred43, bindPredictML1734, TOPCONS235, BOCTOPUS36; the original embedding-based methods: SETH26, bindEmbed21DL24, TMbed23.
In the past, our group has found that predicting which residues in a protein bind to non-protein substrates is particularly challenging due to the scarcity of reliable experimental data34. For this task, we compared the embedding-based bindEmbed21DL24, and the MSA-based bindPredictML1734 to adding MSA information. Here, augmenting bindEmbed21DL with evolutionary information (MSACons) improved numerically over the original bindEmbed21DL (bindEmbed21DL24 vs. bindEmbed21DL MSACons, Fig. 3b). However, this difference was not statistically significant.
Our final prediction task, in comparing inputting raw embeddings with combining embeddings with evolutionary information, was the prediction of transmembrane segments. We compared top MSA-based methods such as TOPCONS235 (for transmembrane helices, TMHs) and BOCTOPUS236 (for transmembrane beta-strands, TMB) with the embedding-based TMbed23. All of those successfully predict transmembrane segments37–39. We found that the per-segment Qok (“Methods” section) slightly improved through MSACons over the original TMbed. For TMHs, this improvement was statistically significant (TMbed23 vs. TMbed MSACons, Fig. 3c,d).
Additionally, to the discussed tasks we also evaluated our approach on predicting the conservation of residues within a protein family based on the embedding based VESPA25 approach. Due to issues discovered with the reproducibility of VESPA’s reported performance, details for this task are provided in the SOM (SOM Sect. 1.2 Conservation Prediction).
For all four tasks collected to capture varying aspects of protein prediction, embedding-based methods outperformed or were on par with the best MSA-based methods. Averaging over predictions for all proteins in an MSA (MSACons) improved statistically significantly for TMH predictions; it improved numerically for predicting TMBs and binding residues. In contrast, averages were significantly lower for disorder prediction (at 95% confidence interval—CI ± 1.96 stderr).
Raw embeddings mostly outperformed evolutionary information
Comparing values recently published, pLM-based methods outperform or match MSA-based solutions for many protein prediction tasks. On the per-residue level, these include the prediction of secondary structure (Fig. 4a, Table S9), signal peptides40 (Fig. 4c, Table S11), binding residues (Fig. 3b, Table S7), and disordered regions (Fig. 3a, Table S6). On the per-protein level, these include the prediction of CATH-classes41 (Fig. 4e, Table S13), SCOP classes42 (Fig. 4f, Table S14), and subcellular location27 (Fig. 4g, Table S15). On the per-segment level, these tasks include transmembrane prediction (TMHs: Fig. 3c, Table S8; TMBs: Fig. 3d, Table S8).
Fig. 4.
Literature MSA vs. embedding. We compared the performance of MSA-based to embedding-based methods published for seven different protein prediction tasks (baselines: shuffle labels: random draws from class distribution and random pairs: average TM-score of randomly picking two unrelated proteins). In each panel, values referred to the same dataset and only the numerically best performing method shown (more methods in Tables S9–S13). Error bars mark 1.96% standard errors, i.e., the 95% confidence interval (CI), if available in the original literature. The tasks and methods were: (a) per-residue Q3 performance for secondary structure prediction on CASP12 dataset54 (methods: NetSurfP-2.055, Ankh56). (b) Per-residue TM-score average for 3D structure prediction on the CASP14 dataset57 (methods: AlphaFold215, ESMFold58). (c) per-residue MCC performance for signal peptide prediction on SignalP-5.0 benchmark59 (methods: SignalP-5.059, SignalP-6.040). (d) Per-residue Spearman correlation of mutation effect prediction on the DMS4160 (methods: GEMME61, ESM-1v (fine tuned)62). (e) Per-protein MCC performance for CATH prediction on the TOP177341 (methods: BLAST63, CATHe41). (f) Per-protein Spearman correlation for SCOP class prediction on ASTRAL 2.0642,64 (methods: TMalign65, MT-LSTM42). (g) Per-protein Q10 performance on localization prediction on the DeepLoc test set66 (methods: DeepLoc66, LA(ProtT5)27). (h) per-protein Fmax score for GO term prediction on the CAFA3 dataset50 (BPO: biological process ontology; MFO: molecular function ontology; CCO: cellular component) (methods: DomFun51, goPredSim52, GOLabeler53).
The opposite (evolutionary information more accurate than embeddings) remains true for 3D structure prediction44 (Fig. 4b, Table S10), possibly because AlphaFold215 has optimized so many aspects of this task in ways not readily transferrable to using pLMs as input45. Similarly, the best evolutionary information-based methods still appear to outperform pLM-based methods in predicting the effect of point mutants (SAV: single amino acid changes; Fig. 4d, Table S12) as measured by Deep Mutational Scanning (DMS/MAVE) experiments25,46–48 and for predicting aspects of function as categorized by the GeneOntology (GO) numbers49 (Fig. 4h, Table S16) For both of these problems, advanced machine learning models on top of MSAs continue to outperform simple pLM-based solutions, for the time being50–53. However, for both problems, embedding-based methods, such as VESPA25,46 and goPredSim52, come close without using MSAs.
Discussion
We hypothesized that explicitly using evolutionary information from multiple sequence alignments (MSAs) when inputting embeddings from large protein Language Models (pLMs) could improve the performance of subsequent supervised learning protein prediction tasks. While we proved this hypothesis for earlier pLMs, such as SeqVec16 and ProtBert17, we failed to improve for embeddings from more advanced pLMs, such as ProtT517 (Figs. 2, 3, 4, Tables S5–S16). The likely interpretation of the first finding (improvement through explicitly using MSAs) is that those first pLM-generations either failed to capture evolutionary information or did not capture it to the same degree as MSAs. The gradual increase in the power of transfer learning (using the “lessons learned” by pLMs, i.e., the embeddings, as input to subsequent supervised training) with each new pLM generation suggested that pLM embeddings increasingly capture information correlated with evolutionary information taken from MSAs.
Do embeddings capture or correlate with evolutionary information?
The difficult question then becomes: do embeddings from advanced pLMs correlate with evolutionary information, or do they explicitly capture it? Understanding this distinction is important because it impacts how we interpret the utility of embeddings and the underlying insights they provide.
Biologically, correlation would imply that embeddings reflect biophysical and functional constraints on protein sequences. Examples of those include inter-residue bonds, the balance between hydrophobic and hydrophilic residues67, and binding sites. These constraints are reflected in both MSAs and embeddings from advanced pLMs. This suggests that pLMs can learn the underlying grammar of protein sequences, which also constrain MSAs. We conclude that embeddings and MSAs may be correlated because they both reflect these constraints.
Our results did not decisively support or invalidate either hypothesis. We considered several approaches to explore this further. Firstly, we could evaluate if differences in family size result in prediction differences. Previous studies have observed such differences17,46,68. However, the scarcity of high-resolution data for small families and biophysical differences between small and large families further complicate this analysis. For a more in-depth discussion of this, see SOM Sect. 2.1. Limitations of family size.
Secondly, we could investigate the impact of training protein language models without evolutionary information. A pLM trained on large databases like BFD69 may implicitly learn the same constraints captured in an MSA simply by encountering sequences from larger families more frequently. However, training pLMs without this evolutionary information is currently impractical due to data limitations. We provided further details on this in SOM Sect. 2.2. Limitations of Training pLMs without Evolutionary Information. Reducing the data used during pLM training to only non-redundant sequences would require much larger databases than are currently available. This issue makes this approach unfeasible in the near future.
Thirdly, systematically biasing training of pLMs towards a specific family could help answer the question if pLMs capture or correlate with the evolutionary information contained in MSAs. However, this approach exceeds our computational resources and may introduce confounding factors that would make drawing conclusions difficult. For more details, see SOM Sect. 2.3. Limitations of biasing pLM training towards Families.
Given these challenges, the most practical way of answering our question was by evaluating if explicitly adding evolutionary information in the form of MSAs or PSSMs improves embedding-based predictions. Our results imply that embeddings capture information that at least correlates with the information in MSAs. Adding MSAs to the embedding-based approaches did not consistently enhance performance. Ultimately, further research on whether embeddings capture or correlate with evolutionary information can help guide future research and tool development. If embedding-based methods only correlate, improving MSA-based methods could still be valuable. If embeddings capture evolutionary information, this may shift the focus in our field toward further development and optimization of pLMs.
Impact of evolutionary distance on prediction performance
To what extent evolutionary information aids in predicting specific protein properties depends on the distance over which these properties are conserved as well as the sequence similarity in the MSA or PSSM, relative to the query sequence. For example, binding residues are not expected to be conserved well below 75% sequence identity70 and will benefit minimally from highly diverse alignments but may benefit from highly similar aligned sequences. Conversely, protein structure is still conserved at thresholds as low as 30% sequence similarity70,71 and is therefore expected to benefit more from diverse MSAs. Similarly, the diversity of evolutionary information captured by or correlated with embeddings is likely to impact which tasks are best suited for embedding-based models. However, the diversity of evolutionary information encoded in embeddings is currently hard to ascertain. These differences are critical for understanding the varied impact of evolutionary information and embeddings across different prediction tasks and should be considered in future model development and evaluations.
Performance gains for embedding-based methods only for early-generation pLMs
For secondary structure prediction, the advance from relatively simple pLMs such as SeqVec16 (based on ELmO72 predicting the next word/amino acid in a sentence/protein sequence) to more advanced ProtBert17 (based on BERT73) sufficed to turn the gain from explicitly using MSAs into becoming a numerical improvement that remained statistically insignificant (Fig. 2). Stepping from BERT to T574 (or more precisely, ProtBert to ProtT517), the numerical advantage (ProtBert, Fig. 2) turned into a numerical disadvantage (ProtT5, Fig. 2) of using MSAs. One of the major aspects in the advance of T5 and BERT over, e.g., ELMo, is the concept of attention heads75. This explained why the accurate prediction of protein 3D structure from sequence leaped in the transition from SeqVec to ProtBert and ProtT5, as shown recently28. However, we did not use these attention heads for secondary structure prediction. Thus, the rise in secondary structure prediction performance to a similar level with and without explicitly using MSAs (Fig. 2, Tables S1, S2) implied that especially ProtT5 additionally extracted other constraints from protein sequences relevant for structure formation.
MSA average worsens the prediction of disorder
When evaluating the performance of integrating evolutionary information into existing embedding-based methods by averaging over all single-protein predictions in an MSA (MSACons), we observed that only one of the methods showed a significant improvement (Fig. 3, Tables S5–S16). For the per-residue prediction of disorder (as measured by CheZOD scores76), the MSA-average over predictions (MSACons) significantly decreased performance over the original method (Fig. 3a, Table S6).
This decrease may be related to the difficulty of correctly aligning intrinsically disordered proteins (IDPs)77–79. Disordered regions often have lower conservation and evolve faster, leading to higher variability and potential misalignments80,81. Incorrectly aligned proteins introduce noise, which can degrade the performance of methods relying on MSA-derived features. Furthermore, with the evolutionary conservation of disordered regions being less stringent, orthologs are frequently less similar80,81, challenging the conceptual foundation that MSA information would consistently improve disordered regions. While there are tools specialized in accounting for these challenges, we adhered to our MMseqs230 workflow to ensure consistency across all prediction tasks. Adjusting our approach to optimize MSA generation specifically for disordered proteins would require extensive research, potentially warranting a separate publication.
Similar or better performance for embedding-based methods on most prediction tasks
Our method comparison of embedding-based and evolutionary information-based methods in the literature shows that embedding-based models are as accurate or better performing for most tasks (Figs. 3, 4, Tables S5–S16). The only three exceptions in our set constituted tasks related to predicting protein function (Fig. 4h, Table S16), the effect of sequence variation upon molecular function (Fig. 4d, Table S12) and 3D structure prediction (Fig. 4b, Table S10).
Most proteins consist of several structural domains82–84. Recent state-of-the-art methods predict aspects of function based on domain information by first chopping full-length proteins into their domain constituents and then learning patterns in these units, e.g., DomFun51 or the recent breakthrough method ChainSaw85. In contrast, the embedding-based goPredSim transfers annotations from a protein of known function to a query protein through the similarities of per-protein embeddings; in a method referred to as embedding-based annotation transfer (EAT)52. The domain-chopping of DomFun might ultimately be more relevant for its edge over goPredSim (Fig. 4h, Table S16), basing its decision on full-length proteins rather than the aspect of explicitly using evolutionary information. In fact, embedding-based protein comparisons substantially drop in performance when using full-length proteins rather than domains86.
Methods predicting the effect of sequence variation upon protein function have recently been assessed through the lens of deep mutational scanning (DMS)25,43,46. Most of those analyses suggested that no prediction method consistently outperforms all others25,46,61. While DMS data continue to improve, resources are still limited due to the enormous experimental complexity. Alternative data sets come with their own issues, e.g., most existing methods have been over-trained to fit small data sets, yet other data sets heavily depend on creating MSAs. Given all these issues, whether MSA-based methods outperform pLM-based solutions in predicting the effect of sequence variation upon molecular function remains to be seen.
This left no example other than 3D structure prediction (Fig. 4b, Table S10), for which MSA-based methods (i.e. AlphaFold215), outperformed pLM-based methods. Once again, it is too early to tell whether or not this roots in the fact that the developers of this unique breakthrough engineered so many outstanding solutions for so many aspects of the structure prediction challenge. Overall, pLMs already extract important information from MSAs. With every generation of pLMs advancing significantly45, the point at which pLMs can essentially replace MSAs as input for all aspects of protein prediction appears to be getting closer. Clearly, pLMs have become too advanced to benefit from simple MSA-averaging (Fig. 2, Tables S1, S2). However, the lack of improvement does not add any evidence to answer whether embeddings capture evolutionary information directly or whether they, instead, capture the constraints underlying the existing sequences as much as MSAs. Ultimately, this distinction may not matter in practice as long as pLMs effectively replace MSAs for protein prediction.
Conclusions
Over the past three decades, the evolutionary information captured in multiple sequence alignments (MSAs) of protein families has become the most successful tool in combination with machine learning to predict aspects of protein structure and function. Recently, embeddings from protein Language Models (pLMs) have replaced MSA as standard input for protein prediction. For many tasks, they reach or beat MSA-based predictions. What if we combined MSAs and embeddings? We tested various approaches to incorporate evolutionary information into embeddings. While for earlier pLMs such as SeqVec16 and ProtBert17 the combination significantly improved accuracy, the more recent pLM ProtT517 did not benefit from adding MSAs (Figs. 2, 3, 4). Is this because pLMs capture evolutionary information, or because prediction methods reached some level of saturation? Embedding-based often outperform MSA-based methods (Fig. 4, Tables S5–S16). Naturally, better-performing methods are closer to the theoretical performance limit or saturation points than those that perform worse. Except for this obvious observation, we have no evidence to claim that saturation explained our findings. Combining MSAs and embeddings even decreased accuracy for disorder, possibly due to bias introduce in the way we created MSAs. However, we could not find any evidence to solve the open question whether pLMs capture evolutionary information contained in MSAs directly or whether pLMS simply correlate with evolutionary information by capturing the same underlying constraints on sequence space (all observed sequences) in a similar way as MSAs do. Overall, our study highlights the effectiveness of embedding-based methods.
Methods
Data set—secondary structure prediction
For 156,897 proteins (≤ 3.5 Å, ≥ 40 residues), we extracted sequences from the PDB87. At the time of data generation (2020-10-26), this yielded 573,479 chains that cross-checked with PDBredo DB88 and CATH89 by computationally removing any IDs unavailable in either of those databases. This kept 296,596 protein chain sequences from 117,623 different proteins.
All remaining sequences were pairwise aligned by MMseqs230. For each pair, we computed the HSSP-value (HVAL)71,90 (Eq. 2) and the percentage pairwise sequence identity (PIDE, Eq. 1) as follows91:
| 1 |
| 2 |
where L is the number of M-states, and n_ident is the number of identical residues in the alignment. Note that the HVAL computes proximity for protein pairs based only on their sequences.
Training, validation, and test set (secondary structure prediction)
Given that our prediction task is structure-based, we used the HVAL to reduce the redundancy between the separate data sets used for training, validation, and final performance assessment (test) to account for and remove structural similarity: TEST100 was used only to assess performance, and VAL100 was used for optimizing hyper-parameters. We chose these sets as follows:
TEST100 we randomly selected 100 sequences meeting three criteria: (1) Deposited after April 2018 to allow a fair comparison to other recent methods, (2) resolution ≤ 2 Å, (3) any sequence pair (a,b) with a,b ∈ TEST100 must have an HVAL ≤ 0 (Eq. 2).
VAL100 we randomly selected an additional 100 sequences constrained to (1) deposited before April 2018, (2) resolution ≤ 2 Å, and (3) we ascertained that any sequence pair (a,b) with either a ∈ TEST100 or a ∈ VAL100 and b ∈ VAL100 had a maximal HVAL ≤ 0.
TRAIN6727 we used the remaining sequences for training if and only if the following criteria were fulfilled: (1) deposition before April 2018, (2) CATH annotations on the topology level (T) had to be different from any contained in TEST100 or VAL100, (3) HVAL ≤ 0 for any pair (a,b) with a ∈ TEST100 or a ∈ VAL100 and b ∈ TRAIN6727, and (4) PIDE ≤ 70 as computed by MMseqs30 for any pair (a,b) with a,b ∈ TRAIN6727, if a ≠ b. This yielded 6727 protein chains for training.
Preventing data leakage and ensuring robustness
To prevent data leakage between training, validation and test set, we used an HVAL ≤ 0. This cutoff roughly corresponds to a maximum sequence similarity of 30% or lower but additionally ensures that no pair of sequences share similar structures71. While this approach minimizes the risk of overlap, we acknowledge that sequences in MSAs could exhibit higher pairwise similarities and have a higher HVAL. To ensure that no data leakage was introduced in this way, we performed pairwise global alignments for training, validation and test sequences, including the ones introduced by the MSAs. This included aligning all training sequences to all validation sequences, all training sequences to all test sequences and all validation sequences to all test sequences. Our analysis of the resulting sequence similarities confirmed that even with the sequences introduced through MSAs, all pairs between these sets fell below the threshold of 0.3. Furthermore, our analysis with ProtT517 and ProtBert17 embeddings show no significant performance gains for MSA or PSSM models (Fig. 2, Table S1). If there were data leakage, noticeably increased performance would be expected for any evolutionary information-based model. The lack of such a change confirms that no data leakage was introduced. To ensure the robustness and generalization capability of our models we evaluated our top-performing model (ProtT5 raw embeddings) on the CASP12 dataset54. Our model achieved performances consistent with previous methods (Table S9), further validating the robustness of our approach and the effectiveness of our data splits.
Secondary structure states
We reduced the eight classes of DSSP92 secondary structure assignments to three states through the following standard protocol3: Alpha helix classes DSSP-H, DSSP-G, and DSSP-I to helix (H), beta-strand classes DSSP-E and DSSP-B to strand (E), and all remaining classes to other (“–” or “ ”, often misleadingly referred to as loop).
Data sets—MSA-based averages
We obtained the datasets originally used to estimate performance to appropriately compare adaptations with original methods. To provide a comprehensive overview of our approach, we have summarized the datasets and methods used for each prediction task in Table 1. For VESPA25, we used the ConSurf10k25 test set (https://zenodo.org/record/5238537) with per-residue conservation scores in 9 classes (1: most variable to 9: most conserved) for 519 sequence unique proteins. For SETH26, we used the CheZOD11743 set introduced along with ODiNPred43 with per-residue CheZOD scores for 117 proteins. The CheZOD score quantifies the extent of residue disorder by comparing the nuclear magnetic resonance spectroscopy-derived chemical shift values93 with the computed random coil chemical shifts94. For bindEmbed21DL24, we used TestSet22524 (https://github.com/Rostlab/bindPredict) with per-residue annotations for 225 proteins (classified as binding metal, nuclear or small molecules, or are non-binding). For TMbed23, we used 571 transmembrane proteins (TMPs) with transmembrane alpha helices (alpha-TMP)23 and the 57 TMPs with transmembrane beta strands (beta-TMP)23 (https://github.com/BernhoferM/TMbed) with per-residue annotations for four states: TM alpha-helix, TM beta-strand, signal peptide, none, as well as, of so-called membrane topology (inside/outside).
Table 1.
Overview of prediction tasks, dataset and approaches used for integrating evolutionary information.
| Prediction task | Dataset | Approach for integration of evolutionary information |
|---|---|---|
| α-helix bundle transmembrane segments | 571 α-TMPs23 | MSACons |
| β-barrel transmembrane segments | 57 β-TMP23 | MSACons |
| Binding residue prediction | TestSet22524 | MSACons |
| Conservation | ConSurf10k test25 | MSAcons |
| Disorder | CheZOD11743 | MSACons |
| Secondary structure prediction | TRAIN 6727, VAL100, TEST100 | MSACons, PSSMSplit, PSSMConcat, MSA embeddings |
Overview of the datasets used for testing the effect of integration of evolutionary information. The first column specifies the prediction task that was investigated, the second one the dataset used for evaluating the prediction performance and the last column what kind of approach(s) were used for integration of evolutionary information.
Data sets—literature
We obtained additional datasets to compute random baselines and compare recent MSA- and pLM-based methods. In particular, we used for per-residue predictions: secondary structure: CASP1254 dataset (20 unique sequences, https://www.dropbox.com/s/te0vn0t7ocdkra7/CASP12_HHblits.csv?dl=1), 3D structure: CASP1457 dataset (51 unique proteins, https://predictioncenter.org/download_area/CASP14/), signal peptides: SignalP-5.0 benchmark59 set (8811 unique sequences, https://services.healthtech.dtu.dk/services/SignalP-5.0/benchmark_set.fasta), SAV effect prediction: DMS4160 dataset (41 unique proteins, https://static-content.springer.com/esm/art%3A10.1038%2Fs41592-018-0138-4/MediaObjects/41592_2018_138_MOESM4_ESM.xlsx). For per-protein predictions, we obtained: CATH superfamily: TOP 177341 dataset (6712 unique sequences, https://zenodo.org/record/6327572), SCOP class: ASTRAL 2.06 test19,64 set (5602 unique sequences, http://bergerlab-downloads.csail.mit.edu/bepler-protein-sequence-embeddings-from-structure-iclr2019/scope.tar.gz), GO function: CAFA350 dataset (130,827 unique proteins, https://biofunctionprediction.org/cafa-targets/CAFA3_targets.tgz), and for subcellular: location: DeepLoc Test59 set (2889 unique sequences, https://services.healthtech.dtu.dk/services/DeepLoc-1.0/deeploc_data.fasta). Although we provided some figures for the sizes of those data sets, those numbers are not strictly comparable. For one task, N proteins might capture all aspects of prediction methods; for another, M >> N might still not suffice. Similarly, head-on comparisons between the accuracy of tasks T1 and T2 are potentially very misleading. For guidance, we provided values reflecting similar solutions toward random baselines. (Even improvement over random is irrelevant as a means of comparison, as illustrated by the following example: assume task 1 (T1) having three equally distributed states (random Q3 = 33%), and task 2 (T2) to have ten equally distributed states (random Q10 = 10%). Imagine a method three times better than random for T1 (Q3 = 3 × 33 = 100%) and another reaching the same gain over random for T2 (Q10 = 3 × 10 = 30%).
T1 solved by a method reaching Q3 = 100% (3 times better than random). The method for T1 is completely right, while that for T2 might be completely blind on 4 of the ten states and still only get “half” of the predictions right.)
MSA and PSSM
We created multiple sequence alignments (MSAs) by running MMseqs230 against UniRef5095 for all query proteins. MMseqs230 was run with a sensitivity (-s) of 7.5 for the prefiltering module, a minimum sequence identity (–min-seq-id) of 0.2, and 3 iterations (–num-iterations). MMseqs2 first selects only cluster representatives and then extends the results by adding the most diverse sequences to the alignment. Position-specific scoring matrices (PSSMs) were also obtained from MMseqs2. This workflow was consistently applied to all tasks that required MSA or PSSM generation, namely: secondary structure (TRAIN 6727, VAL100, TEST100), transmembrane prediction (57 β-TMP23, 571 α-TMPs23), binding residues (TestSet22524), conservation (ConSurf10k test25) and disorder (CheZOD11743). For conservation prediction, we additionally obtained the precomputed MSAs of our query sequences from ConSurfDB96 that have been generated with MAFFT97.
Embeddings
We used three different protein language models (pLMs) to generate our embeddings: The transformer-based ProtT517 and ProtBert17 and the bi-directional pLM SeqVec16. For each model, we compared the following two alternatives.
Raw embeddings
For ProtT5 and ProtBert, we used only the encoder to convert each protein sequence into an embedding matrix (dimensions 1024 × L; L number of residues), representing each residue by a 1024-dimensional vector. For SeqVec, we extracted the forward and backward pass of the first LSTM layer (also dimension 1024 × L). We used the pre-trained models without any task-specific fine-tuning throughout.
MSA embeddings
Besides the embeddings for each query sequence Q, we computed embeddings for each protein in the MSA of Q. We removed gaps and mapped rare amino acids to X before the encoding. For each residue position i in Q, we averaged over all per-residue embeddings from any protein aligned at i.
Prediction methods
We compared identical model architectures for three different 1024-dimensional input vectors: raw embeddings, MSA embeddings, and by the MSACons average over the raw embeddings for all proteins. Additionally, we tested two highly similar architectures to incorporate the additional PSSM input. Figure 1 provides a schematic overview of the used approaches. By comparing as similar architectures as possible, we ensured that performance differences could be primarily attributed to different inputs instead of model complexity, capacity, or other unknown confounding factors.
Input
On top of raw—and MSA-embeddings (both 1024 input units), we evaluated the combination of raw embeddings and PSSMs (PSSMSplit and PSSMConcat). The mixture models (embeddings + PSSMs) used 1024 (embedding dimension) + 20 (PSSM) input units (dimensions for PSSM; SOM: Figs. S5–S7).
Models
Our models consisted of multiple consecutive convolutional layers, each using leaky relu as activation function98. The input dimensions for the embedding-only models (MSACons, raw, and MSA embeddings were 1024, 32, 16, and 8, and the output layer had three units for the three secondary structure states (H: helix, E: strand, –: other; Fig. S6). Note the MSACons model applied the original method to predict for each sequence aligned in the MSA and then averaged over the predictions, while MSA-embeddings input the averaged embeddings into the original method. To combine PSSMs and embeddings, we tested two solutions. As none of those improved over simpler approaches, we confined the results to the SOM (Figs. S7, S8).
Performance measures—secondary structure
We used 3-state per-residue accuracy (Q3, Eq. 3) as our main performance measure for secondary structure prediction evaluation:
| 3 |
Standard deviations and errors for all measures were estimated on a per-chain base by counting the total and correctly predicted number of residues in a chain and calculating the listed performance measures for each chain. The resulting distribution over all chains was used to compute the mean and standard deviation (SD). Standard error was calculated by dividing SD by the square root of the sample size (Eq. 4).
| 4 |
Additional performance measures for secondary structure prediction evaluation have been confined to the SOM (SOM 3.3 Additional Performance Measures).
Performance measures—other tasks
We evaluated embeddings-based methods following the original method publications. For conservation prediction, we used MCC as described in the SOM (Eq. S1); TPx: correctly predicted as conserved (> 5); FPx: predicted conserved but annotated as not conserved (< 6); TNx: correctly predicted as not conserved; FNx: predicted not conserved and annotated as conserved). The Spearman correlation p (Eq. 5) assessed disorder predictions:
| 5 |
The F1 score (Eq. 6) evaluated binding residue predictions:
| 6 |
| 7 |
| 8 |
With TP/FP: correctly predicted as binding/non-binding and FP/FN: incorrectly predicted as binding/non-binding.
Transmembrane predictions were compared based on a per-protein level. For this purpose, a segment was considered to be predicted correctly if (1) the start and end did not differ by more than five residues from the annotated segment and (2) the intersection overlap of the predicted and observed segment was at least half of their union23. We evaluated transmembrane prediction performance using Qok as described in Eq. (9):
| 9 |
Q10 (Eq. 10) evaluated location prediction:
| 10 |
Supplementary Information
Acknowledgements
Thanks to Tim Karl and Inga Weise (both TUM) for invaluable help with technical and administrative aspects of this work, to the members of the Rostlab, especially Michael Heinzinger, for the interesting discussions, to Daniel Kogan for an insightful brainstorming session for visualizing our workflows, to Tristan Bepler (NYSBC, New York) for providing the ASTRAL 2.06 test set, and to all other researchers making datasets and results available online. This work was supported by a grant from the BMBF (German Ministry for Education and Research; Grant Number 031L0168). Last but not least, thanks to all those who maintain public databases, in particular Steven Burley (PDB, Rutgers Univ.), Christine Orengo (CATH, UCL London), Anastassis Perrakis (PDBredo, Amsterdam) and their teams. This work was only possible with those contributions.
Abbreviations
- CATH
Protein structure classification database
- DMS
Deep mutational scanning
- FN
False negatives
- FP
False positives
- GO
GeneOntology
- HVAL
Distance to HSSP curve
- HSSP curve
A curve that represents an empirically determined threshold for automatic family assignment
- MCC
Matthews correlation coefficient
- ML
Machine learning
- MMseqs2
Many-against-many sequence searching tool
- MSA
Multiple sequence alignment
- NLP
Natural language processing
- PDB
Protein Data Bank
- PDBredo DB
A database for optimized crystallographic structure models
- pLM
Protein language model
- ProtBert
ProtBert-BFD protein language model
- ProtT5
ProtT5-XL-U50 protein language model
- PSSM
Position specific scoring matrix
- Q3
3-State accuracy
- SD
Standard deviation
- SE
Standard error
- SOTA
State-of-the-art
- SOV
The fractional overlap of segments
- TM
Transmembrane
- TN
True negatives
- TP
True positive
Author contributions
KE has designed and conceptualized this study. KE has implemented, trained, and evaluated the included models for secondary structure prediction. KE adopted the previously published methods TMbed, VESPA, SETH, and bindEmbed21DL to use the MSAConsensus approach. KE conducted the literature research for a comparison of previously published work. KE analyzed and interpreted the results. KE prepared all figures and tables. All authors have participated in preparing the manuscript and approved the final version. BR supervised the project and provided valuable feedback.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Data availability
The source code is publicly available as a GitHub repository at https://github.com/erckert/EV-embeddings. The dataset used for the comparison of different approaches to integrating evolutionary information for secondary structure prediction is publicly available at: 10.5281/zenodo.10026192.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-024-71783-8.
References
- 1.Zemla, A., Venclovas, C., Fidelis, K. & Rost, B. A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment. Proteins34, 220–223. 10.1002/(sici)1097-0134(19990201)34:2%3c220::aid-prot7%3e3.0.co;2-k (1999). [DOI] [PubMed] [Google Scholar]
- 2.Rost, B. & Sander, C. Jury returns on structure prediction. Nature360, 540–540. 10.1038/360540b0 (1992). 10.1038/360540b0 [DOI] [PubMed] [Google Scholar]
- 3.Rost, B. & Sander, C. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol.232, 584–599 (1993). 10.1006/jmbi.1993.1413 [DOI] [PubMed] [Google Scholar]
- 4.Rost, B. PHD: Predicting one-dimensional protein structure by profile based neural networks. Methods Enzymol.266, 525–539 (1996). 10.1016/S0076-6879(96)66033-9 [DOI] [PubMed] [Google Scholar]
- 5.Jones, D. T. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol.292, 195–202 (1999). 10.1006/jmbi.1999.3091 [DOI] [PubMed] [Google Scholar]
- 6.Rost, B. & Sander, C. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins Struct. Funct. Genet.19, 55–72 (1994). 10.1002/prot.340190108 [DOI] [PubMed] [Google Scholar]
- 7.Liu, J. & Rost, B. NORSp: Predictions of long regions without regular secondary structure. Nucleic Acids Res.31, 3833–3835 (2003). 10.1093/nar/gkg515 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Radivojac, P. et al. Protein flexibility and intrinsic disorder. Protein Sci.13, 71–80 (2004). 10.1110/ps.03128904 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Schlessinger, A., Liu, J. & Rost, B. Natively unstructured loops differ from other loops. PLoS Comput. Biol.3, e140 (2007). 10.1371/journal.pcbi.0030140 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Schlessinger, A. & Rost, B. Protein flexibility and rigidity predicted from sequence. Proteins Struct. Funct. Bioinform.61, 115–126 (2005). 10.1002/prot.20587 [DOI] [PubMed] [Google Scholar]
- 11.Punta, M. & Rost, B. PROFcon: Novel prediction of long-range contacts. Bioinformatics21, 2960–2968 (2005). 10.1093/bioinformatics/bti454 [DOI] [PubMed] [Google Scholar]
- 12.Jones, D. T., Singh, T., Kosciolek, T. & Tetchner, S. MetaPSICOV: Combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics31, 999–1006. 10.1093/bioinformatics/btu791 (2015). 10.1093/bioinformatics/btu791 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE6, e28766. 10.1371/journal.pone.0028766 (2011). 10.1371/journal.pone.0028766 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Michel, M. et al. PconsFold: Improved contact predictions improve protein models. Bioinformatics30, i482-488. 10.1093/bioinformatics/btu458 (2014). 10.1093/bioinformatics/btu458 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature596, 583–589. 10.1038/s41586-021-03819-2 (2021). 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Heinzinger, M. et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform.20, 723. 10.1186/s12859-019-3220-8 (2019). 10.1186/s12859-019-3220-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Elnaggar, A. et al. ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell.44, 7112–7127. 10.1109/TPAMI.2021.3095381 (2022). 10.1109/TPAMI.2021.3095381 [DOI] [PubMed] [Google Scholar]
- 18.Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods1, 1–8 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. 10.48550/ARXIV.1902.08661 (2019).
- 20.Madani, A. et al. ProGen: Language modeling for protein generation. http://arXiv.org/2004.03497, 10.1101/2020.03.07.982272 (2020).
- 21.Rao, R. et al. Evaluating protein transfer learning with TAPE. http://arXiv.org/1906.08230 (2019). [PMC free article] [PubMed]
- 22.Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci.118, e2016239118. 10.1073/pnas.2016239118 (2021). 10.1073/pnas.2016239118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Bernhofer, M. & Rost, B. TMbed: Transmembrane proteins predicted through language model embeddings. BMC Bioinform.23, 326. 10.1186/s12859-022-04873-x (2022). 10.1186/s12859-022-04873-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Littmann, M., Heinzinger, M., Dallago, C., Weissenow, K. & Rost, B. Protein embeddings and deep learning predict binding residues for various ligand classes. Sci. Rep.11, 23916. 10.1038/s41598-021-03431-4 (2021). 10.1038/s41598-021-03431-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Marquet, C. et al. Embeddings from protein language models predict conservation and variant effects. Hum. Genet.141, 1629–1647. 10.1007/s00439-021-02411-y (2022). 10.1007/s00439-021-02411-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Ilzhöfer, D., Heinzinger, M. & Rost, B. SETH predicts nuances of residue disorder from protein embeddings. Front. Bioinform.2, 1 (2022). 10.3389/fbinf.2022.1019597 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Stärk, H., Dallago, C., Heinzinger, M. & Rost, B. Light attention predicts protein location from the language of life. Bioinform. Adv.1, 035. 10.1093/bioadv/vbab035 (2021). 10.1093/bioadv/vbab035 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Weissenow, K., Heinzinger, M. & Rost, B. Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction. Structure30, 1169–1177. 10.1016/j.str.2022.05.001 (2022). 10.1016/j.str.2022.05.001 [DOI] [PubMed] [Google Scholar]
- 29.Bernhofer, M. et al. PredictProtein—Predicting protein structure and function for 29 years. Nucleic Acids Res.10.1093/nar/gkab354 (2021). 10.1093/nar/gkab354 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol.35, 1026–1028. 10.1038/nbt.3988 (2017). 10.1038/nbt.3988 [DOI] [PubMed] [Google Scholar]
- 31.Dunker, A. K. et al. What’s in a name? Why these proteins are intrinsically disordered. Intrins. Disord. Proteins1, e24157 (2013). 10.4161/idp.24157 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Del Conte, A. et al. CAID prediction portal: A comprehensive service for predicting intrinsic disorder and binding regions in proteins. Nucleic Acids Res.10.1093/nar/gkad430 (2023). 10.1093/nar/gkad430 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Liu, J., Tan, H. & Rost, B. Loopy proteins appear conserved in evolution. J. Mol. Biol.322, 53–64 (2002). 10.1016/S0022-2836(02)00736-2 [DOI] [PubMed] [Google Scholar]
- 34.Schelling, M., Hopf, T. A. & Rost, B. Evolutionary couplings and sequence variation effect predict protein binding sites. Proteins86, 1064–1074. 10.1002/prot.25585 (2018). 10.1002/prot.25585 [DOI] [PubMed] [Google Scholar]
- 35.Tsirigos, K. D., Peters, C., Shu, N., Käll, L. & Elofsson, A. The TOPCONS web server for consensus prediction of membrane protein topology and signal peptides. Nucleic Acids Res.43, W401–W407. 10.1093/nar/gkv485 (2015). 10.1093/nar/gkv485 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Hayat, S., Peters, C., Shu, N., Tsirigos, K. D. & Elofsson, A. Inclusion of dyad-repeat pattern improves topology prediction of transmembrane β-barrel proteins. Bioinformatics32, 1571–1573. 10.1093/bioinformatics/btw025 (2016). 10.1093/bioinformatics/btw025 [DOI] [PubMed] [Google Scholar]
- 37.Hendrickson, W. A. Atomic-level analysis of membrane-protein structure. Nat. Struct. Mol. Biol.23, 464–467. 10.1038/nsmb.3215 (2016). 10.1038/nsmb.3215 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Newport, T. D., Sansom, M. S. P. & Stansfeld, P. J. The MemProtMD database: A resource for membrane-embedded protein structures and their lipid interactions. Nucleic Acids Res.47, D390–D397. 10.1093/nar/gky1047 (2019). 10.1093/nar/gky1047 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Varga, J., Dobson, L., Reményi, I. & Tusnády, G. E. TSTMP: Target selection for structural genomics of human transmembrane proteins. Nucleic Acids Res.45, D325–D330. 10.1093/nar/gkw939 (2017). 10.1093/nar/gkw939 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Teufel, F. et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat. Biotechnol.40, 1023–1025. 10.1038/s41587-021-01156-3 (2022). 10.1038/s41587-021-01156-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Nallapareddy, V. et al. CATHe: Detection of remote homologues for CATH superfamilies using embeddings from protein language models. Bioinformatics1, 029. 10.1093/bioinformatics/btad029 (2023). 10.1093/bioinformatics/btad029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Bepler, T. & Berger, B. Learning the protein language: Evolution, structure, and function. Cell Syst.12, 654–669. 10.1016/j.cels.2021.05.017 (2021). 10.1016/j.cels.2021.05.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Dass, R., Mulder, F. A. A. & Nielsen, J. T. ODiNPred: Comprehensive prediction of protein order and disorder. Sci. Rep.10, 14780. 10.1038/s41598-020-71716-1 (2020). 10.1038/s41598-020-71716-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Haas, J. et al. Continuous automated model evaluation (CAMEO) complementing the critical assessment of structure prediction in CASP12. Proteins Struct. Funct. Bioinform.86, 387–398. 10.1002/prot.25431 (2018). 10.1002/prot.25431 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Weissenow, K., Heinzinger, M., Steinegger, M. & Rost, B. Ultra-fast protein structure prediction to capture effects of sequence variation in mutation movies. BioRxiv.10.1101/2022.11.14.516473 (2022). 10.1101/2022.11.14.516473 [DOI] [Google Scholar]
- 46.Notin, P. et al. Tranception: Protein fitness prediction with autoregressive transformers and inference-time retrieval. 10.48550/ARXIV.2205.13760 (2022).
- 47.Weile, J. & Roth, F. P. Multiplexed assays of variant effects contribute to a growing genotype–phenotype atlas. Hum. Genet.137, 665–678. 10.1007/s00439-018-1916-x (2018). 10.1007/s00439-018-1916-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Fowler, D. M. & Fields, S. Deep mutational scanning: A new style of protein science. Nat. Methods11, 801–807. 10.1038/nmeth.3027 (2014). 10.1038/nmeth.3027 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Ashburner, M. et al. Gene Ontology: Tool for the unification of biology. Nat. Genet.25, 25–29. 10.1038/75556 (2000). 10.1038/75556 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol.20, 244. 10.1186/s13059-019-1835-8 (2019). 10.1186/s13059-019-1835-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Rojano, E. et al. Assigning protein function from domain-function associations using DomFun. BMC Bioinform.23, 43. 10.1186/s12859-022-04565-6 (2022). 10.1186/s12859-022-04565-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Littmann, M., Heinzinger, M., Dallago, C., Olenyi, T. & Rost, B. Embeddings from deep learning transfer GO annotations beyond homology. Sci. Rep.11, 1160. 10.1038/s41598-020-80786-0 (2021). 10.1038/s41598-020-80786-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.You, R. et al. GOLabeler: Improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics34, 2465–2473. 10.1093/bioinformatics/bty130 (2018). 10.1093/bioinformatics/bty130 [DOI] [PubMed] [Google Scholar]
- 54.Abriata, L. A., Tamò, G. E., Monastyrskyy, B., Kryshtafovych, A. & DalPeraro, M. Assessment of hard target modeling in CASP12 reveals an emerging role of alignment-based contact prediction methods. Proteins Struct. Funct. Bioinform.86, 97–112. 10.1002/prot.25423 (2018). 10.1002/prot.25423 [DOI] [PubMed] [Google Scholar]
- 55.Klausen, M. S. et al. NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning. Proteins Struct. Funct. Bioinform.87, 520–527. 10.1002/prot.25674 (2019). 10.1002/prot.25674 [DOI] [PubMed] [Google Scholar]
- 56.Elnaggar, A. et al. Ankh: Optimized modelling protein language model unlocks general-purpose (2023).
- 57.Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP)—Round XIV. Proteins Struct. Funct. Bioinform.89, 1607–1617. 10.1002/prot.26237 (2021). 10.1002/prot.26237 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science379, 1123–1130. 10.1126/science.ade2574 (2023). 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]
- 59.Almagro Armenteros, J. J. et al. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat. Biotechnol.37, 420–423. 10.1038/s41587-019-0036-z (2019). 10.1038/s41587-019-0036-z [DOI] [PubMed] [Google Scholar]
- 60.Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods15, 816–822. 10.1038/s41592-018-0138-4 (2018). 10.1038/s41592-018-0138-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Laine, E., Karami, Y. & Carbone, A. GEMME: A simple and fast global epistatic model predicting mutational effects. Mol. Biol. Evol.36, 2604–2619. 10.1093/molbev/msz179 (2019). 10.1093/molbev/msz179 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. BioRxiv.10.1101/2021.07.09.450648 (2021).33948588 10.1101/2021.07.09.450648 [DOI] [Google Scholar]
- 63.Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol.215, 403–410. 10.1016/s0022-2836(05)80360-2 (1990). 10.1016/s0022-2836(05)80360-2 [DOI] [PubMed] [Google Scholar]
- 64.Fox, N. K., Brenner, S. E. & Chandonia, J.-M. SCOPe: Structural classification of proteins—Extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res.42, D304–D309. 10.1093/nar/gkt1240 (2014). 10.1093/nar/gkt1240 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Zhang, Y. & Skolnick, J. TM-align: A protein structure alignment algorithm based on the TM-score. Nucleic Acids Res.33, 2302–2309. 10.1093/nar/gki524 (2005). 10.1093/nar/gki524 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Almagro Armenteros, J. J., Sønderby, C. K., Sønderby, S. K., Nielsen, H. & Winther, O. DeepLoc: Prediction of protein subcellular localization using deep learning. Bioinformatics33, 3387–3395. 10.1093/bioinformatics/btx431 (2017). 10.1093/bioinformatics/btx431 [DOI] [PubMed] [Google Scholar]
- 67.Xia, Y., Huang, E. S., Levitt, M. & Samudrala, R. Ab initio construction of protein tertiary structures using a hierarchical approach. J. Mol. Biol.300, 171–185 (2000). 10.1006/jmbi.2000.3835 [DOI] [PubMed] [Google Scholar]
- 68.Lin, Z. et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv.10.1101/2022.07.20.500902 (2022).36597535 10.1101/2022.07.20.500902 [DOI] [Google Scholar]
- 69.Steinegger, M., Mirdita, M. & Soding, J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods16, 603–606. 10.1038/s41592-019-0437-4 (2019). 10.1038/s41592-019-0437-4 [DOI] [PubMed] [Google Scholar]
- 70.Devos, D. & Valencia, A. Practical limits of function prediction. Proteins Struct. Funct. Bioinform.41, 98–107. 10.1002/1097-0134(20001001)41:1%3c98::AID-PROT120%3e3.0.CO;2-S (2000). [DOI] [PubMed] [Google Scholar]
- 71.Rost, B. Twilight zone of protein sequence alignments. Protein Eng. Des. Sel.12, 85–94. 10.1093/protein/12.2.85 (1999). 10.1093/protein/12.2.85 [DOI] [PubMed] [Google Scholar]
- 72.Peters, M. E. et al. Deep contextualized word representations. http://arXiv.org/1802.05365 (2018).
- 73.Devlin, J., Chang, M., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding (2019).
- 74.Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. 10.48550/ARXIV.1910.10683 (2020).
- 75.Vaswani, A. et al. Proc. 31st International Conference on Neural Information Processing Systems 6000–6010 (Curran Associates Inc., Long Beach, 2017).
- 76.Nielsen, J. T. & Mulder, F. A. A. There is diversity in disorder—“In all chaos there is a cosmos, in all disorder a secret order”. Front. Mol. Biosci.3, 4. 10.3389/fmolb.2016.00004 (2016). 10.3389/fmolb.2016.00004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Lange, J., Wyrwicz, L. S. & Vriend, G. KMAD: Knowledge-based multiple sequence alignment for intrinsically disordered proteins. Bioinformatics32, 932–936. 10.1093/bioinformatics/btv663 (2016). 10.1093/bioinformatics/btv663 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Radivojac, P., Obradovic, Z., Brown, C. J. & Dunker, A. K. Improving sequence alignments for intrinsically disordered proteins. Pac. Symp. Biocomput.1, 589–600 (2002). [PubMed] [Google Scholar]
- 79.Riley, A. C., Ashlock, D. A. & Graether, S. P. The difficulty of aligning intrinsically disordered protein sequences as assessed by conservation and phylogeny. PLoS ONE18, e0288388. 10.1371/journal.pone.0288388 (2023). 10.1371/journal.pone.0288388 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Brown, C. J. et al. Evolutionary rate heterogeneity in proteins with long disordered regions. J. Mol. Evol.55, 104–110. 10.1007/s00239-001-2309-6 (2002). 10.1007/s00239-001-2309-6 [DOI] [PubMed] [Google Scholar]
- 81.Huang, H. & Sarai, A. Analysis of the relationships between evolvability, thermodynamics, and the functions of intrinsically disordered proteins/regions. Comput. Biol. Chem.41, 51–57. 10.1016/j.compbiolchem.2012.10.001 (2012). 10.1016/j.compbiolchem.2012.10.001 [DOI] [PubMed] [Google Scholar]
- 82.Ahnert, S. E., Marsh, J. A., Hernandez, H., Robinson, C. V. & Teichmann, S. A. Principles of assembly reveal a periodic table of protein complexes. Science350, 2245. 10.1126/science.aaa2245 (2015). 10.1126/science.aaa2245 [DOI] [PubMed] [Google Scholar]
- 83.Ponting, C. P. & Russell, R. R. The natural history of protein domains. Annu. Rev. Biophys. Biomol. Struct.31, 45–71. 10.1146/annurev.biophys.31.082901.134314 (2002). 10.1146/annurev.biophys.31.082901.134314 [DOI] [PubMed] [Google Scholar]
- 84.Rey, F. A. One protein, many functions. Nature468, 773–775. 10.1038/468773a (2010). 10.1038/468773a [DOI] [PubMed] [Google Scholar]
- 85.Wells, J., Hawkins-Hooker, A., Bordin, N., Paige, B. & Orengo, C. Chainsaw: Protein domain segmentation with fully convolutional neural networks. BioRxiv.10.1101/2023.07.19.549732 (2023). 10.1101/2023.07.19.549732 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Schütze, K., Heinzinger, M., Steinegger, M. & Rost, B. Nearest neighbor search on embeddings rapidly identifies distant protein relations. Front. Bioinform.10.3389/fbinf.2022.1033775 (2022). 10.3389/fbinf.2022.1033775 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Berman, H. M. et al. The protein data bank. Nucleic Acids Res.28, 235–242. 10.1093/nar/28.1.235 (2000). 10.1093/nar/28.1.235 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Joosten, R. P., Long, F., Murshudov, G. N. & Perrakis, A. ThePDB_REDOserver for macromolecular structure model optimization. IUCrJ1, 213–220. 10.1107/s2052252514009324 (2014). 10.1107/s2052252514009324 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Sillitoe, I. et al. CATH: Increased structural coverage of functional space. Nucleic Acids Res.49, D266–D273. 10.1093/nar/gkaa1079 (2021). 10.1093/nar/gkaa1079 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Sander, C. & Schneider, R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins9, 56–68. 10.1002/prot.340090107 (1991). 10.1002/prot.340090107 [DOI] [PubMed] [Google Scholar]
- 91.Mika, S. UniqueProt: Creating representative protein sequence sets. Nucleic Acids Res.31, 3789–3791. 10.1093/nar/gkg620 (2003). 10.1093/nar/gkg620 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Kabsch, W. & Sander, C. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers22, 2577–2637. 10.1002/bip.360221211 (1983). 10.1002/bip.360221211 [DOI] [PubMed] [Google Scholar]
- 93.Howard, M. J. Protein NMR spectroscopy. Curr. Biol.8, R331–R333. 10.1016/S0960-9822(98)70214-3 (1998). 10.1016/S0960-9822(98)70214-3 [DOI] [PubMed] [Google Scholar]
- 94.Nielsen, J. T. & Mulder, F. A. A. In Intrinsically Disordered Proteins: Methods and Protocols (eds Kragelund, B. B. & Skriver, K.) 303–317 (Springer, 2020). [Google Scholar]
- 95.Suzek, B. E. et al. UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics31, 926–932. 10.1093/bioinformatics/btu739 (2015). 10.1093/bioinformatics/btu739 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Ben Chorin, A. et al. ConSurf-DB: An accessible repository for the evolutionary conservation patterns of the majority of PDB proteins. Protein Sci.29, 258–267. 10.1002/pro.3779 (2020). 10.1002/pro.3779 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Katoh, K., Rozewicki, J. & Yamada, K. D. MAFFT online service: Multiple sequence alignment, interactive sequence choice and visualization. Brief. Bioinform.20, 1160–1166. 10.1093/bib/bbx108 (2019). 10.1093/bib/bbx108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Fukushima, K. Cognitron: A self-organizing multilayered neural network. Biol. Cybern.20, 121–136. 10.1007/BF00342633 (1975). 10.1007/BF00342633 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The source code is publicly available as a GitHub repository at https://github.com/erckert/EV-embeddings. The dataset used for the comparison of different approaches to integrating evolutionary information for secondary structure prediction is publicly available at: 10.5281/zenodo.10026192.



