Abstract
Accurate detection of protein sequence homology is essential for understanding evolutionary relationships and predicting protein functions, particularly for detecting remote homology in the “twilight zone” (20–35% sequence similarity), where traditional sequence alignment methods often fail. Recent studies show that embeddings from protein language models (pLM) can improve remote homology detection over traditional methods. Alignment-based approaches, such as those combining pLMs with dynamic programming alignment, further improve performance but often suffer from noise in the resulting similarity matrices. To address this, we evaluate a newly developed embedding-based sequence alignment approach that refines residue-level embedding similarity using K-means clustering and double dynamic programming (DDP). We show that the incorporation of clustering and DDP consistently contributes to the improved performance in detecting remote homology. Experimental results demonstrate that our approach outperforms both traditional sequence-based methods and state-of-the-art embedding-based approaches on several benchmarks. Our study illustrates embedding-based alignment refined with clustering and DDP offers a powerful approach for identifying remote homology, with potential to evolve further as pLMs continue to advance.
Keywords: Protein language models, Protein sequence alignment, Needleman-Wunsch algorithm, Double dynamic programming, Protein embeddings, Twilight zone
Subject terms: Computational biology and bioinformatics, Mathematics and computing
Introduction
Identifying protein sequence homology through sequence similarity remains a standard approach for detecting evolutionary conserved functions across proteins1,2. For decades, protein sequence homology supports numerous applications, including the prediction of protein functions3–7, protein structures and protein interactions8–17, protein design18, and evolutionary relationships1. Traditional sequence homology-based methods are typically fast and accurate when proteins share high sequence similarity. However, while protein sequence similarity is usually below 20–35%, often referred to as the twilight zone19, their accuracy declines rapidly. It is well-established that protein structure is more conserved than sequence across evolutionary time20, making remote homology detection, the task of detecting structurally similar proteins with low sequence similarity, a major challenge for existing sequence homology-based approaches.
Moreover, structure-based alignments tools such as TM-align20, DALI21, FAST22, and Mammoth23 can accurately detect such remote homologs by superimposing protein three-dimensional (3D) structures. Nonetheless, they need experimentally determined or predicted structures that remain unavailable for most proteins. Despite the recent progress made by protein structure prediction methods9,11,12,24,25, including AlphaFold210, that transforms the field with rapid structure predictions with high accuracy, it still faces critical limitations, particularly in keeping pace with the exponential growth of number of available protein sequences26. For example, metagenomic alone has billions of unique protein sequences27,28, of which only a small fraction has known structures29, thereby highlighting the importance of efficient sequence-based approaches that capture structural similarities without requiring explicit structure prediction.
Recent advances in sequence-based approaches using pre-trained protein language models (pLM) show promise. These transformer-based models, inspired by advances in natural language processing, are trained on millions of protein sequences using self-supervised learning. These models treat protein sequences as sentences. Through this training, pLMs begin to understand the “language of life” by capturing important biological information30. When a protein sequence is input, pLM produces high-dimensional vector representations, known as embeddings, for each residue or for the entire sequence. In recent years, these embeddings become an important feature for sequence alignments, particularly for remotely homologous proteins in the twilight zone. To do this, several methods represent protein sequences by averaging residue-level embeddings into fixed-length vectors in high-dimensional space31–33. Evo-velocity32 uses Euclidean distance between these averaged embeddings to construct evolutionary graphs via K-nearest neighbors, while ProtTucker31 employs contrastive learning to improve clustering of similar CATH domains34. TM-Vec29 also uses averaged embeddings to directly predict TM-scores35 for computing structural similarity. Although these approaches are efficient, they often overlook fine-grained residue-level alignment information36. To overcome this, alignment-based strategies37–39 are introduced, which compute residue-residue similarity using embedding-derived alignments or train neural networks to produce scoring and gap parameters. Most recently, EBA36 proposes an unsupervised alignment-based approach that combines residue-level embedding similarities with dynamic programming to detect structural relationships in the twilight zone, outperforming existing pLM based approaches. However, alignment-based methods suffer from noise in the resulting similarity matrix36.
So, can these resulting embedding similarity matrices be refined in an unsupervised manner to effectively detect remote homology? To address this challenge, we develop a new unsupervised protein sequence alignment approach that refines residue-level embedding similarity by incorporating K-means clustering and a double dynamic programming strategy. We evaluate the effectiveness of our approach through a fourfold strategy. First, we perform structural alignment benchmarking on the PISCES dataset40 (≤ 30% sequence similarity) by calculating Spearman correlations between predicted alignment scores and TM-align20–derived similarity scores (TM-scores) to evaluate structural similarity across remote homologs. Second, we conduct an ablation study to assess the contribution of each component in our pipeline—specifically, the clustering- and double dynamic programming-based refinement introduced in this work—by systematically removing these components and measuring their individual effects on alignment performance. Third, we evaluate functional generalization using CATH34 annotation transfer task across all classification hierarchy (Class, Architecture, Topology, and Homology). Fourth, to assess alignment quality, we benchmark our approach on the HOMSTRAD dataset. Notably, our approach leverages embeddings obtained from pretrained pLMs, and it does not require any additional training or parameter optimization.
Methods
Protein Embeddings from Language Models: As shown in Fig. 1, the initial step of our approach converts protein sequences into residue-level embeddings using pretrained protein language models (pLMs). These models generate high-dimensional vector representations for each residue, capturing both sequence context and physicochemical properties. In this study, we use residue-level embeddings generated by three widely used pLMs: ProtT5 (ProtT5-XL-UniRef50)30, ProstT541, and ESM-1b (esm1b_t33_650M_UR50S)42. While ProtT5 and ESM-1b are transformer43-based models trained on the UniRef5044 database using masked language modeling, ProstT5 builds upon ProtT5 by incorporating sequential and structural information through Foldseek45’s 3Di-token encoding. These models output fixed-length vectors for each residue, with dimensionalities of 1024 for ProtT5 and ProstT5, and 1280 for ESM-1b.
Fig. 1.
Overview of our approach. Given a pair of input protein sequences, P (of length u) and Q (of length v), residue-level embeddings are generated using a pretrained protein language model. In Stage 1, a residue–residue similarity matrix (SMu×v) is computed. To reduce noise in the initial similarity matrix, Stage 2 applies Z-score normalization independently across rows and columns, resulting in a normalized similarity matrix (SM′u×v). Stage 3 further refines the similarity matrix by first performing a Needleman–Wunsch dynamic programming, followed by applying K-means clustering on the combined embeddings from both sequences. Cluster-based weights are then incorporated to adjust similarity scores, and a second Needleman–Wunsch dynamic programming is performed on the final similarity matrix (SM′′u×v). Finally, the alignment score derived from this matrix is normalized using either the shorter (Scoremin) or longer (Scoremax) sequence length.
While these embeddings offer a powerful representation of protein sequences, a common practice is to average the residue-level embeddings to obtain a fixed-size vector for each protein31–33. This enables fast sequence comparison using distance metrics such as Euclidean distance. While this approach can capture high-level similarities between proteins, it often struggles with accurate residue-level alignment33,36.
Stage 1: Construction of embedding similarity matrix
To address this limitation, we construct a residue–residue similarity matrix that captures the fine-grained spatial relationships between sequences. Specifically, to compare two protein sequences P and Q, with lengths u and v, we first compute an embedding similarity matrix (
), where each entry represents the similarity between a pair of residues. The similarity score between residue
in sequence P and residue
in sequence Q is computed as:
![]() |
1 |
where
and
are the residue-level embeddings of residues
(∈P) and b (∈Q), respectively, and
denotes the Euclidean distance.
Stage 2: Z-score-based normalization of the similarity matrix
To reduce noise, inspired by prior work36, we transform the initial similarity matrix (
) using a Z-score normalization strategy. For each residue
∈P, we compute the row-wise mean
) and standard deviation
as follows:
![]() |
2 |
![]() |
3 |
Similarly, for each residue b ∈Q, we compute the column-wise mean
) and standard deviation
as follows:
![]() |
4 |
![]() |
5 |
The Z-scores are computed with respect to both the row and column distributions as follows:
![]() |
6 |
![]() |
7 |
where
is the row-/column-wise Z-score for a residue pair
. The Z-score-based similarity matrix (
) is then obtained by averaging the row- and column-wise Z-scores of each residue pair as follows:
![]() |
8 |
Stage 3: Refining the Z-score-based similarity matrix with clustering and double dynamic programming
To further remove noise and highlight informative signals, we introduce a new refinement strategy that updates the Z-score–normalized similarity matrix (
) from Stage 2 using clustering and double dynamic programming (DDP). We use Needleman–Wunsch46 dynamic programming (with zero gap penalties) on the Z-score-based similarity matrix
. Next, we apply K-means clustering to the combined residue embeddings from both sequences, grouping residues into 20 clusters based on their embedding representations. While this is inspired in part by prior work47 that applied clustering in 20-dimensional space for alignment-free protein classification, our use of clustering is fundamentally different, as it refines residue-level similarity matrices for sequence alignment. Based on the resulting cluster assignments, a weight matrix is created to assign higher similarity scores to residue pairs that belong to the same cluster. Specifically, each aligned residue pair is used to adjust the corresponding similarity score by blending it with a cluster-informed value. The updated similarity score between residue
(∈P) and residue
(∈Q) is computed as:
![]() |
9 |
where α is a fixed blending factor (α = 0.8) and
=1.0 if residues
share the same cluster, and
= 0.5 otherwise. Residue pairs that were not aligned are left unchanged in this step. On this refined similarity matrix (
), a second Needleman–Wunsch dynamic programming is used to produce the final alignment. Similar refinement strategy is originally proposed in 48 and later adopted for contact (or distance)-assisted protein structure prediction13,14,49; in contrast, our approach is quite different. Finally, the resulting alignment score (
) is normalized using either the minimum or maximum sequence length to allow fair comparison across protein pairs (refer to Supplementary Text S1 for pseudocode):
![]() |
10 |
This asymmetric normalization ensures that alignment scores are interpretable across varying sequence lengths, inspired by previous works20,35,36.
Complexity analysis: For two protein sequences of lengths u and v, construction of the residue–residue similarity matrix requires O(uvd) operations, where d is the embedding dimensionality. Each Needleman–Wunsch alignment requires O(uv), and our pipeline applies it twice (once on the Z-score normalized matrix and once after clustering refinement). The K-means clustering stage operates on (u + v) residue embeddings, with complexity O(K(u + v)dT), where K = 20 is the number of clusters and T (scikit-learn defaults to max_iter = 300) is the number of iterations until convergence. Overall, our approach scales as O(uvd), with clustering and the second alignment introducing modest overhead.
Benchmark datasets, methods to compare, and performance evaluation
To evaluate the performance in structural similarity analysis, we benchmark our approach using protein pairs from the PISCES40 dataset, containing 19 599 protein pairs with a maximum pairwise sequence similarity of 30% and a minimum chain length of 75 residues. While these protein pairs show detectable homology with an e-value threshold of 10− 4 using HHsearch50, the maximum sequence similarity cutoff of 30% indicates distant homology in the twilight zone. On this dataset, our approach is compared against state-of-the-art protein embedding-based approaches, such as EBA36, ProtTucker31, TM-Vec29, and pLM-BLAST51, as well as traditional sequence alignment methods, such as HH-align50 and Needleman-Wunsch46. Notably, EBA aligns protein sequences in embedding space using dynamic programming on residue-level representations from language models. ProtTucker calculates similarity via Euclidean distance between contrastively trained protein embeddings, while TM-Vec directly predicts TM-scores from embeddings using deep learning. pLM-BLAST identifies local similarity by comparing contextual embeddings. Traditional competing methods include HHalign, which performs profile–profile alignments using hidden Markov models, and Needleman–Wunsch global alignment with a standard BLOSUM52 substitution matrix. While our approach also uses Needleman–Wunsch, the purpose of including Needleman-Wunsch with a standard BLOSUM substitution matrix as a competing method is to highlight the performance gap between matrix-based alignments and embedding-enhanced strategies, particularly in low-homology settings. Furthermore, the performance of each method is evaluated by computing the Spearman correlation between its predicted similarity score and the TM-score computed by TM-align20, which serves as the ground truth. Correlations were calculated using TM-scores (by TM-align) normalized by the lengths of both the shorter and the longer protein in each pair. It is noted that the reported Spearman correlation values for all competing methods on PISCES dataset are obtained from the reported results of 36. To make a fair comparison, we use the same embedding types as competing embedding-based methods.
To further evaluate the performance in transferring CATH domain annotations, we use the same lookup and test set as used by prior work31,36. In particular, while the look up set maintains very low sequence similarity to the test set (HVAL < 0), it is ensured that for each protein in the test set, there exists at least one protein in the lookup set with an identical CATH classification at the specified level. On this dataset, our approach is compared against state-of-the-art pLM-based approaches, such as EBA, ProtTucker, and TM-Vec, as well as non pLM-based approaches such as Foldseek45, HMMER2, and MMseq253. While HMMER and MMseq2 are widely used sequence profile-based approaches, Foldseek is a state-of-the-art structure-based approach. Similar to prior work31,36, the performance is evaluated using accuracy as follows:
![]() |
11 |
where N denotes the total number of test samples,
and
are the ground truth (experimental annotation) and prediction for protein
, respectively.
is 1 when the prediction matches the ground truth and 0 otherwise. It is noted that accuracy scores of competing methods are obtained from the reported results of 31,36. Notably, both our approach and EBA use ProstT530 embeddings and normalize alignment scores by the length of the longer protein in each aligned pair. In contrast, ProtTucker and TM-Vec rely on embeddings from ProtT530.
Moreover, to evaluate the alignment quality, we use the HOMSTRAD dataset54, which contains manually curated structural alignments covering 1032 protein families. Following prior work36, we generate pairwise alignment per family by selecting the first and last sequence, resulting in 1032 sequence pairs. The corresponding HOMSTRAD reference alignments is used as ground truth. For each predicted alignment, we calculate precision and sensitivity as follows:
![]() |
12 |
![]() |
13 |
where
is the number of true (or false) positives and
is the number of false negatives. We report average values across all families. On this dataset, we benchmark our performance against state-of-the-art structure-based approaches, such as TM-align, DALI21,55, Foldseek, Foldseek-TM, and CLE-SW56, as well as sequence-based approaches such as EBA and MMseqs2. To make a fair comparison, both embedding-based approaches: our approach and EBA use ProstT5 embeddings. It is noted that sensitivity/precision scores of competing methods are obtained from the reported results of 36.
Results and discussions
Performance on PISCES dataset
We evaluate the performance of our approach in structural similarity analysis on PISCES dataset, which includes 19,599 protein pairs filtered to ensure low sequence identity (≤ 30%) and a minimum chain length of 75 residues. Table 1 presents the Spearman correlation between predicted similarity scores of competing methods and TM-align–computed TM-scores. Spearman correlations are reported with respect to TM-scores normalized by the shorter sequence length (TMmin) and the longer sequence length (TMmax), providing a dual perspective on alignment quality. It is worth mentioning that the Spearman correlations of competing methods are obtained from published results reported in 36. To make a fair comparison, we use the same embedding types as competing methods.
Table 1.
Performance comparison on PISCES dataset (best performance is listed in bold).
| Method type | Methods | Embeddings | Spearman correlationc | |
|---|---|---|---|---|
| TMmin | TMmax | |||
| Traditional | Needleman–Wunscha | - | 0.61 | 0.43 |
| HH-aligna | 0.82 | 0.77 | ||
| Embedding based | ProtTuckera, b | ProtT5 | – 0.46 | – 0.38 |
| pLM-BLASTa, b | 0.58 | 0.60 | ||
| TM-Veca, b | 0.81 | 0.82 | ||
| EBAa | ESM-1b | 0.87 | 0.80 | |
| ProtT5 | 0.90 | 0.84 | ||
| ProstT5 | 0.92 | 0.86 | ||
| Our work | ESM-1b | 0.89 | 0.81 | |
| ProtT5 | 0.91 | 0.84 | ||
| ProstT5 | 0.93 | 0.87 | ||
a Spearman correlation values are obtained from the reported results of 36.
b can only run using ProtT5 embeddings.
c correlations are reported with respect to TM-scores normalized by the shorter sequence length (TMmin) and the longer sequence length (TMmax). Our work and EBA generate two alignment scores per protein pair, one normalized by the shorter sequence length and one by the longer, allowing a direct comparison with TMmin and TMmax, respectively. Since, other competing methods generate only one prediction score, the same value is used for comparison against both TM-score variants.
As shown in Table 1, our approach (refer to Our work), which integrates clustering- and DDP-based refinement of the embedding similarity matrix, consistently achieves the highest Spearman correlations with TM-scores (the ground truth) across all three protein language models: ProstT5, ProtT5, and ESM-1b. Among these, ProstT5 embeddings yield the highest correlation, with Spearman correlations of 0.93 / 0.87 (TMmin/TMmax), outperforming all competing embedding-based and traditional methods. Similarly strong correlations are observed with ProtT5 (0.91 / 0.84) and ESM-1b embeddings (0.89 / 0.81), highlighting the effectiveness of our approach across embedding types. Moreover, compared to next best method, EBA, which applies dynamic programming alignment in embedding space, our approach (refer to Our work) achieves higher correlation across all embeddings. For instance, with ProstT5 embeddings, EBA achieves Spearman correlations of 0.92 / 0.86, falling short of the correlations achieved by our approach (0.93 / 0.87). Improvements in Spearman correlations are also noted with ProtT5 (0.91 / 0.84 vs. 0.90 / 0.84) and ESM-1b (0.89 / 0.81 vs. 0.87 / 0.80) embeddings. These improvements highlight the effectiveness of our clustering- and DDP-based approach over the alignment-based state-of-the-art approach, EBA, in remote homology detection. In addition, among other embedding-based methods, TM-Vec, which trains twin neural networks on ProtT5 embeddings to directly predict TM-scores, achieves Spearman correlations of 0.81 / 0.82 (TMmin/TMmax) and performs better than some embedding-based and traditional methods, but still lags significantly behind our approach. pLM-BLAST, which uses local alignment heuristics on contextual embeddings, achieves moderate correlations (0.58 / 0.60). ProtTucker, which employs contrastive learning over ProtT5 embeddings, yields negative correlations (–0.46 / − 0.38), suggesting its embedding space is not aligned with TM-based structural similarity. Notably, these three approaches—TM-Vec, pLM-BLAST, and ProtTucker—are trained using ProtT5 embeddings and cannot be directly applied across other embedding models, unlike our approach and EBA. Furthermore, while traditional sequence-based tools that do not utilize protein embeddings typically underperform in low-homology, HH-align, which leverages HMM-based profile–profile comparisons, achieves Spearman correlations of 0.82 / 0.77, outperforming several embedding-based methods such as pLM-BLAST and ProtTucker. Although it is outperformed by our approach (refer to Our work) and EBA, HH-align’s performance highlights the enduring value of evolutionary profiles in identifying remote homologs. In contrast, Needleman–Wunsch alignment using a BLOSUM62 matrix performs considerably worse (0.61 / 0.43), illustrating the limitations of fixed substitution scoring. Notably, our approach also uses the Needleman–Wunsch algorithm for global alignment but replaces the BLOSUM matrix with a similarity matrix derived from protein embeddings and refined through clustering and DDP, illustrating the effectiveness of our embedding based scoring matrix. Overall, Table 1 demonstrates the superior performance of our approach among both traditional and embedding-based approaches on the PISCES dataset, with ProstT5 embeddings yielding the highest Spearman correlations. The consistent improvements across different pLMs and TM-score variants illustrate the effectiveness of our approach in detecting remote homologs.
We further evaluate the performance of our approach (refer to Our work) and the next-best approach, EBA, in the twilight zone by dividing the PISCES dataset into < 20% and 20–30% sequence identity subranges. Spearman correlations are reported with respect to TM-scores (TMmin/TMmax). To ensure a fair comparison, both methods use ProstT5 embeddings. As shown in Supplementary Table S1, our approach (refer to Our work) achieves higher correlations than EBA in both similarity ranges. In the < 20% range, correlations remain strong for both methods, with our approach (refer to Our work) reaching 0.923/0.864 (TMmin/TMmax) compared to 0.911/0.851 for EBA. At 20–30% sequence similarity, correlations are higher overall, and again both methods perform well, with our approach (refer to Our work) showing a modest but consistent improvement (0.934/0.875 vs. 0.925/0.860). These results indicate that while both methods are effective at capturing structural similarity in the low-sequence similarity conditions, our clustering- and DDP-based approach demonstrates a measurable improvement over EBA.
To further evaluate the predictive quality of our alignment scores, Fig. 2 shows a head-to-head comparison of our predicted scores (refer to Eq. 10) using ProstT5 embeddings against TM-scores computed by TM-align, which serve as the ground truth. Specifically, as shown in Fig. 2, our predicted alignment scores normalized by the shorter/longer sequence length (Scoremin / Scoremax) is compared to TM-scores normalized similarly (TMmin / TMmax). Each point in the scatter plots represents a protein pair, with the color scale indicating the length ratio between the sequences (shorter/longer). In both cases, our approach demonstrates strong positive correlations with the TM-scores, consistent with the high Spearman coefficients reported in Table 1. Moreover, the estimation of similarity between a pair of proteins is influenced by their length ratio: when the ratio is high (sequences of similar length, darker points), the scores show higher correlation with TMmin, whereas low ratios (large length differences, lighter points) are associated with reduced agreement (refer to Fig. 2(A)). With TMmax (refer to Fig. 2(B)) this effect is less pronounced but still partially evident. Notably, the gray dashed line in each panel represents a TM-score of 0.5, the commonly used threshold for correctly identifying the correct fold. Overall, this figure demonstrates that our alignment scores, derived from clustering- and DDP-refined embedding similarity matrices, are well-aligned with the structural similarity metric and effective in detecting remote homologs, while the correlation with TM-scores is modestly affected by the relative sequence length ratio.
Fig. 2.
A head-to-head comparison between alignment scores predicted by our approach (using ProstT5 embeddings) and TM-scores (ground truth) computed by TM-align on the PISCES dataset. (A) Our predicted alignment scores normalized by the shorter sequence length (Scoremin) vs. TM-scores normalized similarly (TMmin). (B) Our predicted alignment scores normalized by the longer sequence length (Scoremax) vs. TM-scores normalized accordingly (TMmax). Each data point represents a protein pair from the PISCES dataset, filtered for low sequence identity (≤ 30%) and a minimum chain length of 75 residues. The horizontal gray dashed line at a TM-score of 0.5 indicates the correct folds. A color bar indicates the length ratio between the two sequences in each pair, calculated as the ratio of the shorter sequence length to the longer sequence length.
To complement the structural benchmarking, we also assess the runtime efficiency of our approach (refer to Our work) compared to TM-align. Notably, when embeddings are precomputed, the reported runtime of our approach reflects only the time required for alignment on a CPU. As shown in Fig. 3, our approach (with precomputed embeddings) exhibits a substantial computational advantage over TM-align while maintaining strong structural alignment performance. Specifically, across all tested protein pairs from the PISCES dataset, our approach (with precomputed embeddings) achieves an average runtime of 0.044 s per alignment, compared to 0.172 s required by TM-align (refer to Supplementary Table S2), representing approximately a fourfold speed-up. We also report end-to-end runtimes that include embedding generation (using a GPU) with different pLMs (ESM-1b, ProtT5, ProstT5) in Supplementary Table S2. In particular, using ProstT5 embeddings, which provides the best predictive performance, the average runtime remains competitive at 0.130 s per pair, still faster than TM-align.
Fig. 3.
Runtime comparison between our approach (with precomputed embeddings) and TM-align on the PISCES dataset. X-axis represents average protein length and Y-axis represents alignment runtime (in seconds). Runtime for our approach (blue points) is computed assuming that residue-level embeddings were pre-generated, reflecting only the time required for alignment after embeddings are generated. TM-align’s runtime (gray points) reflects the full computation time of structural alignment. Each data point represents one protein pair from the PISCES dataset.
Ablation study
To investigate the individual contributions of each stage in our approach, we perform an ablation study using the PISCES dataset, evaluating Spearman correlations between predicted alignment scores and TM-scores (by TM-align) across three embedding sources: ProstT5, ProtT5, and ESM-1b. As shown in Table 2, We evaluate four variants: (1) Our work: the full pipeline; (2) Our work w/o Stage 3: excludes clustering- and DDP-based refinement; (3) Our work w/o Stage 2: excludes Z-score normalization; (4) Our work w/o Stage 1: uses averaged residue embeddings to compute Euclidean distance between protein pairs. Notably, similarity scores are expected to exhibit a positive correlation, whereas distance yields a negative correlation.
Table 2.
Contribution of individual ablated variants on PISCES dataset (best performance for each embedding is listed in bold).
| Embeddings | Methods | Spearman correlationa | |
|---|---|---|---|
| TMmin | TMmax | ||
| ProstT5 | Our work w/o stage 1b | – 0.67 | – 0.49 |
| Our work w/o stage 2c | 0.58 | 0.20 | |
| Our work w/o stage 3d | 0.91 | 0.84 | |
| Our worke | 0.93 | 0.87 | |
| ProtT5 | Our work w/o stage 1b | – 0.46 | – 0.37 |
| Our work w/o stage 2c | 0.74 | 0.55 | |
| Our work w/o stage 3d | 0.88 | 0.83 | |
| Our worke | 0.91 | 0.84 | |
| ESM-1b | Our work w/o stage 1b | – 0.37 | – 0.33 |
| Our work w/o stage 2c | 0.65 | 0.55 | |
| Our work w/o stage 3d | 0.86 | 0.79 | |
| Our worke | 0.89 | 0.81 | |
a correlations are reported with respect to TM-scores normalized by the shorter sequence length (TMmin) and the longer sequence length (TMmax). While similarity scores are expected to exhibit a positive correlation, distance yields a negative correlation.
b uses average residue embeddings to compute Euclidean distance between protein pairs. Distance yields a negative correlation.
c excludes Z-score based normalization.
d excludes clustering and DDP-based refinement.
e our complete pipeline with Stage 1, 2, and 3.
Across all three types of embeddings, the full pipeline (refer to Our work), which incorporates all three stages, consistently outperforms the other ablated variants by yielding the highest Spearman correlations with TM-scores. Notably, the best performance is observed with ProstT5 embeddings, where the spearman correlations of our full pipeline (refer to Our work) are 0.93/0.87 (TMmin/TMmax), followed by ProtT5 (0.91/0.84) and ESM-1b (0.89/0.81) embeddings, illustrating that ProstT5 provides more robust embeddings for remote homology detection. Furthermore, when Stage 3 is removed (refer to Our work w/o Stage 3), which excludes the clustering- and DDP-based refinement introduced in this work, performance decreases consistently across all embeddings. For instance, with ProstT5 embeddings, the correlation drops from 0.93/0.87 to 0.91/0.84. Similar reductions are observed with ProtT5 (from 0.91/0.84 to 0.88/0.83) and ESM-1b embeddings (from 0.89/0.81 to 0.86/0.79). These results indicate that the clustering- and DDP-based refinement introduced in this work provides an additive improvement by refining residue-level similarities beyond what normalization alone can achieve. Moreover, the removal of Stage 2 (refer to Our work w/o Stage 2), which eliminates Z-score normalization, leads to a substantial performance decline across all embeddings. For example, the correlation with ProstT5 embeddings drops sharply to 0.58/0.20, confirming that normalization is essential for denoising the initial similarity matrix. Similar trends are observed with ProtT5 (0.74/0.55) and ESM-1b embeddings (0.65/0.55), demonstrating the importance of denoising the initial similarity matrix. Lastly, using embedding distance alone (refer to Our work w/o Stage 1) yields a negative correlation (–0.67 / − 0.49) with ProstT5 embeddings. Similar trends are also observed with ProtT5 (–0.46 / − 0.37) and ESM-1b (–0.37/–0.33) embeddings, indicating that raw distances are insufficient for capturing structural similarity and highlights the importance of alignment-based similarity scoring.
Together, these results demonstrate the complementary contributions of each stage in our approach. Specifically, the poor correlations observed when Stage 1 is removed confirm that alignment-based similarity scoring is essential; relying solely on averaged embedding distances fails to capture the structural relationships needed for effective alignment. Stage 2, which applies Z-score normalization, plays a critical role in denoising the initial similarity matrix. Its removal leads to a sharp decline in performance, indicating that normalization is important to amplify informative signals while suppressing background noise. Finally, Stage 3, the clustering- and DDP-based refinement introduced in this work, provides consistent improvements across all embedding types, even after Z-score normalization, demonstrating its potential in refining residue-level similarities. While its contribution is moderate compared to Stage 2, Stage 3 remains a key enhancement that improves alignment quality in an unsupervised manner and strengthens the overall effectiveness of our embedding-based approach for remote homology detection.
Furthermore, we evaluate the average alignment runtime for each ablated version of our approach to assess computational efficiency. It is worth mentioning that all runtime estimates of all ablated variants are computed assuming that residue-level embeddings have been pre-generated, thus reflecting only the time required for alignment on a CPU. As shown in Supplementary Table S2, the variant using only embedding distances without alignment (refer to Our work w/o Stage 1) is the fastest, requiring only 7 × 10⁻⁵ seconds per protein pair, but as discussed earlier, it yields the worst performance. Our work w/o Stage 2 and Our work w/o Stage 3 slightly increase the runtime to 0.020 and 0.021 s per pair, respectively, while the full pipeline (refer to Our work), which includes clustering- and DDP-based refinement, requires 0.044 s on average. Notably, despite this increase, our full method remains approximately 4 times faster than TM-align (0.172 s) when embeddings are precomputed. When the cost of embedding generation (on a GPU) is also included, the end-to-end runtime of the full pipeline averages 0.119 s with ESM-1b, 0.126 s with ProtT5, and 0.130 s with ProstT5, remain competitive relative to TM-align. These results suggest that our approach provides an efficient and robust framework for structure-aware sequence alignment.
Sensitivity to number of clusters and blending factor: The number of clusters (K = 20) and blending factor (α = 0.8) in our approach are inspired in part by prior work47 that uses 20-dimensional clustering for protein representations and by the intuition that a moderate blending factor balances alignment signals with cluster refinement. To further examine the effect of these parameters, as shown in Supplementary Figure S1, we vary K (20, 50, 100) and α (1.0, 0.8, 0.6) using ProstT5 embeddings on the PISCES dataset. In this setting, α = 1.0 corresponds to no contribution from clustering-based refinement (refer to Eq. 9), serving as a baseline to assess the added value of clustering- and DDP-based refinement of our approach. Relative to this baseline, α values below 1.0 yield consistently higher Spearman correlations on the PISCES dataset, indicating that clustering and DDP provide measurable gains. The highest Spearman correlations are obtained at K = 20 and α = 0.8, reaching 0.93/0.87 (TMmin/TMmax). Comparable performance is also observed at alternative settings, such as K = 100 with α = 0.6 (0.921/0.860), indicating that our approach performs consistently across different hyperparameter choices and that the selected defaults provide a reasonable balance without requiring extensive tuning.
CATH annotation transfer performance
To evaluate how well our approach captures functionally relevant structural information from protein sequences, we benchmark its performance on the CATH annotation transfer task. This task involves predicting the CATH classification of a protein at four hierarchical levels—Class (C), Architecture (A), Topology (T), and Homology (H)—by transferring annotations from its most similar protein in a lookup set. Following the protocol used in prior work31,36, the lookup and test sets share no detectable sequence similarity (HVAL < 0), ensuring that any successful transfer reflects structural, not sequence, similarity. The accuracy scores of competing methods are obtained from published results reported in 31,36.
As shown in Fig. 4, our approach (refer to Our work) outperforms all competing methods across all four CATH levels and achieves the highest accuracy overall. At the Class level, our approach (refer to Our work) reaches 91% accuracy, similar to the next best method, EBA, but exhibits consistent improvements at other levels over EBA: 85% for Architecture (vs. 84% for EBA), 80% for Topology (vs. 78% for EBA), and 89% for Homology (vs. 88% for EBA). The performance improvements of our approach (refer to Our work) over EBA are consistent across the hierarchy, indicating the effectiveness of our clustering- and DDP-based refinement of similarity matrix in the performance. Notably, both of these approaches use ProstT5 embeddings and normalize alignment scores by the length of the longer protein in each pair, ensuring a fair comparison. While competing pLM-based methods like ProtTucker and TM-Vec use ProtT5 embeddings, our approach (refer to Our work) outperforms these approaches across the CATH hierarchy. While ProtTucker and TM-Vec perform competitively at higher levels, achieving accuracy of 88% and 89% at the Class level and 82% and 83% at the Architecture level, respectively, they exhibit a noticeable decline in performance at finer-grained levels. Specifically, at the Topology level, their accuracy drops to 68% (ProtTucker) and 71% (TM-Vec), falling behind our approach (refer to Our work) by 12 and 9% points, respectively. Similar performance gaps are observed at the Homology level, with ProtTucker and TM-Vec reaching accuracy of 79% and 81%, respectively. Among non–pLM-based methods, Foldseek stands out as the best-performing approach. It achieves accuracy of 77% at Class (Δ14% vs. Our work), 73% at Architecture (Δ12% vs. Our work), 59% at Topology (Δ21% vs. Our work), and 77% at Homology (Δ12% vs. Our work). Similar or more pronounced performance gap can be noted for other non pLM-based methods, HMMER and MMseqs2.
Fig. 4.
Performance comparison on CATH annotation transfer across four levels: Class (C), Architecture (A), Topology (T), and Homology (H). The heatmap shows accuracy percentages achieved by our clustering- and DDP-refined embedding-based work using ProstT5 embeddings (refer to Our work), compared to other state-of-the-art protein language model (pLM)–based approaches (EBA, ProtTucker, TM-Vec) and non–pLM-based approaches (Foldseek, HMMER, and MMseqs2). Each cell represents the accuracy at the corresponding CATH level, with darker colors indicating higher accuracy. It is noted that accuracy scores of competing methods are obtained from the reported results of 31,36. Both our approach and EBA use ProstT531 embeddings and normalize alignment scores by the length of the longer protein in each aligned pair. ProtTucker and TM-Vec rely on embeddings from ProtT530.
Overall, while the performance gap between pLM-based and non pLM-based approaches indicates the effectiveness of pLM embeddings in accurate CATH annotation transfers, the consistent improvements achieved by our approach across all CATH levels highlights the effectiveness of this work over existing state-of-the-art methods, further illustrating the effectiveness of our clustering- and double dynamic programming-based approach in identifying structural and functional relationships across the CATH hierarchy.
Performance on HOMSTRAD dataset
To assess alignment quality, we benchmark our approach on the HOMSTRAD dataset using precision and sensitivity. As shown in Fig. 5, structure-based methods such as TM-align and DALI remain the most accurate overall, with DALI attaining the highest precision (0.913) and TM-align attaining the highest sensitivity (0.877). Other structure-based methods, including Foldseek, Foldseek-TM, and CLE-SW, also perform strongly. However, the sequence-based method MMseqs2 performs substantially lower precision (0.597) and sensitivity (0.325). Embedding-based approaches bridge this gap: EBA and our approach (refer to Our work) achieve competitive performance using ProstT5 embeddings, highlighting the effectiveness of embedding information for sequence alignment. Specifically, our approach (refer to Our work) obtains a precision of 0.864 and sensitivity of 0.860, outperforming EBA (0.861 and 0.859, respectively) and substantially exceeding MMseqs2. It is also worth noting that structure-based methods rely on experimentally determined or predicted structural information, whereas embedding- and sequence-based methods operate directly on sequences. Together, these results demonstrate that while structure-based methods remain the most accurate, embedding-based sequence alignment approaches are highly effective in the absence of structural data.
Fig. 5.
Benchmarking alignment quality on the HOMSTRAD dataset using (A) precision and (B) sensitivity. Bars are color-coded by method type: gray for structure-based methods (TM-align, Foldseek-TM, Foldseek, DALI, CLE-SW), blue for embedding-based methods (EBA and our approach), and orange for sequence-based method (MMseqs2). Both our approach (refer to Our work) and EBA use ProstT531 embeddings. Precision and sensitivity scores for the competing methods are obtained from reported results of 36. It is noted that Sequence- (and embedding)-based methods rely only on sequence (or embedding) information and do not use structural data.
Conclusions
Despite recent advances in protein language models (pLMs), accurately aligning protein sequences in the twilight zone of sequence similarity remains a major challenge in structural bioinformatics. To address this, we introduce a new unsupervised alignment approach that combines residue-level embeddings from pretrained pLMs with a clustering- and double dynamic programming–based refinement stage. Notably, our approach operates entirely in embedding space using embeddings obtained from pLMs, and requires no additional training, making it broadly applicable across language models. Moreover, benchmarking on the PISCES dataset demonstrates that our approach achieves the highest Spearman correlation with TM-scores (by TM-align) across all evaluated embedding types, reaching up to 0.93 (TMmin) and 0.87 (TMmax) with ProstT5 embeddings, and consistently outperforming both traditional methods such as HH-align and Needleman–Wunsch, and embedding-based state-of-the-art methods including EBA, TM-Vec, and pLM-BLAST, illustrating the effectiveness of the clustering- and DDP-based approach in detecting remote homology. Ablation studies further demonstrate that each component of our pipeline—similarity matrix construction, normalization, and refinement—plays a critical role in driving performance, with clustering- and double dynamic programming-based refinement of embedding similarity matrix introduced in this work show consistent improvements across all embedding types. Moreover, our approach generalizes beyond structural benchmarks: in CATH domain annotation transfer, it achieves highest accuracy across all four classification levels: Class, Architecture, Topology, and Homology—outperforming both pLM-based state-of-the-art methods such as EBA, ProtTucker, and TM-Vec, as well as non–pLM-based methods like Foldseek, HMMER, and MMseqs2. To further assess alignment quality, we benchmark our approach on the HOMSTRAD dataset, where our approach achieves competitive precision and sensitivity compared to structure-based methods, while outperforming existing embedding- and sequence-based approaches. Furthermore, the computational runtime of our approach is still reasonable, making it four times faster than TM-align when embeddings are precomputed. Overall, this study demonstrates that our embedding-based alignment with unsupervised clustering and double dynamic programming provides a powerful approach for remote homology detection. We also note that our evaluation is limited to ProtT5, ProstT5, and ESM-1b due to computational constraints. However, newer models such as ESM-225 have shown improved performance, and their integration into our approach represents a promising avenue for future work. We also anticipate that systematic exploration of clustering strategies and adaptive tuning of the blending factor could further enhance the robustness of our approach. In addition, future extensions may include benchmarking against retrieval-based baselines (e.g., PLMSearch57, DHR58, and conducting detailed error analyses. Together, these directions highlight the potential for our approach to continue improving as protein language models advance, while broader developments in interpretability and localization within deep learning59–61 may also provide useful context for future extensions.
Supplementary Information
Below is the link to the electronic supplementary material.
Acknowledgements
This work was made possible in part by a grant of high performance computing resources and technical support from the Alabama Supercomputer Authority. The authors would like to thank Dr. Ben Okeke, Dr. Olcay Kursun, and Sai Prashanthi Pallati for helpful discussions.
Author contributions
R.S.: Writing -review & editing, Writing -- original draft, Software, Validation; N.R.: Writing -review & editing, Writing -- original draft, Software, Validation; S.D.: Writing -review & editing, Software, Validation; P.U.: Writing -review & editing, Visualization, Validation; C.S.: Writing -review & editing, Writing -- original draft, Validation; S.B.: Writing -review & editing, Writing -- original draft, Visualization, Validation, Supervision, Software, Resources, Project administration, Methodology, Investigation, Funding acquisition, Formal analysis, Data curation, Conceptualization. All authors approve the final version.
Funding
This work is partially supported by the NSF grant 2435093 (to SB).
Data availability
PISCES40 dataset is publicly available at [http://dunbrack.fccc.edu/pisces/](http:/dunbrack.fccc.edu/pisces) and [https://git.scicore.unibas.ch/schwede/eba_benchmark/-/tree/main/pisces/data? ref_type=heads](https:/git.scicore.unibas.ch/schwede/eba_benchmark/-/tree/main/pisces/data? ref_type=heads) . CATH34 dataset is publicly available at [https://www.cathdb.info](https:/www.cathdb.info) and [https://git.scicore.unibas.ch/schwede/eba_benchmark/-/tree/main/cath/data? ref_type=heads](https:/git.scicore.unibas.ch/schwede/eba_benchmark/-/tree/main/cath/data? ref_type=heads) . HOMSTRAD54 dataset is publicly available at http://www-cryst.bioc.cam.ac.uk/homstrad/ and [https://git.scicore.unibas.ch/schwede/eba_benchmark/-/tree/main/homstrad? ref_type=heads](https:/git.scicore.unibas.ch/schwede/eba_benchmark/-/tree/main/homstrad? ref_type=heads) .
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-025-23319-x.
References
- 1.Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol.215, 403–410 (1990). [DOI] [PubMed] [Google Scholar]
- 2.Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res.39, W29–W37 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nat. Genet.25, 25–29 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Finn, R. D. et al. The Pfam protein families database. Nucleic Acids Res.36, D281–D288 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Blum, M. et al. The interpro protein families and domains database: 20 years on. Nucleic Acids Res.49, D344–D354 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Huerta-Cepas, J. et al. EggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res.47, D309–D314 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Loewenstein, Y. et al. Protein function annotation by homology-based inference. Genome Biol.10, 207 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Greener, J. G., Kandathil, S. M. & Jones, D. T. Deep learning extends de Novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nat. Commun.10, 3977 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature577, 706–710 (2020). [DOI] [PubMed] [Google Scholar]
- 10.Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science373, 871–876 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Chowdhury, R. et al. Single-sequence protein structure prediction using a Language model and deep learning. Nat. Biotechnol.40, 1617–1623 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Baker, K., Hughes, N. & Bhattacharya, S. An interactive visualization tool for educational outreach in protein contact map overlap analysis. Front Bioinforma4, (2024). [DOI] [PMC free article] [PubMed]
- 14.Bhattacharya, S., Roche, R., Moussad, B. & Bhattacharya, D. DisCovER: distance- and orientation-based covariational Threading for weakly homologous proteins. Proteins Struct. Funct. Bioinforma. 90, 579–588 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Bhattacharya, S. & Bhattacharya, D. Evaluating the significance of contact maps in low-homology protein modeling using contact-assisted Threading. Sci. Rep.10, 2908 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Bhattacharya, S. & Bhattacharya, D. Does inclusion of residue-residue contact information boost protein threading? Proteins Struct. Funct. Bioinforma. 87, 596–606 (2019). [DOI] [PubMed] [Google Scholar]
- 17.Bhattacharya, S., Roche, R., Shuvo, M. H. & Bhattacharya, D. Recent advances in protein homology detection propelled by Inter-Residue interaction map Threading. Front Mol. Biosci.8, (2021). [DOI] [PMC free article] [PubMed]
- 18.Shin, J. E. et al. Protein design and variant prediction using autoregressive generative models. Nat. Commun.12, 2403 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Rost, B. Twilight zone of protein sequence alignments. Protein Eng. Des. Sel.12, 85–94 (1999). [DOI] [PubMed] [Google Scholar]
- 20.Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res.33, 2302–2309 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Holm, L., Kääriäinen, S., Wilton, C. & Plewczynski, D. Using Dali for structural comparison of proteins. Curr. Protoc. Bioinforma. 14, 551–5524 (2006). [DOI] [PubMed] [Google Scholar]
- 22.Zhu, J., Weng, Z. & FAST A novel protein structure alignment algorithm. Proteins Struct. Funct. Bioinforma. 58, 618–627 (2005). [DOI] [PubMed] [Google Scholar]
- 23.Ortiz, A. R., Strauss, C. E. M. & Olmea, O. MAMMOTH (Matching molecular models obtained from theory): an automated method for model comparison. Protein Sci.11, 2606–2621 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods. 19, 679–682 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a Language model. Science379, 1123–1130 (2023). [DOI] [PubMed] [Google Scholar]
- 26.Varadi, M. et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res.50, D439–D444 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Richardson, L. et al. MGnify: the Microbiome sequence data analysis resource in 2023. Nucleic Acids Res.51, D753–D759 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Nordberg, H. et al. The genome portal of the department of energy joint genome institute: 2014 updates. Nucleic Acids Res.42, D26–D31 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Hamamsy, T. et al. Protein remote homology detection and structural alignment using deep learning. Nat. Biotechnol.42, 975–985 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Elnaggar, A. et al. ProtTrans: toward Understanding the Language of life through Self-Supervised learning. IEEE Trans. Pattern Anal. Mach. Intell.44, 7112–7127 (2022). [DOI] [PubMed] [Google Scholar]
- 31.Heinzinger, M. et al. Contrastive learning on protein embeddings enlightens midnight zone. NAR Genomics Bioinforma. 4, lqac043 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Hie, B. L., Yang, K. K. & Kim, P. S. Evolutionary velocity with protein Language models predicts evolutionary dynamics of diverse proteins. Cell. Syst.13, 274–285e6 (2022). [DOI] [PubMed] [Google Scholar]
- 33.Schütze, K., Heinzinger, M., Steinegger, M. & Rost, B. Nearest neighbor search on embeddings rapidly identifies distant protein relations. Front Bioinforma2, (2022). [DOI] [PMC free article] [PubMed]
- 34.Sillitoe, I. et al. CATH: increased structural coverage of functional space. Nucleic Acids Res.49, D266–D273 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Xu, J. & Zhang, Y. How significant is a protein structure similarity with TM-score = 0.5? Bioinformatics26, 889–895 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Pantolini, L. et al. Embedding-based alignment: combining protein Language models with dynamic programming alignment to detect structural similarities in the twilight-zone. Bioinformatics40, btad786 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Bepler, T. & Berger, B. Learning the protein language: Evolution, structure, and function. Cell. Syst.12, 654–669e3 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Morton, J. T. et al. Protein Structural Alignments From Sequence. 11.03.365932 Preprint at (2020). 10.1101/2020.11.03.365932 (2020).
- 39.Iovino, B. G. & Ye, Y. Protein embedding based alignment. BMC Bioinform.25, 85 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Wang, G. & Dunbrack, R. L. Jr. PISCES: a protein sequence culling server. Bioinformatics19, 1589–1591 (2003). [DOI] [PubMed] [Google Scholar]
- 41.Heinzinger, M. et al. Bilingual Language model for protein sequence and structure. NAR Genomics Bioinforma. 6, lqae150 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. 118, e2016239118 (2021). [DOI] [PMC free article] [PubMed]
- 43.Vaswani, A. et al. Curran Associates Inc., Red Hook, NY, USA,. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems 6000–6010 (2017).
- 44.Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics31, 926–932 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.van Kempen, M. et al. Fast and accurate protein structure search with foldseek. Nat. Biotechnol.42, 243–246 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol.48, 443–453 (1970). [DOI] [PubMed] [Google Scholar]
- 47.Bielińska-Wąż, D., Wąż, P. & Błaczkowska, A. 20D-Dynamic representation of protein sequences combined with K-means clustering. (2025). 10.2174/0113862073359729250220131623 [DOI] [PubMed]
- 48.Taylor, W. R. Protein structure comparison using iterated double dynamic programming. Protein Sci.8, 654–665 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Ovchinnikov, S. et al. Protein structure determination using metagenome sequence data. Science355, 294–298 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinform.20, 473 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Kaminski, K., Ludwiczak, J., Pawlicki, K., Alva, V. & Dunin-Horkawicz, S. pLM-BLAST: distant homology detection based on direct comparison of sequence representations from protein Language models. Bioinformatics39, btad579 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89, 10915–10919 (1992). [DOI] [PMC free article] [PubMed]
- 53.Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol.35, 1026–1028 (2017). [DOI] [PubMed] [Google Scholar]
- 54.Mizuguchi, K., Deane, C. M., Blundell, T. L. & Overington, J. P. HOMSTRAD: A database of protein structure alignments for homologous families. Protein Sci.7, 2469–2471 (1998). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Holm, L. & Sander, C. Protein structure comparison by alignment of distance matrices. J. Mol. Biol.233, 123–138 (1993). [DOI] [PubMed] [Google Scholar]
- 56.Wang, S., Zheng, W. M. & CLePAPS Fast pair alignment of protein structures based on conformational letters. J. Bioinform Comput. Biol.06, 347–366 (2008). [DOI] [PubMed] [Google Scholar]
- 57.Liu, W. et al. PLMSearch: protein Language model powers accurate and fast sequence search for remote homology. Nat. Commun.15, 2775 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Hong, L. et al. Fast, sensitive detection of protein homologs using deep dense retrieval. Nat. Biotechnol.43, 983–995 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Nie, L. et al. Unveiling the black box of PLMs with semantic anchors: towards interpretable neural semantic parsing. Proc. AAAI Conf. Artif. Intell.37, 13400–13408 (2023). [Google Scholar]
- 60.Zhang, B. et al. Multi-label subcellular localization predict based on cluster balanced subspace partitioning method and multi-class contrastive representation learning. IEEE J. Biomed. Health Inf. 1–14. 10.1109/JBHI.2025.3537284 (2025). [DOI] [PubMed]
- 61.Li, X. et al. Interpretable deep learning: interpretation, interpretability, trustworthiness, and beyond. Knowl. Inf. Syst.64, 3197–3234 (2022). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
PISCES40 dataset is publicly available at [http://dunbrack.fccc.edu/pisces/](http:/dunbrack.fccc.edu/pisces) and [https://git.scicore.unibas.ch/schwede/eba_benchmark/-/tree/main/pisces/data? ref_type=heads](https:/git.scicore.unibas.ch/schwede/eba_benchmark/-/tree/main/pisces/data? ref_type=heads) . CATH34 dataset is publicly available at [https://www.cathdb.info](https:/www.cathdb.info) and [https://git.scicore.unibas.ch/schwede/eba_benchmark/-/tree/main/cath/data? ref_type=heads](https:/git.scicore.unibas.ch/schwede/eba_benchmark/-/tree/main/cath/data? ref_type=heads) . HOMSTRAD54 dataset is publicly available at http://www-cryst.bioc.cam.ac.uk/homstrad/ and [https://git.scicore.unibas.ch/schwede/eba_benchmark/-/tree/main/homstrad? ref_type=heads](https:/git.scicore.unibas.ch/schwede/eba_benchmark/-/tree/main/homstrad? ref_type=heads) .


















