Abstract
Homology is a key concept underpinning the comparison of sequences across organisms. Sequence-level homology is based on a statistical framework optimized over decades of work. Recently, computational protein structure prediction has enabled large-scale homology inference beyond the limits of accurate sequence alignment. In this regime, it is possible to observe nearly identical protein structures lacking detectable sequence similarity. In the absence of a robust statistical framework for structure comparison, it is largely assumed similar structures are homologous. However, it is conceivable that matching structures could arise through convergent evolution, resulting in analogous proteins without shared ancestry. Large databases of predicted structures offer a means of determining whether analogs are present among structure matches. Here, I find that a small subset (∼2.6%) of Foldseek clusters lack sequence-level support for homology, including ∼1% of strong structure matches with template modeling score ≥ 0.5. This result by itself does not imply these structure pairs are nonhomologous, since their sequences could have diverged beyond the limits of recognition. Yet, strong matches without sequence-level support for homology are enriched in structures with predicted repeats that could induce spurious matches. Some of these structural repeats are underpinned by sequence-level tandem repeats in both matching structures. I show that many of these tandem repeat units have genealogies inconsistent with their corresponding structures sharing a common ancestor, implying these highly similar structure pairs are analogous rather than homologous. This result suggests caution is warranted when inferring homology from structural resemblance alone in the absence of sequence-level support for homology.
Keywords: homology, analogy, protein structure search, TM-score
Significance.
Fast protein structural search programs are revolutionizing our ability to detect remote homologs. It is presently unclear whether strong protein structure matches arise solely through homology or may result from convergent evolution to similar protein structures. Shared ancestry is the basis for many evolutionary analyses, and analogous structures could taint search results. Here, I show strong structure matches (template modeling score [TM-score] ≥ 0.5) lacking evidence for homology in their underlying sequences are depleted in multidomain proteins and enriched in structural repeats. Tandem sequence repeats underlying the structural repeats display genealogies that are inconsistent with the proteins sharing a common ancestor. Therefore, strong protein structure matches can arise from analogy, and structural similarity alone is insufficient to assert homology. Structure search programs may need to incorporate structural complexity to avoid nonhomologous hits, much as how sequence search programs mask repeats and low complexity regions.
Introduction
Recent advances in protein structure prediction have dramatically increased access to high-quality protein structures (Jumper et al. 2021; Lin et al. 2023). Databases of predicted protein structures coupled with fast protein structural search programs, such as Foldseek (van Kempen et al. 2024) and PLMSearch (Liu et al. 2024), have unlocked the ability to perform structural searches at scale. Given a predicted query structure, these programs can quickly find similar structures within a large database of targets. However, it is presently unclear whether these so-called “structurlogs” share a common ancestor (i.e. are homologs) or arose through convergent evolution (i.e. are analogs). The ability to discern between these two possibilities would be useful given the requirement for homology in many downstream analyses.
Analogy traditionally refers to traits with similar functions but different ancestry. For proteins, the assumption can be made that function follows form and, therefore, structural similarity is a proxy for functional similarity. There are many notable exceptions wherein similar protein structures can serve different functions, or similar functions can be carried out by different protein structures. For simplicity, I follow the convention of referring to proteins without a shared ancestor, but with similar structures, as analogous without consideration for whether the proteins serve the same function (Orengo et al. 2001). Since the objective of structure search is typically to identify homologous sequences via structural similarity, the existence of analogous structures could interfere with that objective.
It is known that structural motifs, such as α-helices and enzymatic active site configurations, have repeatedly evolved across different proteins (Cheng et al. 2007; McGhee 2011; Murata et al. 2024). Claims of convergent evolution for larger macromolecular protein structures have been called into question (Mackin et al. 2014; Seong and Krasileva 2023). Structural search programs often use manually curated homolog databases to establish cutoffs that avoid nonhomologous matches (Barrio-Hernandez et al. 2023), but these cutoffs do not guarantee that significant matches are due to homology rather than analogy. Since both homologous and analogous structures are expected to share topological similarity, it is unclear how to distinguish between homology and analogy using structures alone. This issue is of particular concern for small, low complexity, or repetitive proteins that readily evolve across different genomes (Wolynes 1996; Johnston et al. 2022).
A universal threshold for sequence-level homology has yet to be established. BLAST- and HMM-based approaches employ a statistical model based on the probability of finding a sequence match by chance, which depends on the size of the database being searched. Low complexity and repetitive regions are typically masked, because these sequence motifs are far more likely to independently arise in biological sequences than expected by chance (Frith 2011b). In this work, I adopted a pragmatic approach to quantifying homology based on summing amino acid substitution scores for well-aligned positions in a structural alignment. I used bootstrapping to quantify similarity relative to random sequences with the same amino acid composition. This approach quickly identifies matches with lower than expected sequence-level similarity, although this information alone is insufficient to guarantee two proteins are nonhomologous.
Here, I apply phylogenetic methods to investigate the evidence for analogy versus homology. I focused on strong structure matches because weak matches could be the result of spurious nonhomologous hits. Structural searches can be made independently of sequences, which allows sequence similarity to serve as an independent test of homology. Strong structure matches lacking support for sequence homology could result from analogous structures or an extremely high level of sequence divergence. Only a small fraction of strong structure matches lacked sequence-level support for homology. This subset was enriched in structural repeats, which may result in the false appearance of homology. I expose analogy by making use of the fact that homologous tandem repeats must have existed prior to divergence between sequences. Collectively, the results of this study provide insight into the relationships among homology, analogy, and the degree to which structures match.
Results
A Subset of Strong Structure Matches Lack Sequence Support for Homology
Homology implies that traits become more similar approaching their last common ancestor, while analogy implies traits become less similar in ancestors (Fig. 1a). Determining whether two proteins are homologous is equivalent to testing whether they share a common ancestor within finite time (Schaper et al. 2012). For a pair of sequences, this test can be approximated by determining whether the substitution score for aligned residues is higher than expected for randomly aligned residues (Fig. 1b). This test depends on an accurate alignment, which requires structural alignment for protein matches with identities <20%.
Fig. 1.
Discerning homology from analogy in structure matches. a) Birds share homologous beaks that descended from a common ancestor through divergence. In contrast, parrots and parrot fish have analogous beak morphology as a result of convergent evolution. b) Genotypes can be used to determine whether similar traits are the result of shared ancestry (homology) or resulted from convergent evolution (analogy). Asking whether structures share a common ancestor is equivalent to testing if the time to their most recent common ancestor (tMRCA) is less than infinity. To this end, a structural match's substitution score is compared to that of bootstrapped sequence alignments drawn from the match's background sequence distribution. Sequences with support for homology will have substitution scores greater than the random expectation. This approach can distinguish homology from nonhomology but not analogy from nonhomology. c) Tandem repeat units must exist before they diverge in order for the repeat units to be homologous. The process of descent from a common ancestor is expected to result in intermixed repeats on a phylogenetic tree created from the alignment of repeat units. However, tandem repeat units underlying analogous proteins are expected to segregate into separate clades on a phylogenetic tree. Therefore, the support for the branch partitioning the two clades on the phylogenetic tree is a measure of analogy.
A previously published set of “structurlogs” provides an ideal data set for testing homology (Barrio-Hernandez et al. 2023). This data set was generated by clustering the AlphaFold Database (Varadi et al. 2022) at 50% sequence identity and then clustering the remaining structures with Foldseek at an E-value of 0.01 with at least 90% overlap (van Kempen et al. 2024). To avoid issues related to incorrect structure predictions, I downloaded the AlphaFold Database and subset it to only high confidence predicted structures. The TM-score was used to quantify structural similarity based on pairwise alignments output by US-align (Zhang et al. 2022).
As shown in Fig. 2, 97.4% of structure matches had support for homology using this method (bootstrap support ≥ 0.99). Support for homology decreased below a TM-score of ∼0.5. Previous work showed the probability of random structure matches is 5.5E−7 at a TM-score of 0.5 for proteins with 80 to 200 amino acids (Xu and Zhang 2010). A minority (1.0%) of structure matches with high TM-scores (≥0.5) had low levels of support for homology (<0.99). Strong structure matches with low TM-scores were depleted of multidomain proteins (one-sided Fisher's exact test P < 5E−3) and enriched in structural repeats (one-sided Fisher's exact test P < 1E−15). Given that low complexity and repetitive sequences are the source of many nonhomologous sequence matches, it is possible that some of these unsupported structure matches may result from analogy.
Fig. 2.
Some strong structure matches lack support for homology. Points represent Foldseek structure matches between high confidence AlphaFold Database structures. Average support for homology increased with structural similarity (TM-score). However, 1% of strong structure matches (TM-score ≥ 0.5) lacked support for homology (i.e. points below 0.99 to the right of the vertical dashed line). This subset was depleted of multidomain proteins and enriched in structural repeats. The curve represents a logistic regression fit and 95% confidence intervals (shaded area). Red points represent the presence of a structural repeat with high repeat unit similarity (average repeat unit TM-score ≥ 0.5).
Some Nonhomologous Proteins Fold into Strongly Analogous Structures
I next sought to investigate the authenticity of strong structure matches with low support for homology. Of the 145,196 total structural alignments, 134,836 (92.9%) had high TM-scores (≥0.5), and 1,338 (1%) of these lacked support for homology. Notably, 14,707 strong structure matches included a protein with a detectable structural repeat (10.9%), including 477 of those lacking support for homology (35.7%). Only 2,155 strong structure matches had a protein with a detectable sequence-level tandem repeat (1.6%), including 95 of those lacking support for homology (7.1%). Using the RepeatsDB taxonomy (Clementel et al. 2024), I manually classified the subset of 90 structure pairs lacking support for homology that had predicted structure and sequence repeats. These structures were comprised mostly of β-solenoids (59%) or α-helical coils consisting of a single (20%) α-helix or multiple (9%) α-helices separated by turns. As shown in Fig. 3, this subset also included more complex repeat structures, such as propellers (6%) and β-beads (1%).
Fig. 3.
Representative structural repeat proteins. AlphaFold Database structures clustered by Foldseek were investigated containing structural repeats with high TM-scores (≥0.5) but low support for homology (<0.99). Matches were filtered to the subset of 90 structures with detectable structural repeats and sequence-level tandem repeats. Selected structural alignments from different repeat types are shown. Pairs of structures with the highest TM-scores tended to be β-solenoids or α-helical coils. However, more complex repeat structures were also observed. UniProt accessions are listed in the color corresponding to each structure.
Repetitive sequences are notoriously difficult for programs to accurately align. For this reason, I used structural alignments to guide manual multiple sequence alignment of the repeat units underlying the strongest structure matches without support for homology. This process resulted in 30 high-quality alignments that were used to construct phylogenetic trees of the repeats' evolution. As shown in Fig. 4, none of the trees displayed a branching order indicating homology (Fig. 1c). Instead, the trees had high bootstrap support (≥0.80) for genealogies in which the all of the repeat units arose more recently than a hypothetical common ancestor. This order of events precludes the possibility of homology for the tandem repeat arrays spanning the majority of the proteins. Given their high TM-scores, these structure pairs can be considered analogs that resulted from convergent evolution to very similar structures.
Fig. 4.
Strong structure matches likely due to analogy. The repeat units underlying a subset of 30 matching structures with very high structural similarity (TM-score ≥ 0.78) underwent structure-guided alignment. All balanced minimum evolution trees constructed from the repeat unit alignments had strong bootstrap support for topologies consistent with the repeat units arising after splitting from a hypothetical common ancestor. This tree topology implies the structures are analogous, because the repeats did not exist in the same ancestor. Some tandem repeats had different periodicity between structures, further indicating the sequences are nonhomologous. Trees are mid-point rooted with bootstrap support listed above the root. UniProt accessions are listed in the color corresponding to the leaves. Repeat unit alignments are depicted as a set of boxes colored by amino acid, with gaps (i.e. “–”) in black.
Discussion
Convergence is a key concept in molecular evolution with many examples and implications for the predictability of traits (McGhee 2011; Satterlee et al. 2024). The existence of functional convergence was previously established for proteins (Logsdon and Doolittle 1997) and noncoding RNAs (Salehi-Ashtiani and Szostak 2001). Here, I uncovered rare cases of structural convergence among AlphaFold structures clustered by Foldseek. Most structure matches showed evidence for underlying sequence homology, especially those with high TM-scores. However, a notable fraction (∼1%) of strong structure matches lacked sequence-level support for homology, and a few of these contained recent tandem repeats that made it possible to demonstrate they are actually putative analogs. It is plausible that many more analogs exist among structure search results given the shared building blocks (i.e. α-helices and β-sheets) comprising most proteins.
The tree topologies contrasted in Fig. 1c are only two of many possibilities, and other topologies may not clearly discern homology from analogy. It is known that some tandem repeats undergo internal repeat duplication and deletion (Persi et al. 2016; Zhu et al. 2016; Galpern et al. 2022), which could result in mixed homologous and analogous protein matches. It is similarly feasible an ancient progenitor element is shared across two proteins but recently underwent tandem repetition. Nevertheless, all 30 tree topologies that were closely investigated (Fig. 4) are consistent with large repeats arising independently in two proteins and, therefore, not sharing a common ancestor. It is important to note that many other structural repeats existed in proteins with support for homology (Fig. 1; supplementary table S1, Supplementary Material online), so the existence of a repeat does not imply an absence of homology.
A limitation of this study was the use of substitution scores to identify potentially nonhomologous matches. On one hand, bootstrapping is unlikely to expose low complexity regions that are a common source of spurious sequence matches (Frith 2011a), and the number of nonhomologous matches may have been underestimated by this approach. Bootstrapping also required 50 well-aligned residues, which excluded 9,433 smaller structure matches and may have further underestimated the prevalence of nonhomology or analogy. On the other hand, some proteins may have diverged to the point where their substitution score is within the background distribution of a permuted sequence, which would overestimate nonhomology. Structural similarity provides the only remaining evidence for homology in these cases, and it would be useful to have a measure of significance for structural matches with a comparably robust statistical foundation to the methods commonly employed for quantifying sequence homology. While defining such a method is beyond the scope of this study, future research may consider calculating the likelihood of a given set of matching structural elements by chance.
The main conclusion of this study is that analogous structures may infiltrate structure search results, and strong structural similarity does not imply homology. In contrast to analogous matches, nonhomologous matches were expected given the Foldseek clusters were obtained with an E-value of 0.01. This permissive threshold could result in a nonhomologous hit every 100 structure searches even in the absence of analogy. Furthermore, Foldseek E-values are believed to be underestimated by orders of magnitude (Edgar 2024). The substantial number of pairs with low TM-scores confirmed that weak structure matches were pervasive among Foldseek clusters. Hence, less permissive E-values are required to avoid spurious hits when performing many searches, such as during clustering large numbers of sequences. Stringent E-values are particularly important when assigning homology without transitivity, i.e. when assuming clustered proteins are all homologous because they share similarity to a single cluster representative. As with sequence-based clustering (Wright 2024), it is more accurate to establish relationships using all-versus-all measurements of similarity (i.e. average- or complete-linkage).
It is difficult to definitively prove analogy, and the method drawing on tandem repeats used here will only apply to a small fraction of proteins. Therefore, the burden of proof lies on proving homology rather than disproving analogy, and caution is merited when there is an absence of proof for homology. The fact that analogs may pollute structure search results does not diminish the utility of structure search programs for finding remote homologs. The high conservation of structures facilitates finding and aligning remote homologs that might be missed by sequence search programs. Notably, the vast majority (99%) of strong structure matches had sequence-level support for homology, implying TM-score is a reasonable, albeit imperfect, proxy for shared ancestry. Future measures of structure similarity may account for repeats and low complexity structures to avoid nonhomologous hits. In the meantime, sequence information can be used to validate structure matches when homology is required.
Materials and Methods
AlphaFold Database (v4) structures were subset to the 119,404,806 having at least 80% of positions with pLDDT of at least 70, which is the confidence level classified as providing a good backbone prediction (Tunyasuvunakool et al. 2021). Foldseek clusters (Barrio-Hernandez et al. 2023) were subset to those containing exactly two sequences such that one sequence was the cluster representative. Pairwise structural alignment was performed using US-align (v20240602) under default settings, which outputs TM-scores and a sequence alignment (Zhang et al. 2022). TM-scores normalized by the length of the shortest protein in the pair were used throughout.
Substitution scores were computed for well-aligned residues (α-carbon distance < 5 Å) using the PFASUM50 substitution matrix (Keul et al. 2017). Amino acids in the alignments were resampled with replacement to determine the distribution of substitution scores for random alignments of similar sequence composition. Support for homology was quantified as the fraction of 10,000 bootstrap replicates with a substitution score below that of the structural alignment. At least 50 well-aligned residues were required for analysis to avoid pairs with insufficient information. Domain predictions were made with UniDoc (Zhu et al. 2023). Structural repeats were predicted with CE-Symm (v2.3.0) and only considered if their average repeat unit TM-score was at least 0.5 (Bliven et al. 2019).
Statistical analyses were conducted in R (v4.4.0). The glm function was used to perform logistic regression with the quasibinomial link function. Domain enrichment analysis was done with the fisher.test function using all pairs with TM-score ≥ 0.5 to compare the association between proteins having multiple (≥2) domains and support (≥0.99) for homology. Similarly, fisher.test was used to quantify the association between proteins with structural repeats (average repeat unit TM-score ≥ 0.5) and support (≥0.99) for homology. Structure superposition images were created with the RCSB Protein Data Bank pairwise structural alignment tool using TM-align default settings (Bittrich et al. 2024).
Structures with sequence-level repeats were identified with the DetectRepeats function in the DECIPHER (v3.3.1) package (Wright 2015) using the default minimum score of 10. I attempted to align tandem repeats for the top scoring 32 of 90 structure matches lacking support for homology and having detectable sequence-level repeats. Only two structure matches were excluded because they contained complex repeats that could not be confidently aligned across sequences (clusters A0A183HIC3 and A0A6P1CX94). Using the DECIPHER function Treeline, balanced minimum evolution phylogenetic trees were constructed from multiple sequence alignments of repeat units according to pairwise distance matrices based on the WAG model of amino acid evolution (Whelan and Goldman 2001). Branch support values were obtained from 1,000 bootstrap replicates generated by resampling columns of the multiple sequence alignment with replacement and recomputing the tree.
Supplementary Material
Acknowledgments
I am grateful to Spencer Bliven for fixing an issue in CE-Symm identified during the course of this work.
Supplementary Material
Supplementary material is available at Genome Biology and Evolution online.
Funding
This work was funded by the NIAID at the NIH (grant number U01AI176418).
Data Availability
Results for all structural alignments are provided as a supplemental table.
Literature Cited
- Barrio-Hernandez I, Yeo J, Janes J, Mirdita M, Gilchrist CLM, Wein T, Varadi M, Velankar S, Beltrao P, Steinegger M. Clustering predicted structures at the scale of the known protein universe. Nature. 2023:622(7983):637–645. 10.1038/s41586-023-06510-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bittrich S, Segura J, Duarte JM, Burley SK, Rose Y. RCSB Protein Data Bank: exploring protein 3D similarities via comprehensive structural alignments. Bioinformatics. 2024:40(6):btae370. 10.1093/bioinformatics/btae370. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bliven SE, Lafita A, Rose PW, Capitani G, Prlic A, Bourne PE. Analyzing the symmetrical arrangement of structural repeats in proteins with CE-Symm. PLoS Comput Biol. 2019:15(4):e1006842. 10.1371/journal.pcbi.1006842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng H, Kim BH, Grishin NV. MALISAM: a database of structurally analogous motifs in proteins. Nucleic Acids Res. 2007:36(Database):D211–D217. 10.1093/nar/gkm698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clementel D, Arrias PN, Mozaffari S, Osmanli Z, Castro XA; Repeats DBc curators, Ferrari C, Kajava AV, Tosatto SCE, Monzon AM. 2024. RepeatsDB in 2025: expanding annotations of structured tandem repeats proteins on AlphaFoldDB. Nucleic Acids Res. 53(D1):D575–D581. 10.1093/nar/gkae965. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edgar RC. Protein structure alignment by Reseek improves sensitivity to remote homologs. Bioinformatics. 2024:40(11):btae687. 10.1093/bioinformatics/btae687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frith MC. Gentle masking of low-complexity sequences improves homology search. PLOS ONE. 2011a:6(12):e28819. 10.1371/journal.pone.0028819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frith MC. A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res. 2011b:39(4):e23. 10.1093/nar/gkq1212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galpern EA, Marchi J, Mora T, Walczak AM, Ferreiro DU. Evolution and folding of repeat proteins. Proc. Natl. Acad. Sci. USA. 2022:119(31):e2204131119. 10.1073/pnas.2204131119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnston IG, Dingle K, Greenbury SF, Camargo CQ, Doye JPK, Ahnert SE, Louis AA. Symmetry and simplicity spontaneously emerge from the algorithmic nature of evolution. Proc. Natl. Acad. Sci. USA. 2022:119(11):e2113883119. 10.1073/pnas.2113883119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Zidek A, Potapenko A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021:596(7873):583–589. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Keul F, Hess M, Goesele M, Hamacher K. PFASUM: a substitution matrix from Pfam structural alignments. BMC Bioinformatics. 2017:18(1):293. 10.1186/s12859-017-1703-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, Smetanin N, Verkuil R, Kabeli O, Shmueli Y, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023:379(6637):1123–1130. 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
- Liu W, Wang Z, You R, Xie C, Wei H, Xiong Y, Yang J, Zhu S. PLMSearch: protein language model powers accurate and fast sequence search for remote homology. Nat. Commun. 2024:15(1):2775. 10.1038/s41467-024-46808-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Logsdon JM Jr, Doolittle WF. Origin of antifreeze protein genes: a cool tale in molecular evolution. Proc. Natl. Acad. Sci. USA. 1997:94(8):3485–3487. 10.1073/pnas.94.8.3485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mackin KA, Roy RA, Theobald DL. An empirical test of convergent evolution in rhodopsins. Mol. Biol. Evol. 2014:31(1):85–95. 10.1093/molbev/mst171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McGhee GR. Convergent evolution: limited forms most beautiful. Cambridge, MA, USA: MIT Press; 2011. 177–208. [Google Scholar]
- Murata H, Toko K, Chikenji G. Protein superfolds are characterised as frustration-free topologies: a case study of pure parallel beta-sheet topologies. PLoS Comput Biol. 2024:20(8):e1012282. 10.1371/journal.pcbi.1012282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Orengo CA, Sillitoe I, Reeves G, Pearl FMG. Review: what can structural classifications reveal about protein evolution? J. Struct. Biol. 2001:134(2-3):145–165. 10.1006/jsbi.2001.4398. [DOI] [PubMed] [Google Scholar]
- Persi E, Wolf YI, Koonin EV. Positive and strongly relaxed purifying selection drive the evolution of repeats in proteins. Nat. Commun. 2016:7(1):13570. 10.1038/ncomms13570. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salehi-Ashtiani K, Szostak JW. In vitro evolution suggests multiple origins for the hammerhead ribozyme. Nature. 2001:414(6859):82–84. 10.1038/35102081. [DOI] [PubMed] [Google Scholar]
- Satterlee JW, Alonso D, Gramazio P, Jenike KM, He J, Arrones A, Villanueva G, Plazas M, Ramakrishnan S, Benoit M, et al. Convergent evolution of plant prickles by repeated gene co-option over deep time. Science. 2024:385(6708):eado1663. 10.1126/science.ado1663. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schaper E, Kajava AV, Hauser A, Anisimova M. Repeat or not repeat?—statistical validation of tandem repeat prediction in genomic sequences. Nucleic Acids Res. 2012:40(20):10005–10017. 10.1093/nar/gks726. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Seong K, Krasileva KV. Prediction of effector protein structures from fungal phytopathogens enables evolutionary analyses. Nat. Microbiol. 2023:8(1):174–187. 10.1038/s41564-022-01287-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Žídek A, Bridgland A, Cowie A, Meyer C, Laydon A, et al. Highly accurate protein structure prediction for the human proteome. Nature. 2021:596(7873):590–596. 10.1038/s41586-021-03828-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, Soding J, Steinegger M. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 2024:42(2):243–246. 10.1038/s41587-023-01773-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, Yuan D, Stroe O, Wood G, Laydon A, et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 2022:50(D1):D439–D444. 10.1093/nar/gkab1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 2001:18(5):691–699. 10.1093/oxfordjournals.molbev.a003851. [DOI] [PubMed] [Google Scholar]
- Wolynes PG. Symmetry and the energy landscapes of biomolecules. Proc. Natl. Acad. Sci. USA. 1996:93(25):14249–14255. 10.1073/pnas.93.25.14249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright ES. DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment. BMC Bioinformatics. 2015:16(1):322. 10.1186/s12859-015-0749-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright ES. Accurately clustering biological sequences in linear time by relatedness sorting. Nat. Commun. 2024:15(1):3047. 10.1038/s41467-024-47371-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu J, Zhang Y. How significant is a protein structure similarity with TM-score = 0.5? Bioinformatics. 2010:26(7):889–895. 10.1093/bioinformatics/btq066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang C, Shine M, Pyle AM, Zhang Y. US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat. Methods. 2022:19(9):1109–1115. 10.1038/s41592-022-01585-1. [DOI] [PubMed] [Google Scholar]
- Zhu H, Sepulveda E, Hartmann MD, Kogenaru M, Ursinus A, Sulz E, Albrecht R, Coles M, Martin J, Lupas AN. Origin of a folded repeat protein from an intrinsically disordered ancestor. eLife. 2016:5:e16761. 10.7554/eLife.16761. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu K, Su H, Peng Z, Yang J. A unified approach to protein domain parsing with inter-residue distance matrix. Bioinformatics. 2023:39(2):btad070. 10.1093/bioinformatics/btad070. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Results for all structural alignments are provided as a supplemental table.