Abstract
Although most globular proteins fold into a single stable structure1, an increasing number have been shown to remodel their secondary and tertiary structures in response to cellular stimuli2. State-of-the-art algorithms3–5 predict that these fold-switching proteins assume only one stable structure6,7, missing their functionally critical alternative folds. Why these algorithms predict a single fold is unclear, but all of them infer protein structure from coevolved amino acid pairs. Here, we hypothesize that coevolutionary signatures are being missed. Suspecting that over-represented single-fold sequences may be masking these signatures, we developed an approach to search both highly diverse protein superfamilies–composed of single-fold and fold-switching variants–and protein subfamilies with more fold-switching variants. This approach successfully revealed coevolution of amino acid pairs uniquely corresponding to both conformations of 56/58 fold-switching proteins from distinct families. Then, using a set of coevolved amino acid pairs predicted by our approach, we successfully biased AlphaFold25 to predict two experimentally consistent conformations of a candidate protein with unsolved structure. The discovery of widespread dual-fold coevolution indicates that fold-switching sequences have been preserved by natural selection, implying that their functionalities provide evolutionary advantage and paving the way for predictions of diverse protein structures from single sequences.
Keywords: Protein Fold Switching, Coevolutionary Analysis, Protein Evolution, Protein Structure Prediction, Protein Folding, Conformational Diversity
Introduction
Though machine learning methods have recently revolutionized protein structure prediction3,5,8, some systems remain a challenge9–12. For example, fold-switching proteins, also known as metamorphic proteins13,14, transition between two sets of stable secondary and tertiary structure2,15. These structural transitions modulate protein functions involved in suppressing human innate immunity during SARS-CoV-2 infection16, regulating the expression of bacterial virulence genes17, maintaining the cycle of the cyanobacterial circadian clock18,19, and more20. In spite of their biological importance, AlphaFold2 predicts only one conformation for 92% of these dual-folding proteins, and 30% of the predicted conformations were likely not the lowest energy state21. Other structure prediction algorithms, such as trRosetta22 and EVCouplings4, also systematically failed to predict experimentally validated fold switching in the universally conserved NusG family of transcription factors7.
Most state-of-the-art protein structure prediction algorithms, including all just mentioned, infer folding information from evolutionary conservation patterns. Very early studies of protein structure23 recognized that amino acid pairs with similar selection rates, also known as coevolved residue pairs, tend to be in direct contact24,25. These coevolved contacts can greatly constrain the number of possible conformations that computational methods must sample to predict a protein’s fold26, motivating the development of increasingly sophisticated methods that infer amino acid coevolution27–33. Multiple sequence alignments (MSAs), collections of sequences likely homologous to the sequence of interest, are the inputs to most of these methods. Typically, the accuracy of inferred coevolved residue pairs increases with MSA depth34, though recent deep learning-based methods can make accurate inferences from shallow MSAs5,35.
Some previous work hints that amino acid contacts unique to both conformations of fold-switching proteins may have coevolved. This work demonstrates that alternative protein conformations can be detected in subfamily-specific MSAs7,36. For example, we recently identified fold-switching proteins within the universally conserved NusG transcription factor family by leveraging structural information derived from MSAs from protein superfamilies (deep MSAs) and protein subfamilies (shallow MSAs with sequences similar to a target of interest)7. Furthermore, Dishman and colleagues found that several reconstructed ancestors of the fold-switching chemokine XCL1 switch folds, from which they concluded that XCL1 fold switching was evolutionarily selected37. These studies, though suggestive, focus on a couple of specific systems and infer fold switching from experimental observation of a few variants (XCL1) or inconsistent secondary structure predictions (NusG). Weak coevolutionary couplings of a fold-switching NusG have also been predicted, though the couplings had high proportions of noise: 38% at best38.
To robustly search for amino acid contacts unique to both conformations of fold-switching proteins, we applied unsupervised learning techniques to both superfamily and subfamily-specific MSAs of all 98 known fold switchers with two distinct experimentally determined structures2. One technique identifies coevolution of amino acid pairs using Markov Random Fields (MRFs). The MRF construction offers several advantages: (i) it converges to a global minimum as MSA depth increases, (ii) it can generate reasonable predictions from fairly shallow MSAs, and (iii) the MRF formalism accounts for noncausal correlations that arise when two residues interact with a third but not with one another39–41. Among the numerous MRF-based methods4,42–44, we selected GREMLIN (Generative REgularized ModeLs of proteINs) because of its superior performance38,45. The second technique, MSA transformer, infers coevolved amino acid pairs using a language model that focuses on both evolutionary patterns of amino acids within an MSA (column-wise attention) and properties of the individual sequences (row-wise attention), often with better accuracy than GREMLIN for single-fold proteins35.
We gauge the success of these methods by quantifying the overlap between predicted and experimentally determined residue-residue contacts from both folds. These comparisons are easily visualized with contact maps, which display amino acid pairs either measured or predicted to be proximal (heavy atom distance ≤8Å45). Though typical contact maps are symmetric about the diagonal, those used here are asymmetric to maximize information content. For example, the large light gray circles in the upper triangular portion of Figure 1 represent contacts unique to the experimentally determined monomeric fold of KaiB, while the black circles in the lower triangular portion represent contacts unique to KaiB’s experimentally determined tetrameric fold. Contacts common to both experimentally determined folds are shown in medium gray on both sides of the diagonal. Where appropriate, interchain contacts are represented by smaller circles using the same color scheme. Predicted contacts, though not shown in Figure 1, are smaller and teal. Correct predictions are opaque circles; incorrect predictions are translucent diamonds.
Figure 1:
Example of a dual fold contact map from experimentally determined structures. KaiB monomeric/tetrameric contacts within 8Å are shown in the upper/lower triangles of the contact map in in light gray/black. Contacts common to both folds are shown in dark gray. Interchain contacts within 10Å are shown as smaller circles in their respective colors. Monomeric/tetrameric contacts were calculated from PDBs 5JYT/4KSO. Protein structures were generated with PyMOL46. Plots in all figures were generated with Matplotlib47
Results
Approach to identify dual-fold contacts
Figure 2 depicts our workflow to search for dual-fold coevolution (Methods). The query sequence, which corresponds to two distinct experimentally determined structures, is used to generate a deep MSA. This MSA is filtered to create successively shallower MSAs with sequences increasingly identical to the query (Figure 2a). These increasingly subfamily-specific MSAs are intended to reveal coevolutionary couplings from alternative conformations, as they did with RfaH, a fold-switching NusG protein48 whose ground state α-helical conformation can be only in subfamily-specific MSAs7. Accordingly, coevolutionary analysis is performed on each MSA using GREMLIN and MSA Transformer (Figure 2b). Predictions from both methods run on these nested MSAs are combined and superimposed on a single contact map (Figure 2c). Finally, these predictions are filtered by density-based scanning49 to remove noise (Figure 2d).
Figure 2:
Graphical depiction of the workflow designed for applying coevolutionary analysis to fold-switching proteins. a) An MSA suitable for coevolutionary analysis is pruned using a Query Identity (QID) filter, removing distantly related sequences from the dataset and generating subfamily-specific MSAs. b) Each MSA (original + all pruned) is used as input for coevolutionary analysis. c) Predictions from all MSAs are superimposed on a single contact map. d) A clustering algorithm filters noise, leaving dense clusters of predicted amino acid contacts. Contacts unique to the dominant/alternative folds are light gray/black; common contacts are light gray; experimentally consistent predictions are teal circles; incorrect predictions (noise) are translucent teal diamonds.
Contacts are categorized as follows. Dominant fold: unique contacts corresponding to the experimentally determined structure that overlaps most with predicted contacts from the deepest MSA (light gray contacts in Figure 2b–d); Alternative fold: unique contacts corresponding to the other experimentally determined structure (black contacts in Figure 2b–d); Common: predicted contacts overlapping with experimentally determined contacts shared by both folds (gray contacts on both diagonals in Figure 2b–d); Noise: predicted contacts that do not overlap with any experimentally determined contacts (readily visible in Figure 2c). Though we call it noise, these experimentally inconsistent contacts could result from yet another alternative—but experimentally uncharacterized—conformation.
Evolutionary selection of dual-fold proteins
We applied our approach to all known fold-switching proteins, 91 single sequences with two distinctly folded experimentally determined structures2. These proteins are found in all kingdoms of life and represent >80 distinct fold families (Extended Data Table 1). Although efforts were made to generate the deepest possible MSA for each fold-switching sequence (Methods), the depths of 32 MSAs were too shallow for downstream analysis (<5*length of query sequence45) and one displayed severe artifacting after analysis (Extended Data Table 1). Thus, the rest of our approach was applied only to the remaining 58 fold-switching sequences with sufficiently deep MSAs (Extended Data Table 1, Extended Data Figures 1). Conformations with more contacts predicted in the deep MSA are denoted “dominant”, and those with fewer predicted contacts, “alternative”. This terminology holds no biophysical significance: 33% of “dominant” conformations do not correspond to the lowest energy states (Extended Data Table 2).
Our approach predicted substantially more correct contacts than coevolutionary analysis run deep superfamily MSAs alone, the standard approach34. The number of correctly predicted contacts increased for 56/58 proteins, with mean/median increases of 46%/41% (Figure 3a). Predicted amino acid contacts uniquely corresponding to the 58 alternative conformations were highly enhanced, with mean/median increases of 98%/69% (Figures 3b). Noise was amplified substantially less than correctly predicted alternative contacts, with mean/median increases of 50%/30% (Extended Data Figure 2a). Prior to density-based filtering, mean/median noise was amplified by 377%/277%, demonstrating that, on average, 85% of the extra noise accrued from subfamily MSAs is sparsely distributed (Extended Data Figure 2b).
Figure 3.

Our workflow amplifies correctly predicted contacts for fold-switching proteins. Amplification is observed for 56/58 predicted contacts (a) and especially for contacts uniquely corresponding to the alternative fold (b). Identity lines in both plots are dashed lines. (c) Noise generated by our pipeline was significantly reduced by density-based filtering. Violin plots show the distributions of %non-dominant contacts for single-fold and fold-switching proteins. The left and right distributions were generated from n=181 and n=58 datapoints, respectively. Inner bold black boxes span the interquartile ranges (IQRs) of each distribution (first quartile, Q1 through third quartile, Q3); medians of each distribution are white dots, lower line (whisker) is the lowest datum above Q1−1.5*IQR; upper line (whisker) is the highest datum below Q3 + 1.5*IQR.
Statistical analysis confirmed that the additional coevolutionary contacts identified by our approach are products of evolution rather than chance. Specifically, the likelihood of generating the additional correct contacts–with concomitant noise–was very low for all 58 proteins, with p-values ranging from 10−6 to 0 (one-tailed hypergeometric test, Extended Data Table 1). These low p-values demonstrate that the dual-fold coevolutionary signatures identified by GREMLIN and MSA Transformer were almost certainly not generated by chance. Instead, these statistically significant dual-fold signatures suggest that evolution has selected for protein sequences that assume two distinct folds. Furthermore, the distribution of non-dominant contacts for single-fold proteins was significantly lower than for fold switchers (p < 1.1*10−94, Epps-Singleton test, Figure 3c, Extended Data Table 3). Because our approach especially enriches the number of contacts corresponding to the alternative conformation (Figure 3b), these results demonstrate that evolution has selected for protein sequences that assume two distinct folds.
Prediction of two folds from one sequence
Widespread dual-fold coevolution opens the possibility of predicting both conformations of a folds-witching protein from its sequence. We tested this possibility on a NusG Variant with low sequence identity (≤29%) to homologs with experimentally determined three-dimensional structures. NusG proteins are the only transcription factors known to be conserved in all kingdoms of life50. Unlike most NusGs with atomic level structures, whose C-terminal domains (CTDs) assume a β-roll fold17,51–54, this Variant’s CTD switches from an α-helical ground state to a β-roll7, much like its homolog, RfaH48. Nevertheless, AlphaFold2 consistently predicts that the CTD of this Variant assumes a β-roll fold only (Figure 4a, Extended Data Figure 3). This prediction corroborates the observations discussed previously: all NusG CTDs are expected to assume β-roll folds (dominant conformation), though a subpopulation can also assume α-helical folds (alternative conformation). To test whether the coevolutionary signal of the β-roll fold might be masking a weaker α-helical signature, we examined the coevolved amino acid pairs identified by our approach. Twenty-one amino acid positions in the CTD only formed coevolved pairs corresponding to the β-roll fold, while positions exclusively forming coevolved pairs corresponding to the α-helical fold numbered only four (Figure 4a). To weaken the coevolutionary signal corresponding to the β-roll fold, we changed all 21 positions in the MSA to alanine, the mutation of choice for perturbing structure55, except for the sequence of the Variant (Figure 4b). From this modified MSA, AlphaFold2 predicted a ground state α-helical structure (Figure 4b). The secondary structures of both CTDs have high prediction confidences (pLDDT scores), except for the most C-terminal helix in the α-hairpin conformation (Extended Data Figure 4).
Figure 4.
AlphaFold2 successfully predicts two conformations of a candidate sequence without experimentally determined structures. (a). A NusG N-terminal (NGN) fold (light gray) and a C-terminal β-roll fold (lavender) are predicted from a deep input MSA (region corresponding to the CTD shown).Predicted β-sheets in the C-terminal domain agree closely with secondary structures predicted from nuclear magnetic resonance experiments (black boxes surrounding lavender bars). (b). A NusG N-terminal (NGN) fold (light gray) and a C-terminal α-helical hairpin fold (teal) are predicted from a modified input MSA in which columns predicted to form only β-roll contacts are changed to alanine. Predicted α-helices in the C-terminal domain agree with secondary structures predicted from nuclear magnetic resonance experiments (black boxes surrounding teal bars). Protein structures were generated with PyMOL46.
Both predicted conformations are consistent with amino-acid-specific secondary structure predictions calculated from nuclear magnetic resonance assignments7 (Figure 4). Furthermore, without suppressing the strong β-roll coevolutionary signature, AlphaFold2 consistently predicted the β-sheet fold regardless of input MSAs and use or absence of templates. The α-helical CTD conformation was also missed by RoseTTAfold3 and RGN256, an MSA-independent deep learning method that outperforms AlphaFold2 on orphan protein sequences (Extended Data Figure 3). Together, these results demonstrate that the coevolved contacts identified by our approach facilitated predictions of two folds from one sequence.
Discussion
Although globular proteins are generally observed to assume single unique folds, an increasing number can switch between distinct sets of stable secondary and tertiary structure. These fold-switching proteins facilitate cancer progression57, foster SARS-CoV-2 pathogenesis58, fight microbial infection37, and more20.
By running well-developed coevolutionary analysis methods35,42,45 on many sets of protein superfamilies and subfamilies, we identified statistically significant coevolutionary signals corresponding to two folds of 58 diverse fold-switching proteins. This widespread selection indicates that fold switching (1) confers evolutionary advantage and (2) is a fundamental biological mechanism. These results, coupled with the difficulties associated with experimentally characterizing fold switchers2,14, suggest that fold-switching proteins may be more naturally abundant than currently realized. Accordingly, recent experimentally confirmed predictions suggest that over 3500 proteins in the NusG transcription factor family of ~15,500 proteins switch folds7. Furthermore, since subfamily MSAs have also been used to infer other protein properties59,60, our approach might successfully extend beyond fold switchers to other forms of structural heterogeneity, such as allostery, which previous coevolutionary approaches have predicted with some success61,62.
The observed prevalence and biological relevance of fold-switching proteins underscore the need to develop computational methods that reliably predict more. Although state-of-the-art predictive algorithms have revolutionized protein structure prediction3,5,8, they systematically fail to predict protein fold switching6,7. Nevertheless, we show here that AlphaFold2 can be biased to predict two folds from one amino acid sequence. The key to this approach was suppressing the strong coevolutionary signature of the β-roll fold, allowing AlphaFold2 to detect weaker α-helical signals. This approach may extend to other fold-switching proteins55, though a systematic analysis has not yet been performed. Furthermore, a recent preprint reported that AlphaFold2 could be guided to predict both structures of 3/6 known fold-switching proteins by clustering their sequences into subfamilies63.
Our findings lay the groundwork for a more functionally complete picture of the proteome by capturing dual-fold coevolutionary signatures of fold-switching proteins from their genomic sequences. Still, further technical advances are needed to reliably predict protein fold switching. First, dual-fold contacts must be distinguished from noise or true contacts arising from other phenomena, such as multimerization. Second, dual-fold contacts must be correctly separated into their two respective folds without the knowledge of both conformations, on which we rely here. Third, robust dual-fold contacts must be predicted reliably. Our approach works only on sequences for which deep MSAs can be generated. As a result, fold switching could not be predicted in nearly 50% of the sequences in our initial dataset. Furthermore, it is uncertain how many sets of the dual-fold contacts we generated are complete enough to robustly predict two distinct folds from single amino acid sequences. Nevertheless, the rapid growth of diverse protein sequence64, recent advances in deep learning65,66, and increasingly accessible computational resources leave us optimistic that these challenges will be overcome.
Materials and Methods
MSA generation.
Fold-switching protein sequences were used as inputs for jackhmmer67,68 to generate multiple sequence alignments (MSAs) after searching the Uniref9064 release from January 2021. To achieve optimal MSA depths, multiple searches with -incE and -incdomE thresholds set to the same value ranging from 10−1 to 10−250 were performed in increments of 10−3. We then searched for the deepest MSA in this range with a maximum of 60,000 sequences. Each jackhmmer run was iterated until the MSA converged or until 10 iterations had occurred.
MSA preparation.
To generate subfamily MSAs, distantly related sequences were pruned from deep superfamily MSAs using hhfilter69. This software filters alignments by QID, pairwise sequence identity between the query sequence used to generate the MSA and each subsequent sequence within it. Subfamily MSAs of varying depths were generated with QID thresholds ranging from 1% to 50% in increments of 1%. All MSAs—both superfamily and subfamily—were prepared for coevolutionary analysis by removing any sequences with >25% gaps and then filtering any columns with >75% gaps.
Coevolutionary Analysis.
Prepared MSAs from each protein family were used as separate inputs into both GREMLIN41,42 and MSA transformer35, each run with default parameters. Typically, the number of coevolved amino acid pairs retained from each run from both programs is 3L/245,70, where L is length of the target protein. Here, 2L pairs are retained for the deepest MSA because we expect more coevolved contacts to arise from fold-switching sequences with two conformations. The number of contacts retained from subfamilies decreases linearly by the number of sequences in the subfamily MSA normalized by the number of sequences in the deepest MSA to a minimum value of 3L/2 for the shallowest input MSA from each protein family. The N contacts with the highest APC scores for each GREMLIN run and the M contacts with the highest Z-scores from each MSA Transformer run were retained, where N and M ∈ [3L/2,2L]. Predicted contacts from each MSA were combined and duplicates were removed.
Noise filtering.
All predicted contacts generated from the original MSA and the sub-family MSAs were superimposed onto a single contact map. These predictions were clustered using a density-based algorithm (DBSCAN)49 that efficiently identifies structure in datasets with arbitrarily shaped clusters. The main criteria for defining whether a point belongs to a cluster is how many other points are close. The EPS parameter defines a radial distance from a core point and points within that radius are clustered. All points included in the cluster were then used as new core points to search for additional points within the EPS. Clusters were iteratively built in this way until the entire dataset is clustered. The minimum number of points to define a cluster in this work is 3. The sparsest points in the dataset were then defined as noise and eliminated from the dataset to produce the final, densest set of filtered predictions. The EPS value is optimized for each set of superimposed contacts using a receiver operating characteristic (ROC) curve, where the optimal value’s first derivative > 1, corresponding to more true positives gained by increasing the EPS value, but the successive value’s first derivative <1, corresponding to more false positives gained by further increasing the EPS value. True positives are defined as being within +/−1 residue of crystallographic contacts. However, EPS values could not be so stringent that fewer contacts were returned than from the original run on deep MSAs.
Statistical tests.
p-values were calculated using the one-tailed hypergeometric test (also known as Fisher’s exact test) to evaluate the significance of the additional structural information obtained from the subfamily alignments, as described by:
| (1), |
where Nexp is the total number of unique experimentally determined contacts from both conformations of a fold-switching protein, Npred super family is the number of unique contacts correctly predicted by GREMLIN and MSA transformer on the superfamily MSA only, Npred sub families is the number of unique contacts predicted by GREMLIN and MSA transformer on all subfamily MSAs excluding those also predicted from the superfamily, Nnoise is L2 − Nexp, where L is the maximum sequence length of an experimentally determined structure, Nnoise super family is the number of unique contacts incorrectly predicted by GREMLIN or MSA transformer on the superfamily MSA only, and Nnoise sub families is the number of unique contacts incorrectly predicted by GREMLIN or MSA transformer on all subfamily MSAs, excluding those also predicted from the superfamily.
Structure predictions.
Structure predictions of Variant 5 were performed by AlphaFold2.1.2 both with templates deposited in the PDB by 4/20/22 and without templates and both with MSAs generated from the standard pipeline (Uniref9071, MGnify72, and MMseqs273 (BFD clust)) and the shallowest MSA generated from our approach. In all four runs, only the β-roll fold was predicted in the five top-scoring models (Figure S3). The α-helical fold was predicted by modifying MSA columns that our pipeline predicted to form only β-roll contacts. These columns corresponded to amino acids in the 100–168 range, and the deepest MSA generated by our approach was modified by mutating these columns to alanine. AlphaFold2.1.2 was run on this modified MSA without templates. The standard RoseTTAfold pipeline (https://robetta.bakerlab.org) was used to predict three-dimensional structures of Variant 5 both with the standard MSA generation protocol and the shallowest MSA generated from our pipeline.
Finally, the RGN2 Colab notebook (https://colab.research.google.com/github/aqlaboratory/rgn2/blob/master/rgn2_prediction.ipynb) was run on the sequence of Variant 5 with standard parameters. The sequence of Variant 5 is: MESFLNWYLIYTKVKKEDYLEQLLTEAGLEVLNPKIKKTKTVRNKKKEVIDPLFPCYLFVKADLNVHLRIISYTQGIRRLVGGSNPTIVPIEIIDTIKSRMVDGFIDTKSEEFKKGDTILIKDGPFKDFVGIFQEELDSKGRVSILLKTLALQPRITVDKDMIEKLHN. Experimentally determined secondary structures were taken from7.
Supplementary Material
Acknowledgments
We thank Loren Looger, Carolyn Ott, Yuri Wolf, Nash Rochman, Robert Best, Andy LiWang, Eugene Koonin, Devlina Chakravarty, and Danielle and Jean Thierry-Mieg for helpful discussions. This work utilized resources from the NIH HPS Biowulf cluster (http://hpc.nih.gov), and it was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health.
References
- 1.Anfinsen C. B. Principles that govern the folding of protein chains. Science 181, 223–230, doi: 10.1126/science.181.4096.223 (1973). [DOI] [PubMed] [Google Scholar]
- 2.Porter L. L. & Looger L. L. Extant fold-switching proteins are widespread. Proc Natl Acad Sci U S A 115, 5968–5973, doi: 10.1073/pnas.1800168115 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Baek M. et al. Accurate prediction of protein structures and interactions using a three-Strack neural network. Science 373, 871–876, doi: 10.1126/science.abj8754 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hopf T. A. et al. The EVcouplings Python framework for coevolutionary sequence analysis. Bioinformatics 35, 1582–1584, doi: 10.1093/bioinformatics/bty862 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Jumper J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589, doi: 10.1038/s41586-021-03819-2 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Chakravarty D. & Porter L. L. AlphaFold2 fails to predict protein fold switching. Protein Sci 31, e4353, doi: 10.1002/pro.4353 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Porter L. L. et al. Many dissimilar NusG protein domains switch between alpha-helix and beta-sheet folds. Nat Commun 13, 3802, doi: 10.1038/s41467-022-31532-9 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.AlQuraishi M. Machine learning in protein structure prediction. Curr Opin Chem Biol 65, 1–8, doi: 10.1016/j.cbpa.2021.04.005 (2021). [DOI] [PubMed] [Google Scholar]
- 9.David A., Islam S., Tankhilevich E. & Sternberg M. J. E. The AlphaFold Database of Protein Structures: A Biologist’s Guide. J Mol Biol 434, 167336, doi: 10.1016/j.jmb.2021.167336 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Ruff K. M. & Pappu R. V. AlphaFold and Implications for Intrinsically Disordered Proteins. J Mol Biol 433, 167208, doi: 10.1016/j.jmb.2021.167208 (2021). [DOI] [PubMed] [Google Scholar]
- 11.Tunyasuvunakool K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596, doi: 10.1038/s41586-021-03828-1 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Outeiral C., Nissley D. A. & Deane C. M. Current structure predictors are not learning the physics of protein folding. Bioinformatics, doi: 10.1093/bioinformatics/btab881 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Murzin A. G. Biochemistry. Metamorphic proteins. Science 320, 1725–1726, doi: 10.1126/science.1158868 (2008). [DOI] [PubMed] [Google Scholar]
- 14.Dishman A. F. & Volkman B. F. Design and discovery of metamorphic proteins. Curr Opin Struct Biol 74, 102380, doi: 10.1016/j.sbi.2022.102380 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Bryan P. N. & Orban J. Proteins that switch folds. Curr Opin Struct Biol 20, 482–488, doi: 10.1016/j.sbi.2010.06.002 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Gao X. et al. Crystal structure of SARS-CoV-2 Orf9b in complex with human TOM70 suggests unusual virus-host interactions. Nat Commun 12, 2843, doi: 10.1038/s41467-021-23118-8 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kang J. Y. et al. Structural Basis for Transcript Elongation Control by NusG Family Universal Regulators. Cell 173, 1650–1662 e1614, doi: 10.1016/j.cell.2018.05.017 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Chang Y. G. et al. Circadian rhythms. A protein fold switch joins the circadian oscillator to clock output in cyanobacteria. Science 349, 324–328, doi: 10.1126/science.1260031 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Chavan A. G. et al. Reconstitution of an intact clock reveals mechanisms of circadian timekeeping. Science 374, eabd4453, doi: 10.1126/science.abd4453 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kim A. K. & Porter L. L. Functional and Regulatory Roles of Fold-Switching Proteins. Structure 29, 6–14, doi: 10.1016/j.str.2020.10.006 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Chakravarty D. & Porter L. L. AlphaFold2 fails to predict protein fold switching. Protein Science 31 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Du Z. et al. The trRosetta server for fast and accurate protein structure prediction. Nat Protoc 16, 5634–5651, doi: 10.1038/s41596-021-00628-9 (2021). [DOI] [PubMed] [Google Scholar]
- 23.Yanofsky C., Horn V. & Thorpe D. Protein Structure Relationships Revealed by Mutational Analysis. Science 146, 1593–1594, doi: 10.1126/science.146.3651.1593 (1964). [DOI] [PubMed] [Google Scholar]
- 24.Altschuh D., Lesk A. M., Bloomer A. C. & Klug A. Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus. Journal of Molecular Biology 193, 693–707, doi: 10.1016/0022-2836(87)90352-4 (1987). [DOI] [PubMed] [Google Scholar]
- 25.Anishchenko I., Ovchinnikov S., Kamisetty H. & Baker D. Origins of coevolution between residues distant in protein 3D structures. Proc Natl Acad Sci U S A 114, 9122–9127, doi: 10.1073/pnas.1702664114 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Yang J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc Natl Acad Sci U S A 117, 1496–1503, doi: 10.1073/pnas.1914677117 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Korber B. T., Farber R. M., Wolpert D. H. & Lapedes A. S. Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. Proceedings of the National Academy of Sciences 90, 7176–7180, doi: 10.1073/pnas.90.15.7176 (1993). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Göbel U., Sander C., Schneider R. & Valencia A. Correlated mutations and residue contacts in proteins. Proteins: Structure, Function, and Bioinformatics 18, 309–317, doi: 10.1002/prot.340180402 (1994). [DOI] [PubMed] [Google Scholar]
- 29.Morcos F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences 108, E1293–E1301, doi: 10.1073/pnas.1111471108 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Jones D. T., Buchan D. W. A., Cozzetto D. & Pontil M. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28, 184–190, doi: 10.1093/bioinformatics/btr638 (2012). [DOI] [PubMed] [Google Scholar]
- 31.Dunn S. D., Wahl L. M. & Gloor G. B. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 24, 333–340, doi: 10.1093/bioinformatics/btm604 (2008). [DOI] [PubMed] [Google Scholar]
- 32.de Juan D., Pazos F. & Valencia A. Emerging methods in protein co-evolution. Nature Reviews Genetics 14, 249–261, doi: 10.1038/nrg3414 (2013). [DOI] [PubMed] [Google Scholar]
- 33.Lockless S. W. & Ranganathan R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science 286, 295–299, doi: 10.1126/science.286.5438.295 (1999). [DOI] [PubMed] [Google Scholar]
- 34.Ovchinnikov S. et al. Protein structure determination using metagenome sequence data. Science 355, 294–298, doi: 10.1126/science.aah4043 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Rao R. M. et al. in International Conference on Machine Learning. 8844–8856 (PMLR; ). [Google Scholar]
- 36.Kim A. K., Looger L. L. & Porter L. L. A high-throughput predictive method for sequence-similar fold switchers. Biopolymers, e23416, doi: 10.1002/bip.23416 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Dishman A. F. et al. Evolution of fold switching in a metamorphic protein. Science 371, 86–90, doi: 10.1126/science.abd8700 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Galaz-Davison P., Ferreiro D. U. & Ramirez-Sarmiento C. A. Coevolution-derived native and non-native contacts determine the emergence of a novel fold in a universally conserved family of transcription factors. Protein Sci 31, e4337, doi: 10.1002/pro.4337 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Balakrishnan S., Kamisetty H., Carbonell J. G., Lee S.-I. & Langmead C. J. Learning generative models for protein fold families. Proteins: Structure, Function, and Bioinformatics 79, 1061–1078, doi: 10.1002/prot.22934 (2011). [DOI] [PubMed] [Google Scholar]
- 40.Marks D. S., Hopf T. A. & Sander C. Protein structure prediction from sequence variation. Nature Biotechnology 30, 1072–1080, doi: 10.1038/nbt.2419 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kamisetty H., Ovchinnikov S. & Baker D. Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era. Proceedings of the National Academy of Sciences 110, 15674–15679, doi: 10.1073/pnas.1314045110 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Balakrishnan S., Kamisetty H., Carbonell J. G., Lee S. I. & Langmead C. J. Learning generative models for protein fold families. Proteins 79, 1061–1078, doi: 10.1002/prot.22934 (2011). [DOI] [PubMed] [Google Scholar]
- 43.Morcos F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci U S A 108, E1293–1301, doi: 10.1073/pnas.1111471108 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Zerihun M. B., Pucci F., Peter E. K. & Schug A. pydca v1.0: a comprehensive software for direct coupling analysis of RNA and protein sequences. Bioinformatics 36, 2264–2265, doi: 10.1093/bioinformatics/btz892 (2020). [DOI] [PubMed] [Google Scholar]
- 45.Kamisetty H., Ovchinnikov S. & Baker D. Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proc Natl Acad Sci U S A 110, 15674–15679, doi: 10.1073/pnas.1314045110 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.The PyMOL Molecular Graphics System, Version 2.0 Schrödinger, LLC. [Google Scholar]
- 47.Hunter J. D. Matplotlib: A 2D graphics environment. Comput Sci Eng 9, 90–95 (2007). [Google Scholar]
- 48.Burmann B. M. et al. An alpha helix to beta barrel domain switch transforms the transcription factor RfaH into a translation factor. Cell 150, 291–303, doi: 10.1016/j.cell.2012.05.042 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Ester M., Kriegel H.-P., Sander J. & Xu X. in kdd. 226–231. [Google Scholar]
- 50.Werner F. A nexus for gene expression-molecular mechanisms of Spt5 and NusG in the three domains of life. J Mol Biol 417, 13–27, doi: 10.1016/j.jmb.2012.01.031 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Drogemuller J. et al. An autoinhibited state in the structure of Thermotoga maritima NusG. Structure 21, 365–375, doi: 10.1016/j.str.2012.12.015 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Guo G. et al. Structural and biochemical insights into the DNA-binding mode of MjSpt4p:Spt5 complex at the exit tunnel of RNAPII. J Struct Biol 192, 418–425, doi: 10.1016/j.jsb.2015.09.023 (2015). [DOI] [PubMed] [Google Scholar]
- 53.Stein P. E., Leslie A. G., Finch J. T. & Carrell R. W. Crystal structure of uncleaved ovalbumin at 1.95 A resolution. J Mol Biol 221, 941–959, doi: 10.1016/0022-2836(91)80185-w (1991). [DOI] [PubMed] [Google Scholar]
- 54.Webster M. W. et al. Structural basis of transcription-translation coupling and collision in bacteria. Science 369, 1355–1359, doi: 10.1126/science.abb5036 (2020). [DOI] [PubMed] [Google Scholar]
- 55.Del Alamo D., Sala D., McHaourab H. S. & Meiler J. Sampling alternative conformational states of transporters and receptors with AlphaFold2. Elife 11, doi: 10.7554/eLife.75751 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Chowdhury R. et al. Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol, doi: 10.1038/s41587-022-01432-w (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Li B. P. et al. CLIC1 Promotes the Progression of Gastric Cancer by Regulating the MAPK/AKT Pathways. Cell Physiol Biochem 46, 907–924, doi: 10.1159/000488822 (2018). [DOI] [PubMed] [Google Scholar]
- 58.Gordon D. E. et al. Comparative host-coronavirus protein interaction networks reveal pan-viral disease mechanisms. Science 370, doi: 10.1126/science.abe9403 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Gu X. & Vander Velden K. DIVERGE: phylogeny-based analysis for functional-structural divergence of a protein family. Bioinformatics 18, 500–501, doi: 10.1093/bioinformatics/18.3.500 (2002). [DOI] [PubMed] [Google Scholar]
- 60.Rodriguez G. J., Yao R., Lichtarge O. & Wensel T. G. Evolution-guided discovery and recoding of allosteric pathway specificity determinants in psychoactive bioamine receptors. Proc Natl Acad Sci U S A 107, 7787–7792, doi: 10.1073/pnas.0914877107 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Morcos F., Jana B., Hwa T. & Onuchic J. N. Coevolutionary signals across protein lineages help capture multiple protein conformations. Proc Natl Acad Sci U S A 110, 20533–20538, doi: 10.1073/pnas.1315625110 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Sfriso P. et al. Residues Coevolution Guides the Systematic Identification of Alternative Functional Conformations in Proteins. Structure 24, 116–126, doi: 10.1016/j.str.2015.10.025 (2016). [DOI] [PubMed] [Google Scholar]
- 63.Wayment-Steele H. K., Ovchinnikov S., Colwell L. & Kern D. Prediction of multiple conformational states by combining sequence clustering with AlphaFold2. bioRxiv (2022). [Google Scholar]
- 64.UniProt C. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 49, D480–D489, doi: 10.1093/nar/gkaa1100 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Wang X., Zhao Y. & Pourpanah F. Recent advances in deep learning. International Journal of Machine Learning and Cybernetics 11, 747–750 (2020). [Google Scholar]
- 66.Bepler T. & Berger B. Learning the protein language: Evolution, structure, and function. Cell Syst 12, 654–669 e653, doi: 10.1016/j.cels.2021.05.017 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Johnson L. S., Eddy S. R. & Portugaly E. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinformatics 11, 431, doi: 10.1186/1471-2105-11-431 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Mistry J., Finn R. D., Eddy S. R., Bateman A. & Punta M. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res 41, e121–e121, doi: 10.1093/nar/gkt263 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Steinegger M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473, doi: 10.1186/s12859-019-3019-7 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Ovchinnikov S., Kamisetty H. & Baker D. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. Elife 3, e02030, doi: 10.7554/eLife.02030 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Suzek B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932, doi: 10.1093/bioinformatics/btu739 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Mitchell A. L. et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res 48, D570–D578, doi: 10.1093/nar/gkz1035 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Steinegger M. & Soding J. Clustering huge protein sequence sets in linear time. Nat Commun 9, 2542, doi: 10.1038/s41467-018-04964-5 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



