Abstract
Investigating structural variability is essential for understanding protein biological functions. Although AlphaFold2 accurately predicts static structures, it fails to capture the full spectrum of functional states. Recent methods have used AlphaFold2 to generate diverse structural ensembles, but they offer limited interpretability and overlook the evolutionary signals underlying the predictions. In this work, we enhance the generation of conformational ensembles and identify sequence patterns that influence the alternative fold predictions for several protein families. Building on prior research that clustered multiple sequence alignments to predict fold-switching states, we introduce a refined clustering strategy that integrates protein language model representations with hierarchical clustering, overcoming limitations of density-based methods. Our strategy effectively identifies high-confidence alternative conformations and generates abundant sequence ensembles, providing a robust framework for applying direct coupling analysis (DCA). Through DCA, we uncover key coevolutionary signals within the clustered alignments, leveraging them to design mutations that stabilize specific conformations, which we validate using alchemical free energy calculations from molecular dynamics. Notably, our method extends beyond fold-switching, effectively capturing a variety of conformational changes.
1. Introduction
The three-dimensional structure of proteins is crucial to their function in biological processes, yet predicting the diverse conformations that proteins can adopt is a significant challenge in structural biology. Recently, AlphaFold2 (AF2), a deep learning model for protein structure prediction, achieved unprecedented accuracy and marked a major breakthrough in computational structural biology. Nevertheless, despite its remarkable performance, AF2 is primarily designed to predict a single dominant conformation, and its ability to capture multiple functional states of a protein remains limited. , Several strategies have been developed to improve conformational sampling using the AlphaFold2 algorithm. However, these methods often lack interpretability, particularly in terms of the evolutionary signals that drive the conformational transitions. As a result, the connection among sequence divergence, evolutionary pressures, and the resulting structural states remains poorly understood in many cases. Existing strategies for sampling protein conformations with Alphafold2 comprise two main approaches. The first class of methods involves incorporating conformational information into either the training data or the framework itself, while the second class concerns manipulating the multiple sequence alignment (MSA) input. Within the first category, AlphaFlow fine-tunes AF2 by introducing a flow matching objective, which generates structural ensembles by training with both experimental data and molecular dynamics simulations. C-Fold retrains AlphaFold2 removing the template track from the architecture and using a conformational split of the Protein Data Bank (PDB), which enables the model to learn structure flexibility from experimental data. Another method integrates state-annotated structure databases into AlphaFold2’s structural templates, specifically for G-protein-coupled receptors. The second class of approaches involves manipulating the input of AlphaFold2. One of the most successful techniques involves reducing the depth of the MSA by randomly subsampling sequences, which can increase uncertainty and enhance structural diversification when the number of retained sequences is very low. Similarly, another method also subsamples the MSA but enhances model diversity by adjusting the number of noncenter sequences excluded during AlphaFold2’s initial clustering step, thereby modulating coevolutionary signals. An alternative strategy involves masking columns of the MSA similarly to the SPEACH_AF method, altering the information within the alignment, and forcing AF2 to reconstruct the masked region to explore alternative conformations. Furthermore, AF-Cluster generates diverse protein conformations by clustering aligned sequences based on similarity. Using these clusters as separate inputs for AF2 effectively leads to sample distinct conformational states of fold-switching proteins. Notably, experimentally verified predictions have confirmed its capability to identify mutations that induce conformational changes.
We embrace this last perspective and hypothesize that AF2 predictions are biased toward conformations with prominent coevolutionary signals in the MSA, rather than necessarily favoring the most thermodynamically stable state. This effect stems from the interplay between competing coevolutionary signals within the alignment, which correspond to distinct structural states and influence the predicted conformation. Building on this hypothesis, our aim is to detect metastable states with a limited number of predictions, prioritizing efficiency and interpretability. To achieve this goal, we need to overcome the limitations of existing methods, which often produce small cluster sizes for alternative states, making it difficult to identify evolutionary signatures linked to contacts from different conformations. By generating more comprehensive groups through our approach, we can address this issue and perform reliable statistical analysis of the clustered sequences, ultimately enabling us to identify key regions within the alignments that drive changes in the predicted structure.
We combine agglomerative clustering with representations generated with a protein language model and demonstrate superior detection of high-confidence alternative states compared with density-based clustering on a set of fold-switching proteins. Clustering a large number of sequences allows us to apply direct coupling analysis (DCA), − a statistical framework which models coevolutionary signals within sequence alignments to infer direct residue interactions, and identify coevolved pairs that are crucial for the alternative prediction. Focusing on covarying residues that interact specifically in the alternative state, we design targeted mutations that preferentially stabilize particular conformations by measuring frequency shifts with respect to the full alignments. We also validate a specific mutation pair stabilizing an alternative fold using alchemical molecular dynamics (MD) simulations. Figure shows a schematic overview of the pipeline. We showcase the versatility of the developed strategy by applying it to various conformational changes beyond fold-switching. Overall, by uncovering evolutionary signals that underlie structural transitions, we distill essential sequence-structure relationships while minimizing computational overhead.
1.
Aligned homologous sequences are embedded using the MSA Transformer and clustered to input AlphaFold2. Predictions match either the default state (FS for KaiB) or the alternative state (GS for KaiB). DCA of clustered alignments helps identify mutated pairs (M) aimed at stabilizing the alternative state compared with the wild-type sequence (W), as validated through MD-based free energy calculations.
2. Clustering Multiple Sequence Alignments
The pioneering study introduces clustering of aligned sequences as input for AlphaFold2 to predict diverse conformations relying on DBSCAN, a density-based clustering algorithm. However, this approach results in a highly fragmented landscape characterized by numerous small clusters with only a few resulting in meaningful alternative predictions. Many clusters produce low-confidence structures that deviate from both the ground- and the fold-switched conformation. Since the method requires running AlphaFold2 on each group, it is noticeable that such an abundance of small clusters significantly limits scalability to large data sets. Moreover, the presence of a “halo” of data points that fails to meet the density criterion leads to significant information loss: biases in sequencing and recombination events can create discontinuities in the sequence landscape, causing density-based approaches to misclassify many sequences as noise. We develop an approach focused on detecting metastable states with a limited number of predictions, enabling larger and more cohesive clusters that facilitate the identification of critical sequence features. We propose an agglomerative hierarchical clustering approach (AHC) and compare representations from the final layer of the MSA Transformer model to those obtained through standard sequence-based clustering. Leveraging the attention mechanisms to integrate information across homologous sequences in an alignment, the MSA Transformer captures both residue-level dependencies and evolutionary constraints. We choose to use these representations to transition from a sequence-based similarity measure to a structured, continuous latent space while preserving evolutionary information.
3. Results on Fold-Switching Proteins
In this study, we select for analysis 8-fold-switching proteins, which reflect a diverse set of scenarios, representing a spectrum of MSA depths and fold-switching behaviors, from subtle local rearrangements to more extensive structural transitions. The AHC clustering method consistently identifies a larger fraction of sequences associated with alternative predicted conformations compared to DBSCAN on these proteins (Figure ). This improvement is crucial for downstream analyses as it favors the identification of critical signals within the MSA that modulate the switching between states. AHC achieves a slight improvement over direct sequence-based clustering by incorporating representations from the MSA Transformer. However, the overall performance difference between these two approaches remains relatively small, suggesting that sequence similarity alone is often sufficient to guide effective clustering, while evolutionary-based embeddings provide only a marginal refinement.
2.
Fraction of sequences associated with alternative state predictions, relative to the full MSA depth, for different clustering methods (AHC and DBSCAN) and inputs (sequences and MSA Transformer representations).
Based on these findings, we used MSA Transformer-based AHC clustering for the subsequent analysis. Figure presents the RMSD of cluster-based AlphaFold2 predictions with respect to experimental PDB structures of the two distinct states across 8-fold-switching proteins. This analysis demonstrates that the clustering strategy produces a well-balanced distribution of conformations, effectively capturing structural variability across the different folds. The number of false positivesdefined as predictions that deviate significantly from both experimental reference structuresis minimal compared to the foundational work, highlighting the improved reliability of our approach. Moreover, the predicted structures exhibit consistently high pLDDT scores, indicating AlphaFold2’s strong confidence in their accuracy. Figure S1 reports the RMSD values for all structures generated with DBSCAN clustering, highlighting a smaller number and reduced size of clusters associated with alternative predictions compared to the AHC approach. These results underscore the effectiveness of our clustering strategy in capturing meaningful conformational diversity while minimizing noise.
3.
AlphaFold2 structure prediction with clustered MSA for 8-fold-switching proteins. RMSDs are calculated with respect to the experimental PDB structures of the 2-fold-switching states.
To ensure a rigorous comparison with AF-Cluster, we focus on a subset of well-characterized fold-switching proteins, KaiB, Mad2, and RfaH, as reported in the study and documented in the GitHub repository. As shown in Figure S2 in SI, AF-Cluster is able to recover alternative conformational states for these proteins, but only by generating a large number of MSA clusters and requiring hundreds of AF2 runs per protein. This makes the approach computationally intensive and difficult to scale. Furthermore, the resulting predictions span a broad continuum of RMSD values relative to the default AF2 model, complicating structural interpretation and limiting its effectiveness for blind identification of the fold-switching behavior. In contrast, our strategy leverages fewer but larger and more coherent clusters, enabling effective conformational separation with significantly fewer AF2 runs and improved interpretability, offering a scalable and efficient alternative for detecting fold-switching.
3.1. Benchmark on the Extended Data Set
To expand the evaluation of our method’s robustness, we extend the clustering analysis to a larger collection of fold-switching proteins. We consider the 96 systems listed in the data set and retain 81 proteins with sufficiently deep alignments, as a minimal MSA depth is required for the method to be applicable. For proteins exceeding 300 residues, sequences are trimmed to isolate specific domains based on InterPro annotations. Encouragingly, our method successfully identifies a total of 26 fold-switching events, all in agreement with experimentally determined structures in the PDB, against the 18-fold-switching events detected by AF-Cluster, as reported in a recent benchmark performed by Chakravarty et al. All successfully detected fold-switching events are shown in Figures , , S3, and S4.
4.
AF2 structure prediction with clustered MSA for Lymphotactin and CLIC1. RMSDs are calculated with respect to the experimental PDB structures of the 2-fold-switching states. Circle size reflects the cluster size. The structures shown in the plots are AF2 predictions aligned based on the fold-switching region, which is colored blue.
As illustrative cases, in Figure , we show results obtained for 2-fold-switching proteins, CLIC1 and lymphotactin, which the foundational study has previously failed to resolve into alternative states. The third protein mentioned in the study is Selecase, which poses a particularly challenging case due to the limited number of sequences in its full MSA (only 140). For CLIC1, we failed to identify clusters corresponding to predicted alternative conformations. While some clusters exhibit high RMSD with respect to the default full MSA prediction, the observed structural variability does not align with a meaningful fold-switching event. In contrast, for lymphotactin, we succeed where the original approach failed, identifying a cluster of 90 sequences associated with the alternative folded state.
This finding underscores the power of our approach in detecting fold-switching events, showing how our refined clustering strategy allows an increase in efficiency and precision with respect to AF-Cluster, while maintaining a similar approach.
3.2. Structure Variability in Evoformer Representations
We perform principal component analysis (PCA) on representations from the Evoformer block, which is a crucial component of AlphaFold2’s architecture, to explore how structural variations emerge at different stages of its workflow. The Evoformer block consists of two main modules: the MSA Transformer module, which encodes evolutionary relationships between sequences, and the pair representation module, which refines residue–residue interactions using triangle attention to update structural parameters. These modules continuously exchange information, enabling the model to integrate coevolutionary constraints and enhance the accuracy of structural predictions. In most systems, the first two principal components of the single representations provide limited separation between the two states (Figure S5). In contrast, the PCA of the pair representations (Figure ) reveals a clear distinction between the two conformations, underscoring the importance of incorporating MSA-derived information into residue–residue interactions. This pattern is further supported by the cosine distance distribution of pairwise and single representations (Figures S6 and S7), which shows high intrastate similarity while maintaining a clear separation between conformations. These findings indicate that AlphaFold2 captures fold-switching events at the abstract representation level. The observed differentiation suggests that both structural and evolutionary constraints are encoded within the Evoformer stage, effectively embedding multiple conformations in its internal representations before the structural module refines the final prediction.
5.
PCA analysis of AF2 Evoformer pair representations of clustered MSAs. The size of circles reflects the cluster size, while the color codes for alternative (blue), default (red), and unassigned (gray) structure predictions.
4. DCA Informs the Design of Conformation-Stabilizing Mutations
The number of grouped sequences is sufficiently large to reliably examine mutation patterns that may influence the predictions, with the attempt to use advanced statistical techniques. In particular, we employ direct coupling analysis, , which focuses on identifying direct evolutionary couplings between pairs of residues by estimating the statistical dependencies between columns in the MSA. It assumes that the residues that are spatially close in the folded structure of a protein tend to coevolve, meaning that mutations at one residue position often correlate with mutations at another. This coevolution can be captured by the model and used to predict contacts between residues in the three-dimensional structure of the molecule. It has been shown that extensively sequenced protein families encode coevolutionary signals corresponding to residue contacts across diverse functional conformational states. , To ensure the reliability of this analysis, sequences should exhibit sufficient variability within the clusters. Although DCA is constrained by the limited size of the alignments, their sequence heterogeneity justifies the search for potential coevolutionary signals (Figure S8). Through clustering of the full family, we aim to partition sequences into groups that isolate the covariance signal corresponding to a specific conformation, effectively disentangling them from signatures of alternative contacts. In this way, we can identify candidate mutations that promote stabilization of the alternative over the default state.
To assess whether MSA clustering can isolate coevolutionary signals associated with distinct conformations, we conduct a systematic analysis comparing contact predictions obtained from DCA on clustered MSAs. For each cluster associated with an alternative predicted fold, we perform DCA and select the top-L (sequence length) ranked residue pairs. These were then compared to the contacts observed in both the default and the alternative AF2-predicted structures. We specifically identify contacts that are unique to either the alternative or the default conformation and measure the fractions that are correctly recovered by DCA. This enabled us to compute, for each structure, the ratio between the fraction of correctly predicted contacts that are exclusive to the alternative fold and that are exclusive to the default fold. As shown in Figure , the results demonstrate a marked enrichment of alternative-specific contacts among the top DCA predictions derived from clustered MSAs in 6 proteins, supporting the hypothesis that these clusters can highlight fold-specific coevolutionary constraints that are otherwise averaged out in the full MSA.
6.
Ratio of correctly predicted alternative to default-specific contacts using DCA on clustered vs full MSAs. For each fold-switching protein, bars indicate the percentage of contacts unique to the alternative conformation divided by those unique to the default conformation. Contacts are predicted by applying DCA to cluster-derived MSAs (orange) and to the full family MSA (red).
In Figure , we highlight in orange (upper corner) the top L-ranked DCA-predicted contacts from the clusters that maximized the detection of alternative contacts. For comparison, red points in the lower corner indicate the top DCA couplings derived from the full alignment. The strongest couplings computed from the full MSA align remarkably well with the contact map predicted by AlphaFold2’s default model, confirming that predictions are biased by the dominant coevolutionary signals in the protein families. On the other hand, alternative clusters exhibit a weaker correspondence between couplings and actual residue contacts, as expected due to the significantly smaller alignment sizes and the similarity among clustered sequences. However, several covarying pairs in the alternative alignments correspond to contacts unique to their respective conformations and appear crucial in guiding AlphaFold2 toward predicting alternative states.
7.
Comparison of DCA-derived contacts from full and alternative MSA. Top L DCA scores from full MSA (red, lower). Top L DCA-scored pairs from the cluster that maximizes alternative contacts detection (orange, upper). AF2 predicted contacts with the full and alternative alignments are shown in gray in the lower and upper corner, respectively.
We also compare DCA predictions to those obtained using the Alternative Contact Enhancement (ACE) method, a framework that prunes deep alignments into shallow, subfamily-specific alignments to reveal coevolutionary signals enriched in specific conformational states, such as alternative folds (Figures S9 and S10). Across the seven proteins considered (excluding CopK due to insufficient MSA depth), our method recovers a comparable number of unique alternative-specific contacts in five cases and outperforms ACE in two, underscoring its ability to capture fold-specific coevolutionary constraints often missed by full MSA-based analyses.
4.1. Mutation Candidates across Proteins
We filter the ranked mutations to include only those pairs that form contacts in the alternative predicted structure but are absent in the default structure derived from the full MSA. Among the mutations identified for KaiB (which are listed in the GitHub repository mentioned below), the double mutation E31H with P67E stands out as a strong candidate for stabilizing the ground state (GS) over the fold-switched state (FS). Specifically, E31H disrupts the salt bridge between E31 and R79 in the FS, while potentially forming a new electrostatic interaction between H31 and E67 when histidine is protonated (at a lower pH). This interaction may contribute to the stabilization of the ground state (GS) (Figure S11) We remind that, for KaiB, the FS corresponds to the AlphaFold2 prediction derived from the full MSA, while the GS refers to the alternative prediction. For RfaH, the double mutation P133M-N156L emerges as a key modification. This combination introduces two hydrophobic side chains, which are likely to interact with each other, contrasting with their solvent-exposed conformation observed in the full MSA state (see Figure S12). Proplasmepsin, which displays several large clusters associated with the alternative state, presents a proposed mutation V59Y-E167K that is particularly interesting: the Y–K contact benefits from strong hydrogen bonding and cation−π interactions, offering a clear advantage over the hydrophobic-polar mismatch between V and E. Moreover, the E167K mutation may destabilize the GS by disrupting the charge balance with neighboring residues K9, E11, and K16 (Figure S13). Additionally, the double mutations N61D-R355K introduce a salt bridge in the FS, which could provide stability over the default state. The proposed double mutation I53M-F60M in CopK would potentially stabilize the alternative state, where M53 and M60 would be close to each other rather than solvent-exposed as in the default state (see Figure S14).
4.2. Validation of Mutation Effect with Alchemical Free Energy Calculations Based on Molecular Dynamics
Among the candidate mutations discussed above, we selected the KaiB double mutation E31H–P67E for validation and use molecular dynamics alchemical free energy calculations (AFEC) to quantify how strongly the mutation favors the GS over the FS. Alchemical methods allow simulation of trajectories in which molecular species are mutated into different ones by switching on/off nonbonded interaction of specifically chosen atoms, and the free energy associated with the transformation can be computed by integrating along the alchemical path λ (See Figure ). , Early applications of these methods allowed for prediction of the effects of protein mutations on protein stability and drug binding, whereas a recent study used AFEC to investigate impact of mutations on protein–protein interface between two signaling proteins. We aim to adapt these methods to assess how a mutation affects the relative stability of 2-fold-switching conformations. This is achieved by performing AFEC on both conformers and calculating the difference in ΔG values associated with the alchemical transformations, as illustrated in the thermodynamic cycle (Figure b). To this end, we adapted the AFEC procedure developed in a recent work, which employs the Hamiltonian replica exchange scheme (further details in the Methods section below). For the E31H-P67E mutation, the resulting ΔΔG indicates a stabilization of the GS conformation over the FS by approximately 3.5 kcal/mol, supporting the effectiveness of our pipeline in designing double mutations that influence the relative stability of fold-switching protein conformers. This result further reinforces the hypothesis that coevolutionary signals play a role in guiding AlphaFold2’s predictions toward specific conformational states.
8.
Alchemical free energy calculations. Assessment of the impact of mutations on the relative thermodynamic stability of KaiB GS and FS conformers (a), defined as ΔΔG = ΔG W – ΔG M. ΔΔG can be derived as the difference between the horizontal ΔG values in the thermodynamic cycle (b), which are computed with MD by integrating along an alchemical path describing the transformation of the wild-type protein into the mutant. (c) The double mutation E31H-P67E in KaiB stabilizes the GS over the FS by 3.5 kcal/mol.
5. Capturing Diverse Conformational Changes beyond Fold-Switching
To demonstrate the broader applicability of our approach, we extend our analysis beyond fold-switching, defined here as transitions between conformations characterized by distinct patterns of secondary structure contacts. While previous work has primarily focused on such dramatic topological rearrangements, many biologically relevant conformational changes involve subtler yet functionally critical shifts such as domain motions, loop rearrangements, or reorganization of specific side-chain interactions. We showcase that our MSA clustering strategy, initially designed to capture fold-switches, is also effective in identifying the alternative conformations associated with these more subtle structural transitions.
Specifically, we demonstrate its effectiveness across two distinct protein families: a kinase and a G-protein-coupled receptor. In both cases, we establish a clear correspondence between statistical enrichment and structural outcomes, highlighting the method’s ability to uncover biologically meaningful sequence patterns that shape functional conformational transitions. Consistent with previous findings using MSA subsampling, our clustering approach identifies both open and closed conformations of the Beta-1 adrenergic receptor, characterized by a slight shift of helix 7 relative to helix 6. However, the predictions reveal more pronounced structural differences beyond these subtle shifts. In particular, one cluster exhibits a significant divergence from the full MSA prediction (Figure a), leading to a major fold switch on the extracellular side. This transition is defined by a reorganization of sulfur bridge interactions, differing from those in the default conformer. Specifically, the alternative conformation disrupts the native C82-C167 and C160-C166 sulfur bridges while establishing a novel C82-C160 bridge (Figure b). Crucially, the statistical properties of this cluster provide a strong rationale for the observed structural rearrangement. The sequences associated with the alternative conformation show a strong enrichment of cysteine at position 160, accompanied by a corresponding depletion at position 167 (Figure c). This cysteine redistribution within the clustered sequence directly mirrors the rearrangement of disulfide bonds, reinforcing the idea that evolutionary constraints encoded in MSA drive the observed conformational change.
9.
GPCR Beta-1 adrenergic receptor: AlphaFold2 structure prediction performed with clustered MSA. (a) Conformation with high plDDT and significant RMSD with respect to the full MSA prediction. (b) Secondary structure rearrangements in the extracellular region: breaking of C82-C167/C160-C166 and forming of C82-C160 sulfur bridges.(c) Cisteine frequencies at rearranged positions in full and alternative MSA.
Building on recent work, which demonstrated that subsampling the input MSA of Abl1 Kinase captures both active and inactive conformations, we apply our clustering approach to the tyrosine kinase domain of the Epidermal Growth Factor Receptor. Our method successfully distinguishes both functional states by detecting characteristic activation loop rearrangements. Notably, clusters associated with the inactive conformation contain a substantial number of sequences, reflecting a strong evolutionary signal favoring this state. Furthermore, AlphaFold2 predicts the alternative conformation with remarkably high confidence (pLDDT 90%; Figure ). Focusing on the MSA linked to the inactive state, we apply our DCA-based mutation detection method and identify a strongly coupled triplet, where position 164 is shared between two residue pairs: 158–164 and 159–164 (Figure S15). Specifically, position 164 undergoes a K-to-N mutation, reducing the positive charge, while residues 158 and 159 mutate from Y to C and H to L, respectively. These mutations likely mitigate the electrostatic repulsion present in the active state, instead stabilizing the inactive conformation where these residues are spatially close and form favorable contacts.
10.
Tyrosine Kinase: AlphaFold2 structure prediction performed with clustered MSA. Three clusters are associated with the inactive state prediction. RMSDs are computed with respect to the AF2 predictions.
6. Discussion
We systematically identify metastable states in fold-switching proteins leveraging MSA Transformer-based agglomerative clustering, with greater confidence, efficiency, and precision than existing strategies. The effectiveness of the method naturally depends on the availability of the homologous sequences. By grouping many sequences, we enable a robust statistical analysis of evolutionary constraints. This allows us to extract meaningful signals linked to conformational changes and design stabilizing mutations using Direct Coupling Analysis, suggesting that AlphaFold2 can be guided by such signals to favor specific conformations. We test the method on 81 well-characterized fold-switching proteins, consistently obtaining accurate predictions with high pLDDT scores for 26 tested systems. The number of false positives from AF2 predictions remains remarkably low, indicating that nearly all clustered sequences yield predictions that closely align with one of the 2-fold-switching states. Importantly, we are able to identify the alternative state in a protein where the foundational clustering approach failed. Through Principal Component Analysis of the Evoformer module representations derived from clustered alignments, we reveal a clear separation of conformational states in this stage of the AlphaFold2 model. These findings suggest that the MSA representations play a crucial role in refining pairwise representations within the Evoformer, steering the model toward distinct and accurate structural predictions. Furthermore, by applying DCA within each sequence cluster, we pinpoint key residue pairs with strong coevolutionary signals specific to each cluster. By leveraging these insights, we can identify residue–residue interactions that are likely crucial in driving conformational transitions between alternative states. The coevolving pairs identified in the alternative alignments facilitate the design of targeted mutations that exhibit distinct statistical profiles compared to the full MSA, and are predicted to selectively stabilize the corresponding conformational state. To evaluate the impact of the E31H–P67E double mutation in KaiB on protein stability, we perform molecular dynamics simulations. In particular, we use alchemical free energy calculations, a well-established framework for reliable estimations of ΔΔG values, which is used here to quantify the impact of mutations on the relative populations of alternative conformations. The results confirm our hypothesis, showing that the E31H-P67E mutation significantly stabilizes the ground state of KaiB over the fold-switched state. We further extend the application of our clustering framework to capture a broader range of conformational transitions beyond fold-switching by applying it to proteins that exhibit distinct structural changes that are functional. We analyze the Beta-1 adrenergic receptor and detect both open and closed states, together with a distinct conformation that diverges significantly from the full MSA prediction, revealing a major fold switch on the extracellular side. The predicted structure disrupts native sulfur bridges while forming a new one, a change that is reflected in the differences in cysteine statistics between the corresponding cluster and full MSA. Similarly, in the Tyrosine Kinase domain of the Epidermal Growth Factor Receptor, our analysis successfully captures characteristic activation loop rearrangements that distinguish the active and inactive states. The mutations identified through our DCA-based approach likely mitigate electrostatic repulsion in the active state, instead stabilizing the inactive conformation where these residues are positioned in close proximity.
Remarkably, our strategy enables blind detection of alternative folds, as high-confidence predictions that diverge markedly in the RMSD from the dominant conformation often indicate structurally distinct fold-switching events. Our findings reveal key evolutionary patterns within protein families that shape AlphaFold2’s predictions of the alternative folds. By refining the clustering strategy, we improve not only the efficiency and precision of the predictions but also their reliability and interpretability, as DCA within clusters reveals meaningful covariant pairs. The detected mutation patterns reflect evolutionary pressures driving the emergence of distinct fold-switching phenotypes, enabling the design of targeted mutations that selectively stabilize a specific structural state. Our results highlight the power of clustering-driven sequence analysis in uncovering biologically relevant metastable states and informing the rational design of mutations to stabilize specific conformations.
7. Methods
7.1. Clustering Techniques
The DBSCAN algorithm adopted in the seminal work identifies dense regions, where sequences are closely packed, and separates them from low-density areas relying on two critical hyperparameters: the maximum distance between points in a cluster (ϵ) and the minimum sample size, which is the smallest number of points required to form a dense region. Through a grid search, the ϵ value is optimized to produce the highest number of clusters while maintaining a very low minimum sample size to ensure high heterogeneity of sequence groups. We re-evaluate the performance of DBSCAN, choosing an increased minimum sample size of 20 (with ϵ set as previously illustrated). We introduce the agglomerative hierarchical clustering approach, using Ward’s linkage method to merge clusters. This method iteratively merges clusters while minimizing the increase in the total within-cluster variance. We iteratively decrease the method’s maximum cluster number parameter until each cluster contains at least 20 elements, to prevent the creation of small, not representative clusters.
7.2. Alternative State Detection
AlphaFold2 predictions are performed using ColabFold, following a protocol with three recycling iterations, generating five models per prediction, and without employing any structural templates. We report the results for the top-ranked model, as determined by the pLDDT score. RMSDs between AF2 predictions and experimental PDB structures were calculated using MDTraj, considering only backbone C atoms. Atom indices for the target and reference structures were carefully selected based on structural alignment. For the MinE system, a subset of 30 atoms located in the central region of the protein was used for RMSD calculation, enabling a clearer distinction between the default and alternative states in the 2D RMSD plots shown in Figure . Each prediction was classified as either the default (full MSA) state or the alternative state based on its RMSD values. The analysis scripts are available in the GitHub repository (see Data Availability), within each system’s output folder, as a notebook named 2D_rmsd_plot.ipynb.
7.3. Direct Coupling Analysis and Alignments’ Statistics
We employ the pydca package using pseudolikelihood maximization for the inference of couplings. We select the top L residue pairs with the strongest couplings from each alternative MSA (L is the sequence length). For each pair, we propose the mutation that shows the greatest divergence in frequency between the alternative and full alignments. We merge the pairs from all clusters and rank them based on their frequency difference between the clustered and full MSAs. The top L pairs with the most pronounced frequency shifts are then retained. Contacts are defined as residue pairs whose side chains are within 10Å of each other in the predicted structure. Contact maps for alternative states are constructed by averaging residue distances across all predicted alternative structures.
7.4. Alchemical Free Energy Calculations
In this work, we adapt an AFEC setup developed previously, which makes use of the Hamiltonian replica exchange scheme, proposing exchanges every 200 fs. We run 48 replicas simultaneously, interpolating bond interactions, Lennard-Jones parameters, and partial charges. We make use of the double system single box strategy to account for the change in net charge induced by the mutations. In particular, we simultaneously simulate the GS and FS KaiB system in the same box, using soft constraints to fix the distance of their centers of mass to 6 nm and performing the alchemical transformations in opposite directions. In this way, the ΔG computed from the single simulation corresponds directly to the ΔΔG assessing the impact of the mutations on the relative stability of alternative conformers (Figure ). Simulation boxes consist of rhombic dodecahedrons containing the two proteins, water, Na+, and Cl– ions with an excess salt concentration of 0.1 M. The systems were energy minimized and subjected to a multistep equilibration procedure for 8 replicas corresponding to λ values ranging from 0 to 1:100 ps of thermalization to 300 K in the NVT ensemble is conducted through the stochastic dynamics integrator (i.e., Langevin dynamics), and other 100 ps are run in the NPT ensemble simulations using the Parrinello–Rahman barostat. In production runs, the stochastic dynamics integrator is used in combination with the Parrinello–Rahman barostat to keep the pressure at 1 bar. Equations of motion are integrated with a time-step of 2 fs. Long-range electrostatic interactions are handled by particle-mesh Ewald. Each replica is simulated for 10 ns, for a total of 48 × 10 = 480 ns. The starting structures for KaiB are selected from the AF2 predictions. To generate a hybrid residue topology, we use the pmx packages. Molecular dynamics simulations are performed using GROMACS 2022.5, with the AMBER-ff 14SB AMBER force-field for amino acids, TIP3 model for water, and Joung and Cheatham parameters for ions. No long-range dispersion corrections are applied in the MD simulations with the AMBER-ff14SB force field. The protonation states of all ionizable residues, including histidine H31 in the E31H+P67E KaiB mutant, are assigned using the default GROMACS tools, corresponding to a neutral histidine (HSD). Finally, free energy differences are computed with the BAR method implemented in GROMACS.
Supplementary Material
Acknowledgments
The authors acknowledge the AREA Science Park supercomputing platform ORFEO made available for conducting the research reported in this paper and the technical support of the Laboratory of Data Engineering staff. V.P., A.C., and F.C. were supported by the European Union – NextGenerationEU within the project PNRR “PRP@CERIC” IR0000028 - Mission 4 Component 2 Investment 3.1 Action 3.1.1.
All relevant data are displayed in the manuscript. The codes for reproducing the results and figures shown in the manuscript are available at https://github.com/RitAreaSciencePark/MSARC. This folder also contains all parameter files, topologies, and scripts needed to reproduce the AFEC simulations. All clustered MSAs and related AF2 predictions presented in this work can be found at 10.5281/zenodo.15869824.
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.5c01090.
Additional results on analysis of evoformer representations; design of mutations; and bechmark oon extended data set of fold-switching proteins (PDF)
V.P. and F.C. contributed equally to this work. V.P. performed the molecular dynamics simulations, while F.C. developed the code for cluster generation and conducted the Direct Coupling Analysis (DCA). A.C. supervised the project and provided guidance throughout the research process.
The authors declare no competing financial interest.
References
- Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A.. et al. Highly accurate protein structure prediction with AlphaFold. nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saldaño T., Escobedo N., Marchetti J., Zea D. J., Mac Donagh J., Velez Rueda A. J., Gonik E., García Melani A., Novomisky Nechcoff J., Salas M. N.. et al. Impact of protein conformational diversity on AlphaFold predictions. Bioinformatics. 2022;38:2742–2748. doi: 10.1093/bioinformatics/btac202. [DOI] [PubMed] [Google Scholar]
- Sala D., Engelberger F., Mchaourab H., Meiler J.. Modeling conformational states of proteins with AlphaFold. Curr. Opin. Struct. Biol. 2023;81:102645. doi: 10.1016/j.sbi.2023.102645. [DOI] [PubMed] [Google Scholar]
- Jing B., Berger B., Jaakkola T.. AlphaFold meets flow matching for generating protein ensembles. arXiv. 2024 doi: 10.48550/arXiv.2402.04845. [DOI] [Google Scholar]
- Bryant P., Noé F.. Structure prediction of alternative protein conformations. Nat. Commun. 2024;15:7328. doi: 10.1038/s41467-024-51507-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heo L., Feig M.. Multi-state modeling of G-protein coupled receptors at experimental accuracy. Proteins: Struct., Funct., Bioinf. 2022;90:1873–1885. doi: 10.1002/prot.26382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Del Alamo D., Sala D., Mchaourab H. S., Meiler J.. Sampling alternative conformational states of transporters and receptors with AlphaFold2. Elife. 2022;11:e75751. doi: 10.7554/eLife.75751. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Monteiro da Silva G., Cui J. Y., Dalgarno D. C., Lisi G. P., Rubenstein B. M.. High-throughput prediction of protein conformational distributions with subsampled AlphaFold2. Nat. Commun. 2024;15:2464. doi: 10.1038/s41467-024-46715-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stein R. A., Mchaourab H. S.. SPEACH_AF: Sampling protein ensembles and conformational heterogeneity with Alphafold2. PLOS Computational Biology. 2022;18:e1010483. doi: 10.1371/journal.pcbi.1010483. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wayment-Steele H. K., Ojoawo A., Otten R., Apitz J. M., Pitsawong W., Hömberger M., Ovchinnikov S., Colwell L., Kern D.. Predicting multiple conformations via sequence clustering and AlphaFold2. Nature. 2024;625:832–839. doi: 10.1038/s41586-023-06832-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morcos F., Pagnani A., Lunt B., Bertolino A., Marks D. S., Sander C., Zecchina R., Onuchic J. N., Hwa T., Weigt M.. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. U. S. A. 2011;108:E1293–E1301. doi: 10.1073/pnas.1111471108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cocco S., Feinauer C., Figliuzzi M., Monasson R., Weigt M.. Inverse statistical physics of protein sequences: a key issues review. Rep. Prog. Phys. 2018;81:032601. doi: 10.1088/1361-6633/aa9965. [DOI] [PubMed] [Google Scholar]
- Ekeberg M., Hartonen T., Aurell E.. Fast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences. J. Comput. Phys. 2014;276:341–356. doi: 10.1016/j.jcp.2014.07.024. [DOI] [Google Scholar]
- Morcos F., Jana B., Hwa T., Onuchic J. N.. Coevolutionary signals across protein lineages help capture multiple protein conformations. Proc. Natl. Acad. Sci. U. S. A. 2013;110:20533–20538. doi: 10.1073/pnas.1315625110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chodera J. D., Mobley D. L., Shirts M. R., Dixon R. W., Branson K., Pande V. S.. Alchemical free energy methods for drug discovery: progress and challenges. Curr. Opin. Struct. Biol. 2011;21:150–160. doi: 10.1016/j.sbi.2011.01.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ester, M. ; Kriegel, H.-P. ; Sander, J. ; Xu, X. . et al. In A density-based algorithm for discovering clusters in large spatial databases with noise, KDD'96, 1996; pp 226–231. [Google Scholar]
- Rao, R. M. ; Liu, J. ; Verkuil, R. ; Meier, J. ; Canny, J. ; Abbeel, P. ; Sercu, T. ; Rives, A. In MSA transformer, International Conference on Machine Learning, 2021; pp 8844–8856. [Google Scholar]
- Porter L. L., Looger L. L.. Extant fold-switching proteins are widespread. Proc. Natl. Acad. Sci. U. S. A. 2018;115:5968–5973. doi: 10.1073/pnas.1800168115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chakravarty D., Schafer J. W., Chen E. A., Thole J. F., Ronish L. A., Lee M., Porter L. L.. AlphaFold predictions of fold-switched conformations are driven by structure memorization. Nat. Commun. 2024;15:7296. doi: 10.1038/s41467-024-51801-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sfriso P., Duran-Frigola M., Mosca R., Emperador A., Aloy P., Orozco M.. Residues coevolution guides the systematic identification of alternative functional conformations in proteins. Structure. 2016;24:116–126. doi: 10.1016/j.str.2015.10.025. [DOI] [PubMed] [Google Scholar]
- Schafer J. W., Porter L. L.. Evolutionary selection of proteins with two folds. Biophys. J. 2023;122:474a. doi: 10.1016/j.bpj.2022.11.2543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mey A. S., Allen B. K., Macdonald H. E. B., Chodera J. D., Hahn D. F., Kuhn M., Michel J., Mobley D. L., Naden L. N., Prasad S.. et al. Best Practices for Alchemical Free Energy Calculations [Article v1.0] Living J. Comp. Mol. Sci. 2020;2:18378. doi: 10.33011/livecoms.2.1.18378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao J., Kuczera K., Tidor B., Karplus M.. Hidden Thermodynamics of Mutant Proteins: A Molecular Dynamics Analysis. Science. 1989;244:1069–1072. doi: 10.1126/science.2727695. [DOI] [PubMed] [Google Scholar]
- La Serra M. A., Vidossich P., Acquistapace I., Ganesan A. K., De Vivo M.. Alchemical Free Energy Calculations to Investigate Protein–Protein Interactions: The Case of the CDC42/PAK1 Complex. J. Chem. Inf. Model. 2022;62:3023–3033. doi: 10.1021/acs.jcim.2c00348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Piomponi V., Fröhlking T., Bernetti M., Bussi G.. Molecular simulations matching denaturation experiments for N6-Methyladenosine. ACS Cent. Sci. 2022;8:1218–1228. doi: 10.1021/acscentsci.2c00565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- da Silva G. M., Cui J., Dalgarno D.. et al. High-throughput prediction of protein conformational distributions with subsampled AlphaFold2. Nat. Commun. 2024;15:2464. doi: 10.1038/s41467-024-46715-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Müllner D.. fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python. J. Stat. Software. 2013;53:1–18. doi: 10.18637/jss.v053.i09. [DOI] [Google Scholar]
- Mirdita M., Schütze K., Moriwaki Y., Heo L., Ovchinnikov S., Steinegger M.. ColabFold: making protein folding accessible to all. Nat. Methods. 2022;19:679–682. doi: 10.1038/s41592-022-01488-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McGibbon R. T., Beauchamp K. A., Harrigan M. P., Klein C., Swails J. M., Hernández C. X., Schwantes C. R., Wang L.-P., Lane T. J., Pande V. S.. MDTraj: A Modern Open Library for the Analysis of Molecular Dynamics Trajectories. Biophys. J. 2015;109:1528–1532. doi: 10.1016/j.bpj.2015.08.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zerihun M. B., Pucci F., Peter E. K., Schug A.. pydca v1.0: a comprehensive software for direct coupling analysis of RNA and protein sequences. Bioinformatics. 2020;36:2264–2265. doi: 10.1093/bioinformatics/btz892. [DOI] [PubMed] [Google Scholar]
- Patel D., Patel J. S., Ytreberg F. M.. Implementing and Assessing an Alchemical Method for Calculating Protein–Protein Binding Free Energy. J. Chem. Theory Comput. 2021;17:2457–2464. doi: 10.1021/acs.jctc.0c01045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goga N., Rzepiela A., De Vries A., Marrink S., Berendsen H.. Efficient algorithms for Langevin and DPD dynamics. J. Chem. Theory Comput. 2012;8:3637–3649. doi: 10.1021/ct3000876. [DOI] [PubMed] [Google Scholar]
- Parrinello M., Rahman A.. Polymorphic transitions in single crystals: A new molecular dynamics method. J. Appl. Phys. 1981;52:7182–7190. doi: 10.1063/1.328693. [DOI] [Google Scholar]
- Darden T., York D., Pedersen L.. Particle mesh Ewald: An N log (N) method for Ewald sums in large systems. J. Chem. Phys. 1993;98:10089–10092. doi: 10.1063/1.464397. [DOI] [Google Scholar]
- Gapsys V., Michielssens S., Seeliger D., de Groot B. L.. pmx: Automated protein structure and topology generation for alchemical perturbations. J. Comput. Chem. 2015;36:348–354. doi: 10.1002/jcc.23804. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Abraham M. J., Murtola T., Schulz R., Páll S., Smith J. C., Hess B., Lindahl E.. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX. 2015;1:19–25. doi: 10.1016/j.softx.2015.06.001. [DOI] [Google Scholar]
- Maier J. A., Martinez C., Kasavajhala K., Wickstrom L., Hauser K. E., Simmerling C.. ff14SB: Improving the Accuracy of Protein Side Chain and Backbone Parameters from ff99SB. J. Chem. Theory Comput. 2015;11:3696–3713. doi: 10.1021/acs.jctc.5b00255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jorgensen W. L., Chandrasekhar J., Madura J. D., Impey R. W., Klein M. L.. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 1983;79:926–935. doi: 10.1063/1.445869. [DOI] [Google Scholar]
- Joung I. S., Cheatham T. E. III. Determination of alkali and halide monovalent ion parameters for use in explicitly solvated biomolecular simulations. J. Phys. Chem. B. 2008;112:9020–9041. doi: 10.1021/jp8001614. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bennett C. H.. Efficient estimation of free energy differences from Monte Carlo data. J. Comput. Phys. 1976;22:245–268. doi: 10.1016/0021-9991(76)90078-4. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All relevant data are displayed in the manuscript. The codes for reproducing the results and figures shown in the manuscript are available at https://github.com/RitAreaSciencePark/MSARC. This folder also contains all parameter files, topologies, and scripts needed to reproduce the AFEC simulations. All clustered MSAs and related AF2 predictions presented in this work can be found at 10.5281/zenodo.15869824.