Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

ArXiv logoLink to ArXiv
[Preprint]. 2023 Sep 7:arXiv:2309.03649v1. [Version 1]

Exploring kinase DFG loop conformational stability with AlphaFold2-RAVE

Bodhi P Vani , Akashnathan Aranganathan , Pratyush Tiwary ¶,§
PMCID: PMC10508826  PMID: 37731662

Abstract

Kinases compose one of the largest fractions of the human proteome, and their misfunction is implicated in many diseases, in particular cancers. The ubiquitousness and structural similarities of kinases makes specific and effective drug design difficult. In particular, conformational variability due to the evolutionarily conserved DFG motif adopting in and out conformations and the relative stabilities thereof are key in structure-based drug design for ATP competitive drugs. These relative conformational stabilities are extremely sensitive to small changes in sequence, and provide an important problem for sampling method development. Since the invention of AlphaFold2, the world of structure-based drug design has noticably changed. In spite of it being limited to crystal-like structure prediction, several methods have also leveraged its underlying architecture to improve dynamics and enhanced sampling of conformational ensembles, including AlphaFold2-RAVE. Here, we extend AlphaFold2-RAVE and apply it to a set of kinases: the wild type DDR1 sequence and three mutants with single point mutations that are known to behave drastically differently. We show that AlphaFold2-RAVE is able to efficiently recover the changes in relative stability using transferable learnt order parameters and potentials, thereby supplementing AlphaFold2 as a tool for exploration of Boltzmann-weighted protein conformations.

Graphical Abstract

graphic file with name nihpp-2309.03649v1-f0001.jpg

Introduction

The first step of a typical structure-based drug design pipeline2 is target protein structure prediction.3 Traditionally this has been done using experimental techniques like x-ray crystallography, NMR spectroscopy or computational techniques like homology modeling.46 These are all either time consuming, have limited accuracy or require adequate prior knowledge. With AlphaFold27(AF2), we saw a paradigm shift in protein structure prediction. However, protein function is not solely dependent on a single native like structure, rather it is only properly understood or characterized by the protein’s structural ensemble, including potentially several metastable conformations. Moreover, it is not sufficient to have a sense of conformational diversity alone, as relative thermodynamic stabilities of protein conformations can be key in understanding activity, effects of mutation, and differences in behaviors of closely related proteins. In the short time since the development of AlphaFold2 (AF2), several publications have discovered ways to bridge the gaps between conformational dynamics variability and AF2 predictions. Many of these leverage AF2’s internal architecture, and range over a spectrum between needing substantial input from physics-based simulation engines to not needing any physics at all. A common approach is to exploit the input featurisation of multiple sequence alignment to introduce stochastictity and deviations from native structure to the AF2 prediction.8 This includes our work9 combining AF2 and the machine learning-based enhanced sampling method Reweighted Autoencoded Variational Bayes for Enhanced Sampling (RAVE)10 into a combined protocol which we call AF2-RAVE to go from sequence to conformations ranked as per their thermodynamic or Boltzmann weights.

A well explored way to study the conformational diversity of a biomolecule is using the computational method of molecular dynamics (MD), i.e. by parametrizing intra- and inter-molecular forces with a force field and integrating newtons equations of motion.1113 However there are two key challenges in MD. First is the difficulty in sampling biologically relevant timescales. Since the integration of the equations of motion are limited by or fastest degree of motion which is at a femtosecond time scale, seeing changes of interest which can be at timescales of nanoseconds to hours is often prohibitively expensive and intractable with our current computational capacities. This has given rise to a large body of work in enhanced sampling algorithms1417 for difficult to sample distributions. These algorithms essentially attempt to sample a modified distribution and then reweight observables to obtain the correct statistics. This includes a wide range of methods addressing different concerns in sampling, each with its own set of challenges to deal with and it’s own limitations. Broadly these methods can be classified in at least two ways: those that attempt to change the underlying Hamiltonian of the system,1820 and those which aim to statistically bias trajectories by splitting and resampling them.2123

The second, closely related to the first, challenge in MD is the so-called curse of dimensionality. Biomolecular systems usually are roughly in the range of 103 to 107 atoms, leading to an unmanageably large number of degrees of freedom. We do not aim to sample the entire configuration space of these systems. However, it is commonly true that most biomolecules have a small number of low lying degrees of freedom, or a low lying manifold, that completely describes transitions of interest, or can separate conformational differences of biological relevance.24,25 This underlying manifold is rather confusingly referred to by many names, some commonly used are: reaction coordinate, alluding to the fact that the manifold traverses transitions between metastable states; collective variable, as it is usually a function of multiple coordinates of the system; order parameter, as it is used to parametrize different metastable states. In this work, we will refer to these degrees of freedom as “collective variables” (CVs) when using them as inputs or a basis set, and “order parameter” (OP) when referring to the finally obtained variable that we will bias along. Solving this second challenge thus has implications for the first as well.

While some of the aforementioned enhanced sampling methods attempt a generalized increase in sampling across every degree of freedom for every atom, for instance by increasing temperature, most of them aim to increase sampling along a specific manifold in the configuration space of the molecule. Even for the more generalized enhanced methods, actually quantifying the increase in sampling for high dimensional systems is difficult. Conversely, for methods that sample along predetermined manifolds, it is often the case that the wrong choices result in incomplete or incorrect configuration space sampling. Prior to the advent of machine learning in chemistry these low-lying degrees of freedom were almost always chosen by careful inspection and prior biophysical knowledge.2631 Since the popularization of manifold learning methods, the identification of collective variables using ML has been a large field of interest.3234 However most methods in this field are still best suited to very small set of problems and each have their own particular limitations,15,32 in particular the need for a priori information remains a frequent bottleneck.

In this work, we use the AF2-RAVE9 protocol, but with some significant refinements that make it more efficient, statistically robust, and transferable to mutations, suggesting it might be transferable within families of closely related proteins. We demonstrate this protocol on a kinase and its mutants.

Protein kinases, and in particular the DFG-in to DFG-out transition have been extensively studied using MD along with several enhanced sampling methods.35,36 This family of enzymes is one of the most important therapeutic targets for structure-based drug design, as they are ubiquitous in the human proteome.37 Their main role is to mediate cell signalling in a large range of biomolecular processes at the cell level, in particular replication, hence implicating them in a majority of cancers. While there already exist several highly effective medicinal molecules for cancer therapy that function by targeting and inhibiting kinases,38 one could argue that at best we have scratched the surface in terms of kinase based therapeutics.39

In their active state, protein kinases catalyze the phosphorylation of substrate proteins through the transfer of the γ-phosphate group from adenosine triphosphate (ATP) or guanosine triphosphate (GTP). Often, the substrate protein is another in a cascade of kinases re-quired for cell signalling.40,41 While there are over 500 kinases in the human kinome, making them a challenging class to study, the protein kinome universally has some highly conserved structural motifs with a structurally well characterized active state. One key motif for this characterization is the Asp-Phe-Gly (DFG) motif in the activation loop. This motif has two structural conformers, one with the Asp pointing into the loop, the DFG-in or active conformation, and one with it pointing out into the solvent, the DFG-out or inactive conformation. The ATP binding, and hence phosphorylation catalysis can only occur in the DFG-in conformation. Most drugs targeting kinases are “ATP competitive”, i.e. they bind to prohibit ATP binding. These ATP-competitive drugs themselves are classified mainly in two types, either binding to the active site in DFG-in conformation, hence inhibiting the binding of ATP, or binding to the DFG-out conformation and hence stabilizing the inactive state. However, the presence of kinases in many essential cellular functions and their homologous nature makes specificity and efficacy particularly hard to achieve. Often, we are interested in drugs that bind preferentially to specific kinases without affecting other kinases. Given this, the characterization of diverse inactive states is of particular importance, both in terms of a structural understanding of these states, as well as knowledge of the thermodynamic stabilities relative to active state. For instance, given a target kinase, finding a uniquely stable inactive state with a novel binding site could lead to a more specific, less promiscuous (and hence toxic) drug. However, the most robust way to do this computationally, obtaining a MD trajectory that traverses the space of active and inactive states multiple times, is impossible. Additionally, the transition from DFG-in to DFG-out is highly non-local and a concerted combination of several long-range and large-scale motions, so that characterizing it and studying it even with the aid of enhanced sampling is difficult.

Another important covariate to study kinase conformational ensembles is that of point mutations, which are often the cause of incorrect signalling leading to pathology. One key reason for this is that the balance of probability between active and inactive conformations is delicate and often flipped on changing single residues, as shown by the system we have chosen in this work, DDR1.

Our AF2-RAVE based approach for solving this problem involves combining structural ensembles obtained from AF2 with a machine learning algorithm to learn order parameters for biasing. To demonstrate this, we use the discoidin domain receptor tyrosine kinase 1 (DDR1), which in wild-type is more stable in an inactive DFG-out conformation. However, in several single site mutations, specifically D671N, Y755A and Y759A, the relative DFG conformational stabilities are flipped. One of the reasons we choose this set of systems is that a recent paper42 provides extremely detailed and valuable work on their DFG stabilities, with atypically long unbiased MD trajectories. We show that our results agree with theirs qualitatively, in that we predict the flipping of DFG conformation preference on mutation. While we sacrifice some accuracy qualitatively, our method is faster by roughly 2–3 orders of magnitude (exact simulation lengths described in Results), and has the potential of being reused without relearning a lot of the information.

We will begin by discussing the methods used within the protocol and outlining AF2RAVE. Next, we discuss some molecular biology background for kinases, in particular those which may be important to the DFG-in to DFG-out transition. Finally, we will describe our protocol and the results we obtained.

Methods

In the section we will first describe the methods that compose our protocol: (i) AlphaFold2 and MSA depth modification, (ii) Metadynamics, and (iii)State predictive information bottleneck (SPIB,43 the most recent variant of RAVE10). Finally we will list important parameters to note for our MD simulations in (iv) Simulation Details. Additionally, Fig. 1 shows a high level flowchart form of the method.

Figure 1:

Figure 1:

A high level schematic of the method, showing: (i) a typical input sequence, (ii) AF2 generated seed structures, (iii) regular space clustering and unbiased runs, (iv) SPIB to suggest OPs, (v) metadynamics runs.

AlphaFold2 and MSA depth modification

The search for a computational model to predict crystal structures or other native-like structures for proteins has been a central part of computational molecular biology. When AF2 was introduced in 2020 it was unprecedented in its speed and accuracy.7 The internal architecture of AlphaFold2 uses three primary components: the alignment of multiple evolutionarily related sequences, an attention-based neural network, or transformer, and a black hole initialized attention based graph neural network structure module. The model is trained on the entire RCSB database protein database of experimentally derived structures. While transformative to the field, the model does not quite solve the protein folding problem, as proteins in vivo are not defined by a single structure but by the structural ensemble.

The multiple sequence alignment (MSA) form of the input has been found to be a convenient point of input to introduce stochasticity. The simplest way to do this, which we employ in AF2-RAVE is to decrease the depth of the MSA input into both channels of the model, and then run the model repeatedly with randomly chosen subsets of the full MSA. In some sense this process is a way to withhold data from the model to produce ostensibly incorrect outputs. However, AF2 has also been found to perform consistently badly in situations that deviate from the norm of their evolutionarily related sequences, which is likely due to the MSA form of input featurization leading to significant bias. From this perspective, the above described protocol should lead to some random sampling of the correct structure in these special cases. Moreover, in cases where the protein family has multiple metastable structures, with the structure of native stability being different in different members of the family, this protocol could conceivably provide hints or “breadcrumbs” for the entire conformational space of interest.

A recent study44 has proposed that AF2 has indeed learned an energy-surface for protein folding in its transformer weights. They propose and provide significant evidence for the idea that the MSA pair representation matrix simply initializes close to the correct minimum while the transformer architecture performs an optimization step on this energy surface. This suggests that our MSA reduction protocol simply initializes the transformer closer to a different local minimum, possibly a non-native for the learned energy surface. This might be a biologically native-like structure from the previously described special cases, or a biologically relevant metastable structure.

In spite of this, this modified version of AF2 still leads to some highly unphysical structures, as we illustrated previously.9 Worse still, the structures obtained, including those that are metastable are not in any physically reasonable probability distribution. Nor is there an obvious way to directly obtain a distribution or free energy surface from them that could account for both enthalpy and entropy. Some direct notion of physics and thermodynamics is still required for this information to be usable.

Metadynamics

Hamiltonian-based enhanced sampling algorithms rely on the approach of editing the underlying energetics of dynamical systems. For instance umbrella sampling adds harmonic restraints along successive points of conformation of space in replicate simulations and recombines them to reproduce the original energy surface. One of the most powerful of this family of methods to explore complex energy landscapes with high barriers is metadynamics.20

For a predetermined order parameter(s), metadynamics aims to learn a biased potential that is the negative of the true free energy. To achieve this, a history dependent potential is added to the Hamiltonian of the dynamics. This potential is updated periodically by adding Gaussian functions to the bias function centered at current values of the order parameters.

This potential acts as a driving force, pushing the system away from visited regions, forcing it to explore new areas of the configuration space. The specific version we employ is well-tempered metadynamics, wherein the height of the Gaussians is modulated with a time dependence to mimic a high temperature simulation and to prevent the bias potential from growing indefinitely. In this case the bias potential can be shown to converge to the true underlying free energy modulo a multiplicative constant.45

While clever and asymptotically accurate estimates for time dependent bias rewinding exist for dynamic biases,20 they are subject to normalization errors and some strong assumptions. Additionally in spite of the well-tempering of this method, often its unbounded nature can result in sampling regions of configuration space that we are not interested in or are sufficiently rare enough to be irrelevant. In the study we use metadynamics to learn a potential that we then freeze and use as a static Hamiltonian bias. By controlling the region explored by the initial metadynamics through a stop condition we circumvent learning and unbounded potential bias, and compute our final statistical estimates with fewer errors. Next, we initialize independent walkers using the same static bias from both DFG-in and DFG-out structures, and run them till we see transitions followed by a stable trajectory in the basin that they were not initialized in. In general, since metadynamics relies on dynamically pushing the simulation to undiscovered regions, and has historically been considered difficult to learn an effective static bias, and hence is not a common practice.45,46 Every single independently launched trajectory visit the DFG-in basin when launched from DFG-out, and the DFG-out basin when launched from DFG-in. We find this to be a computationally more efficient protocol for getting overlap in explored configuration space. We find that the same static bias can also occasionally lead to back-and-forth transitions in the same trajectory, but the approach used in this work is computationally much more attractive.

It is important to note however that metadynamics suffers from the common enhanced sampling method limitation that it requires an a priori notion of the approximate reaction coordinate or underlying low dimensional manifolds to use as order parameters for sampling. A now common approach that we also employ in this work is to use a machine learning method to learn OPs for sampling, specifically, here we use the approach of a time lagged state predictive autoencoder, described below. Previous work has shown its suitability to learn metadynamics OPs for a range of systems such as conformational changes, membrane permeation,47 as well as for using in this AF2-seeded approach.9 This represents a crucial step towards making enhanced sampling methods usable on novel systems with limited a priori biophysical understanding.

State predictive information bottleneck

To solve the problem of an unknown underlying manifold required for biasing that captures relevant slow degrees of freedom, we use a method based originally on the reweighted autoencoder for variational Bayes algorithm (RAVE).10 We use its more updated form, the state predictive information bottleneck.43

In general a variational auto encoder (VAE) is a neural network framework that attempts to learn a low dimensional probabilistic function (the encoder) of the input and a function that is then able to reproduce the input (the decoder). This underlying low dimensional function is the information bottleneck, i.e. it minimizes input information while maximizing its ability to obtain the output. Here, since we aim to study a dynamical trajectory, we modify the basic VAE to incorporate a past future information bottleneck. Given a frame of the trajectory as an input, instead of reproducing this input, we reproduce a trajectory frame at a later time stamp, i.e. with some time lag. Additionally, we note that since proteins have several degrees of freedom that move at different time scales, for a specific time lag we are not attempting to reproduce the trajectory at every coordinate. Since we do not know which degrees of freedom correspond with which time scale, we instead choose for the output a notion of states, represented by one hot encoded vectors. These states are iteratively learnt between epochs of training the neural network. This protocol has been shown to be effective in several complex systems.4749

Since we aim to make our protocol generalizable, we start with a large set of input CVs. However, biasing using metadynamics on a function of a large set of CVs is hard to control and not always statistically stable. To alleviate this to some extent, we adopt a basis CV refinement step as a stand-in for regularization in our OP learning protocol, wherein we run SPIB three times, each time discarding features with weights lower than 0.25 of the maximum weight.

Simulation details

The protein is represented by the AMBER03 force field.50 The simulations are performed at 300 K with the BAOAB integrator51 in OpenMM;52 LINCS is used to constrain the lengths of bonds to hydrogen atoms;53 Particle Mesh Ewald is used to calculate electrostatics;54 the step size was 2 fs. The systems are solvated with TIP3P water models and equilibrated under NVT and NPT for 200ps and 300ps respectively. To prevent melting during biasing, we also restrain the conserved αC-helix of the N-lobe. Large scale motion for this motif is essential, both to see a transition and for drug unbinding pathways.55,56 However, biasing CVs that include distances from this helix often leads to irreversible disordering, and we find that applying torsional restraints on residues 65 to 81 is sufficient to allow for smooth upward motion. In Figure 2, we show the αC-helix and its position with respect to the DFG loop.

Figure 2:

Figure 2:

a) The structural anatomy of the DDR1 kinase molecule showing the activation loop (red), Gly rich P-loop(blue), αC helix (purple) and the characteristic N lobe β sheets (green). These motifs are relevant to DFG-in to DFG-out transition. b) The conserved salt bridge between K57 in β3 and E74 in αC helix that is crucial in ligand dissociation and for basic kinase functioning.55

Results and Discussion

Our overall protocol comprises the methods described in the previous section. We start by using reduced-MSA AlphaFold2 to generate a diverse set of initial conformations, and cluster them using regular space clustering. We choose this method of clustering because the reduced-MSA AF2 outputs tend to be quite sparse in regions of interest, and we want to prioritize the tails of the distribution over highly sampled regions. The centers of these structures are then used to seed unbiased MD trajectories, which are our input trajectories for RAVE. This learns a 2-dimensional order parameter expressed as an information bottleneck. Then, we run well-tempered metadynamics biasing along this 2-dimensional order parameter. Finally we use the metadynamics learnt bias for the wild type protein as a static bias to sample distributions for the wild type and mutant structures sequences.

We will discuss the results in two sections. First, we discuss the modified AF2 outputs for all four sequences, and the process of learning a biasing potential, and next, we discuss results from biased dynamics for DDR1 WT and mutants.

When we refer to the kinase DFG-in, -out, and -inter structures, we use the Dunbrack method57 for classification using distance cutoffs. Representative structures for these states are shown in Fig. 3. When referring to dimensions or OPs that are learnt through RAVE, we label them as information bottlenecks (IB) and assign variables σ.

Figure 3:

Figure 3:

Representative structures from reduced MSA AF2 for the DFG-in (purple), DFG-inter (blue) and DFG-out (green) shown in two views, superimposed on the same structure.

Learning a bias potential from AF2-RAVE on WT DDR1

Our first step is to generate structures using the reduced-MSA version of AF2. We generate 1280 structures for each kinase: 640 for MSAs of depth 16 and 32, with 128 random seeds generating 5 structures each. AlphaFold2 even with the reduced MSA approach is simply unable to distinguish between the conformational diversity expected for these 4 sequences, giving effectively identical results for all. This can be seen from Fig. 4a) where we show the populations of active, inactive, and the known intermediate or transition state “DFG-up” or “DFG-inter” state after filtering out obviously unphysical structures (e.g. with broken bonds). These structures are classified using Dunbrack classification described below. We note that while we see significant DFG-out population, even in the wild type which is known to have a higher inactive state stability, AF2 predicts the more commonly found DFG-in structure. These are in disagreement with population densities implied by previous long MD simulations of the same kinase, in spite of thoroughly searching MSA hyperparameters to force increased structural diversity.42 The transition state is currently commonly called “inter”,57 as it has been consistently found to be necessary structure to see the DFG-in to DFG-out trajectory. Previously, it was referred to as “DFG-up”, for the upward pointing position of the Asp residue sidechain in the traditional structural view (with the N-lobe above the C-lobe), while the previously named “DFG-down” position is referred to as unassigned, as it is a high energy, physically unlikely structure. The fact that we see the transition state from reduced MSA AlphaFold2 is extremely significant and useful, as we have previously attempted to use RAVE simply with crystal structures of DFG-in and -out structures and failed, as simulations tend to push towards and then get stuck in the “DFG-down” configuration. We show further analysis of the AF2 reduced-MSA outputs in the SI.

Figure 4:

Figure 4:

Populations of active, inactive, and the known transition state “DFG-up” or “DFG-inter” state for wild type and mutants D671N, Y755A, and Y759A, a) through reduced MSA AF2 (MSA lengths 8 and 16 combined) and b) using our AF2-RAVE protocol. We clearly see that AF2 by itself is unable to distinguish between wild type and mutants, and in particular for mutant gives us the wrong order of stability. On the other hand, AF2-RAVE is able to find the reversal in stability on point mutation, and give us more thermodynamically representative populations for these states, in excellent agreement with benchmark calculations performed in Hanson et al42 using unbiased MD simulations.

Next, we use SPIB on these structures to propose possible OPs for biasing. The activeinactive conformational transition is a highly delocalized one, and requires an understanding of several intramolecular interactions. In particular, the α C-helix forms a conserved salt bridge with the β3 strand that plays a crucial role in the molecule’s transition.55,58 Further, globally, the opening and closing of the two lobes (N and C) define the active-inactive transition.59 From a dynamics perspective, the prime motifs involved in this transition include the Activation loop (A-Loop), P-Loop and αC-helix, which mainly belong to the N-lobe (Fig. S1). To this end, we include a number of other CVs, described below, including the distances used for the DFG classification above. These CVs roughly correspond to those used in several previous studies focusing on different parts of the kinase molecular structure.57,6062 In Table S1, we list all the residues involved in distances that we consider, indicating our acronym, the conserved residue (if applicable), and the resid for DDR1 (numbered from 1), and a description of the motif it belongs to. In Table S2, we list all the distances used as initial inputs for SPIB, and indicate them visually in Figure S1.

In this nomenclature, the Dunbrack distances, which we use to project our final potentials of mean force (PMFs), are sbridgeK CB - ChelE CB and sbridgeK CB - DFGAsp CB. Referring to these as d1 and d2, the structures are classified as: DFG-in if d1 < 11 and d2 > 14, DFG-out if d1 > 11 and d2 < 14, DFG-inter if d1 < 11 and d2 < 11, and unassigned if d1 > 11 and d2 > 11.

In Fig. S2 we show the distributions of AF2 outputs for all four sequences of DDR1 on the entire set of input CVs and find that they all look quite similar, in spite of known differences in these mutants. This is interesting, considering we expect AF2 to usually perform more confidently in evolutionarily faithful sequences. However, in this case, we suspect that the fact that DDR1 is unusual in its DFG-out stability contributes to the easy access it has to conformational diversity. Nevertheless, our AF2 outputs still predict higher stability for the DFG-in structures universally for the DDR1s.

To learn a biasing potential, we run multiple metadynamics trajectories in parallel that share bias potentials. This is done only for the wild-type sequence and the same bias is then used for further calculations for all sequences. These initial structures are chosen as in the original AF2-RAVE paper,9 as follows: first, we run AF2 using ColabFold with a manually set MSA depth of 8 and 16. Next, we run regular space clustering with the minimum distance parameter of 9 on standardized CV values, using the set of CVs described above.

We use the following stopping conditions for the initial metadynamics runs: (1) If the walker started in DFG-in(out) structures, it must reach the DFG-out(in) structure, and (2) If the walker was not initially in one of the two main metastable states, it must reach one of them. We only stop if the transition is stable for 1 nanosecond during the biased simulation. These simulations are between 5ns and 20ns long.

In Fig. S4, we show the bias learnt for the process. In order to demonstrate the transferability and efficiency of our protocol, we only learn an IB and a bias potential using the wild type. This also provides a basis to propose learning universal biases for systems with transferable CVs potentially allowing for more efficient sampling for homologous families of proteins.

Biased dynamics on DDR1 mutants

In this study, our main result is the accurately predicted relative stabilities of the DFG-in and DFG-out structures for the wild type and mutants for the kinase DDR1. In Fig. S5, we show predicted PMFs for these structures. We compute DFG-in versus DFG-out relative thermodynamic stability by integrating probabilities over Dunbrack definitions for kinase structural states and calculating the ΔG between these states: (i) WT: 0.5 KBT, (ii) D671N: −0.2 KBT, (iii) Y755A: −0.42 KBT, (iv) Y759A: −0.13 KBT. Each of these were computed with 5 trajectories each, and had standard deviations of (i) WT: 0.23 KBT, (ii) D671N: 0.12 KBT, (iii) Y755A: 0.29 KBT, (iv) Y759A: 0.11 KBT. This flipping in Boltzmann ranking of active and inactive is in concurrence with previous findings,42 which were obtained using long unbiased MD simulations. We can also integrate over our reweighted data to obtain the Dunbrack populations to compare with those from AF2, shown in Fig. 4.

In Fig. 5 we show snapshots of one example of our sampled trajectories. We also note some salient details of the transition we study. We find that in the process of the transition, the breakage and formation of the salt bridge is clear, and the large scale upward motion of the Chelix is absolutely essential for sampling dynamics, as noted previously.

Figure 5:

Figure 5:

An example trajectory for the DFG-out to DFG-in transition obtained using AF2-RAVE. See Code Availability for details of an example video showing the transition.

Conclusion

In this work, we have extended our previous protocol AF2-RAVE to obtain Boltzmann-ranked conformational diversity in protein kinases, specifically from the perspective of the pharmaceutically relevant and evolutionarily conserved DFG loop. We have previously shown that the protocol is able to capture a versatile range of transitions: rotameric metastability, large scale helical motions, and partial disordering.9 Here, we study a the DFG loop conformational change that is of utmost therapeutic importance. Moreover, we choose DDR1 since there exists an unusually thorough sampling with unbiased trajectories by Hansen et al.42 which provides more robust comparison than enhanced sampling-based work on large biomolecules is usually afforded. With AF2-RAVE we obtain the same conformational ranking and similar thermodynamic stabilities as found by Hanson et al.42 for the DFG-in versus DFG-out conformations of DDR1. However the total MD simulation time in our work is around 2–3 orders of magnitude shorter.

It is important to note that our trajectories rely on collective variables that allow us to sample the DFG-inter transition state. Without AF2 structures to seed our initial unbiased trajectories, the collective variable learnt usually leads to a DFG-down structure resulting in an unsuccessful transition trajectory. Another significant factor of our work is the use of a static learnt bias, which is usually considered difficult and avoided. However, we find that the restriction to transitions of importance is necessary for this measurement in terms of both speed and replication.

This entire protocol can be repeated for other kinases in a few different ways. The first is to learn a bias using a wild-type and then sample mutants that are known to be pathological or disease causing due to changes in activity. The second is to learn a bias using a single kinase and sample other closely related kinases using the same potential. Finally, we hope that eventually a generalized set of collective variables and biases can be learnt that could sample across the human kinome. This protocol can be used to sample novel states that can then be used as relatively high-throughput inputs for cryptic pocket prediction algorithms,63 and we demonstrate some examples in the SI.

Acknowledgments

P.T. and B.V. were supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number R35GM142719. The content is solely the responsibility of the authors and does not represent the official views of the National Cancer Research. We are grateful to NSF ACCESS Bridges2 (project CHE180053) and University of Maryland Zaratan High-Performance Computing cluster for enabling the work performed here. We thank Drs. Eric Beyerle, Xinyu Gu, Zack Smith, and Dedi Wang for critical reading of the manuscript, and Drs. Mrinal Shekhar, Shashank Pant, Zack Smith and Dedi Wang for helpful discussions regarding kinases.

Footnotes

The authors declare the following competing financial interest(s): P.T. is a consultant to Schrodinger, Inc. and is on their Scientific Advisory Board.

Supporting Information Available

Detailed description of methods used and further analysis of systems and sampling can be found in the supplement.

Code Availability

The code to run AF2-RAVE in a seamless manner is available at https://github.com/tiwarylab/alphafold2rave. This can be run on Google Colab using GPUs. Using Colab Pro is advised. Codes, parameters, and bias files used to specifically run the simulations from this protocol can be found at https://github.com/tiwarylab/kinase_Aloop. These will currently only work for the sequences used in this paper, but most files are easily adaptable for use in other kinases with some specific changes, which are marked. An example video is also in the folder, and while full trajectories are too large to upload, they can be made available on request.

Data Availability

All data associated with this work is available through https://github.com/tiwarylab/kinase_Aloop.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data associated with this work is available through https://github.com/tiwarylab/kinase_Aloop.


Articles from ArXiv are provided here courtesy of arXiv

RESOURCES