Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Aug 13.
Published in final edited form as: J Chem Theory Comput. 2013 Aug 13;9(8):10.1021/ct400198q. doi: 10.1021/ct400198q

Hidden Conformation Events in DNA Base Extrusions: A Generalized Ensemble Path Optimization and Equilibrium Simulation Study

Liaoran Cao 1, Chao Lv 1, Wei Yang 1,2,*
PMCID: PMC3829643  NIHMSID: NIHMS501688  PMID: 24250279

Abstract

DNA base extrusion is a crucial component of many biomolecular processes. Elucidating how bases are selectively extruded from the interiors of double-strand DNAs is pivotal to accurately understanding and efficiently sampling this general type of conformational transitions. In this work, the on-the-path random walk (OTPRW) method, which is the first generalized ensemble sampling scheme designed for finite-temperature-string path optimizations, was improved and applied to obtain the minimum free energy path (MFEP) and the free energy profile of a classical B-DNA major-groove base extrusion pathway. Along the MFEP, an intermediate state and the corresponding transition state were located and characterized. The MFEP result suggests that a base-plane-elongation event rather than the commonly focused base-flipping event is dominant in the transition state formation portion of the pathway; and the energetic penalty at the transition state is mainly introduced by the stretching of the Watson-Crick base pair. Moreover to facilitate the essential base-plane-elongation dynamics, the surrounding environment of the flipped base needs to be intimately involved. Further taking the advantage of the extended-dynamics nature of the OTPRW Hamiltonian, an equilibrium generalized ensemble simulation was performed along the optimized path; and based on the collected samples, several base-flipping (opening) angle collective variables were evaluated. In consistence with the MFEP result, the collective variable analysis result reveals that none of these commonly employed flipping (opening) angles alone can adequately represent the base extrusion pathway, especially in the pre-transition-state portion. As further revealed by the collective variable analysis, the base-pairing partner of the extrusion target undergoes a series of in-plane rotations to facilitate the base-plane-elongation dynamics. A base-plane rotation angle is identified to be a possible reaction coordinate to represent these in-plane rotations. Notably, these in-plane rotation motions may play a pivotal role in determining the base extrusion selectivity.

I. Introduction

DNA base extrusion is an important component of many biological events; this general conformational transition is a common strategy for protein machines to access to specific bases that are usually protected inside helical double-strand DNAs112. Due to its obvious importance, base extrusion, during which base flipping is usually assumed to be the major event, has been a target of immense studies, particularly by molecular dynamics (MD) simulation methods9,1334. Despite that many valuable insights have been obtained from previous investigations, key questions, such as how are base extrusions activated, which motions are responsible for the energetic costs of transition state formations, and how the base extrusion selectivity is controlled etc., remain to be answered. To address these questions, it is important to describe essential events that can selectively represent the slowest dynamics of a specific base-extrusion pathway. Moreover, base extrusions may occur on timescales that are commonly intractable. Therefore, understanding base extrusion pathways can also be instrumental to feasibly sampling this general type of conformational changes for future studies, especially on more complex processes.

To describe DNA extrusions, a variety of collective variables have been proposed16,29,3234. A straightforward one is, the base-separation distance, which can be exemplified by the distance between the pyrimidine N1 and purine N3 atoms32. Although such base-pair separation distance can monotonically change with the progress of a base extrusion process, it cannot selectively represent a unique path in terms of distinguishing either between the target base and its partner base extrusion pathways or between the major and minor groove extrusion routes. To address this selectivity issue, torsion-angle based collective variables were suggested16,29,33,34. Among them, the base-opening angle by Lavery et al.34 and the center of mass pseudo-dihedral angle (CPD), which was originally proposed by the Mackerell group16 and further refined by Simmerling and co-workers29, are commonly applied. These reaction coordinates were designed largely based on the hypothesis that base flipping (opening) is the major event in base extrusion processes. Notably, when a base-flipping (opening) angle collective variable was employed as the order parameter for free energy calculations, often non-trivial simulation lengths are required and calculated free energy profiles can be very sensitive to specific base-flipping (opening) order parameter definitions and sampling lengths24,29. These observations suggest that essential “hidden” events may exist in the orthogonal space perpendicular to the commonly applied collective variables; and these “hidden” events may play a crucial role in base extrusion processes. To uncover these key “hidden” events, in this work, we seek to elucidate a B-DNA major-groove base extrusion pathway and understand the associated essential motions.

Various approaches have been applied to map the conformational/chemical free energy pathways of complex processes20,27,3539. To find the optimal free energy path that represents a complex molecular process, two general strategies are often employed. One general strategy is to directly construct high-dimension free energy surfaces along pre-chosen collective variables40; it can be achieved via the umbrellas sampling method41, the blue-moon-ensemble thermodynamics integration method42, the non-equilibrium steered molecular dynamics (SMD) method43, or generalized ensemble (GE) sampling based algorithms such as the adaptive umbrellas sampling method44, the metadynamics method45, the adaptive biasing force (ABF) method46,47, the adaptive biased molecular dynamics (ABMD) method48, or the orthogonal space sampling methods4951 etc. A recent work39 based on the combination of the SMD and ABMD methods is an excellent example that demonstrates the usefulness of this general strategy in understanding DNA conformational changes. Direct construction of high-dimension free energy surfaces allows multiple free energy pathways to be simultaneously mapped; however with the increases of collective variable dimensions, large diffusion sampling overheads, which are used to cover event-irrelevant phase regions, can boost to the level that renders it impractical. If one unique path is targeted, as an alternative strategy, chain-of-states path optimization approaches5257, among which the finite-temperature-string (FTS) method56,5863 is of particular interest, allows sampling to be concentrated on phase regions near the target pathway. By nature, FTS path optimizations can be computationally demanding. To improve FTS calculation sampling efficiency, in our previous work61, the first generalized ensemble sampling based FTS algorithm was developed; this method can naturally ensure orthogonal space structural continuity and permit more flexible usage of computing resources, for instance with a multi-walker implementation64. It should be noted that different from regular GE free energy simulation methods, by which simulated systems travel along fixed collective variable directions, GE-FTS performs random sampling along continuously updated paths. To emphasize such inherent difference, we named the GE based FTS path optimization scheme as the on-the-path random walk (OTPRW) method61.

In this study, a combined OTPRW path optimization and equilibrium generalized ensemble simulation strategy was used to understand a classical B-DNA major-groove base extrusion process. To more reliably sample the target base extrusion pathway, the OTPRW method was improved, specifically with the original metadynamics kernel replaced by an adaptive biasing force (ABF) based kernel46,47,64. Via this integration-based OTPRW (iOTPRW) approach, the MFEP and the free energy profile of the target base extrusion pathway were simultaneously obtained. Along the MFEP, an intermediate state and the corresponding transition state were located and characterized. The MFEP result suggests that a base-plane-elongation event occurs before the commonly emphasized base-flipping event, the latter of which largely takes place after the formation of the transition state and the energetic penalty at the transition state is mainly introduced by the stretching motion of the Watson-Crick (WC) base pair; and to facilitate this stretching dynamics, the surrounding environment of the target base needs to be intimately involved. Further taking the advantage of the extended-dynamics nature of the OTPRW Hamiltonian, an equilibrium GE simulation was performed along the optimized path; and based on the collected samples, several collective variables were evaluated. Notably, none of the previously proposed flipping (opening)-angle collective variables alone can adequately represent the transition state formation portion of the pathway; indeed at the pre-TS stage, these collective variables do not take any obvious effect. The collective variable analysis also allows us to identify essential events that couple with the base-plane-elongation motion: the base-pairing partner of the extrusion target undergoes a series of in-plane rotation transitions. And an in-plane rotation angle (IPRA) can be used as a general collective variable to represent these motions. Notably, these in-plane rotation motions are likely to play a pivotal role in governing the base extrusion selectivity.

II. Theoretical Methods

II.A. A brief introduction to the OTPRW sampling method

As is aforementioned, the OTPRW method is the first GE sampling strategy designed for FTS path optimizations. In FTS calculations, a set of collective variables θ(X) =(θ1 (X),…,θm (X)) is pre-defined to represent the path space that can sufficiently describe a target process. Then, a string of θ-space points can be employed to depict a pathway between two end points, ZA=(z1A,,zmA) and ZB=(z1B,,zmB). Alternatively, a set of λ-dependent functions Z(λ) =(z1(λ),…,zm (λ)), in which Z(0) =ZA and Z(1) =ZB, can be used as a continuous representation of this θ-space pathway; here, the progressing parameter λ spans from 0 to 1 to manifest the on-the-path distances from the starting point ZA. If along a string, every single point Z(λ′) satisfies the following condition,

[M[Z(λ)]Gθ[Z(λ)]]=0, (1)

this pathway can be called a minimum free energy path (MFEP)56,58. In Equation 1, M represents the diffusion tensor matrix; Gθ=(Gθ1,,Gθm) stands for the free energy gradient vector; and ⊥ denotes the projection perpendicular to the curve. Based on Equation 1, an iterative procedure of MD simulations, which are used to collect samples for Gθ calculations, and subsequent path optimization operations is generally referred as a collective variable space FTS algorithm5863. In FTS path optimization practices, commonly, sampling is performed on a series of independent images that are equally spaced along a to-be-update string. Two practical issues need to be noted on this original sampling strategy: (1) on-the-path structural continuity of the environmental portion (in the orthogonal space perpendicular to the path) cannot be ensured via non-communicating samplings on independent images; (2) it is challenging for a path that represents an incorrect mechanism to be switched into the correct reaction channel. In the OTPRW sampling scheme, target molecular systems, instead of being independently constrained on non-communicating images, are randomly propagated along to-be-update paths to collect samples for path optimizations61. Thereby, path sampling efficiency and robustness can be improved for the fact that orthogonal space structural continuity is naturally ensured; and certain path-space sampling that is forbidden in the original FTS sampling scheme becomes plausible (Scheme 1).

Scheme 1.

Scheme 1

The illustration of the path-space tunneling mechanism that the OTPRW sampling method enables. Two possible pathways are depicted by the dash (the starting path) and solid (an alternative path) lines respectively. In the OTPRW sampling scheme (the red arrow), the system can evolve from the starting path to an alternative path that shares common image states; such path-space tunneling allows large energy barriers (the central hill) that are not insurmountable in the independent-image sampling scheme (the blue arrow) to be possibly bypassed.

In the OTPRW sampling design, the following extended-dynamics Hamiltonian is applied,

Hλ=Ho+pλ22mλ+i=1m12Ki(θi(X)-zi(λ))2. (2)

In Equation 2, Ho represents the unmodified Hamiltonian; λ is treated as a one-dimension dynamic particle with a mass of mλ and its momentum is denoted as pλ ; via the energy terms i=1m12Ki(θi(X)-zi(λ))2, the system is restrained on the immediate path described by Z(λ) =(z1(λ),…,zm (λ)). In our implementation, the λ particle is propagated based on the Langevin equation, in which the reservoir temperature is set to be the same as the system temperature. To restrain λ within the range between 0 and 1, a boundary potential

Ubound(λ)={12Kboundλ2+fm(0),λ<00,0λ112Kbound(λ-1)2+fm(1),λ>1, (3)

is employed. Considering that λ may travel out of the boundaries, when λ moves below zero, Z(λ) is set equal to Z(0); and when λ moves above one, Z(λ) is set equal to Z(1).

To realize random walks in the λ space, the on-the-path energy surface described by Equation 2 needs to be flattened with the addition of a biasing potential fm (λ), the target of which is the negative of the λ-dependent free energy profile −G(λ). Because G(λ) is not known a priori and is altered upon each path optimization operation, in an OTPRW calculation, not only the path Z(λ) but also the biasing potential fm (λ) ought to be recursively updated. In the original OTPRW implementation61, the metadynamics method45 was employed as the fm (λ) recursion kernel. As is generally known, the performance of metadynamics recursions can be sensitive to both orthogonal-space energy surface ruggedness and pre-chosen metadynamics parameters. Consequently, the accuracy of free energy gradient estimations, which is essential for both path optimizations and on-the-path biasing potential updates, can be poor, particularly when instantaneous paths are not yet close to the final MFEP. In the current iOTPRW design, an ABF-like recursion strategy is employed to replace the original metadynamics-based recursion strategy so that samples can be more robustly collected at close-equilibrium conditions.

II.B. The integration-based OTPRW sampling method

The iOTPRW algorithm has two key inter-related components, respectively for adaptive updates of fm (λ) and Z(λ). Correspondingly, two time intervals need to be set: ΔTf for fm (λ) recursions and ΔTZ for Z(λ) optimizations. Usually, ΔTZ should be much longer than ΔTf; for instance, in this study, ΔTf was set as 0.1 pico-second (ps) and ΔTZ was set as 200 ps.

The fm(λ) recursion procedure

In iOTPRW, fm (λ) is recursively evaluated based on the thermodynamic integration (TI) equation42,65:

fm(λ)=-G(λ)=-0λHλλλdλ=-oλi=1m{dzi(λ)dλ|λKi(zi(λ)-θi(X))λ}dλ. (4)

As mentioned earlier, in OTPRW, the on-the-path Hamiltonian (Equation 2) is altered upon each Z(λ) update. Therefore, 〈Ki (zi (λ) − θi (X))〉 calculations should be performed only based on samples collected within the same ΔTZ time interval. Following the ABF approach46,47,64, we can calculate 〈Ki (zi (λ) − θi (X))〉λ using the following equation,

i=1mtKi(zi(t)-θi(X))δ(λ(t)-λ)tδ(λ(t)-λ), (5)

where t denotes all the sample-collecting time-steps that are within the current ΔTZ time interval and before the current fm(λ) update. To avoid possible issue caused by low-precision estimation, 〈Ki (zi (λ) − θi (X))〉λ at a state λ′ is re-evaluated only when the number of samples collected for the λ′ image exceeds a pre-set cutoff value, Ncutoff; otherwise, the previous 〈Ki (zi (λ) − θi (X))〉λ value that is utilized for last fm (λ) update will be re-used in the current fm (λ) update.

Although λ is a continuous variable, for the convenience of sample collection, the λ space is linearly partitioned to a number of discrete bins; i.e. if the whole λ space is partitioned to L number of bins, the jth bin should have 2j-12L as its bin center and has j-1L and jL as its sample collection boundaries. Accordingly, we can assign samples collected in the jth bin to the bin-center image λ=2j-12L for Ki(zi(λ)-θi(X))2j-12L calculations (Equation 5). Because all the 〈Ki (zi (λ) − θi (X))〉 values are evaluated at the bin centers, using the TI formula (Equation 4), we can estimate fm (λ) values of the bin boundary states (0,1L,2L1), based on which we can generate fm (λ) via the B-Spline fitting method.

The Z(λ) optimization procedure

In each path optimization step, a major task is to calculate free energy gradient vectors Gθ for the bin-center states. In OTPRW, we use 〈Ki (zi (λ) − θi (X))〉, which is estimated during the latest fm (λ) recursion step, to approximate Gθi. It is noted that in a path optimization cycle, if the number of samples collected for an image (for instance, the jth bin) is lower than Ncutoff, we set GθZ(2j-12L) as zero; thereby, the images that have low numbers of samples will not be updated in the incoming optimization operation. To evolve Z(λ), we employ the steepest descent minimization method; specifically, the point at an image center (for instance, the jth image center), Z(2j-12L), is updated to be Z(2j-12L)-{M[Z(2j-12L)]Gθ[Z(2j-12L)]}Δz, in which Δz is the path minimization step size. After the bin-center points are updated, the λ values of these newly-evolved points need to be re-calculated to reflect the percentages of their on-the-path distances from the point ZA, the λ value of which is always set as zero. Thereafter, we can generate the λ-dependent functions of Z(λ) =(z1 (λ),…,zm (λ)) also via the B-Spline fitting method.

II.C. On-the-path generalized ensemble simulation and data analysis

The above procedure can enable efficient convergences of Z(λ) and fm (λ). When an iOTPRW reaches the convergence phase, we can fix both Z(λ) and fm (λ) and carry out equilibrium GE simulations to collect samples along the optimized pathway. To obtain a higher-precision free energy profile along λ, an on-the-path ABF simulation can be performed; in this case, only the optimized path Z(λ) needs to be fixed and fm (λ) can be adaptively re-evaluated.

Based on the data obtained from an on-the-path equilibrium GE simulation, along any trial collective variable set , the free energy landscape can be constructed based on the following equation:

Go(s)=-kTolntexp[i=1m12Ki(θi(X)-zi(λ))2+fm(λ(t))kTo]δ(s(X(t))-s)+const, (6)

where k denotes the Boltzmann constant; t represents scheduled sample-collection time-steps. This extended-ensemble-to-canonical-ensemble re-weighting strategy has been employed in our earlier orthogonal space random walk based simulation studies 50,66; it is a generalized form of the umbrella sampling formula41, which was originally derived in the canonical-ensemble-to-canonical-ensemble reweighting framework.

III. Computational Details

The iOTPRW method was implemented in a customized version of the CHARMM program67,68, by which all the simulations in this study were performed. The sequence of the model ds-DNA is shown in Table 1. In this well-studied B-DNA dodecamer (Figure 1), the base of Cytosine 6 (C6) is the extrusion target; and the Guanine 19 (G19) is the C6’s Watson-Crick (WC) partner. In the simulated system, the ds-DNA is embedded in a truncated-octahedral box with 4256 water molecules, which are represented by the TIP3P model69, and 22 sodium ions. The ds-DNA and the counter-ions were treated by the CHARMM27 force field70,71.

Table 1.

The sequence of the target d-DNA. The C6 base is the to-be-excluded target; and the G19 is the base-pair partner of C6.

5′ G T C A G C G C A T G G 3′
3′ C A G T C G C G T A C C 5′

Figure 1.

Figure 1

The chemical environment of the target base (C6) and the center of mass pseudo-dihedral angle (CPD) definition. The C6 base is colored in blue and the G19 base is colored in red. The major groove at the C6 position is located on the backside. Standard atom numbering of the C and G bases is shown on the G7 and C18 bases. In the CPD definition, the four atom groups are separately circled and the central pseudo-bond is drawn as a green line.

III.A. The on-the-path random walk path optimization

As discussed in Section II, in a collective variable FTS calculation, a collective variable function set θ(X) =(θ1 (X),…,θm (X)) needs to be pre-chosen to represent the path space. In this study, ten distance collective variables were employed. As listed in Table 2, the first two distance collective variables are used to describe the C6 and G19 base separation motion; the next four distance collective variables (θ3θ6) are employed to describe relative motions between the G19 base and its neighboring base planes; and the last four distance collective variables (θ7θ10) are used to describe internal motions of the base planes adjacent to the C6-G19 base plane. It has been known that the two major order parameters, θ1 and θ2, cannot distinguish specific base extrusion pathways. In this setup, we hypothesized that the base extrusion local environment that is represented by the remaining eight collective variables (θ3θ10) allows the base extrusion selectivity to be enforced; and this hypothesis is confirmed by the results of the path optimization calculation and the equilibrium GE simulation.

Table 2.

The definition of the collective variable space θ =(θ1,…,θ10). In an ApBq (for instance, N3C6) symbol, “A” represents the atom type and “B” represents the base type; “q” stands for the atom number (the numbering scheme is illustrated in Figure 1) and “p” stands for the nucleotide number.

θ =(θ1,…,θ10)
The base-pair elongation N3C6 – N1G19 θ1
O2C6 – N2G19 θ2

The relative motions between the G19 and the neighboring base planes N1G19 – N3C18 θ3
C6G19 – C4C18 θ4
N1G19 – N3C20 θ5
C6G19 – C4C20 θ6

The neighboring base internal motions N1G5 – N3C20 θ7
N2G5 – O2C20 θ8
N1G7 – N3C18 θ9
N2G7 – O2C18 θ10

In order to generate a starting pathway, an orthogonal space tempering (OST) simulation was performed. The OST simulation was carried out with the commonly employed CPD as the order parameter; the CPD definition is shown in Figure 1. To restrict the motion along a major-groove base extrusion pathway, the CPD value was confined in between −100° and 5°. The details of the OST method can be found out in the original method paper51; in this simulation, the orthogonal space sampling temperature was set as 1500 K. It is noted that the purpose of this OST simulation was not to calculate a converged free energy profile along CPD but to generate a reasonable trajectory for the construction of an initial pathway. Therefore, only a short OST trajectory that connects the two CPD end states was generated. From this OST trajectory, totally 101 θ-space points were built to represent the initial string (Figure 2). The structures corresponding to the two end CPD states were used to define ZA and ZB. To ensure the initial path smoothness, each of the 99 intermediate points, which are shown by Figure 2, was built by averaging the θ values of 5 picosecond (ps) neighboring samples. Based on these θ-space points, we obtained the λ-dependent string functions Z(λ) =(z1 (λ),…,z10(λ)) via the B-Spline fitting method; here, the λ value of each point was calculated to reflect the percentage of its on-the-path distance from ZA.

Figure 2.

Figure 2

The θ-space points on the initial-guess path. The base-pair separation collective variables are shown in (a); the collective variables describing the interactions between the G19 base and the neighboring base planes are shown in (b); and the collective variables describing the internal motions of the adjacent base pairs are shown in (c).

In the iOTPRW Hamiltonian (Equation 2), mλ was set as 103 a.m.u.; the friction coefficient in the λ Langevin dynamics was set as 2.0 × 105 a.m.u./ps; the path restraint force constant Ki was set as 100 kcal/mol/Å2. The λ boundary restraint constant Kbound (Equation 3) was set as 105 kcal/mol. During free energy derivative estimations and path optimizations, the λ space that spans from 0 to 1 was uniformly partitioned to 100 images. ΔTf was set as 0.1 ps and ΔTZ was set as 200 ps. In each path optimization step, the steepest-descent method was employed with the minimization step size Δz set as 0.005 Å2·mol/kcal.

III.B. The on-the-path generalized ensemble simulation and collective variable analysis

After about 12 nano-seconds (ns) iOTPRW simulation, Z(λ) was fixed and a 5 ns equilibrium GE simulation was performed. After the first 700 ps data was removed, the remaining samples were used to perform the collective variable analysis, particularly to assess how various collective variables are involved in the base extrusion process. To evaluate each specific collective variable ξ, the two-dimension free energy surface, Go(ξ, λ), was constructed. For the fact that along the optimized path Z(λ), λ represents an “ideal” reaction coordinate, the correlation relationship between ξ and λ in the free energy profile Go (ξ, λ) can provide a meaningful basis for the understanding of the role that ξ plays in the base extrusion process.

III.C. The general molecular dynamics simulation setup

The molecular dynamics simulation setup was generated through the CHARMM-GUI server72. The particle mesh ewald (PME) method73 was applied to treat the long-range columbic interactions while the short-range interactions were totally switched off at 12 Å. The Nóse-Hoover method74 was employed to maintain a constant reservoir temperature at 300 K, and the Langevin piston algorithm75 was used to maintain the constant pressure at 1 atm. The simulation time-step was set as 1 femto-second.

IV. Results and Discussion

IV. A. The integration-based on-the-path random walk simulation result

The iOTPRW calculation convergence behavior

During the 12-ns iOTPRW simulation, the progressing parameter λ repetitively traveled between the two end states (Figure 3a). The fact that there is no obvious sampling bottleneck, where λ random diffusion is possibly blocked, indicates that there is no significant “hidden” barrier in the space orthogonal to the pathway that is described by the pre-set collective variables; i.e. the target reaction pathway can be adequately represented by the pre-chosen collective variable set θ(X). Furthermore, throughout the iOTPRW simulation, all the base extrusions occur through the C6 major-groove channel; this suggests that the employed environmental collective variables (θ1θ10) are sufficient to selectively steer base extrusions along the target pathway at least within the timescale that the iOPTPRW simulation can represent.

Figure 3.

Figure 3

The iOTPRW calculation results. (a) The time-dependent progressing parameter changes. (b) The time-dependent “maximum deviation” changes; the map of the RMSD between each pair of the estimated free energy profiles is shown in the inset. (c) The free energy surface along the optimized path; the committor probability distribution at the transition state is shown in the inset.

To monitor the convergence behavior, the root mean square deviation (RMSD) between each pair of the estimated free energy profiles was mapped (the inset of Figure 3b). The “maximum deviation” at each sampling length t′ is defined by the maximum value at the upper right (t′, tmax) corner of the average RMSD map. The “maximum deviation” value represents the largest possible uncertainty within the overall simulation length; thus, it can be a fair indicator on calculation convergences. Figure 3b shows that at about 9 ns, the maximum deviation dropped below 0.54 kcal/mol; and at about 10 ns, the maximum deviation dropped below 0.1 kcal/mol. To monitor the path optimization convergence along certain collective variable θi direction, the map of the RMSD between the corresponding functions zi (λ) at each pair of simulation lengths was generated. As shown by the right panels of Figure 5, it takes about 7.5 ns for the collective variables that describe the base-pair separation, such as θ1 (Figure 5a), to enter the convergence phase; and it takes less than 9 ns for the collective variables that describe the base-extrusion environment changes, such as θ4 (Figure 5b) and θ7 (Figure 5c), to enter their convergence phases. Furthermore, the convergence behaviors of the collective variable functions and the free energy profile are in good agreement. It should be noted that in an iOTPRW simulation, usuall, the majority of sampling time is used to optimize collective variable functions; and when these collective variable functions enter their convergence phase, sampling time required for the final free energy convergence should be relatively short.

Figure 5.

Figure 5

The evolutions of the essential collective variable candidates along the initial (the dotted lines) and optimized (the solid lines) pathways. The left panel shows the evolutions of the representative essential collective variable candidates. The base-pair separation variable changes are shown in (a); the changes of the variables describing the interactions between the G19 base and the neighboring base planes are shown in (b); and the changes of the variables describing the internal motions of the neighboring base pairs are shown in (c). The right panel shows the maps of the RMSD between the representative collective variable functions obtained at each pair of simulation lengths.

The energetic and structural details of the MFEP result

Based on the final free energy profile (Figure 3c), we can identify two free energy minimum regions: the intra-helical region at λ =0.03 and an intermediate (IM) region, in which λ broadly spans between 0.38 and 0.50. The free energy of the IM region is about 11.6 kcal/mol higher than that of the intra-helical region. The transition state (TS) between these two free energy minima is located around the state of λ = 0.26; and the corresponding free energy barrier height is about 13.1 kcal/mol. To verify this transition state, a committor probability analysis76 was performed. For this analysis, we carried out a 500-ps equilibrium simulation based on the on-the-path Hamiltonian, in which Z(λ) is set as the optimized string and λ is fixed at 0.26; to obtain the committor value for each of the 500 collected samples, we generated 100 MD trajectories, each of which was initiated with a different random seed. As shown in the inset of Figure 3c, the calculated committor probabilities form a normal distribution that centers around 0.5 (0.54 from the Gaussian fitting with the R2 value of 0.84); this result further confirms the quality of the path optimization convergence and supports the fact that the state of λ =0.26 is a valid transition state. In contrary to what is generally hypothesized, at the transition state (Figure 4a), the C6 base actually has not yet been extruded from the intra-helical interior and the C6-G19 base pair still stay on the same plane. Apparently, during the transition-state formation, the stretching of the WC base plane rather than the base flipping is the dominant event. As the consequence of the base-pair stretching, at the transition state, although the WC moieties still directly point to each other, the WC hydrogen bonding interactions are mostly lost. Furthermore, we can tell from the TS and IM structures (Figures 4a and 4b) that at the transition state, the pathway starts to switch from the base-pair-elongation mode to the base-flipping mode; specifically when the system moves forward from the TS region, the C6 base starts to flip out from the “dry interior” so that it can be stabilized by water molecules in the major-groove (Figure 4c). Apparently, in this base extrusion process, the base-pair stretching and base-flipping motions proceed in a step-wise fashion; the energetic penalty at the transition state is mainly introduced by the activation of the earlier motion, whose role may simply be preparing a ready environment for the latter motion to effortlessly occur.

Figure 4.

Figure 4

The structures of the transition state (a), the intermediate state (b), and the flipped-out end state (c).

Earlier umbrella sampling (US) studies reveal that base extrusion free energy profile calculations are sensitive not only to reaction coordinate definitions but also to force field treatments9,24,29. Here, we focus our comparison with the result of a recent US study24, in which the identical DNA sequence and energy function were employed and the CPD collective variable was employed as the order parameter. In the region of λ∈ (0, 0.5), the free energy profiles obtained by the US and iOTPRW calculations are consistent in terms of general shape and even free energy height of the IM region, which is estimated to be about 12.0 kcal/mol by the US calculation. The major difference lies around the pre-TS and TS regions: the free energy barrier height was calculated to be 16.0 kcal/mol by the US method, about 2.9 kcal/mol higher than the MFEP free energy barrier; and the US free energy profile is more rugged in the pre-TS region, where the iOTPRW free energy surface is very smooth. Employing the equilibrium GE simulation samples, we re-calculated the CPD-dependent potential of mean force (PMF); although these samples were collected along the optimized path, for the fact that the CPD may not be an ideal reaction coordinate, the CPD-based free energy barrier is estimated to be 11.9 kcal/mol, about 1.2 kcal/mol lower than the MFEP free energy barrier. As elaborated in the next section, these discrepancies suggest the fact that a base-flipping collective variable alone may not be sufficient to guide effective sampling of the transition state formation, which is actually predominantly a base-plane elongation event. It is worth noting that the CPD-dependent PMF should be contributed by the microstates along all the possible paths, each of which corresponds to a unique flipped-out (OUT) end state, while the MFEP free energy profile should describe the free energy change along a single optimized path and it is usually end-state dependent. In the region of λ ∈ (0, 0.5), the to-be-extruded base is still confined in a narrow area; in this region, a converged US free energy profile along a correct reaction coordinate should be comparable with the MFEP free energy profile. However, in the region of λ ∈ (0.5, 1.0), the C6 base flips away from the tightly bound IM region and moves towards the pre-defined OUT state (Figure 3c). Thus in this region, the MFEP free energy profile is end-state dependent and cannot be directly compared with the US calculation result; as the matter of fact, in the region of λ ∈ (0.5, 1.0), the PMF along a simple collective variable is likely to have large entropic contribution from all the possible pathways. Notably, the focus of this study is mainly on the region of λ ∈ (0, 0.5), because this is the key region where the transition state is formed and crucial “hidden” events are involved as well.

The collective variable changes along the MFEP

The above MFEP free energy profile analysis further suggests that at the pre-TS stage of the process, the associated events may be too complex for a single collective variable such as the CPD to represent. From the optimized MFEP, we can identify seven collective variables (θ1, θ2, θ3θ8) as essential coordinate candidates, based on the fact that these coordinates have distinct on-the-path changes, particularly in the λ ∈ (0, 0.5) region.

The left panels of Figure 5 show how these collective variables evolve along the optimized path. It is noted that the tiny beginning portion of an MFEP pathway (for instance, along this MFEP path, the region between the starting point ZA and the free energy minimum point Z(λ = 0.03)) is usually not informative but a necessary consequence of the path-optimization setup, because it is always preferable for the starting point to be chosen as a point before the reactant free energy minimum. The left panel of Figure 5a displays the on-the-path evolutions of the base-pair separation collective variables, θ1 and θ2, which show that the MFEP transition state (λ = 0.26) has a tighter structure (with both θ1 and θ2 slightly smaller than 6 Å) than the transition state obtained by a previous US study16, the θ1 value of which is around 7 Å. This structural difference provides a reasonable explanation on why the free energy barrier from the US calculation may have been overestimated. Notably, at the early stage of the path, the two base-pair separation distances change in a nonsynchronous manner. As observed from the trajectory, along this part of the pathway, indeed an inter-base rocking motion occurs around the vector along θ1; i.e. then the C6 plane rocks away from the original WC base plane. Starting from λ =0.2, the rocking motion is reversed and when the system approaches the transition state, the broken WC base plane is re-formed. Such two-step rocking motions may play a role in helping the strain introduced by the base-plane elongation motion to be released. The left panel of Figure 4b displays the progresses of the collective variables that describe the relative motions between the G19 plane and its adjacent base-planes, θ4 and θ6 (θ3 is not shown for the fact that it has a similar on-the-path evolution behavior as θ6 does). Obviously, during the formation of the transition state, the inter-base-plane motions need to be actively involved. The same as the C6-plane rocking motions, these motions may also be able to facilitate the stretching of the WC base pair. More importantly, as discussed in the next section, these motions can be crucial in selectively steering the base-extrusion process along the target channel. The left panel of Figure 4c display the progresses of the collective variables that describe the internal motion of the adjacent G5-C20 plane, θ7 and θ8; these on-the-path evolutions show that simultaneously with the transition state formation, the G5-C20 base plane is buckled; and after the transition state, accompanying the flipping of the C6 base, the G5-C20 plane relaxes back to its canonical WC shape. These neighboring base-plane responses are clearly induced by the C6-G19 elongation motion. Interestingly along the optimized path, the structure of the other neighboring (G7-G18) plane (represented by θ9 and θ10) does not have any obvious change. Such asymmetric responses further indicate that the base-extrusion environment plays a crucial role in selecting base-extrusion channels.

In summary, a well converged MFEP and the corresponding free energy profile were generated via the iOTPRW simulation. To our knowledge, this is the first time that the base-plane-elongation motion rather than the commonly focused base flipping motion is identified as the key component, specifically the major event responsible for the transition state formation, in a base extrusion process. Moreover, we find out that the base-pair stretching and base flipping motions are largely dissociative. Furthermore, it is revealed that some intricate motions involving neighboring base planes actively participate in this part of the base extrusion pathway. It is worth noting that the base-pair elongation motion and associated changes of the surrounding base-pair structures were reported in an earlier study as well16. Here, we further verify these observations with more accurate details and more importantly identify the crucial roles that they play in the base extrusion process.

IV. B. The on-the-path generalized ensemble simulation and the collective variable analysis

The iOTPRW result allows us to identify several essential coordinate candidates from the employed collective variable set. Due to the fact that these collective variables were defined in the context of a specific process, the obtained path information may not be directly transferable for other DNA base extrusion studies to use. Therefore, it is necessary to perform further analysis to derive relatively simple and transferable representations of the motions that may be generally meaningful. In addition, it is of importance to revisit previously proposed collective variables and re-evaluate them based on the obtained MFEP.

The analysis on the flipping (opening) angle collective variables

As explained in III.B, in our collective variable analysis, the free energy profile, Go(ξ, λ), needs to be constructed to map the relationship between a collective variable ξ, and the “ideal” reaction coordinate λ. Here, we focus our examination on three flipping (opening) angle collective variables: the base-opening angle, the original CPD, and a modified CPD (mCPD). As shown by the left panel of Figure 6, all of these two-dimension free energy surfaces agree well with the free energy profile from the path optimization calculation (Figure 3c); for instance, the free energy barriers in these free energy profiles are all close to 13.1 kcal/mol. Such agreement indicates that the collective variable set θ(X) =(θ1 (X),…,θm (X)) has adequately included the representation of each of the three base flipping order parameter functions. More importantly, Figure 6 shows that these flipping-angle based collective variables have no obvious change at the pre-TS stage of the base extrusion process; this further confirms that the TS formation portion of the pathway is predominantly the base-pair-elongation event rather than the base-flipping event. In terms of path selectivity, although the three commonly applied flipping-angle-based collective variables were designed to enforce the final base flipping direction, as shown in Figure 6, these collective variables cannot guide effective sampling until λ is close to 0.2. Furthermore, based on these high-quality equilibrium GE ensemble samples, we calculated the PMFs along these three collective variables (the right panels of Figure 6). In consistence with the above understanding, these PMFs have only moderate flipping-angle-dependent changes till these flipping angles enter the PMF barrier regions where sharp increases occur. It should be noted that the PMF value at a collective variable state is contributed by all the relevant microstate regions; and due to the exponential relationship, the energetics of the lowest free energy microstate region are likely to dominate the corresponding PMF value. Therefore, if a collective variable is not an ideal reaction coordinate, the correct PMF barrier height is likely to be lower than the MFEP barrier height. In the present study, all the flipping angle collective variables show broad distributions at every λ state (the left panel of Figure 6); consequently, in these PMFs, the contribution of the MFEP transition state is totally overshadowed by the energetics of the neighboring lower free energy regions (along λ), in particular the IM regions. As shown by the right panel of Figure 6, the PMF barriers along the three flipping (opening) angles are respectively 11.3 kcal/mol, 11.9 kcal/mol, and 11.1 kcal/mol, about 1.2 kcal/mol to 2.0 kcal/mol lower than the MFEP barrier height; as the matter of fact, these values mostly reflect the free energy height of the MFEP IM region: 11.6 kcal/mol (Figure 3). In addition, these three collective variables show different behaviors in terms of how they correlate with λ; these difference explains why US-base free energy profile calculations are sensitive to flipping-angle definitions.

Figure 6.

Figure 6

The two-dimension free energy profiles along each of the three examined collective variables (in the horizontal direction) and the progressing parameter (in the vertical direction). (a) The base-opening angle is the target collective variable; (b) the original CPD is the target collective variable; and (c) the mCPD is the target collective variable. The white lines in these free energy profiles represents the average collective variable values along the path.

The selectivity-governing motions

According to the above analysis, in base-extrusion simulations, it is crucial to have the base-pair-elongation motion properly represented in reaction coordinate definitions. In the current study, the base-pair separation distance collective variables, θ1 and θ2, are included for this purpose. As mentioned in the introduction, these collective variables alone cannot selectively describe a unique path in terms of distinguishing either between the target base and partner base extrusion pathways or between the major and minor groove extrusion routes. Thus, a question arises: what motions govern the path selectivity at the pre-TS stage of the extrusion process? Based on the details of on-the-path collective variable changes discussed in IV. A., among the identified essential coordinate candidates, θ1 and θ2 mainly take care of the base-pair-elongation motion; θ7 and θ8 only describe a neighboring plane internal response motion; and these collective variables do not have any obvious change till λ approaches 0.2 (the left panel of Figure 6). Therefore, θ3, θ4, and θ7 are the only candidates that are possibly responsible for the target path extrusion selectivity. Interestingly, all of these three collective variables were introduced to describe how the C6-G19 base plane, specifically the G19 base, moves relative to the neighboring base planes. By surveying our simulation trajectories, indeed, the on-the-path changes of these three relative variables follow a series of G19-base in-plane rotation motions. To represent these rotation motions in a general manner, as illustrated in Figure 7a, we can utilize the vector generated via the projection of N9G7→ C6C18 on the G7-C18 reference plane and the vector generated via the projection of C8G19→ C2G19 on the reference C6-G19 reference plane to define an in-plane rotation angle (IPRA) function; here, the G7-C18 and C6-G19 reference planes are referred to the corresponding base planes in an “ideal” B-DNA structure, upon which the obtained structures from the trajectory that contain the N9G7→C6C18 and C8G19→C2G19 vectors are superimposed. As shown in Figure 7b, different from the base-flipping angle collective variables, the IPRA undergoes distinct transitions throughout the whole process. At the very beginning of the process, with the decreasing of the IPRA value, the C6-G19 base plane slightly rotates into the minor groove region. At around λ =0.08, the base plane starts to rotate reversely; thus through the concurrent base-pair-elongation motion, the C6 base can be selectively extruded into the target major groove region. At around the state of λ =0.16, although the Watson-Crick hydrogen bonding functional groups of C6 are not yet exposed (Figure 4a), the C6 base has been sufficiently delivered into the major groove region. In the following portion of the process, the G19 base starts to relax back towards its canonical ds-DNA position (Figure 7b) while due to the continuing separation between the C6 and G19 bases, the C6 base still stays in the major groove region. Apparently, these in-plane rotation motions are not only essential for the base-extrusion path selectivity but also for the enabling of the base-plane-elongation dynamics in the confined ds-DNA environment. In the region of λ ∈ (0, 0.16), without any obvious base-flipping motion, about 9.5 kcal/mol free energy increase has already been built up (Figure 3c). After the transition state (Figure 4a), when the base-pair separation generates enough room for C6 to be mobilized inside the elongated Watson-Crick structure, the C6 base starts to flip away from the “dry interior” of the B-DNA (Figure 4b) and at the same time, the G19 base almost completes its in-plane rotation and returns to the canonical ds-DNA location (Figure 7b). As also shown in Figure 7b, starting at around λ =0.75, the G19 base rotates away from its canonical ds-DNA position. Such unusual base rotation is likely to be caused by the MFEP end-state definition, which as discussed earlier can greatly influence the near-end-state MFEP result; and it also explains the sharp increase in the corresponding part of the free energy profile (Figure 3c).

Figure 7.

Figure 7

The in-plane rotation motions in the DNA base extrusion. (a) The in-plane rotation angle (IPRA) function is defined based on the vectors of N9G7→C6C18 and C8G19→C2G19. The G19 major-groove rotations progress with the increasing of the IPRA value. (b) The two-dimension free energy profiles along the IPRA (in the horizontal direction) and the progressing parameter (in the vertical direction).

The above analysis further confirms the fact that at the early stage of the C6-base major-groove extrusion process, the base-plane-elongation motion, which can create room for the mobilization of the C6 base, is the predominant event. Thus not surprisingly, all the commonly employed flipping-angle collective variables fail to represent the essential initial portion of the target base extrusion pathway. In addition, we identify G19-base in-plane rotations as essential motions that may govern the base-extrusion selectivity.

V. Concluding Remarks

In this work, the on-the-path random walk (OTPRW) method, which is the first generalized-ensemble-sampling based finite-temperature-string algorithm, was improved to be an integration-based OTPRW (iOTPRW) algorithm for the understanding of a B-DNA major-groove base extrusion pathway. From our iOTPRW MFEP optimization calculation, we identified a base-plane-elongation event rather than a base-flipping event as the dominant motion in the transition state formation portion of the process; and we also show that to facilitate the essential base-plane-elongation dynamics, the surrounding environment of the flipped base needs to be actively involved. Further taking the advantage of the extended-dynamics nature of the OTPRW Hamiltonian, an on-the-path equilibrium simulation was performed along the optimized path. The analysis of the collected samples reveals that none of these commonly employed flipping-angle based collective variables alone can adequately represent the base extrusion pathway. Furthermore, the collective variable analysis also allows us to uncover the essential hidden events that couple with the base-plane-elongation motion and govern the base extrusion selectivity: the base-pairing partner of the extrusion target undergoes a series of in-plane rotation motions to accommodate the base-plane-elongation dynamics. A base-plane rotation angle is identified as a general reaction coordinate to represent these in-plane rotations.

DNA base extrusion is a crucial component of many biomolecular processes. We believe that our understanding will shed light on related future studies. It is worth noting that in this study, we have employed a relatively simple model system and only focused our analysis on the motions surrounding the base extrusion site. As one could expect, in more complex systems, more extended and larger-scale dynamics may be involved. Then if the iOTPRW method is applied, carefully choosing collective variable sets is definitely critical. Here, the term “base extrusion” instead of “base flipping” or “base opening” is employed to define the whole target process in order to differentiate it from the composition events including the base-flipping motion in the context of this work; nevertheless to follow the tradition of this topic, “based flipping” is possibly more appropriate to represent such processes.

Acknowledgments

We would like to thank Dr. Alex Mackerell for helpful discussion and suggestions. Funding support from the National Institute of Health (GM054403) and the National Science Foundation (MCB0919983) is acknowledged. We would also like to thank the Florida State University High Performance Computing (HPC) Center and the Institute of Molecular Biophysics Computing Facility (Dr. Michael Zawrotny) for computing support.

References

RESOURCES