Abstract
Wordom is a versatile, user-friendly, and efficient program for manipulation and analysis of molecular structures and dynamics. The following new analysis modules have been added since the publication of the original Wordom paper in 2007: assignment of secondary structure, calculation of solvent accessible surfaces, elastic network model, motion cross correlations, protein structure network, shortest intra-molecular and inter-molecular communication paths, kinetic grouping analysis, and calculation of mincut-based free energy profiles. In addition, an interface with the Python scripting language has been built and the overall performance and user accessibility enhanced. The source code of Wordom (in the C programming language) as well as documentation for usage and further development are available as an open source package under the GNU General Purpose License from http://wordom.sf.net. © 2010 Wiley Periodicals, Inc. J Comput Chem, 2011
Keywords: structural/dynamics analysis program, free energy landscape, elastic network model, protein structure network, communication paths
Introduction
Wordom is a program aimed at fast manipulation and analysis of individual molecular structures and molecular conformation ensembles. Its development started in 2003 and the relative publication appeared in 2007.1
A number of programs are already available to analyze molecular structures and dynamics. These include: (a) the most common molecular simulation and analysis packages, like CHARMM,2, 3 Gromacs,4 and Amber5, 6; (b) a number of molecular viewers, like VMD,7 and Pymol8; (c) command-line oriented analysis programs and script suites, like MMTSB,9 carma,10 and pcazip11; and (d) packages that provide environments for structural analysis, like Bio3D,12 MMTK,13 or Biskit.14 In this panorama, Wordom was originally conceived as a simple command-line utility to quickly access data in common structural data files. Basic manipulation tools were then implemented, which paved the way for the adoption of a modular framework to easily add analysis routines. At the time of the first publication, novel analysis modules already formed the bulk of Wordom's code, and others have been added since then.
Some of the new modules (Table 1), such as secondary structure assignment (SSA), surface area calculations, and elastic network models (ENM), implement tools that are already available in some form in other software packages or web servers. However, their use on trajectory files is either cumbersome or unpractical. Indeed, programs for SSA and surface computation are widespread, but most of them can only deal with a single structure file at a time, thus making the handling of multiconformation files complex and time consuming. On the same line, ENM can be computed by the CHARMM program2, 3 or via web servers.41–45 However, the former is slower and significantly more complicated than Wordom in input setting, whereas the latter do not handle multiconformation files. Moreover, a number of ENM-based analysis tools are available in different programs and/or web servers, whereas Wordom joins many of them together in a single interface.
Table 1.
Module | Labela | Function | Reference |
---|---|---|---|
Secondary Structure Assignment | SSA | Assignment of secondary structure based on geometric criteria | 15, 16 |
Molecular Surface | SURF | Calculation of solvent accessible, solvent excluding and van der Waals surfaces; surface correlation along a trajectory | 17, 18 |
Elastic Network Model | ENM | Calculation of elastic network models on a protein structure | 19–24 |
Cross Correlation | CORR | Correlations of atomic displacements along a trajectory | 25–27 |
Protein Structure Network | PSN | Calculation of network of amino acid interactions | 28–31 |
PSN Path | PSN-path | Path calculation within protein structure network | 32, 33 |
Clustering | CLUS | Clustering according to conformation similarity | 34–36 |
cut-based Free Energy Profile | cFEP | Computation of a one-dimensional free energy profile that preserves barriers between free energy basins | 39, 46 |
Kinetic Grouping Analysis | KGA | Determination of free energy basins based on kinetic behavior | 40 |
Abbreviation/acronym used in the text.
Other novel modules introduce procedures and algorithms not available elsewhere, such as protein structure network (PSN) analysis,28, 29 search for the shortest intra-molecular and inter-molecular communication paths (PSN-path),32 kinetic grouping analysis (KGA),40 and mincut-based Free Energy Profile (cFEP).46 The principles underlying these modules have been reported in the relevant papers, but, so far, no other publicly available software can perform these analyses. In particular, PSN and PSN-path are based on the application of graph theory to protein structures, allowing to represent molecular systems as networks of interacting amino acids and to infer the functional implications of such networks in the context of intra-molecular and inter-molecular communication.30, 31, 37, 38, 47 Importantly, cFEP and KGA are rigorous methods for determining free energy basins and barriers and thus for investigating the free energy surface of simulated processes, e.g., reversible folding and conformational changes of structured peptides and miniproteins.39, 46, 48
Significant technical improvements include a more user-friendly input syntax and a more general procedure for selecting subsets of atoms. Some parts of the code have been rewritten to gain speed, robustness, and facilitate the addition of new modules. As for performance, Wordom has been modified to treat calculations relative to different frames as different threads and exploit multicore compute architectures (coarse-grained data parallelism). This multithread approach is now present in the modules in which frames are treated independently of each other. Future modules that fall under this category will be able to easily use this kind of threading without major modifications to the code. This approach does not prevent single modules to adopt internal threading. An example is the clustering module, which can now be used in multicore mode in the CPU-intensive step of frame–frame comparison. Finally, an interface with the Python scripting language has been implemented to take advantage of its flexibility and speed of coding.
This article details the analysis tools added to Wordom after the original publication, with particular emphasis on those modules that are not available in other analysis programs.
New Tools in Wordom
Secondary Structure Assignment
The SSA module is able to evaluate the secondary structure of a peptide or protein using two methods, DLIKE or DCLIKE, derived from the DSSP49 and DSSPcont15, 16 algorithms, respectively. These two approaches are considered two standards in the field of secondary structure assignments. DSSPcont is a consensus-based DSSP assignment, in which the whole DSSP procedure is run 10 times with different values of the energy cutoff that defines an hydrogen bond (H-bond).15, 16 Assignments are then weighted according to the cutoff and a consensus is given as the final output. DSSP and DSSPcont assignments are generally comparable.
Both algorithms have been rewritten from scratch since the DSSP license does not allow free reuse of the code. The output is a simple string where the nth character corresponds to the secondary structure of the nth amino acid. There are eight possible letters in the secondary structure “alphabet”: H, G, I, E, B, T, S, and L, standing for α helix, 310 helix, π helix, extended, isolated β-bridge, hydrogen bonded turn, bend, and unstructured loop, respectively.15 No extra information such as that included in the typical DSSP output is given, since the SSA module is meant to be used for a quick analysis of the secondary structure profile along a trajectory, rather than for a complete and throughout characterization of a single structure.
Comparisons between the secondary structure assignments by Wordom and by the DSSP program are shown in Table 2. The agreement is good, i.e., 92%, considering also that most discrepancies do not concern exchanges between helices and strands. The higher speed of the SSA module compared to DSSP shows itself on trajectory files (see Table 3). In fact, whereas the SSA module can compute the secondary structure along a trajectory very fast, DSSP works on single frame files previously extracted from the trajectory. Thus, the better performance of Wordom must be ascribed, at least in part, to the lack of input/output operations associated with handling each molecule conformation as a standalone file (see Table 3). The speedup is more pronounced when dealing with small systems, e.g., peptides.
Table 2.
DSSPa | |||||||||
---|---|---|---|---|---|---|---|---|---|
E | B | T | S | L | H | G | I | Total | |
Wordom/SSAa | |||||||||
E | 2103 | 31 | 18 | 21 | 85 | 2 | 0 | 0 | 2260 |
B | 11 | 58 | 6 | 6 | 38 | 1 | 7 | 0 | 127 |
T | 16 | 1 | 638 | 51 | 32 | 6 | 3 | 0 | 747 |
S | 12 | 0 | 5 | 656 | 13 | 0 | 2 | 0 | 688 |
L | 44 | 3 | 9 | 6 | 1351 | 3 | 0 | 0 | 1416 |
H | 0 | 1 | 11 | 2 | 7 | 951 | 17 | 0 | 989 |
G | 1 | 0 | 39 | 0 | 3 | 1 | 163 | 0 | 207 |
I | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 2 |
Total | 2187 | 94 | 728 | 742 | 1529 | 964 | 192 | 0 | 6436 |
The test set consists of 29 proteins (2CCY, 1ECA, 2IFO, 1TPM, 1HRE, 1PHT, 2POR, 3BCL, 2HLA, 1CDQ, 1AFC, 1MSA, 1VMO, 1HXN, 1NSC, 2BBK, 3AAH, 1TSP, 2PEC, 1PPK, 1STD, 4TIM, 1BRS, 1NTR, 1PYA, 2DNJ, 1PLQ, 1BNH, and 1PYP) selected as representatives of common folds.50 Results have been pooled together for each program and compared. Each element ij of the matrix reports the number of residues assigned by Wordom and by DSSP to be in conformation i and j, respectively.
Table 3.
A script extracted each single frame by mean of Wordom and called DSSP on the extracted frame.
A script called DSSP on the already-extracted frames.
Calculation through the Wordom SSA module.
Contrarily to DSSP, because Wordom is conceived to operate on the results of simulations, the structure files must contain all the atoms that contribute to the backbone H-bonds. Therefore, structures derived directly from the protein data bank (PDB), especially the X-ray structures that miss hydrogen atoms or entire residues, must be completed before submission to the SSA module.
Molecular Surface Calculation, Correlation, and Clusterization
Wordom computes different kinds of molecular surfaces using two different algorithms: an exact analytical method developed by Hu and coworkers (i.e., ARVO algorithm)17 and a fast numerical method developed by Pascual-Ahuir and coworkers (i.e., GEPOL algorithm).18 ARVO calculates the solvent accessible surface area by expressing the molecular surface as surface integrals of the second kind and then transforming these integrals into a sum of double integrals using the stereographic projection method.17 In contrast, GEPOL describes the molecular surface as a series of tesserae and then calculates the overall area.18 The Wordom implementation of GEPOL allows calculation of the van der Waals, solvent accessible and solvent excluding surfaces as well as tuning of three different parameters [i.e., number of divisions (ndiv), overlapping factor (ofac), and radius of the smaller sphere (rmin)] to balance the speed and accuracy of area computation. Wordom implementations are faster than the original programs (see Table 4).
Table 4.
Module | # Selected atomsa | Approximate CPU timeb |
---|---|---|
Surface (WordomARVO)c | 115d | 2980 |
Surface (ARVO)e | 115 | 3690 |
Surface (WordomGEPOL–ASURF)f | 115 | 2130 |
Surface (GEPOLASURF)g | 115 | 2660 |
Surface (WordomGEPOL–ESURF)h | 115 | 5900 |
Surface (GEPOLESURF)i | 115 | 7290 |
Surface (WordomGEPOL–WSURF)j | 115 | 1890 |
Surface (GEPOLWSURF)k | 115 | 1970 |
Correlation (DCC)i | 360m | 4 |
Correlation (LMI)n | 360 | 63 |
PSNo | 2593 | 391 |
PSN-path | – | 15 per pair |
Clustering (distances only)p | 316q | 1461 |
Clustering (QT-like)r | 316 | 100 |
Clustering (hiero)s | 316 | >50,000 |
Clustering (leader)t | 316 | 10 |
Clustering (leader)u | 316 | 10 |
Clustering (leader)v | 316 | 45 |
The considered system is a 10,000 frame trajectory of the GTP-bound Gαi1 subunit (PDB: 1CIP; 2593 atoms; 316 residues and 1 GTP molecule (44 atoms)).
CPU time (seconds) on an AMD Athlon 64 3000+, 2 GHz, 2 GB RAM.
Solvent accessible surface area computed by the Wordom implementation of the ARVO algorithm.
selection consisted in GTP and first 9 residues (selection /*/@(1–10)/*)
Solvent accessible surface area computed by the ARVO program.
Solvent accessible surface area computed by the Wordom implementation of the GEPOL algorithm (highest accuracy).
Solvent accessible surface area computed by the GEPOL program (highest accuracy).
Solvent excluded surface area computed by the Wordom implementation of the GEPOL algorithm; accuracy settings: rmin 0.5, ofac 0.8, ndiv 5.
Solvent excluded surface area computed by the GEPOL program; accuracy setting: rmin 0.5, ofac 0.8, ndiv 5.
van der Waals surface area computed by the Wordom implementation of the GEPOL algorithm; highest accuracy.
van der Waals surface area computed by the GEPOL program; highest accuracy.
Residue-residue correlation by means of the dynamic cross correlation method; masses were not taken into account.
Selection consisted in all Cα atoms and GTP
Residue-residue correlation by means of the linear mutual information method; masses were not taken into account.
PSN analysis probing 11 different Imin values (from 0.0 to 5.0 with a 0.5 step).
Only the RMSD-based distance matrix was computed at this stage and written to file.
All Cα atoms were selected.
Clustering by the QT-like algorithm, using a precalculated distance matrix (RMSD cutoff 1.0 Å).
Clustering by the hierarchical algorithm, using a pre-calculated distance matrix (RMSD cutoff 1.0 Å).
Clustering by the leader-like algorithm (RMSD cutoff 1.0 Å); distance matrix is not necessary.
Clustering by the leader-like algorithm (RMSD cutoff 1.0 Å) and turning on the non-markovian option. In this case, the bottleneck is disk speed (CPU usage 18%).
Clustering by the leader-like algorithm (DRMS cutoff 1.0 Å).
Using either one of these two algorithms, Wordom can perform different regression analyses (i.e., linear, logarithmic, exponential, and power) to correlate surface area values from two different selections computed along a trajectory. Moreover, a number of statistical parameters can be derived from the surface timeseries (i.e., range, time average, covariance, and standard deviation). Finally, clustering (binning) of the trajectory snapshots can be also performed on the basis of the surface area values of a given selection, dividing the trajectory frames in different clusters of user-defined width.
Elastic Network Model
The ENM is a coarse grained normal mode analysis (NMA) technique able to describe the vibrational dynamics of protein systems around an energy minimum. Within this technique, the protein structure is described by a reduced subset of atoms (usually Cα-atoms), whose coordinates can be derived either from structure determinations (crystallography, NMR) or from molecular simulations. The interactions between particle pairs are given by a single term Hookean harmonic potential.19 The total energy of the system is thus described by the simple Hamiltonian:
(1) |
where dij and are the instantaneous and equilibrium distances between Cα-atoms i and j, respectively, whereas kij is a force constant, whose definition varies depending on the type of ENM used. The second derivatives of the harmonic potential are stored in a 3N × 3N Hessian matrix (H), whose diagonalization gives a set of 3N-6 nonzero-frequency eigenvectors and associated eigenvalues.
Two alternative versions of ENM have been implemented. In the first version, termed “linear cutoff-enm,” the force constant is equal to 1 for pairwise interactions between the Cα-atoms lying within a cutoff distance chosen by the user, and adjacent Cα-atoms are assigned a force constant equal to 10.20 In the second one, termed “Kovacs-ENM,”21 the force constant depends on the distance of the interacting particles:
(2) |
where C is constant (with a default value of 40 Kcal/mol ċ Å2).21
The structural perturbation method (SPM) has been recently described as a technique useful to characterize allosteric wiring diagrams in the context of the ENM lowest frequency modes.22 According to this methodology, amino acid positions that are relevant to protein dynamics are searched by perturbing systematically all the springs that connect the Cα-atoms and then measuring the residue-specific response of such perturbations in the context of a given mode m. The perturbation response is computed as:
(3) |
where νm is the eigenvector of mode m, is its transpose, and δH is the Hessian matrix of the perturbation to the energy of the elastic network:
(4) |
The response δωim is proportional to the elastic energy of the springs that are connected to the ith residue when they are perturbed by an arbitrary value (0.1), thus defining the most critical nodes for the dynamics of a given mode. The number of modes used for the computation is specified by the user (from 1 up to 3N-6). It is also possible to generate, for each analyzed mode, a pdb file containing the values of δωim in the β-factor field (Fig. 1).
Theoretical β-factors can be computed inside the ENM module, by the formula23
(5) |
where νmn is the nth element of eigenvector m, λm is the associated eigenvalue, k is the Boltzmann constant, and T is the temperature in K.
Cross correlations between theoretical and experimental β-factors can be also computed according to the following equation:
(6) |
where and are the theoretical and experimental β-factors, and and are the theoretical and experimental β-factor average over all atoms, respectively. The number of modes used for the computation is specified by the user (from 1 up to 3N-6).
Moreover, involvement coefficients I between the ENM modes and the displacement vector between a given structure/frame T and a reference structure R can be computed according to the following equation:
(7) |
where and is the ith coordinate in the two conformers and νmn is the nth element of eigenvector m.24 By default, the computation is done for all 3N-6 modes, and only the values of I greater than an arbitrary threshold (i.e., 0.2) are output.
The cumulative square overlap (CSO) between all modes and the displacement vector is computed according to the following equation:
(8) |
Finally, residue correlation Cij is computed as:51
(9) |
Cross Correlation
Wordom implements two different algorithms to calculate correlations of atomic displacements along an MD trajectory. One algorithm, called dynamic cross-correlation (DCC),25 is a simple and well established method based on the calculation of the normalized covariance of atom/residue position vectors. DCC values are computed as:
(10) |
where i and j may be atoms or centroids of atoms grouped by residue, and ri and rj are the corresponding position vectors. DCC represents the extent of atom/residue displacement correlation within a range that goes from 1.0 to −1.0; where 1.0 indicates completely correlated (same period and phase) and −1.0 completely anti-correlated (same period and opposite phase) displacements. The second algorithm, called linear mutual information (LMI),26, 27 is computationally more expensive (see Table 4) than DCC but overcomes some limitations of the DCC algorithm and is able to estimate correlations between perpendicular motions. LMI values are computed as:
(11) |
where i and j may be atoms or residues, Cij is the pair-covariance matrix, and Ci and Cj are marginal covariance matrices.26, 27 LMI correlation values can vary from 0.0 to 1.0, which indicate completely uncorrelated and completely correlated displacements, respectively.
The Wordom implementation of the DCC and LMI algorithms incorporates some setup options. In particular, it is possible to calculate correlations by treating atoms independently or collectively with respect to the residues they belong to. It is also possible to take into account the atomic masses.
For a selection of 360 atoms within 10000 trajectory frames, the DCC and LMI methods took, respectively, 4″ and 63″ on the same processor (Table 4). The relative correlation matrices are shown in Figure 2.
Protein Structure Network
In recent times, the concept of PSN has been explored, giving more insights into the global properties of protein structures.30, 31 The representation of protein structures as networks of interactions between amino acids has proven to be useful in a number of studies, such as protein folding,47 residue contribution to the protein–protein binding free energy in given complexes,37 and prediction of functionally important residues in enzyme families.38 All these aspects pertain to the issue of intra-molecular and inter-molecular communication.30, 31
Wordom implementation of PSN analysis is based on the work and algorithms described in the relevant papers by the Vishveshwara and coworkers.28, 29 PSN is constructed from the atomic coordinates of residues, which represent the nodes of the network. Two nodes are connected by an edge if the percentage of interaction between them is greater than or equal to a given Interaction Strength cutoff.
(12) |
where Iij is the interaction percentage of nodes i and j, nij is the number of side-chain atom pairs within a given distance cutoff (4.5 Å as a default), and Ni and Nj are, respectively, the normalization factors (NF) for residues i and j, which take into account the differences in size of the different nodes and their propensity to make the maximum number of contacts with other nodes in protein structures. The NFs for the 20 amino acids in our implementation were taken from the work by Kannan and Vishveshwara.28 Novel NFs for nonamino acid nodes can be introduced as well by the user. In this respect, the current version of the module holds also the NFs for retinal, guanine nucleotide di- and tri-phosphates (GDP and GTP, respectively), and Mg2+. In detail, retinal NF (i.e., 170.13) was computed as the average number of contacts done by the molecule in a dataset of 83 crystallographic structures concerning the different photointermediate states of bacteriorhodopsin, bovine rhodopsin, sensory rhodopsin, and squid rhodopsin. The NFs for GDP and GTP (i.e., 220.19 and 274.78, respectively) were derived from datasets of 55 and 69 G proteins, respectively. Finally, the NFs for Mg2+ concerns GTPases and is 14.65 and 22.01 in the GDP- (i.e., based upon 41 GTPase structures) and GTP-bound states (i.e., based upon 68 GTPase structures). Iij are calculated for all node pairs excluding j = i ± n, where n is a given neighbour cutoff (2 as default), and each node pair with an Iij value greater than or equal to a given Imin cutoff is connected by an edge. Different networks can be achieved by probing a range of Imin cutoffs. At high Imin cutoffs, only nodes with high number of interacting atom pairs will be connected by edges, indicative of stronger inter-residue interactions. At a given Imin cutoff, those nodes that realize more than a given number of edges (4 as default) are called hubs. The percentage of interaction of a hub node is
(13) |
where Ii is the hub interaction percentage of node i, nij is the number of side-chain atom pairs within a given distance cutoff and Ni is the normalization factor of residue i. Node inter-connectivity is finally used to highlight cluster-forming nodes, where a cluster is a set of connected amino acids in a graph. Node clusterization procedure is such that nodes are iteratively assigned to a cluster if they can establish a link with at least one node in such cluster. A node not linkable to existing clusters initiates a novel cluster and so on until the node list is exhausted. The size (defined as the number of nodes) of the largest cluster is used to calculate the Icritic value. Icritic is defined as the Imin at which the size of the largest cluster is half the size of the largest cluster at Imin = 0.0. At Imin = Icritic weak node interactions are discarded, emphasizing the effects of stronger interactions on PSN properties.
The Wordom implementation of PSN analysis allows the user to: (a) modify all the involved cutoffs (i.e., distance, neighbor, hub); (b) make residue selections; (c) set Imin ranges; and (d) set, when dealing with a trajectory, the fraction of frames for which a PSN property is defined as stable. Furthermore, Wordom computes all network properties described in the relevant papers by Vishveshwara's group (i.e., interaction strength for all node pairs, stable node interactions, hub frequencies, cluster compositions, and dimensions).28, 29
An original feature of Wordom is the hub correlation analysis, a simple but effective method to highlight correlations in the propensity of two nodes to behave as hubs along an MD trajectory (Fig. 3).
The results of an application of the PSN module to a 10,000 frame trajectory of the Gαi1-subunit complexed with GTP are shown in Figures 4A and 4B. The relative CPU time required for such a demonstrative calculation is reported in Table 4.
Search for Communication Paths
As an extension of the PSN analysis tool, Wordom can calculate the shortest non-covalently connected path(s) between two residues of interest in a single structure or in a trajectory (Fig. 4C), by combining PSN node inter-connectivities and residue correlated motions, as described in the relevant paper by the Gosh and Vishveshwara.32 Path search through the PSN-path module uses Dijkstra's algorithm33 to traverse PSN inter-connectivities, and to find the shortest paths in each frame. It consists in: (a) search for all shortest paths between selected residue pairs based upon the PSN connectivities and (b) selection of paths that contain at least one residue correlated with either one of the two extremities (i.e., the first and last amino acids in the path). Once the shortest paths have been found, their frequencies, i.e., the number of frames containing the selected path divided by the total number of frames in the trajectory, are computed, which helps selection of the most meaningful paths. Steps (a) and (b) of path search employ the outputs from the PSN and CORR modules, respectively. The Wordom implementation allows the user to tune several parameters of the path-search routine (i.e., minimum interaction strength cutoff between nodes, lowest accepted residue correlation cutoff, minimum length and frequency of paths). Either the DCC or LMI methods can be chosen as a source of residue correlations.
Clustering
The original RMSD- and DRMS-based clustering module allowed the choice of three different algorithms: leader-like,34 hierarchical35 and quality threshold-like (QT-like).36 QT-like differs from the original QT algorithm in the check performed to assess whether a conformation belongs to a cluster or not. The original QT builds a perspective cluster for each frame by comparing it with all others and adding conformations progressively farther away from the starting frame until each new addition is within the chosen threshold with respect to all previously added frames. The largest of all these perspective clusters is then taken as the first cluster, its members are taken out of the conformation population and the procedure is run again until all conformations are either in a cluster or isolated. The threshold can be seen as the diameter of the cluster thus formed. In contrast, QT-like builds the clusters only checking that newly added conformations are within the threshold with respect to the reference frame; the threshold is thus the radius of the cluster.
The clustering module has been optimized both in its performance (speed and memory usage) and accuracy. In the leader-like34 algorithm (the fastest but least accurate one), each subsequent frame is compared with the existing cluster centers and, in case no cluster center is within the threshold, a new cluster is created with the frame as its center. The original implementation allowed the choice of two different frame-comparison modalities. According to the first modality a frame is compared with all the existing clusters and assigned to the nearest one (more accurate, default behavior). With the second option a frame is assigned to the first cluster within the threshold (faster). In the latest version a third option has been added, such that each frame n is compared with the existing clusters moving backward from the cluster that holds frame n − 1, to the cluster that holds frame n − 2, and so on, until a distance lower than the threshold is found. In non-Markovian data sets (e.g., snapshots of MD simulations saved every few ps which are correlated) this approximation greatly speeds up the process, because the likelihood that a frame belongs to the same cluster as the preceding frame(s) is quite high. The accuracy of the new option is only slightly lower than the “comparison with all clusters” approach, but the execution is faster than the original “stop at first viable cluster” option. Leader-like clustering is less accurate than the QT or the Hierarchical algorithms since it compares each frame only with the clusters that have been already found along the trajectory, thus making the outcome dependent on the frame order.
More relevant improvements concern the Hierarchical and QT-like algorithms. Indeed, they have been both modified so that the original memory requirements have been almost halved. Furthermore, the actual clustering algorithms massively use multithreading in the CPU-intensive computation of the inter-frame distances (RMSD or DRMS). Also, the distance matrix can now be saved for later use, so that, if clustering with different threshold values is desired, the distance-computing step needs to be performed only once. Finally, the original two-pass clustering has been improved as well. In detail, after a first clustering run on a subset of frames, a second pass can be run that assigns each considered frame to the nearest cluster found in the first run. In the original version, frames with no clusters within a distance lower than the threshold were labelled as isolated. In contrast, Wordom now treats these isolated frames as new cluster centers, so that new clusters can be found and populated in the second run. This improves the overall accuracy and allows for a smaller portion of the total data set to be used in the first run.
When accuracy is paramount, the QT-like algorithm is probably the most appropriate, being more accurate than the leader-like one and significantly faster (with comparable accuracy) than the Hierarchical algorithm (Table 4). Yet, in spite of the improvements, it remains considerably memory-hungry. Therefore, when dealing with big data sets (>1M–10M frames, depending on the available computing power and memory), with which it is impossible to consider all frame–frame distances, the user can choose to either use QT on a subset of frames and then run a two-pass clustering, or to opt for the leader-like algorithm.
Determination of Free Energy Basins and Barriers
Wordom has two distinct modules, cFEP46 and KGA,40 devoted to the identification of (meta)stable states sampled by MD simulations. The key idea of cFEP and KGA is to group conformations not according to structural criteria, but mainly according to equilibrium kinetics. In this way, an analysis of the MD trajectory in terms of free energy basins, i.e., basins of attraction of the dynamics, is provided. The main advantage of cFEP with respect to KGA is the information on the height and location of the free-energy barriers along the cumulative partition function,46 which can be used to identify the transition state structure(s).52 On the other hand, the KGA procedure does not require the use of a reaction coordinate to determine the free-energy basins.40
For both cFEP and KGA procedures, MD snapshots (i.e., Cartesian coordinate sets) need to be finely clusterized and assigned to a discrete set of microstates. Clusterization can be done according to atomistic, i.e., RMSD-based clustering, or to coarse-grained representations such as secondary structure strings. Both RMSD-based and coarse-grained clustering have proven to be good discretization methods of MD trajectory snapshots into a set of microstates that describe large conformational changes (see Ref.40, 48, and53–56 for examples in protein folding). Application to large proteins requires more sophisticated clustering procedures like principal component space.57
Mincut-Based Free Energy Profile
The cFEP module refers to a rigorous method introduced by Krivov and Karplus46 for determining a one-dimensional free energy profile that preserves the barriers between free energy basins; given the barriers, free energy basins can be determined. The method uses the relative partition function,46 which is a reaction coordinate that takes into account all possible pathways to a reference state (e.g., the folded state).
The cFEP algorithm is based on a network description of the conformational dynamics. Each microstate (see above) represents a node of the conformation space network53, 58 and a link is made if a direct transition between two microstates is observed during the timeseries in a time step of a given size (see Fig. 5).59
The cFEP module implemented in Wordom is a precise and fast approximation of the minimum-cut method.60, 61 The free energy is projected as a function of the partition function relative to a reference node.39, 46 With this method, microstates are ranked according to their kinetic proximity with respect to a reference microstate (Fig. 6). The relative partition function is used as the progress coordinate, and the free energy barriers are determined as a function of it, either based on the probability of reaching the folded state before unfolding (pfold)46 or on the mean first passage time (mfpt)39 to a selected node (both calculated analytically from the transition matrix). The pfold implementation, which requires a target node, is appropriate to find barriers between two well-defined basins, which are specified by the user through the assignment of pfold = 1 to the representative node of one basin, and pfold = 0 to the representative node of the other. On the other hand, the mfpt-based method is more suitable for free energy profiles relative to a single target basin (e.g., for unfolding profiles), for which the representative node of the target basin is assigned mfpttarget = 0.
Kinetic Grouping Analysis
The free energy basins are determined by KGA on the basis of kinetic behaviors (fast relaxation at equilibrium) along an MD simulation.40 The KGA method is based on a network description of the conformational dynamics.
Two protein microstates are grouped in a basin if, along the MD trajectory, they interconvert frequently within a short commitment time τcommit, which represents a typical relaxation time within basins. The principle behind this approach is that if two conformations interconvert rapidly, they are not separated by a barrier and therefore belong to the same basin. The τcommit is a characteristic of the investigated system. It is an important (user-chosen) parameter of KGA and defines the resolution with which basins are isolated. A short τcommit will group structures only locally or if the free energy surface is very smooth. A longer τcommit is more generous and might group sub-basins, isolated by a shorter τcommit, into larger basins. The log–log plot of the distribution of first passage times to the native microstate (or a representative microstate of another basin) usually reflects two timescales: the inter- and intra-basin relaxation times (see Fig. 7 of Muff and Caflisch40). The barrier that separates the two regimes can give a good indication for the relaxation time.
The KGA module allows for isolation of either all relevant basins at once or of a single basin. In the first case, for a fixed commitment time τcommit, a matrix with interconversion (commitment) probabilities pcommit between any pair of microstates can be calculated in principle, and microstates with pcommit ≥ 0.5 are grouped together. Because the computational cost of all-against-all calculations increases quadratically, in practice one selects a subset of highly populated microstates (e.g., the 500 most populated microstates), calculates the pcommit-matrix and divides them into basins. In a post-processing step, all other microstates are assigned commitment probabilities to the isolated basins; finally, all microstates having a pcommit ≥ 0.5 to a given basin are assigned to it. Otherwise, the microstates remain unassigned. Both the heavy-microstate calculation and the post-processing are done by Wordom in the same function. On the other hand, if only one basin is of interest or if the relaxation times within basins lay on different timescales, it is better to choose an appropriate τcommit for each basin separately and then calculate the commitment probability (pcommit) according to it. In this way, it is possible to isolate basins one-by-one. In this case, the user has to run the procedure a number of times equal to the number of basins that need to be isolated. In addition to the τcommit, a representative microstate of each basin (usually the most populated microstate in the basin) has to be specified. Finally, all microstates that commit to the representative microstate of a basin with probability pcommit ≥ 0.5 are considered as part of that basin.
Python Bindings
Using the SWIG (simple wrappers and interface generator)62 tools, a python module has been written that gives access to most of Wordom's input/output functions and structures in the python environment via a simple import command. Basic analysis functions (e.g., RMSD, distances, atoms selections) are also exposed to the python environment. The availability of Wordom's input/output functions allows scripts to operate directly on molecular data, whereas access to Wordom's analysis functions makes it easy to compute properties on molecules or whole trajectories, and to further process the output without writing full-fledged C code or resort to temporary files. It is also practical to write the prototype of an analysis module in python and then convert it to C to enhance its performance, as has been done for some of the recently added modules.
Conclusions
Wordom is a user-friendly program for manipulating and analysing data from structural studies and molecular simulations. The latest release represents a significant improvement and enrichment of the original version published in 2007,1 as it provides new analysis tools that are unique to Wordom. These include new procedures for efficient structural analysis such as dynamic PSN and shortest communication path modules, which are effective tools to infer amino acids essential for stability and function as well as to unravel intra-molecular and inter-molecular communication mechanisms. Other novelties are user-friendly methods for determining free energy basins and barriers using the network of transitions sampled by MD simulations. With these new tools, Wordom can be used to analyze the free energy surface and therefore investigate the thermodynamics and kinetics of complex molecular processes, e.g., the reversible folding of structured peptides (Fig. 6).
Improvements include also the implementation of an interface with the popular scripting language Python.
Like the original version, this version of Wordom is released under the general purpose license (GPL), which allows anybody to download, modify, and redistribute both source code and binary files. This license has been adopted in order to foster diffusion and encourage contributions to the code.
Acknowledgments
The authors thank M. Cecchini, G. Settanni, and A. Vitalis for helpful discussions, Y. Valentini for technical help, S. Krivov for suggesting the modification to the leader clustering algorithm, and P. Schütz for providing Figure 6.
References
- 1.Seeber M, Cecchini M, Rao F, Settanni G, Caflisch A. Bioinformatics. 2007;23:2625. doi: 10.1093/bioinformatics/btm378. [DOI] [PubMed] [Google Scholar]
- 2.Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, Karplus M. J Comput Chem. 1983;4:187. [Google Scholar]
- 3.Brooks BR, Brooks CL, III, Mackerell AD, Jr., Nilsson L, Petrella RJ, Roux B, Won Y, Archontis G, Bartels C, Boresch S, Caflisch A, Caves L, Cui Q, Dinner AR, Feig M, Fischer S, Gao J, Hodoscek M, Im W, Kuczera K, Lazaridis T, Ma J, Ovchinnikov V, Paci E, Pastor RW, Post CB, Pu JZ, Schaefer M, Tidor B, Venable RM, Woodcock HL, Wu X, Yang W, York DM, Karplus M. J Comput Chem. 2009;30:1545. doi: 10.1002/jcc.21287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lindahl E, Hess B, van der Spoel D. J Mol Model. 2001;7:306. [Google Scholar]
- 5.Case D, Cheatham T, III, Darden T, Gohlke H, Luo R, Merz K, Jr., Onufriev A, Simmerling C, Wang B, Woods R. J Comput Chem. 2005;26:1668. doi: 10.1002/jcc.20290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Case D, Darden T, Cheatham T, III, Simmerling C, Wang J, Duke R, Luo R, Walker R, Zhang W, Merz K, Roberts B, Wang B, Hayik S, Roitberg A, Seabra G, Kolossvry I, Wong K, Paesani F, Vanicek J, Wu X, Brozell S, Steinbrecher T, Gohlke H, Cai Q, Ye X, Wang J, Hsieh M-J, Cui G, Roe D, Mathews D, Seetin M, Sagui C, Babin V, Luchko T, Gusarov S, Kovalenko A, Kollman P. San Francisco: University of California; 2010. [Google Scholar]
- 7.Humphrey W, Dalke A, Schulten K. J Mol Graphics. 1996;14:33. doi: 10.1016/0263-7855(96)00018-5. [DOI] [PubMed] [Google Scholar]
- 8.DeLano W. The PyMOL Molecular Graphics System. San Carlos, CA, USA: DeLano Scientific; 2002. [Google Scholar]
- 9.Feig M, Karanicolas J, Brooks C., III J Mol Graph Model. 2004;22:377. doi: 10.1016/j.jmgm.2003.12.005. [DOI] [PubMed] [Google Scholar]
- 10.Glykos N. J Comput Chem. 2006;27:1765. doi: 10.1002/jcc.20482. [DOI] [PubMed] [Google Scholar]
- 11.Meyer T, Ferrer-Costa C, Perez A, Rueda M, Bidon-Chanal A, Luque F, Laughton C, Orozco M. J Chem Theory Comput. 2006;2:251. doi: 10.1021/ct050285b. [DOI] [PubMed] [Google Scholar]
- 12.Grant B, Rodrigues A, ElSawy K, McCammon J, Caves L. Bioinformatics. 2006;22:2695. doi: 10.1093/bioinformatics/btl461. [DOI] [PubMed] [Google Scholar]
- 13.Hinsen K. J Comput Chem. 2000;21:79. [Google Scholar]
- 14.Grunberg R, Nilges M, Leckner J. Bioinformatics. 2007;23:769. doi: 10.1093/bioinformatics/btl655. [DOI] [PubMed] [Google Scholar]
- 15.Andersen CAF, Palmer AG, Brunak S, Rost B. Structure. 2002;10:174. doi: 10.1016/s0969-2126(02)00700-1. [DOI] [PubMed] [Google Scholar]
- 16.Carter P, Andersen C, Rost B. Nucleic Acids Res. 2003;31:3293. doi: 10.1093/nar/gkg626. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Buša J, Džurina J, Hayryan E, Hayryan S, Hu C, Plavka J, Pokornỳ I, Skřivánek J, Wu M. Comput Phys Commun. 2005;165:59. [Google Scholar]
- 18.Pascual-Ahuir J, Silla E, Tunon I. J Comput Chem. 1994;15:1127. [Google Scholar]
- 19.Tirion M. Phys Rev Lett. 1996;77:1905. doi: 10.1103/PhysRevLett.77.1905. [DOI] [PubMed] [Google Scholar]
- 20.Delarue M, Sanejouand Y. J Mol Biol. 2002;320:1011. doi: 10.1016/s0022-2836(02)00562-4. [DOI] [PubMed] [Google Scholar]
- 21.Kovacs J, Chacón P, Abagyan R. Proteins. 2004;56:661. doi: 10.1002/prot.20151. [DOI] [PubMed] [Google Scholar]
- 22.Zheng W, Brooks B, Doniach S, Thirumalai D. Structure. 2005;13:565. doi: 10.1016/j.str.2005.01.017. [DOI] [PubMed] [Google Scholar]
- 23.Bahar I, Atilgan A, Erman B. Fold Des. 1997;2:173. doi: 10.1016/S1359-0278(97)00024-2. [DOI] [PubMed] [Google Scholar]
- 24.Marques O, Sanejouand Y. Proteins. 1995;23:557. doi: 10.1002/prot.340230410. [DOI] [PubMed] [Google Scholar]
- 25.McCammon JA, Harvey S. Dynamics of Proteins and Nucleic Acids. Cambridge: Cambridge University Press; 1987. [Google Scholar]
- 26.Kraskov A, Stoegbauer H, Grassberger P. Phys Rev E. 2004;69:66138. doi: 10.1103/PhysRevE.69.066138. [DOI] [PubMed] [Google Scholar]
- 27.Lange O, Grubmuller H. Proteins. 2006;62:1053. doi: 10.1002/prot.20784. [DOI] [PubMed] [Google Scholar]
- 28.Kannan N, Vishveshwara S. J Mol Biol. 1999;292:441. doi: 10.1006/jmbi.1999.3058. [DOI] [PubMed] [Google Scholar]
- 29.Brinda KV, Vishveshwara S. Biophys J. 2005;89:4159. doi: 10.1529/biophysj.105.064485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Vishveshwara S, Ghosh A, Hansia P. Curr Protein Pept Sci. 2009;10:146. doi: 10.2174/138920309787847590. [DOI] [PubMed] [Google Scholar]
- 31.del Sol A, Fujihashi H, Amoros D, Nussinov R. Mol Syst Biol. 2006;2:2006.0019. doi: 10.1038/msb4100063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ghosh A, Vishveshwara S. Proc Natl Acad Sci USA. 2007;104:15711. doi: 10.1073/pnas.0704459104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Dijkstra E. Numer Math. 1959;1:269. [Google Scholar]
- 34.Hartigan J. Clustering Algorithms. New York, NY, USA: Wiley; 1975. [Google Scholar]
- 35.Johnson S. Psychometrika. 1967;32:241. doi: 10.1007/BF02289588. [DOI] [PubMed] [Google Scholar]
- 36.Heyer L, Kruglyak S, Yooseph S. Genome Res. 1999;9:1106. doi: 10.1101/gr.9.11.1106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.del Sol A, O'Meara P. Proteins. 2005;58:672. doi: 10.1002/prot.20348. [DOI] [PubMed] [Google Scholar]
- 38.Amitai G, Shemesh A, Sitbon E, Shklar M, Netanely D, Venger I, Pietrokovski S. J Mol Biol. 2004;344:1135. doi: 10.1016/j.jmb.2004.10.055. [DOI] [PubMed] [Google Scholar]
- 39.Krivov SV, Muff S, Caflisch A, Karplus M. J Phys Chem B. 2008;112:8701. doi: 10.1021/jp711864r. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Muff S, Caflisch A. Proteins. 2008;70:1185. doi: 10.1002/prot.21565. [DOI] [PubMed] [Google Scholar]
- 41.Suhre K, Sanejouand Y. Nucleic Acids Res. 2004;32(Web Server Issue):W610. doi: 10.1093/nar/gkh368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Eyal E, Yang L, Bahar I. Bioinformatics. 2006;22:2619. doi: 10.1093/bioinformatics/btl448. [DOI] [PubMed] [Google Scholar]
- 43.Zheng W. AD-ENM Web Server. 2010. Available at: http://enm.lobos.nih.gov. Accessed on October 12.
- 44.Lindahl E, Azuara C, Koehl P, Delarue M. Nucleic Acids Res. 2006;34(Web Server issue):W52. doi: 10.1093/nar/gkl082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Hollup S, Salensminde G, Reuter N. BMC Bioinformatics. 2005;6:52. doi: 10.1186/1471-2105-6-52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Krivov SV, Karplus M. J Phys Chem B. 2006;110:12689. doi: 10.1021/jp060039b. [DOI] [PubMed] [Google Scholar]
- 47.Vendruscolo M, Dokholyan N, Paci E, Karplus M. Phys Rev E. 2002;65:61910. doi: 10.1103/PhysRevE.65.061910. [DOI] [PubMed] [Google Scholar]
- 48.Paoli B, Pellarin R, Caflisch A. J Phys Chem B. 2010;114:2023. doi: 10.1021/jp910216j. [DOI] [PubMed] [Google Scholar]
- 49.Kabsch W, Sander C. Biopolymers. 1983;22:2577. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
- 50.Orengo C, Michie A, Jones S, Jones D, Swindells M, Thornton J. Structure. 1997;5:1093. doi: 10.1016/s0969-2126(97)00260-8. [DOI] [PubMed] [Google Scholar]
- 51.Van Wynsberghe A, Cui Q. Structure. 2006;14:1647. doi: 10.1016/j.str.2006.09.003. [DOI] [PubMed] [Google Scholar]
- 52.Muff S, Caflisch A. J Chem Phys. 2009;130:125104. doi: 10.1063/1.3099705. [DOI] [PubMed] [Google Scholar]
- 53.Rao F, Caflisch A. J Mol Biol. 2004;342:299. doi: 10.1016/j.jmb.2004.06.063. [DOI] [PubMed] [Google Scholar]
- 54.Rao F, Settanni G, Guarnera E, Caflisch A. J Chem Phys. 2005;122:184901. doi: 10.1063/1.1893753. [DOI] [PubMed] [Google Scholar]
- 55.Ihalainen J, Paoli B, Muff S, Backus E, Bredenbeck J, Woolley G, Caflisch A, Hamm P. Proc Natl Acad Sci USA. 2008;105:9588. doi: 10.1073/pnas.0712099105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Paoli B, Seeber M, Backus E, Ihalainen J, Hamm P, Caflisch A. J Phys Chem B. 2009;113:4435. doi: 10.1021/jp810431s. [DOI] [PubMed] [Google Scholar]
- 57.Yew Z, Krivov S, Paci E. J Phys Chem B. 2008;112:16902. doi: 10.1021/jp807316e. [DOI] [PubMed] [Google Scholar]
- 58.Gfeller D, De Los Rios P, Caflisch A, Rao F. Proc Natl Acad Sci USA. 2007;104:1817. doi: 10.1073/pnas.0608099104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Gfeller D, de Lachapelle DM, De Los Rios P, Caldarelli G, Rao F. Phys Rev E. 2007;76:026113. doi: 10.1103/PhysRevE.76.026113. [DOI] [PubMed] [Google Scholar]
- 60.Gomory R, Hu T. SIAM J Appl Math. 1961;9:551. [Google Scholar]
- 61.Krivov S, Karplus M. Proc Natl Acad Sci USA. 2004;101:14766. doi: 10.1073/pnas.0406234101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Beazley D. The Simple Wrapper and Interface Generator. Available at: http://www.swig.org.