Abstract
Dynamic variations in the distances between pairs of atoms are used for clustering subdomains of biomolecules. We draw on a well-known target function for clustering and first show mathematically that the assignment of atoms to clusters has to be crisp, not fuzzy, as hitherto assumed. This reduces the computational load of clustering drastically, and we demonstrate results for several biomolecules relevant in immunoinformatics. Results are evaluated regarding the number of clusters, cluster size, cluster stability, and the evolution of clusters over time. Crisp clustering lends itself as an efficient tool to locate semirigid domains in the simulation of biomolecules. Such domains seem crucial for an optimum performance of subsequent statistical analyses, aiming at detecting minute motional patterns related to antigen recognition and signal transduction.
1. Introduction
Molecular dynamics (MD) can be used to investigate functional elements in biomolecules [1–5]. In addition to static structures (such as crystal structures stored in the protein data bank (PDB) [6]) molecular dynamics yields information on dynamic properties [7, 8], lending themselves for evaluation of, for example, signal transduction. However, key patterns of motion related to such a functional element may be hidden among a large amount of “other” movements, reflecting no more than ordinary thermal motility of the biomolecule. Molecular dynamics itself can be carried out along relatively standardized protocols [9, 10]. However, recognizing specific patterns of motion, which are deemed crucial for a functional element, remains a tricky task, requiring sophisticated statistical methods [11], such as principal component analysis [12, 13] or normal mode analysis [14].
For all the mentioned approaches, an initial and essential step is the “fitting” of the molecular structure of each time step of an MD trajectory (henceforward called frame) to a reference structure, x ref [15]. A given frame x i is first translated to let its centre of mass coincide with that of the reference frame. Then x i is rotated (around its centre of mass) to minimize square deviations between corresponding atoms of x i and x ref:
(1) |
In many approaches, RMSD has been used not only for fitting but also for directly (and successfully) quantifying molecular deformations [15], including structural changes, drifts, and trends [16]. In many cases, however, even sophisticated statistical methods fail, when applied to MD-frames after fitting. The suspicion is that the process of fitting itself might cause this failure. How does this come about?
By default, the GROMACS [17] fitting procedure uses atomic masses as weights for superposition of a structure's atomic coordinates to a reference structure. Accordingly, the fitting of x i “as a whole” is being optimized. In some cases, fitting the whole molecule may be inadequate and even conceal what one is searching for. For example, consider a molecule with one or more flexible loops. While the body of such a molecule behaves like a slightly deformable, rigid body, a loop may be conformationally flexible exhibiting largely uncorrelated movements with respect to the rest of the molecule. In the fitting criterion, however, atoms within the loop and those in the body may have equal weights. Since all deviations enter quadratic into (1), large movements of an even small number of loop-atoms may generate dominant contributions to the RMSD. In such a case, in order to minimize total RMSD, x i is rotated predominantly to accommodate for the few atoms within the loop. As a result, the large remaining body of the molecule has to “follow its own loop,” as if the tail chases the dog [maybe this was not the primary intention of fitting]. Needless to say, due to such movements caused by fitting, that minute motile elements may become totally submerged, without any chance of being retrieved from the trajectory, not even by sophisticated statistics.
The described situation is typical and demands more elaborate fitting methods. Choosing unequal weights suggests itself as a nearby and convenient solution. The more rigid parts of the molecule should receive more weight, the flexible ones less. However, how should one know, prior to fitting, which parts are semirigid and which are flexible?
One possibility would be a two-pass procedure, in the first pass fitting to the whole molecule with uniform weights (w i = 1) and evaluating the RMSFi:
(2) |
where 〈〉 denotes the average over a trajectory and x i,ref denotes a reference position of atom i, not changing over time. Note that RMSFi will highly depend on the choice of the reference position, which is usually the mean coordinate of atom i over the whole trajectory. Then, in a second pass of fitting, weights are chosen inversely proportional to the RMSFi, as reported by [18]. Highly motile atoms receive less weight and lose their role in shaking the remaining main parts of the molecule. However, this method suffers from the fact that RMSFi depends on the selection of x i,ref in the first pass of fitting; that is, the correction procedure depends on the error it is supposed to correct.
Another possibility is the identification of semirigid domains (clusters) within the molecule, as reported by [18]. In particular, the definition of clusters may be based on distances d ij = ||x i − x j|| between pairs of atoms rather than coordinates computed in the trajectory. The standard deviation of distance variation (STDDV) between an atomic pair (i, j) is given by
(3) |
where N is the number of atoms, 〈〉 denotes the average over a trajectory, and S ij is measured in nm. Evidently, pair-distances are unaffected by any kind of arbitrariness due to fitting.
Given a number of clusters (N clust), let c im denote the partial class membership of atom i in cluster m. For normalization we require
(4) |
The following criterion has been proposed to identify an optimum decomposition into a given number (N clust) of clusters [18]. Minimize the target function:
(5) |
under the constraints of (4). Once identified, any such cluster may be used as “primary fitting domain,” by assigning large weights to the atoms therein. With little motion within such a cluster, the motility of the remaining atoms of the molecule will appear relative to that cluster. This generally increases the chance of tracing relevant patterns of motion outside the cluster, for many statistical methods being applied.
The important difference from the known structure-analyzing tools of GROMACS, whether based on RMS deviation after fitting or RMS deviation of atom-pair distances, is that there is no need of a reference structure here. Also the clustering algorithms themselves, though with different criteria, assign conformations to a cluster for the molecule as a whole, while in our work, groups of atoms are assigned to the same cluster if their mutual distances vary little over time (a spatial clustering within the molecule).
MD trajectories for protein complexes were analyzed by clustering of averaged standard deviations of distance variation (STDDV); see below. Obtaining the most rigid cluster of atoms can be seen as the first step to facilitate the search for protein motions.
2. Methods
2.1. Construction of Complexes for Molecular Dynamics Simulation
We applied the clustering algorithm to a series of molecular systems as follows, see Table 1.
Table 1.
Molecular system | Simulation length |
---|---|
Penta-L-alanine (A5) | 1000 ns |
LC13 TCR/ABCD3/HLA-B*44:02 (B4402) | 250 ns |
LC13 TCR/ABCD3/HLA-B*44:03 (B4403) | 250 ns |
LC13 T cell receptor (TCR) in complex with major histocompatibility complex (MHC) HLA-B*44:05 and the ABCD3 peptide (EEYLQAFTY) has been successfully crystallized by Macdonald et al. [20] and its structure is accessible on http://www.pdb.org/ assigned the PDB ID 3KPS. However, there are no structure files of LC13 TCR in complex with HLA-B*44:02 and HLA-B*44:03. Therefore, we applied homology modelling to create these structures.
In-silico mutagenesis was carried out using Swiss PDB Viewer [21]. For generation of LC13/ABCD3/HLA-B*44:03, we used PDB structure 3KPS as a template and introduced mutations Y116D (numbers according to PDB numbering) and D156L to the MHC thus changing the HLA type from B*44:05 to B*44:03. For generation of LC13/ABCD3/HLA-B*44:02 we used PDB structure 3KPS as a template and introduced mutation Y116D to the MHC thus changing the HLA type from B*44:05 to B*44:02; see Figures 1 and 2.
2.2. Molecular Dynamics Simulation Protocol
The workflow of the molecular dynamics simulation of the penta-L-alanine system is closely related to that in the work of Bernhard and Noé [18]. MD simulation of penta-L-alanine was performed using GROMACS 4.0.7 [17] according to the following protocol.
First, we immersed penta-L-alanine in an explicit SPC [22] artificial water bath (cubic box) allowing for a minimum distance of 1 nm between peptide and box boundaries. Second, we minimized the solvated system using a steepest descent method. Next, we warmed up the system to 293 K during a 100 ps position restraint MD simulation. Finally, we carried out the MD production run with LINCS constraint algorithm acting on bonds with hydrogen atoms using an integration step of 2 fs and the GROMOS96 53a6 force field [23]. Coordinates were written to the trajectory every 2 ps. Coulomb interactions were computed using Particle Mesh Ewald (PME) with a maximum grid spacing of 0.12 nm and interpolation order 4. Both, Van der Waals and Coulomb interactions were computed with a cut-off at 1.4 nm. Berendsen temperature coupling to 293 K and Berendsen isotropic pressure coupling to 1 bar were used. All further parameters were set in accordance with Omasits et al. [24].
MD simulation of TCR/pMHC systems was performed using GROMACS 4.0.7 [17] according to the following protocol. First, we immersed the TCR/pMHC complex in SPC [22] artificial water bath (cubic box) allowing for a minimum distance of 2 nm between complex and box boundaries. Second, we added sodium and chloride ions to a concentration of 0.15 mol/L, and at the same time neutralizing the net charge of the system. Third, we minimized the energy of the solvated system using a steepest descent method. Next, we warmed up the system to 310 K during a 100 ps position restraints MD simulation. Finally, we carried out MD production runs with LINCS constraint algorithm acting on all bonds and using the GROMOS96 53a6 force field [23]. Hydrogen motions were removed allowing for an integration step of 5 fs. Coordinates were written to the trajectory every 50 ps. Coulomb interactions were computed using Particle Mesh Ewald (PME) with a maximum grid spacing of 0.12 nm and interpolation order 4. Both, Van der Waals and Coulomb interactions were computed with a cut-off at 1.4 nm. Velocity rescale temperature coupling to 310 K and Berendsen isotropic pressure coupling to 1 bar were used. All further parameters were set in accordance with Omasits et al. [24].
2.3. Optimization of Cluster Membership
Each atom i in a molecular dynamics simulation may be uniquely assigned to one of the N clust clusters considered [7, 25, 26], represented by a “crisp” vector of cluster-membership; for example, c im = [0,0, 0,1, 0,0] if atom i belongs to cluster 4 out of N clust = 6 clusters. Alternatively, each atom i may be considered to belong to several clusters simultaneously, represented by fuzzy, noninteger memberships, with normalization condition see (4). Fuzzy memberships are the more general case, it seems that they might yield lower minima of the target function than crisp memberships and should therefore be preferred. Interestingly, Bernhard and Noé [18] report that fuzzy memberships, upon optimization with a gradient method, tend to end up as crisp, that is, either 0 or 1. We have scrutinized this issue and will demonstrate how this comes about. Even more, as one of the main results of this work, we will prove that the solution has to be crisp. The proof is given via mathematical arguments; see results section. This finding allows us to restrict the search space to crisp memberships, without diminishing the generality of the optimization problem posed.
2.3.1. Optimization of Crisp Cluster Memberships by a Two-Stage Monte Carlo Method
Initially, the number of clusters, N clust, is chosen and the target function (5) has to be minimized. We will show that, under certain assumptions (see Section 3.1), it is sufficient to search in
(6) |
In a less formal formulation, the objective is to assign each of the N atoms to one particular cluster (crisp memberships).
It may become quite tricky to attack such a problem with an analytic gradient approach since the boundary conditions are usually difficult to handle and the search domain consists of isolated points. We have therefore chosen a two-step search, in which a random process is succeeded by an exhaustive search.
Every constellation (i.e., Monte Carlo trial for improvement) which cannot be improved by a single move of one of the N atoms from one cluster to another is considered a result (minimum constellation). The result with the lowest q(c) is the ground state, but all other minimum constellations should also be included in the further analysis.
2.3.2. Search Algorithm
(i) Start-Up. Initially, each of the N components is randomly assigned to one (of the N clust) cluster.
(ii) Random Search. In the first step, each of the N components (atoms) is moved from its current cluster to another randomly chosen cluster with probability P. If this mutation yields a reduction in q(c) the new constellation is preserved; otherwise, it is rejected. This process is repeated K times. A very rudimentary benchmarking analysis has shown that P = 1/N and K = N · N clust are reasonable values to use.
(iii) Exhaustive Search. In the second step, the algorithm tries to improve q(c) by single step moves for each component separately. If there is no possible move to improve q(c), the constellation is necessarily a local minimum in our sense.
(iv) Ground State. Usually there is a large number of minimum constellations in the above sense. For a matrix with significant structure, the ground state will be reached after only a few trials. For ill-conditioned matrices (those without structure), a minimum constellation very close to the ground state will also be found after a few trials, although the absolute ground state might be difficult to find.
3. Results
3.1. Crisp Cluster Membership as a Necessary Consequence
We formulate and prove a lemma that crisp memberships are a necessary consequence of the topology of the multidimensional space of pair-distance standard deviations.
Due to its definition, S is a symmetric, nonsingular N × N adjacency matrix whose entries are the standard deviations explained earlier (3); thus, S ij > 0 for i ≠ j and S ii = 0. Let
(7) |
The objective is to find for c ∈ Ω.
3.1.1. Lemma
If S is a symmetric, nonsingular N × N matrix with nonnegative entries and , for c ∈ Ω, then .
To prove this lemma one uses Lagrange multipliers:
(8) |
Since q(c) is a polynomial of order 2, the derivatives with respect to c im and λ i yield a system of N × (N clust + 1) linear equations of the form:
(9) |
with I N being N × N identity matrix and c m = (c 1m,…, c Nm). The determinant of the matrix is −N clustdet(S)Nclust−1 and therefore, under the assumption det (S) ≠ 0, there must be a unique solution, which is given by
(10) |
Unfortunately, this solution yields the maximum of q(c) for c ∈ Ω. However, since Ω is convex and bounded, one knows that any argmin q(c) must be on δΩ. Therefore, there must be at least one c i′m′ = 0. Reducing the above system of linear equations by this constraint, one sees that the revolving system again has a unique solution:
(11) |
which again gives a maximum of q(c). The derivation of the value of λ i is rather involved and not shown.
With the same argument as before, one can proceed by setting all c im equal to zero with the exception of one c im for each m. This iterative procedure shows that clustering with respect to q(c) leads to a unique assignment of each of the N components to one particular cluster.
3.2. Heuristic Evaluation of Solution Space around the Minima
Our theoretical result, that memberships are crisp, can be illustrated very intuitively; see Figure 3. Left and right end of x-axis correspond to constellations with minimum target function, located at the boundary, and no other minimum is found in between.
3.3. Clusters of Atomic Motions in MD Trajectory
For above mentioned TCR/pMHC complexes B4402 and B4403, STDDV matrices were computed from the MD trajectories; see Figure 4. Only the second parts of the trajectories (corresponding to approx. 125 ns simulation time) were considered to exclude relaxation effects; see also Section 3.6.1.
STDDV matrices were clustered as described above for N clust = 2 to 6. After computation, clusters are renumbered according to size (=number of atoms), the largest one always being labelled as cluster 1. For N clust = 5,6 cluster memberships for B4402 and B4403 were remapped onto the protein structure and displayed in VMD [27]; see Figure 5.
Note that NO information whatsoever about secondary structural elements, such as α-helices and β-sheets, has entered the clustering procedure. Still, clusters more or less seem to retrieve some of these structural elements; see Figure 5. This could be related to extensive hydrogen bonding in α-helices and β-sheets stabilizing these secondary protein structure elements. How could bond constraints in MD simulations influence the resulting clusters? In our calculations, we just considered the protein backbone, because amino acids side chains show larger spatial fluctuations. The backbone C α atoms are separated by a planar and rigid amid bond, so neighboring C α atoms will experience less variation in distance.
3.4. More Clusters Improve Target Function
The number of clusters has to be preselected in our approach. If we had just one cluster, the total distance variability contained in matrix S would be part of that cluster. Increasing the number of clusters generally reduces the fraction of variability contained within clusters, expressed as percentage of total in Figure 6.
3.5. Larger Clusters Turn Out to Be More Rigid
Clusters were constructed to achieve maximum internal “rigidity,” that is, a minimum sum of pair-distance standard deviations. One might expect that large clusters, since they accommodate many atoms within larger spatial domains, should turn out to be less rigid then smaller clusters. However, the opposite is true: larger clusters turn out to be more rigid; see the declining trend of normalized STDDV with increasing cluster size in Figure 7. This demonstrates again that structures in motility are captured via clustering. If there is no structure within matrix S, normalized cluster sizes would result nearly equal, that is, centered around 1.
Clearly, clusters do not result identical for different subsections of a MD trajectory. In Figure 7, data are shown for two trajectories and 50 subsections each, each clustered for N clust = 5 and 6. For details, see legend of Figure 7. 100 Monte Carlo attempts were performed for each clustering, out of which the optimum (smallest target function) was adopted. These results confirm the general trend that larger clusters are more rigid.
An overview of dispersions within and between clusters is given in Figure 8, the numerical results for 6 clusters being given in Table 2.
Table 2.
1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|
1 | 0.058 | 0.115 | 0.093 | 0.145 | 0.178 | 0.167 |
2 | 0.115 | 0.068 | 0.146 | 0.193 | 0.219 | 0.129 |
3 | 0.093 | 0.146 | 0.079 | 0.123 | 0.175 | 0.212 |
4 | 0.145 | 0.193 | 0.123 | 0.075 | 0.210 | 0.264 |
5 | 0.178 | 0.219 | 0.175 | 0.210 | 0.090 | 0.241 |
6 | 0.167 | 0.129 | 0.212 | 0.264 | 0.241 | 0.124 |
The full trajectory of protein complex B4402 was clustered into 6 clusters. Numbers in main diagonal give averaged STDDVs [nm] within clusters, corresponding to left markers (symbols: mean, minimum, maximum) in Figure 8. Off-diagonal values relate to STDDV between clusters, corresponding to right marker in Figure 8.
3.6. Stability of Clusters
Clusters have been evaluated regarding stability, in order to check whether they lend themselves as reliable semirigid domains for fitting MD-configurations. Of note that (at least) two sources of variability of cluster memberships need to be scrutinized as follows:
variability due to the stochastic nature of the Monte Carlo clustering method and
variability due to different parts of an MD trajectory being clustered.
We will demonstrate that variability due to our Monte Carlo clustering method is negligible. As opposed to that, the “adequate” choice and preparation of the MD trajectory has tremendous impact and remains an issue of a never ending debate [28–32].
3.6.1. Variability between Different Parts of a Single MD Trajectory
Adequate sampling of phase space is essential regarding MD-simulations [33]. Much work has been done to detect changes, drifts, and trends as markers for inadequate sampling [16, 29, 34]. Block averaging was proposed as one of the remedies [35].
In this work, the clustering presented above was based on matrices S computed from whole 250 ns MD trajectories. Clearly, the matrices S(t 1, t 2) for each subset (t 1, t 2) of a trajectory would be different, entailing different results for clustering. The question is which is the most reliable clustering for a given molecule?
To answer this question, we shall quantify the variability of clusters for subsets, relate it to the result for the whole trajectory, and derive a “stiff kernel,” that is, those atoms which do not (or very rarely) change clusters between subsets of the trajectory.
First, we inspect the total dispersion contained within S ij, evaluated separately for 50 subsets of 5 ns each (i.e., 100 frames out of 5000 frames in a whole trajectory); see Figure 9. In this trajectory we observe an irregular oscillating time behavior, starting with a declining tendency. The existence of such a substantial initial phase indicates that the influence of the starting configuration does not die out until after (roughly) half of the total simulation time has passed by. To account for this fact in a heuristic way, straight lines were fitted to the first and second half of the trajectory, allowing for some overlap. This accommodates with the finding of our previous work [29] that only the second half of the trajectory can be considered an unbiased sample from phase space and should be taken for further evaluations.
3.6.2. The Path along Most Stable Clusterings
For a preselected (number of clusters) N clust, clustering was performed for each of the 50 subset-trajectories (5 ns each), yielding an assignment for each atom to one of the clusters. Note that clusters were primarily labeled (cluster 1, 2, etc.) according to their size in that particular clustering (see Figure 10). Thus, the following situation may occur. Given a clustering of a subset-trajectory with first and second cluster about equal in size. Then, when clustering the following subset-trajectory, a few atoms from cluster 1 may end up in (i.e., “migrate” into) cluster 2, which may suffice to make cluster 2 now the largest cluster and therefore receiving the label “cluster 1” in the second clustering. This would yield a very peculiar result. The few “migrating atoms” would formally belong to the same cluster (in this example cluster 1), while the majority of atoms would switch between clusters 1 and 2. To avoid this misleading and undesired artifact, we refined the procedure as follows. In the first clustering, clusters were assigned labels (1, 2, etc.) according to decreasing size. After each subsequent clustering, we evaluated all permutations of cluster labels regarding the number of “migrating atoms” with respect to the first clustering. That permutation of labels, which yielded a minimum of migrating atoms, was finally adopted. As an example, the resulting number of migrating atoms is shown for B4402 and N clust = 2 in Figure 11.
3.6.3. Quantifying the Stability of Cluster Assignments
In Figure 11 the number of migrating atoms was considered. Now, we evaluate which atoms migrate. For quantification, we resort to the Kullback-Leibler-distance [36, 37]:
(12) |
1/N clust represents the assumed background probability if the assignment of atom i to any of the clusters were equally probable. p im is the actual probability for atom i to belong to cluster m, estimated from an average over cluster memberships c im obtained from clustering subsets of the trajectory:
(13) |
Large values of KLDi indicate that atom i stays predominantly in the same cluster throughout the trajectory. On the contrary, values of KLDi close to zero indicate a random distribution of an atom between all clusters. Figure 12 shows KLDi for B4402 and N clust = 5.
4. Discussion
4.1. Clustering Reflects Structure within STDDV-Matrix
Target function (5) only counts distance variability within the clusters, not between atoms belonging to different clusters. Thus, if we reorder atoms according to their cluster membership, clusters appear as squares along the main diagonal of the matrix S; see Figure 15. If we hypothetically assume that elements Sij are more or less homogenously distributed across the matrix, the “area” of each cluster in the matrix will roughly correspond to the variability within that cluster. Clearly, these squares have to be of equal size to make their joint area minimum (and thus minimize the total distance variation within clusters).
We have verified this prognosis by randomly rearranging elements of matrix S and then performing the clustering procedure. Cluster sizes resulted almost equal (for illustration see left panel of Figure 13) in each of 20 trials of random rearrangement and clustering. This finding was verified for 2 ≤ N clust ≤ 6 (data not shown).
As opposed, clusters of very different sizes resulted for matrices S derived from MD simulation (illustrated in right panel of Figure 13). This clearly indicates that clusters of unequal size do not result by chance but reflect distinct dependencies within S. The sensitivity of clustering to existing structures within S is also reflected in target function q(S), as can be seen by comparing matrices S from MD and their randomly rearranged counterparts; see Figure 14. In both cases averaged STDDV decreases (improves) with increasing number of clusters. However, in presence of real dependencies between atomic mobility, clustering achieves much more improvement.
4.2. Many Small Clusters Are More Rigid but Cover Less Distance Variation
One might ask why a larger cluster number is less favorable. Obviously, the extreme case of defining each single pair of atoms as a separate cluster would lead to minimum STDDV within clusters but would represent a trivial and useless solution. Note that STTDV within clusters decreases with increasing N clust, which makes clusters more homogeneous and is a desired effect. However, at the same time the overall amount of variability caught within clusters also decreases, which is an undesired effect, since larger portions of the molecule are disregarded. These trends are reflected by the total area of coloured squares along the diagonal in Figure 13. The smaller the squares (with increasing N clust), the smaller their total area, even if there are more squares (the area decreases quadratic with the side lengths of squares).
This tendency has been quantitatively demonstrated for real MD-data in Figure 6. On top of that, Figure 14 shows that the same trend also holds for unstructured matrices, as mentioned above. However, results also show that matrices without structure (randomly rearranged elements) allow for very little reduction in the STDDV covered within clusters, as compared to structured matrices and that this difference further increases with N clust.
4.3. Clustering Is Stable and Self-Contained
Powerful statistics on MD-trajectories (which is, e.g., able to detect small motions related to signaling) needs careful fitting of configuration frames as a prerequisite. Fitting to a domain which should be as rigid and large as possible is one of the options.
Hence, finding large clusters is desirable. However, large clusters in general might prove unstable. Therefore, we have carefully investigated this issue for the target function proposed by Bernhard and Noé [18] in conjunction with our clustering method.
It turned out that larger clusters are even more stable (see Figure 7), which is a strong indicator of stability and self-containment of our results.
4.4. Ergodicity
Atomic motions in a large molecule constitute a highly dimensional phase space, and MD-simulations can in most cases explore only part of the total phase space. At least, one can never be sure about the exact fraction of phase space actually visited in a specific simulation run. As a consequence we use only the second half of our MD trajectory for final clustering (see Figure 5), in accordance with our previous work [29].
4.5. Computational Resources
Clustering takes very little iteration steps for A5, due to monotonous, continuous relationships between elements of S; see the appendix. Note that the rapidity of clustering in this case is not only a primary consequence of the small number of atoms but a matter of simplicity in structure.
Randomized matrices S take significantly more iterations for clustering, since many minima are almost equal regarding the target function, rendering solutions ambiguous. As opposed to this, clustering large molecules with structured internal motion, such as B4402 and B4403, yields well-defined minima after a reasonable number of trials.
4.6. Cluster Interpretation
There is an obvious difference in visual appearance between STDDV matrices for B4402 and B4403 as seen in Figure 4. Trajectory B4402 yields an unstructured, rather flat STDDV matrix (upper panel in Figure 4), while B4403 shows a distinctly structured STDDV matrix (lower panel in Figure 4). The relation between cluster rigidity and size is illustrated in Figure 7 and shows a clear trend: cluster rigidity increases with increasing cluster size, reflected in a decreasing STDDV within clusters. However, this does not mean that the largest cluster is always the most rigid cluster (i.e., has lowest STDDV). For N clust = 5, we see that in B4403 the largest cluster is at the same time the most rigid one. For N clust = 5 in B4402 the second largest cluster is the most rigid one. For N clust = 6 the situation is inverted: in B4402 the largest cluster is the most rigid one. In B4403 the second largest cluster is most rigid.
All in all, the most rigid cluster was always found among the two largest clusters. The mappings of the clustering results for N clust = 5 and N clust = 6 for B4402 and B4403 have been displayed in Figure 5.
4.7. Conclusion and Prospects
A main result of the paper is the finding that the target function proposed by others [18] has only crisp solutions; that is, each atom belongs to one single cluster only (and not to several clusters in a fuzzy sense). This finding allows for a much more efficient search for optimum clustering.
Based on this new finding, the process of clustering was evaluated regarding various aspects to provide concomitant information for possible application by other investigators. Applicability was demonstrated for two trajectories of 250 ns each for large biomolecular complexes whose dynamics is of key importance for the understanding of immune reactions.
Further improvements can be expected from a more detailed investigation of the Kullback-Leibler distance [36, 37]. In this work it has only been reported as a means for assessing the quality of clustering by some given method. In future work, the Kullback-Leibler-distance may enter the clustering procedure itself and render clusters even more stable between subtrajectories and over time.
Acknowledgments
The MD trajectories used in the present work were generated on the IBM-BlueGene computer facility at Bulgarian National Centre for Supercomputing Applications (NCSA). The work was supported in part by BSF and OeAD under Grants DCVP 02/1/2009, DNTS-A 01-2/2013, and WTA-BG 06/2013. We gratefully acknowledge mathematical advice and helpful discussions from Professor Rudolf Karch, PhD and Peicho Petkov, PhD, and technical assistance by Michael Cibena.
Appendices
A. Pair-Distance Variations in a Small Molecule
For comparison, we applied our clustering method also to the A5 penta-L-alanine peptide already analyzed by Bernhard and Noé [18]. They choose penta-L-alanine to evaluate their clustering algorithm on MD simulations for reasons of simplicity of this molecule. A5 has four amide bonds, each comprising four atoms (CONH). The delocalization of the nitrogen's free electron between conjugated carbonyl and amine groups poses planarity and rigidity onto this structure.
Mu et al. [38] showed that A5 does not remain in an alpha-helical conformation, but rather exhibits repeated folding and unfolding events. As expected for such a molecule, our clustering of the averaged STDDV matrix indicates that atoms close together show little fluctuations of their mutual distances, such as the atoms in the middle of the pentapeptide; they are, so to speak, in the centre of the storm. Conversely, atoms near the edges show large distance fluctuations with respect to all other atoms.
Clustering such a homogeneous matrix S does not reveal structural elements within the molecule, but rather splits it into equal parts, according to the number of clusters preselected. Note, however, that this behavior is a property of the equally distributed matrix values Sij and not a general rule.
B. Software Availability
The software used for clustering into crisp domains of preselected number is currently being implemented in Java and will be made available for free download from http://www.meduniwien.ac.at/msi/biosim/index.php?lang=en&seite=en_forLehreIt_pairDistanceClustering.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
References
- 1.Bao G. Mechanics of biomolecules. Journal of the Mechanics and Physics of Solids. 2002;50(11):2237–2274. [Google Scholar]
- 2.Lavery R, Lebrun A, Allemand J-F, Bensimon D, Croquette V. Structure and mechanics of single biomolecules: experiment and simulation. Journal of Physics Condensed Matter. 2002;14(14):R383–R414. [Google Scholar]
- 3.Gordon D, Chen R, Chung SH. Computational methods of studying the binding of toxins from venomous animals to biological ion channels: theory and applications. Physiological Reviews. 2013;93:767–802. doi: 10.1152/physrev.00035.2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Cui Y. Using molecular simulations to probe pharmaceutical materials. Journal of Pharmaceutical Sciences. 2011;100(6):2000–2019. doi: 10.1002/jps.22392. [DOI] [PubMed] [Google Scholar]
- 5.Adcock SA, McCammon JA. Molecular dynamics: survey of methods for simulating the activity of proteins. Chemical Reviews. 2006;106(5):1589–1615. doi: 10.1021/cr040426m. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Berman HM, Westbrook J, Feng Z, et al. The protein data bank. Nucleic Acids Research. 2000;28(1):235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wriggers W, Schulten K. Protein domain movements: detection of rigid domains and visualization of hinges in comparisons of atomic coordinates. Proteins. 1997;29:1–14. [PubMed] [Google Scholar]
- 8.Taylor WR. Protein structural domain identification. Protein Engineering. 1999;12(3):203–216. doi: 10.1093/protein/12.3.203. [DOI] [PubMed] [Google Scholar]
- 9.Haile JM. Molecular Dynamics Simulation: Elementary Methods. John Wiley & Sons; 1992. Fundamentals. [Google Scholar]
- 10.Haile JM. Molecular Dynamics Simulation: Elementary Methods. John Wiley & Sons; 1992. Hard spheres. [Google Scholar]
- 11.Berendsen HJ, Hayward S. Collective protein dynamics in relation to function. Current Opinion in Structural Biology. 2000;10(2):165–169. doi: 10.1016/s0959-440x(00)00061-0. [DOI] [PubMed] [Google Scholar]
- 12.Amadei A, Linssen ABM, Berendsen HJC. Essential dynamics of proteins. Proteins: Structure, Function and Genetics. 1993;17(4):412–425. doi: 10.1002/prot.340170408. [DOI] [PubMed] [Google Scholar]
- 13.Balsera MA, Wriggers W, Oono Y, Schulten K. Principal component analysis and long time protein dynamics. Journal of Physical Chemistry. 1996;100(7):2567–2572. [Google Scholar]
- 14.Hayward S, De Groot BL. Normal modes and essential dynamics. Methods in Molecular Biology. 2008;443:89–106. doi: 10.1007/978-1-59745-177-2_5. [DOI] [PubMed] [Google Scholar]
- 15.García AE. Large-amplitude nonlinear motions in proteins. Physical Review Letters. 1992;68(17):2696–2699. doi: 10.1103/PhysRevLett.68.2696. [DOI] [PubMed] [Google Scholar]
- 16.Zhang X, Bhatt D, Zuckerman DM. Automated sampling assessment for molecular simulations using the effective sample size. Journal of Chemical Theory and Computation. 2010;6(10):3048–3057. doi: 10.1021/ct1002384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hess B, Kutzner C, Van Der Spoel D, Lindahl E. GRGMACS 4: algorithms for highly efficient, load-balanced, and scalable molecular simulation. Journal of Chemical Theory and Computation. 2008;4(3):435–447. doi: 10.1021/ct700301q. [DOI] [PubMed] [Google Scholar]
- 18.Bernhard S, Noé F. Optimal identification of semi-rigid domains in macromolecules from molecular dynamics Simulation. PLoS ONE. 2010;5(5) doi: 10.1371/journal.pone.0010491.e10491 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Robinson J, Halliwell JA, McWilliam H, Lopez R, Parham P, Marsh SG. The IMGT/HLA database. Nucleic Acids Research. 2013;41:D1222–D1227. doi: 10.1093/nar/gks949. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Macdonald WA, Chen Z, Gras S, et al. T cell allorecognition via molecular mimicry. Immunity. 2009;31(6):897–908. doi: 10.1016/j.immuni.2009.09.025. [DOI] [PubMed] [Google Scholar]
- 21.Guex N, Peitsch MC. SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis. 1997;18(15):2714–2723. doi: 10.1002/elps.1150181505. [DOI] [PubMed] [Google Scholar]
- 22.Berendsen HJ, Postma JPM, Van Gunsteren WF, Hermans J. Interaction models for water in relation to protein hydration. Intermolecular Forces. 1981;14:331–342. [Google Scholar]
- 23.Oostenbrink C, Villa A, Mark AE, Van Gunsteren WF. A biomolecular force field based on the free enthalpy of hydration and solvation: the GROMOS force-field parameter sets 53A5 and 53A6. Journal of Computational Chemistry. 2004;25(13):1656–1676. doi: 10.1002/jcc.20090. [DOI] [PubMed] [Google Scholar]
- 24.Omasits U, Knapp B, Neumann M, et al. Analysis of key parameters for molecular dynamics of pMHC molecules. Molecular Simulation. 2008;34(8):781–793. [Google Scholar]
- 25.Yesylevskyy SO, Kharkyanen VN, Demchenko AP. Hierarchical clustering of the correlation patterns: new method of domain identification in proteins. Biophysical Chemistry. 2006;119(1):84–93. doi: 10.1016/j.bpc.2005.07.004. [DOI] [PubMed] [Google Scholar]
- 26.Shibuya T. Fast hinge detection algorithms for flexible protein structures. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2010;7(2):333–341. doi: 10.1109/TCBB.2008.62. [DOI] [PubMed] [Google Scholar]
- 27.Humphrey W, Dalke A, Schulten K. VMD: visual molecular dynamics. Journal of Molecular Graphics. 1996;14(1):33–38. doi: 10.1016/0263-7855(96)00018-5. [DOI] [PubMed] [Google Scholar]
- 28.Allen MP. Introduction to Molecular Dynamics Simulation. John von Neumann Institute for Computing; 2004. [Google Scholar]
- 29.Schreiner W, Karch R, Knapp B, Ilieva N. Relaxation estimation of RMSD in molecular dynamics immuno-simulations. Computational and Mathematical Methods in Medicine. 2012;2012:9 pages. doi: 10.1155/2012/173521.173521 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Clarage JB, Romo T, Andrews BK, Pettitt BM, Phillips GN., Jr. A sampling problem in molecular dynamics simulations of macromolecules. Proceedings of the National Academy of Sciences of the United States of America. 1995;92(8):3288–3292. doi: 10.1073/pnas.92.8.3288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Daggett V. Long timescale simulations. Current Opinion in Structural Biology. 2000;10(2):160–164. doi: 10.1016/s0959-440x(00)00062-2. [DOI] [PubMed] [Google Scholar]
- 32.Hess B. Convergence of sampling in protein simulations. Physical Review E. Statistical, Nonlinear, and Soft Matter Physics. 2002;65(3) doi: 10.1103/PhysRevE.65.031910.031910 [DOI] [PubMed] [Google Scholar]
- 33.Smith LJ, Daura X, Van Gunsteren WF. Assessing equilibration and convergence in biomolecular simulations. Proteins. Structure, Function and Genetics. 2002;48(3):487–496. doi: 10.1002/prot.10144. [DOI] [PubMed] [Google Scholar]
- 34.Grossfield A, Zuckerman DM. Quantifying uncertainty and sampling quality in biomolecular simulations. Annual Reports in Computational Chemistry. 2009;5:23–48. doi: 10.1016/S1574-1400(09)00502-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Flyvbjerg H, Petersen HG. Error estimates on averages of correlated data. The Journal of Chemical Physics. 1989;91(1):461–466. [Google Scholar]
- 36.McClendon CL, Hua L, Barreiro A, Jacobson MP. Comparing conformational ensembles using the Kullback-Leibler divergence expansion. Journal of Chemical Theory and Computation. 2012;8:2115–2126. doi: 10.1021/ct300008d. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Kullback S, Leibler RA. On information and sufficiency. Annals of Mathematical Statistics. 1951;22:79–86. [Google Scholar]
- 38.Mu Y, Nguyen PH, Stock G. Energy landscape of a small peptide revealed by dihedral angle principal component analysis. Proteins. Structure, Function and Genetics. 2005;58(1):45–52. doi: 10.1002/prot.20310. [DOI] [PubMed] [Google Scholar]