Abstract
Molecular dynamics (MD) simulations are now able to routinely reach timescales of microseconds and beyond. This has led to a corresponding increase in the amount of MD trajectory data that needs to be stored, particularly when those trajectories contain explicit solvent molecules. As such, it is desirable to be able to compress trajectory data while still retaining as much of the original information as possible. In this work, we describe compressing MD trajectory data using the NetCDF4/HDF5 file format, making use of quantization of the original positions to achieve better compression ratios. We also analyze the affect this has on both the resulting positions and the energies calculated from post‐processing these trajectories, and recommend an optimal level of quantization. Overall we find the NetCDF4/HDF5 format to be an excellent choice for storing MD trajectory data in terms of speed, compressibility, and versatility.
Keywords: data analysis, energy calculation, molecular dynamics, trajectory compression
1. INTRODUCTION
Molecular dynamics (MD) simulations of biomolecules typically produce a variety of data, the largest of which tends to be the coordinates of the atomic positions saved periodically throughout the simulation (often referred to as the MD “trajectory”), and may also consist of atomic velocities, forces, energy data, and so on. Thanks to advances in GPU hardware, modern MD simulations are often able to generate gigabytes (GB) to terabytes (TB) of data. 1 , 2 Compression is an attractive solution to the problem of how to store very large MD trajectories. Simply put, data compression is reducing the number of bits needed to store that data. Take for example the sentence:
“the quick brown fox jumped over the lazy dog”.
One could compress this sentence by noticing that the character string “the” is repeated twice, and use a single character like “$” to represent it:
“$quick brown fox jumped over $lazy dog”.
In this case, the original sentence can be reconstructed (as long as it is known $ stands for “the”). Another way to compress the sentence is to omit the word “the” entirely:
“quick brown fox jumped over lazy dog”.
In this case, the new sentence is somewhat awkward sounding but still retains its meaning. In both cases, the number of characters has been reduced and the data is “compressed”. These very simplistic examples serve to illustrate the two types of compression: “lossless” and “lossy”. Lossless compression is when data can be compressed without losing any information, and lossy compression is when data is compressed but some of the original information is lost. “Lossless” compression works by removing redundancy in the data. The Lempel–Ziv–Welch (LZW) algorithm is a widely used example of a lossless compression algorithm. 3 This tends to work well for data with lots of redundancy, such as plain ASCII text, but works less well for binary numerical data (especially those representing real numbers), where it is more difficult to identify repeated patterns. “Lossy” compression works by removing data that are not essential for a mostly accurate reconstruction of the original data, that is, removing data that will not be missed. The discrete cosine transform (DCT) algorithm is an example of a lossy compression algorithm. 4 Lossy compression algorithms sacrifice some fidelity to the original data for higher compression ratios.
There currently exist a wide variety of MD trajectory formats that use either human‐readable ASCII (i.e., text) characters or directly‐written binary. In general, binary formats are preferred as they are faster to read/write and store more information per bit than ASCII formats, although ASCII formats tend be more compressible. It is possible to compress MD trajectories after the fact with commonly available programs such as gzip (which uses the DEFLATE 5 algorithm) or bzip2 (which uses the Burrows–Wheeler Transform 6 ). There are also some trajectories with compression schemes built in. The Byte Structure Variable Length Coding (BS‐VLC) algorithm of Melo et al. 7 uses a combination of coordinate delta coding (i.e., storing the differences in coordinates instead of coordinates themselves) between frames and quantization. The Essential Dynamics method of Meyer et al. 8 (and implemented in software like pyPcazip 9 ) reduces trajectories to eigenvectors and projections via principal component analysis of the Cartesian coordinates; this scheme was combined with the DCT method by Kumar et al. 10 , 11 to achieve more compression.
The BS‐VLC and essential dynamics methods require making one or more passes over existing trajectory data. However, it is often desirable to be able to compress the data in such a way that it can be worked with directly, that is, the data does not need to be completely written before compression, or completely decompressed before it can be worked with. These include the XTC format, 12 , 13 the TNG format, 14 the HRTC scheme, 15 the predictive compression schemes of Marais et al. 16 and Dvořák et al., 17 the H5 format of MDTraj, 18 and the H5MD format of MDAnalysis. 19 These formats make use of various compression strategies such as quantization, delta coding, particle reordering (to minimize the differences resulting from delta coding), and/or predictive positional schemes.
It is typical for lossy compression schemes to be evaluated using geometrical criteria; the max difference in absolute position, calculation of coordinate root‐mean‐square deviation, calculation of the radial distribution function of certain atoms, etc. However, most of these compression schemes are not evaluated using energy as a criterion. Calculation of energy from trajectories is important in many types of calculations, including MM‐PB/GB‐SA, grid inhomogeneous solvation theory, and linear interaction energy to name a few. In this manuscript, we detail a simple trajectory compression scheme by adapting the current Amber NetCDF format in the analysis program CPPTRAJ 20 from NetCDF3 (or “classic”) to NetCDF4/HDF 5. Staying within the NetCDF framework is advantageous because it requires minimal changes to existing parsers and can make use of the built‐in compression available with the NetCDF4/HDF5 format. In addition, a floating point to integer quantization scheme is used along with integer shuffling to improve compression ratios. We evaluate different quantization schemes in order to determine the one with the best tradeoff between errors in position and energy reproduction compared to original trajectories and final trajectory size (as well as read and write performance). We also compare the performance to several existing compressed and non‐compressed trajectory formats.
2. RESULTS
2.1. Position RMSD
Figure 1 shows the average position RMSD for each converted trajectory from the reference (original) trajectory for the Amber SPCE and SPCFW simulations (reference nc3), and the Gromacs SPCE simulation (reference trr).
FIGURE 1.
Average RMSD of converted trajectories from reference trajectories. A‐SPCE is Amber SPCE (reference nc3), G‐SPCE is Gromacs SPCE (reference trr), and A‐SPCFW is Amber SPCFW (reference nc3). The magenta line indicates the expected error for converting between Å and nm. RMSD, root mean square deviation
The average RMSD for all trajectory formats is under 1 Å. There is some error associated with converting from formats that store coordinates in Å (e.g., nc3) and formats that store coordinates in nm (e.g., trr) since CPPTRAJ uses Å internally; this is denoted by the magenta line. Above this line, the average RMSD correlates well with the expected precision loss for each format. For example, the average RMSD for the Amber ASCII format (crd) which stores coordinates to three decimal places is around 0.001 Å. The average RMSD for the Gromacs XTC format (xtc) which stores coordinates to three decimal places (but in nm) is around 0.01 Å. The average RMSD for the NetCDF4 integer compressed formats (iX.nc4) correlates with the level of quantization (0.1 for 10×, 0.01 for 100×, etc.).
2.2. Energy RMSE
Figure 2 shows the energy RMSE for each converted trajectory from the reference (original trajectory) for each simulation.
FIGURE 2.
Energy RMSE of converted trajectories from reference trajectories. A‐SPCE is Amber SPCE (reference nc3), G‐SPCE is Gromacs SPCE (reference trr), and A‐SPCFW is Amber SPCFW (reference nc3). The magenta line indicates the expected error for converting between Å and nm. Points not shown (e.g., nc3 for A‐SPCE) have an ERMSE of zero. RMSE, root mean square error
The error in energy correlates well with position. However, it is clear that even if the deviations in positions are small (as indicated by a low coordinate RMSD), this can translate into significant (>1 kcal/mol) differences in energy. Figure 3a shows the ERMSE for each individual energy term for the Gromacs SPCE trajectory for lossy compressed formats. The Amber SPCE simulation shows similar values and trends and is shown in Appendix S1.
FIGURE 3.
(a) Individual term energy RMSE of converted lossy compressed trajectories from Gromacs SPCE TRR trajectory. (b) Individual term energy RMSE of converted lossy compressed trajectories from Amber SPCFW NetCDF v3 trajectory. RMSE, root mean square error
For almost every format the terms with the highest ERMSE are the electrostatics (Elec), van der Waals (VDW), and bond terms in descending order, except for the extremely low precision i1.nc4 format where the order is VDW, bond, Elec. It is notable that the crd and i3.nc4 formats overlap quite well with the exception of the VDW (and to a lesser extent the Elec) term; this is because that although the precision of the atomic coordinates of i3.nc4 matches that of the crd format, the unit cell parameters are still stored with full precision.
Unlike the two rigid SPCE simulations, the Amber flexible SPC water simulation (A‐SPCFW) shows higher energy RMSEs, particularly for the lower‐precision formats. Figure 3b shows the ERMSE for each individual energy term for the Amber SPCFW trajectory for lossy compressed formats. Although the non‐bonded terms have similar values to the SPCE simulations, the bond term for the SPCFW simulation now has the highest ERMSE, and the ERMSE of the angle term has increased by an order of magnitude. This is because in the SPCE simulations, energies are not calculated for the constrained bonds (bonds to hydrogen). In the SPCFW simulations, the energy of the bonds in each water molecule are now calculated, and there is also an angle energy for each water molecule.
In addition to absolute energies, the difference in energy between structures is arguably a more important metric since the zero point of molecular mechanics force fields are largely arbitrary. The total root mean square deviation of every possible delta energy between frames (deltaRMSE) was calculated and is shown for the Gromacs SPCE and Amber SPCFW trajectories in Figure 4. The values for the Amber SPCE trajectory are similar and are shown in Appendix S1.
FIGURE 4.
(a) The deltaRMSE for all frame to frame pairs of the Gromacs SPCE TRR trajectory to those in converted lossy compressed trajectories. (b) The deltaRMSE for all frame to frame pairs of the Amber SPCFW NetCDF v3 trajectory to those in converted lossy compressed trajectories. RMSE, root mean square error
The values are quite similar to those in Figure 3, indicating that the error due to precision loss is essentially random and cannot be easily corrected for. As was the case with the ERMSE values, the deltaRMSE values for the Amber SPCFW simulation are higher due to differences in the bond energy term. Taken all together, it is clear that precision loss is a concern when attempting to reproduce energy values, particularly when not constraining bonds to hydrogen.
2.3. File size
File sizes for each format are shown in Figure 5. The expected size for the trajectory with coordinates stored in single precision is about 793 MB, indicated by the magenta line. The Amber ASCII (crd) format is the largest by far at about 1.6 GB. The ASCII trajectory is able to compress quite a bit, down to about 584 MB with gzip and 411 MB with bzip2. The XTC and TNG formats have the smallest sizes at about 255 and 307 MB, respectively (about a third of the size of the original NetCDF/TRR formats). The non‐lossy compressed formats have compression ratios ranging from about 686/697 MB (h5) to 751/770 MB (c1.nc4, c2.nc4, and gz.h5md) depending on whether an Amber NetCDF v3 file or Gromacs TRR file is being compressed. The files converted from the Gromacs TRR trajectory have slightly better compression ratios; the reason for this appears to be that the time values in the Gromacs trajectory start from 0 ps (and hence are smaller and more easily compressible) whereas the time values in the Amber NetCDF v3 file start from 1,031 ps. It is interesting to note that increasing the deflate level from one to two makes very little impact on the final file size. The integer compressed trajectories range in size from 329 to 675 MB when converted from NetCDF v3, to 244 to 652 MB when converted from TRR. It is seen here that integer shuffling has a large impact on the final compressed file size; turning off integer shuffling increases the file size by about 23%. In terms of overall file size, the XTC and TNG formats are clear winners.
FIGURE 5.
(Top and middle) Trajectory read/write times for each format in seconds. (Bottom) File sizes for each trajectory in MB. The magenta line is the approximate expected file size for a trajectory with single‐precision coordinates
2.4. Trajectory processing speed
The trajectory read/write speeds for each format are also shown in Figure 5. The trajectory write times are the times needed to convert from the reference format (NetCDF v3, TRR) to the given format. Each run was repeated multiple times to ensure the trajectory being read was cached and total times would be representative of write performance (see Section 4 for more details). CPPTRAJ was used to convert all formats except tng and xtc formats were written by “gmx convert‐trj”, h5 was written by “mdconvert” from MDTraj, and h5md and gz.h5md were written using MDAnalysis (see Table 1). The trajectory read times are the times needed for CPPTRAJ to process the trajectory and perform a simple distance calculation between the first two atoms of the system (all formats).
TABLE 1.
Trajectory formats tested
Format key | Description | Written by | Compression |
---|---|---|---|
nc3 | Amber NetCDF v3 | CPPTRAJ | No |
dcd | Charmm DCD | CPPTRAJ | No |
trr | Gromacs TRR | CPPTRAJ | No |
crd | Amber ASCII | CPPTRAJ | No |
cpptraj.xtc | Gromacs XTC | CPPTRAJ | Lossy |
xtc | Gromacs XTC via gmx | gmx convert‐trj | Lossy |
tng | Gromacs TNG | gmx convert‐trj | Lossy |
nc4 | Amber NetCDF v4 | CPPTRAJ | No |
c1.nc4 | Compressed NetCDF v4 (level 1) | CPPTRAJ | Lossless |
c2.nc4 | Compressed NetCDF v4 (level 2) | CPPTRAJ | Lossless |
i1.nc4 | Int. Compressed NetCDF v4 (×10) | CPPTRAJ | Lossy |
i2.nc4 | Int. Compressed NetCDF v4 (×100) | CPPTRAJ | Lossy |
i3.nc4 | Int. Compressed NetCDF v4 (×1,000) | CPPTRAJ | Lossy |
i4.nc4 | Int. Compressed NetCDF v4 (×10,000) | CPPTRAJ | Lossy |
i5.nc4 | Int. Compressed NetCDF v4 (×100,000) | CPPTRAJ | Lossy |
i6.nc4 | Int. Compressed NetCDF v4 (×1,000,000) | CPPTRAJ | Lossy |
crd.gz | Gzip Amber ASCII | CPPTRAJ | Lossless |
crd.bz2 | Bzip2 Amber ASCII | CPPTRAJ | Lossless |
h5 | MDTraj HDF5 | MDTraj mdconvert | Lossless |
h5md | MDAnalysis HDF5 | MDAnalysis | Lossless |
gz.h5md | MDAnalysis HDF5 with compression | MDAnalysis | Lossless |
noshuffle3.nc4 | Int. Compressed NetCDF v4 (×10,000), no shuffle | CPPTRAJ | Lossy |
Note: The gzip and bzip ASCII trajectories (crd.gz and crd.bz2) were written compressed directly from CPPTRAJ, not written and then compressed afterwards. Similarly, CPPTRAJ reads these compressed trajectories directly without having to decompress them first.
The DCD, TRR, and NetCDF v3 (nc3) formats are the fastest to write, all processing in under 3 s. The Netcdf v4 (nc4) format is the next fastest to write, processing in about 4 s, likely from having slightly more overhead compared to NetCDF v3 from the underlying HDF5 library. The XTC format is next fastest to write, with write times ranging from 4–12 s. Interestingly, CPPTRAJ appears to write XTC files faster than “gmx trjconv”, even when the latter is using the “‐normpbc” flag to avoid reimaging calculations. This may be due to ‘gmx trjconv’ doing extra checks, and/or from the underlying XDRfile library adapted from MDTraj 18 that CPPTRAJ uses to handle XTC files. The integer‐compressed formats are next fastest, taking anywhere from 12–24 s to write, with higher precision files taking longer (as they are harder to compress). The lossless compressed NetCDF v4 files, TNG, and h5/h5md formats are the next fastest, taking around half a minute to write. Although the h5md format is not compressed, it is not as fast as the NetCDF4 format probably because of some overhead from the python layer. The compressed h5md (gz.h5md) format is the next fastest, taking about a minute to write. The slowest formats are the Amber ASCII (crd) and compressed versions of Amber ASCII (crd.gz, crd.bz2), which take anywhere from 1.5 to 5 min to complete.
The read times for all binary formats are fairly consistent, ranging from about 5–8 s. The notable exception is the TNG format, which takes about a minute to read; this is perhaps due to overhead from the decompression scheme. The processing speed of the lossy‐compressed trajectories correlates with the level of precision (lower precision is faster to process as there is less data). The ASCII formats are all slow to process, with the uncompressed format taking about 30 s process, the gzip format taking about 4 s to process, and the bzip2 format taking up to 1.5 min to process.
3. CONCLUSIONS
The results here suggest that ASCII formats should be avoided. Even though they can be compressed to as much as 52% of the original binary trajectory size and have reasonable positional accuracy (5 × 10−4 Å) and energy RMSE (1 kcal/mol), they are extremely slow to process, particularly when compressed. It is also worth noting that when converting between Å and nm, there is a precision loss of about 2 × 10−6 Å and 3 × 10−3 kcal/mol for all single precision binary trajectory formats.
Based on our tests, the XTC format is usually the best compressed format in terms of overall size (compresses to about 33% of the original binary trajectory), read/write performance, and positional accuracy (5 × 10−3 Å). However, if it is desired to reproduce energies and energy deltas from the original simulation, more precision is needed. This is particularly important if hydrogen atoms are not being constrained, for example, when a flexible water model is being used. The compressed and integer‐quantized Amber trajectory format detailed here implemented in NetCDF v4/HDF5 using a conversion factor of 10,000× seems to be a good blend of size (compresses to about 66% of the original binary trajectory), read/write performance, positional accuracy (5 × 10−5 Å), and energy RMSE (0.066 kcal/mol for rigid water models, 0.080 kcal/mol for flexible water models). For compressing Gromacs TRR trajectories, the H5 format of MDTraj is also a decent alternative; although it is slower to process and has a slightly worse compression ratio (compresses to about 88% of the original binary trajectory) it has perfect positional accuracy and an energy RMSE of effectively zero (since there is no conversion from nm to Å needed).
Overall, the NetCDF4/HDF5 format seems like an ideal mix of versatility and performance for storing MD trajectory data. It is relatively fast to read and write, can be compressed/decompressed on‐the‐fly, and unlike other binary trajectory formats it is both self‐describing and extensible (i.e., future additions to the format can easily be done in such a way as to not break existing parsers).
4. METHODS
4.1. NetCDF4/HDF5 trajectory compression
For a lossless compressed trajectory, each variable in the Amber NetCDF trajectory file is set to use zlib compression via the nc_def_var_deflate() function (coordinates, velocities, forces, temperature, unit cell parameters, time, replica indices, etc.). A deflate level of 1 was used unless otherwise indicated; values higher than this were found to produce drastically diminishing returns in terms of compression.
Integer quantization was implemented as follows for coordinates, velocities, and forces. When used, a new integer variable would be defined in the NetCDF file in place of the existing floating point variable (“compressedpos” in place of “coordinates,” “compressedvel” in place of “velocities,” and “compressedfrc” in place of “forces”). In addition to setting a deflation level with nc_def_var_deflate(), the “shuffle” option is turned on. The integer compressed variable would be given an attribute in double precision that is the compression factor used to convert the floating point value to integer.
For writes, quantization for a given floating point value occurs as follows:
where ival is the converted integer value, round() is a function that returns the integral value that is nearest to the input value, with halfway cases rounded away from zero, and compressFac is the aforementioned compression factor attribute. In the code, ival is checked for overflow before writing.
For reads, the quantized integer value is converted back to floating point via:
4.2. Test trajectories
Several trajectories generated from both Amber 21 and Gromacs 22 molecular dynamics runs were used to evaluate various trajectory formats. We wanted to ensure we used trajectories from both software packages because although cpptraj reads all Gromacs formats natively, internally CPPTRAJ uses the AKMA unit scheme while Gromacs trajectories have metric units, and we wanted to be able to distinguish errors in precision from lossy compression from errors in precision due to unit conversion.
4.2.1. Gromacs SPCE
Gromacs version 2021.2 was used. The system used for the Gromacs MD run was the Trpzip2 beta hairpin (PDB ID 1LE1). The system was built using a cubic unit cell with 2,233 SPCE waters and 2 neutralizing Cl− ions (6,921 atoms total). The system was minimized for 336 steps of steepest descent minimization using PME for long‐range electrostatics and a potential‐shift cutoff scheme for long‐range Lennard–Jones interactions with a cutoff of 10 Å. The system was then relaxed for 100 ps in NVT using leapfrog integration with a 2 fs timestep using a velocity‐rescaling thermostat set to 300 K, followed by 100 ps in NPT using a Parinello–Rahman barostat. Bonds to H were constrained with the LINCS algorithm 23 and default settings. Production dynamics were run in NVT on an Nvidia Geforce GTX 780 GPU (Cuda 11.4) using the same settings as the previous NVT run for 10 ns, with coordinates written to a TRR file every 500 steps, for 10,001 frames total (file size 794 MB). Full details on how the system was set up (including exact command lines used) are available in Appendix S1.
4.2.2. Amber SPCE
Amber version 16 was used for the MD runs (as it runs better on the older Nvidia GTX 780 GPU), while Amber version 22 was used to build the system. The system used for the Amber MD run was the Trpzip2 beta hairpin (PDB ID 1LE1). The system was built using a cubic unit cell with 2,233 SPCE waters and 2 neutralizing Cl− ions (same composition as the Gromacs SPCE system, 6,921 atoms total). In order to ensure the system size matched the one used in the Gromacs simulation, a custom brute‐force script was used to adjust the unit cell size so that the appropriate number of SPCE waters were added (https://github.com/drroe/Solvate.sh). The system was built the using ff14SB 24 force field with Joung and Cheatham parameters for the ions. 25 The system was then relaxed using the protocol of Roe and Brooks at 300 K. 26 Production dynamics were run for 5,000,500 steps on an Nvidia Geforce GTX 780 GPU (Cuda 9.2). The system was run in NVT using Langevin dynamics using a 2 fs timestep at 300 K with a collision frequency of 5 ps−1. Long range electrostatics were handled using the particle mesh Ewald 27 method with a cutoff of 9.0 Å and default Amber parameters. Long range Lennard–Jones interactions were handled using a cutoff of 9.0 Å and a long range correction. 28 Bonds to hydrogen were constrained using SHAKE 29 , 30 with default settings. Coordinates were written to a NetCDF v3 file every 500 steps, for 10,001 frames total (file size 793 MB). Full details on how the system was set up (including input scripts) are available in Appendix S1.
4.2.3. Amber SPCFW
The SPCFW system was built and relaxed in the same manner as the SPCE system except that SHAKE was not used, the timestep was 1 fs, 10,001,000 steps were run, and coordinates were written to a NetCDF v3 file every 1,000 steps for 10,001 frames total (file size 793 MB).
4.3. Trajectory formats
The 22 trajectory formats (which include several variations of certain formats) tested are listed in Table 1. All formats listed are readable by CPPTRAJ. All formats were written by CPPTRAJ except where otherwise noted. The XTC and TNG formats were written with “gmx convert‐trj” from Gromacs 2021.2. When doing trajectory timings, the “‐normbc” keyword was used to suppress reimaging; when writing the trajectories for energy calculations this keyword was omitted to ensure all bonded atoms were properly imaged. The H5 format was written with the mdconvert tool from MDTraj 1.9.7. The H5MD formats were written with mdanlysis version 2.1.0. Scripts and command lines used for trajectory conversion are provided in Appendix S1.
4.4. Trajectory input/output speed measurements
Trajectory write time was measured as the total execution time needed for the given program (see Table 1) to convert the reference trajectory to the target format. To ensure that the total execution time would be representative of write performance (i.e., the write is the majority of the total execution time), each run was repeated several times to ensure that the reference trajectory would be cached, and therefore reading the reference trajectory would not significantly impact the total execution time. This was confirmed by examining detailed input/output (I/O) timings from CPPTRAJ compiled with the “‐DTIMER” compiler define, which showed that when cached, trajectory reads took on average only about 0.5 s of the total execution time, with trajectory writes taking up almost all the remaining total execution time (see Appendix S1).
Trajectory read time was measured as the total execution time needed for CPPTRAJ to read the given trajectory format and perform a simple distance calculation between atoms 1 and 2. To ensure the total execution time would be representative of read performance, each run was conducted on a newly written trajectory to prevent any read caching. This was again confirmed from detailed CPPTRAJ I/O timings, which showed that trajectory reads took up the majority (>99%) of total execution time for each format (see Appendix S1).
4.5. Energy root mean square error calculation
Energies were calculated for each frame of each trajectory via the “energy” command of CPPTRAJ. The “energy” command calculates energy using an Amber‐like energy function 21 :
where E Total is the total energy, E Bond, and E Angle are simple Hooke's law bond and angle energies, E Dihedral is torsion energy implemented via a Fourier series, E VDW14 is the van der Waals energy implemented via a Lennard–Jones 6–12 potential for 1–4 (i.e., atoms separated by three bonds, A1–A2–A3–A4) pairs, Eelec14 is the electrostatic energy implemented via a Coulomb potential for 1–4 pairs, and E Nonbond is the energy for non‐bonded atom pairs calculated via:
where E Self is the PME cancelling Gaussian, E Recip is the reciprocal space PME energy (calculated using the helPME library, https://github.com/andysim/helpme), and E Direct is the direct space energy is calculated for atom pairs within a specified cutoff via:
where E VDW is the van der Waals energy implemented via a Lennard–Jones 6–12 potential, E Elec is the electrostatic energy for PME implemented via a Coulomb potential adjusted by the complementary error function, 27 and E LR is a long‐range correction for the Lennard–Jones 6–12 term. 28 The energy root mean square error (ERMSE) of a converted trajectory from the reference (original) trajectory for a given term was then calculated as:
where N is the number of frames in each trajectory, ER i is the energy of frame i in the reference trajectory, and EC i is the energy of frame i in the converted trajectory. Scripts for calculating the energy and energy RMSE can be found in Appendix S1.
The energy calculation of the Amber trajectories used the same parameter/topology file that was used in the corresponding simulation. For the energy calculation of the Gromacs trajectory, the Amber SPCE parameter/topology file was used after being re‐ordered to match the ordering of the Gromacs topology via the CPPTRAJ “remap” command:
parm trpzip2.ff14SB.mbondi3.gmxspce.parm7 trajin trpzip2.gmxspce.nomin.rst7 readdata remap.dat name Map remap data Map parmout gmxorder.trpzip2.ff14SB.mbondi3.gmxspce.parm7 trajout gmxorder.trpzip2.gmxspce.nomin.rst7
where the file “remap.dat” is a file containing the desired atom ordering:
<Gromacs atom #> <Amber atom #>
The delta ERMSE (i.e., the delta between the difference in energies in two frames in the reference trajectory compared to the difference in energies of the same two frames in a converted trajectory) was calculated as:
where M is the total number of unique frame to frame pairs.
4.6. Position root mean square deviation calculation
The average root mean square deviation (RMSD) of the reference trajectory to a converted trajectory is calculated as:
where N is the number of frames in each trajectory, rms() is a function that calculates the position RMSD of a target structure to a reference structure, XR i is the structure of frame i from the reference trajectory, and XC i is the structure of frame i from the converted trajectory. The CPPTRAJ script for calculating the trajectory to trajectory RMSD can be found in Appendix S1.
AUTHOR CONTRIBUTIONS
Daniel R Roe: Conceptualization (lead); writing – original draft (lead). Bernard R Brooks: Supervision (lead); writing – review and editing (supporting).
CONFLICT OF INTEREST
The authors have declared no conflict of interest.
Supporting information
Appendix S1: Supporting Information
ACKNOWLEDGMENTS
This work was supported by the intramural research program of the National Heart, Lung and Blood Institute (NHLBI) of the National Institutes of Health, NHLBI Z01 HL001051‐23. Computational resources provided by the LoBoS cluster (NIH/NHLBI).
Roe DR, Brooks BR. Quantifying the effects of lossy compression on energies calculated from molecular dynamics trajectories. Protein Science. 2022;31(12):e4511. 10.1002/pro.4511
Review Editor: Nir Ben‐Tal
Funding information National Institutes of Health, Grant/Award Number: NHLBI Z01 HL001051‐23
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request.
REFERENCES
- 1. Salomon‐Ferrer R, Götz AW, Poole D, Le Grand S, Walker RC. Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. Explicit solvent particle mesh Ewald. J Chem Theory Comput. 2013;9(9):3878–3888. 10.1021/ct400314y. [DOI] [PubMed] [Google Scholar]
- 2. Cheatham TE, Roe DR. The impact of heterogeneous computing on workflows for biomolecular simulation and analysis. Comput Sci Eng. 2015;17(2):30–39. 10.1109/MCSE.2015.7. [DOI] [Google Scholar]
- 3. Welch TA. A technique for high‐performance data compression. Computer (Long Beach Calif). 1984;17(6):8–19. 10.1109/MC.1984.1659158. [DOI] [Google Scholar]
- 4. Ahmed N, Natarajan T, Rao KR. Discrete cosine transform. IEEE Trans Comput. 1974;C–23(1):90–93. 10.1109/T-C.1974.223784. [DOI] [Google Scholar]
- 5. Deutsch, P. DEFLATE compressed data format specification Version 1.3; 1996. 10.17487/rfc1951. [DOI] [Google Scholar]
- 6. Burrows, M. ; Wheeler, D. A block‐sorting lossless data compression algorithm; Palo Alto, California; 1994.
- 7. Melo A, Puga AT, Gentil F, Brito N, Alves AP, Ramos MJ. Byte Structure Variable Length Coding (BS‐VLC): A new specific algorithm applied in the compression of trajectories generated by molecular dynamics. J Chem Inf Comput Sci. 2000;40(3):559–566. 10.1021/ci990069u. [DOI] [PubMed] [Google Scholar]
- 8. Meyer T, Ferrer‐Costa C, Pérez A, et al. Essential dynamics: A tool for efficient trajectory compression and management. J Chem Theory Comput. 2006;2(2):251–258. 10.1021/ct050285b. [DOI] [PubMed] [Google Scholar]
- 9. Shkurti A, Goni R, Andrio P, et al. pyPcazip: A PCA‐based toolkit for compression and analysis of molecular simulation data. SoftwareX. 2015;5:44–50. 10.1016/j.softx.2016.04.002. [DOI] [Google Scholar]
- 10. Kumar A, Zhu X, Tu Y‐C, Pandit S. Compression in molecular simulation datasets. Intell Sci Big Data Eng. 2014;8261(8):22–29. 10.1007/978-3-642-42057-3_4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Kumar A, Tu Y‐C. Exploiting locality for query processing and compression in scientific databases. Proceedings of the fourth SIGMOD PhD workshop on innovative database research ‐ IDAR ‘10. New York, NY: ACM Press, 2010; p. 13–18. 10.1145/1811136.1811139. [DOI] [Google Scholar]
- 12. Clementi E, Corongiu G. Methods and techniques in computational chemistry: METECC‐95. Cagliari: STEF, 1995;p. 435. [Google Scholar]
- 13. Krylov NA, Efremov RG. Libxtc: An efficient library for reading XTC‐compressed MD trajectory data. BMC Res Notes. 2021;14(1):14–17. 10.1186/s13104-021-05536-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Spångberg D, Larsson DSD, Van Der Spoel D. Trajectory NG: Portable, compressed, general molecular dynamics trajectories. J Mol Model. 2011;17(10):2669–2685. 10.1007/s00894-010-0948-5. [DOI] [PubMed] [Google Scholar]
- 15. Huwald J, Richter S, Ibrahim B, Dittrich P. Compressing molecular dynamics trajectories: Breaking the one‐bit‐per‐sample barrier. J Comput Chem. 2016;37(20):1897–1906. 10.1002/jcc.24405. [DOI] [PubMed] [Google Scholar]
- 16. Marais P, Kenwood J, Smith KC, Kuttel MM, Gain J. Efficient compression of molecular dynamics trajectory files. J Comput Chem. 2012;33(27):2131–2141. 10.1002/jcc.23050. [DOI] [PubMed] [Google Scholar]
- 17. Dvořák J, Maňák M, Váša L. Predictive compression of molecular dynamics trajectories. J Mol Graph Model. 2020;96. 10.1016/j.jmgm.2020.107531. [DOI] [PubMed] [Google Scholar]
- 18. McGibbon RT, Beauchamp KA, Harrigan MP, et al. MDTraj: A modern open library for the analysis of molecular dynamics trajectories. Biophys J. 2015;109(8):1528–1532. 10.1016/j.bpj.2015.08.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Jakupovic, E. ; Beckstein, O. MPI‐parallel molecular dynamics trajectory analysis with the H5MD format in the MDAnalysis python package. Proceedings of the 20th Python in Science Conference; 2021, No. Scipy, 40–48. 10.25080/majora-1b6fd038-005. [DOI]
- 20. Roe DR, Cheatham TE. PTRAJ and CPPTRAJ: Software for processing and analysis of molecular dynamics trajectory data. J Chem Theory Comput. 2013;9(7):3084–3095. 10.1021/ct400341p. [DOI] [PubMed] [Google Scholar]
- 21. Case DA, Cheatham TE, Darden T, et al. The Amber biomolecular simulation programs. J Comput Chem. 2005;26(16):1668–1688. 10.1002/jcc.20290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Hess B, Kutzner C, Van Der Spoel D, Lindahl E. GROMACS 4: Algorithms for highly efficient, load‐balanced, and scalable molecular simulation. J Chem Theory Comput. 2008;4(3):435–447. 10.1021/ct700301q. [DOI] [PubMed] [Google Scholar]
- 23. Hess B, Bekker H, Berendsen HJC, Fraaije JGEM. LINCS: A linear constraint solver for molecular simulations. J Comput Chem. 1997;18(12):1463–1472. . [DOI] [Google Scholar]
- 24. Maier JA, Martinez C, Kasavajhala K, Wickstrom L, Hauser KE, Simmerling C. Ff14SB: Improving the accuracy of protein side chain and backbone parameters from Ff99SB. J Chem Theory Comput. 2015;11(8):3696–3713. 10.1021/acs.jctc.5b00255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Joung IS, Cheatham TE. Determination of alkali and halide monovalent ion parameters for use in explicitly solvated biomolecular simulations. J Phys Chem B. 2008;112(30):9020–9041. 10.1021/jp8001614. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Roe DR, Brooks BR. A protocol for preparing explicitly solvated systems for stable molecular dynamics simulations. J Chem Phys. 2020;153(5):054123. 10.1063/5.0013849. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Darden T, York D, Pedersen L. Particle mesh Ewald: An N ·log( N ) method for Ewald sums in large systems. J Chem Phys. 1993;98(12):10089–10092. 10.1063/1.464397. [DOI] [Google Scholar]
- 28. Allen MP, Tildesley DJ. Computer simulation of liquids. Oxford: Oxford University Press, 1987. [Google Scholar]
- 29. Ryckaert JP, Ciccotti G, Berendsen HJC. Numerical integration of cartesian equations of motion of a system with constraints – Molecular dynamics of N‐alkanes. J Comput Phys. 1977;23(3):327–341. [Google Scholar]
- 30. Miyamoto S, Kollman PA. Settle: An analytical version of the SHAKE and RATTLE algorithm for rigid water models. J Comput Chem. 1992;13(8):952–962. 10.1002/jcc.540130805. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Appendix S1: Supporting Information
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.