A survey of algorithms for transforming molecular dynamics data into metadata for in situ analytics based on machine learning methods

Michela Taufer; Trilce Estrada; Travis Johnston

doi:10.1098/rsta.2019.0063

. 2020 Jan 20;378(2166):20190063. doi: 10.1098/rsta.2019.0063

A survey of algorithms for transforming molecular dynamics data into metadata for in situ analytics based on machine learning methods

Michela Taufer ^1,^✉, Trilce Estrada ², Travis Johnston ³

PMCID: PMC7015296 PMID: 31955686

Abstract

This paper presents the survey of three algorithms to transform atomic-level molecular snapshots from molecular dynamics (MD) simulations into metadata representations that are suitable for in situ analytics based on machine learning methods. MD simulations studying the classical time evolution of a molecular system at atomic resolution are widely recognized in the fields of chemistry, material sciences, molecular biology and drug design; these simulations are one of the most common simulations on supercomputers. Next-generation supercomputers will have a dramatically higher performance than current systems, generating more data that needs to be analysed (e.g. in terms of number and length of MD trajectories). In the future, the coordination of data generation and analysis can no longer rely on manual, centralized analysis traditionally performed after the simulation is completed or on current data representations that have been defined for traditional visualization tools. Powerful data preparation phases (i.e. phases in which original row data is transformed to concise and still meaningful representations) will need to proceed data analysis phases. Here, we discuss three algorithms for transforming traditionally used molecular representations into concise and meaningful metadata representations. The transformations can be performed locally. The new metadata can be fed into machine learning methods for runtime in situ analysis of larger MD trajectories supported by high-performance computing. In this paper, we provide an overview of the three algorithms and their use for three different applications: protein–ligand docking in drug design; protein folding simulations; and protein engineering based on analytics of protein functions depending on proteins' three-dimensional structures.

This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Keywords: protein–ligand docking, protein folding, protein engineering, machine learning, MapReduce

1. Problem overview and proposed solution

Molecular dynamics (MD) simulations are widely recognized in the fields of chemistry, material sciences, molecular biology and drug design. The system sizes and time-scales accessible to MD simulations have been steadily increasing [1]. Today MD simulations are the most common simulations running on petascale machines. For example, a survey of resources used on XSEDE machines over the past six months shows how biomolecular codes (predominantly MD codes such as Amber [2], CHARMM [3] and NAMD [4]) use 25.7% of the XSEDE resources (i.e. total amount of XD service units [Sus] used by jobs in the field of science indicated). At the same time, work of Luu and co-authors shows how HPC computing resources already can be up to 75% idle performing I/O operations while running scientific simulations because of poor data handling [5]. The transition from petascale to exascale computing will bring unprecedented computing capability to MD simulations. Next-generation high-performance computing (HPC) systems (e.g. XSEDE high-end clusters such as Stampede2 and national laboratory supercomputers such as Summit) will have a dramatically larger compute performance than current systems such as the San Diego Supercomputer Center cluster Comet or the Oak Ridge National Laboratory supercomputer Titan. The increase in computing capability directly translates into the ability to execute more and longer simulations. For MD simulations, this means generating more data that needs to be analysed (e.g. in terms of number and length of MD trajectories). Because of power constraints, however, the I/O bandwidth and parallel file system capacity of next-generation HPC systems will not grow at the same pace. The steady increase in compute power (in petaflops) and the stagnant I/O bandwidth (in TB/s) in next-generation machines such as Summit and Aurora, compared with current and past HPC systems such as Titan, is well known. For example, Summit's parallel file system is provisioned to have 1 TB/s peak I/O bandwidth, the same as that of Titan. In some cases, new HPC systems will share older existing parallel file systems with their predecessors, creating under-provisioned systems and contention conditions. While many architectural implementations and scheduling aspects of next-generation HPC systems remain a topic of discussion (e.g. the role of burst buffers, or BBs, the ideal BB location, and the feasibility of smart I/O staging), the research community must revisit the way MD simulations are being executed. Hardware and scheduling strategies such as BBs and staging are not the magic silver bullet that they are made out to be; I/O contention will still be a problem if the burst buffer capability is exceeded [5–7]. Moreover, BBs can improve offloading bandwidth but do not help with uploading data from storage for increasingly complex workflows [8]. The overall coordination of data generation and analysis will no longer be able to rely on the manual, centralized approaches used today. Clearly needed are new comprehensive workflows for MD simulations in which HPC meets data analytics.

Trends in HPC move away from the centralized nature of MD analysis into a distributed approach that is predominantly performed in situ, supports a broad range of MD codes, and can enable instantaneous tuning of MD workflows (i.e. stop, start and fork MD jobs). Contrary to traditional MD data analyses that use a centralized approach (i.e. first generate and save all the trajectory data to storage followed by post-simulation analysis), these new trends analyse data as they are generated, save to disk only what is really needed for future analysis, and annotate MD outputs to drive the next steps in increasingly complex MD workflows. Note that here we refer to the analysis of MD-generated data (e.g. capturing significant rare events and monitoring convergence of observables based on inherently noisy and high-dimensional MD outputs) rather than to the generation process itself (e.g. modelling of the atom interactions and parallelization of single MD jobs), since flops are essentially free and other efforts tackle the computing challenges. At the same time, in situ MD analyses are leveraging machine learning methods to identify recurrent patterns and predict rare events in MD trajectories, for example. In general, when using ML methods, some data preparation is required before the data is fed into the methods by transforming large, coarse-grained, redundant data into concise, meaningful representations. To support in situ analyses in MD simulations, new data representations are needed that, while still capturing structural properties of molecules locally, better fit the profile of input data suitable for machine learning methods. In other words, these representations should enable in situ analysis by capturing what is going on in each frame without disrupting the simulation (e.g. stealing CPU and memory on the node), without the need for moving all the frames to a central file system and analysing them once the simulation is over, by removing the need to compare each frame with past frames of the same job, and without comparing each frame with frames of other jobs.

In this paper, we present three algorithms that transform molecular structures into metadata suitable for in situ analysis of larger and larger MD trajectories. The three algorithms are applied to datasets from three use cases: protein–ligand docking simulations; protein folding simulations; and analytics of protein functions in protein engineering. The overarching approach common to the three use cases is the need to map individual substructures in local molecular structures to metadata frame by frame at runtime. The global knowledge suitable for scientists can be rebuilt from the distributed local knowledge captured in the metadata by using MapReduced-based techniques and machine learning tools. MapReduce-based frameworks and machine learning tools can use the ensembles of metadata for automatic strategic analysis and steering of MD simulations within a trajectory or across trajectories, making unnecessary the manual identification of those portions of trajectories in which rare events take place or in which critical conformational features are embedded. We provide evidence that builds on previous work of the authors of how, by mapping individual molecular structures to metadata, frame by frame at runtime and locally, our algorithms still capture the knowledge in each frame while, at the same time, potentially facilitate the study of automatic, strategic analysis and enable steering of MD simulations within a trajectory or across trajectories.

2. Three data transformations for three use cases

The three algorithms we present were applied to three different scientific use cases based on drug design and protein–ligand docking, protein folding and the identification of rare events in the folding trajectories, and protein variants expressed in protein engineering. Each use case requires a different transformation of standard atomic representation into metadata. Each set of metadata is suitable for different ML methods. While there is not a transformation-fits-it-all algorithm, each use case is representative of a broader range of MD applications within which our transformation can be broadly used.

3. Drug design and protein–ligand docking

The first algorithm transforms a three-dimensional (3D) small molecule called ligand, which is composed of n atoms, into a single N-D point. We use the algorithm to capture the conformation and orientation of large datasets of 3D ligands once docked into a protein binding site. In general, protein–ligand docking simulates the docking of ligands that function like drugs into proteins involved in the disease process. In the docking process, the ligand docks into the protein binding site and plays an essential role in turning protein function on or off, which is the molecular basis of the therapeutic benefit of most drugs [9].

Given a ligand and a protein with its protein binding site, the understating of the docking process requires that a large number of ligand conformations are generated; for each conformation, a large number of orientations have to be docked into the pocket and scored. Consequently, protein–ligand docking is a search with uncertainties in a very large space of potential ligand conformations and associated orientations, driven by the protein, the ligand, the computational methods and the degrees of freedom to be explored [10]. Computationally, given a protein and the ligand docking into the protein binding site (i.e. the protein–ligand complex), a docking simulation consists of a sequence of hundreds of thousands of independent attempts or trials. For each attempt, a random ligand confirmation is generated and docked into the protein active site with randomly generated orientations. The attempts are performed in a distributed fashion by an ensemble of MD jobs executed across multiple computing nodes. During the simulation, a distributed dataset of docked ligands is generated. The large datasets we use for illustrating our algorithm are produced by Docking@Home (D@H) [11]. D@H has been used over 6 years (2006–2012) to perform high-throughput docking simulations on computing resources donated by the public (e.g. idle computing cycles of desktops and laptops connected to the Internet).

The search for ‘good’ ligands that can potentially serve as drugs is supported by the hypothesis that, given an accurate computational method and a sufficiently large sampled space, output conformations of a protein–ligand docking simulation tend to converge towards conformations that are close to those observed experimentally, for example, by X-ray crystallography. For the effective search of those ligands across large datasets at runtime, rather than leveraging a posteriori visualization of the protein binding site with the docked ligand, or relaying on energy values, at runtime we map each docked ligand into a 3D point which serves as a proxy for the ligand conformation and orientation in the docking pocket itself. Our approach has three key steps. The first step is based on a geometry reduction: it encodes the geometry of a 3D ligand conformation into a single 3D point in the space. Each single 3D point is the result of three projections of the ligand atoms to the x-y, y-z and x-z planes, respectively, followed by the linear interpolation of the three projections and the measurement of the three linear slopes. The three slopes become the coordinates of the 3D point. Figure 1 describes the mapping of two different 3D ligand conformations of ligand 1hvi into two 3D points.

Figure 1. — Mapping of two different 3D sampled conformations of ligand 1hvi into two 3D points [12]. (Online version in colour.)

The second step builds an octree by assigning an octant identifier, or octkey, to every single 3D point. The octkey is a sequence of digits between 0 and 7 that maps the point to a specific octant of the projection 3D space. The length of the octkey depends on the outcome granularity of the search, ranging from a large, coarse-grained search with one or two digits composing the octkey to a smaller, fine-grained search with, for example, eight or more digits. Octkeys are used as keys of key-value pairs. The use of key-value pairs, one for each docked ligand generated locally by an MD job, allows us to leverage MapReduce-based programming models for the final step of our approach in which we extract a subset of similar ligand conformations. These are the ligands that are more likely of interest to scientists.

In the third step, we identify the densest octant, the subspace of 3D points with the highest point density. For the sake of scalability, the octree-based clustering can be performed on top of a MapReduce-based framework such as Hadoop [13], Spark [14], MR-MPI [15] or Mimir [16–18]. Opensource versions of the software for Mimir are available at [19]. In figure 2a, we show the outcome for one of the HIV ligands docking into the HVI protease. The figure shows the octants that were traversed by our algorithm. Lighter dots (cyan) are conformations with RMSD larger than 2 Armstrong and darker dots (red) are near-native conformations (i.e. conformations with RMSD smaller than or equal to 2 Armstrong compared with the known crystal structure). The deepest, denser octant found by the octree clustering with 8-digit octkeys is shown in figure 2b (octant framed in black). The best conformations converge towards a dense octant in the octree as stated in our hypothesis. The details on the applicability of our transformation algorithm are beyond the scope of this survey paper; we present our scalable and accurate approach to transform ligands to metadata and systematically identify ‘good’ ligands for a diverse set of protein–ligand complexes in [12,20,21].

Figure 2. — Octree of the sampled conformations of ligand 1k1l. The most dense octant located deeper in the octree contains near-native ligand structures. (a) Entire octree for ligand 1k1l and (b) zoom in showing the dense octant [12]. (Online version in colour.)

4. Protein folding and rare events

The second algorithm transforms one or multiple secondary structures (i.e. α-helix and β-strand) from an MD trajectory frames into a single eigenvalue at runtime. We use the eigenvalue representations as powerful structural collective variables capable of capturing relevant molecular motions. These collective variables can ultimately map the occurrence of rare events (e.g. the transformation of amino acids from an α-helix into a straight β-strand, or the swinging of a protein helix with respect to the rest of the protein) at runtime in trajectories, without the need for visualizing the trajectories a posteriori. While the first algorithm was applied to the final frame of short MD simulations, this second algorithm is applied to each one of the many frames that are output by long MD simulations. Given a frame, both algorithms reduce small molecules or molecular substructures (i.e. small ligands or a combinations of secondary structures) into numerical values (i.e. an N-dimensional value or a single eigenvalue).

When not using a posteriori visualization, measuring structural changes of a frame with respect to past frames of the same trajectory (or to frames in other trajectories) requires (a) keeping frames in memory or moving the frames across nodes, and (b) quantitatively measuring similarity between two or more frames with the root mean square deviation (RMSD), for example. This process violates the in situ constraints that require that the analysis not compete for resources with the simulation or require moving data. Rather than using the RMSDs, we first simplify the system made by m amino acids by extracting the positions of the m α-Carbon (Cα) backbone atoms (x_i, y_i, z_i) for 1 ≤ i ≤ m amino acids and using the backbone atoms as our domains of interest. In our mapping algorithm, we capture the dynamic relationship between two or more segments of a molecular system (e.g. Segment 1 consisting of m1 amino acids k1, … , km1 and Segment 2 consisting of m2 amino acids £1, … , £m2); we further extend our approach based on distance matrices and eigenvalues by constructing bipartite distance matrices B = [bij]. Since we want to focus on capturing information on the relative positions of two or more segments in a domain, we consider only distances between Cαatoms in, for example, Segment 1 and Cαatoms in, for example, Segment 2; all other distances are recorded as zeros. The bipartite distance matrix is symmetric and has zeros on the diagonal. Because of its block structure whenever λ is an eigenvalue of B, −λ is also an eigenvalue of B. The collective variable we associate with the segments in a given domain of interest at a given time t is the largest (positive) eigenvalue, λ_max, of B.

Figure 3 depicts an example in which the position of two β-strands in the 1E0L6 protein are monitored with respect to each other and mapped to a bipartite distance matrix in order to identify the formation of strands as an example of a monitored rare event. The matrix B for two segments has at most 10 non-zero eigenvalues: five pairs of λ and −λ. B has trace zero since the diagonal is identically zero. In the case of the bipartite distance matrix B, we observed that the smallest four positive eigenvalues are much smaller than the largest eigenvalue and the negative eigenvalues perfectly mirror the positive ones. Thus, capturing only the largest eigenvalue provides a sufficiently good proxy distance metric.

As an example of an intermolecular rare event, we show how the time-series analysis of eigenvalues from a bipartite (two-structure) distance matrix can capture conformational changes. We considered a trajectory of the 1BDD protein and we observed relative change between structures over time. The three helices of the protein (i.e. Helix 1, Helix 2 and Helix 3) form relatively quickly (within the first 1000 frames). Figure 4 shows, in time series, the largest eigenvalue of each helix from Frames 1100–1700. At this point in the trajectory, the helices have formed and are relatively stable: the three largest eigenvalues fluctuate around a constant value. A different story is seen, however, if we consider the eigenvalues of the bipartite distance matrices by comparing Helix 1 with Helix 2, Helix 1 with Helix 3 and Helix 2 with Helix 3. Figure 5 shows the eigenvalues of these bipartite distance matrices in time series over the same range of frames (i.e. for Frames 1100–1700). We see little change in the eigenvalues comparing the relative positions of atoms in Helix 1 with atoms in Helix 2. However, we see a significant spike occurring between Frames 1300 and 1400 when making the other two comparisons. Hence we believe that Helix 3 makes a dramatic movement relative to the other helices and then returns to a position similar to that before the change. Figure 6 shows three representative frames of the trajectory. The region corresponding to the first helix (Helix 1) is coloured red, the second helix (Helix 2) green and the third helix (Helix 3) blue. In Frame 1300, the protein is compact. In Frame 1330, Helix 3 (blue) has swung away from the other helices, rendering the protein temporarily much less compact. By Frame 1390, the third helix has returned, and the protein is once again compact. This is essentially what we predicted. Note how the set of collective variables in our two empirical case studies is an example of molecular-agnostic parameter. The global knowledge of occurring rare events within and across long trajectories can be built from patterns of eigenvalues, which are substantially smaller to manage and visualize than the original trajectories.

Figure 4. — Stable eigenvalues of individual helices for 1BDD [19,23]. (Online version in colour.)

Figure 5. — Example of sudden spike in eigenvalues in Helices 1–3 and Helices 2–3 but lack of corresponding change in Helices 1–2, indicating Helix 3 moves relative to the other two [19,23].

Figure 6. — (a–c) Visualization of the intermolecular transformation in the 1BDD protein with its swinging Helix 3 [19,23]. (Online version in colour.)

Our use of multidirectional scaling, such as the eigen decomposition of distance matrices, to quantify the variances in one or more relevant substructures has been discussed in depth for trajectories of folding proteins in [23,24]. In our past work, we showed how the dramatic variances of eigenvalues can be associated with rare events in two use cases such as the transformation of amino acids from an α-helix into a straight β-strand and the swinging of a protein helix with respect to the rest of the protein.

5. Protein variants and protein engineering

The third algorithm builds upon the fact that proteins with similar sequences have similar functions. Our algorithms can be used to support the measurement of millions of protein variants expressed to produce desired properties (protein engineering). Our mapping transforms heterogeneous macromolecules such as proteins to capture and exposure structure (i.e. secondary and tertiary characteristics) into homogeneous tensor representations so that the representation can be processed efficiently for different types of proteins and has the potential to scale on large supercomputers [25]. Contrary to the previous two algorithms that transform small molecular structures into single values, this algorithm transforms an entire large molecular structure into an image-based representation (i.e. multiple tensors). Specifically, we map structural and conformational information of macromolecules into a codified image. The process, described in detail in [25] for an example of encoding for the gene V protein (PDBid: 1AE2), is presented in figure 7. Starting from the protein in its atomic representation, the method first combines knowledge from:

(a)
the Ramachandran [26] plot to determine which among six types of secondary structures (i.e. α-helix, β-strand, Polyproline PII-helix, γ′-turn, γ-turn and cis-peptide bonds) an amino acid residue may be part of, and
(b)
the Euclidean distance matrix between alpha carbon atoms in the backbone of the protein providing us with raw information regarding the secondary and tertiary structure of the protein.

Figure 7. — Example of encoding procedure from original 3D protein to image for the gene V protein (PDBid: 1AE2) [24]. (Online version in colour.)

This knowledge is used to build a 3-channel representation of the protein into a tensor (i.e. Red-Green-Blue channels in an image). The colours encode secondary structures while colour intensities represent distances of the structures. The final step consists of resizing the image by applying a bi-cubic interpolation to produce an output of consistent dimensions across proteins regardless of their original length [25].

Neural network architectures can analyse proteins in this encoded format by relying on the idea of treating each colour channel independently prior to grouping. Results in [25,27] rely on this approach to show how the processing of protein structures into images and identification of protein structure can take place in a few seconds, providing nearly instant feedback in in situ analyses. Figure 8 shows a normalized confusion matrix for the classification of the 62 991 proteins from the Protein Data Bank into eight biological processes from biological process taxonomy in RCSB-PDB. The diagonal of the matrix shows the average accuracy of the predictions versus the real-world results. We validate the effectiveness of our transformation algorithms and the strength of the neuronal network using our transformations through a fivefold cross-validation over roughly 63 thousand images, achieving an accuracy of 80% across eight distinct classes. Our overall approach of encoding and classifying proteins for real-time processing leading to high-throughput analysis is presented in [25,27].

Figure 8. — Mapping proteins into images and the classification of the protein into eight biological processes from biological process taxonomy in RCSB-PDB for 62 991 proteins from the Protein Data [24]. (Online version in colour.)

6. Summary

This paper presents three algorithms that transform row molecular structures into metadata suitable for three use cases combining structural molecular biosciences, machine learning and data analytics and HPC. In the three use cases, we leverage the three algorithms to boost and implement new in situ analytics capable of tracing molecular events such as binding events, conformational changes or phase transitions and structural variants in MD simulations at runtime. Our algorithms reduce knowledge embedded into high-dimensional molecular organization into a set of relevant metadata. Transformations performed in the first use case dealing with protein–ligand docking map single ligand conformations, each outputted by an individual MD job, into a single three-dimensional point that serves as a proxy for the protein's structural conformation; the extraction of the global knowledge is performed with a clustering method executed on top of MapReduce-based frameworks. Transformations performed in the second use case dealing with the identifications of rare events in MD trajectories map single-molecule substructure in frames, each outputted by an evolving MD job, into a single eigenvalue that serves as a proxy for the spatial properties of one or multiple domains of interest. The extraction of the global knowledge from the distributed set of eigenvalues is performed from the classifications of eigenvalue patterns. Last, transformations performed in the third use case dealing with protein engineering map an entire MD frame into a single image (tensor) that serves as a proxy for the structural properties of multiple synthetic proteins. The extraction of the global knowledge (e.g. the family a new protein can potentially belong to) is performed by convolutional neural networks that identify patterns in the images.

The applicability of our transformations is driven by the type of scientific question an MD simulation is targeting. Work of the authors has applied the transformations to three different ranges of MD problems, integrating them into distributed frameworks and unsupervised machine learning techniques to build accurately and efficiently an explicit global organization of structural and temporal MD trajectories, such that anomalies, convergence and correlations are exposed at runtime while minimizing data movement and communication in in situ analyses.

Acknowledgements

The authors thank Zhang B, Cicotti P, Armen RS, Liwo A, Crivelli S, Benson J, Carrillo-Cabada H, Razavi AS, Cuendet MA, Weinstein H and Deelman E for their feedback in different phases of the algorithms' conceptualization and design.

Data accessibility

The manuscript does not report primary data.

Authors' contributions

M.T. worked with Estrada and Johnson on conception and design of the approaches in Sections 2.1–2.3; she worked on the drafting the article and revising it critically for important intellectual content; and she is the final approver of the version to be published. T.E. provided substantial contributions to conception and design of the approaches in Sections 2.1 and 2.3. T.J. provided substantial contributions to conception and design of the approaches in §2.2.

Competing interests

We declare we have no competing interests.

Funding

This work has been supported by the US National Science Foundation, awards IIS 1741057, CCF 1318445, OAC 0802650.

References

1.Perilla JR, Goh BC, Cassidy CK, Liu B, Bernardi RC, Rudack T, Yu H, Wu Z, Schulten K. 2015. Molecular dynamics simulations of large macromolecular complexes. 31, 64–74. ( 10.1016/j.sbi.2015.03.007) [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Case DA, et al. 2005. The amber biomolecular simulation programs. J. Comput. Chem. 26, 1668–1688. ( 10.1002/jcc.20290) [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Brooks BR, et al. 2009. CHARMM: the biomolecular simulation program. J. Comput. Chem. 30, 1545–1614. ( 10.1002/jcc.21287) [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Phillips JC, et al. 2005. Scalable molecular dynamics with NAMD. J. Comput. Chem. 26, 1781–1802. ( 10.1002/jcc.20289) [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Luu H, Winslett M, Gropp W, Ross R, Carns P, Harms K, Prabhat M, Byna S, Yao Y. 2015. A Multiplatform Study of I/O Behavior on Petascale Supercomputers. In Proc. 24th Int. Symp. on High-Performance Parallel and Distributed Computing, pp. 33–44.
6.Herbein S, Ahn DH, Lipari D, Scogland TRW, Stearman M, Grondona M, Garlick J, Springmeyer B, Taufer M. 2016. Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters. In Proc. 25th ACM Int. Symp. on High-Performance Parallel and Distributed Computing, pp. 69–80.
7.Liu N, Cope J, Carns P, Carothers C, Ross R, Grider G, Crume A, Maltzahn C. 2012. On the role of burst buffers in leadership-class storage systems. In IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).
8.Sato K, Mohror K, Moody A, Gamblin T, de Supinski BR, Naoya M, Matsuoka S. 2014. A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers. In Proc. 2014 14th IEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing.
9.Estrada T, Armen RS, Taufer M. 2010. Automatic selection of near-native protein–ligand conformations using a hierarchical clustering and volunteer computing. In Proc. First ACM Int. Conf. on Bioinformatics and Computational Biology (BCC), pp. 204–213.
10.Jain A. 2008. Bias, reporting, and sharing: computational evaluations of docking methods. J. Comp. Aided Mol. Des. 22, 201–212. ( 10.1007/s10822-007-9151-x) [DOI] [PubMed] [Google Scholar]
11.Taufer M, Armen R, Chen J, Teller P, Brooks C. 2009. Computational multiscale modelling in protein–ligand docking. IEEE Eng. Med. Biol. Mag. 28, 58–69. ( 10.1109/MEMB.2009.931789) [DOI] [PubMed] [Google Scholar]
12.Estrada T, Zhang B, Cicotti P, Armen RS, Taufer M. 2012. A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach. Comp. in Bio. and Med. 42, 758–771. ( 10.1016/j.compbiomed.2012.05.001) [DOI] [PubMed] [Google Scholar]
13.White T. 2012. Hadoop: The Definitive Guide. O'Reilly Media, Inc. (9781491901687)
14.Zaharia Z, Chowdhury M, Franklin MJ, Shenker S, Stoica I. 2010. Spark: cluster computing with working sets. Proc. HotCloud 10, 10. [Google Scholar]
15.Plimpton SJ, Devine KD. 2011. MapReduce in MPI for large-scale graph algorithms. Parallel Comput. 37, 610–632. ( 10.1016/j.parco.2011.02.004) [DOI] [Google Scholar]
16.Gao T, Guo Y, Zhang B, Cicotti P, Lu Y, Balaji P, Taufer M. 2018. On the Power of Combiner Optimizations in MapReduce Over MPI Workflows. In Proc. 24rd IEEE Int. Conf. on Parallel and Distributed Systems (ICPADS), pp. 441–448.
17.Gao T, Guo Y, Wei Y, Wang B, Lu Y, Cicotti P, Balaji P, Taufer M. 2017. Bloomfish: A Highly Scalable Distributed K-mer Counting Framework. In Proc. 23rd IEEE Int. Conf. on Parallel and Distributed Systems (ICPADS), pp. 170–179.
18.Gao T, Guo Y, Zhang B, Cicotti P, Lu Y, Balaji P, Taufer M. 2017. Mimir: Memory-Efficient and Scalable MapReduce for Large Supercomputing Systems. In 2017 IEEE Int. Parallel and Distributed Processing Symp. (IPDPS), pp. 1098–1108.
19.Gao T, Guo Y, Zhang B, Cicotti P, Lu Y, Balaji P, Taufer M.. 2019. Scalable Data Skew Mitigation in MapReduce over MPI for Supercomputing Systems. In IEEE Transactions on Parallel and Distributed Systems (TPDS) (accepted for publication))
20.Zhang B, Estrada T, Cicotti P, Balaji P, Taufer M. 2017. Enabling scalable and accurate clustering of distributed ligand geometries on supercomputers. Parallel Comput. 63, 38–60. ( 10.1016/j.parco.2017.02.005) [DOI] [Google Scholar]
21.Zhang B, Estrada T, Cicotti P, Balaji P, Taufer M. 2014. Enabling In-Situ Data Analysis for Large Protein-Folding Trajectory Datasets. In Proc. 28th IEEE Int. Parallel & Distributed Processing Symp. (IPDPS), pp. 221–230.
22.Johnston TJ, Zhang B, Liwo A, Crivelli S, Taufer M. 2017. In situ data analytics and indexing of protein trajectories. J. Comput. Chem. 38, 1419–1430. ( 10.1002/jcc.24729) [DOI] [PubMed] [Google Scholar]
23.Travis J, Zhang B, Liwo A, Crivelli S, Taufer M. 2017. In situ data analytics and indexing of protein trajectories. J. Comput. Chem. 38, 1419–1430. ( 10.1002/jcc.24729) [DOI] [PubMed] [Google Scholar]
24.Travis J, Zhang B, Liwo A, Crivelli S, Taufer M. 2015. It-Situ Data Analysis of Protein Folding Trajectories. CoRR (abs/1510.08789).
25.Estrada T, Benson J, Carrillo-Cabada H, Razavi AS, Cuendet MA, Weinstein H, Deelman E, Taufer M. 2018. Graphic Encoding of Macromolecules for Efficient High-Throughput Analysis. In Proc. 2018 ACM Int. Conf. on Bioinformatics, Computational Biology, and Health Informatics (BCB), pp. 315–324.
26.Ramachandran GN, Ramakrishnan C, Sasisekharan V. 1963. Multipolar representation of protein structure. J. Mol. Biol. 7, 95 ( 10.1186/1471-2105-7-242) [DOI] [PubMed] [Google Scholar]
27.Carrillo-Cabada H, Benson J, Razavi A, Mulligan B, Cuendet MA, Weinstein H, Taufer M, Estrada T. 2019. A Graphic Encoding Method for Quantitative Classification of Protein Structure and Representation of Conformational Changes. In IEEE/ACM Transactions on Computational Biology and Bioinformatics (IEEE TCBC) ( 10.1109/TCBB.2019.2945291) [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The manuscript does not report primary data.

[RSTA20190063C1] 1.Perilla JR, Goh BC, Cassidy CK, Liu B, Bernardi RC, Rudack T, Yu H, Wu Z, Schulten K. 2015. Molecular dynamics simulations of large macromolecular complexes. 31, 64–74. ( 10.1016/j.sbi.2015.03.007) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTA20190063C2] 2.Case DA, et al. 2005. The amber biomolecular simulation programs. J. Comput. Chem. 26, 1668–1688. ( 10.1002/jcc.20290) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTA20190063C3] 3.Brooks BR, et al. 2009. CHARMM: the biomolecular simulation program. J. Comput. Chem. 30, 1545–1614. ( 10.1002/jcc.21287) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTA20190063C4] 4.Phillips JC, et al. 2005. Scalable molecular dynamics with NAMD. J. Comput. Chem. 26, 1781–1802. ( 10.1002/jcc.20289) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSTA20190063C5] 5.Luu H, Winslett M, Gropp W, Ross R, Carns P, Harms K, Prabhat M, Byna S, Yao Y. 2015. A Multiplatform Study of I/O Behavior on Petascale Supercomputers. In Proc. 24th Int. Symp. on High-Performance Parallel and Distributed Computing, pp. 33–44.

[RSTA20190063C6] 6.Herbein S, Ahn DH, Lipari D, Scogland TRW, Stearman M, Grondona M, Garlick J, Springmeyer B, Taufer M. 2016. Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters. In Proc. 25th ACM Int. Symp. on High-Performance Parallel and Distributed Computing, pp. 69–80.

[RSTA20190063C7] 7.Liu N, Cope J, Carns P, Carothers C, Ross R, Grider G, Crume A, Maltzahn C. 2012. On the role of burst buffers in leadership-class storage systems. In IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[RSTA20190063C8] 8.Sato K, Mohror K, Moody A, Gamblin T, de Supinski BR, Naoya M, Matsuoka S. 2014. A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers. In Proc. 2014 14th IEEE/ACM Int. Symp. on Cluster, Cloud and Grid Computing.

[RSTA20190063C9] 9.Estrada T, Armen RS, Taufer M. 2010. Automatic selection of near-native protein–ligand conformations using a hierarchical clustering and volunteer computing. In Proc. First ACM Int. Conf. on Bioinformatics and Computational Biology (BCC), pp. 204–213.

[RSTA20190063C10] 10.Jain A. 2008. Bias, reporting, and sharing: computational evaluations of docking methods. J. Comp. Aided Mol. Des. 22, 201–212. ( 10.1007/s10822-007-9151-x) [DOI] [PubMed] [Google Scholar]

[RSTA20190063C11] 11.Taufer M, Armen R, Chen J, Teller P, Brooks C. 2009. Computational multiscale modelling in protein–ligand docking. IEEE Eng. Med. Biol. Mag. 28, 58–69. ( 10.1109/MEMB.2009.931789) [DOI] [PubMed] [Google Scholar]

[RSTA20190063C12] 12.Estrada T, Zhang B, Cicotti P, Armen RS, Taufer M. 2012. A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach. Comp. in Bio. and Med. 42, 758–771. ( 10.1016/j.compbiomed.2012.05.001) [DOI] [PubMed] [Google Scholar]

[RSTA20190063C13] 13.White T. 2012. Hadoop: The Definitive Guide. O'Reilly Media, Inc. (9781491901687)

[RSTA20190063C14] 14.Zaharia Z, Chowdhury M, Franklin MJ, Shenker S, Stoica I. 2010. Spark: cluster computing with working sets. Proc. HotCloud 10, 10. [Google Scholar]

[RSTA20190063C15] 15.Plimpton SJ, Devine KD. 2011. MapReduce in MPI for large-scale graph algorithms. Parallel Comput. 37, 610–632. ( 10.1016/j.parco.2011.02.004) [DOI] [Google Scholar]

[RSTA20190063C16] 16.Gao T, Guo Y, Zhang B, Cicotti P, Lu Y, Balaji P, Taufer M. 2018. On the Power of Combiner Optimizations in MapReduce Over MPI Workflows. In Proc. 24rd IEEE Int. Conf. on Parallel and Distributed Systems (ICPADS), pp. 441–448.

[RSTA20190063C17] 17.Gao T, Guo Y, Wei Y, Wang B, Lu Y, Cicotti P, Balaji P, Taufer M. 2017. Bloomfish: A Highly Scalable Distributed K-mer Counting Framework. In Proc. 23rd IEEE Int. Conf. on Parallel and Distributed Systems (ICPADS), pp. 170–179.

[RSTA20190063C18] 18.Gao T, Guo Y, Zhang B, Cicotti P, Lu Y, Balaji P, Taufer M. 2017. Mimir: Memory-Efficient and Scalable MapReduce for Large Supercomputing Systems. In 2017 IEEE Int. Parallel and Distributed Processing Symp. (IPDPS), pp. 1098–1108.

[RSTA20190063C19] 19.Gao T, Guo Y, Zhang B, Cicotti P, Lu Y, Balaji P, Taufer M.. 2019. Scalable Data Skew Mitigation in MapReduce over MPI for Supercomputing Systems. In IEEE Transactions on Parallel and Distributed Systems (TPDS) (accepted for publication))

[RSTA20190063C20] 20.Zhang B, Estrada T, Cicotti P, Balaji P, Taufer M. 2017. Enabling scalable and accurate clustering of distributed ligand geometries on supercomputers. Parallel Comput. 63, 38–60. ( 10.1016/j.parco.2017.02.005) [DOI] [Google Scholar]

[RSTA20190063C21] 21.Zhang B, Estrada T, Cicotti P, Balaji P, Taufer M. 2014. Enabling In-Situ Data Analysis for Large Protein-Folding Trajectory Datasets. In Proc. 28th IEEE Int. Parallel & Distributed Processing Symp. (IPDPS), pp. 221–230.

[RSTA20190063C22] 22.Johnston TJ, Zhang B, Liwo A, Crivelli S, Taufer M. 2017. In situ data analytics and indexing of protein trajectories. J. Comput. Chem. 38, 1419–1430. ( 10.1002/jcc.24729) [DOI] [PubMed] [Google Scholar]

[RSTA20190063C23] 23.Travis J, Zhang B, Liwo A, Crivelli S, Taufer M. 2017. In situ data analytics and indexing of protein trajectories. J. Comput. Chem. 38, 1419–1430. ( 10.1002/jcc.24729) [DOI] [PubMed] [Google Scholar]

[RSTA20190063C24] 24.Travis J, Zhang B, Liwo A, Crivelli S, Taufer M. 2015. It-Situ Data Analysis of Protein Folding Trajectories. CoRR (abs/1510.08789).

[RSTA20190063C25] 25.Estrada T, Benson J, Carrillo-Cabada H, Razavi AS, Cuendet MA, Weinstein H, Deelman E, Taufer M. 2018. Graphic Encoding of Macromolecules for Efficient High-Throughput Analysis. In Proc. 2018 ACM Int. Conf. on Bioinformatics, Computational Biology, and Health Informatics (BCB), pp. 315–324.

[RSTA20190063C26] 26.Ramachandran GN, Ramakrishnan C, Sasisekharan V. 1963. Multipolar representation of protein structure. J. Mol. Biol. 7, 95 ( 10.1186/1471-2105-7-242) [DOI] [PubMed] [Google Scholar]

[RSTA20190063C27] 27.Carrillo-Cabada H, Benson J, Razavi A, Mulligan B, Cuendet MA, Weinstein H, Taufer M, Estrada T. 2019. A Graphic Encoding Method for Quantitative Classification of Protein Structure and Representation of Conformational Changes. In IEEE/ACM Transactions on Computational Biology and Bioinformatics (IEEE TCBC) ( 10.1109/TCBB.2019.2945291) [DOI] [PMC free article] [PubMed]

PERMALINK

A survey of algorithms for transforming molecular dynamics data into metadata for in situ analytics based on machine learning methods

Michela Taufer

Trilce Estrada

Travis Johnston

Abstract

1. Problem overview and proposed solution

2. Three data transformations for three use cases

3. Drug design and protein–ligand docking

Figure 1.

Figure 2.

4. Protein folding and rare events

Figure 3.

Figure 4.

Figure 5.

Figure 6.

5. Protein variants and protein engineering

Figure 7.

Figure 8.

6. Summary

Acknowledgements

Data accessibility

Authors' contributions

Competing interests

Funding

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A survey of algorithms for transforming molecular dynamics data into metadata for in situ analytics based on machine learning methods

Michela Taufer

Trilce Estrada

Travis Johnston

Abstract

1. Problem overview and proposed solution

2. Three data transformations for three use cases

3. Drug design and protein–ligand docking

Figure 1.

Figure 2.

4. Protein folding and rare events

Figure 3.

Figure 4.

Figure 5.

Figure 6.

5. Protein variants and protein engineering

Figure 7.

Figure 8.

6. Summary

Acknowledgements

Data accessibility

Authors' contributions

Competing interests

Funding

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases