MDRepo—an open data warehouse for community-contributed molecular dynamics simulations of proteins

Amitava Roy; Ethan Ward; Illyoung Choi; Michele Cosi; Tony Edgin; Travis S Hughes; Md Shafayet Islam; Asif M Khan; Aakash Kolekar; Mariah Rayl; Isaac Robinson; Paul Sarando; Edwin Skidmore; Tyson L Swetnam; Mariah Wall; Zhuoyun Xu; Michelle L Yung; Nirav Merchant; Travis J Wheeler

doi:10.1093/nar/gkae1109

. 2024 Nov 13;53(D1):D477–D486. doi: 10.1093/nar/gkae1109

MDRepo—an open data warehouse for community-contributed molecular dynamics simulations of proteins

Amitava Roy ^1,², Ethan Ward ^2,², Illyoung Choi ³, Michele Cosi ⁴, Tony Edgin ⁵, Travis S Hughes ^6,⁷, Md Shafayet Islam ⁸, Asif M Khan ⁹, Aakash Kolekar ¹⁰, Mariah Rayl ¹¹, Isaac Robinson ¹², Paul Sarando ¹³, Edwin Skidmore ¹⁴, Tyson L Swetnam ¹⁵, Mariah Wall ¹⁶, Zhuoyun Xu ¹⁷, Michelle L Yung ¹⁸, Nirav Merchant ¹⁹, Travis J Wheeler ^20,^✉

PMCID: PMC11701643 PMID: 39535038

Abstract

Molecular Dynamics (MD) simulation of biomolecules provides important insights into conformational changes and dynamic behavior, revealing critical information about folding and interactions with other molecules. The collection of simulations stored in computers across the world holds immense potential to serve as training data for future Machine Learning models that will transform the prediction of structure, dynamics, drug interactions, and more. Ideally, there should exist an open access repository that enables scientists to submit and store their MD simulations of proteins and protein-drug interactions, and to find, retrieve, analyze, and visualize simulations produced by others. However, despite the ubiquity of MD simulation in structural biology, no such repository exists; as a result, simulations are instead stored in scattered locations without uniform metadata or access protocols. Here, we introduce MDRepo, a robust infrastructure that provides a relatively simple process for standardized community contribution of simulations, activates common downstream analyses on stored data, and enables search, retrieval, and visualization of contributed data. MDRepo is built on top of the open-source CyVerse research cyber-infrastructure, and is capable of storing petabytes of simulations, while providing high bandwidth upload and download capabilities and laying a foundation for cloud-based access to its stored data.

Graphical Abstract

Introduction

In Molecular Dynamics (MD) simulation, the movement and interactions of one or more molecules is estimated over time by calculating the force on every atom at discreet time steps on the order of femtoseconds. MD simulation of the fluctuation of a protein molecule with several thousand atoms is commonly captured over time scales of nanoseconds to microseconds, enabling exploration of molecular interactions at spatial and temporal scales that are difficult to observe experimentally (1). The primary products of MD simulations are coordinates over time, known as trajectories, saved at a user-defined frequency (often pico- to nanoseconds). The resulting size of the files capturing simulated atomic trajectories is on the order of many gigabytes. A wide variety of post-simulation analyses are performed by researchers, ranging from quality control, to free energy estimates, to measurements of molecular mobility.

It is common in most data-intensive areas of biological science that primary data is captured in central, open repositories, made publicly available when associated research is published. This has been codified in the data management and sharing policies of most journals and large research funding organizations. For example, is now a common funding agency mandate that all generated data should adhere to the FAIR guiding principles for scientific data management and stewardship (i.e. ensuring that data are Findable, Accessible, Interoperable, and Reusable (2)).

This guiding principle has driven the creation of invaluable centralized open-access repositories for large-scale data across bioinformatics. Notable examples include archives of sequence and functional information for proteins (3) and DNA (4,5), a gene expression atlas (6), a data bank for protein structures (7), and databases for classifications of protein families (8) and structural domains (9). These repositories are characterized by their support for scalable expansion, performant search and retrieval, structured metadata, and open access nature; most are designed to grow by accepting data contributions from researchers across the globe.

Despite the trend for open access repositories, there is no equivalent option for creators of MD simulations. A few special-purpose databases do exist (e.g. (10–16)), but these are limited in scope or scale, and none are designed to receive contributions from the community, or to interact directly with the computational resources required for large-scale analysis or model training. BioSimGrid (17) was devised two decades ago with the purpose of enabling community contributions of diverse MD simulations, but funding and technical challenges caused its mission to remain unfulfilled. Because no adequate repository currently exists, the worldwide collection of protein MD simulations, reaching well into the petabytes in scale, is stored in a highly fragmented landscape. Researchers wishing to fulfill data sharing mandates are forced to either host their own web server (with predictable decay in availability (18)) or share simulation data via one of several unstructured, general-purpose open repositories (e.g. OSF (19), Zenodo (https://www.zenodo.org), FigShare (https://figshare.com)). Meanwhile, the large majority of MD data languish on private computers with no public access capability.

The lost opportunities resulting from the current fragmented data landscape can hardly be overstated. One obvious consequence is that a researcher who might benefit from a collection of previously-performed simulations is likely to be unaware of their existence, and will therefore either repeat the expensive calculations or proceed with no such simulation data. Arguably more important is the lost potential to use the large collection of existing simulations to train systems (especially machine learning models) for a variety of analytical problems that would benefit from a nuanced understanding of the diversity of dynamics of molecular systems. One compelling example of the role that a large collection of MD simulations could play in training of machine learning models is in the context of rapid computational prediction of drug binding affinity and dynamics. Modern computational protein-drug affinity estimation methods demonstrate limited general predictive power (20,21); the failure of these models limits their utility in drug development, and stems in great part from insufficient volume of training data (22,23) and lack of representation of structural variability (24).

Recent simulation data sets intended for model training (16,25) hint at the potential for using simulated trajectories to improve machine-learning models for binding and affinity prediction. Unfortunately, scale and data availability are limiting: one resource (25) appears to provide only structures and summary statistics over simulated trajectories (the data are inaccessible at the time of this writing), while the other is limited to only 100 snapshots over 8ns per simulation (16); neither is designed for expansion via community contribution, and even the many thousands of simulations that they contain are a small fraction of the full set of protein-drug interactions studied by researchers across the globe. A large and diverse pool of MD simulations will serve as the launch pad for future machine learning methods in protein-drug affinity prediction, just as large-scale protein structure databases provided the necessary training data for transformative deep learning methods for structural prediction (26–28).

It is remarkable that there does not currently exist an open repository that enables scientists to submit and store their protein/drug MD simulations for access by others. This is particularly true when considering the extensive use of such simulations in research labs around the world, the large computational burden of individual simulation runs, the reusability of resulting data, the increasing emphasis on FAIR data management, and the high value of such data for training machine learning tools to perform a broad spectrum of related analyses. As evidenced by the many existing databases, there is no shortage of interest in creating such a repository, so we suspect that the lack of a general repository is primarily due to the challenges of scale. Considering both published and unpublished simulations, large-scale projects, and individual research efforts, it seems likely that several million protein MD simulations have been performed over the decades. Therefore, the total size of existing MD simulation data must range in the many petabytes in size. This data scale places extreme demands on infrastructure, both for storage and the data transmission required to enable convenient access to multiple simulations. These demands necessitate a hardware and system architecture that are generally beyond the scope of a single research group.

Here, we introduce a new service designed to fill this void. MDRepo is an open repository that is designed to support community contribution, large-scale retrieval, visualization, and cloud-backed analysis of biomolecule MD simulations. It is designed to provide a home for the millions of simulations accumulated over decades of research effort, with an expected eventual scale of 10s of petabytes. Storage of simulated trajectories is intended to reduce redundant research efforts, improve reproducibility, and enable new discoveries and modeling techniques. In the initial release, MDRepo is built to accommodate protein simulations (with or without ligands); it will soon expand to capture simulations of all biomolecules. Data stored in MDRepo are released under the open Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), ensuring unfettered use and distribution of its simulations. In the following sections, we introduce the MDRepo user interface and describe its underlying architecture.

Website and user interface

Researchers will interact with MDRepo primarily through its website. A site user can explore stored simulations, its metadata, and any available results of downstream analyses. They can also identify simulations matching particular search constraints and manage data movement (contribution and download) of any number of simulations. Each simulation is stored as a separate entry, with standardized metadata captured for each. Each entry is assigned a unique and persistent accession number.

Data Exploration page

The common page for searching and exploring MDRepo data is the Explore page. This page, as seen in Figure 1, presents a list of all simulations in the database, and can be sorted and filtered to meet user requirements. Search fields include the simulation ‘Description’, ‘Biomolecules’ and ‘Ligands’ associated with the simulation, the ‘Protein sequence’ and the ‘Software’ used to create the simulation (some fields are hidden from view in the screen capture). Fields can be dynamically added and removed from user view. Results are paginated with user-selected page length (default is 10 simulations per page).

Simulation Detail page

A site user may click on an entry in the Explore page, and navigate to the Simulation Detail page (Figure 2) for a selected simulation. This resulting page provides additional simulation properties, such as duration and time steps, RMSD/RMSF values, simulation software version and parameters, and (where applicable) a linkout to the original website source of the simulation. The Simulation Detail page also contains a visualization of the trajectory as rendered with the NGL viewer, and provides access to download all files associated with the simulation.

Data download

Data for a single simulation, including the files associated with the simulation, can be downloaded from the Simulation Detail page, using the list of files presented at the bottom of the page (Figure 3). The selected files will be compressed into a ‘.zip’ file and downloaded through the browser.

Figure 3. — On the *MDRepo* simulation detail page, a user may select and download one or more files associated with a simulation.This figure shows the collection of files available for simulation MDR00001111.

The MDRepo system also supports download of multiple simulations at the same time. Rather than downloading batch simulations through the browser, MDRepo is designed to support high throughput and fault-tolerant download directly to a user-side server, such as an HPC resource, where there is expected to be both sufficient storage to hold the requested data and sufficient computational power to perform analyses on the downloaded simulations. A user can download many simulations by first selecting the desired entries from the Explore page, then clicking the ‘Download Selection’ button. This causes the backend to generate a download token that contains access information about the simulations to be downloaded. The user must install the MDRepo command-line tool (mdrepo, https://github.com/MD-Repo/md-repo-cli/) on the recipient server, run the command ‘mdrepo get’ as instructed (Figure 4), and supply the token provided as a result of choosing ‘Download Selection’. Note that the Download Selection button for multiple downloads is only available for users who have signed in with their ORCID iD. This limitation is taken to limit risk of denial of service attack.

Figure 4. — After selecting a batch of simulations to be downloaded, the user clicks the Download Selection button, receives instructions for running the ‘`mdrepo get`’ command, and is provided a token to support download on the recipient computer. For batch download, the computer is expected to be a server, not the system from which the user accessed MDRepo.

Data submission

MDRepo allows contributions from authenticated users. A contributor must create a metadata file for each simulation that they wish to upload, and organize their simulation files in a specific manner. They can then use the mdrepo command-line tool to upload their simulation files.

A submission directory is a single directory containing a set of subdirectories, with one subdirectory for each simulation to be uploaded. Each simulation subdirectory contains one trajectory file, one structure / coordinate file, one topology file (.psf for CHARMM, NAMD, XPLOR, .top, .itp, .tpr for GROMACS, .prmtop for AMBER etc.), the simulation metadata file, and any additional files produced with the simulation.

The metadata file for each Simulation can be generated manually (optionally with assistance from the help page provided on the MDRepo website: https://mdrepo.org/metadata), or with a contributor-created script that converts user-structured data into the precise format expected by MDRepo. The required metadata includes information such as simulation descriptions, protein/ligand information, the software used to produce the simulation, published papers, contributor details, and information about the files to be uploaded.

Contributors must have the mdrepo command-line tool installed on the system containing the simulation directories in order to perform the upload. They may then click the ‘Contribute’ button on any page of the MDRepo site, and choose ‘Get upload tokens’. A cryptographic token is then created, which ensures that the account that generated the token is associated with the person that submits the simulation files from their computer. Simulation upload progress or errors can be tracked on the upload logs page (Figure 5).

Figure 5. — Once a user has contributed one or more simulations, they can view their upload log. The log contains information about ongoing and past contributions.

Methods

System Architecture

Users of MDRepo will primarily interact with the website (https://mdrepo.org/), where they can explore existing simulations, request batch downloads from the data store, and initiate data contributions. For contributions and large-scale retrievals, the user manages data transfer with our mdrepo command-line tool, which controls data upload/download in a high-throughput and fault-tolerant manner.

All functionality rests on the foundation provided by the open-source CyVerse (29) research cyber-infrastructure. MDRepo is a cloud native platform deployed onto Kubernetes, a container orchestration engine. This enables MDRepo to scale specific services, such as the web application, in response to increasing connections, cpu load, or RAM utilization. In addition, Kubernetes provides facilities for high availability and load balancing of containerized services. Upgrades can be seamlessly deployed using controlled rollout process. All these features provide for a robust MDRepo platform that can scale and grow as the number of users and data grows.

Website

The MDRepo website is hosted on a set of virtual machines (VMs) within at the JetStream2 cloud computing environment (30). The front end provides an interactive user experience based on the React (https://react.dev) and Next.js (https://nextjs.org) frameworks, supplemented with Chakra UI (https://v2.chakra-ui.com) components. Visualization of simulations is performed using the NGL viewer (31). Website backend operations are handled by the Django web framework (https://www.djangoproject.com), with database operations supported by PostgreSQL (https://www.postgresql.org). MDRepo only supports requests for batch upload or download from site users who have authenticated using a valid ORCID account (https://orcid.org) to avoid site vandalism and denial of service attacks; general exploration and single-simulation downloads do not require authentication.

Data storage

MDRepo content is stored in one of two ways, with content storage divided in a way that can accommodate peta-scale simulation data while providing for a fast and interactive website. (i) Metadata about both simulations and users is captured in a site-specific PostgreSQL relational database hosted on a VM co-located with the primary webserver. (ii) All large primary data files (such as trajectory and topology files) are stored in CyVerse Data Store. CyVerse Data Store is managed across two sites: the University of Arizona (UArizona) and the Texas Advanced Computing Center (TACC), and is built on top of the Integrated Rule Oriented Data System (iRODS) (32), a federated data grid and data management system with high throughput data handling capabilities. Data managed through iRODS benefits from the underlying metadata driven rules and policies that afford fine grained access control, along with automation through its message bus architecture and subscription based event monitoring that allows integration with external systems. Data stored in CyVerse Data Store is replicated between UArizona and TACC to ensure data availability, reliability, and resilience.

Data contribution and download

A unique aspect of MDRepo is its design in support of direct contribution of MD simulations. We expect that most data uploads will be performed from computer servers where simulations were performed, rather than from researchers’ laptops. Similarly, we expect that downloads involving multiple simulations will generally aim to gather data to a user side server where large-scale analyses can be performed. We have developed a command-line tool to meet these needs, mdrepo (https://github.com/MD-Repo/md-repo-cli), written in the Go programming language. The user first requests an MDRepo token from its website (for either contribution or download) to initiate data transfer, then provides that token to the command-line tool for authenticated data transfer.

A path to a directory is provided to mdrepo for upload. The directory may contain multiple subdirectories – each is treated as a distinct simulation for submission, and must contain a topology file, a trajectory file, and a metadata file containing information specifically describing the simulation (see https://mdrepo.org/metadata for specifications, and for a sample Python script to help researchers automate the task of preparing large numbers of metadata files). In the case of download, sub-directories are added to the provided path, one for each requested simulation. When using the mdrepo commandline tool, data transfer speeds will generally be limited only by the bandwidth of the computer being used to perform the transfer – in our experience the MDRepo back end supports roughly Gigabit-speed transfer.

Post-upload processing pipeline

Upon completion of simulation upload to the iRODS landing directory, a postprocess event is initiated in the MDRepo backend. This event validates each submitted trajectory, performs a few standard analyses, and loads information from the simulation metadata file into the webserver’s database. File upload status is monitored by the CyVerse Datawatch (https://gitlab.com/cyverse/datawatch) system; after all files in a simulation directory have completed transfer, Datawatch makes a post request to the Django backend, where Django Q2 (https://django-q2.readthedocs.io/en/master/) manages the task queue for the entire submission process.

Data formatting consists of the following steps:

Verify that the metadata file is valid (correct format, all required fields are present and meet the field requirements).
Confirm that files specified in the metadata file exist in the upload.
Ensure that the submission is not a duplicate (based on the file hashes of the uploaded files).
Check that trajectory files do not exceed the current maximum size (25 GB as of September 2024).
Perform a collection of basic sanity checks and quality control tests.

If all of these checks pass, the simulation files are then copied from the iRODS landing directory to the task processing server. The following processing steps are then taken:

Using MDTraj (33), save or convert the structure file to the ‘.pdb’ format.
Using MDTraj, save or convert the trajectory file to an ‘.xtc’ file (which provides modest compression).
Compute a hash of the topology file. If the topology hash matches other hashes already stored in the database, but the trajectory hashes do not, then the simulation is a replicate of an existing contribution, initialized from the same starting topology. The simulation is automatically linked with the existing contribution matching the hash.
Compute RMSD and RMSF values of the trajectory.
Produce a copy of the simulation, with extra atoms (lipids, water, and ions) removed from the simulation files using VMD (34)) .
Create a thumbnail image of the protein structure using the minimal structure file and NGL viewer (35).
Extract the protein sequence from the structure file, and store it in the database entry for the simulation.
Using MDTraj, create a down-sampled trajectory with 100 frames; this serves as the lightweight visual presented on the webpage for the simulation.
Save database objects corresponding to the simulation, metadata, and simulation upload logs.
Upload files from the simulation processing server to their final iRODS destination.

Seeding MDRepo with simulations

Many repositories of MD simulations have been created over the years (e.g. (10–15)), each containing hundreds or thousands of simulations. Each of these repositories, typically containing simulations produced by the database hosts, is a valuable resource that provides researchers access to a quantity and diversity of simulations that they would be unlikely to produce on their own. However, data organization and access patterns differ between services, and relatively low server bandwidth means that batch downloads are generally quite slow. With the aim of improving accessibility of these data for researchers (relatively simple search/download protocols and improved access speed), we have imported simulations from two of these databases into MDRepo. The first data source is the ATLAS repository (15), which holds the results of simulations based on pdb-sourced topology files. At the time of download (April 2024), the ATLAS website described the results as ‘freely available’, though no specific licence is described. Download of 2798 simulations from ATLAS required nearly 3 weeks to complete due to download bandwidth limitations. The second data set is GPCRmd (13), which was released under the same Creative Commons Attribution 4.0 license as MDRepo. At the time of download (January 2024), we retrieved 1457 simulations across 48 G-protein coupled receptor proteins, with download requiring over one week to complete. Each resulting MDRepo entry contains a reference (a ‘linkout’) to the URL associated with the simulation source at the time of import, to ensure that proper credit is given to data creators. (Note: since our download, the GPCRmd site has moved behind a registration wall, and many of the retrieved simulations seem to be unavailable on the site. This change in accessibility for data released under an open license highlights one of the motivations for an enduring and perpetually open simulation repository).

In addition to simulations gathered from two existing repositories, we have received several dozen contributions during an invitation-only phase of system validation. We intend to world with developers of other repositories to provide a secure and long-term storage option for their data, and we anticipate that the number of contributions from individual data creators will grow in the coming months.

Discussion

MDRepo is an open repository for community-generated MD simulations of biomolecules. It is designed to provide a home for millions of simulations accumulated over years of research effort, with a robust storage infrastructure that ensures both data safety and high throughput data access. The centralized and open access nature of the repository will help to meet demands for reduced environmental impact by reducing redundant effort, improving reproducibility, and obviating dispersed storage solutions. Meanwhile, the anticipated 10s of petabytes of simulation data will enable new discoveries and modeling techniques.

A researcher may submit any number of simulations to MDRepo, from a single trajectory to thousands. Submitted simulations are subjected to some post-submission validation and preparation, then stored in infrastructure backed by CyVerse. Each simulation is stored as a separate entry, with standardized metadata captured for each. MDRepo places no restrictions on the use or distribution of stored data.

Users can search and explore simulations submitted by others. An individual trajectory can be downloaded directly from the website. Downloading a batch of simulations is performed with the MDRepo command-line tool.

We have seeded the repository with several thousand simulations, some gathered from other valuable repositories, and some newly generated for the repository. While this initial seed will serve as a large valuable resource to the community, it is only a first step. By the time of publication, we will have added hundreds of simulations of ligand-bound interactions for NAMPT and kinase drug targets, but the promise of MDRepo will only be reached through extensive data contribution from the community. With this in mind, we are actively developing an interface between MDRepo and the forthcoming European Union-funded Molecular Dynamics Data Bank (MDDB), which is intended to serve as a federated system of topical simulation repositories (36). This effort will include development of functionality in support of specialized approaches such as umbrella or enhanced sampling, and an expanded interface for retrieving simulation subsets.

We anticipate that a growing pool of community-contributed data will enable new discoveries via re-analysis of individual simulations and through development (by us and others) of new Machine Learning models designed to leverage the rich trove of training data. We welcome the opportunity to work with the broader community to extend the collection of stored simulations into the millions, and to improve the functional and analytical features of the website.

One important philosophy driving some design decisions for MDRepo is the way that we handle cases of missing metadata. Rather than enforce strict metadata requirements, MDRepo allows contributions with incomplete metadata. This means that some simulations will be non-reproducible, or otherwise imperfectly documented, but we believe that the benefit of such data outweighs the harm. This approach will maximize data availability, and users of MDRepo will be able to filter downloaded simulations to remove entries missing metadata that is vital for their purposes. This approach is in agreement with the retention of proteins with limited support in UniProt TrEMBL, and retention of protein-ligand interactions with poor resolution in PDB.

One important benefit of the architectural design of MDRepo is that the data are stored in a location that is designed to support access from academic and corporate cloud systems. Though the functionality does not yet exist, in the future we will establish the infrastructure to allow researchers to avoid the step of downloading data to their own servers, and instead to bring containerized analysis pipelines and machine learning models close to the data, for analysis in the cloud.

MDRepo currently constrains the size of submissions, only accepting simulations with trajectory files no larger than 25GB. We plan to raise this limit in the future, to allow storage of longer simulations and simulations of much larger systems.

In its present design, MDRepo is designed to store and manage results from conventional molecular dynamics simulations. At this stage, it does not accommodate more specialized approaches, such as enhanced sampling methods (e.g., replica exchange, metadynamics) or ensemble-based techniques. In the coming months, we intend to establish support for such simulations following a round of community feedback.

While the current design ensures that data creators can be credited for their contributions to data found in MDRepo (through a combination of contributor lists, paper citations, and link-outs), we recognize the importance of improving the landscape of credit; over the coming months, we will formalize a robust framework for microcitation, so that researchers who contribute simulations to MDRepo will receive credit when those simulations are used by work leading to publications by other researchers.

Acknowledgements

We thank members of the CompbioAsia research community, and the associated Bioinformatics Research Consortium, for discussions that initiated development of MDRepo, with particular gratitude for early suggestions by Charles Laughton and Chandra Verma. We also gratefully acknowledge the high performance computing (HPC) resources supported by the University of Arizona TRIF, UITS, and Research, Innovation, and Impact (RII) and maintained by the UArizona Research Technologies department.

Contributor Information

Amitava Roy, Department of Biomedical and Pharmaceutical Sciences, University of Montana, Missoula, MT, USA.

Ethan Ward, R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ, USA.

Illyoung Choi, CyVerse, University of Arizona, Tucson, AZ, USA.

Michele Cosi, Data Science Institute, University of Arizona, Tucson, AZ, USA.

Tony Edgin, CyVerse, University of Arizona, Tucson, AZ, USA.

Travis S Hughes, Department of Biomedical and Pharmaceutical Sciences, University of Montana, Missoula, MT, USA; Biochemistry and Biophysics, University of Montana, Missoula, MT, USA.

Md Shafayet Islam, Department of Physics, Shahjalal University of Science and Technology, Sylhet, Bangladesh.

Asif M Khan, University of Doha for Science and Technology, Doha, Qatar.

Aakash Kolekar, R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ, USA.

Mariah Rayl, Biochemistry and Biophysics, University of Montana, Missoula, MT, USA.

Isaac Robinson, R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ, USA.

Paul Sarando, CyVerse, University of Arizona, Tucson, AZ, USA.

Edwin Skidmore, CyVerse, University of Arizona, Tucson, AZ, USA.

Tyson L Swetnam, CyVerse, University of Arizona, Tucson, AZ, USA.

Mariah Wall, CyVerse, University of Arizona, Tucson, AZ, USA.

Zhuoyun Xu, CyVerse, University of Arizona, Tucson, AZ, USA.

Michelle L Yung, CyVerse, University of Arizona, Tucson, AZ, USA.

Nirav Merchant, CyVerse, University of Arizona, Tucson, AZ, USA.

Travis J Wheeler, R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ, USA.

Data availability

This manuscript describes a repository for MD simulations. Simulations are released under the Creative Commons Attribution 4.0 license. All data are available at the MDRepo website, https://mdrepo.org.

Funding

National Science Foundation under DBI Grant [0735191, 1265383, 1743442], along with support from the University of Arizona Research, Innovation & Impact (RII) through BIO5 and IT4IR TRIF Funds; computational workload depends on Jetstream2 at Indiana University through allocation BIO230080 from the NSF ACCESS program, which is supported by OAC Grants [2138259, 2138286, 2138307, 2137603, 2138296]; NSF IRES grant [1953405]. Funding for open access charge: National Science Foundation under DBI Grant [0735191, 1265383, 1743442], along with support from the University of Arizona Research, Innovation & Impact (RII) through BIO5 and IT4IR TRIF Funds.Computational workload depends on Jetstream2 at Indiana University through allocation BIO230080 from the NSF ACCESS program, which is supported by OAC Grants [2138259, 2138286, 2138307, 2137603, 2138296]; NSF IRES grant [1953405].

Conflict of interest statement. None declared.

References

1. Dror R.O., Dirks R.M., Grossman J., Xu H., Shaw D.E.. Biomolecular simulation: a computational microscope for molecular biology. Annu. Rev. Biophys. 2012; 41:429–452. [DOI] [PubMed] [Google Scholar]
2. Wilkinson M.D., Dumontier M., Aalbersberg I.J., Appleton G., Axton M., Baak A., Blomberg N., Boiten J.-W., da Silva Santos L.B., Bourne P.E.et al.. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data. 2016; 3:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. The UniProt Consortium UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023; 51:D523–D531. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Sayers E.W., Cavanaugh M., Clark K., Pruitt K.D., Sherry S.T., Yankie L., Karsch-Mizrachi I.. GenBank 2024 update. Nucleic Acids Res. 2024; 52:D134–D137. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Katz K., Shutov O., Lapoint R., Kimelman M., Brister J.R., O’Sullivan C.. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 2022; 50:D387–D390. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. GTEx Consortium The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020; 369:1318–1330. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Burley S.K., Bhikadiya C., Bi C., Bittrich S., Chao H., Chen L., Craig P.A., Crichlow G.V., Dalenberg K., Duarte J.M.et al.. RCSB Protein Data Bank (RCSB.org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Res. 2023; 51:D488–D508. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Paysan-Lafosse T., Blum M., Chuguransky S., Grego T., Pinto B.L., Salazar G.A., Bileschi M.L., Bork P., Bridge A., Colwell L.et al.. InterPro in 2022. Nucleic Acids Res. 2023; 51:D418–D427. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Andreeva A., Kulesha E., Gough J., Murzin A.G.. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res. 2020; 48:D376–D382. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Meyer T., D’Abramo M., Rueda M., Ferrer-Costa C., Pérez A., Carrillo O., Camps J., Fenollosa C., Repchevsky D., Gelpí J.L.et al.. MoDEL (Molecular Dynamics Extended Library): a database of atomistic molecular dynamics trajectories. Structure. 2010; 18:1399–1409. [DOI] [PubMed] [Google Scholar]
11. Newport T.D., Sansom M. S.P., Stansfeld P.J.. The MemProtMD database: a resource for membrane-embedded protein structures and their lipid interactions. Nucleic Acids Res. 2019; 47:D390–D397. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Juračka J., Šrejber M., Melíková M., Bazgier V., Berka K.. MolMeDB: molecules on membranes database. Database. 2019; 2019:baz078. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Rodríguez-Espigares I., Torrens-Fontanals M., Tiemann J.K., Aranda-García D., Ramírez-Anguita J.M., Stepniewski T.M., Worp N., Varela-Rial A., Morales-Pastor A., Medel-Lacruz B.et al.. GPCRmd uncovers the dynamics of the 3D-GPCRome. Nature Methods. 2020; 17:777–787. [DOI] [PubMed] [Google Scholar]
14. van der Kamp M.W., Schaeffer R.D., Jonsson A.L., Scouras A.D., Simms A.M., Toofanny R.D., Benson N.C., Anderson P.C., Merkley E.D., Rysavy S.et al.. Dynameomics: a comprehensive database of protein dynamics. Structure. 2010; 18:423–435. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Vander Meersche Y., Cretin G., Gheeraert A., Gelly J.-C., Galochkina T.. ATLAS: protein flexibility description from atomistic molecular dynamics simulations. Nucleic Acids Res. 2024; 52:D384–D392. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Siebenmorgen T., Menezes F., Benassou S., Merdivan E., Didi K., Mourão A.S.D., Kitel R., Liò P., Kesselheim S., Piraud M.et al.. MISATO: machine learning dataset of protein–ligand complexes for structure-based drug discovery. Nat. Computat. Sci. 2024; 4:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Tai K., Murdock S., Wu B., Ng M.H., Johnston S., Fangohr H., Cox S.J., Jeffreys P., Essex J.W., Sansom M.S.. BioSimGrid: towards a worldwide repository for biomolecular simulations. Org. Biomol. Chem. 2004; 2:3219–3221. [DOI] [PubMed] [Google Scholar]
18. Kern F., Fehlmann T., Keller A.. On the lifetime of bioinformatics web services. Nucleic Acids Res. 2020; 48:12523–12533. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Foster E.D., Deardorff A.. Open science framework (OSF). J. Med. Libr. Assoc. 2017; 105:203. [Google Scholar]
20. Cerón-Carrasco J.P. When virtual screening yields inactive drugs: dealing with false theoretical friends. ChemMedChem. 2022; 17:e202200278. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Díaz-Rovira A.M., Martín H., Beuming T., Díaz L., Guallar V., Ray S.S.. Are deep learning structural models sufficiently accurate for virtual screening? application of docking algorithms to AlphaFold2 predicted structures. J. Chem. Inform. Model. 2023; 63:1668–1674. [DOI] [PubMed] [Google Scholar]
22. Stärk H., Ganea O., Pattanaik L., Barzilay R., Jaakkola T.. Equibind: Geometric deep learning for drug binding structure prediction. International Conference on Machine Learning. 2022; PMLR; 20503–20521. [Google Scholar]
23. Lin S., Shi C., Chen J.. Generalizeddta: combining pre-training and multi-task learning to predict drug-target binding affinity for unknown drug discovery. BMC bioinformatics. 2022; 23:367. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Axelrod S., Gomez-Bombarelli R.. Molecular machine learning with conformer ensembles. Mach. Learn. Sci. Technol. 2023; 4:035025. [Google Scholar]
25. Korlepara D.B., CS V., Srivastava R., Pal P.K., Raza S.H., Kumar V., Pandit S., Nair A.G., Pandey S., Sharma S.et al.. Plas-20k: Extended dataset of protein-ligand affinities from md simulations for machine learning applications. Sci. Data. 2024; 11:180. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A.et al.. Highly accurate protein structure prediction with AlphaFold. Nature. 2021; 596:583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Lin Z., Akin H., Rao R., Hie B., Zhu Z., Lu W., dos Santos Costa A., Fazel-Zarandi M., Sercu T., Candido S.et al.. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv. 2022; 2022:500902. [Google Scholar]
28. Baek M., DiMaio F., Anishchenko I., Dauparas J., Ovchinnikov S., Lee G.R., Wang J., Cong Q., Kinch L.N., Schaeffer R.D.et al.. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021; 373:871–876. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Swetnam T.L., Antin P.B., Bartelme R., Bucksch A., Camhy D., Chism G., Choi I., Cooksey A.M., Cosi M., Cowen C.et al.. CyVerse: cyberinfrastructure for open science. PLOS Computat. Biol. 2024; 20:e1011270. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Hancock D.Y., Fischer J., Lowe J.M., Michael S., Weakley L.M.. Jetstream2: Research Clouds as a Convergence Accelerator. Comput.Sci. Eng. 2024; 99:1–11. [Google Scholar]
31. Rose A., Bradley A.R., Valasatava Y., Duarte A.M., Prlić A., Rose P.W.. NGL Viewer: web-based molecular graphics for large complexes. Bioinformatics. 2018; 34:3755–3758. [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Rajasekar A., Moore R., Hou C.-Y., Lee C., Marciano R., de Torcy A., Wan M., Schroeder W., Chen S., Gilbert L.et al.. iRODS Primer: Integrated Rule-Oriented Data System, Vol.2, Synthesis Lectures on Information Concepts, Retrieval, and Services. 2010; https://link.springer.com/book/10.1007/978-3-031-02271-5.
33. McGibbon R.T., Beauchamp K.A., Harrigan M.P., Klein C., Swails J.M., Hernández C.X., Schwantes C.R., Wang L.-P., Lane T.J., Pande V.S.. MDTraj: a modern open library for the analysis of molecular dynamics trajectories. Biophys. J. 2015; 109:1528–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Spivak M., Stone J.E., Ribeiro J., Saam J., Freddolino P.L., Bernardi R.C., Tajkhorshid E.. VMD as a platform for interactive small molecule preparation and visualization in quantum and classical simulations. J. Chem. Inform. Model. 2023; 63:4664–4678. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Rose A.S., Hildebrand P.W.. NGL Viewer: a web application for molecular visualization. Nucleic Acids Res. 2015; 43:W576–W579. [DOI] [PMC free article] [PubMed] [Google Scholar]
36. MDDB Preliminary report on sustainability models for the MDDB infrastructure. Technical report, MDDBR. 2024; https://mddbr.eu/wp-content/uploads/2024/07/D4.3-Preliminary-report-on-sustainability-models-for-the-MDDB-infrastructure.pdf.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[B1] 1. Dror R.O., Dirks R.M., Grossman J., Xu H., Shaw D.E.. Biomolecular simulation: a computational microscope for molecular biology. Annu. Rev. Biophys. 2012; 41:429–452. [DOI] [PubMed] [Google Scholar]

[B2] 2. Wilkinson M.D., Dumontier M., Aalbersberg I.J., Appleton G., Axton M., Baak A., Blomberg N., Boiten J.-W., da Silva Santos L.B., Bourne P.E.et al.. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data. 2016; 3:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3. The UniProt Consortium UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023; 51:D523–D531. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4. Sayers E.W., Cavanaugh M., Clark K., Pruitt K.D., Sherry S.T., Yankie L., Karsch-Mizrachi I.. GenBank 2024 update. Nucleic Acids Res. 2024; 52:D134–D137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5. Katz K., Shutov O., Lapoint R., Kimelman M., Brister J.R., O’Sullivan C.. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 2022; 50:D387–D390. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6. GTEx Consortium The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020; 369:1318–1330. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7. Burley S.K., Bhikadiya C., Bi C., Bittrich S., Chao H., Chen L., Craig P.A., Crichlow G.V., Dalenberg K., Duarte J.M.et al.. RCSB Protein Data Bank (RCSB.org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Res. 2023; 51:D488–D508. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8. Paysan-Lafosse T., Blum M., Chuguransky S., Grego T., Pinto B.L., Salazar G.A., Bileschi M.L., Bork P., Bridge A., Colwell L.et al.. InterPro in 2022. Nucleic Acids Res. 2023; 51:D418–D427. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9. Andreeva A., Kulesha E., Gough J., Murzin A.G.. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res. 2020; 48:D376–D382. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10. Meyer T., D’Abramo M., Rueda M., Ferrer-Costa C., Pérez A., Carrillo O., Camps J., Fenollosa C., Repchevsky D., Gelpí J.L.et al.. MoDEL (Molecular Dynamics Extended Library): a database of atomistic molecular dynamics trajectories. Structure. 2010; 18:1399–1409. [DOI] [PubMed] [Google Scholar]

[B11] 11. Newport T.D., Sansom M. S.P., Stansfeld P.J.. The MemProtMD database: a resource for membrane-embedded protein structures and their lipid interactions. Nucleic Acids Res. 2019; 47:D390–D397. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12. Juračka J., Šrejber M., Melíková M., Bazgier V., Berka K.. MolMeDB: molecules on membranes database. Database. 2019; 2019:baz078. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13. Rodríguez-Espigares I., Torrens-Fontanals M., Tiemann J.K., Aranda-García D., Ramírez-Anguita J.M., Stepniewski T.M., Worp N., Varela-Rial A., Morales-Pastor A., Medel-Lacruz B.et al.. GPCRmd uncovers the dynamics of the 3D-GPCRome. Nature Methods. 2020; 17:777–787. [DOI] [PubMed] [Google Scholar]

[B14] 14. van der Kamp M.W., Schaeffer R.D., Jonsson A.L., Scouras A.D., Simms A.M., Toofanny R.D., Benson N.C., Anderson P.C., Merkley E.D., Rysavy S.et al.. Dynameomics: a comprehensive database of protein dynamics. Structure. 2010; 18:423–435. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15. Vander Meersche Y., Cretin G., Gheeraert A., Gelly J.-C., Galochkina T.. ATLAS: protein flexibility description from atomistic molecular dynamics simulations. Nucleic Acids Res. 2024; 52:D384–D392. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16. Siebenmorgen T., Menezes F., Benassou S., Merdivan E., Didi K., Mourão A.S.D., Kitel R., Liò P., Kesselheim S., Piraud M.et al.. MISATO: machine learning dataset of protein–ligand complexes for structure-based drug discovery. Nat. Computat. Sci. 2024; 4:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17. Tai K., Murdock S., Wu B., Ng M.H., Johnston S., Fangohr H., Cox S.J., Jeffreys P., Essex J.W., Sansom M.S.. BioSimGrid: towards a worldwide repository for biomolecular simulations. Org. Biomol. Chem. 2004; 2:3219–3221. [DOI] [PubMed] [Google Scholar]

[B18] 18. Kern F., Fehlmann T., Keller A.. On the lifetime of bioinformatics web services. Nucleic Acids Res. 2020; 48:12523–12533. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19. Foster E.D., Deardorff A.. Open science framework (OSF). J. Med. Libr. Assoc. 2017; 105:203. [Google Scholar]

[B20] 20. Cerón-Carrasco J.P. When virtual screening yields inactive drugs: dealing with false theoretical friends. ChemMedChem. 2022; 17:e202200278. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21. Díaz-Rovira A.M., Martín H., Beuming T., Díaz L., Guallar V., Ray S.S.. Are deep learning structural models sufficiently accurate for virtual screening? application of docking algorithms to AlphaFold2 predicted structures. J. Chem. Inform. Model. 2023; 63:1668–1674. [DOI] [PubMed] [Google Scholar]

[B22] 22. Stärk H., Ganea O., Pattanaik L., Barzilay R., Jaakkola T.. Equibind: Geometric deep learning for drug binding structure prediction. International Conference on Machine Learning. 2022; PMLR; 20503–20521. [Google Scholar]

[B23] 23. Lin S., Shi C., Chen J.. Generalizeddta: combining pre-training and multi-task learning to predict drug-target binding affinity for unknown drug discovery. BMC bioinformatics. 2022; 23:367. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24. Axelrod S., Gomez-Bombarelli R.. Molecular machine learning with conformer ensembles. Mach. Learn. Sci. Technol. 2023; 4:035025. [Google Scholar]

[B25] 25. Korlepara D.B., CS V., Srivastava R., Pal P.K., Raza S.H., Kumar V., Pandit S., Nair A.G., Pandey S., Sharma S.et al.. Plas-20k: Extended dataset of protein-ligand affinities from md simulations for machine learning applications. Sci. Data. 2024; 11:180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26. Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A.et al.. Highly accurate protein structure prediction with AlphaFold. Nature. 2021; 596:583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27. Lin Z., Akin H., Rao R., Hie B., Zhu Z., Lu W., dos Santos Costa A., Fazel-Zarandi M., Sercu T., Candido S.et al.. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv. 2022; 2022:500902. [Google Scholar]

[B28] 28. Baek M., DiMaio F., Anishchenko I., Dauparas J., Ovchinnikov S., Lee G.R., Wang J., Cong Q., Kinch L.N., Schaeffer R.D.et al.. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021; 373:871–876. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29. Swetnam T.L., Antin P.B., Bartelme R., Bucksch A., Camhy D., Chism G., Choi I., Cooksey A.M., Cosi M., Cowen C.et al.. CyVerse: cyberinfrastructure for open science. PLOS Computat. Biol. 2024; 20:e1011270. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] 30. Hancock D.Y., Fischer J., Lowe J.M., Michael S., Weakley L.M.. Jetstream2: Research Clouds as a Convergence Accelerator. Comput.Sci. Eng. 2024; 99:1–11. [Google Scholar]

[B31] 31. Rose A., Bradley A.R., Valasatava Y., Duarte A.M., Prlić A., Rose P.W.. NGL Viewer: web-based molecular graphics for large complexes. Bioinformatics. 2018; 34:3755–3758. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] 32. Rajasekar A., Moore R., Hou C.-Y., Lee C., Marciano R., de Torcy A., Wan M., Schroeder W., Chen S., Gilbert L.et al.. iRODS Primer: Integrated Rule-Oriented Data System, Vol.2, Synthesis Lectures on Information Concepts, Retrieval, and Services. 2010; https://link.springer.com/book/10.1007/978-3-031-02271-5.

[B33] 33. McGibbon R.T., Beauchamp K.A., Harrigan M.P., Klein C., Swails J.M., Hernández C.X., Schwantes C.R., Wang L.-P., Lane T.J., Pande V.S.. MDTraj: a modern open library for the analysis of molecular dynamics trajectories. Biophys. J. 2015; 109:1528–1532. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B34] 34. Spivak M., Stone J.E., Ribeiro J., Saam J., Freddolino P.L., Bernardi R.C., Tajkhorshid E.. VMD as a platform for interactive small molecule preparation and visualization in quantum and classical simulations. J. Chem. Inform. Model. 2023; 63:4664–4678. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] 35. Rose A.S., Hildebrand P.W.. NGL Viewer: a web application for molecular visualization. Nucleic Acids Res. 2015; 43:W576–W579. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B36] 36. MDDB Preliminary report on sustainability models for the MDDB infrastructure. Technical report, MDDBR. 2024; https://mddbr.eu/wp-content/uploads/2024/07/D4.3-Preliminary-report-on-sustainability-models-for-the-MDDB-infrastructure.pdf.

PERMALINK

MDRepo—an open data warehouse for community-contributed molecular dynamics simulations of proteins

Amitava Roy

Ethan Ward

Illyoung Choi

Michele Cosi

Tony Edgin

Travis S Hughes

Md Shafayet Islam

Asif M Khan

Aakash Kolekar

Mariah Rayl

Isaac Robinson

Paul Sarando

Edwin Skidmore

Tyson L Swetnam

Mariah Wall

Zhuoyun Xu

Michelle L Yung

Nirav Merchant

Travis J Wheeler

Abstract

Graphical Abstract

Graphical Abstract.

Introduction

Website and user interface

Data Exploration page

Figure 1.

Simulation Detail page

Figure 2.

Data download

Figure 3.

Figure 4.

Data submission

Figure 5.

Methods

System Architecture

Website

Data storage

Data contribution and download

Post-upload processing pipeline

Seeding MDRepo with simulations

Discussion

Acknowledgements

Contributor Information

Data availability

Funding

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases