Abstract
In the Big Data era, a change of paradigm in the use of molecular dynamics is required. Trajectories should be stored under FAIR (findable, accessible, interoperable and reusable) requirements to favor its reuse by the community under an open science paradigm.
The communities that embraced data archiving efforts decades ago are now, in the era of data-driven biology, gaining the most from the AI revolution. The structural biology community was a pioneer in this regard, establishing the Protein Data Bank in 1971 and making data accessible using the FAIR principles even before these were articulated1,2. The genomics and bioinformatics community has followed the example, establishing many widely used databases3,4. By contrast, molecular simulation has been anchored in usage paradigms dating back to the seventies, when molecular dynamics (MD) simulation was first applied to study biomacromolecules5. At that time, MD was used by theoretical physicists and chemists in proof-of-concept simulations, but 50 years later, MD has evolved into a cornerstone molecular biology technique that can provide accurate, quantitative analysis and property prediction. MD is now employed by tens of thousands of researchers worldwide, accounting for roughly 15% of global supercomputer usage. Unfortunately, these rich and costly data are not systematically maintained, and when further analyses are required, simulations have to be rerun — an unacceptable situation from scientific, environmental and sustainability standpoints. In this letter, we argue for a collaborative endeavor to archive MD simulation data and describe ongoing efforts to establish cost-effective and sustainable data archiving strategies.
Advances in computer technology have made it possible to simulate large, realistic biological systems beyond the millisecond time-scale, and we are seeing simulations in the 109-particle range, covering entire organelles and even minimal cells, resulting in a “deluge of data”6 in a field that lacks agreed strategies for data storage. As in the seventies, trajectories obtained after a huge effort are often ignored (or even deleted) after a hypothesis-driven analysis is presented in a scientific publication. For a field entirely based on sampling, and where the recipe for observations can be described exactly and critically assessed, this is a huge problem. Instead of being able to reanalyze, reuse, and potentially spot undetected artifacts or new features in data, readers are often expected to blindly trust the closed set of statements made by the authors in a paper. The lack of a systematic approach to storing data (and associated provenance and metadata) prevents new studies based on previous trajectories; impedes meta-analyses, extension of trajectories, training of machine learning approaches, optimization of force fields and simulation protocols, generation of new conformations for modeling of reactivity; hampers the use of trajectories to train coarse-grained and mesoscopic models or generative models; and prohibits the integration of MD results into the rich ecosystem of biology databases. Some journals and funding institutions now require the deposition of trajectories. Without a centralized reference repository, this has led to the use of existing generic repositories (for example, Zenodo, Figshare) and the creation of numerous small, independent databases. As a result, we may face vast amounts of dispersed and disconnected data, which are expensive to maintain and often useless for further analysis. It is clear that the community needs to escape from a paradigm that made sense in the seventies but now hinders progress, and move to an open science model.
Establishing an archive for biosimulation data — upon quality assessment — would address these issues, democratize the field, and have a material impact of MD simulations on life science research. The traditional view held by the simulation community that storing and archiving is more expensive than recomputing, which might have been correct in the past, is no longer valid, as demonstrated by the massive Folding@home study on the SARS-CoV-2 main protease7, or for simulations with many millions of atoms8. However, the new science that can be learned from stored trajectories is more important than the cost. For instance, the ABC Consortium9 was established in 2004 as a community effort generating a multi-gigabyte database of DNA simulations, which had grown to hold 15 terabytes of data by 2019. The original goal of ABC was to study DNA polymorphisms, but the database has become crucial in other fields, such as force field refinement, the study of signal transfer in DNA and the development of coarse-grained models. The current HexABC database contains 400 terabytes of data generated by 14 different groups to explore hexamer dependencies of DNA dynamics. However, its future use, which is difficult to anticipate, might be more important than the current goals of the project. Another example emerged during the COVID-19 pandemic10,11, when the Molecular Sciences Software Institute (MolSSI), in collaboration with European groups including BioExcel, European Open Science Cloud, European Bioinformatics Institute and Zenodo, created the COVID-19 Molecular Structure and Therapeutics Hub (https://covid.molssi.org). It went live in April 2020, connecting scientists across the global biomolecular simulation community, as well as improving the connection between simulation and experimental and clinical data and their investigators. A further example is MDverse (https://mdverse.github.io/), an effort to make MD trajectories FAIRer by indexing and curating thousands of simulations scattered across the internet. Many other examples are now under development, highlighting the general belief of the community that the traditional paradigm from the seventies should be abandoned and all well-annotated, validated trajectories should be stored and integrated in a general data infrastructure to favor the advance of science and the optimization of computational resources.
The challenges that lie ahead for the community are diverse. The technical ones —sustained data storage capacity, bandwidth, and processing capacity for analysis — can be alleviated by a distributed database policy following initiatives such as the EGA infrastructure (European Genome-Phenome Archive; https://ega-archive.org/) and by the commitment of funding institutions and high-performance computing centers, offering storage, bandwidth and processing capabilities. Other key decisions such as quality requirements for storing and maintaining the data, the sparsity of the trajectory, the compression strategy, or whether stored trajectory should be dry or contain also solvent molecules should be taken by the community, keeping in mind that, while storing all the potential information derived by an MD simulation might be impossible, preserving as much data as possible should be a priority.
A centralized management entity should coordinate the federated nodes, defining required metadata (crucial for reproducibility, extension of trajectories, increase of the time density of snapshots, or meta-analysis), setting deposition policies, guaranteeing compliance of FAIR rules and providing a common entry point through web-based and programmatic representational state transfer (REST) API interfaces. The myriads of variants of MD programs, protocols, formats and simulation conditions lead to more complex problems. Recent MD repositories and databases11 are already prepared to manage not only plain MD trajectories but also Markov state models, ensembles, multiscale simulations (hybrid or combined approaches involving mesoscale, coarse-grained and atomistic methods, as well as quantum mechanics with molecular mechanics), constant pH, replica exchange, and MD trajectories biased with metadynamics or similar methods. NoSQL databases such as MongoDB (with the GridFS file storage and retrieving specification) allow efficient storing and querying of the diversity of outputs provided by MD engines and are already adopted by MD storage initiatives. However, much more work is required for an effective analysis framework that can manage an increasingly large number of MD variants and trajectory formats.
Data should be findable, with each entry registered with a persistent identifier, ideally a DOI, ensuring a proper citation, following the example of the WorkflowHub registry (https://workflowhub.eu/). Furthermore, they should be stored in an interoperable manner, so that they can be read and exploited by current and future data scraping and machine learning algorithms. To this end, the community must reach an agreement to standardize MD data exchange formats with (i) efficient trajectory compression, including simple system specifications (for example, atom or residue names and connectivity); (ii) key-value trees storing high-level and full simulation settings metadata; and (iii) metadata-based ontology12, which would allow the user to search databases on the basis of the contents, the nature or even the purpose of the simulations. Standardized provenance should be stored by means of data blocks specifying commands or operations used to generate the trajectory, together with names, stored hash sums of the complete files used for input, and specific software used (with precise versions). This would allow the user to reproduce all the different steps followed to prepare and run the simulation, including modeling of missing residues, physical conditions (for example, pH, salt concentration, temperature and pressure) and force fields, methodology used to obtain parameters involving non-standard molecules (for example, small molecules, membrane systems, ionic coordination), and the equilibration and possibly sampling process. Minimum metadata should include system information, simulation parameters, author(s), data license and copyright, and, importantly, the main purpose of the simulation. The definition of standardized protocols (that is, list of operations) for production run and analysis, including a troubleshooting section, could be added. These, along with a set of metadata-dependent quality control analyses, both general and system specific, are crucial requisites for gaining trust from the community and for defining deposition rules. A data repository following FAIR principles and the associated analysis tools will increase the impact and the reproducibility (complex at the binary level; that is, it is difficult to reproduce exactly the same trajectory owing to numerical errors) of MD in related fields in the life science data ecosystem, from genomics to structural biology and from protein and drug design to molecular biology. MD data would provide unique dynamic information of biological macromolecules fully complementary with the rich information available from the Protein Data Bank. This could be integrated into the life science ecosystem following the approach of the Protein Data Bank in Europe Knowledge Base, designed for the integration and enrichment of 3D structure data and functional annotations13. All this information will contribute to knowledge democratization, helping research teams with limited resources and fueling further advances in artificial intelligence (AI) in the scientific domain14 (Fig. 1).
Fig. 1 |. Data cycle workflow for implementing FAIR (findable, accessible, interoperable and reusable) principles in biomolecular simulations.

The diagram highlights the added value that can be extracted from accessible open data.
The MDDB project (https://mddbr.eu/) and similar initiatives aim to establish such a repository, allowing (i) data quality assessment metrics to increase the trust of the community in the deposited data; (ii) common data format, metadata requirements and ontologies to facilitate interoperability; (iii) a minimum set of information needed to store and reproduce the simulations, including data provenance, license and copyright; and (iv) a standard and robust infrastructure to store and share the data, with persistent identifiers and different ways to access them. We believe science will be better served by fully embracing this data-driven view of biomolecular simulation. Furthermore, data-driven initiatives such that supported by this Correspondence would help the interaction with other simulation communities, such as the materials science one, which share some of the problems the biomolecular simulation community is facing.
Acknowledgements
The authors thank the whole MD community for useful inputs and discussions. The MDDB project is supported by European Union’s Horizon Europe programme under grant agreement 101094651 awarded to M.O., E.L., S.V., J.L.G., J.I., A.C. and P.C.B.
Footnotes
Competing interests
The authors declare no competing interests.
References
- 1.wwPDB Consortium. Nucleic Acids Res. 47, D520–D528 (2019). (D1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wilkinson MD et al. Sci. Data 3, 160018 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Thakur M et al. Nucleic Acids Res. 51, D9–D17 (2023). (D1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Rigden DJ & Fernández XM Nucleic Acids Res. 50, D1–D10 (2022). (D1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.McCammon JA, Gelin BR & Karplus M Nature 267, 585–590 (1977). [DOI] [PubMed] [Google Scholar]
- 6.Hospital A et al. Wiley Interdiscip. Rev. Comput. Mol. Sci. 10, e1449 (2020). [Google Scholar]
- 7.von Delft F et al. Nature 594, 330–332 (2021). [DOI] [PubMed] [Google Scholar]
- 8.Dommer A et al. Int. J. High Perform. Comput. Appl. 37, 28–44 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.da Rosa G et al. Biophys. Rev. 13, 995–1005 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Amaro RE & Mulholland AJ J. Chem. Inf. Model. 60, 2653–2656 (2020). [DOI] [PubMed] [Google Scholar]
- 11.Beltrán D et al. Nucleic Acids Res. 52, D393–D403 (2024). (D1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hospital A et al. Nucleic Acids Res. 44, D272–D278 (2016). (D1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Consortium PDBe-KB. Nucleic Acids Res. 48, D344–D353 (2019). (D1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Dessimoz C & Thomas PD Sci. Data 11, 268 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
