Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Mar 2.
Published in final edited form as: Nat Methods. 2025 Apr;22(4):641–645. doi: 10.1038/s41592-025-02635-0

The need to implement FAIR principles in biomolecular simulations

Rommie E Amaro 1, Johan Åqvist 2, Ivet Bahar 3,4, Federica Battistini 5, Adam Bellaiche 6, Daniel Beltran 7, Philip C Biggin 8, Massimiliano Bonomi 9, Gregory R Bowman 10, Richard A Bryce 11, Giovanni Bussi 12, Paolo Carloni 13,14, David A Case 15, Andrea Cavalli 16,17, Chia-En A Chang 18, Thomas E Cheatham III 19, Margaret S Cheung 20,21, Christophe Chipot 22,23,24, Lillian T Chong 25, Preeti Choudhary 6, G Andres Cisneros 26,27, Cecilia Clementi 28, Rosana Collepardo-Guevara 29,30,31, Peter Coveney 32,33, Roberto Covino 34,35, T Daniel Crawford 36,37, Matteo Dal Peraro 38, Bert L de Groot 39, Lucie Delemotte 40, Marco De Vivo 41, Jonathan W Essex 42, Franca Fraternali 43, Jiali Gao 44, Josep Ll Gelpí 5,45, Francesco L Gervasio 46,47,48,49, Fernando D González-Nilo 50, Helmut Grubmüller 51, Marina G Guenza 52, Horacio V Guzman 53, Sarah Harris 54, Teresa Head-Gordon 55, Rigoberto Hernandez 56, Adam Hospital 7,57,, Niu Huang 58, Xuhui Huang 59, Gerhard Hummer 60,61, Javier Iglesias-Fernández 62, Jan H Jensen 63, Shantenu Jha 64, Wanting Jiao 65, William L Jorgensen 66, Shina C L Kamerlin 67,68, Syma Khalid 8, Charles Laughton 69, Michael Levitt 70, Vittorio Limongelli 71, Erik Lindahl 40,72, Kresten Lindorff-Larsen 73, Sharon Loverde 74, Magnus Lundborg 40, Yun L Luo 75, F Javier Luque 76,77, Charlotte I Lynch 8, Alexander D MacKerell Jr 78, Alessandra Magistrato 79, Siewert J Marrink 80, Hugh Martin 32, J Andrew McCammon 81,82, Kenneth Merz 83,84, Vicent Moliner 85, Adrian J Mulholland 86, Sohail Murad 87, Athi N Naganathan 88, Shikha Nangia 89, Frank Noe 90,91,92,93, Agnes Noy 94, Julianna Oláh 95, Megan L O’Mara 96, Mary Jo Ondrechen 97, Jose N Onuchic 92,98,99,100, Alexey Onufriev 101,102,103, Sílvia Osuna 104,105, Giulia Palermo 18,106, Anna R Panchenko 107,108,109, Sergio Pantano 110,111, Carol Parish 112, Michele Parrinello 113, Alberto Perez 114, Tomas Perez-Acle 115,116, Juan R Perilla 117, B Montgomery Pettitt 118, Adriana Pietropaolo 119, Jean-Philip Piquemal 120, Adolfo B Poma 121, Matej Praprotnik 122,123, Maria J Ramos 124, Pengyu Ren 125, Nathalie Reuter 126,127, Adrian Roitberg 114, Edina Rosta 128, Carme Rovira 105,129, Benoit Roux 130, Ursula Rothlisberger 131, Karissa Y Sanbonmatsu 132,133, Tamar Schlick 134,135, Alexey K Shaytan 136,137, Carlos Simmerling 3,138, Jeremy C Smith 139,140, Yuji Sugita 141,142,143, Katarzyna Świderek 85, Makoto Taiji 144, Peng Tao 145, D Peter Tieleman 146, Irina G Tikhonova 147, Julian Tirado-Rives 66, Iñaki Tuñón 148, Marc W van der Kamp 149, David van der Spoel 2, Sameer Velankar 6, Gregory A Voth 150, Rebecca Wade 151, Ariel Warshel 152, Valerie Vaissier Welborn 37,153, Stacey D Wetmore 154, Travis J Wheeler 155, Chung F Wong 156, Lee-Wei Yang 157, Martin Zacharias 158, Modesto Orozco 5,7,
PMCID: PMC12950262  NIHMSID: NIHMS2081560  PMID: 40175561

Abstract

In the Big Data era, a change of paradigm in the use of molecular dynamics is required. Trajectories should be stored under FAIR (findable, accessible, interoperable and reusable) requirements to favor its reuse by the community under an open science paradigm.


The communities that embraced data archiving efforts decades ago are now, in the era of data-driven biology, gaining the most from the AI revolution. The structural biology community was a pioneer in this regard, establishing the Protein Data Bank in 1971 and making data accessible using the FAIR principles even before these were articulated1,2. The genomics and bioinformatics community has followed the example, establishing many widely used databases3,4. By contrast, molecular simulation has been anchored in usage paradigms dating back to the seventies, when molecular dynamics (MD) simulation was first applied to study biomacromolecules5. At that time, MD was used by theoretical physicists and chemists in proof-of-concept simulations, but 50 years later, MD has evolved into a cornerstone molecular biology technique that can provide accurate, quantitative analysis and property prediction. MD is now employed by tens of thousands of researchers worldwide, accounting for roughly 15% of global supercomputer usage. Unfortunately, these rich and costly data are not systematically maintained, and when further analyses are required, simulations have to be rerun — an unacceptable situation from scientific, environmental and sustainability standpoints. In this letter, we argue for a collaborative endeavor to archive MD simulation data and describe ongoing efforts to establish cost-effective and sustainable data archiving strategies.

Advances in computer technology have made it possible to simulate large, realistic biological systems beyond the millisecond time-scale, and we are seeing simulations in the 109-particle range, covering entire organelles and even minimal cells, resulting in a “deluge of data”6 in a field that lacks agreed strategies for data storage. As in the seventies, trajectories obtained after a huge effort are often ignored (or even deleted) after a hypothesis-driven analysis is presented in a scientific publication. For a field entirely based on sampling, and where the recipe for observations can be described exactly and critically assessed, this is a huge problem. Instead of being able to reanalyze, reuse, and potentially spot undetected artifacts or new features in data, readers are often expected to blindly trust the closed set of statements made by the authors in a paper. The lack of a systematic approach to storing data (and associated provenance and metadata) prevents new studies based on previous trajectories; impedes meta-analyses, extension of trajectories, training of machine learning approaches, optimization of force fields and simulation protocols, generation of new conformations for modeling of reactivity; hampers the use of trajectories to train coarse-grained and mesoscopic models or generative models; and prohibits the integration of MD results into the rich ecosystem of biology databases. Some journals and funding institutions now require the deposition of trajectories. Without a centralized reference repository, this has led to the use of existing generic repositories (for example, Zenodo, Figshare) and the creation of numerous small, independent databases. As a result, we may face vast amounts of dispersed and disconnected data, which are expensive to maintain and often useless for further analysis. It is clear that the community needs to escape from a paradigm that made sense in the seventies but now hinders progress, and move to an open science model.

Establishing an archive for biosimulation data — upon quality assessment — would address these issues, democratize the field, and have a material impact of MD simulations on life science research. The traditional view held by the simulation community that storing and archiving is more expensive than recomputing, which might have been correct in the past, is no longer valid, as demonstrated by the massive Folding@home study on the SARS-CoV-2 main protease7, or for simulations with many millions of atoms8. However, the new science that can be learned from stored trajectories is more important than the cost. For instance, the ABC Consortium9 was established in 2004 as a community effort generating a multi-gigabyte database of DNA simulations, which had grown to hold 15 terabytes of data by 2019. The original goal of ABC was to study DNA polymorphisms, but the database has become crucial in other fields, such as force field refinement, the study of signal transfer in DNA and the development of coarse-grained models. The current HexABC database contains 400 terabytes of data generated by 14 different groups to explore hexamer dependencies of DNA dynamics. However, its future use, which is difficult to anticipate, might be more important than the current goals of the project. Another example emerged during the COVID-19 pandemic10,11, when the Molecular Sciences Software Institute (MolSSI), in collaboration with European groups including BioExcel, European Open Science Cloud, European Bioinformatics Institute and Zenodo, created the COVID-19 Molecular Structure and Therapeutics Hub (https://covid.molssi.org). It went live in April 2020, connecting scientists across the global biomolecular simulation community, as well as improving the connection between simulation and experimental and clinical data and their investigators. A further example is MDverse (https://mdverse.github.io/), an effort to make MD trajectories FAIRer by indexing and curating thousands of simulations scattered across the internet. Many other examples are now under development, highlighting the general belief of the community that the traditional paradigm from the seventies should be abandoned and all well-annotated, validated trajectories should be stored and integrated in a general data infrastructure to favor the advance of science and the optimization of computational resources.

The challenges that lie ahead for the community are diverse. The technical ones —sustained data storage capacity, bandwidth, and processing capacity for analysis — can be alleviated by a distributed database policy following initiatives such as the EGA infrastructure (European Genome-Phenome Archive; https://ega-archive.org/) and by the commitment of funding institutions and high-performance computing centers, offering storage, bandwidth and processing capabilities. Other key decisions such as quality requirements for storing and maintaining the data, the sparsity of the trajectory, the compression strategy, or whether stored trajectory should be dry or contain also solvent molecules should be taken by the community, keeping in mind that, while storing all the potential information derived by an MD simulation might be impossible, preserving as much data as possible should be a priority.

A centralized management entity should coordinate the federated nodes, defining required metadata (crucial for reproducibility, extension of trajectories, increase of the time density of snapshots, or meta-analysis), setting deposition policies, guaranteeing compliance of FAIR rules and providing a common entry point through web-based and programmatic representational state transfer (REST) API interfaces. The myriads of variants of MD programs, protocols, formats and simulation conditions lead to more complex problems. Recent MD repositories and databases11 are already prepared to manage not only plain MD trajectories but also Markov state models, ensembles, multiscale simulations (hybrid or combined approaches involving mesoscale, coarse-grained and atomistic methods, as well as quantum mechanics with molecular mechanics), constant pH, replica exchange, and MD trajectories biased with metadynamics or similar methods. NoSQL databases such as MongoDB (with the GridFS file storage and retrieving specification) allow efficient storing and querying of the diversity of outputs provided by MD engines and are already adopted by MD storage initiatives. However, much more work is required for an effective analysis framework that can manage an increasingly large number of MD variants and trajectory formats.

Data should be findable, with each entry registered with a persistent identifier, ideally a DOI, ensuring a proper citation, following the example of the WorkflowHub registry (https://workflowhub.eu/). Furthermore, they should be stored in an interoperable manner, so that they can be read and exploited by current and future data scraping and machine learning algorithms. To this end, the community must reach an agreement to standardize MD data exchange formats with (i) efficient trajectory compression, including simple system specifications (for example, atom or residue names and connectivity); (ii) key-value trees storing high-level and full simulation settings metadata; and (iii) metadata-based ontology12, which would allow the user to search databases on the basis of the contents, the nature or even the purpose of the simulations. Standardized provenance should be stored by means of data blocks specifying commands or operations used to generate the trajectory, together with names, stored hash sums of the complete files used for input, and specific software used (with precise versions). This would allow the user to reproduce all the different steps followed to prepare and run the simulation, including modeling of missing residues, physical conditions (for example, pH, salt concentration, temperature and pressure) and force fields, methodology used to obtain parameters involving non-standard molecules (for example, small molecules, membrane systems, ionic coordination), and the equilibration and possibly sampling process. Minimum metadata should include system information, simulation parameters, author(s), data license and copyright, and, importantly, the main purpose of the simulation. The definition of standardized protocols (that is, list of operations) for production run and analysis, including a troubleshooting section, could be added. These, along with a set of metadata-dependent quality control analyses, both general and system specific, are crucial requisites for gaining trust from the community and for defining deposition rules. A data repository following FAIR principles and the associated analysis tools will increase the impact and the reproducibility (complex at the binary level; that is, it is difficult to reproduce exactly the same trajectory owing to numerical errors) of MD in related fields in the life science data ecosystem, from genomics to structural biology and from protein and drug design to molecular biology. MD data would provide unique dynamic information of biological macromolecules fully complementary with the rich information available from the Protein Data Bank. This could be integrated into the life science ecosystem following the approach of the Protein Data Bank in Europe Knowledge Base, designed for the integration and enrichment of 3D structure data and functional annotations13. All this information will contribute to knowledge democratization, helping research teams with limited resources and fueling further advances in artificial intelligence (AI) in the scientific domain14 (Fig. 1).

Fig. 1 |. Data cycle workflow for implementing FAIR (findable, accessible, interoperable and reusable) principles in biomolecular simulations.

Fig. 1 |

The diagram highlights the added value that can be extracted from accessible open data.

The MDDB project (https://mddbr.eu/) and similar initiatives aim to establish such a repository, allowing (i) data quality assessment metrics to increase the trust of the community in the deposited data; (ii) common data format, metadata requirements and ontologies to facilitate interoperability; (iii) a minimum set of information needed to store and reproduce the simulations, including data provenance, license and copyright; and (iv) a standard and robust infrastructure to store and share the data, with persistent identifiers and different ways to access them. We believe science will be better served by fully embracing this data-driven view of biomolecular simulation. Furthermore, data-driven initiatives such that supported by this Correspondence would help the interaction with other simulation communities, such as the materials science one, which share some of the problems the biomolecular simulation community is facing.

Acknowledgements

The authors thank the whole MD community for useful inputs and discussions. The MDDB project is supported by European Union’s Horizon Europe programme under grant agreement 101094651 awarded to M.O., E.L., S.V., J.L.G., J.I., A.C. and P.C.B.

Footnotes

Competing interests

The authors declare no competing interests.

References

RESOURCES