Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2023 Nov 27;19(11):e1011676. doi: 10.1371/journal.pcbi.1011676

Facilitating bioinformatics reproducibility with QIIME 2 Provenance Replay

Christopher R Keefe 1, Matthew R Dillon 1, Elizabeth Gehret 1, Chloe Herman 1,2, Mary Jewell 3, Colin V Wood 1, Evan Bolyen 1, J Gregory Caporaso 1,2,*
Editor: Jan-Ulrich Kreft4
PMCID: PMC10703398  PMID: 38011287

Abstract

Study reproducibility is essential to corroborate, build on, and learn from the results of scientific research but is notoriously challenging in bioinformatics, which often involves large data sets and complex analytic workflows involving many different tools. Additionally, many biologists are not trained in how to effectively record their bioinformatics analysis steps to ensure reproducibility, so critical information is often missing. Software tools used in bioinformatics can automate provenance tracking of the results they generate, removing most barriers to bioinformatics reproducibility. Here we present an implementation of that idea, Provenance Replay, a tool for generating new executable code from results generated with the QIIME 2 bioinformatics platform, and discuss considerations for bioinformatics developers who wish to implement similar functionality in their software.

Introduction

Reproducibility, the ability of a researcher to duplicate the results of a study, is a necessary condition for scientific research to be considered informative and credible [1]. Peer review relies on study documentation to maintain the trustworthiness of scientific research [24]. Without comprehensive documentation, reviewers may be unable to verify a study’s validity and merit, and other researchers will be unable to interrogate the results or learn from the researchers’ approach, limiting the study’s value.

The biomedical research community has recently been concerned with a “reproducibility crisis,” and several high-profile publications have shown researchers unable to confirm findings of original studies [5,6]. This discussion generally focuses on one type of reproducibility failure: an inability to corroborate a study’s results. However, this literature neglects a deeper issue: many studies fail to provide even the minimum necessary documentation to reproduce a study’s methodology.

Although there is no standard nomenclature for reproducibility in the literature, existing definitions illustrate the goals of different types of reproducibility. For example, the Turing Way defines research as “Reproducible,” “Replicable,” “Robust,” or “Generalizable” based on whether a study’s results can be repeated using methods and data that are the same as, or different from, the original study [7]. Gundersen and Kjensmo create similar categories in their work on reproducibility, but they define a hierarchy based on the degree of generality [8].

We incorporated these ideas into a hierarchy of reproducible research, where different levels of reproducibility are represented by the classes “Reproducible,” “Replicable,” “Robust,” or “Generalizable” (Fig 1). Under this hierarchy, generalizable studies produce findings that are corroborated in other contexts, providing the building blocks for advancing scientific knowledge. Lower degrees of generality allow researchers to validate studies and expand focused work toward generalized conclusions [9].

Fig 1. Turing Way (TW) reproducibility classes [7] ordered in a hierarchy based on generality, similar to Gundersen and Kjensmo [8].

Fig 1

High quality research documentation is essential to all levels of reproducibility. In bioinformatics, research typically involves large datasets, complex computer software, and analytical procedures with many distinct steps. In order to reproduce such a study, one needs both prospective provenance, the analytic workflow specified as a recipe for data creation, and retrospective provenance, the details of the runtime environment and the resources used for analysis [10].

Prospective provenance is most frequently realized using a document written by the data analyst, for example using Jupyter Notebooks, Snakemake, or RMarkdown documents that include executable code and in some cases analysis notes. Great care must be taken to link revisions of these types of research documentation with the results that they generated, as these documents tend to evolve over time. And even if specific revisions are conclusively linked to specific research results, problems can arise if it’s not clear how code was executed in an interactive environment to generate a specific research result. For example, it is not uncommon for novice data scientists to execute cells out of order in a large Jupyter Notebook or RMarkdown file, in which case a linear interpretation of the document would not accurately describe how it was used to generate a result. Retrospective provenance can be realized by capturing information about the analysis as it is run, including hardware and software environments, resource use, and the data and metadata involved. Similar to prospective provenance, this is most frequently captured by a data analyst in the form of written notes. Most published research does not meet these reproducibility needs, as researchers must balance competing demands on their time and grapple with publication structures that incentivize producing new work over documenting for reproducibility [9,11].

Even if researchers know what needs to be tracked and are diligent about tracking that information, recording all of the prospective and retrospective provenance required to reproduce a computational analysis is tedious and error-prone for humans. In our opinion, provenance tracking is a task better left to computer software. Analytic software tools that automatically produce research documentation have the potential to reduce the risk of paper retraction; facilitate collaboration, review, and debugging; and improve the continuity and impact of scientific research [7].

Engineering bioinformatics software to facilitate aspects of analysis reproducibility is a topic of contemporary interest in bioinformatics software literature [12]. For example, Snakemake [13] can automatically generate workflow graphs, using its –-dag option, providing an automated workflow visualization tool. Love et al [14] present tximeta, an approach for linking reference data checksums to RNA-seq analysis results that rely on that reference data, ensuring that relevant references can be uniquely identified. They also briefly review work highlighting the need for improved provenance tracking in bioinformatics, as well as scientific computing tools that aim to facilitate provenance tracking. Of the existing tools that we are aware of, CWLProv [15] and Research Objects [16] serve the broadest purpose of facilitating reproducibility of entire computational workflows, and are most similar to the work presented here. CWLProv provides a layer between the Common Workflow Language (CWL) and the W3C PROV model, to document retrospective provenance of arbitrary CWL workflows. Research Objects are a concept designed for value-added publication of research products, including research data that is discoverable for other work, and which includes data provenance to enable users to understand how the data was generated.

QIIME 2 is a biological data science platform that was initially built to facilitate microbiome amplicon analysis [17], but has been expanding into new domains, including analysis of highly-multiplexed serology assays [18], pathogen genomics [19], and microbiome shotgun metagenomics, through alternative distributions (i.e., bundles of QIIME 2 plugins, where plugins serve as Python 3 wrappers for arbitrary analytic software, including software written in languages other than Python). QIIME 2 has a built-in system that automatically tracks prospective and retrospective data provenance for users as they run their analyses, and the popularity of this feature is in part responsible for its adoption in other domains. In QIIME 2, users conduct analysis using Actions that each produce one or more Results. The prospective and retrospective provenance of all preceding analysis steps are automatically stored in each Result, allowing users to determine how a Result was generated and an analysis was conducted (Fig 2), even if scripts or notes were not recorded (or were misplaced) by the user, or if revision identifiers of those documents were not linked to research results. Additionally, QIIME 2 assigns universally unique identifiers (UUIDs) to every Result it creates, enabling data to be conclusively identified. Taken together, QIIME 2 therefore automatically supports reproducible, replicable, and robust bioinformatics analyses, without any effort on the part of its users.

Fig 2. Schematic diagram of the provenance of a QIIME 2 Visualization.

Fig 2

A: An example of a QIIME 2 visualization illustrating the taxonomic composition of several samples. B: The directed, acyclic graph (DAG) tracing the history of the panel A visualization from initial data import into QIIME 2 through the creation of the visualization (denoted with an asterisk). This DAG can be used for analytical interpretation or publication, and serves as the input to Provenance Replay. Additional detail is provided in panel C on the nodes included in the dashed box. C: A DAG describing the inputs and outputs of the ‘action’ node highlighted in gray. D: “Action details,” captured during the execution of the node highlighted in gray. Some data collected as action details, such as information about the computational environment where the action was run, is not presented in this schematic for readability, but can be observed in the provenance of either of the two QIIME 2 results included in Supporting Information S1. The node selected for panel D was arbitrarily chosen.

The work presented here attempts to improve computational methods reproducibility in bioinformatics by reducing the practical overhead of creating reproducibility documentation. Built around the automated, decentralized provenance capture implemented in QIIME 2, we present Provenance Replay, a tool that validates the integrity of QIIME 2 Results, parses the provenance data they contain, and programmatically generates executable scripts that allow for the reproduction, study, and extension of the source analysis.

Design and implementation

Provenance Replay is written in Python 3 [20], and depends heavily on the Python standard library, NetworkX [21], pyyaml [22], and QIIME 2 itself, from which it takes advantage especially of the PluginManager and the Usage API. It ingests one or more QIIME 2 results and parses their provenance data into a Directed Acyclic Graph (DAG), implemented as a NetworkX DiGraph. It produces outputs by subsetting and manipulating this DiGraph and its contents. Outputs include BibTeX-formatted [23] citations for all Actions and Plugins used in a computational analysis, and executable scripts targeting the user’s preferred QIIME 2 interface. Users interact with the software through a command-line interface implemented with Click [24], or using its Python 3 API.

The initial software design was based on literature review, existing API targets, and discussion with QIIME 2 developers, as well as an initial requirements engineering process. The requirements engineering process consisted of requirements elicitation by focus group and requirements validation using the Technology Acceptance Model (TAM) [25]. Focus group participants were recruited through posts on the QIIME 2 community forum and participated in one-hour focus group sessions, which included a software demonstration and discussion. Discussion questions were intended to elicit open-ended feedback and exploration of possible features of value. When asked how likely they were to recommend Provenance Replay to a colleague who uses QIIME 2, 79% of focus group participants (15/19) were classified as promoters (scores of 9–10 out of 10), 21% (4/19) as passive (scores 7–8 out of 10) and none as detractors, resulting in a net promoter score of +79%. Provenance Replay also scored well on the TAM instruments, with respondents rating both its Perceived Ease of Use as “high,” (overall mean 5.8, on a scale of 1–7) and Perceived Usefulness as “high” (overall mean 6.0, on a scale of 1–7). Additional detail on this process is provided in [26].

Provenance Replay is supported by QIIME 2 versions 2021.11 and newer, and can parse data provenance from Results generated with any version of QIIME 2. Provenance Replay is capable of replaying a single QIIME 2 Result in a few seconds, and a very large analysis (450 results) in 8–10 minutes on a contemporary small-business laptop (Intel Core i7-8565U CPU @ 1.8GHz, 16 GB RAM, OpenSUSE Tumbleweed running on an M.2 SSD). As such, most users will not need to work in a cluster environment, and native installation is recommended (and supported on Linux, macOS, and Windows via Windows Subsystem for Linux 2 (WSL2)).

Results

Provenance Replay is software for the documentation and enactment of in silico reproducibility in QIIME 2, which can produce command-line (bash) and Python 3 scripts directly from a QIIME 2 Result. Provenance Replay outputs are self-documenting, using UUIDs to identify them as products of specific QIIME 2 Results, and they include step-by-step instructions for users to execute the scripts produced. Provenance Replay also implements MD5 checksum-based validation of Result provenance, which can alert if the Results were altered since they were generated, in which case the data provenance would no longer be reliable.

Provenance Replay has many features that we consider good general targets for tools that automate reproducibility documentation:

  • Completeness: Provenance Replay provides comprehensive access to captured provenance data.

  • Ease of Documentation: Users can generate a complete “reproducibility supplement,” including replay scripts and citation information, with a single command through different user interface types.

  • Ease of Application: Replay scripts are executable with minimal modification, target a variety of interfaces, and are self-documenting.

  • Accessibility: Replay documents are designed for human readability and include their own usage instructions. Additionally, by providing multiple user interfaces for running Provenance Replay, as well as multiple target interfaces for its outputs, users with varying degrees of computational experience can interpret its results.

Provenance Replay automatically removes most barriers to in silico methods reproducibility in QIIME 2 (with some exceptions discussed below). This simplifies the process of documenting research, and it has already been used to generate reproducibility supplements for scientific publications [27,28].

Availability and future directions

The Provenance Replay software is open source and free for all use (BSD 3-clause license). As of QIIME 2 2023.5 (released 24 May 2023), the software is included in the QIIME 2 “core distribution” (such that it is installed with QIIME 2), and as of QIIME 2 2023.9 (released 11 October 2023) it is included in the QIIME 2 framework itself, ensuring that it will stay current as QIIME 2 continues to evolve and will be available in all QIIME 2 distributions, including any developed by third-parties.

Reproducible and robust bioinformatics (Fig 1) involves unambiguous identification of the data used in an analysis, and enabling this is an important target for tools aiming to facilitate reproducible bioinformatics. This can be challenging to achieve, as it requires stable, unique identifiers, and if data can be mutated, the identifiers should be versioned. QIIME 2 uniquely identifies its data artifacts with UUIDs, and those artifacts are immutable (once created, they cannot be changed without the creation of a new data artifact with a different UUID). This ensures that analyses are reproducible and robust if a researcher has access to the data. Providing general purpose access to QIIME 2 Results is outside the scope of the system, so it remains the responsibility of the user to ensure their data are available to others. Sharing QIIME 2 Results and Provenance Replay reproducibility supplements in a single archive through a stable service such as FigShare is an excellent way for users to ensure that their analysis will be reproducible, replicable, and robust. QIIME 2 may facilitate data access in the future by enabling programmatic retrieval of data artifacts from Qiita [29], or integrating q2-fondue [30] commands with Provenance Replay results, to load data from the NCBI Sequence Read Archive into QIIME 2 as a step in replaying an analysis.

In addition to enabling unambiguous identification of data (including any reference data) that is used, there are several other important targets for tools aiming to facilitate reproducible bioinformatics. First, commands used to generate data must be recorded with all parameter settings, including default values, and unambiguously linked to their input and output data. While checksums of entire files are tempting unique identifiers, they take too long to compute on large files to be practical as identifiers. QIIME 2 uses version 4 UUIDs. Next, all relevant software versions, including versions of underlying dependencies, must be recorded. Details about the environment where command execution occurred are important to record, including the Operating System and its version and the version of the programming languages used. Differences in any of these variables could cause a failure to reproduce results. Minting and recording an execution identifier (e.g., as a UUID) can help to remove ambiguity regarding when or how a command or workflow was applied. And, while not essential for reproducibility, recording a timestamp of execution can be helpful when trying to make sense of collections of results. Finally, it is generally a good idea to optionally provide software tools in a containerized environment, to ensure that a working environment will be accessible in the future (e.g., if unavailability of compatible binaries prevents environment recreation). Khan et al. (2019) [15] provide a more detailed list of specific recommendations and corresponding justifications on best practices for ensuring reproducible computational workflows.

A future goal is to enable Provenance Replay to output replay scripts for users of QIIME 2 through graphical interfaces. QIIME 2 can be accessed through a Python 3 API, a command line interface (CLI), and through various workflow systems, including Galaxy and CWL. At present, Provenance Replay can output Python 3 scripts and bash scripts, providing documentation options for users of the API and CLI. A future development target is to provide reproducibility instructions for Galaxy [31] users as well. Providing complete documentation through higher-level (e.g., graphical) interfaces is more verbose, but ultimately expands the audience who can learn from that documentation.

Comprehensive study documentation is a necessary prerequisite to scientific reproducibility, but many researchers are unable to provide adequate documentation due to limited training, resources, and competing demands for their time. Tools such as Provenance Replay provide a means for ensuring study reproducibility while reducing the documentation burden on bioinformatics users, who may forget to record steps in their computational lab notebooks, or who may not be aware of all of the information that needs to be documented to ensure reproducibility. Provenance Replay largely automates in silico reproducibility in QIIME 2, and this approach can provide a model for other scientific computing platforms. Moving forward, computational tools that record data provenance for the user will be a major advancement for methods reproducibility, allowing researchers to more easily corroborate results, learn from the work of others, and build on the conclusions of scientific studies.

Supporting information

S1 File. qiime2-provenance-replay-code-and-tutorial.

The provenance replay code, as of QIIME 2 2023.9, and a brief usage tutorial with corresponding data. The two .qzv files included in this supplement are “QIIME Zipped Visualization” files. These can be used with QIIME 2 Provenance Replay to generate replay scripts, can be viewed using QIIME 2 View (https://view.qiime2.org), or can be unzipped with any typical unzip utility as they are .zip files with a specific internal structure that enables QIIME 2 to interpret them.

(ZIP)

Data Availability

Provenance Replay is open source and free for all use (BSD 3-clause license). The original stand-alone version is available at https://github.com/qiime2/provenance-lib. As of QIIME 2 2023.9 (released 11 October 2023) Provenance Replay is included in the QIIME 2 framework, available at https://github.com/qiime2/qiime2.

Funding Statement

This work was funded by NCI ITCR award 1U24CA248454-01 to JGC. The funders had no role in the design, implementation or presentation of the work shared here.

References

  • 1.Cacioppo JT, Kaplan RM, Krosnick JA, Olds JL, Dean H. Social, behavioral, and economic sciences perspectives on robust and reliable science. Report of the Subcommittee on Replicability in Science Advisory Committee to the National Science Foundation Directorate for Social, Behavioral, and Economic Sciences. 2015;1. Available from: https://www.nsf.gov/sbe/AC_Materials/SBE_Robust_and_Reliable_Research_Report.pdf. [Google Scholar]
  • 2.University of California Museum of Paleontology. How Science Works. Understanding Science. 2022. Available from: https://undsci.berkeley.edu/lessons/pdfs/how_science_works.pdf. [Google Scholar]
  • 3.Gazzaniga MS. Psychological science 2018. 6th ed. Norton W. W.; 2018. Available from: http://archive.org/details/dokumen.pub_psychological-science-1-6nbsped-9780393640403. [Google Scholar]
  • 4.Nicholas D, Watkinson A, Jamali HR, Herman E, Tenopir C, Volentine R, et al. Peer review: still king in the digital age. Learn Publ. 2015;28: 15–21. doi: 10.1087/20150104 [DOI] [Google Scholar]
  • 5.Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015;349: aac4716. doi: 10.1126/science.aac4716 [DOI] [PubMed] [Google Scholar]
  • 6.Baker M. 1,500 scientists lift the lid on reproducibility. Nature. 2016;533: 452–454. doi: 10.1038/533452a [DOI] [PubMed] [Google Scholar]
  • 7.The Turing Way Community. The Turing Way: A handbook for reproducible, ethical and collaborative research. doi: 10.5281/zenodo.7625728 [DOI]
  • 8.Gundersen OE, Kjensmo S. State of the Art: Reproducibility in Artificial Intelligence. 2018. [Google Scholar]
  • 9.Shiffrin RM, Börner K, Stigler SM. Scientific progress despite irreproducibility: A seeming paradox. Proceedings of the National Academy of Sciences. 2018;115: 2632–2639. doi: 10.1073/pnas.1711786114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Zhao Y, Wilde M, Foster I. Applying the Virtual Data Provenance Model. In: Hutchison D, Kanade T, Kittler J, Kleinberg JM, Mattern F, Mitchell JC, et al., editors. Provenance and Annotation of Data. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006. pp. 148–161. doi: 10.1007/11890850_16 [DOI] [Google Scholar]
  • 11.Munafò MR, Nosek BA, Bishop DVM, Button KS, Chambers CD, Percie du Sert N, et al. A manifesto for reproducible science. Nature Human Behaviour. 2017;1: 1–9. doi: 10.1038/s41562-016-0021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Mesirov JP. Computer science. Accessible reproducible research. Science. 2010;327: 415–416. doi: 10.1126/science.1179653 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Köster J, Rahmann S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics. 2012;28: 2520–2522. doi: 10.1093/bioinformatics/bts480 [DOI] [PubMed] [Google Scholar]
  • 14.Love MI, Soneson C, Hickey PF, Johnson LK, Pierce NT, Shepherd L, et al. Tximeta: Reference sequence checksums for provenance identification in RNA-seq. PLoS Comput Biol. 2020;16: e1007664. doi: 10.1371/journal.pcbi.1007664 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Khan FZ, Soiland-Reyes S, Sinnott RO, Lonie A, Goble C, Crusoe MR. Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv. Gigascience. 2019;8: giz095. doi: 10.1093/gigascience/giz095 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Bechhofer S, Buchan I, De Roure D, Missier P, Ainsworth J, Bhagat J, et al. Why linked data is not enough for scientists. Future Gener Comput Syst. 2013;29: 599–611. doi: 10.1016/j.future.2011.08.004 [DOI] [Google Scholar]
  • 17.Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol. 2019;37: 852–857. doi: 10.1038/s41587-019-0209-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Brown AM, Bolyen E, Raspet I, Altin JA, Ladner JT. PepSIRF + QIIME 2: software tools for automated, reproducible analysis of highly-multiplexed serology data. arXiv [q-bio.QM]. 2022. Available from: http://arxiv.org/abs/2207.11509. [Google Scholar]
  • 19.Bolyen E, Dillon MR, Bokulich NA, Ladner JT, Larsen BB, Hepp CM, et al. Reproducibly sampling SARS-CoV-2 genomes across time, geography, and viral diversity. F1000Res. 2020;9: 657. doi: 10.12688/f1000research.24751.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Python Software Foundation. Python Language Reference. Python Software Foundation; 2001. Available from: http://www.python.org. [Google Scholar]
  • 21.Hagberg AA, Schult DA, Swart PJ. Exploring Network Structure, Dynamics, and Function using NetworkX. In: Varoquaux G, Vaught T, Millman J, editors. Proceedings of the 7th Python in Science Conference. Pasadena, CA USA; 2008. pp. 11–15. [Google Scholar]
  • 22.Simonov K YAML community. PyYAML. The YAML Project; 2006. Available from: https://pyyaml.org/. [Google Scholar]
  • 23.Boulogne F, Mangin O, Verney L, Al E. BibTexParser. sciunto-org; Available from: https://bibtexparser.readthedocs.io/en/master/.
  • 24.Pallets. Click. Pallets; 2014. Available from: https://click.palletsprojects.com/en/7.0.x/. [Google Scholar]
  • 25.Davis FD. Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology. Miss Q. 1989;13: 319–340. doi: 10.2307/249008 [DOI] [Google Scholar]
  • 26.Keefe CR. Improving In Silico Scientific Reproducibility With Provenance Replay Software. Caporaso JG, editor. Master of Science, Northern Arizona University. 2022. doi: 10.6084/m9.figshare.24217224.v1 [DOI] [Google Scholar]
  • 27.Borsom EM, Conn K, Keefe CR, Herman C, Orsini GM, Hirsch AH, et al. Predicting neurodegenerative disease using pre-pathology gut microbiota composition: a longitudinal study in mice modeling Alzheimer’s disease pathologies. 2022. Apr. doi: 10.21203/rs.3.rs-1538737/v1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Weninger SN, Herman C, Meyer RK, Beauchemin ET, Kangath A, Lane AI, et al. Oligofructose improves small intestinal lipid-sensing mechanisms via alterations to the small intestinal microbiota. Microbiome. 2023;11: 169. doi: 10.1186/s40168-023-01590-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Gonzalez A, Navas-Molina JA, Kosciolek T, McDonald D, Vázquez-Baeza Y, Ackermann G, et al. Qiita: rapid, web-enabled microbiome meta-analysis. Nat Methods. 2018;15: 796–798. doi: 10.1038/s41592-018-0141-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Ziemski M, Adamov A, Kim L, Flörl L, Bokulich NA. Reproducible acquisition, management and meta-analysis of nucleotide sequence (meta)data using q2-fondue. Bioinformatics. 2022;38: 5081–5091. doi: 10.1093/bioinformatics/btac639 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Afgan E, Baker D, Batut B, van den Beek M, Bouvier D, Čech M, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018;46: W537–W544. doi: 10.1093/nar/gky379 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011676.r001

Decision Letter 0

Mark Alber, Jan-Ulrich Kreft

19 Sep 2023

Dear Dr Caporaso,

Thank you very much for submitting your manuscript "Facilitating Bioinformatics Reproducibility with QIIME 2 Provenance Replay" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

Dear Authors,

Thank you for submitting your manuscript to PLoS Computational Biology. It has now been reviewed by two experts who have done a thorough job and made constructive suggestions for improvement. I have also read the manuscript and agree with their assessment and recommendations.

I agree it will be important to reduce the reliance on the master's thesis by going back to the original citations and this would also be fair to the authors of these studies.

Regarding the tension between developing the concepts about reproducibility and explaining the actual work sufficiently, I agree that the latter is essential but if the authors can manage to better explain the conceptual work in a concise way, I would be happy to have that included. However it must become clearer than it is at the moment. I felt the explanation was not clear enough for someone not familiar with the ideas already and for those who already are, it is less of a problem but not doing a great job as to what are new ideas and why they are important. If you decide to develop this into a separate publication that is fine. If you want to keep it, make it clearer while keeping it concise.

Best wishes,

Jan-Ulrich Kreft

Guest Editor

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Jan-Ulrich Kreft

Guest Editor

PLOS Computational Biology

Mark Alber

Section Editor

PLOS Computational Biology

***********************

Dear Authors,

Thank you for submitting your manuscript to PLoS Computational Biology. It has now been reviewed by two experts who have done a thorough job and made constructive suggestions for improvement. I have also read the manuscript and agree with their assessment and recommendations.

I agree it will be important to reduce the reliance on the master's thesis by going back to the original citations and this would also be fair to the authors of these studies.

Regarding the tension between developing the concepts about reproducibility and explaining the actual work sufficiently, I agree that the latter is essential but if the authors can manage to better explain the conceptual work in a concise way, I would be happy to have that included. However it must become clearer than it is at the moment. I felt the explanation was not clear enough for someone not familiar with the ideas already and for those who already are, it is less of a problem but not doing a great job as to what are new ideas and why they are important. If you decide to develop this into a separate publication that is fine. If you want to keep it, make it clearer while keeping it concise.

Best wishes,

Jan-Ulrich Kreft

Guest Editor

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The presented software article motivates and presents a python application capable of extracting metadata from a QIIME2 analysis. This metadata serves as documentation of the analyses performed within QIIME2, and can be used to re-run the analyses. The advantage of programmatically extracting this documentation, or data provenance information, is that it requires very little time and effort from the researcher and guarantees retrieval of accurate and complete information, which can be tedious to achieve manually.

The application is a useful and much needed addition to the QIIME2 software suite, making it easier and faster for researchers to provide provenance information in scientific publications and related works. The idea is not particular novel, and references to exiting work need to be added, but means to increase the chances that scientists provide good provenance data are very important.

The presentation of the application, its output, and the placement in the field should be improved before publication as outlined in the major comments below.

Major comments:

- The last sentence of the abstract says ‘… and discuss considerations for bioinformatics developers who wish to implement similar functionality in their software.’ I cannot find this back in the Results or Future Directions sections. I think it would be useful to list the key elements that should be reported for microbiome research, and any suggestions you might have for developers in the field.

- The manuscript heavily relies on the reference Keefe 2022, a master thesis which is not easily accessible, only upon request. I feel this is a major limitation and all information necessary to fully understand the paper, its motivation and the application should be given in the manuscript or supplement.

- No proper introduction of the field of workflow documentation is given, and references to similar approaches are missing. This needs to be added. For example, Snakemake generates an analysis graph (DAG) and provides documentation during runtime, the config file allows ‘replay’ at any time. Other post-analysis documentation tools exist but are not discussed and cited. See how Love et al, 2020, provided an overview over existing applications for provenance tracking of RNA-seq data for how it should look like.

- I like the reference to the reproducibility classes of the Turing Way categories/Gundersen. The figure legend (figure 1) is a bit hard to digest though. The concept is clear and easy to understand, so make the figure legend easy to comprehend as well, cite the references in the legend as well, and keep the details for the main text.

- The authors do not place their application in the context of Figure 1. Please do so and discuss the challenges associated with the more general reproducibility classes.

- Figure 2 is extremely difficult to interpret with the limited information given. Even if the reader is very familiar with the steps of the analysis, it is impossible to match input and output data and actions/methods to the DAG. The DAG needs to be sensibly annotated to be useful. Same holds for the Action details which are almost not human-readable in the presented form. Make sure the action details are easily matched to the DAG and the input/output files of the user, to make the DAG a useful feature of Provenance Replay.

Minor comments:

- Avoid spoken expressions like “aren’t” in the text (abstract).

References:

[Love et al., 2020] Love MI, Soneson C, Hickey PF, Johnson LK, Pierce NT, et al. (2020) Tximeta: Reference sequence checksums for provenance identification in RNA-seq. PLOS Computational Biology 16(2): e1007664. https://doi.org/10.1371/journal.pcbi.1007664

Reviewer #2: Keefe et al have developed a useful software component that will, as advertised, serve as an aid to reproducibility in microbiome data analysis. The problem that the authors identify is real: even sophisticated bioinformatics scientists have trouble recording and documenting complex analytical workflows. The solution presented by the authors is to let the computer record the steps and software dependencies from the analysis, a task which seems to be so much more appropriate for computers than humans. While this collusion only covers one field of research on one bioinformatics platform, it does point the way forward for other areas of research and software.

Major comments

1. There is a tension within the article between developing broad ideas about reproducibility vs introducing the software. As the primary purpose of this article is to introduce new software, the authors should not develop broad ideas that are not absolutely necessary to motivate and describe the software presented. As just one example, is it necessary to merge reproducibility categories from two organizations into a unified hierarchy? Can’t the software’s utility be justified separately in the context of each framework, without merging them? Personally, I think the broad ideas are interesting and would make for an impactful, but separate, manuscript. The authors do not have the space to properly develop broad ideas about reproducibility here, and any attempts to do so will distract from the main focus of the paper.

2. Much of the background for this manuscript seems to come from a Master’s thesis written by the lead author. The thesis is primarily used as a source for the goals that the software seeks to meet. I tried but was unable to obtain a copy of the thesis from the library at Northern Arizona University. I understand referencing a thesis, but am uncomfortable with the degree to which the basis of the article relies on a thesis from the lead author. As I’ve not read the thesis, I’m not sure about the degree to which the goals are crafted by the author himself vs. aggregated from other sources. If the goals outlined in O’Keefe 2022 are collected from other sources, please cite them directly. Throughout the article, the reliance on O’Keefe 2022 should be minimized in favor of briefly describing the rationale and citing original sources.

3. In the Design and Implementation section, the authors say, “initial software design was based on literature review, existing API targets, and discussion with QIIME 2 developers, as well as an initial requirements engineering process and formal focus groups with prospective users.” Even if you don’t have space for all the details, we at least need to hear about the requirements engineering process, and we need to see some data from the focus groups. Because you have gone above and beyond the design practice in many labs, the readers need to see what a robust design process looks like. And if you did it right, these data will provide strong support for your design.

4. In Availability and Future Directions, the authors say, “this approach can provide a model for other scientific computing platforms,” but then don’t address HOW other computing platforms might use ideas from QIIME 2 Provenance Replay. This is the one glaring question that needs to be addressed by the authors. It seems to me that the design was so successful in QIIME 2 because the framework was built from the ground up to keep track of software used in each step. Is the plan to decouple QIIME 2 from microbiome data analysis and extend into other areas? To have separate QIIME 2-like frameworks for each field of research? To develop a general tool that tracks software dependencies on the command line? How does Galaxy fit into this picture? Do Jupyter Notebooks and RMarkdown have anything to learn from this? If you have to trim other parts of the article to make space for exploring this question, please do. This is where you can emphasize the payoff from your software and examine some broad ideas while staying relevant to the topic of the article.

Minor comments

1. The reproducibility hierarchy introduced by the authors, merging work from Gunderson and Kjensmo with ideas from the Turing Way, is presented in an overly technical way. Instead of “we contextualize key factors in generalizability within the broader goals of reproducible research,” you could just as easily say “we developed an expanded hierarchy that includes ideas from both groups.” The figure legend is impossible to understand without a full understanding of the text, which is itself difficult. This prevents the figure from serving as a visual introduction to your ideas about reproducibility. This may be a moot point if you follow the suggestion from major comment 1.

2. You probably mean to cite the Turing Way as “The Turing Way Community. (2021, November 10). The Turing Way: A handbook for reproducible, ethical and collaborative research. Zenodo. http://doi.org/10.5281/zenodo.3233853”

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Julia C Engelmann

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011676.r003

Decision Letter 1

Mark Alber, Jan-Ulrich Kreft

10 Nov 2023

Dear Dr Caporaso,

Thank you for the thorough revision addressing all comments raised by the reviewers satisfactorily. There is therefore no need to send it out to the reviewers again and we are pleased to inform you that your manuscript 'Facilitating Bioinformatics Reproducibility with QIIME 2 Provenance Replay' has been provisionally accepted for publication in PLOS Computational Biology.

There are a few minor changes that I would like you to consider at the proofing stage. In the Design and Implementation section, you report percentages of focus group participants with 4 significant figures although there were only 19. The values should be rounded to two sig. fig. Reporting the TAM sores of e.g. 5.82 would be easier to interpret if you stated the maximum score. In Results you use the term computational literacy, it may be nicer to use experience instead.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Jan-Ulrich Kreft

Guest Editor

PLOS Computational Biology

Mark Alber

Section Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011676.r004

Acceptance letter

Mark Alber, Jan-Ulrich Kreft

20 Nov 2023

PCOMPBIOL-D-23-00967R1

Facilitating Bioinformatics Reproducibility with QIIME 2 Provenance Replay

Dear Dr Caporaso,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Anita Estes

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. qiime2-provenance-replay-code-and-tutorial.

    The provenance replay code, as of QIIME 2 2023.9, and a brief usage tutorial with corresponding data. The two .qzv files included in this supplement are “QIIME Zipped Visualization” files. These can be used with QIIME 2 Provenance Replay to generate replay scripts, can be viewed using QIIME 2 View (https://view.qiime2.org), or can be unzipped with any typical unzip utility as they are .zip files with a specific internal structure that enables QIIME 2 to interpret them.

    (ZIP)

    Attachment

    Submitted filename: plos-comp-bio-response-to-reviewers.pdf

    Data Availability Statement

    Provenance Replay is open source and free for all use (BSD 3-clause license). The original stand-alone version is available at https://github.com/qiime2/provenance-lib. As of QIIME 2 2023.9 (released 11 October 2023) Provenance Replay is included in the QIIME 2 framework, available at https://github.com/qiime2/qiime2.


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES