Standardized and accessible multi-omics bioinformatics workflows through the NMDC EDGE resource

Julia M Kelliher; Yan Xu; Mark C Flynn; Michal Babinski; Shane Canon; Eric Cavanna; Alicia Clum; Yuri E Corilo; Grant Fujimoto; Cameron Giberson; Leah YD Johnson; Kaitlyn J Li; Po-E Li; Valerie Li; Chien-Chi Lo; Wendi Lynch; Paul Piehowski; Kaelan Prime; Samuel Purvine; Francisca Rodriguez; Simon Roux; Migun Shakya; Montana Smith; Setareh Sarrafan; Shreyas Cholia; Lee Ann McCue; Chris Mungall; Bin Hu; Emiley A Eloe-Fadrosh; Patrick SG Chain

doi:10.1016/j.csbj.2024.09.018

. 2024 Sep 27;23:3575–3583. doi: 10.1016/j.csbj.2024.09.018

Standardized and accessible multi-omics bioinformatics workflows through the NMDC EDGE resource

Julia M Kelliher ^a,^⁎, Yan Xu ^a, Mark C Flynn ^a, Michal Babinski ^a, Shane Canon ^b, Eric Cavanna ^b, Alicia Clum ^b, Yuri E Corilo ^c, Grant Fujimoto ^c, Cameron Giberson ^c, Leah YD Johnson ^a, Kaitlyn J Li ^a, Po-E Li ^a, Valerie Li ^a, Chien-Chi Lo ^a, Wendi Lynch ^b, Paul Piehowski ^c, Kaelan Prime ^a, Samuel Purvine ^c, Francisca Rodriguez ^a, Simon Roux ^b, Migun Shakya ^a, Montana Smith ^c, Setareh Sarrafan ^b, Shreyas Cholia ^b, Lee Ann McCue ^c, Chris Mungall ^b, Bin Hu ^a, Emiley A Eloe-Fadrosh ^b, Patrick SG Chain ^a,^⁎

PMCID: PMC11832004 PMID: 39963423

Abstract

Accessible and easy-to-use standardized bioinformatics workflows are necessary to advance microbiome research from observational studies to large-scale, data-driven approaches. Standardized multi-omics data enables comparative studies, data reuse, and applications of machine learning to model biological processes. To advance broad accessibility of standardized multi-omics bioinformatics workflows, the National Microbiome Data Collaborative (NMDC) has developed the Empowering the Development of Genomics Expertise (NMDC EDGE) resource, a user-friendly, open-source web application (https://nmdc-edge.org). Here, we describe the design and main functionality of the NMDC EDGE resource for processing metagenome, metatranscriptome, natural organic matter, and metaproteome data. The architecture relies on three main layers (web application, orchestration, and execution) to ensure flexibility and expansion to future workflows. The orchestration and execution layers leverage best practices in software containers and accommodate high-performance computing and cloud computing services. Further, we have adopted a robust user research process to collect feedback for continuous improvement of the resource. NMDC EDGE provides an accessible interface for researchers to process multi-omics microbiome data using production-quality workflows to facilitate improved data standardization and interoperability.

Keywords: Microbiome, Multi-omics, Bioinformatics workflows, Standardization, Software, Open-source

Highlights

•
NMDC EDGE is a resource for accessible, standardized microbiome multi-omics workflows.
•
Layered software architecture ensures flexibility and enables updates to workflows.
•
Feedback is collected through user research efforts to improve the resource.

1. Introduction

Multi-omics methods, including a combination of metagenomics, metatranscriptomics, metabolomics, and/or metaproteomics methods, have become more affordable and accessible, enabling new ways to explore diverse microbiomes [25]. Challenges nonetheless exist for researchers to select appropriate computational tools for data analysis and integration, which often require extensive bioinformatics experience and computational resources [20], [51]. Further, the growing number of bioinformatics tools has also resulted in inconsistent data outputs that can be neither compared nor standardized across samples or studies [33]. This limits the generation of Findable, Accessible, Interoperable, and Reusable (FAIR) data, making meta-analyses and machine-learning applications difficult or impossible [55]. Together, these challenges surrounding multi-omics data processing significantly hinder progress in the field of microbiome research.

Web-based resources and cyberinfrastructures are effective ways to promote the accessibility of bioinformatics tools to the larger community because they allow for widespread, on-demand access, reduce the need for specialized local hardware/software installations and maintenance, and foster collaboration by providing centralized platforms for data sharing and tool integration [52], [37], [2], [36]. They can streamline running bioinformatics workflows without the need for local downloads and can be made available to researchers across the globe. However, these systems often require substantial training, an in-depth understanding of computational tools, and the need to weave multiple tools into a workflow to fully leverage these resources.

The Department of Energy’s (DOE) National Microbiome Data Collaborative (NMDC) program strives to provide the microbiome research community with tools and resources that facilitate FAIR data practices [57], [15]. To address shortcomings with many existing bioinformatics workflows and resources, we developed NMDC Empowering the Development of Genomics Expertise (EDGE) to support access to the standardized bioinformatics workflows used to process microbiome data available in the NMDC Data Portal [15]. These standardized workflows developed at two DOE user facilities, the Joint Genome Institute (JGI) and the Environmental Molecular Sciences Laboratory (EMSL), process raw multi-omics data and produce interoperable annotated data from metagenomes, metatranscriptomes, metaproteomes, and natural organic matter characterizations. To date, access to these workflows has largely been limited to the facilities for which they were developed. Herein, we describe the NMDC EDGE resource, which was modeled after the generalized EDGE bioinformatics platform [36] and modified to support a greater volume of projects focusing on microbiome multi-omics data.

2. Methods

2.1. NMDC EDGE architecture overview

The NMDC EDGE architecture is built using a flexible, modular design, updated from the generalized EDGE bioinformatics platform [36], and is divided into three distinct layers: web application, orchestration, and execution (Fig. 1). This modular design supports updates to workflows and inclusion of new workflows.

2.2. NMDC EDGE architecture layers

The web application layer (Fig. 1A) forms the user interface and is responsible for user interactions. It employs a modern JavaScript web application stack, the MERN (MongoDB, Express.js, React, and Node.js) technical stack, ensuring a user-friendly experience and low maintenance costs [26]. The web application frontend collects user inputs, provides data visualizations and data downloads, and manages user profiles. It is developed with the ReactJS framework with the CoreUI free React admin template (https://coreui.io/product/free-react-admin-template/). The backend of the web application utilizes a MongoDB to store user credentials, input data and workflow outputs, and it controls access to different projects. The frontend communicates with the backend of the web application layer via a set of HTTP Application Programming Interfaces (APIs), ensuring a standard approach to accessing and manipulating resources and a specific backend service that tracks the status and progress of submitted workflows. In terms of extendibility for future additional workflows, this architecture allows for seamless addition of new ReactJS components.

The orchestration layer (Fig. 1B) manages the bioinformatics workflows and job scheduling. The orchestration layer, supported by the Cromwell workflow manager, manages the execution of complex workflows defined using the Workflow Description Language (WDL)[53]. Cromwell uses a MySQL database for storing execution information, which allows for tracking and management of workflow executions. This setup bridges the gap between the web application layer and the execution layer, providing a workflow orchestration system.

Unlike the web application layer and the orchestration layer that are abstracted from the computing environment, the execution layer (Fig. 1C), responsible for the actual execution of tasks, interacts directly with the computing resources and job schedulers. NMDC EDGE is compatible and tested in Simple Linux Utility for Resource Management (Slurm) and will also run with other resource management tools supported by Cromwell. By keeping this layer isolated, the architecture remains flexible and is capable of executing tasks without requiring modifications to existing workflows.

To streamline updates and support inclusion of new bioinformatics workflows, all workflow executable files are provided as software containers. Compared to native installations of any bioinformatics workflow, software containers ensure reproducibility and provide increased flexibility for adding new tools and workflows, as they eliminate software incompatibility issues by encapsulating the necessary dependencies and environments. The web application layer and the orchestration layer are deployed to a virtual machine (VM), which has shared project storage space with the high-performance computing (HPC) environment that executes all the workflows.

NMDC EDGE is currently hosted at the San Diego Supercomputer Center (SDSC) and operates within a VM environment with 8 CPUs and 16 GB of RAM dedicated to web hosting and the Cromwell workflow manager. We obtained our allocation on Expanse using ACCESS (Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support) [5]. The workflow manager, in turn, oversees the execution of workflow jobs on the SDSC Expanse cluster computer, which shares a file system with the VM. The workflows are currently executed on Expanse cluster compute nodes with 256 GB of memory. One benefit to the design of NMDC EDGE is its portability across computational resources, where advanced users can download the source code from GitHub (https://github.com/microbiomedata/nmdc-edge). For local HPC installation of NMDC EDGE, users must configure the orchestration and execution layers to work with their existing resource manager and computing environment. For cloud deployments, reconfiguration of the orchestration and execution layers is required, including cloud native code development or replacing Cromwell with the cloud provider's workflow orchestration system, but no changes are needed to the web application layer or the workflow runtime locations.

3. Results

3.1. The NMDC workflows

Currently, NMDC EDGE offers five main workflows for processing sequence data (metagenome, metatranscriptome, and prediction of viruses and plasmids) and mass spectrometry data (metaproteome and natural organic matter characterization) (Fig. 2) [13], [14], [15], [8]. The inputs, tools, and outputs of the workflows are outlined in Table 1, and more detailed information regarding the versions, parameters, and other specifics about the workflows can be found on the NMDC documentation site (https://nmdc-documentation.readthedocs.io/en/latest/index.html) and in the NMDC GitHub (https://github.com/microbiomedata/). Every version of the NMDC standardized workflows has fixed tools, parameters, and underlying databases to maintain standardization across runs.

Fig. 2 — The NMDC standardized bioinformatics workflows and their associated inputs and main outputs for (A) sequencing data and (B) mass spectrometry data. Additional output files and visualizations are shown in Table 1. The rectangles indicate workflows, and gray parallelograms indicate inputs and outputs. Direct infusion Fourier-transform ion cyclotron mass spectrometry (DI FT-ICR MS); Metagenome-assembled genomes (MAGs); Liquid chromatography-tandem mass spectrometry (LC-MS/MS).

Table 1.

The NMDC standardized bioinformatics workflows, their inputs, the tools they are using, and the outputs available through NMDC EDGE, in addition to the downloadable output files.

Workflow		Input	Tools	Outputs	Additional NMDC EDGE Output Visualizations
Metagenome Workflow	Reads QC	Raw Illumina sequencing data (.fastq,.fq,.fastq.gz,.fq.gz)	BBTools: rqcfilter2, bbduk, BBMap [3]	Cleaned data as a compressed interleaved FASTQ file (.fq.gz) and QC statistics (.txt)	QC statistics in a summary table
	Read-based Taxonomy Classification	Illumina data (QC’ed); (.fastq,.fq,.fastq.gz,.fq.gz)	GOTTCHA2 [17], Kraken2 [56], Centrifuge [32]	Profiling results for each tool at 3 taxonomic levels (species, genus, family)	Summary tables, interactive Krona plots[46]

	Metagenome Assembly	Illumina data (QC-ed); (.fastq,.fq,.fastq.gz,.fq.gz)	BBtools: bbcms, bbmap; metaSPAdes [3], [42]	Assembled contigs file, scaffolds file, assembly coverage and description files	Table of assembly statistics

Metagenome Annotation	Assembled contig file (.fasta,.fa,.fna,.fasta.gz,.fa.gz,.fna.gz)	tRNAscan-SE [9], Infernal [43], CRT-CLI [4], Prodigal[22], GeneMarkS−2 [38], LAST[18], HMMER[16], [13]	Structural annotations, functional annotations, KEGG summary, Enzyme Commission summary, gene phylogeny summary	Tables of annotation statistics and features

Metagenome Assembled Genomes (MAGs)	Assembled contigs (.fasta,.fa, or .fna), read mapping file from the assembly (.sam.gz or .bam), functional annotation of the assembly (.gff)	SAMtools[35], MetaBat2[27], CheckM[48], GTDB-TK[10], HMMER[16], Prodigal[22], pplacer[39], FastANI[23], FastTree[49], mash[47]	File of High Quality (HQ) and Medium Quality (MQ) bins as well as other lower quality bins	Summary tables of MAG binning information and quality, Each bin’s annotation results.

Metatranscriptome Workflow	Reads QC	Raw Illumina sequencing data (.fastq,.fq,.fastq.gz,.fq.gz)	BBTools: rqcfilter2, BBMap,[3]	Cleaned data as a compressed interleaved FASTQ file (.fq.gz) and QC statistics (.txt)	QC statistics
	Metatranscriptome Assembly	Illumina data (QC-ed); (.fastq,.fq,.fastq.gz,.fq.gz)	rnaSPAdes [7]	Assembled contigs file of transcripts, scaffolds file, assembly coverage, and description files	Tables of assembly statistics

	Metatranscriptome Annotation	Assembled contig file (.fasta,.fa,.fna,.fasta.gz,.fa.gz,.fna.gz)	tRNAscan-SE[9], Infernal[43], CRT-CLI[4], Prodigal[22], GeneMarkS−2[38], LAST[18], HMMER[16],[13]	Structural annotations, functional annotations, KEGG summary, Enzyme Commission summary, gene phylogeny summary	Tables of annotation statistics and features

Read Count	Mapped reads (.bam) and annotation file (.gff)	readCov_metaTranscriptome_2k20.pl (dongyingwu/rnaseqct:1.1)	Read counts for transcripts	Table of transcripts and their read counts

Viruses & Plasmids Workflow		Assembly file from a metagenome, metatranscriptome, or genome assembly workflow (.fasta,.fa,.fna)	geNomad[8], CheckV[44]	List of predicted virus sequences and/or regions, along with confidence scores, annotation, completeness, and contamination estimations. List of predicted plasmids with confidence scores and annotations.	Interactive summary table of information about predicted viruses; interactive summary table of information about predicted plasmids; interactive summary table of virus quality

Natural Organic Matter Workflow		Direct Infusion Mass Spectrum Instrument data (Bruker (.d) and Thermo raw), and/or mass list data (.csv, txt)	CoreMS 1.0[14]	Table of all measured m/z and all possible molecular formula assignments for each m/z.	Mass Spectrum, Mass Error Distribution, van Krevelen diagram, and Carbon # vs. DBE diagram for each heteroatomic class

Metaproteome Workflow		LC-MS/MS Data (.raw); Assembled contig file (.fasta); functional annotation of the assembly (.gff)	MSConvert[21], [30], MSGF+[31], [41], MASIC [24]	Table of identified peptide sequences, protein table with relative abundance measurements, and functional annotations	Summary table of QC metrics

Open in a new tab

3.1.1. Metagenome workflow

The NMDC standardized metagenome workflow (Fig. 2A) leverages JGI’s production pipeline for short-read data and consists of: reads quality control (QC), metagenome assembly, metagenome annotation, and binning of population genomes to generate metagenome-assembled genomes (MAGs) workflows (Table 1) [13], [15].

The reads QC workflow utilizes rqcfilter2 to trim and filter low quality data from raw metagenome Illumina reads (FASTQ files). The workflow additionally removes artifacts, linkers, adapters, spike-in reads, and reads mapping to several hosts and common contaminants. The NMDC EDGE interface provides users with a summary table of QC statistics and a variety of metrics, including the number of reads and bases before and after QC filtering.

The read-based taxonomy classification workflow, which is not part of the JGI production pipeline, utilizes three distinct classifiers - GOTTCHA2, Kraken2, and Centrifuge - to profile quality-controlled reads [17], [56], [32]. The use of three distinct tools is meant to accommodate varied project goals and sequencing approaches that cover a spectrum from high sensitivity to high specificity that is dependent on the algorithms and cut-off levels chosen from different tools. The NMDC EDGE interface also provides summary tables and interactive Krona plots as visual outputs for this workflow [46].

The metagenome assembly workflow uses bbcms, metaSPAdes, and BBMap to run error correction, assembly, and assembly validation, respectively [13]. NMDC EDGE provides an output table of assembly statistics. The metagenome annotation workflow takes in assembled metagenomes and generates structural and functional annotations. The metagenome annotation results in NMDC EDGE include tables of statistics for processed sequences, predicted genes, and general quality information from the workflow. The MAGs workflow uses metabat2 to generate metagenome bins and applies the MIMAG standards using annotated tRNAs, rRNAs, and marker genes with checkM to estimate completeness and contamination and subsequent taxonomic lineage assignment [6], [13], [11]. The MAGs result page in NMDC EDGE provides a summary section with information on binned and unbinned contigs, genome completeness, estimated contamination, and the number of genes present on all bins determined to be high quality or medium quality.

Users can run a single workflow within the metagenome pipeline with the appropriate input files, and the entire metagenome workflow is available to run from start to finish on NMDC EDGE from a single input raw Illumina file (Fig. 2A). Upon completion of the run, users can view the results, which are grouped by individual workflow.

3.1.2. Metatranscriptome workflow

The NMDC standardized metatranscriptome workflow (Fig. 2A) leverages JGI’s production pipeline and consists of: reads QC, metatranscriptome assembly, annotation, and read count workflows. Similar to the metagenome workflow, the reads QC workflow utilizes rqcfilter2 but also removes ribosomal RNA reads. The metatranscriptome assembly workflow uses rnaSPAdes [7] for assembly and uses BBMap to map the reads back to contigs. Assembled transcripts are then annotated with the metagenome annotation workflow described above. Next, reads are counted by mapping to sense and antisense direction of annotated features (e.g., coding sequence or CDS) and unannotated features (e.g., intergenic region). The workflow result page includes a table of the top 100 expressed genes as measured by their read counts. Selecting the header of each column will sort the data by that column. Users can also download a .tsv file of all detected features in the input dataset for further analysis.

3.1.3. Viruses & plasmids workflow

This workflow uses the newly developed geNomad tool [8] to detect putative viruses and plasmids from metagenome and metatranscriptome data (Fig. 2A). The workflow provides quality and confidence information from the outputs of CheckV [44]. In NMDC EDGE, the results are displayed as multiple tables. The first output table includes information about predicted viruses in the input data, including sequence length, topology, coordinates, number of genes, genetic code, virus score, false discovery rate (FDR), number of hallmark genes, marker enrichment, and taxonomy. The second table provides the plasmid prediction summary, which includes information on sequence length, topology, number of genes, genetic code, plasmid score, false discovery rate (FDR), number of hallmark genes, marker enrichment, conjugation genes, and any antimicrobial resistance (AMR) genes present.

3.1.4. Natural organic matter workflow

This workflow leverages EMSL’s CoreMS framework and takes in Direct infusion Fourier-transform ion cyclotron mass spectrometry (DI FT-ICR MS) data that undergoes signal processing and molecular formula assignment [14] (Fig. 2B). Time domain data is transformed into the frequency domain and finally into mass-to-charge ratio (m/z) domain using Fourier Transform and Ledford based equations [54]. Data is then denoised, followed by peak picking and recalibration using an external reference list of known compounds, and searched against a dynamically generated molecular formula library with a defined molecular search space. The downloadable output file from NMDC EDGE consists of a molecular formula table with several columns representing specific measurements and attributes related to the mass spectrometry data, including measured m/z, peak height, the molecular formula candidate, the mass accuracy associated with each molecular formula, the heteroatomic class and a composite confidence score that combines the mass accuracy and the spectral similarity of the fine isotopic structure. All molecular formula candidates are shown for each m/z measurement that are possible within the molecular search space and the parameters associated with instrument performance. NMDC EDGE uses the default parameters defined on the CoreMS software; modification of these parameters can be achieved using the enviroMS and CoreMS python packages and docker images [14].

3.1.5. Metaproteome workflow

The metaproteome workflow (Fig. 2B) is an end-to-end data dependent acquisition (DDA) workflow for protein identification and relative quantification using bottom-up mass spectrometry (MS) data. The workflow takes in raw liquid chromatography-tandem mass spectrometry (LC-MS/MS) data files and an associated metagenome FASTA file to generate: peptide identifications at the user-specified false discovery rate (FDR) using MSGF+ , relative abundance values derived by MASIC from MS1 area under the curve measurements, and protein functional annotations from the provided metagenome [31], [40], [24]. A QC output is displayed in NMDC EDGE that summarizes the proteomic search results and enables users to quickly gauge dataset quality. Result files are provided as downloadable text files. First, the raw output of the pipeline is provided prior to FDR correction, along with the NMDC-produced FASTA file used for the database search. The FDR corrected results are output at the unique peptide sequence level, as well as at the protein level, rolled up using parsimonious inference and summing peptide level abundance measurements [45].

3.2. Running the NMDC workflows in NMDC EDGE

The NMDC EDGE resource is open and available at no cost to the microbiome research community at https://nmdc-edge.org (Fig. 3). To run the available NMDC workflows, users must login to the site using ORCiD credentials (https://orcid.org/). Once logged in, the ‘Upload Files’ option allows users to upload omics data files onto the NMDC EDGE resource. The allowable file types are listed on the webpage, and the maximum per-file size is 10.0 gigabytes (Gb) due to limitations of using https to upload data. Raw files will remain on the NMDC EDGE server for 180 days before they are deleted, and users are allocated a total storage space of 150.0 Gb. The limits on individual file size uploads and storage space are comparable to other web-based bioinformatics resources [52], [2]. Users can manage their uploads under the ‘My Uploads’ menu to delete, share, or publish files publicly. Publicly available data files may also be added to NMDC EDGE using the ‘Retrieve SRA Data’ workflow, which imports data housed in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) directly into NMDC EDGE [34], [28]. Users can then select these datasets as their input data when running a workflow.

Selecting any available workflow will open a webpage where users can input their run information (Fig. 3). Users can select test data, publicly shared data files, SRA data, or privately uploaded files as input to the workflow. Users must also provide any other required information (e.g., if their data is interleaved or the files are paired). Once all required information has been entered and the run has been submitted, the project will appear in the ‘My Projects’ menu (Fig. 3). Within this page, users can view run information, such as the project status (Submitted, In queue, Running, Complete, or Failed) and run type. Users are also given the option to share their project publicly (with all NMDC EDGE users) or to share it with specific NMDC EDGE accounts. Users can select the “View Results” icon to navigate to their results, including summary tables, workflow-specific visualizations, log files, and downloadable output files.

The NMDC EDGE resource uses the FIFO (First-In-First-Out) scheduler to execute jobs in the order they are submitted to the system. The ‘Job Queue’ menu lists all running and pending jobs (Fig. 3). Users can access this menu to see how many projects are running or waiting to run, which can aid in estimating run times. The average runtimes for each of the workflows are listed in Table 2, but the runtimes will vary based on file size, file type, data complexity, and the job queue.

Table 2.

Average workflow runtimes in NMDC EDGE. For the reads QC, read-based taxonomy classification, metagenome assembly, metagenome annotation, and metagenome MAGs, runtime information from when the workflows were run alone as well as part of the larger metagenome pipeline were combined and averaged to produce the results shown in the table. This information was compiled using the runtime for each task for the 4865 projects that had been run to date when the analysis was performed. File sizes for these projects ranged from 53 B to 13.9 Gb, with an average size of 325 Mb and a standard deviation of 973.6 Mb.

Workflow	Average Runtime (hours)
Metagenome Reads QC	1.82
Metagenome Read-based Taxonomy Classification	0.81
Metagenome Assembly	2.28
Metagenome Annotation	8.85
Metagenome MAGs	0.41
Metatranscriptome	12.64
Viruses and Plasmids	0.97
Natural Organic Matter	0.45
Metaproteome	5.31

Open in a new tab

3.3. Training materials & documentation

A suite of training materials and documentation is available through NMDC EDGE that includes video tutorials, user guides, and technical documentation (https://nmdc-edge.org/tutorial); (https://nmdc-documentation.readthedocs.io/en/latest/tutorials/run_workflows.html) (Fig. 3). Translations of the user guides into Spanish and French are also available. Additional instructional content and descriptions can also be found on the NMDC YouTube channel (www.youtube.com/@microbiomedata) and within publicly available NMDC training materials [29], [50].

3.4. User research & usability testing

Since the launch of NMDC EDGE in May 2021, more than 1580 users have collectively run over 6130 workflows. For NMDC EDGE, we employ a user-centered design methodology to collect feedback from the research community, leading to iterative and ongoing improvement to ensure we are meeting the needs of the microbiome research community. Rolling feedback can be submitted via our support email (support@microbiomedata.org) or the feedback form provided on the NMDC EDGE homepage. Beta-testing is an important step of workflow release to the NMDC EDGE interface. Researchers who have contributed NMDC EDGE feedback or participated in beta-testing are acknowledged within this publication. The initial round of beta testing, conducted in 2021, resulted in 49 action items; of these 49, only two were determined to be infeasible by the team. The remaining feedback has been implemented (42 of 49; 86 %) or is in the process of implementation (5 of 49; 10 %). For example, user feedback drove updates to the workflow input pages to clarify the types of acceptable input data, led to the inclusion of clearer avenues for users to reach out to the team for support, allowed the team to identify and fix memory and workflow issues, and led to the improvement of the tutorials and user guides. The most recent round of beta-testing conducted in 2023 (form provided in Supplementary Table 1) resulted in 63 insights and 40 action items. This feedback is actively being discussed, addressed, and implemented.

4. Discussion

The NMDC EDGE resource provides bioinformatics workflows using a flexible, modular design for the microbiome research community. These workflows are production-quality, referring to their use by DOE user facility production pipelines which routinely process thousands of datasets. The user-friendly interface has allowed for broader adoption by researchers from various backgrounds to process diverse microbiome datasets. NMDC EDGE supports access to the standardized bioinformatics workflows used to process microbiome data available in the NMDC Data Portal [15]. With the standardized workflows and workflow parameters, researchers can more readily compare their data processed through NMDC EDGE with other datasets available via the NMDC Data Portal, making these datasets more FAIR particularly in regards to their interoperability and reusability.

The standardization and restrictions on customization of workflow parameters allow for more accurate and meaningful comparisons of datasets, however there are inherent limitations with these features. The standardized parameters of the NMDC workflows may not be optimal for all samples, sample types or research questions. Users can retrieve the publicly available NMDC workflows and customize their local runs, however NMDC EDGE has locked down the workflows to allow for seamless comparisons with the projects available on the NMDC Data Portal. Another limitation of the NMDC EDGE resource is that only a limited number of initial workflows are available which may not satisfy all use cases for the broader microbiome research community performing multi-omics and integrative analyses.

New features and improvements are planned for future NMDC EDGE releases. Updates to the existing workflows and their underlying tools and databases will follow a coordinated release schedule with the JGI and EMSL user facility workflows. All workflow and software updates are versioned and tracked in both the software release notes (https://github.com/microbiomedata/nmdc-edge/releases) and the NMDC workflow containers. Currently, NMDC workflows are tested manually. We will implement automated testing and adopt best practices from the software industry, such as continuous integration and continuous delivery/deployment. A new workflow for long-read sequencing data will be added to reflect the growing utilization of this technology for microbiome sequencing [19], [12], [1]. An option for reference-free metaproteome analysis will be incorporated for users to assess the protein composition in their samples without the need for matched metagenomes. A gas chromatography mass spectrometry metabolomics workflow, based on the one in use at EMSL, will be added to NMDC EDGE. Annual rounds of beta testing, user research, and usability testing for the resource will be conducted to continuously improve the workflows, user interface, outputs, and overall user experience. We will work with researchers to successfully apply these workflows to their research, and we aim to make this resource as accessible as possible, regardless of researcher background, location, expertise, or computational resource availability.

5. Conclusions

Despite the rapid growth in microbiome research, many barriers exist for multi-omics data processing and standardization. The NMDC EDGE resource provides production-quality bioinformatics workflows in an intuitive web interface with flexibility to support updates and new workflows to be added in the future. Overall, NMDC EDGE serves as a valuable resource to the microbiome research community to facilitate improved data standardization and interoperability.

CRediT authorship contribution statement

Yuri E. Corilo: Writing – review & editing, Supervision, Software, Methodology. Grant Fujimoto: Software, Resources, Methodology. Bin Hu: Writing – review & editing, Writing – original draft, Supervision, Software, Resources, Methodology, Conceptualization. Eric Cavanna: Writing – review & editing, Software, Resources, Methodology. Emiley A. Eloe-Fadrosh: Writing – review & editing, Writing – original draft, Supervision, Project administration, Funding acquisition, Conceptualization. Alicia Clum: Writing – review & editing, Writing – original draft, Validation, Supervision, Software, Resources, Methodology. Lee Ann McCue: Writing – review & editing, Writing – original draft, Supervision, Project administration. Michal Babinski: Writing – original draft, Validation, Software, Methodology. Christopher Mungall: Writing – review & editing, Supervision, Project administration. Shane Canon: Software, Resources, Methodology. Setareh Sarrafan: Writing – review & editing, Supervision, Project administration. Yan Xu: Writing – review & editing, Writing – original draft, Visualization, Validation, Software, Resources, Methodology, Conceptualization. Shreyas Cholia: Writing – review & editing, Supervision, Software, Resources, Project administration, Methodology. Mark C. Flynn: Writing – review & editing, Writing – original draft, Visualization, Validation, Software, Resources, Methodology, Funding acquisition, Conceptualization. Migun Shakya: Writing – review & editing, Writing – original draft, Validation, Software, Resources, Methodology. Montana Smith: Writing – review & editing, Validation, Supervision. Julia M. Kelliher: Writing – review & editing, Writing – original draft, Visualization, Supervision, Software, Resources, Project administration, Conceptualization. Francisca Rodriguez: Writing – review & editing, Writing – original draft, Visualization, Resources. Simon Roux: Writing – review & editing, Writing – original draft, Validation, Software, Resources, Methodology. Samuel Purvine: Writing – review & editing, Supervision, Resources, Methodology. Paul Piehowski: Writing – review & editing, Writing – original draft, Software, Resources, Methodology. Kaelan Prime: Writing – review & editing. Chien-Chi Lo: Writing – review & editing, Writing – original draft, Validation, Software, Resources, Methodology. Wendi Lynch: Writing – review & editing, Project administration. Po-E Li: Writing – review & editing, Writing – original draft, Software, Resources, Methodology. Valerie Li: Writing – review & editing, Software, Resources, Methodology. Leah Y.D. Johnson: Writing – review & editing, Writing – original draft, Visualization, Software, Resources, Investigation. Kaitlyn J. Li: Writing – review & editing, Writing – original draft, Visualization, Validation, Software, Resources, Methodology. Cameron Giberson: Software, Resources, Methodology. Patrick S.G. Chain: Writing – review & editing, Writing – original draft, Supervision, Software, Resources, Project administration, Methodology, Conceptualization.

Declaration of Competing Interest

The authors do not have any conflicts of interest to disclose.

Acknowledgments

We would like to thank Amy Chen, Kaitlyn Creamer, Cassandra Ettinger, Sarai Finks, Buck Hanson, Judson Hervey, Alex Honeyman, Marcel Huntemann, Matthew Kellom, Marie Kroeger, Justine Macalindong, Kevin Myers, Marijke Rittmann, Josué Rodríguez-Ramos, Jason Rothman, Brett Youtsey, and Ying Zhang for their valuable beta-testing feedback. The work conducted by the National Microbiome Data Collaborative (https://ror.org/05cwx3318) is supported by the Genomic Science Program in the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research (BER) under contract numbers DE-AC02–05CH11231 (LBNL), 89233218CNA000001 (LANL), and DE-AC05–76RL01830 (PNNL). The work used Expanse at SDSC through allocation MCB180107: EDGE Bioinformatics Science Gateway from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.

Footnotes

^{Appendix A}

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.csbj.2024.09.018.

Contributor Information

Julia M. Kelliher, Email: jkelliher@lanl.gov.

Patrick S.G. Chain, Email: pchain@lanl.gov.

Appendix A. Supplementary material

Supplementary material

mmc1.xlsx^{(12.8KB, xlsx)}

References

1.Agustinho D.P., Fu Y., Menon V.K., Metcalf G.A., Treangen T.J., Sedlazeck F.J. Unveiling microbial diversity: harnessing long-read sequencing technology. Nat Methods. 2024:1–13. doi: 10.1038/s41592-024-02262-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Arkin A.P., Cottingham R.W., Henry C.S., Harris N.L., Stevens R.L., Maslov S., et al. KBase: The United States department of energy systems biology knowledgebase. Nat Biotechnol. 2018;36:566–569. doi: 10.1038/nbt.4163. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.BBMap. SourceForge 2023. https://sourceforge.net/projects/bbmap/ (accessed June 19, 2024).
4.Bland C., Ramsey T.L., Sabree F., Lowe M., Brown K., Kyrpides N.C., et al. CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinforma. 2007;8:209. doi: 10.1186/1471-2105-8-209. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Boerner T.J., Deems S., Furlani T.R., Knuth S.L., Towns J. Practice and Experience in Advanced Research Computing. Association for Computing Machinery; New York, NY, USA: 2023. ACCESS: Advancing Innovation: NSF’s Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support; pp. 173–176. [DOI] [Google Scholar]
6.Bowers R.M., Kyrpides N.C., Stepanauskas R., Harmon-Smith M., Doud D., Reddy T.B.K., et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol. 2017;35:725–731. doi: 10.1038/nbt.3893. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Bushmanova E., Antipov D., Lapidus A., Prjibelski A. rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. GigaScience. 2019;8 doi: 10.1093/gigascience/giz100. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Camargo A.P., Roux S., Schulz F., Babinski M., Xu Y., Hu B., et al. Identification of mobile genetic elements with geNomad. Nat Biotechnol. 2023:1–10. doi: 10.1038/s41587-023-01953-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Chan P.P., Lowe T.M. tRNAscan-SE: Searching for tRNA genes in genomic sequences. Methods Mol Biol. 2019;1962:1–14. doi: 10.1007/978-1-4939-9173-0_1. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Chaumeil P.-A., Mussig A.J., Hugenholtz P., Parks D.H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics. 2020;36:1925–1927. doi: 10.1093/bioinformatics/btz848. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Chen I.-M.A., Chu K., Palaniappan K., Ratner A., Huang J., Huntemann M., et al. The IMG/M data management and analysis system v.7: content updates and new features. Nucleic Acids Res. 2023;51:D723–D732. doi: 10.1093/nar/gkac976. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Chen L., Zhao N., Cao J., Liu X., Xu J., Ma Y., et al. Short- and long-read metagenomics expand individualized structural variations in gut microbiomes. Nat Commun. 2022;13:3175. doi: 10.1038/s41467-022-30857-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Clum A., Huntemann M., Bushnell B., Foster B., Foster B., Roux S., et al. DOE JGI Metagenome Workflow. mSystems. 2021;6 doi: 10.1128/mSystems.00804-20. 10.1128/msystems.00804-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Corilo Y.E., Kew W.R., McCue L.A. EMSL-Computing/CoreMS: CoreMS 1.0.0 2021. https://doi.org/10.5281/zenodo.4641553.
15.Eloe-Fadrosh E.A., Ahmed F., Anubhav, Babinski M., Baumes J., Borkum M., et al. The National Microbiome Data Collaborative Data Portal: an integrated multi-omics microbiome data resource. Nucleic Acids Res. 2022;50 doi: 10.1093/nar/gkab990. D828–36. [DOI] [Google Scholar]
16.Finn R.D., Clements J., Eddy S.R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011;39:W29–W37. doi: 10.1093/nar/gkr367. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Freitas T.A.K., Li P.-E., Scholz M.B., Chain P.S.G. Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Nucleic Acids Res. 2015;43 doi: 10.1093/nar/gkv180. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Frith M.C., Wan R., Horton P. Incorporating sequence quality data into alignment improves DNA read mapping. Nucleic Acids Res. 2010;38 doi: 10.1093/nar/gkq010. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Gehrig J.L., Portik D.M., Driscoll M.D., Jackson E., Chakraborty S., Gratalo D., et al. Finding the right fit: evaluation of short-read and long-read sequencing approaches to maximize the utility of clinical microbiome data. Microb Genom. 2022;8 doi: 10.1099/mgen.0.000794. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Hu B., Canon S., Eloe-Fadrosh E.A., Anubhav, Babinski M., Corilo Y., et al. Challenges in bioinformatics workflows for processing microbiome omics data at scale. Front Bioinform. 2022;1 doi: 10.3389/fbinf.2021.826370. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.https://github/ProteoWizard/pwiz.
22.Hyatt D., Chen G.-L., LoCascio P.F., Land M.L., Larimer F.W., Hauser L.J. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 2010;11:119. doi: 10.1186/1471-2105-11-119. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Jain C., Rodriguez-R L.M., Phillippy A.M., Konstantinidis K.T., Aluru S. High throughput ANI analysis of 90 K prokaryotic genomes reveals clear species boundaries. Nat Commun. 2018;9:5114. doi: 10.1038/s41467-018-07641-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.https://github.com/PNNL-Comp-Mass-Spec/MASIC.
25.Jansson J.K., Baker E.S. A multi-omic future for microbiome studies. Nat Microbiol. 2016;1:1–3. doi: 10.1038/nmicrobiol.2016.49. [DOI] [PubMed] [Google Scholar]
26.Kadam P., Goplani A., Mattoo S., Gupta S., Amrutkar D., Dhanke J., et al. Introduction to MERN stack & comparison with previous technologies. Eur Chem Bull. 2023;12:14382–14386. doi: 10.48047/ecb/2023.12.si4.1300. [DOI] [Google Scholar]
27.Kang D.D., Li F., Kirton E., Thomas A., Egan R., An H., et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019;7 doi: 10.7717/peerj.7359. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Katz K., Shutov O., Lapoint R., Kimelman M., Brister J.R., O’Sullivan C. The sequence read archive: a decade more of explosive growth. Nucleic Acids Res. 2022;50:D387–D390. doi: 10.1093/nar/gkab1053. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Kelliher J., Rodriguez F., Johnson L., Ockert I., Roux S., Eloe-Fadrosh E., et al. 2023 NMDC Ambassador Presentations 2023. 10.5281/zenodo.10015793. [DOI]
30.Kessner D., Chambers M., Burke R., Agus D., Mallick P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics. 2008;24:2534–2536. doi: 10.1093/bioinformatics/btn323. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Kim S., Pevzner P.A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat Commun. 2014;5:5277. doi: 10.1038/ncomms6277. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Kim D., Song L., Breitwieser F.P., Salzberg S.L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016;26:1721–1729. doi: 10.1101/gr.210641.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Kyrpides N.C., Eloe-Fadrosh E.A., Ivanova N.N. Microbiome data science: understanding our microbial planet. Trends Microbiol. 2016;24:425–427. doi: 10.1016/j.tim.2016.02.011. [DOI] [PubMed] [Google Scholar]
34.Leinonen R., Sugawara H., Shumway M., on behalf of the International Nucleotide Sequence Database Collaboration The sequence read archive. Nucleic Acids Res. 2011;39:D19–D21. doi: 10.1093/nar/gkq1019. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Li P.-E., Lo C.-C., Anderson J.J., Davenport K.W., Bishop-Lilly K.A., Xu Y., et al. Enabling the democratization of the genomics revolution with a fully integrated web-based bioinformatics platform. Nucleic Acids Res. 2017;45:67–80. doi: 10.1093/nar/gkw1027. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Lo C.-C., Shakya M., Connor R., Davenport K., Flynn M., Gutiérrez A.M., y, et al. EDGE COVID-19: a web platform to generate submission-ready genomes from SARS-CoV-2 sequencing efforts. Bioinformatics. 2022;38:2700–2704. doi: 10.1093/bioinformatics/btac176. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Lomsadze A., Gemayel K., Tang S., Borodovsky M. Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes. Genome Res. 2018;28:1079–1089. doi: 10.1101/gr.230615.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Matsen F.A., Kodner R.B., Armbrust E.V. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinforma. 2010;11:538. doi: 10.1186/1471-2105-11-538. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Monroe M.E., Shaw J.L., Daly D.S., Adkins J.N., Smith R.D. MASIC: A software program for fast quantitation and flexible visualization of chromatographic profiles from detected LC–MS(/MS) features. Comput Biol Chem. 2008;32:215–217. doi: 10.1016/j.compbiolchem.2008.02.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.MSGFPlus/msgfplus 2024.
42.Nurk S., Meleshko D., Korobeynikov A., Pevzner P.A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27:824–834. doi: 10.1101/gr.213959.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Nawrocki E.P., Eddy S.R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29:2933–2935. doi: 10.1093/bioinformatics/btt509. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Nayfach S., Camargo A.P., Schulz F., Eloe-Fadrosh E., Roux S., Kyrpides N.C. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol. 2021;39:578–585. doi: 10.1038/s41587-020-00774-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Nesvizhskii A.I., Keller A., Kolker E., Aebersold R. A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry. Anal Chem. 2003;75:4646–4658. doi: 10.1021/ac0341261. [DOI] [PubMed] [Google Scholar]
46.Ondov B.D., Bergman N.H., Phillippy A.M. Interactive metagenomic visualization in a Web browser. BMC Bioinforma. 2011;12:385. doi: 10.1186/1471-2105-12-385. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Ondov B.D., Treangen T.J., Melsted P., Mallonee A.B., Bergman N.H., Koren S., et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17:132. doi: 10.1186/s13059-016-0997-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Parks D.H., Imelfort M., Skennerton C.T., Hugenholtz P., Tyson G.W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25:1043–1055. doi: 10.1101/gr.186072.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Price M.N., Dehal P.S., Arkin A.P. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE. 2010;5 doi: 10.1371/journal.pone.0009490. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Rodríguez-Ramos J., Kelliher J., Rodriguez F., Johnson L., Eloe-Fadrosh E. Standardized Workflows and NMDC EDGE Training: Spanish Translation 2023. 10.5281/zenodo.10014901. [DOI]
51.Smith D.R. Buying in to bioinformatics: an introduction to commercial sequence analysis software. Brief Bioinforma. 2015;16:700–709. doi: 10.1093/bib/bbu030. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Swetnam T.L., Antin P.B., Bartelme R., Bucksch A., Camhy D., Chism G., et al. CyVerse: Cyberinfrastructure for open science. PLOS Comput Biol. 2024;20 doi: 10.1371/journal.pcbi.1011270. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Voss K., Gentry J., Van Der Auwera G. Full-stack genomics pipelining with GATK4 + WDL + Cromwell. F1000Research. 2017:6. doi: 10.7490/f1000research.1114631.1. [DOI] [Google Scholar]
54.Wang M., Marshall A.G. Mass shifts induced by negative frequency peaks in linearly polarized Fourier transform ion cyclotron resonance signals. Int J Mass Spectrom Ion- Process. 1988;86:31–51. doi: 10.1016/0168-1176(88)80053-3. [DOI] [Google Scholar]
55.Wilkinson M.D., Dumontier M., Aalbersberg Ij.J., Appleton G., Axton M., Baak A., et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3 doi: 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Wood D.E., Lu J., Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20:257. doi: 10.1186/s13059-019-1891-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Wood-Charlson, Anubhav E.M., Auberry D., Blanco H., Borkum M.I., Corilo Y.E., et al. The National Microbiome Data Collaborative: enabling microbiome science. Nat Rev Microbiol. 2020;18:313–314. doi: 10.1038/s41579-020-0377-0. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.xlsx^{(12.8KB, xlsx)}

[bib1] 1.Agustinho D.P., Fu Y., Menon V.K., Metcalf G.A., Treangen T.J., Sedlazeck F.J. Unveiling microbial diversity: harnessing long-read sequencing technology. Nat Methods. 2024:1–13. doi: 10.1038/s41592-024-02262-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Arkin A.P., Cottingham R.W., Henry C.S., Harris N.L., Stevens R.L., Maslov S., et al. KBase: The United States department of energy systems biology knowledgebase. Nat Biotechnol. 2018;36:566–569. doi: 10.1038/nbt.4163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.BBMap. SourceForge 2023. https://sourceforge.net/projects/bbmap/ (accessed June 19, 2024).

[bib4] 4.Bland C., Ramsey T.L., Sabree F., Lowe M., Brown K., Kyrpides N.C., et al. CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinforma. 2007;8:209. doi: 10.1186/1471-2105-8-209. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Boerner T.J., Deems S., Furlani T.R., Knuth S.L., Towns J. Practice and Experience in Advanced Research Computing. Association for Computing Machinery; New York, NY, USA: 2023. ACCESS: Advancing Innovation: NSF’s Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support; pp. 173–176. [DOI] [Google Scholar]

[bib6] 6.Bowers R.M., Kyrpides N.C., Stepanauskas R., Harmon-Smith M., Doud D., Reddy T.B.K., et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol. 2017;35:725–731. doi: 10.1038/nbt.3893. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Bushmanova E., Antipov D., Lapidus A., Prjibelski A. rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. GigaScience. 2019;8 doi: 10.1093/gigascience/giz100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Camargo A.P., Roux S., Schulz F., Babinski M., Xu Y., Hu B., et al. Identification of mobile genetic elements with geNomad. Nat Biotechnol. 2023:1–10. doi: 10.1038/s41587-023-01953-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Chan P.P., Lowe T.M. tRNAscan-SE: Searching for tRNA genes in genomic sequences. Methods Mol Biol. 2019;1962:1–14. doi: 10.1007/978-1-4939-9173-0_1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Chaumeil P.-A., Mussig A.J., Hugenholtz P., Parks D.H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics. 2020;36:1925–1927. doi: 10.1093/bioinformatics/btz848. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Chen I.-M.A., Chu K., Palaniappan K., Ratner A., Huang J., Huntemann M., et al. The IMG/M data management and analysis system v.7: content updates and new features. Nucleic Acids Res. 2023;51:D723–D732. doi: 10.1093/nar/gkac976. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Chen L., Zhao N., Cao J., Liu X., Xu J., Ma Y., et al. Short- and long-read metagenomics expand individualized structural variations in gut microbiomes. Nat Commun. 2022;13:3175. doi: 10.1038/s41467-022-30857-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Clum A., Huntemann M., Bushnell B., Foster B., Foster B., Roux S., et al. DOE JGI Metagenome Workflow. mSystems. 2021;6 doi: 10.1128/mSystems.00804-20. 10.1128/msystems.00804-20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Corilo Y.E., Kew W.R., McCue L.A. EMSL-Computing/CoreMS: CoreMS 1.0.0 2021. https://doi.org/10.5281/zenodo.4641553.

[bib15] 15.Eloe-Fadrosh E.A., Ahmed F., Anubhav, Babinski M., Baumes J., Borkum M., et al. The National Microbiome Data Collaborative Data Portal: an integrated multi-omics microbiome data resource. Nucleic Acids Res. 2022;50 doi: 10.1093/nar/gkab990. D828–36. [DOI] [Google Scholar]

[bib16] 16.Finn R.D., Clements J., Eddy S.R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011;39:W29–W37. doi: 10.1093/nar/gkr367. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Freitas T.A.K., Li P.-E., Scholz M.B., Chain P.S.G. Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Nucleic Acids Res. 2015;43 doi: 10.1093/nar/gkv180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Frith M.C., Wan R., Horton P. Incorporating sequence quality data into alignment improves DNA read mapping. Nucleic Acids Res. 2010;38 doi: 10.1093/nar/gkq010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Gehrig J.L., Portik D.M., Driscoll M.D., Jackson E., Chakraborty S., Gratalo D., et al. Finding the right fit: evaluation of short-read and long-read sequencing approaches to maximize the utility of clinical microbiome data. Microb Genom. 2022;8 doi: 10.1099/mgen.0.000794. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Hu B., Canon S., Eloe-Fadrosh E.A., Anubhav, Babinski M., Corilo Y., et al. Challenges in bioinformatics workflows for processing microbiome omics data at scale. Front Bioinform. 2022;1 doi: 10.3389/fbinf.2021.826370. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.https://github/ProteoWizard/pwiz.

[bib22] 22.Hyatt D., Chen G.-L., LoCascio P.F., Land M.L., Larimer F.W., Hauser L.J. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinforma. 2010;11:119. doi: 10.1186/1471-2105-11-119. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Jain C., Rodriguez-R L.M., Phillippy A.M., Konstantinidis K.T., Aluru S. High throughput ANI analysis of 90 K prokaryotic genomes reveals clear species boundaries. Nat Commun. 2018;9:5114. doi: 10.1038/s41467-018-07641-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.https://github.com/PNNL-Comp-Mass-Spec/MASIC.

[bib25] 25.Jansson J.K., Baker E.S. A multi-omic future for microbiome studies. Nat Microbiol. 2016;1:1–3. doi: 10.1038/nmicrobiol.2016.49. [DOI] [PubMed] [Google Scholar]

[bib26] 26.Kadam P., Goplani A., Mattoo S., Gupta S., Amrutkar D., Dhanke J., et al. Introduction to MERN stack & comparison with previous technologies. Eur Chem Bull. 2023;12:14382–14386. doi: 10.48047/ecb/2023.12.si4.1300. [DOI] [Google Scholar]

[bib27] 27.Kang D.D., Li F., Kirton E., Thomas A., Egan R., An H., et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019;7 doi: 10.7717/peerj.7359. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Katz K., Shutov O., Lapoint R., Kimelman M., Brister J.R., O’Sullivan C. The sequence read archive: a decade more of explosive growth. Nucleic Acids Res. 2022;50:D387–D390. doi: 10.1093/nar/gkab1053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Kelliher J., Rodriguez F., Johnson L., Ockert I., Roux S., Eloe-Fadrosh E., et al. 2023 NMDC Ambassador Presentations 2023. 10.5281/zenodo.10015793. [DOI]

[bib30] 30.Kessner D., Chambers M., Burke R., Agus D., Mallick P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics. 2008;24:2534–2536. doi: 10.1093/bioinformatics/btn323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Kim S., Pevzner P.A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat Commun. 2014;5:5277. doi: 10.1038/ncomms6277. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] 32.Kim D., Song L., Breitwieser F.P., Salzberg S.L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016;26:1721–1729. doi: 10.1101/gr.210641.116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.Kyrpides N.C., Eloe-Fadrosh E.A., Ivanova N.N. Microbiome data science: understanding our microbial planet. Trends Microbiol. 2016;24:425–427. doi: 10.1016/j.tim.2016.02.011. [DOI] [PubMed] [Google Scholar]

[bib34] 34.Leinonen R., Sugawara H., Shumway M., on behalf of the International Nucleotide Sequence Database Collaboration The sequence read archive. Nucleic Acids Res. 2011;39:D19–D21. doi: 10.1093/nar/gkq1019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Li P.-E., Lo C.-C., Anderson J.J., Davenport K.W., Bishop-Lilly K.A., Xu Y., et al. Enabling the democratization of the genomics revolution with a fully integrated web-based bioinformatics platform. Nucleic Acids Res. 2017;45:67–80. doi: 10.1093/nar/gkw1027. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib37] 37.Lo C.-C., Shakya M., Connor R., Davenport K., Flynn M., Gutiérrez A.M., y, et al. EDGE COVID-19: a web platform to generate submission-ready genomes from SARS-CoV-2 sequencing efforts. Bioinformatics. 2022;38:2700–2704. doi: 10.1093/bioinformatics/btac176. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] 38.Lomsadze A., Gemayel K., Tang S., Borodovsky M. Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes. Genome Res. 2018;28:1079–1089. doi: 10.1101/gr.230615.117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] 39.Matsen F.A., Kodner R.B., Armbrust E.V. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinforma. 2010;11:538. doi: 10.1186/1471-2105-11-538. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] 40.Monroe M.E., Shaw J.L., Daly D.S., Adkins J.N., Smith R.D. MASIC: A software program for fast quantitation and flexible visualization of chromatographic profiles from detected LC–MS(/MS) features. Comput Biol Chem. 2008;32:215–217. doi: 10.1016/j.compbiolchem.2008.02.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib41] 41.MSGFPlus/msgfplus 2024.

[bib42] 42.Nurk S., Meleshko D., Korobeynikov A., Pevzner P.A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017;27:824–834. doi: 10.1101/gr.213959.116. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib43] 43.Nawrocki E.P., Eddy S.R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29:2933–2935. doi: 10.1093/bioinformatics/btt509. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib44] 44.Nayfach S., Camargo A.P., Schulz F., Eloe-Fadrosh E., Roux S., Kyrpides N.C. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol. 2021;39:578–585. doi: 10.1038/s41587-020-00774-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] 45.Nesvizhskii A.I., Keller A., Kolker E., Aebersold R. A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry. Anal Chem. 2003;75:4646–4658. doi: 10.1021/ac0341261. [DOI] [PubMed] [Google Scholar]

[bib46] 46.Ondov B.D., Bergman N.H., Phillippy A.M. Interactive metagenomic visualization in a Web browser. BMC Bioinforma. 2011;12:385. doi: 10.1186/1471-2105-12-385. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] 47.Ondov B.D., Treangen T.J., Melsted P., Mallonee A.B., Bergman N.H., Koren S., et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17:132. doi: 10.1186/s13059-016-0997-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib48] 48.Parks D.H., Imelfort M., Skennerton C.T., Hugenholtz P., Tyson G.W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25:1043–1055. doi: 10.1101/gr.186072.114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib49] 49.Price M.N., Dehal P.S., Arkin A.P. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE. 2010;5 doi: 10.1371/journal.pone.0009490. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib50] 50.Rodríguez-Ramos J., Kelliher J., Rodriguez F., Johnson L., Eloe-Fadrosh E. Standardized Workflows and NMDC EDGE Training: Spanish Translation 2023. 10.5281/zenodo.10014901. [DOI]

[bib51] 51.Smith D.R. Buying in to bioinformatics: an introduction to commercial sequence analysis software. Brief Bioinforma. 2015;16:700–709. doi: 10.1093/bib/bbu030. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib52] 52.Swetnam T.L., Antin P.B., Bartelme R., Bucksch A., Camhy D., Chism G., et al. CyVerse: Cyberinfrastructure for open science. PLOS Comput Biol. 2024;20 doi: 10.1371/journal.pcbi.1011270. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib53] 53.Voss K., Gentry J., Van Der Auwera G. Full-stack genomics pipelining with GATK4 + WDL + Cromwell. F1000Research. 2017:6. doi: 10.7490/f1000research.1114631.1. [DOI] [Google Scholar]

[bib54] 54.Wang M., Marshall A.G. Mass shifts induced by negative frequency peaks in linearly polarized Fourier transform ion cyclotron resonance signals. Int J Mass Spectrom Ion- Process. 1988;86:31–51. doi: 10.1016/0168-1176(88)80053-3. [DOI] [Google Scholar]

[bib55] 55.Wilkinson M.D., Dumontier M., Aalbersberg Ij.J., Appleton G., Axton M., Baak A., et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3 doi: 10.1038/sdata.2016.18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib56] 56.Wood D.E., Lu J., Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20:257. doi: 10.1186/s13059-019-1891-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib57] 57.Wood-Charlson, Anubhav E.M., Auberry D., Blanco H., Borkum M.I., Corilo Y.E., et al. The National Microbiome Data Collaborative: enabling microbiome science. Nat Rev Microbiol. 2020;18:313–314. doi: 10.1038/s41579-020-0377-0. [DOI] [PubMed] [Google Scholar]

PERMALINK

Standardized and accessible multi-omics bioinformatics workflows through the NMDC EDGE resource

Julia M Kelliher

Yan Xu

Mark C Flynn

Michal Babinski

Shane Canon

Eric Cavanna

Alicia Clum

Yuri E Corilo

Grant Fujimoto

Cameron Giberson

Leah YD Johnson

Kaitlyn J Li

Po-E Li

Valerie Li

Chien-Chi Lo

Wendi Lynch

Paul Piehowski

Kaelan Prime

Samuel Purvine

Francisca Rodriguez

Simon Roux

Migun Shakya

Montana Smith

Setareh Sarrafan

Shreyas Cholia

Lee Ann McCue

Chris Mungall

Bin Hu

Emiley A Eloe-Fadrosh

Patrick SG Chain

Abstract

Highlights

1. Introduction

2. Methods

2.1. NMDC EDGE architecture overview

Fig. 1.

2.2. NMDC EDGE architecture layers

3. Results

3.1. The NMDC workflows

Fig. 2.

Table 1.

3.1.1. Metagenome workflow

3.1.2. Metatranscriptome workflow

3.1.3. Viruses & plasmids workflow

3.1.4. Natural organic matter workflow

3.1.5. Metaproteome workflow

3.2. Running the NMDC workflows in NMDC EDGE

Fig. 3.

Table 2.

3.3. Training materials & documentation

3.4. User research & usability testing

4. Discussion

5. Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Footnotes

Contributor Information

Appendix A. Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases