Abstract
The National Microbiome Data Collaborative (NMDC) Data Portal (https://data.microbiomedata.org) supports microbiome multi-omics data exploration and access through an integrated, distributed data framework aligned with the FAIR (Findable, Accessible, Interoperable and Reusable) data principles (1). The NMDC Data Portal currently hosts 10.2 terabytes of multi-omics microbiome data, spanning five data types (metagenomes, metatranscriptomes, metaproteomes, metabolomes, and natural organic matter characterizations), generated at two Department of Energy User Facilities, the Joint Genome Institute (JGI) at Lawrence Berkeley National Laboratory (LBNL) and the Environmental Molecular Systems Laboratory (EMSL) at Pacific Northwest National Laboratory (PNNL). A flexible data schema (https://github.com/microbiomedata/nmdc-schema) leveraging community-driven standards underpins how data is managed and integrated. Annotated multi-omic data products are produced by the NMDC workflows and linked through common biosamples to enable search capabilities based on environmental context, instrumentation, and functional attributes. As a pilot system, the NMDC Data Portal offers download capabilities and several search components, including interactive geographic visualization of samples; environmental classification distribution visualized through an interactive Sankey diagram; time-series slider to select longitudinal samples of interest; and an upset plot displaying the number of multi-omics data generated from the same biosample within a study.
INTRODUCTION
The growth of microbiome data, and in particular multi-omics data, has been coupled with a rapidly growing suite of approaches to synthesize diverse data streams into meaningful ecological and microbe-host dynamic processes (2–4). As sequencing technologies have advanced ahead of other omics technologies (e.g. proteomics and metabolomics), the largest remaining bottleneck for analysis and interpretation of microbiome data rests in the ability to analyze them in a robust, integrated, and standardized manner. Ongoing efforts have already generated petabytes of data from a staggeringly diverse array of microbial habitats on Earth, creating a rich data resource for addressing grand challenges in the areas of bioenergy, environment, agriculture and health. Unfortunately, these efforts have, to date, not been complemented by the development of similarly cross-cutting and integrative solutions for data capture, storage, curation, analysis and sharing with associated community coordination (5).
The National Microbiome Data Collaborative (NMDC) is a new US-based pilot initiative launched in June 2019 to support microbiome data exploration and discovery through a collaborative, integrative data science ecosystem (6). The primary goal is to democratize microbiome data science by providing access to multi-omics microbiome data to support reproducible, cross-study analyses aligned with the FAIR data principles (1). To demonstrate the utility of linking across multi-omics microbiome data, the NMDC team has initially focused on projects funded through the Facilities Integrating Collaborations for User Science (FICUS) Program, a partnership between two DOE User Facilities, the JGI at LBNL and EMSL at PNNL. These projects span diverse terrestrial and aquatic environments, and address questions related to microbially-mediated carbon transformations, biogeochemical cycling, and plant-microbiome interactions. Currently, metagenome and metatranscriptome data are made available through maintained resources at the JGI, including the JGI Genome Portal (https://genome.jgi.doe.gov/portal), the Genomes OnLine Database (GOLD, (https://gold.jgi.doe.gov) (7)), and the Integrated Microbial Genomes and Microbiomes (IMG/M, (https://img.jgi.doe.gov) (8)). Metaproteome, metabolome and natural organic characterization data generated at EMSL are made available through the MyEMSL/NEXUS system (https://search.emsl.pnnl.gov). While the primary research teams are aware of data generated for their own projects across these two User Facilities, challenges exist for the broader research community to search associated data and link complementary data derived from the same biosample or project. Further, publications resulting from data generated at the JGI and EMSL reference a suite of repositories (e.g., PRIDE (9), MetaboLights (10), INSDC (11)) where data is dispersed across resources that renders meta-analyses difficult for multi-omics data.
Here, we describe the pilot NMDC Data Portal (https://data.microbiomedata.org) that provides a resource for consistently processed multi-omics data that is integrated to enable search, access, analysis and download. The pilot refers to a proof-of-concept infrastructure developed through a user-centered design process. Open-source bioinformatics workflows are used to process raw multi-omics data and produce interoperable and reusable annotated data from metagenome, metatranscriptome, metaproteome, metabolome, and natural organic matter characterizations. The NMDC Data Portal offers several search and navigation components, and data can be downloaded through the graphical user interface using an ORCiD (https://orcid.org/) authentication, with associated download metrics. All multi-omics data are available under a Creative Commons 4.0 license, which enables public use with attribution, as outlined in the NMDC Data Use Policy (https://microbiomedata.org/nmdc-data-use-policy). This first iteration of the NMDC Data Portal was released in March 2021, and will continue to expand its data hosting and functionality on a quarterly basis. Associated release notes and updated user guides will accompany each quarterly release.
RESOURCE CONTENT
Multi-omics data from diverse environments
The NMDC Data Portal contains 10.2 terabytes of data associated with 638 biosamples, 7 studies and 5 data types from a breadth of environmental microbiomes, spanning river sediments, subsurface shale carbon reservoirs, plant-microbe associations, and temperate and tropical soils (Table 1). JGI’s microbiome data management systems, IMG/M (8) and GOLD (7) serve as the main integration points for metagenome and metatranscriptome projects and associated metadata. EMSL’s data management system MyEMSL/NEXUS (https://search.emsl.pnnl.gov) hosts a search and retrieval interface that provides access to metaproteome, metabolome, and natural organic characterization data and associated metadata. Development of the NMDC Data Portal resulted in establishing infrastructure coordination for sample management and data hosting between the JGI and EMSL, which is important, as much of the experimental data are derived from the same biosamples. Thus, the NMDC Data Portal supports sample tracking, integration, and reuse through identifiers and harmonized metadata (12). As the NMDC Data Portal is a pilot infrastructure, incoming projects for which study information and curated environmental metadata become available is first validated and loaded with a flag (‘Omics data coming soon’) before processed instrumentation data is integrated into the portal.
Table 1.
Omics data types and analysis products available through the NMDC Data Portal. Currently, there are three studies with available processed and integrated data: Riverbed sediment microbial communities from the Columbia River consisting of 85 biosamples, 294 organic matter characterizations, 50 metagenomes, 38 metaproteomes, and 34 metabolomes (36); Soil microbial communities from the East River watershed near Crested Butte, Colorado consisting of 53 biosamples, 652 organic matter characterizations, 48 metagenomes, and 45 metatranscriptomes (37); and Deep subsurface shale carbon reservoir microbial communities from Ohio and West Virginia consisting of 25 biosamples and 25 metagenomes (38)
| Omics Data Type | Analysis products |
|---|---|
| Metagenome | Read QC: QC Statistics, Filtered Sequencing Reads |
| Read-based Analysis: Krona Plot, Classification Report, Taxonomic Classification | |
| Assembly: Assembly Coverage Stats, Assembly Contigs, Assembly Scaffolds, Assembly AGP, Assembly Coverage BAM | |
| Annotation: Annotation Enzyme Commission, Annotation KEGG Orthology, Functional Annotation GFF, Annotation Amino Acid FASTA, Structural Annotation GFF | |
| Binning: Metagenome Bins, CheckM Statistics | |
| Metatranscriptome | Read QC: QC Statistics, Filtered Sequencing Reads |
| Annotation: Annotation Enzyme Commission, Annotation KEGG Orthology, Functional Annotation GFF, Annotation Amino Acid FASTA, Structural Annotation GFF | |
| Metaproteome | Unfiltered Metaproteomics results, Filtered peptide results, Filtered protein results, Aggregate workflow statistics |
| Metabolome | GC-MS Metabolomics Results |
| Natural Organic Matter | FT ICR-MS analysis results |
Open-source bioinformatics workflows for processing raw multi-omics data (e.g. metagenome, metatranscriptome, metaproteome, metabolome, and natural organic matter characterization data) have been developed based on production-quality workflows at the JGI and EMSL. These workflows form the basis for producing interoperable and reusable annotated data products (https://nmdc-workflow-documentation.readthedocs.io/en/latest).
Metagenomes
Illumina-sequenced shotgun metagenome data undergo pre-processing, error correction, assembly, structural and functional annotation, and binning leveraging the JGI’s production pipelines (13), along with an additional read-based taxonomic analysis component. Standardized outputs from the read QC, read-based analysis, assembly, annotation, and binning are available for search and download for 123 metagenomes on the NMDC Data Portal.
Metatranscriptomes
Illumina-sequenced shotgun reads from cDNA library undergo pre-processing and error correction in the same way as described above in the metagenome workflow with additional steps to filter ribosomal reads. High-quality reads are then assembled into transcripts using MEGAHIT (14), annotated using the annotation module described in the metagenome workflow, and the high-quality reads are mapped back to the annotated transcripts using HISAT2 (15) and then processed to calculate the number of reads mapped per feature using featureCounts (16) and RPKM calculations per feature using edgeR (17). Results from read QC, assembly, and annotation are available for search and download for 45 metatranscriptomes on the NMDC Data Portal.
Metaproteomes
Data-dependent mass spectrometry raw data files are first converted to mzML, using MSConvert (18). Peptide identification is achieved using MSGF+ and the associated metagenomic information in the FASTA file, and peptide identification false discovery rate is controlled using a decoy database approach (19). Intensity information is extracted using MASIC (20) and combined with protein information. Protein annotation information is obtained from the associated metagenome annotation output. Standardized outputs for quality control, and peptide and protein-level quantitative data are available for search and download for 38 metaproteomes on the NMDC Data Portal.
Metabolomes
The gas chromatography-mass spectrometry (GC-MS) based metabolomics workflow (metaMS) developed by leveraging EMSL’s CoreMS mass spectrometry software framework allows target and semi-target data analysis of metabolomics data (21). The raw data is parsed into coreMS data structure and undergoes all the steps of signal processing (signal noise reduction, m/z based chromatogram peak deconvolution, abundance threshold calculation, peak picking) and molecular identification, including the molecular search using a metabolites standard compound library, spectral similarity calculation, and similarity score calculation (22), all in a single step. The putative metabolite annotation data is available to download for 34 metabolomes on the NMDC Data Portal. Data dependent liquid chromatography–mass spectrometry (LC–MS) based workflows are currently under development. Additionally, it should be noted that all available data derives from exploratory, untargeted analysis and is semi-quantitative.
Natural organic matter characterization (NOM)
Direct Infusion Fourier Transform mass spectrometry (DI FT-MS) data undergoes signal processing and molecular formula assignment leveraging EMSL’s CoreMS framework (21). Raw time domain data is transformed into the m/z domain using Fourier Transform and Ledford equation (23). Data is denoised followed by peak picking, recalibration using an external reference list of known compounds, and searched against a dynamically generated molecular formula library with a defined molecular search space. The confidence scores for all the molecular formula candidates are calculated based on the mass accuracy and fine isotopic structure, and the best candidate assigned as the highest score. The molecular formula characterization table is available to download for 946 natural organic matter characterizations on the NMDC Data Portal. Importantly, natural organic matter characterizations represent the largest number of available omics data within the NMDC Data Portal, yet represent multiple associated runs deriving from the same biosample and relate to laboratory-based extraction protocols (e.g. extracted via chloroform, methanol, or water fractionation). Additional details on sequential extraction of organic matter is described by Tfaily and colleagues (24).
Metadata standards for biosamples
Metadata that contextualizes physical samples, including sample collection, sample preparation, data processing methods, and data products are essential for the interpretation of measurements or any data produced from a biological sample (25). For standardizing the sets of fields that describe physical samples, the NMDC team has adopted the Genomic Standards Consortium (GSC) Minimum Information about any (x) Sequence (MIxS) templates (26). This provides a standard data dictionary of sample descriptors (e.g. location, biome, altitude, depth) organized into seventeen environmental packages (https://gensc.org/mixs) for sequence data. The NMDC team has mapped fields used to describe samples in the GOLD database to MIxS version 5 (v5) elements. In addition, we are adopting the MIxS standards for sequence data types (e.g. sequencing method, pcr primers and conditions, etc.), and are leveraging standards and controlled vocabularies developed by the Proteomics Standards Initiative (27), the National Cancer Institute's Proteomic Data Commons (https://pdc.cancer.gov/data-dictionary/dictionary.html), and the Metabolomics Standards Initiative (28) for mass spectrometry data types (e.g., ionization mode, mass resolution, scan rate, etc.).
We have engaged in two areas of curation in order to best support search capabilities and to maximize interoperability of heterogeneous data sets. First, for each of the seven studies’ biosamples, we have applied a manual curation process designed to reconcile the biosample metadata collected separately at the JGI and EMSL, in order to integrate the data produced by each Facility. This effort is coordinated with a curation process that allows the research teams to update and correct metadata about the biosamples that were sent to the JGI and EMSL, as well as to provide additional metadata. Second, extensive manual curation work has also been undertaken at EMSL to accurately associate parent and child samples generated when biosamples were aliquoted in order to generate diverse data sets, and to identify the array of various data types that are most frequently generated from microbiomes using EMSL instrumentation.
In collaboration with the GOLD (7) and the Environment Ontology (EnvO) (29) teams, we devised a system for mapping GOLD Ecosystem Classification path descriptors to EnvO. The GOLD Ecosystem Classification paths is a hierarchical 5-place system which uses terms at different levels of granularity to describe ecosystem classifications. This contrasts with how ecosystem classifications are characterized in MIxS, which is a 3-place system with terms drawn from EnvO. We have mapped all distinct GOLD Ecosystem Classification path descriptors used for 638 biosamples in the seven available studies to EnvO. Further details of the curation process and availability of 40 619 biosamples with curated MIxS-EnvO triad fields are available from Mukherjee et al. (7).
DATA STORAGE INFRASTRUCTURE
Data schema and architecture
The NMDC team has developed a data schema for representing studies, samples, data objects and relationships amongst these entities (https://github.com/microbiomedata/nmdc-schema). The NMDC schema is defined using the Linked data Modeling Language (LinkML, https://linkml.io). LinkML allows us to easily generate Python classes used for Extract-Transform-Load (ETL) processes, and schemas against which we validate the ETL output. For example, our ETL process ingests metadata from the JGI and EMSL. Using the Python classes generated from the LinkML schema, we transform the metadata into JavaScript Object Notation (JSON) structured documents, and these JSON documents are validated against the NMDC JSON schema that is auto generated by LinkML.
In order to easily distribute the NMDC schema for use by software developers, we deploy the NMDC schema as a Python library on the PyPI platform (https://pypi.org/project/nmdc-schema). For instance, using this library, a developer can validate a JSON document against the schema either by executing the command ‘validate-json-schema -i < JSON document>’ in a terminal or by accessing the Python modules within the nmdc-schema library. Moreover, the PyPI platform enables us to easily manage schema changes and deploy these changes to the research community. A developer can either use the most recent version of the schema or use a previous version.
As noted above, metadata is extracted from the respective data management systems at the JGI (e.g. GOLD) and EMSL (MyEMSL/NEXUS), and is integrated based on the NMDC data schema associations, and transformed into JSON documents. These JSON documents are stored in a MongoDB (https://www.mongodb.com) and validated against the NMDC JSON schema. The Dagster framework (https://dagster.io) is used to organize and orchestrate the ETL process (https://github.com/microbiomedata/nmdc-runtime). NMDC metadata as JSON documents are then transformed into a relational model and stored in a PostgreSQL database (https://www.postgresql.org) optimized for use as the persistence layer beneath a custom search application (https://data.microbiomedata.org). The search application server API is implemented in Python using the FastAPI framework (https://fastapi.tiangolo.com). The client is written in JavaScript using the Vue.js framework (https://vuejs.org) and additional JS libraries.
All production infrastructure for data ETL and for the search portal is deployed as Docker containers for a Rancher-fronted Kubernetes cluster managed by the National Energy Research Scientific Computing Center (NERSC) as part of its Spin service (https://www.nersc.gov/systems/spin/).
DATA QUERY AND ACCESS
A key feature of the NMDC Data Portal is enabling the research community to discover data through a variety of search capabilities. When a user navigates to the NMDC Data Portal they are presented with a few ways to navigate the data. These include the ability to refine and subset NMDC data through faceted search and interactive visualizations. The data are organized by Study, Sample, and Omics data types. The NMDC home page provides several different mechanisms for refining the search results and we describe those in detail here. A detailed User Guide is available through the Data Portal (https://the-nmdc-portal-user-guide.readthedocs.io/en/latest).
Faceted search and access
The NMDC Data Portal is a unique resource that enables researchers to search across multi-omics analyses by functional annotation, environment, or analysis. Currently, the pilot enables search by investigator name, omics processing information, KEGG Ontology (KO), module and pathway (30) terms, and a suite of environmental descriptors, as well as two systems for ecosystem classifications, GOLD Ecosystem Classification paths (7) and the MIxS-EnvO triad (Environmental Broad Scale, Environmental Local Scale, Environmental Medium) (Figure 1A). The KO terms are annotated by the NMDC annotation workflow that is applied to metagenome, metatranscriptome, and transiently to metaproteome data as described above, and indexed for performant search. For example, a search for KO term ‘K10944’ (methane/ammonia monooxygenase subunit A [EC:1.14.18.3 1.14.99.39]) results in filtered data for 88 metagenomes and 34 metatranscriptomes for all three available studies (Figure 1A). Combinatorial KO terms can be added to the search filter to further refine based on functional attributes of interest.
Figure 1.
Faceted search and interactive visualization on the NMDC Data Portal. (A) The left search panel supports the ability to query by KEGG Ontology (KO) functional terms, KEGG modules, and KEGG pathways across all omics data. Text searches based on keywords provide dynamic suggestions for incomplete search queries as a user types. Each of these actions refines the list of studies and biosamples in the search results, as well as the visualizations. All terms can be used in a combinatorial fashion to further subset the data of interest. (B) The Omics navigation tab consists of four dynamic panels: omics data barplot, geographic map, temporal slider, and upset plot. For each plot, the upper right ‘?’ provides a description of each plot and functionality. (C) The Environment navigation tab is a single interactive Sankey diagram for the five-level GOLD Ecosystem Classification paths. Hovering over each area will display the path level (e.g. Environmental → Terrestrial) along with the number of biosamples within that ecosystem path (e.g., 455 Terrestrial biosamples).
Similarly, biosample attributes such as depth or latitude/longitude can be searched using a range of numerical input from the user, including ‘is between’, ‘is greater than’, ‘is greater than or equal to’, ‘is less than’, ‘is less than or equal to’, ‘is equal to’, and ‘is not’ to refine search criteria. For example, searching by depth with ‘greater than’ 15 centimeters returns one study with 18 metagenomes. Lastly, the five-level GOLD Ecosystem Classification paths and MIxS-EnvO triad (Environmental Biome, Environmental feature, Environmental material) terms are available for search refinement, and curated as described above by the NMDC team to ensure accuracy and consistency. For multiple search criteria applied, data not meeting the combinatorial criteria will be subsetted and grayed out in the search interface.
Interactive visualizations
There are two tabs in the top center of the NMDC Data Portal home page that display data either by ‘Omics’ or by ‘Environment’. These core navigation tabs reflect feedback from the user community on different desired navigation and visualization routes depending on their specific research needs. The ‘Omics’ tab provides a suite of visual summaries focused on multi-omics data availability from a biosample- and study-level perspective. The four visualization panels include (i) barplot depicting the number of omics processing runs for each data type available, (ii) geographic map with circle colors and numbers representing the number of biosamples from a particular location, (iii) temporal slider to select samples of interest based on sample collection date, and (iv) an upset plot displaying the number of multi-omics data generated from the same biosample within a study (Figure 1B). All four visualization panels are interactive and will filter data based on respective search criteria. For the map, the panning and zooming in functionality provides users the ability to explore data based on geographic coordinates. Further, the ‘Search this region’ button will limit search results to the current map bounds. The temporal slider allows users to filter biosamples and studies by specific sample collection dates, currently ranging from 4 January 2014 to 10 January 2020 grouped by collection month, and view a histogram of the number of samples available in a particular time period. Lastly, the upset plot enables filtering by clicking the sample bar to select multi-omics data associated with a given biosample. For example, selecting the 33 Samples bar will filter based on available biosamples with associated metagenome, metaproteome, and metabolomics data.
The ‘Environment’ tab presents an interactive Sankey diagram of the filtered search results based on the five-level GOLD Ecosystem Classification paths (Figure 1C). Hovering over parts of the diagram will display the ecosystem level along with the number of samples in that ecosystem category. When a user selects a facet from the left-hand panel, the diagram is updated dynamically to reflect the remaining studies. This visualization provides users with quick insight into the specific environment from which samples were derived.
Download functionality
Once a user has discovered data of interest, they are able to initiate a download of the data. The NMDC Data Portal does not require a login for search and discovery, but does require a user to log in with their ORCiD credentials prior to allowing the data download to proceed. Currently, individual file and bulk download operations are supported, with bulk download subseted by search criteria for analysis outputs and file type for a given sample. In Figure 2, the metagenome omics type for multiple studies was selected and only those analysis files are available for download. Further refinement can be done in the Bulk Download dropdown menu to select the different analysis file types. The number of files and total size is displayed dynamically in the Bulk Download button. Once the selection is made, a compressed zip file containing all of the files can be downloaded through the browser. Data download statistics based on unique ORCiD credentials are also presented to support usage metrics.
Figure 2.
Bulk download functionality. Search criteria for analysis outputs and file type for a given sample will provide a streamlined way to select and bulk download data of interest. For available metagenomes, all analysis outputs are displayed with the number of associated files. To select metagenome bins, the number of files available (2157) from 123 samples is listed along with the archive size (2Gib). To download data, users are required to use ORCiD authentication (https://orcid.org).
Design and user research
The NMDC is a resource designed together with and for the scientific community. We have engaged in extensive user research through interviews and direct collaboration with the scientific community that have informed the design, development, and display of data through the NMDC Data Portal. This methodology (31) enables the scientific community to provide feedback, iterative and continuous improvement of our systems, and ensures that our systems enable a high level of scientific productivity. Feedback collected from the scientific community during early iterations of the Data Portal can be linked to the features and design directions found in the current release. Our community-centered design approach ensures that the NMDC can evolve with the needs of the microbiome research community, but will also be important for uncovering creative design solutions, clarifying expectations, reducing redesign, and perhaps most importantly, enabling shared ownership (32) of the NMDC. We hope that this inclusive approach will enable us to expand our engagements with the microbiome research community and the utility of the NMDC Data Portal.
FUTURE DIRECTIONS AND CONCLUSIONS
The first iteration of the NMDC Data Portal was released in March 2021, and will be continually updated on a quarterly basis with new studies, features, and capabilities to support microbiome data discovery and access. As described above, the focus on environmental microbiomes and coordinating multi-omics data generated at the User Facilities, the JGI and EMSL, will continue to support a global network of thousands of scientists. The NMDC is committed to being a FAIR resource, following best practices for accessing data, and for embedding sufficient contextual metadata to ensure that relevant data can be found, easily repurposed and reused, and combined with other data sets for meta-analyses. In future iterations of the NMDC Data Portal, we will leverage available tools for evaluating ‘FAIRness’ to iteratively refine the NMDC APIs, release files, and web interfaces such that they are more useful for both users and for machine access. We will utilize the FAIRness evaluation framework (https://fairsharing.github.io/FAIR-Evaluator-FrontEnd) and create FAIRness Maturity Indicators (MIs) and Compliance Tests for microbiome data, and make these available through sites such as FAIRsharing (33).
While the current version of the NMDC Data Portal does not support external submissions, we do plan to develop a streamlined web-based interface for bulk sample metadata submission from environmental studies combined with our internal processes for extracting metadata from GOLD and EMSL. To support this effort, we plan to develop a standard approach for representing mappings between templates/schema elements, along with a standard approach for versioning to coordinate mapping efforts between the systems used by the NMDC and our partners. Further, we plan to develop a general purpose converter that can take metadata from one system and translate it to another using these mappings. This converter will enable harmonization of metadata across systems.
The bioinformatic workflows currently used to process the multi-omics data are tightly coupled to the JGI and EMSL User Facilities production processes, and plans are in place to integrate updates, as appropriate, to keep pace with best practices and new technologies. These workflows will be augmented in the future to support broader sequencing platforms (e.g. long-read sequencing) and data formats, add de novo assembly of metatranscriptome data and statistical metrics for quantifying gene transcription levels, and develop a set of algorithms to automate the microbiome data reanalysis process. Additional search functionalities to support a broader range of annotation systems (e.g., Pfam and GO) will be developed to complement the KEGG functional search, alongside providing search results that are biologically informative (relative abundance estimates and coverage information). Further, we plan to make the workflows more broadly available as integrated components in EDGE (34) and KBase (35). Available metagenome workflows and training materials are currently hosted in the beta-version of NMDC EDGE (https://edge-nmdc.org). Additionally, while amplicon data is not currently hosted within the NMDC Data Portal, the data schema does support amplicon metadata and associated MIMARK standards, to allow biosamples to be linked to related multi-omics data within a given study.
As the NMDC Data Portal was recently publicly released in March 2021, we do not have reliable tracking metrics for usage at this time. However, we are currently developing these tracking metrics and have already implemented data download statistics as outlined above. The NMDC Data Portal, emphasis on curated metadata and production-quality bioinformatic workflows, and associated engagement activities together make up a unique collaborative resource for environmental microbiome researchers.
DATA AVAILABILITY
The NMDC Data Portal is freely available at https://data.microbiomedata.org, with available code on the GitHub repository https://github.com/microbiomedata.
ACKNOWLEDGEMENTS
We gratefully acknowledge the research teams involved in providing feedback during the initial development of the NMDC Data Portal, along with the NMDC Champions and Ambassadors for their continued engagement and support.
Contributor Information
Emiley A Eloe-Fadrosh, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Faiza Ahmed, Kitware, Clifton Park, NY 12065, USA.
Anubhav, Pacific Northwest National Laboratory, Richland, WA 99354, USA.
Michal Babinski, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
Jeffrey Baumes, Kitware, Clifton Park, NY 12065, USA.
Mark Borkum, Pacific Northwest National Laboratory, Richland, WA 99354, USA.
Lisa Bramer, Pacific Northwest National Laboratory, Richland, WA 99354, USA.
Shane Canon, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Danielle S Christianson, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Yuri E Corilo, Pacific Northwest National Laboratory, Richland, WA 99354, USA.
Karen W Davenport, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
Brandon Davis, Kitware, Clifton Park, NY 12065, USA.
Meghan Drake, Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA.
William D Duncan, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Mark C Flynn, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
David Hays, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Bin Hu, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
Marcel Huntemann, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Julia Kelliher, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
Sofya Lebedeva, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Po-E Li, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
Mary Lipton, Pacific Northwest National Laboratory, Richland, WA 99354, USA.
Chien-Chi Lo, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
Stanton Martin, Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA.
David Millard, Pacific Northwest National Laboratory, Richland, WA 99354, USA.
Kayd Miller, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Mark A Miller, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Paul Piehowski, Pacific Northwest National Laboratory, Richland, WA 99354, USA.
Elais Player Jackson, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
Samuel Purvine, Pacific Northwest National Laboratory, Richland, WA 99354, USA.
T B K Reddy, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Rachel Richardson, Pacific Northwest National Laboratory, Richland, WA 99354, USA.
Marisa Rudolph, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Setareh Sarrafan, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Migun Shakya, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
Montana Smith, Pacific Northwest National Laboratory, Richland, WA 99354, USA.
Kelly Stratton, Pacific Northwest National Laboratory, Richland, WA 99354, USA.
Jagadish Chandrabose Sundaramurthi, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Pajau Vangay, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Donald Winston, Polyneme LLC, New York, NY 10038, USA.
Elisha M Wood-Charlson, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Yan Xu, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
Patrick S G Chain, Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
Lee Ann McCue, Pacific Northwest National Laboratory, Richland, WA 99354, USA.
Douglas Mans, Pacific Northwest National Laboratory, Richland, WA 99354, USA.
Christopher J Mungall, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Nigel J Mouncey, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
Kjiersten Fagnan, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA.
FUNDING
This work is supported by the Genomic Science Program in the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research (BER) [DE-AC02-05CH11231 to L.B.N.L., 89233218CNA000001 to L.A.N.L., DE-AC05-00OR22725 to O.R.N.L., DE-AC05-76RL01830 to P.N.N.L.]. Funding for open access charge: Department of Energy.
Conflict of interest statement. None declared.
REFERENCES
- 1. Wilkinson M.D., Dumontier M., Aalbersberg I.J., Appleton G., Axton M., Baak A., Blomberg N., Boiten J.-W., da Silva Santos L.B., Bourne P.E. et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 2016; 3:160018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Zhou W., Sailani M.R., Contrepois K., Zhou Y., Ahadi S., Leopold S.R., Zhang M.J., Rao V., Avina M., Mishra T. et al. Longitudinal multi-omics of host–microbe dynamics in prediabetes. Nature. 2019; 569:663–671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Lloyd-Price J., Arze C., Ananthakrishnan A.N., Schirmer M., Avila-Pacheco J., Poon T.W., Andrews E., Ajami N.J., Bonham K.S., Brislawn C.J. et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature. 2019; 569:655–662. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Jansson J.K., Baker E.S. A multi-omic future for microbiome studies. Nature Microbiol. 2016; 1:16049. [DOI] [PubMed] [Google Scholar]
- 5. Kyrpides N.C., Eloe-Fadrosh E.A., Ivanova N.N. Microbiome data science: understanding our microbial planet. Trends Microbiol. 2016; 24:425–427. [DOI] [PubMed] [Google Scholar]
- 6. Wood-Charlson E.M., Anubhav, Auberry D., Blanco H., Borkum M.I., Corilo Y.E., Davenport K.W., Deshpande S., Devarakonda R., Drake M. et al. The National Microbiome Data Collaborative: enabling microbiome science. Nat. Rev. Microbiol. 2020; 18:313–314. [DOI] [PubMed] [Google Scholar]
- 7. Mukherjee S., Stamatis D., Bertsch J., Ovchinnikova G., Sundaramurthi J.C., Lee J., Kandimalla M., Chen I.-M.A., Kyrpides N.C., Reddy T.B.K. Genomes OnLine Database (GOLD) v.8: overview and updates. Nucleic Acids Res. 2020; 49:D723–D733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Chen I.-M.A., Chu K., Palaniappan K., Ratner A., Huang J., Huntemann M., Hajek P., Ritter S., Varghese N., Seshadri R. et al. The IMG/M data management and analysis system v.6.0: new tools and advanced capabilities. Nucleic Acids Res. 2020; 49:D751–D763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Perez-Riverol Y., Csordas A., Bai J., Bernal-Llinares M., Hewapathirana S., Kundu D.J., Inuganti A., Griss J., Mayer G., Eisenacher M. et al. The PRIDE database and related tools and resources in 2019: improving support for quantification data. Nucleic Acids Res. 2018; 47:D442–D450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Haug K., Cochrane K., Nainala V.C., Williams M., Chang J., Jayaseelan K.V., O’Donovan C. MetaboLights: a resource evolving in response to the needs of its scientific community. Nucleic Acids Res. 2019; 48:D440–D444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Karsch-Mizrachi I., Takagi T., Cochrane G.International Nucleotide Sequence Database, C. The international nucleotide sequence database collaboration. Nucleic Acids Res. 2018; 46:D48–D51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Damerow J.E., Varadharajan C., Boye K., Brodie E.L., Burrus M., Chadwick K.D., Crystal-Ornelas R., Elbashandy H., Alves R.J.E., Ely K.S. et al. Sample identifiers and metadata to support data management and reuse in multidisciplinary ecosystem sciences. Data Sci. J. 2021; 20:11. [Google Scholar]
- 13. Clum A., Huntemann M., Bushnell B., Foster B., Foster B., Roux S., Hajek P.P., Varghese N., Mukherjee S., Reddy T.B.K. et al. DOE JGI metagenome workflow. mSystems. 2021; 6:e00804–00820. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Li D., Liu C.-M., Luo R., Sadakane K., Lam T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics. 2015; 31:1674–1676. [DOI] [PubMed] [Google Scholar]
- 15. Kim D., Paggi J.M., Park C., Bennett C., Salzberg S.L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 2019; 37:907–915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Liao Y., Smyth G.K., Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014; 30:923–930. [DOI] [PubMed] [Google Scholar]
- 17. Robinson M.D., McCarthy D.J., Smyth G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26:139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Chambers M.C., Maclean B., Burke R., Amodei D., Ruderman D.L., Neumann S., Gatto L., Fischer B., Pratt B., Egertson J. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 2012; 30:918–920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Kim S., Gupta N., Pevzner P.A. Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. J. Proteome Res. 2008; 7:3354–3363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Monroe M.E., Shaw J.L., Daly D.S., Adkins J.N., Smith R.D. MASIC: a software program for fast quantitation and flexible visualization of chromatographic profiles from detected LC–MS(/MS) features. Comput. Biol. Chem. 2008; 32:215–217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Corilo Y.E., Kew W.R, McCue L.A.. 2021; EMSL-Computing/CoreMS: CoreMS 1.0.0 (v1.0.0). Zenodo 10.5281/zenodo.4641553. [DOI]
- 22. Hiller K., Hangebrauk J., Jäger C., Spura J., Schreiber K., Schomburg D. MetaboliteDetector: comprehensive analysis tool for targeted and nontargeted GC/MS based metabolome analysis. Anal. Chem. 2009; 81:3429–3439. [DOI] [PubMed] [Google Scholar]
- 23. Marshall A.G., Hendrickson C.L., Jackson G.S. Fourier transform ion cyclotron resonance mass spectrometry: a primer. Mass Spectrom. Rev. 1998; 17:1–35. [DOI] [PubMed] [Google Scholar]
- 24. Tfaily M.M., Chu R.K., Toyoda J., Tolić N., Robinson E.W., Paša-Tolić L., Hess N.J. Sequential extraction protocol for organic matter from soils and sediments using high resolution mass spectrometry. Anal. Chim. Acta. 2017; 972:54–61. [DOI] [PubMed] [Google Scholar]
- 25. Vangay P., Burgin J., Johnston A., Beck K.L., Berrios D.C., Blumberg K., Canon S., Chain P., Chandonia J.M., Christianson D. et al. Microbiome metadata standards: report of the National Microbiome Data Collaborative's workshop and follow-on activities. mSystems. 2021; 6:e01194-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Yilmaz P., Kottmann R., Field D., Knight R., Cole J.R., Amaral-Zettler L., Gilbert J.A., Karsch-Mizrachi I., Johnston A., Cochrane G. et al. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nature Biotechnol. 2011; 29:415–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Taylor C.F., Paton N.W., Lilley K.S., Binz P.-A., Julian R.K., Jones A.R., Zhu W., Apweiler R., Aebersold R., Deutsch E.W. et al. The minimum information about a proteomics experiment (MIAPE). Nat. Biotechnol. 2007; 25:887–893. [DOI] [PubMed] [Google Scholar]
- 28. Sansone S.-A., Fan T., Goodacre R., Griffin J.L., Hardy N.W., Kaddurah-Daouk R., Kristal B.S., Lindon J., Mendes P., Morrison N. et al. The metabolomics standards initiative. Nat. Biotechnol. 2007; 25:846–848. [DOI] [PubMed] [Google Scholar]
- 29. Buttigieg P.L., Pafilis E., Lewis S.E., Schildhauer M.P., Walls R.L., Mungall C.J. The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation. J. Biomed. Semantics. 2016; 7:57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Kanehisa M., Sato Y., Kawashima M., Furumichi M., Tanabe M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016; 44:D457–462. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Abras C., Maloney-Krichmar D., Preece J.. Bainbridge W. User-Centered Design. Encyclopedia of Human-Computer Interaction. 2004; Thousand Oaks: Sage Publications. [Google Scholar]
- 32. Preece J., Rogers Y., Sharp H. Interaction Design: Beyond Human-Computer Interaction. 2002; NY: John Wiley & Sons. [Google Scholar]
- 33. Wilkinson M.D., Dumontier M., Sansone S.-A., Bonino da Silva Santos L.O., Prieto M., Batista D., McQuilton P., Kuhn T., Rocca-Serra P., Crosas M. et al. Evaluating FAIR maturity through a scalable, automated, community-governed framework. Scientific Data. 2019; 6:174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Li P.-E., Lo C.-C., Anderson J.J., Davenport K.W., Bishop-Lilly K.A., Xu Y., Ahmed S., Feng S., Mokashi V.P., Chain P.S.G. Enabling the democratization of the genomics revolution with a fully integrated web-based bioinformatics platform. Nucleic Acids Res. 2016; 45:67–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Arkin A.P., Cottingham R.W., Henry C.S., Harris N.L., Stevens R.L., Maslov S., Dehal P., Ware D., Perez F., Canon S. et al. KBase: the United States Department of Energy Systems Biology Knowledgebase. Nature Biotechnol. 2018; 36:566–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Stegen J. Coupling Microbial Communities to Carbon and Contaminant Biogeochemistry in the Groundwater-Surface Water Interaction Zone [Data set]. 2014; 10.25585/1487765. [DOI]
- 37. Sorensen P., Brodie E., Beller H., Wang S., Bill M., Bouskill N. Sample Collection Metadata for Soil Cores from the East River Watershed, Colorado collected in 2017 [Data set]. 2019; Berkeley, CA (United States)Lawrence Berkeley National Laboratory (LBNL). [Google Scholar]
- 38. Wrighton K.C. Microbial controls on biogeochemical cycling in deep subsurface shale carbon reservoirs [Data set]. 2014; DOE Joint Genome Institute 10.25585/1487763. [DOI]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The NMDC Data Portal is freely available at https://data.microbiomedata.org, with available code on the GitHub repository https://github.com/microbiomedata.


