Abstract
Motivation
Widespread interest in the study of the microbiome has resulted in data proliferation and the development of powerful computational tools. However, many scientific researchers lack the time, training, or infrastructure to work with large datasets or to install and use command line tools.
Results
The National Institute of Allergy and Infectious Diseases (NIAID) has created Nephele, a cloud-based microbiome data analysis platform with standardized pipelines and a simple web interface for transforming raw data into biological insights. Nephele integrates common microbiome analysis tools as well as valuable reference datasets like the healthy human subjects cohort of the Human Microbiome Project (HMP). Nephele is built on the Amazon Web Services cloud, which provides centralized and automated storage and compute capacity, thereby reducing the burden on researchers and their institutions.
Availability and implementation
https://nephele.niaid.nih.gov and https://github.com/niaid/Nephele
1 Introduction
The human microbiome, or the collective genomes of humans and our microbial symbionts, has been described as a major frontier in biomedical research (Cho and Blaser, 2012). In general, metagenomics—or the study of the combined genetic material of samples from any environment (including the human host)—has experienced an immense surge of interest over the past decade. Researchers across many life and clinical science disciplines are increasingly exploring the impact of microbes and microbial community dynamics. Concomitant increases are occurring in microbiome data generation as well as in the development of computational tools to perform data analysis (Aguiar-Pulido et al., 2016; Morgan and Huttenhower, 2014; Noecker et al., 2017). However, many researchers studying the microbiome lack resources or specialized skills for managing and analyzing the large, multi-dimensional datasets associated with metagenomic analyses.
To address this challenge, a team at the National Institutes of Health (NIH) developed Nephele (ne-FEH-lee), an analysis platform that co-locates microbiome data and analysis tools in a cloud computing environment. Nephele simplifies the transfer, analysis and visualization of large microbiome datasets, and offers standardized pipelines of common microbiome analysis tools and reference databases to process amplicon-based (e.g. 16S, 18S, ITS) and whole metagenome shotgun sequencing data.
2 Nephele
Nephele repackages commonly used, open-source algorithms for rRNA gene sequence surveys and for metagenomic sequencing analyses—including QIIME (Caporaso et al., 2010), mothur (Schloss et al., 2009), bioBakery (Huttenhower, 2015) and a5-miseq (Coil et al., 2015)—and generates tabular and graphical outputs using tools like phyloseq (McMurdie et al., 2013). Each of the aforementioned tools is primarily designed to be used interactively through the Unix command line interface. While this offers great control and flexibility, it can also be a barrier for inexperienced users.
Nephele presents standardized, multi-step analysis pipelines through a web user interface. Nephele allows users to configure analysis steps (such as which reference database to use to compare microbial sequences) and parameters (such as specific threshold values for identification of operational taxonomic units in classifications) in order to make microbiome analysis more accessible to the research community.
Nephele shares some common elements with other metagenomics analysis tools and services. For example, like the MG-RAST service and associated data portal, Nephele offers users a web interface and a pipeline for data processing (Meyer et al., 2008; Wilke et al., 2016). Also, like the Cloud Virtual Resource, or CloVR, Nephele leverages virtual machine images and cloud computing capabilities to analyze microbiome data (Angiuoli et al., 2011). Nephele offers an alternative to these and other metagenomics tools, and focuses primarily on providing a simple interface to users while reducing cost of analysis and time to results.
2.1 Cloud technology
Nephele takes advantage of Amazon Web Services (AWS) cloud infrastructure both for web application hosting as well as for providing on-demand and scalable storage and compute capacity. More specifically, users provide input files, submit jobs and download results through Nephele‘s web interface, while data processing is conducted on ephemeral compute environments, or instances. A dedicated AWS instance is automatically ‘spun up’ to perform an individual data analysis job and ‘spun down’ upon completion, which offers flexibility and scalability to accommodate many concurrent analyses without the need for job queues. Parallel processing, when applicable, takes place using multiple processors on a single instance, and multiple instance sizes are available to accommodate larger or more complex analyses. Additionally, all input files, configuration parameters and virtual machine images used for Nephele analyses are tracked, so all jobs are reproducible. More detailed descriptions of cloud-based features leveraged by Nephele, as well as a high-level system architecture diagram, are described in the guide available here: https://nephele.niaid.nih.gov/#guide.
2.2 Analysis engine
Nephele's Analysis Engine enables users to submit raw sequence data and a metadata file for standardized, multi-step analysis pipeline processing. The analysis engine pipeline options are organized by analysis type (e.g. amplicon survey for 16S, 18S, or ITS, versus shotgun WGS), by primary analysis tool (e.g. mothur or QIIME) and by data type (e.g. SFF, single- or paired-end FASTQ, etc.). Through the web interface, users provide input files, select processing parameters, submit jobs and download results.
Users import data into Nephele using one of two methods: by uploading raw sequence data and a metadata file, or by supplying a public URL where files are stored. The latter approach expedites the submission process by eliminating the need to wait during large file uploads. Upon receipt, input files are automatically validated in order to confirm proper file format and to check for common mistakes, such as improperly structured mapping files, inconsistencies with file identifiers, or missing file references. After successful validation, users may specify certain optional parameters or retain common default values, including selecting which reference database to use for taxonomic assignment and whether alignment should be open, closed, or de novo. Next, users provide their name, a description for the job and an email address at which they would like to be notified of job updates. Finally, users submit their jobs; they can track progress during analyses and download final results upon job completion. Typical job outputs include BIOM files, heat maps, bar plots and taxonomy tables, among other reports and visualizations.
Nephele also offers users the ability to compare experimental data to the 16S rRNA amplicon datasets that were collected in the first phase of NIH Human Microbiome Project (NIH HMP Working Group, 2009; Proctor, 2016), which have been made available by AWS as part of their Public Datasets initiative (see https://aws.amazon.com/datasets/human-microbiome-project/). Steps involved in Nephele‘s ‘compare-to-HMP’ analysis consist of merging taxonomic assignments from the HMP and user-supplied data, and performing beta-diversity analysis using the Bray-Curtis method available through QIIME. The output contains principal coordinates analysis (PCoA) plots as well as bar plots containing user samples and closely related HMP samples.
Nephele is being further developed to accommodate additional pipelines and to provide bioinformaticians and advanced users with more granular control of parameters as well as greater flexibility in generating outputs.
2.3 Cost, access, reproducibility and data sharing
While there is no direct cost to users, Nephele analysis jobs do incur costs to NIH for the use of cloud-based compute and storage resources. On an individual analysis job, storage costs are negligible. Because AWS compute costs vary by type of compute environment, can fluctuate over time, and incur by the hour (rounding up), we report time per gigabyte (GB) analyzed as a method for estimating cost. As an example, the expected duration of a QIIME-based 16S analysis of paired-end FASTQ data using Nephele is approximately 1.4 h per GB. Therefore, to analyze 2.5 GB of data using this pipeline and the default Nephele instance type (c3.4xlarge, which at the time of this writing costs $0.84 per hour), the total expected cost would be $3.36. On average, whole metagenome assembly jobs using Nephele‘s a5-miseq-based pipeline require 6.9 h of compute time per GB, while functional annotation jobs using Nephele‘s bioBakery-based pipeline take 4.9 h per GB.
Nephele is freely accessible to the global microbiome research community. Upon initial registration, all users are provided with five ‘use codes’—or tokens good for a single analysis—to test the system and to allow NIH to control its cost liabilities. Users who intend to perform additional analyses beyond the initial five may do so by requesting extended access. However, to promote equitability among users and cost sustainability to NIH, any user who may wish to make extensive use of the Nephele service (e.g. to analyze hundreds of datasets) is encouraged to take advantage of Nephele‘s public virtual machine image and open source pipeline code.
The NIH encourages the use of public resources like Nephele to enable data sharing and reuse, whenever possible, for the benefit of the broader scientific community. While direct, user-to-user data sharing features do not currently exist within Nephele, there are two supported mechanisms whereby users can share data with others. First, a user may elect to share his or her private data by sharing a pre-signed URL to data objects on AWS Simple Storage Service (S3), which enables 24-h access to download files. A second mechanism to support data sharing on Nephele is for the user to permit his or her processed data to be accessible to other researchers for possible future reuse. This is enabled by selecting ‘Yes’ to a data sharing prompt just before submitting an analysis job on Nephele. If a user selects ‘Yes,’ data files are maintained by NIH and are reserved for potential reuse in future data sharing features that might be built into Nephele or related resources. Otherwise, the data will be removed 90 days after job submission.
2.4 Impact
Nephele simplifies and expedites results generation, and the standardized yet configurable options can provide valuable time savings to individual researchers and high-throughput sequencing facilities alike. Moreover, because Nephele tracks input files, configuration settings and virtual machine images used in data analyses, all jobs are reproducible.
Procurement and maintenance of scientific computing infrastructure can bear a heavy cost on investigators and institutions. The on-demand, pay-per-use nature of cloud computing spares valuable funding by significantly reducing this cost for both hardware and labor. Lessons learned from Nephele could also help to inform future developments in mechanisms for distributing or allocating research funding.
3 Conclusions
Growing evidence supports a critical role for microbiota in human health and disease. Amplicon-based and whole metagenomic sequencing analyses are critical to understanding this relationship—but analysis of large and complex datasets requires advanced computing infrastructure and sophisticated software that are costly or otherwise inaccessible to many researchers. Nephele is a novel microbiome data analysis platform that leverages the Amazon cloud and empowers researchers who do not have ready access to high-performance computing environments—or the support personnel and/or budgets to manage them—by providing all prerequisites for working with microbiome sequence data in an on-demand, pay-as-you-go model. This approach facilitates standardization of tools and analysis, sharing of methods and results, reproducibility of analysis and direct extension of capabilities by the research community. By supporting greater adoption and improved efficiency in microbiome research, resources like Nephele can facilitate broader participation—and ideally, new discoveries—in microbiome research.
Acknowledgements
This work was supported in part by the Office of Science Management and Operations (OSMO) of the NIAID. The authors also thank the many organizations and individuals who made significant contributions to the vision and development of Nephele, especially John J. McGowan, Michael Tartakovsky, Yentram Huyen, Maria Giovanni, Vivien Bonazzi, Owen White and Rob Knight; the Office of Cyber Infrastructure and Computational Biology of the National Institute of Allergy and Infectious Diseases; Amazon Web Services; DLT Solutions; BioDigital, Inc.; and supporting and former team members, Andrei Gabrielian, Jonathan Groth, Ramandeep Kaur, Alexander Levitsky, Jianchang Ning, Conrad Shyu, Chris Simons and Richard Burke Squires.
Funding
This project has been funded in part with Federal funds from the National Institutes of Health, Department of Health and Human Services, under contract number GS35F0373X.
Conflict of Interest: none declared.
References
- Aguiar-Pulido V. et al. (2016) Metagenomics, Metatranscriptomics, and Metabolomics Approaches for Microbiome Analysis. Evolutionary Bioinformatics Online, 12(Suppl 1), 5–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Angiuoli S.V. et al. (2011) CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics, 12, 356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Caporaso J.G. et al. (2010) QIIME allows analysis of high-throughput community sequencing data. Nat. Methods, 7, 335–336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cho I., Blaser M.J. (2012) The human microbiome: at the interface of health and disease. Nat. Rev. Genet., 13, 260–270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Coil D. et al. (2015) A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data. Bioinformatics, 31, 587–589. [DOI] [PubMed] [Google Scholar]
- Huttenhower C. (2015) Huttenhower Lab Tools. The Huttenhower Lab Department of Biostatistics, Harvard T.H. Chan School of Public Health, Cambridge, MA, USA. https://bitbucket.org/biobakery/biobakery/wiki/Home. [Google Scholar]
- McMurdie P.J. et al. (2013) phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PloS One, 8, e61217.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meyer F. et al. (2008) The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 9, 386.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morgan X.C., Huttenhower C. (2014) Meta'omic analytic techniques for studying the intestinal microbiome. Gastroenterology, 146, 1437–1448.e1. 431. [DOI] [PubMed] [Google Scholar]
- NIH HMP Working Group. (2009) The NIH Human Microbiome Project. Genome Res., 19, 2317–2323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Noecker C. et al. (2017) High-resolution characterization of the human microbiome. Translat. Res. J. Lab. Clin. Med., 179, 7–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Proctor L.M. (2016) The National Institutes of Health Human Microbiome Project. Semin. Fetal Neonatal Med., 21, 368–372. [DOI] [PubMed] [Google Scholar]
- Schloss P.D. et al. (2009) Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol., 75, 7537–7541. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilke A. et al. (2016) The MG-RAST metagenomics database and portal in 2015. Nucleic Acids Res., 44, D590–D594. [DOI] [PMC free article] [PubMed] [Google Scholar]
