Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Mar 16.
Published in final edited form as: Curr Opin Syst Biol. 2017 Jul 11;4:92–96. doi: 10.1016/j.coisb.2017.07.003

The Microbiome and Big Data

Jose A Navas-Molina 1, Embriette R Hyde 2, Jon Sanders 2, Rob Knight 1,2,3,*
PMCID: PMC10019530  NIHMSID: NIHMS1846646  PMID: 36937228

Abstract

Microbiome datasets have expanded rapidly in recent years. Advances in DNA sequencing, as well as the rise of shotgun metagenomics and metabolomics, are producing datasets that exceed the ability of researchers to analyze them on their personal computers. Here we describe what Big Data is in the context of microbiome research, how this data can be transformed into knowledge about microbes and their functions in their environments, and how the knowledge can be applied to move microbiome research forward. In particular, the development of new high-resolution tools to assess strain-level variability (moving away from OTUs), the advent of cloud computing and centralized analysis resources such as Qiita (for sequences) and GNPS (for mass spectrometry), and better methods for curating and describing “metadata” (contextual information about the sequence or chemical information) are rapidly assisting the use of microbiome data in fields ranging from human health to environmental studies.

From cells to bits: what is Big Data in microbiome research?

Since the term “microbiome” was coined by Joshua Lederberg in 2001 [1], the microbiome research field has exploded both in terms of the heterogeneity of the data produced and in the amount of data generated. Early approaches to characterizing the microbiome were based on targeted detection techniques in the laboratory, such as culturing and assays based on the Polymerase Chain Reaction (PCR), and assessed limited numbers of subjects (on the order of tens) [2]. The introduction of sequencing technologies revolutionized the field, enabling investigators to characterize microbial communities directly from primary samples. Historically, the 16S rRNA gene, a marker gene that exists in all bacteria and archaea as an essential part of the ribosome, has been targeted for these sequence-based profiling efforts. Its ubiquity among bacteria and archaea and the low cost of the approach has made it the most widely used for microbiome profiling of samples. Similarly, amplification and sequencing of the 18S rRNA gene and the internal transcribed spacer (ITS) permit investigators to profile the eukaryotic and fungal communities present in a sample using similar techniques. Since the introduction of Next Generation Sequencing, technologies have evolved from generating a few hundred thousand reads per run (454 GS) to tens of million reads (Illumina MiSeq) or even a few billion reads per run (Illumina HiSeq) [3]. Benchmarked protocols, such as those used by the Earth Microbiome Project and widely adopted by researchers around the globe, facilitate meta-analyses of unprecedented size-investigators can combine studies, each with hundreds to thousands of samples, into a single large analysis effort.

The precipitous drop in sample processing and sequencing costs associated with new technology development is enabling researchers to move beyond simple taxonomy and abundance-based work to species and strain level profiling as well as descriptions of functional pathways through whole genome shotgun metagenomics sequencing. As a result, researchers are able to ask more critical questions of their samples and are utilizing other technologies, such as detection of small molecules via mass spectrometry, to confirm or refute hypotheses driven by functional pathway and gene abundance information obtained from shotgun sequencing data.

The rate at which these technologies are increasing their data output is faster than our computational power is growing [4], effectively shifting the costs of a research study from the sequencing pipeline to the data analysis pipeline. Additionally, as researchers utilize larger and larger datasets, they are able to design large-scale studies to ask (and answer) complex questions. The metadata associated with samples, therefore, is becoming an increasingly large contributor to microbiome big data and the challenges associated with streamlining data analysis. Standards such as MIMARKs [5] have helped investigators format their metadata to facilitate data analysis and data upload to repositories such as the European Bioinformatics Institute’s European Nucleotide Archive (EBI ENA). Nevertheless, as samples are increasingly processed in parallel with multiple different protocols (i.e., 16S, 18S, ITS, shotgun, metabolomics, etc.), correct formatting of metadata to capture this information and facilitate multi-omics correlative analyses will require careful attention and appropriate implementation of tools capable of handling hundreds to thousands of columns of data for hundreds to thousands of samples. Tools such as Qiita (qiita.microbio.me) are being developed to address the challenges associated with analyzing large numbers of samples, processed via multiple different protocols, and with complex metadata-and these tools rely on both the availability and effective usage of large-scale compute resources. The ability to apply tools such as QIIME in the cloud; e.g., using Amazon Web Services [6], has broadened these capabilities far beyond the original user base, and enabled users in developing countries such as Bangladesh to use these tools without operating their own large-scale compute infrastructure. These techniques are now being applied in the United States through Illumina’s BaseSpace (https://basespace.illumina.com/home/index) and NIH’s Cloud Pilot (https://commonfund.nih.gov/bd2k/commons).

From bits to knowledge: how is Big Data moving microbiome research forward?

Initial efforts to characterize and understand the healthy human microbiome using next generation sequencing techniques [7,8] raised more questions than answers, and led to the explosion of microbiome research that has identified associations between the microbiome and diseases as varied as obesity, inflammatory bowel disease, cardiovascular disease, and autism (among many others). Most of these studies have simply identified associations and the question of causation or simple association remains unknown. Key studies, such as the obesity work done by Jeffrey I. Gordon and his team at Washington University [911] and the personalized nutrition work done by Eran Segal of the Weizmann Institute [12] are coming closer to answering the question of causality versus association. However, it is becoming increasingly clear that integrating DNA sequence data with other ‘omics techniques such as metatranscriptomics (sequencing the RNA), proteomics (sequencing the proteins), and metabolomics (characterizing the metabolites) will be key for advancing microbiome research. An example of the power of combining multiple techniques for assessing the microbiome is the National Institutes of Health’s (NIH) Human Microbiome Project (HMP), the largest human microbiome sequencing effort at the time of its publication in 2012. 16S rRNA gene amplicons were generated from total of 4,788 samples collected from 242 healthy adults [7] and sequenced using 454 pyrosequencing. Additionally, a whole genome shotgun sequencing on the paired-end Illumina platform was performed on a subset of 681 samples, generating 2.9 Gigabases per sample (close to 2 terabytes of data for the entire dataset).

The HMP shotgun metagenomics data revealed a key observation: while no taxon was observed in all individuals (i.e., no “core” healthy microbiome was identified), the functional pathways inferred from the shotgun data were evenly distributed across individuals and body sites. While this was an important observation, the addition of other data types, such as RNA-seq or metabolomics would have provided precise information regarding the actual activity of the microbial community and which small molecules were present, respectively, further exemplifying importance of combining different -omics techniques for generating hypotheses that ultimately lead to studies designed to obtain a more complete picture of a given microbial community (and the significance of its presence). For example, as reported by Bouslimani et al. [13], using a paired sequencing-mass spectrometry approach allowed the investigators to identify correlations between Propionibacterium genera and the presence of oleic acid, palmitic acid, mono-oleic, and palmitic acylated glycerols on human skin. Hypothesizing that Propionibacterium mediates the hydrolyzation of triacylglycerides or diacylglycerides from human acylated glycerols, Bouslimani et al. cultured Propionibacterium acnes in a medium supplemented with the triglyceride triolein and examined the resulting metabolic products, ultimately confirming their hypothesis.

Microbiome citizen science initiatives such as the American Gut Project (AGP; americangut.org) have made significant contributions to the field by “democratizing” microbiome research and thus providing large-scale datasets that can be used as comparative frameworks for other studies. Citizens support the science by sending samples from their bodies, their pets, or their environment as well as the necessary funds to cover the sample processing. These projects face the challenge of dealing with large numbers of samples; while most current microbiome studies contain hundreds or a few thousand samples, these citizen science efforts contain a continually growing number of samples that in some cases are on the order of over ten thousand samples, pushing the limits of the current computational tools. Furthermore, this democratization is not free: subject data is self-reported, and at times, significant amounts of data are necessary to correctly characterize the sample source. The American Gut Project currently collects up to 400 variables about study participants, including detailed dietary information proffered through a standardized food frequency questionnaire (VioScreen). Analyzing all these variables is a challenge, and one solution is crowdsourcing the data analysis itself. All de-identified AGP data are made public as soon as they are available, allowing researchers and clinicians around the world to use the data to identify correlations between those variables and the microbiome data which can generate new hypotheses, or to contextualize their own studies with the largest open source human microbiome dataset that currently exists. The power of meta-analyses is apparent from early work by Lozupone and Knight [14], in which 21,752 16S rRNA sequences from diverse environments sampled across 111 studies were analyzed together to find that the main environmental driver differentiating microbial communities was salinity, rather than temperature, humidity, or a number of other environmental factors. However, when we restrict the analysis to the human gut microbiome, technical factors that differ between studies, such as DNA extraction, PCR primers, and sequencing platform are often larger than the biological effects we seek to discover [15]. Performing similar large-scale meta-analyses with the AGP data and the hundreds of other publicly available human microbiome datasets will be critical for identifying universal microbiome signatures associated with different health and disease states, and for understanding which technical variables have larger effect sizes than biological variables.

Big Data has also proven critical in the context of microbial epidemiology. Using Mycobacterium tuberculosis as an example, Guthrie and Gardy [16] describe the utility of using next-generation sequencing techniques for understanding disease outbreaks. Whole genome sequencing of a specific pathogen can reveal the infection path (including patient 0) of the outbreak by allowing investigators to follow mutations from several strains isolated from infected individuals. Whole genome sequencing can also be used to diagnose disease. For example, determination of antibiotic resistance of Mycobacterium tuberculosis is a notoriously difficult clinical problem; current gold-standard diagnostic techniques are culture-based and can take up to 8 weeks to generate results. Whole genome sequencing can reduce this time to a few days when the mutations responsible for drug resistance are well characterized and the reference databases are high quality. As a byproduct, the usage of whole genome sequencing for outbreak tracking and rapid diagnostics generates a genome catalogue that can be used for new drug development as well as better disease characterization. Clinical sequencing and diagnostic timeframes are becoming even faster with the advent of nanopore sequencing technology, currently commercialized by Oxford Nanopore Technologies (ONT) through the MinION sequencer. The reads produced by ONT devices are longer but comparatively less accurate compared to other sequencing technologies; however, they are generated extremely rapidly and portably. Similar in size and price to a high-end smartphone, the MinION sequencer facilitates near-immediate data acquisition, meaning sequences can be generated much closer to the biological source. Nanopore sequencers have been used to perform same-day diagnosis of tuberculosis [17] as well as in situ monitoring of an Ebola outbreak [18]. The speed and portability can also benefit non-epidemiological microbiome work by making field-based work where sample transit and storage are difficult to impossible more obtainable. The MinION has already been used for on-site microbiological surveys in Antarctica [19] and produced the first sequences generated in space aboard the International Space Station [20].

Looking to the future: opportunities and challenges

The tools and technologies that have enabled microbiome research thus far continue to improve at breakneck pace. Increased usage of fast, portable sequencers such as the MinION and of multi-omics techniques means that the amount of data collected by microbiome researchers will quickly reach never before seen sizes, which will pose challenges for data storage and analysis. This wealth of information also will facilitate the understanding of bacterial community mechanics and interactions like never before, leading to groundbreaking developments not only in human health [2123], but also in agriculture [24], biofuels [25], and many other applications. One of the biggest challenges facing the field as investigators aim to achieve these goals is the ability to integrate and correlate the massive amounts of data produced by these protocols and to identify biologically relevant information that can be used to formulate testable hypotheses.

As investigators begin to utilize and combine multi-omics technologies, they are faced with tools and protocols that are at different stages of development. For example, one of the difficulties associated with mass spectrometry analysis of small molecules is that in many cases we are unable to determine whether molecules are microbial or host-derived due to lack of annotation, and if indeed derived from the microbiome, which specific group(s) of bacteria generated the chemical signature. Applying mass spectrometry techniques to more and more microbiome datasets will enable researchers to build the existing databases. Even among sequence data, biases exist towards well studied environments, such as the human gut, while less studied environments, such as coral reefs, are not represented accurately (Earth Microbiome Project, in review). Developing tools to cross-compare sequence and small molecule data is also a key challenge; many of the techniques to assess sequence data are phylogeny based and cannot be applied to mass spectrometry outputs. Additionally statistical approaches for assessing microbiome sequence data [26,27] will need to be validated on mass spectrometry data, or new, appropriate tools will need to be developed. Finally, visualizing multi-omics data together in a clear, meaningful way poses an interesting challenge, particularly given that such tools will need to be able to process thousands of data points from thousands of samples.

Large-scale meta-analyses, such as those described in the previous section, also pose a unique challenge. Current 16S rRNA studies contain tens of millions of reads, and the amount of data utilized in meta-analyses is likely to be orders of magnitude larger as shotgun sequence and metabolomics data become a routine part of microbiome studies. The largest known meta-analysis in existence, performed on the first 27,742 samples from 91 different studies in the Earth Microbiome Project (EMP; earthmicrobiome.org) exposed key problems. First, the current tools utilized to analyze the data cannot handle more than 30,000 samples at a single time. Additionally, the importance of standardizing metadata also became crystal clear. Although standard metadata definitions exist, data repositories currently do not enforce their compliance, and the metadata normalization effort is shifted to the researcher performing the meta-analysis. New tools as well as more accurate documentation will be key to facilitate the adoption of the standards in the community.

Last but not least, one of the most important challenges that will face microbiome research in the near future is the translation of results from the laboratory to everyday life. The human body is a supra-organism containing a wide variety of microorganisms that provide up to 99% of the genetic material present in our bodies. Ignoring this part of the system when assessing the well-being of a person is akin to performing a routine physical but only checking the blood pressure of the patient. Although the ultimate goal of human microbiome research is to implement clinical microbiome surveys, there is much work to be done before this goal can be realized. First, and most importantly, more data need to be collected and analyzed. Well-designed studies on clinical cohorts will be key for identifying meaningful host-microbiome associations and how these associations can be leveraged to improve human health. Universal Standard Operating Procedures will also be critical to minimize lab to lab variation [28], including protocols for sample collection, handling, storage and processing, as well as standardizing analysis tools. Clinician education will also be critical to enable health care providers to understand the limits of microbiome research as well as the advantages, and easy to understand microbiome analysis reports will be a key part of clinician education. Finally, sample processing and analysis times and costs need to be reduced. While in some cases genomic analysis is more rapid than gold standard diagnostics, in many cases, the processing time and costs outweigh the advantages of these techniques. For example, RNA-seq remains a lengthy, complex approach. The MinION may be useful for addressing this issue as it is able to directly accept RNA without the requirement for cDNA generation; however, widespread use of this tool will likely be closely tied to a reduction in the current error rate suffered by the system.

Microbiome research is currently on the precipice of producing orders of magnitude more data than ever before. To accurately assess and utilize this data, investigators will rely on the development of tools, pipelines, and SOPs able to effectively handle big data. Together, researchers, clinicians, and computer scientists are poised to revolutionize microbiome research and its applications in human health, agriculture, food science, and a number of other critical fields.

Highlights:

  • Cloud-based tools enable rapid computation of microbiome datasets

  • Analysis of large microbial datasets contributes critical information for health care, epidemiology, agriculture, and biofuels

  • Efficiently using microbiome data in the future will require accurate and rapid metadata curation, integration of multiple datatypes, and simple yet elegant visualizations

Acknowledgements:

Funding: This work was supported in part by the NIH, the NSF, and the Alfred P. Sloan Foundation.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References:

  • 1.Lederberg J: ‘Ome Seet ‘Omics -- A Genealogical Treasury of Words. Sci. 2001, [no volume]. [Google Scholar]
  • 2.Brigidi P, Vitali B, Swennen E, Bazzocchi G, Matteuzzi D: Effects of probiotic administration upon the composition and enzymatic activity of human fecal microbiota in patients with irritable bowel syndrome or functional diarrhea. Res. Microbiol. 2001, 152:735–741. [DOI] [PubMed] [Google Scholar]
  • 3.Goodwin S, McPherson JD, McCombie WR: Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 2016, 17:333–351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wetterstrand K: DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program. 2013, [no volume]. [Google Scholar]
  • 5.Yilmaz P, Kottmann R, Field D, Knight R, Cole JR, Amaral-Zettler L, Gilbert JA, Karsch-Mizrachi I, Johnston A, Cochrane G, et al. : Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nat Biotechnol 2011, 29:415–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ragan-Kelley B, Walters WA, McDonald D, Riley J, Granger BE, Gonzalez A, Knight R, Perez F, Caporaso JG: Collaborative cloud-enabled tools allow rapid, reproducible biological insights. ISME J. 2013, 7:461–464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Human Microbiome Project C: A framework for human microbiome research. Nature 2012, 486:215–221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Human Microbiome Project C: Structure, function and diversity of the healthy human microbiome. Nature 2012, 486:207–214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V, Mardis ER, Gordon JI: An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 2006, 444:1027–1031. [DOI] [PubMed] [Google Scholar]
  • 10.Turnbaugh PJ, Ridaura VK, Faith JJ, Rey FE, Knight R, Gordon JI: The effect of diet on the human gut microbiome: a metagenomic analysis in humanized gnotobiotic mice. Sci Transl Med 2009, 1:6ra14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ridaura VK, Faith JJ, Rey FE, Cheng J, Duncan AE, Kau AL, Griffin NW, Lombard V, Henrissat B, Bain JR, et al. : Gut microbiota from twins discordant for obesity modulate metabolism in mice. Science (80-. ). 2013, 341:1241214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Zeevi D, Korem T, Zmora N, Halpern Z, Elinav E, Segal E, Zeevi D, Korem T, Zmora N, Israeli D, et al. : Personalized Nutrition by Prediction of Glycemic Article Personalized Nutrition by Prediction of Glycemic Responses. 2015, doi: 10.1016/j.cell.2015.11.001. •• Integrating dietary habits, physical activity, blood glucose levels, and the gut microbiome, the authors design and implement an algorithm for predicting personalized diets to control blood sugar in diabetic Israelis.
  • 13. Bouslimani A, Porto C, Rath CM, Wang M, Guo Y, Gonzalez A, Berg-Lyon D, Ackermann G, Moeller Christensen GJ, Nakatsuji T, et al. : Molecular cartography of the human skin surface in 3D. Proc. Natl. Acad. Sci. U. S. A. 2015, 112:E2120–9. ••The authors utilize a high spatial resolution 3D mapping tool to visualize and correlate chemical and microbial composition of the human skin.
  • 14.Lozupone CA, Knight R: Global patterns in bacterial diversity. Proc Natl Acad Sci U S A 2007, 104:11436–11440. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Lozupone CA, Stombaugh J, Gonzalez A, Ackermann G, Wendel D, Vazquez-Baeza Y, Jansson JK, Gordon JI, Knight R: Meta-analyses of studies of the human microbiota. Genome Res 2013, 23:1704–1714. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Guthrie JL, Gardy JL: A brief primer on genomic epidemiology: lessons learned from Mycobacterium tuberculosis. Ann. N. Y. Acad. Sci. 2016, 1388:59–77. •In this review, the authors provide a thorough description of tracing pathogen transmission routes using genomic data, discussing M. tuberculosis as an example. Notably, the authors define the scenarios to which genomics may be applied, discuss appropriate technologies and tools, and define the appropriate clinical and epidemiological backdrop for transmission inference.
  • 17.Votintseva AA, Bradley P, Pankhurst L, Del Ojo Elias C, Loose M, Nilgiriwala K, Chatterjee A, Smith EG, Sanderson N, Walker TM, et al. : Same-Day Diagnostic and Surveillance Data for Tuberculosis via Whole-Genome Sequencing of Direct Respiratory Samples. J. Clin. Microbiol. 2017, 55:1285–1298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Quick J, Loman NJ, Duraffour S, Simpson JT, Severi E, Cowley L, Bore JA, Koundouno R, Dudas G, Mikhail A, et al. : Real-time, portable genome sequencing for Ebola surveillance. Nature 2016, 530:228–232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Johnson SS, Zaikova E, Goerlitz DS, Bai Y, Tighe SW: Real-Time DNA Sequencing in the Antarctic Dry Valleys Using the Oxford Nanopore Sequencer. J. Biomol. Tech. 2017, 28:2–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Castro-Wallace SL, Chiu CY, John KK, Stahl SE, Rubins KH, McIntyre ABR, Dworkin JP, Lupisella ML, Smith DJ, Botkin DJ, et al. : Nanopore DNA Sequencing and Genome Assembly on the International Space Station. bioRxiv 2016, [no volume]. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.van Nood E, Vrieze A, Nieuwdorp M, Fuentes S, Zoetendal EG, de Vos WM, Visser CE, Kuijper EJ, Bartelsman JFWM, Tijssen JGP, et al. : Duodenal Infusion of Donor Feces for Recurrent Clostridium difficile. N. Engl. J. Med. 2013, 368:407–415. [DOI] [PubMed] [Google Scholar]
  • 22.Cox LM, Blaser MJ: Antibiotics in early life and obesity. Nat. Rev. Endocrinol. 2015, 11:182–190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Ling LL, Schneider T, Peoples AJ, Spoering AL, Engels I, Conlon BP, Mueller A, Schaberle TF, Hughes DE, Epstein S, et al. : A new antibiotic kills pathogens without detectable resistance. Nature 2015, 517:455–459. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Sessitsch A, Mitter B: 21st century agriculture: integration of plant microbiomes for improved crop production and food security. Microb. Biotechnol. 2015, 8:32–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Hess M, Sczyrba A, Egan R, Kim T-W, Chokhawala H, Schroth G, Luo S, Clark DS, Chen F, Zhang T, et al. : Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science 2011, 331:463–467. [DOI] [PubMed] [Google Scholar]
  • 26.Morton JT, Sanders J, Quinn RA, Mcdonald D, Gonzalez A, Vázquez-baeza Y, Navas-molina JA: Balance Trees Reveal Microbial Niche Differentiation. mSystems 2017, 2:e00162–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Mandal S, Van Treuren W, White RA, Eggesbø M, Knight R, Peddada SD: Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 2015, 26:27663. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Sinha R, Abnet CC, White O, Knight R, Huttenhower C: The microbiome quality control project: baseline study design and future directions. Genome Biol. 2015, 16:276. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES