Skip to main content
The Journal of Infectious Diseases logoLink to The Journal of Infectious Diseases
. 2024 Aug 27;233(1):76–86. doi: 10.1093/infdis/jiae378

Planning and Analyzing a Low-Biomass Microbiome Study: A Data Analysis Perspective

George I Austin 1,2, Tal Korem 3,4,✉,2
PMCID: PMC12445924  NIHMSID: NIHMS2108398  PMID: 39189314

Abstract

As investigations of low-biomass microbial communities have become more common, so too has the recognition of major challenges affecting these analyses. These challenges have been shown to compromise biological conclusions and have contributed to several controversies. Here, we review some of the most common and influential challenges in low-biomass microbiome research. We highlight key approaches to alleviate these potential pitfalls, combining experimental planning strategies and data analysis methods.

Keywords: microbiome, low-biomass microbiome, contamination, batch effect, experimental design


Low-biomass microbiome studies show great potential alongside major controversies. This review surveys key methodological challenges and discusses experimental planning strategies and data analysis methods that can address them.


Advances in microbiome research have spurred a growing interest in identifying microbial DNA in tissues or ecosystems with low microbial biomass, including tumors [1, 2], lungs [3, 4], placenta [5], and blood [6], as well as in nonhuman ecosystems such as the deep biosphere [7] or glaciers [8]. While such endeavors yielded many success stories, they have also fueled several controversies and contradictory results [9–13], highlighting a need for a deeper understanding of the common pitfalls, approaches, and limitations of low-biomass microbiome studies [14–19]. For example, the human placenta was once claimed to harbor a microbiome [5], but additional research revealed that the results were driven by contamination [12, 13]. Another article analyzing the tumor microbiome [20] was recently retracted amid concerns regarding misclassification of human DNA and machine learning approaches [10, 21]. In this review, we survey common challenges encountered in low-biomass microbiome studies, offer insights into how they can be addressed, and highlight remaining methodological gaps. We focus on aspects of study design and data analysis, rather than experimental approaches, which have been reviewed elsewhere [16, 22–27]. We note that while some have classified “low-biomass” quantitatively (eg, <10 000 microbial cells/mL [28]), we advocate for considering biomass as a continuum, with certain challenges having a stronger effect the fewer microbes are present in the ecosystem.

KEY ANALYTICAL CHALLENGES OF LOW-BIOMASS MICROBIOME STUDIES

Low-biomass microbiome studies have been undertaken using a variety of approaches, including 16S ribosomal RNA gene amplicon sequencing, metagenomics, and metatranscriptomics [1–8]. Below, we discuss key challenges that have emerged in various studies. While issues of host depletion and abundance estimation are not applicable to 16S sequencing, it provides limited phylogenetic and functional resolution.

Host DNA Misclassification

Metagenomic or transcriptomics data originating from low-biomass human microbiome studies would typically consist mostly of sequences originating from the host (eg, in the tumor microbiome only approximately 0.01% of sequenced reads were estimated to be microbial [2, 29]). While sometimes referred to as “host contamination” [23, 30], we find that this term is somewhat inaccurate, as host DNA is expected to genuinely be present in the ecosystem. It also misspecifies the main issue, which is that unaccounted host DNA can be misidentified as microbial [10] (Figure 1A). In most cases, this will generate noise that will impede the ability to identify signals in the data. However, if the levels of undetected host DNA are confounded with a phenotype, host DNA misclassification might result in artifactual signals.

Figure 1.

Figure 1.

Challenges of data analysis in low-biomass microbiome research. A–D, Illustrations of 4 challenges in microbiome studies that are particularly complicating low-biomass studies: host misclassification [10] (sometimes called “host contamination”), in which DNA from the host, such as a human, gets incorrectly classified as bacterial (A); contamination [17], the introduction of microbial DNA during sample collection and processing that can be misclassified as belonging to the low-biomass microbiome of interest (B); well-to-well leakage [14], in which the contents of a sample can transfer to other samples, usually spatially adjacent to it during processing (C); and processing bias [31], the compositional impact of the varying efficiency each experimental and analytic stage has for different microbes, such that the relative abundance of different microbes can be over- or understated, depending on the sample composition and the experimental and analytic methods used (D). E–H, A hypothetical experiment illustrating how failing to account for these technical challenges during study design and computational analyses can lead to incorrect conclusions. This is an analysis of 108 samples (54 cases and 54 controls), of which 106 have an identical microbial composition. By construction, there is no underlying association between the phenotype (case/control) and any microbe. However, if the distribution of samples across plates is confounded with the phenotype (E and G), then technical factors of contamination, leakage, and bias will manufacture a difference in the measured communities between cases (F) and controls (H). I, As a result, implementations of standard differential abundance methods without technical corrections would identify multiple significant differences between the microbiomes of the case and control samples, despite the fact that their true compositions are nearly identical.

External Contamination

Among the most common issues plaguing low-biomass microbiome research is the unwanted introduction of DNA that originated from sources other than the environment being investigated. This external DNA, broadly described as “contamination,” can be introduced in various experimental stages, such as sample collection or DNA extraction, each with its own microbial composition [16–18, 22, 32, 33] (Figure 1B). Contamination is particularly relevant in low-biomass studies, because it will generally account for a greater proportion of the observed data. Similarly to host DNA misclassification, in most cases contamination will generate noise; however, if it is confounded with a phenotype it might result in artifactual signals.

Well-to-Well Leakage (“Cross-contamination”)

Another source of DNA that did not originate in the ecosystem itself are other samples that are processed concurrently, often nearby in space (eg, in adjacent wells on a 96-well plate; Figure 1C). Termed “cross-contamination,” “well-to-well leakage,” or the “splashome” [14, 33–35], this phenomenon can compromise the inferred composition of every sample. Importantly, we have recently demonstrated that well-to-well leakage into contamination controls violates the assumptions of most state-of-the-art computational decontamination methods [33].

Batch Effects and Processing Bias

“Batch effects” is a general term (and challenge) in biology, which describes the differences observed among samples from different laboratories or processing batches, and can be attributed to differences in protocols, personnel, reagent batches, or even ambient temperature [18, 36–39] (Figure 1D). In the microbiome field, these differences have also been attributed to variable efficiency of different experimental and analytic processing stages for different microbes, altogether termed “processing bias,” which may be increased by some experimental approaches used in low-biomass research [15, 31, 40–42]. Processing biases were shown to potentially distort inferred signals both when batches are confounded with a phenotype, but also when they are not, if the mean sample efficiency is associated with a phenotype [42]. Batch confounding and batch effects, which are exacerbated by processing bias, underlie how many issues, such as those discussed above, generate artifactual signals. Furthermore, the ability to distinguish between processing bias and contamination could complicate the interpretation of low-biomass research.

Identification and Classification of Microbes

Low-biomass microbial ecosystems are often understudied, and the microbes potentially residing in them might not be adequately represented in reference genomic datasets [26, 43, 44]. As a result, the accurate identification of microbes in these samples is often problematic. This is of particular concern when the identity of detected microbes serves to validate that the analytic pipeline yields sensical results [11, 45].

A Hypothetical Case Study

We demonstrate the risks inherent to these challenges with a hypothetical analysis of a simulated case:control dataset with 54 samples from cases and 54 from controls. For simplicity, we demonstrate a scenario in which 53 of the samples from each group have an identical distribution consisting of 2 taxa, along with 1 extra sample per group that contain monocultures of a third and fourth taxon. In this scenario, the cases and controls are processed separately. Consequently, the batch that contains the case samples (Figure 1E) is subjected to distinct contamination, well-to-well leakage, and processing bias, resulting in an observed dataset that is highly different from the true samples (Figure 1F). The control samples (Figure 1G) undergo slightly different contamination, well-to-well leakage, and processing biases, which would result in an observed dataset significantly different from that of the case samples (Figure 1H). As a result, and despite the fact that 98% of all samples are identical, an analysis of the 2 observed datasets would identify 6 taxa “associated” with case/control status (Figure 1I): 2 due to contamination, 2 due to well-to-well leakage, and 2 due to processing bias. While this example emphasizes the risk of the different challenges, we note that the reason underlying artifactual results is the confounding of the phenotype with the batch structure. In the absence of such confounding, the result of the aforementioned processes would be noise (Figure 2). In either case, these pitfalls are addressable through effective study design and analytics, which we outline below.

Figure 2.

Figure 2.

Unconfounded experimental design can avoid artifactual signals. Similar to Figure 1  E–H, but in which case and control samples are uniformly scattered across and within each plate. A–D, This hypothetical example illustrates how a microbiome experimental design in which case and control sample processing is not confounded by factors such as sequencing plate and well locations within each plate avoids the artifactual signals potentially generated by the challenges of contamination, well-to-well leakage, and processing bias.

EXPERIMENTAL STUDY DESIGN

Optimal experimental study design is essential for low-biomass microbiome studies. We offer some important considerations.

Avoid Batch Confounding by Optimizing Study Design

A critical step to reducing the impact of low-biomass challenges on subsequent data analysis is to ensure that phenotypes and covariates of interest are not confounded with the batch structure at any experimental stage (eg, sample shipment batches or DNA extraction batches). As demonstrated in Figure 2, in which both batches include a similar ratio of case and control samples, an unconfounded design increases the likelihood that experimental biases will mask true signals rather than introduce artifactual ones into downstream analyses. A broad definition of analytic goals, even for exploratory analysis, combined with study-specific considerations and domain knowledge, will help in identifying covariates and outcomes to consider when designing a study. While randomization of samples is helpful, we advocate for a more active approach in generating unconfounded batches, such as the approach proposed by BalanceIT [46]. If batches cannot be de-confounded from a covariate, such as in the case of a clinical site with a different case:control ratio than other sites, we recommend that rather than analyzing data from all batches together, the generalizability of results be assessed explicitly across batches [40, 47]

Use Process Controls That Represent All Contamination Sources

Even when best efforts were made to reduce batch confounding, addressing the challenges of low-biomass microbiome studies is critical to reduce the impact of residual confounding and to improve detection of true signals. Chief among these challenges is contamination (Figure 1B). While best laboratory practices can reduce contamination, they cannot eliminate it. It has therefore become standard to collect process controls whose contents represent contamination introduced throughout the study [1, 18, 32, 35].

Which controls to collect is a topic of ongoing investigation. Some have recommended focusing on control samples that pass through the entire experiment and therefore “represent” all contaminants concurrently [48, 49]. We caution, however, that in large studies this requires careful planning that ensures that these control samples are present in each batch, or they might miss certain contamination sources (see Figure 3 for an example). We and others therefore advocate for additionally identifying contamination sources and profiling them separately using process-specific controls [16, 32, 33] (Figure 3). The types of controls collected should be tailored to each study; examples include surface or adjacent tissue samples [1, 13, 50], empty collection kits [18, 35], blank extraction controls [13, 20], no-template controls [1, 14], or library preparation controls [20, 33]. For each of these, attention should be given to factors that may cause a difference in the contamination profile, such as manufacturing batches for swabs [13].

Figure 3.

Figure 3.

“Whole experiment” controls might miss certain contamination sources. An illustration of how separate contamination sources are captured by different sets of controls. Certain process controls, such as the blank swab collected early in the pipeline, will contain combinations of downstream contamination sources. Samples from separate process batches, such as DNA extraction batches, can be subjected to different contaminants. Note that the blank swab does not represent the contamination source of batch 1 (dark purple).

We note that we are not aware of a general consensus or data-driven guidance on the required number of controls per contamination source. We have found that 2 control samples are always preferable to 1, and that in specific cases more controls are helpful [33], particularly when a high amount of contamination is expected. We anticipate that the optimal number of controls, as well as the prioritization between different types, will vary between studies and ecosystems. While it is best to collect process control samples for every possible source of contamination, this might not always be feasible, in which case careful analytic strategies and alternative decontamination methods that do not use controls should be applied.

Minimize Well-to-Well Leakage and Account for It in Experimental Design

Our current ability to address well-to-well leakage analytically is limited, making study design and experimental measures even more important. To minimize leakage, high- and low-biomass samples (eg, stool and tissue samples) should not be processed together, as it has been demonstrated that such scenarios risk high leakage into the low-biomass samples [14]. It was further shown that using robotic systems bears higher risk of well-to-well leakage [14]. Beyond these observations, we are currently not aware of accepted standards or guidelines for addressing this challenge. We therefore recommend that laboratories performing low-biomass microbiome studies profile and calibrate the extent of this phenomenon with their protocols, potentially using monocultures [14, 26].

Well-to-well leakage is especially destructive for contamination controls, as it contradicts the assumptions made by most decontamination methods [33]. We therefore recommend that efforts be made to limit the number of samples processed adjacent to contamination controls. For example, for processing done on a 96-well plate, we have observed that leaving blank wells around them reduces well-to-well leakage [33]. Additionally, as noted above, well-to-well leakage can confound results by exposing samples from similar groups to the same leakage pattern, which risks creating artifactual results (Figure 1F, 1H, and 1I). Thus, just as with batch design, it can often be helpful to randomize sample locations, and it is important to ensure that there are no covariates of interests associated with processing order or spatial arrangement of an experiment (Figure 2).

Record and Publish Detailed Experimental Metadata

Correcting for factors such as batch effects, contamination, and well-to-well leakage requires in-depth sample metadata. It is therefore critical to record how each sample traversed through the experimental pipeline, including factors such as collection site and time, as well as batches of shipment, aliquoting, DNA extraction, PCR (polymerase chain reaction), library preparation, and other study-specific stages. The order of processing and location on processing plates should also be recorded. This information is important both for the initial analysis of a study, as well as for future reanalyses and meta-analyses performed by other investigators, and should therefore be made publicly available alongside the data. We advocate for its explicit inclusion in checklists such as STORMS (Strengthening The Organization and Reporting of Microbiome Studies) [51]. There are many additional relevant factors, such as which researchers were processing which samples, or what were the production batches of different reagents. In general, recording and examining as much information describing the experimental pipelines as possible gives investigators the best opportunity for successful data analysis.

DATA ANALYSIS

Effective data processing and analysis of the resulting data are needed to ensure robust interpretation of low-biomass data (Figures 2 and  4). Importantly, in every analytic decision made in these studies lies a tradeoff between discovery of novel signals and reporting of results that are conservative and high confidence. Analyses with several levels of confidence may therefore be beneficial. For example, Poore et al. [20] analyzed 4 datasets with increasing levels of decontamination stringency and 2 abundance estimation pipelines. We discuss 5 recommendations for sequencing-based low-biomass microbiome data analysis.

Figure 4.

Figure 4.

Cross-batch or cross-study generalization demonstrates robust data analysis. Illustration of an analysis unlikely to suffer from inflated results due to contamination, bias, misclassification, or well-to-well leakage. The training set (A) is contained within a separate batch from the test set (B) and is processed separately. Both batches contain random dispersion of case and control samples, along with process controls to facilitate decontamination. C, A standard metagenomics processing pipeline for the training batch, which includes host depletion, reference mapping, decontamination, and bias correction. Afterward, some analysis of the resulting training dataset should be implemented to evaluate the biological relevance of the sequenced microbes, for example, by devising machine learning (ML) models that are predictive of a phenotype. D, Processing of the test samples should follow the same 4 steps as the training set, upon which the resulting samples should be used to assess the generalization of the associations or predictive models described for the training set. None of the processing or analysis steps, of either training or test samples, should incorporate test set labels. While we illustrate an example train-test split across 2 plates, certain situations might require relevant analysis split across patient populations, covariates, or more.

Avoid Misclassification of Host Reads

A substantial concern is that sequenced reads originating in the DNA of the host might be misclassified as microbial [10]. A key step to address this is to filter host reads by mapping them against reference genomes. These references have seen impactful updates in recent years and are still evolving, and orders-of-magnitude differences in mapping rates are observed between different references [29]. Therefore, the use of updated references, such as T2T-CHM13 [52] and the Human Pangenome Reference Consortium [53], is necessary. Furthermore, a recent analysis showed that the number of reads removed using subsequent reference updates has not reached saturation, indicating that human reads are potentially still missed [29]. While these are likely in low quantities, we recommend applying additional quality assurance steps, such as investigating taxa whose abundance across samples is correlated with the abundance of detected host DNA.

A complementary way to reduce human reads misclassification is to ensure that microbial reference genome databases do not contain host sequences, for example, using postprocessing methods [29, 54]. We note that some methods, such as Kraken [55], require the presence of the human reference genome also within the reference database, lest it results in misclassification [10, 55, 56], but this is not necessary for other methods such as Bowtie2 [57]. While it has been suggested that reference databases should only include complete genomes because these are less likely to contain errors [10], we find that this is unnecessarily restrictive. Recent assembly efforts have demonstrated incredible diversity not captured in datasets of complete genomes [43, 58, 59], and it is likely that many microbial inhabitants of low-biomass microbial ecosystems do not have complete reference genomes. We recommend, instead, approaches such as the inclusion of a simulated sample composed entirely of host DNA that is then processed with the same analytic pipeline to quantify the rate of false-positive misclassification of host DNA as microbial [29].

Interpret Microbial Abundance Estimation With Caution

In low-biomass microbiome research, the identity of taxa measured in the ecosystem is often used as a measure for analytical rigor or the effectiveness of decontamination [9, 11, 45]. We caution against this approach: Such ecosystems are often understudied, and misidentification of taxonomy is likely [60]. We also caution against the opposite: It is advisable not to heavily rely on specific taxonomic classifications without deeper investigation and potentially validation, even if these show some biological plausibility.

Additional approaches may be necessary to alleviate issues in abundance estimation. Homology between genomes and ambiguously mapped reads [61] can cause spurious read mapping to a particular genome and a false-positive identification. Examination of coverage breadth has been proposed for this purpose and can be incorporated into abundance estimation [56, 62], under the premise that a certain proportion of the genome needs to be covered to consider a microbe as potentially present in the ecosystem [29]. Further development of bespoke methods for draft genome references and low read counts, characteristic of low-biomass microbiomes, is necessary. Additionally, using absolute abundance measures such as spike-in taxa can be used to establish a limit of detection for low-abundance taxa [13].

An alternative approach that avoids reference databases is de novo assembly [43, 58, 59]. However, it is typically challenging in low-biomass microbiome studies, as obtaining the necessary sequencing depth is often difficult and expensive. Bespoke development of reference-guided co-assembly pipelines could greatly advance this field [63]. Finally, complementary measurements, such as cultures or fluorescence in situ hybridization (FISH) [64], which are independent of some complications that affect sequencing data, offer potential for additional validation.

Decontaminate Analytically, but Consider Uncaptured Contamination

“Decontamination” methods in microbiome research aim to computationally remove unwanted microbial DNA introduced during sample collection and processing (Figure 1B). The use of process control is strongly recommended [16, 17], and most decontamination methods use these controls to identify contamination [33, 48, 65] (Figure 4A and 4B). For instance, decontam [65] classifies species as contaminants based on their prevalence in controls and samples, while SCRuB [33] models taxonomic compositions of contamination sources, and accounts for the possibility of well-to-well leakage into process controls (Table 1). When using SCRuB, we recommend to sequentially model each contamination source using process-specific controls. Additionally, since precise contamination communities are specific to individual processing groups/batches, every SCRuB decontamination layer should be performed per individual batch or grouping (Figure 3).

Table 1.

Overview of Public Resources Commonly Used to Address Challenges in Low-Biomass Microbiome Research

Category Subcategory Methods Description
Host depletiona Alignmenta BWA [66]
Bowtie2 [57]
Used to filter sequenced reads that align to the genome of the host.
Database depletiona Conterminator [54]
Exhaustive [29]
Removes elements similar to host genomes from microbial databases.
Host referencea Human genomea T2T-CHM13 [52] Human genome reference.
Human pangenomea HPRC [53] Genetically diverse collection of reference human genomes.
Human transcriptsa GENCODE [67] Collection of human transcript references for RNA-sequencing data.
Microbial abundance estimationa,b Alignmenta BWA [66]
Bowtie2 [57]
Used to compare sequenced reads to microbial genomes by aligning the entire length of each read.
K-mer–based approachesa Kraken [55]
KrakenUniq [56]
Used to compare sequenced reads to microbial genomes by comparing k-mers within each read.
Marker genesa MetaPhlAn [68]
mOTUs [69]
Assess microbial composition by comparing sequenced reads against microbial marker genes. Would typically be sensitive to the low read count in low-biomass samples.
Coveragea KrakenUniq [56]
ICRA [62]
Account for the fraction of reference taxa genomes that is covered by sequenced reads.
16S analysisb DADA2 [70]
Deblur [71]
Identify and quantify microbial abundance using 16S ribosomal RNA amplicon sequencing data.
Decontaminationa,b Contaminant classificationa,b decontama,b [65]
Squeegeea [72]
Use information from collected samples to infer if a taxon is a contaminant and should be removed from downstream analyses.
Control-free decontaminationa,b decontama,b [65]
Squeegeea [72]
Methods that identify and remove contamination without the use of process controls.
Composition modelinga,b microDecona,b [48]
SCRuBa,b [33]
Processes collected samples to infer and remove a contamination source from a dataset.
Well-to-well leakagea,b Handle leakage into controlsa,b SCRuBa,b [33] Remove potential well-to-well leakage from controls to avoid compromising the decontamination process.
Strain-level comparisona inStraina [34, 73] Compare which pairs of samples share similar strains.
Bias/batch correctiona,b Batch correctiona,b ComBat [74]
ConQuR [75]
Voom [76]
SNM [77]
MMUPHin [78]
PLSDA-batch [79]
Transformations that strengthen microbial composition similarities across multiple batches, often while preserving differences within specified covariates.
Bias correctiona,b radEmu [41]
DEBIAS-M [40]
Methods that explicitly model and correct for processing bias.

Abbreviations: BWA, Burrows-Wheeler Aligner; ConQuR, Conditional Quantile Regression; DADA2, Divisive Amplicon Denoising Algorithm 2; DEBIAS-M, domain adaptation with phenotype estimation and batch integration across studies of the microbiome; HPRC, Human Pangenome Reference Consortium; ICRA, iterative coverage-based read assignment; MetaPhlAn, metagenomic phylogenetic analysis; MMUPHin, meta-analysis methods with a uniform pipeline for heterogeneity in microbiome studies; mOTU, marker gene–based operational taxonomic unit; PLSDA, partial least squares discriminant analysis; SCRuB, source-tracking for contamination removal in microbiomes; SNM, supervised normalization of microarrays; T2T-CHM13, Telomere-to-telomere assembly of the CHM13 cell line.

aDenotes relevance to metagenomic and metatranscriptomic sequencing.

bDenotes relevance to amplicon sequencing.

Importantly, it is likely that not all contamination sources are adequately captured in process controls [13, 20], necessitating complementary approaches for identifying contamination that do not require such controls. For example, decontam's frequency test identifies an inverse correlation between taxa abundance and sample concentration as evidence for contamination [65]. Squeegee compares samples from different ecosystems that have a procedural similarity, such as similar DNA extraction kits, and considers taxa shared among them as contaminants [72]. Using lists of “known contaminants” or “kitomes” is common, particularly in low-biomass microbiome studies. However, we advocate for more experiment-specific data-driven approaches, as these lists often contain known commensals and it is possible for similar species to both inhabit an environment of interest and be introduced as contamination.

Some have argued that observing different microbial abundances between similar samples that are processed with different kits is an indication of contamination [13, 45]. However, we find that this argument does not sufficiently distinguish between experimental contamination (Figure 1B) and processing bias or batch effects (Figure 1D). These have been shown to introduce changes even in high-biomass samples [15, 31, 38, 39]. Therefore, we believe that variations in abundances of particular microbes across sequencing kits alone should not be considered evidence for contamination (batch-specific differences in microbial presence, however, offer stronger evidence for contamination). In general, as we hope this discussion demonstrates, this is an area of ongoing research with substantial need for development of new, “control-free” decontamination methods.

Be Wary of Information Leakage in Batch Correction

Given how common batch effects are, many tools have been developed to address them [74–79] (Table 1). These are generally designed to reduce the differences between microbial compositions across different batches, while sometimes maintaining differences in specified covariates. Other methods, such as DEBIAS-M [40] or radEmu [41], were developed specifically to capture and correct for multiplicative processing bias proposed for microbiome studies [15, 31].

Batch- and bias-correction methods can be important for identifying generalizable microbiome patterns across batches and studies. However, the use of covariates and outcome variables by these methods can inflate the strength of observed associations. Particularly, in predictive modeling, it can compromise separation between training and test sets. As outlined in Figure 4, no processing steps should incorporate any test-set labels. Recently, application of batch correction methods have sparked controversy, as they were shown to introduce phenotype-associated values to sparse features [10]. While we demonstrated that this is often a benign result of using pseudocounts [80], caution should still be used when interpreting features manipulated by most batch correction methods. As with all data processing steps, it is recommended to perform exploratory data analysis to understand how different host depletion, reference mapping, decontamination, and batch/bias-correction pipelines have impacted analyses.

Use Generalization Benchmarks

The strongest indication for adequate handling of the various challenges of low-biomass microbiome studies is the detection of a robust association, in a nonconfounded setting, between a microbial feature and some covariate, such as a patient's phenotype. To ensure generalization, we recommend doing this in a predictive setting, via a train-test split that employs cross-validation within the training set for model selection and hyperparameter tuning. While optimal experimental design can alleviate concerns of batch confounding, we still recommend, when possible, to evaluate generalization across studies or batches such that residual batch confounding would make the benchmark harder rather than easier (Figure 4). Through such a benchmark, demonstrating that a model generalizes effectively to an independently processed test dataset is an effective justification for the legitimacy of the selected preprocessing steps, and the biological relevance of the detected associations.

DISCUSSION

There is a great opportunity for furthering our understanding of both human and nonhuman ecosystems through the analysis of low-biomass microbiomes. We discussed key challenges that have emerged in this field, which can introduce the possibility of both overstated and understated conclusions. We reviewed experimental design and computational analysis approaches that can mitigate these challenges, and highlighted areas in need of further methodological development.

While we described computational resources and tools that can be used for low-biomass microbiome research (Table 1), there is much need for further research and development of new methods. In particular, we note that recent controversies surrounding the potential misclassification of host DNA [10, 29] highlight a necessity for new methods. Additionally, there continues to be a lack of methods that can quantify and remove well-to-well leakage between samples, due to the computational challenge of confidently inferring the true source of the “leak” in most datasets. Finally, there is a need for new approaches that detect contamination in a data-driven fashion yet without the use of process controls, as these are not always available or sufficient. As this is a rapidly evolving field, we expect that the set of optimal tools will expand in the coming years.

Contributor Information

George I Austin, Department of Biomedical Informatics; Program for Mathematical Genomics, Department of Systems Biology.

Tal Korem, Program for Mathematical Genomics, Department of Systems Biology; Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, New York.

Notes

Acknowledgments. We thank members of the Korem group for useful discussions.

Financial support . This work was supported by the Program for Mathematical Genomics at Columbia University and the National Institutes of Health (grant numbers R01HD106017 to T. K. and T15LM007079 to G. I. A.).

References

  • 1. Nejman  D, Livyatan  I, Fuks  G, et al.  The human tumor microbiome is composed of tumor type-specific intracellular bacteria. Science  2020; 368:973–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Narunsky-Haziza  L, Sepich-Poore  GD, Livyatan  I, et al.  Pan-cancer analyses reveal cancer-type-specific fungal ecologies and bacteriome interactions. Cell  2022; 185:3789–806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Erb-Downward  JR, Thompson  DL, Han  MK, et al.  Analysis of the lung microbiome in the “healthy” smoker and in COPD. PLoS One  2011; 6:e16384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Segal  LN, Clemente  JC, Tsay  J-CJ, et al.  Enrichment of the lung microbiome with oral taxa is associated with lung inflammation of a Th17 phenotype. Nat Microbiol  2016; 1:16031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Aagaard  K, Ma  J, Antony  KM, Ganu  R, Petrosino  J, Versalovic  J. The placenta harbors a unique microbiome. Sci Transl Med  2014; 6:237ra65. [Google Scholar]
  • 6. Païssé  S, Valle  C, Servant  F, et al.  Comprehensive description of blood microbiome from healthy donors assessed by 16S targeted metagenomic sequencing. Transfusion  2016; 56:1138–47. [DOI] [PubMed] [Google Scholar]
  • 7. Morono  Y, Inagaki  F. Chapter 3—analysis of low-biomass microbial communities in the deep biosphere. In: Sariaslani  S, Michael Gadd  G, eds. Advances in applied microbiology. Cambridge, MA: Academic Press, 2016:149–78. [Google Scholar]
  • 8. Hamilton  TL, Peters  JW, Skidmore  ML, Boyd  ES. Molecular evidence for an active endogenous microbiome beneath glacial ice. ISME J  2013; 7:1402–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Tan  CCS, Ko  KKK, Chen  H, et al.  No evidence for a common blood microbiome based on a population study of 9,770 healthy humans. Nat Microbiol  2023; 8:973–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Gihawi  A, Ge  Y, Lu  J, et al.  Major data analysis errors invalidate cancer microbiome findings. mBio  2023; 14:e0160723. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Gihawi  A, Cooper  CS, Brewer  DS. Caution regarding the specificities of pan-cancer microbial structure. Microb Genom  2023; 9:mgen001088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Lauder  AP, Roche  AM, Sherrill-Mix  S, et al.  Comparison of placenta samples with contamination controls does not provide evidence for a distinct placenta microbiota. Microbiome  2016; 4:29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. de Goffau  MC, Lager  S, Sovio  U, et al.  Human placenta has no microbiome but can contain potential pathogens. Nature  2019; 572:329–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Minich  JJ, Sanders  JG, Amir  A, Humphrey  G, Gilbert  JA, Knight  R. Quantifying and understanding well-to-well contamination in microbiome research. mSystems  2019; 4:e00186–19. [Google Scholar]
  • 15. Brooks  JP, Edwards  DJ, Harwich  MD  Jr, et al.  The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies. BMC Microbiol  2015; 15:66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Eisenhofer  R, Minich  JJ, Marotz  C, Cooper  A, Knight  R, Weyrich  LS. Contamination in low microbial biomass microbiome studies: issues and recommendations. Trends Microbiol  2019; 27:105–17. [DOI] [PubMed] [Google Scholar]
  • 17. Salter  SJ, Cox  MJ, Turek  EM, et al.  Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol  2014; 12:87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. de Goffau  MC, Lager  S, Salter  SJ, et al.  Recognizing the reagent microbiome. Nat Microbiol  2018; 3:851–3. [DOI] [PubMed] [Google Scholar]
  • 19. Lu  J, Salzberg  SL. Removing contaminants from databases of draft genomes. PLoS Comput Biol  2018; 14:e1006277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Poore  GD, Kopylova  E, Zhu  Q, et al.  Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature  2020; 579:567–74. [DOI] [PMC free article] [PubMed] [Google Scholar] [Retracted]
  • 21. Poore  GD, Kopylova  E, Zhu  Q, et al.  Retraction note: microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature  2024; 631:694. [DOI] [PubMed] [Google Scholar]
  • 22. Kim  D, Hofstaedter  CE, Zhao  C, et al.  Optimizing methods and dodging pitfalls in microbiome research. Microbiome  2017; 5:52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Liu  Y-X, Qin  Y, Chen  T, et al.  A practical guide to amplicon and metagenomic analysis of microbiome data. Protein Cell  2020; 12:315–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Rajar  P, Dhariwal  A, Salvadori  G, et al.  Microbial DNA extraction of high-host content and low biomass samples: optimized protocol for nasopharynx metagenomic studies. Front Microbiol  2022; 13:1038120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Costello  M, Fleharty  M, Abreu  J, et al.  Characterization and remediation of sample index swaps by non-redundant dual indexing on massively parallel sequencing platforms. BMC Genomics  2018; 19:332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Minich  JJ, Zhu  Q, Janssen  S, et al.  KatharoSeq enables high-throughput microbiome analysis from low-biomass samples. mSystems  2018; 3:e00218-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Orlando  L, Allaby  R, Skoglund  P, et al.  Ancient DNA analysis. Nat Rev Methods Primers  2021; 1:14. [Google Scholar]
  • 28. Selway  CA, Eisenhofer  R, Weyrich  LS. Microbiome applications for pathology: challenges of low microbial biomass samples during diagnostic testing. J Pathol Clin Res  2020; 6:97–106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Sepich-Poore  GD, McDonald  D, Kopylova  E, et al.  Robustness of cancer microbiome signals over a broad range of methodological variation. Oncogene  2024; 43:1127–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Ong  CT, Ross  EM, Boe-Hansen  GB, Turni  C, Hayes  BJ, Tabor  AE. Technical note: overcoming host contamination in bovine vaginal metagenomic samples with nanopore adaptive sequencing. J Anim Sci  2022; 100:skab344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. McLaren  MR, Willis  AD, Callahan  BJ. Consistent and correctable bias in metagenomic sequencing experiments. Elife  2019; 8:e46923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Weiss  S, Amir  A, Hyde  ER, Metcalf  JL, Song  SJ, Knight  R. Tracking down the sources of experimental contamination in microbiome studies. Genome Biol  2014; 15:564. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Austin  GI, Park  H, Meydan  Y, et al.  Contamination source modeling with SCRuB improves cancer phenotype prediction from microbiome data. Nat Biotechnol  2023; 41:1820–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Lou  YC, Hoff  J, Olm  MR, et al.  Using strain-resolved analysis to identify contamination in metagenomics data. Microbiome  2023; 11:36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Olomu  IN, Pena-Cortes  LC, Long  RA, et al.  Elimination of “kitome” and “splashome” contamination results in lack of detection of a unique placental microbiome. BMC Microbiol  2020; 20:157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Gaulke  CA, Schmeltzer  ER, Dasenko  M, Tyler  BM, Vega Thurber  R, Sharpton  TJ. Evaluation of the effects of library preparation procedure and sample characteristics on the accuracy of metagenomic profiles. mSystems  2021; 6:e0044021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Nearing  JT, Comeau  AM, Langille  MGI. Identifying biases and their potential solutions in human microbiome studies. Microbiome  2021; 9:113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Costea  PI, Zeller  G, Sunagawa  S, et al.  Towards standards for human fecal sample processing in metagenomic studies. Nat Biotechnol  2017; 35:1069–76. [DOI] [PubMed] [Google Scholar]
  • 39. Sinha  R, Abu-Ali  G, Vogtmann  E, et al.  Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium. Nat Biotechnol  2017; 35:1077–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Austin  GI, Kav  AB, Park  H, Biermann  J, Uhlemann  A-C, Korem  T. Processing-bias correction with DEBIAS-M improves cross-study generalization of microbiome-based prediction models. bioRxiv [Preprint]. Posted online 12 February 2024. 10.1101/2024.02.09.579716. Accessed 15 July 2024. [DOI] [Google Scholar]
  • 41. Clausen  DS, Willis  AD. Estimating fold changes from partially observed outcomes with applications in microbial metagenomics. arXiv [Preprint]. Posted online 7 February 2024. http://arxiv.org/abs/2402.05231. Accessed 15 July 2024. [Google Scholar]
  • 42. McLaren  MR, Nearing  JT, Willis  AD, Lloyd  KG, Callahan  BJ. Implications of taxonomic bias for microbial differential-abundance analysis. bioRxiv [Preprint]. Posted online 8 October 2022. https://www.biorxiv.org/content/10.1101/2022.08.19.504330v2.abstract. Accessed 9 April 2024. [Google Scholar]
  • 43. Nayfach  S, Shi  ZJ, Seshadri  R, Pollard  KS, Kyrpides  NC. New insights from uncultivated genomes of the global human gut microbiome. Nature  2019; 568:505–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Escapa  IF, Chen  T, Huang  Y, Gajare  P, Dewhirst  FE, Lemon  KP. New insights into human nostril microbiome from the expanded Human Oral Microbiome Database (eHOMD): a resource for the microbiome of the human aerodigestive tract. mSystems  2018; 3:e00187-18. [Google Scholar]
  • 45. Kennedy  KM, Goffau  Md, Perez-Muñoz  ME, et al.  Questioning the fetal microbiome illustrates pitfalls of low-biomass microbial studies. Nature  2023; 613:639–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Chiang  AWT, Gazestani  VH, Altieri  MG, et al.  Optimal balancing of clinical factors in large scale clinical RNA-seq studies. bioRxiv [Preprint]. Posted online 25 October 2021. https://www.biorxiv.org/content/10.1101/2021.06.30.450639v3.abstract. Accessed 3 April 2024. [Google Scholar]
  • 47. Whalen  S, Schreiber  J, Noble  WS, Pollard  KS. Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet  2022; 23:169–81. [DOI] [PubMed] [Google Scholar]
  • 48. McKnight  DT, Huerlimann  R, Bower  DS, Schwarzkopf  L, Alford  RA, Zenger  KR. microDecon: a highly accurate read-subtraction tool for the post-sequencing removal of contamination in metabarcoding studies. Environmental DNA  2019; 1:14–25. [Google Scholar]
  • 49. Callahan  B. Decontam issue #129: extraction control, pcr control and “buffer” control. 2023. https://github.com/benjjneb/decontam/issues/129. Accessed 9 April 2024.
  • 50. Marotz  C, Belda-Ferre  P, Ali  F, et al.  SARS-CoV-2 detection status associates with bacterial community composition in patients and the hospital environment. Microbiome  2021; 9:132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Mirzayi  C, Renson  A; Genomic Standards Consortium, et al.  Reporting guidelines for human microbiome research: the STORMS checklist. Nat Med  2021; 27:1885–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Rhie  A, Nurk  S, Cechova  M, et al.  The complete sequence of a human Y chromosome. Nature  2023; 621:344–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Liao  W-W, Asri  M, Ebler  J, et al.  A draft human pangenome reference. Nature  2023; 617:312–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Steinegger  M, Salzberg  SL. Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Genome Biol  2020; 21:115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Wood  DE, Lu  J, Langmead  B. Improved metagenomic analysis with Kraken 2. Genome Biol  2019; 20:257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Breitwieser  FP, Baker  DN, Salzberg  SL. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol  2018; 19:198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Langmead  B, Salzberg  SL. Fast gapped-read alignment with Bowtie 2. Nat Methods  2012; 9:357–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Pasolli  E, Asnicar  F, Manara  S, et al.  Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell  2019; 176:649–62.e20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Almeida  A, Nayfach  S, Boland  M, et al.  A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol  2021; 39:105–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Darling  JA, Pochon  X, Abbott  CL, Inglis  GJ, Zaiko  A. The risks of using molecular biodiversity data for incidental detection of species of concern. Divers Distrib  2020; 26:1116–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Hong  C, Manimaran  S, Shen  Y, et al.  PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples. Microbiome  2014; 2:33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Zeevi  D, Korem  T, Godneva  A, et al.  Structural variation in the gut microbiome associates with host health. Nature  2019; 568:43–8. [DOI] [PubMed] [Google Scholar]
  • 63. Coleman  I, Korem  T. Embracing metagenomic complexity with a genome-free approach. mSystems  2021; 6:e0081621. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Prudent  E, Raoult  D. Fluorescence in situ hybridization, a complementary molecular tool for the clinical diagnosis of infectious diseases by intracellular and fastidious bacteria. FEMS Microbiol Rev  2019; 43:88–107. [DOI] [PubMed] [Google Scholar]
  • 65. Davis  NM, Proctor  DM, Holmes  SP, Relman  DA, Callahan  BJ. Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome  2018; 6:226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Li  H, Durbin  R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics  2009; 25:1754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Frankish  A, Diekhans  M, Jungreis  I, et al.  GENCODE 2021. Nucleic Acids Res  2020; 49:D916–23. [Google Scholar]
  • 68. Blanco-Míguez  A, Beghini  F, Cumbo  F, et al.  Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat Biotechnol  2023; 41:1633–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Ruscheweyh  H-J, Milanese  A, Paoli  L, et al.  Cultivation-independent genomes greatly expand taxonomic-profiling capabilities of mOTUs across various environments. Microbiome  2022; 10:212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Callahan  BJ, McMurdie  PJ, Rosen  MJ, Han  AW, Johnson  AJA, Holmes  SP. DADA2: high-resolution sample inference from Illumina amplicon data. Nat Methods  2016; 13:581–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71. Amir  A, McDonald  D, Navas-Molina  JA, et al.  Deblur rapidly resolves single-nucleotide community sequence patterns. mSystems  2017; 2:e00191-16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Liu  Y, Elworth  RAL, Jochum  MD, Aagaard  KM, Treangen  TJ. De novo identification of microbial contaminants in low microbial biomass microbiomes with Squeegee. Nat Commun  2022; 13:6799. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. Olm  MR, Crits-Christoph  A, Bouma-Gregson  K, Firek  BA, Morowitz  MJ, Banfield  JF. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat Biotechnol  2021; 39:727–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Zhang  Y, Parmigiani  G, Johnson  WE. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom Bioinform  2020; 2:lqaa078. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Ling  W, Lu  J, Zhao  N, et al.  Batch effects removal for microbiome data via conditional quantile regression. Nat Commun  2022; 13:5418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76. Law  CW, Chen  Y, Shi  W, Smyth  GK. Voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol  2014; 15:R29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77. Mecham  BH, Nelson  PS, Storey  JD. Supervised normalization of microarrays. Bioinformatics  2010; 26:1308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78. Ma  S, Shungin  D, Mallick  H, et al.  Population structure discovery in meta-analyzed microbial communities and inflammatory bowel disease using MMUPHin. Genome Biol  2022; 23:208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79. Wang  Y, Lê Cao  KA. PLSDA-batch: a multivariate framework to correct for batch effects in microbiome data. Brief Bioinform  2023; 24:bbac622. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Austin  GI, Korem  T. Compositional transformations can reasonably introduce phenotype-associated values into sparse features. bioRxiv [Preprint]. Posted online 21 February 2024. https://www.biorxiv.org/content/biorxiv/early/2024/02/21/2024.02.19.581060. Accessed 4 April 2024. [Google Scholar]

Articles from The Journal of Infectious Diseases are provided here courtesy of Oxford University Press

RESOURCES