Accurate prediction of absolute prokaryotic abundance from DNA concentration

Jakob Wirbel; Tessa M Andermann; Erin F Brooks; Lanya Evans; Adam Groth; Mai Dvorak; Meenakshi Chakraborty; Bianca Palushaj; Gabriella ZM Reynolds; Imani E Porter; Monzr Al Malki; Andrew Rezvani; Mahasweta Gooptu; Hany Elmariah; Lyndsey Runaas; Teng Fei; Michael J Martens; Javier Bolaños-Meade; Mehdi Hamadani; Shernan Holtan; Rob Jenq; Jonathan U Peled; Mary M Horowitz; Kathleen L Poston; Wael Saber; Leslie S Kean; Miguel-Angel Perales; Ami S Bhatt

doi:10.1016/j.crmeth.2025.101030

. 2025 Apr 28;5(5):101030. doi: 10.1016/j.crmeth.2025.101030

Accurate prediction of absolute prokaryotic abundance from DNA concentration

Jakob Wirbel ^1,^∗, Tessa M Andermann ², Erin F Brooks ¹, Lanya Evans ², Adam Groth ², Mai Dvorak ³, Meenakshi Chakraborty ⁴, Bianca Palushaj ⁵, Gabriella ZM Reynolds ⁴, Imani E Porter ⁴, Monzr Al Malki ⁶, Andrew Rezvani ⁷, Mahasweta Gooptu ⁸, Hany Elmariah ⁹, Lyndsey Runaas ¹⁰, Teng Fei ¹¹, Michael J Martens ^10,¹², Javier Bolaños-Meade ¹³, Mehdi Hamadani ¹⁰, Shernan Holtan ¹⁴, Rob Jenq ⁶, Jonathan U Peled ^15,¹⁶, Mary M Horowitz ¹⁰, Kathleen L Poston ⁵, Wael Saber ¹⁰, Leslie S Kean ^8,^17,¹⁸, Miguel-Angel Perales ^11,¹⁵, Ami S Bhatt ^1,^4,^19,^∗∗

PMCID: PMC12146642 PMID: 40300608

Summary

Quantification of the absolute microbial abundance in a human stool sample is crucial for a comprehensive understanding of the microbial ecosystem, but this information is lost upon metagenomic sequencing. While several methods exist to measure absolute microbial abundance, they are technically challenging and costly, presenting an opportunity for machine learning. Here, we observe a strong correlation between DNA concentration and the absolute number of 16S ribosomal RNA copies as measured by digital droplet PCR in clinical stool samples from individuals undergoing hematopoietic cell transplantation (BMT CTN 1801). Based on this correlation and additional measurements, we trained an accurate yet simple machine learning model for the prediction of absolute prokaryotic load, which showed exceptional prediction accuracy on an external cohort that includes people living with Parkinson’s disease and healthy controls. We propose that, with further validation, this model has the potential to enable accurate absolute abundance estimation based on readily available sample measurements.

Keywords: absolute microbial abundance, machine learning, digital droplet PCR, microbiome, metagenomics

Graphical abstract

Highlights

•
Strong correlation between DNA concentration and absolute prokaryotic abundance
•
Accurate machine learning model for the prediction of absolute prokaryotic abundance
•
Validation of the machine learning model using an external cohort

Motivation

Information about absolute microbial abundance is lost with standard metagenomic sequencing, making it difficult to compare microbial ecosystems. Methods to measure absolute microbial abundance have been developed, but they are technically challenging and require costly additional experiments. Here, we observed a strong correlation between DNA concentration and absolute prokaryotic abundance, which we use to train a simple machine learning model that is able to predict absolute abundance based on easily available sample characteristics.

Wirbel et al. observe a strong correlation between DNA concentration and absolute prokaryotic abundance in a large stool metagenomic cohort. They use this relationship as the basis to train a machine learning model for the prediction of absolute prokaryotic abundance, which they validate in an external dataset.

Introduction

The study of the human gut microbiome revealed many associations between microbes and human diseases,¹ including inflammatory bowel disease,²^,³ colorectal cancer,⁴^,⁵ Parkinson’s disease (PD),⁶^,⁷ and graft-versus-host disease,⁸ among others. In such studies, the relative microbial abundance and composition of a stool sample are typically assessed using high-throughput sequencing. A major drawback of this approach is that sequencing data are inherently compositional,⁹ and therefore, information about the absolute number of microbes in an ecosystem is lost in the process. This has the potential to lead to biased or misleading results. For example, the observation of higher relative abundance for one microbe could be explained by an increase in the number of this microbe in a population or by a reduction in the abundance of other microbes.¹⁰ The compositional nature of sequencing data is widely recognized but often ignored in analysis,¹¹ despite potentially biased results and proposed compositionality-aware analysis methodology.¹⁰^,¹²^,¹³

Multiple methods have been proposed for measuring the absolute abundance of microbes or quantitative microbiome profiling (QMP), including spike-in standards,¹⁴^,¹⁵ quantitative or digital droplet PCR (qPCR and ddPCR, respectively),¹⁶^,¹⁷^,¹⁸^,¹⁹ and flow cytometry.²⁰^,²¹ For example, QMP showed a lower prokaryotic load in patients with Crohn’s disease compared to healthy controls.²⁰ However, QMP measurements are not routinely included in large metagenomic studies, as they require additional expertise or equipment and increase the cost of a project.

To overcome these issues, a recent study by Nishijima and colleagues²² proposed to predict the absolute prokaryotic load in a sample using a machine learning model based on taxonomic composition, as they found some prokaryotic species to be correlated with flow-cytometry-based QMP measurements in two large-scale metagenomic studies. They used this relationship to train XGBoost regression models in two studies and were able to predict the prokaryotic load in the corresponding other study with a Pearson correlation of 0.56 between the predicted and measured values. After applying their model to numerous publicly available metagenomic studies, they observed that the predicted prokaryotic load indeed confounds multiple reported microbiome-disease associations.²² While this study set a precedent for the use of predictive tools for absolute abundance, the dynamic range of absolute abundance in that study was rather small, and prediction accuracy was limited.

Results

To measure the absolute prokaryotic abundance, we performed ddPCR with universal primers for the 16S ribosomal RNA gene (16S) as detailed in a recent protocol paper by our group (see Figure 1A; see Doyle et al.²³). All samples analyzed here are part of the large-scale BMT CTN 1703/1801 clinical trial (see STAR Methods for more information on the cohort; see Bolaños-Meade et al.²⁴) investigating the microbiome of individuals undergoing reduced-intensity conditioning allogeneic hematopoietic cell transplantation (allo-HCT) for hematological malignancies (see STAR Methods). For this study, stool samples were processed in a standardized way to minimize the variance introduced by varying the amounts of stool used for DNA extraction (see Figure S1) and were then sequenced using shotgun metagenomic sequencing for taxonomic characterization.

Workflow for this study and comparison to published absolute prokaryotic abundance data

(A) Schematic illustrating our workflow: for each stool sample, all DNA was extracted from frozen stool. The amount of stool used for extraction was measured to enable calculation per wet gram of stool. The extracted DNA was then used as input for metagenomic shotgun sequencing, and the resulting reads were subjected to standard downstream analysis pipelines (see STAR Methods). Additionally, the DNA was diluted 1:10,000 and used as input for ddPCR to measure the number of 16S rRNA copies in the sample. Measurements of relevance for the resulting machine learning model are bolded and italicized.

(B) Numbers of absolute prokaryotic abundance (either measured as 16S rRNA copies or bacterial cells, depending on technology) across different datasets. Studies are colored by the health status of the analyzed population (HCT, individuals undergoing hematopoietic cell transplant) and separated by the technology used for absolute abundance measurement. See the STAR Methods for details on the data from the study by Rolling et al.²⁵ Horizontal lines in the violin plot indicate the 25th, 50th (bold), and 75th quantiles.

Samples from different individuals and sampling time points relative to allo-HCT were randomly distributed across 96-well plates for DNA extraction. We used six 96-well plates to measure the absolute number of 16S copies per DNA extraction via ddPCR, resulting in 528 samples being analyzed (one column per plate was reserved for internal ddPCR standards). Due to the randomization across plates, there was no significant skew toward specific time points relative to allo-HCT compared to the full study set (see STAR Methods). The number of 16S copies for samples that were too dilute (n = 16) was set to the lower limit of detection. Of the total number of samples, ten samples produced an insufficient number of droplets and were removed from the analysis, resulting in a dataset of 518 samples with absolute abundance measurement.

Due to extensive medical treatment during and after transplantation, including administration of a variety of antibiotics, the microbiome of individuals undergoing allo-HCT is known to be disrupted in terms of alpha diversity⁸^,²⁵ and absolute prokaryotic abundance.²⁵^,²⁶ In line with these expectations, the numbers of 16S copies per wet gram (as a proxy for bacterial numbers per wet gram) in our study are lower than previously reported numbers of bacteria measured with flow cytometry in largely healthy (i.e., less heavily treated) populations²⁰^,²¹^,²² (see Figure 1B). Studies included here employed different technologies to measure the absolute abundance of prokaryotes, and therefore, comparing their numbers is difficult, as the limits of detection and normalization procedures might differ. Nevertheless, compared to the machine learning study from Nishijima et al.,²² the data in our study encompass a larger dynamic range spanning multiple orders of magnitude.

We hypothesized that there should be a direct relationship between the total amount of DNA in a sample and the absolute prokaryotic abundance (as measured by log₁₀ 16S copies per extraction), as the majority of DNA in stool is typically of prokaryotic origin. Other sources of DNA could be host contamination and other microbes, such as microbial eukaryotes or food remnants. We calculated the Spearman correlation coefficient between DNA concentration and absolute prokaryotic abundance and observed a strong positive correlation (rho = 0.92, p < 2e−16; see Figure 2A). Another positive correlation was observed for prokaryotic alpha diversity (measured as Shannon diversity, rho = 0.34, p = 2.1e−15; see Figure 2B), as reported in Nishijima et al.²² For the relative abundance of high-level taxonomic groups (Eukarya, Archaea, and Bacteria), no strong positive correlation was observed (see Figure S2).

Machine learning model for accurate prediction of 16S copies in stool metagenomic samples

(A) Relationship between absolute DNA concentration and 16S copies per extraction, measured by ddPCR. The trendline displays the mean log copies estimated using a linear model with the formula y∼log(x), and the surrounding gray areas indicate 95% confidence intervals.

(B) Relationship between alpha diversity (as measured by Shannon’s index) and 16S copies per extraction, measured by ddPCR. The trendline displays the mean log copies estimated using a linear model with the formula y∼x, and the surrounding gray areas indicate 95% confidence intervals.

(C) Relationship between measured 16S copies per extraction and 16S copies predicted by the full machine learning model. The plot shows the average prediction for hold-out data points from the five cross-validation repeats.

(D) Mean relative model weight for each predictor across the 10 repeats of the cross-validated model fitting. Black error bars indicate the standard deviation across cross-validation folds.

In all plots, the limit of quantification (LoQ) is indicated by a dashed black line.

We hypothesized that the strong correlation between absolute prokaryotic abundance and DNA concentration could be the basis for a predictive model. Since DNA concentration is relatively easy to measure in contrast to absolute prokaryotic abundance, such a model could potentially substitute costly and time-consuming additional experiments. Note that we aim to predict the number of 16S copies per extraction, not per wet gram of stool, because this latter value requires additional normalization by the amount of stool used as input for the extraction (see Figure S1).

To test this hypothesis, we trained a random forest model on our data, using only the DNA concentration as input, employing a ten times-repeated 10-fold cross-validation strategy. This “DNA-only” model achieved a Spearman correlation between the measured and predicted values of 0.89 (see Figure S3). As we had performed metagenomic sequencing on all samples, we had additional sample information available that might be relevant to this prediction task. We added high-level domain taxonomic information (in order to avoid excessive dimensionality), the fraction of human reads, prokaryotic alpha diversity, and the type of sample storage (same-day versus next-day sample freezing) as predictors. This “full” model achieved a Spearman correlation of 0.91 (see Figure 2C) and outperformed the DNA-only model in other metrics as well (see Table 1), showing better prediction accuracy across samples (p = 0.0003, paired t test).

Table 1.

Cross-validation performance of machine learning models

Model	Pearson’s r	Spearman’s rho	R²	MSE (mean-squared error)	CCC (concordance correlation coefficient)
DNA-only	0.92	0.89	0.82	0.11	0.92
Full model^∗	0.94	0.91	0.86	0.08	0.93

Open in a new tab

^∗

indicates best performance.

To better understand our machine learning model, we extracted the feature importance from all models trained in the cross-validation procedure and normalized the importance per model by the summed importance for all predictors. As expected, the DNA concentration carried the strongest relative model weight, followed by the fraction of host reads in sequencing and prokaryotic alpha diversity (see Figure 2D). Reassuringly, the type of sample storage (same-day versus next-day sample freezing) did not seem to have a strong impact on absolute prokaryotic abundance.

To estimate the test error of our model, we employed a ten times-repeated 10-fold cross-validation strategy during model training. As an additional and more rigorous evaluation of model generalizability, we trained models on data from five of the six 96-well plates used to run the ddPCR and evaluated the model performance on the data from the left-out plate, repeating it for each plate (see Figure S4). In this more difficult prediction task (since fewer data are available for training), we observed a high correlation between the measured and predicted values (mean Spearman correlation of 0.91 ± 0.03, mean concordance correlation coefficient [CCC] of 0.93 ± 0.03 across plates), indicating that our model generalizes well to completely unseen data.

To further validate our model on completely unseen data, we performed ddPCR to measure the absolute prokaryotic abundance in an external case-control cohort of people living with PD, which included healthy controls (total samples = 179, PD samples = 103). Although the ddPCR protocol and bioinformatic processing were identical, the PD cohort differed from the allo-HCT population in terms of diagnosis, lifestyle and medication, and sample handling and DNA extraction, potentially leading to biases that could limit model generalization. However, we observe not only a very similar relationship between DNA concentration and 16S copies per extraction (rho = 0.98, p < 2e−16; see Figure 3A) but also high accuracy for the prediction of the PD data by the model trained on the allo-HCT data alone (R² = 0.92; see Figure 3B). This analysis demonstrates excellent generalizability for our machine learning model.

Evaluation of our model on an external case-control cohort of people living with Parkinson’s disease and healthy controls

(A) Relationship between absolute DNA concentration and 16S copies per extraction, measured by ddPCR, for the PD cohort (in orange) compared to the allo-HCT cohort (in gray). The trendline displays the mean log copies estimated using a linear model with the formula y∼log(x), and the surrounding gray areas indicate 95% confidence intervals.

(B) Relationship between measured 16S copies per extraction and predicted 16S copies for the external PD cohort.

In both plots, the limit of quantification (LoQ) is indicated by a dashed black line.

Discussion

Measuring the absolute abundance of prokaryotes in a sample can require additional experiments associated with increased costs, technical expertise, and labor. Therefore, absolute abundance is not routinely measured in large-scale metagenomic studies, despite absolute prokaryotic counts being crucial for a bias-free understanding of microbiome-host relationships.⁹^,¹⁰^,²⁷ Here, we propose a relatively simple machine learning model for the prediction of absolute prokaryotic abundance based on DNA concentration and other sample-level information readily available after standard metagenomic shotgun sequencing. This model showed unparalleled accuracy using a cross-validation approach (see Figure 2C) and for the prediction of a truly external validation cohort (see Figure 3B), highlighting its potential to complement and eventually supplant costly and technically challenging experiments in future metagenomic studies. Future studies will be required to further validate this model using large independent datasets.

In our clinical dataset of individuals undergoing allo-HCT, we observe the 16S copy number to vary over multiple orders of magnitude (see Figure 1B), which is not surprising given the extensive antibiotic and chemotherapy exposure in this population.²⁵^,²⁶ Our approach is naturally limited by the dynamic range of our method, potentially failing for samples with extreme host contamination or dilution, such as diarrhea. Nonetheless, this wide dynamic range is a strength of our study compared to the previous machine learning approach for absolute prokaryotic abundance prediction,²² which used the data from healthy or less heavily treated participants as training data, varying over only a single order of magnitude.

However, compared to the study from Nishijima et al.,²² our work has a notable limitation: Nishijima and colleagues used their taxonomy-based machine learning model to perform a retrospective analysis of the predicted absolute prokaryotic abundance in published data and revealed that many microbiome-disease associations are potentially confounded by changes in absolute abundance. Such a retrospective analysis of published data is not possible for our model, as DNA concentration and the other sample measurements (such as the fraction of host reads) are not routinely part of the sample information included in publications or sequencing read archives.

Going forward, it might be useful for researchers to report the amount of stool used for extraction, the DNA concentration, and the fraction of host reads as part of the minimal metadata for their samples, surpassing proposed current standards such as the MIxS and STORMS checklist.²⁸^,²⁹ This would enable future validation or retrospective analyses of the proposed model for the prediction of absolute prokaryotic abundance.

Another limitation of our work is that the model has been trained on consistently processed stool metagenomic samples that were not stored in any preservative solution. Consequently, the chance exists that the model might not generalize to other studies using preservative solutions for sample storage or different DNA extraction techniques, as it cannot be ruled out that those sample processing choices might not alter the observed relationship between DNA concentration and absolute prokaryotic abundance.¹⁹ Somewhat allaying these concerns is the fact that the samples from the PD cohort were stored in a preservative solution and underwent a modified extraction protocol (see STAR Methods) but could be predicted with remarkable accuracy by the model trained on the allo-HCT data alone (see Figure 3). Still, additional validation is needed for studies using other preservative solutions or different DNA extraction techniques or for samples from other body sites with lower bacterial biomass, such as oral or skin samples, which typically contain more host reads than stool metagenomic samples.

Overall, we hope that the proposed model can be useful for researchers who want to estimate the absolute prokaryotic abundance of their samples based on measurements that are easily and routinely measured with metagenomic sequencing. To facilitate broader adoption of this method for estimating absolute prokaryotic abundance, the trained machine learning models, together with the underlying data, are available on GitHub for other researchers to apply to their own data.

Limitations of the study

There are several considerations for using the proposed model for the prediction of absolute prokaryotic abundance. First, further validation of the proposed model is needed, as there exists a multitude of processing options for metagenomic stool samples. For example, different storage buffers, DNA extraction protocols, or DNA concentration measurement modalities might affect the generalizability of the machine learning model. Second, it is important to keep in mind that the proposed model predicts the number of 16S copies within a sample. Further normalization for the percentage of water (or storage buffer) and the number of 16S copies per genome is required to estimate the number of prokaryotic cells per dry gram of feces. Third, the generalizability of the model to other environmental samples or host-associated microbiomes needs to be assessed, as host contamination and microbial community structure in those samples might differ significantly from those of stool samples. Finally, the proposed model will only be applicable when the absolute prokaryotic abundance is within the observed dynamic range in this study. Samples with extremely high or extremely low prokaryotic abundance will potentially lead to unreliable predictions from the machine learning model.

Resource availability

Lead contact

Requests for further information and resources should be directed to and will be fulfilled by the lead contact, Ami S. Bhatt (asbhatt@stanford.edu).

Materials availability

This study did not generate new unique reagents.

Data and code availability

•
All data tables are available under the MIT license on GitHub at www.github.com/jakob-wirbel/absolute_abundance and on Zenodo.³⁰
•
All scripts to reproduce the presented analyses, as well as the trained machine learning models (as mlr3 model objects), are available under the MIT license on GitHub at www.github.com/jakob-wirbel/absolute_abundance and on Zenodo.³⁰
•
Any additional information required to reanalyze the data reported in this work paper is available from the lead contact upon request.

Acknowledgments

We would like to thank members of the Bhatt lab for enriching discussions and valuable feedback, especially Boryana Doyle. We would also like to thank the Boehm lab at Stanford for technical assistance and access to their machines. We thank D. Solow-Cordero and S. Sim for assistance in using the Stanford High-Throughput Bioscience Center and Stanford Functional Genomics Facility, which is supported by NIH Shared Instrumentation Grants S10RR019513, S10RR026338, S10OD025004, and S10OD026899 and by an anonymous donation. We further acknowledge support from the Biostatistics Shared Resource at the Medical College of Wisconsin Cancer Center. We would like to acknowledge all members and participating clinical centers of the BMT CTN 1703/1801 trial and all participants who generously donated samples for the study. This manuscript was prepared using BMT CTN 1703/1801 Research Materials obtained from the BMT CTN Repository operated by the NMDP and does not necessarily reflect the opinions or views of the BMT CTN 1703/1801 protocol team, the BMT CTN, the NHLBI, or NCI. The PD study was supported by the Wu Tsai Neuroscience Institute and the Knight Initiative for Brain Resilience at Stanford University, and we would like to thank all participants for donating their samples. A.S.B. is supported by an Allen Distinguished Investigator award, and the Bhatt lab is supported by NIH R01AI148623 and R01AI143757 and a Stand Up 2 Cancer grant. J.W. is a Damon Runyon Quantitative Biology Fellow supported by the Damon Runyon Cancer Research Foundation (DRQ-22-24). M.C. acknowledges support from an NIH-funded predoctoral fellowship (5T32HG000044-25) and the National Defense Science and Engineering Graduate Fellowship (starting September 2022). M.-A.P., J.U.P., and T.F. received support from the NIH/NCI Cancer Center Support Grant P30 CA008748. Support for this study was provided by grants U10HL069294 and U24HL138660 to the Blood and Marrow Transplant Clinical Trials Network from the National Heart, Lung, and Blood Institute and the National Cancer Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. K.L.P. is supported by grants from the NIH (U19 AG065156, R01 NS107513, R01 NS115114, P30 AG066515, R01 AG081144, R21 NS132101, U01 DK140939, and R01 AG089169), the Michael J. Fox Foundation for Parkinson’s Research (grants 020756, 16921, and 18411), the Knight Initiative for Brain Resilience, the Wu Tsai Neuroscience Institute, the Lewy Body Dementia Association, the Alzheimer's Drug Discovery Foundation, and the Sue Berghoff LBD Research Fellowship.

Author contributions

L.E. and A.G. processed all stool samples for the allo-HCT study, and J.W. and M.D. performed the ddPCR experiments. For the PD study, M.C., G.Z.M.R., and I.E.P. collected and processed the stool samples and also performed the ddPCR experiments. J.W. performed data analysis and machine learning training. E.F.B., T.F., R.J., J.U.P., T.M.A., W.S., M.-A.P., L.S.K., and A.S.B. planned the BMT CTN 1801 study, and M.A.M., A.R., M.G., H.E., and L.R. oversaw recruitment in the top five accruing centers. A.S.B., L.S.K., and M.-A.P. chaired the BMT CTN 1801 study committee, and W.S. chaired the BMT CTN 1801 protocol team. B.P., K.L.P., and A.S.B. conceived and oversaw the PD study. A.S.B., K.L.P., and T.M.A. supervised the work. J.W. and A.S.B. wrote the manuscript with input from all authors. All authors read and approved the final manuscript.

Declaration of interests

M.-A.P. reports honoraria from Adicet, Allogene, Allovir, Caribou Biosciences, Celgene, Bristol-Myers Squibb, Equilium, Exevir, ImmPACT Bio, Incyte, Karyopharm, Kite/Gilead, Merck, Miltenyi Biotec, MorphoSys, Nektar Therapeutics, Novartis, Omeros, OrcaBio, Sanofi, Syncopation, VectivBio AG, and Vor Biopharma. He serves on data safety and monitoring boards (DSMBs) for Cidara Therapeutics and Sellas Life Sciences and on the scientific advisory board of NexImmune. He has ownership interests in NexImmune, Omeros, and OrcaBio. He has received institutional research support for clinical trials from Allogene, Incyte, Kite/Gilead, Miltenyi Biotec, Nektar Therapeutics, and Novartis. J.U.P. reports research funding, intellectual property fees, and travel reimbursement from Seres Therapeutics and consulting fees from DaVolterra, CSL Behring, Crestone, Inc., MaaT Pharma, Canaccord Genuity, Inc., and RA Capital. He serves on an advisory board of and holds equity in Postbiotics Plus Research. He serves on an advisory board of and holds equity in Prodigy Biosciences. He has filed intellectual property applications related to the microbiome (reference numbers #62/843,849, #62/977,908, and #15/756,845). The Memorial Sloan Kettering Cancer Center (MSK) has financial interests relative to Seres Therapeutics. K.L.P. has been on the scientific advisory board for Amprion, where she receives stock options. She has been a consultant for Novartis, Biohaven, Curasen, and Neuron23, where she receives consulting fees. A.S.B. is a founder of Stylus Medicine, serves on the scientific advisory board, and is a board observer. She also serves on the scientific advisory boards of Caribou Biosciences and Cantata Biosciences.

STAR★Methods

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Chemicals, peptides, and recombinant proteins

Microbial Pathogen DNA Standards for Detection and Identification	NIST	#RM8376
ddPCR SuperMix	Bio-Rad	#1863024
/56-FAM/CGTATTACC/ZEN/GCGGCTGCTGGCAC/3IABkFQ/	Integrated DNA technologies	N/A

Critical commercial assays

DNA/RNA Shield-Fecal Collection Tube	Zymo Research	#R1101
biopsy punch	Fisher Scientific	#12-460-410
QIAamp PowerFecal Pro DNA Kit	QIAGEN	#51804
DNeasy PowerClean Pro	QIAGEN	#12997-50
Agilent 5400 Fragment Analyzer System	Agilent	#M5312AA
NEB Ultra II kit	NEB	#E7645L
NovaSeq 6000 platform	Illumina	#20012850
Velocity 11 Vprep	Agilent	N/A
ddPCR 96-well plates	Bio-Rad	#12001925
Auto Droplet Generator	Bio-Rad	#1864101
QX200 Droplet Reader	Bio-Rad	#1864003

Deposited data

Tables of DNA concentration, log-16S copies per extraction for reproduction of the presented results	This study	Zenodo: https://doi.org/10.5281/zenodo.14026970

Oligonucleotides

5′-TCCTACGGGAGGCAGCAGT-3′	Integrated DNA technologies	331F
5′-GGACTACCAGGGTATCTAATCCTGTT-3′	Integrated DNA technologies	797R

Software and algorithms

Custom code	This study	https://doi.org/10.5281/zenodo.14026970
QuantaSoft software v1.7.4.0917	Bio-Rad	#1864011
NextFlow v22.10.5	Di Tommaso et al. 2017³¹	https://github.com/nextflow-io/nextflow
HTStream	N/A	https://github.com/s4hts/HTStream
TrimGalore v0.6.7	N/A	https://github.com/FelixKrueger/TrimGalore
bwa v0.7.17	Li and Durbin 2009³²	https://github.com/lh3/bwa
MetaPhlAn v4.0.4	Manghi et al. 2023³³	https://github.com/biobakery/MetaPhlAn
mlr3 v0.16.0	Lang et al. 2019³⁴	https://github.com/mlr-org/mlr3

Open in a new tab

Experimental model and study participant details

Allo-HCT study cohort

The samples analyzed in this study originate from individuals recruited for the BMT CTN 1703/1801 clinical trial (NCT03959241). This trial was a multi-center, randomized control trial comparing two graft-versus-host-disease prophylaxis medication regimens (see Bolaños-Meade et al.²⁴ for the results of the trial). Individuals enrolled in BMT CTN 1801 provided stool samples before and at specified intervals after infusion with hematopoietic stem cells. Stool samples were collected without preservative solution and stored at 4°C within 30 min after collection and then shipped overnight to the National Marrow Donor Program, where they were aliquoted and frozen at −80°C. A subset of samples was frozen directly after collection (same-day versus next-day sample storage). In total, 2573 stool samples had been collected, extracted into 96-well plates, and were then sequenced using shotgun metagenomic sequencing.

For this project, six 96-well plates were randomly selected for absolute prokaryotic abundance measurements. Due to the randomization of samples onto the 96-well plates, the included subset does not have any significant skew in terms of sampling time point relative to allo-HCT (p = 0.988, permutation test with 1000 random subsets) and only a small depletion of same-day sample storage samples (p = 0.007, chi-square test).

The current study only reports on the observed association between DNA concentration and 16S copies per extraction, whereas the results of the microbiome dynamics after hematopoietic cell transplantation and participant demographic information will be reported elsewhere.

PD study cohort

A case-control study, approved by the institutional review board under IRB 62642 (principal investigator B.R.P), was conducted at the Stanford Movement Disorder Clinic and the broader California region. Informed consent was obtained from all individuals prior to sample collection. Individuals with Parkinson’s disease (PD) and healthy controls (HC) provided blood and stool samples in addition to completing an online lifestyle questionnaire. PD participants were self-identified, verifying that they were diagnosed with Parkinson’s disease by a qualified neurologist. HC were required to have no history of neurodegenerative disease (e.g., Alzheimer’s Disease, Multiple Sclerosis, and Amyotrophic Lateral Sclerosis). 96% of enrolled HC shared a household with a PD participant. Individuals with a history of inflammatory bowel disease and/or those actively receiving chemotherapy or immunotherapy were excluded from both study groups.

Stool samples were collected in Zymo preservative solution (Cat. No. R1101) and either shipped or hand-delivered to Stanford to be frozen.

Similarly to the allo-HCT data, the current study only reports on the observed association between DNA concentration and 16S copies per extraction; other results from sample analyses and participant demographic information will be reported elsewhere.

Method details

Stool sample processing for the allo-HCT cohort

All stool samples were processed consistently in the same laboratory to minimize batch effects. From every sample, we aimed to extract an equal amount of frozen stool using a biopsy punch (Fisher Scientific, Cat. No. 12-460-410) and determined the exact amount by weighing the sample tube before and after adding the frozen stool. This allows us to normalize the 16S counts per DNA extraction to 16S counts per wet gram of stool (see Figure S1).

DNA was extracted using the QIAamp PowerFecal Pro DNA Kit (QIAGEN; Cat. No. 51804) without modifications to the manufacturer’s instruction and eluted into 100μL of elution buffer.

Stool sample processing for the PD cohort

All stool samples were processed consistently in the same laboratory to minimize batch effects. Since stool samples were stored in Zymo preservative solution, an equal amount of stool (200μL) was extracted from each vial using a pipette. The exact amount was determined by weighing the sample tubes before and after adding the stool.

For DNA extraction, we used the QIAamp PowerFecal Pro DNA Kit (QIAGEN; Cat. No. 51804), but with the following modifications: the CD2 inhibitor removal step was excluded, as per the recommendations of Zymo. In addition, we further treated the samples with the Qiagen DNeasy PowerClean Pro Cleanup kit (Cat. No. 12997-50) and eluted to a final volume of 50 μL.

Metagenomic shotgun sequencing and data processing

DNA concentration of the extracted DNA was measured using an Agilent 5400 Fragment Analyzer System (Agilent, Cat. No. M5312AA) as part of the quality control prior to sequencing.

Metagenomic libraries were prepared using the NEB Ultra II kit (NEB; Cat. No. E7645L) according to the manufacturer’s instructions. Libraries were pooled and 2 × 150 bp reads were generated using the NovaSeq 6000 platform (Illumina; Cat. No. 20012850).

All raw reads were processed with the NextFlow pipeline available under https://github.com/bhattlab/bhattlab_workflows_nf, using NextFlow v22.10.5.³¹ In short, reads were deduplicated with HTStream SuperDeduper v1.3.3 and low-quality bases were trimmed with TrimGalore v0.6.7. Then, reads were mapped against the human genome (hg38) using bwa v0.7.17³² and all matching reads were discarded. Finally, preprocessed reads were used for metagenomic profiling with MetaPhlAn v4.0.4,³³ since it also allows for profiling of the eukaryotic fraction in a sample. The fraction of host reads is one of the outputs of the nextflow workflow. All parameters for the described tools are recorded in the nextflow parameter file available under www.github.com/jakob-wirbel/abs_abundance.

Quantification of 16S copies via ddPCR

For quantification of 16S copies, the protocol detailed in Doyle et al.²³ was followed. In short, 5μL of sample were diluted into nuclease-free water in five 1:10 steps (transferring 5μL into 45 μL of nuclease-free water at each step) using a liquid handler (Agilent, Velocity 11 Vprep) in a 96-well plate format. The last column was reserved for four negative (nuclease-free water) and four positive controls (Microbial Pathogen DNA Standards for Detection and Identification, NIST, Cat. No. RM8376). Then, 6μL of the diluted sample were transferred into ddPCR 96-well plates (Bio-Rad, Cat. No.: 12001925) and mixed with 16μL of ddPCR mastermix (1× ddPCR SuperMix (Bio-Rad, Cat. No.: 1863024), primer-probe mixture with a final concentration of 0.4μM and nuclease-free water). The primer-probe mixture consists of equal parts probe (HPLC-purified FAM probe, from Integrated DNA technologies, sequence:/56-FAM/CGTATTACC/ZEN/GCGGCTGCTGGCAC/3IABkFQ/) and 16S primers: 331F: 5′-TCCTACGGGAGGCAGCAGT-3′, 797R: 5′-GGACTACCAGGGTATCTAATCCTGTT-3′, all of them at 100μM concentration. After sealing the plate, each well was thoroughly mixed by vortexing each row and column of the plate for at least 5 s to ensure sufficient mixing. Droplets were generated with the Auto Droplet Generator (Bio-Rad, Cat. No.: 1864101) according to manufacturer’s instructions and the new plate was sealed again before running the following thermocycler program: 95°C for 10 min, 40 cycles of 95°C for 30 s, 56°C for 1 min, 72°C for 2 min. The program ended with 4°C for 5 min, followed by 95°C for 5 min. After finishing the PCR reaction, the plate was analyzed using the QX200 Droplet Reader (Bio-Rad, Cat. No.: 1864003) and droplets were quantified using the QuantaSoft software (v1.7.4.0917).

Samples were rejected if they contained less than 1000 accepted droplets (n = 10) and 16S copies per reaction were calculated as follows:

log(number_of_droplets/number_of_negative_droplets) ∗ 1/0.795 nL ∗ 1000nL/1μL ∗ 22μL, using the average droplet size of 0.795 nL and the reaction volume of 22 μL as recommended in.²³ The number of 16S copies per DNA extraction was then calculated as follows:

number_of_16S_copies_per_reaction ∗ 10⁵∗ ⅕ ∗ 100, to account for the dilution factor, the 5μL of sample initially used for the dilution, and the overall volume of the DNA extraction.

Comparison to other absolute abundance studies

To compare absolute abundance to other published datasets, we downloaded the absolute bacterial counts per wet gram of stool as available in the supplementary material of each included study.¹⁹^,²⁰^,²¹^,²²^,²⁵ When comparing all data across those studies, we noticed that the data for Rolling et al. was three orders of magnitude lower than the other studies, so we adjusted the data by a factor of 10³, indicating potential differences in normalization procedures.

Machine learning approach

To train machine learning models, we used the random forest classification model from the ranger package v0.15.1³⁵ available through the mlr3 v0.16.0 meta-package.³⁴ Input features consisted of the DNA concentration measurements described above and, for the full model, additionally of the following sample measurements: fraction of host reads of all reads passing quality control (see above), prokaryotic alpha diversity estimated using the Shannon index (from the diversity function in vegan v2.6-4³⁶), kingdom-level relative abundance values from MetaPhlAn 4,³³ and lastly the type of sample storage (same-day versus next-day freezing).

Models were trained using a ten-times repeated 10-fold cross-validation strategy to avoid over-optimistic test error evaluation, using the resample function from mlr3. Within each fold, the hyper-parameters mtry and num.tree for the random forest model were optimized in an internal 3-fold cross-validation. Predictions were averaged across the ten cross-validation repeats for global model evaluation. Prediction quality was evaluated using Pearson’s r or Spearman’s rho between measured and predicted values and the R² and mean-squared error measure (both from mlr3measures). Additionally, the concordance correlation coefficient (CCC³⁷) was computed using the epiR package (v2.0.76).

For the leave-one-plate out validation, models were trained in the same way on data from all but one plate and were then applied to the left-out plate for external evaluation. Model predictions were averaged across all models.

Model importance was extracted using the impurity estimator available from the ranger package, normalized to relative importance (dividing by the summed importance over all predictors), and lastly averaged across cross-validation folds.

Quantification and statistical analysis

All statistical analyses were performed using R version v4.2.2. Significance was defined as a p-value lower than 0.05. All plots were generated using ggplot2 v3.4.2 as part of the tidyverse v2.0.0 suite of tools.³⁸

Additional resources

Additional information regarding the BMT CTN 1801 clinical trial can be found under: https://clinicaltrials.gov/study/NCT03959241.

Published: April 28, 2025

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.crmeth.2025.101030.

Contributor Information

Jakob Wirbel, Email: wirbel@stanford.edu.

Ami S. Bhatt, Email: asbhatt@stanford.edu.

Supplemental information

Document S1. Figures S1–S4

mmc1.pdf^{(856.7KB, pdf)}

Document S2. Article plus supplemental information

mmc2.pdf^{(5.8MB, pdf)}

References

1.Schmidt T.S.B., Raes J., Bork P. The Human Gut Microbiome: From Association to Modulation. Cell. 2018;172:1198–1215. doi: 10.1016/j.cell.2018.02.044. [DOI] [PubMed] [Google Scholar]
2.Franzosa E.A., Sirota-Madi A., Avila-Pacheco J., Fornelos N., Haiser H.J., Reinker S., Vatanen T., Hall A.B., Mallick H., McIver L.J., et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat. Microbiol. 2019;4:293–305. doi: 10.1038/s41564-018-0306-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Lloyd-Price J., Arze C., Ananthakrishnan A.N., Schirmer M., Avila-Pacheco J., Poon T.W., Andrews E., Ajami N.J., Bonham K.S., Brislawn C.J., et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature. 2019;569:655–662. doi: 10.1038/s41586-019-1237-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Thomas A.M., Manghi P., Asnicar F., Pasolli E., Armanini F., Zolfo M., Beghini F., Manara S., Karcher N., Pozzi C., et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat. Med. 2019;25:667–678. doi: 10.1038/s41591-019-0405-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Wirbel J., Pyl P.T., Kartal E., Zych K., Kashani A., Milanese A., Fleck J.S., Voigt A.Y., Palleja A., Ponnudurai R., et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat. Med. 2019;25:679–689. doi: 10.1038/s41591-019-0406-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Bedarf J.R., Hildebrand F., Coelho L.P., Sunagawa S., Bahram M., Goeser F., Bork P., Wüllner U. Functional implications of microbial and viral gut metagenome changes in early stage L-DOPA-naïve Parkinson’s disease patients. Genome Med. 2017;9:39. doi: 10.1186/s13073-017-0428-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Romano S., Savva G.M., Bedarf J.R., Charles I.G. Meta-analysis of the Parkinson’s disease gut microbiome suggests alterations linked to intestinal inflammation. NPJ Parkinsons Dis. 2021;7:27. doi: 10.1038/s41531-021-00156-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Peled J.U., Gomes A.L.C., Devlin S.M., Littmann E.R., Taur Y., Sung A.D., Weber D., Hashimoto D., Slingerland A.E., Slingerland J.B., et al. Microbiota as predictor of mortality in allogeneic hematopoietic-cell transplantation. N. Engl. J. Med. 2020;382:822–834. doi: 10.1056/NEJMoa1900623. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Gloor G.B., Macklaim J.M., Pawlowsky-Glahn V., Egozcue J.J. Microbiome Datasets Are Compositional: And This Is Not Optional. Front. Microbiol. 2017;8:2224. doi: 10.3389/fmicb.2017.02224. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Morton J.T., Marotz C., Washburne A., Silverman J., Zaramela L.S., Edlund A., Zengler K., Knight R. Establishing microbial composition measurement standards with reference frames. Nat. Commun. 2019;10:2719. doi: 10.1038/s41467-019-10656-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Kleine Bardenhorst S., Berger T., Klawonn F., Vital M. Data Analysis Strategies for Microbiome Studies in Human Populations—a Systematic Review of Current Practice. mSystems. 2021;6 doi: 10.1128/mSystems.01154-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Fernandes A.D., Macklaim J.M., Linn T.G., Reid G., Gloor G.B. ANOVA-like differential expression (ALDEx) analysis for mixed population RNA-Seq. PLoS One. 2013;8 doi: 10.1371/journal.pone.0067019. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Mandal S., Van Treuren W., White R.A., Eggesbø M., Knight R., Peddada S.D. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 2015;26 doi: 10.3402/mehd.v26.27663. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Stämmler F., Gläsner J., Hiergeist A., Holler E., Weber D., Oefner P.J., Gessner A., Spang R. Adjusting microbiome profiles for differences in microbial load by spike-in bacteria. Microbiome. 2016;4:28. doi: 10.1186/s40168-016-0175-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Rao C., Coyte K.Z., Bainter W., Geha R.S., Martin C.R., Rakoff-Nahoum S. Multi-kingdom ecological drivers of microbiota assembly in preterm infants. Nature. 2021;591:633–638. doi: 10.1038/s41586-021-03241-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Barlow J.T., Bogatyrev S.R., Ismagilov R.F. A quantitative sequencing framework for absolute abundance measurements of mucosal and lumenal microbial communities. Nat. Commun. 2020;11:2590. doi: 10.1038/s41467-020-16224-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Jian C., Luukkonen P., Yki-Järvinen H., Salonen A., Korpela K. Quantitative PCR provides a simple and accessible method for quantitative microbiota profiling. PLoS One. 2020;15 doi: 10.1371/journal.pone.0227285. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Langenfeld K., Chin K., Roy A., Wigginton K., Duhaime M.B. Comparison of ultrafiltration and iron chloride flocculation in the preparation of aquatic viromes from contrasting sample types. PeerJ. 2021;9 doi: 10.7717/peerj.11111. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Maghini D.G., Dvorak M., Dahlen A., Roos M., Doyle B., Kuersten S., Bhatt A.S. Quantifying bias introduced by sample collection in relative and absolute microbiome measurements. Nat. Biotechnol. 2024;42:328–338. doi: 10.1038/s41587-023-01754-3. [DOI] [PubMed] [Google Scholar]
20.Vandeputte D., Kathagen G., D’hoe K., Vieira-Silva S., Valles-Colomer M., Sabino J., Wang J., Tito R.Y., De Commer L., Darzi Y., et al. Quantitative microbiome profiling links gut community variation to microbial load. Nature. 2017;551:507–511. doi: 10.1038/nature24460. [DOI] [PubMed] [Google Scholar]
21.Vandeputte D., De Commer L., Tito R.Y., Kathagen G., Sabino J., Vermeire S., Faust K., Raes J. Temporal variability in quantitative human gut microbiome profiles and implications for clinical research. Nat. Commun. 2021;12:6740. doi: 10.1038/s41467-021-27098-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Nishijima S., Stankevic E., Aasmets O., Schmidt T.S.B., Nagata N., Keller M.I., Ferretti P., Juel H.B., Fullam A., Robbani S.M., et al. Fecal microbial load is a major determinant of gut microbiome variation and a confounder for disease associations. Cell. 2025;188:222–236.e15. doi: 10.1016/j.cell.2024.10.022. [DOI] [PubMed] [Google Scholar]
23.Doyle B., Reynolds G.Z.M., Dvorak M., Maghini D.G., Natarajan A., Bhatt A.S. Absolute quantification of prokaryotes in the microbiome by 16S rRNA qPCR or ddPCR. Nat. Protoc. 2025 doi: 10.1038/s41596-025-01165-5. [DOI] [PubMed] [Google Scholar]
24.Bolaños-Meade J., Hamadani M., Wu J., Al Malki M.M., Martens M.J., Runaas L., Elmariah H., Rezvani A.R., Gooptu M., Larkin K.T., et al. Post-transplantation cyclophosphamide-based graft-versus-host disease prophylaxis. N. Engl. J. Med. 2023;388:2338–2348. doi: 10.1056/NEJMoa2215943. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Rolling T., Zhai B., Gjonbalaj M., Tosini N., Yasuma-Mitobe K., Fontana E., Amoretti L.A., Wright R.J., Ponce D.M., Perales M.A., et al. Haematopoietic cell transplantation outcomes are linked to intestinal mycobiota dynamics and an expansion of Candida parapsilosis complex species. Nat. Microbiol. 2021;6:1505–1515. doi: 10.1038/s41564-021-00989-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Morjaria S., Schluter J., Taylor B.P., Littmann E.R., Carter R.A., Fontana E., Peled J.U., van den Brink M.R.M., Xavier J.B., Taur Y. Antibiotic-induced shifts in fecal Microbiota density and composition during hematopoietic stem cell transplantation. Infect. Immun. 2019;87 doi: 10.1128/IAI.00206-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Tito R.Y., Verbandt S., Aguirre Vazquez M., Lahti L., Verspecht C., Lloréns-Rico V., Vieira-Silva S., Arts J., Falony G., Dekker E., et al. Microbiome confounders and quantitative profiling challenge predicted microbial targets in colorectal cancer development. Nat. Med. 2024;30:1339–1348. doi: 10.1038/s41591-024-02963-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Yilmaz P., Kottmann R., Field D., Knight R., Cole J.R., Amaral-Zettler L., Gilbert J.A., Karsch-Mizrachi I., Johnston A., Cochrane G., et al. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nat. Biotechnol. 2011;29:415–420. doi: 10.1038/nbt.1823. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Mirzayi C., Renson A., Genomic Standards Consortium. Massive Analysis and Quality Control Society. Zohra F., Elsafoury S., Geistlinger L., Kasselman L.J., Eckenrode K., van de Wijgert J., et al. Reporting guidelines for human microbiome research: the STORMS checklist. Nat. Med. 2021;27:1885–1892. doi: 10.1038/s41591-021-01552-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Wirbel J., Bhatt A.S. Zenodo; 2025. Data and Code for “Accurate Prediction of Absolute Prokaryotic Abundance from DNA concentration.”. [DOI] [PubMed] [Google Scholar]
31.Di Tommaso P., Chatzou M., Floden E.W., Barja P.P., Palumbo E., Notredame C. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 2017;35:316–319. doi: 10.1038/nbt.3820. [DOI] [PubMed] [Google Scholar]
32.Li H., Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Manghi P., Blanco-Míguez A., Manara S., NabiNejad A., Cumbo F., Beghini F., Armanini F., Golzato D., Huang K.D., Thomas A.M., et al. MetaPhlAn 4 profiling of unknown species-level genome bins improves the characterization of diet-associated microbiome changes in mice. Cell Rep. 2023;42 doi: 10.1016/j.celrep.2023.112464. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Lang M., Binder M., Richter J., Schratz P., Pfisterer F., Coors S., Au Q., Casalicchio G., Kotthoff L., Bischl B. mlr3: A modern object-oriented machine learning framework in R. J. Open Source Softw. 2019;4:1903. [Google Scholar]
35.Wright M.N., Ziegler A. Ranger: A fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Software. 2017;77:1–17. doi: 10.18637/jss.v077.i01. [DOI] [Google Scholar]
36.Oksanen, J., Blanchet, F.G., Kindt, R., Legendre, P., Minchin, P.R., O’hara, R.B., Simpson, G.L., Solymos, P., Stevens, M.H.H., Wagner, H., et al. (2013). Package “vegan.” Community Ecology Package, version 2, 1–295
37.Lin L.I. A concordance correlation coefficient to evaluate reproducibility. Biometrics. 1989;45:255–268. [PubMed] [Google Scholar]
38.Wickham H., Averick M., Bryan J., Chang W., McGowan L., François R., Grolemund G., Hayes A., Henry L., Hester J., et al. Welcome to the tidyverse. J. Open Source Softw. 2019;4:1686. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Figures S1–S4

mmc1.pdf^{(856.7KB, pdf)}

Document S2. Article plus supplemental information

mmc2.pdf^{(5.8MB, pdf)}

Data Availability Statement

•
All data tables are available under the MIT license on GitHub at www.github.com/jakob-wirbel/absolute_abundance and on Zenodo.³⁰
•
All scripts to reproduce the presented analyses, as well as the trained machine learning models (as mlr3 model objects), are available under the MIT license on GitHub at www.github.com/jakob-wirbel/absolute_abundance and on Zenodo.³⁰
•
Any additional information required to reanalyze the data reported in this work paper is available from the lead contact upon request.

[bib1] 1.Schmidt T.S.B., Raes J., Bork P. The Human Gut Microbiome: From Association to Modulation. Cell. 2018;172:1198–1215. doi: 10.1016/j.cell.2018.02.044. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Franzosa E.A., Sirota-Madi A., Avila-Pacheco J., Fornelos N., Haiser H.J., Reinker S., Vatanen T., Hall A.B., Mallick H., McIver L.J., et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat. Microbiol. 2019;4:293–305. doi: 10.1038/s41564-018-0306-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Lloyd-Price J., Arze C., Ananthakrishnan A.N., Schirmer M., Avila-Pacheco J., Poon T.W., Andrews E., Ajami N.J., Bonham K.S., Brislawn C.J., et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature. 2019;569:655–662. doi: 10.1038/s41586-019-1237-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Thomas A.M., Manghi P., Asnicar F., Pasolli E., Armanini F., Zolfo M., Beghini F., Manara S., Karcher N., Pozzi C., et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat. Med. 2019;25:667–678. doi: 10.1038/s41591-019-0405-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Wirbel J., Pyl P.T., Kartal E., Zych K., Kashani A., Milanese A., Fleck J.S., Voigt A.Y., Palleja A., Ponnudurai R., et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat. Med. 2019;25:679–689. doi: 10.1038/s41591-019-0406-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Bedarf J.R., Hildebrand F., Coelho L.P., Sunagawa S., Bahram M., Goeser F., Bork P., Wüllner U. Functional implications of microbial and viral gut metagenome changes in early stage L-DOPA-naïve Parkinson’s disease patients. Genome Med. 2017;9:39. doi: 10.1186/s13073-017-0428-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Romano S., Savva G.M., Bedarf J.R., Charles I.G. Meta-analysis of the Parkinson’s disease gut microbiome suggests alterations linked to intestinal inflammation. NPJ Parkinsons Dis. 2021;7:27. doi: 10.1038/s41531-021-00156-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Peled J.U., Gomes A.L.C., Devlin S.M., Littmann E.R., Taur Y., Sung A.D., Weber D., Hashimoto D., Slingerland A.E., Slingerland J.B., et al. Microbiota as predictor of mortality in allogeneic hematopoietic-cell transplantation. N. Engl. J. Med. 2020;382:822–834. doi: 10.1056/NEJMoa1900623. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Gloor G.B., Macklaim J.M., Pawlowsky-Glahn V., Egozcue J.J. Microbiome Datasets Are Compositional: And This Is Not Optional. Front. Microbiol. 2017;8:2224. doi: 10.3389/fmicb.2017.02224. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Morton J.T., Marotz C., Washburne A., Silverman J., Zaramela L.S., Edlund A., Zengler K., Knight R. Establishing microbial composition measurement standards with reference frames. Nat. Commun. 2019;10:2719. doi: 10.1038/s41467-019-10656-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Kleine Bardenhorst S., Berger T., Klawonn F., Vital M. Data Analysis Strategies for Microbiome Studies in Human Populations—a Systematic Review of Current Practice. mSystems. 2021;6 doi: 10.1128/mSystems.01154-20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Fernandes A.D., Macklaim J.M., Linn T.G., Reid G., Gloor G.B. ANOVA-like differential expression (ALDEx) analysis for mixed population RNA-Seq. PLoS One. 2013;8 doi: 10.1371/journal.pone.0067019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Mandal S., Van Treuren W., White R.A., Eggesbø M., Knight R., Peddada S.D. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 2015;26 doi: 10.3402/mehd.v26.27663. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Stämmler F., Gläsner J., Hiergeist A., Holler E., Weber D., Oefner P.J., Gessner A., Spang R. Adjusting microbiome profiles for differences in microbial load by spike-in bacteria. Microbiome. 2016;4:28. doi: 10.1186/s40168-016-0175-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Rao C., Coyte K.Z., Bainter W., Geha R.S., Martin C.R., Rakoff-Nahoum S. Multi-kingdom ecological drivers of microbiota assembly in preterm infants. Nature. 2021;591:633–638. doi: 10.1038/s41586-021-03241-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Barlow J.T., Bogatyrev S.R., Ismagilov R.F. A quantitative sequencing framework for absolute abundance measurements of mucosal and lumenal microbial communities. Nat. Commun. 2020;11:2590. doi: 10.1038/s41467-020-16224-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] 17.Jian C., Luukkonen P., Yki-Järvinen H., Salonen A., Korpela K. Quantitative PCR provides a simple and accessible method for quantitative microbiota profiling. PLoS One. 2020;15 doi: 10.1371/journal.pone.0227285. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Langenfeld K., Chin K., Roy A., Wigginton K., Duhaime M.B. Comparison of ultrafiltration and iron chloride flocculation in the preparation of aquatic viromes from contrasting sample types. PeerJ. 2021;9 doi: 10.7717/peerj.11111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Maghini D.G., Dvorak M., Dahlen A., Roos M., Doyle B., Kuersten S., Bhatt A.S. Quantifying bias introduced by sample collection in relative and absolute microbiome measurements. Nat. Biotechnol. 2024;42:328–338. doi: 10.1038/s41587-023-01754-3. [DOI] [PubMed] [Google Scholar]

[bib20] 20.Vandeputte D., Kathagen G., D’hoe K., Vieira-Silva S., Valles-Colomer M., Sabino J., Wang J., Tito R.Y., De Commer L., Darzi Y., et al. Quantitative microbiome profiling links gut community variation to microbial load. Nature. 2017;551:507–511. doi: 10.1038/nature24460. [DOI] [PubMed] [Google Scholar]

[bib21] 21.Vandeputte D., De Commer L., Tito R.Y., Kathagen G., Sabino J., Vermeire S., Faust K., Raes J. Temporal variability in quantitative human gut microbiome profiles and implications for clinical research. Nat. Commun. 2021;12:6740. doi: 10.1038/s41467-021-27098-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] 22.Nishijima S., Stankevic E., Aasmets O., Schmidt T.S.B., Nagata N., Keller M.I., Ferretti P., Juel H.B., Fullam A., Robbani S.M., et al. Fecal microbial load is a major determinant of gut microbiome variation and a confounder for disease associations. Cell. 2025;188:222–236.e15. doi: 10.1016/j.cell.2024.10.022. [DOI] [PubMed] [Google Scholar]

[bib24] 23.Doyle B., Reynolds G.Z.M., Dvorak M., Maghini D.G., Natarajan A., Bhatt A.S. Absolute quantification of prokaryotes in the microbiome by 16S rRNA qPCR or ddPCR. Nat. Protoc. 2025 doi: 10.1038/s41596-025-01165-5. [DOI] [PubMed] [Google Scholar]

[bib25] 24.Bolaños-Meade J., Hamadani M., Wu J., Al Malki M.M., Martens M.J., Runaas L., Elmariah H., Rezvani A.R., Gooptu M., Larkin K.T., et al. Post-transplantation cyclophosphamide-based graft-versus-host disease prophylaxis. N. Engl. J. Med. 2023;388:2338–2348. doi: 10.1056/NEJMoa2215943. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 25.Rolling T., Zhai B., Gjonbalaj M., Tosini N., Yasuma-Mitobe K., Fontana E., Amoretti L.A., Wright R.J., Ponce D.M., Perales M.A., et al. Haematopoietic cell transplantation outcomes are linked to intestinal mycobiota dynamics and an expansion of Candida parapsilosis complex species. Nat. Microbiol. 2021;6:1505–1515. doi: 10.1038/s41564-021-00989-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Morjaria S., Schluter J., Taylor B.P., Littmann E.R., Carter R.A., Fontana E., Peled J.U., van den Brink M.R.M., Xavier J.B., Taur Y. Antibiotic-induced shifts in fecal Microbiota density and composition during hematopoietic stem cell transplantation. Infect. Immun. 2019;87 doi: 10.1128/IAI.00206-19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Tito R.Y., Verbandt S., Aguirre Vazquez M., Lahti L., Verspecht C., Lloréns-Rico V., Vieira-Silva S., Arts J., Falony G., Dekker E., et al. Microbiome confounders and quantitative profiling challenge predicted microbial targets in colorectal cancer development. Nat. Med. 2024;30:1339–1348. doi: 10.1038/s41591-024-02963-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] 28.Yilmaz P., Kottmann R., Field D., Knight R., Cole J.R., Amaral-Zettler L., Gilbert J.A., Karsch-Mizrachi I., Johnston A., Cochrane G., et al. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nat. Biotechnol. 2011;29:415–420. doi: 10.1038/nbt.1823. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Mirzayi C., Renson A., Genomic Standards Consortium. Massive Analysis and Quality Control Society. Zohra F., Elsafoury S., Geistlinger L., Kasselman L.J., Eckenrode K., van de Wijgert J., et al. Reporting guidelines for human microbiome research: the STORMS checklist. Nat. Med. 2021;27:1885–1892. doi: 10.1038/s41591-021-01552-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Wirbel J., Bhatt A.S. Zenodo; 2025. Data and Code for “Accurate Prediction of Absolute Prokaryotic Abundance from DNA concentration.”. [DOI] [PubMed] [Google Scholar]

[bib31] 31.Di Tommaso P., Chatzou M., Floden E.W., Barja P.P., Palumbo E., Notredame C. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 2017;35:316–319. doi: 10.1038/nbt.3820. [DOI] [PubMed] [Google Scholar]

[bib32] 32.Li H., Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.Manghi P., Blanco-Míguez A., Manara S., NabiNejad A., Cumbo F., Beghini F., Armanini F., Golzato D., Huang K.D., Thomas A.M., et al. MetaPhlAn 4 profiling of unknown species-level genome bins improves the characterization of diet-associated microbiome changes in mice. Cell Rep. 2023;42 doi: 10.1016/j.celrep.2023.112464. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Lang M., Binder M., Richter J., Schratz P., Pfisterer F., Coors S., Au Q., Casalicchio G., Kotthoff L., Bischl B. mlr3: A modern object-oriented machine learning framework in R. J. Open Source Softw. 2019;4:1903. [Google Scholar]

[bib35] 35.Wright M.N., Ziegler A. Ranger: A fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Software. 2017;77:1–17. doi: 10.18637/jss.v077.i01. [DOI] [Google Scholar]

[bib36] 36.Oksanen, J., Blanchet, F.G., Kindt, R., Legendre, P., Minchin, P.R., O’hara, R.B., Simpson, G.L., Solymos, P., Stevens, M.H.H., Wagner, H., et al. (2013). Package “vegan.” Community Ecology Package, version 2, 1–295

[bib37] 37.Lin L.I. A concordance correlation coefficient to evaluate reproducibility. Biometrics. 1989;45:255–268. [PubMed] [Google Scholar]

[bib38] 38.Wickham H., Averick M., Bryan J., Chang W., McGowan L., François R., Grolemund G., Hayes A., Henry L., Hester J., et al. Welcome to the tidyverse. J. Open Source Softw. 2019;4:1686. [Google Scholar]

PERMALINK

Accurate prediction of absolute prokaryotic abundance from DNA concentration

Jakob Wirbel

Tessa M Andermann

Erin F Brooks

Lanya Evans

Adam Groth

Mai Dvorak

Meenakshi Chakraborty

Bianca Palushaj

Gabriella ZM Reynolds

Imani E Porter

Monzr Al Malki

Andrew Rezvani

Mahasweta Gooptu

Hany Elmariah

Lyndsey Runaas

Teng Fei

Michael J Martens

Javier Bolaños-Meade

Mehdi Hamadani

Shernan Holtan

Rob Jenq

Jonathan U Peled

Mary M Horowitz

Kathleen L Poston

Wael Saber

Leslie S Kean

Miguel-Angel Perales

Ami S Bhatt

Summary

Graphical abstract

Highlights

Motivation

Introduction

Results

Figure 1.

Figure 2.

Table 1.

Figure 3.

Discussion

Limitations of the study

Resource availability

Lead contact

Materials availability

Data and code availability

Acknowledgments

Author contributions

Declaration of interests

STAR★Methods

Key resources table

Experimental model and study participant details

Allo-HCT study cohort

PD study cohort

Method details

Stool sample processing for the allo-HCT cohort

Stool sample processing for the PD cohort

Metagenomic shotgun sequencing and data processing

Quantification of 16S copies via ddPCR

Comparison to other absolute abundance studies

Machine learning approach

Quantification and statistical analysis

Additional resources

Footnotes

Contributor Information

Supplemental information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases