Skip to main content
PLOS One logoLink to PLOS One
. 2023 Jul 21;18(7):e0288926. doi: 10.1371/journal.pone.0288926

A macroecological perspective on genetic diversity in the human gut microbiome

William R Shoemaker 1,¤,*
Editor: Karthik Raman2
PMCID: PMC10361512  PMID: 37478102

Abstract

While the human gut microbiome has been intensely studied, we have yet to obtain a sufficient understanding of the genetic diversity that it harbors. Research efforts have demonstrated that a considerable fraction of within-host genetic variation in the human gut is driven by the ecological dynamics of co-occurring strains belonging to the same species, suggesting that an ecological lens may provide insight into empirical patterns of genetic diversity. Indeed, an ecological model of self-limiting growth and environmental noise known as the Stochastic Logistic Model (SLM) was recently shown to successfully predict the temporal dynamics of strains within a single human host. However, its ability to predict patterns of genetic diversity across human hosts has yet to be tested. In this manuscript I determine whether the predictions of the SLM explain patterns of genetic diversity across unrelated human hosts for 22 common microbial species. Specifically, the stationary distribution of the SLM explains the distribution of allele frequencies across hosts and predicts the fraction of hosts harboring a given allele (i.e., prevalence) for a considerable fraction of sites. The accuracy of the SLM was correlated with independent estimates of strain structure, suggesting that patterns of genetic diversity in the gut microbiome follow statistically similar forms across human hosts due to the existence of strain-level ecology.

Introduction

The human gut microbiome harbors astounding levels of genetic diversity. Hundreds to thousands of species continually reproduce in a typical host, accruing a total of ∼ 109 de novo mutations each day [1]. Due to the comparatively brief generation time of microbes in the human gut [2], those mutations that are beneficial can rapidly fix on a timescale of days to months [1, 39]. Such evolutionary dynamics have the capacity to alter the genetic composition of a species within a given host. However, while all genetic diversity ultimately arises due to mutation, this actuality does not mean that all the genetic variants observed in the human gut are necessarily subject to evolutionary dynamics.

For many bacterial species a large number of genetic variants do not fix or become extinct within a given host. Instead, these variants fluctuate at intermediate frequencies over time on timescales ranging from months to years [1, 1013]. Such within-host genetic structure is reflected by the shape of phylogenetic trees constructed from microbial isolates, where the existence of a low number of deep phylogenetic branches suggests the existence of strain structure [11, 1419]. This pattern of diversity within hosts arises due to the co-occurrence of a few (O(1-4)) genetically and ecologically diverged strains that belong to the same species, a process known as oligocolonization [3, 11]. This sub-species ecological structure that can occur within a host is more than a descriptive detail, as it has been proposed that strains are the relevant scale at which interactions and dynamics occur in microbial systems [20, 21]. Thus, the dynamics of the genetic variants that comprise a given strain are subject to exogenous and endogenous ecological processes [2224]. However, evolution within a strain does not stop, as genetic variants continue to be acquired and segregate over time within a given strain [5]. Such dynamics are a clear departure from those captured by standard population genetic models used to describe microbial evolution, where genetic variants either arise in a population due to mutation or are introduced by migration and then proceed towards extinction or fixation (i.e., origin-fixation models), suggesting that measures of genetic diversity estimated within the human gut are shaped by the ecology of strains alongside evolutionary processes such as low recombination rates that result in physical linkage between alleles [3, 5, 10, 25].

This confluence of ecological and evolutionary dynamics requires new approaches and theory for characterizing genetic diversity in the human gut. Many studies tackle such complexity by examining individual species [6, 2628] or by searching for genetic differences between species [11, 12, 2933]. While such approaches are useful for identifying individual species that are potential contributors towards specific conditions such as disease or the ability to metabolize certain resources, it is difficult to translate isolated observations into general patterns. By focusing on individual species and differences between species it is plausible that uncharacterized patterns of genetic diversity that are generalizable across species may have been overlooked.

As an alternative, it is reasonable to first identify genetic patterns that are similar across species (i.e., statistical invariance). Such an approach may provide the empirical motivation necessary to identify mathematical models that can explain said patterns and aid in the identification of underlying ecological or evolutionary dynamics [10, 3436]. In recent years, substantial progress has been made towards characterizing the typical microbial evolutionary dynamics across species that operate within and across human hosts [3, 4, 7, 8, 3739]. An example of such a pattern is the observation that the relationship between synonymous nucleotide divergence (a proxy for evolutionary time) and the ratio of nonsynonymous and synonymous divergence (dS vs. dN/dS) falls on a single curve across microbial species in the human gut, representing 20 genera, 14 families, 7 orders, 6 classes, and 5 phyla [3, 39]. While this approach can often be limited by the number of observations and measurement error, modern data curation methods can often alleviate these limitations. Using this approach it is possible to leverage the richness of species in the human gut microbiome, where each species can be viewed as a draw from an unknown distribution and, as an ensemble, be used to identify patterns that are statistically invariant [40, 41].

To identify such patterns, it is useful to examine prior attempts that successfully pared down the complexity of the gut. Notable recent examples come from the discipline of macroecology, which has succeeded at predicting patterns of microbial diversity and abundance at the species level across disparate environments, including the gut microbiome [4247]. This approach emphasizes the benefits of identifying patterns of diversity that are statistically invariant, motivating the development of quantitative predictions derived from ecological first principles. Recent work suggests that this approach may hold across scales of organization in the human gut, as species-level macroecological patterns have been extended to temporal patterns of strain-level ecology within a single host [46]. This consistency in strain-level patterns provided the motivation to apply an established model of ecology to predict macroecological quantities within a single host over time, the Stochastic Logistic Model of growth (SLM). In macroecology, the SLM has been found to successfully characterize the distribution of species relative abundance across hosts and over time within a host (i.e., the Abundance Fluctuation Distribution (AFD)), the relationship between the mean abundance of a species and the fraction of hosts where it is present (i.e., the abundance-prevalence relationship [48]), and the relationship between the mean and variance of the abundance of a species (i.e., Taylor’s Law [49]) [44]. Inspired by the success of the SLM, it was recently applied to the strain-level to explain the temporal form of the AFD and Taylor’s Law within a single human host [46]. In this study it was found that the temporal dynamics of strains within a single healthy human host invariant with respect to time (i.e., stationary). Motivated by this result, it was determined that the empirical distribution of strain frequencies over time followed the distribution of the SLM at stationarity, a gamma distribution. The results of this study suggest that the SLM, a model that succeeded in predicting patterns of strains within a single human host, may also succeed in predicting patterns of genetic diversity across unrelated hosts due to the existence of strain structure.

In this study, I sought to determine whether the SLM as a model of ecology was capable of quantitatively predicting patterns of genetic diversity across hosts due to the existence of strain structure. I identified patterns of diversity that remained statistically invariant among phylogenetically distant species, providing the motivation necessary to identify the SLM as a plausible model of across-host patterns of diversity. To evaluate the feasibility of the SLM while accounting for the effects of sampling, I obtained predictions for the fraction of hosts harboring an allele at a given nucleotide site (i.e., prevalence) using zero free parameters. I identified evolutionary models of allele frequencies that predict the same stationary probability distribution as the SLM and found that their assumptions are unrealistic to explain the data. To confirm that the success of the SLM was due to the presence of strains, I inferred whether strain structure was present in each host for each species using an established computational approach, finding that the presence of strain structure was correlated with the accuracy of the SLM in predicting allelic prevalence.

Results

Patterns of genetic diversity are statistically invariant across species

In order to determine whether it is possible to predict patterns of genetic diversity in the human gut, it is necessary to first investigate the degree of similarity in measures of genetic diversity across phylogenetically distant species. This manner of visualization, known as a data collapse, allows one to assess whether it is reasonable to assume that similar dynamics underlie different systems [5052]. Such an analysis also provides the benefit of allowing for the identification of previously unknown empirical patterns for subsequent investigation. To determine whether there is evidence that the distributions of measures of genetic diversity have qualitative similar forms across species, I compiled allele frequency data for 22 bacterial species across human hosts using a quality control pipeline that explicitly accounted for the rate of sequencing error using the Maximum-likelihood Analysis of Population Genomic Data MAPGD program (Materials and methods). The total number of processed hosts ranged from 108–371 across species, with a median of 182 (Fig 1a). The total number of sites ranged from 39–37, 204 across species, with a median of 10,269 synonymous and 5,204 nonsynonymous sites (Fig 1b, S1b Fig). These results, and their existence for both synonymous and nonsynonymous sites, provides the empirical basis necessary to formulate quantitative predictions.

Fig 1. Distributions of genetic diversity exhibit similar statistical forms across phylogenetically distant species in the human gut.

Fig 1

a,b) Similarity in patterns of genetic diversity was evaluated for sites obtained from the 22 most prevalent bacterial species. c) The distribution of within-host allele frequencies across all hosts as well as d) the distribution of mean allele frequencies were rescaled to determine whether they exhibited similar forms, specifically by rescaling their logarithm using the standard score (i.e., z-score). In c, statistical fits of a gamma (the distribution predicted by the SLM) and a lognormal (a point of comparison) are illustrated as black lines. To limit the effect of the bounded nature of allele frequencies on the distribution, mean frequencies containing observations of f = 1 were excluded from subplot d. e) The relationship between statistical moments of within-host allele frequencies was consistent across species, as there was a strong linear relationship between the mean frequency of an allele and its variance on a log-log scale (i.e., Taylor’s Law). To reduce the contribution of an excess number of zeros towards estimates of f¯ and σf2, alleles with non-zero values of f in <35% of hosts were excluded. f) Finally, the fraction of sites harboring alleles present in a given number of hosts decreased in a similar manner across species. All sites in this analysis are synonymous, identical analyses were performed on alleles at nonsynonymous sites (S1 Fig). Species within the same genus were assigned the same primary or secondary color with different degrees of saturation.

First, I obtained the distribution of across-host allele frequencies for each nucleotide site and then pooled the frequencies of all sites. If the typical allele was present due to evolutionary processes, then the empirical distribution of within-host allele frequencies across hosts would be the equivalent to the ensemble of single-site frequency spectra expected from within-host evolution [53, 54]. If the typical allele was present because it was on the background of a strain, then the macroecological view of this distribution is that it captures the distribution of relative abundances of strains across hosts, the AFD [44]. Furthermore, the degree of similarity across species allows one to assess whether it is reasonable to predict that a single probability distribution is capable of explaining the distribution of within-host allele frequencies across hosts for phylogenetically distant species.

In order to determine whether different distributions share a single form it is useful to rescale them by key parameters [50]. Inspired by prior work [44], I rescaled the distribution of within-host frequencies across hosts by 1) pooling all non-zero frequencies for a given species, 2) log-transforming all frequencies, 3) calculating the mean and standard deviation of the frequency, and 4) calculating each log-transformed frequency as a standard score (i.e., z-score). By repeating this process for each species, one can determine whether the form of the distribution qualitatively varies across species or whether they simply differ in their statistical moments. The distribution of within-host allele frequencies had a similar qualitative form across species in the human gut for synonymous (Fig 1c) and nonsynonymous sites (S1c Fig), suggesting that a single distribution may be sufficient to characterize all species. Regions of the distribution are well-captured by the gamma distribution [44], suggesting that it would be informative to examine models that lead to a gamma distribution and then assess the gamma’s capacity to predict quantities calculated from individual alleles. As a point of comparison, I fit the distribution in Fig 1c using a lognormal distribution. This distribution was previously used to evaluate the AFD at the species level in disparate ecosystems [44]. The lognormal clearly deviates from the bulk of the distribution, a result that is even more apparent when the probability density is plotted as a survival probability (S2a and S3a Figs). An Akaike Information Criterion (AIC) test supports this conclusion (Synonymous: AICgamma = 6, 277, 330, AIClognormal = 6, 618, 492; Nonsynonymous: AICgamma = 3, 359, 295, AICgamma = 3, 490, 025).

Beyond the shape of the distribution of within-host allele frequencies, the statistical moments of within-host frequencies across hosts also exhibit qualitatively similar forms. I repeated the same standard score rescaling procedure for the logarithm of the mean within-host allele frequency across hosts (f¯). The distribution of f¯ tended to overlap across species for both synonymous (Fig 1d, S2b Fig) and nonsynonymous sites (S1d and S3b Figs). While here I am not explicitly interested in the processes that shape the mean distribution, as the mean will be used as an empirical input for calculating predictions in the subsequent section, the result does suggest that statistical moments calculated across host display features of invariance.

The mean and variance of random variables frequently follows linear relationships on logarithmic scales across biological systems, most notably in patterns of biodiversity in ecological communities [44, 55, 56] but also among population genetic patterns [5759]. The existence of this relationship, known as Taylor’s Law [49], would imply in the context of this study that the mean and variance of allele frequencies across hosts are not independent among species, reducing the number of parameters necessary to characterize the dynamics of the system [44]. Furthermore, if the variance scales quadratically with the mean then the existence of the relationship implies that the coefficient of variation of f is constant across sites, an observation that can considerably reduce the difficulty of characterizing the dynamics of the system.

By examining the relationship between f¯ and the variance of f (σf2), it is clear that the two moments follow a linear relationship on a logarithmic scale for low values of f¯ (0<f¯0.35). The exponent of this relationship is ∼1.96 (bootstrapped 95% CI from 10,000 samples: [1.85, 2.06]) for synonymous sites, a value that is remarkably close to two, implying that the coefficient of variation in f can be viewed as a constant for the range of f¯ where the relationship is linear (Fig 1e). The exponent is slightly reduced for nonsynonymous sites (∼ 1.83, 95% CI [1.72, 1.93]; S1e Fig), suggesting that the variance increases with the mean at a slower rate relative to synonymous sites. Given that purifying selection is pervasive across species within the human gut [3, 12, 39, 60], it is likely that the typical allele at a nonsynonymous site confers a deleterious fitness effect, reducing its variance across hosts for low values of f¯. However, the linear relationship does not extend to high values of f¯. Given that f is, by definition, a bounded quantity (i.e., 0 ≤ f ≤ 1), it is possible that the relationship between f¯ and σf2 for values of f¯1 is governed by the upper bound on f. To determine whether this is the case, I plotted the maximum possible value of σf2 for a given value of f¯ constrained on the lower and upper bounds of f. This relationship, known as the Bhatia–Davis inequality, is defined as σf2(max(f)-f¯)(f¯-min(f))=(1-f¯)f¯ [61]. The empirical relationship between f¯ and σf2 follows the inequality across species for f¯0.35, suggesting that the relationship can be explained solely by the mathematical constraints on f, making the relationship uninformative for the purpose of identifying universal evolutionary or ecological patterns at a certain scale. It is worth noting that an exponent of two can emerge if the underlying distribution is sufficiently skewed [62, 63]. However, similar to ecological analyses of the relationship between the mean and variance of species abundances [44, 56, 64], the fact that the observed mean allele frequency varies by close to two orders of magnitude suggests that this patterns reflects a true scaling relationship. While the existence of this relationship may not be system-specific [64], it does allow us to make a claim about the relationship between statistical moments.

Finally, I turned my attention to the fraction of hosts where a given allele is present. I found that the number of hosts in which a typical allele is present is small for synonymous (Fig 1f) and nonsynonymous sites (S1f Fig). Alternatively stated, the fraction of hosts harboring a given allele (i.e., prevalence) is typically low.

Predicting the prevalence of an allele across hosts

The existence of multiple patterns of genetic diversity that are universal across evolutionarily distant bacterial species suggests that comparable dynamics are ultimately responsible. The next task is to identify a candidate model capable of explaining said patterns. The shape of the rescaled distribution of allele frequencies suggests that a gamma distribution is a suitable candidate, reducing the range of feasible models to those that are capable of predicting said distribution or a distribution of similar form. Different approaches can be used to identify such a model. However, given the consistency of the patterns, it is appropriate to focus on models that solely contain parameters that can be measured from empirical data (i.e., no statistical fitting) rather than relying on estimates of free parameters via statistical inference (i.e., statistical fitting) [65].

We begin with the assumption that the frequency dynamics of a typical allele within a host are primarily driven by the ecological dynamics of the strain on which said allele resides. Appropriate Langevin equations (i.e., stochastic differential equations) that capture relevant ecological dynamics can be used. However, regardless of the underlying dynamics, the available data constrains the ways in which a given model can be evaluated. Given that temporal metagenomic data for the human gut microbiome remains restricted to a small number of hosts, I focused on samples taken at a single timepoint across a large number of unrelated hosts. This detail means that time-dependent solutions of the probability distribution of f cannot be empirically evaluated (i.e., p(f, t)), so I instead focused on stationary probability distributions and limiting cases where time-dependence is captured by parameters that can, in principle, be estimated (i.e., p(f)). This assumption of stationarity (i.e., time-invariance) is supported by previous research efforts that examined macroecological patterns that were stationary with respect to time at the strain level [46].

To model the dynamics of an allele that is on the genetic background a strain, it is necessary to identify essential features of growth. There are two main features that are necessary to consider the deterministic dynamics of strain dynamics: 1) that the rate of growth is often exponential when a species or strain is far from its carrying capacity and 2) that growth is self-limiting. There is also the need to consider stochasticity in growth that can be driven by environmental noise. To capture these features I examined a Langevin equation known as the Stochastic Logistic Model of growth (SLM), a model that has recently been shown to describe a range of macroecological patterns for microbial communities across disparate environments [44, 6668] as well as the temporal dynamics of strains within a single human host [46]. The SLM is defined as

ft=fτi(1-fKi)Self-limitinggrowth+στiτif·η(t)Environmentalnoise (1)

Here I am assuming that the allele I observed within a given host was present because it was on the background of a strain. This interpretation means that the fluctuations of the allele over time are due to the ecological fluctuations of the strain, where the terms 1τi, Ki, and στi respectively represent the intrinsic growth rate, the carrying capacity within a given species in terms of relative abundance (0 ≤ Ki ≤ 1), and the coefficient of variation of growth rate fluctuations of the ith strain.

Environmental noise is captured by the product of a linear frequency term (as opposed to demographic noise, which would be captured by the term f), the compound parameter στiτi, and a Brownian noise term η(t) that introduces stochasticity into the equation. Using standard definitions of Langevin equations, the expected value of η(t) is 〈η(t)〉 = 0 [69]. The dependence of η(t′) at time t′ on an earlier time η(t) is defined as 〈η(t)η(t′)〉 = δ(tt′) [69]. This standard definition means that if the noise term is shifted in time, then it has zero correlation with itself, otherwise it is identical to itself.

This definition of a Langevin equation is convenient in that it is possible to obtain a partial differential equation describing how the probability distribution of f changes with time (i.e., the Fokker-Planck equation) [69]. Once this equation is obtained, the probability distribution of f at stationarity (i.e., no dependence on time) can be obtained. One finds that the SLM predicts that the frequency of a given allele on the background of a strain follows a gamma distribution (additional detail provided in Materials and methods)

fSLMGamma(2στi-1,2Kiστi) (2)

This distribution is fully characterized by the mean frequency and the squared inverse of the coefficient of variation across hosts (β=f¯2σf2) across hosts [44, 70]

fGamma(β,βf¯) (3)

The similarity in the shape of the distribution of f¯ across species and the relationship between f¯ and σf2 suggests that the mean and variance are appropriate quantities to evaluate the predictive capacity of each model (Fig 1d and 1e, S1d and S1e Fig). However, f¯ and σf2 can be interpreted as parameters of the SLM (i.e., empirical inputs), meaning that the SLM cannot be used to predict f¯ and σf2. To test the applicability of the SLM, I chose to examine the fraction of hosts harboring a given allele (i.e., prevalence), a quantity that has been used to examine microbial ecology and evolution across systems [44, 71], including the human gut microbiome [3, 7, 44].

In order to test prevalence predictions, it is necessary to account for the sampling effort at a given site in a given host (i.e., total depth of sequencing coverage). This can be accomplished by deriving the sampling distribution of the gamma, providing the probability of observing A reads of a gamma distributed allele with D coverage.

Pr[A|D,β,β/f¯]=Γ(β+A)A!Γ(β)(f¯Dβ+f¯D)A(ββ+f¯D)β (4)

A value A = 0 represents the absence of an allele, which can be used to define presence as the complement, providing a natural definition of the prevalence of an allele across hosts.

ϱ=1-1Mm=1MPr[0|Dm,β,β/f¯] (5)

where I have defined prevalence as the average of the probability of presence over M hosts. While Eq 5 is correct, the pipeline used for sequence data enacted a cutoff for the total depth of coverage, resulting in a coverage cutoff for a given minor allele (Acutoff = 10; S4 Fig). This cutoff truncates the sampling distribution of minor allele read counts, meaning that read counts for a given allele less than the specified cutoff are effectively observed as zeros. This inferential detail can be explicitly accounted for by summing over the probabilities of observing alternative allele read counts up to and excluding the cutoff

ϱ=1-1Mm=1MAi=0Acutoff-1Pr[Ai|Dm,β,β/f¯] (6)

The choice of prevalence also allows one to evaluate the extent that the gamma distribution can recapitulate empirical relationships between genetic quantities. One such relationship is that the prevalence of a species (equivalently known as occupancy in macroecology) should increase with its mean abundance across communities [72], a pattern that has been found to exhibit statistically similar forms at the species level across microbial systems [48, 73, 74] and can be quantitatively explained through the existence of macroecological laws [44].

The Stochastic Logistic Model succeeds at predicting allelic prevalence

By examining the relationship between observed and predicted allelic prevalence, I found that the SLM generally succeeded in predicting this relationship for both synonymous and nonsynonymous sites using zero free parameters (Fig 2a, S5 and S6 Figs). Alternatively stated, predictions were obtained by computing the predicted prevalence using Eq 6 without the need to perform statistical fitting. The fraction of all sites with relative errors ≤ 0.1 (≤ 10%) ranged from 0.19–0.6 across species, suggesting that a considerable fraction of genetic variants within the human gut are driven by ecological dynamics (Fig 2b, S7 and S8 Figs). Furthermore, the SLM generally succeeded in recapitulating the relationship between f¯ and prevalence, a strain-level analogue of abundance-prevalence relationships in macroecology (Fig 2b, S9 and S10 Figs). While the SLM on its own cannot be used to predict Taylor’s Law since the mean and variance were used as empirical inputs, the evidence for the existence of Taylor’s Law constrains the parameterization of the SLM and its subsequent interpretation [44, 75]. For the sites that the SLM is able to predict with a high degree of accuracy, the existence of Taylor’s Law implies that β is constant across sites (στi=στ; Fig 1e, S12 and S13 Figs). Thus, the function for the expected allele frequency reduces to the proportionality 〈f〉 ∝ Ki.

Fig 2. The SLM successfully predicts genetic patterns for prevalent alleles.

Fig 2

a) Predicted values of prevalence were obtained (Eq 6) and matched with their observed values, where the data generally fell on the one-to-one line (dashed black line) for sites with a prevalence ≳ 0.01. b) The pattern of the SLM performing better for higher values of prevalence was illustrated by quantifying the relative error of the prevalence predictions. The mean of the logarithm of the relative error of the SLM over all sites (log10¯ε, dashed black line) was ∼0.1. c) The contingency of the SLM’s success was illustrated by examining the relationship between the mean frequency of an allele across hosts (f¯) and its prevalence. The predictions of the SLM (not a statistical fit) succeed for high mean frequency alleles (dashed black line). All analyses here were performed on alleles at synonymous sites using the common commensal gut species B. vulgatus. The color of each datapoint is proportionate to the number of sites. Visualizations of the predictions in this plot for all species for nonsynonymous and synonymous sites can be found in S1 Text.

These results suggest that it is worth investigating the accuracy of the SLM across observed estimates of prevalence. Given that a considerable fraction of sites have prevalence values close to one (e.g., the dot in the top-right corner of Fig 2b), where the SLM has its highest degree of accuracy, it is necessary to remove these sites in order to examine the full distribution of prediction errors. By focusing on sites with observed prevalences <0.9, one can see that observed and predicted prevalence values followed a one-to-one relationship across a wide range of observed prevalence values for many species (Fig 3a). From these results, one can glean a few insights into the appropriateness of the SLM. First, for most observed prevalence values, when the predicted prevalence differs from the observed value it is generally below the one-to-one line. This pattern does not deviate as the observed prevalence increases, suggesting that the SLM is generally able to capture the distribution of f and its relation to prevalence for the range of observed prevalence values. When the SLM is inaccurate it generally underpredicts the true prevalence of an allele. Alternatively stated, the SLM can predict an excess of zero observations (f = 0). However, there is a large uptick in the predicted prevalence for alleles with high observed values of prevalence, where the predictions are effectively on the one-to-one line. This is the case, as there is a large drop in the error as observed prevalence increases (Fig 3b), suggesting a negative relationship between the observed prevalence of an allele and the relative error of the SLM. I found that the correlation between these two quantities was negative for all species (Fig 3c). By permuting the order of these quantities I obtained null distributions of correlation coefficients for all species, where the majority of coefficients (16/22) clearly fall below the bounds of their null 95% confidence intervals.

Fig 3. The SLM succeeds and fails in a consistent manner across phylogenetically distant species.

Fig 3

a) Distributions of logarithmic relative errors of prevalence obtained from Eqs 6 and 21 were rescaled by their mean and standard deviation to illustrate their similarity across species. b) Binning observed prevalences reveals how the predicted values tend to follow a one-to-one relationship (dashed black line) across species, with variation among species. c) By calculating the correlation between log10-transformed observed prevalence and the relative error of our predictions, one finds a negative correlation for the majority of species. A permutation test confirms that these negative correlations are significant (95% CIs as black lines) for the majority of species (16/22). All analyses in this plot were performed using synonymous sites. Identical analyses and equivalent results for nonsynonymous sites can be found in S11 Fig.

The dependence of the predictive success of the SLM on the observed prevalence of an allele across hosts is a curious pattern. Ideally, one expects that the SLM is capable of explaining the dynamics of alleles if they primarily exist on the genetic background of strains, as the SLM has been found to explain the temporal dynamics of strain frequencies within a single host [46]. The fact that the SLM, when it is inaccurate, tends to underpredict true prevalence allows one to rule out models that predict an excess proportion of values of f = 0 than expected from a given distribution (e.g., zero-inflated gamma), as they would further reduce the predicted prevalence and increase the error of the prediction. Such models are often representative of competitive exclusion [44], an ecological outcome where the presence of a given species precludes the existence of another species. In the context of this study, competitive exclusion corresponds to the hypothesis that a strain is not found in a given host because it is unable to compete, resulting in alternative stable states. This connection allows one to rule out competitive exclusion as probably contributor towards the predictive error.

Alternatively, the comparatively poor performance of an ecological model among low prevalence alleles provides room for evolutionary explanations. Low prevalence alleles may be present due to the evolutionary dynamics operating within a host, where a given allele can arise in a host due to mutation, increasing the observed prevalence to a value higher than that predicted by an ecological model of strain dynamics. Without a model describing the dynamics of both the ecological and evolutionary dynamics of f within a host it is difficult to parse alleles into discrete “ecological” and “evolutionary” categories. Regardless, the distributions of prevalence prediction errors have a strikingly bimodal form for several species (S7 and S8 Figs), suggesting that there may be some truth to the claim that evolution, rather than ecology, can disproportionately contribute to the prevalence of alleles across hosts.

Strain-level ecology determines the accuracy of the gamma distribution

While the gamma distribution succeeds at explaining patterns of genetic diversity for an appreciable number of sites across microbial species, the ability to connect quantitative predictions to empirical distributions ultimately rests on the fact that the SLM predicts a distribution that is fully parameterized by the mean and variance of f (Eq 3). Alternatively stated, one does not directly estimate the carrying capacity, they estimate the mean frequency, the expectation of which is a function of the carrying capacity under the SLM that reduces to a proportionality when Taylor’s law holds [44]. This reliance on statistical moments estimated from observational data suggests that alternative Langevin equations that can also predict a gamma distribution are equivalent candidates to the SLM in the absence of additional evidence.

In contrast, models of ecology and evolution that predict probability distributions other than the gamma are inappropriate on the outset, as the form of the gamma distribution that explicitly considers the effect of sampling (Eq 4) succeeded in predicting the prevalence of alleles of moderate-to-high mean frequency across hosts (Eq 6). For example, models of neutral evolution ftf(1-f)N·η(t) and neutral ecology ftf·η(t) predict that Fig 1c should resemble a Gaussian distributions over short timescales, a prediction that is incompatible with the observed distribution of within-host allele frequencies across hosts [76, 77] (Fig 1c). In addition, models that predict a lognormal distribution of within-host allele frequencies across hosts such as an ecological model of a strain with a constant rate of growth (as opposed to the logistic growth term in the SLM) and environmental noise (the same noise term as that in the SLM) are also inappropriate since a lognormal distribution did a poor job explaining the empirical distribution [78] (Fig 1c).

Given that the set of potential Langevin equations is constrained to those capable of returning a gamma distribution, I evaluated two evolutionary models that predict a gamma distribution to determine whether they were viable alternatives to the SLM. Starting with a Langevin equation, the frequency dynamics of a single allele are governed by forward and backward mutation (μ, υ), selection (s), and random genetic drift for a population of size N.

ft=sf(1-f)+μ(1-f)+υf+f(1-f)N·η(t) (7)

To reduce the number of free parameters and remove nonlinear terms, one can examine Eq 7 in the low frequency limit (f ≪ 1) and obtain Langevin equations for evolution under positive (s > 0) and purifying selection (s < 0)

ft=-sf+μ+fN·η(t) (8a)
ft=sf+μ+fN·η(t) (8b)

A gamma-distributed stationary solution can be derived in the case of purifying selection (s < 0; [76, 79]), whereas the stationary solution of Eq 8a is straightforward to derive. For the s > 0 case, a gamma distribution of allele frequencies can be derived where the time-dependence is captured by the maximum frequency that an allele can reach (fmaxest-12Ns) which, in principle, can be estimated from empirical data (Materials and methods). This dynamic form of mutation-selection balance is a gamma distribution that is solely parameterized by the maximum obtainable frequency of said allele (fmax) and the population scaled mutation rate (2) [25]. Together, these selection regimes provide two forms of mutation-selection balance that predict the gamma distribution (S1 Text).

fEvo,s<0Gamma(2Nμ,2N|s|) (9a)
fEvo,s>0Gamma(2Nμ,fmax-1) (9b)

Both of these distributions can be parameterized using the mean and variance, where f=μ|s| and β=12Nμ for s < 0 and 〈f〉 = 2Nμfmax and β = (2)2 for s > 0. These distributions provide alternative explanations for the predictive capacity of the gamma distribution, and it is worth investigating their feasibility.

There is evidence that purifying selection is widespread in the human gut microbiome [3, 39], though it is unlikely responsible for the range of across-host allele frequency fluctuations that could be inferred in this study. After accounting for sequencing error and total depth of coverage using MAPGD, the mean of the lowest inferred non-zero allele frequency across all species was ∼0.02. Given that microbial populations in the human gut are typically very large in size, it is unlikely that the same allele independently reached frequencies ≳ 0.02 in multiple hosts under negative selection. Assuming that independence among sites holds, the frequencies of individual alleles should not exceed 1N|s| [80], meaning that it is unlikely that a substantial fraction of the alleles with non-zero inferred frequencies were driven by purifying selection. This explanation becomes even less likely when one considers the negative relationship between observed allelic prevalence and the accuracy of the gamma.

The dynamic mutation-selection balance parameterization of the gamma distribution also contains forward mutation, a useful feature given that mutation can contribute towards the total observed frequency of a beneficial allele that is increasing in frequency within a given host (Eqs 9a and 9b). Such a feature is appealing, as, when the SLM fails, it does so by under-predicting observed prevalence. However, while the inclusion of forward mutation may increase the predicted value of prevalence, it is unlikely that positive selection substantially contributes towards the typical across-host single-site frequency spectra for alleles of moderate-to-high mean frequency. Furthermore, the explicit time-dependence in the parameterization of fmax implies that a given allele is increasing in frequency in the f ≪ 1 regime in all hosts where non-zero frequencies were observed. This model is highly useful for describing the dynamics of an ensemble of populations where the initial frequency is known and of low frequency, but it is likely inapplicable to single-timepoint samples across unrelated human hosts. To the extent that this parameterization holds, it likely does so for alleles that are observed at low frequencies, restricting the model’s applicability to low prevalence alleles. Because the SLM tends to fail for this prevalence regime, it is possible that a distribution derived from a model of evolutionary dynamics, gamma or otherwise, is necessary to predict genetic diversity among low prevalence alleles.

Beyond the consideration of evolutionary models, it is ultimately necessary to evaluate the extent that the success of the gamma is due to the existence of strain structure. This is a difficult task given that there is no guarantee that a given allele is present in multiple hosts because it is on the background of genetically identical strains. This limitation arises due to the difficulty inherent in determining whether a given host harbors a given strain from a single static metagenomic sample. Such difficulty persists in part due to technical and statistical limitations, but also due to the lack of practical strain definitions [8184]. It is difficult to phase strains from short-read sequencing data where physical linkage between variants cannot be established, making it necessary to instead identify a single haplotype within a given host [3, 11]. So, while it is currently possible to identify the prevalence of dominant lineages across hosts, there is no straightforward approach that allows one to assign an observed allele to a given strain.

Given these methodological constraints, I identified the existence of strain-level structure for a given species within a given host using StrainFinder, a program that infers the number of strains using the shape of the site-frequency spectrum (Materials and methods; [85]). I found that the fraction of hosts harboring strain structure ranged from 0.089–0.64 among the species I examined, with a median of ∼0.27 (Fig 4a). If alleles with higher prevalence were driven by the presence of strain structure, then one would expect a positive correlation between the accuracy of the predicted allele prevalence and the fraction of hosts containing strain structure across species, equivalent to observing a negative correlation between relative error and the fraction of hosts containing strain structure. This prediction held, as the correlation between these two variables was typically negative within a given range of prevalence values, but tended to become more negative when only alleles with high prevalence estimates were included (Fig 4b). To resolve this relationship, I examined the degree of correlation between the relative error of the gamma and the fraction of hosts with strain structure across a wide range of observed prevalence thresholds (Fig 4c). The correlation tended to increase with prevalence, though there was a substantial decrease once a prevalence threshold of ∼0.1 was reached. The fraction of hosts with strain structure for a given species provide context as they place lower bounds on the range of allelic prevalences that can be driven by strain-level ecology, meaning that alleles with prevalence values lower than the lowest observed fraction of hosts with strain structure are unlikely to be driven by ecology. Given that the species with the lowest observed fraction of hosts with strain structure were close to this descent, it is possible that this value is where the ecological dynamics of strains began to predominantly influence across-host patterns of genetic diversity.

Fig 4. The presence of strain structure is correlated with the accuracy of the SLM.

Fig 4

a) The presence or absence of strain structure was inferred from the distribution of allele frequencies for each species within each host using StrainFinder, providing an estimate of the fraction of hosts that harbor strain structure. b) This per-species estimate of strain structure can be compared to the mean relative error of allelic prevalence predictions obtained using the SLM (Eq 6) to determine whether the success of the SLM is correlated with the existence of strains. By examining this relationship when sites with rare alleles (i.e., low prevalence) are included (ϱ^0.05: dashed line) and excluded (ϱ^0.15: solid line), one sees a stronger correlation ffor alleles with a high prevalence threshold. c) This trend can be systematically evaluated by calculating the correlation across a range of prevalence values (black dots). A permutation test establishes 95% confidence intervals of the null (grey window).

Discussion

This study demonstrated that a model of ecology that explained the dynamics of strains within a single host can be successfully applied to explain across-host patterns of diversity at individual nucleotide sites, the constituent of strains. I identified patterns of genetic diversity in the human gut microbiome that were statistically invariant across evolutionarily distant species. Motivated by these results, and the prominence of strain structure in the human gut, I identified a prospective model of ecological dynamics (i.e., the Stochastic Logistic Model [44]) that could explain said patterns. Using this model, I was able to predict the fraction of hosts that harbored a given allele (i.e., prevalence) using measurable parameters (i.e., zero statistical fitting) for a considerable fraction of sites across species. Prediction accuracy tended to improve among more prevalent alleles, a result that is consistent with the conceptual picture that both ecology and evolution are operating within a given species in the human gut [10]. The accuracy of prevalence predictions were correlated with independent estimates of strain structure, providing additional empirical evidence that patterns of genetic diversity across human hosts are driven by strain-structure for a considerable fraction of sites.

The success of the SLM for common alleles implies that one’s level of sampling (i.e., sequencing coverage) is the primary determinant of whether or not strain structure can be detected within a healthy human host for several species [44]. A lack of genuine absences of strain structure (i.e., extinction) in healthy human hosts would subsequently imply that competitive exclusion is rare at the strain level. This is a strong claim and proving it is beyond the scope of this study. Instead, it is worth noting that the claim seemingly contrasts with the observation that strains exhibit varying frequencies across hosts for several species [3], though fluctuations across hosts alone are not demonstrative of competitive exclusion. Like any model with stochasticity, fluctuations across hosts are expected under the SLM and some number of absences will inevitably arise due to the finite nature of sampling. However, it is also possible that the carrying capacity of a given strain could vary from host-to-host for several species, widening the across-host distribution of frequencies (Fig 1c). This detail can readily be incorporated into the SLM if one assumes that the carrying capacity in a given host is an independently drawn random variable from some unknown distribution [67]. Thus, given certain assumptions, the success of the SLM is reconcilable with the view that the carrying capacity of a strain is host-dependent.

While the SLM successfully predicted the prevalence of common alleles across hosts and has been shown to describe the temporal dynamics of strains within a host [46], all models have limitations, and it is useful to briefly discuss those that are applicable. A fundamental limitation of the SLM is that it is phenomenological in nature. It is unclear how microscopic details that are relevant to the ecological dynamics of strains, namely, consumer-resource dynamics [20, 40, 86, 87], map onto the SLM. Models that incorporate such dynamics are capable of recapitulating the temporal dynamics of microbial communities at the species level [88], and are necessary to model the emergence of new strains and their subsequent eco-evolutionary dynamics [87], suggesting that these microscopic details may be necessary to describe certain macroecological patterns at the strain level. As a contrast, the phenomenological nature of the SLM could be viewed as an asset when one wants to capture and predict multiple empirical patterns using an analytic solution without the use of fitted parameters. Indeed, the SLM succinctly captures the dynamics of a constrained random walk [88] and it is likely that alternative models of strain-level ecology that capture the same stochastic process are equally applicable.

The conclusion that properties of a substantial fraction of alleles can be predicted across hosts using a model of ecology, as opposed to evolutionary, dynamics is of consequence to studies of diversity in the human gut. Measures of genetic diversity within a single host (e.g., nucleotide diversity) are often used to assess the genetic content of a microbial species [12, 8992]. Recent efforts to characterize patterns of genetic diversity within a single host have found that the temporal dynamics of nucleotide diversity are primarily driven by fluctuations in strain frequencies over time [46]. In addition, the contribution of strain structure to estimates of genetic diversity from unrelated human hosts has been previously reported [3]. This study builds on past results by specifying that strain structure shapes patterns of genetic diversity across hosts. The implications of strain-level ecology likely extends to measures of genetic differentiation between populations (e.g., fixation index, FST) that have been used to assess the degree of structure of a given species across human hosts [12]. The results presented in this study suggest that single-sample measures of genetic diversity that do not account for strain ecology are unlikely to be informative of evolutionary processes operating within the human gut.

The success of a prediction is often contingent on one’s range of observation. After accounting for sequencing error, the lowest inferred allele frequencies ranged from 0.006–0.06 across species with an average of ∼ 0.02. A straightforward calculation suggests that this range is higher than the true minimum frequency (i.e., 1/N) by several orders of magnitude. The mean relative abundances of a given species across hosts ranged from 0.02–0.1. Previously established order-of-magnitude estimates of the typical number of cells in the human gut range from ∼ 1013 − 1014, from which one can use the mean relative abundance distribution across species to calculate a first-pass range of empirical abundances of ∼ 1011 − 1013 [1, 93]. This range suggests that the true minimum frequency of an allele is at least eight orders of magnitude lower than the minimum inferred allele frequency. It is clear from this calculation that, at present, researchers are only able to examine a narrow range of possible allele frequencies within the human gut microbiome, a range that is likely primarily driven by strain structure.

The implication of this relatively narrow observational window is that the success of predictions derived from ecological principles, and the feasibility of alternative single-locus models of evolution, are likely contingent on present limitations on the depth and error rates of shotgun metagenomic sequencing. As advances in sequencing technology and statistical inference continue to permit lower observational thresholds and provide information about physical linkage between variants [5, 9497], one expects that an increasingly higher fraction of observed alleles will be subject to evolutionary processes rather than the ecological processes affecting the strain harboring said allele, reducing the aptness of ecological models. Succinctly stated, the existence of strain structure suggests that the dynamics of allele frequencies are likely dependent on the frequency scale at which observations can be made. By expanding said range, it may be possible to identify a frequency threshold where observable alleles are primarily driven by evolutionary dynamics rather than the ecological dynamics of strains, providing the means to test quantitative predictions of recent developments in population genetic theory [25, 80]. Recognition of the possibility of such scale-dependence has the potential to shape future studies and rigorously assess the purported universality of empirical patterns of genetic diversity in the human gut microbiome.

Throughout this manuscript I have assumed that strains are sufficiently genetically diverged such that within-host structure is overt (i.e., many alleles with intermediate frequencies within a host, 0 < f < 1). However, de novo strains can emerge within a host, resulting in within-host structure where strains are separated by only a handful of SNVs (e.g., Bacteroides fragilis strains in [1]). It is worth considering how the emergence of new strains relates to the patterns documented in this manuscript. A recently diverged strain within a single host is analogous to a species that is only found in a single host. Given that the sampling form of the gamma distribution used in this study succeeded at predicting the prevalence of species present in a single host [44], it should, in principle, also succeed in predicting the prevalence of a SNV observed in a single host due to recently emerged strain structure. This expectation is not the case, as predictions for low prevalence alleles consistently failed for all species included in this study (Figs 2 and 3, S5, S6 and S11 Figs). It is reasonable to interpret this lack of predictive success for low prevalence alleles as a consequence of said alleles being present in a low number of hosts due to evolutionary dynamics, rather than their presence being a reflection of newly emerged within-host strain structure. Though this interpretation does not mean that recently diverged strains are absent in this cohort of human hosts. Rather, it is instead likely that the macroecological lens applied in this study has insufficient resolution to identify alleles that reflect recently diverged strains that have colonized a low number of hosts, a number similar to the number of hosts in which we would expect to observe a given allele due to evolutionary dynamics (e.g., recurrent mutation).

Finally, it worth commenting on the applicability of these results towards evaluating the ecological effects of environmental perturbations, namely, host-induced changes. Across-host patterns are often the consequence of within-host dynamics, so, in principle, it should be possible to use the SLM to predict across-host changes in response to perturbations. However, it is difficult to know a priori whether a host is in a perturbed state unless the perturbation is administrated as part of a controlled study (e.g., major diet change, drug trials, etc.). For example, one could compile data from studies on different human hosts where courses of antibiotics were administrated and metagenomic sequencing was performed over time. Knowing the inferred frequency of a strain or the frequencies of its constituent alleles at the start of the trial f0, one could then leverage the SLM and its stationary solution to determine when statistical quantities calculated across hosts such as the mean frequency or prevalence reach their stationary values as the system relaxes away from its perturbed state (i.e., 〈f|t, f0〉 → 〈f〉 or 〈ϱ|t, ϱ0〉 → 〈ϱ〉).

Materials and methods

Data acquisition and processing

To investigate patterns of genetic diversity within the human gut microbiome, I used shotgun metagenomic data from 468 healthy North American individuals sequenced by the Human Microbiome Project [24, 98]. I first processed the data using a previously developed analysis pipeline to identify the set of sites in core genes [3]. This pipeline uses a standard reference-based approach (MIDAS v1.2.2 [29]) to map reads from each metagenomic sample to reference genes across a panel of prevalent species and filter reads based on quality scores and read mapping criteria. Definitions of “species” vary across disciplines in biology. To avoid ambiguity, I opted for a direct operational definition provided by the resolution of the reference genome panel used by MIDAS, a definition that has been used in many studies of the human gut microbiome [3, 5, 25, 36, 39, 46, 99].

The relative abundance of each species in a given host was inferred using merge_midas.py species. Then, the command merge_midas.py genes was run with the following flags: --sample_depth 10, --min_samples 1, and --max_species 150. Using this processed gene output, the command merge_midas.py snps was run with the following flags: --sample_depth 5, --site_depth 3, --min_samples 1, --max_species 150, and --site_prev 0.0. I only processed samples from unrelated hosts and removed all temporal samples, leaving us with a set of samples that correspond to observations from unique hosts. I retained alleles in sites that were present in a core gene (i.e., a gene that was present in ≥ 90% of hosts) with a minimum total within-host depth of coverage of 20 (D ≥ 20) in at least 20 hosts. These parameter settings are effectively identical to the settings used in prior studies that have examined the HMP [3, 7, 39, 46]. The (non)synonymous status of sites were determined using MIDAS reference genomes. While members of a species can differ in genomic content and can alter the accuracy of calling the status of a site, this pipeline was previously run on the same dataset used in this manuscript to test population genetic predictions on measures of genetic diversity using nonsynonymous and synonymous sites, finding that the empirical data matched theoretical predictions [3], implying that any ambiguity in site (non)synonymous assignment did not shape our results.

After identifying the set of sites in core genes that passed the quality control thresholds, I obtained appropriate BAM files so that allele frequencies could be reliably estimated. First, using the reference genomes used by MIDAS and the BAM file generated by MIDAS for all species in a single host, I split the BAM file containing all species into separate BAM files for each species using the command samtools view with default settings [100]. Each BAM file was sorted using the samtools sort command with default settings, from which a .header file was using samtools view.

Sequencing errors can obfuscate naïve estimates of low frequency alleles (f ≪ 1). This effect is particularly on measures of genetic diversity, as sequencing error-induced noise can easily swamp real biological signal when statistics are calculated over a large number of sites. Studies often attempt to control for errors by restricting their analyses to sites that pass a particular threshold for the total depth of coverage and/or the coverage of the minor allele. However, the effectiveness of an arbitrary cutoff will not be identical across sites and hosts due to variance in the total depth of coverage. For this specific analysis, reliable frequency estimates are also necessary since an allele of any non-zero frequency can contribute towards measures of prevalence, so it is imperative for low frequency alleles to be estimated in a statistically justified and unbiased manner that balances sensitivity and specificity.

To account for sequencing errors, I elected to use the full maximum likelihood estimator MAPGD v0.4.40 [101], an unbiased estimator that accounts for the total depth of coverage and an unknown sequencing error. This estimator was chosen due its comparative high sensitivity to low frequency alleles without sacrificing its false discovery rate in real and simulated data [102], reflecting a balance between sensitivity and specificity. Using BAM and .header files processed from MIDAS output, MAPGD was run with a log-likelihood ratio polymorphism cutoff of 20 (-a 20), a choice informed by prior benchmarking studies and MAPGD recommendations [102]. The choice of log-likelihood ratio cutoff is unlikely to shape the results of this study, as the cutoff effectively establishes a coverage cutoff which can be incorporated into predictions derived using probability distributions that explicitly account for sampling (e.g., Eq 4). Samples with insufficient coverage for MAPGD to be run were removed from all downstream analyses. I then polarized alleles based on the major allele across hosts.

Gamma and lognormal distributions were fit to the distribution of within-host allele frequencies across all hosts using SciPy. Because the distribution was rescaled using the logarithm of the data, the distributions were fit as if I was interested in the logarithm of the random variable (i.e., log10(f)). This detail translates to fitting the loggamma instead of the gamma and the Gaussian instead of the lognormal. AIC was calculated using custom scripts.

The MAPGD inference procedure makes no assumptions about the existence of strain structure. If strain structure is present in the data it will shape the distribution of allele frequencies, subsequently altering measures of genetic diversity and the predictive capacity of the SLM. To determine whether the error of the SLM was related to the existence of strains, I used an algorithm to determine whether strain structure was present for each species within each host. As an independent estimate of strain structure, I estimated strain frequencies for all species using all sites with ≥ 20 fold coverage within each host by applying StrainFinder v1.0 [85] to the frequency spectra obtained from the upstream pipeline [3]. The program StrainFinder was run on each sample for each species using 10 initial conditions using local convergence criteria with the following flags: --dtol 1, --ntol 2, --max_time 20000, --converge. The program was run for strain numbers ranging from one to four the estimates with the top five log-likelihoods were retained. I then selected strain frequencies with the lowest Bayesian Information Criterion for each species in each sample. Joint density plots (e.g., Fig 2) were made using functions from macroecotools v0.4.0 [103].

Predicting allele prevalence using the SLM

Deriving the distribution of allele frequencies from the SLM

We begin with the assumption that the typical polymorphism observed within a given host for a given species is present because it is on the background of a colonizing strain. In such a scenario, the dynamics of an allele are not determined by its evolutionary attributes (i.e., fitness effect, mutation rate, etc.) but by the ecological dynamics of the strain. There is increasing evidence that the stochastic logistic model of growth (SLM) is a suitable null model of microbial ecological dynamics at the species level [44, 66, 67] and recent evidence indicates that the SLM sufficiently fits the temporal dynamics of strains within a human host for the vast majority of microbial species [46]. A non-trivial application of the SLM to strain-level ecology requires there to be more than one strain within a given host, giving the allele a range of frequencies of 0 < f < 1

ft=fτi(1-fKi)+στiτif·η(t) (10)

The terms 1τi, Ki, and στi represent the intrinsic growth rate of the strain, carrying capacity, and the coefficient of variation of growth rate fluctuations. The term η(t) is a Brownian noise term where 〈η(t)〉 = 0 and 〈η(t)η(t′)〉 = δ(tt′) [69]. By definition, strain frequencies within a species must be between zero and one, so 0 < Ki < 1.

Using the Itô ↔ Fokker-Planck equivalence [69], one can formulate a partial differential equation for the probability p(f, t) that an allele has frequency f at time t

p(f,t)t=-f[(fτi(1-fKi))p(f,t)]+στi2τi2f2(f2p(f,t)) (11)

From which one sets pt=0 to obtain the stationary probability distribution of allele frequencies

p(f)=1Γ(2στi-1-1)(2Kiστi)2στi-1-1exp[-2Kiστif]f2στi-1-2 (12)

This distribution, known as the abundance fluctuation distribution in macroecology [44], is a gamma distribution with the following mean and squared coefficient of variation

f=Ki(1-στi2) (13a)
f2-f2f2=στi2-στi (13b)

Defining empirical estimates of 〈f〉 and f2-f2f2 as f¯ and β−1, I obtained a form of the gamma (represented in its shape/rate parameterization form) that can be used to generate ecological predictions of measures of genetic diversity with zero free parameters

p(f|β,β/f¯)=1Γ(β)(βf¯)βexp[-fβf¯]fβ-1 (14)

Deriving prevalence predictions using the SLM

The probability of detecting an allele of a given frequency within a host depends on one’s sampling effort. The impact of finite sampling from gamma distributed random variables has been previously examined within macroecology [44], results that I apply and extend in this section. To model this process, I start by letting (A, D) denote the number of reads of the alternate allele and total sequencing depth at a given site. I estimate the frequency of the alternate allele within a given host as f^=A/D. I assume that sampling distribution of A is binomial

Pr[A|D,f]=(DA)fA(1-f)D-A (15)

When D ≫ 1 and f ≪ 1 while Df remains finite, the binomial sampling process can be approximated by the Poisson distribution

Pr[A|D,f]=(D·f)AA!e-D·f (16)

Using this approximation, one can solve the integral for the probability of observing A reads assigned to the alternate allele out of D total reads when f is a gamma distributed random variable

Pr[A|D,β,β/f¯]=01Pr[A|D,f]·p(f|β,β/f¯)df (17a)
=DAA!Γ(β)βf¯β01fA+β-1e-f(D+βf¯-1)df (17b)
=Γ(β+A)A!Γ(β)(f¯Dβ+f¯D)A(ββ+f¯D)β (17c)

The distribution Pr[A|D,β,β/f¯] now represents a gamma distribution that explicitly accounts for sampling. By setting A = 0, one can calculate the probability of not detecting the alternate allele (i.e., absence) with a sampling depth of D reads [44, 71]

Pr[0|D,β,β/f¯]=(1+D·βf¯)-β (18)

From which one can calculate the expected prevalence of the allele ϱ over M hosts as

ϱ=1Mm=1M(1-Pr[0|Dm,β,β/f¯]) (19)

Evaluating prevalence predictions

Full derivations of the predicted prevalence of each model can be found in the Materials and Methods. The predicted values of prevalence were compared to the following estimate of observed prevalence.

ϱ^=1Mm=1M(1-δfm,0)=1-1Mm=1Mδfm,0 (20)

where fm is the frequency of the alternative allele in the mth host and the Kronecker delta δfm,0 is equal to 1 if fm = 0 and zero otherwise. To evaluate the success of the predictions I calculated the relative error of a given prediction

ε=|Obs.-Pred.Obs.| (21)

I performed permutation tests to determine whether the SLM had higher success among alleles with higher prevalence. By permuting all values of ε for a given species and calculating the Pearson correlation coefficient between it and the observed prevalence, I obtained a null distribution of correlation coefficients from which I calculated 95% intervals.

To determine whether there was a relationship between the error of the SLM and the fraction of hosts with strain structure across species, I implemented a permutation approach. First, for each species, I calculated the number of alleles in a given prevalence threshold (T total thresholds). I then permuted all values of ε and calculated the mean ε using the number of alleles that were found in each prevalence threshold, (ε¯1,,ε¯T). The correlation coefficient was then calculated between ε¯t and the fraction of hosts containing strains among all species for each prevalence threshold, allowing us to obtain a null distribution of correlation coefficients for all values of t. I only retained a prevalence threshold for a given species if there were at least 10 sites within the threshold.

Supporting information

S1 Fig. Measures of genetic diversity among nonsynonymous sites.

Measures of genetic diversity calculated from nonsynonymous sites exhibit similar statistical forms across phylogenetically distant species in the human gut, similar to patterns observed among synonymous sites (Fig 1).

(TIF)

S2 Fig. AFD survival curves for synonymous sites.

Survival forms of rescaled distributions of within-host allele frequencies across hosts and mean frequencies across hosts. Representing the data presented in Fig 1c and 1d reveals how distributions of genetic diversity have similar forms across phylogenetically distant species. Each non-black line represents a species. A dashed black line represents the fit of a gamma distribution and dotted black line represents a lognormal.

(TIF)

S3 Fig. AFD survival curves for nonsynonymous sites.

The equivalent plot for S2 Fig for nonsynonymous sites.

(TIF)

S4 Fig. Coverage distribution.

The use of the log-likelihood ratio in MAPGD introduces a lower bound on the total depth of coverage (D) necessary to estimate the frequency of an allele at a given site. a) The existence of a lower bound translates to a truncation of the data, where I did not observe any sites with a coverage less than 20 that were processed by MAPGD. b,c) This truncation means that the depth of coverage of a minor allele (A) cannot be less than half the total coverage (e.g., 10).

(TIF)

S5 Fig. Synonymous prevalence predictions.

A direct comparison between the observed prevalence of all alleles and their corresponding predicted prevalences using the SLM for synonymous sites. A total of 1,000 datapoints were sampled without replacement for each subplot.

(TIF)

S6 Fig. Nonsynonymous prevalence predictions.

Analogous analyses to S5 Fig using nonsynonymous sites.

(TIF)

S7 Fig. Synonymous prevalence prediction error distributions.

By calculating the relative error of all alleles for the SLM I can examine the error distributions across species. To visually compare the two models, I examined the survival distribution of the relative errors (i.e., the compliment of the empirical cumulative density function). All alleles in this plot are at synonymous sites.

(TIF)

S8 Fig. Nonsynonymous prevalence prediction error distributions.

Analogous analyses to S7 Fig using nonsynonymous sites.

(TIF)

S9 Fig. Synonymous relationship between f¯ and prevalence.

The empirical relationship between the mean frequency of an allele (f¯) and its prevalence across hosts can be recapitulated by the SLM for synonymous sites. Blue dots represent observed values and the shade of blue is proportional to the density of observations. The black line is the predicted relationship calculated using Eq 11. A total of 1,000 datapoints were sampled without replacement for each subplot.

(TIF)

S10 Fig. Nonsynonymous relationship between f¯ and prevalence.

Analogous analyses to S9 Fig using nonsynonymous sites.

(TIF)

S11 Fig. Nonsynonymous prevalence error analysis.

The equivalent analyses in Fig 3 were performed on alleles at nonsynonymous sites. The results of these analyses are qualitatively consistent with those of synonymous sites.

(TIF)

S12 Fig. Relationship between f¯ and β for synonymous sites.

The relationship between the empirical estimates of the two parameters of the SLM: the mean allele frequency across hosts (f¯) and the squared inverse of the coefficient of variation of frequencies across hosts (β). Each point is an individual allele. All alleles are on synonymous sites. A total of 1,000 datapoints were sampled without replacement for each subplot.

(TIF)

S13 Fig. Relationship between f¯ and β for nonsynonymous sites.

Analogous analyses to S12 Fig using nonsynonymous sites.

(TIF)

S1 Text. Supplemental information.

Derivation of the distribution of allele frequencies under a linearized single-locus model of evolution.

(PDF)

Acknowledgments

I thank S. Bald and R.W. Wolff for their assistance with StrainFinder and both N.R. Garud and R.W. Wolff for their comments on an early draft. Thanks to B.H. Good for pivotal discussions, sharing their insights, and for making their lecture notes available to the public. Thanks to S. Bubnovich, J. Grilli, D. Reyes-González, and N.I. Wisnoski for their feedback on the manuscript. Finally, thanks to M.S. Ackerman for their assistance with MAPGD. This work used computational and storage services associated with the Hoffman2 Shared Cluster provided by UCLA Institute for Digital Research and Education’s Research Technology Group.

Data Availability

All raw sequencing data for the metagenomic samples used in this study were downloaded from the HMP \cite{huttenhower_structure_2012, lloyd-price_strains_2017} (URLs: \url{https://portal.hmpdacc.org/}, \url{https://aws.amazon.com/datasets/human-microbiome-project/}). \texttt{MIDAS} output was processed using publicly available code \cite{garud_evolutionary_2019}. Processed data is available on Zenodo: \url{https://doi.org/10.5281/zenodo.6793770}. All code written for this study is available on GitHub: \url{https://github.com/wrshoemaker/StrainPrevalence}.

Funding Statement

This work was supported by the NSF Postdoctoral Research Fellowships in Biology Program under Grant No. 2010885 (W.R.S.). https://beta.nsf.gov/funding/opportunities/postdoctoral-research-fellowships-biology-prfb The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Zhao S, Lieberman TD, Poyet M, Kauffman KM, Gibbons SM, Groussin M, et al. Adaptive Evolution within Gut Microbiomes of Healthy People. Cell Host & Microbe. 2019;25(5):656–667.e8. doi: 10.1016/j.chom.2019.03.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ghosh OM, Good BH. Emergent evolutionary forces in spatial models of luminal growth in the human gut microbiota; 2021. Available from: https://www.biorxiv.org/content/10.1101/2021.07.15.452569v1. [DOI] [PMC free article] [PubMed]
  • 3. Garud NR, Good BH, Hallatschek O, Pollard KS. Evolutionary dynamics of bacteria in the gut microbiome within and across hosts. PLOS Biology. 2019;17(1):e3000102. doi: 10.1371/journal.pbio.3000102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Yaffe E, Relman DA. Tracking microbial evolution in the human gut using Hi-C reveals extensive horizontal gene transfer, persistence and adaptation. Nature Microbiology. 2020;5(2):343–353. doi: 10.1038/s41564-019-0625-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Roodgar M, Good BH, Garud NR, Martis S, Avula M, Zhou W, et al. Longitudinal linked-read sequencing reveals ecological and evolutionary responses of a human gut microbiome during antibiotic treatment. Genome Research. 2021;31(8):1433–1446. doi: 10.1101/gr.265058.120 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Ghalayini M, Launay A, Bridier-Nahmias A, Clermont O, Denamur E, Lescat M, et al. Evolution of a Dominant Natural Isolate of Escherichia coli in the Human Gut over the Course of a Year Suggests a Neutral Evolution with Reduced Effective Population Size. Applied and Environmental Microbiology. 2018;84(6):e02377–17. doi: 10.1128/AEM.02377-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Chen DW, Garud NR. Rapid evolution and strain turnover in the infant gut microbiome; 2021. Available from: https://www.biorxiv.org/content/10.1101/2021.09.26.461856v1. [DOI] [PMC free article] [PubMed]
  • 8. Groussin M, Poyet M, Sistiaga A, Kearney SM, Moniz K, Noel M, et al. Elevated rates of horizontal gene transfer in the industrialized human microbiome. Cell. 2021;184(8):2053–2067.e18. doi: 10.1016/j.cell.2021.02.052 [DOI] [PubMed] [Google Scholar]
  • 9. Dapa T, Wong DP, Vasquez KS, Xavier KB, Huang KC, Good BH. Within-host evolution of the gut microbiome. Current Opinion in Microbiology. 2023;71:102258. doi: 10.1016/j.mib.2022.102258 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Good BH, Hallatschek O. Effective models and the search for quantitative principles in microbial evolution. Current Opinion in Microbiology. 2018;45:203–212. doi: 10.1016/j.mib.2018.11.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Truong DT, Tett A, Pasolli E, Huttenhower C, Segata N. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Research. 2017;27(4):626–638. doi: 10.1101/gr.216242.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Schloissnig S, Arumugam M, Sunagawa S, Mitreva M, Tap J, Zhu A, et al. Genomic variation landscape of the human gut microbiome. Nature. 2013;493(7430):45–50. doi: 10.1038/nature11711 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Faith JJ, Guruge JL, Charbonneau M, Subramanian S, Seedorf H, Goodman AL, et al. The long-term stability of the human gut microbiota. Science (New York, NY). 2013;341(6141):1237439. doi: 10.1126/science.1237439 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Moeller AH. Metagenomic signatures of balancing selection in the human gut. Molecular Ecology. 2023;32(10):2582–2591. doi: 10.1111/mec.16474 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Valles-Colomer M, Blanco-Míguez A, Manghi P, Asnicar F, Dubois L, Golzato D, et al. The person-to-person transmission landscape of the gut and oral microbiomes. Nature. 2023;614(7946):125–135. doi: 10.1038/s41586-022-05620-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Turroni F, Foroni E, Pizzetti P, Giubellini V, Ribbera A, Merusi P, et al. Exploring the Diversity of the Bifidobacterial Population in the Human Intestinal Tract. Applied and Environmental Microbiology. 2009;75(6):1534–1545. doi: 10.1128/AEM.02216-08 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Vatanen T, Plichta DR, Somani J, Münch PC, Arthur TD, Hall AB, et al. Genomic variation and strain-specific functional adaptation in the human gut microbiome during early life. Nature Microbiology. 2019;4(3):470–479. doi: 10.1038/s41564-018-0321-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Forster SC, Kumar N, Anonye BO, Almeida A, Viciani E, Stares MD, et al. A human gut bacterial genome and culture collection for improved metagenomic analyses. Nature Biotechnology. 2019;37(2):186–192. doi: 10.1038/s41587-018-0009-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Ferretti P, Pasolli E, Tett A, Asnicar F, Gorfer V, Fedi S, et al. Mother-to-Infant Microbial Transmission from Different Body Sites Shapes the Developing Infant Gut Microbiome. Cell Host & Microbe. 2018;24(1):133–145.e5. doi: 10.1016/j.chom.2018.06.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Goyal A, Bittleston LS, Leventhal GE, Lu L, Cordero OX. Interactions between strains govern the eco-evolutionary dynamics of microbial communities. eLife. 2022;11:e74987. doi: 10.7554/eLife.74987 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wang Z, Fridman Y, Maslov S, Goyal A. Fine-scale diversity of microbial communities due to satellite niches in boom-and-bust environments; 2022. Available from: https://www.biorxiv.org/content/10.1101/2022.05.26.493560v2. [DOI] [PMC free article] [PubMed]
  • 22. Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, Mende DR, et al. Enterotypes of the human gut microbiome. Nature. 2011;473(7346):174–180. doi: 10.1038/nature09944 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Ley RE, Lozupone CA, Hamady M, Knight R, Gordon JI. Worlds within worlds: evolution of the vertebrate gut microbiota. Nature Reviews Microbiology. 2008;6(10):776–788. doi: 10.1038/nrmicro1978 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Lloyd-Price J, Mahurkar A, Rahnavard G, Crabtree J, Orvis J, Hall AB, et al. Strains, functions and dynamics in the expanded Human Microbiome Project. Nature. 2017;550(7674):61–66. doi: 10.1038/nature23889 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Good BH. Linkage disequilibrium between rare mutations. Genetics. 2022;220(4):iyac004. doi: 10.1093/genetics/iyac004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Baumgartner M, Bayer F, Pfrunder-Cardozo KR, Buckling A, Hall AR. Resident microbial communities inhibit growth and antibiotic-resistance evolution of Escherichia coli in human gut microbiome samples. PLOS Biology. 2020;18(4):e3000465. doi: 10.1371/journal.pbio.3000465 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Tett A, Pasolli E, Masetti G, Ercolini D, Segata N. Prevotella diversity, niches and interactions with the human host. Nature Reviews Microbiology. 2021;19(9):585–599. doi: 10.1038/s41579-021-00559-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Karcher N, Pasolli E, Asnicar F, Huang KD, Tett A, Manara S, et al. Analysis of 1321 Eubacterium rectale genomes from metagenomes uncovers complex phylogeographic population structure and subspecies functional adaptations. Genome Biology. 2020;21(1):138. doi: 10.1186/s13059-020-02042-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Nayfach S, Rodriguez-Mueller B, Garud N, Pollard KS. An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography. Genome Research. 2016;26(11):1612–1625. doi: 10.1101/gr.201863.115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Niu J, Xu L, Qian Y, Sun Z, Yu D, Huang J, et al. Evolution of the Gut Microbiome in Early Childhood: A Cross-Sectional Study of Chinese Children. Frontiers in Microbiology. 2020;11. doi: 10.3389/fmicb.2020.00439 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Kundu P, Blacher E, Elinav E, Pettersson S. Our Gut Microbiome: The Evolving Inner Self. Cell. 2017;171(7):1481–1493. doi: 10.1016/j.cell.2017.11.024 [DOI] [PubMed] [Google Scholar]
  • 32. Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, Contreras M, et al. Human gut microbiome viewed across age and geography. Nature. 2012;486(7402):222–227. doi: 10.1038/nature11053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Tierney BT, Yang Z, Luber JM, Beaudin M, Wibowo MC, Baek C, et al. The Landscape of Genetic Content in the Gut and Oral Human Microbiome. Cell Host & Microbe. 2019;26(2):283–295.e8. doi: 10.1016/j.chom.2019.07.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Prosser JI, Bohannan BJM, Curtis TP, Ellis RJ, Firestone MK, Freckleton RP, et al. The role of ecological theory in microbial ecology. Nature Reviews Microbiology. 2007;5(5):384–392. doi: 10.1038/nrmicro1643 [DOI] [PubMed] [Google Scholar]
  • 35. Marquet PA, Allen AP, Brown JH, Dunne JA, Enquist BJ, Gillooly JF, et al. On Theory in Ecology. BioScience. 2014;64(8):701–710. doi: 10.1093/biosci/biu098 [DOI] [Google Scholar]
  • 36. Good BH, Rosenfeld LB. Eco-evolutionary feedbacks in the human gut microbiome. bioRxiv; 2022. Available from: https://www.biorxiv.org/content/10.1101/2022.01.26.477953v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Garud NR, Pollard KS. Population Genetics in the Human Microbiome. Trends in Genetics. 2020;36(1):53–67. doi: 10.1016/j.tig.2019.10.010 [DOI] [PubMed] [Google Scholar]
  • 38. Poyet M, Groussin M, Gibbons SM, Avila-Pacheco J, Jiang X, Kearney SM, et al. A library of human gut bacterial isolates paired with longitudinal multiomics data enables mechanistic microbiome research. Nature Medicine. 2019;25(9):1442–1452. doi: 10.1038/s41591-019-0559-3 [DOI] [PubMed] [Google Scholar]
  • 39. Shoemaker WR, Chen D, Garud NR. Comparative Population Genetics in the Human Gut Microbiome. Genome Biology and Evolution. 2021;(evab116). doi: 10.1093/gbe/evab116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Cui W, Marsland R, Mehta P. Diverse communities behave like typical random ecosystems. Physical Review E. 2021;104(3):034416. doi: 10.1103/PhysRevE.104.034416 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Advani M, Bunin G, Mehta P. Statistical physics of community ecology: a cavity solution to MacArthur’s consumer resource model. Journal of Statistical Mechanics (Online). 2018;2018:033406. doi: 10.1088/1742-5468/aab04e [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Descheemaeker L, de Buyl S. Stochastic logistic models reproduce experimental time series of microbial communities. eLife. 2020;9:e55650. doi: 10.7554/eLife.55650 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Shoemaker WR, Locey KJ, Lennon JT. A macroecological theory of microbial biodiversity. Nature Ecology & Evolution. 2017;1(5):1–6. [DOI] [PubMed] [Google Scholar]
  • 44. Grilli J. Macroecological laws describe variation and diversity in microbial communities. Nature Communications. 2020;11(1):4743. doi: 10.1038/s41467-020-18529-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Ji BW, Sheth RU, Dixit PD, Tchourine K, Vitkup D. Macroecological dynamics of gut microbiota. Nature Microbiology. 2020;5(5):768–775. doi: 10.1038/s41564-020-0685-1 [DOI] [PubMed] [Google Scholar]
  • 46. Wolff R, Shoemaker W, Garud N. Ecological Stability Emerges at the Level of Strains in the Human Gut Microbiome. mBio. 2023;0(0):e02502–22. doi: 10.1128/mbio.02502-22 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Shoemaker WR, Grilli J. Macroecological patterns in coarse-grained microbial communities; 2023. Available from: https://www.biorxiv.org/content/10.1101/2023.03.02.530804v1. [DOI] [PMC free article] [PubMed]
  • 48. Shade A, Dunn RR, Blowes SA, Keil P, Bohannan BJM, Herrmann M, et al. Macroecology to Unite All Life, Large and Small. Trends in Ecology & Evolution. 2018;33(10):731–744. doi: 10.1016/j.tree.2018.08.005 [DOI] [PubMed] [Google Scholar]
  • 49. Taylor LR. Aggregation, Variance and the Mean. Nature. 1961;189(4766):732–735. doi: 10.1038/189732a0 [DOI] [Google Scholar]
  • 50. Phillips R. Theory in Biology: Figure 1 or Figure 7? Trends in Cell Biology. 2015;25(12):723–729. doi: 10.1016/j.tcb.2015.10.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Bhattacharjee SM, Seno F. A measure of data collapse for scaling. Journal of Physics A: Mathematical and General. 2001;34(33):6375–6380. doi: 10.1088/0305-4470/34/33/302 [DOI] [Google Scholar]
  • 52. Stanley HE. Scaling, universality, and renormalization: Three pillars of modern critical phenomena. Reviews of Modern Physics. 1999;71(2):S358–S366. doi: 10.1103/RevModPhys.71.S358 [DOI] [Google Scholar]
  • 53. Theys K, Feder AF, Gelbart M, Hartl M, Stern A, Pennings PS. Within-patient mutation frequencies reveal fitness costs of CpG dinucleotides and drastic amino acid changes in HIV. PLOS Genetics. 2018;14(6):e1007420. doi: 10.1371/journal.pgen.1007420 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Vogl C, Bergman J. Computation of the Likelihood of Joint Site Frequency Spectra Using Orthogonal Polynomials. Computation. 2016;4(1):6. doi: 10.3390/computation4010006 [DOI] [Google Scholar]
  • 55. Marquet PA, Quiñones RA, Abades S, Labra F, Tognelli M, Arim M, et al. Scaling and power-laws in ecological systems. Journal of Experimental Biology. 2005;208(9):1749–1769. doi: 10.1242/jeb.01588 [DOI] [PubMed] [Google Scholar]
  • 56. Ramsayer J, Fellous S, Cohen JE, Hochberg ME. Taylor’s Law holds in experimental bacterial populations but competition does not influence the slope. Biology Letters. 2012;8(2):316–319. doi: 10.1098/rsbl.2011.0895 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Kendal WS, Jørgensen B. Taylor’s power law and fluctuation scaling explained by a central-limit-like convergence. Physical Review E. 2011;83(6):066115. doi: 10.1103/PhysRevE.83.066115 [DOI] [PubMed] [Google Scholar]
  • 58. Kendal WS. A scale invariant clustering of genes on human chromosome 7. BMC Evolutionary Biology. 2004; p. 10. doi: 10.1186/1471-2148-4-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Kendal WS. An Exponential Dispersion Model for the Distribution of Human Single Nucleotide Polymorphisms. Molecular Biology and Evolution. 2003;20(4):579–590. doi: 10.1093/molbev/msg057 [DOI] [PubMed] [Google Scholar]
  • 60. He M, Sebaihia M, Lawley TD, Stabler RA, Dawson LF, Martin MJ, et al. Evolutionary dynamics of Clostridium difficile over short and long time scales. Proceedings of the National Academy of Sciences of the United States of America. 2010;107(16):7527–7532. doi: 10.1073/pnas.0914322107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Bhatia R, Davis C. A Better Bound on the Variance. The American Mathematical Monthly. 2000;107(4):353–357. doi: 10.1080/00029890.2000.12005203 [DOI] [Google Scholar]
  • 62. Cohen JE, Xu M. Random sampling of skewed distributions implies Taylor’s power law of fluctuation scaling. Proceedings of the National Academy of Sciences. 2015;112(25):7749–7754. doi: 10.1073/pnas.1503824112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Giometto A, Formentin M, Rinaldo A, Cohen JE, Maritan A. Sample and population exponents of generalized Taylor’s law. Proceedings of the National Academy of Sciences. 2015;112(25):7755–7760. doi: 10.1073/pnas.1505882112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Xiao X, Locey KJ, White EP. A Process-Independent Explanation for the General Form of Taylor’s Law. The American Naturalist. 2015;186(2):E51–E60. doi: 10.1086/682050 [DOI] [PubMed] [Google Scholar]
  • 65. Transtrum MK, Machta BB, Brown KS, Daniels BC, Myers CR, Sethna JP. Perspective: Sloppiness and emergent theories in physics, biology, and beyond. The Journal of Chemical Physics. 2015;143(1):010901. doi: 10.1063/1.4923066 [DOI] [PubMed] [Google Scholar]
  • 66. Zaoli S, Grilli J. The stochastic logistic model with correlated carrying capacities reproduces beta-diversity metrics of microbial communities. PLOS Computational Biology. 2022;18(4):e1010043. doi: 10.1371/journal.pcbi.1010043 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Zaoli S, Grilli J. A macroecological description of alternative stable states reproduces intra- and inter-host variability of gut microbiome. Science Advances. 2021;7(43):eabj2882. doi: 10.1126/sciadv.abj2882 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Camacho-Mateu J, Lampo A, Sireci M, Muñoz MA, Cuesta JA. Species interactions reproduce abundance correlations patterns in microbial communities; 2023. Available from: http://arxiv.org/abs/2305.19154. [DOI] [PMC free article] [PubMed]
  • 69. Gardiner CW. Stochastic methods: a handbook for the natural and social sciences. 4th ed. No. 13 in Springer series in synergetics. Berlin Heidelberg: Springer; 2009. [Google Scholar]
  • 70. Engen S, Lande R. Population Dynamic Models Generating Species Abundance Distributions of the Gamma Type. Journal of Theoretical Biology. 1996;178(3):325–331. doi: 10.1006/jtbi.1996.0028 [DOI] [Google Scholar]
  • 71. Shoemaker WR, Lennon JT. Predicting Parallelism and Quantifying Divergence in Microbial Evolution Experiments. mSphere. 2022. doi: 10.1128/msphere.00672-21 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Gaston KJ, Blackburn TM, Greenwood JJD, Gregory RD, Quinn RM, Lawton JH. Abundance–occupancy relationships. Journal of Applied Ecology. 2000;37(s1):39–59. doi: 10.1046/j.1365-2664.2000.00485.x [DOI] [Google Scholar]
  • 73. Sloan WT, Woodcock S, Lunn M, Head IM, Curtis TP. Modeling Taxa-Abundance Distributions in Microbial Communities using Environmental Sequence Data. Microbial Ecology. 2007;53(3):443–455. doi: 10.1007/s00248-006-9141-x [DOI] [PubMed] [Google Scholar]
  • 74. Burns AR, Stephens WZ, Stagaman K, Wong S, Rawls JF, Guillemin K, et al. Contribution of neutral processes to the assembly of gut microbial communities in the zebrafish over host development. The ISME Journal. 2016;10(3):655–664. doi: 10.1038/ismej.2015.142 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75. Cohen JE. Every variance function, including Taylor’s power law of fluctuation scaling, can be produced by any location-scale family of distributions with positive mean and variance. Theoretical Ecology. 2020;13(1):1–5. doi: 10.1007/s12080-019-00445-7 [DOI] [Google Scholar]
  • 76. Ewens WJ. Mathematical population genetics. Springer; 2010. [Google Scholar]
  • 77. Hubbell SP. The unified neutral theory of biodiversity and biogeography. No. 32 in Monographs in population biology. Princeton: Princeton University Press; 2001. [Google Scholar]
  • 78. Øksendal BK. Stochastic differential equations: an introduction with applications. 6th ed. Universitext. Berlin; New York: Springer; 2007. [Google Scholar]
  • 79. Nei M. The frequency distribution of lethal chromosomes in finite populations. Proceedings of the National Academy of Sciences of the United States of America. 1968;60(2):517–524. doi: 10.1073/pnas.60.2.517 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80. Cvijović I, Good BH, Desai MM. The Effect of Strong Purifying Selection on Genetic Diversity. Genetics. 2018;209(4):1235–1278. doi: 10.1534/genetics.118.301058 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81. Yan Y, Nguyen LH, Franzosa EA, Huttenhower C. Strain-level epidemiology of microbial communities and the human microbiome. Genome Medicine. 2020;12:71. doi: 10.1186/s13073-020-00765-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82. Dijkshoorn L, Ursing BM, Ursing JB. Strain, clone and species: comments on three basic concepts of bacteriology. Journal of Medical Microbiology. 2000;49(5):397–401. doi: 10.1099/0022-1317-49-5-397 [DOI] [PubMed] [Google Scholar]
  • 83. Zhu A, Sunagawa S, Mende DR, Bork P. Inter-individual differences in the gene content of human gut bacterial species. Genome Biology. 2015;16(1):82. doi: 10.1186/s13059-015-0646-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84. Shapiro BJ, Polz MF. Ordering microbial diversity into ecologically and genetically cohesive units. Trends in Microbiology. 2014;22(5):235–247. doi: 10.1016/j.tim.2014.02.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Smillie CS, Sauk J, Gevers D, Friedman J, Sung J, Youngster I, et al. Strain Tracking Reveals the Determinants of Bacterial Engraftment in the Human Gut Following Fecal Microbiota Transplantation. Cell Host & Microbe. 2018;23(2):229–240.e5. doi: 10.1016/j.chom.2018.01.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86. Chesson P. MacArthur’s consumer-resource model. Theoretical Population Biology. 1990;37(1):26–38. doi: 10.1016/0040-5809(90)90025-Q [DOI] [Google Scholar]
  • 87. Good BH, Martis S, Hallatschek O. Adaptation limits ecological diversification and promotes ecological tinkering during the competition for substitutable resources. Proceedings of the National Academy of Sciences. 2018;115(44):E10407–E10416. doi: 10.1073/pnas.1807530115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88. Ho PY, Good BH, Huang KC. Competition for fluctuating resources reproduces statistics of species abundance over time across wide-ranging microbiotas. eLife. 2022;11:e75168. doi: 10.7554/eLife.75168 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89. Madi N, Chen D, Wolff R, Shapiro BJ, Garud NR. Community diversity is associated with intra-species genetic diversity and gene loss in the human gut microbiome. bioRxiv; 2022. Available from: https://www.biorxiv.org/content/10.1101/2022.03.08.483496v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90. Li J, Rettedal EA, van der Helm E, Ellabaan M, Panagiotou G, Sommer MOA. Antibiotic Treatment Drives the Diversification of the Human Gut Resistome. Genomics, Proteomics & Bioinformatics. 2019;17(1):39–51. doi: 10.1016/j.gpb.2018.12.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91. N’Guessan A, Brito IL, Serohijos AWR, Shapiro BJ. Mobile Gene Sequence Evolution within Individual Human Gut Microbiomes Is Better Explained by Gene-Specific Than Host-Specific Selective Pressures. Genome Biology and Evolution. 2021;13(8):evab142. doi: 10.1093/gbe/evab142 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92. Simonet C, McNally L. Kin selection explains the evolution of cooperation in the gut microbiota. Proceedings of the National Academy of Sciences. 2021;118(6):e2016046118. doi: 10.1073/pnas.2016046118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93. Sender R, Fuchs S, Milo R. Revised Estimates for the Number of Human and Bacteria Cells in the Body. PLOS Biology. 2016;14(8):e1002533. doi: 10.1371/journal.pbio.1002533 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94. Kuleshov V, Jiang C, Zhou W, Jahanbani F, Batzoglou S, Snyder M. Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome. Nature Biotechnology. 2016;34(1):64–69. doi: 10.1038/nbt.3416 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95. Zlitni S, Bishara A, Moss EL, Tkachenko E, Kang JB, Culver RN, et al. Strain-resolved microbiome sequencing reveals mobile elements that drive bacterial competition on a clinical timescale. Genome Medicine. 2020;12(1):50. doi: 10.1186/s13073-020-00747-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96. DeMaere MZ, Darling AE. bin3C: exploiting Hi-C sequencing data to accurately resolve metagenome-assembled genomes. Genome Biology. 2019;20(1):46. doi: 10.1186/s13059-019-1643-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97. Press MO, Wiser AH, Kronenberg ZN, Langford KW, Shakya M, Lo CC, et al. Hi-C deconvolution of a human gut microbiome yields high-quality draft genomes and reveals plasmid-genome interactions; 2017. Available from: https://www.biorxiv.org/content/10.1101/198713v1. [Google Scholar]
  • 98. Methé BA, Nelson KE, Pop M, Creasy HH, Giglio MG, Huttenhower C, et al. A framework for human microbiome research. Nature. 2012;486(7402):215–221. doi: 10.1038/nature11209 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99. Liu Z, Good BH. Dynamics of bacterial recombination in the human gut microbiome; 2022. Available from: https://www.biorxiv.org/content/10.1101/2022.08.24.505183v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10(2). doi: 10.1093/gigascience/giab008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101. Lynch M, Bost D, Wilson S, Maruki T, Harrison S. Population-Genetic Inference from Pooled-Sequencing Data. Genome Biology and Evolution. 2014;6(5):1210–1218. doi: 10.1093/gbe/evu085 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102. Guirao-Rico S, González J. Benchmarking the performance of Pool-seq SNP callers using simulated and real sequencing data. Molecular Ecology Resources. 2021;21(4):1216–1229. doi: 10.1111/1755-0998.13343 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.Xiao X, Thibault K, Harris DJ, Baldridge E, White E. weecology/macroecotools: v0.4.0; 2016. Available from: https://zenodo.org/record/166721.

Decision Letter 0

Karthik Raman

6 Apr 2023

PONE-D-22-30239A macroecological perspective on genetic diversity in the human gut microbiomePLOS ONE

Dear Dr. Shoemaker,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. In particular, there are a number of writing issues to be addressed, and how the manuscript presents new results that contribute to the field.

Please submit your revised manuscript by May 21 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Karthik Raman, Ph.D.

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“This work was supported by the NSF Postdoctoral Research Fellowships in Biology Program under Grant No. 2010885 (W.R.S.). This work used computational and storage services associated with the Hoffman2 Shared Cluster provided by UCLA Institute for Digital Research and Education’s Research Technology Group.”

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“This work was supported by the NSF Postdoctoral Research Fellowships in Biology Program under Grant No. 2010885 (W.R.S.).

https://beta.nsf.gov/funding/opportunities/postdoctoral-research-fellowships-biology-prfb

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

3. Please upload a copy of Figure 5, to which you refer in your text on page 17. If the figure is no longer to be included as part of the submission please remove all reference to it within the text.

Additional Editor Comments:

The reviews for the manuscript are in. While the reviewers found interesting aspects of your work, they have raised major concerns. Please revise the manuscript taking into consideration the reviews of both the reviewers.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: No

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This manuscript by William Shoemaker explores intraspecies diversity within and across individual microbiomes, with a focus on the applicability of stochastic logistic model and the existence of strain structure. It is my understanding that the main result of this paper is that the result of this modeling is that strains (ecology) are responsible for more within-host variants than mutation (evolution). This result echoes the current thinking in the field, which has been shown before by other papers, including at least one on which this author is a contributor.

This manuscript is difficult to read – partly because what is done is frequently described before what is the goal of this work is, and partly because concepts are a bit muddled / poorly introduced. I list some examples below. I also have some issues with data analysis. Most critically, it is hard to understand what is new in this paper. While it is my understanding that novelty is not a criterion for publication, more fully placing work in the context of the literature is very important for any theoretical or modeling based paper.

Writing issues:

It hard to understand from the abstract, introduction, text, and discussion, exactly is new in this manuscript. The introduction spends much time opining on the value and need for models, without really explaining what has been shown before about the SLM in microbiomes. In particular, the manuscript cites works, including from this author, showing that strains are responsible for most intraspecies dynamics in the microbiome (this has been known for quite some time) and SLMs describe these dynamics well. What was the open question the other set out to solve? If there was some doubt about the applicability of the SLM model and strains prior to this work, what was the critical missing analysis? What required the need for this approach?

It requires a lot of work for a reader to understand what is being referred to at each point in the text. For example:

• At Line 94: “the empirical distribution” is unclear; “the distribution of within-host allele frequencies across all hosts” would be clearer. Even clearer would be “the distribution of within-host frequencies at a given site, across all hosts” or something else.

• Line 339: “unlikely for the allele frequency fluctuations I was able to observe”. Does this refer to fluctuations across hosts, or time? At the very least, a figure panel should be referred to here.

• Line 418: what is a “true strain absence”. Is the author suggesting that single strain colonization is rare for all species and strains? This would conflict with the literature (e.g. B fragilis almost always colonizes humans as a single strain, PMID: 30673701), but if it is supported by the data this should be made more clear.

• Line 104: which variable? Mean and variance across what?

Line 500: I’m not sure exactly what point is trying to be made here. If there is already an interpretation of the data (antibiotics wiped out a strain), how does the SLM help?

Technical notes:

I find Figure 1 (as well as some s to not be presented in a manner such that I can evaluate the claim being made: that the rescaled distributions are similar across species. To evaluate this, I would like to see example distributions (perhaps cumulative curves) drawn for each species – instead what I am presented with is a mess of dots on top of each other in 1b and 1d. I see notable differences, even though these are rescaled already. No goodness of test fits are applied and no alternative distributions are provided.

Site frequency spectra are not widely used in microbiology, as the SFS is something that results from recombination. The idea that within-host or across-host evolution would result in a meaningful SFS is confusing to this reviewer.

Fig 2c: I’m not sure what the value of predicting “mean within-host allele frequency across hosts” is, if all hosts—including those without the allele at all—are included. Perhaps I am missing something. Also, it seems like most of the data here isn’t fitting the line, but instead fitting a curve starting at y = 10^-1. These graphs should be made as density maps to make it clearer how well the data and the prediction line up.

Some thoughts:

It is clear that the author has put a lot of work into the generation of this data and model, but it is always important to be clear about what is known and what is novel. The formalization of an already proposed model can be valuable. Presenting a model which more clearly shows what has been shown by others could also be valuable; if this is the goal than more effort needs to be put into making it clear.

Reviewer #2: In this work, the author used a statistical approach to a publicly available metagenomic dataset to evaluate the dynamics of genetic diversity in the human gut microbiome. They used a proposed distribution of allele frequencies to predict the prevalence of a given allele in a given species in a host. They found that the Stochastic Logistic Model accurately predicted this across species, and could also predict the prevalence of alleles across multiple hosts.

Generally, this is an interesting work. However, as someone with a limited background in biostatistics, I find the description of the underlying modeling approach very hard to follow. For instance, the assumptions behind the SLM described from line 197-208 are not clear to me. Hence, my recommendation is to clarify the description of the SLM, its assumptions, and the overall discussed concepts for readers less familiar with them. I understand that this is not my area of expertise, but it would be very helpful to make the presented results more accessible to scientists in the gut microbiome field.

Minor comments:

- The legends for Figures 1-4 are more a repetition of the Results section than a clear description of what is shown in each panel.

- Line 339: MAGPD is mentioned here for the first time without having been defined before.

- Line 448-449: “Regardless of the speci c model, the conclusion that a substantial fraction of observable alleles are primarily subject to ecological, as opposed to evolutionary, dynamics is of consequence to studies of genetic diversity in the human gut.” It is not clear to me which part of the Results section precisely shows this.

- Line 655: There is a figure legend for Figure 5, which dos not exist.

- Line 704: Is this really the correct URL for the Human Microbiome Project?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Jul 21;18(7):e0288926. doi: 10.1371/journal.pone.0288926.r002

Author response to Decision Letter 0


24 Apr 2023

Below is my response that can also be found in the uploaded document 20230420_1538_Response.pdf

Reviewer #1

This manuscript by William Shoemaker explores intraspecies diversity within and across individual microbiomes, with a focus on the applicability of stochastic logistic model and the existence of strain structure. It is my understanding that the main result of this paper is that the result of this modeling is that strains (ecology) are responsible for more within-host variants than mutation (evolution). This result echoes the current thinking in the field, which has been shown before by other papers, including at least one on which this author is a contributor.

Author response: The main result of this paper is that a model of ecology that succeeded at capturing the temporal dynamics of strains within a single host (Wolff et al., 2023), the Stochastic Logistic Model of growth, was capable of quantitatively predicting patterns of allelic prevalence across hosts for a considerable fraction of sites. In the revision I have rewritten portions of the Abstract, Introduction, and Discussion to emphasize these points (Lines 69-72; 73-83; 441-456).

This manuscript is difficult to read – partly because what is done is frequently described before what is the goal of this work is, and partly because concepts are a bit muddled / poorly introduced. I list some examples below. I also have some issues with data analysis. Most critically, it is hard to understand what is new in this paper. While it is my understanding that novelty is not a criterion for publication, more fully placing work in the context of the literature is very important for any theoretical or modeling based paper.

Author response: Interdisciplinary research is often difficult to communicate. In writing the submitted version of this manuscript I meticulously incorporated feedback from peers in macroecology, microbial ecology, statistical physics, and population genetics. I understand that attempting to please too many points of view can lead to a message that is less than precise. I have worked to revise this manuscript to address the Reviewer’s comments. For example, the Abstract now clarifies the purpose of this study and how it fits into the field’s current understanding of strain ecology. The Introduction now explicitly states how we can quantitatively describe the dynamics of strains within a single human host, a result that provides the motivation to use the SLM to investigate patterns of genetic diversity across hosts (Lines 50-83).

By adopting a statistical physics approach and leveraging my past efforts to characterize patterns of diversity within a single host, I was able to 1) successfully predict the prevalence of alleles across hosts for a high proportion of sites and 2) provide evidence that the success of my predictions was due to the existence of strain structure.

Writing issues:

It hard to understand from the abstract, introduction, text, and discussion, exactly is new in this manuscript. The introduction spends much time opining on the value and need for models, without really explaining what has been shown before about the SLM in microbiomes. In particular, the manuscript cites works, including from this author, showing that strains are responsible for most intraspecies dynamics in the microbiome (this has been known for quite some time) and SLMs describe these dynamics well. What was the open question the other set out to solve? If there was some doubt about the applicability of the SLM model and strains prior to this work, what was the critical missing analysis? What required the need for this approach?

Author response: The question was whether an ecological model that succeeded in predicting macroecological patterns at the species level and in predicting the temporal dynamics of strains within a single host could also predict patterns of genetic diversity across hosts due to the existence of strain structure.

Making the jump from within-host dynamics to across-hosts patterns is not an easy task at the strain level. Within a given host, the relative abundance of a strain often has to be inferred using the temporal trajectories of individual genetic variants (Roodgar et al., 2021; Wolff et al., 2023). This approach requires several metagenomic sequencing samples within a host over a period time, a fairly high cost that limits the number of hosts that can be sequenced. Therefore, when investigating questions of strain macroecology across a large number of unrelated hosts you often only have access to samples obtained from a single timepoint (e.g., the Human Microbiome Project data examined in this study). In this scenario, it is reasonable to examine individual nucleotide sites and determine whether their patterns follow what you would expect if they were present because they were on a strain, rather than dealing with the difficulty of inferring strain relative abundances from a single metagenomic sample.

I have revised the Introduction to clarify existing knowledge gaps of strain ecology in the human gut. I now explicitly list patterns that the SLM has previously captured at the species level within and across human hosts (e.g., results described in Grilli, 2020) and how it has succeeded at explaining the temporal dynamics of strains within a single human host (Wolff et al., 2023). This body of work raises the question of whether the SLM can capture patterns of strain level diversity across hosts. Because genetic variants are the constituents of strains, this question can be tested as a question of allelic patterns across hosts and whether the SLM succeeds (Lines 69-72).

It requires a lot of work for a reader to understand what is being referred to at each point in the text. For example:

1) At Line 94: “the empirical distribution” is unclear; “the distribution of within-host allele frequencies across all hosts” would be clearer. Even clearer would be “the distribution of within-host frequencies at a given site, across all hosts” or something else.

Author response: When referring to a distribution in the revised manuscript I now specify the quantity of a distribution (e.g., Line 106).

2) Line 339: “unlikely for the allele frequency fluctuations I was able to observe”. Does this refer to fluctuations across hosts, or time? At the very least, a figure panel should be referred to here.

Author response: I thank the Reviewer for pointing out this typo. “Allele frequency fluctuations” should be “range of allele frequencies”, as the proceeding sentence discusses the lower bound on allele frequencies that could be inferred after accounting for sequencing error. This typo has been corrected in the revised manuscript. The data referred to in this sentence is summarized in the proceeding sentence (Lines 375-378).

3) Line 418: what is a “true strain absence”. Is the author suggesting that single strain colonization is rare for all species and strains? This would conflict with the literature (e.g. B fragilis almost always colonizes humans as a single strain, PMID: 30673701), but if it is supported by the data this should be made more clear.

Author response: By “true strain absence” I mean strain structure that we do not observe in a host because it is not in the host, as opposed to strain structure that is present at an abundance in a host that is too low to be detected. The gamma distribution implies that the perceived absence of strain structure is typically driven by insufficient depth of sampling (Grilli, 2020), a conclusion that is consistent with the paper cited by the Reviewer (Garud et al., 2019). I have revised this section of the manuscript to clarify these points (Lines 457-471).

4) Line 104: which variable? Mean and variance across what?

Author response: I rescaled allele frequencies by pooling all sites across all hosts and calculating the mean and variance. I rescaled mean allele frequencies by pooling all sites and calculating the mean and variance. I now clarify these points in the revised manuscript (Lines 114-120; 137-138).

5) Line 500: I’m not sure exactly what point is trying to be made here. If there is already an interpretation of the data (antibiotics wiped out a strain), how does the SLM help?

Author response: The utility of an interpretation depends on the goal of the study. A course of antibiotics is a particularly strong perturbation that would alter the dynamics of strains. Other, arguably less drastic perturbations could be caused by a number of factors (e.g., a host traveling to a different climate, a temporary change in diet, etc.). In these scenarios it would be useful to have an ecological framework to determine whether the dynamics of strains differ from a quantitative null. The gamma distribution obtained from the SLM provides such a null, allowing researchers to determine whether the strain dynamics we observe truly depart from what we would expect if growth rate, environmental noise, and carrying capacity were the only ecologically important factors. For example, the gamma distribution was recently used as a null model to determine whether interactions between strains were sufficiently strong that the assumptions of the SLM were invalid (Goyal et al., 2022). In the revision, I have revised this section of the manuscript to incorporate the Reviewer’s feedback and increase overall clarity (Lines 554-558).

Technical notes

6) I find Figure 1 (as well as some s to not be presented in a manner such that I can evaluate the claim being made: that the rescaled distributions are similar across species. To evaluate this, I would like to see example distributions (perhaps cumulative curves) drawn for each species – instead what I am presented with is a mess of dots on top of each other in 1b and 1d. I see notable differences, even though these are rescaled already. No goodness of test fits are applied and no alternative distributions are provided.

Author response: The purpose of Fig. 1b is to provide empirical motivation for examining the SLM as a model of strain ecology across hosts. A gamma appears to capture the distribution, so one should investigate models of ecology that generate a gamma distribution. I propose the SLM as such a model. In the revision I have used a standard fitting procedure that I now describe in the manuscript (Lines 120-134). Because the distribution in Fig. 1b uses tens of thousands of data points, a goodness of fit test would inevitably produce a P-value less than 0.05, suggesting that a more insightful approach would be to compare the fit of the gamma with alternative distributions.

There are many options for alternative distributions, though the Reviewer did not recommend a specific one. Since the empirical Abundance Fluctuation Distribution (AFD) in Fig. 1c is wide (stretches over several orders of magnitude), a Gaussian would be inappropriate on the outset. As a reasonable alternative, I chose the lognormal as it has been used in the past as a macroecological comparison for the gamma AFD (Grilli, 2020). By fitting a lognormal, we see that it in no way captures the empirical distribution (Fig. 1c). This result is even more apparent when plotting the same data as a survival distribution, as recommended by the reviewer. Here we see that all species except one follow the gamma distribution (Fig. S2 in the revised manuscript, embedded below). I cannot compare these distributions using a likelihood ratio test since they are not nested, but by calculating the Akaike Information Criterion it is clear that the gamma is a far better descriptor of the empirical distribution (〖AIC〗_gamma= 6,277,330,〖AIC〗_lognorm=6,618,492; Lines 132-134). The AIC is so high because we are summing a large number of data points for each test.

Regarding distributions of mean abundances, I purposefully chose not to propose a mathematical distribution (e.g., Gaussian, gamma, etc.) to explain the empirical distribution in Fig. 1d because a mathematical descriptor of the distribution is not relevant to the subsequent analyses. Using the sampling form of the gamma we can only describe the patterns of genetic diversity across hosts for a given site using the empirical mean and variance. We are not interested in suggesting a probability distribution to capture the empirical distribution of means because the mean is an empirical input used to calculate the predicted value of prevalence (Eq. 6). This point has been clarified in the revised manuscript (Lines 143-152).

Regarding the appropriateness of the gamma, I explicitly test the gamma distribution by using the mean and variance to predict the prevalence of an allele across hosts (Fig. 2-4). This is a prediction obtained using only empirical inputs (i.e., the mean and variance) with zero free parameters. Alternatively stated, no statistical fits were performed for that analysis.

7) Site frequency spectra are not widely used in microbiology, as the SFS is something that results from recombination. The idea that within-host or across-host evolution would result in a meaningful SFS is confusing to this reviewer.

Author response: The SFS has been intensively examined in theoretical studies motivated by microbial evolution, including studies that examine purely asexually evolving populations (Cvijović et al., 2018; Kosheleva & Desai, 2013; Neher & Hallatschek, 2013; Okada & Hallatschek, 2021) as well as populations evolving with varying rates of recombination (Good et al., 2014; Neher et al., 2013). Furthermore, the SFS has been the primary data object used for strain inference, whether it is by quantifying the existence of strain structure (e.g., StrainFinder in Smillie et al. (2018)) or by obtaining a single haplotype from the SFS using the unique form generated by the existence of co-occurring strains (Garud, Good et al., 2019).

The reviewer is correct in that one’s interpretation of the SFS depends on the degree of recombination in the population. There is increasing evidence that bacteria recombine more often than previously thought. The decay of measures of correlation in genotype frequencies (e.g., linkage disequilibrium) decays with genetic distance (i.e., # base pairs) at a faster rate than expected by an asexual population (Garud, Good et al., 2019; Good, 2022). While this rate of decay is slower than what might be expected under free recombination, it does mean that an asexual view of bacterial evolution, and how that view shapes our understanding of empirical patterns such as the SFS, is not entirely valid.

However, in all these studies, and in the field at large, the SFS is defined as the distribution of allele frequencies among all sites within a single host/population. In this study I solely refer to the distribution of allele frequencies at a single site across hosts, a distribution that has previously been defined as the single-site frequency spectrum (Theys et al., 2018).

8) Fig 2c: I’m not sure what the value of predicting “mean within-host allele frequency across hosts” is, if all hosts—including those without the allele at all—are included. Perhaps I am missing something. Also, it seems like most of the data here isn’t fitting the line, but instead fitting a curve starting at y = 10^-1. These graphs should be made as density maps to make it clearer how well the data and the prediction line up.

Author response: In this manuscript I predict the fractions of hosts harboring a given allele (i.e., prevalence). I do not predict the mean frequency of an allele across hosts, that quantity is an empirical input used to calculate the predicted prevalence (Eq. 6). Regarding why we want to include hosts where an allele was not observed in our calculation of the mean, we are hypothesizing that allele frequencies across hosts can be described using a gamma distribution, the stationary solution of the SLM. If the underlying empirical data truly followed the form of the gamma distribution I used that explicitly accounts for total depth of coverage (i.e., sampling; Eq. 4), then excluding hosts where an allele was not observed (i.e., the allele had a coverage of zero) means that I would no longer be sampling from that distribution, I would instead be sampling from a truncated form of that distribution.

Regarding the data, first, the black dashed line is a prediction using measures obtained from empirical data (i.e., the mean and variance) with zero free parameters. It is not a statistical fit because there were no free parameters to infer. I now clarify this point throughout the manuscript, including at lines that cite Fig. 2c (Lines 77-80; 277-279; 657-660), and the figure legend.

Second, I stated in the original submission that the prediction tended to succeed for alleles with high mean frequency. We expect that some fraction of polymorphic sites for a given species within a host are polymorphic because they are on the background of a strain (i.e., ecology), whereas other sites are segregating due to evolution. This result is consistent with the conceptual picture that both ecology and evolution are operating within a given species, but it is novel in that I was able to predict the prevalence of alleles with a high mean abundance across hosts using an ecological model that succeeded at predicting the dynamics of strains within hosts (Wolff et al., 2023). I now clarify these points in the revised manuscript (Lines 441-456).

The figures presented in the original manuscript were already illustrated as density maps. The color of each data point is proportional to the density of points in its immediate area. However, this detail was not made explicit, so the revised manuscript now explicitly mentions this detail in the figure legend. There was a plotting error in the original manuscript, as the color did not correspond to the density. This has been corrected in the revised manuscript. I have also added a color bar indicating how the color corresponds to the number of points for all figures with density plots (Figs. 2, S5, S6, S9, S10, S12, S13).

Some thoughts: It is clear that the author has put a lot of work into the generation of this data and model, but it is always important to be clear about what is known and what is novel. The formalization of an already proposed model can be valuable. Presenting a model which more clearly shows what has been shown by others could also be valuable; if this is the goal than more effort needs to be put into making it clear.

Author response: I agree with the reviewer. The novelty of this study is that I demonstrated how an ecological model that captures the dynamics of strains within a single host (Wolff et al., 2023) was able to capture patterns of genetic diversity across hosts due to the existence of strain structure. In the revision I have incorporated their feedback by carefully clarifying in the Abstract, Introduction, and Discussion the novelty of this work and how it builds on prior studies (Lines 56-72; 73-83; 441-456).

Reviewer #2

Major comments

Generally, this is an interesting work. However, as someone with a limited background in biostatistics, I find the description of the underlying modeling approach very hard to follow. For instance, the assumptions behind the SLM described from line 197-208 are not clear to me. Hence, my recommendation is to clarify the description of the SLM, its assumptions, and the overall discussed concepts for readers less familiar with them. I understand that this is not my area of expertise, but it would be very helpful to make the presented results more accessible to scientists in the gut microbiome field.

Author response: I thank the reviewer for their comments. I have worked to clarify the motivation for and definition of the SLM in the revised manuscript. I have added brackets in Eq. 1 to describe the physical meaning of each term in the equation and additional text to describe how noise is modeled in the equation (Lines 211-242). Additional technical steps are provided in the Materials and Methods (Lines 633-679). As for the rest of the manuscript, I have revised several sections in the Abstract, Introduction, and Discussion to clarify novelty of this work and how it builds on prior studies (Lines 56-72; 73-83; 441-456).

Minor comments

1) The legends for Figures 1-4 are more a repetition of the Results section than a clear description of what is shown in each panel.

Author response: I initially included additional descriptive details in the legends at the request of past reviewers. The legends of Fig. 1-4 have now been revised to emphasize the contents of the figures rather than the results.

2) Line 339: MAGPD is mentioned here for the first time without having been defined before.

Author response: A definition of the acronym has now been included and moved earlier in the manuscript to where data from MAPGD was first described (Line 97).

3) Line 448-449: “Regardless of the specific model, the conclusion that a substantial fraction of observable alleles are primarily subject to ecological, as opposed to evolutionary, dynamics is of consequence to studies of genetic diversity in the human gut.” It is not clear to me which part of the Results section precisely shows this.

Author response: In this manuscript I demonstrate that the Stochastic Logistic Model of growth, an ecological model that has been used to capture the temporal dynamics of strains within a single host, can predict patterns of allelic prevalence across hosts for alleles of moderate-to-high mean frequency (Fig. 2, 3). Because alleles are the constituents of strains, this result means that the across-host patterns of a substantial portion of alleles are driven by strain-level ecology. This result is supported by the observation that the error of the prevalence predictions is inversely related to the fraction of hosts harboring strains (Fig. 4). I have revised the Discussion to clarify this point (Lines 441-456).

4) Line 655: There is a figure legend for Figure 5, which does not exist.

Thank you for pointing this out. The figure legend has been removed and there never was an intention of having a fifth figure. It was an artifact of reformatting the paper in multiple latex templates.

5) Line 704: Is this really the correct URL for the Human Microbiome Project?

Author response: I have used a URL that links to a repository containing the data of the Human Microbiome Project. This URL and the data therein were listed under the Data and Code Availability Statement of a previously published paper that also used Human Microbiome Project data (Garud, Good, et al., 2019). In the revision I now include the official URL of the Human Microbiome Project alongside this link.

References

Cvijović, I., Good, B. H., & Desai, M. M. (2018). The Effect of Strong Purifying Selection on Genetic Diversity. Genetics, 209(4), 1235–1278. https://doi.org/10.1534/genetics.118.301058

Garud, N. R., Good, B. H., Hallatschek, O., & Pollard, K. S. (2019). Evolutionary dynamics of bacteria in the gut microbiome within and across hosts. PLOS Biology, 17(1), e3000102. https://doi.org/10.1371/journal.pbio.3000102

Good, B. H. (2022). Linkage disequilibrium between rare mutations. Genetics, 220(4), iyac004. https://doi.org/10.1093/genetics/iyac004

Good, B. H., Walczak, A. M., Neher, R. A., & Desai, M. M. (2014). Genetic Diversity in the Interference Selection Limit. PLOS Genetics, 10(3), e1004222. https://doi.org/10.1371/journal.pgen.1004222

Goyal, A., Bittleston, L. S., Leventhal, G. E., Lu, L., & Cordero, O. X. (2022). Interactions between strains govern the eco-evolutionary dynamics of microbial communities. ELife, 11, e74987. https://doi.org/10.7554/eLife.74987

Grilli, J. (2020). Macroecological laws describe variation and diversity in microbial communities. Nature Communications, 11(1), 4743. https://doi.org/10.1038/s41467-020-18529-y

Kosheleva, K., & Desai, M. M. (2013). The Dynamics of Genetic Draft in Rapidly Adapting Populations. Genetics, 195(3), 1007–1025. https://doi.org/10.1534/genetics.113.156430

Neher, R. A., & Hallatschek, O. (2013). Genealogies of rapidly adapting populations. Proceedings of the National Academy of Sciences, 110(2), 437–442. https://doi.org/10.1073/pnas.1213113110

Neher, R. A., Kessinger, T. A., & Shraiman, B. I. (2013). Coalescence and genetic diversity in sexual populations under selection. Proceedings of the National Academy of Sciences, 110(39), 15836–15841. https://doi.org/10.1073/pnas.1309697110

Okada, T., & Hallatschek, O. (2021). Dynamic sampling bias and overdispersion induced by skewed offspring distributions. Genetics, 219(4), iyab135. https://doi.org/10.1093/genetics/iyab135

Roodgar, M., Good, B. H., Garud, N. R., Martis, S., Avula, M., Zhou, W., Lancaster, S. M., Lee, H., Babveyh, A., Nesamoney, S., Pollard, K. S., & Snyder, M. P. (2021). Longitudinal linked-read sequencing reveals ecological and evolutionary responses of a human gut microbiome during antibiotic treatment. Genome Research, 31(8), 1433–1446. https://doi.org/10.1101/gr.265058.120

Smillie, C. S., Sauk, J., Gevers, D., Friedman, J., Sung, J., Youngster, I., Hohmann, E. L., Staley, C., Khoruts, A., Sadowsky, M. J., Allegretti, J. R., Smith, M. B., Xavier, R. J., & Alm, E. J. (2018). Strain Tracking Reveals the Determinants of Bacterial Engraftment in the Human Gut Following Fecal Microbiota Transplantation. Cell Host & Microbe, 23(2), 229-240.e5. https://doi.org/10.1016/j.chom.2018.01.003

Theys, K., Feder, A. F., Gelbart, M., Hartl, M., Stern, A., & Pennings, P. S. (2018). Within-patient mutation frequencies reveal fitness costs of CpG dinucleotides and drastic amino acid changes in HIV. PLOS Genetics, 14(6), e1007420. https://doi.org/10.1371/journal.pgen.1007420

Wolff, R., Shoemaker, W., & Garud, N. (2023). Ecological Stability Emerges at the Level of Strains in the Human Gut Microbiome. MBio, 0(0), e02502-22. https://doi.org/10.1128/mbio.02502-22

Attachment

Submitted filename: 20230420_1538_Response.pdf

Decision Letter 1

Karthik Raman

22 May 2023

PONE-D-22-30239R1A macroecological perspective on genetic diversity in the human gut microbiomePLOS ONE

Dear Dr. Shoemaker,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 06 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Karthik Raman, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments:

I am in agreement with one of the reviewers, who has given detailed constructive criticism to further improve the manuscript. It is important that the manuscript be carefully revisited to systematically address all the concerns of the reviewers.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: (No Response)

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This manuscript has improved from the last version. In R2R document, the author states that the purpose of this manuscript is to demonstrate that the previously proposed model, in which (1) most intrastrain polymorphisms arise from migration and growth of strains rather than within-person evolution; and (2) this is formalized with an SLM model, can explain allelic prevalence across hosts. Bolstering the SLM model and the importance of strain structure is a fine goal (though I have a question on the contextualization of this below), and now I can begin to interpret the manuscript.

Significant work still remains to be done on the writing side. While the author appears to have attempted to address my concerns, shortcuts were taken rather than addressing the real problem (e.g. point #1f below), the text still falls short of communicating what is being advanced by these models, and murky language is used throughout--- such that I still relied heavily on the response to the reviewer (R2R) to interpret what the author intended. I do not think that all interdisciplinary work needs to be so difficult to read. I also have major concerns around the framing and some technical aspects of this work.

1) The writing is murky and it is hard to follow what point the author is making throughout.

1a) As an example to the author, I provide below an edited alternative of the middle of the abstract which is easier to read. In this example, the connection between strains and ecology is made more clear to a naiive reader, and what the SLM is modeling (strains) is made clear within the sentence that introduces it. The use of a “however” is removed from a location where a clear contrast isn’t being made. The phrases “determine whether” and “is capable of capturing” are replaced by more direct language. I am not sure if the suggestion for the last sentence reflects what the author intended; if I am incorrect, the author should clarify what is meant by “patterns of genetic diversity across hosts follow statistically similar forms”.

Revised abstract:

“…Recent efforts have suggested that a large fraction intrahost, intraspecies, genetic variation is driven by dynamics between co-colonizing of strains of the same species, highlighting the importance of modeling ecological forces. In particular, the Stochastic Logistic Model (SLM) of growth, commonly used in macroecology to describe between-species variation, has been used successfully to predict the temporal dynamics of strains within a single human host. Here, using data from 22 common microbial species across a large cohort of unrelated hosts, I show that the SLM also successfully predicts across-host genetic diversity in the human gut. The SLM predicts both the distribution of allele frequencies across hosts and the fraction of hosts harboring an allele for a given site (i.e., prevalence). The accuracy of the SLM in predicting these across-host parameters is correlated with independent estimates of strain structure, confirming that the success of the SLM arises from the presence of strain-level ecology in the human gut.

1b) In regards to my point #2 in the previous round the authors have changed a phrase but have not fixed the core of the problem, which is requiring a reader to read every sentence in order and guess what is being referred to be able to follow. One should not write “allele frequency fluctuations that I was able to observe” when they could say something like “THE LARGE allele frequency fluctuations ACROSS HOSTS” or whatever feature the author is actually intending here. As a reader, I cannot evaluate the validity of a sentence that is vague.

1c) Line 62 would be clearer if the manuscript said “In macroecology, the SLM…. “

1d) The last paragraph of the introduction would be much clearer with specific methods (is “an ecological model” different than the SLM already discussed? In the last sentence, what is “an alternative evolutionary model”?) or with the general methods removed (just state result).

1e) “First, I examined the distribution of within-host allele frequencies across hosts.” This should be explained better. This is not a normal metric, and it would be great if this could be spelled out.

1f) Line 70: “If the typical allele was present due to evolutionary forces, then the empirical distribution of within-host allele frequencies across hosts can be viewed as an ensemble of single-site frequency spectra.” By definition I believe that this metric is an ensemble of SFS. I think you mean “… across hosts WOULD BE EQUIVALNT TO THE ensemble of single-site frequency spectra EXPECTED FROM WITHIN PERSON EVOLUTION.” It would also be helpful for the reader if the shape of each of these expectations were described in terms of the distributions used in Figure 1 in this paragraph.

1g) The dN/dS curve mentioned is all for Bacteroides and related species. This curve has not been investigated in other major taxa (e.g. Clostridia). In addition, this result is mentioned but not explained well enough for a reader to understand this point. “a clear example being the observation 41 that THE RATIO OF NONSYNONYMOUS DIVERGENCE TO SYNONYMOUS DIVERGENCE DECAYS NEARLY IDENTICALLY OVER LONG TIME SCALES OF across microbial species in the human gut”

1h) “This disproportionate focus on 32 individual species and differences between species can lead to idiosyncratic notions 33 about the typical dynamics of a genetic variant in the microbiome….. Instead, it is reasonable to start by identifying genetic patterns” I am just not sure what this means. One interpretation of these sentences is that the author is claiming that looking at things one species at a time is incorrect. I’m not sure the author needs to say this. I think it would be enough to say that identifying a single model that works across species would reveal general principles of eco-evo dynamics across microbiomes.

There are parts of the results section that are likewise confusing, but I have selected to use my time to give several clear examples of areas for improvement in the introduction and results as an example to the author of how more clarity can be achieved.

Other major concerns:

2) It is very true that new and better eco-evo models are needed for microbial populations, but the presentation of current knowledge in the population genetics field and the remaining gaps are incorrectly summarized.

2a) The surprise of strain structure is a bit overstated. One can infer that strains are a real entity by building phylogenetic trees from available isolates or from metagenomic samples with single strains (e.g. quasiphasable as in Garud and Good et al). This sort of expectation should be more explicitly stated in the introduction.

2b) Page 2, Line 20: “Such dynamics are a clear departure from those captured by standard population genetic models, where genetic variants either arise in a population due to mutation or are introduced by migration and then proceed 22 towards extinction or fixation (i.e., origin-fixation models), suggesting that measures of genetic diversity estimated within the human gut are shaped by the ecology of strains alongside evolution [10]. “ There are many population genetic models that deal with standing variation – the entire field of human population genetics on quantitative traits comes to mind. I think the bigger gap is that models for population genetics on standing variation don’t deal with the genome-wide linkage inherent to bacteria.

2c) “This confluence of ecological and evolutionary dynamics calls into question the 26 feasibility of characterizing genetic diversity in the human gut.” “Calls into question the feasibility” is a bit overstated. How about “requires new approaches and theory for”

3) I am not swayed by the argument that there are always multiple strains in a subject (response to previous point #3). B fragilis is often found at very high abundances, and the presences of strain structure does not correlate with the abundance in that species, as would be expected if there was a detection limit problem. The author has not proved “a lack of genuine absences of strain structure” which is very strong claim that when not written in the double negative. This could be mitigated by putting a qualifier of “for most species” for most sentences in this paragraph.

4) If I am understanding correctly, this paper is about the cross-host applicability of the SLM. However, the examples in the last paragraph of the discussion are all about within-host dynamics, which were investigated in a prior manuscript. This is confusing and perhaps misleading. Could the across-host applicability of the SLM help us understand how the microbiome at a global scale might change in response to a global perturbation like global warming?

5) I understand that Figure 1 isn’t about a goodness of fit, though the author has now done some analysis in this direction now anyway. I still don’t really understand what I am supposed to learn from this data looking roughly gamma though. How do the lognormal and gamma distribution relate to the SFS expected for within-host evolution described in lead up to Figure 1? Can the author give an example of a generative process that wouldn’t result in the patterns shown in Figure 1?

Reviewer #2: In their revision, the author worked to improve the readability of the manuscript, clarifying the concepts and the aim of the study. Overall, they did a good job. I thank the author for rewriting the manuscript to make it easier to follow. KI still have some difficulties interpreting the equations, which is likely due to my unfamiliarity with the subject.

Overall comments:

- Since the article was based on a previously published work (Wolff et al. 2023), I think it would improve understanding of the context of the study further if one or two lines summarizing the results of that study were added at line 69.

- While the last paragraph of the discussion addresses this, it is still not entirely clear to me how the model can be applied to dysbiosis. What are the implications of the prediction of the prevalence of a given gene in a fraction of hosts? What does that tell us about the microbiome?

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Jul 21;18(7):e0288926. doi: 10.1371/journal.pone.0288926.r004

Author response to Decision Letter 1


8 Jun 2023

Editor Comments

I am in agreement with one of the reviewers, who has given detailed constructive criticism to further improve the manuscript. It is important that the manuscript be carefully revisited to systematically address all the concerns of the reviewers.

Author response: I thank the editor for the feedback along with the opportunity to improve the manuscript. In my response I describe how all comments made by both reviewers were addressed in my resubmission.

Reviewer #1

This manuscript has improved from the last version. In R2R document, the author states that the purpose of this manuscript is to demonstrate that the previously proposed model, in which (1) most intrastrain polymorphisms arise from migration and growth of strains rather than within-person evolution; and (2) this is formalized with an SLM model, can explain allelic prevalence across hosts. Bolstering the SLM model and the importance of strain structure is a fine goal (though I have a question on the contextualization of this below), and now I can begin to interpret the manuscript.

Author response: I thank the reviewer for their constructive comments before, which have improved the manuscript.

Significant work still remains to be done on the writing side. While the author appears to have attempted to address my concerns, shortcuts were taken rather than addressing the real problem (e.g. point #1f below), the text still falls short of communicating what is being advanced by these models, and murky language is used throughout--- such that I still relied heavily on the response to the reviewer (R2R) to interpret what the author intended. I do not think that all interdisciplinary work needs to be so difficult to read. I also have major concerns around the framing and some technical aspects of this work.

1) The writing is murky and it is hard to follow what point the author is making throughout.

Author response: In this revision I have gone to considerable efforts to clarify my points. In addition to incorporating the reviewer’s suggestions, I have gone through the manuscript and identified additional opportunities for clarification.

1a) As an example to the author, I provide below an edited alternative of the middle of the abstract which is easier to read. In this example, the connection between strains and ecology is made more clear to a naiive reader, and what the SLM is modeling (strains) is made clear within the sentence that introduces it. The use of a “however” is removed from a location where a clear contrast isn’t being made. The phrases “determine whether” and “is capable of capturing” are replaced by more direct language. I am not sure if the suggestion for the last sentence reflects what the author intended; if I am incorrect, the author should clarify what is meant by “patterns of genetic diversity across hosts follow statistically similar forms”.

Revised abstract:

“…Recent efforts have suggested that a large fraction intrahost, intraspecies, genetic variation is driven by dynamics between co-colonizing of strains of the same species, highlighting the importance of modeling ecological forces. In particular, the Stochastic Logistic Model (SLM) of growth, commonly used in macroecology to describe between-species variation, has been used successfully to predict the temporal dynamics of strains within a single human host. Here, using data from common microbial species across a large cohort of unrelated hosts, I show that the SLM also successfully predicts across-host genetic diversity in the human gut. The SLM predicts both the distribution of allele frequencies across hosts and the fraction of hosts harboring an allele for a given site (i.e., prevalence). The accuracy of the SLM in predicting these across-host parameters is correlated with independent estimates of strain structure, confirming that the success of the SLM arises from the presence of strain-level ecology in the human gut.

Author response: I appreciate the reviewer’s suggestions. I have revised the Abstract to incorporate all suggested edits made by the reviewer.

1b) In regards to my point #2 in the previous round the authors have changed a phrase but have not fixed the core of the problem, which is requiring a reader to read every sentence in order and guess what is being referred to be able to follow. One should not write “allele frequency fluctuations that I was able to observe” when they could say something like “THE LARGE allele frequency fluctuations ACROSS HOSTS” or whatever feature the author is actually intending here. As a reader, I cannot evaluate the validity of a sentence that is vague.

Author response: In the prior round of comments the reviewer quoted a line of the manuscript that contained a typo. I fixed the typo and I thank the reviewer for clarifying their prior comment. The revision now specifies that “fluctuations” refers to “fluctuations across hosts” (line 410). I have also gone through the manuscript and clarified every mention of the term “fluctuation”.

1c) Line 62 would be clearer if the manuscript said “In macroecology, the SLM…. “

Author response: I have made the suggested edit (line 69).

1d) The last paragraph of the introduction would be much clearer with specific methods (is “an ecological model” different than the SLM already discussed? In the last sentence, what is “an alternative evolutionary model”?) or with the general methods removed (just state result).

Author response: I have changed “an ecological model” to “the SLM as a model of ecology” (line 84). I have rewritten the last sentence to clarify that I am referring to evolutionary Langevin equations that also generate the same stationary probability distribution as the SLM (lines 91-93). I then specify that I inferred the presence of strain structure and found that strain structure was correlated with the accuracy of the SLM in predicting allelic prevalence (lines 93-97).

1e) “First, I examined the distribution of within-host allele frequencies across hosts.” This should be explained better. This is not a normal metric, and it would be great if this could be spelled out.

Author response: I have made the requested edits, changing the quoted line to:

“First, I obtained the distribution of across-host allele frequencies for each nucleotide site and then pooled the frequencies of all sites.” (line 118-119).

1f) Line 70: “If the typical allele was present due to evolutionary forces, then the empirical distribution of within-host allele frequencies across hosts can be viewed as an ensemble of single-site frequency spectra.” By definition I believe that this metric is an ensemble of SFS. I think you mean “… across hosts WOULD BE EQUIVALNT TO THE ensemble of single-site frequency spectra EXPECTED FROM WITHIN PERSON EVOLUTION.” It would also be helpful for the reader if the shape of each of these expectations were described in terms of the distributions used in Figure 1 in this paragraph.

Author response: I appreciate the reviewer’s comments regarding the importance of clarity. I have incorporated their request into the manuscript (lines 119-123).

The purpose of Fig. 1c is to illustrate how a single probability distribution is capable of explaining the empirical distributions of a many species. Visual inspection (and AIC tests based on the reviewer’s previous requests) suggests that the gamma distribution sufficiently captures the empirical distribution. I then identify the SLM as a reasonable model because 1) of its past success in explaining within-host strain dynamics and 2) because its stationary distribution is the gamma distribution. I then proceed with testing the gamma at each site (Figs. 2,3) and determine that the accuracy of the gamma is correlated with an independent estimate of strain structure (Fig. 4), validating the ecological interpretation of the gamma distribution. In the revision I identified sections of the manuscript related to the above points that required additional clarity and made appropriate edits (lines 219-227, 253-255).

1g) The dN/dS curve mentioned is all for Bacteroides and related species. This curve has not been investigated in other major taxa (e.g. Clostridia). In addition, this result is mentioned but not explained well enough for a reader to understand this point. “a clear example being the observation that THE RATIO OF NONSYNONYMOUS DIVERGENCE TO SYNONYMOUS DIVERGENCE DECAYS NEARLY IDENTICALLY OVER LONG TIME SCALES OF across microbial species in the human gut”

Author response: The curve presented in Fig. 3 of Garud, Good et al. was expanded with additional species in Shoemaker et al., where it visualized species from 20 genera, 14 families, 7 orders, 6 classes, and 5 phyla (2019; 2021). The assemblage of species represented in Shoemaker et al. includes members of the class Clostridia (e.g., Clostridium, Butyrivibrio; 2021). I understand that it is not an exhaustive representation of bacterial diversity, but it provides empirical motivation for identifying patterns that hold across evolutionarily distant species in the human gut microbiome. Therefore, I briefly summarize the taxonomic diversity in the line quoted by the reviewer in the revision. I have also edited the line to state that the ratio dN/dS decays with increasing dS, where dS is interpreted as a proxy for evolutionary time (lines 47-51).

1h) “This disproportionate focus on individual species and differences between species can lead to idiosyncratic notions about the typical dynamics of a genetic variant in the microbiome….. Instead, it is reasonable to start by identifying genetic patterns” I am just not sure what this means. One interpretation of these sentences is that the author is claiming that looking at things one species at a time is incorrect. I’m not sure the author needs to say this. I think it would be enough to say that identifying a single model that works across species would reveal general principles of eco-evo dynamics across microbiomes.

Author response: Fig. 1c-f contains visual examples of what is meant by the line quoted by the reviewer. I believe that it is reasonable to first inspect empirical patterns before applying a model. From past experiences, discussions with peers, and assessing published literature, the practice of plotting distributions of the same quantity from different species/systems on the same axis to identify qualitatively similar behavior is often overlooked in the life sciences. This observation was the motivation for describing the benefits of performing a “data collapse” and investigating whether distributions of within-host allele frequencies for different species have invariant forms (lines 101-107; Fig. 1c). However, the approach that I, and other researchers in microbial evolution and ecology, took is not the singular approach. Rather, it can be viewed as one way to address scientific questions rather than a prescription and I have revised the manuscript to reflect this view and the views of the reviewer (lines 38-45).

There are parts of the results section that are likewise confusing, but I have selected to use my time to give several clear examples of areas for improvement in the introduction and results as an example to the author of how more clarity can be achieved.

Author response: I have combed through the manuscript and done my best to meet their requests. I appreciate the reviewer for identifying areas where clarity can be improved.

Other major concerns:

2) It is very true that new and better eco-evo models are needed for microbial populations, but the presentation of current knowledge in the population genetics field and the remaining gaps are incorrectly summarized.

2a) The surprise of strain structure is a bit overstated. One can infer that strains are a real entity by building phylogenetic trees from available isolates or from metagenomic samples with single strains (e.g., quasiphasable as in Garud and Good et al). This sort of expectation should be more explicitly stated in the introduction.

Author response: In this manuscript I start from the observation that a number of genetic variants at the species level neither become fixed nor go extinct over extended periods of time within a single human gut. This empirical motivation has been used in other manuscripts to introduce readers to the concept of multiple strains co-occurring within a single human host (Good & Hallatschek, 2018).

In lines 11-16 I altered the language to be more descriptive so that I do not overstate the existence of strain structure. At the request of the reviewer, I have included a sentence discussing how this ecological structure is often reflected in the shape of phylogenetic trees constructed from microbial isolates (lines 13-16).

2b) Page 2, Line 20: “Such dynamics are a clear departure from those captured by standard population genetic models, where genetic variants either arise in a population due to mutation or are introduced by migration and then proceed towards extinction or fixation (i.e., origin-fixation models), suggesting that measures of genetic diversity estimated within the human gut are shaped by the ecology of strains alongside evolution [10]. “ There are many population genetic models that deal with standing variation – the entire field of human population genetics on quantitative traits comes to mind. I think the bigger gap is that models for population genetics on standing variation don’t deal with the genome-wide linkage inherent to bacteria.

Author response: The existence of strain structure in microbial communities is not a matter of standing genetic variation in the way it is for humans. Microbial strains that are present the same host represent a form of ecological structure. Namely, what we observe as standing genetic variation that persists over extended timescales (over a year in Wolff et al., 2023) is in reality the presence of multiple community members that occupy different ecological roles while de novo mutations continue to be acquired and segregate within each strain (Dapa et al., 2023; Roodgar et al., 2021).

This empirical observation has motivated those in the field to develop models that integrate principles from microbial ecology with those from population genetics. For example, in Good, Martis, et al. the authors develop an evolutionary model where each strain acquires mutations that impact growth rate (i.e., fitness) at one rate and mutations that impact which resources they consume at another rate (i.e., ecological strategy; 2018). This model captures the emergence and coexistence of strains within a human host over time, where each individual strain can be viewed as an asexual population where the population scaled mutation rate (population size multiplied by the per-genome mutation rate) is sufficiently small such that typically only a single mutation is segregating within a given strain (strong selection weak mutation limit). Subsequent modeling efforts extend this model to the genome-wide linkage regime where asexually reproducing strains have a population scaled mutation rate sufficiently high so that multiple mutations segregate within a single strain (Wong and Good, 2022).

Regarding recombination, there is no doubt that linkage contributes to the patterns of genetic diversity we observe in the human gut as this topic remains an active area of research (e.g., Garud, Good, 2019; Roodgar et al., 2021; Good, 2022). However, the extent that an absence of recombination is necessary to preserve the ecological structure of strains within a human host is currently unknown, as published eco-evolutionary models of strain dynamics have primarily examined purely asexually reproducing populations. To reflect this point, in the revision I incorporated the reviewer’s comment regarding the need to consider genetic linkage (line 29-31).

2c) “This confluence of ecological and evolutionary dynamics calls into question the feasibility of characterizing genetic diversity in the human gut.” “Calls into question the feasibility” is a bit overstated. How about “requires new approaches and theory for

Author response: I have made the suggested edit (lines 32-33).

3) I am not swayed by the argument that there are always multiple strains in a subject (response to previous point #3). B fragilis is often found at very high abundances, and the presences of strain structure does not correlate with the abundance in that species, as would be expected if there was a detection limit problem. The author has not proved “a lack of genuine absences of strain structure” which is very strong claim that when not written in the double negative. This could be mitigated by putting a qualifier of “for most species” for most sentences in this paragraph.

Author response: The reviewer is correct that the implications of the gamma with regards to the existence of strains makes strong claims. In this manuscript I do not attempt to prove nor disprove this claim, I simply state the interpretation of the gamma based on past macroecological research performed at the species level (Grilli, 2020). In the revision I now clarify this point (lines 493-501). I have also incorporated the reviewer’s suggestion and now include the qualifier “for several species” throughout the paragraph (lines 357, 495, 500, 505).

4) If I am understanding correctly, this paper is about the cross-host applicability of the SLM. However, the examples in the last paragraph of the discussion are all about within-host dynamics, which were investigated in a prior manuscript. This is confusing and perhaps misleading. Could the across-host applicability of the SLM help us understand how the microbiome at a global scale might change in response to a global perturbation like global warming?

Author response: I apologize for the lack of clarity. The last paragraph was originally written at the request of a previous reviewer asking that I talk about the applicability of the SLM in general. To incorporate this comment and the comment of Reviewer 2, I have rewritten this paragraph to focus on the overall state of strain macroecology in-light of this study and how the assumption of stationarity could be leveraged to identify the effects of perturbations across-hosts (lines 567-587).

The reviewer does raise an interesting question regarding global perturbations. I would imagine that the temperature of endothermic systems like the human gut would be largely immune to slight (though globally important) gradual changes in the temperature of the climate. Scenarios where the gut ecology of a large number of human hosts are simultaneously perturbed would be ideal for evaluating across-host strain ecology. This scenario is most likely to occur in human experimental trials (e.g., effect of change in diet, novel drug testing, etc.). In this scenario the SLM could be used to examine statistical measures of over an ensemble of hosts (e.g., mean abundance over hosts of a given strain) relaxes over time towards a stationary state after a perturbation (i.e., 〈x|t,x_0〉 => 〈x〉).

5) I understand that Figure 1 isn’t about a goodness of fit, though the author has now done some analysis in this direction now anyway. I still don’t really understand what I am supposed to learn from this data looking roughly gamma though. How do the lognormal and gamma distribution relate to the SFS expected for within-host evolution described in lead up to Figure 1? Can the author give an example of a generative process that wouldn’t result in the patterns shown in Figure 1?

Author response: In my prior revision I fit a lognormal distribution at the reviewer’s request that I include and compare an additional distribution as an alternative to the gamma. The purpose of Fig. 1c is to

1) Demonstrate that different species have similar distributions of within-host allele frequencies across hosts when rescaled and

2) Propose the gamma distribution as a potential model of allelic diversity across hosts.

This is the empirical motivation that leads to the SLM being identified as a reasonable model, as the stationary solution of the SLM predicts a gamma distribution and describes the temporal dynamics of strains within a single human host (Eqs. 10-12). In contrast, a Langevin equation that models linear ecological growth (as opposed to the logistic growth of the SLM) and environmental noise (the same form of noise as the SLM) predicts a lognormal distribution. This model would be inappropriate since the lognormal did a comparatively poor job explaining the empirical distribution in Fig.1c relative to the gamma distribution.

However, different Langevin equations can produce the same probability distribution and it is worth considering whether such equations are viable alternatives to the SLM. In the manuscript I identified two Langevin equations of molecular evolution that also predict a gamma distribution (Eqs. 7-9; S1 Text). I describe the features of these models, their parameters, and how the assumptions necessary to explain the existence of genetic variants that are present at intermediate frequencies (0 < f < 1) over a large number of hosts are unrealistic (lines 409-439). As for a generative process that would not produce the pattern shown in Fig. 1c and would not produce a lognormal, an evolutionarily neutral allele or an allele on the background of an ecologically neutral strain would generate a Gaussian distribution over short timescales.

Finally, this section leads into an explicit test of whether the SLM as an ecological interpretation of the gamma is reasonable, where I find that the accuracy of the gamma is correlated with independent estimates of strain structure (Fig. 4), pointing to the SLM as a feasible ecological model. I have revised this section of the manuscript for clarity (Lines 367-386). I now also describe how distributions different from the gamma (i.e., lognormal or Gaussian) could emerge as a consequence of ecological or evolutionary dynamics (lines 373-385).

Reviewer #2

In their revision, the author worked to improve the readability of the manuscript, clarifying the concepts and the aim of the study. Overall, they did a good job. I thank the author for rewriting the manuscript to make it easier to follow. I still have some difficulties interpreting the equations, which is likely due to my unfamiliarity with the subject.

Author response: I thank the reviewer for their feedback.

Overall comments

Since the article was based on a previously published work (Wolff et al. 2023), I think it would improve understanding of the context of the study further if one or two lines summarizing the results of that study were added at line 69.

Author response: I have added the requested lines to the revision (lines 76-80).

While the last paragraph of the discussion addresses this, it is still not entirely clear to me how the model can be applied to dysbiosis. What are the implications of the prediction of the prevalence of a given gene in a fraction of hosts? What does that tell us about the microbiome?

Author response: I apologize to the reviewer for the lack of clarity. Much of this paragraph was written at the request of a past reviewer that asked for the macroecological implications of this work in terms of practical application to be discussed. To incorporate this comment and the comment of Reviewer 1, I have rewritten this paragraph to focus on the state of strain macroecology in-light of this study. Specifically, how the analyses in this study assume that strain dynamics are stationary with respect to time and how that assumption could be leveraged to determine when statistics calculated over the gut microbiomes of a large number of perturbed hosts approach stationarity (e.g., mean frequency or prevalence approaching a stationary value over time; lines 576-587).

References

Dapa, T., Wong, D. P., Vasquez, K. S., Xavier, K. B., Huang, K. C., & Good, B. H. (2023). Within-host evolution of the gut microbiome. Current Opinion in Microbiology, 71, 102258. https://doi.org/10.1016/j.mib.2022.102258

Garud, N. R., Good, B. H., Hallatschek, O., & Pollard, K. S. (2019). Evolutionary dynamics of bacteria in the gut microbiome within and across hosts. PLOS Biology, 17(1), e3000102. https://doi.org/10.1371/journal.pbio.3000102

Good, B. H. (2022). Linkage disequilibrium between rare mutations. Genetics, 220(4), iyac004. https://doi.org/10.1093/genetics/iyac004

Good, B. H., Martis, S., & Hallatschek, O. (2018). Adaptation limits ecological diversification and promotes ecological tinkering during the competition for substitutable resources. Proceedings of the National Academy of Sciences, 115(44), E10407–E10416. https://doi.org/10.1073/pnas.1807530115

Good, B. H., & Hallatschek, O. (2018). Effective models and the search for quantitative principles in microbial evolution. Current Opinion in Microbiology, 45, 203–212. https://doi.org/10.1016/j.mib.2018.11.005

Grilli, J. (2020). Macroecological laws describe variation and diversity in microbial

communities. Nature Communications, 11(1), 4743. https://doi.org/10.1038/s41467-020-18529-y

Roodgar, M., Good, B. H., Garud, N. R., Martis, S., Avula, M., Zhou, W., Lancaster, S. M., Lee, H., Babveyh, A., Nesamoney, S., Pollard, K. S., & Snyder, M. P. (2021). Longitudinal linked-read sequencing reveals ecological and evolutionary responses of a human gut microbiome during antibiotic treatment. Genome Research, 31(8), 1433–1446. https://doi.org/10.1101/gr.265058.120

Shoemaker, W. R., Chen, D., & Garud, N. R. (2021). Comparative Population Genetics in the Human Gut Microbiome. Genome Biology and Evolution, evab116. https://doi.org/10.1093/gbe/evab116

Wolff, R., Shoemaker, W., & Garud, N. (2023). Ecological Stability Emerges at the Level of Strains in the Human Gut Microbiome. MBio, 0(0), e02502-22.

Wong, D., & Good, B. (2022). Ecological diversification of rapidly adapting populations. In APS

March Meeting Abstracts (Vol. 2022, pp. M05-011).

Attachment

Submitted filename: 20230608_2201_point-by-point.pdf

Decision Letter 2

Karthik Raman

2 Jul 2023

PONE-D-22-30239R2A macroecological perspective on genetic diversity in the human gut microbiomePLOS ONE

Dear Dr. Shoemaker,

Thank you for submitting your manuscript to PLOS ONE. The manuscript is now almost ready for acceptance. However, we have a new reviewer, who has a minor comment, which the author could potentially comment on, and further improve the manuscript.

Please submit your revised manuscript by Aug 16 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Karthik Raman, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #3: The author has addressed all prior reviewer concerns. I have no further concerns, and I commend the author on an interesting piece. One minor point that the author may want to consider is to discuss de novo evolution of coexisting strains from a recent ancestor within the lifespan of a human host. Most of the discussion of ecological dynamics of strains appears to assume rather distant lineages (separated by > thousands of SNPs) colonizing the same host. However, the Zhao et al. reference (ref 1) shows two B. fragilis strains within an individual, separated by a handful of SNPs, where both lineages coexist for a year or more (i.e., stationary dynamics - one lineage does not sweep the other). This would suggest a kind of sympatric speciation (transition from the evolutionary to the ecological regime) is possible over very short timescales.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 

PLoS One. 2023 Jul 21;18(7):e0288926. doi: 10.1371/journal.pone.0288926.r006

Author response to Decision Letter 2


5 Jul 2023

Reviewer three comment

The author has addressed all prior reviewer concerns. I have no further concerns, and I commend the author on an interesting piece. One minor point that the author may want to consider is to discuss de novo evolution of coexisting strains from a recent ancestor within the lifespan of a human host. Most of the discussion of ecological dynamics of strains appears to assume rather distant lineages (separated by > thousands of SNPs) colonizing the same host. However, the Zhao et al. reference (ref 1) shows two B. fragilis strains within an individual, separated by a handful of SNPs, where both lineages coexist for a year or more (i.e., stationary dynamics - one lineage does not sweep the other). This would suggest a kind of sympatric speciation (transition from the evolutionary to the ecological regime) is possible over very short timescales.

Author response: The reviewer raises an interesting question. They are correct in their assessment that I implicitly assumed that strains are diverged by many SNVs. This is due to the limitations of algorithms such as strain-finder, which uses the distribution of allele frequencies to test for the existence of strain structure within a host, where strain structure cannot be determined if the strains within a host are genetically diverged at only a handful of sites.

However, strains must ultimately have descended from a common ancestor, and it is worth discussing whether the existence of recently evolved strains can impact the patterns evaluated in this study. A newly evolved strain within a single host is analogous to a species that is only present in a single host. Given that the sampling form of the gamma distribution used in this study succeeds at predicting the prevalence of these species (Grilli, 2020), it should, in principle, be capable of predicting the prevalence of a SNV observed in a single host due to recently evolved strain structure. Across species we find that predictions for low prevalence alleles consistently fail though they tend to succeed for high prevalence alleles (e.g., Figs. S5). It is reasonable to interpret the lack of predictive success for low prevalence alleles as a consequence of said allele being present in a low number of hosts due to evolutionary dynamics, rather than its presence being a reflection of ecological strain structure. However, this interpretation does not mean that recently diverged strains are absent in the cohort of human hosts used in this study. Rather, it is possible that the macroecological lens applied here is insensitive to recently diverged strains that have colonized a low number of hosts similar to the number of hosts that we would expect to find a given allele present at intermediate frequencies due to evolutionary dynamics (e.g., repeated mutation).

In the revised manuscript I now summarize the above response to address the reviewer’s comment (lines 574-595).

References

Grilli, J. (2020). Macroecological laws describe variation and diversity in microbial communities. Nature Communications, 11(1), 4743. https://doi.org/10.1038/s41467-020-18529-y

Attachment

Submitted filename: 20230603_1112_point_by_point.pdf

Decision Letter 3

Karthik Raman

7 Jul 2023

A macroecological perspective on genetic diversity in the human gut microbiome

PONE-D-22-30239R3

Dear Dr. Shoemaker,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Karthik Raman, Ph.D.

Academic Editor

PLOS ONE

Acceptance letter

Karthik Raman

13 Jul 2023

PONE-D-22-30239R3

A macroecological perspective on genetic diversity in the human gut microbiome

Dear Dr. Shoemaker:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Karthik Raman

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Measures of genetic diversity among nonsynonymous sites.

    Measures of genetic diversity calculated from nonsynonymous sites exhibit similar statistical forms across phylogenetically distant species in the human gut, similar to patterns observed among synonymous sites (Fig 1).

    (TIF)

    S2 Fig. AFD survival curves for synonymous sites.

    Survival forms of rescaled distributions of within-host allele frequencies across hosts and mean frequencies across hosts. Representing the data presented in Fig 1c and 1d reveals how distributions of genetic diversity have similar forms across phylogenetically distant species. Each non-black line represents a species. A dashed black line represents the fit of a gamma distribution and dotted black line represents a lognormal.

    (TIF)

    S3 Fig. AFD survival curves for nonsynonymous sites.

    The equivalent plot for S2 Fig for nonsynonymous sites.

    (TIF)

    S4 Fig. Coverage distribution.

    The use of the log-likelihood ratio in MAPGD introduces a lower bound on the total depth of coverage (D) necessary to estimate the frequency of an allele at a given site. a) The existence of a lower bound translates to a truncation of the data, where I did not observe any sites with a coverage less than 20 that were processed by MAPGD. b,c) This truncation means that the depth of coverage of a minor allele (A) cannot be less than half the total coverage (e.g., 10).

    (TIF)

    S5 Fig. Synonymous prevalence predictions.

    A direct comparison between the observed prevalence of all alleles and their corresponding predicted prevalences using the SLM for synonymous sites. A total of 1,000 datapoints were sampled without replacement for each subplot.

    (TIF)

    S6 Fig. Nonsynonymous prevalence predictions.

    Analogous analyses to S5 Fig using nonsynonymous sites.

    (TIF)

    S7 Fig. Synonymous prevalence prediction error distributions.

    By calculating the relative error of all alleles for the SLM I can examine the error distributions across species. To visually compare the two models, I examined the survival distribution of the relative errors (i.e., the compliment of the empirical cumulative density function). All alleles in this plot are at synonymous sites.

    (TIF)

    S8 Fig. Nonsynonymous prevalence prediction error distributions.

    Analogous analyses to S7 Fig using nonsynonymous sites.

    (TIF)

    S9 Fig. Synonymous relationship between f¯ and prevalence.

    The empirical relationship between the mean frequency of an allele (f¯) and its prevalence across hosts can be recapitulated by the SLM for synonymous sites. Blue dots represent observed values and the shade of blue is proportional to the density of observations. The black line is the predicted relationship calculated using Eq 11. A total of 1,000 datapoints were sampled without replacement for each subplot.

    (TIF)

    S10 Fig. Nonsynonymous relationship between f¯ and prevalence.

    Analogous analyses to S9 Fig using nonsynonymous sites.

    (TIF)

    S11 Fig. Nonsynonymous prevalence error analysis.

    The equivalent analyses in Fig 3 were performed on alleles at nonsynonymous sites. The results of these analyses are qualitatively consistent with those of synonymous sites.

    (TIF)

    S12 Fig. Relationship between f¯ and β for synonymous sites.

    The relationship between the empirical estimates of the two parameters of the SLM: the mean allele frequency across hosts (f¯) and the squared inverse of the coefficient of variation of frequencies across hosts (β). Each point is an individual allele. All alleles are on synonymous sites. A total of 1,000 datapoints were sampled without replacement for each subplot.

    (TIF)

    S13 Fig. Relationship between f¯ and β for nonsynonymous sites.

    Analogous analyses to S12 Fig using nonsynonymous sites.

    (TIF)

    S1 Text. Supplemental information.

    Derivation of the distribution of allele frequencies under a linearized single-locus model of evolution.

    (PDF)

    Attachment

    Submitted filename: 20230420_1538_Response.pdf

    Attachment

    Submitted filename: 20230608_2201_point-by-point.pdf

    Attachment

    Submitted filename: 20230603_1112_point_by_point.pdf

    Data Availability Statement

    All raw sequencing data for the metagenomic samples used in this study were downloaded from the HMP \cite{huttenhower_structure_2012, lloyd-price_strains_2017} (URLs: \url{https://portal.hmpdacc.org/}, \url{https://aws.amazon.com/datasets/human-microbiome-project/}). \texttt{MIDAS} output was processed using publicly available code \cite{garud_evolutionary_2019}. Processed data is available on Zenodo: \url{https://doi.org/10.5281/zenodo.6793770}. All code written for this study is available on GitHub: \url{https://github.com/wrshoemaker/StrainPrevalence}.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES