Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2024 Mar 15;20(3):e1011937. doi: 10.1371/journal.pcbi.1011937

Bayesian inference of relative fitness on high-throughput pooled competition assays

Manuel Razo-Mejia 1,*, Madhav Mani 2,3, Dmitri Petrov 1,4,5
Editor: Sergei Maslov6
PMCID: PMC10971673  PMID: 38489348

Abstract

The tracking of lineage frequencies via DNA barcode sequencing enables the quantification of microbial fitness. However, experimental noise coming from biotic and abiotic sources complicates the computation of a reliable inference. We present a Bayesian pipeline to infer relative microbial fitness from high-throughput lineage tracking assays. Our model accounts for multiple sources of noise and propagates uncertainties throughout all parameters in a systematic way. Furthermore, using modern variational inference methods based on automatic differentiation, we are able to scale the inference to a large number of unique barcodes. We extend this core model to analyze multi-environment assays, replicate experiments, and barcodes linked to genotypes. On simulations, our method recovers known parameters within posterior credible intervals. This work provides a generalizable Bayesian framework to analyze lineage tracking experiments. The accompanying open-source software library enables the adoption of principled statistical methods in experimental evolution.

Author summary

In this study, we present a novel Bayesian pipeline for analyzing DNA barcode tracking sequencing data, addressing the challenge of accurately quantifying competitive microbial fitness in the presence of experimental noise. Our method uniquely contributes to understanding microbial evolutionary dynamics by enabling reliable inference of the relative fitness of diverse microbial strains from high-throughput lineage tracking assays. Our approach is distinct in its ability to systematically account for and propagate uncertainties from various noise sources throughout all inferred parameters. Furthermore, the error-propagation quality of our Bayesian method allows us to extend the inference pipeline to common dataset structures, such as jointly analyzing multiple experimental replicates or accounting for multiple unique barcodes mapping the equivalent genotypes. This comprehensive treatment of uncertainties is crucial in experimental settings where noise can significantly influence the results. Furthermore, we have optimized our pipeline for scalability, allowing it to handle large numbers of unique barcodes effectively. This scalability is essential for analyzing complex datasets typical in microbial fitness studies.

Introduction

The advent of DNA barcoding—the ability to uniquely identify cell lineages with DNA sequences integrated at a specific locus—and high-throughput sequencing has opened new venues for understanding microbial evolutionary dynamics with an unprecedented level of temporal resolution [13]. These experimental efforts rely on our ability to reliably infer the relative fitness of an ensemble of diverse genotypes. Moreover, inferring these fitness values over an ensemble of environmental conditions can help us determine the phenotypic diversity of a rapid adaptation process [4].

As with any other sequencing-based quantification, tracking lineages via DNA barcode sequencing is inexorably accompanied by noise sources coming from experimental manipulation of the microbial cultures, DNA extraction, and sequencing library preparation that involves multiple rounds of PCR amplification, and the sequencing process itself. Thus, accounting for the uncertainty when inferring the relevant parameters from the data is a crucial step to draw reliable conclusions. Bayesian statistics presents a paradigm by which one can account for all known sources of uncertainty in a principled way [5]. This, combined with the development of modern Markov Chain Monte Carlo sampling algorithms [6] and approximate variational approaches [7] have boosted a resurgence of Bayesian methods in different fields [8].

We present a Bayesian inference pipeline to quantify the uncertainty about the parametric information we can extract from high-throughput competitive fitness assays given a model of the data generation process and experimental data. In these assays, the fitness of an ensemble of genotypes is determined relative to a reference genotype [3, 4]. Fig 1(A) shows a schematic of the experimental procedure in which an initial pool of barcoded strains are mixed with a reference strain and inoculated into fresh media. After some time—usually, enough time for the culture to saturate—an aliquot is transferred to fresh media, while the remaining culture is used for DNA sequencing of the lineage barcodes. The time-series information of the relative abundance of each lineage, i.e., the barcode frequency depicted in Fig 1(B), is used to infer the relative fitness—the growth advantage on a per-cycle basis—for each lineage with respect to the reference strain. The proposed statistical model accounts for multiple sources of uncertainty when inferring the lineages’ relative fitness values (see Section “Experimental setup” for details on sources of uncertainty accounted for by the model). Furthermore, minor changes to the core statistical model allow us to account for relevant experimental variations of these competition assays. More specifically, in Section “Fitness inference on multiple environments”, we present a variation of the statistical model to infer fitness on growth dilution cycles in multiple environments with proper error propagation. Furthermore, as described in Section “Accounting for experimental replicates via hierarchical models”, our statistical model can account for batch-to-batch differences when jointly analyzing multiple experimental replicates using a Bayesian hierarchical model. Finally, a variant of these hierarchical models, presented in Section “Accounting for multiple barcodes per genotype via hierarchical models”, can account for variability within multiple barcodes mapping to equivalent genotypes within the same experiment.

Fig 1. Typical competitive fitness experiment.

Fig 1

(A) Schematic of the typical experimental design to determine the competitive fitness of an ensemble of barcoded genotypes. Genotypes are pooled together and grown over multiple growth-dilution cycles. At the end of each cycle, a sample is processed to generate a library for amplicon sequencing. (B) Typical barcode trajectory dataset. From each time point, the relative frequency of each barcode is determined from the total number of reads. Shades of blue represent different relative fitness. Darker gray lines define the typical trajectory of neutral lineages. (C) Sources of uncertainty accounted for by our method. The Bayesian model fit to the data propagates uncertainties from the categories schematically depicted to all parameters in the inference.

For all the model variations presented in this paper, we benchmark the ability of our pipeline to infer relative fitness parameters against synthetic data generated from logistic growth simulations with added random noise. A Julia package accompanies the present method to readily implement the inference pipeline with state-of-the-art scientific computing software.

Results

Experimental setup

The present work is designed to analyze time-series data of relative abundance of multiple microbial lineages uniquely identified by a DNA barcode [3, 4]. In these competition assays, an ensemble of genotypes is pooled together with an unlabeled reference strain that, initially, represents the vast majority (≥90%) of the cells in the culture (see schematic in Fig 1(A)). Furthermore, a fraction of labeled genotypes equivalent to the unlabeled reference strain—hereafter defined as neutral lineages—are spiked in at a relatively high abundance (≈ 3 − 5%). The rest of the culture is left for the ensemble of genotypes of interest. This experimental design in which the barcodes of interest represent a small fraction of the culture serves three main purposes: First, the model used to infer the relative fitness differences between genotypes is valid in the regime where the genotype frequency is significantly smaller than one (see Section “Fitness model”). Second, potential problems with frequency-dependent selection are minimized as long as the frequency of each genotype remains small. Third, having a single genotype dominating the culture standardizes the environment experienced by all genotypes. This is because variations in the chemistry of the environment are effectively dictated by the dominating genotype, allowing for reproducible growth-dilution cycles.

To determine the relative fitness of the ensemble of genotypes, a series of growth-dilution cycles are performed on either a single or multiple environments. In other words, the cultures are grown for some time; then, an aliquot is inoculated into fresh media for the next growth cycle. This process is repeated for roughly 4–7 cycles, depending on the initial abundances of the mutants and their relative growth rates. The DNA barcodes are sequenced at the end of each growth cycle to quantify the relative abundance of each of the barcodes. We point the reader to [4] for specific details on these assays for S. cerevisiae and to [3] for equivalent assays for E. coli. Fig 1(B) presents a typical barcode trajectory where the black trajectories represent the so-called neutral lineages, genetically equivalent to the untagged ancestor strain that initially dominates the culture. These spiked-in neutral lineages simplify the inference problem since the fitness metric of all relevant barcodes is quantified with respect to these barcodes—thus referred to as relative fitness.

Preliminaries on mathematical notation

Before jumping directly into the Bayesian inference pipeline, let us establish the mathematical notation used throughout this paper. We define (column) vectors as underlined lowercase symbols such as

x_=[x1x2xN]. (1)

In the same way, we define matrices as double-underline uppercase symbols such as

A__=[A11A12A1NA21A22A2NAM1AM2AMN]. (2)

Fitness model

Empirically, each barcode frequency trajectory follows an exponential function of the form [1, 3, 4]

ft+1(b)=ft(b)e(s(b)-s¯t)τ, (3)

where ft(b) is the frequency of barcode b at the end of cycle number t, s(b) is the relative fitness with respect to the reference strain—the quantity we want to infer from the data—s¯t is the mean fitness of the culture at the end of cycle number t, and τ is the time pass between cycle t and t + 1. We can rewrite Eq 3 as

1τlnft+1(b)ft(b)=(s(b)-s¯t). (4)

Eq 4 separates the measurements—the barcode frequencies—from the unobserved (sometimes referred to as latent) parameters we want to infer from the data—the population mean fitness and the barcode relative fitness. This is ultimately the functional form used in our inference pipeline. Therefore, the relative fitness is computed by knowing the log frequency ratio of each barcode throughout the growth-dilution cycles.

The presence of the neutral lineages facilitates the determination of the population mean fitness value s¯t. Since every relative fitness is determined relative to the neutral lineage that dominates the culture, we define their fitness to be s(n) = 0, where the superscript (n) specifies their neutrality. This means that Eq 4 for a neutral lineage takes the simpler form

1τlnft+1(n)ft(n)=-s¯t. (5)

Therefore, we can use the data from these reference barcodes to directly infer the value of the population mean fitness.

It is important to notice that the frequencies ft(b) are not the allele frequencies in the population (most of the culture is not sequenced since the reference strain is not barcoded), but rather the relative frequencies in the total number of sequencing reads. A way to conceptualize this subtle but important point is to assume exponential growth in the number of cells Nt(b) of the form

Nt+1(b)=Nt(b)eλ(b)τ, (6)

for every barcode b with growth rate λ(b). However, when we sequence barcodes, we do not directly measure the number of cells, but some number of reads rt(b) that map to barcode b. In the simplest possible scenario, we assume

rt(b)Nt(b), (7)

where, importantly, the proportionality constant depends on the total number of reads for the library for cycle t, which might vary from library to library. Therefore, to compare the number of reads between libraries at different time points, we must normalize the number of reads to the same scale. The simplest form is to define a relative abundance, i.e., a frequency with respect to the total number of reads,

ft(b)rt(b)brt(b). (8)

This is the frequency Eq 3 describes.

Our ultimate objective is to infer the relative fitness s(b) for each of the M relevant barcodes in the experiment—hereafter referred to as s(m) to distinguish from the general s(b) and the neutral lineages s(n) relative fitness. To do so, we account for the three primary sources of uncertainty in our model:

  1. Uncertainty in the determination of frequencies. Our model relates frequencies between adjacent growth-dilution cycles to the fitness of the corresponding strain. However, we do not directly measure frequencies. Instead, our data for each barcode consists of a length T vector of counts r_(b) for each of the T cycles in which the measurements were taken.

  2. Uncertainty in the value of the population mean fitness. We define neutral lineages to have fitness s(n) = 0, helping us anchor the value of the population mean fitness s¯t for each pair of adjacent growth cycles. Moreover, we take this parameter as an empirical parameter to be obtained from the data, meaning that we do not impose a functional form that relates s¯t to s¯t+1. Thus, we must infer the T−1 values of this population mean fitness with their uncertainty that must be propagated to the value of the mutants’ relative fitness.

  3. Uncertainty in each of the mutants’ fitness values.

Fig 1(C) shows schematically the sources of uncertainty accounted for by our model. The first three sources—population composition, barcode read counts, and barcode frequencies—all contribute to uncertainty source 1. Uncertainty source 2 and 3 are depicted by the last two sources, respectively. To account for all these sources of uncertainty in a principled way, in the next section, we develop a Bayesian inference pipeline.

Bayesian inference

As defined in the previous section, our ultimate objective is to infer the vector of relative fitness values

s_M=(s(1),s(2),,s(M)), (9)

where indicates the transpose. Our data consists of an T × B matrix R__, where B = M + N is the number of unique barcodes given by the sum of the number of unique, relevant barcodes we care about, M, and the number of unique neutral barcodes, N, and T is the number of growth cycles where measurements were taken. The data matrix is then of the form

R__=[-r_1--r_2--r_T-], (10)

where each row r_t is a B-dimensional array containing the raw barcode counts at cycle t. We can further split each vector r_t into two vectors of the form

r_t=[r_tNr_tM], (11)

i.e., the vector containing the neutral lineage barcode counts r_tN and the corresponding vector containing the mutant barcode counts r_tM. Following the same logic, matrix R__ can be split into two matrices as

R__=[R__NR__M], (12)

where R__N is a T × N matrix with the barcode reads time series for each neutral lineage and R__M is the equivalent T × M matrix for the non-neutral lineages.

Our objective is to compute the joint probability distribution for all relative fitness values given our data. We can express this joint posterior distribution using Bayes theorem as

π(s_MR__)=π(R__s_M)π(s_M)π(R__), (13)

where hereafter π(⋅) defines a probability density function. When defining our statistical model, we need not to focus on the denominator on the right-hand side of Eq 13. Thus, we can write

π(s_MR__)π(R__s_M)π(s_M). (14)

However, when implementing the model computationally, the normalization constant on the right-hand side of Eq 13 must be computed. This can be done from the definition of the model via an integral of the form

π(R__)=dMs_Mπ(R__s_M)π(s_M), (15)

also known as a marginalization integral. Hereafter, differentials of the form dn imply a n-dimensional integral.

Although Eqs 13 and 14 seem simple enough, recall that Eq 3 relates barcode frequency values and the population mean fitness to the mutant relative fitness. Therefore, we must include these nuisance parameters as part of our inference problem. We direct the reader to Section B of S1 Text for the exact definitions of these parameters. Here, it suffices to say that the inference problem must include the vector s¯_T of all population mean fitness values and the matrix F__ of all barcode frequencies within the sequencing data. With these nuisance variables in hand, the full inference problem we must solve takes the form

π(s_M,s¯_T,F__R__)π(R__s_M,s¯_T,F__)π(s_M,s¯_T,F__). (16)

To recover the marginal distribution over the non-neutral barcodes relative fitness values, we can numerically integrate out all nuisance parameters, i.e.,

π(s_MR__)=dT-1s¯_TdBf_1dBf_Tπ(s_M,s¯_T,F__R__). (17)

Factorizing the posterior distribution

The left-hand side of Eq 16 is extremely difficult to work with. However, we can take advantage of the structure of our inference problem to rewrite it in a more manageable form. Specifically, the statistical dependencies of our observations and latent variables allow us to factorize the joint distribution into the product of multiple conditional distributions. To gain some intuition about this factorization, let us focus on the inference of the population mean fitness values s¯_T. Eq 5 relates the value of the population mean fitness to the neutral lineage frequencies and nothing else. This suggests that when writing the posterior for these population mean fitness parameters, we should be able to condition it only on the neutral lineage frequency values, i.e., π(s¯_TF__N). We point the reader to Section B in S1 Text for the full mathematical details on this factorization. For our purpose here, it suffices to say we can rewrite the joint probability distribution as a product of conditional distributions of the form

π(s_M,s¯_T,F__R__)=π(s_Ms¯_T,F__M)π(s¯_TF__N)π(F__R__). (18)

Written in this form, Eq 18 captures the three sources of uncertainty listed in Section “Fitness model” in each term. Starting from right to left, the first term on the right-hand side of Eq 18 accounts for the uncertainty when inferring the frequency values given the barcode reads. The second term accounts for the uncertainty in the values of the mean population fitness at different time points. The last term accounts for the uncertainty in the parameter we care about—the mutants’ relative fitnesses. We refer the reader to Section B in S1 Text for an extended description of the model with specific functional forms for each term on the left-hand side of Eq 18 as well as the extension of the model to account for multiple experimental replicates or hierarchical genotypes.

Variational inference

One of the technical challenges to the adoption of Bayesian methods is the analytical intractability of integrals such as that of Eq 17. Furthermore, even though efficient Markov Chain Monte Carlo (MCMC) algorithms such as Hamiltonian Montecarlo can numerically perform this integration [6], the dimensionality of the problem in Eq 18 makes an MCMC-based approach prohibitively slow.

To overcome this computational limitation, we rely on the recent development of the automatic differentiation variational inference algorithm (ADVI) [7]. Briefly, when performing ADVI, our target posterior distribution π(θR__), where θ=(s_M,s¯_T,F__), is replaced by an approximate posterior distribution qϕ(θ), where ϕ fully parametrizes the approximate distribution. As further explained in Section A in S1 Text, the numerical integration problem is replaced by an optimization problem of the form

qϕ*(θ)=minϕDKL(qϕ(θ)||π(θR__)), (19)

where DKL is the Kulback-Leibler divergence. In other words, the complicated high-dimensional numerical integration problem is transformed into a much simpler problem of finding the value of the parameters ϕ such that Eq 19 is satisfied as best as possible within some finite computation time. Although to compute Eq 19, we require the posterior distribution we are trying to approximate π(θR__), it can be shown that maximizing the so-called evidence lower bound (ELBO) [9]—equivalent to minimizing the variational free energy [10]—is mathematically equivalent to performing the optimization prescribed by Eq 19. We direct the reader to Section A in S1 Text for a short primer on variational inference.

This work is accompanied by the Julia library BarBay.jl that makes use of the implementation of both MCMC-based integration as well as ADVI optimization to numerically approximate the solution of Eq 17 within the Julia ecosystem [11].

Inference on a single dataset

To assess the inference pipeline performance, we applied it to a simulated dataset with known ground truth relative fitness values (See Section D in S1 Text for details on simulation). Fig 2(A) shows the structure of the synthetic dataset. The majority of barcodes of interest (faint color lines) are adaptive compared to the neutral barcodes (s(m) > 0). Although the barcode frequency trajectories look relatively smooth, our fitness model requires the computation of the log frequency ratio between adjacent time points as derived in Eq 4. Fig 2(B) shows such data transformation where we can better appreciate the observational noise input into our statistical model. This noise is evident for the darker lines representing the neutral barcodes since all of these lineages are assumed to be identically distributed.

Fig 2. Single dataset inference.

Fig 2

(A) Frequency trajectories that represent the raw data going into the inference. (B) Log frequency ratio between two adjacent time points used by the inference pipeline. Darker lines represent the neutral barcodes. These transformed data are much more noisy than the seemingly smooth frequency trajectories. (C) Examples of the posterior predictive checks for all neutral lineages (upper left panel) and a subset of representative mutant lineages. Shaded regions represent the 95%, 68%, and 5% credible regions for the data. The reported errors above the plot represent the 68% credible region on the mutant relative fitness marginal distribution. (D) Comparison between the ground truth fitness value from the logistic-growth simulation and the inferred fitness value. Gray error bars represent the 68% posterior credible region for the relative fitness values. (E) The empirical cumulative distribution function (ECDF) for the absolute z-score value of the ground truth parameter value within the inferred fitness posterior distribution.

To visualize the performance of our inference pipeline in fitting our fitness model to the observed data, we compute the so-called posterior predictive checks (PPC). In short, the PPC consists of repeatedly generating synthetic datasets in agreement with the results from the inference results. In other words, we use the resulting parameter values from the ADVI inference to generate possible datasets in agreement with the inferred values (See Section C in S1 Text for further details on these computations). Fig 2(C) shows these results for all neutral lineages (upper left corner plot) and a few representative non-neutral barcodes. The different color shades represent the 95%, 68%, and 5% credible regions, i.e., the regions where we expect to find the data with the corresponding probability—or in terms of our parameter, the X% credible region is the interval where we expect the true parameter value to lie with X% probability.

The main advantage of our method is the natural interpretability of these credible regions where an X% credible region indeed captures the region of parameter space where we expect with X% probability the actual value of the parameter lies given our statistical model, our prior information, and the observed experimental data. Bayesian methods avoid common misconceptions associated with the construction of frequentists confidence intervals [12].

To capture the global performance of the model, Fig 2(D) compares the known ground truth with the inferred relative fitness value for all barcodes of interest. There is an excellent degree of correspondence between these values, with the error bars representing the 68% credible region for the parameter value crossing the identity line for most barcodes. This latter point is made clear with Fig 2(E) where ≈90% of ground truth fitness values fall within one standard deviation of the mean in the inferred posterior distributions.

Fitness inference on multiple environments

The fitness model in Eq 3 relates nuisance parameters such as the population mean fitness and the barcode frequencies to the relative fitness parameter we want to infer from the data. These dependencies imply that uncertainty on the estimates of these nuisance parameters influences the inference of the relevant parameters. For example, imagine a scenario where the neutral lineages data were incredibly noisy, leading to poor estimates of the population mean fitness values s¯_T. Since the relative fitness of any non-neutral barcode s(m) is determined with respect to these neutral barcodes, not accounting for the lack of precision in the value of the population mean fitness would result in misleading estimates of the accuracy with which we determine the value of the parameter we care about. Thus, propagating these sources of uncertainty in nuisance parameters is vital to generate an unbiased estimate of the relevant information we want to extract from the data. One of the benefits of Bayesian methods is the intrinsic error propagation embedded in the mathematical framework. For our previous example, the uncertainty on the value of the population mean fitness values is propagated to the relative fitness of a non-neutral barcode since we defined a joint posterior distribution over all parameters as fully expressed in Eq 16.

This natural error propagation can help us with the experimental design schematized in Fig 3(A). Here, rather than performing growth-dilution cycles in the same environment, the cells are diluted into a different environment. Thus, the uncertainty on the fitness estimate for the previous environment must be propagated to that of the next one. To validate the extension of our statistical model to this scenario, Fig 3(B) shows the trajectory of the log frequency ratios between adjacent time points. The different colored regions correspond to the different environments. For this simulation, the growth rate of Environment 2 was set to be, on average, half of the average growth rate in Environment 1. Equivalently, the growth rate in Environment 3 was set to be, on average, twice the average growth rate in Environment 1. Fig 3(C)–3(E) show the correspondence between the simulation ground truth and the inferred fitness values, where the error bars represent the 68% credible region. Fig 3(F) summarizes the performance of our inference pipeline by showing the empirical cumulative distribution functions for the absolute value of the ground truth fitness value z-score within the posterior distribution. This plot shows that, overall, ≈75% of inferred mean values fall within one standard deviation of the ground truth. For completeness, Fig 3(G) shows the posterior predictive checks for a few example barcodes.

Fig 3. Multi-environment fitness inference.

Fig 3

(A) Schematic of the simulated experimental design where growth-dilution cycles are performed into different environments for each cycle. (B) log frequency rations between adjacent time points. Darker lines represent the neutral barcodes. The colors in the background demark the corresponding environment, matching colors in (A). Environment 2 is set to have, on average, half the growth rate of environment 1. Likewise, environment 3 is set to have, on average, twice the growth rate of environment 1. (C-E) Comparison between the ground truth fitness value from the logistic-growth simulation and the inferred fitness value for each environment. Gray error bars represent the 68% posterior credible region. (F) The empirical cumulative distribution function (ECDF) for the absolute z-score value of the ground truth parameter value within the inferred fitness posterior distribution for all fitness values (black line) and each environment individually (color lines). (G) Examples of the posterior predictive checks for all neutral lineages (upper left panel) and a subset of representative mutant lineages. Shaded regions surrounding the data represent the 95%, 68%, and 5% credible regions for the data. The reported errors above the plot represent the 68% credible region on the mutant relative fitness marginal distribution. Background colors match those of (A).

Accounting for experimental replicates via hierarchical models

Our inference pipeline can be extended to account for multiple experimental replicates via Bayesian hierarchical models [13]. Briefly, when accounting for multiple repeated measurements of the same phenomena, there are two extreme cases one can use to perform the data analysis: On the one hand, we can treat each measurement as entirely independent, losing the power to utilize multiple measurements when trying to learn a single parameter. This can negatively impact the inference since, in principle, the value of our parameter of interest should not depend on the particular experimental replicate in question. However, this approach does not allow us to properly “combine” the uncertainties in both experiments when performing the inference. On the other hand, we can pool all data together and treat our different experiments as a single measurement with higher coverage. However, by doing so, we lose all information about experiment-to-experiment variability due to intrinsic biological variability and environmental fluctuations—such as small temperature fluctuations—that we cannot control. In this sense, combining the datasets defeats the purpose of performing multiple measurements to account for this variability when extracting the relevant information from the observations.

Hierarchical models present a middle ground between these extremes. First, hierarchical models rely on the definition of so-called hyper-parameters, that capture the parametric inference we are interested in—for this inference problem, we have a hyper-fitness value θ(m) for each non-neutral barcode. Second, each experiment draws randomly from the distribution of this hyper-parameter, allowing for subtle variability between experiments to be accounted for—in the present inference pipeline, each experimental replicate gets assigned a local fitness value si(m), where the extra sub-index indicates the i-th experimental replicate. Conceptually, we can think of the local fitness for replicate i as being sampled from a distribution that depends on the value of the global hyper-fitness value, i.e., si(m)πθ(m), where the subindex θ(m) indicates the distribution’s parametric dependence on the hyper-fitness value. This way of interpreting the connection between the distribution πθ(m) and the local fitness implies that a large replicate-to-replicate variability would lead to a broad hyper-fitness distribution—implying a large uncertainty when determining the parameter that characterizes the overall relative fitness. We point the reader to Section B.4 in S1 Text for the full definition of the hierarchical model used in this section. Importantly, as schematized in Fig 4(A), the influence between different experimental replicates runs both ways. First, the data from one experimental replicate (R__kM in the diagram) informs all local fitness values via the global hyper-fitness (upper panel in Fig 4(A)). Second, the local fitness value is informed by the data from all experimental replicates via the same global hyper-fitness parameter (lower panel in Fig 4(A)).

Fig 4. Hierarchical model on experimental replicates.

Fig 4

(A) Schematic depiction of the interactions between local fitness values s_kM through the global hyper-fitness value θ_M for K hypothetical experimental replicates. The upper diagram shows how the data from replicate k informs all local fitness values via the hyper-fitness parameter. The lower panel shows the reverse, where all other datasets inform the local fitness value. (B-C) Simulated replicate datasets with 900 barcodes of interest and 100 neutral lineages. (D) Comparison between the simulation ground truth hyper-fitness and each replicate ground truth fitness. The scatter between parameters captures experimental batch effects. (E) Examples of the posterior predictive checks for all neutral lineages (upper left panels) and a subset of representative mutant lineages. Shaded regions from light to dark represent the 95%, 68%, and 5% credible regions. (F-G) Comparison between the simulation’s ground truth hyper-fitness (F) and replicate fitness (G) values and the inferred parameters. Gray error bars represent the 68% posterior credible region. (H-I) The empirical cumulative distribution function (ECDF) for the absolute z-score value of the ground truth parameter value within the inferred hyper-fitness posterior distribution (H) and replicate fitness (I).

To test the performance of this model, we simulated two experimental replicates with 1000 unique barcodes (see Fig 4(B) and 4(C)) where we randomly sampled a ground truth hyper-fitness value θ(m) for each barcode. We sampled a variation from this hyper-fitness value for each experimental replicate si(m) to capture experimental batch effects. Fig 4(D) shows the relationship between hyper-fitness and replicate fitness values for this simulation. The spread around the identity line represents the expected batch-to-batch variation. The posterior predictive checks examples in Fig 4(E) show that the hierarchical model can correctly fit the data for each experimental replicate. Furthermore, Fig 4(F) and 4(G) show a high correlation between the ground truth and the inferred fitness values. The empirical cumulative distribution functions shown in Fig 4(H) and 4(I) reveal that for ≈75% of the non-neutral barcodes, the ground truth hyper-fitness values fall within one standard deviation from the mean value in the posterior distributions.

As shown in Fig 5, the structure imposed by the hierarchical model schematized in Fig 4(A), where we explicitly account for the connection between experimental replicates can improve the quality of the inference. Inferred fitness values between experimental replicates exhibit a stronger degree of correlation in the hierarchical model (Fig 5(A)) compared to conducting inference on each replicate independently (Fig 5(B)). Moreover, when comparing the inferred hyper-fitness values—the objective parameter when performing multiple experimental measurements—the hierarchical model outperforms averaging the independent experimental replicates as shown in Fig 5(C) and 5(D).

Fig 5. Comparison between hierarchical model and single dataset model.

Fig 5

(A-B) comparison of inferred fitness values between experimental replicates when fitting a hierarchical model (A) or independently fitting each dataset (B). Gray error bars represent the 68% credible regions. (C) Comparison between the ground truth hyper-fitness value and the inferred parameters. The blue dots show the inferred hyper-fitness values when assuming a hierarchical model. Gray error bars show the 68% credible region for this inference. The yellow dots show the average of the mean inferred fitness values for the two experimental replicates. No error bars are shown for these, as it is inappropriate to compute one with two data points per non-neutral barcode. (D) Empirical cumulative distribution function (ECDF) of the absolute difference between the inferred mean and the ground truth hyper-fitness.

Accounting for multiple barcodes per genotype via hierarchical models

Hierarchical models can also capture another experimental design in which multiple barcodes map to the same or an equivalent genotype. As we will show, this many-to-one mapping can improve the inference compared to the extreme cases of inferring the fitness of each barcode independently or pooling the data of all barcodes mapping to a single genotype. As schematized in Fig 6(A), a small modification of the base model allows us to map the structure of our original model to that of a hierarchical model with a fitness hyperparameter vector θ_G, where G is the number of genotypes in the dataset. This analysis pipeline assumes a known barcode-to-genotype mapping to assign each barcode to the corresponding hyper-fitness parameter uniquely.

Fig 6. Hierarchical model for multiple barcodes per genotype.

Fig 6

(A) Schematic depiction of the hierarchical structure for multiple barcodes mapping to a single genotype. A set of barcodes mapping to an equivalent genotype map to “local” fitness values s(b) that are connected via a hyper-fitness parameter for the genotype θ(g). (B) Simulated dataset with 100 neutral lineages and 900 barcodes of interest distributed among 90 genotypes. (C-E) Comparison between the inferred and ground truth fitness values for a hierarchical model (C), a model where each barcode is inferred independently (D), and a model where barcodes mapping to the same genotype are pooled together (E). Gray error bars represent the 68% credible regions. (F) Empirical cumulative distribution function (ECDF) of the absolute difference between the inferred mean and the ground truth fitness values for all three models. (G) Examples of the posterior predictive checks for all neutral lineages (upper left panels) and a subset of representative mutant lineages. Shaded regions from light to dark represent the 95%, 68%, and 5% credible regions.

Fig 6(B) shows a single experimental replicate in which 90 genotypes were assigned a random number of barcodes (a multinomial distribution with a mean of ten barcodes per genotype) for a total of 900 non-neutral barcodes. To assess the performance of the hierarchical model proposed in Fig 6(A), we performed inference using this hierarchical model, as well as the two extreme cases of ignoring the connection between the barcodes belonging to the same genotype—equivalent to performing inference using the model presented in Fig 2(A) over the barcodes—or pooling the data of all barcodes belonging to the same genotype into a single count—equivalent to performing inference using the model presented in Fig 2(A) over the pooled barcodes. Fig 6(C) and 6(D) shows the comparison between the simulation ground truth and the inferred values for these three cases. Not only do the hierarchical model results show higher degrees of correlation with the ground truth, but the error bars (representing the 68% credible regions) are smaller, meaning that the uncertainty in the estimate of the parameter we care about decreases when using the hierarchical model. The improvement in the prediction can be seen in Fig 6(F) where the empirical cumulative distribution function of the absolute difference between the mean inferred value and the simulation ground truth is shown for all three inference models. The hierarchical model’s curve ascends more rapidly, showing that, in general, the inferred values are closer to the ground truth. For completeness, Fig 6(G) shows some examples of how the hierarchical model can capture the raw log-frequency count observations.

Discussion

Experimental evolution of microbial systems has dramatically advanced our understanding of the basic principles of biological evolution [14]. From questions related to the optimal fine-tuning of gene expression programs [15], to the dimensionality, geometry, and accessibility of the adaptive fitness landscape explored by these rapidly adapting populations [4, 16], to the emergence of eco-evolutionary dynamics in a long-term evolution experiment [17]; for all of these and other cases, the microbial experimental platform combined with high-throughput sequencing has been essential to tackling these questions with empirical data. This exciting research area promises to improve as new culturing technologies [18] as well as more complex lineage barcoding schemes [2, 19], are adopted.

For this data-heavy field, properly accounting for the uncertainty in parameters inferred from experiments is vital to ensure the conclusions drawn are reliable. Bayesian statistics presents a principled way to quantify this uncertainty systematically [20]. Moreover, Bayesian analysis offers a more natural way to interpret the role that probability theory plays when performing data analysis compared to the often-misinterpreted frequentist methods [21]. Nevertheless, the technical challenges associated with Bayesian analysis has limited its application. This is set to change as recognition of the misuse of frequentist concepts such as the p-value is receiving more attention [22]. Moreover, advances in numerical methods such as Hamiltonian Monte Carlo [6] and variational inference [7] allows for complex Bayesian models to be fit to empirical data.

In this paper, we present a computational pipeline to analyze lineage-tracking time-series data for massive-parallel competition assays. More specifically, we fit a Bayesian model to infer the fitness of multiple genotypes relative to a reference [3, 4]. The proposed model accounts for multiple sources of uncertainty with proper error propagation intrinsic to Bayesian methods. To scale the inference pipeline to large datasets with >10, 000 barcodes, we use the ADVI algorithm [7] to fit a variational posterior distribution. The main difference between our method and previous inference pipelines, such as [23], is that the present analysis provides interpretable errors on the inferred fitness values. The reported uncertainty intervals—known as credible regions—can be formally interpreted as capturing the corresponding probability mass of finding the true value of the parameter given the model, the prior information, and the data. Furthermore, minor modifications to the structure of the statistical model presented in this work allow for the analysis of different experimental designs, such as growth-dilution cycles in different environments, joint analysis of multiple experimental replicates of the same experiment via hierarchical models, and a hierarchical model for multiple barcodes mapping to equivalent genotypes. We validate our analysis pipeline on simulated datasets with known ground truth, showing that the model fits the data adequately, capturing the ground truth parameters within the posterior distribution.

It is important to highlight some of the consequences of the general experimental design and the implicit assumptions within the proposed statistical model to analyze the resulting data. First, the composition of the population is such that the initial fraction of the population occupied by the barcoded genotypes is small—usually >90% of the initial population is the non-labeled reference strain. This constraint is important as the fitness model used to fit the time series data assumes that the tracked frequencies are ≪ 1. Second, when computing log frequency ratios, we can run into the issue of dividing by zero. This is a common problem when dealing with molecular count data [24]. Our model gets around this issue by assuming that the frequency of any barcode cannot be, but still can get arbitrarily close to, zero. Therefore, we implicitly assume that no lineage goes extinct during the experiment. Moreover, the statistical model directly accounts for the uncertainty associated with having zero barcode counts, increasing the corresponding uncertainty. Third, the models presented in this paper require the existence of a labeled sub-population of barcoded reference strains. These barcodes help determine the fitness baseline, as every fitness is quantified with respect to this reference genotype. This experimental design constraint facilitates the inference of the population mean fitness since most of the culture—the unlabeled reference genotype—is not tracked. Fourth, the fitness model in Eq 3 does not assume any functional form for the cellular growth dynamics. This description of the changes in the relative frequency of the genotypes can approximate the dynamics for multiple growth forms. For example, in Section E in S1 Text, we apply our inference pipeline to simulations like those presented in [23] with non-logistic growth models. Moreover, Section F in S1 Text, in the supplementary materials, reanalyzes experimental data from [4], where previous work showed that the fitness advantage from some of the mutants results from a shorter lag phase at the beginning of the growth cycle [25]. Thus, our method’s inferred relative fitness is a coarse-grained quantity that integrates the differences in fitness across the entire growth cycle as long as frequencies and frequency changes are small enough to be approximated by Eq 3. Finally, the presented statistical model assumes that relative fitness is solely a constant of the environment and the genotype. Future directions of this work could extend the fitness model to properly analyze data with time-varying or frequency-dependent fitness values.

In total, the statistical model presented in this work and the software package accompanying the paper allow for a principled way of quantifying the accuracy with which we can extract relevant parametric information from large-scale multiplexed fitness competition assays. Furthermore, the implementation of Bayesian models and their fitting via automatic differentiation approaches opens the gate to extend this type of formal analysis to the data-rich literature in experimental evolution and other high-throughput technologies applications.

Materials and methods

All synthetic data generation and custom scripts used in this work were stored using Git version control. Code for analysis and figure generation can be found on the GitHub repository (https://github.com/mrazomej/bayesian_fitness). The accompanying software package BarBay.jl can be directly installed from the Julia package repository, or via cloning the corresponding GitHub repository (https://github.com/mrazomej/BarBay.jl).

Supporting information

S1 Text. Supplementary materials.

Section A gives a short primer on variational inference. Section B defines the probabilistic models used throughout the main text. Section C details how the validity of the model is computed via posterior predictive checks. Section D explains how the simulated frequency trajectories are generated. Section E compares the inferences of our method with state-of-the-art methods in the literature. Section F reanalyzes experimental data from yeast evolution experiments. Section G details how the computation time scales with the number of barcodes.

(PDF)

pcbi.1011937.s001.pdf (2.9MB, pdf)

Acknowledgments

We would like to thank Griffin Chure and Michael Betancourt for their helpful advice and discussion. We would like to thank Karna Gowda, Spencer Farrell, and Shaili Mathur for critical observations on the manuscript. We are especially thankful to Grant Kinsler for kindly providing raw experimental data as well as lengthy discussions about the state-of-the-art inference method.

Data Availability

This paper is accompanied by a highly documented Julia software library—BarBay.jl (see documentation in https://mrazomej.github.io/BarBay.jl). Furthermore, to ensure transparency with every piece of information presented in this paper, we have made all of the code used in the processing, analysis, and figure generation for this work also publicly available on this paper’s GitHub repository (https://github.com/mrazomej/bayesian_fitness).

Funding Statement

This work was supported by - The NIH/NIGMS, Genomics of rapid adaptation in the lab and in the wild R35GM11816506 (MIRA grant, to DP) - The NIH, Unravelling mechanisms of tumor suppression in lung cancer (R01CA23434903, to DP) - The NIH (PQ4), Quantitative and multiplexed analysis of gene function in cancer in vivo (R01CA23125303, to DP) - The NIH, Genetic Determinants of Tumor Growth and Drug Sensitivity in EGFR Mutant Lung Cancer (R01CA263715, to DP) - The NIH, Dissecting the interplay between aging, genotype, and the microenvironment in lung cancer (U01AG077922, to DP) - The NIH, Genetic dissection of oncogenic Kras signaling (R01CA230025, to DP) - The National Science Foundation-Simons Center for Quantitative Biology at Northwestern University and the Simons Foundation grant (597491, to MM) - The Chan Zuckerberg Initiative, an advised fund of Silicon Valley Community Foundation (DAF2023-329587, to MM) - MRM was supported by the Schmidt Science Fellowship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Levy SF, Blundell JR, Venkataram S, Petrov DA, Fisher DS, Sherlock G. Quantitative Evolutionary Dynamics Using High-Resolution Lineage Tracking;519(7542):181–186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Nguyen Ba AN, Cvijović I, Rojas Echenique JI, Lawrence KR, Rego-Costa A, Liu X, et al. High-Resolution Lineage Tracking Reveals Travelling Wave of Adaptation in Laboratory Yeast;575(7783):494–499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Ascensao JA, Wetmore KM, Good BH, Arkin AP, Hallatschek O. Quantifying the Local Adaptive Landscape of a Nascent Bacterial Community;14(1):248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Kinsler G, Geiler-Samerotte K, Petrov DA. Fitness Variation across Subtle Environmental Perturbations Reveals Local Modularity and Global Pleiotropy of Adaptation;9:1–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Eddy SR. What Is Bayesian Statistics?;22(9):1177–1178. [DOI] [PubMed] [Google Scholar]
  • 6.Betancourt M. A Conceptual Introduction to Hamiltonian Monte Carlo;.
  • 7.Kucukelbir A, Tran D, Ranganath R, Gelman A, Blei DM. Automatic Differentiation Variational Inference;.
  • 8. Efron B. Bayes’ Theorem in the 21st Century;340(6137):1177–1178. [DOI] [PubMed] [Google Scholar]
  • 9.Kingma DP, Welling M. Auto-Encoding Variational Bayes;. Available from: http://arxiv.org/abs/1312.6114.
  • 10. Gottwald S, Braun DA. The Two Kinds of Free Energy and the Bayesian Revolution;16(12):e1008420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ge H, Xu K, Ghahramani Z. Turing: A Language for Flexible Probabilistic Inference. In: Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics. PMLR;. p. 1682–1690. Available from: https://proceedings.mlr.press/v84/ge18b.html.
  • 12. Morey RD, Hoekstra R, Rouder JN, Lee MD, Wagenmakers EJ. The Fallacy of Placing Confidence in Confidence Intervals;23(1):103–123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Betancourt MJ, Girolami M. Hamiltonian Monte Carlo for Hierarchical Models;. Available from: http://arxiv.org/abs/1312.0906.
  • 14. Kussell E. Evolution in Microbes;42:493–514. [DOI] [PubMed] [Google Scholar]
  • 15. Dekel E, Alon U. Optimality and Evolutionary Tuning of the Expression Level of a Protein;436(7050):588–592. [DOI] [PubMed] [Google Scholar]
  • 16. Maeda T, Iwasawa J, Kotani H, Sakata N, Kawada M, Horinouchi T, et al. High-Throughput Laboratory Evolution Reveals Evolutionary Constraints in Escherichia Coli;11(1):5970. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Good BH, Mcdonald MJ, Barrick JE, Lenski RE, Desai MM. The Dynamics of Molecular Evolution over 60,000 Generations. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Jagdish T, Nguyen Ba AN. Microbial Experimental Evolution in a Massively Multiplexed and High-Throughput Era;75:101943. [DOI] [PubMed] [Google Scholar]
  • 19. Yang D, Jones MG, Naranjo S, Rideout WM, Min KHJ, Ho R, et al. Lineage Tracing Reveals the Phylodynamics, Plasticity, and Paths of Tumor Evolution;185(11):1905–1923.e25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Gelman A, Shalizi CR. Philosophy and the Practice of Bayesian Statistics;math.ST(1996):36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.VanderPlas J. Frequentism and Bayesianism: A Python-driven Primer; p. 1–9.
  • 22. Nuzzo R. Statistical Errors;506. [DOI] [PubMed] [Google Scholar]
  • 23.Li F, Tarkington J, Sherlock G. Fit-Seq2.0: An Improved Software for High-Throughput Fitness Measurements Using Pooled Competition Assays. [DOI] [PMC free article] [PubMed]
  • 24. Lovell DR, Chua XY, McGrath A. Counts: An Outstanding Challenge for Log-Ratio Analysis of Compositional Data in the Molecular Biosciences;2(2):lqaa040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Li Y, Venkataram S, Agarwala A, Dunn B, Petrov DA, Sherlock G, et al. Hidden Complexity of Yeast Adaptation under Simple Evolutionary Conditions;28(4):515–525.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011937.r001

Decision Letter 0

Sergei Maslov, Zhaolei Zhang

15 Nov 2023

Dear Dr. Razo-Mejia,

Thank you very much for submitting your manuscript "Bayesian inference of relative fitness on high-throughput pooled competition assays" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

Two major issues raised by several reviewers are that the algorithm has not been tested on experimental data and has not been compared to currently available algorithms. Addressing at least one of these issues is required for publication.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Sergei Maslov

Academic Editor

PLOS Computational Biology

Zhaolei Zhang

Section Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Review is uploaded as an attachment

Reviewer #2: By tracking lineages frequency via DNA barcodes in competitive cultures, it is possible to measure microbial fitness and phenotypic diversity on a large scale. However, many biological and non-biological noises will have a significant impact on the relationship between barcodes and phenotypes (fitness). Accordingly, the authors of this manuscript developed a Bayesian model-based pipeline that takes into account some of these uncertainties in the experimental setup described above. Additionally, this model was applied to analyze simulated multi-environment and replicate (by batches, or barcodes of the same genotype) experiments. As a whole, the manuscript has a solid theoretical basis and should be of general interest to the field. I believe that additional work is necessary before it can be published. My concerns are listed below.

== MAJOR ==

1. The manuscript contains a little bit too many technical details, so that readers might become distracted from the main logic flow. As an example, I do not believe it is necessary to comment on frequentist confidence intervals versus Bayesian credible regions starting at line 247. However, I lack a strong statistical background to provide a more subjective or comprehensive assessment or to say that all the technical details are trivial/unnecessary. Other reviewers with expertise in that area may be able to provide better suggestions. Nonetheless, moving some details to supplementary to better emphasize the main logic seems reasonable.

2. There is no description of the detailed simulation procedure in the main text or in SI. This would make reproducing the results difficult. The reader must also understand the factors that have been taken into consideration during the simulation in order to determine (i) whether the results presented are related to the reader's own experimental environment or not, and (ii) whether the performance assessment based on the simulation is just expected since the simulation entails exactly the same type of noise as those considered in the inference model. I understand from the manuscript that (ii) is exactly the case. I am not saying this approach is wrong, but it certainly needs clarification. Even better, sometimes the simulation may include some additional noise or perturbation that the inference model is unable to account for, so that one can determine whether the inference is robust against those specific additional noises or perturbations.

3. There is a disconnect between the results and the actual experimental data. In all tests, inferences are drawn based on simulated data. I understand that it gives the authors the ground truth fitness. Nevertheless, every simulation is based on some simplification or assumption that may not be valid in reality. It is difficult to determine the relevance of the method without actually relating the model to real data. I would like to provide two more specific comments in this regard.

3.1. The authors seem to assume some specific composition of the competing population (e.g. most strains are not barcoded), as well as some specific distribution of mutational fitness effects (s) (most barcodes are slightly more adaptive than wild-type). In my opinion, both assumptions are frequently violated. For example, in https://www.science.org/doi/10.1126/science.aae0568, https://www.nature.com/articles/nature17995 and https://academic.oup.com/mbe/article/39/5/msac086/6575838, all genotypes are barcoded. In the commonly used "deep mutational scanning" assays, where only proximal mutations are tested, most mutants (containing only a few simple mutations) have fitness very similar to that of the wild-type. Can the model accurately estimate their fitness? What is the significance of their differences with wild-type?

3.2. There should be some analysis based on actual experimental data (such as those in the three papers mentioned above). A straightforward test would be to determine whether the Bayesian-inferred fitness shows better between-replicate-correlations than the more naively estimated fitness reported in those papers. I am certain that the author can come up with more analyses relating the actual experimental data to the accuracy of their methodology.

==Minor==

1. The schematic diagram in Figure 1 could use a lot more details, especially regarding the source of uncertainties (at least those have been considered in the model)

2. Figure2D and E. This is related to major comment 2. Please explain how the “ground truth” is chosen/defined and used in the simulation. Also, why “68%” is used ? The number just seems weird.

3.Figure6F. It would be more prefessional to connect the two lines reaching y=1.0 early on, to the top right corner of the plot area (ECD = 1 across that range).

4.Line 310, “This loses the subtle differences due to biotic and abiotic batch effects, effectively halving the data that goes into our inference problem” I don’t understand this, please elaborate.

5. I am unable to see the name of the particular section whenever the author refers to one (for example, line 53, "... See Section ?? for details...", also in line 56, etc.). I have tried two PCs (both using Acrobat on Win11). Are there invisible symbols due to PDF transformation?

Reviewer #3: In Razo Mejia et al, the authors describe a Bayesian framework for the analysis of barcode fitness assays. The authors show how general the framework can be, and test the validity of the inference on synthetic data. I really enjoyed the reading and seeing this type of inference approach finally seeing the light is a blessing to the community. I particularly enjoyed the section dealing with hyperparameters and how it can be used for interesting assays. While it is not the first time barcode fitness inferences have taken a Bayesian spin, it is the first to do so with a true Bayesian perspective on many different parameters. Unfortunately, I am not familiar with Julia (and don’t have it installed) and so cannot evaluate the software within the allotted review timeframe as the documentations are fairly lengthy.

The only major issue I have is that I am unable to judge how much this framework improves on the inference compared to the general approaches the community have undertaken. How poor are the current approaches implemented by the community? Or is it less so about the fitness estimates in the simple case (single environment, assuming replicates are same fitnesses), and more about the fact that the approach can be expanded? Or is it just the credible intervals on fitness values (augmented by priors)? Because my understanding is that almost all the ad hoc approaches taken currently by the community (by the lab of Levy, Petrov, Sherlock, Desai, Dunham, Gresham, or even earlier by the genomic era through the barcoded yeast deletion collection). are very adequate for their purposes. It would have been interesting and borderline necessary to compare at least the simple approaches to this. If it is simply differences between confidence intervals and credible intervals, then it would be ideal if the authors show distinct differences in their output notwithstanding the philosophical differences between them as the differences are usually fairly minimal when priors are not strong.

I have a few minor comments:

1) There are a few missing references to specific sections (e.g. line 53, line 56, but there are at least dozens throughout).

2) I’m very familiar with the experimental setup, but I think the readers will not appreciate why the unlabeled reference strain should be at ~90% of the population in these assays as described in lines 69-75. Many barcode assays have not done this in the literature and as far as I know there have been no reported issues. From my recollection, this is done due to frequency-dependent selection on some key lineages, but there is no evidence that this design resolves this (and frequency-dependent selection is not solved by any inference approach).

3) Is the simulation really sufficient for this work? Without going overboard and testing every single scenarios, it seems overly simple for the power of the proposed approach. I guess I see drift being implemented through the Poisson noise but is the Gaussian noise really adequate for the read frequency if one does not sequence to extremely high coverage? I’m also wondering if systematic noise that all play a role influence the framework at all: such as prolonged lag phase leading to uncertainty on the generations per transfer, exponential jackpotting issues during sequencing that may be poorly estimated from the neutral lineages, day to day variations etc. The main issue is that I feel like all the older approaches can adequately infer fitness for the simulation that was performed.

4) I think the aside debating the fundamental differences between confidence intervals and credible intervals is a bit misplaced. Yet if the authors want to keep this in, then perhaps the frequentist confidence intervals should be made more precise by emphasizing the repeated construction of CIs: “frequentist CIs represent the rangeS (emphasis on plural) of values where X% of ranges …” since upon repeating the experiments the confidence intervals would differ. As is written (in singular ‘range’), it may imply a fixed range of values between repetitions (vs a fixed construction method) and the ‘repetition’ containing the true population parameter would not really make sense.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Attachment

Submitted filename: ploscompbio_review_11122023.pdf

pcbi.1011937.s002.pdf (23.8KB, pdf)
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011937.r003

Decision Letter 1

Sergei Maslov, Zhaolei Zhang

21 Feb 2024

Dear Dr. Razo-Mejia,

We are pleased to inform you that your manuscript 'Bayesian inference of relative fitness on high-throughput pooled competition assays' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Sergei Maslov

Academic Editor

PLOS Computational Biology

Zhaolei Zhang

Section Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: Thanks to the authors for a very thorough and thoughtful response and revision. The analysis of the Kinsler dataset greatly adds to the paper, and I also appreciate their clarification of the flexibility of the model to handle non-logistic growth. I have no further comments.

Reviewer #2: In general, I am satisfied with the author's response, especially after they applied their method to empirical datasets. My congratulations go out to the authors for their nice work.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011937.r004

Acceptance letter

Sergei Maslov, Zhaolei Zhang

11 Mar 2024

PCOMPBIOL-D-23-01682R1

Bayesian inference of relative fitness on high-throughput pooled competition assays

Dear Dr Razo-Mejia,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Lilla Horvath

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Supplementary materials.

    Section A gives a short primer on variational inference. Section B defines the probabilistic models used throughout the main text. Section C details how the validity of the model is computed via posterior predictive checks. Section D explains how the simulated frequency trajectories are generated. Section E compares the inferences of our method with state-of-the-art methods in the literature. Section F reanalyzes experimental data from yeast evolution experiments. Section G details how the computation time scales with the number of barcodes.

    (PDF)

    pcbi.1011937.s001.pdf (2.9MB, pdf)
    Attachment

    Submitted filename: ploscompbio_review_11122023.pdf

    pcbi.1011937.s002.pdf (23.8KB, pdf)
    Attachment

    Submitted filename: ploscompbio_reviews.pdf

    pcbi.1011937.s003.pdf (1.1MB, pdf)

    Data Availability Statement

    This paper is accompanied by a highly documented Julia software library—BarBay.jl (see documentation in https://mrazomej.github.io/BarBay.jl). Furthermore, to ensure transparency with every piece of information presented in this paper, we have made all of the code used in the processing, analysis, and figure generation for this work also publicly available on this paper’s GitHub repository (https://github.com/mrazomej/bayesian_fitness).


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES