Skip to main content
Springer logoLink to Springer
. 2025 Jul 26;87(9):120. doi: 10.1007/s11538-025-01491-5

Unpacking fitness differences between two invaders in a multispecies context

Tomas Ferreira Amaro Freire 1, Sten Madec 2, Erida Gjini 1,
PMCID: PMC12296887  PMID: 40715892

Abstract

Ecosystems are constantly exposed to newcoming strains or species. Which newcomer will be able to invade a resident multi-species community depends on the invader’s relative fitness. Classical fitness differences between two growing strains are measured using the exponential model. Here we complement this approach, developing a more explicit framework to quantify fitness differences between two co-invading strains, based on the replicator equation. By assuming that the resident species’ frequencies remain constant during the initial phase of invasion, we are able to determine the invasion fitness differential between the two strains, which drives growth rate differences post-invasion. We then apply our approach to a critical current global problem: invasion of the gut microbiota by antibiotic-resistant strains of the pathobiont Escherichia coli, using previously-published data. Our results underscore the context-dependent nature of fitness and demonstrate how species frequencies in a host environment can explicitly modulate the selection coefficient between two strains. This mechanistic framework can be augmented with machine-learning algorithms and multi-objective optimization to predict relative fitness in new environments, to steer selection, and design strategies to lower resistance levels in microbiomes.

Supplementary Information

The online version contains supplementary material available at 10.1007/s11538-025-01491-5.

Introduction

Why do we measure fitness effects of mutations? One of the key goals is to link them to underlying evolutionary processes and the particular dynamics of alleles under selection, but also to overarching ecological and environmental processes. A typical way to quantify this is by the selection coefficient s. When considering an asexual (haploid) population consisting of two genotypes, a mutant and a wild-type, with respective population sizes (or densities) N1 and N2, and frequencies p and (1-p), under the assumptions of continuous growth and no age structure, we can define the selection coefficient as s=dln(N1)dt-dln(N2)dt, (Fisher 1958), which has units of time-1. The mutant will increase in frequency if s>0 and decrease if s<0, at a speed determined by s. Since the ratio of allelic frequencies is equal to the ratio of population sizes (or densities) of each genotype, we may also write s=ddtln(p1-p). In particular, if selection is density-independent and the two genotypes do not interact, then s=rm-rw where r is the Malthusian parameter (Fisher 1958) or intrinsic rate of increase of each genotype (m, mutant; w, wild-type). In practice, r is estimated as the regression slope of log-population size against time in the exponential phase, i.e. at low population density (assuming no Allee effects). Such methodology is widely studied (Chevin 2011) and classically used in biological studies of bacterial evolution (Dykhuizen 1990; Lenski 1991), in models to describe invader dynamics over short periods of time (Hastings 1996; Crooks 2005), and in models to compare fitness of different strains (Wiser and Lenski 2015), with applications ranging from microbial competition studies and experimental evolution, to epidemiology, including assessments of viral variants during the COVID-19 pandemic (Xu 2020; Kucharski 2020; Volz 2021).

Yet, understanding fitness differences between two entities within a complex environment, e.g. an ecosystem, presents more challenges. The cost or selective advantage of a certain mutation is strongly influenced by the environment in which the organism grows, both in its abiotic (for example, nutrient availability) and biotic (interactions with other organisms) dimensions. How organism physiology and bio-procesess in dealing with these components give rise to phenotypes, and how these phenotypes map to fitness, is a central question in evolutionary biology and ecology, with heterogeneity in selective advantage across environments a focus of modern investigations.

A key scenario of this kind arises when two invaders co-invade simultaneously an ecosystem composed of other species, for example an antibiotic-resistant and antibiotic-susceptible bacterial strain enter a multi-species microbiota context (Cardoso et al. 2020). In general co-invasion, where multiple invading species or strains enter a system simultaneously, is a widespread phenomenon with significant implications in microbiology, epidemiology and agriculture. For instance, co-invasion is commonly observed in agricultural crops and plant communities (Mahuku 2015; Seabloom 2009; Susi 2015), and has been important for understanding the context-dependent nature of antibiotic resistance (Cardoso et al. 2020; Lindon 2024) and pathogen dynamics in epidemiology (Alizon 2013; Wong 2023).

Although it is expected that the ability of different species or strains to invade and establish themselves in a new environment depends on subtle differences in growth rates, which can be influenced by interactions with the resident species (Kurkjian et al. 2021; Litvak and Bäumler 2019; He 2013), a framework to quantify these relationships is not yet developed. In this report, we present a theory-driven model framework (Madec and Gjini 2020) aimed at quantifying the fitness differences between two co-invading strains, in a multi-species context. We apply our framework to a series of co-invasion experimental data with E.coli strains in mice (Cardoso et al. 2020), quantifying fitness costs between drug-resistant and wild-type strains.

The data originate from competitive fitness assays of antibiotic resistance in the mouse gut. The antibiotic-resistant strains used carried common streptomycin (StrR (rpsLK43T)) and rifampicin (RifR (rpoBH526Y)) resistance mutations, and the double-resistant strain carried both mutations (StrR RifR (rpsLK43T rpoBH526Y)). These mutations have been identified in many critical pathogens, such as Mycobacterium tuberculosis and Salmonella spp., and also in pathogenic and commensal E. coli bacteria (Barreto 2009; Hong 2010; Rahmani 2012). The aim of this study was to examine how interspecies interactions in the natural ecosystem comprising the mammalian gut influence the costs of antibiotic resistance in-vivo, specifically, the initial co-invasion dynamics of two strains in mice that had a complex microbiota.

By harnessing this dataset, as a prototype dataset to apply our method, in a proof-of-concept spirit, we establish a methodology to directly link heterogeneity in selective advantage or fitness cost between two invaders, to underlying host environment multi-species composition (in this case microbiota). We first detail the mechanistic approach based on linearization of the replicator model (LR-M), highlighting its strengths and limitations, and secondly we also illustrate an approach based on machine learning algorithms, namely the Random-Forest (RF-A). Both of these approaches provide tools and insights for determining how resident polymorphic communities, -in this case, microbiota, shape initial invasion dynamics between two invaders.

More generally, our research aims to improve the understanding of how the multi-species composition of a host environment influences selective advantage of invaders, and whether it can be used to predict and steer competition in future environments, including for ecosystem control or biomedical applications.

Methods

Co-invasion in a multispecies context

The classical measurement of the selection coefficient for two strains, in a co-invasion scenario, can yield different values depending on the environment. Variations in resources, environmental conditions, and biotic or abiotic factors can all influence the outcome, as each of the two invaders may differ in relating to such gradient. A particular case of environmental variation is a multispecies context, studied, for example, by Cardoso et al. (2020), where variation in co-invasion outcomes can be seen as a final consequence of host microbiota composition. Although this study reported different fitness costs for the same antibiotic resistant strain relative to the WT strain in different mice and proposed a resource-consumer model, it did not fully close the explanatory loop to link the specific microbiota variation to that in selection coefficients. Here, we aim to close this gap, using and highlighting a replicator framework (Madec and Gjini 2020; Gjini and Madec 2023).

Empirical invasion of two invaders

When two strains are mixed in equal proportions (1:1) and grow independently, their relative abundances initially follow the equation:

(rA-rB)t=lnnAnB, 1

where rA and rB are their respective growth rates. From this linear relationship with time, we can obtain the selection coefficient between the two strains, s=rA-rB.

A replicator equation model for multispecies dynamics

The frequency dynamics of N species (zi) is modeled using a replicator equation framework (Madec and Gjini 2020; Gjini and Madec 2023):

dzidτ=Θzijiλijzj-1kjNλkjzjzk,1i,jN 2

In this equation, τ represents time, Θ a speed constant, and λij the pairwise initial growth rates between any species i and j, when i grows in an equilibrium set by j alone, also known as pairwise invasion fitness (Geritz et al. 1998). Here, zi represents the frequency (i.e. the relative proportion) of species i in the community, so that i=1Nzi=1. This model is particularly useful when working with microbiota data, since such data often arises from high-throughput sequencing techniques like 16S rRNA sequencing, which typically measures the relative abundances of microbial taxa.

This replicator model has an equivalent Lotka-Volterra model for N species with equal growth rates, and interaction matrix (aij), where λij=aij-ajj for any two species.

Scenario of 2 strains invading a multi-species system

The initial growth rate of an invader in a system at state z(τ) depends on both invader and system Traits (Gjini and Madec 2023) and is given by:

rinvader=j=1Nzjλinvjinvader traits-Q(z)system invasion resistance 3

where λinvj, depends on invader traits, namely how this entity interacts with each resident species j=1,...,N in the system at time of invasion. Here, Q depends on the system as a whole and is equal for all invaders (Gjini and Madec 2023).

When 2 strains, A and B, co-invade a multi-species host environment, the initial difference in their growth rates is:

rA-rB=j=1Nzj(λAj-λBj)=zΔΛ, 4

where the common quadratic term Q has canceled out, and where z is the row vector of species frequencies in the resident, and ΔΛ is the column vector invasion fitness differences between the two strains relative to resident species. Assuming low initial frequencies (zA,zBjzj1), compared to those of residents, the initial influence of the two invaders can be neglected in the system. With this approach, and under the assumption of a linear selection coefficient, we attribute this linearity to the constant frequencies of the resident species in the system during the initial invasion period. Our analysis assumes that the timescale of the invader dynamics is faster than any significant change in the resident community composition during the invasion experiment, but this assumption can be relaxed with more availability of longitudinal data (see SI Text 1). Under many realizations of such co-invasion in different resident systems, we have: s=FΔΛ where s=(rA-rB)i=1..M are observed growth rate differences between A and B in M different hosts, and F the M×N matrix of their multispecies frequency compositions.

Results

Growth rate differences linked to microbiota compositions via the replicator

We assume the frequencies of N resident species are measured just prior to invasion, in M replicates (hosts) of the same co-invasion with 2 strains, collected in data matrix F. The empirical initial growth rate differential between strains A and B, is denoted by a vector s, which links to F via:

FΔΛ=sz1(1)zN(1)z1(2)zN(2)z1(M)zN(M)λA1-λB1λA2-λB2λAN-λBN=rA^(1)-rB^(1)rA^(2)-rB^(2)rA^(N)-rB^(N). 5

Thus, each row of F is the frequency vector of multispecies composition in each host (assumed constant), ΔΛ the invasion fitness difference vector between the two strains, and s the observed selection coefficient in each host. The goal is to estimate the vector ΔΛ.

When N=M, we can obtain ΔΛ as an exact solution of the linear system:

ΔΛ=F-1s, 6

and this requires the data matrix F to be invertible.

When N<M, the number of species is less than or equal to the number of hosts, we can use least squares regression:

ΔΛ=(FTF)-1FTs. 7

This allows us to leverage many classical statistical results regarding the uncertainty of estimation and prediction.

When N>M, the number of species exceeds that of hosts, and one can proceed with some suitable aggregation of species to a lower number, to reduce the number of variables, and then apply the same method.

From model to data: E.coli strains invade different microbiota

We extracted and prepared data from Cardoso et al. (2020), and applied the model to 3 co-invasion experiments between antibiotic-resistant mutants and wild-type E.coli: 1) Rif. resistant vs. WT in single-caged mice, 2) Rif. resistant vs. WT in co-housed mice, and 3) Strep. resistant vs. WT in co-housed mice (see Supplementary Dataset and Fig. 1), with microbiota resolution at the phylum level (N=4 species). The species were Bacteroidetes (B), Proteobacteria (P), Verrucomicrobia (V) and Firmicutes (F).

Fig. 1.

Fig. 1

Microbiota compositions and initial relative growth between two invaders. Data publicly available and extracted from experiments with E.coli strains in mice can be found in Cardoso et al. (2020). A-B Data 1 (M=5, N=4): Mice were single-housed and injected with WT + a rifampicin-resistant mutant; C-D Data 2 (M=6, N=4): Mice were co-housed and injected with WT + a rifampicin-resistant mutant; E-F Data 3 (M=6, N=3): Mice were co-housed and injected with WT + a streptomycin-resistant mutant. Left panels show host microbiota compositions just prior to invasion. Right panels show plots of ln(Res/Wt). The slopes rA-rB were estimated from the linear fit with time, and their significance, assessed by an F-test. Only hosts with a significant non-zero slope were retained for analysis (mouse 1 in Data 1 and mouse 3 in Data 2 were removed, depicted in dashed lines in B,D). For details see SI dataset and SI Text 2

We also combined Data 1 and 2 under a joint regression, as they shared the same co-invaders. This led to a larger sample size (M=9) and thus more statistical power. Results are presented in Table 1, revealing the ecological interaction basis of fitness cost in these antibiotic-resistant strains. A fitness cost (ΔΛj<0) for each antibiotic-resistant strain is confirmed relative to WT, with respect to any phyla (j=B,P, or V), except for the rifampicin-resistant strain relative to species j=F, which displays a positive signal in Data 1 and 2, separately and jointly, but without reaching individual significance. The lack of statistical significance in Data 1 and 2 is expected, as these datasets contain relatively few data points, limiting the power of the regression analysis; this issue is mitigated by combining Data 1 and 2, which increases sample size. Regarding the positive value of ΔΛ for Firmicutes, while not statistically significant, it raises the possibility of a differential ecological interaction between Firmicutes and the rifampicin-resistant strain-perhaps reflecting reduced antagonism or even facilitation in this specific context. Although we do not draw strong conclusions from this result, it suggests a potential direction for future studies, including investigation of how Firmicutes may influence mutant-specific fitness outcomes.

Table 1.

Invasion fitness differential between two invading strains of E.coli relative to host microbiota. Solution of a linear system for Data 1, and regression estimates (plus 95%CI) for Data 2 (R2=0.999), Data 3 (R2=0.999), and 1+2 jointly fitted (R2=0.976). For more details on the data, see Fig. 1

Host microbiota Data 1 Data 2 Data 3 Data 1+2
Species i Coefficients (ΔΛj) Coefficients (ΔΛj) Coefficients (ΔΛj) Coefficients (ΔΛj)
Inline graphic Bacteroidetes (B) -0.83 -1.78 [-10.17,6.61] -1.26 [-1.50,-1.02] -1.16 [-1.80,-0.52]
Inline graphic Proteobacteria (P) -2.17 -9.17 [-73.97,55.63] -1.02 [-1.61,-0.44] -2.92 [-5.08,-0.75]
Inline graphic Verrucomicrobia (V) -2.77 -2.00 [-8.05,4.04] -2.26 [-3.99,-0.52]
Inline graphic Firmicutes (F) 0.22 8.98 [-15.92,33.89] -3.32 [-6.45,-0.19] 1.07 [-1.23,3.36]

Quality of selection coefficient estimation and prediction

From the regression framework, we obtain explicitly the uncertainty around each coefficient as:

var(Δ^Λj)=SSRM-N(FTF)jj-1, 8

where SSR is the sum of squared residuals of the regression. Using these results, in a new microbiota composition c, we can predict the selection coefficient between the two strains as: These results enable us to make predictions about selection coefficients in new microbiota compositions. Specifically, given a new composition c (a vector of species frequencies summing to 1), we can predict the selection coefficient between the two strains as:

θ^=cΔ^Λ, 9

where, in mean, the standard error is sθ^=σ^c(FTF)-1cT, with σ^=SSRM-N, while for a particular case, sd=σ^1+c(FTF)-1cT, depending both on past (F) and current (c) data (See SI Text 3 for details).

Method validity test and potential use for microbiota engineering

Validity and predictive power Having quantified ΔΛ, including its uncertainty, we could predict rA-rB when the same strains A and B invade other hosts, with any microbiota compositions. To illustrate this predictive power, we validated our linear replicator method (LR-M) on the same dataset, but through a leave-one-out cross-validation approach (Fig. 2, SI Text 4), which confirmed good regression performance. Nevertheless, and especially in this small dataset, sensitivity of linear regression to outliers or multicollinearity may lead to inaccurate predictions, warranting for care and preliminary tests.

Fig. 2.

Fig. 2

Linear replicator model predictions vs. observed selection coefficients. We re-estimated invasion fitness parameters via a leave-one-out cross-validation procedure. For each host not used in estimation, we predicted the selection coefficient (and 95%CI), based on its microbiota composition, c. The 1-α confidence interval for predicted θ^=Δ^Λc is: (θ^±tα/2(M-N)sθ^) in mean, or (θ^±tα/2(M-N)sd) for a particular case (shown here for tests with Data 1+2 and Data 3). See also SI Text 4

This procedure confirms that this regression-based framework predicts well selection coefficients in new hosts based on the information extracted from previous microbiota-selection coefficients data. Additionally, it illustrates perfectly two ways in which linear regression-based prediction can fail or be inaccurate: i) due to the presence of outliers, and ii) due to the presence of collinearity between the predictor variables (here the species frequencies). The collinearity can be captured in the Rj2 for each partial regression of a focal species on the remaining species in the dataset (see Supplementary Material, SI Text 4). For Data 1+2, when considering Rj2 we confirmed explicitly the two cases where collinearity is very prominent: when removing mouse 2 or mouse 4 from estimation. In these two cases, in the data used for estimation, at least one variable has Rj2>0.9, which increases the variance around the estimated regression coefficients for these species, and is likely to lead to inaccurate predictions in future cases where these species frequencies may be high. Complementary diagnostics such as variance inflation factors (VIFs) can be used to quantify the effects of multicollinearity more directly. Additionally, regularization methods like ridge regression may improve robustness by reducing the impact of correlated predictors through coefficient shrinkage.

In this particular dataset, collinearity structure is clearly the main driving factor behind inaccurate predictions for these two mice, because the other component of the variance in regression coefficients, (total variance in each variable across the sample SSTj) is comparable between all the test cases. In contrast, for Data 3, the performance was great.

Microbiota engineering to steer selection On the application front, we also explored how this method could inform the design of optimal microbiomes to steer selection coefficients towards desired regimes (Fig. 3, SI Text 5). In general, determining the right microbiota compositions and diversity levels, that keep selection coefficient s between two invaders below or above a threshold, can be framed and solved as a constrained optimization problem.

Fig. 3.

Fig. 3

Searching microbiota compositions for target selection coefficients between invaders. We used the estimated invasion fitness differential for the Rif-WT pair of strains (Cardoso et al. 2020), and sought microbiota compositions yielding selection coefficients s in 2 target regimes: positive s>0.2, and negative s<-0.2 (red and blue circles in A). We efficiently explored the space of possible solutions via minimizing izi2 so as to favor more diversity, related to Simpson’s index. In this search, the species frequencies zi potentially converge to different admissible values (B-C), that satisfy inequality constraints |ΔΛz|>0.2, but vary in diversity. In particular, using the Simpson’s index, we observe more diversity in compositions that satisfy the negative inequality s<-0.2 (blue points in A), whereas in compositions that satisfy the positive inequality s>0.2 (red points in A), we observe less diversity and necessarily a high abundance of F

In Fig. 3, we illustrate our model’s capability for use in optimization and microbiota system bioengineering. The points shown were obtained using the specific estimated coefficients of the regression for Data 1+2 (Table 1) and applying MATLAB’s fmincon function (MATLAB 2024) to search for the microbiota compositions that satisfy the constraints of the selection coefficients being below -0.2 and above 0.2. This allowed us to explore various compositions that satisfied the constraints, starting from random initial guesses and searching the variable space in the direction of minimizing jzj2. Thus, as a ‘cost’ function for the optimization search, we considered an inverse measure of diversity, which favors greater diversity related to the Simpson index (jzj2)-1, although this was not used as a strict criterion, as different solutions satisfying the constraint and varying in diversity levels were admitted for illustration.

It is important to note that the microbiota compositions illustrated do not represent the absolute optimal solution. Since this is a convex optimization problem, the true minimum could be calculated analytically. However, our use of fmincon was an efficient way to find points that satisfied the constraints, illustrating the wide space of potential balances between host microbiota composition and diversity, and the target selection coefficients desired. More specific optimization remains an open avenue for future investigation.

Alternative prediction using machine learning: the random forest algorithm

In addition to our mechanistic regression approach, we explored a model-free alternative by employing a Random Forest machine learning algorithm (RF-A) to predict the selection coefficient rA-rB from ‘learned’ host microbiota compositions:

input:microbiotaoutput:selection coefficient

Random forests (Breiman 2001), one of the classic ML techniques, are an ensemble learning method used for classification and regression. They reduce variance and improve generalization by averaging predictions across multiple decision trees and have found success in the field of microbiota research (Yatsunenko 2012; Roguet 2013; Zhang 2019). Given the small dataset in our study, they offer robustness against overfitting and can perform well even when explanatory variables outnumber observations (Cutler 2007). Unlike the proposed mechanistic model, where a linear relation between species frequencies and selection coefficients naturally arises (Eq. 6), random forests do not assume a predefined relationship between explanatory variables (input) and the response variable (output). Instead, they infer patterns directly from the data by aggregating predictions from multiple decision trees (Hastie et al. 2009). The primary hyperparameter we control in our random forest model is the number of trees. Increasing this parameter improves stability, but also increases computational costs (Liaw and Wiener 2002). We implemented this algorithm in Python and used a manually specified number of trees while keeping the other hyperparameters at their default values. For more details, see SI Text 6 and SI Code.

Our first goal was to assess the model’s performance using leave-one-out cross-validation (LOO-CV), configuring the RF with 20 trees and repeating the training process 10 times to account for inherent stochasticity. Secondly, we wanted to compare the predictive performance of the replicator-based linear regression and the random forest algorithm. Our final goal was to evaluate whether different levels of microbiota resolution influence predictive accuracy in the RF setting. Results are summarized in Fig. 4 and Table 2.

Fig. 4.

Fig. 4

Comparing predictive performance of mechanistic vs. machine learning approach. We plot observed vs. predicted selection coefficients for the Replicator Equation-based prediction, LR-M (blue) vs. the Random Forest-based prediction RF-A (green), in the leave-one-out cross-validation approach. The actual observed selection coefficients are given by the black crosses. The figure illustrates the performance of both models, with the RF-A model achieving lower uncertainty. The interval (min, max) for the random forest algorithm was obtained by point-estimates over 10 independent trainings with 20 trees, whose mean is denoted by the green circle

Table 2.

Relative Errors from LOOCV performance of the two approaches. To evaluate the performance of both methods, the replicator-based regression and the random-forest algorithm, we considered the relative error for each instance of prediction of true selection coefficient s: RE=s^-ss, and finally, two key performance metrics by averaging over several predictions: Mean Relative Error (MRE=1ni=1nsi^-sisi), and Root Mean Squared Error (RMSE=1ni=1n(s^i-si)2). We show the absolute relative error for each point prediction using LR-M (linear replicator model) and RF-A (Random Forest algorithm) on Data 1+2 and Data 3 (rounded to two decimal places) and take their average. In Bold are highlighted the cases when the point prediction by the replicator-based linear regression is closer to the empirical selection coefficient than the prediction by the Random Forest algorithm (30% of predictions for Data 1+2, and 50% of predictions for Data 3)

Prediction host index i Relative Errors True si
LR-M RF-A
Data 1+2 2 3.92 0.16 –1.12
3 0.72 0.45 –0.76
4 5.69 1.02 –0.50
5 0.03 0.17 –1.48
1 0.19 0.12 –1.39
2 0.23 0.16 –1.61
4 0.17 0.20 –1.04
5 0.10 0.13 –1.50
6 0.41 0.00 –0.99
Global comparison MRE 1.27 0.28
RMSE 2.32 0.41
Data 3 1 0.02 0.01 –1.25
2 0.01 0.03 –1.22
3 0.10 0.05 –1.43
4 0.08 0.05 –1.41
5 0.09 0.13 –1.23
6 0.02 0.03 –1.27
Global comparison MRE 0.05 0.05
RMSE 0.05 0.06

Comparison between the two approaches In the Leave-One-Out Cross-Validation (LOO-CV), the model is trained on N-1 data points and tested on the remaining one, iterating over all possible combinations. The RF-A algorithm was trained on the same dataset, with the microbiota frequency vector as predictors (input), and the observed selection coefficients as the response variable (output). In our case, the prediction intervals were obtained by training 10 independent models (20 trees) each time, and taking the min and max over all predicted values. For the point-estimate analysis we considered the mean of these 10 models as the point prediction from the RF-A algorithm.

We compare the quality of these predictions in Table 2. Phylum was chosen as the standard level of microbiota resolution for direct comparison with previous results, while we also explored a higher resolution level, level 3, though only in the RF-A case due to dimensionality constraints (Table S2).

First, when examining the results for Data 1+2, it is clear that the RF-A algorithm outperforms the replicator model in most predictions (Table 2, and, therefore, on average, according to both metrics. However, this trend is not observed in Data 3, where the performance metrics are nearly identical for both models. We attribute this to the machine learning algorithm’s superior ability to adapt to outliers and bypass collinearity problems of classical regression. However, when the data are of high quality, and clean of collinearity structure, such as is the case for Data 3,  both approaches perform similarly well. This highlights the explanatory and predictive power of the simpler replicator framework (see SI Text 6, and Table S3 for variable significance in the RF-A setting).

However, it is worth noting that the replicator-based linear regression provides comparatively more accurate point predictions in about 30% of hosts from dataset 1+2 and 50% of hosts from dataset 3, and even in cases of lower performance, still yields confidence intervals that almost always contain the empirically-observed values of s, which is not the case with the uncertainty intervals obtained with RF-A. While not surprising, this data analysis seems to suggest that the linear regression framework may be more sensitive to realistic data limitations, both in terms of sample size and quality, and reflects this explicitly in its estimates. While the RF-A, machine learning algorithm, emphasizes much more strict precision, creating an impression of over-confidence, but at the expense of being sometimes less accurate (Fig. 4).

Finally, when comparing the performance of the machine learning algorithm at different levels of microbiota resolution, we observe that RF-A performs better at the lower resolution level (Phylum) in both datasets (Table S2). This suggests that, at higher levels of resolution, the inclusion of additional species may introduce noise that is not relevant, and may even hamper the predictions. However, further analysis with more data is necessary before drawing definitive conclusions.

Overall, our mathematical approach clarifies and quantifies how host environment composition affects the relative competition of two invaders. With the explicit replicator model, it is possible to start unpacking not only the fitness cost of antibiotic resistance in terms of host microbiota species, as in this dataset (Cardoso et al. 2020), but more generally the fitness effects of any mutation, or any kind of invasion in a biodiverse environment.

Discussion

Linking host microbiota to invaders’ relative fitness difference

To model invader dynamics in a multi-species ecosystem, we apply a replicator equation framework (Hofbauer and Sigmund 2003; Gjini and Madec 2023) tailored to invasion scenarios. It effectively and minimally describes frequency-dependent dynamics, is related to the well-known Lotka-Volterra models (Bomze 1995), and lends itself to the contextual nature of fitness cost between 2 strains upon co-invasion. How to connect host multi-species composition to the selection coefficient observed between two invaders? Assuming that the frequencies of resident species remain constant in the initial invasion phase, the problem simplifies to a linear system (Eqs. 5), where depending on data dimensions, either an exact solution, or optimal regression-based estimate can be found, for the vector of invasion fitness difference between the two invaders with respect to each constituent species. This method relates to and extends classical selection coefficient estimation approaches, widely used in ecology and biology, based on pure exponential growth (Fisher 1958; Lenski 1991; Dykhuizen 1990).

We demonstrated this proof-of-concept on E.coli invasion data in mice with different microbiota (Cardoso et al. 2020), although the framework is applicable to other multispecies composition and invasion scenarios. For example the method could be applied to in-vivo nasopharyngeal or lung microbiota and respiratory pathogen co-invasion (Dickson and Huffnagle 2015; Yagi 2021), vaginal microbiota and its protective role against viral infection (Stefan 2020; Mendling 2016; Ravel 2011), in-vitro invasion studies in artificially-constructed microbiomes (Lindon 2024), and other general multispecies environments shaping initial relative frequencies of two newcomers in the system.

We stated some obvious limitations of the method, related for example to the requirement of relatively large sample sizes (nr hosts) relative to the number of microbiota species, and those related to the sensitivity to outliers and collinearity in model variables. Still, if data quality criteria are met, and outliers are properly handled-e.g., through standard regression diagnostics (such as Cook’s distance, studentized residuals, or leverage scores), or suitable information-theoretic metrics-the application of the linear regression offered by the replicator framework should pose no major problems.

By offering an explicit quantification of the effect on each species on selection, this framework constitutes a first step in understanding underlying mechanisms, and potentially using these interactions for strategies in microbiome engineering, antibiotic resistance management, and ecosystem control.

Mechanistic model vs. machine-learning algorithm

In recent years, machine learning techniques have gained significant traction in microbiome research, providing powerful tools to uncover complex patterns and relationships within microbial communities (Medina 2022; Li 2022; Marcos-Zambrano 2021; Statnikov 2013). These methods offer an effective means of analyzing high-dimensional datasets, often encountered in microbiota studies. Among these methods, Random Forest (RF-A), a supervised ML algorithm, has emerged as a particularly robust and flexible approach for analyzing heterogeneous microbiome data.

In this context, with the growing paradigm of AI use in research, we brought additional insights by comparing the results of our replicator-based linear regression with a classical Machine Learning algorithm, the Random Forest. Our results indicated that, depending on data quality, the RF-A may outperform the replicator model, especially in precision. However both methods perform similarly well in terms of accuracy and could be used in complementary fashion.

This dual approach enables us to compare the strength and limitations of both mechanistic models and model-free approaches in capturing the dynamics of microbial invasion. While the proposed replicator-based model with invasion-fitnesses limits the multi-species effects to be at most linearly additive for the final selection coefficient, the RF-A machine learning algorithm bypasses this constraint, and explores a wider set of possibly nonlinear effects and relationships.

So, our empirical tests indicate that if the main objective is point prediction, machine learning approaches like RF-A may be the preferred choice to connect host microbiota to invaders’ fitness differences.

While this adds to the strong predictive power of the machine learning approach, it also comes with well-known limitations, the principal one being reduced interpretability and lack of clear processes. In contrast, our linear replicator model explicitly and mechanistically quantifies how invaders interact with particular resident species (Gjini and Madec 2023), and provides confidence intervals for these estimates. This interpretability and explicit statistical foundation makes it a valuable tool for microbiota engineering, offering straightforward guidance for experimental design.

Outlook

The next challenge lies in using the selective coefficient estimates from existing datasets to prospectively design systems that satisfy certain desired control targets. Similarly, it remains challenging to optimize diversity resolution in biotic environments for robust inference of invader parameters. There may be cases when the number of species in the microbiota composition is larger than the number of hosts, and these may require new bases of aggregation and classification, beyond genetic distance, taking into account functional diversity between members of a community. We saw that increasing the taxonomic resolution for this particular dataset in the Random Forest application did not significantly improve predictions, but of course this question deserves deeper investigation.

Finally, a key area that calls for attention is the integration of short- and long-term invasion outcomes (Arnoldi 2022), along key gradients (Gjini and Madec 2021), and across systems. Here, motivated also by the data (Cardoso et al. 2020), we treated only the initial short-term dynamics post-invasion of two invaders, and did not concern ourselves with their ultimate fate in the system. However, linking these initial fitness signatures to final persistence and coexistence dynamics is crucial for our understanding of long-term outcomes in ecological systems. The explicit linear approach outlined here, can be adapted to obtain the full global dynamics of the multispecies community by considering stepwise each member as an invader (Gjini and Madec 2023).

Furthermore, identifying the coupling parameters and processes that connect seemingly separate invasion dynamics across time and space, and environmental gradients, will provide essential unifying insights into our global quantification of fitness.

Supporting Information

SI dataset. All microbiota composition data and selection coefficients between two invaders, extracted from Cardoso et al. (2020); Cardoso et al. Data (2020). SI Text. Extended methods and references. SI Code. https://github.com/tomasfreire/Unpacking_fitness_differences

Supplementary Information

Below is the link to the electronic supplementary material.

Acknowledgements

We thank Ermanda Dekaj for initial extraction of the data, and Isabel Gordo and Susana Vinga for helpful discussions. The work is supported by the Portuguese Foundation for Science and Technology via grant number 2022.03060.PTDC.

Funding

Open access funding provided by FCT|FCCN (b-on).

Declarations

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. Fisher R.A (1958) “The Genetical Theory of Natural Selection.” (2nd ed.). Dover, New York
  2. Chevin L-M (2011) On measuring selection in experimental evolution. Biol Let 7(2):210–213 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Dykhuizen D.E (1990) “Experimental studies of natural selection in bacteria”. In: Annual Review of Ecology and Systematics, pp.373–398
  4. Lenski RE et al (1991) Long-term experimental evolution in Escherichia coli. I. Adaptation and divergence during 2,000 generations. Am Nat 138(6):1315–1341 [Google Scholar]
  5. Hastings A (1996) Models of Spatial Spread: Is the Theory Complete? Ecology 77(6):1675–1679. 10.2307/2265772 [Google Scholar]
  6. Crooks JA (2005) Lag times and exotic species: The ecology and management of biological invasions in slow-motion1. É coscience 12(3):316–329. 10.2980/i1195-6860-12-3-316.1 [Google Scholar]
  7. Wiser M.J, Lenski R.E (May 2015) “A Comparison of Methods to Measure Fitness in Escherichia coli”. In: PLOS ONE 10.5. Ed. by Jeffrey L Blanchard, e0126210. issn: 1932-6203. doi: 10.1371/journal.pone.0126210 [DOI] [PMC free article] [PubMed]
  8. Xu B. et al (Mar. 2020) “Epidemiological data from the COVID-19 outbreak, real-time case information”. In: Scientific Data 7.1. issn: 2052-4463. 10.1038/s41597-020-0448-0 [DOI] [PMC free article] [PubMed]
  9. Kucharski AJ et al (2020) Early dynamics of transmission and control of COVID-19: a mathematical modelling study. Lancet Infect Dis 20(5):553–558. 10.1016/s1473-3099(20)30144-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Volz E et al (2021) Evaluating the Effects of SARS-CoV-2 Spike Mutation D614G on Transmissibility and Pathogenicity. Cell 184(1):64-75.e11. 10.1016/j.cell.2020.11.020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Cardoso LL et al (2020) Dysbiosis individualizes the fitness effect of antibiotic resistance in the mammalian gut. Nature Ecology & Evolution 4(9):1268–1278 [DOI] [PubMed] [Google Scholar]
  12. Mahuku G et al (2015) Maize Lethal Necrosis (MLN), an Emerging Threat to Maize-Based Food Security in Sub-Saharan Africa. Phytopathology® 105(7):956–965. 10.1094/phyto-12-14-0367-fi [DOI] [PubMed] [Google Scholar]
  13. Seabloom EW et al (2009) Diversity and Composition of Viral Communities: Coinfection of Barley and Cereal Yellow Dwarf Viruses in California Grasslands. Am Nat 173(3):E79–E98. 10.1086/596529 [DOI] [PubMed] [Google Scholar]
  14. Susi H. et al (Jan. 2015) “Co-infection alters population dynamics of infectious disease”. In: Nature Communications 6.1 issn: 2041-1723. doi: 10.1038/ncomms6975 [DOI] [PMC free article] [PubMed]
  15. Lindon S et al (2024) Antibiotic resistance alters the ability of Pseudomonas aeruginosa to invade bacteria from the respiratory microbiome. en Evol Lett 8(5):735–747 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Alizon S (2013) Parasite co-transmission and the evolutionary epidemiology of virulence. Evolution 67(4):921–933 [DOI] [PubMed] [Google Scholar]
  17. Wong A. et al (Mar. 2023) “The interactions of SARS-CoV-2 with cocirculating pathogens: Epidemiological implications and current knowledge gaps”. In: PLOS Pathogens 19.3. Ed. by Tom C. Hobman, e1011167. issn: 1553-7374. 10.1371/journal.ppat.1011167 [DOI] [PMC free article] [PubMed]
  18. Kurkjian HM, Javad Akbari M, Momeni B (2021) The impact of interactions on invasion and colonization resistance in microbial communities. PLoS Comput Biol 17(1):1–18. 10.1371/journal.pcbi.1008643 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Litvak Y, Bäumler AJ (2019) The founder hypothesis: A basis for microbiota resistance, diversity in taxa carriage, and colonization resistance against pathogens. PLoS Pathog 15(2):1–6. 10.1371/journal.ppat.1007563 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. He X et al (2013) The social structure of microbial community involved in colonization resistance. ISME J 8(3):564–574. 10.1038/ismej.2013.172 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Madec S, Gjini E (2020) Predicting N-Strain Coexistence from Co-colonization Interactions: Epidemiology Meets Ecology and the Replicator Equation. Bull Math Biol 82:142. 10.1007/s11538-020-00816-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Barreto  et al (2009) Detection of antibiotic resistant E. coli and Enterococcus spp. in stool of healthy growing children in Portugal. J Basic Microbiol 49(6):503–512 [DOI] [PubMed] [Google Scholar]
  23. Hong S et al (2010) Genetic characterization of atypical Shigella flexneri isolated in Korea. J Microbiol Biotechnol 20(10):1457–62 [DOI] [PubMed] [Google Scholar]
  24. Rahmani F et al (2012) Drug resistance in Vibrio cholerae strains isolated from clinical specimens. Acta Microbiol Immunol Hung 59(1):77–84 [DOI] [PubMed] [Google Scholar]
  25. Gjini E, Madec S (2023) Towards a mathematical understanding of invasion resistance in multispecies communities. Royal Society Open Science 10(11):231034 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Geritz S, Kisdi E, Mesze’NA G et al (1998) Evolutionarily Singular Strategies and the Adaptive Growth and Branching of the Evolutionary Tree. Evol Ecol 12:35–57. 10.1023/A:1006554906681
  27. Cardoso et al. Data 10.5061/dryad.j0zpc869r). Accessed: 2023-09-30. (2020)
  28. MATLAB Documentation. Available at https://www.mathworks.com/help/. MathWorks. (2024)
  29. Breiman L (2001) Random Forests. Mach Learn 45:5–32. 10.1023/A:1010933404324 [Google Scholar]
  30. Yatsunenko T et al (2012) Human gut microbiome viewed across age and geography. Nature 486(7402):222–227. 10.1038/nature11053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Roguet A et al (Oct. 2018) “Fecal source identification using random forest”. In: Microbiome 6(1) issn: 2049-2618. doi: 10.1186/s40168-018-0568-3 [DOI] [PMC free article] [PubMed]
  32. Zhang J et al (2019) NRT1.1B is associated with root microbiota composition and nitrogen use in field-grown rice. Nat Biotechnol 37(6):676–684. 10.1038/s41587-019-0104-4 [DOI] [PubMed] [Google Scholar]
  33. Cutler DR et al (2007) Random forests for classification in ecology. Ecology 88(11):2783–2792 [DOI] [PubMed] [Google Scholar]
  34. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning. Springer
  35. Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22 [Google Scholar]
  36. Hofbauer J, Sigmund K (2003) Evolutionary game dynamics. Bull Am Math Soc 40(4):479–519. 10.1090/S0273-0979-03-00988-1 [Google Scholar]
  37. Bomze IM (1995) Lotka-Volterra equation and replicator dynamics: new issues in classification. Biol Cybern 72(5):447–453 [Google Scholar]
  38. Dickson RP, Huffnagle GB (2015) The lung microbiome: new principles for respiratory bacteriology in health and disease. PLoS Pathog 11(7):e1004923 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Yagi K et al (2021) The lung microbiome during health and disease. Int J Mol Sci 22(19):10872 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Stefan KL et al (2020) Commensal microbiota modulation of natural resistance to virus infection. Cell 183(5):1312–1324 [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Mendling W (2016) “Vaginal microbiota”. In: Microbiota of the human body: Implications in health and disease , pp.83–93
  42. Ravel J et al (2011) “Vaginal microbiome of reproductive-age women”. In: Proceedings of the National Academy of Sciences 108.supplement_1, pp.4680–4687 [DOI] [PMC free article] [PubMed]
  43. Medina R.H et al (Oct. 2022) “Machine learning and deep learning applications in microbiome research”. In: ISME Communications 2.1. issn: 2730-6151. 10.1038/s43705-022-00182-9 [DOI] [PMC free article] [PubMed]
  44. Li P et al (Nov. 2022) “Machine learning for data integration in human gut microbiome”. In: Microbial Cell Factories 21.1. issn: 1475-2859. 10.1186/s12934-022-01973-4 [DOI] [PMC free article] [PubMed]
  45. Marcos-Zambrano L.J et al (Feb. 2021) “Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment”. In: Frontiers in Microbiology 12. issn: 1664-302X. 10.3389/fmicb.2021.634511 [DOI] [PMC free article] [PubMed]
  46. Statnikov A et al (Apr. 2013) “A comprehensive evaluation of multicategory classification methods for microbiomic data”. In: Microbiome 1.1. issn: 2049-2618. 10.1186/2049-2618-1-11 [DOI] [PMC free article] [PubMed]
  47. Arnoldi J-F et al (2022) Invasions of ecological communities: Hints of impacts in the invader’s growth rate. Methods Ecol Evol 13(1):167–182 [Google Scholar]
  48. Gjini E, Madec S (2021) The ratio of single to co-colonization is key to complexity in interacting systems with multiple strains. Ecol Evol 11(13):8456–8474 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials


Articles from Bulletin of Mathematical Biology are provided here courtesy of Springer

RESOURCES