Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2025 Sep 18;21(9):e1013437. doi: 10.1371/journal.pcbi.1013437

wavess: An R package for simulation of adaptive within-host virus sequence evolution

Narmada Sambaturu 1,2,#, Zena Lapp 1,#, Fernando D K Tria 1, Ethan Romero-Severson 1, Carmen Molina-París 1, Thomas Leitner 1,*
Editor: Jordan Douglas3
PMCID: PMC12459828  PMID: 40966297

Abstract

Simulating within-host virus sequence evolution allows for the investigation of factors such as the role of recombination in virus diversification and the impact of selective pressures on virus evolution. Here, we provide a new software to simulate virus within-host evolution called wavess (within-host adaptive virus evolution sequence simulator), a discrete-time individual-based model and a corresponding user-friendly R package. The underlying model simulates recombination, a latent infected cell reservoir, and three forms of selection: conserved sites fitness and replicative fitness in comparison to a reference sequence, and immune fitness including cross-reactivity imposed by a co-evolving immune response. In the R package, we also provide functions to generate model inputs from empirical data, as well as functions to analyze the simulation outputs. At user-defined time points, the software returns various counts related to the virus population(s) and a set of sampled virus sequences. We applied this model to investigate the selection pressures on HIV-1 env sequences longitudinally collected from 11 individuals. The best-fitting immune cost differed across individuals, mirroring the real-world expectation of heterogeneous immune responses among human hosts. Furthermore, the phylogenies reconstructed from these simulated sequences were similar to the phylogenies reconstructed from the real sequences for all summary statistics tested. To our knowledge, compared to other similar models, wavess has been more rigorously validated against real within-host virus sequences, and is the first to be implemented as an R package. The wavess R package can be downloaded from https://github.com/MolEvolEpid/wavess.

Author summary

During a virus infection, the virus population within an individual has the potential to evolve over time. While the evolutionary rate and duration of infection determine the number of mutations the virus can accumulate, changes to a virus sequence may impact protein function and host immune recognition; viruses with maintained or improved functional capacities that have evaded the host immune system survive, while others might go extinct. Thus, the history of virus evolution within an individual can be captured by sequencing viruses sampled over time. Here, we present an individual-based model and corresponding user-friendly R package, wavess, that allows users to simulate within-host virus evolution and evaluate different parameters that affect sequence evolution. We tested and validated wavess by showing that the phylogenetic trees reconstructed from simulated sequences match those generated from empirical HIV-1 data. Taken together, we believe that wavess will be a useful tool for the virus evolution research community.

Introduction

The diversity acquired by virus sequences during long-term evolution within a host provides useful information for applications such as inferring transmission, developing vaccine candidates, and understanding latent virus reservoir dynamics, ultimately with the goal of curing an infected person. However, in many instances, we are limited in the amount of within-host virus sequence data available to study outcomes of interest, and furthermore, these data may not provide insight into potential underlying mechanisms driving virus evolution. A realistic model of within-host virus evolution calibrated to real sequence data can overcome some of these limitations and enable us to, for instance, investigate sequence evolution over time under different biological scenarios, or perform parameter inference via approximate Bayesian computation.

While many genetic simulation software packages exist [1], few are tailored to within-host evolution of viruses. Those that are applicable to virus evolution tend to be individual-based forward simulation models since important biological features such as recombination and selection can be easily incorporated (e.g., the general simulation framework SLiM [2] and the virus-specific simulation framework SANTA-SIM [3]). It is often not straightforward to customize general simulation frameworks for long-term or chronic within-host virus evolution. On the other hand, more specialized packages that are tailored to within-host virus evolution are not focused on explicitly modeling the immune response to infection, for instance detailed modeling of immune epitopes and their cross-reactivity. Furthermore, it is important to validate any given model against real within-host virus sequences, which has been done for simple summary statistics [3,4], but not for more complex statistics such as those generated from reconstructed phylogenies. Finally, while the goal of these tools is to provide researchers with realistic models, they lack pre- and post-processing capabilities that allow for direct incorporation of empirical data into the analysis. Large amounts of empirical data are available for many viruses, and such data can be used to inform model inputs and validate model outputs. Therefore, there is an existing need for a within-host virus sequence evolution simulation framework that is readily informed by empirical data, easy to use, and straightforward to validate against real-world data.

Here, we present the user-friendly R package and corresponding model, wavess, a within-host adaptive virus evolution sequence simulator. wavess allows users to run easily customizable forward-in-time individual-based models to study within-host virus evolution in long-term or chronic infections. Previously, we developed a similar, but simpler, model with HIV in mind [5], which included the ability to simulate virus sequences optionally including selection, recombination, and latency. In our new implementation, providing a comprehensive R package, wavess includes functionality to take empirical data and generate simulation inputs; therefore, with appropriate adaptations to model inputs and parameters, wavess could be applied to other chronic infections such as HCV, or to long-term infections of typically non-chronic viruses such as SARS-CoV-2. We also provide functions for output sequence analysis to support model fitting and downstream analyses. Using HIV-1 as an example, we show that, given empirically informed inputs, the model output sequences recapitulate the trends observed in real data for several distinct phylogenetic summary statistics. Below, we first describe the model and package implementation, then delve into a theoretical sensitivity analysis followed by comparison of model output to empirical data, and finally provide detailed methods for the theoretical and empirical analyses.

Design and implementation

wavess, which is heavily inspired by the model developed by Immonen et al. [5], models the within-host evolution of human viruses as a sequence of synchronized generations beginning with one or more founder sequences. Each generation is one full virus life cycle, from infecting a cell to exiting the cell. Within a generation, infected host cells can change state between being active or latent, viruses in active cells can undergo mutation and recombination, and virus fitness may change based on selective pressures (Fig 1). The next generation of infecting viruses is sampled based on virus fitness. Event counts (e.g., latent cell dynamics, mutation events, cells with a recombination event) and virus sequences are sampled according to a specified sampling scheme.

Fig 1. Overview of the wavess algorithm and R package.

Fig 1

The left panel describes required and optional inputs, the middle panel describes the algorithm itself, and the right panel describes the model output and potential analysis.

Infected cells

Two populations of infected cells are tracked: an active population and a latent one. The active population can be initialized with one or more infected cells, each containing a virus founder sequence; all founder sequences must be of equal length. Each infected cell in the active population contains one or two virus sequences that can mutate and recombine, and active cells can produce viruses that infect the next generation of active cells. By default, the active cell population follows a discrete logistic growth curve. The viruses that are selected for the next generation are randomly sampled based on their fitness (see below). The number of latently infected cells in each generation is determined by user-specified rates of death, proliferation, activation (transition from latent to active), and deposition (transition from active to latent). These rates are constant and do not change over time. Viruses in latent cells cannot mutate or recombine, and latent cells do not produce viruses.

Viruses

Each virus has a nucleotide sequence on which the virus fitness is calculated. Nucleotide sequences must all be the same length and are not allowed to have gaps. In each generation, the viruses in active cells can mutate and recombine. Indels are not modeled.

Mutation.

The probability of a particular nucleotide mutating to another is specified by a per-site and per-generation mutation rate and a nucleotide substitution rate matrix Q [6]. The Q matrix can correspond to any desired model of DNA evolution.

Recombination.

The probability of a recombination event is specified by a per-site and per-generation recombination rate; therefore, there may be more than one cross-over event between two parent viruses in a single active cell. Cells in which at least one template switch occurs are dually infected. All other cells are singly infected.

Selection

Three selective pressures are considered in wavess: (i) conserved fitness, selection constrained by a set of conserved sites in comparison to a reference sequence; (ii) replicative fitness, selection driven by replicative ability in comparison to a reference sequence; and (iii) immune fitness, selection imposed by a co-evolving immune response at a set of pre-defined epitope locations. Each of the three forms of selection has a fitness cost associated with it. The fitness cost is set in the range [0,1), where 0 indicates no cost, and 1, which indicates no ability to survive, is not allowed as we require a non-zero probability of survival.

The fitness, Fj, of virus j is defined by the product of the fitness of each component:

Fj=FjC×FjR×FjI, (1)

where FjC is the conserved fitness, FjR the replicative fitness, and FjI the immune fitness. We describe each of the components of this equation below.

Fitness by comparison to a reference sequence.

Conserved fitness and replicative fitness are computed by comparing nucleotides in the virus to an aligned reference sequence, which can be different for each fitness type. Every site in the aligned founder sequence is labeled either conserved or non-conserved. Conserved fitness dictates that a mutation that causes a difference from the reference nucleotide at a conserved site is deleterious to the virus, thus simulating purifying selection. Replicative fitness dictates that at non-conserved sites, non-reference nucleotides are considered to be slightly less viable than the reference. The equation used to compute FjC and FjR is

FjX=(1c)n, for any X{C,R}, (2)

where n is the number of conserved (non-conserved) sites that differ from the reference sequence for X = C (X = R), and c is the per-site cost associated with a difference from the reference sequence [5] (Fig 2A). The wavess R package uses a default value of c = 0.99 when X = C to impose a high cost of mutations in conserved sites and a default value of c = 0.001 [7] when X = R. For small c, Eq 2 is equivalent to ecn, which has previously been used to simulate replicative fitness [8].

Fig 2. Visualization of model components related to selection.

Fig 2

(A) Fitness by comparison to a reference sequence (Eq 2) for different values of n (number of mutations) and c (cost per mutation). (B) Depiction of the host immune response to an epitope (Eq 3). Over time, an epitope is recognized once it reaches a certain frequency, followed by a gradual linear increase in the strength of the immune response until a maximum immune cost is reached. (C) Beta distribution from which the reduction in immune cost d is sampled, based on edit distance (number of amino acid mutations) h (Eq 4).

A nucleotide position is never considered for both conserved and replicative fitness. When only replicative fitness is modeled, all positions in the simulated sequence that are not an insertion relative to the reference sequence are included in the fitness calculation for FjR. However, when both forms of fitness are used, if the position is considered to be a conserved site, then it is not included when calculating replicative fitness. When modeling conserved fitness, if the nucleotide in the founder sequence at a user-defined conserved site is not the same as the expected conserved nucleotide, then it is not considered a conserved site under the assumption that the virus founder sequence was viable within its true host.

Fitness due to a co-evolving immune response.

We model a co-evolving immune response as part of the host environment, which dynamically adapts to recognize frequent virus amino acid epitopes at pre-defined locations of equal length in the sequence. This immune recognition imposes a fitness cost to viruses which contain that epitope, and the strength of recognition optionally increases over time to allow for modeling processes such as antibody affinity maturation or T-cell clonal expansion. Once a virus epitope is recognized, the memory is retained such that an identical sequence coming up, say from the latent reservoir, is immediately recognized. Further, cross-reactivity results in (weaker) recognition of mutated but similar epitopes.

Once an epitope i reaches a user-defined frequency in the population, the fitness cost due to the immune response, ci, increases linearly each generation t until it reaches a maximum value ci* (Fig 2B), as

ci(t)=min{ci*,ci*(tti0)/g},tti0, (3)

where ti0 is the time when the immune response against epitope i is generated and g is the time it takes to reach its maximum.

To model cross-reactivity when a new epitope variant arises in an environment where the host immune system has already mounted an immune response against the virus, we first determine the number of amino acid positions that differ (edit distance) between the new epitope and each of the immune-recognized epitopes. We then identify the cost ch of the immune-recognized epitope with the minimum edit distance h to the new epitope. The strength of cross-reactivity to the new epitope, cix, and therefore the immune cost to the new epitope, is the product of ch and a random value d (the cross-reactivity factor) drawn from a Beta distribution with α=1 and β=h2 (Fig 2C):

cix=chd. (4)

We chose to use a Beta distribution because we assume that larger edit distances between new and recognized epitopes correlate with lower cross-reactivity. The fitness cost of the cross-reactive epitope remains the same across generations unless it reaches the aforementioned user-defined frequency in the population. If this happens, then the immune response evolves according to Eq 3, where the epitope recognition generation ti0 is based on the cross-reactive immune cost and the original time to maximum immune cost: tcixg. This is to maintain the same slope in immune cost increase over time for all epitopes at a given location.

We assume that the immune fitness cost of a virus j is driven by the epitope with the strongest immune response against it, leading to an immune fitness FjI of:

FjI=1maxijci, (5)

where epitope i must be present in virus j.

Model implementation

The wavess R [9] package includes functions for learning model inputs from empirical sequence data, simulating within-host virus evolution, and generating summary statistics of the output for model evaluation (Fig 1). The back-end of the function that implements the wavess algorithm, run_wavess(), is implemented in Python 3 [10]. We also provide a run_wavess.py file that may be used instead of the R function. The default model parameter values and example inputs included in the package are listed in Table 1 and are tuned to HIV-1 env gp120. The R package, Python script, and example data are hosted on GitHub (https://github.com/MolEvolEpid/wavess). The package includes vignettes that describe in detail generating model inputs, running wavess, and analyzing model outputs.

Table 1. Model parameters.

Parameter Default value [range] a Unit Notes/Reference
Virus
Sequence features
Founder sequence(s) HIV-1 env gp120 NA DEMB11US006 [11]
Conserved sites See Methods NA Empirical [12]
Replicative reference HIV-1 subtype B consensus NA [12,13]
Population
Generation time 1 [12] Days [7]
Starting population size 10 [11000] Infected cells
Maximum growth rate 0.3 [0.011] generation −1
Maximum population size 2000 [1,00010,000] Infected cells [7]
Mutation rate 4×105 [105104] site −1 generation −1 [14]
Q matrix See Methods Unitless [15]
Recombination rate 1.5×105 [106104] site −1 generation −1 [16]
Individual fitness
Conserved cost 0.99 [00.99] Unitless
Replicative cost 0.001 [00.003] Unitless [7]
Immune cost 0.3 [00.9] Unitless Fitted
Host
Immune response
Epitope location See Methods NA Empirical [12]
Number of epitopes 10 NA Empirical [12]
Epitope amino acid length 10 [120] NA Empirical [12]
Threshold epitope count 100 [11000] NA
Time to maximum immune cost 90 [110,000] Days [17]
Latency rates
Deposition rate 0.001 [0.00010.01] day −1
Activation rate 0.01 [0.0010.1] day −1
Homeostatic proliferation rate 0.01 [0.0010.1] day −1 [18]
Latent death rate 0.01 [0.0010.1] day −1 Set to equal proliferation rate

 a Default value as of wavess v1.0.0; range is what is investigated in the sensitivity analysis of this report.

Inputs.

run_wavess() has three required inputs, each of which can be generated with a helper function:

  1. A per-generation infected cell population growth curve (define_growth_curve(); e.g., Fig 3A).

  2. A sampling scheme (define_sampling_scheme()).

  3. The founder sequence(s) of the infection (extract_seqs()).

Fig 3. Visualization of example model inputs and empirical data used to inform model inputs.

Fig 3

(A) Example active (deterministic) and latent (stochastic) growth curves based on default values. (B) Conserved nucleotide sites for the example founder sequence (gp120 of DEMB11US006). (C) Epitope probability distribution for HXB2 gp120. After being converted to nucleotides, these positions can be mapped to the founder sequence.

An example HIV founder sequence (gp120 of DEMB11US006 [11]) is included in the package, but users can also extract a founder sequence from their own alignment or simply provide their own founder sequence.

By default, no selective pressures are modeled. Rather, there are optional inputs corresponding to modeling the three different types of selective pressure: conserved, replicative, and immune. We provide helper functions to generate these inputs based on empirical data:

  1. identify_conserved_sites() takes as input an alignment of the genomic region to be simulated that is representative of virus diversity, and provides conserved sites and a consensus sequence that can be used as the reference sequence for modeling replicative fitness.

  2. sample_epitopes() provides epitope locations randomly sampled from an epitope probability distribution that can be generated with get_epitope_frequencies(), which takes as input known amino acid epitope locations along the sequence.

The example data provided for modeling virus fitness are specific to the example founder sequence, and include a vector of conserved sites (Fig 3B), the HIV-1 subtype B consensus sequence from the Los Alamos National Laboratory (LANL) HIV database [12,13], and ENV amino acid epitope locations (Fig 3C), also from the LANL HIV database [12].

Outputs and analysis.

The model outputs event counts, the sampled sequences, and the fitness of each sampled sequence. We also provide the function calc_tr_stats() to calculate summary statistics based on a phylogeny reconstructed from the output sequence data. We focus on calculating summary statistics that capture various expected characteristics of the phylogenies of chronic virus infections. Note, however, that other summary statistics might be more useful depending on the question(s) investigated. The phylogenetic summary statistics calculated include:

  1. Mean branch length. This measures the phylogenetic inter-node genetic distance, affected by the overall evolutionary rate of the simulated evolutionary system.

  2. Mean change in diversity per unit time, measured as the change in within-timepoint tip-to-tip distance across time (slope) calculated using linear regression. This captures how the virus population diversity has changed over time.

  3. Mean change in divergence per unit time, measured as the change in root-to-tip distance across time (slope) calculated using linear regression. This captures how virus divergence from the root state has changed over time.

  4. Mean leaf depth, measured as the Sackin index normalized by the number of tips in the phylogeny (treebalance::avgLeafDepI()) [19]. This captures how ladder-like the tree is, which may be indicative of the strength of selection.

  5. Mean number of lineages through time, calculated as the maximum parsimony score (phangorn::parsimony()) based on ancestral reconstruction of tree tip labels indicating sampling timepoints [20], normalized by the number of timepoints minus one. This estimate of the mean number of phylogenetic lineages that survived from one sampling to the next is another potential indicator of selection strength.

calc_tr_stats() also takes a branch length threshold, which is used to collapse branches less than that value into hard polytomies prior to computing the mean leaf depth. This allows the user to reduce bias in the leaf depth calculation due to zero or very short branches, which have no information about the bifurcation pattern but still influence the leaf depth calculation.

Dependencies.

The R package dependencies include the Python packages numpy [21] and scipy [22], and the R packages ape [23], phangorn [20], treebalance [19], reticulate [24], dplyr [25], tibble [25], and tidyr [25]. For plotting in the vignettes, ggplot2 [25] and ggtree [26] are used. For the Python implementation, the pandas [27] and Bio [28] packages are also required.

Results

Summary statistics are sensitive to input parameters

Model sensitivity to input parameter values is dependent on the model output of interest, which is defined by the research question at hand. As an example, we investigate the impact of 14 numeric input parameters on the HIV-1 phylogeny resulting from sequences sampled from the active cell population during ten years of infection. To do so, we first generated 42 points in parameter space using Latin Hypercube Sampling (LHS) with values across the ranges defined in Table 1. As expected, these different points resulted in a wide range in the number of mutation, recombination, and latency events, and in virus fitness (Fig 4A).

Fig 4. Sensitivity of model output to input parameters.

Fig 4

(A) Event counts and fitness over time colored by the most relevant parameter value. (B) Normalized mean elementary effect (μ**) of a 20% change in input parameter values across the range in Table 1 relative to stochastic noise for five phylogenetic summary statistics. The partial rank correlation coefficient is shown in parentheses.

We next used the LHS points to generate trajectories through parameter space based on the elementary effects sensitivity analysis method [29] with a perturbation size of 20% of the parameter range. To account for model stochasticity, we computed a normalized mean elementary effect, μ**, where the between-point difference for a parameter is divided by the within-point difference across replicate simulations (see Methods). For some pairs of parameters and summary statistics, there is a similar amount of within-point versus between-point variability (Fig 4B and S1 FigA), indicating that, on average, we do not observe a difference in the summary statistic when that parameter value is perturbed. For other pairs, the between-point difference is larger than the within-point difference (Fig 4B and S1 FigA), indicating that those summary statistics are sensitive to those parameter values. Furthermore, while the correlation between μ** and the mean between-point difference is strong across summary statistics (S1 FigB), there is a weaker correlation between μ** and the partial rank correlation coefficient (PRCC) (S1 FigC). This difference may be because μ** accounts for non-monotonic effects of parameter perturbations on summary statistic values, but PRCC does not.

In these HIV-1 infected host simulations, the parameters with the most influence on the phylogeny were mutation rate and immune cost (Fig 4B). Mean branch length, and mean annual change in diversity and divergence, all branch length statistics, were most sensitive to mutation rate (partial rank correlation coefficient [PRCC]  = 0.69, 0.55 and 0.79, respectively), with a 20% change in parameter value across the range tested leading to a greater than two-fold change over stochastic effects. The mean number of lineages through time was most sensitive to immune cost (PRCC  = −0.70), with a 20% change leading to a greater than three-fold change in mean number of lineages through time relative to stochastic effects. Other parameters for which a summary statistic was at least two-fold higher with a perturbation change compared to no change include epitope length, days to maximum immune cost, and recombination rate. Changing other parameters resulted in less than a two-fold change in all summary statistics, although mean branch length was generally impacted the most. Average leaf depth, a metric of tree imbalance, was not very sensitive to input parameters under the sampling scheme we used here, although this may be due to large within-point stochasticity as the correlation between average leaf depth and immune cost was quite strong (PRCC  = 0.65).

Model output matches empirical data

The usefulness of wavess depends on how well it recapitulates empirical data for the research question of interest. As an example, we investigated how well we can match our model output to empirical HIV-1 env sequences sampled longitudinally from infected individuals [30,31], and whether this can shed any insights into the immune response of the individuals. We carried out this analysis bearing in mind that virus evolution in these individuals was also a stochastic event with only one realization. For this analysis, we used the functions provided in the wavess R package to generate dataset-specific inputs, including founder sequences and sampling schemes. To investigate the host immune response, we varied the immune cost and epitope locations across simulations (see Methods). To evaluate the model, we used the normalized phylogenetic summary statistics described above, and computed the normalized sum of squared errors (SSE) between the phylogeny reconstructed from the empirical data and the phylogenies of each of the simulated sequence alignments. The summary statistics are not strongly correlated with each other in the empirical data (S2 Fig).

We found that selecting different epitope locations did not significantly affect the phylogenetic summary statistics even when accounting for immune cost (S3 Fig). Furthermore, we observed very little difference between simulations with an immune cost of 0 and 0.1, and between simulations with an immune cost of 0.6 and higher; however, summary statistic values differed in the 0.1-0.6 range (Fig 5A and S4 Fig). Within the 0.1-0.6 range, the empirical summary statistic value fell into the range of simulated summary statistic values for all datasets and summary statistics (n = 55; S4 Fig). For most datasets, we observed one immune cost with a lower SSE than the rest, and the value of the immune cost with the lowest SSE differed across individuals (Fig 5B and S5 Fig). This suggests that the selection pressure imposed by the immune response had different strengths among the individuals studied, which mirrors the previously observed heterogeneous immune response among people [3234].

Fig 5. Summary of simulations performed using empirical founder sequences and sampling schemes with varying immune cost.

Fig 5

(A) Phylogenetic summary statistics for each immune cost. (B) Mean sum of the squared errors (SSE) of the simulated data relative to the empirical data for each immune cost. Color indicates number of replicates performed for a given immune cost.

We next took a subset of the simulations including only the top 5% based on normalized SSE for each dataset (n=30 per dataset). In this subset, we again observed that the distribution of immune costs varied by individual (Fig 6A). Across all 11 individuals, the empirical summary statistics were on average 0.78 standard deviations away from the mean simulated value (range: [0.01, 2.36]) (Fig 6B). Furthermore, the summary statistic from the best-fitting simulation result and the empirical data were often very similar (Fig 5B). Finally, for the empirical compared to the best-fitting simulated data, the distribution of rate heterogeneity across nucleotide sites estimated using empirical Bayesian methods had similar shapes with long right tails, but the empirical data had a lower median rate (0.52 versus 0.76) and a higher maximum rate (25.2 versus 12.62) (S6 Fig). Empirical phylogenies and rate heterogeneity distributions of representative datasets with an estimated low, medium, and high immune cost are shown in Fig 7, together with the phylogeny and site rate heterogeneity distribution from the best-fitting simulated data for that sampling scheme.

Fig 6. Summary of top 5% of simulation results based on sum of the squared errors between the empirical and simulated data (n=30 per dataset).

Fig 6

(A) Distribution of immune costs. (B) Distribution of phylogenetic summary statistics. Open black circles indicate the best-fitting simulated data and green asterisks indicate the empirical data value.

Fig 7. Phylogenies reconstructed from empirical (top) and simulated (bottom) HIV-1 within-host sequences.

Fig 7

For representative datasets with an estimated (A) low, (B) moderate, or (C) high immune cost. The color indicates the estimated sampling time from infection. The scale bar (0.01) is in units of substitutions/site. The inset for each plot shows the distribution of site-rates estimated by IQ-TREE using an empirical Bayesian method [35].

These findings suggest that the within-host virus evolution simulations generated by wavess can recapitulate empirical data, and can provide insights into the evolutionary parameters that impact HIV-1 within-host evolution.

Availability and future directions

wavess is hosted on GitHub (https://github.com/MolEvolEpid/wavess). Both the R package and the Python script are included in the repository. While we believe that the current implementation of wavess will allow researchers to answer many biological questions of interest, there is also substantial room to extend the functionality of the package even further. Ways in which wavess could be extended in the future include: (1) allowing for the simulation of segmented viruses, (2) tracking recombination events and virus ancestors and descendants to enable the generation of ancestral recombination graphs and genealogies, (3) dividing the immune response into separate antibody and cytotoxic T cell-like immune responses, (4) adding evolution specific to drug resistance, (5) allowing latent cells to have different lifespans, (6) allowing for compartmentalization of viruses into different tissues, and (7) modeling indels. Furthermore, wavess could be embedded into epidemiological-level models of transmission (e.g. [36]) to make them more realistic. Taken together, we envision that wavess will be an extremely useful tool for the virus evolution research community.

Methods

Elementary effects sensitivity analysis modified for stochastic simulations

We followed the standard elementary effects one-at-a-time global sensitivity analysis algorithm to generate points (sets of 14 parameter values) in parameter space and trajectories of those points [29]. We performed replicate simulations for each point to account for stochasticity in model output. Specifically, we generated 42 trajectories consisting of 15 points each and performed 10 replicate simulations for each point along the trajectory. To ensure broad coverage of the parameter space, the initial 42 points were generated using LHS [37] on the 0-1 range with the R function lhs::optimumLHS() [38]. Note that LHS assumes that the input parameter distributions are uncorrelated, which we believe is a reasonable assumption for the parameters used in the sensitivity analysis here. For a given trajectory, the order in which parameters were perturbed was selected at random and each parameter was perturbed by ±0.2. The perturbation direction was selected randomly, and if the resulting value was outside the range 0-1 then the opposite direction was selected. Values from LHS followed by perturbation were linearly transformed from the 0-1 range to the true parameter value using the ranges indicated in Table 1.

For parameter values that were not perturbed, the default values were used (Table 1). For each simulation, 20 samples were taken from the active cell population at generation 5 and each year post-infection, and a maximum-likelihood phylogeny was inferred from each resulting simulated sequence alignment using IQ-TREE 2 [35] with the GTR+I+R nucleotide substitution model. Phylogenies were rooted on the founder sequence using phytools::reroot() [39]; the founder sequence was then dropped from the tree. For each collapsed rooted phylogeny, we computed the five summary statistics described above using wavess::calc_tr_stats() with a branch length threshold of 1/1503, where 1503 is the length of the founder sequence; these were the model outputs of interest for the sensitivity analysis.

In a standard elementary effects sensitivity analysis, the revised mean elementary effect μk* is computed for each parameter k, which estimates the impact of the parameter on model output by comparing the output between adjacent points in the trajectory [40]. To extend this to stochastic models, one option is to to calculate μk* using the mean value of the model output across replicate simulations for each point in parameter space [41]. However, this method does not allow us to determine whether a given parameter change leads to more of a difference than is stochastically observed across replicates at one point in parameter space. Therefore, we instead modified the elementary effects equation by, for each trajectory, normalizing the mean between-point difference in the value of each model output across replicates by the mean within-point difference of the model output across replicates for the original point. Then we calculated the mean normalized value across all trajectories.

The mean between-point difference for the kth of p parameters for a single trajectory r was calculated as:

Brk=1n2i=1nj=1n|y(x1,,xk1,xk+Δ,xk+1,,xp)iy(x1,,xk1,xk,xk+1,,xp)j|, (6)

where n is the number of replicates (10) and y is a model output. Note that the perturbation size we used here is identical across all simulations (0.2), so we do not need to divide the difference by the perturbation size. This allows for a more straightforward comparison of the normalized values.

The mean within-point difference for parameter k of a single trajectory is given by:

Wrk=2n(n1)i=2nj=1i1|y(x1,,xk1,xk,xk+1,,xp)iy(x1,,xk1,xk,xk+1,,xp)j|. (7)

The normalized mean elementary effect is then calculated as:

μk**=1mr=1mBrkWrk, (8)

where m is the number of trajectories (in our analysis, m=42). μk** thus allows us to determine whether a given parameter change has more of a difference than is stochastically observed using the original parameter value. Note that the framework described here can be used to investigate sensitivity to any sampling scheme and model output of interest.

We also computed the partial rank correlation coefficient (PRCC) for the 42 initial points in parameter space obtained from LHS using epiR::epi.prcc() [42] to investigate the direction of effect between each parameter and summary statistic. Note, however, that this metric only captures monotonic relationships between the parameter and model output.

Model inputs and analysis for comparison to empirical data

Sample information.

We validated wavess using HIV-1 env sequences sampled longitudinally from 11 individuals infected with HIV-1 subtype B (n = 7 from [30], n = 4 from [31]), describing natural HIV-1 evolution. This number excludes individuals from these studies with viruses containing >1% nucleotide sequence diversity at the first time point (n = 3) because we focus on validating the model with datasets where we have sequences from early in infection, and where the individual was likely infected with only one variant. The individuals included in our analysis were either treatment-naive or were eventually put on relatively ineffective treatment targeting non-ENV proteins. In one patient (P2), we observed a decrease in virus load and an increase in CD4 +  T cell counts for the last three sampling times (S7 Fig). As treatment appeared to be working at these times, we removed the corresponding samples from the analysis. We included all other samples, even when patients were being treated, as the treatment did not appear to be working well based on viral load and CD4 +  T cell count. Furthermore, the evolution of envelope sequences should not be influenced by these drugs. We customized the simulations for each individual by using person-specific founder sequences and sampling schemes that matched the empirical data for each one. To obtain sampling times, we assumed that the infection time was the midpoint between the last negative and first positive test; this is considered sero-conversion time by the original authors. Sequences and metadata were obtained from the LANL HIV database [12]. All dataset IDs used here match the IDs from the original publications.

Input sequences.

To generate person-specific founder sequences, we first chose the longest sequence from the first time point for each individual. In the event of a tie, one sequence was chosen randomly. We aligned this sequence to the HXB2 (GenBank accession number K03455) env gene as well as the subtype B sequence DEMB11US006 (GenBank accession number KC473833) [11] using the mafft v7.526 ginsi command [43]. We then trimmed off any portion of the sequence outside the env gene. Next, we identified, across all sequences, the smallest (6225, env start) and largest (7786) HXB2 nucleotide coordinates within env. To ensure consistency across simulations, we chose to simulate this entire portion of env (1562 nucleotides). For sequences that did not span this entire region, we filled in leading and trailing missing nucleotides with DEMB11US006 to generate the full founder sequence.

Conserved sites.

We considered conserved sites to be HXB2 positions that are identical in >99% of sequences in the env filtered alignment of all HIV-1 subtypes from the LANL HIV database [12], identified using wavess::identify_conserved_sites(). These positions were mapped from HXB2 to each individual founder sequence using wavess::map_ref_founder().

Reference sequence.

We used the HIV-1 Subtype B consensus sequence from the LANL HIV database [12,13] as the reference sequence for replicative fitness. This was aligned to the founder sequence using the mafft v7.526 ginsi command [43]

Epitopes.

To determine the epitope length to use for our simulations, we identified all linear human gp120 epitopes from the LANL HIV database and used the median length (10 amino acids, or 30 nucleotides) for our simulations.

We used the binding, contacts, and neutralization features from the LANL ENV features database [12] to determine the probability of an epitope occurring at each HXB2 position across the length of the founder sequence using wavess::get_epitope_frequencies(). We then used this distribution to randomly sample 10 non-overlapping epitopes of length 30nt across the sequence for each simulation replicate using wavess::sample_epitopes(). The same set of epitope locations was used for each set of immune cost simulations, and for each dataset. The start positions were mapped from HXB2 to each individual founder sequence via alignment using wavess::map_ref_founder(); each epitope was set to be 30nt long beginning from this start position.

Parameter values.

As HIV sequence evolution is influenced by the strength of an individual’s immune responses [3234], we varied the immune cost and epitope locations across simulations. Each combination of immune cost (range 0-0.9, n = 10) and epitope location (n = 20 sets for immune costs of 0 and 0.7-0.9, n = 100 for immune costs of 0.1-0.6) was simulated once, yielding 680 simulations per sampling scheme. All other parameters were fixed at the values defined in Table 1. wavess::define_growth_curve() with default values was used to define the infected cell growth curve.

Phylogenies.

We first removed leading and trailing segments of the simulated sequences that originated from DEMB11US006, thus keeping only the portion of the simulated sequences derived from the empirical sequence. Then we built phylogenetic trees for each set of empirical and simulated sequences using IQ-TREE 2 [35] with the GTR+I+R model of sequence evolution and the –rate option to obtain empirical Bayesian estimates of site rates [44]. Phylogenies were rooted on the sequences from the first timepoint using phytools::reroot() [39]. We then calculated the five summary statistics described above for each tree using wavess::calc_tr_stats() with a branch length threshold of 1/Lf, where Lf is the length of the original founder sequence. Next, we computed the difference between phylogenies representing the empirical sequences to the phylogenies reconstructed from each of the sets of simulated sequences by calculating the sum of the squared errors between the empirical and simulated data for centered and scaled values of the five phylogenetic summary statistics.

Data analysis and visualization

We used the Snakemake (v7.32.4) workflow manager [45] to develop an analysis pipeline and parallelize simulations. All data analysis and visualization was performed in R v4.3.3 [9] with the following packages (in addition to wavess): ape v5.8 [23], phytools v2.1.1 [39], tidyverse v2.0.0 [25], epiR v2.0.78 [42], ggcorrplot v0.1.4.1 [46], ggridges v0.5.4 [47], ggtree v3.10.0 [26], treeio v1.26.0 [48], scales v1.3.0 [49], and ggpubr v0.6.0 [50].

Supporting information

S1 Fig. Comparison of metrics to quantify the sensitivity of a model output (facets) to parameter (color) perturbations.

(A) Mean within-point difference in model output across replicates (μ** denominator) compared to mean between-point difference with a parameter perturbation (μ** numerator). Black line is y = x. (B) Normalized mean elementary effect (μ**) compared to mean between-point difference with a parameter perturbation (μ** numerator). (C) Normalized mean elementary effect (μ**) compared to the partial rank correlation coefficient. R indicates Pearson correlation coefficient, ρ indicates Spearman correlation coefficient.

(TIF)

pcbi.1013437.s001.tif (328.1KB, tif)
S2 Fig. Spearman correlation between phylogenetic summary statistics.

Based on 11 maximum-likelihood phylogenies reconstructed from empirical data.

(TIF)

pcbi.1013437.s002.tif (181.4KB, tif)
S3 Fig. Comparison of phylogenetic summary statistics for different sets of epitope locations.

(TIF)

pcbi.1013437.s003.tif (354.8KB, tif)
S4 Fig. Phylogenetic summary statistics.

For each real dataset (red line) and simulated dataset across immune costs (black point).

(TIF)

pcbi.1013437.s004.tif (807.7KB, tif)
S5 Fig. Mean sum of squared errors between simulated and empirical data.

For each dataset and phylogenetic summary statistic across immune costs.

(TIF)

pcbi.1013437.s005.tif (625.9KB, tif)
S6 Fig. Distribution of nucleotide site rates for each dataset.

Calculated by IQ-TREE using an empirical Bayesian method where, for each site, the posterior mean site rate across rate categories is weighted by the posterior probability of the site being in each category. The top row is the rates estimated from real sequence alignments and the bottom row is rates estimated from the best-fitting simulated data.

(TIF)

pcbi.1013437.s006.tif (131.3KB, tif)
S7 Fig. Viral load and CD4 count over time for individuals included in the analysis.

The dotted line indicates when an individual was no longer treatment-naive. Samples removed prior to the analysis are indicated as open circles on the plot.

(TIF)

pcbi.1013437.s007.tif (260.3KB, tif)

Acknowledgments

We thank Macauley Locke and Ruian Ke for discussions about the sensitivity analysis.

Data Availability

All data and code are available on GitHub. The package code and corresponding example data are available at https://github.com/MolEvolEpid/wavess and the code and data to recreate the manuscript analysis and figures are available at https://github.com/MolEvolEpid/wavess_manuscript.

Funding Statement

This work was supported by the National Institutes of Health [R01AI087520 to TL] and the Los Alamos National Laboratory [Laboratory Directed Research and Development program fellowship project no. 20230873PRD4 to ZL and project no. 20210959PRD3 to NS]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Genetic Simulation Resources. [cited 2023 Oct 1]. https://surveillance.cancer.gov/genetic-simulation-resources/home/
  • 2.Haller BC, Messer PW. SLiM 4: multispecies eco-evolutionary modeling. Am Nat. 2023;201(5):E127–39. doi: 10.1086/723601 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Jariani A, Warth C, Deforche K, Libin P, Drummond AJ, Rambaut A, et al. SANTA-SIM: simulating viral sequence evolution dynamics under selection and recombination. Virus Evol. 2019;5(1):vez003. doi: 10.1093/ve/vez003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Terbot JW 2nd, Cooper BS, Good JM, Jensen JD. A simulation framework for modeling the within-patient evolutionary dynamics of SARS-CoV-2. Genome Biol Evol. 2023;15(11):evad204. doi: 10.1093/gbe/evad204 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Immonen TT, Conway JM, Romero-Severson EO, Perelson AS, Leitner T. Recombination enhances HIV-1 envelope diversity by facilitating the survival of latent genomic fragments in the plasma virus population. PLoS Comput Biol. 2015;11(12):e1004625. doi: 10.1371/journal.pcbi.1004625 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Felsenstein J. Inferring phylogenies. New York: Oxford University Press; 2003.
  • 7.Balagam R, Singh V, Sagi AR, Dixit NM. Taking multiple infections of cells and recombination into account leads to small within-host effective-population-size estimates of HIV-1. PLoS One. 2011;6(1):e14531. doi: 10.1371/journal.pone.0014531 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Rouzine IM, Coffin JM. Evolution of human immunodeficiency virus under selection and weak recombination. Genetics. 2005;170(1):7–18. doi: 10.1534/genetics.104.029926 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2023.
  • 10.Foundation PS. Python Programming Language. 2023. https://www.python.org/
  • 11.Sanchez AM, DeMarco CT, Hora B, Keinonen S, Chen Y, Brinkley C, et al. Development of a contemporary globally diverse HIV viral panel by the EQAPOL program. J Immunol Methods. 2014;409:117–30. doi: 10.1016/j.jim.2014.01.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.HIV Databases. https://www.hiv.lanl.gov/content/index.
  • 13.Linchangco GV, Foley B, Leitner T. Updated HIV-1 consensus sequences change but stay within similar distance from worldwide samples. Frontiers in Microbiology. 2022;12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Sanjuán R, Nebot MR, Chirico N, Mansky LM, Belshaw R. Viral mutation rates. J Virol. 2010;84(19):9733–48. doi: 10.1128/JVI.00694-10 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Zanini F, Puller V, Brodin J, Albert J, Neher RA. In vivo mutation rates and the landscape of fitness costs of HIV-1. Virus Evol. 2017;3(1):vex003. doi: 10.1093/ve/vex003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Neher RA, Leitner T. Recombination rate and selection strength in HIV intra-patient evolution. PLoS Comput Biol. 2010;6(1):e1000660. doi: 10.1371/journal.pcbi.1000660 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Richman DD, Wrin T, Little SJ, Petropoulos CJ. Rapid evolution of the neutralizing antibody response to HIV type 1 infection. Proc Natl Acad Sci U S A. 2003;100(7):4144–9. doi: 10.1073/pnas.0630530100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Macallan DC, Busch R, Asquith B. Current estimates of T cell kinetics in humans. Curr Opin Syst Biol. 2019;18:77–86. doi: 10.1016/j.coisb.2019.10.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Fischer M, Herbst L, Kersting S, Kühn AL, Wicke K. Tree balance indices. Springer; 2023. 10.1007/978-3-031-39800-1 [DOI]
  • 20.Schliep KP. phangorn: phylogenetic analysis in R. Bioinformatics. 2011;27(4):592–3. doi: 10.1093/bioinformatics/btq706 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585(7825):357–62. doi: 10.1038/s41586-020-2649-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261–72. doi: 10.1038/s41592-019-0686-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2019;35(3):526–8. doi: 10.1093/bioinformatics/bty633 [DOI] [PubMed] [Google Scholar]
  • 24.Kalinowski T, Ushey K, Allaire JJ, Tang Y. Reticulate: Interface to “Python”. 2024. https://cran.r-project.org/web/packages/reticulate/index.html
  • 25.Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, et al. Welcome to the tidyverse. JOSS. 2019;4(43):1686. doi: 10.21105/joss.01686 [DOI] [Google Scholar]
  • 26.Yu G, Smith DK, Zhu H, Guan Y, Lam TT. ggtree: an r package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol Evol. 2016;8(1):28–36. doi: 10.1111/2041-210x.12628 [DOI] [Google Scholar]
  • 27.Team Tpd. pandas-dev/pandas: Pandas. Zenodo; 2024. https://zenodo.org/records/13819579
  • 28.Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422–3. doi: 10.1093/bioinformatics/btp163 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Morris MD. Factorial sampling plans for preliminary computational experiments. Technometrics. 1991;33(2):161–74. doi: 10.1080/00401706.1991.10484804 [DOI] [Google Scholar]
  • 30.Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, Farzadegan H, et al. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol. 1999;73(12):10489–502. doi: 10.1128/JVI.73.12.10489-10502.1999 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Dapp MJ, Kober KM, Chen L, Westfall DH, Wong K, Zhao H, et al. Patterns and rates of viral evolution in HIV-1 subtype B infected females and males. PLoS One. 2017;12(10):e0182443. doi: 10.1371/journal.pone.0182443 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Halapi E, Leitner T, Jansson M, Scarlatti G, Orlandi P, Plebani A, et al. Correlation between HIV sequence evolution, specific immune response and clinical outcome in vertically infected infants. AIDS. 1997;11(14):1709–17. doi: 10.1097/00002030-199714000-00007 [DOI] [PubMed] [Google Scholar]
  • 33.Lee HY, Perelson AS, Park S-C, Leitner T. Dynamic correlation between intrahost HIV-1 quasispecies evolution and disease progression. PLoS Comput Biol. 2008;4(12):e1000240. doi: 10.1371/journal.pcbi.1000240 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Lemey P, Kosakovsky Pond SL, Drummond AJ, Pybus OG, Shapiro B, Barroso H, et al. Synonymous substitution rates predict HIV disease progression as a result of underlying replication dynamics. PLoS Comput Biol. 2007;3(2):e29. doi: 10.1371/journal.pcbi.0030029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol. 2020;37(5):1530–4. doi: 10.1093/molbev/msaa015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Kupperman MD, Ke R, Leitner T. Identifying impacts of contact tracing on epidemiological inference from phylogenetic data. bioRxiv. 2024. 2023.11.30.567148. https://www.biorxiv.org/content/10.1101/2023.11.30.567148v5
  • 37.McKay MD, Beckman RJ, Conover WJ. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics. 1979;21(2):239–45. [Google Scholar]
  • 38.Carnell R. LHS: Latin Hypercube Samples. R package version 1.2.0. 2024. https://CRAN.R-project.org/package=lhs
  • 39.Revell LJ. phytools: an R package for phylogenetic comparative biology (and other things). Methods Ecol Evol. 2011;3(2):217–23. doi: 10.1111/j.2041-210x.2011.00169.x [DOI] [Google Scholar]
  • 40.Campolongo F, Cariboni J, Saltelli A. An effective screening design for sensitivity analysis of large models. Environmental Modelling & Software. 2007;22(10):1509–18. [Google Scholar]
  • 41.Marino S, Hogue IB, Ray CJ, Kirschner DE. A methodology for performing global uncertainty and sensitivity analysis in systems biology. J Theor Biol. 2008;254(1):178–96. doi: 10.1016/j.jtbi.2008.04.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Stevenson M, Sergeant E, Heuer C, Nunes T, Heuer C, Marshall J, et al.. epiR: Tools for the Analysis of Epidemiological Data; 2025. https://cran.r-project.org/web/packages/epiR/index.html
  • 43.Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–80. doi: 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Mayrose I, Graur D, Ben-Tal N, Pupko T. Comparison of site-specific rate-inference methods for protein sequences: empirical Bayesian methods are superior. Mol Biol Evol. 2004;21(9):1781–91. doi: 10.1093/molbev/msh194 [DOI] [PubMed] [Google Scholar]
  • 45.Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins-Tinch CH, Sochat V, et al. Sustainable data analysis with Snakemake. F1000Res. 2021;10:33. doi: 10.12688/f1000research.29032.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Kassambara A. kassambara/ggcorrplot. Original-date: 2016 -01-12T16:51:33Z; 2025. https://github.com/kassambara/ggcorrplot
  • 47.wilkelab/ggridges. Wilke Lab; 2025. Original-date: 2017-09-13T04:05:16Z. https://github.com/wilkelab/ggridges
  • 48.Wang L-G, Lam TT-Y, Xu S, Dai Z, Zhou L, Feng T, et al. Treeio: An R package for phylogenetic tree input and output with richly annotated and associated data. Mol Biol Evol. 2020;37(2):599–603. doi: 10.1093/molbev/msz240 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Scale Functions for Visualization. https://scales.r-lib.org/
  • 50.ggplot2 Based Publication Ready Plots. https://rpkgs.datanovia.com/ggpubr/
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013437.r001

Decision Letter 0

Jordan Douglas

29 Jul 2025

PCOMPBIOL-D-25-01084

wavess: An R package for agent-based simulation of within-host virus sequence evolution

PLOS Computational Biology

Dear Dr. Leitner,

Thank you for submitting your manuscript to PLOS Computational Biology. After careful consideration, we feel that it has merit but does not fully meet PLOS Computational Biology's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 30 days Sep 28 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at ploscompbiol@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pcompbiol/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to formatting updates and technical items listed in the 'Journal Requirements' section below.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Jordan Douglas

Academic Editor

PLOS Computational Biology

Natalia Komarova

Section Editor

PLOS Computational Biology

Additional Editor Comments:

Thank you for your interesting, relevant, and highly-polished mansucript; which has met positive feedback from all three Reviewers. Please work through these very minor revisions proposed by the Reviewers. I look forward to seeing the revised manuscript.

Jordan Douglas

Journal Requirements:

If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

1) We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex. If you are providing a .tex file, please upload it under the item type u2018LaTeX Source Fileu2019 and leave your .pdf version as the item type u2018Manuscriptu2019.

2) Please upload all main figures as separate Figure files in .tif or .eps format. For more information about how to convert and format your figure files please see our guidelines: 

https://journals.plos.org/ploscompbiol/s/figures

3) We notice that your supplementary Figures are included in the manuscript file. Please remove them and upload them with the file type 'Supporting Information'. Please ensure that each Supporting Information file has a legend listed in the manuscript after the references list.

4) Please amend your detailed Financial Disclosure statement. This is published with the article. It must therefore be completed in full sentences and contain the exact wording you wish to be published.

1) State what role the funders took in the study. If the funders had no role in your study, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

2) If any authors received a salary from any of your funders, please state which authors and which funders..

If you did not receive any funding for this study, please simply state: u201cThe authors received no specific funding for this work.u201d

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This work offers robust computational tools for simulating virus evolution within a host.

Its framework supports recombination, latent infected cell reservoirs, and three types of selection. The system is applied to investigate selection pressures on HIV-1 env sequences, using longitudinal samples from 11 individuals. Validation against real within-host virus data is thorough and rigorous.

Overall, this is a well-constructed and relevant contribution.

comments:

(i) Figure 1, insets (G) and (I), from Grenfell (2004; https://doi.org/10.1126/science.1090727) compare HIV population-level phylogeny with within-host HIV phylogeny. In this study, the authors found that the phylogenies reconstructed from the simulated sequences closely resembled the within-host phylogenies derived from empirical data across all tested summary statistics.

Given these results, would it be feasible to also reconstruct the population-level HIV phylogeny using this simulation framework?

(ii) Latin Hypercube Sampling (LHS) assumes that the input parameter distributions are uncorrelated. Is this assumption realistic in practice?

(iii) second paragraph section 2.3.3: correct "the the".

(iv) section 6.3: The R version v5.3.3 has not yet been released.

Reviewer #2: In the manuscript ”wavess: An R package for agent-based simulation of within-host virus sequence evolution”, the authors describe the titular R package and the model underlying it. In addition, they thoroughly test its sensitivity to varying parameters and compare its results to real data. The biological and mathematical theory behind the model as well as its results are convincing, and the manuscript is very clearly written (and so is the model code behind it). The model itself represents a significant contribution to the growing field of agent-based modelling in pathogen evolution.

I tested the R package using the detailed instructions provided by the authors, and it works as advertised. While I am no expert on R, their code appears on closer reading to do what is stated in the description in the manuscript. The biological assumptions of the model also seem to be well-founded. Finally, the package allows for a large degree of customisation to suit various modelling needs.

My main comments are really more intended as points of discussion or food for thought:

- The authors state that both viruses and their host cells are treated as agents in the model. However, it seems that the infection process itself is not explicitly modelled. Rather, a sample determined through a fitness function is taken from the previous generation of viral sequences and transferred to the following. This is a valid method, but I would not say that it treats the viruses as agents in their own right. As the agency of host cells is also limited, I would almost say that the main agents of the model are the sequences themselves.

- As a consequence of the above, the phylogeny of the model strains cannot be tracked directly, but has to be reconstructed. Again, this is not a problem as such, but one advantage of the agent-based approach can be the ability to directly track relationships between individual agents (e.g., ancestors and descendants). I think a small note on this choice of methodology, possibly in the section on future modelling directions, could benefit the discussion.

Minor comments:

- Drawing a random number from a beta distribution to calculate cross-reactivity seems reasonable, but it would be a good idea with a note on the reasoning behind using this specific distribution.

- The expression given in eq. 3 appears to always give an immune cost of 0 at time t0. However, eq. 4 then describes a starting immune cost. Should this be added to eq. 3?

- It is stated in section 4.2 that 96 % of empirical summary statistics fall within the range of simulated statistics. Since a probability distribution is calculated based on the model results, would it be possible to test whether this observation matches the theoretical probability of outliers?

- In eq. 8, the sum runs over the variable t, but no t appears inside the sum. It would make the equation easier to understand to explicitly show what exactly is summed over.

- On page 10, line 4: “We next subset the simulations to include…”. I found this sentence hard to parse, and suggest replacing it with “We next took a subset of the simulations including…”

- Page 12, line 7: a small grammatical error in “…which parameters was perturbed…”

All in all, I find this model highly interesting and the associated package quite user-friendly. I thank the authors for their contribution to this interesting field.

Reviewer #3: This paper serves as a comprehensive introduction and explanation of a recently implemented R package for simulating within-host evolution of viral sequences. The package simulates models of viral fitness and selection, host immune response, latent reservoirs, and recombination. The model virus that this was built for is HIV but it is applicable well beyond that.

I found the paper easy to follow and well written. The package works as described and has comprehensive vignettes to walk the user through the main functions.

In my view, the paper is publishable as is but could be improved by addressing the following:

0. I think you should make some of the use cases for this software clearer - how would people use it to "allow for the investigation of factors such as the

role of recombination"? (to quote the abstract). An obvious way would be to do parameter inference eg via approximate Bayesian computation or similar - you could mention and reference some works along these lines.

1. in the intro, explain how the paper is structured. Many of the details are relegated to section 6 - it would be good to signal that up front.

2. "Many modeling frameworks" referred to in the intro but only 2 referenced. Add some references here or point to another article where we can find a fuller list.

3. figure 4 needs to be larger to be able to read it, currently lots of detail and a very small font

4. Section 3.2 Page 8: I was confused by "Mean number of transitions per timepoint" as I was unsure what you meant by transition (my mind first turned to transitions vs transversions.) Looking back several pages in the paper I understood but you could remind readers here what you mean.

5. IN the R package, the vignettes are not being found by the system. Eg:

> library(wavess)

> vignette("prepare_input_data")

Warning message:

vignette ‘prepare_input_data’ not found

6. (v minor point) In the vignettes, pipes |> and %>% are both used - choose one (|> works throughout by the looks)

7. With any simulation package, users always ask "but what about...?" In section 5 you discuss a little about future directions but it is not clear how you have structured the implementation to allow others to contribute extensions themselves. THis would be worth a word (or at least thinking about).

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Andreas Eilersen

Reviewer #3: No

Figure resubmission:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions.

Reproducibility:

To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013437.r003

Decision Letter 1

Jordan Douglas

14 Aug 2025

Dear Leitner,

We are pleased to inform you that your manuscript 'wavess: An R package for simulation of adaptive within-host virus sequence evolution' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Jordan Douglas

Academic Editor

PLOS Computational Biology

Natalia Komarova

Section Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013437.r004

Acceptance letter

Jordan Douglas

PCOMPBIOL-D-25-01084R1

wavess: An R package for simulation of adaptive within-host virus sequence evolution

Dear Dr Leitner,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofia Freund

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Comparison of metrics to quantify the sensitivity of a model output (facets) to parameter (color) perturbations.

    (A) Mean within-point difference in model output across replicates (μ** denominator) compared to mean between-point difference with a parameter perturbation (μ** numerator). Black line is y = x. (B) Normalized mean elementary effect (μ**) compared to mean between-point difference with a parameter perturbation (μ** numerator). (C) Normalized mean elementary effect (μ**) compared to the partial rank correlation coefficient. R indicates Pearson correlation coefficient, ρ indicates Spearman correlation coefficient.

    (TIF)

    pcbi.1013437.s001.tif (328.1KB, tif)
    S2 Fig. Spearman correlation between phylogenetic summary statistics.

    Based on 11 maximum-likelihood phylogenies reconstructed from empirical data.

    (TIF)

    pcbi.1013437.s002.tif (181.4KB, tif)
    S3 Fig. Comparison of phylogenetic summary statistics for different sets of epitope locations.

    (TIF)

    pcbi.1013437.s003.tif (354.8KB, tif)
    S4 Fig. Phylogenetic summary statistics.

    For each real dataset (red line) and simulated dataset across immune costs (black point).

    (TIF)

    pcbi.1013437.s004.tif (807.7KB, tif)
    S5 Fig. Mean sum of squared errors between simulated and empirical data.

    For each dataset and phylogenetic summary statistic across immune costs.

    (TIF)

    pcbi.1013437.s005.tif (625.9KB, tif)
    S6 Fig. Distribution of nucleotide site rates for each dataset.

    Calculated by IQ-TREE using an empirical Bayesian method where, for each site, the posterior mean site rate across rate categories is weighted by the posterior probability of the site being in each category. The top row is the rates estimated from real sequence alignments and the bottom row is rates estimated from the best-fitting simulated data.

    (TIF)

    pcbi.1013437.s006.tif (131.3KB, tif)
    S7 Fig. Viral load and CD4 count over time for individuals included in the analysis.

    The dotted line indicates when an individual was no longer treatment-naive. Samples removed prior to the analysis are indicated as open circles on the plot.

    (TIF)

    pcbi.1013437.s007.tif (260.3KB, tif)
    Attachment

    Submitted filename: wavess_review_responses.pdf

    pcbi.1013437.s008.pdf (125.8KB, pdf)

    Data Availability Statement

    All data and code are available on GitHub. The package code and corresponding example data are available at https://github.com/MolEvolEpid/wavess and the code and data to recreate the manuscript analysis and figures are available at https://github.com/MolEvolEpid/wavess_manuscript.


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES