phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets

Nicola De Maio; William Boulton; Lukas Weilguny; Conor R Walker; Yatish Turakhia; Russell Corbett-Detig; Nick Goldman

doi:10.1371/journal.pcbi.1010056

. 2022 Apr 29;18(4):e1010056. doi: 10.1371/journal.pcbi.1010056

phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets

Nicola De Maio ^1,^*,^#, William Boulton ^1,^¤a,^#, Lukas Weilguny ¹, Conor R Walker ^1,^2,^¤b, Yatish Turakhia ³, Russell Corbett-Detig ^4,⁵, Nick Goldman ¹

Editor: Ville Mustonen⁶

PMCID: PMC9094560 PMID: 35486906

Abstract

Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, and are an essential component of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here, we present a new algorithm and software for efficiently simulating sequence evolution along extremely large trees (e.g. > 100, 000 tips) when the branches of the tree are short, as is typical in genomic epidemiology. Our algorithm is based on the Gillespie approach, and it implements an efficient multi-layered search tree structure that provides high computational efficiency by taking advantage of the fact that only a small proportion of the genome is likely to mutate at each branch of the considered phylogeny. Our open source software allows easy integration with other Python packages as well as a variety of evolutionary models, including indel models and new hypermutability models that we developed to more realistically represent SARS-CoV-2 genome evolution.

Author summary

One of the most influential responses to the SARS-CoV-2 pandemic has been the widespread adoption of genome sequencing to keep track of viral spread and evolution. This has resulted in vast availability of genomic sequence data, that, while extremely useful and promising, is also increasingly hard to store and process efficiently. An important task in the processing of this genetic data is simulation, that is, recreating potential histories of past and future virus evolution, to benchmark data analysis methods and make statistical inference. Here, we address the problem of efficiently simulating large numbers of closely related genomes, similar to those sequenced during SARS-CoV-2 pandemic, or indeed to most scenarios in genomic epidemiology. We develop a new algorithm to perform this task, that provides not only computational efficiency, but also extreme flexibility in terms of possible evolutionary models, allowing variation in mutation rates, non-stationary evolution, and indels; all phenomena that play an important role in SARS-CoV-2 evolution, as well as many other real-life epidemiological scenarios.

This is a PLOS Computational Biology Methods paper.

Introduction

Sequence evolution simulation is important for many aspects of bioinformatics [1]. Its most ubiquitous applications are for testing and comparing the performance of various essential tools (such as alignment, phylogenetic, and molecular evolution inference software, see e.g. [2–4]). However, simulating sequence evolution is also often used for testing hypotheses (e.g. [5]) and for inference, either for example through Approximate Bayesian Computation [6, 7], see [8, 9], or, more recently, using deep learning, see [10–12].

Many simulators address the task of simulating gene trees, or ancestral recombination graphs, as well as simulating evolution along these trees (e.g. [13–16]). Instead, here we focus on the problem of generating sequences given an input tree, as done by “phylogenetic” simulators (e.g. [17–19]). Realistic simulation of sequence evolution along a phylogenetic tree is essential, for example, for assessing and improving our methods for inference of SARS-CoV-2 phylogenies, which is a largely still open problem [20]. One important factor is the large numbers of available genome sequences for SARS-CoV-2 (> 3, 000, 000 in the GISAID database [21] as of September 2021). Despite this, there are currently no available simulation frameworks capable of simulating the scale and complex evolutionary features of SARS-CoV-2 and similar genome datasets. For this reason, we focus on the issue of simulating realistic substitution patterns for large datasets of closely related samples, as broadly observed in genomic epidemiology sequence data, and for arbitrarily complex substitution and indel models.

Here we show that sequence simulation for such large numbers of genomes is exceedingly computationally demanding for existing software. Complex evolutionary models, for example codon substitution models and rate variation, can cause significant further slow-downs. Furthermore, many existing methods do not allow the simulation of mutational patterns realistic for SARS-CoV-2, such as non-stationary and highly variable mutational processes [22–24], or don’t allow the simulation of indels. We propose a new approach to efficiently simulate the evolution of many closely related genomes along a phylogenetic tree and under general sequence evolution models. Our approach simulates one mutation (substitution or indel) at a time using the Gillespie method [25], and is further tailored to reduce time and memory demand by efficiently representing and storing information regarding non-mutated positions of the genome. Furthermore, we use a multi-layered search tree structure to efficiently sample mutation events along the genome even when each position has its own mutation rate, and to efficiently traverse the phylogenetic tree and avoid redundant operations. Our approach empowers extremely flexible and fine-grained evolutionary models. For example, non-stationary models are specifiable, with each nucleotide position of the genome assigned a distinct mutational profile, and each codon a distinct nonsynonymous/synonymous rate ratio. Similarly flexible indel models are also specifiable.

Materials and methods

We consider the problem of simulating evolution of a DNA (or RNA) sequence along a specified input phylogenetic tree, and under a given evolutionary model. Our simulation approach is based on the Gillespie method [25], as is typically used in molecular evolution simulators [18, 19]. We assume that each position of the genome (either nucleotide or codon) evolves independently of the others, and under a time-homogeneous substitution process; that is, the rates of evolution at each position are initially specified by the user or are sampled randomly by the simulator. We focus on the efficient simulation of sequence evolution for large phylogenetic trees with short branches: we assume that only a few mutations happen on each branch across the genome, which is typical for genomic epidemiology, and in particular for SARS-CoV-2 [26].

Simple approach

If we assume that evolutionary rates are homogeneous across the genome, it is simple to use the Gillespie approach efficiently in this scenario by adopting an efficient representation of ancestral genomes in terms of differences with respect to a root genome [27]. As a very simplified example, let’s consider the case in which there is no selective force at play, mutation rates are constant across the genome, there are no indels, and all bases mutate into all other bases at the same rate (JC69 model [28] with equal nucleotide frequencies). Throughout the manuscript, we will not assume equilibrium or stationarity in sequence evolution, but instead assume that we are given a genome at the root of the phylogeny, which we then evolve down the tree according to given rates.

In this simplified scenario, the total mutation rate across the genome is equal to the mutation rate for one base, 3r, times the genome length (which we assume constant), L. Starting from the root and its genome, we visit each branch of the tree one at the time in preorder traversal. For each branch of the tree, we consider its length t_b, and we recursively sample a time for the next mutation from an exponential distribution with rate parameter 3rL. If the sampled time t is over t_b, we move to the next branch. Otherwise, we decrease t_b by t and we sample a mutation event. In the considered scenario, this simply means sampling one position of the genome at random (a random integer number 1 ≤ i ≤ L), and then a random allele b, different from the current allele at position i, to mutate into. Additional steps are also required to keep track of mutations which have already occurred and allow them to further mutate, for example, possibly reversing a mutated allele back to the reference allele. We track each sampled mutation by adding it to a list of mutations for the current branch. It is worth noting however that there are more efficient ways to keep track of mutations that have already occurred, which we discuss in subsequent sections.

A pseudocode description of the algorithm is given in Algorithm 1. So overall, the total cost of this simple algorithm is constant in genome size, and is linear in the number of tips N. It does however scale with the number of mutation events (total tree length) M = O(Nl) where l is the average number of mutations per branch. The initialization step has cost O(N) in order to read the phylogenetic tree, and further O(L) with more complex models in order to keep track of the positions of different alleles. Performing the simulations has cost O(M log(N) + M² log(N)/N) = O(l² N log(N)); the main cost here is to screen previous mutation events at each new mutation, and this can be significantly reduced as explained in the next section. There is a caveat however. The default output of our software phastSim is a concise representation of the simulated sequences. If, however, we want to produce a file containing the full alignment in FASTA or PHYLIP formats, the memory and time cost of the algorithm will become O(NL).

Algorithm 1 Simple algorithm for one phylogenetic branch.

Here evolution on one branch is considered. t_b is initialized as the length of the considered branch. r is the mutation rate from one nucleotide to any other nucleotide. L is genome size. ref[i] is the reference allele at position i. “Node” is the child node of the currently considered branch.

Sample t (the time to next mutation event) from an exponential distribution with rate parameter 3rL.

while t < t_b do

Sample a random integer 0 < i ≤ L

if i is not a position previously mutated in an ancestor of Node then

a ← ref[i].

else

a is the current allele for Node

end if

Sample a random new allele b ≠ a.

Add mutation (i, a, b) to the list of mutations of Node.

Update current allele for Node at position i as b.

t_b ← t_b − t

Sample t from an exponential distribution with parameter 3rL.

end while

In classical implementations of sequence evolution simulators [17], for each node of the tree we need to update each base of the genome one at the time, therefore incurring in cost O(NL). Therefore, when the number of expected mutations is M ≪ NL we expect an advantage in using this approach.

A considerable limitation of the above simple approach is that we assume that rates are the same across the genome, and this is hardly realistic [29, 30]. We implemented an extension of this algorithm above which accounts for both an arbitrary nucleotide substitution model (UNREST [31]), and for rate variation across the genome in terms of a finite number of rate categories. To achieve this, we extended the algorithm above to keep track of which positions of the genome have which rates. This allows us to efficiently calculate total mutation rates for each class of sites, and to efficiently sample sites within a class.

We also implemented a new model of rate variation in order to better fit the patterns of hypermutability observed in SARS-CoV-2. In this model, small proportions of hypermutable sites are given a (possibly much) higher mutation rate. At an hypermutable site, only one specific mutation rate (from one nucleotide to one other nucleotide) is enhanced. For example, one such site with hypermutability might have only the G→T mutation rate increased 100-fold, while all other rates at that site remain the same. This is to model the effects observed in SARS-CoV-2 which are possibly attributable to APOBEC and ROS activity (or other still unclear mechanisms) [22, 23].

However, as the number of site classes increases, and as the number of alleles increase (for example when considering codon models), the efficiency of the extension of the simple approach described above deteriorates, especially when each site of the genome is given different evolutionary rates. For this reason, we developed a more complex (“hierarchical”) algorithm that remains efficient in light of rate variation, with only a small efficiency sacrifice relative to the simple method in the scenario of no rate variation. We allow phastSim users to choose between the simple approach or the more complex hierarchical one, that we describe below. Advanced features, for example simulation of indels, are only implemented with the hierarchical algorithm.

Hierarchical approach

Binary genome search tree

We first describe the structure and algorithm that allow us to efficiently sample a mutation event along the genome when each position might have a distinct mutation rate. This structure needs to be efficiently updatable following a mutation event; in fact, a mutation event changes the allele at a position of the genome, and therefore also its mutation rate. This is very similar to the problem of sampling from a categorical distribution with many elements, where the probabilities can be slightly modified at each sample [32]. A Huffman tree [33] would be close to optimal for this task, however, here we implement a binary search tree, which has a slightly higher expected cost [32] but allows us to more efficiently model blocks of contiguous nucleotides, and therefore to efficiently simulate indels.

In our “genome” search tree (which is distinct from, and should not be confused with, a phylogenetic tree), each node corresponds to a contiguous block of nucleotides along the genome. The root node represents the whole genome, and contains a rate value corresponding to the global mutation rate of the whole genome. The two children of the root correspond to the first and the second half of the genome, respectively. There is no overlap between the regions considered by each child node, and their union gives the region considered by the parent node. Consequently, the sum of the rates of the children of a node is equal to the rate of the node. Given this structure, we also refer to this binary search tree as the “genome” tree. A terminal node of the genome search tree corresponds to one unit of the genome, either a base or a codon, depending on the model we choose for simulations. A terminal node contains not only information about the position of the unit along the genome, but also the reference allele at this position and the mutation rates associated with it, to allow sampling of a specific mutation event at the given position/node. A graphical representation of an example genome search tree is depicted in Fig 1.

Fig 1 — An example genome search tree for ancestral genome ACGGT. Blue nodes are terminal and red nodes are internal. Inside each node we represent on top the genome positions represented by the node; at the center inside terminal nodes we show the allele of the node; at the bottom of nodes is their total rate. Under each terminal node we show the example relevant mutation rates. The path highlighted in orange shows an example sampling of one mutation event. A parameter R is assigned an initial random number sampled uniformly between 0 and the total rate 8.1, in this case it is R = 4.7. As we move downward, the value of R can decrease, as described in Algorithm 2, determining which site will mutate and how. Here, an initial R = 4.7 results in the sampling of a G→T mutation at genome position 4.

Sampling a mutation time is done as in the simple approach: sampling from an exponential distribution with rate parameters determined by the total mutation rate at the root of the genome search tree. Then, to sample a specific substitution event at a specific genome position, we first sample a random value uniformly in [0, 1) and multiply it by the total mutation rate R. Then, we traverse the tree from the root to the terminal node corresponding to the mutated position, which takes log(L) time. Finally, once reaching the corresponding terminal node (genome position) we choose a random substitution event affecting this position and correspondingly a new allele a for this position. An example mutation sampling is depicted in Fig 1. A pseudocode description of this algorithm is given in Algorithm 2. The cost of this approach is linear in the number of alleles, making it much more efficient than classical simulation methods based on matrix exponentiation when large state spaces (e.g. codon models) are considered. Furthermore, the computation cost for simulating under a codon model can be further reduced by considering that typically a codon model only allows a maximum of 9 substitution events from any codon, so at each terminal node we only need to consider a maximum of 9 events and rates at any time. Thanks to this, the cost of running a codon model with this approach is similar to the cost of running a nucleotide model.

Algorithm 2 Sampling of a substitution event along a genome search tree, and updating the genome search tree.

We assume we are given node “Root”, the root of a genome search tree structure. For any given genome node, “Node”, Node.rate represents the corresponding total subtree rate. Node.children represents the list of its child nodes (2 children for internal nodes, 0 for terminal nodes). Node.parent is the parent node of “Node”. Node.allele is its current allele. Node.rates (which is only allocated for terminal nodes) represents the 2-dimensional matrix of mutation rates (from any allele to any other allele) at the given position, and for simplicity here we assume that a rate of an allele into itself (a matrix diagonal entry) is 0; for codon models, for efficiency the rows of the matrix are only allocated and filled when they are needed for the first time. Node.position (only defined for terminal nodes) refers to the genome position represented by the node. The list “mutations” is used to record the mutation events simulated on the considered phylogenetic branch. The function “sample(rates)” samples a rate from a list proportional to its value. We are also given a random number 0 ≤ R < Root.rate for which we want to sample the corresponding mutation event.

Node ← Root

while Node is not terminal do

for Child in Node.children do

if Child.rate < R then

R ← R − Child.rate

else

Node.rate ← Node.rate − Child.rate

Node ← Child

break

end if

end for

end while

oldAllele ← Node.allele

Node.allele ← sample(Node.rates[oldAllele])

mutations.append([Node.position,oldAllele,Node.allele])

Node.rate ← ∑_bNode.rates[Node.allele][b]

NewRate ← Node.rate

while Node.parent is not null do

Node ← Node.parent

Node.rate ← Node.rate + NewRate

NewRate ← Node.rate

end while

As mentioned before, once a mutation event is sampled, we need to modify the sampling process so that the change in allele at the mutated position is taken into account, since this change usually affects local and global mutation rates (a rare exception is for example when substitution rates are all equal). Modifying our genome search tree following a substitution event is both simple and efficient: we simply need to modify the rates and allele at the mutated terminal node, and then update the rate of all ancestors of this terminal node accordingly. Algorithm 2 for example describes how to sample a substitution event from a genome search tree as well as how to update the genome search tree accordingly. Again, this can be done in log(L) time for each new substitution event sampled. However, while this is efficient for simulating evolution along a temporal line, that is, along a single branch of the phylogeny descendant from the root, it becomes inefficient for simulating evolution along a phylogenetic tree. This is because, if we modify the tree, then we cannot use it as it is for the sibling nodes. In other words, when we reach a split in the phylogenetic tree, and we have two children of the same phylogenetic node, we need to pass the same genome search tree of the phylogenetic parent node to both phylogenetic children. However, we can’t only pass a pointer to the same tree to both children, because evolving along one branch leading to one sibling would modify the genome search tree also for the other sibling. If we take the approach of duplicating the genome search tree at each phylogenetic split, we end up with a cost O(NL), which we are trying to avoid. For this reason, we devise an alternative, hierarchical, multi-layered approach to evolving a genome search tree, described below. Later on in the text we also describe the extension of our approach and of the genome search tree structure to simulate indels.

Hierarchical, multi-layer evolving genome search tree

In order to use our genome search tree structure to sample mutations along a phylogenetic tree, we add a further “vertical” dimension to it. At each branch of the phylogenetic tree, instead of modifying a genome search tree, we take the approach of building on it, without modifying the starting genome search tree nodes, so that the original genome search tree is not lost but instead is preserved at “layer 0” of our multi-layer structure. When we sample a mutation, we create a few new genome search tree nodes in the corresponding layer of the structure, instead of not an entire new genome search tree. By doing this, we can effectively adapt a (multi-layer) genome search tree as new mutations are sampled without losing the original genome search tree. This means that when we can pass the same genome search tree to two children of a phylogenetic node without needing to duplicate the genome search tree structure. Instead, we simply remove (de-allocate, or ignore) the genome search tree nodes that have been added to other layers by the descendants of the first child node, and pass the same genome search tree structure to the two considered child nodes. A graphical representation of an example multi-layer genome search tree and its evolution is given in Fig 2.

Fig 2 — We track the evolution of the multi-layer genome search tree starting from the genome search tree of Fig 1. Colors for the genome search tree are the same as in Fig 1 (right side of each panel). On the left side of each panel, we show an extract of the phylogenetic tree containing three nodes (“P” for parent, which in this example is the root of the phylogeny, and “L” and “R” for left and right node). “L” has further descendants, but we don’t show them here and only focus on this triplet of nodes as an example. The green arrow along the phylogenetic tree shows the current step of the preorder traversal being considered by the given panel. Black arrows show past steps. Vertical dashed lines in the multi-layer genome search tree connect nodes that represent the same portions of the genome but that are in different layers. “L0” stands for “Layer 0” and “L1” for “Layer 1”, etc. A At the phylogenetic root “P” we initialize the genome search tree for layer 0. B As we move to child “L”, a new substitution is sampled (as in Fig 1) and 3 corresponding genome nodes are created in layer 1. These nodes correspond to the nodes in the original genome search tree whose rate is affected by the new mutation. C As we traverse the subtree of the descendants of L, new nodes and mutations might be added in the layers below. D We are finished traversing the subtree of the descendants of L, and we return to L, at which point all nodes in layer below 1 have either been removed or have become irrelevant. E We return to P, at which point the genome search tree nodes previously added layer 1 are also ignored or deleted. F We move from P to R, and in doing so new mutation events might be sampled and the corresponding genome nodes might be added to layer 1 (new genome search tree nodes corresponding to 1 new substitution are shown in the new layer 1).

We start with a genome search tree at the phylogenetic root node; additional nodes are then added at further layers. A genome search tree layer n represents the genome nodes specific to a particular depth of the phylogenetic tree; phylogenetic nodes closer to the root (in terms of number of branches that need to be traversed from the root) will be associated with a lower n, and those more distant from the root with higher n. All the initial nodes of the original genome search tree belong to layer 0, the layer corresponding to the phylogenetic root. Then, as we move from the phylogenetic root to its first child, we add nodes to the tree in layer 1, representing the consequences of mutation events happening along the branch between the phylogenetic root and the first child. Nodes in layer 0 only point to nodes in layer 0, and never to nodes in other layers. More generally, nodes in layer n only point to nodes in layers m ≤ n. Every time the multi-layer genome search tree is passed from phylogenetic parent (layer n) to child (layer n + 1), new nodes are added to the corresponding layer (n + 1) if mutation events occur on the corresponding phylogenetic branch.

We traverse the phylogenetic tree in preorder traversal, so, starting from the root, we move to the first child, to which we pass the initial genome search tree, add new layers, then do the same for this child’s children. For each new mutation occurring on this phylogenetic branch connecting the root and its first child, we traverse the genome search tree, and every time we would modify the genome search tree (to update the mutation rates following a change of allele at a position) we instead create new genome search tree nodes in the child layer. Once we have traversed the whole phylogenetic subtree of the first child of the root, we have to move to second child of the root. This operation does not incur the cost of duplicating any part of the genome search tree, as we only need to pass to the second child the pointer to the root of layer 0 of the hierarchical genome search tree. Similarly, at any internal phylogenetic node at layer n, to both children we pass the pointer to the root of layer n of the genome search tree. The only additional step which might be required is the de-allocation of nodes in layer n + 1 as we move from one node to its sibling (thanks to our preorder traversal, the nodes currently in and below this layer will not be used again), but this step at most only slows simulations by a small constant factor.

At the start of the simulations for each branch, moving from layer n to n + 1, we first create a new genome root node for layer n + 1. This root initially points to the same children as the genome root at layer n, and it also has the same total rate. After creating a new layer root, we sample mutation events for the current phylogenetic branch. To sample mutations, we follow the binary search tree determined by the root of layer n + 1. As a new mutation event is picked, we either create new layer n + 1 nodes, or modify existing layer n + 1 nodes. When sampling a new mutation, every time we reach a node in the genome search tree, we either modify the rate of the node, if it’s in layer n + 1, or we create a new layer n + 1 node, if the original node was in a different layer. The new node is given at first the same children as the original node. When a terminal node is reached, we calculate its new rates (unless they have already been created before for some other node in the phylogenetic tree, in which case we just retrieve them from a dictionary) and total rate, and we pass the new total rate to its parent node, which uses it to update its own total rate, and so on. In total, the cost of sampling a new mutation event and updating the multi-layered structure is O(log(L)). A sketch of the mutation sampling process and multi-layer genome search tree update is given in Algorithm 3. The total cost of the algorithm is then O(L + N + M log(L)), where the addendum L is due to the initial creation of the layer 0 genome search tree, and N is due to the tree traversing process. In a scenario like SARS-CoV-2 genomic epidemiology, this can lead to dramatic reduction in computational demand compared to the standard O(NL) in the field, since M appears typically not distant from N [22–24]. We give a summary of the global hierarchical method in Algorithm 4.

Algorithm 3 SampleMutation(Node,Layer,R): Sampling of a mutation event along a multi-layer genome search tree.

This function is initially run on the root node “Root” of a genome search tree for layer “Layer”. Parameters are as in Algorithm 2; in addition, Node.layer represents the layer of the considered node. While below we simplify a few details, in reality we don’t recalculate rates at every mutation, but we only calculate them the first time they are needed, and then store them in dictionaries.

if Node.layer ≠ Layer then

create a new node NewNode copy of Node

NewNode.layer ← Layer

Node ← NewNode

end if

if Node is terminal then

sample mutation event from Node.allele using Node.rates and R.

expand if needed Node.rates, and update Node.rate and Node.allele

return Node

else

for c in Length(Node.children) do

Child ← Node.children[c]

if Child.rate < R then

R ← R − Child.rate

else

Node.rate ← Node.rate − Child.rate

NewChild ← SampleMutation(Child,Layer,R) {note that SampleMutation is the current function, so this function is called recursively}

Node.children[c] ← NewChild

Node.rate ← Node.rate + NewChild.rate

return Node

end if

end for

end if

Algorithm 4 SimulatePhyloNode(Node,GenomeNode,Layer): Hierarchical algorithm for simulating sequence evolution along a branch of the phylogenetic tree.

Here evolution on a branch of the phylogenetic tree is considered. The branch is passed through Node, which represents the child node of the branch. The branch length is Node.length. Simulation of the whole phylogeny is performed by calling SimulatePhyloNode(Root,GenomeRoot,0), where Root is the root of the phylogenetic tree (we assume Root.length = 0) and GenomeRoot is the root of the initial genome search tree for layer 0. This layer 0 genome search tree is created by considering the genome of the phylogenetic root, which is typically either sampled at random or read from a reference genome.

Sample t (the time to next mutation event) from an exponential distribution with rate parameter GenomeNode.rate

CurrentTime ← t

while CurrentTime < Node.length do

Sample a random uniform vaule 0 ≤ R < GenomeNode.rate

GenomeNode ← SampleMutation(GenomeNode,Layer,R) {note that this is the function defined by Algorithm 3}

Sample t (the time to the next mutation event) from an exponential distribution with rate parameter GenomeNode.rate

CurrentTime ← CurrentTime + t

end while

for Child in Node.children do

run SimulatePhyloNode(Node,GenomeNode,Layer+1) {note that this function is the one defined in the current Algorithm, which is therefore recursive in nature}

end for

if needed, de-allocate all nodes of layer Layer from GenomeNode down to its descendants. {Because genome nodes with layer Layer are descendant only from nodes with layer Layer, we do not need to traverse the whole multi-layer genome search tree, but only its layer Layer.}

Indels

We further extended the multi-layer genome search tree approach to efficiently simulate insertions and deletions. Each leaf on the genome search tree is assigned a deletion rate, insertion rate, and substitution rate, denoted R_d, R_i, and R_s respectively, and the total mutation rate for the leaf will be R_d + R_i + R_s. The substitution rate R_s itself is the sum of all substitution rates from the current allele of the leaf. Insertions are modeled as occurring on the right (3′ end) of the sampled position; to model insertion at the 5′ end of the genome, a dummy terminal genome search tree node is employed representing the leftmost end of the genome, and is initialized with R_s = R_d = 0 but with non-zero R_i. Just as with the substitution rates, which can be site specific, R_d and R_i can be drawn from a gamma distribution, or can be constant across the genome. When a mutation event is sampled at a node, it will be sampled as a deletion, an insertion, or a substitution proportionally to R_d, R_i and R_s.

Our software allows for indels with lengths drawn from a number of parametric distributions following the options allowed with INDELible, see Table 1 for an overview of the various distributions that have been implemented. Sampled indels have always length ≥ 1.

Table 1. Indel length distribution options.

Distribution	Parameters	P(X = n), n > 0
Geometric	p	(1 − p)ⁿ⁻¹p
Negative Binomial	p, k	$(\binom{k + n - 1}{n}) {(1 - p)}^{n - 1} p^{k}$
Zeta	a	n^−a/ζ(a)
Lavalette	a, k	${(\frac{k n}{k - n + 1})}^{- a}$ for n ≤ k
Discrete	vector v	v _n

Open in a new tab

Below we explain in more detail the algorithm used to efficiently simulate insertions and deletions using multi-layered genome search trees. In short, insertion events are simulated by adding new small subtrees to the genome search tree in the current layer. Deletion events are instead simulated by setting substitution and indel rates to 0 in deleted nodes.

Insertion algorithm

The algorithms for simulating insertions and deletions mostly proceed as the one simulating substitutions (Algorithm 3) in that we traverse the genome search tree to find the terminal node “Leaf” affected by the next sampled mutation. We then sample the type of the next mutation event (insertion, deletion, or substitution) proportional to the corresponding mutation rates R_i, R_d and R_s of “Leaf”. The process for simulating a substitution remains the same as before. If instead a new insertion event is simulated, we sample a length l for the inserted material from the corresponding prior distribution, and then add a new subtree to the genome search tree as detailed in Algorithms 5 and 6.

Algorithm 5 insertNode(Node, l): this function inserts a new genome subtree at the given terminal genome search tree node “Node” at which the insertion is sampled, given the insertion length l. We assume that Node is part of the current layer, and that new nodes are created at the current layer.

insertionRootNode ← populateGenomeTree(l) {This calls algorithm 6 to generate an insertion subtree of size l.}

Create a new genome search tree internal node newInternalNode

newInternalNode.parent ← Node.parent

Replace Node with newInternalNode as child of Node.parent.

Node.parent ← newInternalNode

insertionRootNode.parent ← newInternalNode

newInternalNode.children ← [Node, insertionRootNode]

newInternalNode.rate ← Node.rate + insertionRootNode.rate

Algorithm 6 populateGenomeTree(l): this function recursively creates a new genome subtree for a given insertion of length l, and returns the root node of the subtree.

Create a new node genome search tree node Node.

if l = 1 then

Sample Node.allele from a prior distribution (typically the reference nucleotide or codon frequencies).

Sample mutation rates R_i, R_d, R_s for Node.

Node.rate ←R_i + R_d + R_s

else

leftL ← ⌊l/2⌋

rightL ← l—leftL

leftNode ← populateGenomeTree(leftL)

rightNode ← populateGenomeTree(rightL)

leftNode.parent ← Node

rightNode.parent ← Node

Node.children ← [leftNode, rightNode]

Node.rate ← leftNode.rate + rightNode.rate

end if

return Node

If the user specifies a root genome, inserted nucleotides are randomly and independently sampled from the root genome nucleotide frequencies; if the user does not specifies a root genome, but instead specifies root nucleotide frequencies for phastSim to sample a random root genome from them, then the same sampling is done for inserted sequences. The substitution model for each inserted nucleotide/codon is chosen at the time of insertion in the same way as for the root genome (in particular also accounting for rate variation).

Note that the addition of new subtrees to the genome search tree will typically make it less balanced, and a potentially less efficient search tree. In typical scenarios considered here, that is, when divergence is low and all genomes are closely related to the root, the effect of this imbalance on the overall search-efficiency of the genome search tree will be extremely minor.

Deletion algorithm

If the next mutation event at Leaf will instead be a deletion, again, we first sample a deletion length l, and then we proceed to set to 0 the total mutation rate for node Leaf and its following l − 1 positions of the genome in the current layer, ignoring positions that are already deleted. The main subtlety of this approach is to avoid traversing the whole genome search tree (incurring a cost of O(L) operations), to delete these l characters. Algorithm 7 below performs this task efficiently employing at most O(l log(L)) operations. This algorithm returns the number of characters deleted. If Leaf represents a position near the 3′ end of the genome, the sampled deletion length might overflow past the end of the genome, and so fewer than l positions might be effectively deleted.

Algorithm 7 deleteNodes(Rand,GenomeNode,Layer,RemainingDeletions): recursive algorithm for deleting nodes of the genome search tree following a deletion event. It returns the number of positions deleted. “Rand” is the random number that has been used to sample the deletion event—here it’s used to direct the search to the first deleted positions and the following ones.

if GenomeNode.isTerminal then

{Skip gap characters and only delete nodes with a non-gap symbol.}

if GenomeNode.allele ≠ “-” then

GenomeNode.allele ← “-”

GenomeNode.rate ← 0

return 1

else

return 0

end if

else

deletedPositions ← 0

totalRate ← 0

for Child in GenomeNode.children do

if RemainingDeletions = 0 then

return 0

end if

if Rand > child.rate then

Rand = Rand—child.rate

else

if child.layer ≠ Layer then

create newChild, copy of child in layer Layer.

else

newChild ← child

end if

newDeletions ← deleteNodes(Rand,newChild,Layer,RemainingDeletions)

RemainingDeletions ← RemainingDeletions—newDeletions

deletedPositions ← deletedPositions + newDeletions

totalRate ← totalRate + newChild.rate

end if

end for

GenomeNode.rate ← totalRate

return deletedPositions

end if

Further details of the implementation

Substitution models

Thanks to our algorithm, we can allow any substitution model without incurring a dramatic increase in computational demand, and without risking numerical instability (which can sometimes be a problem with classical matrix exponentiation approaches). Users can easily specify different nucleotide substitution matrices (e.g. JC [28], HKY [34], or GTR [6]). By default, we adopt the most general nucleotide substitution model, UNREST [31], using as default rates those we estimated from SARS-CoV-2 [22].

We also implemented codon models, which, with our hierarchical approach, come at only a small additional computational demand compared to nucleotide models. To define substitution rates of codon models, we use an extension of the GY94 [35] model, and separately model the nucleotide mutation process and the amino acid selection one. Unlike GY94 (which assumes an HKY nucleotide mutation process), we allow any general nucleotide mutation process as defined by an UNREST matrix. Then, nonsynonymous mutations rates are modified by a single factor ω (see next section for variation of ω across codons). Under this model, a substitution from codon c₁ to codon c₂ therefore has rate:

\begin{matrix} r_{c_{1} \to c_{2}} = {\begin{matrix} m_{n_{1} \to n_{2}}, & if c_{1} and c_{2} are synonymous and differ only by \\ nucleotides n_{1} and n_{2} at a position, \\ ω m_{n_{1} \to n_{2}}, & if c_{1} and c_{2} are non-synonymous and differ only by \\ nucleotides n_{1} and n_{2} at a position, \\ 0, & if c_{1} and c_{2} differ by more than one nucleotide, \end{matrix} \end{matrix}

(1)

where $m_{n_{1} \to n_{2}}$ is the mutation rate from nucleotide n₁ to nucleotide n₂.

We don’t allow, at this stage, instantaneous multi-nucleotide mutation events, or amino acid substitution models, but we plan to address them in future extensions. A description of currently implemented models and a comparison with those in other similar simulation software is given in Table 2.

Table 2. A comparison of features of different sequence evolution simulation software packages.

	phastSim	Seq-Gen [17]	INDELible [18]	pyvolve [36]
Indels	Yes	No	Yes	No
Nucleotide Models	≤ UNREST	≤ GTR	≤ UNREST	≤ GTR
Codon Models	Extended GY94	No	GY94-style	GY94, MutSel and MG94-style
Amino Acid Models	No	Yes	Yes	Yes
Hypermutability	Yes	No	No	No

Open in a new tab

Models of rate variation

We consider four types of variation in rates across the genome. These types can be used in combination, or separately, as required.

The first type of variation is changes in the position-specific mutation rate across the genome. Every nucleotide position i in the genome (even when using a codon model) is assigned its own mutation rate scaling factor γ_i. This means that, at position i, the mutation rate from any nucleotide n₁ to any other nucleotide n₂ becomes $γ_{i} m_{n_{1} \to n_{2}}$ . We allow two ways to sample values of γ_i for each i. One way is to sample them from a continuous Gamma distribution with parameters Γ(α, α), with α specified by the user; this results in each genome position having a distinct γ_i. Alternatively, we allow the definition of discrete categories, with a finite number of categories, each with its own proportion of sites and γ rate.

The second type of variation we model is variation in ω, with each codon position i across the genome being given its own ω_i. As with γ_i, values of ω_i can either be sampled from a continuous Gamma or a finite categorical distribution.

Lastly, to accommodate the strong variation in mutation rates observed in SARS-CoV-2 [22, 23] attributable to APOBEC, ADAR, or ROS activity, we introduce a new model of rate variation. This model allows, for a certain position, to have one specific mutation rate (from one specific nucleotide to another specific nucleotide) enhanced by a certain amount μ. In this case we only allow a categorical distribution, with the first category having no enhancement (μ = 1) and the other categories having μ > 1. For any nucleotide position i that is assigned a hypermutable category and therefore has μ_i > 1, we then sample uniformly a start nucleotide n_s and a destination nucleotide n_d. The mutation rates $m_{n_{1} \to n_{2}}^{i}$ for position i then become:

\begin{matrix} m_{n_{1} \to n_{2}}^{i} = {\begin{matrix} γ_{i} m_{n_{1} \to n_{2}}, & if n_{1} \neq n_{s} or n_{2} \neq n_{d}, \\ γ_{i} μ_{i} m_{n_{1} \to n_{2}}, & otherwise . \end{matrix} \end{matrix}

(2)

Rate normalization

We assume that, for the given input phylogenetic tree, branch lengths represent expected numbers of substitutions per nucleotide (no matter if a nucleotide or a codon model is used) for the root genome. As mutations accumulate across the phylogeny, the total mutation rate of the genome might slightly change; this is particularly true because we allow substitution models that are not at equilibrium. This also means that while branch lengths near the root represent the expected numbers of substitutions per nucleotide, as one moves down the tree the expected number of nucleotide substitutions expected on a branch might not be a simple function of the branch length.

Output formats

As default, our software creates an output file where it stores information about which genome position evolved under which rate. It also creates a file where each tip name is listed together with the mutations it contains that distinguish its genome from the reference genome. In scenarios similar to SARS-CoV-2 datasets (where each genome is very similar to the reference), this format requires much less space and time to generate than FASTA or PHYLIP formats (see the “Simple approach” subsection).

An optional output format that our software can create is a tree in Newick format, where each branch of the input phylogeny is annotated with a list of mutation events that occurred on that branch. This format is richer than the others, as it provides information regarding each mutation event, even those that might be over-written by other mutations at the same position; it is also more efficient than multiple sequence alignment formats in the scenario of short branch lengths considered here. We also allow a binary analogue of this annotated Newick tree, called a MAT (mutation annotated tree) [37], which is compatible with the phylogenetic software UShER [27].

Finally, we also allow the creation of unaligned FASTA output. However, note that the creation of a FASTA file costs O(NL) in time and space. In the case simulations are performed without indels, we also allow the generation of a PHYLIP format alignment output.

Python package

Our software phastSim is implemented as a Python package, and can be found at https://github.com/NicolaDM/phastSim or https://pypi.org/project/phastSim/. phastSim uses the ETE3 library [38] to robustly read input trees in different variants of the Newick format. We tested the correctness of our simulator with a series of tests, some of which are showcased in S1 Text.

Comparisons with other methods

To assess the performance of our approach we compare it to other popular sequence simulation methods. INDELible [18] is a sequence evolution simulation software that is particularly useful since it allows simulation of indels and evolution of codon sequences. INDELible allows simulations under two different algorithms: method 1 (“INDELible-m1”) uses matrix exponentiation to model substitutions, while method 2 (“INDELible-m2”) uses instead the Gillespie approach for the same task. Here we consider both algorithms. Seq-Gen [17] is a particularly popular and efficient simulation tool, but it does not allow simulation under complex models, and in particular it does not allow simulation of indels. Finally, we consider the software pyvolve [36], which offers an extremely broad choice of models of sequence evolution, and, like phastSim, is implemented in Python.

SARS-CoV-2 datasets

As a ground for comparison between existing sequence evolution simulators we consider different scenarios typical for genomic epidemiology. First, we consider the simulation of a scenario similar to SARS-CoV-2 evolution. We simulate trees with a custom script (an adapted version of the simulator NGESH [39]) and under a Yule process with birth rate equal to genome length (29,903), so to have in the order of one mutation per branch. We use these trees as input for different sequence simulation methods, tracking their computational demand. We simulate sequence evolution under an UNREST model [31] with rates inferred from SARS-CoV-2 data [22] where possible (for phastSim, pyvolve [36] and INDELible [18]) and a GTR model [6] otherwise (for Seq-Gen [17]).

While in the basic simulation setting we consider a simple model of nucleotide evolution without substitution rate variation across the genome, without selection and without indels, we also extend this scenario so to measure the impact of different simulation parameters on the performance of different methods. In the scenario of discrete category rate variation, we consider a nucleotide model with 10 categories of substitution rates across the genome. In the scenario of continuous rate variation we consider a nucleotide model with continuous variation in rate (each site has a distinct rate sampled from a Gamma distribution). In the the codon model scenario we consider a codon substitution model with substitution rates derived from the nucleotide model. In the codon model with discrete rate variation scenario, we consider in addition 10 discrete rate categories across the genome for the selection parameter ω. In the codon model with continuous rate variation scenario, we consider a codon model where each codon is allowed its own value of ω (only allowed in phastSim). Finally, for methods that allow indels (INDELible and phastSim) we also consider the scenario of a nucleotide sequence evolving with the same UNREST substitution model described before in addition to uniform insertion and deletion rates of 0.1 and with indel length distribution of Geo(0.5).

Bacterial datasets

To demonstrate a scenario in which we are interested in simulating bacterial genome evolution within one outbreak, we use the E. Coli reference genome (https://www.ncbi.nlm.nih.gov/nuccore/U00096.3 [40], 4,641,652 nucleotides) as our root genome sequence. Here we only consider the scenario of a nucleotide model without rate variation. We again assume a scenario typical for genomic epidemiology, that the birth rate of the simulated tree is equal to the genome length. The number of mutations simulated is therefore comparable to the number of branches in the tree.

Results

SARS-CoV-2 scenario

Due to the vast size of current SARS-CoV-2 genome datasets, simulation of SARS-CoV-2 genomes is expected to be an application where simulator efficiency will be particularly important. While phastSim and pyvolve are both Python implementations, therefore sharing similar benefits (high compatibility with other packages and ease of extensions) and draw-backs (reduced efficiency compared to some other languages), we see that the two approaches have dramatically different run time demands (Fig 3): simulating 50 sequences under pyvolve requires on average more time than simulating 500,000 in phastSim. We can also see that INDELible-m2 is marginally faster than INDELible-m1 in this scenario, due to the low number of mutations per branch. However, while phastSim and INDELible-m2 are both similar Gillespie approaches, simulating 5,000 sequences with INDELible-m2 requires slightly more time than simulating 500,000 sequences in phastSim (Fig 3), despite the fact that INDELible is coded in C++. Seq-Gen appears to be very efficient, but it’s still more than one order of magnitude slower than phastSim on large phylogenetic trees in this scenario. Also note that, for large trees considered here, we can reduce computational demand in phastSim by more than 5-fold by not producing a FASTA output alignment; this way we can also save very significant amounts of memory demand. Regarding small trees (< 10⁴ tips) most of the demand in phastSim is associated with initializing the simulations (loading packages and initializing the genome search tree structure); these initialization costs do not depend on tree size, and instead depend on genome size, and they are why phastSim is relatively less efficient on small trees. If simulation on small trees are indeed of interest, these initialization costs could be reduced by re-using the same genome search tree structure over multiple replicates, or, in the case of simple evolutionary models, by using our simple, non-hierarchical simulation approach. Differences in memory demands between methods are similar to time demands, with the benefits of phastSim being even more evident (S1 Text).

Fig 3 — On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the number of tips simulated. Each point represents ten replicates. We do not run the most demanding simulators when each replicate would take substantially more than 1 minute to run. In red is the time to run phastSim with a concise output, and in orange is the time for phastSim with additionally generating a FASTA format output. In green is the demand of pyvolve, and in purple of Seq-Gen. In yellow and brown are respectively the time for running INDELible with method 1 (matrix exponentiation) and method 2 (Gillespie approach).

Bacterial scenario

To showcase the impact of simulated genome size on the performance of different methods, we consider the simulation of the evolution of an entire bacterial genome. As genome size increases, time and memory demand of traditional simulators is expected to grow linearly. Indeed, we now see that Seq-Gen takes considerably more time to simulate the same number of genomes than in the SARS-CoV-2 scenario (Fig 4). phastSim also has an increased computational demand, but only in terms of the initial step of generating an initial genome search tree. This initial cost is linear with respect to genome length, but does not increase with the number of samples or with the number of mutations simulated. In total, in this scenario phastSim can simulate sequence evolution along trees with more than 1000 times more samples than Seq-Gen. A further reduction in computational demand, in particular in terms of the initial cost of generating a genome search tree, can be obtained by using the simple non-hierarchical algorithm (Fig 4), which however comes at the cost of narrowing the choice of evolutionary models to less complex ones. Similar results are observed with respect to memory demand (S1 Text).

Fig 4 — On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the number of tips simulated. Each point represents ten replicates. We do not run Seq-Gen for more than 1000 tips due to high computational demand. In red is the time to run phastSim, and in orange is the time for phastSim with the simple, non-hierarchical approach. In purple is the time demand of Seq-Gen.

Evolutionary and indel models

One of the advantages of the approach we present here is that simulating evolution under increasingly complex models comes at almost no additional time or memory cost (Fig 5 and S1 Text). It can be seen, for example, that INDELible-m1 and Seq-Gen incur a significantly higher time cost when using a continuous variation in mutation rate. Running INDELible-m2 with a codon model appears to come with no additional computational demand, similarly to phastSim (Fig 5). For these comparisons we have considered the SARS-CoV-2 simulation scenario.

Fig 5 — On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the model used for simulations: “nucleotide” is a nucleotide substitution model without variation; “nuc+10cat” is a nucleotide model with 10 rate categories; “nuc+alpha” is a nucleotide model with continuous variation in rate (each site has a distinct rate sampled from a Gamma distribution); “codon” represents a codon substitution model; “codon+10cat” represents a codon substitution model with 10 categories for ω; “codon+alpha” is a codon model with continuous rate variation in mutation rate and in ω (only allowed in phastSim). Each value represents ten replicates. Seq-Gen does not allow codon models. Colors are as in Fig 3. Here we used alignments of 1000 tips.

Our algorithm also allows efficient simulation of insertion and deletion events (indels). Among the other simulators considered here, only INDELible can simulate indels. In the SARS-CoV-2 scenario, phastSim can simulate substitutions and indels under about 10 times larger phylogenies than INDELible for the same computational run time (Fig 6), and under about 200 times larger phylogenies for the same memory demand (S1 Text).

Fig 6 — In this scenario we compare phastSim against Indelbile-m1 and Indelible-m2 (the only other methods considered here that model indels). Each value represents ten replicates.

The impact of branch lengths

Probably the main limiting factor in the applicability of the approach presented here are tree branch lengths. Since the demand of our approach is affected linearly by the number of mutation events, and as we scale up the length of the tree we need to simulate more mutation events, then the length of the phylogenetic branches will significantly affect the performance of our approach. We can see that, in the SARS-CoV-2 scenario, the impact is not strictly linear (Fig 7). This is because there are additional factors which contribute to phastSim demand in addition to the number of mutation events. For example, one also has to consider the time to initialize the genome search tree, which is linear in genome size, as well as the time to read, initialize, and traverse the input phylogenetic tree, which are linear in the number of tips. Predictably, the computational demands of Seq-Gen and INDELible-m1 seem not affected by the length of the branches. It is instead surprising to see that the computational demand of INDELible-m2 seems also not affected by the branch lengths, despite it using a Gillespie approach; the reason is that probably other factors, independent of the number of mutations, cause the bulk of the demand in this scenario.

Fig 7 — On the Y axis we show the number of seconds it takes to perform simulations using different software. On the X axis is the rescaling factor we use to make the phylogenetic tree branch lengths longer or shorter. Colors are as in Fig 3. Here we used alignments of 5000 tips.

Discussion

We have introduced a new approach to simulating sequence evolution that is particularly efficient when used on phylogenies with many tips and with short branches. Our software phastSim implements this new algorithm and is implemented in Python, allowing it to be easily extended and combined with other Python packages. phastSim relies on the ETE 3 tree phylogenetic structure, and in particular it uses ETE 3 to read input phylogenetic trees. This allows flexibility in the phylogenetic tree input format. Furthermore, thanks to the fact that the efficiency of the algorithm is not affected by the complexity of the substitution model used, we allow a broad choice of evolutionary models, such as codon models with position-specific mutation rates and selective pressures. We also implement a new model of hypermutability to more realistically describe the mutational process in SARS-CoV-2. Also, we can efficiently simulate indel events, which are rarely modeled by other simulation packages.

We show that, compared with other simulators, phastSim is more efficient in the scenarios common to genomic epidemiology, that is, when simulating many closely related bacterial or viral genomes. Its particular efficiency with bacterial genomes means that it ideally matches the needs of software that simulate bacterial ancestral recombination graphs (e.g. [9, 41]). phastSim can also be easily run using the output of phylogenetic simulator, most relevantly VGsim [42] which allows fast simulations of very large and short phylogenies typical of SARS-CoV-2 and other genomic epidemiological scenarios, and which also allows the simulation of the effects of selection on the phylogenetic tree shape. phastSim is implemented as a Python package, which allows for easy integration into other Python pipelines.

In the future, it would be possible, and of interest, to expand the features of phastSim, in particular allowing a broader spectrum of models, for example allowing column-specific amino acid fitness profiles; also, it could be possible to implement the described algorithm in more efficient programming languages.

In conclusion, we have presented a novel algorithm, and corresponding software implementation phastSim, to efficiently simulate sequence evolution along large trees of closely related sequences. This new approach considerably outperforms other methods in the scenarios of genomic epidemiology, for example when simulating SARS-CoV-2 genome sequence datasets. This approach also allows for more realistic models of sequence evolution, allowing more efficient and accurate sequence data simulation and inference.

Supporting information

S1 Text. Supplementary text.

Supplementary file containing results regarding memory demand of different methods and containing results from testing of phastSim features.

(PDF)

Click here for additional data file.^{(881.3KB, pdf)}

Acknowledgments

We are very thankful to Vladimir Shchur for the valuable suggestions on our work.

Data Availability

The code and data used for this project are available at https://github.com/NicolaDM/phastSim. phastSim can be easily installed across most platforms (see PyPI repository https://pypi.org/project/phastSim/) using the pip installer.

Funding Statement

NDM, WB, LW, CRW, and NG were supported by the European Molecular Biology Laboratory (EMBL), https://www.embl.org/. CRW was funded by the National Institute of Health Research (NIHR https://cambridgebrc.nihr.ac.uk) Cambridge Biomedical Research Centre, grant number IS-BRC-1215- 20014. R.C.-D. was supported by funding from the Schmidt Futures Foundation https://schmidtfutures.com/, by an Alfred P. Sloan foundation https://sloan.org/ fellowship, and by NIH/NIGMS https://www.nigms.nih.gov/ grant R35GM128932. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Arenas M. Simulation of molecular data under diverse evolutionary scenarios. PLoS Comput Biol. 2012;8(5):e1002495. doi: 10.1371/journal.pcbi.1002495 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Fletcher W, Yang Z. The effect of insertions, deletions, and alignment errors on the branch-site test of positive selection. Molecular biology and evolution. 2010;27(10):2257–2267. doi: 10.1093/molbev/msq115 [DOI] [PubMed] [Google Scholar]
3. Jordan G, Goldman N. The effects of alignment error and alignment filtering on the sitewise detection of positive selection. Molecular biology and evolution. 2012;29(4):1125–1139. doi: 10.1093/molbev/msr272 [DOI] [PubMed] [Google Scholar]
4. Vialle RA, Tamuri AU, Goldman N. Alignment modulates ancestral sequence reconstruction accuracy. Molecular biology and evolution. 2018;35(7):1783–1797. doi: 10.1093/molbev/msy055 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Worobey M, Pekar J, Larsen BB, Nelson MI, Hill V, Joy JB, et al. The emergence of SARS-CoV-2 in Europe and North America. Science. 2020;370(6516):564–570. doi: 10.1126/science.abc8169 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Tavaré S, et al. Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on mathematics in the life sciences. 1986;17(2):57–86. [Google Scholar]
7. Beaumont MA, Zhang W, Balding DJ. Approximate Bayesian computation in population genetics. Genetics. 2002;162(4):2025–2035. doi: 10.1093/genetics/162.4.2025 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Wilson DJ, Gabriel E, Leatherbarrow AJ, Cheesbrough J, Gee S, Bolton E, et al. Rapid evolution and the importance of recombination to the gastroenteric pathogen Campylobacter jejuni. Molecular biology and evolution. 2009;26(2):385–397. doi: 10.1093/molbev/msn264 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. De Maio N, Wilson DJ. The bacterial sequential Markov coalescent. Genetics. 2017;206(1):333–343. doi: 10.1534/genetics.116.198796 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Zou Z, Zhang H, Guan Y, Zhang J. Deep residual neural networks resolve quartet molecular phylogenies. Molecular Biology and Evolution. 2019;37(5):1495–1507. doi: 10.1093/molbev/msz307 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Suvorov A, Hochuli J, Schrider DR. Accurate inference of tree topologies from multiple sequence alignments using deep learning. Systematic biology. 2020;69(2):221–233. doi: 10.1093/sysbio/syz060 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Leuchtenberger AF, Crotty SM, Drucks T, Schmidt HA, Burgstaller-Muehlbacher S, von Haeseler A. Distinguishing Felsenstein zone from Farris zone using neural networks. Molecular Biology and Evolution. 2020;37(12):3632–3641. doi: 10.1093/molbev/msaa164 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Beiko RG, Charlebois RL. A simulation test bed for hypotheses of genome evolution. Bioinformatics. 2007;23(7):825–831. doi: 10.1093/bioinformatics/btm024 [DOI] [PubMed] [Google Scholar]
14. Hudson RR. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics. 2002;18(2):337–338. doi: 10.1093/bioinformatics/18.2.337 [DOI] [PubMed] [Google Scholar]
15. Laval G, Excoffier L. SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Bioinformatics. 2004;20(15):2485–2487. doi: 10.1093/bioinformatics/bth264 [DOI] [PubMed] [Google Scholar]
16. Ewing G, Hermisson J. MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics. 2010;26(16):2064–2065. doi: 10.1093/bioinformatics/btq322 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Rambaut A, Grass NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics. 1997;13(3):235–238. doi: 10.1093/bioinformatics/13.3.235 [DOI] [PubMed] [Google Scholar]
18. Fletcher W, Yang Z. INDELible: a flexible simulator of biological sequence evolution. Molecular biology and evolution. 2009;26(8):1879–1888. doi: 10.1093/molbev/msp098 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Sipos B, Massingham T, Jordan GE, Goldman N. PhyloSim-Monte Carlo simulation of sequence evolution in the R statistical computing environment. BMC bioinformatics. 2011;12(1):1–6. doi: 10.1186/1471-2105-12-104 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Morel B, Barbera P, Czech L, Bettisworth B, Hübner L, Lutteropp S, et al. Phylogenetic analysis of SARS-CoV-2 data is difficult. bioRxiv. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Shu Y, McCauley J. GISAID: Global initiative on sharing all influenza data–from vision to reality. Eurosurveillance. 2017;22(13):30494. doi: 10.2807/1560-7917.ES.2017.22.13.30494 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. De Maio N, Walker CR, Turakhia Y, Lanfear R, Corbett-Detig R, Goldman N. Mutation rates and selection on synonymous mutations in SARS-CoV-2. Genome Biology and Evolution. 2021;13(5):evab087. doi: 10.1093/gbe/evab087 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Turakhia Y, De Maio N, Thornlow B, Gozashti L, Lanfear R, Walker CR, et al. Stability of SARS-CoV-2 phylogenies. PLOS Genetics. 2020;16(11):e1009175. doi: 10.1371/journal.pgen.1009175 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Rice AM, Morales AC, Ho AT, Mordstein C, Mühlhausen S, Watson S, et al. Evidence for strong mutation bias towards, and selection against, U content in SARS-CoV-2: implications for vaccine design. Molecular Biology and Evolution. 2020. [Google Scholar]
25. Gillespie DT. Exact stochastic simulation of coupled chemical reactions. The journal of physical chemistry. 1977;81(25):2340–2361. doi: 10.1021/j100540a008 [DOI] [Google Scholar]
26. Hodcroft EB, De Maio N, Lanfear R, MacCannell DR, Minh BQ, Schmidt HA, et al. Want to track pandemic variants faster? Fix the bioinformatics bottleneck; 2021. [DOI] [PubMed] [Google Scholar]
27. Turakhia Y, Thornlow B, Hinrichs AS, De Maio N, Gozashti L, Lanfear R, et al. Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nature Genetics. 2021;53(6):809–816. doi: 10.1038/s41588-021-00862-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Jukes TH, Cantor CR, et al. Evolution of protein molecules. Mammalian protein metabolism. 1969;3:21–132. doi: 10.1016/B978-1-4832-3211-9.50009-7 [DOI] [Google Scholar]
29. Ellegren H, Smith NG, Webster MT. Mutation rate variation in the mammalian genome. Current opinion in genetics & development. 2003;13(6):562–568. doi: 10.1016/j.gde.2003.10.008 [DOI] [PubMed] [Google Scholar]
30. Yang Z. Among-site rate variation and its impact on phylogenetic analyses. Trends in Ecology & Evolution. 1996;11(9):367–372. doi: 10.1016/0169-5347(96)10041-0 [DOI] [PubMed] [Google Scholar]
31. Yang Z. Estimating the pattern of nucleotide substitution. Journal of molecular evolution. 1994;39(1):105–111. doi: 10.1007/BF00178256 [DOI] [PubMed] [Google Scholar]
32.Tang D. Efficient algorithms for modifying and sampling from a categorical distribution. arXiv preprint arXiv:190611700. 2019.
33. Huffman DA. A method for the construction of minimum-redundancy codes. Proceedings of the IRE. 1952;40(9):1098–1101. doi: 10.1109/JRPROC.1952.273898 [DOI] [Google Scholar]
34. Hasegawa M, Kishino H, Yano Ta. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of molecular evolution. 1985;22(2):160–174. doi: 10.1007/BF02101694 [DOI] [PubMed] [Google Scholar]
35. Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Molecular biology and evolution. 1994;11(5):725–736. [DOI] [PubMed] [Google Scholar]
36. Spielman SJ, Wilke CO. Pyvolve: a flexible Python module for simulating sequences along phylogenies. PloS one. 2015;10(9):e0139047. doi: 10.1371/journal.pone.0139047 [DOI] [PMC free article] [PubMed] [Google Scholar]
37. McBroome J, Thornlow B, Hinrichs AS, Kramer A, De Maio N, Goldman N, et al. A daily-updated database and tools for comprehensive SARS-CoV-2 mutation-annotated trees. Molecular Biology and Evolution. 2021. doi: 10.1093/molbev/msab264 [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Huerta-Cepas J, Serra F, Bork P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Molecular biology and evolution. 2016;33(6):1635–1638. doi: 10.1093/molbev/msw046 [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Tresoldi T. Ngesh, a tool for simulating random phylogenetic trees. Version 0.5; 2021. https://github.com/tresoldi/ngesh.
40. Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, et al. The complete genome sequence of Escherichia coli K-12. science. 1997;277(5331):1453–1462. doi: 10.1126/science.277.5331.1453 [DOI] [PubMed] [Google Scholar]
41. Brown T, Didelot X, Wilson DJ, De Maio N. SimBac: simulation of whole bacterial genomes with homologous recombination. Microbial genomics. 2016;2(1). doi: 10.1099/mgen.0.000044 [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Shchur V, Spirin V, Pokrovskii V, Burovski E, De Maio N, Corbett-Detig R. VGsim: scalable viral genealogy simulator for global pandemic. medRxiv. 2021;. doi: 10.1101/2021.04.21.21255891 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010056.r001

Decision Letter 0

Ville Mustonen, Joel O Wertheim

10 Nov 2021

Dear Dr De Maio,

Thank you very much for submitting your manuscript "phastSim: efficient simulation of sequence evolution for pandemic-scale datasets" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

I agree with the consensus among the three reviewers. This approach is novel and potentially quite useful. That said, the reviewers identified several areas that require attention and improvement. In particular, Reviewers #1 and #3 raise an important point about the structure and focus of the manuscript.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Joel O. Wertheim

Associate Editor

PLOS Computational Biology

Ville Mustonen

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this manuscript, the authors present a novel tool for simulating the evolution of an ancestral sequence along a given phylogeny. The authors have made the tool available as an open source package available on GitHub with user-friendly installation possible via PyPI (using the pip package manager).

As promised in the manuscript and GitHub repo, I was able to easily install the tool, along with all of its dependencies, on my laptop via a simple "pip install phastSim" command (using an Ubuntu 18.04 environment in the Windows Subsystem for Linux). I was then able to run the tool using the example dataset provided in the GitHub repo, which finished running in just ~15 seconds on my laptop, which is quite impressive given the sample dataset size. I only briefly skimmed through the code base on GitHub, but at a glance, it's quite clean and organized. Regarding the algorithms presented, the authors present clever techniques for efficiently simulating sequence evolution along a tree, and importantly, their approach supports the simulation of insertions and deletions, something not supported by Pyvolve nor (if I recall correctly) Seq-Gen.

Regarding the manuscript itself, I believe the paper is overall well-written: as expected from a paper of this nature, the authors (1) introduce the bioinformatics problem at hand, (2) discuss prior work in the space, (3) introduce their novel approach, (4) present their approach's algorithms and tool implementation, (5) describe a simulation experiment to benchmark their tool against existing methods, (6) present the results of the benchmarking experiment, and (7) discuss the results. However, I believe the paper requires some significant revision:

- The technical details of the simulation experiment are currently presented in the "Results" section of the manuscript. From my perspective, it would make more sense to move the technical details about the methods behind the simulation experiment to the "Materials and methods" section of the manuscript. I believe only the results of the simulation experiment (i.e., the actual benchmarking measurements) should be presented in the "Results" section

- The paper only shows runtime measurements for the various tools, but because of the large number of simulations that need to be executed, ideally with many replicates in parallel, in large-scale simulation experiments such as those used to study COVID-19 (e.g. Pekar et al., Science 2021), and because the simulation of sequence evolution can be quite memory-intensive (as mentioned by the authors), the benchmarking results should include plots depicting peak memory usage of the various tools as well. I apologize in advance for asking for this, as I'm sure it'll require quite of work to be redone, but in addition to runtime measurements, peak memory measurements are critical to properly compare these tools

- Figure 1 should be cleaned up to look a bit more professional / production-quality. For example, the child branches coming out of the internal nodes of the tree are quite inconsistent in terms of spacing, and rather than using "->" to denote a right arrow, it would be better to actually use a right arrow (→), etc.

- Figure 2 should be cleaned up substantially: the image looks quite distorted (namely the small red-and-blue trees), perhaps because of resizing vertically?

- Figures 4, 5, and 7 should be redone to look consistent with Figures 3 and 6 (especially the legends and tick labels). Further, these 3 figures have far too much vertical space: the y-max should be much smaller (e.g. 150 seconds for Figure 4, 3 seconds for Figure 5, and 45 seconds for Figure 7)

- Why are Figures 4, 5, and 7 using different-sized trees for each tool in these experiments? These should be replaced with the exact same trees for each tool, just as was done in Figures 3 and 6. If the decision to use different-sized trees was for the sake of presentation due to huge variation in runtime across the tools, the authors can use a log-scale for the vertical axis. Even in the figures' current form, the boxes are quite squished, and log-scale may help better depict them

- Figure 5's minimum vertical axis value (y-min) should be 0, not -1, as these are runtimes (if the vertical axis is changed to log-scale as I recommended in a prior bullet, the y-min would need to be a positive number rather than 0)

- I was not able to find the datasets used in the simulation experiments. I understand that GISAID has tight restrictions on releasing actual sequences, but the phylogenies used in the simulation experiments, along with the raw benchmarking measurements should be made publicly available (e.g. in a separate GitHub repo, on Data Dryad, on figshare, etc.). If the authors are worried about GISAID terms with respect to the phylogeny, the only identifiable component would be the tip labels, so the authors can simply replace the tip labels with arbitrary values (e.g. "0", "1", etc.). I would recommend also including all scripts/commands utilized in conducting the benchmarking experiments so that a reader can simply copy-and-paste the exact commands you used and (more-or-less) reproduce the benchmarking results

Less significant general comments for improvement of presentation:

- The formatting of the pseudocode in the various algorithms is somewhat inconsistent. Of note, the spacing between the equal signs in assignments is inconsistent (sometimes "a=b", sometimes "a= b", sometimes "a =b", and sometimes "a = b"). It would be good to revise to be consistent; I would recommend putting spaces between symbols for clarity

- Assignment operations in pseudocode are typically denoted using a left arrow (←) rather than using an equal sign (=)

- Multiple parts of the paper say "sample ___ from an exponential distribution with parameter ____", which is slightly ambiguous: the exponential distribution has two possible parameterizations (rate, or scale = 1/rate), and while the rate parameterization is the most typical representation (to my knowledge), it would be good to specify, e.g. "sample ____ from an exponential distribution with rate parameter ____"

- All figures appear quite pixelated in the PDF I downloaded. However, this may be an artifact of the submission system, so it may not actually be an issue on the authors' end (but it would be good to double check)

- Figure 3 may be improved by presenting the vertical axis in log-scale to better distinguish between the runtimes of smaller values of "number of tips" (though not necessary, as the trends are quite clear even in the current presentation)

Specific comments about wording/grammar/text throughout the manuscript:

- "Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, as well as being part of some inference methods" --> The last clause of this sentence is grammatically incorrect and should be revised

- "Here we present a new algorithm and software for ..." --> There should be a comma after "Here"

- "Our algorithm is based on the Gillespie approach, and implements an ..." --> There should be an "it" before "implements"

- "either for example through Approximate Bayesian Computation [6, 7], see e.g. [8, 9]," --> I wonder if this could just be changed to be "Approximate Bayesian Computation [6-9]" (i.e., remove the "e.g." part)? Same comment for the following sentence

- The paragraph starting with "In this simplified “vanilla” scenario..." may benefit from being split into two parts, e.g. with a new paragraph starting at " A pseudocode description of..."

- The end of the paragraph starting with "In this simplified “vanilla” scenario..." presents details about the tool implementation, though those tool-specific descriptions should likely be moved to the portion of the manuscript that describes the tool

- There are some more minor grammar issues throughout, so the paper may benefit from another pass of internal revisions for such things

Overall, I was thoroughly impressed by this work, and I look forward to utilizing phastSim in my own research!

Reviewer #2: The review is uploaded as a separate PDF.

Reviewer #3: Summary. The authors develop phastSim, a software package for simulating the evolution of sequences along a tree. The authors' primary innovation is the development of a data structure that allows efficient computation of sequences on large trees. As someone who is an expert in this field, I have experimented with using a binary-search tree to efficiently identify the location of a mutation during simulation. I also abandoned such an algorithm because of the primary problem diagnosed in this paper: copying the binary-search tree to descendant phylogenetic branches is an expensive operation. The authors solved this problem by developing a multi-layer binary search tree that doesn't have to be copied at every phylogenetic split. Instead nodes maintain different views of the shared data-structure, and descendant branches add nodes to the data-structure to update their views when mutations happen without affecting other views. I found this algorithm an interesting solution to the problem.

Major Comments. The paper's primary result is comparing features and runtime between existing programs. Such comparisons should be a secondary result in my opinion. Instead, simulation papers should focus on demonstrating the accuracy of their simulation software. However in this paper, there is no evidence presented that the simulated data generated by phastSim agrees with the models being simulated. It's not uncommon to find subtle bugs in simulation programs that introduce bias into simulations. This is why it is important for simulation papers to demonstrate their accuracy before they compare their performance to other programs. Accuracy can be demonstrated several ways, including using summary statistics, statistical tests, or parameter estimation to show that the simulated output matches what one would expect from the model. Doing all three for several different models support by phastSim would make a strong case that the software is accurate.

It appreciate that the code is open source and freely licensed.

Minor Comments. Several of the algorithms presented as figures in the manuscript were adequately explained in the text. I think the paper would be improved by removing some of these algorithms from the paper. For example, Algorithms 2 and 6.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Niema Moshiri

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Attachment

Submitted filename: phastSim review.pdf

Click here for additional data file.^{(43.7KB, pdf)}

PLoS Comput Biol. 2022 Apr 29;18(4):e1010056. doi: 10.1371/journal.pcbi.1010056.r002

Author response to Decision Letter 0

2 Feb 2022

Attachment

Submitted filename: phastSim_ResponseToReviewers.pdf

Click here for additional data file.^{(107.2KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010056.r003

Decision Letter 1

Ville Mustonen

14 Mar 2022

Dear Dr De Maio,

Thank you for the thorough revision. Please have a look of the few remaining, mostly stylistic, suggestions by the reviewers and try to accommodate them.

Thank you very much for submitting your manuscript "phastSim: efficient simulation of sequence evolution for pandemic-scale datasets" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Ville Mustonen

Deputy Editor

PLOS Computational Biology

Ville Mustonen

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: We thank the authors for their significant revisions. The paper looks excellent, and all of our key concerns have been addressed. We have the following (extremely minor) comments regarding the revised version of the manuscript:

- In Figures 3 and 6, it's unclear to me why "tree generation" is included in the runtime comparisons, as it's not really relevant to the task at hand (sequence simulation). I would recommend removing it such that the comparison is only between sequence simulation methods

- In Figure 3, in the blue curve (tree generation), why is there such a large variance at the 5th point from the left? I imagine at least 1 measurement may have gotten skewed by background processes on the benchmarking machine or something; I would recommend trying to rerun that point while the machine is not being used. Note that this comment is moot if the "tree generation" curves are removed from the figures as per my previous comment

- In the Algorithm 6 pseudocode, at the top of the "else" statement, rather than using the syntax "int(l/2)" (which is likely the Python code that was used to typecast the result of a floating point division to int), I would recommend using the mathematical notation for "floor", e.g. ⌊l/2⌋. In general, it may be good for the authors to take a pass through the algorithms to ensure that they are using standard mathematical pseudocode syntax rather than Python-like syntax where applicable

Reviewer #2: The authors have addressed all my comments and I am happy to recommend acceptance at this stage.

Reviewer #3: The authors have fully addressed my comments from the previous version.

I looked at the supplemental figures showing the accuracy of the simulations. I feel that these figures were quickly put together, and the supplement would benefit from some more time spent on them. This includes (1) increasing font sizes following journal guidelines and (2) consistently marking location of no-error on all histograms. Also the histograms seem a bit blocky to me. The authors should explore other visualizations, including jitter plots, qqplots, and empirical CDFs. The authors should also make note of when two lines or plots are on top of one another, that helps readers know that data hasn't been left out.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Niema Moshiri

Reviewer #2: No

Reviewer #3: No

Figure Files:

Data Requirements:

Reproducibility:

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. 2022 Apr 29;18(4):e1010056. doi: 10.1371/journal.pcbi.1010056.r004

Author response to Decision Letter 1

22 Mar 2022

Attachment

Submitted filename: responseToReviewers.pdf

Click here for additional data file.^{(63.8KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010056.r005

Decision Letter 2

Ville Mustonen

25 Mar 2022

Dear Dr De Maio,

We are pleased to inform you that your manuscript 'phastSim: efficient simulation of sequence evolution for pandemic-scale datasets' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Ville Mustonen

Deputy Editor

PLOS Computational Biology

Ville Mustonen

Deputy Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010056.r006

Acceptance letter

Ville Mustonen

26 Apr 2022

PCOMPBIOL-D-21-01738R2

phastSim: efficient simulation of sequence evolution for pandemic-scale datasets

Dear Dr De Maio,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Anita Estes

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Text. Supplementary text.

Supplementary file containing results regarding memory demand of different methods and containing results from testing of phastSim features.

(PDF)

Click here for additional data file.^{(881.3KB, pdf)}

Attachment

Submitted filename: phastSim review.pdf

Click here for additional data file.^{(43.7KB, pdf)}

Attachment

Submitted filename: phastSim_ResponseToReviewers.pdf

Click here for additional data file.^{(107.2KB, pdf)}

Attachment

Submitted filename: responseToReviewers.pdf

Click here for additional data file.^{(63.8KB, pdf)}

Data Availability Statement

[pcbi.1010056.ref001] 1. Arenas M. Simulation of molecular data under diverse evolutionary scenarios. PLoS Comput Biol. 2012;8(5):e1002495. doi: 10.1371/journal.pcbi.1002495 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref002] 2. Fletcher W, Yang Z. The effect of insertions, deletions, and alignment errors on the branch-site test of positive selection. Molecular biology and evolution. 2010;27(10):2257–2267. doi: 10.1093/molbev/msq115 [DOI] [PubMed] [Google Scholar]

[pcbi.1010056.ref003] 3. Jordan G, Goldman N. The effects of alignment error and alignment filtering on the sitewise detection of positive selection. Molecular biology and evolution. 2012;29(4):1125–1139. doi: 10.1093/molbev/msr272 [DOI] [PubMed] [Google Scholar]

[pcbi.1010056.ref004] 4. Vialle RA, Tamuri AU, Goldman N. Alignment modulates ancestral sequence reconstruction accuracy. Molecular biology and evolution. 2018;35(7):1783–1797. doi: 10.1093/molbev/msy055 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref005] 5. Worobey M, Pekar J, Larsen BB, Nelson MI, Hill V, Joy JB, et al. The emergence of SARS-CoV-2 in Europe and North America. Science. 2020;370(6516):564–570. doi: 10.1126/science.abc8169 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref006] 6. Tavaré S, et al. Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on mathematics in the life sciences. 1986;17(2):57–86. [Google Scholar]

[pcbi.1010056.ref007] 7. Beaumont MA, Zhang W, Balding DJ. Approximate Bayesian computation in population genetics. Genetics. 2002;162(4):2025–2035. doi: 10.1093/genetics/162.4.2025 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref008] 8. Wilson DJ, Gabriel E, Leatherbarrow AJ, Cheesbrough J, Gee S, Bolton E, et al. Rapid evolution and the importance of recombination to the gastroenteric pathogen Campylobacter jejuni. Molecular biology and evolution. 2009;26(2):385–397. doi: 10.1093/molbev/msn264 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref009] 9. De Maio N, Wilson DJ. The bacterial sequential Markov coalescent. Genetics. 2017;206(1):333–343. doi: 10.1534/genetics.116.198796 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref010] 10. Zou Z, Zhang H, Guan Y, Zhang J. Deep residual neural networks resolve quartet molecular phylogenies. Molecular Biology and Evolution. 2019;37(5):1495–1507. doi: 10.1093/molbev/msz307 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref011] 11. Suvorov A, Hochuli J, Schrider DR. Accurate inference of tree topologies from multiple sequence alignments using deep learning. Systematic biology. 2020;69(2):221–233. doi: 10.1093/sysbio/syz060 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref012] 12. Leuchtenberger AF, Crotty SM, Drucks T, Schmidt HA, Burgstaller-Muehlbacher S, von Haeseler A. Distinguishing Felsenstein zone from Farris zone using neural networks. Molecular Biology and Evolution. 2020;37(12):3632–3641. doi: 10.1093/molbev/msaa164 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref013] 13. Beiko RG, Charlebois RL. A simulation test bed for hypotheses of genome evolution. Bioinformatics. 2007;23(7):825–831. doi: 10.1093/bioinformatics/btm024 [DOI] [PubMed] [Google Scholar]

[pcbi.1010056.ref014] 14. Hudson RR. Generating samples under a Wright–Fisher neutral model of genetic variation. Bioinformatics. 2002;18(2):337–338. doi: 10.1093/bioinformatics/18.2.337 [DOI] [PubMed] [Google Scholar]

[pcbi.1010056.ref015] 15. Laval G, Excoffier L. SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Bioinformatics. 2004;20(15):2485–2487. doi: 10.1093/bioinformatics/bth264 [DOI] [PubMed] [Google Scholar]

[pcbi.1010056.ref016] 16. Ewing G, Hermisson J. MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics. 2010;26(16):2064–2065. doi: 10.1093/bioinformatics/btq322 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref017] 17. Rambaut A, Grass NC. Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Bioinformatics. 1997;13(3):235–238. doi: 10.1093/bioinformatics/13.3.235 [DOI] [PubMed] [Google Scholar]

[pcbi.1010056.ref018] 18. Fletcher W, Yang Z. INDELible: a flexible simulator of biological sequence evolution. Molecular biology and evolution. 2009;26(8):1879–1888. doi: 10.1093/molbev/msp098 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref019] 19. Sipos B, Massingham T, Jordan GE, Goldman N. PhyloSim-Monte Carlo simulation of sequence evolution in the R statistical computing environment. BMC bioinformatics. 2011;12(1):1–6. doi: 10.1186/1471-2105-12-104 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref020] 20. Morel B, Barbera P, Czech L, Bettisworth B, Hübner L, Lutteropp S, et al. Phylogenetic analysis of SARS-CoV-2 data is difficult. bioRxiv. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref021] 21. Shu Y, McCauley J. GISAID: Global initiative on sharing all influenza data–from vision to reality. Eurosurveillance. 2017;22(13):30494. doi: 10.2807/1560-7917.ES.2017.22.13.30494 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref022] 22. De Maio N, Walker CR, Turakhia Y, Lanfear R, Corbett-Detig R, Goldman N. Mutation rates and selection on synonymous mutations in SARS-CoV-2. Genome Biology and Evolution. 2021;13(5):evab087. doi: 10.1093/gbe/evab087 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref023] 23. Turakhia Y, De Maio N, Thornlow B, Gozashti L, Lanfear R, Walker CR, et al. Stability of SARS-CoV-2 phylogenies. PLOS Genetics. 2020;16(11):e1009175. doi: 10.1371/journal.pgen.1009175 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref024] 24. Rice AM, Morales AC, Ho AT, Mordstein C, Mühlhausen S, Watson S, et al. Evidence for strong mutation bias towards, and selection against, U content in SARS-CoV-2: implications for vaccine design. Molecular Biology and Evolution. 2020. [Google Scholar]

[pcbi.1010056.ref025] 25. Gillespie DT. Exact stochastic simulation of coupled chemical reactions. The journal of physical chemistry. 1977;81(25):2340–2361. doi: 10.1021/j100540a008 [DOI] [Google Scholar]

[pcbi.1010056.ref026] 26. Hodcroft EB, De Maio N, Lanfear R, MacCannell DR, Minh BQ, Schmidt HA, et al. Want to track pandemic variants faster? Fix the bioinformatics bottleneck; 2021. [DOI] [PubMed] [Google Scholar]

[pcbi.1010056.ref027] 27. Turakhia Y, Thornlow B, Hinrichs AS, De Maio N, Gozashti L, Lanfear R, et al. Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nature Genetics. 2021;53(6):809–816. doi: 10.1038/s41588-021-00862-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref028] 28. Jukes TH, Cantor CR, et al. Evolution of protein molecules. Mammalian protein metabolism. 1969;3:21–132. doi: 10.1016/B978-1-4832-3211-9.50009-7 [DOI] [Google Scholar]

[pcbi.1010056.ref029] 29. Ellegren H, Smith NG, Webster MT. Mutation rate variation in the mammalian genome. Current opinion in genetics & development. 2003;13(6):562–568. doi: 10.1016/j.gde.2003.10.008 [DOI] [PubMed] [Google Scholar]

[pcbi.1010056.ref030] 30. Yang Z. Among-site rate variation and its impact on phylogenetic analyses. Trends in Ecology & Evolution. 1996;11(9):367–372. doi: 10.1016/0169-5347(96)10041-0 [DOI] [PubMed] [Google Scholar]

[pcbi.1010056.ref031] 31. Yang Z. Estimating the pattern of nucleotide substitution. Journal of molecular evolution. 1994;39(1):105–111. doi: 10.1007/BF00178256 [DOI] [PubMed] [Google Scholar]

[pcbi.1010056.ref032] 32.Tang D. Efficient algorithms for modifying and sampling from a categorical distribution. arXiv preprint arXiv:190611700. 2019.

[pcbi.1010056.ref033] 33. Huffman DA. A method for the construction of minimum-redundancy codes. Proceedings of the IRE. 1952;40(9):1098–1101. doi: 10.1109/JRPROC.1952.273898 [DOI] [Google Scholar]

[pcbi.1010056.ref034] 34. Hasegawa M, Kishino H, Yano Ta. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of molecular evolution. 1985;22(2):160–174. doi: 10.1007/BF02101694 [DOI] [PubMed] [Google Scholar]

[pcbi.1010056.ref035] 35. Goldman N, Yang Z. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Molecular biology and evolution. 1994;11(5):725–736. [DOI] [PubMed] [Google Scholar]

[pcbi.1010056.ref036] 36. Spielman SJ, Wilke CO. Pyvolve: a flexible Python module for simulating sequences along phylogenies. PloS one. 2015;10(9):e0139047. doi: 10.1371/journal.pone.0139047 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref037] 37. McBroome J, Thornlow B, Hinrichs AS, Kramer A, De Maio N, Goldman N, et al. A daily-updated database and tools for comprehensive SARS-CoV-2 mutation-annotated trees. Molecular Biology and Evolution. 2021. doi: 10.1093/molbev/msab264 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref038] 38. Huerta-Cepas J, Serra F, Bork P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Molecular biology and evolution. 2016;33(6):1635–1638. doi: 10.1093/molbev/msw046 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref039] 39.Tresoldi T. Ngesh, a tool for simulating random phylogenetic trees. Version 0.5; 2021. https://github.com/tresoldi/ngesh.

[pcbi.1010056.ref040] 40. Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, et al. The complete genome sequence of Escherichia coli K-12. science. 1997;277(5331):1453–1462. doi: 10.1126/science.277.5331.1453 [DOI] [PubMed] [Google Scholar]

[pcbi.1010056.ref041] 41. Brown T, Didelot X, Wilson DJ, De Maio N. SimBac: simulation of whole bacterial genomes with homologous recombination. Microbial genomics. 2016;2(1). doi: 10.1099/mgen.0.000044 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010056.ref042] 42. Shchur V, Spirin V, Pokrovskii V, Burovski E, De Maio N, Corbett-Detig R. VGsim: scalable viral genealogy simulator for global pandemic. medRxiv. 2021;. doi: 10.1101/2021.04.21.21255891 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets

Nicola De Maio

William Boulton

Lukas Weilguny

Conor R Walker

Yatish Turakhia

Russell Corbett-Detig

Nick Goldman

Roles

Abstract

Author summary

Introduction

Materials and methods

Simple approach

Hierarchical approach

Binary genome search tree

Fig 1. Example genome search tree and its use.

Hierarchical, multi-layer evolving genome search tree

Fig 2. Example of multi-layer genome search tree and its evolution.

Indels

Table 1. Indel length distribution options.

Insertion algorithm

Deletion algorithm

Further details of the implementation

Substitution models

Table 2. A comparison of features of different sequence evolution simulation software packages.

Models of rate variation

Rate normalization

Output formats

Python package

Comparisons with other methods

SARS-CoV-2 datasets

Bacterial datasets

Results

SARS-CoV-2 scenario

Fig 3. Comparison of running times of different simulators in a scenario similar to SARS-CoV-2 data.

Bacterial scenario

Fig 4. Comparison of running times of different simulators in a scenario similar to E. Coli outbreak data.

Evolutionary and indel models

Fig 5. Comparison of running times of different simulators in a SARS-CoV-2 scenario using different evolutionary models.

Fig 6. Comparison of running times of Indelible and phastSim simulators in a SARS-CoV-2 scenario with indels.

The impact of branch lengths

Fig 7. Comparison of running times of different simulators in a SARS-CoV-2 scenario after rescaling the tree branch lengths by different factors.

Discussion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Ville Mustonen

Joel O Wertheim

Roles

Author response to Decision Letter 0

Decision Letter 1

Ville Mustonen

Roles

Author response to Decision Letter 1

Decision Letter 2

Ville Mustonen

Roles

Acceptance letter

Ville Mustonen

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases