Skip to main content
PLOS One logoLink to PLOS One
. 2022 Nov 2;17(11):e0275623. doi: 10.1371/journal.pone.0275623

RASCL: Rapid Assessment of Selection in CLades through molecular sequence analysis

Alexander G Lucaci 1,*, Jordan D Zehr 1, Stephen D Shank 1, Dave Bouvier 2, Alexander Ostrovsky 3, Han Mei 2, Anton Nekrutenko 2, Darren P Martin 4, Sergei L Kosakovsky Pond 1,*
Editor: Vladimir Makarenkov5
PMCID: PMC9629619  PMID: 36322581

Abstract

An important unmet need revealed by the COVID-19 pandemic is the near-real-time identification of potentially fitness-altering mutations within rapidly growing SARS-CoV-2 lineages. Although powerful molecular sequence analysis methods are available to detect and characterize patterns of natural selection within modestly sized gene-sequence datasets, the computational complexity of these methods and their sensitivity to sequencing errors render them effectively inapplicable in large-scale genomic surveillance contexts. Motivated by the need to analyze new lineage evolution in near-real time using large numbers of genomes, we developed the Rapid Assessment of Selection within CLades (RASCL) pipeline. RASCL applies state of the art phylogenetic comparative methods to evaluate selective processes acting at individual codon sites and across whole genes. RASCL is scalable and produces automatically updated regular lineage-specific selection analysis reports: even for lineages that include tens or hundreds of thousands of sampled genome sequences. Key to this performance is (i) generation of automatically subsampled high quality datasets of gene/ORF sequences drawn from a selected “query” viral lineage; (ii) contextualization of these query sequences in codon alignments that include high-quality “background” sequences representative of global SARS-CoV-2 diversity; and (iii) the extensive parallelization of a suite of computationally intensive selection analysis tests. Within hours of being deployed to analyze a novel rapidly growing lineage of interest, RASCL will begin yielding JavaScript Object Notation (JSON)-formatted reports that can be either imported into third-party analysis software or explored in standard web-browsers using the premade RASCL interactive data visualization dashboard. By enabling the rapid detection of genome sites evolving under different selective regimes, RASCL is well-suited for near-real-time monitoring of the population-level selective processes that will likely underlie the emergence of future variants of concern in measurably evolving pathogens with extensive genomic surveillance.

Introduction

Rapid characterization and assessment of clade-specific mutations that are found in persistent or rapidly expanding SARS-CoV-2 lineages have become an important component of efforts to monitor and manage the COVID19 pandemic. Identifying the most relevant mutations, e.g., those likely to impact transmission or immune escape, is a priority when new lineages are discovered, as these are key to assessing a lineage’s potential threat-level. If observed mutations have not been previously characterized, computational and laboratory-based analytical approaches to inferring whether the mutations provide transmission or immune escape advantages are generally too slow to inform early public health responses.

Epidemiologically relevant mutations are likely subject to natural selection because they provide a fitness advantage [1]. Such mutations can be identified by detecting the subtle patterns of nucleotide variation within gene-sequence datasets that are indicative of selective processes. There are a multitude of powerful computational techniques that, given sufficiently informative sequence data, can identify individual codons within genes that are evolving under a range of different selective regimes [2]. Hundreds of papers and preprints have used some of the methods implemented in HyPhy [3] and Datamonkey [4] for SARS-CoV-2 selection analyses e.g., [58], but on small, commonly hand-curated, datasets. This is because codon-based selection analyses do not scale well to more than a few hundred sequences when using the generic out-of-the-box versions of these analyses, and because noisy sequencing data (errors in assembled consensus genomes) can drive false positives.

We developed RASCL to standardize and accelerate comparative selection detection analyses of SARS-CoV-2 variants of interest (VOI) or variants of concern (VOC). The tool has been used to study selective forces which, at least in-part, drove the emergence of the Alpha, Beta, Gamma, and Omicron VOCs [912]. More broadly, a tool like RASCL enables near-real-time monitoring of emergent lineages, which in turn can be used both to detect potentially adaptive mutations before they rise to high frequencies, and to help establish relationships between individual mutations and key viral characteristics including pathogenicity, transmissibility, immune evasiveness and drug resistance [1317]. Through routine analysis of patterns of ongoing selection within individual major lineages, we can reveal the variants or circulating sub-lineages that carry potentially concerning fitness-enhancing mutations, and which would therefore most likely drive future viral transmission [18].

Materials and methods

RASCL application overview

The “query” set of whole genome sequences is compared against a diverse set of “background” sequences, chosen to represent globally circulating SARS-CoV-2 sequences (throughout the pandemic), and the query data set is the set of sequences which are the target of selection analyses (e.g., BA.5 clade sequences). Our background dataset is available at https://github.com/veg/RASCL/tree/main/data/ReferenceSetViPR and was assembled from Virus Pathogen Database and Analysis Resource (ViPR, viprbrc.org) [19], a curated database of publicly available viral pathogen sequences, assemblies, and genome annotations (S1 File). The inclusion of the background dataset also provides an “outgroup” clade, enabling the study of selection on branches basal to the clade of interest (COI), and to discover regions under selection pressure which are unique to the COI. The application uses several open-source tools, as well as selection analysis modules from the HyPhy software package and assembles the results from the analysis into JSON files, which can then be visualized with our full-featured ObservableHQ [20] notebook.

Map and compress

Specifically, RASCL takes as input (i) a “query” dataset comprising a single FASTA file containing unaligned SARS-CoV-2 full or partial genomes belonging to a clade of interest (e.g. all sequences from the PANGO [17, 18] lineage B.1.617.2) and (ii) a generic “background” dataset that might comprise, for example, a set of sequences that are representative of global SARS-CoV-2 genomic diversity, e.g. those assembled from ViPR. It is not necessary to remove sequences in the query dataset that are duplicated in the background dataset—the pipeline will do this automatically.

The choice of query and background datasets is analysis-specific. For example, if another clade of interest is provided as a background, it is possible to directly identify the sites that are evolving differentially between the two clades. Other sensible choices of query sequences might be sequences from a specific country/region, or sequences sampled during a particular time-period. Note that the analysis does not require the two sets to be reciprocally monophyletic, but in many applications, this will be the case. Following the automated mapping of whole genome datasets into individual coding sequences (based on the NCBI reference annotation), the gene datasets (each containing a set of query and background sequences) are processed in parallel.

Prepare for selection analysis

Using complete linkage distance clustering implemented in the TN93 package (tn93-cluster tool, https://github.com/veg/tn93), RASCL subsamples from available sequences while maintaining overall genomic diversity; the clustering threshold distance is chosen automatically to include no more than a user-specified number of genomes “D” (e.g., 300). In the Results section below we discuss our recent analyses of several SARS-CoV-2 clades and while other available subsampling methods for SARS-CoV-2 genomes exist which rely on spatiotemporal distributions [21], our method relies on increasing sequence diversity to enhance evolutionary signal for downstream selection analyses, while reducing computational complexity. Core method implementations in HyPhy can handle up to 25,000 subsampled sequences in a reasonable time, but the computational cost increases rapidly, and for faster turnaround 1000 sequences are the recommended setting. Following dataset compression, RASCL creates a combined (query and background) alignment with only the sequences that are divergent enough to be useful for subsequent selection analyses. Inference of a maximum likelihood phylogenetic tree with RAxML-NG, [22], or IQ-TREE, [23] is performed on the merged dataset and the query and background branches of this tree are labeled as Query or Background; internal branches of the tree are labeled using maximum parsimony.

Selection analyses

Selection analyses are performed with state-of-the-art molecular evolution [24] models implemented in HyPhy. To partially mitigate the potentially confounding influences of within-host evolution [25, 26], where mutations occurring within an individual have not been filtered by selection at the broader population-level, and sequencing errors, selection analyses are performed only on the internal branches of phylogenetic trees, where at least one or more rounds of virus transmission are captured [27]. The following individual selection tests are applied to each gene-level alignment of merged query and background sequences.

  • Branch-site Unrestricted Statistical Test for Episodic Diversification with Synonymous rate variation (BUSTED[S]): this method tests for gene-wide selection which is either pervasive (occurring throughout the evolutionary tree) or episodic (occurring only on some lineages). [28].

  • Single-Likelihood Ancestor Counting (SLAC), uses a combination of maximum-likelihood and counting approaches to infer pervasive selection through nonsynonymous (dN) and synonymous (dS) substitution rates on a per-site basis for a given coding alignment and corresponding phylogeny. We use the results from SLAC to create substitution mapping of genomic sites and selection analysis results across methods [29].

  • Coevolution detection using Bayesian Graphical Models (BGM): this method identifies groups of sites that might be co-evolving using the joint distribution of substitutions [30].

  • Fixed Effects Likelihood (FEL): this method locates codon sites within a gene with evidence of pervasive positive diversifying or negative selection by inferring nonsynonymous (dN) and synonymous (dS) substitution rates on a per-site basis for a given coding alignment and corresponding phylogeny [29].

  • Mixed Effects Model of Evolution (MEME): a more sensitive analysis as compared to FEL, this method locates codon sites with evidence of episodic positive diversifying selection, [31].

  • Relaxed Selection (RELAX): compares gene-wide selection pressure and looks for evidence that the strength of selection has been relaxed (or intensified) between the query clade and background sequences [32].

  • Contrast-FEL: comparison of site-by-site selection pressure between query and background sequences to detect evidence indicative of different selective regimes [33].

  • A FUBAR [34] Approach to Directional Selection (FADE): this method identifies amino-acid sites with evidence of directional selection [35]. FUBAR refers to our previously published method Fast, Unconstrained Bayesian AppRoximation for Inferring Selection.

  • FitMultiModel (FMM): this method identifies genes with complex multiple instantaneous substitutions that occur within a codon, a rare but potent source of evolutionary signal. [36].

Software availability

The RASCL application, depicted at high level in Fig 1, is implemented:

Fig 1. The RASCL application overview.

Fig 1

We highlight the high-level architecture of the RASCL workflow. These include what we call multiple Phases, including a (1) Map and Compress step, where input query and background whole genome sequences are separated into individual genes from the viral genome by mapping to the reference gene. For each gene we then extract the representative gene diversity using genetic distance clustering. (2) is where we prepare our gene alignments for selection analysis. We accomplish this by merging alignments from the background and query datasets into a “combined” dataset. From this, we infer a phylogenetic tree and annotate it based on query and background sequences. (3) We perform selection analyses in HyPhy (described in further detail in the Methods section). (4) We combine the results of selection analyses across the viral genome by mapping substitutions to each position in the viral genome to create a selection ‘profile’ for each statistically significant site into an interpretable JSON-formatted file. These combined results are then used for further post-hoc or downstream analysis or ingested by our interactive notebook.

Visualization and downstream post-hoc analyses

Results are combined using a Python script “generate-report.py” into two machine-readable JSON files (“summary.json” and “annotation.json”) that represent detailed analysis results for gene segments and individual sites, respectively. JSON is an open standard text-based file format which is also human-readable. It is well-suited for representing structured data and is commonly used for transmitting data in web applications. These JSON files can either be used as input for other software, or visualized within a standard web-browser via a feature-rich interactive RASCL dashboard hosted on ObservableHQ (Fig 2). For the web application implementation of RASCL, alignments, trees and analysis results are stored and made web-accessible via the Galaxy platform. Results are visualized with interactive notebooks hosted on ObservableHQ (Fig 2) that include an alignment viewer, a visualization of individual codons/amino acid states at user-selected sites mapped onto the tips of phylogenetic trees, and detailed tabulated information on analysis results for individual genes and codon-sites.

Fig 2. Example visualization using our interactive notebook.

Fig 2

Here, we highlight some of the features of our interactive notebook which was created to facilitate result exploration. Key features include: (1) tables with statistically significant results for each selection analysis, (2) the ability to explore the full phylogenetic tree or a site-level tree to explore selection acting on individual sites and (3) we provide a multiple sequence alignment viewer for any of the genes in the results.

Results and discussion

RASCL uses molecular sequence data from genotypically distinct viral lineages to identify distinguishing features and evolution within lineages. A query set of sequences is compared against a globally diverse set of background sequences. The background data set typically contains globally circulating viral sequences, and the query data set is the set of sequences of particular interest to the user. Below, we describe our analyses of several variants of SARS-CoV-2 whole genome sequences, but RASCL is applicable to any measurably evolving pathogen with sufficient surveillance data.

An overview of molecular surveillance of important SARS-CoV-2 viral clades

As a concrete example of the utility of RASCL consider our analysis of 112,017 BA.1 (WHO Omicron, all available BA.1 sequences as of January 2, 2022) sequences. RASCL selected a median (per gene) of 524 BA.1 sequences (a compression ratio of 99.53%) and a median of 145 background sequences (from a dataset of over 150,000 publicly available sequences from ViPR) to represent genomic diversity in SARS-CoV-2. Using the Spike gene as an exemplar, we compressed all available sequences down to 933 representative sequences, reflecting a compression ratio of 99.17%. This level of compression is representative of recent analyses of a few VOIs/VOCs (see Table 1) with RASCL.

Table 1. Examples of various recent RASCL analyses on variants of interest and variants of concern for SARS-CoV-2.

Full results, and more recent updates, are available through our interactive notebooks (see Methods section for additional details). The lower compression ratios for Omicron reflects the higher genomic variability of this lineage, which gave rise to many sublineages soon after emergence, combined with a high-volume of viral genome sequencing.

Lineage Available Median analyzed Compression Ratio Positively selected sites Positively selected sites
p≤0.05 p≤0.0001
BA.2 (WHO Omicron) 48,623 319 1:152 168 29
BA.1 (WHO Omicron) 112,017 524 1:208 200 42
B.1.617.2 (WHO Delta) 1,983 59 1:29 77 1
B.1.621 (WHO Mu) 3,288 101 1:32 67 3
C.37 (WHO Lambda) 2,127 80 1:26 53 1
P.1 (WHO Gamma) 2,070 47 1:43 49 4
B.1.1.7 (WHO Alpha) 8,586 169 1:50 44 1

Importantly, there is evidence of diversifying positive selection acting on the BA.1 sequences, on 42 (p≤0.0001) individual sites (there are 4312 sites that are polymorphic in the amino-acid space among clade sequences; selection also includes the basal branch of the clade) along the internal branches of the clade (S1 Table). There is evidence of diversifying positive selection acting on the BA.1 sequences, on 359 individual sites along all branches of the clade (S2 Table), with 40 sites with an LRT p-value of ≤ 0.0001. When comparing the strength of selection on BA.1 to background sequences along internal tree branches, 21 individual sites along the internal branches of the clade (S3 Table) showed statistically significant differences. Over the entire tree, 31 sites demonstrate evidence of directional selection (S4 Table). Along the internal branches of the BA.1 clade, 81 pairs of sites showed evidence of coevolution (S5 Table). Along the internal branches of the BA.1 clade, 47 sites showed evidence of negative selection (S6 Table). Along the internal branches of the BA.1 clade, 15 (out of 21 segments considered 71.4%) genes/ORFs showed evidence of episodic diversifying selection (BUSTED, q-value ≤ 0.1, S7 Table).

A closer examination of the SARS-CoV-2 BA.5 clade

We investigated the nature and extent of selective forces acting on the viral genes in BA.5 clade (all available BA.5 sequences as of August 9, 2022) sequences by performing a series of comparative phylogenetic analyses on a median of 258 BA.5 sequences and a median of 113 sequences from available sequences chosen to represent genomic diversity in SARS-CoV-2. We compiled our background dataset from the globally subsampled Nextstrain [39] (https://nextstrain.org/ncov/gisaid/global/all-time) build (last accessed June 12, 2022, genomes were sampled from the beginning of the SARS-CoV-2 pandemic) (S1 File). We observe that:

  • There is evidence of diversifying positive selection acting on the BA.5 sequences, on 94 individual sites (there are 2737 sites that are polymorphic in the amino-acid space among clade sequences; selection also includes the basal branch of the clade) along the internal branches of the clade (S8 Table).

  • There is evidence of diversifying positive selection acting on the BA.5 sequences, on 133 individual sites along all branches of the clade (S9 Table).

  • When comparing the strength of selection on BA.5 to background sequences along internal tree branches, 12 individual sites along the internal branches of the clade (S10 Table) showed statistically significant differences.

  • Over the entire tree 11 sites show evidence of directional selection (S11 Table).

  • Along the internal branches of the BA.5 clade, 37 pairs of sites showed evidence of coevolution (S12 Table).

  • Along the internal branches of the BA.5 clade, 14 sites showed evidence of negative selection (S13 Table).

  • Along the internal branches of the BA.5 clade, 14 (out of 21 segments considered, 66.7%) genes/ORFs showed evidence of episodic diversifying selection (BUSTED[S], q-value ≤ 0.1, Table 2).

Table 2. BUSTED[S] selection results on the BA.5 SARS-CoV-2 clade across segments.

Segment omega1 p1 omega2 p2 omega3 p3 p q
3C 0.03 0.94 0.07 0.01 4.93 0.05 0.2449 0.3429
E 0 0.32 1 0 1.18 0.68 0.4462 0.5511
M 0 0.17 0 0.79 8.69 0.04 0.0006 0.0009
N 0 0.93 0.69 0.05 51.71 0.02 0 0
ORF3a 0 0.35 0 0.61 25.77 0.04 0 0
ORF6 0 0.03 0 0.93 34.85 0.04 0 0
ORF7a 0.21 0.91 0.29 0.05 28.91 0.03 0 0
ORF8 0.42 0.96 0.43 0.01 45.23 0.02 0 0
RdRp 0.11 0.98 0.47 0.01 76.27 0.01 0 0
S 0.75 0.98 0.87 0.02 64042.59 0 0 0
endornase 0 0.59 0 0.39 18.75 0.03 0.0005 0.0008
helicase 0 0.73 0 0.09 2.3 0.17 0.32 0.4201
leader 1 0 1 1 8003.48 0 0.0002 0.0004
methyltransferase 0 0.06 0 0.93 67.61 0.01 0 0
nsp10 0.32 0.49 0.5 0.51 1 0 0.5 0.5833
nsp2 0 0.89 0.74 0.09 72.38 0.01 0 0
nsp3 0 0.76 0 0.23 47.93 0.01 0 0
nsp6 0 0.07 0 0.46 1 0.47 0.5 0.5526
nsp7 0.32 0.85 0.34 0.1 1 0.05 0.5 0.525
nsp8 0 0.92 0 0.07 34.97 0.02 0 0
nsp9 0.06 0 0.36 1 1.11 0 0.5 0.5

Segment corresponds to the gene or ORF does under analysis. omega1 refers to the first omega rate class, p1 refers to proportion of sites which fit this rate class. omega2 refers to the first omega rate class, p2 refers to proportion of sites which fit this rate class. omega3 refers to the first omega rate class which captures the episodic diversifying features, p3 refers to proportion of sites which fit this rate class. p-value, the p-value for the likelihood ratio test. q-value refers to the multiple-test corrected q-value (Benjamini-Hochberg). We indicate statistically significant segments with bolded text.

We consider our results from gene-wide estimates of adaptation where we observed that 3 structural and 11 non-structural proteins yield statistically significant results (Table 3). Within the set of structural proteins, we find Spike (S), Membrane glycoprotein (M), and the nucleocapsid phosphoprotein (N), these genes have been implicated in complex biological functions, including as a highly conserved target, M, [40] and have been the focus of studies on viral infection [41], pathology [42] and vaccination and therapeutic intervention [43]. Interestingly, within the set of non-structural proteins we find ORF3a, ORF6, ORF7a, and ORF8, which have been implicated in novel biological mechanisms in the SARS-CoV-2 virus including the induction of autophagy and role as a viral ion channel, ORF3a, [44, 45] disruption of nucleocytoplasmic transport, ORF6, [46] inhibition of host interferon response, ORF7a ORF8 [47, 48]. We also find several members of the ORF1ab polyprotein including RNA-dependent RNA polymerase (RdRp), endoRNAse, leader, methyltransferase, nsp2, nsp3, nsp8. Several important sites in Spike from Table 1 are discussed, including: S/339, S/371, S/440, S/764, S/1162. We highlight these sites due to the level of statistical signal associated with them (we find ≥ 5 branches selected in the exploratory MEME analysis), and provide selection profiles for each of these sites below.

Table 3. Site profiles for selected sites in the BA.5 Spike gene.

Composition Composition
# Position Codon Branches FEL MEME CFEL BA.5 Background
1 22576 339 5 0.00000645 0.0000158 0.000363 D327 G15 -8 N1 G169 D14 -1
2 22672 371 6 1.01E-07 1.56E-07 0.0075 F310 S28 -9 L4 S168 L8 F6 -2
3 22879 440 6 0.00000818 0.0000178 0.0716 K298 N44 -9 N168 K15 -1
4 23797 764 6 0.00000729 0.0000179 0.000463 K325 N17 -8 I1 N167 K13 -4
5 25045 1162 6 0.0000201 0.0000274 0.0208 P308 L41 -2 P182 Q1 S1

Genomic position (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Codon in gene: the location of the codon in the corresponding Spike gene. Number Of selected branches: the number of tree branches (internal branches BA.5 clade) that have evidence of diversifying positive selection at this site (empirical Bayes factor ≥ 100). FEL p-value: the p-value for the likelihood ratio test that non-synonymous rate / synonymous rate ≠ 1 at this site. MEME Internal p-value: the p-value for the likelihood ratio test that a non-zero fraction of internal branches have omega > 1 (i.e., episodic diversifying selection at this branch). CFEL p-value: the p-value for the likelihood ratio test that omega ratios between the internal branches of the two clades are different. p-values reported in this table are not corrected for multiple testing. Amino acid composition (including gaps) is reported for the BA.5 and background dataset at the corresponding site.

Assessing evolutionary pressures on the SARS-CoV-2 μ (B.1.621) clade

We identify genomic sites in B.1.621 (μ) [49] clade sequences that may be subject to selective forces and could be prioritized for further studies but have not yet reached high frequencies. We present an analysis of the B.1.621 variant, performed on individual genes/protein products, using a median of 101 μ sequences subsampled from all available sequences in GISAID [50] to represent the genomic diversity in this clade (all sequences as of September 7, 2021). A similarly subsampled global SARS-CoV-2 background dataset from publicly available sequences via the ViPR database is used as background and provided in our Github repository, (linked to in our Methods section). Interactive results for this analysis can be explored in our ObservableHQ notebook (https://observablehq.com/@aglucaci/rascl-mu) and our Virological (https://virological.org/) post (https://virological.org/t/assessing-evolutionary-pressures-on-the-sars-cov-2-mu-b-1-621-clade/760). Our analysis identified 67 (S14 Table) individual codon sites (among 1643 sites that are polymorphic in the amino-acid space) that showed evidence of episodic diversifying selection along internal branches of this clade using the MEME method at q ≤ 0.20 false discovery rate (FDR). A total of 5 sites (S15 Table) were found to be subject to directional selection using the FADE method.

We identify high-priority sites in SARS-CoV-2 μ (B.1.621) sequences (Table 4), with a “Rank” for each site based on a point system described below. We identify and rank sites (+1 for each category) according to the following protocol:

Table 4. A table of high-priority sites in SARS-CoV-2 μ B.1.621 sequences.
Coordinate (SARS-CoV-2) Gene/ORF Codon (in gene/ORF) p-value q-value Rank Property
16075 RDRP/ORF1b 879/870Y 0.0452115 0.140312 4
28873 N 201G 0.0460285 0.138086 4
28253 ORF8 121VHF 0.0300385 0.180231 4
25336 S 1259HV 0.0157386 0.134903 4
25333 S 1258D 0.00852771 0.0959367 4
13516 RDRP/ORF1b 26/17I 0.0336153 0.168077 4
14122 RDRP/ORF1b 228/219D 0.036867 0.170155 4
14530 RDRP/ORF1b 364/355F 0.0416466 0.144161 4
14767 RDRP/ORF1b 443/434V 0.0384624 0.161005 4
14785 RDRP/ORF1b 449/440T 0.0383252 0.168257 4 charge
17976 Helicase/ORF1b 580/1503F 0.0113156 0.107201 3
20550 Endornase/ORF1b 310/2361I 0.00645087 0.0893198 3
21234 Methyltransferase/ORF1b 192/2589Y 0.049474 0.134929 3
21640 S 27LT 0.000824192 0.0247258 3
21997 S 146YNTPS 0.000598969 0.0215629 3
22003 S 148S 0.0221244 0.165933 3
22000 S 147HQNP 0.0110123 0.1166 3 “Overall, secondary”
19482 Exonuclease/ORF1b 481/2005V 0.0474395 0.133424 3
25707 ORF3a 106FPIL 0.0410847 0.145005 3
27210 ORF6 4PHI 7.99E-06 0.000479141 3
29023 N 251S 0.0425228 0.141743 3
19548 Exonuclease/ORF1b 503/2027VS 0.0470548 0.134442 3
29443 N 391AN 0.0325738 0.167522 3
14470 RDRP/ORF1b 344/335I 0.0339005 0.164921 3
18327 Exonuclease/ORF1b 96/1620I 0.0395957 0.15494 3
2944 NSP3/ORF1a 633/894S 0.0246692 0.158588 3
17820 Helicase/ORF1b 528/1451S 0.0392635 0.160624 3
17025 Helicase/ORF1b 263/1186I 0.0241484 0.167182 3
15001 RDRP/ORF1b 521/512C 0.0457987 0.139725 3
9472 NSP4/ORF1a 307/3070T 0.040304 0.145094 3
9424 NSP4/ORF1a 291/3054T 0.0402514 0.147862 3
9139 NSP4/ORF1a 196/2959S 0.0111345 0.111345 3
8614 NSP4/ORF1a 21/2784VT 0.0400482 0.153376 2
6535 NSP3/ORF1a 1273/2091D 0.0246588 0.164392 2
8578 NSP4/ORF1a 9/2772H 0.0198022 0.162018 2 volume
19479 Exonuclease/ORF1b 480/2004CV 0.00358349 0.0586389 2 “volume, charge”
8659 NSP4/ORF1a 36/2799N 0.0441831 0.142017 2 charge
18960 Exonuclease/ORF1b 307/1831V 0.0462054 0.134145 2
16260 Helicase/ORF1b 8/931Y 0.0309811 0.17989 2
2230 NSP2/ORF1a 476/656SA 0.0316127 0.172433 2

Genomic position (in SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon in gene: the location of the codon in the corresponding gene/ORF. p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. q-value: multiple-test corrected q-value (Benjamini-Hochberg). We assign a “Rank” to each site based on a point system described above. Properties: which, if any, of the five compositive biochemical properties [52] are conserved or changed at this site.

  • Inferred to be under positive selective pressure.

  • Are not clade-defining mutations.

  • Contain mutations that are not predictable based on the evolution of Sarbecovirus sequences [51].

  • Contain mutations that occur in a large fraction of unique haplotypes. This was shown to be predictive of near-term growth in a separate analysis from our global SARS-CoV-2 analysis [17].

Additionally, where available, we provide interpretation of our identified sites in terms of known functional significance, temporal growth trends and location on the 3D structure of the protein. Briefly, each site is given a point for each of the follow requirements that are met: found to be positively selected, the site occurs outside of the clade defining site set, the site has any unexpected mutations as described in our Sarbecoviruses evolutionary analysis notebook (https://observablehq.com/@spond/sars-cov-2-pvo), the site has a mutation present above the minimum threshold in the SARS-CoV-2 Global Haplotype analysis (https://observablehq.com/@spond/sc2-haplotypes), for this we remove any mutation present in background clade, and remove gaps.

Evidence of natural selection history operating on SARS-CoV-2 genomes

For the set of high-priority SARS-CoV-2 genomic sites (taken from Table 4), sites inferred from B.1.621 sequences, we observe when and how selection (positively or negatively) operated on them, through a series of 3-month overlapping intervals going back to the beginning of the pandemic (Fig 3). The earliest intervals end in February 2020 and the latest—in September 2021. In selected sites we observe the temporal trends of high-priority sites in Spike (Fig 4) and RdRp (Fig 5) in B.1.621 sequences. We also describe the spatial location of sites inferred to be under positive selective pressure in the Spike gene (Fig 6) from B.1.621 (μ) sequences on the structure of the protein (from Table 4).

Fig 3. Evolutionary trajectories of 40 high-priority selected sites (from Table 4).

Fig 3

If a site was found to be positively (red) or negatively (blue) selected during a specific time, a bubble will be drawn at a corresponding point on the plot. The area of the bubble is scaled as -log10 p, where p is the p-value of the FEL likelihood ratio test. Larger bubbles correspond to smaller p-values; p-values are not directly comparable between different time windows and different genes due to differences in sample sizes and other factors. The x-axis shows the endpoint of the time-window, e.g., March 30th, 2021, will correspond to the analysis performed with the data from January 1, 2021, to March 30, 2021. Figures like this can be generated with the “Evidence of natural selection history operating on SARS-CoV-2 genomes” ObservableHQ notebook (https://observablehq.com/@spond/sars-cov-2-selected-sites).

Fig 4. Temporal trends of the substitution combinations at selected sites represented in Table 4 in the Spike gene for B.1.621 (μ) sequences in 2021 (from left to right: S/27, S/146, S/147, S/1258, S/1259).

Fig 4

The symbol “.” denotes the reference residue at that site. Figures like this can be generated using Trends in mutational patterns across SARS-CoV-2 Spike enabled by data from https://observablehq.com/@spond/spike-trends. Additional search parameters include “B.1.621[pangolin] AND 20210101[after]”. Notebook link: https://observablehq.com/@spond/spike-trends.

Fig 5. Temporal trends of the substitution combinations at all sites represented in Table 4 in the RDRP (RNA-dependent RNA polymerase) gene for B.1.621 (μ) sequences in 2021 (from left to right: RDRP/26, RDRP/228, RDRP/344, RDRP/364, RDRP/443, RDRP/449, RDRP/521, RDRP/879).

Fig 5

The symbol “.” denotes the reference residue at that site. Figures like this can be generated using Trends in mutational patterns across SARS-CoV-2 Spike enabled by data from https://observablehq.com/@spond/spike-trends. Additional search parameters include “B.1.621[pangolin] AND 20210101[after]”. Notebook link https://observablehq.com/@spond/spike-trends.

Fig 6. Spike protein crystal structure annotation 6CRZ (https://www.rcsb.org/structure/6CRZ) with MEME sites, a measure of episodic selection (These sites are listed in Table 4).

Fig 6

The color legend for these figures is as follows: the N-Terminal domain (NTD) region is highlighted in Blue, the Receptor binding domain (RBD) region is highlighted in Green, The Heptad Repeat (HR) region is highlighted in Ruby, MEME (Positively selected) sites are highlighted in Orange. To interact with the figure above visit: https://observablehq.com/@aglucaci/categorical-ngl-rascl-mu.

Potential biological and clinical significance of mutations

The ongoing monitoring of emergent VOIs and VOCs can detect adaptive mutations before they rise to high frequency and help establish their relationship to key clinical parameters including pathogenicity and transmissibility. Additionally, continued evolution within a particular clade may form the foundation for a subclade with further functional sites of interest. Based on current information for the Spike gene from Stanford Coronavirus Antiviral Resistance Database (CoVDB, https://covdb.stanford.edu/) [53] we include several annotations with clinical relevance. From the SARS-CoV-2 B.1.621 (μ) Spike gene, we identify the following sites of interest from Table 4 due to their interaction with epitope binding in monoclonal antibodies (mAbs): 144, 145, 146, 147, 148, 417, and 501.

Conclusions

A need that has reoccurred throughout the course of the COVID-19 pandemic is to rapidly identify molecular changes as they arise within the SARS-CoV-2 genome and to interpret how these changes impact the fitness and host-interaction of the virus. Additionally, this information is crucial to provide public health officials with the most up-to-date information when making public health decisions. To gather this information, computational and laboratory-based analytical approaches have been used to test and validate hypotheses about the observed genotype and phenotypic implications [54]. These current approaches require both significant effort and time to complete, therefore the results may be gained too slowly to inform early public health responses. Computational methods that detect natural selection can be leveraged to identify sites of interest within viral clades. Due to the massive amount of sequence availability of SARS-CoV-2 genomes, many such methods are rendered uncapable of providing results in a timely manner, bottlenecked by the increased computational complexity associated with large-scale analysis. We address this limitation with RASCL, an agile method that can be used to rapidly characterize and assess natural selection at sites and across proteins within viral genomes. Now, SARS-CoV-2 VOI/VOCs can be screened for signals of selection in a standardized manner and at an accelerated rate, while providing easily interpretable, near-real-time results.

The novelty of RASCL lies in its design; it is highly modular and easily adaptable to rapidly analyze any molecular evolving pathogen. Regarding the modularity component, there are phases of the pipeline, described in Fig 1, each of which can be parallelized across either a high-performance computing environment or a personal computer. We take the intermediate and terminal files created by our methods throughout the analysis and combine the pertinent output files together into the commonly used, standardized JSON format. To make interpretation and visualization of the results easy for the user, we created a customizable RASCL dashboard page using ObservableHQ which runs in the browser of any internet browser. The results page is dynamic and interactive, allowing the user to inspect the results for different signals of selection with ease. The modularity of RASCL makes it highly scalable, yielding near-real-time results at any stage of an outbreak. At the beginning stages of pathogen emergence when very few sequences exist, analyses run quickly, and as the outbreak persists and the number of sequences increases, subsampling can be increased, limiting the computational bottleneck.

RASCL is available in two forms, as a standalone pipeline that uses Snakemake, as well as a web application integrated as a workflow in the Galaxy framework. By implementing the method in these two ways, users at any level of bioinformatics expertise benefit, making RASCL highly accessible. RASCL has been designed such that with minimal modifications to the background and reference genomes, genes under analysis, and default thresholding settings any other evolving pathogens can be rapidly scrutinized to immediately inform public health measures. RASCL can be modified to screen for signals of selection in gene sequences from the current Monkeypox outbreak. Differences in virulence have been reported between Monkeypox isolates from two different geographic regions [55, 56], thus rapidly identifying the evolution within viral genes is highly relevant to public health efforts.

To date RASCL has been used to characterize the role of natural selection in the emergence of the Beta, Gamma, Omicron [17] and BA.4/BA.5 [57] VOC lineages, as well as to identify patterns of convergent evolution in the Alpha, Beta and Gamma lineages [11]. Whole genome sequences from any viral clade of interest (i.e., emerging pathogens), can be separated into a query and background sequence dataset representing the global diversity of viral sequences serves as the input to the tool, where the selective forces associated within and between the two sets of sequences are identified. RASCL has also been used to monitor the evolution of several lineages (see Table 1) and will be applied to future SARS-CoV-2 sublineages as they emerge. Therefore, whenever future genomic surveillance efforts reveal new potentially problematic SARS-CoV-2 lineages, we will use RASCL to analyze these too.

Among the limitations of the current study is that at this moment the RASCL application does not take recombination within genes into account; recombination between genes is handled by performing gene-by-gene analyses. While recombination plays a generally recognized role in the evolution of coronaviruses between species, only a limited amount of recombination is observed in the globally circulating viral population of SARS-CoV-2 [58]. In addition, the types and modalities of selection analyses employed in the RASCL application (described in the Methods section) have robust statistical inference that are only biased when significant recombination changes the topology of the inferred phylogenetic relationships. Future versions of RASCL will include an optional configuration to detect genetic recombination using state-of-the-art methods [5961] and will include an updated interactive notebook to visualize and interpret these kinds of complex evolutionary signals. By taking recombination into account, we look forward to increasing the role that the RASCL application can play in the global monitoring and surveillance of evolution in SARS-CoV-2 and other important pathogens.

Our focus in this study is on the rapid, near-real-time monitoring and analysis of emerging pathogens, where the RASCL software application provides interpretable results for molecular surveillance of continued natural evolution. While an area of active research and both proliferative and heated debate, the question of SARS-CoV-2 origins [62, 63] is out of scope for our study. Indeed, while our analysis of VOIs/VOCs in SARS-CoV-2, has uncovered important and emerging regions of interest in key viral proteins, RASCL has broader applicability to global health threats with known natural origins and existing animal reservoirs. However, further investigation in SARS-CoV-2 origins is critical for understanding the true biological and epidemiological context and complex evolutionary history [64] of global pathogens.

Supporting information

S1 File. GISAID accession ID’s for the analyses reported in Table 1.

We also report the GISAID accession ID’s for our Nextstrain background dataset.

(ZIP)

S1 Table. List of sites found to be under diversifying positive selection by MEME (p≤0.05) along internal branches in BA.1, as well as biochemical properties that are important at this site (via the PRIME method).

Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. % Of branches with omega>1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%—pervasive selection, <100% –episodic selection). p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. q-value: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site.

(CSV)

S2 Table. List of sites found to be under diversifying positive selection by MEME (p≤0.05) along all branches in BA.1, as well as biochemical properties that are important at this site (via the PRIME method).

Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. % Of branches with omega>1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%—pervasive selection, <100% –episodic selection). p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. q-value: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site.

(CSV)

S3 Table. List of sites found to be selected differentially along internal branches between BA.1 and background sequences (FDR ≤ 0.2) using the Contrast-FEL method.

Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Ratio of omega (BA.1: reference): the ratio of site-level omega estimates for the two sets of branches. If this ratio is > 1, then selection on BA.1 is stronger. These values are highly imprecise and should be viewed as qualitative measures. p-value: the p-value for the likelihood ratio test that omega ratios between the internal branches of the two clades are different. This is not corrected for multiple testing. q-value: multiple-test corrected q-value (Benjamini-Hochberg).

(CSV)

S4 Table. List of sites found to be evolving under directional selection in the entire tree, using a FUBAR-like implementation of the DEPS [35] method.

The BA.1 tree was rooted on the genome reference sequence for this analysis. Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Target amino-acid: which amino-acids have statistical support (Bayes Factor ≥ 100) for accelerated evolution towards them.

(CSV)

S5 Table. Pairs of sites (BA.1) found to have epistatic (co-evolving) substitution patterns by BGM method.

Gene/ORF: which gene or ORF does this site belong to. Codon 1/2 (in gene/ORF): the location of the two interacting codons in the corresponding Gene/ORF. Posterior probability of non-independence: estimated posterior probability that substitutions which occur on the interior branches of the BA.1 clade are not independent.

(CSV)

S6 Table. List of sites found to be under pervasive negative selection by FEL (p≤0.05) along internal branches in BA.1.

Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Synonymous rate: Site estimate for the synonymous substitution rate (alpha). These values are highly imprecise and should be viewed as qualitative measures. Non-synonymous rate: Site estimate for the non-synonymous substitution rate (beta). These values are highly imprecise and should be viewed as qualitative measures. p-value: the p-value for the likelihood ratio test that beta / alpha ≠ 1. q-value: multiple-test corrected q-value (Benjamini-Hochberg).

(CSV)

S7 Table. BUSTED[S] selection results on the BA.1 SARS-CoV-2 clade across segments.

Segment corresponds to the gene or ORF does under analysis. Omega1 refers to the first omega rate class, p1 refers to proportion of sites which fit this rate class. Omega2 refers to the first omega rate class, p2 refers to proportion of sites which fit this rate class. Omega3 refers to the first omega rate class which captures the episodic diversifying features, p3 refers to proportion of sites which fit this rate class. P-value, the p-value for the likelihood ratio test. Q-value refers to the multiple-test corrected q-value (Benjamini-Hochberg).

(CSV)

S8 Table. List of sites found to be under diversifying positive selection by MEME (p≤0.05) along internal branches in BA.5, as well as biochemical properties that are important at this site (via the PRIME method).

Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. % Of branches with omega>1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%—pervasive selection, <100% –episodic selection). p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. q-value: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site.

(CSV)

S9 Table. List of sites found to be under diversifying positive selection by MEME (p≤0.05) along all branches in BA.5, as well as biochemical properties that are important at this site (via the PRIME method).

Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. % Of branches with omega>1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%—pervasive selection, <100% –episodic selection). p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. q-value: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site.

(CSV)

S10 Table. List of sites found to be selected differentially along internal branches between BA.5 and background sequences (FDR ≤ 0.2) using the Contrast-FEL method.

Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Ratio of omega (BA.1: background): the ratio of site-level omega estimates for the two sets of branches. If this ratio is > 1, then selection on BA.1 is stronger. These values are highly imprecise and should be viewed as qualitative measures. p-value: the p-value for the likelihood ratio test that omega ratios between the internal branches of the two clades are different. This is not corrected for multiple testing. q-value: multiple-test corrected q-value (Benjamini-Hochberg).

(CSV)

S11 Table. List of sites found to be evolving under directional selection in the entire tree, using a FUBAR-like implementation of the DEPS method.

The BA.5 tree was rooted on the genome reference sequence for this analysis. Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Target amino-acid: which amino-acids have statistical support (Bayes Factor ≥ 100) for accelerated evolution towards them.

(CSV)

S12 Table. Pairs of sites (BA.5) found to have epistatic (co-evolving) substitution patterns by BGM method.

Gene/ORF: which gene or ORF does this site belong to. Codon 1/2 (in gene/ORF): the location of the two interacting codons in the corresponding Gene/ORF. Posterior probability of non-independence: estimated posterior probability that substitutions which occur on the interior branches of the BA.1 clade are not independent.

(CSV)

S13 Table. List of sites found to be under pervasive negative selection by FEL (p≤0.05) along internal branches in BA.5.

Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Synonymous rate: Site estimate for the synonymous substitution rate (alpha). These values are highly imprecise and should be viewed as qualitative measures. Non-synonymous rate: Site estimate for the non-synonymous substitution rate (beta). These values are highly imprecise and should be viewed as qualitative measures. p-value: the p-value for the likelihood ratio test. “q-value”: multiple-test corrected q-value (Benjamini-Hochberg).

(CSV)

S14 Table. List of sites found to be under diversifying positive selection by MEME (p≤0.05) along internal branches in B.1.621, as well as biochemical properties that are important at this site (via the PRIME method).

Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. % of branches with omega > 1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%—pervasive selection, <100% –episodic selection). p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. q-value: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site.

(CSV)

S15 Table. List of sites found to be evolving under directional selection in the entire tree, using a FUBAR-like implementation of the DEPS method.

The B.1.621 tree was rooted on the genome reference sequence for this analysis. Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Target amino-acid: which amino-acids have statistical support (Bayes Factor ≥ 100) for accelerated evolution towards them.

(CSV)

Acknowledgments

We gratefully acknowledge all data contributors, i.e., the Authors and their Originating laboratories responsible for obtaining the specimens, and their Submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based. We thank the global community of health-care workers and scientists who work tirelessly to face the pandemic head-on. We thank members of the Datamonkey and HyPhy, and Galaxy teams for their continued assistance in the development and application of our software.

Abbreviations

BUSTED[S]

Branch-site Unrestricted Statistical Test for Episodic Diversification with synonymous rate variation

BGM

Bayesian Graphical Models

CFEL

Contrast-FEL

FADE

A FUBAR* Approach to Directional Selection (A *Fast, Unconstrained Bayesian AppRoximation for Inferring Selection)

FEL

Fixed Effects Likelihood

HyPhy

Hypothesis Testing using Phylogenies

RASCL

Rapid Assessment of Selection within CLades

NCBI

National Center for Biotechnology Information

MEME

Mixed Effects Model of Evolution

TN93

Tamura-Nei, 1993

RELAX

Relaxation of selective strength

SLAC

Single-Likelihood Ancestor Counting

ViPR

Virus Pathogen Database and Analysis Resource

VOC

Variants of concern

VOI

Variants of interest

Data Availability

The RASCL application, depicted at high level in Fig 1, is implemented: • As a standalone pipeline (https://github.com/veg/RASCL) in Snakemake [37]. • As a web application (https://galaxy.hyphy.org/u/hyphy/w/rapid-assessment-of-selection-on-clades-and-lineages), integrated as a workflow in the Galaxy framework, that is freely available for use on powerful public computing infrastructure (https://usegalaxy.org).

Funding Statement

DPM is funded by the Wellcome Trust (222574/Z/21/Z). This research was supported in part by grants R01 AI134384 (NIH/NIAID) and grant 2027196 (NSF/DBI,BIO) to AN and SLKP. The funding bodies played no role in the design of the study, the collection, analysis, and interpretation of data, nor in writing the manuscript.

References

  • 1. Harvey WT, Carabelli AM, Jackson B, Gupta RK, Thomson EC, Harrison EM, et al. SARS-CoV-2 variants, spike mutations and immune escape. Nat Rev Microbiol. 2021. Jul;19(7):409–24. doi: 10.1038/s41579-021-00573-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Arenas M. Trends in substitution models of molecular evolution. Front Genet. 2015;6:319. doi: 10.3389/fgene.2015.00319 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Kosakovsky Pond SL, Poon AFY, Velazquez R, Weaver S, Hepler NL, Murrell B, et al. HyPhy 2.5-A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies. Mol Biol Evol. 2020. Jan 1;37(1):295–9. doi: 10.1093/molbev/msz197 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Weaver S, Shank SD, Spielman SJ, Li M, Muse SV, Kosakovsky Pond SL. Datamonkey 2.0: A Modern Web Application for Characterizing Selective and Other Evolutionary Processes. Mol Biol Evol. 2018. Mar 1;35(3):773–7. doi: 10.1093/molbev/msx335 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Benvenuto D, Giovanetti M, Ciccozzi A, Spoto S, Angeletti S, Ciccozzi M. The 2019-new coronavirus epidemic: Evidence for virus evolution. J Med Virol. 2020. Apr;92(4):455–9. doi: 10.1002/jmv.25688 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Dearlove B, Lewitus E, Bai H, Li Y, Reeves DB, Joyce MG, et al. A SARS-CoV-2 vaccine candidate would likely match all currently circulating variants. Proc Natl Acad Sci U S A. 2020. Sep 22;117(38):23652–62. doi: 10.1073/pnas.2008281117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Li X, Giorgi EE, Marichannegowda MH, Foley B, Xiao C, Kong XP, et al. Emergence of SARS-CoV-2 through recombination and strong purifying selection. Sci Adv. 2020. Jul;6(27):eabb9153. doi: 10.1126/sciadv.abb9153 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Viana R, Moyo S, Amoako DG, Tegally H, Scheepers C, Althaus CL, et al. Rapid epidemic expansion of the SARS-CoV-2 Omicron variant in southern Africa. Nature. 2022. Mar;603(7902):679–86. doi: 10.1038/s41586-022-04411-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Tegally H, Wilkinson E, Giovanetti M, Iranzadeh A, Fonseca V, Giandhari J, et al. Detection of a SARS-CoV-2 variant of concern in South Africa. Nature. 2021. Apr;592(7854):438–43. doi: 10.1038/s41586-021-03402-9 [DOI] [PubMed] [Google Scholar]
  • 10. Faria NR, Mellan TA, Whittaker C, Claro IM, Candido D da S, Mishra S, et al. Genomics and epidemiology of the P.1 SARS-CoV-2 lineage in Manaus, Brazil. Science. 2021. May 21;372(6544):815–21. doi: 10.1126/science.abh2644 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Martin DP, Weaver S, Tegally H, San JE, Shank SD, Wilkinson E, et al. The emergence and ongoing convergent evolution of the SARS-CoV-2 N501Y lineages. Cell. 2021. Sep 30;184(20):5189–5200.e7. doi: 10.1016/j.cell.2021.09.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Martin DP, Lytras S, Lucaci AG, Maier W, Grüning B, Shank SD, et al. Selection Analysis Identifies Clusters of Unusual Mutational Changes in Omicron Lineage BA.1 That Likely Impact Spike Function. Mol Biol Evol. 2022. Apr 11;39(4):msac061. doi: 10.1093/molbev/msac061 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Hamed SM, Elkhatib WF, Khairalla AS, Noreddin AM. Global dynamics of SARS-CoV-2 clades and their relation to COVID-19 epidemiology. Sci Rep. 2021. Apr 19;11(1):8435. doi: 10.1038/s41598-021-87713-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Young BE, Wei WE, Fong SW, Mak TM, Anderson DE, Chan YH, et al. Association of SARS-CoV-2 clades with clinical, inflammatory and virologic outcomes: An observational study. EBioMedicine. 2021. Apr;66:103319. doi: 10.1016/j.ebiom.2021.103319 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Luchsinger LL, Hillyer CD. Vaccine efficacy probable against COVID-19 variants. Science. 2021. Mar 12;371(6534):1116. doi: 10.1126/science.abg9461 [DOI] [PubMed] [Google Scholar]
  • 16. Abdool Karim SS, de Oliveira T. New SARS-CoV-2 Variants—Clinical, Public Health, and Vaccine Implications. N Engl J Med. 2021. May 13;384(19):1866–8. doi: 10.1056/NEJMc2100362 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Maher MC, Bartha I, Weaver S, di Iulio J, Ferri E, Soriaga L, et al. Predicting the mutational drivers of future SARS-CoV-2 variants of concern. Sci Transl Med. 2022. Feb 23;14(633):eabk3445. doi: 10.1126/scitranslmed.abk3445 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Rambaut A, Holmes EC, O’Toole Á, Hill V, McCrone JT, Ruis C, et al. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020. Nov;5(11):1403–7. doi: 10.1038/s41564-020-0770-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Pickett BE, Sadat EL, Zhang Y, Noronha JM, Squires RB, Hunt V, et al. ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Res. 2012. Jan;40(Database issue):D593–598. doi: 10.1093/nar/gkr859 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Perkel JM. Reactive, reproducible, collaborative: computational notebooks evolve. Nature. 2021. May;593(7857):156–7. doi: 10.1038/d41586-021-01174-w [DOI] [PubMed] [Google Scholar]
  • 21. Cheng Y, Ji C, Han N, Li J, Xu L, Chen Z, et al. covSampler: A subsampling method with balanced genetic diversity for large-scale SARS-CoV-2 genome data sets. Virus Evolution. 2022. Jul 1;8(2):veac071. doi: 10.1093/ve/veac071 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Kozlov AM, Darriba D, Flouri T, Morel B, Stamatakis A. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics. 2019. Nov 1;35(21):4453–5. doi: 10.1093/bioinformatics/btz305 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015. Jan;32(1):268–74. doi: 10.1093/molbev/msu300 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Spielman SJ, Weaver S, Shank SD, Magalis BR, Li M, Kosakovsky Pond SL. Evolution of Viral Genomes: Interplay Between Selection, Recombination, and Other Forces. Methods Mol Biol. 2019;1910:427–68. doi: 10.1007/978-1-4939-9074-0_14 [DOI] [PubMed] [Google Scholar]
  • 25. Kosakovsky Pond SL, Frost SDW, Grossman Z, Gravenor MB, Richman DD, Brown AJL. Adaptation to Different Human Populations by HIV-1 Revealed by Codon-Based Analyses. PLoS Comput Biol. 2006. Jun;2(6):e62. doi: 10.1371/journal.pcbi.0020062 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Pybus OG, Rambaut A, Belshaw R, Freckleton RP, Drummond AJ, Holmes EC. Phylogenetic Evidence for Deleterious Mutation Load in RNA Viruses and Its Contribution to Viral Evolution. Molecular Biology and Evolution. 2007. Mar 1;24(3):845–52. doi: 10.1093/molbev/msm001 [DOI] [PubMed] [Google Scholar]
  • 27. Lorenzo-Redondo R, Fryer HR, Bedford T, Kim EY, Archer J, Pond SLK, et al. Persistent HIV-1 replication maintains the tissue reservoir during therapy. Nature. 2016. Feb 4;530(7588):51–6. doi: 10.1038/nature16933 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Wisotsky SR, Kosakovsky Pond SL, Shank SD, Muse SV. Synonymous Site-to-Site Substitution Rate Variation Dramatically Inflates False Positive Rates of Selection Analyses: Ignore at Your Own Peril. Mol Biol Evol. 2020. Aug 1;37(8):2430–9. doi: 10.1093/molbev/msaa037 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Kosakovsky Pond SL, Frost SDW. Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol Biol Evol. 2005. May;22(5):1208–22. doi: 10.1093/molbev/msi105 [DOI] [PubMed] [Google Scholar]
  • 30. Poon AFY, Lewis FI, Frost SDW, Kosakovsky Pond SL. Spidermonkey: rapid detection of co-evolving sites using Bayesian graphical models. Bioinformatics. 2008. Sep 1;24(17):1949–50. doi: 10.1093/bioinformatics/btn313 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Murrell B, Wertheim JO, Moola S, Weighill T, Scheffler K, Kosakovsky Pond SL. Detecting individual sites subject to episodic diversifying selection. PLoS Genet. 2012;8(7):e1002764. doi: 10.1371/journal.pgen.1002764 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Wertheim JO, Murrell B, Smith MD, Kosakovsky Pond SL, Scheffler K. RELAX: detecting relaxed selection in a phylogenetic framework. Mol Biol Evol. 2015. Mar;32(3):820–32. doi: 10.1093/molbev/msu400 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Kosakovsky Pond SL, Wisotsky SR, Escalante A, Magalis BR, Weaver S. Contrast-FEL-A Test for Differences in Selective Pressures at Individual Sites among Clades and Sets of Branches. Mol Biol Evol. 2021. Mar 9;38(3):1184–98. doi: 10.1093/molbev/msaa263 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Murrell B, Moola S, Mabona A, Weighill T, Sheward D, Kosakovsky Pond SL, et al. FUBAR: a fast, unconstrained bayesian approximation for inferring selection. Mol Biol Evol. 2013. May;30(5):1196–205. doi: 10.1093/molbev/mst030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Kosakovsky Pond SL, Poon AFY, Leigh Brown AJ, Frost SDW. A Maximum Likelihood Method for Detecting Directional Evolution in Protein Sequences and Its Application to Influenza A Virus. Mol Biol Evol. 2008. Sep;25(9):1809–24. doi: 10.1093/molbev/msn123 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Lucaci AG, Wisotsky SR, Shank SD, Weaver S, Kosakovsky Pond SL. Extra base hits: Widespread empirical support for instantaneous multiple-nucleotide changes. PLoS One. 2021;16(3):e0248337. doi: 10.1371/journal.pone.0248337 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins-Tinch CH, Sochat V, et al. Sustainable data analysis with Snakemake. F1000Res. 2021;10:33. doi: 10.12688/f1000research.29032.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Afgan E, Baker D, Batut B, van den Beek M, Bouvier D, Cech M, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018. Jul 2;46(W1):W537–44. doi: 10.1093/nar/gky379 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018. Dec 1;34(23):4121–3. doi: 10.1093/bioinformatics/bty407 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Shen L, Bard JD, Triche TJ, Judkins AR, Biegel JA, Gai X. Emerging variants of concern in SARS-CoV-2 membrane protein: a highly conserved target with potential pathological and therapeutic implications. Emerg Microbes Infect. 2021. Dec;10(1):885–93. doi: 10.1080/22221751.2021.1922097 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Rathnasinghe R, Jangra S, Ye C, Cupic A, Singh G, Martínez-Romero C, et al. Characterization of SARS-CoV-2 Spike mutations important for infection of mice and escape from human immune sera. Nat Commun. 2022. Jul 7;13(1):3921. doi: 10.1038/s41467-022-30763-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Zhao LP, Roychoudhury P, Gilbert P, Schiffer J, Lybrand TP, Payne TH, et al. Mutations in viral nucleocapsid protein and endoRNase are discovered to associate with COVID19 hospitalization risk. Sci Rep. 2022. Jan 24;12(1):1206. doi: 10.1038/s41598-021-04376-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Costa CFS, Barbosa AJM, Dias AMGC, Roque ACA. Native, engineered and de novo designed ligands targeting the SARS-CoV-2 spike protein. Biotechnol Adv. 2022. Oct;59:107986. doi: 10.1016/j.biotechadv.2022.107986 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Su WQ, Yu XJ, Zhou CM. SARS-CoV-2 ORF3a Induces Incomplete Autophagy via the Unfolded Protein Response. Viruses. 2021. Dec 9;13(12):2467. doi: 10.3390/v13122467 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Kern DM, Sorum B, Mali SS, Hoel CM, Sridharan S, Remis JP, et al. Cryo-EM structure of SARS-CoV-2 ORF3a in lipid nanodiscs. Nat Struct Mol Biol. 2021. Jul;28(7):573–82. doi: 10.1038/s41594-021-00619-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Addetia A, Lieberman NAP, Phung Q, Hsiang TY, Xie H, Roychoudhury P, et al. SARS-CoV-2 ORF6 Disrupts Bidirectional Nucleocytoplasmic Transport through Interactions with Rae1 and Nup98. mBio. 2021. Apr 13;12(2):e00065–21. doi: 10.1128/mBio.00065-21 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Pawlica P, Yario TA, White S, Wang J, Moss WN, Hui P, et al. SARS-CoV-2 expresses a microRNA-like small RNA able to selectively repress host genes. Proc Natl Acad Sci U S A. 2021. Dec 28;118(52):e2116668118. doi: 10.1073/pnas.2116668118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Chen J, Lu Z, Yang X, Zhou Y, Gao J, Zhang S, et al. Severe Acute Respiratory Syndrome Coronavirus 2 ORF8 Protein Inhibits Type I Interferon Production by Targeting HSP90B1 Signaling. Front Cell Infect Microbiol. 2022;12:899546. doi: 10.3389/fcimb.2022.899546 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Halfmann PJ, Kuroda M, Armbrust T, Theiler J, Balaram A, Moreno GK, et al. Characterization of the SARS-CoV-2 B.1.621 (Mu) variant. Science Translational Medicine. 2022. May 17;14(657):eabm4908. doi: 10.1126/scitranslmed.abm4908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Khare S, Gurry C, Freitas L, Schultz MB, Bach G, Diallo A, et al. GISAID’s Role in Pandemic Response. China CDC Wkly. 2021. Dec 3;3(49):1049–51. doi: 10.46234/ccdcw2021.255 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Lytras S, Hughes J, Martin D, Swanepoel P, de Klerk A, Lourens R, et al. Exploring the Natural Origins of SARS-CoV-2 in the Light of Recombination. Genome Biology and Evolution. 2022. Feb 1;14(2):evac018. doi: 10.1093/gbe/evac018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Atchley WR, Zhao J, Fernandes AD, Drüke T. Solving the protein sequence metric problem. Proceedings of the National Academy of Sciences. 2005. May 3;102(18):6395–400. doi: 10.1073/pnas.0408677102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Tzou PL, Tao K, Pond SLK, Shafer RW. Coronavirus Resistance Database (CoV-RDB): SARS-CoV-2 susceptibility to monoclonal antibodies, convalescent plasma, and plasma from vaccinated persons. PLOS ONE. 2022. Mar 9;17(3):e0261045. doi: 10.1371/journal.pone.0261045 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. McCallum M, Bassi J, De Marco A, Chen A, Walls AC, Di Iulio J, et al. SARS-CoV-2 immune evasion by the B.1.427/B.1.429 variant of concern. Science. 2021. Aug 6;373(6555):648–54. doi: 10.1126/science.abi7994 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Chen N, Li G, Liszewski MK, Atkinson JP, Jahrling PB, Feng Z, et al. Virulence differences between monkeypox virus isolates from West Africa and the Congo basin. Virology. 2005. Sep 15;340(1):46–63. doi: 10.1016/j.virol.2005.05.030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Isidro J, Borges V, Pinto M, Sobral D, Santos JD, Nunes A, et al. Phylogenomic characterization and signs of microevolution in the 2022 multi-country outbreak of monkeypox virus. Nat Med. 2022. Aug;28(8):1569–72. doi: 10.1038/s41591-022-01907-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Tegally H, Moir M, Everatt J, Giovanetti M, Scheepers C, Wilkinson E, et al. Emergence of SARS-CoV-2 Omicron lineages BA.4 and BA.5 in South Africa. Nat Med. 2022. Jun 27;1–6. doi: 10.1038/s41591-022-01911-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Turakhia Y, Thornlow B, Hinrichs A, McBroome J, Ayala N, Ye C, et al. Pandemic-Scale Phylogenomics Reveals The SARS-CoV-2 Recombination Landscape. Nature. 2022. Aug 11;1–2. doi: 10.1038/s41586-022-05189-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Samson S, Lord É, Makarenkov V. SimPlot++: a Python application for representing sequence similarity and detecting recombination. Bioinformatics. 2022. Jun 1;38(11):3118–20. doi: 10.1093/bioinformatics/btac287 [DOI] [PubMed] [Google Scholar]
  • 60. Martin DP, Varsani A, Roumagnac P, Botha G, Maslamoney S, Schwab T, et al. RDP5: a computer program for analyzing recombination in, and removing signals of recombination from, nucleotide sequence datasets. Virus Evol. 2020. Apr 12;7(1):veaa087. doi: 10.1093/ve/veaa087 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Kosakovsky Pond SL, Posada D, Gravenor MB, Woelk CH, Frost SDW. Automated Phylogenetic Detection of Recombination Using a Genetic Algorithm. Molecular Biology and Evolution. 2006. Oct 1;23(10):1891–901. doi: 10.1093/molbev/msl051 [DOI] [PubMed] [Google Scholar]
  • 62. Boni MF, Lemey P, Jiang X, Lam TTY, Perry BW, Castoe TA, et al. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic. Nat Microbiol. 2020. Nov;5(11):1408–17. doi: 10.1038/s41564-020-0771-4 [DOI] [PubMed] [Google Scholar]
  • 63. Domingo JL. What we know and what we need to know about the origin of SARS-CoV-2. Environ Res. 2021. Sep;200:111785. doi: 10.1016/j.envres.2021.111785 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Makarenkov V, Mazoure B, Rabusseau G, Legendre P. Horizontal gene transfer and recombination analysis of SARS-CoV-2 genes helps discover its close relatives and shed light on its origin. BMC Ecol Evol. 2021. Jan 21;21(1):5. doi: 10.1186/s12862-020-01732-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Emily Chenette

20 Jul 2022

PONE-D-22-06403

RASCL: RAPID ASSESSMENT OF SELECTION IN CLADES THROUGH MOLECULAR SEQUENCE ANALYSIS

PLOS ONE

Dear Dr. Lucaci,

Thank you for submitting your manuscript to PLOS ONE; I sincerely apologise for the unusually delayed review timeframe.

Your manuscript has been assessed by one reviewer, whose comments are appended below. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Among the limitations raised by the reviewer is that the documentation for the software is incomplete; please ensure that the software meets PLOS ONE's policies for sharing software (https://journals.plos.org/plosone/s/materials-software-and-code-sharing#loc-sharing-software).

Please note that we have only been able to secure a single reviewer to assess your manuscript. We are issuing a decision on your manuscript at this point to prevent further delays in the evaluation of your manuscript. Please be aware that the editor who handles your revised manuscript might find it necessary to invite additional reviewers to assess this work once the revised manuscript is submitted. However, we will aim to proceed on the basis of this single review if possible.

Please submit your revised manuscript by Aug 29 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Emily Chenette

Editor in Chief

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

3. Thank you for stating the following in the Funding Section of your manuscript:

“DPM is funded by the Wellcome Trust (222574/Z/21/Z). This research was supported in part by grants R01 AI134384 (NIH/NIAID) and grant 2027196 (NSF/DBI,BIO) to AN and SLKP. The funding bodies played no role in the design of the study, the collection, analysis, and interpretation of data, nor in writing the manuscript.”

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Funding section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“DPM is funded by the Wellcome Trust (222574/Z/21/Z). This research was supported in

part by grants R01 AI134384 (NIH/NIAID) and grant 2027196 (NSF/DBI,BIO) to AN and SLKP.

The funding bodies played no role in the design of the study, the collection, analysis, and

interpretation of data, nor in writing the manuscript.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

4. Thank you for stating the following in your Competing Interests section: 

“None”

Please complete your Competing Interests on the online submission form to state any Competing Interests. If you have no competing interests, please state ""The authors have declared that no competing interests exist."", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now

This information should be included in your cover letter; we will change the online submission form on your behalf.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This manuscript is structured like an "application note", although I am not aware of this journal having this type of article. Thus, the text is very brief and many aspects of the analysis are not explained to the level that I would expect for a standard manuscript. However, the methods presented in the manuscript are important and timely, representing significant advances in our ability to detect selection in large amounts of SARS-CoV-2 genome data.

Please clarify the role of the query/background partition of sequence data in the analysis. What are the respective roles of the different molecular evolution models employed in this workflow, e.g., SLAC, BGM, FEL, etc.? Since many of these are various tests for selection, how are their outputs integrated into a meaningful whole? You should clarify the distinction between "pervasive" and "episodic" selection. The paper would be more accessible to a broader audience if it provided at least a brief explanation of what each tool does.

I found this software unnecessarily difficult to run. It requires some specific computing environments (Anaconda and Slurm) and the documentation is very incomplete. It should be possible to run RASCL via the Snakemake file on a single Linux workstation with a sufficient number of cores and RAM. The package does not appear to include sufficient example files for running a demo (there is `TEST.fasta`, but no `BA.5-04132022`). The Galaxy web service for this tool was not available.

Please provide line numbers in the revised manuscript. If you are using LaTeX, please use the url package to allow line breaks at slashes in long URLs.

Specific comments:

* Abstract, "sensitivity to sequencing errors" and also first page "because noisy sequencing data" -- I think the compressed time-scale and depth of sampling of the SARS-CoV-2 pandemic are important factors in the difficulty of deploying standard selection analyses. More specifically, they are the reason why sequencing error is more problematic than usual.

* "Since epidemiologically relevant mutations [...]", can you please clarify what this means? For example, mutations that affect transmission rates.

* Figure 1, the flowchart is very difficult to read with a poor text to whitespace ratio, and the figure legend is not very helpful. What are "clade genomes"?

* "It is not necessary to remove sequences in the query dataset that are duplicated in the reference dataset -- the pipeline will do this automatically." How does this step handle sequences with ambiguous (uncalled) bases? How does it decide which duplicate sequence to retain as the representative sequence?

* Please explain what ViPR is.

* reversed left square bracket by "RAxML-NG, Kozlov"

* "To mitigate the potentially confounding influences of within-host evolution [...]" How would within-host evolution confound our ability to detect among-host evolution in response to selection?

* JSON (JavaScript Object Notation) should be defined at first usage. It would also be helpful to explain that it is a standard, human-readable format for the interchange of serialized data online, or something to that effect.

* Table 1, what is the background clade for these analyses?

* "Of note, the lower compression ratios for Omicron may reflect its rapid detection, and a mature sequencing ecosystem." Please expand on this. What do you mean by a "mature sequencing ecosystem" and how would this affect the compression ratio?

* Availability and requirements: the HyPhy batch language is highly application-specific that will be unfamiliar to the broader research community, and the language itself has undergone extensive revisions in recent years. Is there an up-to-date language specification available for new users?

* The web service https://usegalaxy.eu/u/hyphy/w/rascl was unavailable when I attempted to access it for this review (HTTP error 500). However, the main page https://usegalaxy.eu/ was working fine. RASCL was not listed in available tools, and I could not locate it using the search interface.

* The installation instructions in GitHub `README` are missing a `cd RASCL` step between 1 and 2. Also, the last `RASCL` argument in the `git clone` command is not strictly necessary, since it is the default value.

* I got the following error when following the installation instructions:

```

ResolvePackageNotFound:

- stephenshank::tn93=1.0.7

```

I'm running conda version 4.11.0 on macOS. I didn't run into this problem on my Linux workstation, so I assume that the issue is that there is no macOS binary distributed via this package - however, there is one available via `bioconda::tn93`. Is there any reason why the RASCL environment can't point to that package?

* There is no file named `snakemake_config.json` in the RASCL directory. The `README.md` configuration instructions should be corrected to refer to `config.json`.

* The configuration instructions are unclear. If the "clade of interest" is set to `B.1.1.7`, does that refer to a FASTA filename (user input data), or to labels contained within that FASTA file? Why don't these instructions employ the same terminology as the manuscript, *i.e.*, "query" and "background"? In the file `cluster.json`, what do the labels `cluster`, `nodes`, `ppn` and `name` correspond to?

* Running RASCL appears to require the Slurm workload manager (`qsub`). If so, then this should be listed as a package dependency. However, users should have the option of running RASCL without a workload manager.

* observablehq.com notebook has older GitHub link, https://github.com/veg/SARS-CoV-2_Clades

* "Label tree with amino-acids" checkbox does not seem to be working.

* My JS console is listing a lot of errors on page load that appear to be associated with phylotree.js; for example:

* Error: <path> attribute d: Expected number, "MNaN,0LNaN,NaNLNa…".

* Error: <g> attribute transform: Expected number, "translate (NaN,NaN) ".

* for the notebook interface, is it possible to provide some visual cue to indicate that table rows can be sorted by clicking on column labels?

* can the developers please provide an additional set of instructions for running RASCL in a Linux environment that is not running Anaconda? Some shared computing environments prohibit users from using Anaconda, e.g., Compute Canada.

* please provide instructions for retrieving JSON data from the observablehq site via some API.

* The Python scripts have some weird formatting. For example, in `tn93_cluster.py` there are long spaces within `add_argument()` calls, each on a single line. Please conform to PEP 8 conventions if possible. `generate-report.py` has a monolithic for-loop spanning over 500 lines. This code would be much more maintainable if the developers applied a more modular code style. Also see lines 689-695 and 743-749 for weird whitespace.</g></path>

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Art Poon

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Nov 2;17(11):e0275623. doi: 10.1371/journal.pone.0275623.r002

Author response to Decision Letter 0


30 Aug 2022

Aug 26, 2022

Dr. Emily Chenette

Editor in Chief

PLOS ONE

Dear Dr. Chenette,

RE: Response to reviewers for: RASCL: RAPID ASSESSMENT OF SELECTION IN CLADES THROUGH MOLECULAR SEQUENCE ANALYSIS.

Thank you for the opportunity to submit a revised manuscript on “RASCL: RAPID ASSESSMENT OF SELECTION IN CLADES THROUGH MOLECULAR SEQUENCE ANALYSIS”. Please find attached our revised contribution that incorporates responses to the Editor and Reviewer comments. Each of the comments have been addressed and a detailed response is attached. Both a marked-up manuscript and a clean LaTeX version of the paper are included.

The authors have declared that no competing interests exist.

The amended Funding Statement is as follows: DPM is funded by the Wellcome Trust (222574/Z/21/Z). This research was supported in part by grants R01 AI134384 (NIH/NIAID) and grant 2027196 (NSF/DBI,BIO) to AN and SLKP. The funding bodies played no role in the design of the study, the collection, analysis, and interpretation of data, nor in writing the manuscript.

Yours sincerely,

Alexander G. Lucaci, M.S.

Ph.D. Candidate

Institute for Genomics and Evolutionary Medicine (iGEM)

Temple University

Editor comments

Comment #1

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

Response 1: We have modified our manuscript to meet the PLOS ONE style requirements, including those for file naming. These changes are reflected in the revised manuscript.

Comment #2

Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex.

Response 2: We have updated our revised manuscript to use the PLOS LaTex template.

Comment #3

Thank you for stating the following in the Funding Section of your manuscript:

“DPM is funded by the Wellcome Trust (222574/Z/21/Z). This research was supported in part by grants R01 AI134384 (NIH/NIAID) and grant 2027196 (NSF/DBI,BIO) to AN and SLKP. The funding bodies played no role in the design of the study, the collection, analysis, and interpretation of data, nor in writing the manuscript.”

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Funding section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“DPM is funded by the Wellcome Trust (222574/Z/21/Z). This research was supported in

part by grants R01 AI134384 (NIH/NIAID) and grant 2027196 (NSF/DBI,BIO) to AN and SLKP.

The funding bodies played no role in the design of the study, the collection, analysis, and

interpretation of data, nor in writing the manuscript.” Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

Response 3: The requested changes are reflected in our revised manuscript, we have removed the Funding statement from the revised manuscript. We have included the amended Funding Statement section in our cover letter.

Comment #4

Thank you for stating the following in your Competing Interests section:

“None”

Please complete your Competing Interests on the online submission form to state any Competing Interests. If you have no competing interests, please state ""The authors have declared that no competing interests exist."", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now

This information should be included in your cover letter; we will change the online submission form on your behalf.

Response 4: We have included the amended Competing Interests section in our cover letter.

Reviewer comments

Comment #1

Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No

Response 1: Our ability to make all underlying data publicly available is restricted under the GISAID Database Access Agreement, which states that “You will not distribute, redistribute, share, or otherwise make available Data, to any third party or the public, unless the individual is an Authorized User of GISAID”. However, for authorized users, we are able to provide a File (S1 File of the revised manuscript) of sequence identifiers for data retrieval and results replication.

Comment #2

This manuscript is structured like an "application note", although I am not aware of this journal having this type of article. Thus, the text is very brief and many aspects of the analysis are not explained to the level that I would expect for a standard manuscript.

Response 2: We have modified our manuscript to provide additional details, without unnecessary bloat

Comment #3

Please clarify the role of the query/background partition of sequence data in the analysis.

Response 3: We have included a clarification of the respective roles of query and background sequence data.

Comment #4

What are the respective roles of the different molecular evolution models employed in this workflow, e.g., SLAC, BGM, FEL, etc.? Since many of these are various tests for selection, how are their outputs integrated into a meaningful whole? You should clarify the distinction between "pervasive" and "episodic" selection.

Response 4: Each method is designed to ask and answer specific biological and statistical questions. We have modified the manuscript to further explain their relative roles.

Comment #5

I found this software unnecessarily difficult to run. It requires some specific computing environments (Anaconda and Slurm) and the documentation is very incomplete. It should be possible to run RASCL via the Snakemake file on a single Linux workstation with a sufficient number of cores and RAM. The package does not appear to include sufficient example files for running a demo (there is `TEST.fasta`, but no `BA.5-04132022`). The Galaxy web service for this tool was not available.

Response 5: We have expanded our README file to be more user friendly. This includes information about simple demo runs, advanced configuration, and how to run locally.

We have provided a “run_Local.sh” version of the submitting script to facilitate the user running the program on a local machine. We have also updated the demo data, reflected in the “/data/Example1” folder within our repository, to be the default configuration for the demo run.

We have updated the Galaxy web service link to reflect a stable link at https://galaxy.hyphy.org/u/hyphy/w/rapid-assessment-of-selection-on-clades-and-lineages. Should the link become unreachable in the future, we also provide a backup link to our workflow via our dedicated Github repository.

Comment #6

Please provide line numbers in the revised manuscript. If you are using LaTeX, please use the url package to allow line breaks at slashes in long URLs.

Response 6: We have added line numbers to our revised manuscript.

Comment #7

* Abstract, "sensitivity to sequencing errors" and also first page "because noisy sequencing data" -- I think the compressed time-scale and depth of sampling of the SARS-CoV-2 pandemic are important factors in the difficulty of deploying standard selection analyses. More specifically, they are the reason why sequencing error is more problematic than usual.

Response 7: A compressed time-scale and depth of sampling can lead to a high number of duplicates, where not enough time has passed to observe inter-host transmission variability of the virus. Sequencing errors are also due in large part to the global scientific community figuring out a way to best process SC2 samples, and to develop standards in protocols, assembly, which also contribute to noise. We address these partially by using Internal branches on some of our analyses, to reduce noise (we explain this further below). However, the scientific space and global community has matured since early in the days of the pandemic. What we face now is still the contribution of some noise, but highly dense (i.e. low-temporal sampling) leading to a significant number of duplicate and near-duplicate sequences.

Comment #8

* "Since epidemiologically relevant mutations [...]", can you please clarify what this means? For example, mutations that affect transmission rates.

Response 8: Here, we refer to epidemiological mutations in spike that received early attention such as D614G as well those that followed the emergence of seed variants.

Comment #9

* Figure 1, the flowchart is very difficult to read with a poor text to whitespace ratio, and the figure legend is not very helpful. What are "clade genomes"?

Response 9: Figure 1 has been expanded into Figures 1 and 2 to add clarity. Additionally, we have modified the figure legends text to provide a stronger description of the figures.

Comment #10

* "It is not necessary to remove sequences in the query dataset that are duplicated in the reference dataset -- the pipeline will do this automatically." How does this step handle sequences with ambiguous (uncalled) bases? How does it decide which duplicate sequence to retain as the representative sequence?

Response 10: This is done using the standard TN93 distance calculation (tn93-cluster), where the default is to RESOLVE ambiguities (e.g. R will “match” A or G, N will “match” any resolved based). Among all the sequences that are placed in the “duplicate” bin, the one that has the fewest overall ambiguities (fraction of sequence length) will be retained. Because most of the ambiguities in SARS-CoV-2 consensus genomes are ‘N’, this has the effect of selecting the “least ambiguous” sequence. In addition to this, we further “mask” (with ‘---’, using strike_ambigs.bf script) partially resolved codons (e.g. ANC) in “post-compression” MSA prior to submitting them to HyPhy. This is because PARTIALLY resolved codons with missing data (N) can create false positive selection signals along very short tree branches.

Comment #11

* Please explain what ViPR is.

Response 11: We have modified the manuscript to include a description for ViPR.

Comment #12

* reversed left square bracket by "RAxML-NG, Kozlov"

Response 12: Fixed.

Comment #13

* "To mitigate the potentially confounding influences of within-host evolution [...]" How would within-host evolution confound our ability to detect among-host evolution in response to selection?

Response 13: This comment refers to the observation that many within-host mutations are maladaptive at the population level (e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1480537/ and https://academic.oup.com/mbe/article/24/3/845/1246056). We added a clarification to the text, as follows:

“To partially mitigate the potentially confounding influences of within-host evolution, where mutations occurring within an individual have not been filtered by selection at the broader population-level, and sequencing errors, these analyses are performed only on the internal branches of phylogenetic trees, where at least one or more rounds of virus transmission are captured…”

Comment #14

* JSON (JavaScript Object Notation) should be defined at first usage. It would also be helpful to explain that it is a standard, human-readable format for the interchange of serialized data online, or something to that effect.

Response 14: We have modified the text as requested.

Comment #15

* Table 1, what is the background clade for these analyses?

Response 15: The background clade for these analyses is a curated dataset from ViPR, it is also publically available on our dedicated Github repository at: https://github.com/veg/RASCL/tree/main/data/ReferenceSetViPR

Comment #16

* "Of note, the lower compression ratios for Omicron may reflect its rapid detection, and a mature sequencing ecosystem." Please expand on this. What do you mean by a "mature sequencing ecosystem" and how would this affect the compression ratio?

Response 16: The Omicron variants designation as a VOC on November 26, 2021 occurred in a significantly more mature scientific ecosystem than that of earlier variants. Many Omicron variant sequences were made available on public databases such as GISAID through the concerted effort of experienced public health laboratories and their respective teams. The rapid pace of sequencing through the ability to better detect, sample, sequence, and assemble the SARS-CoV-2 viral genome, which was not previously available at the beginning of the pandemic.

Comment #17

* Availability and requirements: the HyPhy batch language is highly application-specific that will be unfamiliar to the broader research community, and the language itself has undergone extensive revisions in recent years. Is there an up-to-date language specification available for new users?

Response 17: We have designed the RASCL application to run with minimal user configuration. We therefore expect very little programming background, especially with our user-friendly Galaxy workflow, which provides a point and click interface. The end user does not need to tinker with the underlying HyPhy batch language, methods, or application in order to successfully complete the analysis.

Comment #18

* The web service https://usegalaxy.eu/u/hyphy/w/rascl was unavailable when I attempted to access it for this review (HTTP error 500). However, the main page https://usegalaxy.eu/ was working fine. RASCL was not listed in available tools, and I could not locate it using the search interface.

Response 18: We have updated the Galaxy web service link to reflect a stable link at https://galaxy.hyphy.org/u/hyphy/w/rapid-assessment-of-selection-on-clades-and-lineages. Should the link become unreachable in the future, we also provide a backup link to our workflow via our dedicated Github repository.

Comment #19

* The installation instructions in GitHub `README` are missing a `cd RASCL` step between 1 and 2. Also, the last `RASCL` argument in the `git clone` command is not strictly necessary, since it is the default value.

Response 19: We have modified the README file to reflect these changes.

Comment #20

* I got the following error when following the installation instructions:

```

ResolvePackageNotFound:

- stephenshank::tn93=1.0.7

```

I'm running conda version 4.11.0 on macOS. I didn't run into this problem on my Linux workstation, so I assume that the issue is that there is no macOS binary distributed via this package - however, there is one available via `bioconda::tn93`. Is there any reason why the RASCL environment can't point to that package?

Response 20: The “bioconda::tn93" package was an outdated version that did not provide key functionality required for the RASCL application. We successfully reached out to the bioconda team to provide the latest available version, and we now point to it in our environment file. Additionally, we have significantly improved usability and support on the Linux and Intel OSX platforms. Due to recency of the Apple M1 chip there is no official support for this platform in bioconda. As such, we are not able to provide package manager support for that architecture yet. We have successfully built tn93 on M1 architectures from source and provide details for how to do so. A number of other commonly used packages also suffer from this problem and we will update our usability and support once an industry-wide solution is made available.

Comment #21

* There is no file named `snakemake_config.json` in the RASCL directory. The `README.md` configuration instructions should be corrected to refer to `config.json`.

Response 21: We have modified the README file to reflect these changes.

Comment #22

* The configuration instructions are unclear. If the "clade of interest" is set to `B.1.1.7`, does that refer to a FASTA filename (user input data), or to labels contained within that FASTA file? Why don't these instructions employ the same terminology as the manuscript, *i.e.*, "query" and "background"? In the file `cluster.json`, what do the labels `cluster`, `nodes`, `ppn` and `name` correspond to?

Response 22: We have modified our analysis configuration file (“config.json”) and the corresponding instructions to add clarity:

● “Clade of interest” now refers to the “Label” variable, which is used to annotate the phylogenetic tree.

● The “Query_WholeGenomeSeqs” refers to the relative location of the query whole genome dataset (e.g. “Example1/Query-Alpha.fasta”)

● The “Background_WholeGenomeSeqs” refers to the relative location of the query whole genome dataset (e.g. “Example1/Background-preAlpha.fasta”)

● All other variables include the same terminology for consistency: "max_background", "threshold_background", "max_query", "threshold_query" and are further explained in our README file.

We have modified our README file to contain explainer text about our cluster configuration file (“cluster.json”).

● The “cluster” variable refers to the workload manager

● The “nodes” variable is a request for resource allocation from the server, in this case it refers to the number of nodes.

● The “ppn” variable is a request for resource allocation from the server, in this case it refers to the number of processors per node.

● The “name” variable is a specification to submit the jobs for the RASCL application to a specific queue. These have different names and priorities, please refer to your local system administrator for more information.

● We have added an additional variable “walltime” which is a request for a certain period of time for resource allocation from the server.

Comment #23

* Running RASCL appears to require the Slurm workload manager (`qsub`). If so, then this should be listed as a package dependency. However, users should have the option of running RASCL without a workload manager.

Response 23: We have provided a “run_Local.sh” version of the submitting script to facilitate the user running the program on a local machine without a workload manager.

Comment #24

* observablehq.com notebook has older GitHub link, https://github.com/veg/SARS-CoV-2_Clades

Response 24: We have modified the ObservableHQ notebook to reflect our latest link at https://github.com/veg/RASCL

Comment #25

* "Label tree with amino-acids" checkbox does not seem to be working.

Response 25: This checkbox is only active when viewing a specific site within the viral genome, for example in the “View this site (in SARS-CoV-2 reference coordinates)”. When viewing a single site, the phylogenetic tree viewer shows a site-level tree with the codon at that position for each sequence by default. When the checkbox is enabled, we translate the codon into its corresponding amino acid in the phylogenetic tree viewer. We have modified the checkbox label for clarity “Label tree with amino-acids (Site-level trees only)”.

Comment #26

* My JS console is listing a lot of errors on page load that appear to be associated with phylotree.js; for example:

* Error: attribute d: Expected number, "MNaN,0LNaN,NaNLNa…".

* Error: attribute transform: Expected number, "translate (NaN,NaN) ".

Response 26: We have created an issue to change the console errors to warnings with the package phylotree.js developers.

Comment #27

* for the notebook interface, is it possible to provide some visual cue to indicate that table rows can be sorted by clicking on column labels?

Response 27: We have provided a visual cue to the table columns in the form of an up/down arrow to indicate that sorting is permissible.

Comment #28

* can the developers please provide an additional set of instructions for running RASCL in a Linux environment that is not running Anaconda? Some shared computing environments prohibit users from using Anaconda, e.g., Compute Canada.

Response 28: We have provided instructions in our README for non-conda based installation of environment dependencies.

Comment #29

* please provide instructions for retrieving JSON data from the observablehq site via some API.

Response 29: Observable notebooks allow retrieval of any named cells via embedding (https://observablehq.com/@observablehq/embeds) or data export from named cells. We believe this should be sufficient for most users.

Comment #30

* The Python scripts have some weird formatting. For example, in `tn93_cluster.py` there are long spaces within `add_argument()` calls, each on a single line. Please conform to PEP 8 conventions if possible. `generate-report.py` has a monolithic for-loop spanning over 500 lines. This code would be much more maintainable if the developers applied a more modular code style. Also see lines 689-695 and 743-749 for weird whitespace.

Response 30: We have modified our custom python scripts for styling and maintainability accordingly.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Vladimir Makarenkov

9 Sep 2022

PONE-D-22-06403R1RASCL: Rapid Assessment of Selection in CLades through molecular sequence analysisPLOS ONE

Dear Dr. Lucaci,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Oct 24 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Vladimir Makarenkov

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I appreciate the amount of work that has gone into revising the manuscript and the source code. The manuscript is greatly improved and I was able to run the code on my Linux workstation.

Reviewer #2: Lucaci and collaborators introduces a new piece of software, RASCL, that allows for rapid monitoring of the evolution of SARS-CoV-2 strains. RASCL combines a sequence mapping, clustering, as well as numerous phylogenetic analyses to produce reports about the evolutionary selective trends in a given dataset. I think that the paper is relevant and generally well-written.

I think that the authors should use recombination detection algorithms available for example in SimPlot++ (Samson et al., Bioinformatics 2022) to identify sequence change due to recombination.

Also, I think that the paper could benefit from a short discussion about possible origins of SARS-Cov-2. You could rely on the following references in this discussion: Boni, Maciej F., et al. "Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic." Nature microbiology 5.11 (2020): 1408-1417. Domingo JL. What we know and what we need to know about the origin of SARS-CoV-2. Environ Res. 2021;200:111785. Makarenkov, V., Mazoure, B., Rabusseau, G. et al. Horizontal gene transfer and recombination analysis of SARS-CoV-2 genes helps discover its close relatives and shed light on its origin. BMC Ecol Evo 21, 5 (2021).

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Nov 2;17(11):e0275623. doi: 10.1371/journal.pone.0275623.r004

Author response to Decision Letter 1


16 Sep 2022

September 14, 2022

Dr. Vladimir Makarenkov

Academic Editor

PLOS ONE

RE: Response to reviewers for: RASCL: RAPID ASSESSMENT OF SELECTION IN CLADES THROUGH MOLECULAR SEQUENCE ANALYSIS.

Thank you for the opportunity to submit a revised manuscript on “RASCL: RAPID ASSESSMENT OF SELECTION IN CLADES THROUGH MOLECULAR SEQUENCE ANALYSIS”.

Please find attached our revised contribution that incorporates responses to the Editor and Reviewer comments. Each of the comments have been addressed and a detailed response is attached. Both a marked-up manuscript and a clean LaTeX version of the paper are included.

We have taken special consideration to answer the important points raised by the reviewer regarding SARS-CoV-2 origins and genetic recombination detection, a limitation of our applications current implementation which we aim to handle in future versions of RASCL and that will enable it to play a more significant role in the global pathogen monitoring ecosystem.

Yours sincerely,

Alexander G. Lucaci, M.S.

Ph.D. Candidate in Bioinformatics

Institute for Genomics and Evolutionary Medicine (iGEM)

Temple University

Comment #1

I think that the authors should use recombination detection algorithms available for example in SimPlot++ (Samson et al., Bioinformatics 2022) to identify sequence change due to recombination.

Answer #1 We have modified our manuscript to discuss the use and role of recombination detection in our application.

Comment #2

Also, I think that the paper could benefit from a short discussion about possible origins of SARS-Cov-2. You could rely on the following references in this discussion: Boni, Maciej F., et al. "Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic." Nature microbiology 5.11 (2020): 1408-1417. Domingo JL. What we know and what we need to know about the origin of SARS-CoV-2. Environ Res. 2021;200:111785. Makarenkov, V., Mazoure, B., Rabusseau, G. et al. Horizontal gene transfer and recombination analysis of SARS-CoV-2 genes helps discover its close relatives and shed light on its origin. BMC Ecol Evo 21, 5 (2021).

Answer #2 We have modified our manuscript to include a discussion on the possible origins of SARS-CoV-2.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 2

Vladimir Makarenkov

20 Sep 2022

RASCL: Rapid Assessment of Selection in CLades through molecular sequence analysis

PONE-D-22-06403R2

Dear Dr. Alexander G Lucaci,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Vladimir Makarenkov

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Vladimir Makarenkov

3 Oct 2022

PONE-D-22-06403R2

RASCL: Rapid Assessment of Selection in CLades through molecular sequence analysis

Dear Dr. Lucaci:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Vladimir Makarenkov

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. GISAID accession ID’s for the analyses reported in Table 1.

    We also report the GISAID accession ID’s for our Nextstrain background dataset.

    (ZIP)

    S1 Table. List of sites found to be under diversifying positive selection by MEME (p≤0.05) along internal branches in BA.1, as well as biochemical properties that are important at this site (via the PRIME method).

    Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. % Of branches with omega>1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%—pervasive selection, <100% –episodic selection). p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. q-value: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site.

    (CSV)

    S2 Table. List of sites found to be under diversifying positive selection by MEME (p≤0.05) along all branches in BA.1, as well as biochemical properties that are important at this site (via the PRIME method).

    Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. % Of branches with omega>1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%—pervasive selection, <100% –episodic selection). p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. q-value: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site.

    (CSV)

    S3 Table. List of sites found to be selected differentially along internal branches between BA.1 and background sequences (FDR ≤ 0.2) using the Contrast-FEL method.

    Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Ratio of omega (BA.1: reference): the ratio of site-level omega estimates for the two sets of branches. If this ratio is > 1, then selection on BA.1 is stronger. These values are highly imprecise and should be viewed as qualitative measures. p-value: the p-value for the likelihood ratio test that omega ratios between the internal branches of the two clades are different. This is not corrected for multiple testing. q-value: multiple-test corrected q-value (Benjamini-Hochberg).

    (CSV)

    S4 Table. List of sites found to be evolving under directional selection in the entire tree, using a FUBAR-like implementation of the DEPS [35] method.

    The BA.1 tree was rooted on the genome reference sequence for this analysis. Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Target amino-acid: which amino-acids have statistical support (Bayes Factor ≥ 100) for accelerated evolution towards them.

    (CSV)

    S5 Table. Pairs of sites (BA.1) found to have epistatic (co-evolving) substitution patterns by BGM method.

    Gene/ORF: which gene or ORF does this site belong to. Codon 1/2 (in gene/ORF): the location of the two interacting codons in the corresponding Gene/ORF. Posterior probability of non-independence: estimated posterior probability that substitutions which occur on the interior branches of the BA.1 clade are not independent.

    (CSV)

    S6 Table. List of sites found to be under pervasive negative selection by FEL (p≤0.05) along internal branches in BA.1.

    Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Synonymous rate: Site estimate for the synonymous substitution rate (alpha). These values are highly imprecise and should be viewed as qualitative measures. Non-synonymous rate: Site estimate for the non-synonymous substitution rate (beta). These values are highly imprecise and should be viewed as qualitative measures. p-value: the p-value for the likelihood ratio test that beta / alpha ≠ 1. q-value: multiple-test corrected q-value (Benjamini-Hochberg).

    (CSV)

    S7 Table. BUSTED[S] selection results on the BA.1 SARS-CoV-2 clade across segments.

    Segment corresponds to the gene or ORF does under analysis. Omega1 refers to the first omega rate class, p1 refers to proportion of sites which fit this rate class. Omega2 refers to the first omega rate class, p2 refers to proportion of sites which fit this rate class. Omega3 refers to the first omega rate class which captures the episodic diversifying features, p3 refers to proportion of sites which fit this rate class. P-value, the p-value for the likelihood ratio test. Q-value refers to the multiple-test corrected q-value (Benjamini-Hochberg).

    (CSV)

    S8 Table. List of sites found to be under diversifying positive selection by MEME (p≤0.05) along internal branches in BA.5, as well as biochemical properties that are important at this site (via the PRIME method).

    Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. % Of branches with omega>1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%—pervasive selection, <100% –episodic selection). p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. q-value: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site.

    (CSV)

    S9 Table. List of sites found to be under diversifying positive selection by MEME (p≤0.05) along all branches in BA.5, as well as biochemical properties that are important at this site (via the PRIME method).

    Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. % Of branches with omega>1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%—pervasive selection, <100% –episodic selection). p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. q-value: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site.

    (CSV)

    S10 Table. List of sites found to be selected differentially along internal branches between BA.5 and background sequences (FDR ≤ 0.2) using the Contrast-FEL method.

    Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Ratio of omega (BA.1: background): the ratio of site-level omega estimates for the two sets of branches. If this ratio is > 1, then selection on BA.1 is stronger. These values are highly imprecise and should be viewed as qualitative measures. p-value: the p-value for the likelihood ratio test that omega ratios between the internal branches of the two clades are different. This is not corrected for multiple testing. q-value: multiple-test corrected q-value (Benjamini-Hochberg).

    (CSV)

    S11 Table. List of sites found to be evolving under directional selection in the entire tree, using a FUBAR-like implementation of the DEPS method.

    The BA.5 tree was rooted on the genome reference sequence for this analysis. Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Target amino-acid: which amino-acids have statistical support (Bayes Factor ≥ 100) for accelerated evolution towards them.

    (CSV)

    S12 Table. Pairs of sites (BA.5) found to have epistatic (co-evolving) substitution patterns by BGM method.

    Gene/ORF: which gene or ORF does this site belong to. Codon 1/2 (in gene/ORF): the location of the two interacting codons in the corresponding Gene/ORF. Posterior probability of non-independence: estimated posterior probability that substitutions which occur on the interior branches of the BA.1 clade are not independent.

    (CSV)

    S13 Table. List of sites found to be under pervasive negative selection by FEL (p≤0.05) along internal branches in BA.5.

    Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Synonymous rate: Site estimate for the synonymous substitution rate (alpha). These values are highly imprecise and should be viewed as qualitative measures. Non-synonymous rate: Site estimate for the non-synonymous substitution rate (beta). These values are highly imprecise and should be viewed as qualitative measures. p-value: the p-value for the likelihood ratio test. “q-value”: multiple-test corrected q-value (Benjamini-Hochberg).

    (CSV)

    S14 Table. List of sites found to be under diversifying positive selection by MEME (p≤0.05) along internal branches in B.1.621, as well as biochemical properties that are important at this site (via the PRIME method).

    Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome. Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. % of branches with omega > 1: the fraction of tree branches (internal branches BA.1 clade) that have evidence of diversifying positive selection at this site (100%—pervasive selection, <100% –episodic selection). p-value: the p-value for the likelihood ratio test that a non-zero fraction of branches have omega > 1 (i.e., episodic diversifying selection at this branch). This is not corrected for multiple testing; the MEME test is generally conservative on real data. q-value: multiple-test corrected q-value (Benjamini-Hochberg). Properties: which, if any, of the five compositive biochemical properties from [52] are conserved or changed at this site.

    (CSV)

    S15 Table. List of sites found to be evolving under directional selection in the entire tree, using a FUBAR-like implementation of the DEPS method.

    The B.1.621 tree was rooted on the genome reference sequence for this analysis. Coordinate (SARS-CoV-2): the starting coordinate of the codon in the NCBI reference SARS-CoV-2 genome Gene/ORF: which gene or ORF does this site belong to. Codon (in gene/ORF): the location of the codon in the corresponding Gene/ORF. Target amino-acid: which amino-acids have statistical support (Bayes Factor ≥ 100) for accelerated evolution towards them.

    (CSV)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    The RASCL application, depicted at high level in Fig 1, is implemented: • As a standalone pipeline (https://github.com/veg/RASCL) in Snakemake [37]. • As a web application (https://galaxy.hyphy.org/u/hyphy/w/rapid-assessment-of-selection-on-clades-and-lineages), integrated as a workflow in the Galaxy framework, that is freely available for use on powerful public computing infrastructure (https://usegalaxy.org).


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES