Abstract
An important component of efforts to manage the ongoing COVID19 pandemic is the Rapid Assessment of how natural selection contributes to the emergence and proliferation of potentially dangerous SARS-CoV-2 lineages and CLades (RASCL). The RASCL pipeline enables continuous comparative phylogenetics-based selection analyses of rapidly growing clade-focused genome surveillance datasets, such as those produced following the initial detection of potentially dangerous variants. From such datasets RASCL automatically generates down-sampled codon alignments of individual genes/ORFs containing contextualizing background reference sequences, analyzes these with a battery of selection tests, and outputs results as both machine readable JSON files, and interactive notebook-based visualizations.
Rapid characterization and assessment of the clade-specific molecular features of individual persistent or rapidly expanding SARS-CoV-2 lineages has become an important component of efforts to monitor and manage the COVID19 pandemic. Analyses of natural selection have been broadly incorporated into such assessments as a primary tool for inferring the selective processes under which novel SARS-CoV-2 variants evolve (Tegally et al., 2021, Faria et al., 2021, D. Martin et al., 2021, MacLean et al., 2021). Ongoing monitoring of emergent variants of interest (VOI) or concern (VOC) can detect potentially adaptive mutations before they rise to high frequency, and help establish the relationships between individual mutations and key viral characteristics including pathogenicity, transmissibility, and drug resistance (Hamed et al., 2021, Young et al., 2021, Luchsinger et al., 2021, Abdool et.al., 2021, Cyrus Maher et al., 2021). Molecular patterns of ongoing selection that are evident within sequences sampled from particular VOI or VOC clades may also reveal the sub-lineages within these clades that carry potentially fitness-enhancing mutations and which are therefore most likely to drive future viral transmission (Rambaut et al., 2020).
Here, we present RASCL (Rapid Assessment of SARS-CoV-2 CLades), an analytic pipeline designed to investigate the nature and extent of selective forces acting on viral genes in SARS-CoV-2 sequences through comparative phylogenetic analyses (Figure 1A). RASCL is implemented as an easy-to-use, standalone pipeline and as a web application, integrated in the Galaxy framework and available for use on powerful public computing infrastructure (Afgan et al., 2018).
Figure 1.
(A) A flowchart diagram of the main analytic engine of RASCL. (B) Examples of the ObservableHQ visualization notebook elements for the main Omicron clade (BA.1).
The RASCL pipeline takes as input (i) a “query” dataset comprising a single FASTA file containing unaligned SARS-CoV-2 full or partial genomes belonging to a clade of interest (e.g., all sequences from the PANGO lineage, B.1.617.2) and (ii) a generic “background” dataset that might comprise, for example, a set of sequences that are representative of global SARS-CoV-2 genomic diversity assembled from ViPR (Pickett et al., 2012). It is not necessary to remove sequences in the query dataset from the reference dataset -- the pipeline will do this automatically. The choice of “query” and “background” datasets is analysis-specific. For example, if another clade of interest is provided as background it is possible to identify sites that are evolving differently between two clades directly. Other sensible choices of query sequences might be: sequences from a specific country/region, or sequences sampled during a particular time period. Following the disassembly of whole genome datasets into individual coding sequences (based on the NCBI SARS-CoV-2 reference annotation), the gene datasets (each containing a set of query and background sequences) are processed in parallel.
Using complete linkage distance clustering implemented in the TN93 package (https://github.com/veg/tn93), RASCL subsamples from available sequences while attempting to maintain genomic diversity; the clustering threshold distance is chosen automatically to include no more than a user-specified number of genomes (e.g., 300). A combined (query and background) alignment is created with only the sequences that are divergent enough to be useful for subsequent selection analyses being retained from the background dataset. Inference of a maximum likelihood phylogenetic tree (RAxML-NG, Kozlov et al., 2019, or IQ-TREE, Nguyen et al., 2015) is performed on the combined dataset and the query and background branches of this tree are labeled. Selection analyses are then performed with state of the art molecular evolution models implemented in HyPhy (Pond et al., 2020).
SLAC: performs substitution mapping (Pond and Frost, 2005)
BGM: identifies groups of sites that are apparently co-evolving (Poon et al., 2008)
FEL: locates codon sites with evidence of pervasive positive diversifying or negative selection (Pond and Frost, 2005),
MEME: locates codon sites with evidence of episodic positive diversifying selection, (Murrell et al., 2012)
BUSTEDS: tests for gene-wide episodic selection (Wisotsky et al., 2020)
RELAX: compare gene-wide selection pressure between the query clade and background sequences (Wertheim et al., 2015),
CFEL: comparison site-by-site selection pressure between query and background sequences (Pond et al., 2021).
FADE: identify amino-acid sites with evidence of directional selection (Pond et.al., 2008)
To mitigate the potentially confounding influences of within-host evolution and sequencing errors, these analyses are performed only on internal branches of the phylogenetic tree (Lorenzo-Redondo et al., 2016). Results are combined into two machine readable JSON files (“Summary” and “Annotation”) that are used for web processing. A feature-rich interactive notebook in ObservableHQ (Perkel 2021, https://observablehq.com/@aglucaci/rascl) is used to visualize and summarize RASCL results (Figure 1B)
RASCL is currently available in two distributions:(i) through a web interface via the Galaxy Project as a workflow (https://usegalaxy.eu/u/hyphy/w/rascl); and (ii) as a standalone pipeline via a dedicated GitHub (https://github.com/veg/RASCL) repository. For the web application implementation, the alignment, tree and analysis results are stored and made web-accessible via the Galaxy platform. Results are visualized with an interactive notebook hosted on ObservableHQ (Figure 1B; Perkel 2021) that includes an alignment viewer, a visualization of individual codons/amino acid states at user-selected sites mapped onto the tips of a phylogenetic tree, and detailed tabulated information on analysis results for individual genes and codon-sites.
RASCL has been used to characterize the role of natural selection in the emergence of the Beta (Tegally et al., 2021), Gamma (Faria et al., 2021), and Omicron (Moyo et al.,2021) VOC lineages, and for identifying patterns of convergent evolution in N501Y SARS-CoV-2 lineages (Martin et al., 2021). Whenever future genomic surveillance efforts reveal new potentially problematic SARS-CoV-2 lineages, we anticipate that RASCL will be productively used to analyze these too. Finally, RASCL has been designed so that, with minimal modification, it can also be adapted to analyze any other viral pathogens for which sufficient sequencing data is available.
Acknowledgements
We thank members of the Datamonkey/HyPhy and Galaxy teams for their assistance in the development of this application. DPM is funded by the Wellcome Trust (222574/Z/21/Z). This research was supported in part by grants R01 AI134384 (NIH/NIAID) and grant 2027196 (NSF/DBI,BIO) to AN and SLKP..
Footnotes
Availability: RASCL is available from a dedicated repository at https://github.com/veg/RASCL and as a Galaxy workflow https://usegalaxy.eu/u/hyphy/w/rascl. Existing clade/variant analysis results are available here: https://observablehq.com/@aglucaci/rascl.
References
- 1.Tegally H., Wilkinson E., Giovanetti M. et al. Detection of a SARS-CoV-2 variant of concern in South Africa. Nature 592, 438–443 (2021). 10.1038/s41586-021-03402-9 [DOI] [PubMed] [Google Scholar]
- 2.Faria NR, Mellan TA, Whittaker C, Claro IM, Candido DDS, Mishra S, Crispim MAE, Sales FCS, Hawryluk I, McCrone JT, Hulswit RJG, Franco LAM, Ramundo MS, de Jesus JG, Andrade PS, Coletti TM, Ferreira GM, Silva CAM, Manuli ER, Pereira RHM, Peixoto PS, Kraemer MUG, Gaburo N Jr, Camilo CDC, Hoeltgebaum H, Souza WM, Rocha EC, de Souza LM, de Pinho MC, Araujo LJT, Malta FSV, de Lima AB, Silva JDP, Zauli DAG, Ferreira ACS, Schnekenberg RP, Laydon DJ, Walker PGT, Schlüter HM, Dos Santos ALP, Vidal MS, Del Caro VS, Filho RMF, Dos Santos HM, Aguiar RS, Proença-Modena JL, Nelson B, Hay JA, Monod M, Miscouridou X, Coupland H, Sonabend R, Vollmer M, Gandy A, Prete CA Jr, Nascimento VH, Suchard MA, Bowden TA, Pond SLK, Wu CH, Ratmann O, Ferguson NM, Dye C, Loman NJ, Lemey P, Rambaut A, Fraiji NA, Carvalho MDPSS, Pybus OG, Flaxman S, Bhatt S, Sabino EC. Genomics and epidemiology of the P.1 SARS-CoV-2 lineage in Manaus, Brazil. Science. 2021. May 21;372(6544):815–821. doi: 10.1126/science.abh2644. Epub 2021 Apr 14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Elbe S., and Buckland-Merrett G. (2017) Data, disease and diplomacy: GISAID’s innovative contribution to global health. Global Challenges, 1:33–46. DOI: 10.1002/gch2.1018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Pickett BE, Sadat EL, Zhang Y, Noronha JM, Squires RB, Hunt V, Liu M, Kumar S, Zaremba S, Gu Z, Zhou L, Larson CN, Dietrich J, Klem EB, Scheuermann RH. ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Res. 2012. Jan;40(Database issue):D593–8. doi: 10.1093/nar/gkr859. Epub 2011 Oct 17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kosakovsky Pond Sergei L, Poon Art FY, Velazquez Ryan, Weaver Steven, Hepler N Lance, Murrell Ben, Shank Stephen D, Magalis Brittany Rife, Bouvier Dave, Nekrutenko Anton, Wisotsky Sadie, Spielman Stephanie J, Frost Simon DW, Muse Spencer V (2020) HyPhy 2.5—A Customizable Platform for Evolutionary Hypothesis Testing Using Phylogenies. Molecular Biology and Evolution 37.1 (2020): 295–299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Weaver Steven, Shank Stephen D., Spielman Stephanie J., Li Michael, Muse Spencer V., Kosakovsky Pond Sergei L. Datamonkey 2.0: a modern web application for characterizing selective and other evolutionary processes Mol. Biol. Evol. 35(3):773–777 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993. May;10(3):512–26. doi: 10.1093/oxfordjournals.molbev.a040023. [DOI] [PubMed] [Google Scholar]
- 8.Kozlov Alexey M, Darriba Diego, Flouri Tomáš, Morel Benoit, Stamatakis Alexandros, RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference, Bioinformatics, Volume 35, Issue 21, 1 November 2019, Pages 4453–4455, 10.1093/bioinformatics/btz305 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kosakovsky Pond Sergei L. and Frost Simon D. W. (2005) Not So Different After All: A Comparison of Methods for Detecting Amino Acid Sites Under Selection Molecular Biology and Evolution 22(5): 1208–1222 [DOI] [PubMed] [Google Scholar]
- 10.Poon Art F. Y., Lewis Fraser I., Frost Simon D. W., Kosakovsky Pond Sergei L., Spidermonkey: rapid detection of co-evolving sites using Bayesian graphical models, Bioinformatics, Volume 24, Issue 17, 1 September 2008, Pages 1949–1950, 10.1093/bioinformatics/btn313 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Murrell Ben, Wertheim Joel O., Moola Sasha, Weighill Thomas, Scheffler Konrad and Kosakovsky Pond Sergei L. (2012) Detecting Individual Sites Subject to Episodic Diversifying Selection PLoS Genetics 8(7): e1002764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Smith M. D., Wertheim J. O., Weaver S., Murrell B., Scheffler K. and Kosakovsky Pond S. L. Less Is More: An Adaptive Branch-Site Random Effects Model for Efficient Detection of Episodic Diversifying Selection Molecular Biology and Evolution 32: 1342–1353 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wisotsky SR, Kosakovsky Pond SL, Shank SD, Muse SV. Synonymous Site-to-Site Substitution Rate Variation Dramatically Inflates False Positive Rates of Selection Analyses: Ignore at Your Own Peril. Mol Biol Evol. 2020. Aug 1;37(8):2430–2439. doi: 10.1093/molbev/msaa037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wertheim JO, Murrell B, Smith MD, Kosakovsky Pond SL, Scheffler K. RELAX: detecting relaxed selection in a phylogenetic framework. Mol Biol Evol. 2015. Mar;32(3):820–32. doi: 10.1093/molbev/msu400. Epub 2014 Dec 23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kosakovsky Pond SL, Wisotsky SR, Escalante A, Magalis BR, Weaver S. Contrast-FEL-A Test for Differences in Selective Pressures at Individual Sites among Clades and Sets of Branches. Mol Biol Evol. 2021. Mar 9;38(3):1184–1198. doi: 10.1093/molbev/msaa263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Hamed S.M., Elkhatib W.F., Khairalla A.S. et al. Global dynamics of SARS-CoV-2 clades and their relation to COVID-19 epidemiology. Sci Rep 11, 8435 (2021). 10.1038/s41598-021-87713-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Association of SARS-CoV-2 clades with clinical, inflammatory and virologic outcomes: An observational study. Young Barnaby E et al. EBioMedicine, Volume 66, 103319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Rambaut A., Loman N., Pybus O., Barclay W., Barrett J., Carabelli A., … & Volz E. (2020). Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations. Genom. Epidemiol. [Google Scholar]
- 19.Luchsinger LL, Hillyer CD. Vaccine efficacy probable against COVID-19 variants. Science. 2021. Mar 12;371(6534):1116. doi: 10.1126/science.abg9461. [DOI] [PubMed] [Google Scholar]
- 20.Abdool Karim SS, de Oliveira T. New SARS-CoV-2 Variants - Clinical, Public Health, and Vaccine Implications. N Engl J Med. 2021. May 13;384(19):1866–1868. doi: 10.1056/NEJMc2100362. Epub 2021 Mar 24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Perkel JM. Reactive, reproducible, collaborative: computational notebooks evolve. Nature. 2021. May;593(7857):156–157. doi: 10.1038/d41586-021-01174-w. [DOI] [PubMed] [Google Scholar]
- 22.Afgan Enis, Baker Dannon, Batut Bérénice, van den Beek Marius, Bouvier Dave, Čech Martin, Chilton John, Clements Dave, Coraor Nate, Grüning Björn A, Guerler Aysam, Hillman-Jackson Jennifer, Hiltemann Saskia, Jalili Vahid, Rasche Helena, Soranzo Nicola, Goecks Jeremy, Taylor James, Nekrutenko Anton, Blankenberg Daniel, The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update, Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W537–W544, 10.1093/nar/gky379 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Nguyen L.-T., Schmidt H.A., von Haeseler A., Minh B.Q. (2015) IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.. Mol. Biol. Evol., 32:268–274. 10.1093/molbev/msu300 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Martin D. P., Weaver S., Tegally H., San J. E., Shank S. D., Wilkinson E., Lucaci A. G., Giandhari J., Naidoo S., Pillay Y., Singh L., Lessells R. J., NGS-SA, COVID-19 Genomics UK (COG-UK), Gupta R. K., Wertheim J. O., Nekturenko A., Murrell B., Harkins G. W., Lemey P., … Kosakovsky Pond S. L. (2021). The emergence and ongoing convergent evolution of the SARS-CoV-2 N501Y lineages. Cell, 184(20), 5189–5200.e7. 10.1016/j.cell.2021.09.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.MacLean Oscar A., Lytras Spyros, Weaver Steven, Singer Joshua B., Boni Maciej F., Lemey Philippe, Kosakovsky Pond Sergei L., and Robertson David L.. “Natural selection in the evolution of SARS-CoV-2 in bats created a generalist virus and highly capable human pathogen.” PLoS biology 19, no. 3 (2021): e3001115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Predicting the mutational drivers of future SARS-CoV-2 variants of concern Maher M. Cyrus, Bartha Istvan, Weaver Steven, di Iulio Julia, Ferri Elena, Soriaga Leah, Lempp Florian A., Hie Brian L., Bryson Bryan, Berger Bonnie, Robertson David L., Snell Gyorgy, Corti Davide, Virgin Herbert W., Kosakovsky Pond Sergei L., Telenti Amalio medRxiv 2021.06.21.21259286; doi: 10.1101/2021.06.21.21259286 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kosakovsky Pond S. L., Poon A. F., Leigh Brown A. J., & Frost S. D. (2008). A maximum likelihood method for detecting directional evolution in protein sequences and its application to influenza A virus. Molecular biology and evolution, 25(9), 1809–1824. 10.1093/molbev/msn123 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Lorenzo-Redondo R, Fryer HR, Bedford T, Kim EY, Archer J, Pond SLK, Chung YS, Penugonda S, Chipman J, Fletcher CV, Schacker TW, Malim MH, Rambaut A, Haase AT, McLean AR, Wolinsky SM. Persistent HIV-1 replication maintains the tissue reservoir during therapy. Nature. 2016. Feb 4;530(7588):51–56. doi: 10.1038/nature16933. Epub 2016 Jan 27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Viana R, Moyo S, Amoako DG, Tegally H, Scheepers C, Althaus CL, Anyaneji UJ, Bester Phillip A, Boni Maciej F, Chand Mohammed, Choga Wonderful T, Colquhoun Rachel, Davids Michaela, Deforche Koen, Doolabh Deelan, Engelbrecht Susan, Everatt Josie, Giandhari Jennifer, Giovanetti Marta, Hardie Diana, Hill Verity, Hsiao Nei-Yuan, Iranzadeh Arash, Ismail Arshad, Joseph Charity, Joseph Rageema, Koopile Legodile, Kosakovsky Pond Sergei L, Kraemer Moritz UG, Kuate-Lere Lesego, Laguda-Akingba Oluwakemi, Lesetedi-Mafoko Onalethatha, Lessells RJ, Lockman Shahin, Lucaci Alexander G, Maharaj Arisha, Mahlangu Boitshoko, Maponga Tongai, Mahlakwane Kamela, Makatini Zinhle, Marais Gert, Maruapula Dorcas, Masupu Kereng, Matshaba Mogomotsi, Mayaphi Simnikiwe, Mbhele Nokuzola, Mbulawa Mpaphi B, Mendes Adriano, Mlisana Koleka, Mnguni Anele, Mohale Thabo, Moir Monika, Moruisi Kgomotso, Mosepele Mosepele, Motsatsi Gerald, Motswaledi Modisa S, Mphoyakgosi Thongbotho, Msomi Nokukhanya, Mwangi Peter N, Naidoo Yeshnee, Ntuli Noxolo, Nyaga Martin, Olubayo Lucier, Pillay S, Botshelo Radibe, Ramphal Y, Ramphal U, San JE, Scott Lesley, Shapiro Roger. Singh Lavanya, Smith-Lawrence Pamela, Stevens Wendy, Strydom Amy, Subramoney Kathleen, Tebeila Naume, Tshiabuila Derek, Tsui Joseph, van Wyk Stephanie, Weaver Steven, Wibmer Constantinos K, Wilkinson Eduan, Wolter Nicole, Zarebski Alexander E, Zuze Boitumelo, Goedhals Dominique, Preiser Wolfgang, Treurnicht Florette, Venter Marietje, Williamson Carolyn, Pybus Oliver G, Bhiman Jinal, Glass Allison, Martin DP, Rambaut A, Gaseitsiwe S, von Gottberg A, de Oliveira T. Rapid epidemic expansion of the SARS-CoV-2 Omicron variant in southern Africa medRxiv,MEDRXIV-2021–268028v1-deOliveira: (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Tegally H, Wilkinson E, Giovanetti M, et al. Detection of a SARS-CoV-2 variant of concern in South Africa. Nature. 2021. Apr;592(7854):438–443. DOI: 10.1038/s41586-021-03402-9. [DOI] [PubMed] [Google Scholar]
- 31.Faria Nuno R., Mellan Thomas A., Whittaker Charles, Claro Ingra M., da S. Candido Darlan, Mishra Swapnil, Crispim Myuki A. E., et al. “Genomics and Epidemiology of the P.1 SARS-CoV-2 Lineage in Manaus, Brazil.” Science 372, no. 6544 (May 21, 2021): 815–21. 10.1126/science.abh2644. [DOI] [PMC free article] [PubMed] [Google Scholar]