Skip to main content
Wellcome Open Research logoLink to Wellcome Open Research
. 2021 Sep 17;6:121. Originally published 2021 May 19. [Version 2] doi: 10.12688/wellcomeopenres.16661.2

Tracking the international spread of SARS-CoV-2 lineages B.1.1.7 and B.1.351/501Y-V2 with grinch

Áine O'Toole 1,a,#, Verity Hill 1,#, Oliver G Pybus 2, Alexander Watts 3,4, Issac I Bogoch 5,6, Kamran Khan 3,4,5, Jane P Messina 7; The COVID-19 Genomics UK (COG-UK) consortium; Network for Genomic Surveillance in South Africa (NGS-SA); Brazil-UK CADDE Genomic Network, Houriiyah Tegally 8, Richard R Lessells 8, Jennifer Giandhari 8, Sureshnee Pillay 8, Kefentse Arnold Tumedi 9, Gape Nyepetsi 10, Malebogo Kebabonye 11, Maitshwarelo Matsheka 9, Madisa Mine 10, Sima Tokajian 12, Hamad Hassan 13, Tamara Salloum 12, Georgi Merhi 12, Jad Koweyes 12, Jemma L Geoghegan 14,15, Joep de Ligt 15, Xiaoyun Ren 15, Matthew Storey 15, Nikki E Freed 16, Chitra Pattabiraman 17, Pramada Prasad 17, Anita S Desai 17, Ravi Vasanthapuram 17, Thomas F Schulz 18, Lars Steinbrück 18, Tanja Stadler 19; Swiss Viollier Sequencing Consortium, Antonio Parisi 20, Angelica Bianco 20, Darío García de Viedma 21,22, Sergio Buenestado-Serrano 21, Vítor Borges 23, Joana Isidro 23, Sílvia Duarte 24, João Paulo Gomes 23, Neta S Zuckerman 25, Michal Mandelboim 25, Orna Mor 25, Torsten Seemann 26, Alicia Arnott 27, Jenny Draper 27, Mailie Gall 27, William Rawlinson 28, Ira Deveson 29, Sanmarié Schlebusch 30, Jamie McMahon 30, Lex Leong 31, Chuan Kok Lim 31, Maria Chironna 32, Daniela Loconsole 32, Antonin Bal 33, Laurence Josset 33, Edward Holmes 34, Kirsten St George 35, Erica Lasek-Nesselquist 35, Reina S Sikkema 36, Bas Oude Munnink 36, Marion Koopmans 36, Mia Brytting 37, V Sudha rani 38, S Pavani 38, Teemu Smura 39, Albert Heim 18, Satu Kurkela 40, Massab Umair 41, Muhammad Salman 41, Barbara Bartolini 42, Martina Rueca 42, Christian Drosten 43, Thorsten Wolff 44, Olin Silander 16, Dirk Eggink 45, Chantal Reusken 45, Harry Vennema 45, Aekyung Park 46, Christine Carrington 47, Nikita Sahadeo 47, Michael Carr 48, Gabo Gonzalez 48; SEARCH Alliance San Diego; National Virus Reference Laboratory; SeqCOVID-Spain; Danish Covid-19 Genome Consortium (DCGC); Communicable Diseases Genomic Network (CDGN); Dutch National SARS-CoV-2 surveillance program; Division of Emerging Infectious Diseases (KDCA), Tulio de Oliveira 8, Nuno Faria 2,49, Andrew Rambaut 1, Moritz U G Kraemer 2,b
PMCID: PMC8176267  PMID: 34095513

Version Changes

Revised. Amendments from Version 1

We have updated the figures to amend some issues with proofing. We have added in some details of other excellent resources for SARS-CoV-2 international surveillance. Over on cov-lineages.org (which has had a facelift since time of publishing), we have also added in a resources page (https://cov-lineages.org/resources.html) that points the user to both internally developed and externally developed resources for SARS-CoV-2 lineage and variant tracking. Figures 1 and 2 along with title were also updated.

Abstract

Late in 2020, two genetically-distinct clusters of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) with mutations of biological concern were reported, one in the United Kingdom and one in South Africa. Using a combination of data from routine surveillance, genomic sequencing and international travel we track the international dispersal of lineages B.1.1.7 and B.1.351 (variant 501Y-V2). We account for potential biases in genomic surveillance efforts by including passenger volumes from location of where the lineage was first reported, London and South Africa respectively. Using the software tool grinch (global report investigating novel coronavirus haplotypes), we track the international spread of lineages of concern with automated daily reports, Further, we have built a custom tracking website (cov-lineages.org/global_report.html) which hosts this daily report and will continue to include novel SARS-CoV-2 lineages of concern as they are detected.

Keywords: genomic surveillance, air travel, SARS-CoV-2, genomics, genome sequencing, virus, surveillance, pandemic, B.1.1.7, B.1.351, N501Y, coronavirus, sequencing, genomic epidemiology

Introduction

In December 2020, routine genomic surveillance in the United Kingdom (UK) 1 reported a new and genetically distinct phylogenetic cluster of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (variant VOC202012/01, lineage B.1.1.7). Preliminary analysis suggests that this lineage carries an unusually large number of genetic changes 2. The earliest known cases of B.1.1.7 were sampled in southern England in late September 2020, and by December the lineage had spread to most UK regions and was growing rapidly 3. In October 2020, a separate SARS-CoV-2 cluster (variant 501Y.V2, lineage B.1.351), which carried a different constellation of genetic changes, was detected by the Network for Genomic Surveillance in South Africa 4, 5. Both lineages carry mutations, especially in the virus spike protein, that may affect virus function, and both appear to have grown rapidly in relative frequency since their discovery. Early analyses of the spatial spread of SARS-CoV-2 highlights the potential for rapid virus dissemination through national and international travel 6, 7. Therefore continued genomic monitoring of lineages of concern is required.

To facilitate tracking of these lineages on an international scale, we developed a software tool grinch (global report investigating novel coronavirus haplotypes) that collates SARS-CoV-2 genomic data and epidemiological metadata. Resources such as grinch on cov-lineages.org can inform public health bodies and institutions around the world. Other excellent resources to track lineages and variants are available, including covariants.org, which tracks the spread of SARS-CoV-2 variants of interest, and outbreak.info, which gathers multiple sources of genetic and epidemiological data to track lineages. We include a non-exhaustive list of resources for tracking SARS-CoV-2 at https://cov-lineages.org/resources.html.

Methods

To better characterise the international distribution of lineages B.1.1.7 and B.1.351 we sourced SARS-CoV-2 sequences from GISAID 8, 9 and assigned lineages using pangolin (v2.1.6, https://github.com/cov-lineages/pangolin), which implements the nomenclature scheme described in Rambaut et al., 10. Genomes are assigned lineage B.1.1.7 if they exhibit at least 5 of the 17 mutations inferred to have arisen on the phylogenetic branch immediately ancestral to the cluster ( Table 1) 2; or to B.1.351 if they exhibit at least 5 of 9 lineage-associated mutations ( Table 1) 5. Lineage count and frequency data have been calculated daily using grinch. Using International Air Transport Association (IATA) travel data from October 2020, available through bluedot.global, we aggregated and collated the passenger volumes from international airports in London and South Africa to international destinations on same booking. Destinations with more than 5,000 passengers from London and more than 300 passengers from South Africa during the month of October are displayed on the cov-lineages.org website and in the underlying data for this publication 11. grinch, with custom python modules that make use of geopandas v0.9, matplotlib v3.2 and seaborn v0.10, combines this information and produces reports with descriptive tables and figures that can be found at https://cov-lineages.org/global_report.html.

Table 1. Defining mutations for lineages of interest.

Lineage Defining mutations
B.1.1.7 orf1ab:T1001I; orf1ab:A1708D; orf1ab:I2230T; del:11288:9; del:21765:6; del:21991:3; S:N501Y;
S:A570D; S:P681H; S:T716I; S:S982A; S:D1118H; Orf8:Q27*; Orf8:R52I; Orf8:Y73C; N:D3L; N:
S235F
B.1.351/501Y-V2 E:P71L; N:T205I; orf1a:K1655N; S:D80A; S:D215G; S:K417N; S:E484K; S:N501Y; S:E484K

Implementation

All of the code underlying this daily lineage tracking web-report can be found at GitHub and Zenodo 12. grinch is a python-based tool, the analysis pipeline of which is built on a snakemake backbone 13. Every 24 hours a scheduled cron 14 task runs on our local servers. We download the latest data from GISAID and deduplicate based on sequence names. The sequences are assigned their most likely lineage using pangolin’s latest version and model files. All processed metadata is available and maintained on the cov-lineages.org GitHub repository. To run grinch, the user must have access to a GISAID direct download key and a password and provide these within a configuration file for use. The command used to run grinch is grinch -i grinch_config.yaml, using the config file provided at doi: 10.5281/zenodo.4640379 15.

Operation

Most users will not run grinch themselves, instead all information and useful descriptive figures are provided daily on the web report. Users can navigate to cov-lineages.org in a web browser of choice to view the latest daily report.

Results and discussion

As of 7th Jan 2021, 45 countries had reported the presence of B.1.1.7 and 13 countries had reported B.1.351/501Y.V2. B.1.1.7 and B.1.351 genome sequences were available for 28 and 8 countries, respectively ( Figure 1a, b, c) 11. Although some countries report increases in the relative frequency of B.1.1.7, genome sequencing efforts vary considerably. Potential targeting of sequencing towards travelers from the UK could bias frequency estimates upwards ( Figure 1b, c) and differing genome sharing policies and delays may also skew reporting estimates. The time between the initial collection date of a new variant sample in a country and the first availability of a corresponding virus genome on GISAID was, on average, 12 days (range 1–71).

Figure 1.

Figure 1.

a) The cumulative number of countries with reports of lineage B.1.1.7 (grey line) and cumulative number of genomes of B.1.1.7 deposited in GISAID. b) Rolling seven-day average of the proportion of B.1.1.7 genomes in countries with more than ten sequences of the variant, and with more than ten days between the first B.1.1.7 sequence and the most recent one compared to all sampled genomes in that country. c) Number of sequences (log10) per country. Colour indicates the proportion of sequences that are classified as lineage B.1.1.7. d) Number of air travellers from major international London airports (Heathrow, Gatwick, Luton, City, Stansted, Southend) during October 2020. Colour indicates the number of sampled genomes of lineage B.1.1.7. Reported refers to countries that we found media reports stating there had been sequences of that particular lineage, but for which there were no sequences on GISAID. This is distinct from ‘not reported’ where there were no records found of that lineage in a given country. e) Map of international flights from major international London airports to countries with B.1.1.7 sequences. Colours indicate the date of earliest detection of B.1.1.7. in each country. The width of the lines indicates the number of flights. International Air Transport Association data used here account for ~90% of passenger travel itineraries on commercial flights, excluding transportation via unscheduled charter flights (the remainder is modelled using market intelligence). Data shown represents origin-destination journeys during October 2020. Routes to countries that have not yet detected B.1.1.7 and deposited data on GISAID are not included.

The number of B.1.1.7 and B.1.351/501Y.V2 genome sequences reported in each country is a consequence of (i) the intensity of local genomic surveillance; (ii) the level of concern about new variant introductions; (iii) the volume of international travel among affected countries, and (iv) the amount of local transmission following the introduction of lineage from elsewhere. To explore these factors, we analysed the most recent available IATA travel data (October 2020). We collated the total number of origin-to-destination air journeys between major London international airports and each country. The calculation was repeated for journeys originating in all international South African airports. We focussed on London and South Africa as they are the locations with the first reports and highest reported prevalence of lineages B.1.1.7 and B.1.351 respectively 2, 5. However, due to low SARS-CoV-2 genomic surveillance in many locations, we cannot reject the hypotheses that these lineages initially originated elsewhere. Figure 1d shows destinations receiving >5,000 travellers in October 2020 from the UK ( Figure 2 shows destinations receiving >300 travellers from South Africa).

Figure 2.

Figure 2.

a) Shows the cumulative number of countries with reports of lineage B.1.351 (black line) and cumulative number of genomes of B.1.351 deposited in GISAID. b) Rolling seven-day average of the proportion of B.1.351 genomes in countries with more than ten sequences of the variant, and with more than ten days between the first B.1.351 sequence and the most recent one compared to all sampled genomes in that country. c) Number of sequences (log10) per country. Colour indicates the proportion of sequences that are classified as lineage B.1.351 d) Number of air travellers from South Africa during October 2020. Colour indicates the number of sampled genomes of lineage B.1.351. Not reported refers to a given country having no record of B.1.351, and reported refers to countries that we found media reports but that country had no SARS-CoV-2 genomes shared on GISAID at that time. e) Map of international flights to countries with B.1.351 sequences. Colours indicate the date of earliest detection of B.1.351 in each country. The width of the lines indicates the number of flights. International Air Transport Association data used here account for ~90% of passenger travel itineraries on commercial flights, excluding transportation via unscheduled charter flights (the remainder is modelled using market intelligence). Data shown represents origin-destination journeys during October 2020. Routes to countries that have not yet detected B.1.351 and deposited data on GISAID are not included. >300 travellers from South Africa).

Of the countries that receive >5,000 travellers from London, 16 have sequenced B.1.1.7. Of the 45 countries that have identified B.1.1.7 (32 in travellers and 13 with local onward transmission), only 6 perform real-time routine genomic surveillance (Denmark, UK, Iceland, The Netherlands, Australia, Sweden), 3 have prioritised sequencing based on S-gene target failure tests 16, 30 primarily targeted sequencing towards arriving travellers from the UK, and there was no information available for 10 (details at https://github.com/cov-lineages/lineages-website/blob/master/_data/). Of the 13 countries that have identified B.1.351 (four with local onward transmission including South Africa), 4 perform routine sequencing (South Africa, UK, Botswana, Australia), 6 target sequencing of travellers, and there was no information available for 3. Consequently, there is no clear relationship between number of sequences reported and flight numbers, but rather reflects the current genomic surveillance effort. For example, in September, the UK sequenced ~13% of its reported cases and Denmark sequenced ~21%. In comparison, Israel sequenced ~0.002% of its cases during the same period 17, 18.

Our study has several limitations. The passenger flight data do not include recent changes to holiday travel, and recent restrictions on travel from the UK and South Africa is not reflected in the mobility data. Further, flight data may not accurately reflect the final destination if multiple tickets are purchased.

The discovery and rapid spread of B.1.1.7 and B.1.351/501Y.V2 highlights the importance of real-time and open data for tracking the spread of SARS-CoV-2 and for informing future public health interventions and travel advice.

Data availability

Underlying data

Zenodo: Accession IDs included in publication Tracking the international spread of SARS-CoV-2 lineages B.1.1.7 and B.1.351/501Y-V2. https://doi.org/10.5281/zenodo.4642401 9.

This project contains the following underlying data:

Zenodo: cov-lineages.org website. https://doi.org/10.5281/zenodo.4640140 11.

This project contains the following underlying data:

  • -

    Website data archived at time of publication

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Extended data

Zenodo: Supplementary materials with group affiliations for Tracking the international spread of SARS-CoV-2 lineages B.1.1.7 and B.1.351/501Y-V2. https://doi.org/10.5281/zenodo.4704471 19.

This project contains the following extended data:

  • -

    Supplementary materials with group authorship affiliations and full acknowledgements.

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Software availability

Acknowledgements

An earlier version of this article can be found on Virological (url: https://virological.org/t/tracking-the-international-spread-of-sars-cov-2-lineages-b-1-1-7-and-b-1-351-501y-v2/592).

We thank Norelle Sherry, Benjamin Howden and Michelle Sait for their contribution to sequencing in Australia. We also include full acknowledgements and details of group authorships at https://doi.org/10.5281/zenodo.4704471 19. We would also like to extend our gratitude to everyone involved in the global sequencing effort.

Funding Statement

I.I.B. is supported by the Canadian Institutes of Health Research, COVID-19 Rapid Research Funding Opportunity (02179-000). K.K. is the founder of BlueDot, a social enterprise that develops digital technologies for public health. K.K., A.W., A.T.B. and C.H. are employed at BlueDot. I.I.B. has consulted for BlueDot. T.d.O. and the NGS-SA is funded by the South African Medical Research Council (SAMRC), MRC SHIP and the Department of Science and Innovation (DSI) of South Africa. N.R.F. acknowledges support from a Wellcome Trust and Royal Society Sir Henry Dale Fellowship (204311/Z/16/Z) and a Medical Research Council-São Paulo Research Foundation CADDE partnership award (MR/S0195/1 and FAPESP 18/14389-0). VH was supported by the Biotechnology and Biological Sciences Research Council (BBSRC) [grant number BB/M010996/1]. M.U.G.K. acknowledges support from the Branco Weiss Fellowship and EU grant 874850 MOOD. The contents of this publication are the sole responsibility of the authors and do not necessarily reflect the views of the European Commission. O.G.P. , J.P.M. and M.U.G.K. acknowledge support from the Oxford Martin School. AR acknowledges the support of the Wellcome Trust (Collaborators Award 206298/Z/17/Z – ARTIC network) and the European Research Council (grant agreement no. 725422 – ReservoirDOCS). A.OT is supported by the Wellcome Trust Hosts, Pathogens & Global Health Programme [grant number: grant.203783/Z/16/Z] and Fast Grants [award number: 2236]. COG-UK is supported by funding from the Medical Research Council (MRC) part of UK Research & Innovation (UKRI), the National Institute of Health Research (NIHR) and Genome Research Limited, operating as the Wellcome Sanger Institute.TFS acknowledges support from the Deutsche Forschungsgemeinschaft (SFB900, EXC2155 RESIST). SeqCOVID-SPAIN is supported by a grant from the Instituto de Salud Carlos III COV0020/00140.

[version 2; peer review: 3 approved]

References

  • 1.COVID-19 Genomics UK (COG-UK) consortiumcontact@cogconsortium.uk: An integrated national scale SARS-CoV-2 genomic surveillance network. Lancet Microbe. 2020;1(3):e99–100. 10.1016/S2666-5247(20)30054-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Rambaut A, Loman N, Pybus O, et al. : Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations.2020; published online Dec 18. (accessed Jan 8, 2021). Reference Source [Google Scholar]
  • 3.Volz E, Mishra S, Chand M, et al. : Transmission of SARS-CoV-2 Lineage B.1.1.7 in England: Insights from linking epidemiological and genetic data. bioRxiv. 2021. 10.1101/2020.12.30.20249034 [DOI] [Google Scholar]
  • 4.Msomi N, Mlisana K, de Oliveira T: A genomics network established to respond rapidly to public health threats in South Africa. Lancet Microbe. 2020;1(6):e229–30. 10.1016/S2666-5247(20)30116-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Tegally H, Wilkinson E, Giovanetti M, et al. : Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa. bioRxiv. 2020. 10.1101/2020.12.21.20248640 [DOI] [Google Scholar]
  • 6.du Plessis L, McCrone JT, Zarebski AE, et al. : Establishment and lineage dynamics of the SARS-CoV-2 epidemic in the UK. Science. 2021;371(6530):708–712. 10.1126/science.abf2946 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lu J, du Plessis L, Liu Z, et al. : Genomic Epidemiology of SARS-CoV-2 in Guangdong Province, China. Cell. 2020;181(5):997–1003.e9. 10.1016/j.cell.2020.04.023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Elbe S, Buckland-Merrett G: Data, disease and diplomacy: GISAID’s innovative contribution to global health. Glob Chall. 2017;1(1):33–46. 10.1002/gch2.1018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.O'Toole A: Accession IDs included in publication Tracking the international spread of SARS-CoV-2 lineages B.1.1.7 and B.1.351/501Y-V2 [Data set]. Zenodo. 2021. 10.5281/zenodo.4642401 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Rambaut A, Holmes EC, O’Toole Á,, et al. : A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020;5(11):1403–7. 10.1038/s41564-020-0770-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.O'Toole A: cov-lineages.org website. Zenodo. 2021. 10.5281/zenodo.4640140 [DOI] [Google Scholar]
  • 12.O Toole A, Hill V: grinch. Zenodo. 2021. 10.5281/zenodo.4640037 [DOI] [Google Scholar]
  • 13.Mölder F, Jablonski KP, Letcher B, et al. : Sustainable data analysis with Snakemake [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Res. 2021;10:33. 10.12688/f1000research.29032.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Reznick L: Using cron and crontab. Sys Admin. 1993;2(4):29–32. [Google Scholar]
  • 15.O'Toole A: grinch_config.yaml [Data set]. Zenodo. 2021. 10.5281/zenodo.4640379 [DOI] [Google Scholar]
  • 16.Bal A, Destras G, Gaymard A, et al. : Two-step strategy for the identification of SARS-CoV-2 variants co-occurring with spike deletion H69-V70, Lyon, France, August to December 2020. bioRxiv. 2020. 10.1101/2020.11.10.20228528 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Hasell J, Mathieu E, Beltekian D, et al. : A cross-country database of COVID-19 testing. Sci Data. 2020;7(1):345. 10.1038/s41597-020-00688-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Dong E, Du H, Gardner L: An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect Dis. 2020;20(5):533–4. 10.1016/S1473-3099(20)30120-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.O'Toole A: Supplementary materials with group affiliations for Tracking the international spread of SARS-CoV-2 lineages B.1.1.7 and B.1.351/501Y-V2. Zenodo. 2021. 10.5281/zenodo.4704471 [DOI] [PMC free article] [PubMed] [Google Scholar]
Wellcome Open Res. 2021 Jun 10. doi: 10.21956/wellcomeopenres.18372.r43967

Reviewer response for version 1

George Githinji 1

The article by O’Toole 2021 et al. describes a bioinformatics tool for the analysis of SARS-CoV-2 sequence data. The article is concise, and the relevant details have been considered. For example, the software and source code is available and well documented. The tool has shown great utility in public health based on its application in tracking and describing two SARS-CoV-2 variants of global concern.

Some minor comments below:

  1. The transition from an article of public health importance to a software tool is abrupt. I think a paragraph or a link aimed at orientating the audience would be useful.

  2. It would be useful to outline the special niche that the tool occupies or the gaps it fills relative to similar utilities and webpages such as covariants.org and outbreak.info.

  3. The readme file at https://github.com/cov-lineages/grinch lacks full installation documentation. An introductory paragraph of the tool and its utility would also be useful. The scripts directory could be better organised by separating the snakemake files from the regular python files. I would image a workflows dir and scripts dir

Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Is the rationale for developing the new software tool clearly explained?

Yes

Is the description of the software tool technically sound?

Yes

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes

Reviewer Expertise:

bioinformatics, molecular epidemiology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Wellcome Open Res. 2021 Sep 10.
Áine Niamh O'Toole 1

Thank you for your review and time, and apologies for taking so long to respond!

The transition from an article of public health importance to a software tool is abrupt. I think a paragraph or a link aimed at orientating the audience would be useful.

We have added in a paragraph that bridges the public health information and the resource information. 

It would be useful to outline the special niche that the tool occupies or the gaps it fills relative to similar utilities and webpages such as covariants.org and outbreak.info.

The linking paragraph also discusses other resources such as outbreak.info and covariants.org- we now also provide a non-exhaustive list of a several other resources on the cov-lineages.org website  

​​​​​​​

The readme file at https://github.com/cov-lineages/grinch lacks full installation documentation. An introductory paragraph of the tool and its utility would also be useful. The scripts directory could be better organised by separating the snakemake files from the regular python files. I would image a workflows dir and scripts dir

We have updated the readme on the grinch repository and updated usage in that it can be run in full analysis pipeline mode or in report only mode. We have also added in a brief description of the grinch setup and where the macro data is hosted.

Wellcome Open Res. 2021 Jun 3. doi: 10.21956/wellcomeopenres.18372.r43964

Reviewer response for version 1

Anderson F Brito 1

Thank you for developing these tools for daily tracking of SARS-CoV-2 lineages.

Some comments about the tool and its functionalities:

  • In the manuscript, the authors mention that users with access to GISAID direct download could run this pipeline locally, and generate their own reports. Can this tool be adapted to display genomic results for tracking national spread, or even state-level spread of lineages?

  • If this pipeline is intended to be constantly executed locally by users, it would be helpful to provide more information about how to install and run the pipeline, including reference to example input and output files.  I have tried to run the pipeline using my GISAID data provision credentials, but that was not successful, as I ran into errors for which I could not find a solution online (GitHub and Zenodo).

  • About the online reports, increasing the font size in the plots being displayed (bar, curves, etc) would make labels and legends more intelligible, and improving the readability of their content.

  • About the flight data, why only flight counts from October are shown? Are these data only used for tracking the potential spread in early stages of viral emergence, or do you see other uses for such data?

Concerning the manuscript, a few minor points:

  • The colour gradient in the legend of Figure 1 is incomplete and does not go from 1 to 76. I think it must be just a formatting issue.

  • How was the "reported" cases shown in Figures 1 and 2 detected? By differential PCR? I know that applies to B.1.1.7, but what about B.1.351?

  • The legend in Figure 2 refers to "B.1.1.7" sequences, while the figure shows "B.1.351" sequences. It must be a typo.

Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Is the rationale for developing the new software tool clearly explained?

Yes

Is the description of the software tool technically sound?

Partly

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly

Reviewer Expertise:

Virology, Bioinformatics, Evolution, Epidemiology.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Wellcome Open Res. 2021 Sep 10.
Áine Niamh O'Toole 1

  • In the manuscript, the authors mention that users with access to GISAID direct download could run this pipeline locally, and generate their own reports. Can this tool be adapted to display genomic results for tracking national spread, or even state-level spread of lineages?

    We’ve added in an option to grinch with the --analysis flag, that can either run the whole GISAID-processing pipeline, or just generate a set of reports from a metadata file. It currently would need resolution to remain at the country level, but we like the idea of doing a more local report and will work towards implementing something like this in the future. 

  • If this pipeline is intended to be constantly executed locally by users, it would be helpful to provide more information about how to install and run the pipeline, including reference to example input and output files.  I have tried to run the pipeline using my GISAID data provision credentials, but that was not successful, as I ran into errors for which I could not find a solution online (GitHub and Zenodo).

    We have supplied some more information on how to install on the GitHub repository and run the pipeline, however the tool isn’t necessarily intended for users to run locally themselves as we process the data on this end and share the macro count data on GitHub.

  • About the online reports, increasing the font size in the plots being displayed (bar, curves, etc) would make labels and legends more intelligible, and improving the readability of their content.

    We have made the axes longer to account for more countries, however are working towards re-implementing these reports in javascript so they can be interactive and more responsive to the browser size. 

  • About the flight data, why only flight counts from October are shown? Are these data only used for tracking the potential spread in early stages of viral emergence, or do you see other uses for such data?

    October related to the date of early spread of both lineages described in the text and was due to limitations of access to data. We hope to continue to develop this resource and supply more recent dates that track over time. 

 

Concerning the manuscript, a few minor points:

  • The colour gradient in the legend of Figure 1 is incomplete and does not go from 1 to 76. I think it must be just a formatting issue.

    This was a proofs issue and we’ve hopefully rectified this now.

  • How was the "reported" cases shown in Figures 1 and 2 detected? By differential PCR? I know that applies to B.1.1.7, but what about B.1.351?

    This was the set of manually curated media reports that we were tracking at the time.

  • The legend in Figure 2 refers to "B.1.1.7" sequences, while the figure shows "B.1.351" sequences. It must be a typo.

    Thank you we have now fixed this in the figure.

Wellcome Open Res. 2021 Sep 10.
Áine Niamh O'Toole 1

Also- thank you for your time reviewing this manuscript (we do really appreciate it!) and apologies for taking so long to respond.

Wellcome Open Res. 2021 May 27. doi: 10.21956/wellcomeopenres.18372.r43966

Reviewer response for version 1

Rob Lanfear 1

This article describes a software tool, grinch, that can be used to produce automated reports on SARS-CoV-2 lineages. The authors apply it to two lineages of concern in the article, and also highlight that the main utility of grinch is not in static one-off reports, but in regularly updated reports available at https://cov-lineages.org/global_report.html.

The paper clearly describes the software and demonstrates its utility. I’d like to commend the authors for putting this tool and the associated website together so quickly, for maintaining both to a very high standard, for making sure that all of the work is open and reproducible, and for the huge amount of work and enormous collaborative effort that has gone into this clear and concise report.

I have no serious reservations about the software tool or the data, analyses, or conclusions presented in the manuscript. The software is clear, open-source, sufficiently documented, and almost all of the proposed utility is presented on a clear and regularly updated website. The manuscript is clearly written, well researched, concise, and the conclusions are well justified by the analyses.

Of course, I do have a few comments, some of which I hope might be useful in improving the paper and/or the website.

Minor comments on the manuscript:

  1. I felt there was some tension in this article about whether it’s a software note or a public health report. The title suggests the latter, but much of the article (and the article type of “Software Tool Article”) suggests the former. Most of this tension for me as a reader came from looking at the title, which has no mention of software, so I think sets up expectations that differ from what is then provided (quite reasonably) in the paper. A very simple way to address this would be to start the title with “Using grinch to track…” or to end it with “… using grinch”.

  2. Similar to point 1, the abstract doesn’t actually mention ‘grinch’ or https://cov-lineages.org/global_report.html. It would seem clearer to me to incorporate in the abstract the framing that this article presents a generally applicable software tool, demonstrated on two lineages of concern.

  3.  I would like to see some mention of related efforts somewhere in the report. A full detailed comparison is neither warranted nor useful here because all such websites can and should change regularly, but a couple of sentences comparing cov-lineages.org to sites like outbreak.info and covariants.org would be very useful. At a minimum, it seems useful to list the similar sites the authors are aware of, if only because the fact one can see similar patterns presented on those sites serves as a useful validation of the software presented in this paper.

  4. Given the situation, this is a desirable, not a requirement, but I’d love to see some unit tests on the GitHub repo. It seems potentially important to have this when the intention is to produce daily updates for public health. (Though I note that getting the same end result from completely independent implementations on other sites is probably worth more than a lot of unit tests).

  5. I struggled with Figure 1D. It wasn’t clear to me what ‘reported’ and ‘not reported’ mean. And the legend makes it really hard to figure out how colours map to counts.  

  6. It’s stated that there is no correlation between the numbers of sequences and flight numbers. It would be nice to see the scatter plot for this (maybe as an inset to figure 1D?), as well as the effect size and p-value of a suitable model.

  7. Following from the previous point, the explanation for the lack of a correlation with absolute numbers seems reasonable. But it still seems to me that flight numbers could correlate with the frequency of B117 at a fixed time interval from the first detected case in a given locality (thus somewhat factoring out sequencing effort in the locality). Is it possible to add this analysis?

  8. Please add installation instructions to the GitHub repo

Minor comments on https://cov-lineages.org/global_report.html:

  1. Figure 3 for each lineage is a map of sequence counts by region. I find the legend here completely baffling. All it states is grey=No variant (that makes sense), pink = 1 sequence (that makes sense too), and purple = ‘Max sequences’. I have no idea what to make of this. How many is ‘Max’, and how am I supposed to quantitatively interpret intermediate colours to pink and purple? It’s so obvious I’m certain there are good reasons why this isn’t already done, but it does seem like a continuous colour scale is what should be used here. Similar to the scale in Figure 2 (grey for no data, shades of green nicely spaced and annotated for different values of a continuous variable).

  2. For the widespread lineages like B.1.1.7, there’s a lot of overplotting on Figures 4 and 5, which make the counts and the country names very difficult to read. This could be addressed by just making the figures larger.

  3. The table of links to news reports is absolutely wonderful. Would it be possible to include a button here to allow users to suggest additional news links? (I assume there’s an existing mechanism for doing this, but I couldn’t find one, so if not maybe just a link to a github issue with (potentially) a pre-filled title and required information would help?)

Really minor comments about the manuscript:

  1. The first use of IATA (first para of the methods) is missing “International”, i.e. it says “Using Air Transport…”.

  2. The second use of IATA (second para of results) does not need to be spelled out.

  3. Figure 1A seems like it is missing a second Y axis for the number of GISAID genomes reported.

  4. In the PDF version and the HTML version it seems that new lines were added wherever there was a ‘>’, e.g. ‘>5,000 travellers in October’ and ‘>300 travellers from South Africa’ both (erroneously?) start on new lines.

Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Is the rationale for developing the new software tool clearly explained?

Yes

Is the description of the software tool technically sound?

Yes

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes

Reviewer Expertise:

Phylogenetics, molecular evolution, bioinformatics. I have a passing familiarity with SARS-CoV-2 data analysis.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Wellcome Open Res. 2021 Sep 10.
Áine Niamh O'Toole 1

Thank you so much for your detailed review and our apologies for not responding sooner. All your comments were well justified and fair. We've responded to and believe have addressed your comments and concerns below. 

I felt there was some tension in this article about whether it’s a software note or a public health report. The title suggests the latter, but much of the article (and the article type of “Software Tool Article”) suggests the former. Most of this tension for me as a reader came from looking at the title, which has no mention of software, so I think sets up expectations that differ from what is then provided (quite reasonably) in the paper. A very simple way to address this would be to start the title with “Using grinch to track…” or to end it with “… using grinch”.

I completely agree, it’s a very fair point! We had originally submitted the article as a public health report but this didn’t fit with the Wellcome Open Research journal remit, so resubmitted under Software Tool. I have changed the title as suggested to with with “using grinch”.

Similar to point 1, the abstract doesn’t actually mention ‘grinch’ or https://cov-lineages.org/global_report.html. It would seem clearer to me to incorporate in the abstract the framing that this article presents a generally applicable software tool, demonstrated on two lineages of concern.

Our abstract and introduction now both contain reference to grinch and the reports at cov-lineages.org

I would like to see some mention of related efforts somewhere in the report. A full detailed comparison is neither warranted nor useful here because all such websites can and should change regularly, but a couple of sentences comparing cov-lineages.org to sites like outbreak.info and covariants.org would be very useful. At a minimum, it seems useful to list the similar sites the authors are aware of, if only because the fact one can see similar patterns presented on those sites serves as a useful validation of the software presented in this paper.

We have added in a short paragraph about these resources and a link to a more extensive (but definitely non-exhaustive) list of resources.

Given the situation, this is a desirable, not a requirement, but I’d love to see some unit tests on the GitHub repo. It seems potentially important to have this when the intention is to produce daily updates for public health. (Though I note that getting the same end result from completely independent implementations on other sites is probably worth more than a lot of unit tests).

Since publication, we’ve re-worked the back-end analysis pipeline and the GISAID data is now processed with the datapipe pipeline written by Rachel Colquhoun (https://github.com/COG-UK/datapipe). The reports and webpages are still generated with grinch, however the main data processing steps are now done with the robust datapipe pipeline.

It’s stated that there is no correlation between the numbers of sequences and flight numbers. It would be nice to see the scatter plot for this (maybe as an inset to figure 1D?), as well as the effect size and p-value of a suitable model. Following from the previous point, the explanation for the lack of a correlation with absolute numbers seems reasonable. But it still seems to me that flight numbers could correlate with the frequency of B117 at a fixed time interval from the first detected case in a given locality (thus somewhat factoring out sequencing effort in the locality). Is it possible to add this analysis?

We have amended to state there is no clear relationship, rather than correlation.

Please add installation instructions to the GitHub repo.

We have added updated usage and install instructions, and a description of the behaviour to the repository, at https://github.com/cov-lineages/grinch/blob/main/README.md

Minor comments on https://cov-lineages.org/global_report.html:

  1. Figure 3 for each lineage is a map of sequence counts by region. I find the legend here completely baffling. All it states is grey=No variant (that makes sense), pink = 1 sequence (that makes sense too), and purple = ‘Max sequences’. I have no idea what to make of this. How many is ‘Max’, and how am I supposed to quantitatively interpret intermediate colours to pink and purple? It’s so obvious I’m certain there are good reasons why this isn’t already done, but it does seem like a continuous colour scale is what should be used here. Similar to the scale in Figure 2 (grey for no data, shades of green nicely spaced and annotated for different values of a continuous variable).

    We have since amended the legend for the report. 

  2. For the widespread lineages like B.1.1.7, there’s a lot of overplotting on Figures 4 and 5, which make the counts and the country names very difficult to read. This could be addressed by just making the figures larger.

    We have made these figures larger, but recognise that these plots may be better displayed by having a top 20 country barchart (like in https://cov-lineages.org/lineage.html?lineage=B.1.1.7) and intend to adopt an interactive, more attractive global report in the future.

  3. The table of links to news reports is absolutely wonderful. Would it be possible to include a button here to allow users to suggest additional news links? (I assume there’s an existing mechanism for doing this, but I couldn’t find one, so if not maybe just a link to a github issue with (potentially) a pre-filled title and required information would help?)

          We really like this idea and will work towards implementing it!

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Availability Statement

    Underlying data

    Zenodo: Accession IDs included in publication Tracking the international spread of SARS-CoV-2 lineages B.1.1.7 and B.1.351/501Y-V2. https://doi.org/10.5281/zenodo.4642401 9.

    This project contains the following underlying data:

    Zenodo: cov-lineages.org website. https://doi.org/10.5281/zenodo.4640140 11.

    This project contains the following underlying data:

    • -

      Website data archived at time of publication

    Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

    Extended data

    Zenodo: Supplementary materials with group affiliations for Tracking the international spread of SARS-CoV-2 lineages B.1.1.7 and B.1.351/501Y-V2. https://doi.org/10.5281/zenodo.4704471 19.

    This project contains the following extended data:

    • -

      Supplementary materials with group authorship affiliations and full acknowledgements.

    Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).


    Articles from Wellcome Open Research are provided here courtesy of The Wellcome Trust

    RESOURCES