Abstract
There is considerable public and scientific interest in the origin, spread, and evolution of SARS-CoV-2. Lu et al. recently conducted genomic sequencing and analysis of SARS-CoV-2 in Guangdong, revealing its early transmission out of Hubei and shedding light on the effectiveness of controlling local transmission chains.
The outbreak of SARS-CoV-2 (the cause of the disease known as COVID-19) was first reported in Wuhan city (Hubei province), China, in December 2019, and swept across the nation and then around the world within 2 months, leading to its declaration as a pandemic by the World Health Organisation (WHO) on 11 March 2020. As of 20 May 2020, the worldwide COVID-19 death toll has reached 318 789, and nearly 5 million confirmed cases have been reported across all continents [1]. The enormous health and economic impacts caused by this virus, at such an unprecedented speed, have raised considerable public interest, fueling research into its origin, spread, and evolution. Analysis of genomic data is a key tool for studying emerging pathogens [2]. The recent identification and genomic characterization of bat and pangolin coronaviruses, which are evolutionarily related to human SARS-CoV-2, have suggested that these species could act as the zoonotic origins of the human virus [3,4]. Other pressing questions for understanding the ongoing pandemic include identifying the means by which SARS-CoV-2 has spread across China and the world from its starting place, and whether disease control measures effectively suppress the introduced infections against further transmission.
Several studies using mathematical modeling of COVID-19 and other coronavirus transmission have suggested that measures such as social distancing and city lockdown may be effective methods of disease control [5,6], but these seldom assess individual transmission chains. Lu et al. recently sequenced SARS-CoV-2 genomes from 53 patients in Guangdong, the province of China that reported the highest number of cases (n = 1388) outside Hubei [7]. Phylogenetic analysis of these and other genomes of SARS-CoV-2 worldwide revealed that the Guangdong virus strains are scattered among strains from other Chinese provinces and other countries, consistent with the epidemiological finding that the majority of infections detected in Guangdong were imported cases. Twenty-three Guangdong virus strains formed five statistically well-supported phylogenetic clusters, and this might indicate some local transmission following the introductions of these virus lineages. Notably, the observed durations of these putative local transmission chains largely overlap with the period from mid-January to late February in which most Guangdong local infections defined by the patients’ travel histories were reported. These genetic and epidemiologic findings consistently suggest that control measures successfully reduced local transmission in Guangdong after February.
Genomic information about pathogens provides valuable empirical information about their transmission histories [1], such as the identification of transmission chains through phylogenetic analysis of genome sequences, as illustrated by the work of Lu et al. [7]. In addition to revealing pathogen evolutionary processes and acting as a proxy of disease transmission history, phylogenetic trees also serve as versatile frameworks for comparative analysis of virus genetics and phenotypes, disease epidemiology, clinical manifestations, and population demography and environments, thus facilitating the identification of possible interplay between these various aspects of disease dynamics (Figure 1 ). There has been active research into methods for integrating multidimensional data related to pathogens for statistical inference, especially in a Bayesian framework [8]. It is anticipated that such integrative analysis of SARS-CoV-2, including the work by Lu et al. [7], as well as a more recent study by Dellicour et al. in Belgium [9], will greatly promote our understanding of virus dynamics and interactions with hosts and environments, and pave the way to genomic surveillance studies of SARS-CoV-2 in the near future.
Figure 1.
Illustration of Comparative Analysis of Different Types of Data Related to Emerging Infections.
The disease transmission chain in an outbreak is depicted in the last layer. The penultimate layer shows the genomic samples of the pathogen from the outbreak which are connected in a phylogenetic tree. A newly emerged lineage is highlighted in green and is characterized by the mutation X in red. Other related data such as virus phenotypes, clinical manifestations, population demography, geography, and environmental features can be mapped to the phylogenetic tree based on the information associated with the genomic samples, as shown in the first to fourth (from top down) layers. Comparison to the genomic sequences in the phylogenetic tree can identify the association with the emerged lineage (green), and some examples of such an interpretation are indicated in the text in the right column. Notably, an association may also be established between two or more layers – for example, the emerged lineage carries mutation X (fifth layer), which increases virus shedding titer and duration of infection (fourth layer), and hence causes higher disease severity (third layer).
Despite the usefulness of genomic data and phylogenetic analysis, the resolution of the SARS-CoV-2 phylogeny in the early phase of the pandemic remains low, possibly because of the relatively slow genetic drift of the virus, as well as differential sampling intensities in different regions. Many lineages are defined by a single mutation, and hence the phylogenetic structure could easily be distorted by sequences that carry a high number of sequencing errors or mutations introduced during passage in virus culture. One important issue is the variable intensity of virus genome sampling in different regions. This is seen in the SARS-CoV-2 dataset used by Lu et al. [7], which has relatively few genomes from outside Guangdong province (e.g., only 32 genomes from Hubei province, which had >60000 cases at the time). Such undersampling in regions with high disease incidence could result in phylogenetic analyses that underestimate virus exportation from these regions to other regions [10] such as Guangdong, and hence the extent of local transmission in Guangdong might be overestimated [7]. Notably, the SARS-CoV-2 genome dataset publicly available at time of writing remains highly uneven across different countries. For instance, the GISAID (Global Initiative on Sharing All Influenza Data; https://www.gisaid.org) database had only five full genome sequences from Iran where >120000 cases have been reported; by contrast, it has ~1300 genomes from ~7000 cases reported in Australia. Therefore, any interpretation of phylogenetic analyses involving undersampled regions must be made with caution. These limitations constrain certainty in interpretations obtained using phylogenetics, and a conservative approach is essential when interpreting results from complex phylogenetic models and multidimensional data.
Many countries are now investing efforts into genomic surveillance of SARS-CoV-2, and data sharing on the GISAID public database has now reached 25995 full genomes at unprecedented speed. Although the challenges of low phylogenetic resolution and biased sampling will undoubtedly remain, it is anticipated that future research into these SARS-CoV-2 genomes will take on these challenges with more robust statistical methods and cautious data interpretation, and will potentially provide important insights into SARS-CoV-2 transmission and evolution within different countries and across the world thereby aiding more effective control of the disease.
References
- 1.World Health Organization . WHO; 2020. Coronavirus Disease (COVID-2019) Situation Report 121. [Google Scholar]
- 2.Pybus O.G., Rambaut A. Evolutionary analysis of the dynamics of viral infectious disease. Nat. Rev. Genet. 2009;10:540–550. doi: 10.1038/nrg2583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zhou P. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579:270–273. doi: 10.1038/s41586-020-2012-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lam T.T. Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins. Nature. 2020 doi: 10.1038/s41586-020-2169-0. Published online March 26, 2020. [DOI] [PubMed] [Google Scholar]
- 5.Kraemer M.U.G. The effect of human mobility and control measures on the COVID-19 epidemic in China. Science. 2020;368:493–497. doi: 10.1126/science.abb4218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kissler S.M. Projecting the transmission dynamics of SARS-CoV-2 through the postpandemic period. Science. 2020;368:860–868. doi: 10.1126/science.abb5793. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Lu J. Genomic epidemiology of SARS-CoV-2 in Guangdong province, China. Cell. 2020;181 doi: 10.1016/j.cell.2020.04.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Suchard M.A. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol. 2018;4 doi: 10.1093/ve/vey016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Dellicour S. A phylodynamic workflow to rapidly gain insights into the dispersal history and dynamics of SARS-CoV-2 lineages. BioRxiv. 2020 doi: 10.1101/2020.05.05.078758. Published online May 9, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Grubaugh N.D. Genomic epidemiology reveals multiple introductions of Zika virus into the United States. Nature. 2017;546:401–405. doi: 10.1038/nature22400. [DOI] [PMC free article] [PubMed] [Google Scholar]

