VirusViz: comparative analysis and effective visualization of viral nucleotide and amino acid variants

Anna Bernasconi; Andrea Gulino; Tommaso Alfonsi; Arif Canakoglu; Pietro Pinoli; Anna Sandionigi; Stefano Ceri

doi:10.1093/nar/gkab478

. 2021 Jun 9;49(15):e90. doi: 10.1093/nar/gkab478

VirusViz: comparative analysis and effective visualization of viral nucleotide and amino acid variants

Anna Bernasconi ^1,², Andrea Gulino ^2,², Tommaso Alfonsi ³, Arif Canakoglu ⁴, Pietro Pinoli ⁵, Anna Sandionigi ⁶, Stefano Ceri ^7,^✉

PMCID: PMC8344854 PMID: 34107016

Abstract

Variant visualization plays an important role in supporting the viral evolution analysis, extremely valuable during the COVID-19 pandemic. VirusViz is a web-based application for comparing variants of selected viral populations and their sub-populations; it is primarily focused on SARS-CoV-2 variants, although the tool also supports other viral species (SARS-CoV, MERS-CoV, Dengue, Ebola). As input, VirusViz imports results of queries extracting variants and metadata from the large database ViruSurf, which integrates information about most SARS-CoV-2 sequences publicly deposited worldwide. Moreover, VirusViz accepts sequences of new viral populations as multi-FASTA files plus corresponding metadata in CSV format; a bioinformatic pipeline builds a suitable input for VirusViz by extracting the nucleotide and amino acid variants. Pages of VirusViz provide metadata summarization, variant descriptions, and variant visualization with rich options for zooming, highlighting variants or regions of interest, and switching from nucleotides to amino acids; sequences can be grouped, groups can be comparatively analyzed. For SARS-CoV-2, we manually collect mutations with known or predicted levels of severity/virulence, as indicated in linked research articles; such critical mutations are reported when observed in sequences. The system includes light-weight project management for downloading, resuming, and merging data analysis sessions. VirusViz is freely available at http://gmql.eu/virusviz/.

Graphical Abstract

INTRODUCTION

With the progress of the COVID-19 pandemic, the SARS-CoV-2 virus is undergoing adaptive evolution that improves its fitness to the host’s environment—the human population (1). Unprecedented interest follows the sequencing of new variants, as scientists want to understand better how fast they spread, if they enhance the virus transmission, if they correlate with heavier or milder forms of the disease, and if they could hamper the effectiveness of currently authorized and future vaccines.

Even if the term ‘variant’ has been used to denote particular strains or lineages, in the following we use it with the traditional meaning of single nucleotide variant or amino acid change with respect to the the reference sequence. Taking advantage of our experience in data integration for human genomics (2,3), we designed the Viral Conceptual Model (4) and used its structure to develop ViruSurf (5) (http://gmql.eu/virusurf/)—a database hosting viral sequences collected from GenBank (6), COG-UK (7) and NMDC (http://nmdc.cn/). This database hosts the largest freely queryable collection of SARS-CoV-2 variants deposited worldwide (as of 11 May 2021, about 714K sequences). The data is frequently updated by an incremental pipeline. ViruSurf allows to inspect and download query results, but does not further support users in data visualization, exploration, and analysis.

VirusViz, hereby presented, complements our database and search system; it is a web-based application to visualize and inspect viral variants (currently supporting SARS-CoV, MERS-CoV, Dengue, and Ebola) over populations and their sub-populations, which can be directly invoked from within the ViruSurf result page, thereby enabling a seamless analysis and visualization of query results.

In addition to database results, VirusViz can be used to analyze given viral populations that are not deposited to a public database; for satisfying this need, we offer the option of directly inputting new viral sequences in FASTA format, with metadata provided within a single CSV file. A bioinformatic pipeline extracts their nucleotide and amino acid variants as well as the impact of each nucleotide variant and the lineage of SARS-CoV-2 sequences. When the pipeline terminates, the new sequences and their variants can be inspected.

Several user-provided inputs and database query results can be explored over time. For saving data analysis projects, we designed a persistent structure in standard JSON format that encodes results from database queries or from the bioinformatic pipeline; the JSON file is progressively enriched during the user’s analysis, e.g. by keeping track of the defined sub-populations or highlighted regions. Thanks to this encoding, data analysis sessions can be suspended, resumed or merged at any time. Such functionality may be exploited for merging results collected at different time points, as well as from private data and public data.

While any variant of a given population can be observed within VirusViz, certain variants – that could have biological relevance – continually emerge from the current literature. Specifically for SARS-CoV-2, we systematically collect and classify these variants within a manually curated knowledge base, providing external links to related articles; they are reported when observed in specific sequences.

The user may also highlight particular regions describing relevant portions of the genome (e.g. the Receptor Binding Domain in the Spike protein) or sets of variants characterizing notable strains (e.g. those denoted as the UK, Brazilian and South African variants, which are getting the press attention at the time of writing).

Related Work

Many search systems for viral sequences and their variants were built since the outbreak of the COVID-19 pandemic (8). Some of these are equipped with engines for processing user-input sequences and annotating them with variants and other features (e.g. CoV-GLUE (9) and 2019nCoVR (10)). A number of resources have focused on interactive visualization regarding SARS-CoV-2, including CoVMT (11), GESS (12), coronApp (13), Pango lineages, CoVariants, hcov19-variants (embedded within the GISAID website), cov.lanl, outbreak.info, and CoV Genetics.

Most of the mentioned resources have a focus only on specific accruing mutations. Table 1 represents a comparison of such resources, considering provided features along four main categories: accepted inputs, annotation services applied on the data, criteria for population design, visualization analysis. To the best of our knowledge, VirusViz is the first completely open access web application supporting variant analysis and visualization for several viral species (not only SARS-CoV-2) for all types of variants (not only amino acid ones), accepting input from both public sequences (selected by users through ViruSurf) and user-provided sequences (processed through a bioinformatic pipeline), integrated with a curated knowledge base of variants of interest (continuously updated). In particular, VirusViz stands out for allowing a comparative inspection of variant distributions.

Table 1.

Comparison of resources for searching and visualizing viral sequences. Inline graphic indicates that the resource presents the full functionality; else, we briefly indicate in which way the functionality is limited. For example – for CoVariants – aa/VoI in ‘amino acid analysis’ means that it is available only for particular amino acid changes and Variants of Interest (or notable variants); country in ‘define populations by metadata’ means that it may be performed using only country as a filter

		CoV-GLUE	2019nCoVR	CoVMT	GESS	coronApp	Pango lineages^a	CoVariants^b	hcov19 variants^c	cov.lanl^d	outbreak info^e	CoV Genetics^f	VirusViz
	Features	(9)	(10)	(11)	(12)	(13)							-
input	process user-input FASTA			-	-		-	-	-		-	-
	process user-input metadata	-	-	-	-	-	-	-	-	-	-	-
	support other viruses	-	-	-	-	-	-	-	-	-	-	-
annotation	driven by nuc/aa/variant	aa	aa	aa	nuc	nuc+aa	var	var	var	aa	aa+var	aa	nuc+aa
	nucleotide analysis						-	-	-		-	-
	amino acid analysis						VoI	aa/VoI	VoI
	co-occurrence analysis	-	-	-		-	-	aa/VoI
	integrate ext. knowledge	-	-	-	-	-	-		-	-		-
design	define populations by metadata	-	country	-	-	-	-	country	-	-	country	country
	define populations by mut/var	-	nuc	-	-	-	VoI	aa/VoI	-	-	aa/VoI	Spike aa
analysis	mutations on genome	-				top 10	-	-	-		-
	mut/var through time	-			-	-	-			-			-
	comparison of distributions	-	-	-	-	-	-		-	-	-	-

Open in a new tab

Some resources do not have a related publication, their links are: ^ahttps://cov-lineages.org/; ^bhttp://covariants.org/; ^chttps://www.gisaid.org/hcov19-variants/; ^dhttps://cov.lanl.gov/; ^ehttps://outbreak.info/; ^fhttps://covidcg.org/.

MATERIALS AND METHODS

VirusViz pipeline for processing user-provided FASTA sequences

The annotation and variant calling used for user-provided sequences is inherited from ViruSurf (5). To assign the appropriate reference alignment, the virus species must be indicated. Each sequence is aligned to the reference genome using the Needleman–Wunsch algorithm with affine gap penalty, and nucleotide variants with associated putative impact are computed using SnpEff (14); coding regions are translated to amino acid sequences and variants are inferred.

More advanced analysis is implemented in VirusViz: we added the assignment of each SARS-CoV-2 sequence to its lineage using Pangolin (15) and we defined the notion of distance between two sequences as the counter of the variants that they do not have in common. Distance is used to identify the closest sequences to a given one, and to compute the heterogeneity score–both for nucleotide and amino acid variants–as the average number of different variants between every pair of sequences in a population.

Project management

The VirusViz web application is stateless; in order to support users in suspending, resuming, and merging data analysis sessions, we designed a light-weight project management. We represent the project’s state using a formatted JSON file; the full structure is shown in Supplementary Figures S1 and S2 and an example instance in Supplementary Figure S3. A JSON file with this format is used for uploading to VirusViz either the result of a ViruSurf query or the output of the VirusViz pipeline.

While a data analysis session takes place, the project state is progressively aligned, thereby supporting a download to the user’s local file system that reflects the session evolution; JSON files can be re-uploaded at a later time. Thanks to this unified format, merge operations between projects can occur by combining two project states into one.

The system treats any virus input seamlessly. The taxonomy identifier of the viral species drives the selection of the reference genome, the regions of interest, and the virus image used below plots.

Architecture

VirusViz is a rich-client web application. The back-end is used for processing user-provided FASTA sequences and extracting variants; it is developed in Python using the Flask micro-framework and exposes a REST API to the web interface.

The front end at the client side provides most of VirusViz functionalities; it is implemented using the JavaScript framework Angular.js and the library Apache ECharts v5 for advanced visualizations. The state of the computation is encapsulated in a JSON structure, it includes sequences, variants, groups, regions and all the other entities present in the interface. The client-side logic for filtering sequences and their variants, computing heterogeneity scores, managing the interactions with the user, and communicating with the back-end is developed in JavaScript and relies on several third-party open source tools and libraries, listed in https://github.com/DEIB-GECO/VirusViz/wiki/External-software. VirusViz is freely available without login requirements.

SARS-CoV-2 knowledge base

Several published papers or preprint submissions claim particular effects of SARS-CoV-2 amino acid variants. We manually collected these resources through a systematic literature search and organized their content using carefully planned orthogonal categories, i.e. lower or higher protein stability, viral transmission, infectivity, disease severity, fatality rate, sensitivity to convalescent sera or monoclonal antibodies, binding affinity to host receptor. Variants are annotated with their effects and linked to the specific resources that define them. A complete discussion of the knowledge base design is outside the scope of this paper (a work in progress paper is available (16)). We just note that, in many cases, the reported effects are contrasting with each other and/or different in their significance, since the linked references may be peer-reviewed publications but also preprints.

RESULTS

The main interactions supported by VirusViz are highlighted in Figure 1. A Landing page (http://gmql.eu/virusviz/) includes several descriptions of example projects (documented in our WIKI, https://github.com/DEIB-GECO/VirusViz/wiki); their selection triggers the upload of the relevant files to VirusViz, leading the user to the Project home page.

From the Landing page users can also invoke external resources that produce data uploads for VirusViz: currently we support ViruSurf, but other databases are under development. The tool allows querying a rich collection of sequences based on their metadata. More advanced features allow to further filter results based on the presence of specific variants, both at the nucleotide and at the amino acid level. The produced result, consisting of a set of sequences—order of thousands—with their metadata and variants, is then used as input to VirusViz.

From the Landing page, users can also reach an Input provisioning page, where sequences can be uploaded. Input is provided as two cross-linked files: (i) a multi-FASTA file containing several sequences, each prefixed with a user-provided identifier and (ii) a CSV file for metadata, which includes user-provided identifiers within the first column and arbitrary metadata describing the sequence in the following columns. On completion of the pipeline, the user is taken to the Project home page; as the execution requires time, an optional email message can be set to notify completion.

The Project home page is the center of the data analysis steps; from here, it is possible to save/resume projects and merge an existing project to the current one. The page is linked through a top navigation bar to six pages:

Group definition, for creating named sub-populations of interest;
Region definition, for creating named regions of interest as (possibly discontinuous) intervals on the nucleotide sequence or on the amino acid sequence of a given protein;
Population report, describing for a given group the metadata about sequences as well as aggregate information about their variants;
Sequence report, describing the properties of each individual sequence;
Variant distribution, for showing bar plots of nucleotide and amino acid variants;
Group comparison, for comparatively showing bar plots of selected groups.

To illustrate VirusViz pages, we employ a running example (that can be opened directly from the VirusViz Landing page); all figures in this section describe specific aspects of the running example. The population has been obtained from ViruSurf by performing a query, executed on 9 February 2021, extracting 2695 SARS-CoV-2 sequences, with associated variants, collected in six US states between 1 November 2020 and 15 January 2021.

Group definition

The Group definition page (see top part of Figure 2) is used for defining sub-populations by means of three kinds of filters, respectively on: (i) metadata attributes (selecting for a given ‘Attribute’ a subset of its ‘Values’), (ii) nucleotide variants (selecting the position and change type), and (iii) amino acid variants (selecting a protein and then the position and change type). Groups must be named, tentatively reflecting their selection criteria; the default group ‘ALL’ includes all the sequences. A table summarizes the properties of groups, including the number of sequences within the group and the filtering conditions used in their definition, which can be copied for creating a new group. The full group content can be downloaded in suitable formats.

Region definition

The Region definition page (see bottom part of Figure 2) is used for collecting the definition of named regions of interest, to be highlighted within VirusViz plots; some of them can be read from a predefined list, others can be added by users by specifying either a set of ranges of nucleotide positions, or a set of ranges of amino acid positions within a specific protein. Added regions are displayed within two tables; these are used in the Variant distribution and Group comparison pages.

Population report

The Population report (see Figure 3) provides a general overview of a given sequence group through the following elements:

Group selector (either the default ‘ALL’ group or a user-defined group): via a drop-down menu, the user can choose one group of interest and the whole content of the page is correspondingly updated.
Pie charts for categorical metadata (with values and corresponding percentage within the selected group) and bar plots for numerical or dates metadata.
Heterogeneity score indication, considering both nucleotides and amino acids; scores reflect the distance among variants of the sequences of the group, small values indicate that sequences have similar variants.
Two tables with observed nucleotide and amino acid variants in the group, ordered by descending order of count.
One table with interesting amino acid variants present in the group; these are retrieved from the knowledge base, including their effects and pointers to the descriptive literature resources.

Sequence report

The Sequence report (shown in Figure 4) is organized as a table. Each sequence corresponds to a row, for which we provide a pointer to the associated metadata (including the computed lineage) and counters linked to the lists of nucleotide variants (with corresponding impact annotations), amino acid variants and knowledge base variants. Another link can be clicked to compute a sequence’s closest sequences within the group, based on a distance metrics that counts pairs of different variants. For each sequence, the user may download the corresponding FASTA file (when available) or directly open the Variant distribution page, where variants present in the given sequence are highlighted to distinguish them from the other variants present in the group.

Variant distribution

This page provides the bar plot of the variants of a given group. The horizontal axis represents positions in the nucleotide or amino acid sequences; each bar’s height represents the number of sequences with a mutated nucleotide/amino acid in that position. Two working modes are available, respectively for nucleotides (top part of Figure 5) and for amino acids (bottom part of Figure 5). In the latter case, the user should also select a specific protein. After selection, a click on the ‘Plot’ button produces the graph.

Figure 5. — Top: bar plot representing the distribution of nucleotide variants of the running example. Bars highlighted in red indicate the number of sequences having a C to T mutation. The 3’ UTR of SARS-CoV-2 (bases 29675–29903) is highlighted with a blue background. Bottom: bar plot representing the distribution of amino acid variants in the nucleocapsid (N) protein of the running example. Using the slider at the bottom of the plot, we zoom to a particular range (166–217) of the protein where two variants from the knowledge base are highlighted in red (substitutions at positions 203 and 204); they are known for potentially lowering protein stability and improving viral transmission (21). By hovering on the bar at position 204, we observe a G to R change in 337 sequences, and a G to V change in five sequences.

The specific structure of the virus sequence is shown with a colored image at the bottom of the bar plot, adapted to the specific choice of virus species. A slider can be used for zooming to a preferred coordinate range. By hovering on the plot, the user may visualize the specific position represented by each bar and the precise sequence changes with its corresponding counts.

The user can select different elements to be highlighted in red in the plot: (i) variants belonging to a specific sequence (possibly chosen from the Sequence report table); (ii) specific variants (e.g. all C to T nucleotide substitutions, as shown in the top part of Figure 5); (iii) notable variants (e.g. the 203/204 substitutions in the N protein, as shown in the bottom part of Figure 5). The plot can be adjusted by excluding the most popular variants; this adjustment is advised to enhance bar visibility of less frequent variants when mixed with frequent but not relevant variants. One region, selected by means of a dedicated filter, can be added to the plot (e.g. the 3’ UTR region in the top part of Figure 5).

Group comparison

As shown in Figure 6, this page displays a comparison of the variant frequency in multiple sub-populations, by presenting the corresponding bar plots in a stack. Similarly to the Variant distribution page, it provides two working modes, for nucleotides or amino acids. In addition, two options are available for building tracks, based either on the distinct values of a specific metadata attribute, or on user-defined groups.

Figure 6 shows an example for amino acids focused on the Spike protein, comparing tracks representing different user-defined groups. Many bars are observed in correspondence with the variants of the B.1.1.7 lineage (17), collectively named UK variant and highlighted with thin blue regions.

Several controls allow changing the plots in order to improve their comparative visualization; plots can be deleted, their order can be changed, their size can be normalized and scaled. In addition to filtering variants, as in the previous page, it is also possible to ease the visualization and comparative observations by changing the maximum value represented in the Y-axis of the tracks. The default setting is the number of sequences present in that group; the user may instead set the most present variant count as such maximum value on the track. Both options may be calculated over each single track or globally. Setting the Y-axis maximum value to the globally most present variant allows to compare bars based on their height (representing an absolute value), and it is therefore advised when tracks have similar sample sizes. Visual comparisons may be suggestive of hypotheses that need confirmation by means of accurate data analysis, carefully checking and eliminating any source of bias.

Use case

South African variants in most recent US sequences

Finally, we simulate a scenario in which a user provides novel sequences to be analyzed within VirusViz. We selected from NCBI Virus portal (18) 36 sequences submitted on 16–17 February 2021 from four US States. We use the VirusViz pipeline to process the sequences formatted in a multi-FASTA file, along with their metadata in a CSV file; their processing may take from 30 minutes up to one hour depending on the current load of our servers.

We then compare these new recent sequences against the ones of the running example (submitted up to 29 January). To analyse the two projects together, we merge the current project with the previously saved running example project, whose three groups are displayed in Figure 6 marked as ‘usa-6states-nov20-jan21’; we also build a new group with all the sequences belonging to the user-input batch. In Figure 7, we comparatively inspect the four groups by focusing on a portion of the N protein, where we highlight one of the nine variants pertaining to the B.1.351 lineage (19) (named the 501.V2 or South African variant); even if the sample size of the last group is small, we observe that its presence is increasing over time. Three additional use cases can be found in the Supplementary Material (see Supplementary Figures S6–S12). They provide examples relative to SARS-CoV-2 variants of concern/interest (initially observed in UK, California and New York) based on ViruSurf. This material describes how new variants can be traced since their starting dates.

Figure 7. — Comparison of three groups from the running example and an additional group derived from user-input batch of sequences, submitted to GenBank in February 2021. We observe the T205I change in the N protein; comparatively, bars show an increase of the percentage of the sequences carrying the variant. The figure has been built by zooming 1.5x the default Y-axis options and selecting the X-axis range 180–220 of the N protein.

DISCUSSION

The interest on the spreading of SARS-CoV-2 variants during the winter and spring seasons of 2021 is enormous, as variants are increasing the diffusion of the COVID-19 disease and may hamper the effectiveness of vaccination campaigns. The VirusViz system is a new powerful tool for supporting comparative visual analysis of variants, both of nucleotides and of amino acids, accepting input from both public sequences and user-provided sequences, integrated with a curated knowledge base of variants of interest; public sequences in ViruSurf and variants selected within the knowledge base are continuously updated. VirusViz is designed with a general approach, thus it can show variant distribution for other viral species. Although many other web tools can be used to inspect SARS-CoV-2 variants, VirusViz is the only tool currently supporting user-defined grouping of sequences for the comparative inspection of variant distributions across groups. We are in the process of developing a new data source, called EpiSurf, which integrates public epitopes from IEDB (20) with public amino acid sequences from ViruSurf. The new resource will be used to assess epitope’s conservancy (depending on the absence/presence of mutations), an important property in the design/assessment of vaccines, drugs, and serological assays. This further demonstrates the possibility of easily adding new data sources that use VirusViz for data visualization.

DATA AVAILABILITY

The VirusViz web application is freely available at http://gmql.eu/virusviz/, without login requirements.

Supplementary Material

gkab478_Supplemental_File

Click here for additional data file.^{(3.2MB, pdf)}

ACKNOWLEDGEMENTS

The authors would like to thank Ruba Al Khalaf for her contribution to the development of the knowledge base, all external testers of VirusViz, and Ilaria Capua, Matteo Chiara, Ana Conesa, Luca Ferretti, Alice Fusaro, Susanna Lamers, Stefania Leopardi, Alessio Lorusso, Francesca Mari, Carla Mavian, Graziano Pesole, Alessandra Renieri, Lorenzo Subissi, Stephen Tsui, Limsoon Wong, and Federico Zambelli for sharing their knowledge and experience with us, inspiring the development of ViruSurf and VirusViz.

Contributor Information

Anna Bernasconi, Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy.

Andrea Gulino, Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy.

Tommaso Alfonsi, Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy.

Arif Canakoglu, Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy.

Pietro Pinoli, Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy.

Anna Sandionigi, Quantia Consulting S.r.l., Via Petrarca 20, 22066, Mariano Comense, Como, Italy.

Stefano Ceri, Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, Italy.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

VirusViz is supported by the ERC Advanced Grant [693174] ‘Data-Driven Genomic Computing (GeCo)’; EIT ‘DATA against COVID-19’ Innovation Activity [20663] ‘ViruSurf’. Funding for open access charge: ERC Advanced Grant [693174] 'Data-Driven Genomic Computing (GeCo)'.

Conflict of interest statement. None declared.

REFERENCES

1.Korber B., Fischer W.M., Gnanakaran S., Yoon H., Theiler J., Abfalterer W., Hengartner N., Giorgi E.E., Bhattacharya T., Foley B.et al.. Tracking changes in SARS-CoV-2 spike: evidence that D614G increases infectivity of the COVID-19 virus. Cell. 2020; 182:812–827. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Bernasconi A., Ceri S., Campi A., Masseroli M., Mayr H., Guizzardi G., Ma H., Pastor O.. Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data. Conceptual Modeling. ER 2017. Lecture Notes in Computer Science. 2017; 10650:Cham: Springer; 325–339. [Google Scholar]
3.Canakoglu A., Bernasconi A., Colombo A., Masseroli M., Ceri S.. GenoSurf: metadata driven semantic search system for integrated genomic datasets. Database. 2019; 2019:baz132. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Bernasconi A., Canakoglu A., Pinoli P., Ceri S.. Empowering virus sequence research through conceptual modeling. International Conference on Conceptual Modeling. 2020; Springer; 388–402. [Google Scholar]
5.Canakoglu A., Pinoli P., Bernasconi A., Alfonsi T., Melidis D.P., Ceri S.. ViruSurf: an integrated database to investigate viral sequences. Nucleic Acids Res. 2021; 49:D817–D824. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Sayers E.W., Cavanaugh M., Clark K., Ostell J., Pruitt K.D., Karsch-Mizrachi I.. GenBank. Nucleic Acids Res. 2019; 47:D94–D99. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.The COVID-19 Genomics UK (COG-UK) consortium An integrated national scale SARS-CoV-2 genomic surveillance network. Lancet Microbe. 2020; 1:e99–e100. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Bernasconi A., Canakoglu A., Masseroli M., Pinoli P., Ceri S.. A review on viral data sources and search systems for perspective mitigation of COVID-19. Brief Bioinform. 2021; 22:664–675. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Singer J., Gifford R., Cotten M., Robertson D.. CoV-GLUE: a web application for tracking SARS-CoV-2 genomic variation. 2020; Preprints doi:18 June 2020, preprint: not peer reviewed 10.20944/preprints202006.0225.v1. [DOI]
10.Gong Z., Zhu J.W., Li C.P., Jiang S., Ma L.N., Tang B.X., Zou D., Chen M.L., Sun Y.B., Song S.H.et al.. An online coronavirus analysis platform from the National Genomics Data Center. Zool Res. 2020; 41:705–708. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Alam I., Radovanovic A., Incitti R., Kamau A.A., Alarawi M., Azhar E.I., Gojobori T.. CovMT: an interactive SARS-CoV-2 mutation tracker, with a focus on critical variants. Lancet Infect. Dis. 2021; 21:P602. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Fang S., Li K., Shen J., Liu S., Liu J., Yang L., Hu C.D., Wan J.. GESS: a database of global evaluation of SARS-CoV-2/hCoV-19 sequences. Nucleic Acids Res. 2021; 49:D706–D714. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Mercatelli D., Triboli L., Fornasari E., Ray F., Giorgi F.M.. Coronapp: a web application to annotate and monitor SARS-CoV-2 mutations. J Med Virol. 2021; 93:3238–3245. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Cingolani P., Platts A., Wang l. E. L., Coon M., Nguyen T., Wang L., Land S.J., Lu X., Ruden D.M.. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012; 6:80–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Rambaut A., Holmes E.C., O’Toole A., Hill V., McCrone J.T., Ruis C., du Plessis L., Pybus O.G.. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020; 5:1403–1407. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Al Khalaf R., Alfonsi T., Ceri S., Bernasconi A., Cherfi S., Perini A., Nurcan S.. CoV2K: a knowledge base of SARS-CoV-2 variant impacts. Research Challenges in Information Science. 2021; Cham: Springer. [Google Scholar]
17.Volz E., Mishra S., Chand M., Barrett J.C., Johnson R., Geidelberg L., Hinsley W.R., Laydon D.J., Dabrera G., O’Toole Á.et al.. Transmission of SARS-CoV-2 Lineage B.1.1.7 in England: Insights from linking epidemiological and genetic data. 2021; medRxiv doi:04 January 2021, preprint: not peer reviewed 10.1101/2020.12.30.20249034. [DOI]
18.Hatcher E.L., Zhdanov S.A., Bao Y., Blinkova O., Nawrocki E.P., Ostapchuck Y., Schäffer A.A., Brister J.R.. Virus variation resource - improved response to emergent viral outbreaks. Nucleic Acids Res. 2017; 45:D482–D490. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Tegally H., Wilkinson E., Giovanetti M., Iranzadeh A., Fonseca V., Giandhari J., Doolabh D., Pillay S., San E.J., Msomi N.et al.. Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa. 2020; medRxiv doi:22 December 2020, preprint: not peer reviewed 10.1101/2020.12.21.20248640. [DOI]
20.Vita R., Mahajan S., Overton J.A., Dhanda S.K., Martini S., Cantrell J.R., Wheeler D.K., Sette A., Peters B.. The immune epitope database (IEDB): 2018 update. Nucleic Acids Res. 2019; 47:D339–D343. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Wu S., Tian C., Liu P., Guo D., Zheng W., Huang X., Zhang Y., Liu L.. Effects of SARS-CoV-2 mutations on protein structures and intraviral protein-protein interactions. J Med Virol. 2021; 93:2132–2140. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Wang Z., Schmidt F., Weisblum Y., Muecksch F., Barnes C.O., Finkin S., Schaefer-Babajew D., Cipolla M., Gaebler C., Lieberman J.A.et al.. mRNA vaccine-elicited antibodies to SARS-CoV-2 and circulating variants. Nature. 2021; 592:616–622. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Hoffmann M., Kleine-Weber H., Pöhlmann S.. A multibasic cleavage site in the spike protein of SARS-CoV-2 is essential for infection of human lung cells. Mol Cell. 2020; 78:779–784. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zhang W., Davis B.D., Chen S.S., Sincuir Martinez J.M., Plummer J.T., Vail E.. Emergence of a Novel SARS-CoV-2 variant in Southern California. JAMA. 2021; 325:1324–1326. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkab478_Supplemental_File

Click here for additional data file.^{(3.2MB, pdf)}

Data Availability Statement

The VirusViz web application is freely available at http://gmql.eu/virusviz/, without login requirements.

[B1] 1.Korber B., Fischer W.M., Gnanakaran S., Yoon H., Theiler J., Abfalterer W., Hengartner N., Giorgi E.E., Bhattacharya T., Foley B.et al.. Tracking changes in SARS-CoV-2 spike: evidence that D614G increases infectivity of the COVID-19 virus. Cell. 2020; 182:812–827. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2.Bernasconi A., Ceri S., Campi A., Masseroli M., Mayr H., Guizzardi G., Ma H., Pastor O.. Conceptual Modeling for Genomics: Building an Integrated Repository of Open Data. Conceptual Modeling. ER 2017. Lecture Notes in Computer Science. 2017; 10650:Cham: Springer; 325–339. [Google Scholar]

[B3] 3.Canakoglu A., Bernasconi A., Colombo A., Masseroli M., Ceri S.. GenoSurf: metadata driven semantic search system for integrated genomic datasets. Database. 2019; 2019:baz132. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4.Bernasconi A., Canakoglu A., Pinoli P., Ceri S.. Empowering virus sequence research through conceptual modeling. International Conference on Conceptual Modeling. 2020; Springer; 388–402. [Google Scholar]

[B5] 5.Canakoglu A., Pinoli P., Bernasconi A., Alfonsi T., Melidis D.P., Ceri S.. ViruSurf: an integrated database to investigate viral sequences. Nucleic Acids Res. 2021; 49:D817–D824. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6.Sayers E.W., Cavanaugh M., Clark K., Ostell J., Pruitt K.D., Karsch-Mizrachi I.. GenBank. Nucleic Acids Res. 2019; 47:D94–D99. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7.The COVID-19 Genomics UK (COG-UK) consortium An integrated national scale SARS-CoV-2 genomic surveillance network. Lancet Microbe. 2020; 1:e99–e100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8.Bernasconi A., Canakoglu A., Masseroli M., Pinoli P., Ceri S.. A review on viral data sources and search systems for perspective mitigation of COVID-19. Brief Bioinform. 2021; 22:664–675. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9.Singer J., Gifford R., Cotten M., Robertson D.. CoV-GLUE: a web application for tracking SARS-CoV-2 genomic variation. 2020; Preprints doi:18 June 2020, preprint: not peer reviewed 10.20944/preprints202006.0225.v1. [DOI]

[B10] 10.Gong Z., Zhu J.W., Li C.P., Jiang S., Ma L.N., Tang B.X., Zou D., Chen M.L., Sun Y.B., Song S.H.et al.. An online coronavirus analysis platform from the National Genomics Data Center. Zool Res. 2020; 41:705–708. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11.Alam I., Radovanovic A., Incitti R., Kamau A.A., Alarawi M., Azhar E.I., Gojobori T.. CovMT: an interactive SARS-CoV-2 mutation tracker, with a focus on critical variants. Lancet Infect. Dis. 2021; 21:P602. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12.Fang S., Li K., Shen J., Liu S., Liu J., Yang L., Hu C.D., Wan J.. GESS: a database of global evaluation of SARS-CoV-2/hCoV-19 sequences. Nucleic Acids Res. 2021; 49:D706–D714. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13.Mercatelli D., Triboli L., Fornasari E., Ray F., Giorgi F.M.. Coronapp: a web application to annotate and monitor SARS-CoV-2 mutations. J Med Virol. 2021; 93:3238–3245. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14.Cingolani P., Platts A., Wang l. E. L., Coon M., Nguyen T., Wang L., Land S.J., Lu X., Ruden D.M.. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012; 6:80–92. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15.Rambaut A., Holmes E.C., O’Toole A., Hill V., McCrone J.T., Ruis C., du Plessis L., Pybus O.G.. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology. Nat Microbiol. 2020; 5:1403–1407. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16.Al Khalaf R., Alfonsi T., Ceri S., Bernasconi A., Cherfi S., Perini A., Nurcan S.. CoV2K: a knowledge base of SARS-CoV-2 variant impacts. Research Challenges in Information Science. 2021; Cham: Springer. [Google Scholar]

[B17] 17.Volz E., Mishra S., Chand M., Barrett J.C., Johnson R., Geidelberg L., Hinsley W.R., Laydon D.J., Dabrera G., O’Toole Á.et al.. Transmission of SARS-CoV-2 Lineage B.1.1.7 in England: Insights from linking epidemiological and genetic data. 2021; medRxiv doi:04 January 2021, preprint: not peer reviewed 10.1101/2020.12.30.20249034. [DOI]

[B18] 18.Hatcher E.L., Zhdanov S.A., Bao Y., Blinkova O., Nawrocki E.P., Ostapchuck Y., Schäffer A.A., Brister J.R.. Virus variation resource - improved response to emergent viral outbreaks. Nucleic Acids Res. 2017; 45:D482–D490. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19.Tegally H., Wilkinson E., Giovanetti M., Iranzadeh A., Fonseca V., Giandhari J., Doolabh D., Pillay S., San E.J., Msomi N.et al.. Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa. 2020; medRxiv doi:22 December 2020, preprint: not peer reviewed 10.1101/2020.12.21.20248640. [DOI]

[B20] 20.Vita R., Mahajan S., Overton J.A., Dhanda S.K., Martini S., Cantrell J.R., Wheeler D.K., Sette A., Peters B.. The immune epitope database (IEDB): 2018 update. Nucleic Acids Res. 2019; 47:D339–D343. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21.Wu S., Tian C., Liu P., Guo D., Zheng W., Huang X., Zhang Y., Liu L.. Effects of SARS-CoV-2 mutations on protein structures and intraviral protein-protein interactions. J Med Virol. 2021; 93:2132–2140. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22.Wang Z., Schmidt F., Weisblum Y., Muecksch F., Barnes C.O., Finkin S., Schaefer-Babajew D., Cipolla M., Gaebler C., Lieberman J.A.et al.. mRNA vaccine-elicited antibodies to SARS-CoV-2 and circulating variants. Nature. 2021; 592:616–622. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23.Hoffmann M., Kleine-Weber H., Pöhlmann S.. A multibasic cleavage site in the spike protein of SARS-CoV-2 is essential for infection of human lung cells. Mol Cell. 2020; 78:779–784. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24.Zhang W., Davis B.D., Chen S.S., Sincuir Martinez J.M., Plummer J.T., Vail E.. Emergence of a Novel SARS-CoV-2 variant in Southern California. JAMA. 2021; 325:1324–1326. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

VirusViz: comparative analysis and effective visualization of viral nucleotide and amino acid variants

Anna Bernasconi

Andrea Gulino

Tommaso Alfonsi

Arif Canakoglu

Pietro Pinoli

Anna Sandionigi

Stefano Ceri

Abstract

Graphical Abstract

Graphical Abstract.

INTRODUCTION

Related Work

Table 1.

MATERIALS AND METHODS

VirusViz pipeline for processing user-provided FASTA sequences

Project management

Architecture

SARS-CoV-2 knowledge base

RESULTS

Figure 1.

Group definition

Figure 2.

Region definition

Population report

Figure 3.

Sequence report

Figure 4.

Variant distribution

Figure 5.

Group comparison

Figure 6.

Use case

South African variants in most recent US sequences

Figure 7.

DISCUSSION

DATA AVAILABILITY

Supplementary Material

ACKNOWLEDGEMENTS

Contributor Information

SUPPLEMENTARY DATA

FUNDING

REFERENCES

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases