The NCBI Comparative Genome Viewer (CGV) is an interactive visualization tool for the analysis of whole-genome eukaryotic alignments

Sanjida H Rangwala; Dmitry V Rudnev; Victor V Ananiev; Dong-Ha Oh; Andrea Asztalos; Barrett Benica; Evgeny A Borodin; Nathan Bouk; Vladislav I Evgeniev; Vamsi K Kodali; Vadim Lotov; Eyal Mozes; Marina V Omelchenko; Sofya Savkina; Ekaterina Sukharnikov; Joël Virothaisakun; Terence D Murphy; Kim D Pruitt; Valerie A Schneider

doi:10.1371/journal.pbio.3002405

. 2024 May 7;22(5):e3002405. doi: 10.1371/journal.pbio.3002405

The NCBI Comparative Genome Viewer (CGV) is an interactive visualization tool for the analysis of whole-genome eukaryotic alignments

Sanjida H Rangwala ^1,^*, Dmitry V Rudnev ¹, Victor V Ananiev ¹, Dong-Ha Oh ¹, Andrea Asztalos ¹, Barrett Benica ¹, Evgeny A Borodin ¹, Nathan Bouk ¹, Vladislav I Evgeniev ¹, Vamsi K Kodali ¹, Vadim Lotov ¹, Eyal Mozes ¹, Marina V Omelchenko ¹, Sofya Savkina ¹, Ekaterina Sukharnikov ¹, Joël Virothaisakun ¹, Terence D Murphy ¹, Kim D Pruitt ¹, Valerie A Schneider ¹

Editor: Andreas Hejnol²

PMCID: PMC11101090 PMID: 38713717

Abstract

We report a new visualization tool for analysis of whole-genome assembly-assembly alignments, the Comparative Genome Viewer (CGV) (https://ncbi.nlm.nih.gov/genome/cgv/). CGV visualizes pairwise same-species and cross-species alignments provided by National Center for Biotechnology Information (NCBI) using assembly alignment algorithms developed by us and others. Researchers can examine large structural differences spanning chromosomes, such as inversions or translocations. Users can also navigate to regions of interest, where they can detect and analyze smaller-scale deletions and rearrangements within specific chromosome or gene regions. RefSeq or user-provided gene annotation is displayed where available. CGV currently provides approximately 800 alignments from over 350 animal, plant, and fungal species. CGV and related NCBI viewers are undergoing active development to further meet needs of the research community in comparative genome visualization.

Commonly used genome browsers only show one genome assembly at a time and cannot show comparisons between multiple genomes. This study develops a new visualization tool called the Comparative Genome Viewer (CGV) that aids in the pairwise comparison of whole-genome eukaryotic assembly-assembly alignments.

Introduction

Comparative genome visualization

Comparative genomics leverages shared evolutionary histories among different species to answer basic biological questions and understand the causes of disease. As sequencing costs have dropped and assembly algorithms have improved, there has been tremendous growth in the number of high-quality genome assemblies available in public archives and the diversity in the organisms they represent. These data now make it possible to use comparative genomics approaches to explore more elements of biology and reveal the need for different types of analysis tools to support this exploration. The NIH Comparative Genomics Resource (CGR) maximizes the impact of eukaryotic research organisms and their genomic data to biomedical research [1]. CGR facilitates reliable comparative genomics analyses for all eukaryotic organisms through community collaboration and a National Center for Biotechnology Information (NCBI) genomics toolkit. As part of CGR, we have created the Comparative Genome Viewer (CGV), a web-based visualization tool to facilitate comparative genomics research.

Graphical visualization of genomic data can illuminate relationships among different data types and highlight differences and anomalies; for example, areas of a genome that are depleted in gene annotation, or have unusually high repeat content, or are more variable between species. Interactive genome browsers have become particularly valuable in recent years in helping biologists navigate large sequence datasets and more easily find genomic locations of interest to their specific research question. These visualizations can display molecular data that can help resolve competing hypotheses and expose patterns that can spur additional research questions.

Linear genome browsers, such as the Genome Data Viewer (GDV) at NCBI [2], the UCSC Genome Browser [3], Integrative Genomics Viewer (IGV) [4], and JBrowse [5], visualize gene annotation, sequence variation, and other types of molecular data as “tracks” laid out in parallel and anchored on a single genome assembly. These browsers can display comparative genome data such as genome sequence alignments (e.g., UCSC’s chain/net or NCBI assembly alignments), which can provide an indication of conservation in a discrete genomic region. However, linear genome browser tracks cannot show as easily whether a genome region has been rearranged (e.g., translocated to different chromosomes) in one genome relative to another one. Genome translocations and other types of breaks in syntenic blocks are more easily communicated using 2D viewers that can provide a whole-genome comparison of alignments of one genome relative to another.

Different types of 2D visualizations have been proposed to facilitate analysis of larger scale genome structural differences between 2 or more genomes. These visuals include 2D line graphs (also known as dotplots) [6,7], circular diagrams (i.e., Circos plots) [8], vertical genome maps (e.g., Ensembl’s synteny view [9]), and linear genome browsers that stack one assembly on top of another [10–12]. Different types of visuals have advantages and disadvantages. Circos plots can show multiple datasets in one graphic but can be visually challenging to interpret and usually do not support views of sub-genomic regions. Dotplots can allow zooming to view chromosome or sub-chromosome regions but cannot easily or elegantly display gene or other annotation in the same visual. In order to better serve different research questions, some groups provide a choice of multiple different types of visuals for genome comparisons [13–15].

Genome comparison data

Broadly, there are 2 types of whole-genome comparison data that can be displayed in a comparative genome visualization tool. The first type of data is locations of gene orthologs. Orthology is typically determined using a protein homology-based method (e.g., protein–protein BLAST) in consideration with local gene order conservation [16,17]. This type of data can lend itself to straightforward “beads on string” visualizations that allows researchers to easily determine how syntenic gene regions have evolved across different species [17–19].

The second type of comparison data is whole-genome assembly alignments (e.g., Mauve [20], LASTZ [21]), which are sequence-based and include both genic and intergenic regions. Whole-genome alignments can be much more complex than simple gene ortholog locations but have the advantage of including alignments in regulatory regions and other regions not annotated as genes.

Here, we introduce a new viewer tool at NCBI, the CGV, that is a key element of CGR. The main view of CGV takes the “stacked linear browser” approach—chromosomes from 2 assemblies are laid out horizontally with colored bands connecting regions of sequence alignment. Initial usability research with conceptual prototypes revealed that this type of visual was the easiest to interpret for scientists with a broad range of research expertise in genomics. We display whole-genome pairwise assembly-assembly alignments in CGV. These sequence-based alignments can be used to analyze gene synteny conservation but can also expose similarities in regions outside known genes, e.g., ultraconserved regions that may be involved in gene regulation. Because CGV is a web-based application, researchers do not need to install or configure software or generate their own comparison files before they can begin using it for their research. Below we describe some of the features of CGV and provide examples of how visualization in this tool can generate insights into genome structure and evolution.

Results

Overview of CGV

We developed a web application, the CGV (https://ncbi.nlm.nih.gov/genome/cgv/), to aid in comparing genome structures between 2 eukaryotic assemblies. CGV facilitates analyses of genome variation and evolution between different strains or species, as well as evaluation of assembly quality between older and newer assemblies from the same species.

Alignments are generated at NCBI using BLAST [22] or LASTZ-based algorithms [21] or imported from the UCSC Genomics Institute (https://hgdownload.soe.ucsc.edu/downloads.html) and other research groups (e.g., T2T/HPRC, https://humanpangenome.org/). Shorter alignments are merged where possible; however, because of repeats and gaps, even very similar genomic regions may be broken down into multiple alignment segments. More closely related genomes will provide more contiguous alignments, while more distant species may align only to short highly conserved regions. In addition, while we are often able to provide alignments for polyploid genomes, it is more difficult to distinguish orthologs (identity by descent) from homeologues (identity by duplication) for species pairs with more recent whole-genome duplications. CGV is therefore more suited to analyzing alignments for polyploids with more distinct sub-genomes, e.g., allopolyploids or older autopolyploids (S1 Appendix). Refer to Materials and methods for more details on how we generate whole-genome assembly alignments and load them into the viewer.

The CGV home page provides a menu where users can select from available species and assembly combinations (Fig 1A). We add new whole-genome alignments as high-profile assemblies become available and in response to requests from the scientific community. As of February 2024, we provided a selection of about 800 alignments from over 350 eukaryotic species (Fig 1B). Whole-genome sequence alignments between more distantly related species may be sparse or low-quality with limited analytical utility; therefore, most of the alignments we offer are between assemblies of the same species or more closely related species within the same class or order.

CGV’s main view (the “ideogram view”) displays pairwise alignments as colored connectors linking the chromosomes in the 2 assemblies (Fig 1C). The view is filtered by default to show only reciprocal best hits between assemblies in order to facilitate the analysis of orthologous genomic regions. Researchers can choose to show the non-best placed alignments to reveal additional closely related sequence duplications or ancestral homologues (Fig 1F and S1 Appendix). Users of CGV can also filter alignments in view by size (e.g., to only show large alignment blocks) or by orientation (e.g., to only show regions that have undergone a potential inversion). The complete whole-genome alignment data in GFF3 and human-readable formats like XLSX can be downloaded from the viewer for a researcher’s own use.

Users can click to select a chromosome from each assembly to zoom to the alignments for the selected chromosome. They can navigate further within this chromosome comparison using the zoom in/out and pan buttons or by pinch-zoom or drag to pan. Users can zoom directly to a particular region of a chromosome by dragging their cursor over the coordinate ruler or the ideogram for either assembly. Double-clicking on a selected alignment segment will synchronously zoom both the top and bottom assembly on the aligned coordinates so that they are stacked on top of one another (Fig 1D).

Where available, RefSeq or assembly-submitter provided gene annotation is displayed on the chromosomes (Fig 1D). Similarities in gene order denote regions of synteny, while discrepancies can point to evolutionarily or biologically significant differences. Differences may also result from assembly errors, particularly if evaluating different assemblies from the same species or strain. Researchers can use the search feature in CGV to find their gene of interest by name or keyword, and subsequently navigate to the location of the gene in the viewer (Fig 1E). If the gene region is aligned, the viewer will simultaneously navigate to the aligned location, which may contain the gene’s known or putative ortholog on the second assembly. The “flip” button allows the user to reverse one chromosome to see inverted alignments displayed in the same relative orientation, which may aid in the detection of discrepancies in gene annotation in regions that are locally syntenic between the 2 assemblies. Once a user has completed their analysis of a region of interest, they can export the image as an SVG to adapt for use in publications and presentations.

Users can click on an alignment segment to show an information panel (Fig 1D). This panel reports the chromosome scaffold accession and sequence coordinates of the alignment on each assembly, as well as the percent identity, number of gaps and mismatches, and alignment length. While the ideogram view in CGV does not display specific nucleotide bases, users can open another panel from the right-click menu that shows the alignment sequence. They can also download the alignment FASTA file of a particular alignment segment for downstream analysis, such as BLAST search or primer design. Researchers can also navigate from CGV to NCBI’s genome browser, the GDV [2]. GDV can display the assembly-alignment data viewed in CGV as a linear track alongside additional data mapped onto a genome assembly, such as detailed transcript and CDS annotation, repeats, GC content, variation data, or user-provided annotations. Zooming to a location within GDV can reveal differences in nucleotide sequence or gene exon or CDS annotation between the 2 assemblies.

In addition to the main ideogram-based view, the Comparative Genome Viewer also provides a 2D dotplot view of the pairwise genome alignment (Fig 2A). The dotplot shows aligned sequence locations in one assembly on the X-axis plotted against aligned locations on the second assembly on the Y-axis. Alignments in the reverse orientation are plotted with an opposite slope and in a different color (purple) than alignments in the same orientation (green), making it easier to identify inversions and inverted translocations. The CGV dotplot shows both reciprocal best-placed and non-best placed alignments. As a result, compared to the ideogram view, this plot may more easily expose differences in copy number between 2 assemblies, such as segmental duplications or differences in genome or chromosome ploidy. Users can select and zoom to a view showing the comparison between a pair of chromosomes in the whole-genome plot (i.e., a “cell” in the plot) (Fig 2B). Once a researcher has discovered a chromosome pair of interest in the dotplot, they can navigate back to the ideogram view to conduct even more detailed analysis, including examining gene annotation and investigating short alignment segments that were beyond the resolution of the dotplot.

Analysis using CGV: Conservation of linkage groups with local rearrangement of synteny

CGV can aid in detecting unusual patterns in genome evolution in different taxa. Researchers had previously observed that genomes from Drosophila species conserve gene content within linkage groups, known as Muller elements, corresponding to chromosomes or large sub-chromosome regions. Within these Muller elements, gene order can be reshuffled extensively in one species relative to another [23]. For alignment between D. albomicans versus D. melanogaster, the CGV ideogram view shows restriction of alignment from each chromosome in one genome to a particular chromosome or chromosome region (i.e., linkage group) in the other genome (Fig 3A). However, within a chromosome–chromosome pair, sequence alignment is broken into many small fragments whose relative order is not conserved. This fragmentation of alignment is more clearly visible in the CGV dotplot, which shows that pairwise alignments are restricted to a single chromosome pair, but appear in a scattered pattern, suggesting that the sequence and gene order has been extensively rearranged within chromosomes (Fig 3B).

Fig 3 — (A) CGV ideogram view of alignment between *Drosophila albomicans* and *Drosophila melanogaster* genomes. Alignments are restricted to a single chromosome or chromosome region. The view can be recreated at https://www.ncbi.nlm.nih.gov/genome/cgv/browse/GCF_009650485.2/GCF_000001215.4/40865/7291. (B) CGV dotplot view of alignment between *Drosophila albomicans* and *Drosophila melanogaster* demonstrates that sequence order is “scrambled” within linkage groups, as demonstrated by a scatter pattern indicating many short rearranged alignments. The view can be recreated at https://www.ncbi.nlm.nih.gov/genome/cgv/plot/GCF_009650485.2/GCF_000001215.4/40865/7291. (C) CGV dotplot view of alignment between starfish species *Luida sarsii* and *Asteria rubens* with similar scatter pattern to *Drosophila* alignments. The view can be recreated at https://www.ncbi.nlm.nih.gov/genome/cgv/plot/GCA_949987565.1/GCF_902459465.1/41045/2723838. (D) CGV ideogram view of alignment between chromosome 1 of *Luida sarsii* and chromosome 1 of *Asteria rubens*. These chromosomes align to each other across their length, but the alignment is broken into multiple short segments which are extensively rearranged. The view can be recreated at https://www.ncbi.nlm.nih.gov/genome/cgv/browse/GCA_949987565.1/GCF_902459465.1/41045/2723838#OX465101.1/NC_047062.1/size=1,firstpass=0. (E) CGV dotplot view of alignment between starfish species *Luida sarsii* and *Patiria pectinifera* with similar scatter pattern to *Drosophila* alignments. The view can be recreated at https://www.ncbi.nlm.nih.gov/genome/cgv/plot/GCA_949987565.1/GCA_029964075.1/41165/7594. (F) CGV dotplot view of alignment between starfish species *Plazaster borealis* and *Pisaster ochraceus*. Alignments show less scatter and more of a diagonal slope, indicating more conservation of sequence order between these 2 species’ genomes. The view can be recreated at https://www.ncbi.nlm.nih.gov/genome/cgv/plot/GCA_021014325.1/GCA_010994315.2/41175/466999. CGV, Comparative Genome Viewer.

We observed a similar pattern to Drosophila when comparing some genomes from different starfish species using CGV. When looking at pairwise alignments between Asterias rubens, Patiria pectinifera, and Luida sarsii species, sequences from a chromosome from 1 genome mainly align to a single other chromosome in the other species. However, within a pairwise chromosome alignment, the sequence order is rearranged, resulting in a scatter pattern in the dotplot (Fig 3C and 3E). The ideogram view can show the alignment fragmentation and rearrangement in more granular detail (Fig 3D). We also noted that some starfish pairs show more conservation of location synteny [24] (Fig 3F), consistent with measured sequence distance (Mash distance) based on shared k-mers [25] (Mash distance between Plazaster borealis and Pisaster ochraceus is 0.128 compared to other starfish pairs with a Mash distance >0.3).

Bhutkar and colleagues [23] speculated that the need to keep certain genes in the same regulatory environment may result in conservation of genes within linkage groups even in the absence of selective pressure to maintain the gene order. More recently, conservation of macrosynteny with extensive small-scale sequence rearrangement was reported in comparisons between other invertebrate species, such as cephalopods, cnidarians, jellies, and sponges [18,26,27]. These rearrangements were used to parse the phylogenetic relationships within this clade. We demonstrate here that CGV can aid researchers in detecting and analyzing this phenomenon in starfish and other evolutionarily varied taxa.

Analysis using CGV: Detection of gene family expansions between related genomes

CGV can uncover potential copy number differences in segmental gene families. These differences may appear as gaps in the alignment in otherwise syntenic gene regions. Segmental insertions or deletions may be too small to be apparent on the whole-genome or whole-chromosome alignment but can be detected when searching and navigating to a gene of interest.

Initial analysis of the complete human telomere-to-telomere CHM13 (T2T-CHM13) genome indicated an expansion of amylase genes on chromosome 1 in the T2T assembly compared to the human GRCh38 reference assembly [28,29]. The whole-genome alignment between GRCh38.p14 and T2T-CHM13v2.0 generated by the T2T/HPRC consortium [29] is available to analyze in CGV and validates this initial finding. A search for “alpha amylase” in CGV finds 4 matches in the GRCh38.p14 assembly and 10 matches in the T2T-CHM13v2.0 assembly (Fig 4A). Navigating to the AMY1A gene on chromosome 1 reveals a nearby sequence segment in the T2T-CHM13 assembly that is not aligned to GRCh38 (Fig 4B). This region in the T2T-CHM13 assembly contains numerous annotated loci that lack official gene nomenclature (i.e., named only as “LOC” followed by the numerical locus identifier ID); 6 of these loci are described as “alpha-amylase” gene family members (Fig 4B). Therefore, there are at least 6 additional alpha-amylase genes in the T2T-CHM13 genome compared to the GRCh38 reference assembly. It is possible that the copy number of this gene is variable in humans; it is also possible that the GRCh38reference genome represents fewer than the typical number of gene copies.

Fig 4 — (A) Gene search of a T2T/HPRC-generated alignment between 2 human assemblies in CGV finds an alpha-amylase gene cluster on chromosome 1 containing 10 copies in the T2T-CHM113v2.0 assembly and 4 copies in the GRCh38.p14 assembly. (B) CGV view showing that T2T-CHM13v2.0 contains an insertion relative to GRCh38.p14, which appears as an unaligned region on chromosome 1. This insertion contains additional alpha-amylase gene copies. An information panel (tooltip) indicates one of these additional family members. These tooltips appear when the user hovers their cursor over the gene annotation or gene label. The view can be recreated at https://www.ncbi.nlm.nih.gov/genome/cgv/browse/GCF_009914755.1/GCF_000001405.40/23025/9606#NC_060925.1:103415704-103764412/NC_000001.11:103566852-103915505/size=1000,firstpass=0. (C) CGV view showing that chromosome 2 of *Canis lupus familiaris* (dog) UMICH_Zoey_3.1 aligns to chromosomes 2, 15, 23, and 25 of Dog10K_Boxer_Tasha. The view can be recreated at https://ncbi.nlm.nih.gov/genome/cgv/browse/GCF_005444595.1/GCF_000002285.5/17685/9615#NC_049262.1:6542815-78714085//size=10000. (D) UMICH_Zoey_3.1 assembly chromosome 2 alignment to Dog10K_Boxer_Tasha chromosome 25 contains the *MALRD1* gene, which is annotated as *LOC608668* in the Tasha assembly (boxed in red). Gene synteny is not conserved outside of the region of this gene. Webpage source: National Library of Medicine. CGV, Comparative Genome Viewer.

The T2T/HPRC alignment described above contains only reciprocal best placed alignments; therefore, it does not include alignments corresponding to the additional copies of alpha amylase in the T2T-CHM13v2.0 assembly. As a result, you will not be able to see alignments to these paralogs, even when the option to “show non-best placed alignments” is checked on. S1 Appendix (B–E) provides additional examples of genomic regions visualized in CGV that feature local gene copy number expansions in one genome relative to another. These alignments were generated by NCBI (see Materials and methods; S1 Fig) and, as a result, include non-best placed alignments in the alignment data. Turning on the option to “show non-best placed alignments” reveals that single copy genes in the human genome (e.g., BMPR2, EIF4A3, or EYS) are present in multiple copies in the corresponding genomic region of other primates (S1 Appendix B–E). These primate assemblies are likely to be high quality, so that the copy number differences observed are likely to be real. In situations where one or both genome assemblies are of lower quality, differences in copy number observed in CGV may reflect assembly or annotation errors.

Analysis using CGV: Possible gene translocation between 2 dog assemblies

For closely related strains or species, CGV can help uncover and validate structural anomalies, such as where gene order synteny has been disrupted. Visual inspection of the whole-genome CGV ideogram view of alignment between 2 dog genomes—the Great Dane Zoey (UMICH_Zoey_3.1) and the boxer Tasha (Dog10K_Boxer_Tasha)—indicated a region that aligned to chromosome 2 in the Zoey assembly and chromosome 25 in the Tasha assembly (Fig 4C and 4D). This region contains the MALRD1 gene in the Zoey assembly, which is shown to align to a MALRD1 homolog annotated as LOC608668 in Tasha (Fig 4D). CGV alignments indicate that LOC608668 is likely the Tasha MALRD1 gene; there are no better alignments to MALRD1 detected in CGV or by an independent BLAST search of the Tasha genome.

Zooming out in the aligned region of MALRD1 indicates that gene synteny is not conserved outside of this gene r (Fig 4D). It appears that this gene has been translocated from chromosome 2 on the Zoey assembly to chromosome 25 on Tasha. It is also possible that the Tasha genome may have been misassembled in this region, and the MALRD1 gene sequence is properly situated on chromosome 2 within the otherwise conserved syntenic block. A researcher would need to further examine the quality of the Tasha assembly in this region to distinguish these possibilities, for example, by examining the sequencing reads for the Tasha assembly or viewing an HiC map of assembly structure.

If this anomaly represents a true difference between the 2 genomes, it could prove biologically significant. The translocation of MALRD1 may have placed it in a different gene regulatory environment in the Tasha genome, which could possibly result in different levels or patterns of gene expression. The human ortholog of MALRD1 was shown to be involved in bile acid metabolism [30]. This gene region was also genetically linked to Alzheimer’s disease [31]. Therefore, if valid and not assembly artifacts, differences like these could provide insight into human health.

Discussion

We describe here a new visualization tool for eukaryotic assembly-assembly alignments, the CGV. We developed this web application with a view toward serving both expert genome scientists as well as organismal biologists, students, and educators. Users of CGV do not need to generate their own alignments or configure the software using command line tools. Instead, they can select from our menu of available alignments, access a view immediately in a web application, and start their analysis. We are continuing to add new alignments regularly and invite researchers to contact us if assemblies or organisms of interest are missing. We continue to do periodic outreach to the community to help us improve our visual interfaces so that they are simple, intuitive, and accessible.

CGV exclusively displays whole-genome sequence alignments provided by NCBI; users cannot currently upload their own alignment data or choose assemblies to align in real time. There are both technical and scientific considerations to allowing researchers to select and align assemblies automatically themselves. Currently, whole-genome assembly-assembly alignments take several hours to days, using up to 1,000 CPU processing hours per pairwise alignment of larger genomes, such as those for mammalian or plant assemblies. Moreover, whole-genome alignments are difficult to generate past a certain genetic distance (i.e., Mash > 0.3). Alignments between more distant species may be of limited research value as they will likely have sparse and short segments that may correspond only to the most highly conserved coding sequence (CDS) (Fig 5 and S1 Data). Some more closely related genomes may also be inappropriate for CGV’s alignment algorithm. For example, our alignments may not accurately distinguish sub-genomes for genome pairs with recent whole-genome duplications. We also cannot accurately align assembly pairs where there are large differences in ploidy levels and genome sizes. We suggest protein similarity or gene orthology-based alignments as more appropriate for comparison between genomes that are not appropriate for CGV’s whole-genome sequence analysis. At present, we manually vet requested alignments to make sure that requested assemblies are complete, of high quality (e.g., high scaffold N50 or BUSCO scores [32]), and at a reasonable evolutionary distance. This review ensures that alignments will be useful to both the original requester and others in the research community.

Fig 5 — Percentages of total target genome or CDS nucleotides covered by ungapped alignments are plotted against the Mash distance between the pair of genomes. (A) Alignments between the human GRCh38.p14 assembly and other vertebrates. (B) Alignments of fall army worm (FAW, *Spodoptera frugiperda*, a major insect agricultural pest) and related insects. At lower Mash distances, the whole-genome alignments cover most of the genome and CDS. At Mash distances greater than 0.25 or 0.3, the alignment covers less than 20% of the genome overall, and between 30% and 60% of the CDS. Refer to S1 Data for the data used to populate these graphs. CDS, coding sequence.

Many research questions in comparative biology may be best answered by simultaneously visualizing alignments among more than 2 assemblies. We are exploring user needs for multigenome alignment visualization in order to better support this analysis at NCBI. Our lessons from developing CGV will prove valuable in this upcoming initiative.

Materials and methods

Preparation of assembly-assembly alignments

Genome assemblies are aligned using a two-phase pipeline first described in Steinberg and colleagues [33], with adaptations for cross-species alignments. The pipeline is engineered in a custom database-driven workflow engine (“gpipe”) that executes programs written in C++; the extensive internal database and C++ library dependencies preclude distribution of code for compilation or use outside of NCBI without extensive re-engineering. Here, we provide a detailed description of the processing. In the first phase, initial alignments are generated using BLAST [22] or LASTZ [21], or imported from a third-party source such as UCSC [34]. In the second phase, alignments are merged and ranked to distinguish reciprocal-best alignments from additional alignments that are locally best on one assembly but not the other. The resulting alignment set is omnidirectional and can be used to project information from query to subject assembly or vice versa.

In the first phase, for both BLAST and LASTZ, repetitive sequences present in the query and target assemblies are soft-masked using WindowMasker [35]. The default parameters are usually suitable for aligning assembly pairs within the same species. However, more aggressive masking is required when aligning cross-species assemblies. The masking rate is adjusted with the parameter t_thres_pct set to 99.5 (default, for BLAST same-species), 98.5 (for BLAST cross-species), or 97.5 (LASTZ cross-species), or lower for some genomes with extensive and diverse repeat composition. The 97.5 to 98.5 values typically result in a masking percentage similar to RepeatMasker (http://www.repeatmasker.org) with a species-specific repeat library, with the advantage of not needing to define repeat models beforehand.

Genome assemblies are aligned using either BLAST or LASTZ. The selection of the aligner and specific parameters depends on the level of similarity between the assemblies. We use Mash [25] to compute the approximate distance between 2 assemblies (Fig 5 and S1 Data). BLAST is employed for aligning pairs of assemblies belonging to the same species, as well as cross-species assembly pairs with a Mash distance less than 0.1. An exemplar BLAST command is:

blastn -evalue 0.0001 -gapextend 1 -gapopen 2 -max_target_seqs 250 -soft_masking true -task megablast -window_size 150 -word_size 28

A BLAST word_size of 28 is used for pairs of assemblies with Mash distances below 0.05, such as human and orangutan, while a word_size of 16 is used to enhance sensitivity for more distant cross-species pairs with Mash distances ranging from 0.05 to 0.1, such as human and rhesus macaque.

Assembly pairs with Mash distances exceeding 0.1, such as human and mouse assemblies, are aligned using LASTZ. The make_lastz_chains pipeline [21] is employed to generate alignments between query and target assemblies in UCSC chain format. The default parameters are often adequate to produce satisfactory alignments for many assembly pairs, though some distant assembly pairs (e.g., Mexican tetra-medaka) warrant changes such as the use of a different substitution matrix (BLASTZ_Q=HoxD55).

Alignments generated with the make_lastz_chains pipeline or precomputed alignments imported from UCSC are in chain format. The UCSC chainNet pipeline [36] is run on query x target and target x query alignment sets separately so that the alignments are “flattened” in a way that the reference sequences are covered only once by the alignments in each set, and the 2 chainNet outputs are concatenated.

In the second phase, alignments are converted to NCBI ASN.1 format and processed further for the NCBI’s CGV and GDV browsers. In this phase, the full set of BLAST or LASTZ-derived alignments is processed to merge neighboring alignments and split and rank overlapping alignments to identify a subset of best alignments (S1A Fig). Merging is accomplished on a sequence-pair-by-sequence-pair basis, and ranking is accomplished globally for the assembly pair. The process is designed to find a dominant diagonal among a set of potentially conflicting alignments.

Merging involves the following steps. First, when applicable, alignments based on common underlying sequence components of the assemblies (e.g., the same BAC component used in both human GRCh37 and GRCh38) are identified and merged into the longest and most consistent stretches possible. Second, adjacent alignments are merged if there are no conflicting alignments. Third, alignments are split on gaps using a default threshold of 50 bp (for the same or closer species) or 50 kb (for more distant species), or longer than 5% of the alignment length. Alignments are also split at any point where they intersect with overlapping alignments (S1A Fig). Duplicate or low-quality alignments fully contained within higher-quality alignments are dropped.

After merging and splitting, alignments are subsequently processed using a sorting and ranking algorithm (S1B Fig). Alignments are sorted based on a series of properties, including the use of common components, assembly level (alignment to chromosomes preferred over alignment to unplaced scaffolds), total sequence identity, and alignment length. The alignments are then scanned twice, once each on the query and subject sequence ranges, to sort out reciprocal best-placed (also referred to as “first pass” or reciprocity = 3) and non-best placed (also referred to as “second pass” or reciprocity = 1 or 2) alignment sets. Finally, all alignments in each reciprocity are merged again to stitch together adjacent alignments with no conflicting alignments into the longest representative stretches.

Assembly-assembly alignments are stored in an internal database available for rendering in NCBI’s CGV and GDV browsers. The alignment data are publicly available in GFF3 and ASN.1 formats at https://ftp.ncbi.nlm.nih.gov/pub/remap/.

For display in CGV, assembly alignment batches are filtered to keep only alignments that contain chromosome scaffolds as both anchors and targets, since non-chromosomal scaffolds are not displayed in this viewer. Alignments are converted into a compact binary format designed to keep only the synteny data required for display. This preparation step is done by programs written in C++ and bash scripts that tie them together.

Technical architecture of CGV

CGV operates on a two-tier model, with a front end implemented using HTML/JavaScript running in the user’s web browser and a back end running at NCBI. The graphical rendering is done on the front end using modern WebGL technologies. The main advantage of this approach is speed and fluidity of the user interface since most of the alignment data needed to be rendered is sent to the front end at the initial load and there are no additional roundtrips to the server when the user interacts with the page (e.g., panning or zooming). Using front end graphical rendering also reduces the network traffic between NCBI and the end user, which makes CGV more responsive.

The back end of the CGV application resolves internal alignment identifiers to an alignment data file that the front end can use for generating graphical images. The back end is implemented as an industry-standard gRPC service written in C++ and running in a scalable NCBI service mesh (linkerd, namerd, consul). When a CGV view is initially loaded, our gRPC service requests the alignment data needed for the particular page. On graphical pages, gRPC resolves an alignment identifier to a URL with prepared synteny/alignment data and the page loads the file at this URL into a WASM module which is written in C++ and compiled with Emscripten. The WASM module serves this data on demand to the page’s JavaScript code which uses it for building the image and all user interactions.

In parallel, the list of assembly-assembly alignments and their metadata is sent to the selection menu (i.e., “Set up your view”) on the CGV landing page. This allows the selection menu to report scientific and common species names, assembly accessions, and assembly names.

The alignment selection menu on the CGV landing page is a traditional web form page. We utilize the NCBI version of United States Web Design System (USWDS) design standards and components (https://www.ncbi.nlm.nih.gov/style-guide) to unify graphical design with other US government pages.

On the more graphically intensive ideogram and dotplot pages, graphical rendering is done in the user’s web browser using WebGL using d3.js or pixi.js libraries, which allows for efficient interactivity, scalability, and fluidity of user interaction. Other elements of the page use jQuery/extJS and USWDS components. CGV reuses chromosome ideograms initially developed for NCBI’s Genome Data Viewer [2].

Gene annotations shown in CGV are obtained from NCBI’s public databases using NCBI Entrez Programming Utilities (E-utilities) (https://www.ncbi.nlm.nih.gov/books/NBK25501/). Annotation-build specific gene search is provided by an NCBI Datasets (https://www.ncbi.nlm.nih.gov/datasets/) gRPC service.

Design of CGV application

A philosophy of user-centered design, which puts user needs at the forefront of decision-making, was an integral element in the development of the CGV. Participants for user research were recruited from members of the genomics research community who provided their contact information through feedback links on the CGV application and other sequence analysis tools at NCBI. Some of these researchers were previously familiar with CGV, while others had little or no experience with this tool. Data from user research testing sessions was compiled and analyzed for patterns in behavior, thereby allowing the team to validate that the design was moving in a direction that facilitated analysis of sequence alignment data. To date, we have conducted user research with over 30 different experts in the field of comparative genomics. We also evaluate the application for Section 508 compliance, which helps insures CGV performs well on mobile devices and is accessible to users with limited or no visibility.

Supporting information

S1 Appendix. Examples of CGV showing segmental duplications when non-best placed alignments are included in the view.

CGV’s ideogram view displays best-placed alignments by default. Non-best placed alignments can be added from the “Adjust Your View” configure options (Fig 1F). (A) CGV view of whole-genome alignment of Arabidopsis thaliana and A. suecica, a polyploid hybrid of A. thaliana and A. arenosa [37]. The default view shows regions of similarity between A. thaliana and the A. thaliana subgenome of A. suecica. Including non-best placed alignments reveals additional alignment segments (marked by red boxes) that correspond to alignment between A. thaliana and the A. arenosa subgenome of A. suecica. https://www.ncbi.nlm.nih.gov/genome/cgv/browse/GCA_019202805.1/GCF_000001735.4/48965. (B–E) CGV alignments between human and chimpanzee (B, C, and E) and human and bonobo (D) genome assemblies in regions that contain local segmental duplications. When non-best placed alignments are shown, additional alignment segments are displayed that identify potential additional copies of BMPR2 (B), EIF4A3 (C and D), or EYS related gene sequence (E). EIF4A3 was previously reported to have multiple copies in both chimpanzee and bonobo relative to the human genome [38]. (B). https://www.ncbi.nlm.nih.gov/genome/cgv/browse/GCF_028858775.1/GCF_000001405.40/35595/0#NC_072400.1:104175446-105102154/NC_000002.12:202185329-203061800/size=1000,firstpass=0. (C). https://www.ncbi.nlm.nih.gov/genome/cgv/browse/GCF_028858775.1/GCF_000001405.40/35595/0#NC_072415.1:91293196-91791317/NC_000017.11:80086216-80514871/size=1000,firstpass=0. (D). https://www.ncbi.nlm.nih.gov/genome/cgv/browse/GCF_029289425.1/GCF_000001405.40/36375/9606#NC_073266.1:100957682-101607434/NC_000017.11:80031356-80681076/size=1000,firstpass=0. (E). https://www.ncbi.nlm.nih.gov/genome/cgv/browse/GCF_028858775.1/GCF_000001405.40/35595/0#NC_072404.1:68562328-78605127/NC_000006.12:62860419-76648142/size=1000,firstpass=0.

(PDF)

pbio.3002405.s001.pdf^{(1.1MB, pdf)}

S1 Data. Alignment coverage at different Mash distances for selected assembly pairs.

(XLSX)

pbio.3002405.s002.xlsx^{(22.6KB, xlsx)}

S1 Fig. Merging, sorting, and ranking assembly-assembly alignments.

(A) Flowchart depicting how adjacent alignment segments are merged. Subsequently, alignments are split once again at large gaps. (B) Flowchart showing how overlapping alignments are separated, ranked, and re-merged. Reciprocal best-placed alignments are designated as “first pass,” while the non-best placed alignment is designated “second pass.”

(TIF)

pbio.3002405.s003.tif^{(1MB, tif)}

Acknowledgments

We thank Anne Ketter and Emily W Davis for help with project planning and coordination. We thank Guangfeng Song for aid with user experience research and analysis.

Thanks to Wayne Matten, Michelle Formica-Frizzi, and others in the NCBI customer service team for marketing, webinars, and video tutorials. Anatoliy Kuznetsov and Andrei Shkeda conducted early technical design work that influenced the architecture of the CGV application. Deanna M. Church and Mike DiCuccio participated in initial design and testing of the assembly alignment protocol. Finally, we thank members of the greater genomics research community who have participated in usability sessions and provided feedback and alignment requests to CGV. CGV was developed as a part of the National Institutes of Health’s Comparative Genomics Resource (CGR) (https://www.ncbi.nlm.nih.gov/comparative-genomics-resource/).

Abbreviations

CDS: coding sequence
CGR: Comparative Genomics Resource
CGV: Comparative Genome Viewer
GDV: Genome Data Viewer
IGV: Integrative Genomics Viewer
NCBI: National Center for Biotechnology Information
USWDS: United States Web Design System

Data Availability

Assembly-assembly alignment data displayed in the Comparative Genome Viewer are publicly available in GFF3 and ASN.1 formats at https://ftp.ncbi.nlm.nih.gov/pub/remap/. Gene annotation can be accessed via NCBI Datasets command line interfaces at https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/.

Funding Statement

This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM) at the National Institutes of Health (NIH). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Bornstein K, Gryan G, Chang ES, Marchler-Bauer A, Schneider VA. The NIH Comparative Genomics Resource: addressing the promises and challenges of comparative genomics on human health. BMC Genomics. 2023;24(1):575. doi: 10.1186/s12864-023-09643-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Rangwala SH, Kuznetsov A, Ananiev V, Asztalos A, Borodin E, Evgeniev V, et al. Accessing NCBI data using the NCBI Sequence Viewer and Genome Data Viewer (GDV). Genome Res. 2021;31(1):159–169. doi: 10.1101/gr.266932.120 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Nassar LR, Barber GP, Benet-Pages A, Casper J, Clawson H, Diekhans M, et al. The UCSC Genome Browser database: 2023 update. Nucleic Acids Res. 2023;51(D1):D1188–D1195. doi: 10.1093/nar/gkac1072 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Thorvaldsdottir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14(2):178–192. doi: 10.1093/bib/bbs017 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Buels R, Yao E, Diesh CM, Hayes RD, Munoz-Torres M, Helt G, et al. JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol. 2016;17:66. doi: 10.1186/s13059-016-0924-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Cabanettes F, Klopp C. D-GENIES: dot plot large genomes in an interactive, efficient and simple way. PeerJ. 2018;6:e4958. doi: 10.7717/peerj.4958 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Haug-Baltzell A, Stephens SA, Davey S, Scheidegger CE, Lyons E. SynMap2 and SynMap3D: web-based whole-genome synteny browsers. Bioinformatics. 2017;33(14):2197–2198. doi: 10.1093/bioinformatics/btx144 [DOI] [PubMed] [Google Scholar]
8.Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, et al. Circos: an information aesthetic for comparative genomics. Genome Res. 2009;19(9):1639–1645. doi: 10.1101/gr.092759.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Herrero J, Muffato M, Beal K, Fitzgerald S, Gordon L, Pignatelli M, et al. Ensembl comparative genomics resources. Database (Oxford). 2016;2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Wang H, Su Y, Mackey AJ, Kraemer ET, Kissinger JC. SynView: a GBrowse-compatible approach to visualizing comparative genome data. Bioinformatics. 2006;22(18):2308–2309. doi: 10.1093/bioinformatics/btl389 [DOI] [PubMed] [Google Scholar]
11.Xiao Z, Lam HM. ShinySyn: a Shiny/R application for the interactive visualization and integration of macro- and micro-synteny data. Bioinformatics. 2022;38(18):4406–4408. doi: 10.1093/bioinformatics/btac503 [DOI] [PubMed] [Google Scholar]
12.Richardson JE, Baldarelli RM, Bult CJ. Multiple genome viewer (MGV): a new tool for visualization and comparison of multiple annotated genomes. Mamm Genome. 2022;33(1):44–54. doi: 10.1007/s00335-021-09904-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Diesh C, Stevens GJ, Xie P, De Jesus MT, Hershberg EA, Leung A, et al. JBrowse 2: a modular genome browser with views of synteny and structural variation. Genome Biol. 2023;24(1):74. doi: 10.1186/s13059-023-02914-z [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Albert VA, Krabbenhoft TJ. Navigating the CoGe Online Software Suite for Polyploidy Research. Methods Mol Biol. 2023;2545:19–45. doi: 10.1007/978-1-0716-2561-3_2 [DOI] [PubMed] [Google Scholar]
15.Lee J, Hong WY, Cho M, Sim M, Lee D, Ko Y, et al. Synteny Portal: a web-based application portal for synteny block analysis. Nucleic Acids Res. 2016;44(W1):W35–W40. doi: 10.1093/nar/gkw310 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Wang Y, Tang H, Debarry JD, Tan X, Li J, Wang X, et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 2012;40(7):e49. doi: 10.1093/nar/gkr1293 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Lovell JT, Sreedasyam A, Schranz ME, Wilson M, Carlson JW, Harkess A, et al. GENESPACE tracks regions of interest and gene copy number variation across multiple genomes. Elife. 2022:11. doi: 10.7554/eLife.78526 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Schultz DT, Haddock SHD, Bredeson JV, Green RE, Simakov O, Rokhsar DS. Ancient gene linkages support ctenophores as sister to other animals. Nature. 2023;618(7963):110–117. doi: 10.1038/s41586-023-05936-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Muffato M, Louis A, Poisnel CE, Roest CH. Genomicus: a database and a browser to study gene synteny in modern and ancestral genomes. Bioinformatics. 2010;26(8):1119–1121. doi: 10.1093/bioinformatics/btq079 [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Darling AC, Mau B, Blattner FR, Perna NT. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004;14(7):1394–1403. doi: 10.1101/gr.2289704 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, et al. Integrating gene annotation with orthology inference at scale. Science. 2023;380(6643):eabn3107. doi: 10.1126/science.abn3107 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000;7(1–2):203–214. doi: 10.1089/10665270050081478 [DOI] [PubMed] [Google Scholar]
23.Bhutkar A, Schaeffer SW, Russo SM, Xu M, Smith TF, Gelbart WM. Chromosomal rearrangement inferred from comparisons of 12 Drosophila genomes. Genetics. 2008;179(3):1657–1680. doi: 10.1534/genetics.107.086108 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Lee Y, Kim B, Jung J, Koh B, Jhang SY, Ban C, et al. Chromosome-level genome assembly of Plazaster borealis sheds light on the morphogenesis of multiarmed starfish and its regenerative capacity. Gigascience. 2022:11. doi: 10.1093/gigascience/giac063 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17(1):132. doi: 10.1186/s13059-016-0997-x [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Simakov O, Bredeson J, Berkoff K, Marletaz F, Mitros T, Schultz DT, et al. Deeply conserved synteny and the evolution of metazoan chromosomes. Sci Adv. 2022;8(5):eabi5884. doi: 10.1126/sciadv.abi5884 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Albertin CB, Medina-Ruiz S, Mitros T, Schmidbaur H, Sanchez G, Wang ZY, et al. Genome and transcriptome mechanisms driving cephalopod evolution. Nat Commun. 2022;13(1):2427. doi: 10.1038/s41467-022-29748-w [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Vollger MR, Guitart X, Dishuck PC, Mercuri L, Harvey WT, Gershman A, et al. Segmental duplications and their variation in a complete human genome. Science. 2022;376(6588):eabj6965. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44–53. doi: 10.1126/science.abj6987 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Reue K, Lee JM, Vergnes L. Regulation of bile acid homeostasis by the intestinal Diet1-FGF15/19 axis. Curr Opin Lipidol. 2014;25(2):140–147. doi: 10.1097/MOL.0000000000000060 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Zubenko GS, Hughes HB 3rd. Predicted gene sequence C10orf112 is transcribed, exhibits tissue-specific expression, and may correspond to AD7. Genomics. 2009;93(4):376–382. doi: 10.1016/j.ygeno.2008.12.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19):3210–3212. doi: 10.1093/bioinformatics/btv351 [DOI] [PubMed] [Google Scholar]
33.Steinberg KM, Schneider VA, Graves-Lindsay TA, Fulton RS, Agarwala R, Huddleston J, et al. Single haplotype assembly of the human genome from a hydatidiform mole. Genome Res. 2014;24(12):2066–2076. doi: 10.1101/gr.180893.114 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14(4):708–715. doi: 10.1101/gr.1933104 [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Morgulis A, Gertz EM, Schaffer AA, Agarwala R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics. 2006;22(2):134–141. doi: 10.1093/bioinformatics/bti774 [DOI] [PubMed] [Google Scholar]
36.Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003;100(20):11484–11489. doi: 10.1073/pnas.1932072100 [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Jiang X, Song Q, Ye W, Chen ZJ. Concerted genomic and epigenomic changes accompany stabilization of Arabidopsis allopolyploids. Nat Ecol Evol. 2021;5(10):1382–1393. doi: 10.1038/s41559-021-01523-y [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Mao Y, Catacchio CR, Hillier LW, Porubsky D, Li R, Sulovari A, et al. A high-quality bonobo genome refines the analysis of hominid evolution. Nature. 2021;594(7861):77–81. doi: 10.1038/s41586-021-03519-x [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS Biol. doi: 10.1371/journal.pbio.3002405.r001

Decision Letter 0

Richard Hodge

21 Oct 2023

Dear Dr Rangwala,

Thank you for submitting your manuscript entitled "Interactive visualization of whole genome alignment between two eukaryotic genomes using NCBI’s Comparative Genome Viewer (CGV)" for consideration as a Methods and Resources Article by PLOS Biology. Please accept my apologies for the delay in getting back to you as we consulted with an academic editor about your submission.

Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I am writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. After your manuscript has passed the checks it will be sent out for review. To provide the metadata for your submission, please Login to Editorial Manager (https://www.editorialmanager.com/pbiology) within two working days, i.e. by Oct 23 2023 11:59PM.

If your manuscript has been previously peer-reviewed at another journal, PLOS Biology is willing to work with those reviews in order to avoid re-starting the process. Submission of the previous reviews is entirely optional and our ability to use them effectively will depend on the willingness of the previous journal to confirm the content of the reports and share the reviewer identities. Please note that we reserve the right to invite additional reviewers if we consider that additional/independent reviewers are needed, although we aim to avoid this as far as possible. In our experience, working with previous reviews does save time.

If you would like us to consider previous reviewer reports, please edit your cover letter to let us know and include the name of the journal where the work was previously considered and the manuscript ID it was given. In addition, please upload a response to the reviews as a 'Prior Peer Review' file type, which should include the reports in full and a point-by-point reply detailing how you have or plan to address the reviewers' concerns.

During the process of completing your manuscript submission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Richard

Richard Hodge, PhD

Senior Editor, PLOS Biology

rhodge@plos.org

PLOS

Empowering researchers to transform science

Carlyle House, Carlyle Road, Cambridge, CB4 3DN, United Kingdom

ORCiD I plosbio.org I @PLOSBiology I Blog

California (U.S.) corporation #C2354500, based in San Francisco

PLoS Biol. doi: 10.1371/journal.pbio.3002405.r002

Decision Letter 1

Richard Hodge

15 Jan 2024

Dear Dr Rangwala,

Thank you for your patience while your manuscript "Interactive visualization of whole eukaryote genome alignments using NCBI’s Comparative Genome Viewer (CGV)" was peer-reviewed at PLOS Biology. It has now been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by three independent reviewers.

In light of the reviews, which you will find at the end of this email, we would like to invite you to revise the work to thoroughly address the reviewers' reports.

Reviewer #1 is very positive and has no requests. Reviewer #2 is fairly positive, and appreciates the effort as s/he has tried to build such a tool themselves. They question how well the CGV tool copes with comparisons that cross WGD events, and suggest that you either re-jig the algorithm to improve it, state explicitly that this is a limitation, or convince them that it *does* actually cope well. Because of this, they are sceptical that it would be as useful for plants (where WGDs are more common) as for animals; s/he also wonders how well it will perform where divergence is higher and synteny is less extensively conserved. Reviewer #3 is mostly positive, but wants you to do a direct comparison with related tools in order to make a case for the need for this new addition to the field; s/he also has a list of issues with the current performance and behaviour of the tool.

Given the extent of revision needed, we cannot make a decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is likely to be sent for further evaluation by all or a subset of the reviewers.

We expect to receive your revised manuscript within 2 months. Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension.

At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may withdraw it.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point-by-point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Revised Article with Changes Highlighted" file type.

*Re-submission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Roli Roberts

Roland G Roberts PhD

Senior Editor

PLOS Biology

rroberts@plos.org

on behalf of

Richard Hodge,

Senior Editor

PLOS Biology

rhodge@plos.org

------------------------------------

REVIEWERS' COMMENTS:

Reviewer #1:

In this manuscript Rangwala et al. present a new tool for visualising whole genome comparisons using precomputed alignments of publicly available assemblies for hundreds of species across the tree of life.

They outline some widely used tools and visualisation methods for displaying genomic data and outline limitations with respect of assembly-assembly alignments. They show how the Comparative Genome Viewer presents alignment data in a clear and accessible manner on a platform (browser) open to all users, they do a step by step guide to using CGV and also highlight some interesting use cases.

They talk at length about front and back end set up and design choices and implementation for a good user experience, they also cover some usage limitations and potential new features to implement.

This is a nice well written paper, figures are clear and concise and relevant to the manuscript. There are no big concerns about the presented data and there is definitely space for a tool such as this.

Not something to address in the paper but two thirds of genomes in the alignment set are animals with 20% of the total being mammalian, this is not representative of the publicly available assemblies in GenBank. In addition, it would be worth having some assembly quality metric to help the end user to better understand the genome data, as the authors raise repeatedly the potential for there to be assembly errors.

I do recommend the paper for publication.

Reviewer #2:

Rangwala and colleagues have built a useful and user-friendly interactive visualization tool for pairwise comparison of genomic synteny and alignments. Having worked to build one of these myself, I commend the authors on the simplicity of the user-facing operations and the variety of downstream analytical tools. I have a few thoughts that I hope will improve the utility of the tool, or alternatively better contextualize the scale/scope of the results presented in the manuscript.

It looks like the synteny finding algorithm doesn't deal well with whole-genome duplications. As far as I could tell, there is only one comparison available in the CGV that crosses a whole-genome duplication, that of Xenopus laevis vs. X. tropicalis (https://www.ncbi.nlm.nih.gov/genome/cgv/browse/GCF_000004195.4/GCF_017654675.1/38475/8355), and the alignments therein do not look good. Since the WGD is on the X. laevis branch (Fig. 2 here: https://www.nature.com/articles/nature19840), both homeologs should be similarly related to X. tropicalis, so primary and secondary alignments have no meaning with regard to the homeologs. Yet you can only see the duplication with the inclusion of secondary alignments. I understand that finding synteny from whole-genome alignments that cross a WGD is difficult (unlike with CDS, which is simple), but it is doable with the right context or algorithm. I'd strongly suggest either re-writing the algorithm to handle these cases, or state up front that the CGV cannot handle comparisons of variable ploidy. If you disagree with me and think that the GCV can handle WGDs, please include examples of simple comparisons, like Xenopus laevis --> human, wheat --> Brachypodium distachyon and Brassica napus (oilseed rape) --> Arabidopsis thaliana, and show that genome coverage is equally good for all homeologs.

Depending on the response to my above comment regarding WGDs, I think the statement that CGV is something that is useful for eukaryotes in general is not accurate. Really, it appears that it is only useful for single-copy genomes. I bring this point up because all of the plant species included in the CGV only extend to other genomes that coalesce before a whole-genome duplication event. However, in many cases, a user would want to use this tool in plants like they would in animals - to explore regions of interest between a genome and its related model system. Nearly all alignments from crops to the major plant models (Arabidopsis and Brachypodium) cross at least one WGD. There are ~5x more plant species than vertebrates … the CGV would not be useful for the vast majority of these.

The CGV is not going to be useful for comparisons of very diverged species that lack solid synteny. In amniotes, synteny is remarkably conserved and there still is ~50% genome coverage between Human and Xenopus: https://www.ncbi.nlm.nih.gov/genome/cgv/browse/GCF_000004195.4/GCF_000001405.40/46255/9606#NC_030677.2//size=10000 cover ~50% of the genome). However, as you go deeper (e.g. humans to zebrafish) or with less conserved genomes (https://www.ncbi.nlm.nih.gov/genome/cgv/browse/GCF_000002335.3/GCF_023101765.2/42855/7070), the utility breaks down. While this limitation is obvious to someone well-versed in comparative genomics, it won't be to the vast majority of users and needs to be stated up front.

It is stated that ~700 genomes are available - this makes it seem like one could compare all of these genomes to each other. This is not the case. The majority of the pre-computed alignments are to a very restricted set of genomes, mostly within the same species. I think this is fine, but this limitation needs to be stated clearly up front. Perhaps instead of having all genomes together in one dropdown, have a "project" or whatever selection that gets you to all genomes that can be compared?

Genomes have variable naming conventions, and this can produce comparisons that look messy, but are actually very syntenic. The rat/mouse comparison illustrates this: there are some rearrangements, but these are hard to see since syntenic chromosomes are not stacked. The Phytozome comparative genomics viewer (https://phytozome-next.jgi.doe.gov/tools/synteny) handles this with the "sort scaffold" toggle. Incorporating this here would be helpful.

Reviewer #3:

In this manuscript, the authors introduce the Comparative Genome Viewer (CGV), a web-based pairwise whole-genome alignment visualization tool hosted by the NCBI. The tool allows visualization of ~700 pre-computed pairwise alignments from eukaryotics species. The pairwise alignments are computed using BLAST, LASTZ, or imported from other sources and then post-processed using a custom algorithm. The visualization uses an existing "stacked linear browser" approach and includes gene annotations as tracks for both genome assemblies. The tool has links out to views within the existing Genome Data Viewer (GDV) for browsing within single genomes.

Overall, the CGV is an intuitive tool that provides access to a large number pairwise whole genome alignments. It should be useful to researchers needing to examine orthologous regions from previously published genome assemblies. It is not designed to handle newly-sequenced unpublished genomes, and thus generators of such data will need to use other tools for initial comparative analysis of their genomes. I have just one major critique of the manuscript with respect to its comparisons to existing tools and handful of minor comments.

Major comments:

1. There are several other large-scale providers of pairwise genome alignments and associated visualization tools. For example, the UCSC Genome Browser has a chain/net track that allows for visualizing pairwise alignments with another genome and can easily show rearranged segments (contradicting lines 67-70). The Ensembl website also hosts many pairwise alignments and has dedicated visualizations for viewing them. This manuscript should more strongly make the case for why an additional tool is needed in this space. For example, a figure showing how the these different tools display the alignment of the same region could help to show the strengths and weaknesses of these tools.

Minor comments:

2. Going to dotplot view from ideogram view puts you back at whole-genome level not the regions currently being viewed in ideogram view, which is counterintuitive.

3. The download button downloads entire whole-genome alignment, not the currently displayed part of the alignment.

4. It is unlcear how to download in XLSX format - this option was always greyed out even when zooming in to small regions. It is also unclear what information this format conveys.

5. Can other data tracks other than gene annotations be added?

6. I found the base level view interface difficult to understand, particulary in how it interacts with the main interface. Some further desciription or documentation of this interface would be helpful.

7. Figures in this manuscript need to be at higher resolution as many details are blurry - given that CGV can export in SVG this should be easily achievable.

8. A custom approach is described for processing initial alignments, however, no software is provided as an implementation. Ideally, this software should be made publicly available so that other researchers can evaluate the approach on other data.

9. Please cite and explain Mash distance at its first mention in the manuscript.

10. Line 60 - perhaps mention IGV as it is a popular desktop-based genome browser

11. An example figure showing a non-best alignment with one region in one genome alignment to multiple regions in another would be helpful to understand how this visualization works.

12. amylase family expansion example: it is unclear why there is an unaligned sequence segment in CHM13 if it contains additional copies of the alpha-amylase gene. Are these part of "next-best" alignments that are not shown? If not, why are they not part of an alignment?

PLoS Biol. 2024 May 7;22(5):e3002405. doi: 10.1371/journal.pbio.3002405.r003

Author response to Decision Letter 1

6 Mar 2024

Attachment

Submitted filename: CGV-PLoS Biology- Response to Reviewers.docx

pbio.3002405.s004.docx^{(35.9KB, docx)}

PLoS Biol. doi: 10.1371/journal.pbio.3002405.r004

Decision Letter 2

Richard Hodge

13 Mar 2024

Dear Sanjida,

Thank you for your patience while we considered your revised manuscript "Interactive visualization of whole eukaryote genome alignments using NCBI’s Comparative Genome Viewer (CGV)" for publication as a Methods and Resources Article at PLOS Biology. This revised version of your manuscript has been evaluated by the PLOS Biology editors and the Academic Editor. Please accept my apologies, but I was unable to switch Dong-Ha to corresponding author for the submission as this is a permission that is only granted to the authors when the submission is sent back. I noted that your parental leave begins around 18th March so I hope that you still see this message, but I have cc'ed Dong-Ha in this decision letter to ensure that he is looped in to this correspondence. If you wish, Dong-Ha can be switched to corresponding author upon resubmission and then in principle switched back during the production process.

Based on our Academic Editor's assessment of your revision, I am pleased to say that we are likely to accept this manuscript for publication, provided you satisfactorily address the following data and other policy-related requests that I have provided below (A-E):

(A) We would like to suggest the following modification to the title:

“NCBI Comparative Genome Viewer (CGV) is an interactive visualization tool for the analysis of whole-genome eukaryotic alignments”

(B) You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

-Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

-Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication.

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the figure panels as they are essential for readers to assess your analysis and to reproduce it.

NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

(C) Please also ensure that each of the relevant figure legends in your manuscript include information on *WHERE THE UNDERLYING DATA CAN BE FOUND*, and ensure your supplemental data file/s has a legend.

(D) Per journal policy, as the code that you have generated is important to support the conclusions of your manuscript, we require that you make it available without restrictions upon publication. Please ensure that the code is sufficiently well documented and reusable, and that your Data Statement in the Editorial Manager submission system accurately describes where your code can be found.

(E) Please ensure that your Data Statement in the submission system accurately describes where your data can be found and is in final format, as it will be published as written there.

------------------------------------------------------------------------

As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript.

We expect to receive your revised manuscript within two weeks.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

- a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list

- a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable, if not applicable please do not delete your existing 'Response to Reviewers' file.)

- a track-changes file indicating any changes that you have made to the manuscript.

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Press*

Should you, your institution's press office or the journal office choose to press release your paper, please ensure you have opted out of Early Article Posting on the submission form. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

Please do not hesitate to contact me should you have any questions.

Kind regards

Richard

Richard Hodge, PhD

Senior Editor, PLOS Biology

rhodge@plos.org

PLOS

Empowering researchers to transform science

Carlyle House, Carlyle Road, Cambridge, CB4 3DN, United Kingdom

ORCiD I plosbio.org I @PLOSBiology I Blog

California (U.S.) corporation #C2354500, based in San Francisco

PLoS Biol. 2024 May 7;22(5):e3002405. doi: 10.1371/journal.pbio.3002405.r005

Author response to Decision Letter 2

5 Apr 2024

Attachment

Submitted filename: CGV-PLoS Biology- Response to Reviewers and Editors.docx

pbio.3002405.s005.docx^{(36.7KB, docx)}

PLoS Biol. doi: 10.1371/journal.pbio.3002405.r006

Decision Letter 3

Richard Hodge

8 Apr 2024

Dear Dong-Ha,

On behalf of my colleagues and the Academic Editor, Andreas Hejnol, I am pleased to say that we can accept your manuscript for publication, provided you address any remaining formatting and reporting issues. These will be detailed in an email you should receive within 2-3 business days from our colleagues in the journal operations team; no action is required from you until then. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have completed any requested changes.

Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process.

PRESS

We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have previously opted in to the early version process, we ask that you notify us immediately of any press plans so that we may opt out on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for choosing PLOS Biology for publication and supporting Open Access publishing. We look forward to publishing your study.

Best wishes,

Richard

Richard Hodge, PhD

Senior Editor, PLOS Biology

rhodge@plos.org

PLOS

Empowering researchers to transform science

Carlyle House, Carlyle Road, Cambridge, CB4 3DN, United Kingdom

ORCiD I plosbio.org I @PLOSBiology I Blog

California (U.S.) corporation #C2354500, based in San Francisco

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Appendix. Examples of CGV showing segmental duplications when non-best placed alignments are included in the view.

(PDF)

pbio.3002405.s001.pdf^{(1.1MB, pdf)}

S1 Data. Alignment coverage at different Mash distances for selected assembly pairs.

(XLSX)

pbio.3002405.s002.xlsx^{(22.6KB, xlsx)}

S1 Fig. Merging, sorting, and ranking assembly-assembly alignments.

(TIF)

pbio.3002405.s003.tif^{(1MB, tif)}

Attachment

Submitted filename: CGV-PLoS Biology- Response to Reviewers.docx

pbio.3002405.s004.docx^{(35.9KB, docx)}

Attachment

Submitted filename: CGV-PLoS Biology- Response to Reviewers and Editors.docx

pbio.3002405.s005.docx^{(36.7KB, docx)}

Data Availability Statement

[pbio.3002405.ref001] 1.Bornstein K, Gryan G, Chang ES, Marchler-Bauer A, Schneider VA. The NIH Comparative Genomics Resource: addressing the promises and challenges of comparative genomics on human health. BMC Genomics. 2023;24(1):575. doi: 10.1186/s12864-023-09643-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref002] 2.Rangwala SH, Kuznetsov A, Ananiev V, Asztalos A, Borodin E, Evgeniev V, et al. Accessing NCBI data using the NCBI Sequence Viewer and Genome Data Viewer (GDV). Genome Res. 2021;31(1):159–169. doi: 10.1101/gr.266932.120 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref003] 3.Nassar LR, Barber GP, Benet-Pages A, Casper J, Clawson H, Diekhans M, et al. The UCSC Genome Browser database: 2023 update. Nucleic Acids Res. 2023;51(D1):D1188–D1195. doi: 10.1093/nar/gkac1072 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref004] 4.Thorvaldsdottir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14(2):178–192. doi: 10.1093/bib/bbs017 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref005] 5.Buels R, Yao E, Diesh CM, Hayes RD, Munoz-Torres M, Helt G, et al. JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol. 2016;17:66. doi: 10.1186/s13059-016-0924-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref006] 6.Cabanettes F, Klopp C. D-GENIES: dot plot large genomes in an interactive, efficient and simple way. PeerJ. 2018;6:e4958. doi: 10.7717/peerj.4958 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref007] 7.Haug-Baltzell A, Stephens SA, Davey S, Scheidegger CE, Lyons E. SynMap2 and SynMap3D: web-based whole-genome synteny browsers. Bioinformatics. 2017;33(14):2197–2198. doi: 10.1093/bioinformatics/btx144 [DOI] [PubMed] [Google Scholar]

[pbio.3002405.ref008] 8.Krzywinski M, Schein J, Birol I, Connors J, Gascoyne R, Horsman D, et al. Circos: an information aesthetic for comparative genomics. Genome Res. 2009;19(9):1639–1645. doi: 10.1101/gr.092759.109 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref009] 9.Herrero J, Muffato M, Beal K, Fitzgerald S, Gordon L, Pignatelli M, et al. Ensembl comparative genomics resources. Database (Oxford). 2016;2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref010] 10.Wang H, Su Y, Mackey AJ, Kraemer ET, Kissinger JC. SynView: a GBrowse-compatible approach to visualizing comparative genome data. Bioinformatics. 2006;22(18):2308–2309. doi: 10.1093/bioinformatics/btl389 [DOI] [PubMed] [Google Scholar]

[pbio.3002405.ref011] 11.Xiao Z, Lam HM. ShinySyn: a Shiny/R application for the interactive visualization and integration of macro- and micro-synteny data. Bioinformatics. 2022;38(18):4406–4408. doi: 10.1093/bioinformatics/btac503 [DOI] [PubMed] [Google Scholar]

[pbio.3002405.ref012] 12.Richardson JE, Baldarelli RM, Bult CJ. Multiple genome viewer (MGV): a new tool for visualization and comparison of multiple annotated genomes. Mamm Genome. 2022;33(1):44–54. doi: 10.1007/s00335-021-09904-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref013] 13.Diesh C, Stevens GJ, Xie P, De Jesus MT, Hershberg EA, Leung A, et al. JBrowse 2: a modular genome browser with views of synteny and structural variation. Genome Biol. 2023;24(1):74. doi: 10.1186/s13059-023-02914-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref014] 14.Albert VA, Krabbenhoft TJ. Navigating the CoGe Online Software Suite for Polyploidy Research. Methods Mol Biol. 2023;2545:19–45. doi: 10.1007/978-1-0716-2561-3_2 [DOI] [PubMed] [Google Scholar]

[pbio.3002405.ref015] 15.Lee J, Hong WY, Cho M, Sim M, Lee D, Ko Y, et al. Synteny Portal: a web-based application portal for synteny block analysis. Nucleic Acids Res. 2016;44(W1):W35–W40. doi: 10.1093/nar/gkw310 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref016] 16.Wang Y, Tang H, Debarry JD, Tan X, Li J, Wang X, et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 2012;40(7):e49. doi: 10.1093/nar/gkr1293 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref017] 17.Lovell JT, Sreedasyam A, Schranz ME, Wilson M, Carlson JW, Harkess A, et al. GENESPACE tracks regions of interest and gene copy number variation across multiple genomes. Elife. 2022:11. doi: 10.7554/eLife.78526 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref018] 18.Schultz DT, Haddock SHD, Bredeson JV, Green RE, Simakov O, Rokhsar DS. Ancient gene linkages support ctenophores as sister to other animals. Nature. 2023;618(7963):110–117. doi: 10.1038/s41586-023-05936-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref019] 19.Muffato M, Louis A, Poisnel CE, Roest CH. Genomicus: a database and a browser to study gene synteny in modern and ancestral genomes. Bioinformatics. 2010;26(8):1119–1121. doi: 10.1093/bioinformatics/btq079 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref020] 20.Darling AC, Mau B, Blattner FR, Perna NT. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004;14(7):1394–1403. doi: 10.1101/gr.2289704 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref021] 21.Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, et al. Integrating gene annotation with orthology inference at scale. Science. 2023;380(6643):eabn3107. doi: 10.1126/science.abn3107 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref022] 22.Zhang Z, Schwartz S, Wagner L, Miller W. A greedy algorithm for aligning DNA sequences. J Comput Biol. 2000;7(1–2):203–214. doi: 10.1089/10665270050081478 [DOI] [PubMed] [Google Scholar]

[pbio.3002405.ref023] 23.Bhutkar A, Schaeffer SW, Russo SM, Xu M, Smith TF, Gelbart WM. Chromosomal rearrangement inferred from comparisons of 12 Drosophila genomes. Genetics. 2008;179(3):1657–1680. doi: 10.1534/genetics.107.086108 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref024] 24.Lee Y, Kim B, Jung J, Koh B, Jhang SY, Ban C, et al. Chromosome-level genome assembly of Plazaster borealis sheds light on the morphogenesis of multiarmed starfish and its regenerative capacity. Gigascience. 2022:11. doi: 10.1093/gigascience/giac063 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref025] 25.Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17(1):132. doi: 10.1186/s13059-016-0997-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref026] 26.Simakov O, Bredeson J, Berkoff K, Marletaz F, Mitros T, Schultz DT, et al. Deeply conserved synteny and the evolution of metazoan chromosomes. Sci Adv. 2022;8(5):eabi5884. doi: 10.1126/sciadv.abi5884 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref027] 27.Albertin CB, Medina-Ruiz S, Mitros T, Schmidbaur H, Sanchez G, Wang ZY, et al. Genome and transcriptome mechanisms driving cephalopod evolution. Nat Commun. 2022;13(1):2427. doi: 10.1038/s41467-022-29748-w [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref028] 28.Vollger MR, Guitart X, Dishuck PC, Mercuri L, Harvey WT, Gershman A, et al. Segmental duplications and their variation in a complete human genome. Science. 2022;376(6588):eabj6965. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref029] 29.Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44–53. doi: 10.1126/science.abj6987 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref030] 30.Reue K, Lee JM, Vergnes L. Regulation of bile acid homeostasis by the intestinal Diet1-FGF15/19 axis. Curr Opin Lipidol. 2014;25(2):140–147. doi: 10.1097/MOL.0000000000000060 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref031] 31.Zubenko GS, Hughes HB 3rd. Predicted gene sequence C10orf112 is transcribed, exhibits tissue-specific expression, and may correspond to AD7. Genomics. 2009;93(4):376–382. doi: 10.1016/j.ygeno.2008.12.001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref032] 32.Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19):3210–3212. doi: 10.1093/bioinformatics/btv351 [DOI] [PubMed] [Google Scholar]

[pbio.3002405.ref033] 33.Steinberg KM, Schneider VA, Graves-Lindsay TA, Fulton RS, Agarwala R, Huddleston J, et al. Single haplotype assembly of the human genome from a hydatidiform mole. Genome Res. 2014;24(12):2066–2076. doi: 10.1101/gr.180893.114 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref034] 34.Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14(4):708–715. doi: 10.1101/gr.1933104 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref035] 35.Morgulis A, Gertz EM, Schaffer AA, Agarwala R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics. 2006;22(2):134–141. doi: 10.1093/bioinformatics/bti774 [DOI] [PubMed] [Google Scholar]

[pbio.3002405.ref036] 36.Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D. Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A. 2003;100(20):11484–11489. doi: 10.1073/pnas.1932072100 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref037] 37.Jiang X, Song Q, Ye W, Chen ZJ. Concerted genomic and epigenomic changes accompany stabilization of Arabidopsis allopolyploids. Nat Ecol Evol. 2021;5(10):1382–1393. doi: 10.1038/s41559-021-01523-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pbio.3002405.ref038] 38.Mao Y, Catacchio CR, Hillier LW, Porubsky D, Li R, Sulovari A, et al. A high-quality bonobo genome refines the analysis of hominid evolution. Nature. 2021;594(7861):77–81. doi: 10.1038/s41586-021-03519-x [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

The NCBI Comparative Genome Viewer (CGV) is an interactive visualization tool for the analysis of whole-genome eukaryotic alignments

Sanjida H Rangwala

Dmitry V Rudnev

Victor V Ananiev

Dong-Ha Oh

Andrea Asztalos

Barrett Benica

Evgeny A Borodin

Nathan Bouk

Vladislav I Evgeniev

Vamsi K Kodali

Vadim Lotov

Eyal Mozes

Marina V Omelchenko

Sofya Savkina

Ekaterina Sukharnikov

Joël Virothaisakun

Terence D Murphy

Kim D Pruitt

Valerie A Schneider

Roles

Abstract

Introduction

Comparative genome visualization

Genome comparison data

Results

Overview of CGV

Fig 1. Overview of CGV.

Fig 2. CGV dotplot view of Xenopus laevis and Xenopus tropicalis alignment.

Analysis using CGV: Conservation of linkage groups with local rearrangement of synteny

Fig 3. CGV shows conservation of linkage groups in the absence of conservation of gene order.

Analysis using CGV: Detection of gene family expansions between related genomes

Fig 4. CGV can help uncover gene duplications and rearrangements in closely related genomes.

Analysis using CGV: Possible gene translocation between 2 dog assemblies

Discussion

Fig 5. Genome and CDS coverage of assembly-assembly alignments relative to Mash distance.

Materials and methods

Preparation of assembly-assembly alignments

Technical architecture of CGV

Design of CGV application

Supporting information

Acknowledgments

Abbreviations

Data Availability

Funding Statement

References

Decision Letter 0

Richard Hodge

Roles

Decision Letter 1

Richard Hodge

Roles

Author response to Decision Letter 1

Decision Letter 2

Richard Hodge

Roles

Author response to Decision Letter 2

Decision Letter 3

Richard Hodge

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases