CrusView integrates karyotype information when comparing two genomes, which allows users to perform karyotype-based genome assembly and karyotype-assisted genome synteny analyses with preset karyotype patterns of Brassicaceae genomes.
Abstract
In plants and animals, chromosomal breakage and fusion events based on conserved syntenic genomic blocks lead to conserved patterns of karyotype evolution among species of the same family. However, karyotype information has not been well utilized in genomic comparison studies. We present CrusView, a Java-based bioinformatic application utilizing Standard Widget Toolkit/Swing graphics libraries and a SQLite database for performing visualized analyses of comparative genomics data in Brassicaceae (crucifer) plants. Compared with similar software and databases, one of the unique features of CrusView is its integration of karyotype information when comparing two genomes. This feature allows users to perform karyotype-based genome assembly and karyotype-assisted genome synteny analyses with preset karyotype patterns of the Brassicaceae genomes. Additionally, CrusView is a local program, which gives its users high flexibility when analyzing unpublished genomes and allows users to upload self-defined genomic information so that they can visually study the associations between genome structural variations and genetic elements, including chromosomal rearrangements, genomic macrosynteny, gene families, high-frequency recombination sites, and tandem and segmental duplications between related species. This tool will greatly facilitate karyotype, chromosome, and genome evolution studies using visualized comparative genomics approaches in Brassicaceae species. CrusView is freely available at http://www.cmbb.arizona.edu/CrusView/.
The Brassicaceae (crucifer) plant family contains more than 3,700 species, including the model plant organism Arabidopsis (Arabidopsis thaliana); economically important crop species, such as Brassica rapa and Brassica napus; and close relatives of Arabidopsis used in abiotic stress research, such as Eutrema salsugineum and Schrenkiella parvula. Because Brassicaceae plants have high scientific and economic importance, several whole-genome sequencing projects of the species in this family have been recently launched (http://www.brassica.info). Moreover, Brassicaceae is also a good system for population genomics. The 1001 Arabidopsis Genomes Project (http://www.1001genomes.org/) plans to generate complete genome sequences for 1,001 Arabidopsis strains to study the associations between genetic variation and phenotypic diversity. The Value-directed Evolutionary Genomics Initiative project aims to understand the genome evolution of Brassicaceae species by sequencing several close relatives of Arabidopsis, such as Arabidopsis lyrata and Capsella rubella. Recent advances in high-throughput sequencing technology have greatly expedited these whole-genome sequencing projects of versatile nonmodel organisms. Although increasingly longer reads can now be produced from high-throughput sequencing experiments, de novo assembler tools can only generate contig and/or scaffold sequences from high-throughput sequencing reads. These tools cannot generate complete chromosome sequences without genetic and/or physical maps that typically require years to create. This limitation makes chromosome-scale structural variation (i.e. translocation, inversion, deletion and insertion, and segmental and tandem duplication) and genomic macrosynteny analyses difficult to perform.
In both plants and animals, genomes of species within the same family have evolved with conserved karyotype patterns due to the rearrangements of large chromosomal segments. Chromosomal karyotypes can be obtained from comparative chromosomal painting (CCP) experiments by performing in situ hybridization experiments on bacterial artificial chromosome sequences between related species. The genome of each Brassicaceae member is composed of 24 conserved genomic blocks that have been considered as the basic units of chromosomal rearrangement during genome evolution (Lysak et al., 2006). The sizes of these conserved blocks range from several to dozens of megabases. Currently, karyotypes profiled by CCP experiments in approximately 20 Brassicaceae species are available; such karyotypes include those from Arabidopsis (n = 5), Homungia alpine (n = 6), Eutrema spp. (n = 7), A. lyrata (n = 8), B. rapa (n = 10), and Polyctenium fremontii (n = 14). By utilizing the karyotype information in Brassicaceae, we have developed a tool, KGBassembler (for Karyotype-based Genome assembler for Brassicaceae), to finalize the assembly of chromosomes from scaffolds/contigs without relying on a genetic/physical map (Ma et al., 2012).
Over the past 2 years, complete whole-genome sequences of several Brassicaceae species have been released, including the aforementioned A. lyrata, S. parvula, B. rapa, and E. salsugineum (Dassanayake et al., 2011; Hu et al., 2011; Wang et al., 2011; Wright and Agren, 2011; Wu et al., 2012; Yang et al., 2013). These genomic resources have opened a new era of comparative genomics in Brassicaceae to better understand the genomic evolution (Cheng et al., 2012). Numerous tools and databases are available for performing comparative genomics analysis in plants. CoGe is a comparative genomics analysis platform that is now a part of the iPlant Collaborative Project (Goff et al., 2011). The CoGe database currently includes nearly 2,000 genome sequences of approximately 1,500 organisms, allowing users to perform online visual analyses of genome synteny and duplication events (Tang and Lyons, 2012). PLAZA and Vista are also Web-based databases that provide comparative analysis services on the genomic data deposited in the databases (Frazer et al., 2004; Van Bel et al., 2012). Other stand-alone bioinformatic applications for comparative genomic analysis, such as Easyfig and genoPlotR, are commonly used to generate synteny plots of given genome segments at a scale ranging from a single gene to one chromosome (Guy et al., 2010; Sullivan et al., 2011).
In this work, we present a Java-based bioinformatic application, CrusView, for performing visualized analyses of genome synteny and karyotype evolution in Brassicaceae species. CrusView features a user-friendly graphical user interface (GUI) implemented with Standard Widget Toolkit (SWT)/Swing graphics libraries and a SQLite database used to manage local genomic data. Compared with the most commonly used tools in comparative genomics, one of the unique features of CrusView is that available karyotype data of a Brassicaceae species are incorporated to facilitate karyotype-based chromosome assembly and analyses of chromosomal structural evolution. Compared with Web-based tools, the stand-alone CrusView tool was also designed to give users higher flexibility in analyzing currently unpublished genome data and integrating self-defined genomic information based on the users’ interests, such as gene families, gene duplications, chromosomal break points, Gene Ontology terms, and groups of orthologs/paralogs, with the genomic synteny maps. In addition, CrusView can generate images representing genomic synteny between two compared genomes in PNG/SVG/PDF high-resolution formats that are suitable for publication.
RESULTS AND DISCUSSION
To demonstrate the basic functionality of CrusView, we prepared two example genomes and related data sets from Arabidopsis (n = 5) and E. salsugineum (n = 7) to perform visualized comparative genomics analyses. E. salsugineum (also known as salt cress and Thellungiella halophila) is a halophytic relative of Arabidopsis; it inhabits the seashore saline soils of eastern China. Because E. salsugineum and Arabidopsis share similar life cycles, morphological characters, and genetic composition, E. salsugineum has been widely used in plant salt-tolerance studies using the genetic systems and molecular tools previously established in Arabidopsis. The E. salsugineum genome (243 Mb) contains seven chromosomes and approximately 24,000 protein-coding genes (Yang et al., 2013). The karyotype maps derived from CCP experiments of both E. salsugineum and Arabidopsis are currently available (Lysak et al., 2006). We used these two genomes to demonstrate the karyotype-based genome assembly of the E. salsugineum chromosomes and the comparative analyses of E. salsugineum and Arabidopsis with integrated karyotype information.
Overview of the Functional Panels in CrusView
CrusView can be launched via Web start at http://www.cmbb.arizona.edu/crusview. The navigation panel includes quick buttons that perform basic operations in CrusView. The published karyotypes of 20 Brassicaceae species have been integrated into CrusView, and they are shown in the left “karyotype” panel. We will constantly collect the published karyotypes generated based on CCP experiments. Each time CrusView is launched, the program will automatically query the CrusView server to update the local karyotype database. Genomic data files from E. salsugineum and Arabidopsis can be imported into the SQLite database to run a demonstration for users who run CrusView for the first time. The primary visualization window shows the seven chromosomes of the primary E. salsugineum genome (Fig. 1). The protein-coding genes of E. salsugineum are designated with the corresponding colors based on the conserved genomic blocks in which they are located. The top right panel shows the color schemes and the letter labels for the 24 genomic blocks (A–X), while the bottom right panel shows the five chromosomes of the secondary Arabidopsis genome (Fig. 1). The information window displays the genomic annotations of the genes in the primary genome recorded in the Browser Extensible Data (BED) file, including the gene identifiers (IDs), chromosomal locations, genomic block IDs, orthologous group IDs, sequence similarities with the homologs in the secondary genome, gene functional descriptions, and other user-defined information (Fig. 1). The user can switch the primary and secondary genomes, zoom in/out of the chromosome images, perform a query for genes of interest, and invoke a chromosome-level comparison window using the quick buttons in the navigation panel.
Figure 1.
Functional panels in the CrusView main screen. A, Navigation panel. B, List of available karyotypes in Brassicaceae species. C, Main window showing the primary genome (E. salsugineum). D, Color scheme and letter labels of the 24 conserved genomic blocks. E, Window showing the secondary genome (Arabidopsis). F, Gene annotation panel. G, Digital ancestral karyotype of A. lyrata. H, Digital karyotype of Arabidopsis. I, Digital karyotype of E. salsugineum.
Visualized Karyotype Comparison between E. salsugineum and Arabidopsis
One of the unique functions of CrusView is that it can generate the digital karyotype of a genome, allowing users to visually compare the chromosomal karyotypes of the primary and secondary genomes. The A. lyrata (n = 8) genome represents an ancestral karyotype in the Brassicaceae family in which each member’s genome is composed of 24 conserved genomic blocks according to the karyotype analyses of several representative species in the family using CCP experiments (Lysak et al., 2006). Each conserved genomic block is a large chromosomal segment that can be represented by a group of Arabidopsis genes in synteny with their orthologs in the genomes of other Brassicaceae species. Thus, the Arabidopsis genes can be used as markers to infer the assignment of the 24 conserved genomic blocks to another species’ genome in Brassicaceae (Lysak et al., 2006; Yang et al., 2013). Our previously developed software program KGBassembler includes a pipeline to assign the genes in a Brassicaceae species genome to the 24 conserved genome blocks, with a color scheme and a letter label (A–X) based on the homology with Arabidopsis genes (Ma et al., 2012). Here, we elucidate this procedure using E. salsugineum as a newly sequenced genome based on three basic steps: first, the Arabidopsis amino acid sequences were mapped to the E. salsugineum scaffold sequences using BLAST, followed by the selection of the best aligned locations; second, the Arabidopsis genes mapped onto the E. salsugineum scaffolds were used to infer the conserved genomic blocks, followed by the assignment of the color schemes and letter labels of the 24 blocks to the E. salsugineum genes; and third, pseudochromosome sequences were generated based on the CCP-derived (n = 7) karyotype of E. salsugineum. This pipeline was integrated into CrusView and can be applied to any newly sequenced Brassicaceae species genome to perform karyotype-based genome assembly and generate digital karyotypes for comparison purposes.
In CrusView, the digital karyotypes of the primary and secondary genomes will greatly facilitate visualized genomic comparison and the identification of major chromosomal rearrangement events causing the genomic evolution of the chromosomal karyotype in the studied Brassicaceae species genome. For example, Arabidopsis chromosome 2 (AtChr2) resulted from the merging of E. salsugineum chromosome 4 (EsChr4) and the long arm (14–37 Mb) of EsChr3 (Fig. 1). Moreover, when compared with the ancestral karyotype of the eight A. lyrata chromosomes, users may study the different evolutionary paths of the karyotype in another species. For example, although AtChr1 resulted from the merging of A. lyrata AlChr1 and AlChr2, the structure of EsChr1 remains unchanged compared with AlChr1 (Fig. 1). Furthermore, users can search for gene of interest IDs or ortholog group IDs from the navigation panel and map their positions on the compared primary and secondary genomic karyotypes.
Visualized Fine Adjustment of Pseudochromosome Assembly in CrusView
The automatic generation of pseudochromosome sequences based on the KGBassembler algorithm may miss or misplace certain scaffolds that do not contain sufficient gene synteny information for inferring the assignment of conserved genomic blocks, which are either relatively short or contain too many repetitive sequences. Additionally, de novo scaffold assembly is usually interrupted at the edges of highly repeated centromere sequences. Thus, manual adjustment of the pseudochromosomes may be necessary. Different from KGBassembler, in which users need to edit a text file for manual adjustment, CrusView allows users to perform visualized fine adjustment of pseudochromosome assembly in GUI and to consider additional genomic information, such as positions of genetic markers, centromere-specific (CentO) tandem repeats, and the density of protein-coding genes during the adjustment. Users can directly load the project result produced in KGBassembler for visualized fine adjustment or use the “assembling” function in CrusView to assemble pseudochromosomes from the scaffold sequences. When the assembling function in CrusView is run for the first time, users must indicate the working folder containing the required input files described in “Materials and Methods” and an output folder to save the generated chromosome sequences. Users may set up necessary parameters in the “parameter panel” and save the parameters into a Windows Initialization (INI) configuration file that can be directly loaded to run the assembling function (Fig. 2). The details of the parameters were explained in the KGBassembler manual, and users may wish to apply different parameter settings to produce the most optimal assembly, which is largely dependent on the quality of the scaffold sequences themselves as generated by de novo assembler tools.
Figure 2.
Genome assembling function. A, Digital karyotype of E. salsugineum. B, Unplaced short-scaffold sequences. C, Parameter panel. D, Menu bar. E, Main working panel for the manual curation of the genome assembly of E. salsugineum. F, Density of protein-coding genes on scaffolds. G, CentO tandem repeat. H, Genetic marker track.
To fine-tune the draft pseudochromosome sequences, CrusView allows users to add files containing genetic markers and CentO tandem repeats. In plants, CentO sequences are approximately 170-bp motifs that are tandemly arrayed and specifically located in the core centromeric regions (Benson, 1999). CentO repeats located at one terminal of a long scaffold are generally indicative of the centromeric end of a scaffold (Fig. 2). Moreover, the density of protein-coding genes is typically higher in the euchromatic regions of short and long arms than in the pericentromeric heterochromatic regions (Fig. 2). Thus, these types of information are very useful in assisting users to further inspect and adjust the scaffold layouts and orientations on the chromosomes as well as the genomic positions of the genetic markers. Users can simply perform drag-and-drop actions with a mouse to correct potentially misplaced scaffolds or to adjust the orientation of scaffolds. When a manual adjustment is performed, users can save the pseudochromosome sequences to a FASTA file and simultaneously generate the gene annotation file. Finally, users can use the “push to main screen” function to directly add the assembled pseudochromosome and perform further visualized comparative analyses.
Visualization of Genomic Synteny between Two Genomes
The “compare two genomes” function in CrusView can provide a visualization of genomic synteny for each pair of homologous chromosomes for the primary and secondary genomes. Chromosome-scale genomic synteny can be visualized in two manners, a chromosomal karyotype with homologous genes linked between the two chromosomes and a dot plot indicating chromosomal macrosynteny with duplication events (Fig. 3A). For example, a comparison of the karyotypes of EsChr4 and AtChr2 indicated that Arabidopsis chromosome 2 resulted from an event in which the entire chromosome 4 (genomic blocks I and J) merged with the long arm of chromosome 3 (genomic blocks K, G, and H) in E. salsugineum (Fig. 3A). In addition, the visualized chromosomal synteny with karyotype information can also allow users to examine the differences in the chromosome structures between the two genomes. For instance, the 18-Mb-long region from 27 to 35 Mb of block J on EsChr4 remains highly similar to the 17-Mb-long region from 13 to 20 Mb on AtChr2, whereas the 25-Mb-long I block of EsChr4 has seemingly expanded dramatically, with highly enriched repetitive sequences and transposable elements, compared with the corresponding approximately 17-Mb I block region on AtChr2. More interestingly, a small region of EsChr4 between the positions 10 to 11 Mb was found to result from the inverted translocation of a region from AtChr2. The selection of a genomic region with the mouse can invoke the information window, which contains the genes located in the regions of interest. By clicking on a gene homologous to the corresponding Arabidopsis gene, users will be redirected to The Arabidopsis Information Resource database, which contains detailed gene function information.
Figure 3.
Visualization of genome synteny and gene alignment. A, Panels for genome synteny visualization: a, navigation bar; b, primary genome; c, secondary genome; d, chromosome synteny; e, dot plot; f, genes in the selected area; g, action list; h, selection of segmental duplication; i, genes in the ortholog groups. B, Alignment of multiple gene members in the CDPK family showing tandem duplication events. C, Exon-level alignment of the SALT OVERLY SENSITIVE1 (SOS1) genes between Arabidopsis and E. salsugineum.
Chromosome-scale genomic synteny can also be visualized as a dot plot in CrusView to facilitate the identification of segmental duplication and tandem duplication events between the two compared species. From the dot-plot screen, users can select the regions containing duplication events of interest with the mouse to obtain information regarding the genes located in the selected regions (Fig. 3A). Right clicking the mouse will invoke a pull-down list of advanced actions, such as querying selected genes in the external The Arabidopsis Information Resource database to view detailed functional descriptions, retrieving gene sequences to a FASTA file, performing exon-level sequence alignment for a single gene, and aligning multiple genes in a user-defined synteny region using AJaligner. Figure 3B demonstrates a genomic region between 23.8 and 24.1 Mb on AtChr4 encompassing two tandem duplication events of the gene members in the calcium-dependent protein kinase (CDPK) family that may be involved in stress-responsive pathways in Arabidopsis. While AtCDPK27 and AtCDPK31 represent a pair of tandemly duplicated genes that correspond to the single-copy E. salsugineum gene Thhalv10028618m.g, AtCDPK21 and AtCDPK23 correspond to the single-copy gene Thhalv10028567m.g (Fig. 3B). An exon-level sequence alignment of a pair of interesting orthologous genes will reveal exon-level structural variations, amino acid variations, insertions and deletions, and single-nucleotide polymorphisms, which is illustrated by the comparison of SALT OVERLY SENSITIVE1 in Arabidopsis and its E. salsugineum ortholog (Fig. 3C).
Visualization of a User-Defined List of Genes, Duplication Events, and Copy Number Variations in a Genomic Synteny Plot
Using CrusView, users may visualize a group of genes of interest in the two compared genomes to determine their associations with genomic synteny and possible duplication events. We demonstrate this utility by analyzing the tandemly duplicated F-box superfamily that has been found to display great copy number variations between Arabidopsis (505 genes) and E. salsugineum (613 genes). First, the genes in E. salsugineum were assigned to the orthologous groups annotated in the OrthoMCL database (Li et al., 2003). Each ortholog group indicated by a unique ID contains the putative orthologous genes in Arabidopsis and E. salsugineum. We found that one of the ortholog groups (OG5_127192) that showed high variation in copy number contained 148 and 130 F-box genes in Arabidopsis and E. salsugineum, respectively. In plants, F-box genes consist of a large superfamily encoding an E3 ubiquitin ligase that is involved in substrate-specific protein degradation. First, using the “predict tandem duplication” function in CrusView, highly homologous genes defined with a cutoff of 40% protein identity and located adjacent to each other within a 5-kb window were highlighted in green in the dot plot of EsChr3 and AtChr3 (Fig. 4). The protein identity cutoff and window size can both be adjusted by the user when predicting tandem duplications. Then, using the “keyword search” function, a group of genes of interest is displayed in the current dot plot. For instance, when searching ID OG5_127192, F-box genes classified in this ortholog group by OrthoMCL were highlighted in red in the same dot-plot image (Fig. 4). From the overlapping green dots (tandemly duplicated genes) and red dots (F-box genes in group OG5_127192), we observed a macrosyntenic block covering an approximately 5-Mb region on AtChr3 and an approximately 15-Mb region on EsChr3, encompassing 59 and 78 tandemly arrayed F-box genes in Arabidopsis and E. salsugineum, respectively (Fig. 4).
Figure 4.
Mapping duplication events and genes of interest onto the dot-plot synteny map. A dot-plot synteny map of EsChr3 and AtChr3 is shown. The blue dots represent homologous gene pairs in the Arabidopsis and E. salsugineum genomes. The blue dots arranged along the diagonal line indicate a macrosynteny region. The aligned blue dots deviating from the diagonal line indicate segmental duplications. The green dots represent potential tandemly duplicated genes selected using a protein identity cutoff of 40% and a 5-kb window size. The red dots represent F-box genes selected by a keyword search. The overlapping red dots and green dots indicate the tandemly duplicated F-box genes on EsChr3 and AtChr3.
Similarly, users can also add additional genomic information to the BED file to allow searching for self-defined keywords, such as Gene Ontology terms, gene functional descriptions, or gene families. CrusView also allows users to filter a list of genes or genomic positions of interest from the user-defined genomic information file, which can be displayed on the dot-plot synteny map. Users can define the color schemes for different gene groups on the plots using the setting function of CrusView. Finally, the digital karyotype maps, macrosynteny plots based on the 24 color-coded genomic blocks, and dot-plot synteny map showing duplication events and mapped genes of interest can be saved as high-quality PNG/SVG/PDF publication-quality images.
CONCLUSION
In this work, we developed a Java-based bioinformatic application, CrusView, using the powerful SWI/Swing graphics libraries in the Java and SQLite databases; this application was designed to facilitate research in comparative genomics. We demonstrated the basic functionality of CrusView by performing a visual comparison of the Arabidopsis and E. salsugineum genomes in the plant Brassicaceae (crucifer) family. Compared with other bioinformatic tools that have been developed for similar purposes, one of CrusView’s unique features is its incorporation of genomic karyotype information derived from CCP experiments. The karyotype of a species associated with the genome structure visualized in CrusView can greatly assist users in identifying chromosomal rearrangements, genomic synteny, and major duplication events among the related species. Thus, this unique CrusView feature may facilitate the understanding of karyotype, chromosome, and genome evolution based on a comparative genomics approach. Furthermore, by considering the advantage of a species’ karyotype, CrusView provides a unique function to infer pseudochromosome sequences from scaffold sequences generated by de novo assemblers based on conserved genomic blocks. This feature is especially convenient for nonmodel species that lack a genetic and/or physical map. However, users should be aware that CrusView does not replace de novo assembler tools, and its performance in finalizing the assembly of a pseudochromosome sequence depends largely on the quality of the scaffolds and contigs produced from whole-genome shotgun sequencing projects.
CrusView also includes an array of utilities that can be used to visualize genome synteny and duplication events and to map a list of genes of interest associated with syntenic regions between the two analyzed genomes. Compared with database-based comparative genomics tools, CrusView is much more flexible in the ability to analyze unpublished genomes; it allows users to integrate self-defined genomic information, such as Gene Ontology classifications, gene families of interest, hot spots of chromosomal breakage/fusion points, high-frequency recombination sites, and tandem duplication to study their correlations with genomic variations and duplication events. User-defined information and genome synteny plots can be exported as high-resolution, publication-quality PNG/SVG/PDF images.
Karyotype mapping based on in situ hybridization experiments is a common genomic technique that is widely used in animals and plants. Conserved patterns of chromosomal rearrangements based on syntenic genomic blocks as basic units of chromosomal breakage and fusion events are commonly observed in the animal and plant kingdoms (Lysak et al., 2006; Ferguson-Smith and Trifonov, 2007). Therefore, although CrusView was primarily developed and preset based on the karyotype evolution patterns in the Brassicaceae family (primarily for the convenience of the Brassicaceae community), this software program may also be used to perform karyotype-based genome assembly or karyotype-assisted genome synteny analysis in other plant families or in other organisms for which karyotype data exist. If users wish to use the current version of CrusView for non-Brassicaceae species, they can access the “setting” function to define the color schemes and letter labels of the conserved genomic blocks based on the karyotype evolution patterns of the species of interest. Additionally, to promote the broad use of CrusView in other organisms, the source code of CrusView has been released through Sourceforge.net to allow academic users to freely download and modify the programs.
MATERIALS AND METHODS
Basic Input Files for CrusView
CrusView utilizes the Java Web-start function so that it can be launched through the CrusView homepage. When it is run for the first time, CrusView creates a “CrusView” folder on the user’s local computer and automatically installs the programs and basic data set in the folder. CrusView simultaneously creates a local Java SQLite database to manage the genomic data that the user wishes to analyze. The data files include a FASTA file containing chromosome or scaffold/contig sequences and a GFF file containing gene model annotation that will be imported into the SQLite database. The user must also prepare a BED file in the “bed” folder to provide additional information, such as ortholog group IDs, genome block IDs, and protein sequence identities between the primary and secondary genomes. To enable the advanced search function, the BED file may also include the user’s self-defined genomic information and functional descriptions added in the last column, such as Gene Ontology terms, gene families, recombination hot spots, and so on. To analyze a specific group of genes of interests, the user can load a TXT file containing the gene IDs or genomic positions and their further descriptions into CrusView through provided functions.
Input Files for Karyotype-Based Genome Assembly
For species only containing scaffold sequences but with an available CCP-derived karyotype map, a karyotype-based genome assembly of pseudochromosomes from scaffold sequences is recommended. The KGBassembler will be invoked by the “assembling” function in CrusView. The assembly function requires the following input files: a KARYOTYPE file containing CCP-based karyotype information obtained from the CrusView Web site or prepared by the user based on instructions, a PSL file containing Arabidopsis (Arabidopsis thaliana) genes aligned on the scaffolds, and a FASTA file containing scaffold sequences. The user can either provide a configuration file in Windows Initialization (INI) format or edit the “Parameter” tab in the CrusView interface to set up necessary parameters for assembly. If a genetic map with gene marker information is prepared by the user as a GMM file with designated format described in the CrusView manual, CrusView may also incorporate this information during the manual adjustment of the pseudochromosomes. To facilitate the prediction of scaffold orientations on the pseudochromosomes, the user may run the tandem repeat finder software program (Benson, 1999) to identify the scaffolds containing CentO sequences. CentO repeat locations formatted as a BED file can be loaded into CrusView as an additional track.
After the KGBassembler has generated the pseudochromosome sequences, the user may use CrusView to perform fine adjustments to the orientations and orders of the scaffolds on the pseudochromosomes based on the additional information provided by the user, such as the density of protein-coding genes, user-customized genetic markers, and the locations of CentO tandem repeats on the scaffolds. CrusView has been implemented with an enhanced GUI that can be used to further adjust the pseudochromosome assembly using dragging-and-placing mouse actions. By clicking the “save assembly” button, the pseudochromosome sequences and gene annotation information will be saved in a FASTA file and a GFF file, respectively.
Conversion of a User’s Unpublished Genome Sequence and Self-Defined Gene Annotation to Input Files Compatible with CrusView
To facilitate the user to analyze a yet-to-be-published genome sequence, CrusView includes a function to help the user prepare the input files necessary to be used in CrusView. The user must provide the genome/scaffold sequences in FASTA format, the gene annotation file in GFF or GTF format, and one additional karyotype file if the user wants to perform karyotype-based assembly of pseudochromosome sequences. The user is also prompted to submit protein sequences to the OrthoMCL online database (http://www.orthomcl.org) to assign the genes to the corresponding ortholog groups to facilitate genome comparisons, gene duplication analyses, and copy number variation analyses. To assign the 24 conserved genome block IDs to the genes, the user must provide a BLAST result of the protein sequences of the analyzed genome against Arabidopsis proteins. Additional genomic information that the user wishes to include will be integrated into the last column of the BED file to enable the keyword search function in CrusView.
Inference of Genomic Macrosynteny Based on Conserved Genomic Blocks
The genomes of the Brassicaceae species share 24 conserved genomic blocks (large chromosomal segments) designated A to X. The additional ID “0” is used by CrusView to label undetermined regions that are not assigned to any genomic blocks. The chromosomal locations of the 24 genomic blocks can be inferred from the CCP-derived karyotype. Each gene located within the same conserved genomic block is assigned a designated color code to illustrate the digital karyotype of the studied species. Genes shared within the same genomic block IDs are considered to be in the same genomic macrosyntenic regions. To analyze a genome lacking a CCP-derived karyotype or a genome in other families of plant or animal organisms that have different conserved genomic blocks, the user can self-define the block IDs with hexadecimal color codes in the BED file.
Visualization of Chromosomal Karyotype, Genomic Synteny, and Gene Alignment
CrusView was implemented with the Java SWT/Swing libraries to develop the GUI interface and visualization functions. Visualization of the genomic data of an analyzed species can be performed at three levels: the genome level, the chromosome level, and the gene level. If the karyotype information has been associated with the studied genome, all of the chromosomes will be visualized with the 24 genomic block IDs with corresponding colors. The user can select any two chromosomes of interest in the two compared species to visualize chromosomal synteny. When comparing the karyotypes of two chromosomes, the pairs of orthologous genes between the two species are linked to indicate major chromosomal rearrangement events. CrusView also generates a dot plot for each pair of selected chromosomes to visualize tandem and segmental duplication events. The user may select a group of genes from the dot plot using a mouse framing action to trigger gene-level visualization. A multigene alignment within a designated genomic region (less than 1 Mb) between the two genomes and an exon-to-exon alignment of one pair of orthologous genes with single-nucleotide polymorphism information can be visualized.
Output Image Files Generated from CrusView
One of the useful utilities of CrusView is to generate high-resolution images and save them in PNG/SVG/PDF formats for publication use. Such images include digital karyotypes, genome synteny plots, dot plots of two chromosomes, multigene alignments within a genomic region, exon-to-exon alignment plots, plots of genomic duplication events, and mapping of a list of genes of interest in the genomic synteny plots.
Software Availability
CrusView is publicly available online (http://www.cmbb.arizona.edu/crusview) and has been implemented as a Java Web-start application under Windows and Linux 32/64-bit systems with options for different memory sizes. Sample data sets from Arabidopsis and Eutrema salsugineum are provided to demonstrate the basic functions of CrusView. The software manual and a series of video tutorials for CrusView are also provided online (http://www.cmbb.arizona.edu/crusview/video_tutorial).
Glossary
- CCP
comparative chromosomal painting
- GUI
graphical user interface
- ID
gene identifier
- CentO
centromere-specific
References
- Benson G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27: 573–580 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng F, Wu J, Fang L, Wang X. (2012) Syntenic gene analysis between Brassica rapa and other Brassicaceae species. Front Plant Sci 3: 198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dassanayake M, Oh DH, Haas JS, Hernandez A, Hong H, Ali S, Yun DJ, Bressan RA, Zhu JK, Bohnert HJ, et al. (2011) The genome of the extremophile crucifer Thellungiella parvula. Nat Genet 43: 913–918 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferguson-Smith MA, Trifonov V. (2007) Mammalian karyotype evolution. Nat Rev Genet 8: 950–962 [DOI] [PubMed] [Google Scholar]
- Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I. (2004) VISTA: computational tools for comparative genomics. Nucleic Acids Res 32: W273–W279 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goff SA, Vaughn M, McKay S, Lyons E, Stapleton AE, Gessler D, Matasci N, Wang L, Hanlon M, Lenards A, et al. (2011) The iPlant collaborative: cyberinfrastructure for plant biology. Front Plant Sci 2: 34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guy L, Kultima JR, Andersson SG. (2010) genoPlotR: comparative gene and genome visualization in R. Bioinformatics 26: 2334–2335 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu TT, Pattyn P, Bakker EG, Cao J, Cheng JF, Clark RM, Fahlgren N, Fawcett JA, Grimwood J, Gundlach H, et al. (2011) The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat Genet 43: 476–481 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li L, Stoeckert CJ, Jr, Roos DS. (2003) OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13: 2178–2189 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lysak MA, Berr A, Pecinka A, Schmidt R, McBreen K, Schubert I. (2006) Mechanisms of chromosome number reduction in Arabidopsis thaliana and related Brassicaceae species. Proc Natl Acad Sci USA 103: 5224–5229 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma C, Chen H, Xin M, Yang R, Wang X. (2012) KGBassembler: a karyotype-based genome assembler for Brassicaceae species. Bioinformatics 28: 3141–3143 [DOI] [PubMed] [Google Scholar]
- Sullivan MJ, Petty NK, Beatson SA. (2011) Easyfig: a genome comparison visualizer. Bioinformatics 27: 1009–1010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tang H, Lyons E. (2012) Unleashing the genome of Brassica rapa. Front Plant Sci 3: 172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Bel M, Proost S, Wischnitzki E, Movahedi S, Scheerlinck C, Van de Peer Y, Vandepoele K. (2012) Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiol 158: 590–600 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X, Wang H, Wang J, Sun R, Wu J, Liu S, Bai Y, Mun JH, Bancroft I, Cheng F, et al. (2011) The genome of the mesopolyploid crop species Brassica rapa. Nat Genet 43: 1035–1039 [DOI] [PubMed] [Google Scholar]
- Wright SI, Agren JA. (2011) Sizing up Arabidopsis genome evolution. Heredity (Edinb) 107: 509–510 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu HJ, Zhang ZH, Wang JY, Oh DH, Dassanayake M, Liu BH, Huang QF, Sun HX, Xia R, Wu YR, et al. (2012) Insights into salt tolerance from the genome of Thellungiella salsuginea. Proc Natl Acad Sci USA 109: 12219–12224 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang R, Jarvis DE, Chen H, Beilstein M, Grimwood J, Jenkins J, Shu S, Prochnik S, Xin M, Ma C, et al (2013) The reference genome of the halophytic plant Eutrema salsugineum. Front Plant Sci 4: 46 [DOI] [PMC free article] [PubMed] [Google Scholar]




