Using the Generic Synteny Browser (GBrowse_syn)

Sheldon J McKay; Ismael A Vergara; Jason E Stajich

doi:10.1002/0471250953.bi0912s31

. Author manuscript; available in PMC: 2011 Aug 26.

Published in final edited form as: Curr Protoc Bioinformatics. 2010 Sep;CHAPTER:Unit–9.12. doi: 10.1002/0471250953.bi0912s31

Using the Generic Synteny Browser (GBrowse_syn)

Sheldon J McKay ^1,², Ismael A Vergara ³, Jason E Stajich ⁴

PMCID: PMC3162311 NIHMSID: NIHMS234286 PMID: 20836076

Abstract

Genome Browsers are software that allow the user to view genome annotations in the context of a reference sequence, such as a chromosome, contig, scaffold, etc. The Generic Genome Browser (GBrowse) is an open source genome browser package developed as part of the Generic Model Database Project (see Unit 9.9; Stein et at., 2002). The increasing number of sequenced genomes has to a corresponding growth in the field of comparative genomics, which requires methods to view and compare multiple genomes. Using the same software framework as GBrowse, the Generic Synteny Browser (GBrowse_syn) allows the comparison of co-linear regions of multiple genomes using the familiar GBrowse-style web page. Like GBrowse, GBrowse_syn can be configured to display any organism and is currently the synteny browser used for model organisms such as C. elegans (WormBase; www.wormbase.org; see Unit 1.8) and Arabidopsis (TAIR; www.arabidopsis.org; see Unit 1.11). GBrowse_syn is part of the GBrowse software package and can be downloaded from the web and run on any unix-like operating system, such as Linux, Solaris, Mac OS X etc. GBrowse_syn is still under active development. This unit will cover installation and configuration as part of the current stable version of GBrowse (v1.71).

Introduction

GBrowse_syn was designed to be portable and configurable like its parent application GBrowse. It can be run on any unix-like operating system with the MySQL database management system installed. GBrowse_syn views multiple genomes by comparing co-linear regions of one or more genomes against a single reference sequence, with the ability to toggle between reference and target sequences. The Original use case was for comparison of three nematode genomes at WormBase but, as the number of sequenced nematode and other genomes continues to grow, more than three species can be compared with this software. GBrowse_syn is designed to use the same database adapters as GBrowse for displaying sequence annotations and uses a central joining database to link any number of GBrowse data sources and render them in the same screen.

This unit has two main protocols and one Alternate Protocol. Basic Protocol 1 shows how to configure GBrowse_syn to use the example data set of two aligned rice genomes with the alignments and sequence annotations in MySQL relational databases. In addition to multiple sequence alignment data, GBrowse_syn can use any kind of co-linearity data that has coordinate and strand information. Basic Protocol 2 shows how to configure OrthoCluster (see Unit 6.10) Synteny blocks to be loaded and browsed in GBrowse_syn. Whole genome alignment strategies for complex genomes usually involve hierarchical strategies where syntenic (or co-linear) regions are first identified and then aligned at the nucleotide sequence level. Alternate Protocol 2 shows how to load the GBrowse_syn alignment database from the relatively more complex output of the MERCATOR/MAVID whole genome alignment workflow (Dewey, 2007). Support Protocol 1 describes how to install GBrowse_syn and its dependencies from the most current stable source code (version 1.71 at time of writing).

Basic Protocol 1: Configuring and Using GBrowse_syn

GBrowse_syn is installed along with the GBrowse package. Sample alignment and configuration data are included with the installation. This protocol will describe the basic configuration and use of GBrowse_syn.

Example Data

Genome annotation files are provided in GFF3 for two rice species, referred to throughout the first part of the protocol as ‘rice’, and ‘wild_rice’, and blastz-derived (Schwartz et al., 2003) whole genome alignment data between the genomic DNA of the two species. The files are installed in the databases directory under the GBrowse document root, the HTDOCS option described in Unit 9.9, which is the location of GBrowse cascading style sheets, help files, tutorial, etc. The location of the document root will vary according to system architecture and user options selected at install time. The correct location of these files on the server will be displayed in the welcome screen shown when GBrowse_syn is used for the first time (See Support Protocol 1). In this example the location of the document root is

/var/www/html/gbrowse

Necessary Resources

Hardware

Unix (Linux, Solaris, or other variety) workstation or Macintosh with OS X 10.2.3 or higher
Internet connection

Software

No additional software is required if Support protocol 1 has been completed

Files

The data and configuration files needed for this protocol are pre-installed with the GBrowse package or its prerequisites, as described in Support Protocol 1.
This protocol assumes a unix-like operating system. The examples shown in this protocol are run on Linux (CentOS release 5.3) using MySQL server version 5.0.77. Many steps will require the sudo command for administrator level access to the system.

Obtaining example data

These are instructions for setting up and using GBrowse_syn with the examples that are installed along with the GBrowse package. The alignment data and genome annotation data were provided courtesy of Bonnie Hurtwitz.

1)
Go to the document root using the unix cd command. The $ symbol represents the Linux command prompt.
- $ cd /var/www/html/gbrowse
2)
Examine the document tree of the databases directory using the ls -R command to examine the databases directory.
- $ \ls -R databases
- databases:
- gbrowse_syn yeast_chr1+2
- databases/gbrowse_syn:
- alignments rice wild_rice
- databases/gbrowse_syn/alignments:
- rice.aln.gz
- databases/gbrowse_syn/rice:
- rice.gff3
- databases/gbrowse_syn/wild_rice:
- wild_rice.gff3
- databases/yeast_chr1+2:
- chr1.fa chr2.fa yeast_chr1+2.gff3
- In some systems the ls command may be aliased to use other options by default. The backslash (\) before the ls command will invoke ls with no options other than the specified -R (recursive) argument. The files you will need are under the gbrowse_syn subdirectory.
3)
Change to the directory with the alignment data file and unpack the compressed file using the gunzip command.
- $ cd databases/gbrowse_syn/alignments
- $ sudo gunzip rice.aln.gz
Figure 9.12.1 shows the first few lines of the alignment file. The syntax of the sequence names in the alignment is critical because it contains meta-data required to get the coordinates and strand of each sequence.

The syntax is:
- Species-seqid(strand)/start-end
The database loading script load_alignments_msa.pl (discussed below) will check the name format while parsing the alignments. Violations of the required syntax will cause a fatal exception and script will not execute.
4)
Create a database named ‘rice_synteny’ (you will need a MySQL account with CREATE and GRANT privileges). Substitute your own user name and password for ‘user’ and ‘pass’.
- $ mysql -uuser -ppass
- mysql> create database rice_synteny;
- Query OK, 1 row affected (0.00 sec)
5)
Grant SELECT privileges to use ‘www-data’, the default web user name for mysql in this installation, then quit the mysql shell.
- mysql> grant SELECT on rice_synteny.* to ‘www-data‘@’localhost’;
- Query OK, 0 rows affected (0.02 sec)
- mysql> quit Bye

Load the database using the load_alignments_msa.pl script, which is pre-installed with GBrowse and can be run without specifying the location of the script. This will load the alignment file above into the database. The command is all on one line.

$ load_alignments_msa.pl -u user -p pass -d rice_synteny -v -c rice.aln
where

`-u`	`username with CREATE, INSERT, GRANT privileges`
`-p`	`password (if required)`
`-d`	`database name`
`-v`	`verbose progress reporting (optional)`
`-c`	`start new database. This option overwrites any existing database of that name (recommended).`

Open in a new tab

Now that we have loaded the alignment database, also referred to as the joining database because it links together the data sources for each of the species, turn to the species annotation data in GFF3 format (Figure 9.12.2). The GFF3 format is described in Table 9.9.1 of Unit 9.9. The location of the species annotation data relative to the document root is:

databases/gbrowse_syn/rice/rice.gff3
databases/gbrowse_syn/wild_rice/wild_rice.gff3

By default, the GFF files are used with a flat-file adapter that can access the GFF files directly. Due to the large number of gene models for the two species, using flat files as species databases may be slow and cause excessive latency in rendering the images for GBrowse_syn on some servers. It is a relatively simple process to convert the GFF3 to MySQL databases using the script bp_seqfeature_store.pl, which is installed with the bioperl-live distribution that will have been completed in Support protocol 1.

7)
Repeat steps 4-5 to create and set the permissions on two additional databases, named ‘rice’ and ‘wild_rice’.

Load the rice.gff3 file into a Bio∷DB∷SeqFeature∷Store database using the bp_seqfeature_load.pl script.

$ bp_seqfeature_load.pl -u user -p pass -d rice -c -f rice.gff3
where

`-u`	`username with MySQL root-level privileges`
`-p`	`password (if required)`
`-d`	`database name`
`-f`	`specifies fast loading. This feature is a big time saver but the GFF3 file must be well formatted, so that all subfeatures with the same ID are situated together in the file. The example files in this protocol are compatible with this option.`
`-c`	`start new database. This option overwrites any existing database of that name. (recommended)`

Open in a new tab

9)
Change to the wild_rice subdirectory and repeat step 8 for the wild_rice.gff3 file.
- $ cd ../wild-rice
- $ bp_seqfeature_load -u user -p pass -d wild_rice -c -f wild_rice.gff3
- Note the database name is also changed to wild_rice.

Figure 9.12.1 — The first few lines of the rice.aln file, rice.aln is a CLUSTAL-formatted alignment file. Note, this is simply a formatting convention and does not imply that the CLUSTAL program was used to generate the data.

Figure 9.12.2 — A sample of the genome annotations for the ‘rice’ data source. These annotations are in GFF3 format, which is explained in detail in Unit 9.9. This sample contains three gene models in a three level containment hierarchy (gene > mRNA > CDS).

Configuration files

10)
Find the configuration files for the alignment data and the two rice species at the locations below, relative to the configuration root. The configuration root is the full system path to the configuration files. The actual path will vary by operating system and configuration options at the time of installation. In this example, the configuration root is

/var/www/conf/gbrowse.conf
- /var/www/conf/gbrowse.conf/synteny/oryza.synconf.disabled
- /var/www/conf/gbrowse.conf/synteny/rice_synteny.conf
- /var/www/conf/gbrowse.conf/synteny/wild_rice_synteny.conf
- The file oryza.synconf.disabled will have its name changed to oryza.synconf in a subsequence step. It has the ‘disabled’ extension so that GBrowse_syn will not try to load the data source until the configuration file is ready and the data source is fully configured.
The most important difference in the configuration between GBrowse_syn and GBrowse is that GBrowse_syn uses a joining database that links different species together via database features corresponding to alignment data, synteny blocks, gene orthology, etc. This example uses three configuration files, one for each of the rice species and one to link the species together via the joining database. The species configuration files have the same structure and options as GBrowse configuration files and specify track display options, etc. The example shown in Figure 9.12.3 is a minimal configuration file. For examples of the many configurable options for GBrowse, see Unit 9.9. The other configuration file is the GBrowse_syn and specifies the joining database that links the species, information about the species and their configuration files and display options. See Table 9.12.1 for configurable options for the GBrowse_syn configuration file.

When first installed, as shown in Support protocol 1, GBrowse_syn scans the configuration directory for files ending in .synconf to look for configured data sources. If none are found it prints a welcome screen, which is described in Support protocol 1.

Species configuration files have the same structure as GBrowse configuration files, though they tend to be less complex (see Figure 9.12.4). Note the data source is configured by default to use the memory adapter for flat files. Flat file databases are best used for small data sets. Because the rice example annotations contain many gene models, there can be excessive latency in rendering images on some system configurations. In order to speed up GBrowse_syn for the example data provided, you will use MySQL databases for the two species' genome annotations.
11)
In order to deploy the MySQL databases you have loaded for the rice data, use a text editor such as pico, emacs, vi, etc, and edit rice_synteny.conf so that the database arguments read (you will have to use sudo to edit the installed configuration files):
- db_adapter = Bio∷DB∷SeqFeature∷Store
  
  db_args = dsn dbi:mysql:rice
  
  Open in a new tab
- The Bio∷DB∷SeqFeature∷Store adapter and relational database schema are optimized for GFF3 and is the method of choice for loading this format into a relational database management system, in this case MySQL. This greatly accelerates data access and decreases the latency the user experiences when browsing multiple genomes with GBrowse_syn.
12)
Repeat step 11 for rice_synteny.conf, using the database name ‘wild_rice’.
13)
Reload the web page to see the display shown in figure 9.12.5. Click on the example rice 3:16050173..1606497. You should see the display shown in figure 9.12.6, which now has the alignment between the rice and wild rice genomes shown, with rice as the reference genome.

Figure 9.12.3 — Complete configuration file for the ‘oryza’ data source that is installed as an example with the GBrowse package. This file is similar in structure to a GBrowse configuration file, as described in Unit 9.9. In addition to the connection information for the joining database, this file specifies the location of the configuration files for the species to be compared in GBrowse_syn and the theme color and tracks to load for each species.

Table 9.12.1.

Configurable options for the GBrowse_syn configuration file. Options shown in bold face are required. Options shown in italics are recommended.

Option	Description
Join	The data source name (DSN) for the joining database. Figure x.x.x shows a typical example.
source map	The mapping of symbolic source (name), configuration file name and description for each species. Figure 9.12.3 shows a typical example.
Tmpimages	The URL (location relative to the document root) where temporary image and cached data should be stored, eg: /gbrowse/tmp.
*Buttons*	The location for common GBrowse images of buttons, arrows, etc., eg: /gbrowse/images/buttons. Default images will be used unless otherwise specified.
Stylesheet	The URL for a cascading style sheet (CSS) file that specifies various configurable web display options. These can be customized. The default GBrowse stylesheet is used unless otherwise specified.
Examples	Example segments to display. These specify the reference species, sequence and coordinates. Some examples are shown in fig x.x.x.
zoom levels	which zoom levels will be available in the navigation menu default: zoom levels = 5000 10000 25000 50000 100000 200000 400000
config_extension	The file extension (.syn or .conf) for species configuration files. Note this extension has to be used consistently throughout the GBrowse_syn configuration directory. default: syn
description	The description of the data source for public display. default: none
max_span	The gap between inset panels, expressed as the portion of the referebce panel with, to trigger merging of inset panels. default: 0.3
max_segment	The maximum allowed sequence length to be displayed in the reference panel. default: 400Kb
min_alignment_size	The minimum alignment size, expressed as a fraction of the total reference sequence length, that will be used to create an inset panel. default: 0.01
imagewidth	The default with, in pixels, of the reference panel. default: 5
interimage_pad	The space between inset panels, in pixels. default: 5
vertical_pad	The vertical space between panels, in pixels. default: 5
align_height	The height of the alignment or syntenic block features, in pixels. default: 5
max_gap	The maximum gap allowed between chained alignment features. default: 50Kb
overview_ratio	The relative width of the overview panel in relation to the with of the reference panel. default: 0.9
overview bgcolor	The background color of the overview panel. Named web colors of hexidecimal codes are acceptable. default: gainsboro
grid coordinates	This option is for sparse grid coordinate data. If set to ‘exact’, all coordinates will be used. Otherwise, coordinates that are multiples of 10, 100, 1000, etc will be used depending on the size of the displayed segment.

Open in a new tab

Figure 9.12.4 — The rice_synteny.conf configuration file. Minimal information is required as this is not intended as a stand-alone genome browser. A Detailed list of configurable options for GBrowse configuration files can be found in Unit 9.9. Note that the [EG] track is referenced by the main configuration file oryza.synconf in figure 9.12.3.

Figure 9.12.5 — The startup screen for the *Oryza sativa* sample data source included with the GBrowse package. Clicking on one of the example segment links is a good way to get started browsing.

Figure 9.12.6 — Example segment rice 3:16050173..1606497. With the default options, shaded polygons with grid lines are shown. The grid lines correspond to mapped sequence coordinates in the aligned segments.

NOTE: The reference target relationship is stored reciprocally in the alignment database. Clicking on a part of the inset panel for other genome that does not have other behaviors, such as popup balloons or links, will reload the page with the reference/target relationship reversed.

Interpreting Results

14)
Examine the general layout which shows a central reference panel or a lower reference panel in cases where there are only two species. Inset panels for other species with matching regions appear above or below the reference sequence panel. Clicking on one of the inset panels, which will then become the reference sequence, facilitates rapid switching between reference sequences. The example used in this protocol uses two closely related species where, in most regions the whole segment is collinear and there is only one inset panel.
15)
Hover the mouse over the blue text web page features (figure 9.12.5) to display a popup balloon. Clicking these links take you to a help page describing that feature. A detailed list of features is described on a web-based help page (http://gmod.org/wiki/GBrowse_syn_Help). Figure 9.12.7 shows an excerpt from this help page, which is kept up to date as new features are added to GBrowse_syn.
16)
In the overview panel, click and drag on the scale-bar to activate the rubber band selection to aid in moving, re-centering, or resizing, the viewed region. Clicking anywhere on the overview panel will also move the detailed view to that region.

The overall look and feel are similar to GBrowse, though not all GBrowse features such as draggable tracks and rubber band selection in the detail panels are available.
17)
Compare the species. There is no upper limit on the number of species that can compared with GBrowse_syn. Because only a single reference sequence is shown at one time, the reference panel is repeated as many times as necessary to compare it to all species. An “all in one view” is also available, although it is not very informative if there are a large number of species being compared. Figure 9.12.8 shows a more complex example of a five species comparison from WormBase (http://www.wormbase.org). The lower section of the web page offers a number of image display options, such as width, shading and grid lines for aligned regions. The grid lines option is especially useful, as it tracks corresponding nucleotide residue positions at columns in the DNA sequence alignment, which highlights relatively large insertions and deletions. The example shown in figure 9.12.8 is of particular interest because it shows extensive insertions and deletions among the five genomic DNA sequences being compared.

Figure 9.12.7 — An excerpt from the GMOD (Generic Model Organism Database) Wiki pages that describes web page features for GBrowse_syn. These features continue to be updated and changes are posted to the Wiki.

Figure 9.12.8 — A five species whole genome DNA sequence alignment comparison from WormBase (http://www.wormbase.org), showing regions that are co-linear with *Caenorhabditis elegans* genomic segment X:1085001..1115000. The displayed region uses the default settings for the display options shown in the bottom panels of the image.

Alignment chaining

18)
Select the chain alignments option in the Display Setting part of the GBrowse_syn web page.. GBrowse_syn will perform an “on the fly” analysis of the alignments or co-linear regions from other sources, as well as merge or join parts that are within a configurable distance of each other (the default is 50kb), are on the same strand, and have either monotonically increasing or increasing coordinates depending on the orientation (see Figure 9.12.9). This method is analogous to the blastz chaining described in Kent et al., (2003).

Figure 9.12.9 — Alignment chaining. A) alignment of a segment of the rice and wild-rice genomes with the alignment data provided. B) the same region with the “chain alignments” option selected. Same-stand alignments with monotonically increasing (or decreasing) coordinates are merged or connected by dashed lines where there are gaps. This example allows gaps of up to 50kb between chained alignments. Note the loss of two genes in domestic vs. wild rice.

Basic Protocol 2: Browsing Orthocluster Synteny Blocks with GBrowse_syn

Although Gbrowse_syn was developed with whole genome DNA sequence alignments in mind, it can also be used to display syntenic or co-linear regions that are not based on DNA sequence alignments. For example, OrthoCluster (Ng et al. 2009; Zeng et al. 2008) is a tool that has been developed for the accurate detection of synteny blocks among multiple species. Briefly, OrthoCluster takes as input two types of files: (i) a genome file, which contains the list of all genes with its chromosome/contig, start position, end position and strand; and (ii) a correspondence file, which contains the orthologous relationships among genes in all genomes. A detailed protocol on generating these input files and on running OrthoCluster is available (see Unit 6.10). The following protocol illustrates how to generate the GBrowse_syn input files based on pair-wise synteny block detection using OrthoCluster for three nematode genomes: Caenorhabditis elegans (ele), Caenorhabditis briggsae (bri) and Pristionchus pacificus (ppa). The procedure shown here can be extended to any number and type of genomes.