Using Galaxy to Perform Large-Scale Interactive Data Analyses

Jennifer Hillman-Jackson; Dave Clements; Daniel Blankenberg; James Taylor; Anton Nekrutenko; the Galaxy Team

doi:10.1002/0471250953.bi1005s19

. Author manuscript; available in PMC: 2012 Aug 14.

Published in final edited form as: Curr Protoc Bioinformatics. 2007 Sep;CHAPTER:Unit–10.5. doi: 10.1002/0471250953.bi1005s19

Using Galaxy to Perform Large-Scale Interactive Data Analyses

Jennifer Hillman-Jackson ¹, Dave Clements ², Daniel Blankenberg ¹, James Taylor ², Anton Nekrutenko ¹; the Galaxy Team^1,²

PMCID: PMC3418382 NIHMSID: NIHMS386148 PMID: 18428782

Abstract

Innovations in biomedical research technologies continue to provide experimental biologists with novel and increasingly large genomic and high-throughput data resources to be analyzed. As creating and obtaining data has become easier, the key decision faced by many researchers is a practical one: where and how should an analysis be performed? Datasets are large and analysis tool set-up and use is riddled with complexities outside of the scope of core research activities. The authors believe that Galaxy (galaxyproject.org) provides a powerful solution that simplifies data acquisition and analysis in an intuitive web-application, granting all researchers access to key informatics tools previously only available to computational specialists working in Unix-based environments. We will demonstrate through a series of biomedically relevant protocols how Galaxy specifically brings together 1) data retrieval from public and private sources, for example, UCSC’s Eukaryote and Microbial Genome Browsers (genome.ucsc.edu), 2) custom tools (wrapped Unix functions, format standardization/conversions, interval operations) and 3^rd party analysis tools, for example, Bowtie/Tuxedo Suite (bowtie-bio.sourceforge.net), Lastz (www.bx.psu.edu/~rsharris/lastz/), SAMTools (samtools.sourceforge.net), FASTX-toolkit (hannonlab.cshl.edu/fastx_toolkit), and MACS (liulab.dfci.harvard.edu/MACS), and creates results formatted for visualization in tools such as the Galaxy Track Browser (GTB, galaxyproject.org/wiki/Learn/Visualization), UCSC Genome Browser (genome.ucsc.edu), Ensembl (www.ensembl.org), and GeneTrack (genetrack.bx.psu.edu).

Galaxy rapidly has become the most popular choice for integrated next generation sequencing (NGS) analytics and collaboration, where users can perform, document, and share complex analysis within a single interface in an unprecedented number of ways.

Keywords: comparative genomics, genomic alignments, Web application, genome variation

INTRODUCTION

Most experimental biologists cannot fully take advantage of genomic data due to a formidable wall of countless and unnecessary computational issues. The goal of Galaxy [Goecks, et al., 2010] is to solve these issues. Consider the following example: A researcher wants to identify protein-coding exons containing the highest density of SNPs. Most biologists know three primary sources of genome-wide data for vertebrates: Entrez at NCBI [unit 1.3; Gibney et al., 2011], the Genome Browser at the UCSC [unit 1.4; Karolchik et al., 2009], and Ensembl [unit 1.15; Fernández Suárez et al., 2010] at the EBI/Wellcome Trust Sanger Institute (UK). Although these three sources offer extensive information about genes, including genomic structure, gene expression profiles, and SNPs, the end user must still perform this task elsewhere—the listed resources do not provide functionality necessary to perform this analysis. Typically, this project ends up in the hands of a graduate student who might initially try to achieve this using popular desktop applications. Unfortunately, Excel (like many other desktop applications) cannot handle that much data. As a result, this relatively simple task becomes a complex endeavor that may easily take weeks or months. In the authors’ view, this does not have to be complicated. Galaxy bridges the gap between data and analyses by allowing experimental biologists without programming experience to easily perform large scale studies from within their Web browsers.

In this unit, the authors describe the functionality of Galaxy using a series of examples that correspond to the following protocols: Basic Protocol 1 covers the most fundamental features of Galaxy. Basic Protocol 2 elaborates on different types of data accepted by Galaxy. It also shows the user how to upload data and set data attributes. Basic Protocol 3 demonstrates analysis with ChIP-seq high throughput sequencing data. Basic Protocol 4 shows that manipulation of genomic intervals is one of Galaxy’s greatest strengths. Basic Protocol 5 explains how Galaxy enables users to manipulate multiple alignments.

In addition, each protocol has a corresponding Galaxy tutorial including a Screencast (web video) hosted on the Galaxy wiki at http://galaxycast.org/CurrentProtocolsBioinfo2012.

BASIC PROTOCOL 1

FINDING HUMAN CODING EXONS WITH HIGHEST SNP DENSITY

Suppose one wants to find the top hundred protein-coding exons in the human genome with the highest density of single nucleotide polymorphisms (SNPs). Answering this question is not trivial. To do so, one needs to compare all human exons to all human SNPs. To put this into perspective, the current version of the human genome at UCSC for hg19 includes over 350,000 known coding exons and dbSNP build 134 [Sherry et al., 2001] contains nearly 49 million SNPs. Galaxy is specifically designed to make such large-scale analyses fast and user-friendly. Galaxy’s interface is accessible from http://usegalaxy.org. In the following protocol, the authors will use RefSeq [Pruitt et al., 2005] exons and dbSNP annotations on chromosome 22 extracted from the UCSC Table Browser [Fujita et al., 2011].

Necessary Resources

Hardware

An Internet-connected computer.

Software

Internet browser that supports JavaScript (e.g., most current browsers such as Mozilla Firefox, Safari, Opera, Chrome, or Microsoft Internet Explorer)

Files

None

1.
Open the Galaxy Project’s homepage by pointing your Web browser to http://galaxyproject.org.
- The project homepage features four prominent sections: Use Galaxy, Get Galaxy, Learn Galaxy, and Get Involved.
2.
Click on the “Use Galaxy” link (http://usegalaxy.org), which will bring up the free public Galaxy server.
- The Galaxy interface, populated with sample data, is shown in Figure 10.5.1.
  
  [*Fig 1 near here]
3.
Hover over “User” in the top bar, then click on “Register” in the submenu.
- The center panel of the Galaxy interface will change into a form asking you to provide user account details.
4.
Fill in the Create Account form and click “submit”.
- Although Galaxy can be used without creating an account, the authors highly recommend registering. First, having an account allows you to access your data from any machine connected to the Internet. Second, having an account safeguards data stored in the history against deletion. Anonymous histories and datasets are not reusable from one session to the next. You also cannot do all the protocols in this unit without an account.
  
  When registering for the account, note the mailing list subscription checkbox. By checking it, a new user will be subscribed to the galaxy-announce mailing list. This list is a moderated, low volume list for announcements of interest to the Galaxy community.
5.
After registering you will be automatically logged in. For subsequent sessions, login using your e-mail and password. Hover over “User” in the top menu and click “Login” in the submenu.
6.
Name this history.
1. Click on “Unnamed History” in the History panel
2. Enter “Basic Protocol 1” and hit return.
  
  You can only do this step if you are logged in.
7.
Click the “Get Data” link at the top of the Tools panel.
8.
Click the UCSC Main link.
- The UCSC Table Browser interface will appear in the middle panel of the Galaxy screen. The History panel on the right will disappear until you leave UCSC Main.
9.
Import coordinates of protein-coding exons of known human genes from the UCSC Table Browser to Galaxy. Make sure the following parameters are set as shown in Figure 10.5.2A:
```
     clade:           Mammal
     genome:          Human
     assembly:        Feb 2009 (GRCh37/hg19)
     group:           Genes and Gene Predictions Tracks
     track:           RefSeq Genes
     region:          position
     position:        chr22
     output format:   BED – browser extensible data
     Send output to:  Galaxy
```
1. Click the “get output” button.
  
  This brings up the next screen of the Table Browser interface as shown in Figure 10.5.2B.
  
  [*Fig 2 near here]
2. Select the “Coding Exons” radio button.
3. Click the “Send query to Galaxy” button.
  
  This will return you to Galaxy and create the first item called “UCSC Main on Human: refGene (chr22:1-51304566)” in your History panel, and a large green box in the center panel showing that the upload has been successfully added to the Galaxy job queue. The history item is initially gray, showing it is queued, (Fig 10.5.3). The history item becomes yellow when the job is running, and green once it is complete. If a task fails for any reason the history item will turn red.
  
  This dataset contains ~ 7,100 exons.
  
  [*Fig 3 near here]
4. Click the dataset’s name (underlined text, upper left corner) to expand the box.
  
  Icons in the upper right corner (e.g., eye, pencil, and ×) as well as links in the expanded history item allow one to perform tasks described in the legend for Figure 10.5.4.
  
  [*Fig 4 near here]
11.
Rename the dataset to something more memorable.
1. In the history panel, click the pencil icon next to the “UCSC Main on Human: refGene (chr22:1-51304566)” dataset.
  
  This opens the “Edit Attributes” panel in the center as shown in Figure 10.5.5.
  
  [*Fig 5 near here]
2. Copy and paste the contents of the “Name” field into the “Info” field.
  
  This is step is not necessary, but it does keep this somewhat useful information (“UCSC Main on Human: refGene (chr22:1-51304566)”) associated with the dataset.
3. Type “Exons hg19 chr22” into the “Name” text box.
4. Click the “Save” button.
  
  The item is now renamed in the History panel. To see the new Info value (which was the old name), click on the new name in the History panel.
  
  Giving the dataset a meaningful name makes it easier to keep track of multiple datasets when working in large histories.
12.
In the Tools panel click “Get Data” and then “UCSC Main” under it.
13.
Import coordinates of SNPs from the UCSC Table Browser to Galaxy. Make sure the following parameters are set:
```
     clade:           Mammal
     genome:          Human
     assembly:        Feb 2009 (GRCh37/hg19)
     group:           Variation and Repeats
     track:           Common SNPs(132)
     region:          position
     position         chr22
     output format:   BED – browser extensible data
     Send output to:  Galaxy
```
1. Click the “get output” button.
  
  This brings up the next screen of the Table Browser interface.
2. Select the “Whole Gene” radio button
3. Click the “Send query to Galaxy” button.
  
  This will create the second history item named “UCSC Main on Human: snp132Common (chr22:1-51304566)” This dataset is much larger than the Exons dataset, with ~170,000 SNPs in it.
4. Rename the new dataset. Click on the new dataset’s pencil icon, copy the old name to the “Info” text box and type “SNPs hg19 chr22” in the “Name” text box. Finish by clicking the “Save” button.
14.
Click “Operate on Genomic Intervals” in the Tools panel.
15.
Click “Join” to perform a Join operation.
1. Set
```
     Join:               Exons hg19 chr22
     with:               SNPs hg19 chr22
     with min overlap:   1
     Return:             Only records that are joined (INNER JOIN)
```
  This will join any exon and SNP records that overlap by one or more base pairs. For explanation of various join options, see Basic Protocol 4.
2. Click “Execute”.
  
  This will take a few minutes to compute.
  
  The join tool allows the user to find intersections between two sets of genomic intervals. In our case we are joining protein-coding exons and SNPs as shown in Figure 10.5.6A.
  
  The result of this operation, a dataset with ~4,800 overlapping exon-SNP pairs, is shown in Figure 10.5.7. The first six columns represent protein-coding exons while the last six represent SNPs. The six columns are: (1) chromosome, (2) start position, (3) end position, (4) description, (5) score (always 0 in this example), and (6) strand (+ or −). Figure 10.5.7 highlights a single exon (located on chromosome 22 between positions 17,264,508 and 17,265,299), which contains (overlaps with) 4 SNPs. One can see that coordinates of SNPs (columns eight and nine) are always within start and end position of the exon (columns two and three).
  
  [*Figs 6 and 7 near here]
3. Rename the join dataset. Click on the new dataset’s pencil icon, copy the old name to the “Info” text box and type “Exon-SNP Pairings” in the “Name” text box. Finish by clicking “Save”.
16.
Click “Statistics” in the Tools panel.
17.
Click “Count” to count the number of SNPs per exon as shown in Figure 10.5.6B.
1. Set
```
     from dataset:                               Exon-SNP Pairings
     Count occurrences of values in columns(s):  c4
     delimited by:                               Tab
```
  Column 4 contains the exon name.
2. Click “Execute”
  
  Figure 10.5.7 shows that the number of times each exon is listed equals the number of SNPs that exon overlaps with. Thus, by counting the number of occurrences of every exon in this dataset one can compute how many SNPs each exon overlaps with. The resulting dataset contains ~2,600 lines, one for each exon that overlaps with one or more SNPs.
3. Rename the new dataset. Click the new dataset’s pencil icon, set the “Info” text box to “Count on data 3; Count of unique values in c4” and type “Exon SNP Counts, unsorted” in the “Name” text box. Finish by clicking “Save”.
18.
Sort results by the number of SNPs per exon as shown in Figure 10.5.6C.
- a.
  In the Tools panel click “Filter and Sort” and then “Sort”
- b.
  Set
```
     Sort Query:        Exon SNP Counts, unsorted
     on column:         c1
     with flavor:       Numerical sort
     everything in:     Descending order
```
- c.
  Click “Execute”.
  
  The resulting history item contains the input dataset, sorted by the number of SNPs in each exon (column 1).
- h.
  Rename the sorted dataset. Click the new dataset’s pencil icon, copy the old name to the “Info” text box and type “Exon SNP Counts, sorted” in the “Name” text box. Finish by clicking “Save”.
19.
Select the top 100 exons from this list as shown in Figure 10.5.6D.
1. In the Tools panel click “Text Manipulation” and then “Select first”.
2. Set
```
     Select first:      100
     from:              Exon SNP Counts, sorted
```
3. Click “Execute”.
  
  After execution is finished your new history item will contain a list of the 100 exons with the highest SNP density.
4. Rename the sorted dataset. Click the new dataset’s pencil icon, copy the old name to the “Info” text box and type “Exon SNP Counts, top 100” in the “Name” text box. Finish by clicking “Save”.
  
  The question asked by this protocol has now been answered: The last dataset lists only the exons on chromosome 22 with the most SNPs. However, we lost some information about those exons, such as coordinates and strand, in the process. The final step will link these data back into the result.
20.
Retrieve the other information for the top 100 exons as shown in Figure 10.5.6E.
1. In the Tools panel, click “Join, Subtract, and Group” and then “Compare two datasets”.
2. Set
```
     Compare:         Exons hg19 chr22
     using column:    c4
     Against:         Exon SNP Counts, top 100
     using column:    c2
     To find:         Matching rows of 1st dataset
```
  The exon name, the common value between the two datasets, is in column 4 in the exons dataset and column two in the counts dataset.
3. Click “Execute”.
21.
Rename and format the final result dataset.
1. Click on the new dataset’s pencil icon, copy the old name to the “Info” text box and type “SNP Coding Exons chr22” in the “Name” text box. Finish by clicking the “Save” button. Click on the new history item’s pencil icon to name and format the BED file.
2. Set “Score column for visualization:” to “5”.
3. Click on “Save”.
The resulting dataset contains 100 rows from the Exons dataset. Each row contains a full BED record. This dataset can now be used anywhere a genomic interval dataset (see Basic Protocol 4) or BED dataset can be used. It can also be visualized in genome browsers.

Figure 10.5.1 — Galaxy interface contains four areas: the top bar, Tools panel (left column), detail panel (middle column), and History panel (right column). The top bar contains user account controls as well as help and contact links. The left panel lists the analysis tools and data sources available to the user. The middle panel displays interfaces for tools selected by the user. The right panel (the History panel) shows datasets and the results of analyses performed by the user. Pictured here are four history items in two different stages of completion: The two “FASTQ Groomer” items are yellow, meaning they are in progress, while the two “ungroomed” items are shown in green, meaning they have completed successfully. Every action by the user generates one or more new history items, which can then be used in subsequent analyses, downloaded, or visualized.

Figure 10.5.2 — Uploading a list of protein-coding exons (in BED format) of known human genes from the UCSC Table browser involves two steps (A and B) described in the text.

Figure 10.5.3 — When a job is queued, a history item is initially gray. When a job is running, a history item is yellow. When a job is complete, a history item is green (successful) or red (error).

Figure 10.5.4 — Close up of Galaxy history item. Clicking on links and icons trigger the following events: eye = shows first megabyte of dataset in Galaxy’s middle panel; pencil = open metadata editor. This brings up interface in the middle panel of the Galaxy screen that allows one to edit the attributes of the current history item. For example, one may wish to give the history item a more descriptive name or change column assignments (see Basic Protocol 2); × = delete item from the history **(To undelete or permanently delete, use the history’s Options menu and select “View deleted datasets”.)**; “save” = copy dataset to your computer; “i” = view details about this dataset in center panel, including the dataset(s), if any it was generated from. “rerun” = display this tool in center panel with the same settings it was run with, allowing this step to be exactly rerun or to be modified and rerun. “tags” = add free text tags to this dataset. “sticky note” = add free text annotation. Finally, if the dataset can be visualized in a browser, links to the Galaxy Track Browser (stacked bars icon) and to UCSC, GeneTrack, Ensembl, and others will also be displayed.

Figure 10.5.5 — The “Edit Attributes” form in the center panel. Each attribute can be modified and saved. In this figure the system generated name has been copied to the “Info” field, and a short descriptive name entered in the “Name” field.

Figure 10.5.6 — Data manipulation tools: Join (A), Count (B), Sort (C), Select first lines (D), and Compare two datasets (E).

Figure 10.5.7 — Result of joining two interval datasets, highlighting a single exon that contains (overlaps with) 4 SNPs.

BASIC PROTOCOL 2

LOADING DATA AND UNDERSTANDING DATATYPES

In Galaxy, information is stored in “datasets” which are analogous to files. Datasets can be added to your history by uploading files from your computer, or extracting from external data sources integrated with Galaxy such as UCSC’s ENCODE datasets [Raney et al., 2011]. Transferring external data via http/ftp, copying from shared or public Galaxy histories and libraries, and running data manipulation and analysis tools within Galaxy are explained. In addition to their data contents, each Galaxy “dataset” is associated with “metadata”. Metadata is information that describes the characteristics of a dataset. These can include the assigned and given names/annotation, the associated reference genome and build, the format datatype, and frequently additional datatype-specific labels and definitions.

In this protocol we demonstrate how metadata is assigned and modified for common genome analysis datasets uploaded into Galaxy using the methods listed above. We also use Galaxy to transform a dataset from a custom format into a standard BED format.