Using CisGenome to Analyze ChIP-chip and ChIP-seq Data

Hongkai Ji; Hui Jiang; Wenxiu Ma; Wing Hung Wong

doi:10.1002/0471250953.bi0213s33

. Author manuscript; available in PMC: 2012 Mar 1.

Published in final edited form as: Curr Protoc Bioinformatics. 2011 Mar;CHAPTER:Unit2.13. doi: 10.1002/0471250953.bi0213s33

Using CisGenome to Analyze ChIP-chip and ChIP-seq Data

Hongkai Ji ¹, Hui Jiang ², Wenxiu Ma ³, Wing Hung Wong ⁴

PMCID: PMC3072298 NIHMSID: NIHMS278142 PMID: 21400695

Abstract

Chromatin immunoprecipitation (ChIP) coupled with genome tiling array hybridization (ChIP-chip) and ChIP followed by massively parallel sequencing (ChIP-seq) are high throughput approaches to profile genome-wide protein-DNA interactions. Both technologies are increasingly used to study transcription factor binding sites and chromatin modifications. CisGenome is an integrated software system for analyzing ChIP-chip and ChIP-seq data. This unit describes basic functions of CisGenome and how to use them to find genomic regions with protein-DNA interactions, visualize binding signals, associate binding regions with nearby genes, search for novel transcription factor binding motifs, and map existing DNA sequence motifs to user-supplied genomic regions to define their exact locations.

Keywords: transcription factor, chromatin immunoprecipitation, tiling array, next generation sequencing, motif, gene regulation

INTRODUCTION

Chromatin immunoprecipitation (ChIP) coupled with genome tiling array hybridization (ChIP-chip) (Ren et al., 2000; Cawley et al., 2004) and ChIP followed by massively parallel sequencing (ChIP-seq) (Barski et al., 2007; Johnson et al., 2007; Mikkelsen et al., 2007; Robertson et al., 2007) are powerful technologies for mapping transcription factor binding sites and chromatin modifications. A typical ChIP-chip or ChIP-seq experiment generates tens of millions of data points. Extracting useful information from the huge amount of data is a significant computational challenge. CisGenome (Ji et al., 2008) is an integrated software system to help researchers to cope with this challenge. This software contains a wide range of functionalities including (1) identification of protein-DNA binding regions from ChIP-chip data (also known as “peak calling”), (2) ChIP-seq peak calling, (3) data visualization, (4) association of peaks with nearby genes, (5) retrieving DNA sequences, (6) novel transcription factor binding motif discovery, and (7) mapping existing motifs to protein-DNA binding regions (Figure 2.13.1). This unit introduces how a typical user can use these functions to perform basic ChIP-chip and ChIP-seq data analyses.

ChIP-chip and ChIP-seq data analyses typically begin with detection of genomic regions that contain the protein-DNA interactions of interest. Usually, this involves processing raw microarray or sequence data in order to remove technological biases and distinguish bona fide biological signals from random noises. The signals often stand out as some kind of enrichment, such as increased probe intensities in ChIP-chip or elevated sequence read count in ChIP-seq. In this unit, detection of these enrichment signals is also referred to as “peak calling”. After obtaining a list of putative protein-DNA binding regions, one can proceed to visually examine the enrichment signals in these regions, with gene annotations (e.g. starts and ends of exons and introns) and probe signals or sequence read counts displayed side by side. Visualization is not only a useful way to identify potential artifacts in the experimental data, but also an important first step to make new discoveries. Once the visual examination confirms the data quality, one can then annotate protein-DNA interactions by nearby genes. Genes pulled out in this way can be used to shed light on functions of transcription factors or chromatin marks through subsequent analyses of gene ontology (The Gene Ontology Consortium, 2000; see Unit 7.2) or cross-referencing with gene expression data. For many transcription factors the DNA sequences in the binding regions provide an opportunity to discover the specific DNA sequence patterns (also called “motifs”) recognized by the transcription factors, if such patterns are not already known. It is believed that transcription factor binding motifs are part of the mysterious and highly complex genetic codes written in the genome to dictate when, where and at what level genes should be expressed. Understanding these codes will greatly help us to understand mechanisms behind gene regulation and diseases. There are also a large number of transcription factors for which the DNA binding motifs have already been characterized. For both known and new motifs, it is useful to map them to protein-DNA binding regions to identify their exact locations. These loci can serve as candidates for subsequent experimental validation, such as knock-out or transgenic experiments.

Following the natural order of data analyses, this unit is organized as follows (Figure 2.13.1). First, we will introduce how to use CisGenome to call peaks from ChIP-chip data generated by using Affymetrix genome tiling arrays (Basic Protocol 1). Next, we will present protocols to visualize enrichment signals (Basic Protocol 2), associate peaks with genes (Basic Protocol 3), retrieve DNA sequences (Basic Protocol 4), discover novel transcription factor binding motifs (Basic Protocol 5), and mapping motifs to genomic regions (Basic Protocol 6). Collectively, these protocols can be assembled into a basic analysis pipeline for processing Affymetrix ChIP-chip data. Peak calling for non-Affymetrix ChIP-chip data will be introduced in Basic Protocol 7, and ChIP-seq peak calling will be introduced in Basic Protocols 8 and 9. Once peak calling is done, subsequent analyses can be performed in the same way as described in Basic Protocols 2-6. Therefore, by replacing Basic Protocol 1 by Basic Protocols 7-9, one can easily construct analysis pipelines for non-Affymetrix ChIP-chip and ChIP-seq data. In addition to the basic protocols we also provide two support protocols. Support Protocol 1 introduces how to install CisGenome. Support Protocol 2 introduces how to install genome databases required by many analysis functions of CisGenome.

BASIC PROTOCOL 1

CHIP-CHIP PEAK CALLING FOR AFFYMETRIX TILING ARRAY DATA

This section introduces how to use CisGenome to detect protein-DNA binding regions from an Affymetrix tiling array experiment. A tiling array is a microarray that contains probes to interrogate the whole genome or targeted genomic regions. ChIP-chip experiments are usually carried out by hybridizing ChIP and control DNA samples to tiling arrays. Affymetrix tiling arrays are one of the most popular tiling array platforms currently in use.

A typical Affymetrix tiling array experiment generates multiple CEL files. These files contain raw probe intensities stored in a standard format defined by Affymetrix. Often, data in these files are saved in a binary format to reduce storage space and facilitate efficient data retrieval. This means users may not be able to read the file in a human interpretable way using a text editor.

For each tiling array platform, Affymetrix also provides a library of BPMAP file(s) that contains the array design information. The library may contain one or more BPMAP files which specify how probes in the array(s) are mapped to the genome. The BPMAP files also have a binary format defined by Affymetrix.

In CisGenome, analysis of Affymetrix ChIP-chip data consists of three steps, namely, loading data, sample normalization, and peak calling. We will use a sample data set to illustrate this procedure. The sample data are produced in a ChIP-chip experiment for studying mouse transcription factor Gli3. The data were generated using Affymetrix Mouse Tiling 2.0R array set which contains seven chips in total to cover the whole mouse genome. The BPMAP library contains seven BPMAP files, one for each chip. In the experiment, three ChIP samples and three control samples were profiled. Each sample was hybridized to all seven chips and produced seven CEL files. In total, there are 42 CEL files and 7 BPMAP files. The data can be downloaded from http://www.biostat.jhsph.edu/~hji/cisgenome/index_files/basicprotocol.htm.

Necessary Resources

Hardware

A personal computer (PC) equipped with Windows operating system (OS)

Software

CisGenome (see Support Protocol 1 for installation guide)

Files

CEL files that contain raw tiling array data

BPMAP files that contain the array design information. The BPMAP files can be downloaded from http://www.biostat.jhsph.edu/~hji/cisgenome/index_files/download.htm or http://www.affymetrix.com/.

Load Data

Enter the CisGenome installation folder. Start CisGenome by double-clicking CisGenome.exe.
Click the menu “File > Load Data > Tiling Array Dataset > Import from Affymetrix CEL+BPMAP” (Figure 2.13.2).
A wizard dialog will appear (Figure 2.13.3a). In this dialog, set parameters as follows:
1. Specify a name for the data set (e.g. GliData).
2. Click the button highlighted by circle 1.1. In the dialog that jumps out, choose the folder that contains the BPMAP files (circle 1.2) and then click “OK”.
3. Now all BPMAP files in the selected folder will be listed in the box on the left-hand side named “Available BPMAP” (Figure 2.13.3b). Select the BPMAP files corresponding to the ChIP-chip data. Click “Add” to move them to the box on the right-hand side titled “BPMAP used in the Project”. For the Gli example, all seven BPMAP files need to be added to the project.
4. Use “Move Up” and “Move Down” buttons to adjust the order of BPMAP files (Figure 2.13.3b).
5. Click “Next” on the bottom (Figure 2.13.3b).
A new wizard page will appear (Figure 2.13.4a). Set parameters in this new page as follows:
1. Click the button in circle 1.3 (Figure 2.13.4a). In the dialog that appears, choose the folder that contains the CEL files (Figure 2.13.4a, circle 1.4) and then click “OK”.
2. Now all CEL files in the selected folder should be listed in the box named “Available Arrays” on the left (Figure 2.13.4b). Click “Create New Sample” (Figure 2.13.4b, circle 1.5) to add a sample to the data set. Provide a sample name in the “Sample ID” box (e.g. use “IP1” for ChIP sample 1, and use “CT3” for Input control sample 3, etc.). Provide a group identifier for the sample in the “Group ID” box (Figure 2.13.4b, circle 1.6). For example, set Group ID = 1 if the sample is a ChIP sample, and set Group ID = 2 if it is a control sample.
  If the group identifiers are defined in this way, then the expression “1>2” in subsequent analyses will have the meaning that “the mean probe intensity in the ChIP samples is bigger than the mean probe intensity in the control samples”.
3. For the newly created sample, find all related CEL files. Select them in the “Available Arrays” box, and click “Add” to move them to the box titled “Arrays in the Sample” (Figure 2.13.4b, circle 1.7).
4. Use “Move Up” and “Move Down” buttons to adjust the order of CEL files in the sample so that they match with the corresponding BPMAP files (Figure 2.13.4b, circle 1.8).
5. Repeat steps 4b – 4d to add all ChIP and control samples to the data set.
6. Click the “Finish” button on the bottom (Figure 2.13.4b).
Check the “Project Explorer” window in the CisGenome GUI (Figure 2.13.5). Under the section titled “Tiling Array Datasets”, a new data set should have been created. If you double-click a CEL file (e.g. Figure 2.13.5, circle 1.9), a CisGenome Browser window should appear.

Figure 2.13.2 — The CisGenome graphic user interface (GUI) and menu system. The menu for creating an Affymetrix tiling array data set is shown as an example.

Figure 2.13.3 — The dialog for adding BPMAP files to an Affymetrix ChIP-chip data set.

Figure 2.13.4 — The dialog for adding CEL files to an Affymetrix ChIP-chip data set.

Figure 2.13.5 — The newly created tiling array data set shown in the CisGenome Project Explorer. Double-clicking a CEL file will open a CisGenome Browser window displaying a heat map of the array image.

The window displays the heat map of the raw image of the selected array (Figure 2.13.5). Once you see the heat map, it suggests that the data has been loaded successfully.

Normalize Samples

6.
Click the menu “Tiling Array > Normalization > Quantile (CEL+BPMAP)”.
7.
In the configuration dialog that jumps out (Figure 2.13.6), specify the tiling array data set (e.g. GliData) to be normalized, and choose a folder and file header (e.g. GliData_norm) to export the normalized array intensities. Click “OK”.
8.
The program will start to run. After it is done, a new data set containing the normalized data will be added to Project Explorer under the “Tiling Array Datasets” section (Figure 2.13.7, circle 1.10).

Figure 2.13.6 — The dialog for normalizing an Affymetrix tiling array data set.

Figure 2.13.7 — ChIP-chip peak calling. Before peak detection, a normalized tiling array data set (circle 1.10) needs to be available in the Project Explorer, and one needs to provide several basic peak calling parameters in a dialog.

Call Peaks

9.
Click the menu “Tiling Array > Peak Detection (TileMap)”.
10.
In the dialog that jumps out (Figure 2.13.7), provide the following parameters:
1. Choose a normalized tiling array data set to analyze (e.g. GliData_norm).
2. Choose an analysis type. Use “Two Sample Comparison” if the data contain samples from two experimental conditions (e.g. ChIP vs. control). Use “Multiple Sample Comparison” if the data contain samples from more than two experimental conditions (e.g. ChIP, Input control, and IgG control). “One Sample Comparison” is usually used for non-Affymetrix tiling array data (see Basic Protocol 7).
3. Specify a pattern to look for. In a two sample comparison, suppose Group ID = 1 for ChIP samples, and Group ID = 2 for control samples, then the pattern to look for is 1>2. This pattern means that we are trying to find genomic regions where the probe intensities in group 1 (i.e. ChIP samples) are bigger than the probe intensities in group 2 (i.e. control samples). In a multiple sample comparison, the pattern can be specified as a combination of pairwise comparisons such as (1>2 & 1>3), where “&” means AND.
4. Specify a folder and a file header to save the results.
5. (Optional) Specify parameters in the “Pre/Post Processing” and “Advanced Settings” tabs. The configurable parameters are listed in Table 2.13.1.
6. Click “OK”.
11.
The peak detection program will start to run. After it finishes, the detected peaks will be saved in a COD file named “[file header]_all.cod”, in which [file header] is the file header specified in step 10d. A COD file is a tab-delimited text file with five required columns to describe genomic regions (Figure 2.13.8). The COD file produced by ChIP-chip peak calling will be added to the “Genomic Regions (BED, COD)” section in the Project Explorer (Figure 2.13.8, circle 1.11). It will also be opened and displayed in a window (Figure 2.13.8). In the “Signals” section of the Project Explorer, several BAR files (named as *.bar) will be added as well (Figure 2.13.8, circle 1.12).
The first five columns are required. They are (i) a numerical or string identifier, (ii) chromosome, (iii) start coordinate, (iv) end coordinate, and (v) strand. In a COD file, additional columns are allowed after the first five to annotate regions. See section on Guidelines for Understanding Results to learn more about the COD file produced by CisGenome ChIP-chip peak calling. A COD file can be opened and edited by any text editor (e.g. Notepad and EXCEL). The BAR file format is a binary format defined by Affymetrix. The BAR files produced by CisGenome contain probe-level enrichment signals which can be used for subsequent visualization.
12.
After the analysis, one can save the project using the menu “File > Save Project” or “File > Save Project As”. Save the project to a file named [project title].cgw (e.g. GliData.cgw). In the future, use the menu “File > Open Project” to load the project back whenever needed.

Table 2.13.1.

Optional parameters for ChlP-chip peak calling

Parameters and Description
1. Mask outlier/masked data points in the raw data: if yes, the outlier and masked probes in the original CEL file will not be used in peak detection.
2. Truncation lower bound (TLB): a numerical value. If it is equal to x, then all normalized probe intensities smaller than x will be truncated to x before peak calling.
3. Truncation upper bound (TUB): a numerical value. If it is equal to x, then all normalized probe intensities bigger than x will be truncated to x before peak calling.
4. Transform: before peak calling, the truncated probe intensities can be transformed by log2, logit or inverse logit function. Choose “None” if no transformation is needed.
5. Post processing: one can choose to merge two neighboring peaks if the gap between the two peaks <= x bp and the number of probes below the peak calling cutoff between the two peaks <= y.
6. Post filtering: one can choose to not report a peak if the peak length < x bp or the peak does not contain at least y continuous probes whose enrichment signals are above the peak calling cutoff. Both x and y are integers.
7. Region Summary Method: choose to use TileMap moving average (MA) or Hidden Markov Model (HMM) to call peaks. The default is MA.
8. MA settings: If MA is chosen in 7, provide half window size x and y, and a peak calling cutoff z. For each probe, the MA algorithm uses all probes within y bps and separated from the probe in question by no more than x-1 other probes to compute enrichment signals. If the signal is bigger than z, the probe will be selected to construct peaks.
9. HMM settings: If HMM is chosen in 7, provide the expected peak length x (i.e. how many probes are expected to be covered by an average peak) and the peak cutoff y. The HMM computes a posterior probability for each probe being in a peak. Probes with a posterior probability above y will be used to construct peaks.
10. Method to compute false discovery rate (FDR). Choose from “Left tail”, “UMS”, “Permutation” and “No FDR”. “Left tail” works for two sample comparisons. It flips the ChIP and control sample labels, and detects peaks after the label swap. The FDR is estimated by the ratio [No. of peaks detected after the label swap] / [No. of peaks detected before the label swap]. “UMS” uses the unbalance mixture subtraction method introduced in Ji and Wong (2005) to estimate FDR. “Permutation” works by permuting sample labels and detects peaks afterwards. The FDR is estimated by the ratio [No. of peaks detected after the label permutation] / [No. of peaks detected before the label swap]. “UMS” and “Permutation” can work for multiple sample comparisons. “No FDR” will skip FDR computation.
11. UMS settings, permutation settings, and variance assumptions: These parameters are usually set automatically. Manually setting them requires deep understanding of the TileMap algorithm. Users are referred to Ji and Wong (2005) if they want to learn how to set these parameters manually.

Open in a new tab

Figure 2.13.8 — ChIP-chip peak calling results. Peaks are summarized in a COD file shown in the right window. A number of BAR files are also created to store enrichment signals. Both the COD file and the BAR files are added to the Project Explorer on the left.

BASIC PROTOCOL 2

VISUALIZATION

CisGenome provides a light-weight browser which can be used to visualize enrichment signals (e.g. probe intensities in ChIP-chip, or read counts in ChIP-seq) along the genome. This browser runs locally on the user's computer and does not require one to have access to the Internet. This section introduces how to use CisGenome Browser to visualize the data.

An Alternative Way to Display the Data

The procedure described below allows one to create a browser session manually. In fact, if the ChIP-chip and ChIP-seq peak calling is performed within CisGenome GUI, a browser session will be automatically created. In Figure 2.13.8, by choosing a peak in the COD file and double-clicking the first column of the peak, one will be directed to the browser session in which data within the peak are displayed. If the genome database has been loaded into CisGenome GUI before peak calling, then the browser will also display the gene annotation, conservation and DNA sequence tracks automatically.