metaseq: a Python package for integrative genome-wide analysis reveals relationships between chromatin insulators and associated nuclear mRNA

Ryan K Dale; Leah H Matzat; Elissa P Lei

doi:10.1093/nar/gku644

. 2014 Jul 24;42(14):9158–9170. doi: 10.1093/nar/gku644

metaseq: a Python package for integrative genome-wide analysis reveals relationships between chromatin insulators and associated nuclear mRNA

Ryan K Dale ^1,^*, Leah H Matzat ¹, Elissa P Lei ^1,^✉

PMCID: PMC4132753 PMID: 25063299

Abstract

Here we introduce metaseq, a software library written in Python, which enables loading multiple genomic data formats into standard Python data structures and allows flexible, customized manipulation and visualization of data from high-throughput sequencing studies. We demonstrate its practical use by analyzing multiple datasets related to chromatin insulators, which are DNA–protein complexes proposed to organize the genome into distinct transcriptional domains. Recent studies in Drosophila and mammals have implicated RNA in the regulation of chromatin insulator activities. Moreover, the Drosophila RNA-binding protein Shep has been shown to antagonize gypsy insulator activity in a tissue-specific manner, but the precise role of RNA in this process remains unclear. Better understanding of chromatin insulator regulation requires integration of multiple datasets, including those from chromatin-binding, RNA-binding, and gene expression experiments. We use metaseq to integrate RIP- and ChIP-seq data for Shep and the core gypsy insulator protein Su(Hw) in two different cell types, along with publicly available ChIP-chip and RNA-seq data. Based on the metaseq-enabled analysis presented here, we propose a model where Shep associates with chromatin cotranscriptionally, then is recruited to insulator complexes in trans where it plays a negative role in insulator activity.

INTRODUCTION

Biologists can no longer focus on only one aspect of their system of interest. The wealth of existing genome-wide high-throughput sequencing data makes possible a multi-faceted approach that combines data from a variety of sources, such as gene expression, histone modification, DNA-binding and RNA-binding experiments. Ideally, these data are integrated with novel experimental results to derive new biological insights. However, the fact that these data are available does not necessarily mean they are easy to use or to integrate. One aspect limiting integration is the wide variety of data formats. While a reasonably stable consensus on standard file formats has been reached, there are still many options. These include raw sequences (FASTQ format (1)), mapped reads (BAM/SAM (2)), aggregated read density (wig, bedGraph, bigWig) and summary files including called peaks (BED, bigBed (3,4)), variant calls (VCF (5)) or tabular results (TSV, CSV) such as those from differential expression analysis. This diversity in binary and plain text formats can be a hindrance to integrated analysis.

Here, we introduce metaseq (https://github.com/daler/metaseq), a Python library designed to analyze and visualize data from disparate file formats and experimental protocols (ChIP-seq, RNA-seq, RNA immunoprecipitation and sequencing (RIP-seq), etc.) in a flexible and interactive manner. There are four core functional capabilities that will be described in more detail below: (i) a unified interface for accessing genomic data by interval across multiple file formats, (ii) integration with existing Python tools for working with tabular data and extension of these tools specifically for use with genomic data, such as those keyed by feature ID, (iii) a persistent mapping between feature ID and its genomic coordinates and (iv) extension of existing plotting and visualization tools to aid in the presentation of genome-wide data. metaseq is a library distinguished by its flexibility, modularity and ability to import a wide variety of genomic data into standard Python data structures that can be used in any Python code. metaseq is useful for rapid prototyping of analyses all the way through final figure creation. Here, we demonstrate its utility by analyzing genomic data from RIP-seq, RNA-seq and ChIP-seq experiments related to chromatin insulator proteins in order to derive new biological insights.

Chromatin insulators are DNA–protein complexes that have been proposed to organize the genome into distinct transcriptional domains, thereby playing a role in the regulation of gene expression. They are defined in functional terms: an insulator can exhibit barrier activity by preventing the spread of heterochromatin, or enhancer blocking activity by preventing enhancers from contacting promoters when placed in between the two elements. These properties may derive from an ability to promote DNA looping through insulator–insulator interactions (reviewed in (6)).

Recent work in both Drosophila and mammals has implicated RNA in the regulation of insulator activities. Of particular interest to this study, the gypsy insulator complex of Drosophila was found associated with full-length, spliced mRNA that may regulate insulator activity (7). However, gypsy insulator core components, including the Suppressor of Hairy wing (Su(Hw)) zinc finger DNA-binding protein, do not harbor RNA-binding domains and have no known capacity to bind RNA directly, suggesting that some other component(s) recruit RNA to insulators. One direct interactor of Su(Hw), the RNA-binding protein Shep, was identified as a negative regulator of gypsy insulator activity primarily in central nervous system (CNS) tissue (8). Preliminary evidence suggests that the RNA-binding capacity of Shep is needed for its effect on insulator activity; however, it has yet to be determined whether Shep indeed interacts with gypsy-associated mRNAs, and if so, whether these interactions occur in a cell type-specific manner. Furthermore, the genomic origins of Shep- and insulator-associated nuclear mRNAs have not been explored in comparison to chromatin association of Shep and Su(Hw).

In order to gain insight into the relationship between Shep and Su(Hw) as well as the biogenesis and specificity of Shep and gypsy insulator-associated nuclear mRNAs, we performed RIP-seq for both Shep and Su(Hw) in two different cell lines, either derived from CNS or non-CNS tissue. We integrated these data with our previously obtained ChIP-seq data, which identifies genomic binding sites for the same two factors in the same two cell lines. We then explored these data along with publicly available expression and genome-wide ChIP results for various factors within the same cell lines produced by the modENCODE consortium. Based on our analyses, we conclude that Shep binds a significant subset of Su(Hw)-associated nuclear mRNAs only in CNS-derived cells. Furthermore, at least a fraction of Shep is likely recruited to chromatin during transcription through interaction with RNA. In contrast, insulator-associated mRNAs, a subset of which are produced from long and low expression genes, are mainly produced in trans to chromatin insulator sites. These findings illustrate the utility and flexibility of metaseq for integrated analysis to explore multiple high-throughput sequencing datasets and obtain new biological insights.

MATERIALS AND METHODS

metaseq implementation

The target user for metaseq is the bioinformatician with knowledge of the Python programming language. metaseq takes advantage of many widely-used bioinformatics Python packages and benefits greatly from the work done by the authors of, and communities surrounding, those packages. We encourage users to consult the full documentation found at http://pythonhosted.org/metaseq for more details. In addition, the Supplementary Materials contains the source code detailing the creation of each figure presented in this manuscript. These files are intended to serve as ancillary documentation for learning metaseq as well as a template for the practical application of metaseq to new projects. metaseq is available at https://github.com/daler/metaseq (MIT license) and is also available on the Python Package Index.

Accessing genomic signal by interval

metaseq integrates multiple existing Python tools for accessing data from commonly used file formats: BAM files are handled with pysam (https://github.com/pysam-developers/pysam), the binary formats bigBed and bigWig are accessed via bx-python (https://bitbucket.org/james_taylor/bx-python), and tab-delimited files (BED, GFF, GTF, VCF) are indexed with tabix (2) and retrieved with pybedtools/BEDTools (9,10). Importantly, metaseq provides a single, uniform API to access all of these formats, allowing code to be data format-agnostic. This is particularly useful when a user has an existing analysis but wants to incorporate publicly- or newly-available data, which may be in a different format. For practical purposes we refer to the underlying data as genomic signal, no matter what the format is. For example, BAM files of mapped reads, bigWig files of coverage and BED files of called peaks can all be considered genomic signal.

Arrays of signal

metaseq provides flexible tools for converting genomic signal into NumPy arrays (a data structure commonly used in scientific Python program). The concept of windows (regions of interest) and signal (underlying genome-wide data) can be combined in useful ways. Windows can be represented by any supported interval format (BAM, BED, GFF, GTF, VCF), while signal can be represented by any interval format or bigWig format. For example, one common task is to plot a heatmap of histone modifications at transcription start sites (TSS) by building an array with windows of TSS +/− 1 kb for all annotated genes and signal of histone modification ChIP-seq read density. Another example is that a BED file of motifs (signal) can be compared between promoters (windows) and 3′UTRs (another set of windows) to look for relative enrichment in different parts of genes. While metaseq provides utilities for creating, sorting, clustering and plotting these data, the resulting NumPy arrays themselves are generic and can be easily used in users’ own custom code.

Parallelization and caching

To speed up retrieval of signal, a large number of windows can optionally be automatically split across available CPUs. Depending on the structure of the underlying data and the hardware specifications, the speedup can be linear with the number of CPUs (Supplementary Figure S1). A script is provided with metaseq to identify the optimal number of processes to use for a particular dataset and hardware configuration. metaseq also provides utilities for calculating arrays once, saving results to disk, and then subsequently loading them in downstream analysis as memory-mapped files. This feature makes it possible to work with many large arrays that would not otherwise fit in available memory, which can drastically reduce computational time.

Tables of results

Windows and signal as described above are defined and accessed by genomic coordinate. In contrast, some data used in genomic research is in tabular form and is accessed by ID, such as the output from DESeq or Cufflinks differential expression analysis, where each row is keyed by gene ID. metaseq extends DataFrame objects from the pandas package (high-performance data structures optimized for tabular data, http://pandas.pydata.org) by adding new tools for manipulating and visualizing genomic data. However, the genomic interval of each feature is typically not included in these kinds of results tables, so a mapping of gene ID to interval is required for integration with data accessed by genomic coordinate.

Mapping of ID to genomic interval

In order to map gene ID to interval, we use the gffutils package (https://github.com/daler/gffutils), which converts a GFF or GTF file of annotations into a SQLite database for persistent, indexed retrieval of intervals keyed by ID. Database creation only needs to be performed once for each set of annotations. The database then allows a user to look up the interval for a gene of interest in the tabular data (e.g. an upregulated gene) and extract the genomic signal over that interval from another data source (e.g. a BAM file of mapped reads). An additional benefit of using a gffutils database is that the full gene models can be hierarchically navigated, e.g. to retrieve all of the introns for a gene of interest or to identify constitutive exons.

Mini-browsers

One example of how the building blocks provided by metaseq can be used to build novel functionality is demonstrated in the minibrowser module, which integrates the ID and interval components of metaseq. ‘Mini-browsers’ act as standalone Python genome browsers with a plugin architecture, allowing flexible custom designs to be created. Mini-browsers are designed to operate as callbacks so that when a user clicks a point in a scatterplot created by metaseq, the genomic coordinates of that region are retrieved from the gffutils database, genomic signal is extracted from the configured file or files, and a new figure is opened showing data over the region for the clicked point. This allows rapid exploration of data without having to upload or navigate through a web browser to the region for the clicked point.

Preparation of RIP-seq libraries

Nuclear extracts from Drosophila cell lines BG3 (CNS-derived) and Kc167 (hemocyte-derived) were prepared from 3.0 × 10⁸ to 6.3 × 10⁹ cells per IP essentially as described in (11) with the following modifications: cell extracts were immediately cleared by centrifugation after dounce homogenization and purified nuclei were lysed by dounce homogenization with the B pestle in HBSMT (50 mM HEPES, pH 6.7; 150 mM NaCl; 5 mM KCl; 2.5 mM MgCl₂; 0.3% Triton X-100) supplemented with 1 mM PMSF, Complete EDTA-free protease inhibitor (Roche), and 1 U/μl RNasin (Promega). Nuclear extracts were incubated rotating for 1 h at 4°C with Protein A conjugated sepharose (GE Healthcare) prebound to preimmune serum. Precleared extracts were then transferred to Protein A sepharose prebound to Shep or Su(Hw) antisera (8) for 1 h at 4°C, rotating. Beads and bound complexes were washed in HBSMT three times and HBSM twice. RNA was isolated directly from Protein A beads by acid:phenol extraction and EtOH precipitated with NaOAc. RNA was DNase I treated, poly(A)⁺ RNA was selected using a MicroPoly(A)Purist column (Ambion) and 100 ng was cloned using the Illumina mRNA-seq v2.2 cloning protocol and ∼80 bp paired-end adapters. Size-selected fragments (∼200 bp including adapters) were sequenced on an Illumina GA II or HiSeq 2000 using commercial sequencing primers, resulting in 36-bp reads. Highly similar profiles were obtained using two independent Shep antibodies; therefore, the antibody displaying the highest signal to noise ratio, Rb5, was utilized for subsequent analyzes.

Chromatin immunoprecipitation

ChIP was performed as described previously (8).

Raw data processing

Raw reads were processed with cutadapt v1.2.1 (12) to remove adapters and perform quality trimming, with default parameters except for –quality-cutoff = 20, –mimimum-length = 36 (for RIP-seq), –minimum-length = 25 (for ChIP-seq), and adjusting –quality-base as appropriate for each library. Trimmed reads were mapped to the UCSC dm3 assembly using TopHat2/Bowtie2 (13) (for RIP-seq) or Bowtie2 (ChIP-seq) using default parameters, with the addition of –phred64-quals where appropriate.

For RIP-seq, we used annotations from FlyBase r5.54. Reads were then counted in annotated exons with HTSeq (14). Counts from technical replicates (same sample run on different sequencer models, GAII or HiSeq 2000) were summed. Counts for biological replicates were loaded into DESeq (15). Since we expect input and RIP samples to be substantially different from each other, we chose to use method = ‘per-condition’ for DESeq's estimateDispersion. We also used fitType = ‘local’ which fit the mean-variance relationship better than the default ‘parametric’ parameter. Scripts to download these processed data are included in the Supplementary Materials.

Mapped ChIP-seq reads were filtered with samtools v1.19 to remove multimappers (samtools view -q 20) and duplicate reads (samtools rmdup -s). Peaks were called using both MACS2 v2.0.10 (https://github.com/taoliu/MACS) and SPP v1.11 (16). We chose a conservative set of peaks by retaining only peaks called by both algorithms at FDR = 0.05. Specifically, we used the 1-bp summits from MACS2 at FDR = 0.05 that intersected a peak region called by SPP at FDR = 0.05. Downstream analysis typically looked within a 1-kb window around each peak summit.

Public data

The Supplementary Materials include scripts for downloading and pre-processing ChIP-chip and RNA-seq data from modENCODE. Briefly, for the ChIP-chip data the modENCODE Python web service client was used to query modMine for all Drosophila datasets satisfying the following conditions: (i) the data type was in WIG format, (ii) the experiment type was not ‘Computational annotation’ or ‘ChIP-chip RNAi’, (i) the cell type was either ‘ML_DmBG3-c2’ or ‘Kc167’ and (iv) ‘Mvalues’ was not found in the data URL. This latter filter removed data from individual replicates, but retained the WIG files representing the merged replicates. The WIG files for each entry meeting these criteria were downloaded and converted to bigWig format using UCSC's wigToBigWig program. The values in the bigWig files are those reported by modENCODE. Specifically, Mvalues were calculated for each replicate (log₂(ChIP) - log₂(input) for all perfect match ChIP-chip probes) and then shifted to a mean of zero. M-values for all replicates were smoothed using the lowess method with a 500-bp window, resulting in the final values reported by modENCODE.

For the RNA-seq data, we ran the analysis from raw FASTQ files in order to have expression data for the same FlyBase annotations used for the RIP-seq analysis. Paired-end 37-bp FASTQ files were downloaded from SRA using accessions documented at GSE15596 for BG3 and Kc167 cells. Data processing was performed similar to RIP-seq analysis (including summing read counts for technical replicates), with the exception of using sickle for paired-end trimming and the pooled” method for DESeq's estimateDispersion.

RIP-seq data generated from this study are available at GEO accession GSE55894. Raw ChIP-seq data from previous studies are available at GSE40797 (Kc167 cells) and GSE51462 (BG3 cells).

RESULTS

ChIP-seq profiling of Shep and Su(Hw) in two different cell types

We first compared the genome-wide binding profiles of the gypsy insulator protein Su(Hw) and the CNS-enriched insulator antagonist Shep in two different cell lines. We performed ChIP-seq of Su(Hw) and Shep in parallel in Kc167 (17) and BG3 cells (8), which are embryonic hemocyte and larval CNS-derived lines, respectively. Both lines have been profiled for a wide variety of chromatin associated proteins and histone posttranslational modifications by the modENCODE consortium as well as independent groups. We defined binding sets as the summits of peaks jointly identified by the SPP (16) and macs2 (https://github.com/taoliu/MACS) algorithms at a 5% false discovery rate (FDR) and extended the summits by 500 bp in both directions. We found that most Shep Kc167 sites (80% of 1453) overlap a subset of BG3 sites (23% of 6315). The smaller number of total binding sites for Shep in Kc167 compared to BG3 cells is consistent with the lower expression of Shep in non-CNS tissue (8). For Su(Hw), we found a majority of Su(Hw) sites shared between both cell types (70% of 3845 Kc167 sites and 56% of 4887 BG3 sites) (Figure 1A). Consistency of Su(Hw) binding across cell types has been observed previously (18–20).

Figure 1. — Binary heatmaps of ChIP-seq peak centers +/− 500 bp showing overlap of a single factor between cell types (A, B) or overlap of both factors in a single cell type (C, D). Each row indicates a unique genomic region, and a black mark in the column shows the presence of the factor in that region. Total number of peaks in each experiment is indicated in parentheses. See Supplementary Figure S2 for the combined 4-way comparison and an analogous Venn diagram showing the same data.

In contrast, Shep and Su(Hw) do not strongly overlap in either cell type. Only 9% of Shep binding sites overlap with 11% of Su(Hw) binding sites in BG3 cells, and 19% of Shep binding sites overlap with 8% of Su(Hw) binding sites in Kc167 cells (Figure 1C–D). We visualize these overlaps using binary heatmaps, which, unlike Venn diagrams, can handle cases where a single peak in one dataset is overlapped by multiple peaks in another dataset. Furthermore, this kind of visualization scales to larger multi-way comparisons than a traditional Venn diagram while still remaining interpretable (Supplementary Figure S2). Taken together, these data show that Kc167 peaks are mostly a subset of a larger number of BG3 peaks for both proteins (Figure 1A, B), and Shep and Su(Hw) do not often colocalize with each other in either cell type (Figure 1C, D, Supplementary Figure S2).

Analysis of chromatin context of Su(Hw) and Shep binding sites

We next examined the relationships between Su(Hw) or Shep chromatin association and other known chromatin factors profiled within the same cell type. Previous studies showed that Shep binding primarily coincides with factors related to active transcription rather than gypsy insulator proteins (8). In order to explore this relationship in more detail, we compared average normalized ChIP-chip signals from the modENCODE consortium (21). We used all available factors profiled in Kc167 or BG3 cells, and examined these signals across 2 kb windows centered on Su(Hw) peaks (Figure 2A and C), where higher signal is mainly observed for certain insulator proteins such as CP190 and Mod(mdg4)2.2, but not marks of active transcription (Figure 2A and C). In contrast, Shep peaks (Figure 2B and D) generally correspond to factors related to active transcription, such as RNA polymerase II and Histone H3 trimethylated on lysine 4 (H3K4me3) in both cell types.

Figure 2. — ChIP-chip signal of factors profiled by the modENCODE consortium over Su(Hw) and Shep peaks in Kc167 cells (A, B) and BG3 cells (C, D). Each panel represents one set of called peaks identified in this study, and each row in the panel represents the average normalized ChIP-chip signal reported by modENCODE (lowess-smoothed log₂(IP/input), or M-score) for a single factor over those peaks. Rows are sorted by the mean value over the center 200 bp.

The parallel processing capability of metaseq enables retrieving and binning signal data across multiple files for thousands of sites in a reasonable amount of time (∼1 s per 1000 intervals from a single file using 8 CPUs, Supplementary Figure S1). Furthermore, the resulting arrays can be cached to disk and loaded later as memory-mapped files, and this approach has multiple advantages. First, storing the data on disk avoids recomputation in downstream analysis. Second, memory-mapped files can be read directly in a fraction of the time it would take to parse the same data stored in a text file. Third, the memory-mapped files allow working with many arrays on disk that would not otherwise fit in memory. In Figure 2, each of the 150+ rows is stored as a memory-mapped array with thousands of rows (one row representing one peak), all of which can be loaded, averaged and plotted in <10 s on typical hardware.

Meta-gene plotting shows Shep is recruited to the 5′ end of actively transcribed genes

Since Shep chromatin association coincides with marks of active transcription, we were interested in studying the relationship between Shep recruitment, gene structure and expression status. It was previously shown in BG3 cells that 65% of chromatin-bound Shep was found at the TSS of genes, while Su(Hw) was found mainly associated with introns and intergenic regions (8). In order to examine the relationship between Shep recruitment and gene structure in more detail, we used metaseq to produce a heatmap of normalized ChIP-seq signal over the upstream, gene body and downstream regions of all annotated genes in parallel (‘meta-gene’ plot, Figure 3A). Genes were ranked in order of expression using modENCODE RNA-seq data for BG3 cells, revealing a positive relationship between Shep recruitment at the TSS and higher gene expression. A summary line plot of overall average Shep ChIP signals shows strong preferential recruitment of Shep to the 5′ end of genes (Figure 3B). Finally, signal averages for distinct gene expression quartiles clearly demonstrates that Shep is preferentially recruited to the 5′ end of more highly expressed genes (Figure 3C). In Kc167 cells, only mild enrichment for Shep is observed at the TSS in a gene expression-dependent manner (Supplementary Figure S3, note y-axis scale). In contrast, Su(Hw) chromatin association in BG3 cells shows little preferential association with any particular region examined, and this profile appears independent of gene expression status (Supplementary Figure S4). Su(Hw) binding in Kc167 cells exhibits mild depletion at the TSS at more highly transcribed genes (Supplementary Figure S5).

Figure 3. — Meta-gene plot of ChIP-seq signal for Shep in BG3 cells. Each row in the matrix (A) represents the normalized ChIP-seq enrichment over one gene scaled to 500 bins, and +/− 5 kb regions scaled to 100 bins. Enrichment is calculated by first scaling IP and input libraries to reads per million mapped reads (RPMMR) and then subtracting the input signal from the IP signal (color bar, bottom right). Genes are ranked by expression (RPKM) in BG3 cells (right panel, data from modENCODE). Middle line plot (B) shows the column averages of the heatmap, with the wider band indicating 95% confidence interval. Bottom line plot (C) shows average ChIP-seq signal over expression quantiles (percentiles indicated in legend). Note that white rows in the heatmap are repetitive genes (rRNA, histones) where multi-mapping reads have been removed in the ChIP-seq analysis.

All figures in this manuscript are shown exactly as created by metaseq with no additional editing or post-processing. This feature highlights the flexibility and customization possible with a library of tools and utilities rather than a one-size-fits-all command-line program approach, which would require the developer to anticipate possible options to make available to users. The metaseq-specific functionality demonstrated in Figure 3 includes the following: (i) the ability to flexibly specify regions based on standard interval formats (BED, GFF, GTF; here we use upstream, downstream and gene body regions from a custom set of annotations); (ii) extraction of ChIP and input signal across these regions from BAM or bigWig files to construct the array in parallel across multiple CPUs; (iii) loading a data table of gene expression, sorting it to match the annotations and computing a sort index for plotting the array by expression level and (iv) a colormap with min and max determined by the 1st and 99th percentiles respectively and with white centered on zero. The figure is then constructed using standard Python data manipulation and visualization tools such as pandas, NumPy and matplotlib.

Intersection analysis of Su(Hw) and Shep RIP-seq profiles

Given that Shep associates with chromatin of highly expressed genes and that Shep is an RNA-binding protein, we were interested in the relationship between Shep chromatin association and nuclear RNA-binding. We determined the population of mRNA transcripts stably bound to Shep in either BG3 or Kc167 nuclei by performing Shep RNA immunoprecipitation followed by oligo-dT selection and sequencing (RIP-seq) (22). In order to identify enriched transcripts, we performed differential gene expression analysis using the DESeq algorithm (15) on two independent biological replicates of total nuclear mRNA input versus IP sample. Using aP_adj threshold of 0.05, we identified 50 and 472 Shep-associated transcripts in BG3 and Kc167 cells, respectively. Since gypsy insulator complexes have also recently been shown to associate specifically with certain mRNAs, we also performed Su(Hw) RIP-seq in parallel, obtaining 503 and 1028 Su(Hw)-associated transcripts in BG3 and Kc167 cells, respectively (Figure 4).

Figure 4. — Scatterplots of enrichment versus expression in RIP-seq for Su(Hw) and Shep in BG3 and Kc167 cells. Green (Su(Hw)) or blue (Shep) dots show genes that encode transcripts enriched by RIP-seq with an adjusted P-value < 0.05. Red dots indicate the genes that encode transcripts pulled down by both Su(Hw) and Shep RIP in the same cell type, and gray dots show all other genes. Rug plots extending along the bottom represent genes that had zero reads in the RIP samples and therefore have undefined log₂ fold change. Genes with zero reads in the input samples have both undefined log₂(RPKM) and undefined log₂ fold change, and so are not shown.

We next compared the RNAs associated with Shep or Su(Hw) in either cell line. A subset of transcripts enriched in the Shep RIP are also enriched in Su(Hw) RIP in each cell type (Figure 4, red dots). We found only 43 of 472 (9%, hypergeometric test P = 0.12) Shep-enriched transcripts are also Su(Hw)-enriched in Kc167 cells (Figure 4A and B), while 25 of 50 (50%, hypergeometric test P < 2.2e−16; Figure 4C and D) are shared in BG3 cells. These results are consistent with the possibility that Shep acts as an RNA-binding protein adapter for gypsy insulator complexes in CNS tissue, from which BG3 cells are derived. In addition, since only a small subset of Su(Hw) associated transcripts are bound by Shep in either cell type, we suggest that additional RNA binding proteins may mediate these particular interactions with insulator complexes.

Using metaseq for these basic scatterplots has several advantages. First, gene expression data often contains zero values, resulting in undefined or infinite values when calculating fold change. Such values are represented by metaseq as ‘rug’ plots that always appear on the extreme borders of the axes and remain there even when interactively panning and zooming. This avoids either removing these data from display or choosing arbitrary x or y values for plotting. Second, the plots are interactive in that clicking on a point will by default print the gene ID and any information in the underlying dataset. Optionally, a ‘minibrowser’ can be configured to open a new window when a point is clicked, displaying the genomic signal over the gene of interest (Supplementary Figure S6). Using these interactive features to explore these plots by clicking on outliers, we observed that many transcripts bound by Su(Hw) in BG3 cells are small nucleolar RNAs (snoRNAs). By overlaying the gene type information using metaseq, we find a substantial portion of the Su(Hw)-enriched genes in BG3 cells, but not Kc167 cells, encode snoRNAs (Supplementary Figure S7). In fact, almost half (10 of 25) of the shared Shep/Su(Hw)-enriched genes in BG3 cells encode snoRNAs.

Exploratory visualization reveals that Su(Hw)-associated RNA originates from long, low-expression loci

Based on the above analyses, we noted that many Su(Hw)-associated RNAs in Kc167 cells appear to be expressed at a low level. Interestingly, 438 of the 1028 Su(Hw)-associated RNA in Kc167 cells actually have no detectable expression in the input, yet have consistently high enough read counts across IP replicates to be significantly enriched following DESeq analysis (Figure 4A, Table 1). In contrast, the majority of Shep-associated RNAs are detectably expressed at steady state in nuclear extracts (Figure 4B and D). We took advantage of metaseq's ability to create radviz plots (23), which are a graphical technique for exploratory visualization of multidimensional data, to further characterize the associated RNA. Briefly, variables of interest are spaced evenly around a unit circle. Points (representing genes in this case) are placed within the circle as if they were attached by springs to the points representing variables, where a higher value corresponds to a stronger ‘spring’. From these plots, we noted that many low expression Su(Hw)-associated transcripts are derived from unusually large genomic loci with long introns (Supplementary Figure S8). In order to examine this in greater detail, we plotted genes encoding both Su(Hw) and Shep-associated transcripts (hereafter referred to as ‘RIP-enriched genes’) based on expression in nuclear extracts compared to length of originating locus (Figure 5, Supplementary Figure S9). The trend for long, low-expression Su(Hw)-bound RNA is most apparent in Kc167 cells. Importantly, these Su(Hw) RIP-enriched genes do not necessarily harbor Su(Hw) ChIP peaks, suggesting that this result is not due to an experimental artifact. As illustrated in Figure 5 and Supplementary Figure S9, the plotting utilities in metaseq include the ability to add marginal stacked histograms of subsets of the data. This feature enables more detailed comparisons among subsets of the data that are otherwise obscured by large numbers of points in the scatterplot, as is often the case for genome-wide data.

Table 1. Number of RIP-enriched transcripts identified and subset with undetectable expression in the input nuclear fraction.

	Enriched	Enriched with zero input expression
Kc Su(Hw)	1028	438 (43%)
Kc Shep	472	15 (3%)
BG3 Su(Hw)	503	54 (11%)
BG3 Shep	50	0 (0%)

Open in a new tab

Figure 5. — Scatterplots showing the relationship of locus length in kb and expression for all transcripts, along with marginal histograms. Top histograms show distribution of x-axis values; side histograms show distribution of y-axis values. Green (Su(Hw)) or blue (Shep) dots show genes encoding transcripts enriched by RIP, and gray dots show all other genes.

Analysis of chromatin context from which Shep and Su(Hw)-associated transcripts are derived

Given apparent differences in expression status of Su(Hw) and Shep associated transcripts, we were interested in examining the chromatin features associated with Shep and Su(Hw) RIP-enriched genes. Therefore, we examined modENCODE ChIP signals, this time across 2 kb windows centered on the TSS of the RIP-enriched genes for Su(Hw) or Shep in either Kc167 or BG3 cells (Figure 6). For each factor, we subtracted the average input-normalized signal over all non-enriched TSSs from the average input-normalized signal over all enriched TSSs, thereby highlighting the chromatin context specific to RIP-enriched genes. Consistent with low expression observed for Su(Hw)-bound transcripts in Kc167 cells, strong depletion for H3K4me3 is detected at the promoters of origin (Figure 6A). In contrast, high enrichment of H3K4me3 is observed at the promoters of Shep-associated transcripts in both cell types (Figure 6B and D). Interestingly, Su(Hw)-bound transcripts in BG3 cells also derive from an active chromatin context (Figure 6C).

Figure 6. — Chromatin context of genes encoding RIP-enriched transcripts. Heatmaps are centered on the 5′-most TSS of genes and extend up- and downstream 1 kb. Values represent the average M-scores (lowess-smoothed log₂(IP/input)) over enriched gene TSSs minus the average M-scores over gene TSSs at all other genes. For each cell type, rows are sorted by Su(Hw) signal in that cell type.

Overlap analysis of RIP-enriched genes and ChIP-seq peaks to address cis versus trans recruitment of RNA

Finally, we asked if Su(Hw) or Shep RNA association is happening in cis, that is, if transcribed RNA associates with adjacent chromatin-bound protein. We examined whether RIP-enriched genes harbor a ChIP peak for the same factor within its vicinity. metaseq includes a function that accepts an input file of intervals (such as a BED file of called peaks) and a transformation function that acts on a gene (for example to return the TSS of a gene) and returns an array indicating whether a peak was found in that transformed region. The fraction of RIP-enriched genes harboring a ChIP peak for the same factor within 1 kb is significantly higher compared to non-RIP-enriched genes (Fisher's exact test, P ≤ 4.1e−5 for all cases; Table 2), suggesting that a subset of transcripts enriched in Shep or Su(Hw) purifications could be associating in cis. While the actual ratio of overlap is moderate in most cases (22–26%), a majority of Shep BG3 RIP-enriched genes overlap with Shep BG3 ChIP peaks (58%). However, the remaining RIP-enriched genes (42–78%) have no peak nearby, and by this definition, associate in trans to chromatin binding sites, consistent with previously identified mRNA isolated in complex with gypsy insulator proteins (7).

Table 2. Fraction of RNA (with peak/total) associated with each protein that had a ChIP peak of the same protein within 1 kb of the originating gene body.

	Kc Su(Hw)	Kc Shep	BG3 Su(Hw)	BG3 Shep
Bound RNA with peak near origin	22% (222/1028)	26% (124/472)	26% (130/503)	58% (30/50)
Unbound RNA with peak near origin	14%	10%	14%	31%
Fisher's exact test pval	1.7e−9	1.9e−10	2.2e−10	4.1e−5

Open in a new tab

DISCUSSION

As illustrated here, metaseq features data manipulation utilities such as a uniform API access to signal data, acquisition of binned signal across tens of thousands of windows in parallel, and extension of existing tabular data tools for genomic-specific analysis tasks. For visualization, metaseq provides tools for colormap generation, heatmap sorting, radviz plots, scatter plots with marginal histograms and mini-browsers for interactive data exploration. We have demonstrated the applicability of metaseq for integrating genomic analyses by comparing RIP-seq, ChIP-seq and RNA-seq data from multiple sources, showing that metaseq is capable of generating novel biological insight from new and existing datasets.

Similar software

Several Python packages offer useful functionality that can be further integrated into metaseq. Genomedata (24) provides tools for reading and loading data into the HDF5 file format, which could be used as an alternative or in addition to the memory-mapped NumPy files currently used in metaseq. Cruzdb (25) provides tools for querying the UCSC genome browser's MySQL databases and can be used in metaseq as an alternative for the ID-to-coordinate mapping functionality that is currently supplied by gffutils.

The CummeRbund suite of tools (26) integrates packages from the R statistical programming environment for working with tabular genomic data from the Cufflinks differential expression software (27), whereas the tabular genomic data part of metaseq is not specific to any particular package or type of analysis. Software that can be used to generate heatmaps of genomic signal over specified windows, such as Homer (28), ngsplot (29) and deepTools (30), focus on pre-packaged scripts for command-line usage. metaseq complements these various software packages and allows more complex analyzes by providing a library of modular components that can be incorporated into genomics researchers’ custom analysis code, adding flexibility and interaction for deeper analysis of high-throughput genomics data.

Role of metaseq in genomic research

Not only are high-throughput genomic datasets becoming increasingly available, but new techniques (reviewed in (31)) are constantly being developed beyond the standard ChIP- and RNA-seq protocols. A modular library of tools like metaseq is critical for assisting in the development of custom algorithms, visualization, and downstream analysis of data from such experiments and combining them in ways that could not be anticipated when developing the library. Here, we used chromatin insulators as a representative example of a complex, interconnected system that requires the integration of multiple kinds of datasets and show the practical usage of metaseq. The fundamental advantage of metaseq is that its modular tools make genomic data available in standard Python data structures, thus providing a bridge from primary data analysis (mapping, peak-calling, differential expression detection) to downstream analysis code capable of furthering biological insight.

Shep may act as an RNA-binding adapter for gypsy insulator-associated RNAs in CNS tissue

Comparison of RIP-seq of Shep versus Su(Hw) in Kc167 or BG3 cells showed a statistically significant level of overlap only in the CNS-derived cell line. These results suggest that Shep could act as an RNA-binding protein adapter for gypsy insulator complexes. Since Shep binds only a subset of a larger number of Su(Hw)-associated nuclear mRNAs, Shep is likely only one of such putative adapters within and outside the CNS context. Another candidate RNA-binding protein adapter is Rump (17), but its tissue specificity likely lies outside of the CNS lineage. Notably, a substantial number of transcripts stably bound by both Shep and Su(Hw) particularly in BG3 cells are snoRNAs. It was recently shown that ∼22% of Drosophila snoRNAs associate with chromatin (caRNAs) (32); however, this particular class of snoRNAs is not enriched with Shep or Su(Hw) (data not shown). We did identify a substantial number of Shep-associated transcripts that are not stably associated with Su(Hw) in either Kc167 or BG3 cells, suggesting that Shep likely harbors additional nuclear functions outside of insulator activity. It should be noted that snoRNAs, along with many additional mRNAs purified in complex with Su(Hw) in this study, were not previously identified in higher specificity sequential immunopurifications of gypsy insulator complexes (7). The larger number of transcripts identified associated with the single antibody purification used here may be due to either lower specificity of complex purification or higher consistency of library preparation using a simpler purification protocol.

Shep may be recruited to RNA cotranscriptionally

We found that 58% of RNAs bound by Shep in BG3 cells harbor a Shep ChIP peak in the vicinity of the gene from which it is transcribed. For this subset of transcripts, Shep may first be recruited to chromatin and then transferred to the mature mRNA transcribed at this locus. Interestingly, the majority of Shep chromatin binding occurs at the TSS of active genes, and its binding signal correlates with the strength of expression of the locus. The chromatin context of Shep chromatin binding sites is well correlated with the chromatin context of originating loci of RNA bound to Shep in Kc cells, and both are enriched for marks of active transcription, further suggesting a cis association of RNA. It was previously observed that Shep ChIP profiles consist of peaks broader than that of insulator proteins (8), consistent with nascent RNA binding contributing to its chromatin association. However, not all Shep chromatin association apparently results in stable RNA association, as a much larger number of Shep ChIP peaks are observed compared to bound RNAs.

The majority of Su(Hw) insulator-associated mRNAs are produced in trans to insulator sites

In contrast to Shep, we did not observe correlation between the location of Su(Hw) sites on chromatin and the genes encoding Su(Hw)-associated RNAs, suggesting that the majority of these RNAs are produced in trans to insulator sites. These RNAs could be targeted to insulator complexes by RNA-binding proteins such as Shep and Rump; in fact, Rump has been previously shown to be involved in mRNA localization in the neuron and early embryo (33,34). Interestingly, a large number of Su(Hw)-associated mRNAs are derived from long, low expression genes. It is tempting to speculate that these particular mRNAs have specialized functions in nuclear architecture, perhaps using mechanisms similar to those proposed for long non-coding RNAs (reviewed in (35)).

Another point to consider regarding association of RNA in trans is the propensity of chromatin insulators to mediate long-range interactions. DNA looping could bring chromatin-bound insulator proteins in proximity to a transcribed locus in the 3D space of the nucleus, despite being distal in terms of linear sequence. High-resolution chromatin conformation experiments, such as Hi-C or ChIA-PET (36,37), in these cell types will be important for determining if such ‘spatially cis, genomically trans’ interactions can explain the RNA association with Su(Hw) and Shep or if other recruitment or guide factors play a role. In this case, metaseq could be used to integrate such datasets, loading the various types of genomic data into standard Python data structures ready for use in downstream custom analyses.

ACCESSION NUMBERS

GSE15596, GSE55894, GSE40797 and GSE51462.

Supplementary Material

Supplementary Data

Click here for additional data file.^{(1.9MB, zip)}

Acknowledgments

metaseq could not exist without the software tools that are a product of a robust online community of open source bioinformatics software. We thank the authors and contributors of the packages cited here, especially the authors of pysam, samtools, bx-python, bedtools and matplotlib. We thank members of the Lei lab and D. Sturgill for critical reading of the manuscript. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

FUNDING

Intramural Program of the National Institute of Diabetes and Digestive and Kidney Diseases (DK015602-07 to E.L).

Conflict of interest statement. None declared.

REFERENCES

1.Cock P.J.A., Fields C.J., Goto N., Heuer M.L., Rice P.M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010;38:1767–1771. doi: 10.1093/nar/gkp1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Kuhn R.M., Haussler D., Kent W.J. The UCSC genome browser and associated tools. Brief. Bioinform. 2013;14:144–161. doi: 10.1093/bib/bbs038. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Kent W.J., Zweig A.S., Barber G., Hinrichs A.S., Karolchik D. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics. 2010;26:2204–2207. doi: 10.1093/bioinformatics/btq351. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Danecek P., Auton A., Abecasis G., Albers C.A., Banks E., DePristo M.A., Handsaker R.E., Lunter G., Marth G.T., Sherry S.T., et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Matzat L.H., Lei E.P. Surviving an identity crisis: a revised view of chromatin insulators in the genomics era. Biochim. Biophys. Acta. 2013;1839:203–214. doi: 10.1016/j.bbagrm.2013.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Matzat L.H., Dale R.K., Lei E.P. Messenger RNA is a functional component of a chromatin insulator complex. EMBO Rep. 2013;14:916–922. doi: 10.1038/embor.2013.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Matzat L.H., Dale R.K., Moshkovich N., Lei E.P. Tissue-specific regulation of chromatin insulator function. PLoS Genet. 2012;8:e1003069. doi: 10.1371/journal.pgen.1003069. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Dale R.K., Pedersen B.S., Quinlan A.R. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics. 2011;27:3423–3424. doi: 10.1093/bioinformatics/btr539. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Lei E.P., Corces V.G. RNA interference machinery influences the nuclear organization of a chromatin insulator. Nat. Genet. 2006;38:936–941. doi: 10.1038/ng1850. [DOI] [PubMed] [Google Scholar]
12.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011;17:10–12. [Google Scholar]
13.Kim D., Pertea G., Trapnell C., Pimentel H., Kelley R., Salzberg S.L. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14:R36. doi: 10.1186/gb-2013-14-4-r36. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Anders S., Theodor Pyl P., Huber W. HTSeq— A Python framework to work with high-throughput sequencing data. 2014 doi: 10.1093/bioinformatics/btu638. doi:10.1101/002824. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Anders S., Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Kharchenko P.V., Tolstorukov M.Y., Park P.J. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat. Biotechnol. 2008;26:1351–1359. doi: 10.1038/nbt.1508. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.King M.R., Matzat L.H., Dale R.K., Lim S.J., Lei E.P. The RNA-binding protein Rumpelstiltskin antagonizes gypsy chromatin insulator function in a tissue-specific manner. J. Cell Sci. 2014;127:2956–2966. doi: 10.1242/jcs.151126. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Bushey A.M., Ramos E., Corces V.G. Three subclasses of a Drosophila insulator show distinct and cell type-specific genomic distributions. Genes Dev. 2009;23:1338–1350. doi: 10.1101/gad.1798209. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Roy S., Ernst J., Kharchenko P.V., Kheradpour P., Negre N., Eaton M.L., Landolin J.M., Bristow C.A., Ma L., Lin M.F. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010;330:1787–1797. doi: 10.1126/science.1198374. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Schwartz Y.B., Linder-Basso D., Kharchenko P.V., Tolstorukov M.Y., Kim M., Li H.-B., Gorchakov A.A., Minoda A., Shanower G., Alekseyenko A.A., et al. Nature and function of insulator protein binding sites in the Drosophila genome. Genome Res. 2012;22:2188–2198. doi: 10.1101/gr.138156.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Celniker S.E., Dillon L.A.L., Gerstein M.B., Gunsalus K.C., Henikoff S., Karpen G.H., Kellis M., Lai E.C., Lieb J.D., MacAlpine D.M., et al. Unlocking the secrets of the genome. Nature. 2009;459:927–930. doi: 10.1038/459927a. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Ray D., Kazan H., Cook K.B., Weirauch M.T., Najafabadi H.S., Li X., Gueroussov S., Albu M., Zheng H., Yang A., et al. A compendium of RNA-binding motifs for decoding gene regulation. Nature. 2013;499:172–177. doi: 10.1038/nature12311. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Hoffman P., Grinstein G., Marx K., Grosse I., Stanley E. Visualization ’97., Proceedings. 1997. DNA visual and analytic data mining; pp. 437–441. [Google Scholar]
24.Hoffman M.M., Buske O.J., Noble W.S. The genomedata format for storing large-scale functional genomics data. Bioinformatics. 2010;26:1458–1459. doi: 10.1093/bioinformatics/btq164. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Pedersen B.S., Yang I.V., De S. CruzDB: software for annotation of genomic intervals with UCSC genome-browser database. Bioinformatics. 2013;29:3003–3006. doi: 10.1093/bioinformatics/btt534. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Goff L., Trapnell C., Kelley D. CummeRbund: analysis, exploration, manipulation, and visualization of Cufflinks high-throughput sequencing data. 2012. R package version 2.6.1. [Google Scholar]
27.Trapnell C., Williams B.a., Pertea G., Mortazavi A., Kwan G., van Baren M.J., Salzberg S.L., Wold B.J., Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 2010;28:516–520. doi: 10.1038/nbt.1621. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Heinz S., Benner C., Spann N., Bertolino E., Lin Y.C., Laslo P., Cheng J.X., Murre C., Singh H., Glass C.K. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell. 2010;38:576–589. doi: 10.1016/j.molcel.2010.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Shen Li, Shao Ningyi, Liu Xiaochuan, Nestler Eric. ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases. BMC genomics. 2014;15:284. doi: 10.1186/1471-2164-15-284. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Ramírez Fidel, Dündar Friederike, Diehl Sarah, A Grüning Björn, Manke Thomas. deepTools: a flexible platform for exploring deep-sequencing data. Nucl. acids res. 2014;42:W187–W191. doi: 10.1093/nar/gku365. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Soon W.W., Hariharan M., Snyder M.P. High-throughput sequencing for biology and medicine. Mol. Syst. Biol. 2013;9:640. doi: 10.1038/msb.2012.61. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Schubert T., Pusch M.C., Diermeier S., Benes V., Kremmer E., Imhof A., Längst G. Df31 protein and snoRNAs maintain accessible higher-order structures of chromatin. Mol. Cell. 2012;48:434–444. doi: 10.1016/j.molcel.2012.08.021. [DOI] [PubMed] [Google Scholar]
33.Sinsimer K.S., Jain R.A., Chatterjee S., Gavis E.R. A late phase of germ plasm accumulation during Drosophila oogenesis requires lost and rumpelstiltskin. Development. 2011;138:3431–3440. doi: 10.1242/dev.065029. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Xu X., Brechbiel J.L., Gavis E.R. Dynein-dependent transport of nanos RNA in Drosophila sensory neurons requires Rumpelstiltskin and the germ plasm organizer Oskar. J. Neurosci. 2013;33:14791–14800. doi: 10.1523/JNEUROSCI.5864-12.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Rinn J.L., Chang H.Y. Genome regulation by long noncoding RNAs. Annu. Rev. Biochem. 2012;81:145–166. doi: 10.1146/annurev-biochem-051410-092902. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Li G., Fullwood M.J., Xu H., Mulawadi F.H., Velkov S., Vega V., Ariyaratne P.N., Mohamed Y.B., Ooi H.-S., Tennakoon C., et al. ChIA-PET tool for comprehensive chromatin interaction analysis with paired-end tag sequencing. Genome Biol. 2010;11:R22. doi: 10.1186/gb-2010-11-2-r22. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Lieberman-Aiden E., van Berkum N.L., Williams L., Imakaev M., Ragoczy T., Telling A., Amit I., Lajoie B.R., Sabo P.J., Dorschner M.O., et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–293. doi: 10.1126/science.1181369. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Click here for additional data file.^{(1.9MB, zip)}

[B1] 1.Cock P.J.A., Fields C.J., Goto N., Heuer M.L., Rice P.M. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2010;38:1767–1771. doi: 10.1093/nar/gkp1137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] 2.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3.Kuhn R.M., Haussler D., Kent W.J. The UCSC genome browser and associated tools. Brief. Bioinform. 2013;14:144–161. doi: 10.1093/bib/bbs038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4.Kent W.J., Zweig A.S., Barber G., Hinrichs A.S., Karolchik D. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics. 2010;26:2204–2207. doi: 10.1093/bioinformatics/btq351. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5.Danecek P., Auton A., Abecasis G., Albers C.A., Banks E., DePristo M.A., Handsaker R.E., Lunter G., Marth G.T., Sherry S.T., et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6.Matzat L.H., Lei E.P. Surviving an identity crisis: a revised view of chromatin insulators in the genomics era. Biochim. Biophys. Acta. 2013;1839:203–214. doi: 10.1016/j.bbagrm.2013.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7.Matzat L.H., Dale R.K., Lei E.P. Messenger RNA is a functional component of a chromatin insulator complex. EMBO Rep. 2013;14:916–922. doi: 10.1038/embor.2013.118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8.Matzat L.H., Dale R.K., Moshkovich N., Lei E.P. Tissue-specific regulation of chromatin insulator function. PLoS Genet. 2012;8:e1003069. doi: 10.1371/journal.pgen.1003069. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9.Dale R.K., Pedersen B.S., Quinlan A.R. Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics. 2011;27:3423–3424. doi: 10.1093/bioinformatics/btr539. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10.Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11.Lei E.P., Corces V.G. RNA interference machinery influences the nuclear organization of a chromatin insulator. Nat. Genet. 2006;38:936–941. doi: 10.1038/ng1850. [DOI] [PubMed] [Google Scholar]

[B12] 12.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011;17:10–12. [Google Scholar]

[B13] 13.Kim D., Pertea G., Trapnell C., Pimentel H., Kelley R., Salzberg S.L. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14:R36. doi: 10.1186/gb-2013-14-4-r36. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14.Anders S., Theodor Pyl P., Huber W. HTSeq— A Python framework to work with high-throughput sequencing data. 2014 doi: 10.1093/bioinformatics/btu638. doi:10.1101/002824. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15.Anders S., Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16.Kharchenko P.V., Tolstorukov M.Y., Park P.J. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat. Biotechnol. 2008;26:1351–1359. doi: 10.1038/nbt.1508. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17.King M.R., Matzat L.H., Dale R.K., Lim S.J., Lei E.P. The RNA-binding protein Rumpelstiltskin antagonizes gypsy chromatin insulator function in a tissue-specific manner. J. Cell Sci. 2014;127:2956–2966. doi: 10.1242/jcs.151126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] 18.Bushey A.M., Ramos E., Corces V.G. Three subclasses of a Drosophila insulator show distinct and cell type-specific genomic distributions. Genes Dev. 2009;23:1338–1350. doi: 10.1101/gad.1798209. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19.Roy S., Ernst J., Kharchenko P.V., Kheradpour P., Negre N., Eaton M.L., Landolin J.M., Bristow C.A., Ma L., Lin M.F. Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010;330:1787–1797. doi: 10.1126/science.1198374. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20.Schwartz Y.B., Linder-Basso D., Kharchenko P.V., Tolstorukov M.Y., Kim M., Li H.-B., Gorchakov A.A., Minoda A., Shanower G., Alekseyenko A.A., et al. Nature and function of insulator protein binding sites in the Drosophila genome. Genome Res. 2012;22:2188–2198. doi: 10.1101/gr.138156.112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21.Celniker S.E., Dillon L.A.L., Gerstein M.B., Gunsalus K.C., Henikoff S., Karpen G.H., Kellis M., Lai E.C., Lieb J.D., MacAlpine D.M., et al. Unlocking the secrets of the genome. Nature. 2009;459:927–930. doi: 10.1038/459927a. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22.Ray D., Kazan H., Cook K.B., Weirauch M.T., Najafabadi H.S., Li X., Gueroussov S., Albu M., Zheng H., Yang A., et al. A compendium of RNA-binding motifs for decoding gene regulation. Nature. 2013;499:172–177. doi: 10.1038/nature12311. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23.Hoffman P., Grinstein G., Marx K., Grosse I., Stanley E. Visualization ’97., Proceedings. 1997. DNA visual and analytic data mining; pp. 437–441. [Google Scholar]

[B24] 24.Hoffman M.M., Buske O.J., Noble W.S. The genomedata format for storing large-scale functional genomics data. Bioinformatics. 2010;26:1458–1459. doi: 10.1093/bioinformatics/btq164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25.Pedersen B.S., Yang I.V., De S. CruzDB: software for annotation of genomic intervals with UCSC genome-browser database. Bioinformatics. 2013;29:3003–3006. doi: 10.1093/bioinformatics/btt534. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26.Goff L., Trapnell C., Kelley D. CummeRbund: analysis, exploration, manipulation, and visualization of Cufflinks high-throughput sequencing data. 2012. R package version 2.6.1. [Google Scholar]

[B27] 27.Trapnell C., Williams B.a., Pertea G., Mortazavi A., Kwan G., van Baren M.J., Salzberg S.L., Wold B.J., Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 2010;28:516–520. doi: 10.1038/nbt.1621. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] 28.Heinz S., Benner C., Spann N., Bertolino E., Lin Y.C., Laslo P., Cheng J.X., Murre C., Singh H., Glass C.K. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell. 2010;38:576–589. doi: 10.1016/j.molcel.2010.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29.Shen Li, Shao Ningyi, Liu Xiaochuan, Nestler Eric. ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases. BMC genomics. 2014;15:284. doi: 10.1186/1471-2164-15-284. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] 30.Ramírez Fidel, Dündar Friederike, Diehl Sarah, A Grüning Björn, Manke Thomas. deepTools: a flexible platform for exploring deep-sequencing data. Nucl. acids res. 2014;42:W187–W191. doi: 10.1093/nar/gku365. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31] 31.Soon W.W., Hariharan M., Snyder M.P. High-throughput sequencing for biology and medicine. Mol. Syst. Biol. 2013;9:640. doi: 10.1038/msb.2012.61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] 32.Schubert T., Pusch M.C., Diermeier S., Benes V., Kremmer E., Imhof A., Längst G. Df31 protein and snoRNAs maintain accessible higher-order structures of chromatin. Mol. Cell. 2012;48:434–444. doi: 10.1016/j.molcel.2012.08.021. [DOI] [PubMed] [Google Scholar]

[B33] 33.Sinsimer K.S., Jain R.A., Chatterjee S., Gavis E.R. A late phase of germ plasm accumulation during Drosophila oogenesis requires lost and rumpelstiltskin. Development. 2011;138:3431–3440. doi: 10.1242/dev.065029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B34] 34.Xu X., Brechbiel J.L., Gavis E.R. Dynein-dependent transport of nanos RNA in Drosophila sensory neurons requires Rumpelstiltskin and the germ plasm organizer Oskar. J. Neurosci. 2013;33:14791–14800. doi: 10.1523/JNEUROSCI.5864-12.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] 35.Rinn J.L., Chang H.Y. Genome regulation by long noncoding RNAs. Annu. Rev. Biochem. 2012;81:145–166. doi: 10.1146/annurev-biochem-051410-092902. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B36] 36.Li G., Fullwood M.J., Xu H., Mulawadi F.H., Velkov S., Vega V., Ariyaratne P.N., Mohamed Y.B., Ooi H.-S., Tennakoon C., et al. ChIA-PET tool for comprehensive chromatin interaction analysis with paired-end tag sequencing. Genome Biol. 2010;11:R22. doi: 10.1186/gb-2010-11-2-r22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B37] 37.Lieberman-Aiden E., van Berkum N.L., Williams L., Imakaev M., Ragoczy T., Telling A., Amit I., Lajoie B.R., Sabo P.J., Dorschner M.O., et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–293. doi: 10.1126/science.1181369. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

metaseq: a Python package for integrative genome-wide analysis reveals relationships between chromatin insulators and associated nuclear mRNA

Ryan K Dale

Leah H Matzat

Elissa P Lei

Abstract

INTRODUCTION

MATERIALS AND METHODS

metaseq implementation

Accessing genomic signal by interval

Arrays of signal

Parallelization and caching

Tables of results

Mapping of ID to genomic interval

Mini-browsers

Preparation of RIP-seq libraries

Chromatin immunoprecipitation

Raw data processing

Public data

RESULTS

ChIP-seq profiling of Shep and Su(Hw) in two different cell types

Figure 1.

Analysis of chromatin context of Su(Hw) and Shep binding sites

Figure 2.

Meta-gene plotting shows Shep is recruited to the 5′ end of actively transcribed genes

Figure 3.

Intersection analysis of Su(Hw) and Shep RIP-seq profiles

Figure 4.

Exploratory visualization reveals that Su(Hw)-associated RNA originates from long, low-expression loci

Table 1. Number of RIP-enriched transcripts identified and subset with undetectable expression in the input nuclear fraction.

Figure 5.

Analysis of chromatin context from which Shep and Su(Hw)-associated transcripts are derived

Figure 6.

Overlap analysis of RIP-enriched genes and ChIP-seq peaks to address cis versus trans recruitment of RNA

Table 2. Fraction of RNA (with peak/total) associated with each protein that had a ChIP peak of the same protein within 1 kb of the originating gene body.

DISCUSSION

Similar software

Role of metaseq in genomic research

Shep may act as an RNA-binding adapter for gypsy insulator-associated RNAs in CNS tissue

Shep may be recruited to RNA cotranscriptionally

The majority of Su(Hw) insulator-associated mRNAs are produced in trans to insulator sites

ACCESSION NUMBERS

Supplementary Material

Acknowledgments

FUNDING

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases