Abstract
Motivation
Analysis of next-generation sequencing data often results in a list of genomic regions. These may include differentially methylated CpGs/regions, transcription factor binding sites, interacting chromatin regions, or GWAS-associated SNPs, among others. A common analysis step is to annotate such genomic regions to genomic annotations (promoters, exons, enhancers, etc.). Existing tools are limited by a lack of annotation sources and flexible options, the time it takes to annotate regions, an artificial one-to-one region-to-annotation mapping, a lack of visualization options to easily summarize data, or some combination thereof.
Results
We developed the annotatr Bioconductor package to flexibly and quickly summarize and plot annotations of genomic regions. The annotatr package reports all intersections of regions and annotations, giving a better understanding of the genomic context of the regions. A variety of graphics functions are implemented to easily plot numerical or categorical data associated with the regions across the annotations, and across annotation intersections, providing insight into how characteristics of the regions differ across the annotations. We demonstrate that annotatr is up to 27× faster than comparable R packages. Overall, annotatr enables a richer biological interpretation of experiments.
Availability and Implementation
http://bioconductor.org/packages/annotatr/ and https://github.com/rcavalcante/annotatr
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
Genomic regions resulting from next-generation sequencing experiments and bioinformatics pipelines require annotation to genomic features for context. For example, hyper-methylation of CpG shores in promoters may indicate different regulatory regimes in one condition compared to another, or it may be of interest that a transcription factor overwhelmingly binds in enhancers, while another tends to bind at exon–intron boundaries.
While tools exist to intersect genomic regions of interest with genomic annotations, we found the annotations, methods of intersection, and graphics options had room for improvement. ChIPpeakAnno (Zhu et al., 2010) is an R package that has been used in many studies across a variety of organisms. It returns only one genomic annotation per input region, and while providing the user with some plots, these are limited by their inability to incorporate data associated with the regions of interest such as methylation rates, fold changes, etc. Another R package, goldmine (Bhasin and Ting, 2016), returns either all annotations intersecting regions of interest (a one-to-many mapping) or annotations based on a prioritization (a one-to-one mapping). The goldmine package provides helper functions to create annotations from any UCSC genome browser table. However, it does not offer built-in functions for summary plots, nor to plot data related to the regions over the annotations. Outside of the R ecosystem, BEDtools (Quinlan and Hall, 2010), implemented in C ++, intersects and aggregates genomic regions with annotations, and is very fast. However, its more general purpose means users must provide all annotations and manually generate plots.
We developed annotatr, a Bioconductor package that reports all intersections of genomic regions with built-in genomic annotations for D. melanogaster (dm3 and dm6), H. Sapiens (hg19 and hg38), M. musculus (mm9 and mm10), R. norvegicus (rn4, rn5 and rn6), annotations imported from the AnnotationHub R package or custom annotations for any organism. annotatr enables users to associate numerical or categorical data with regions, enabling better understanding of the underlying experiments via summarization and visualization functions. annotatr is fast, flexible and easily included in bioinformatics pipelines.
2 Implementation and features
A core feature of annotatr is the variety of standard and specialized genomic annotations it includes. Standard annotations include CpG island related features (CpG islands, shores, shelves and ‘open sea’) and genic features (promoters, 5′UTRs, exons, introns, CDS and 3′UTRs) (Supplementary Fig. S1). Specialized genomic annotations include intron/exon boundaries, enhancers, lncRNAs, and chromatin state segmentations. A built-in function easily transforms resources in the AnnotationHub R package (such as COSMIC, ENCODE and Roadmap Epigenomics) into usable annotations. Finally, custom annotations can supplement built-in annotations or enable annotation to any organism. Details regarding the sources, construction, and genome availability for all included annotations are provided in Supplementary Methods and Supplementary Table S1.
The annotatr package consists of four modules that read, annotate, summarize and visualize genomic regions. The read module reads a BED6+ file, defined as BED6 and any number of numerical or categorical data columns (Supplementary Table S2). The annotate module reports the overlap of all input regions with all intersecting genomic annotations selected by the user, with a user-defined threshold overlap between regions and annotations (Supplementary Table S3). The summarize module enables users to quickly compute summarized information of any numerical (Supplementary Table S4) or categorical data (Supplementary Table S5) over the annotations.
The collective goal of the visualization module is to provide insight into modes of regulation, and to discover specific relationships among the input regions and genomic annotations with minimum code or forethought. Consider bisulfite sequencing results from methylSig (Park, 2014) reporting genome-wide differential methylation (DM) between two sample groups. It has columns for DM status (hyper, hypo, none), P-value, methylation difference between the groups, and methylation rates of each group. The annotatr package implements functions to show: (1) the number of DM regions in each annotation type with the option to compare against randomized regions (Supplementary Fig. S2), (2) a heatmap of the number of regions annotated to pairs of annotation types (Supplementary Fig. S3), (3) the distribution of numerical data across the annotations or any categorical variable (Fig. 1A), (4) the joint distribution of two numerical data columns across the annotations or any categorical variable (Supplementary Fig. S4), (5) the distribution of numerical data for regions in any two intersecting annotations (Fig. 1B) and (6) the distribution of a categorical variable across the annotations or any other categorical variable (Fig. 1C).
Fig. 1.

(A) The distribution of the methylation rate across annotations (solid) with the background distribution (outline). Note the clearly visible hyper- and hypo-methylation trends in the different annotation types. (B) The distribution of the methylation rate of regions in just CpG islands (left), promoters and CpG islands (middle) and just promoters (right). Note the relative hypermethylation trend in the co-annotated regions compared to the singly annotated regions. (C) The proportion of annotations of hyper- and hypo-methylated regions, with the background distribution (All) for comparison. Note the differences in enhancers, CpG islands, lncRNAs and K562-insulators between hyper- and hypo-methylated regions compared to each other and all tested regions
We compared runtimes between ChIPpeakAnno (v3.8.1), goldmine (v1.0.0) and annotatr (v1.0.1) on three data sets varying in size from 31 000 to 2 500 000 lines (Supplementary Methods). annotatr performs up to 13.1× faster than ChIPpeakAnno, and up to 27.5× faster than goldmine, with increasingly better performance as file size increases (Supplementary Table S6 and Fig. S5). In addition to benchmarking, we have compared the features of the three packages (Supplementary Table S7).
3 Discussion
Associating regions of interest to genomic annotations is a standard part of many bioinformatics pipelines. The annotatr package improves upon existing annotation tools by returning all the genomic annotations associated with a region instead of artificially prioritizing one annotation type over another, giving a clearer picture of the biological complexities at play. In addition to tabular output of the annotations, annotatr's built-in plotting functions provide an easy and flexible way to summarize the annotations and view how data associated with the regions changes in different genomic contexts. The annotatr package thus enables fast exploration, more complete genomic contextualization of experiments, and more potential discoveries.
Supplementary Material
Acknowledgements
We thank Hani Habra and Jian Kang for their input on the conception and implementation of the package. We also thank Shweta Ramdas and Chee Lee for feedback on the manuscript and package vignette.
Funding
R.C. was funded on a Bioinformatics Training Grant (National Institute of General Medical Sciences-T32 GM070499) and M.A.S. by National Cancer Institute grant R01 CA158286 and National Institute of Environmental Health Sciences grant P30 ES017885.
Conflict of Interest: none declared.
References
- Bhasin J.M., Ting A.H. (2016) Goldmine integrates information placing genomic ranges into meaningful biological contexts. Nucleic Acids Res., 44, 5550–5556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park Y. et al. (2014) MethylSig: a whole genome DNA methylation analysis pipeline. Bioinformatics, 30, 2414–2422. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quinlan A.R., Hall I.M. (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26, 841–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu L.J. et al. (2010) ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip data. BMC Bioinformatics, 11, 237. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
