Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2021 Oct 21;38(3):853–855. doi: 10.1093/bioinformatics/btab728

GPSmatch: an R package for comparing Genomic-binding Profile Similarity among transcriptional regulators using customizable databases

Amy Dong 1, Xiaomin Bao 2,3,4,
Editor: Janet Kelso
PMCID: PMC8756198  PMID: 34672337

Abstract

Summary

Eukaryotic gene expression requires coordination among hundreds of transcriptional regulators. To characterize a specific transcriptional regulator, identifying how it shares genomic-binding sites with other regulators can generate important insights into its action. As genomic data such as chromatin immunoprecipitation assays with sequencing (ChIP-Seq) are being continously generated from individual labs, there is a demand for timely integration and analysis of these new data. We have developed an R package, GPSmatch (Genomic-binding Profile Similarity match), for calculating the Jaccard index to compare the ChIP-Seq peaks from one experiment to other experiments stored in a user-supplied customizable database. GPSmatch also evaluates the statistical significance of the calculated Jaccard index using a nonparametric Monte Carlo procedure. We show that GPSmatch is suitable for identifying and ranking transcriptional regulators with shared genomic-binding profiles, which may unravel potential mechanistic actions of gene regulation.

Availability and implementation

The software is freely available at https://github.com/Bao-Lab/GPSmatch.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

In eukaryotes, transcriptional regulators can bind together in a subset of genomic regions, cooperatively controlling the timing and amplitude of gene expression in response to developmental cues or environmental stimuli. A variety of genomic techniques, including chromatin immunoprecipitation assays with sequencing (ChIP-Seq), are frequently used to capture the genomic-binding profiles of transcriptional regulators (i.e. ChIP-Seq peaks). As new genomic datasets are being continuously generated, timely integration of these new data can help to maximize the pace of research. Many bioinformatic tools are available for analyzing these genomic-binding datasets. For instance, HOMER (Heinz et al., 2010) and MEME-ChIP (Machanick and Bailey, 2011) identify enriched DNA motifs in ChIP-Seq peaks, while GoShifter (Trynka et al., 2015) and RELI (Harley et al., 2018) assess the statistical significance of the associations between ChIP-Seq peaks and genomic variants within complex trait loci. However, those bioinformatics tools are not suitable for comparing ChIP-Seq peaks from one experiment to the peaks of other specific ChIP-Seq experiments—a common task important for unraveling potential mechanistic actions of gene regulation. Thus, we have developed an R package, GPSmatch (Genomic-binding Profile Similarity match), for researchers to easily perform this task in a customizable database, even with just a laptop computer. In addition to ChIP-Seq datasets, we envision this package can be applied for the analysis of other genomic datasets, such as those generated by ATAC-Seq and DNase-Seq.

2 Implementation

Figure 1a depicts an overview of GPSmatch. The input files are ChIP-Seq peak files in BED format (i.e. chromosome number, start genomic position, and end genomic position) for both query (a single BED file) and a customizable database (multiple BED files stored in a single folder). New datasets can be easily deposited to the database by placing additional BED files to this folder.

Fig. 1.

Fig. 1.

Overview and representative results of GPSmatch. (a) Illustration showing the architecture of GPSmatch. (b) Summary of GPSmatch output using p63 ChIP-Seq data as the query with the initial database compiling 2216 genomic datasets. The top three matched hits from the database are labeled. (c) Summary of GPSmatch output using the same query with updated database. The top five matched hits from the database are labeled, with the added two genomic datasets in red.

As described below, GPSmatch includes three major computational steps: (i) calculating the Jaccard index (Jaccard, 1912) between the query and each dataset in the database; (ii) evaluating statistical significance of the calculated Jaccard index using empirical P-values estimated by a Monte Carlo procedure; (iii) ranking the selected database matches with the Jaccard index, empirical P-value, or π-score (Xiao et al., 2014).

Jaccard index has been used for quantifying similarities of ChIP-Seq peaks of transcriptional regulators (Akalin, 2021). The Jaccard index for measuring the similarity between query (represented as A) and a dataset (represented as B) in the database is calculated as, J(A, B) = |A     B|/|A     B|, where |A     B| is defined as the total size of all the overlapping regions between A and B, and |A     B| is defined as the total size of the union of ChIP-Seq peaks in A and B. |A     B| is computed using the intersect function in the bedtools package (Quinlan and Hall, 2010); |A     B| is computed as |A| + |B| − |A     B|, where |A| and |B| correspond to the size of the ChIP-Seq peaks in A and B, respectively.

A Monte Carlo procedure is implemented in GPSmatch to evaluate the statistical significance of the calculated Jaccard index, J(A, B). Specifically, let n represent the number of times that the genomic coordinates of A are repeatedly shuffled into Aʹ (n = 2000 by default), using the shuffle function in the bedtools package. For each Aʹ, the Jaccard index J(Aʹ, B) between each and B is then computed. Let r represent the number of times that J(Aʹ, B) is equal or greater than J(A, B). Then the empirical P-value is calculated as (r + 1)/(n + 1) (North et al., 2002). The nonparametric Monte Carlo procedure is chosen to avoid any incorrect assumptions of parametric probability distributions for hypothesis testing with Jaccard index (Mainali et al., 2017).

A π-score, originally developed to rank a set of differentially expressed genes in RNA-Seq analysis (Xiao et al., 2014), is subsequently calculated to combine the P-value with the ratio of Jaccard index J(A, B) versus the average Jaccard index J(Aʹ, B). For each pair of A and B, their π-scoreA,  B = −log(P-valueA,  B) ×   J(A, B)/mean(J(Aʹ, B)). A higher π-score reflects a higher similarity and/or higher statistical significance.

For each query A, users are provided with the options to rank the matched dataset B based on either the Jaccard index, empirical P-values, or π-scores. In addition, GPSmatch outputs the fraction of the overlapping regions relative to A or B, i.e. |A     B| / |A| and |A     B| / |B|.

3 Results

We validated GPSmatch using a series of simulated ChIP-Seq data (Supplementary Fig. S1), as briefly described below. First, 100 ‘original’ ChIP-Seq data were randomly extracted from an in-house ChIP-Seq database consisting of 2216 publicly available datasets. Second, 100 ‘simulated’ ChIP-Seq data were generated by randomly replacing a certain percentage (e.g. 10%) of the ChIP-Seq peaks in each of those 100 ‘original’ data with randomly selected ChIP-Seq peaks in the database. Third, these ‘simulated data’ were used as queries to compare with each ChIP-Seq data in the in-house database, the number of times the corresponding ‘original’ were identified as top hits was tallied, and their Jaccard index were recorded. The above three steps were repeated for different degrees of shuffling (i.e. from 10% to 100% with 10% increment). As expected, our results showed that the accuracy of GPSmatch to identify correct top database hits decreased with the increasing amount of replacement (Supplementary Table S1).

To illustrate the usage of GPSmatch with a real-world example, we queried GPSmatch using a recently published p63 ChIP-Seq data that profiled the genomic-binding sites of this lineage-specific transcription factor in human keratinocytes (Bao et al., 2015). When our initial in-house ChIP-Seq database was used, top hits from GPSmatch included STAT3, histone H3K4me2 and H3K27Ac ChIP-Seq data that were generated by the ENCODE project (ENCODE Project Consortium, 2012). This is consistent with a previous report that p63 binding sites are enriched with H3K4me2 modifications as well as the STAT3 motif (Sethi et al., 2014), which illustrates that GPSmatch is capable of producing biologically meaningful results. In addition, when our in-house ChIP-Seq database was updated by including p300 and Brg1 ChIP-Seq data in keratinocytes, GPSmatch identified p300 and Brg1 as better matches with p63 (Fig.  1b and 1c). This is consistent with our previous experimental findings that p63, Brg1 and p300 cooperate in activating keratinocyte terminal differentiation (Bao et al., 2015), which illustrates that the ability of GPSmatch to readily incorporate new datasets is crucial to maximize the discovery power in exploring gene regulatory mechanisms.

Supplementary Material

btab728_Supplementary_Data

Acknowledgement

We would like to thank Dr. Daniel E. Webster for his pioneering work on genomic data comparison, which provided inspiration for the initiation of this project.

Funding

This work was supported by the National Institute of Arthritis and Musculoskeletal and Skin Diseases (R00AR065480, R01AR075015), a Research Scholar Grant [RSG-21-018-01-DDC] from the American Cancer Society, and the Searle Leadership Fund.

Conflict of Interest: none declared.

Contributor Information

Amy Dong, Hinsdale Central High School, Hinsdale, IL 60521, USA.

Xiaomin Bao, Department of Molecular Biosciences, Northwestern University, Evanston, IL 60208, USA; Department of Dermatology, Northwestern University, Chicago, IL 60611, USA; Robert H. Lurie Comprehensive Cancer Center, Northwestern University, Chicago, IL 60611, USA.

References

  1. Akalin A. (2021) Computational Genomics with R. CRC Press, Boca Raton, Florida. https://www.routledge.com/Computational-Genomics-with-R/Akalin/p/book/9781498781855#. [Google Scholar]
  2. Bao X.  et al. (2015) A novel ATAC-seq approach reveals lineage-specific reinforcement of the open chromatin landscape via cooperation between BAF and p63. Genome Biol., 16, 284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. ENCODE Project Consortium. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature  489, 57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Harley J.B.  et al. (2018) Transcription factors operate across disease loci, with EBNA2 implicated in autoimmunity. Nat. Genet., 50, 699–707. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Heinz S.  et al. (2010) Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell, 38, 576–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Jaccard P. (1912) The distribution of the flora in the Alpine zone. New Phytol., 11, 37–50. [Google Scholar]
  7. Machanick P., Bailey T.L. (2011) MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics, 27, 1696–1697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Mainali K.P.  et al. (2017) Statistical analysis of co-occurrence patterns in microbial presence-absence datasets. PLoS One, 12, e0187132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. North B.V.  et al. (2002) A note on the calculation of empirical P values from Monte Carlo procedures. Am. J. Hum. Genet., 71, 439–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Quinlan A.R., Hall I.M. (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26, 841–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Sethi I.  et al. (2014) Role of chromatin and transcriptional co-regulators in mediating p63-genome interactions in keratinocytes. BMC Genomics, 15, 1042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Trynka G.  et al. (2015) Disentangling the effects of colocalizing genomic annotations to functionally prioritize non-coding variants within complex-trait loci. Am. J. Hum. Genet., 97, 139–152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Xiao Y.  et al. (2014) A novel significance score for gene selection and ranking. Bioinformatics, 30, 801–807. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btab728_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES