Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2022 Apr 19.
Published in final edited form as: Nat Protoc. 2021 Aug 9;16(9):4144–4176. doi: 10.1038/s41596-021-00567-5

Detecting chromosomal interactions in Capture Hi-C data with CHiCAGO and companion tools

Paula Freire-Pritchett 1,#, Helen Ray-Jones 2,3,#, Monica Della Rosa 2,3, Chris Q Eijsbouts 4,5, William R Orchard 6, Steven W Wingett 7,11, Chris Wallace 8,9, Jonathan Cairns 10, Mikhail Spivakov 2,3,, Valeriya Malysheva 2,3,
PMCID: PMC7612634  EMSID: EMS144252  PMID: 34373652

Abstract

Capture Hi-C is widely used to obtain high-resolution profiles of chromosomal interactions involving, at least on one end, regions of interest such as gene promoters. Signal detection in Capture Hi-C data is challenging and cannot be adequately accomplished with tools developed for other chromosome conformation capture methods, including standard Hi-C. Capture Hi-C Analysis of Genomic Organization (CHiCAGO) is a computational pipeline developed specifically for Capture Hi-C analysis. It implements a statistical model accounting for biological and technical background components, as well as bespoke normalization and multiple testing procedures for this data type. Here we provide a step-by-step guide to the CHiCAGO workflow that is aimed at users with basic experience of the command line and R. We also describe more advanced strategies for tuning the key parameters for custom experiments and provide guidance on data preprocessing and downstream analysis using companion tools. In a typical experiment, CHiCAGO takes ~2–3 h to run, although pre- and postprocessing steps may take much longer.

Introduction

DNA regulatory elements such as enhancers can be up to megabases away from the genes they control, but come into their physical proximity through 3D chromosomal contacts1. Therefore, understanding how DNA is folded in the nucleus is required for deciphering the mechanisms of gene regulation and their aberrations in disease. Chromosome conformation capture techniques are a family of powerful biochemical methods that detect 3D chromosomal topology through proximity ligation of crosslinked chromatin digested with a restriction enzyme2 (Fig. 1a). While these methods initially enabled querying specific interactions between candidate loci, the development of Hi-C3,4 has theoretically made it possible to profile all interactions between every pair of DNA fragments in a single reaction (Fig. 1b). Hi-C enables a global view on 3D chromosomal topology, and has revealed large-scale chromosomal structures such as A/B compartments3 and topologically associated domains57 (TADs) that are relevant for genome integrity and gene control. The global nature of Hi-C, however, results in extreme complexity of Hi-C libraries (i.e., all pairwise contacts between ~106 and 108 fragments in a mammalian genome, depending on the restriction enzyme used). This, in turn, requires an impractical depth of sequencing to enable robust and sensitive detection of individual chromosomal contacts. Capture Hi-C (CHi-C) overcomes this limitation by using thousands of sequence capture probes to enrich Hi-C material for interactions that involve, at least on one end, restriction fragments of interest (‘baits’or ‘viewpoints’), such as all annotated gene promoters (Promoter Capture Hi-C (PCHi-C))810. Sequence-based capture can also be applied to libraries arising from the earlier 3C technology in a technique known as Capture-C11 (Fig. 1c). The resulting enrichment for interactions of interest (15–70 fold in the case of PCHi-C, depending on the enzyme used) reduces the amount of sequencing required for detecting chromosomal contacts at a single restriction fragment level, allowing for a robust and sensitive detection of regulatory chromosomal interactions. Since its development in the early 2010s, CHi-C has been rapidly gaining in popularity and has been used to detect the dynamics of regulatory chromosomal contacts in a variety of developmental systems, as well as for linking disease-associated genetic variation with target genes (see ‘Applications of the method’ for more details).

Fig. 1. An overview of CHi-C and Capture-C.

Fig. 1

a, Both technologies share the initial crosslinking step whereby formaldehyde is used to ‘fix’ the chromatin, maintaining the 3D DNA conformation. The cell membrane is then lysed, the nucleus permeabilized and chromatin incubated with a restriction enzyme. b, In Hi-C-based methods, the ends of restriction fragments are marked with biotin prior to ligation. The biotinylated ligation products are pulled down using streptavidin beads. Together these steps allow for enrichment of valid ligation products (‘valid pairs’ or ‘valid di-tags’, pie chart) that consist of fragments coming from two distinct noncontiguous genomic locations. Hi-C interaction maps reveal large-scale chromatin topological structures such as TADs (blue lines, heatmap graphic), but are too sparse for robust identification of fragment-resolution interactions without very deep sequencing. CHi-C increases the sequencing coverage of interactions that involve, at least on one end, fragments of interest, by using in-solution hybridization capture with biotinylated probes. CHiCAGO can identify significant interactions in CHi-C data, as marked by red dots in the bait interaction profile (bottom left). c, In 3C-based methods, ligation is carried out without prior biotinylation of restriction fragment ends, typically resulting in a much smaller proportion of valid sequencing pairs (pie chart). Capture-C uses in-solution hybridization capture to enrich 3C libraries for interactions of interest in a similar manner as in CHi-C. CHiCAGO can also be used to identify significant interactions in Capture-C data, marked by red dots in the bait interaction profile (bottom right). The bait interaction profiles in b and c were generated using publicly available CHi-C data of mouse ESCs from ref. 8 (two replicates) and Capture-C data from ref. 55 (four replicates). The profiles show interaction data over a 2 Mb window surrounding the baited Scnn1g promoter. Interacting loci are at the resolution of HindIII fragments (CHi-C) or DpnII fragments with other-ends merged into ~5 kb bins (Capture-C).

The asymmetric nature of CHi-C (many ‘baited’ viewpoints on one end versus all restriction fragments on the other end) confers unique statistical properties to the resulting data, making most statistical tools developed for Hi-C normalization and signal detection generally unsuitable for the analysis of this data type. In contrast to Hi-C, CHi-C generates rectangular, rather than square interaction matrices, and introduces an additional background component due to differences in capture efficiency, particularly between bait-to-bait and bait-to-non-bait interacting pairs. In contrast to 4C1214 (a single viewpoint versus all restriction fragments on the other end), the high-throughput nature of CHi-C enables borrowing information across subsets of data with similar properties, providing a more robust analysis than is possible with 4C, but presents challenges for multiple testing correction. To account for the specific features and challenges of CHi-C, we previously developed Capture Hi-C Analysis of Genomic Organization (CHiCAGO), a computational pipeline that incorporates bespoke statistical model, background correction and multiple testing procedures for detecting significant chromosomal interactions in these data15. The CHiCAGO pipeline consists of an R package, Chicago, available via Bioconductor16 and from Bitbucket (www.bitbucket.org/chicagoTeam/chicago), and a suite of auxiliary scripts, chicagoTools, available from the same Bitbucket repository. This article describes Capture-Hi-C data analysis including data preprocessing, interaction detection with CHiCAGO and possible downstream analysis steps (Fig. 2). We summarize the main features of the CHiCAGO pipeline and key companion tools, provide guidelines for CHiCAGO parameter tuning and describe in detail how to install and run the pipeline.

Fig. 2. Standard CHi-C data analysis steps.

Fig. 2

CHi-C: the CHi-C library is generated by performing capture on a Hi-C library followed by paired-end sequencing. HiCUP: HiCUP is a widely used pipeline specifically tailored for a quality control and preprocessing of paired-end sequenced data, generally obtained using Hi-C or CHi-C. HiCUP performs read alignment and then filters out nonvalid, low-quality tags and PCR duplicates. The aligned and filtered interactions processed by HiCUP are then filtered for ‘on-target’ interactions involving fragments designed to be captured, producing a captured.bam file. CHiCAGO: each capture design requires its own design files that should be generated and saved in a single folder prior to running CHiCAGO. The design files and the captured.bam file are then used to generate a CHiCAGO input file (.chinput) that summarizes all data about the sample required to run the pipeline. A new CHiCAGO data object is then created and connected with the design files using the setExperiment() function. Next, the input data are read using the readAndMerge() function. Finally, the CHiCAGO pipeline is launched with the chicagoPipeline() function. When the run is complete, Chicago produces diagnostic plots and the updated CHiCAGO data object, containing all relevant information on every processed interaction. Downstream analysis: CHiCAGO objects can then be used for downstream analysis, such as differential analysis with Chicdiff45 and fine-mapping with Peaky44. MPPC—the marginal posterior probability of a contact.

The convolution background model and normalization procedures for CHi-C data implemented in CHiCAGO

CHiCAGO’s statistical model considers two main sources of noise: (i) ‘Brownian noise’, which arises from random collisions between fragment pairs, and (ii) ‘technical noise’, which accounts for any bias in the assay, including sequencing artifacts. The constrained ‘Brownian noise’ component is strongly influenced by the linear distance between fragment pairs, since the frequency of random collisions increases with the proximity between fragments on the DNA strand. CHiCAGO models read counts resulting from Brownian collisions as a negative binomial random variable whose expected levels depend on the linear distance and the properties of individual fragments (including, but not limited to, differences in capture efficiency). At multimegabase distances, the frequency of random collisions is highly reduced, and so the ‘Brownian noise’ approaches zero. In contrast, technical noise is independent of the linear distance, and CHiCAGO models it as a Poisson random variable. CHiCAGO then combines the two noise parameters in a two-component convolution model known as the Delaporte distribution, assuming that the two sources of noise are independent.

The convolution model allows an independent estimation of each of the noise components, and CHiCAGO uses two different subsets of data to estimate them. For the Brownian component, it uses data from relatively short-range genomic distances where it can be assumed that the contribution of technical noise is negligible compared with that of the Brownian noise (up to ~1–2 Mb for a six-cutter restriction enzyme). To estimate Brownian noise, CHiCAGO pools information across all viewpoints to increase power, accounting for interaction distance and the noise properties of each interacting fragment. To estimate distance dependence, CHiCAGO combines data from all viewpoints into a ‘reference’ profile and fits a piecewise power law model17,18. CHiCAGO then estimates the normalization (scaling) factors for each fragment and combines them for each interaction pair. For baited fragments, scaling factors are estimated with respect to a ‘reference’ interaction profile estimated across all viewpoints. Using the same approach for nonbaited fragments is challenging as each of them is typically detected in only a small number of interactions. Therefore, these fragments are first pooled with respect to their noise properties (using the range of transchromosomal counts as a proxy for this), and normalization is performed in pools. The abundance of transchromosomal counts is also used to estimate technical noise, as the absolute majority of such signals detected in CHi-C reflect technical artifacts as opposed to true interaction signals15,19. The estimated background components are then used to parameterize the Delaporte distribution, and interaction-level P-values are estimated with a one-tailed hypothesis test.

Multiple testing treatment in CHiCAGO

CHiCAGO performs a statistical test for every possible interaction (which implicitly also includes the interactions that yielded a zero count), leading to a significant multiple testing burden. Typically, in genomic assays, this burden is mitigated by using false discovery rate (FDR) approaches, such as the Benjamini–Hochberg procedure20. However, both technical and conceptual considerations make the use of FDR nonideal in the CHiCAGO setting. First, the P-value distributions generated by Delaporte tests are highly nonuniform, owing, at least in part, to very large numbers of interactions having counts of either zero or one, which violates the core assumption of FDR methods. Second, FDR correction treats all interactions equally, whereas in CHi-C we expect the true positive rate and statistical power to vary with interaction distance. Therefore, intuitively, we do not want the shorter-range interactions (in the order of hundreds of kilobase pairs) to ‘bear the brunt’ of the multiple testing burden resulting from testing millions of multimegabase-range and transchromosomal interactions, since this will result in a significant loss of sensitivity. To address these issues, CHiCAGO implements a weighted false discovery control procedure that builds on the theoretical foundations of Genovese et al.21 and is similar in spirit to the independent hypothesis weighting method developed in parallel by Ignatidis et al.22. In this procedure, P-values are weighted according to their genomic distance, with weights estimated based on the reproducibility of interaction signals across replicates. The default weight profile applied in CHiCAGO is calibrated for use on CHi-C data generated using a six-cutter restriction enzyme; however, a precomputed alternative profile calibrated for four-cutter enzymes is available on Bitbucket (https://bitbucket.org/chicagoTeam/chicago). CHiCAGO also provides a script enabling the users to estimate custom weight profiles based on their data (see ‘Custom parameter tuning’ below).

Applications of the method

CHiCAGO analysis has been applied in an increasing number of studies using CHi-C to link 3D chromosomal conformation and gene control. Many such studies used PCHi-C, which enriches for interactions with fragments containing gene promoters. Coupling CHiCAGO-detected CHi-C interaction calls with other genomic readouts such as ChIP-seq, assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) and RNA-seq has provided insights, for example, into the rewiring of promoter–enhancer contacts upon the differentiation of embryonic stem (ES) cells2325, adipocytes26 and keratinocytes27, as well as upon cohesin depletion in a human cell line28. PCHi-C and CHiCAGO have also proven to be powerful tools for linking noncoding genome-wide association study (GWAS) variants with target genes based on 3D chromosomal contacts, enabling the identification of thousands of gene-level genetic associations2934.

CHiCAGO has also been applied to capture designs enriching for fragments other than gene promoters, such as DNase I hypersensitive sites (in a model of Serum-to-2i conversion of mouse ES cells35), as well as to CHi-C experiments using autoimmune disease-associated GWAS loci as baits to identify their gene targets36,37. It was also used to process data from a ‘reciprocal’ CHi-C experiment validating 1,000 promoter interactions initially identified using PCHi-C29.

Though CHiCAGO was initially designed to analyze data from CHi-C experiments, it has also been applied to Capture-C data (Fig. 1c) in organisms ranging from Drosophila38 to mouse39 to humans40,41. Targeting large numbers of regions in Capture-C is challenging, as the high proportion of nonspecific DNA fragments retained during 3C (in the absence of the ligation junction pulldown performed in Hi-C) drives down informative sequencing coverage. Therefore, most Capture-C designs use hundreds of baited fragments as opposed to the tens of thousands in a typical CHi-C experiment. Nonetheless, the latter two Capture-C-based studies40,41 have expanded the analysis to most annotated promoters.

Related methods

Several third-party computational tools developed recently also provide interaction calling in CHi-C data4244. While a detailed description and full comparison of these methods is beyond the scope of this protocol, here we provide a brief overview of these methods and refer readers to the original articles presenting them for details. In addition, we describe how one of these tools, Peaky44, as well as a differential caller for CHi-C data Chicdiff45, can be used in the follow-up analyses of CHiCAGO-processed data (Fig. 2).

HiCapTools42 estimates genome interaction background using either a negative control probe set targeting a random set of regions, or a random sample of ~10% of targeted regions if no negative probes are provided. While discrete value distributions (such as Poisson, negative binomial and Delaporte) are typically used for sequencing count data, HiCapTools estimates the statistical significance of interactions assuming the background counts are distributed normally.

CHiCMaxima43 proposes a ranking approach for interaction detection that does not involve a formal statistical test. It averages through the interaction profile with a sliding window and reports local maxima, which occur at a frequency above local averaged count, as true interactions.

Similarly to CHiCAGO, the recently developed Capture Hi-C Analysis Engine (CHiCANE)46 models the expected read count for each interaction as a function of distance between fragments and their ‘interactability’ informed by transchromosomal interactions. However, unlike in CHiCAGO, this parameter is fitted as a regression model, which enables incorporating additional user-specified covariates such as chromosome number or GC content. By default, CHiCANE assumes the expected count to follow the negative binomial distribution, but offers users a choice between a number of other distributions, including truncated negative binomial and Poisson. The Benjamini–Hochberg false FDR method is then applied for multiple testing correction of the identified significant interactions. According to the authors, CHiCANE works best for interactions in the 100 kb to 5 Mb range of genomic distances46.

Some significant interactions identified by CHiCAGO appear in ‘runs’ over multiple contiguous fragments, potentially representing strong causal interactions generating their own patterns of constrained Brownian motion. Peaky44 may be used to further improve upon the resolution at such interactions. A key assumption behind Peaky is that only a subset of fragments in such runs truly form contacts. A true, causal contact produces not only a peak in signal but also introduces noise that decays with distance among adjacent fragments. In the immediate surroundings of a true contact, this leads to spurious calls, which contact calling with Peaky is intended to eliminate. Peaky jointly models data across interacting fragments for a given bait, assuming a few latent causal contacts with one or more fragments, and generates posterior probabilities of the number and identity of these causal contacts. Peaky features its own negative-binomial regression model to adjust raw read counts prior to fine-mapping, but has recently also been modified to fine-map interactions called by CHiCAGO. When run in this fashion, it uses the CHiCAGO-modeled mean of the Delaporte distribution and thus benefits from CHiCAGO’s background estimation and normalization procedures. Instructions for running Peaky are provided in the Procedure.

To perform differential analysis of significant interactions called by CHiCAGO (between different cell types, differentiation states of the same cell type, etc.), we have previously developed Chicdiff45. This tool combines moderated differential testing for count data using negative binomial generalized linear models (implemented in DESeq247) with signal normalization informed by CHiCAGO and nonuniform multiple testing correction. Chicdiff has been recently applied to detect subsets of promoter interactions that are lost, maintained or gained upon cohesin and CCCTC-binding factor (CTCF) depletion48. Instructions on how to run Chicdiff are provided in the Procedure.

While Hi-C analysis tools are generally not suitable for CHi-C and vice versa, some of the conceptual foundations of CHiCAGO are also relevant for Hi-C analysis. For example, distance-dependent background models are used in Fit-Hi-C49 and HOMER50. In addition, CHiCAGO’s bias-correction strategy is technically very different from, but is based on the same data-driven (‘implicit’) principle as, the widely used iterative correction method for Hi-C51. Finally, while, to our knowledge, P-value weighting has not been used in Hi-C analysis, this procedure may be potentially useful for detecting the most biologically relevant signals in these data.

Limitations

One of the limitations of CHiCAGO is that it is agnostic to higher-order chromosomal organization, such as A/B compartments and TAD structure. CHiCAGO avoids correcting for these structures explicitly, since their effect on specific DNA looping interactions is still not fully understood. In the current implementation, CHiCAGO calls interactions both within and across TADs in the same manner, resulting in a slight overestimation of the local Brownian background for inter-TAD interactions, but only very minor differences in the stringency of detected contacts (see Author Response Image 1 in ref. 23). As expected, the detected interactions are strongly enriched within the same TAD23,29,48. However, multiple cross-TAD contacts are still observed in CHi-C data23,29,48, and a recent study52 has provided evidence that such contacts may play a functional role in developmental gene control.

Another limitation is linked to the way in which CHiCAGO models the background from Brownian collisions. The estimation of scaling factors and distance dependence for the Brownian component relies on sharing count data across all baits and ‘other ends’. Thus, the accuracy of background estimation relies on the large number of baits present in the capture system. Typical genome-wide promoter capture systems contain >10,000 baits, allowing for a robust background estimation. In contrast, using CHiCAGO with capture systems containing much smaller numbers of baits may result in a misestimated background and a misannotation of significant interactions.

In addition, CHiCAGO may not work well with ‘tiled’ capture designs, where baits cover large contiguous regions. As bait-to-bait interactions have a higher capture efficiency compared with the rest, CHiCAGO estimates the distance function from bait-to-nonbait interactions, to ensure that the variability in other end biases cancels out in the averaging procedure (see Appendix A, Additional file 1 of ref. 15). Peaky44 or CHiCANE46 could be more appropriate for the analysis of data arising from such tiled capture designs, as these tools treat ‘bait-to-bait’ interactions as a covariate in their regression models without explicitly removing such interactions from any calculations. Alternatively, bait-to-bait interactions within the tiled regions can be seen as a ‘mini-Hi-C experiment’ and analyzed separately with Hi-C or 5C analysis methods. To facilitate this approach, CHiCAGO generates a .bedpe file containing the reads from all bait-to-bait interactions as part of its preprocessing step.

CHiCAGO workflow

This section summarizes the workflow of CHi-C interaction calling by CHiCAGO. The tools for sequencing data preprocessing and downstream analyses are described in the Procedure.

In addition to the CHi-C sequencing data, CHiCAGO requires five ‘design files’ that depend on the specific genome and restriction enzyme used: the restriction map file (.rmap) and the list of baited restriction fragments (.baitmap), as well as three auxiliary files termed ‘NPerBin’ (.npb), NBaitsPerBin (.nbpb) and proxOE (.poe). The restriction map and the bait map files must be provided by the user (see Materials), whereas the three auxiliary files can be generated using the makeDesignFiles.py script (Fig. 2, panel ‘CHiCAGO.Create design files’; see the Procedure for details) provided as part of the chicagoTools suite together with other auxiliary scripts (https://bitbucket.org/chicagoTeam/chicago/src/master/chicagoTools/). It is recommended that the five design files be placed in the same directory (denoted as designDir) that can be provided as a parameter to the setExperiment() function in the Chicago R package, as well as to various auxiliary scripts (see Supplementary Table 1 for a brief description of key functions in Chicago). Precomputed files for the HindIII PCHi-C design used in a number of previous studies are publicly available (https://osf.io/nemc6/). However, for new CHi-C experiments, even those using the same restriction enzyme and baiting strategy, users should regenerate design files based on their own set of capture probes as described in the Procedure.

The interaction data are supplied to the Chicago R package in the .chinput format. This is principally a pileup table listing read counts per pair of restriction fragments where at least one fragment is baited. These data are used in Chicago’s readAndMerge() function to populate the chicagoData object. The .chinput files are generated from aligned CHi-C reads using the auxiliary script bam2chicago.sh provided as part of the chicagoTools suite, which additionally requires the .rmap and .baitmap files as input (Fig. 2, panel ‘CHiCAGO.Prepare interaction data’). Once the chicagoData object is populated with both experimental settings and interaction data files, the user can run the pipeline by calling the chicagoPipeline() function in R (Fig. 2, panel ‘CHiCAGO. Run CHiCAGO pipeline’). The pipeline adds the detected interaction calls to the chicagoData object. The resulting object is typically saved in the RDS format (native R data format), and can be further converted into a human- and genome browser-readable format, using the exportResults() function (see the Procedure for details).

As a final quality control (QC) step, Chicago provides the peakEnrichment4Features() function, which computes observed and expected overlaps of the ‘other ends’ of significant interactions with features of interest. The expected overlap is estimated by sampling fragments randomly in a manner that replicates the distribution of interaction distances observed for significant interactions. Plots generated by this function are described later, in the ‘Feature enrichment plots’ section.

The whole workflow described above is also implemented in a wrapper R script, runChicago.R provided in the chicagoTools suite, which is executable from the command line and is particularly useful for running the analysis in batch mode. Finally, CHiCAGO results from multiple experiments can be summarized in the form of a ‘peak matrix’ using makePeakMatrix.R from the chicago-Tools suite (see the Procedure for details).

Using CHiCAGO with CHi-C data produced with a four-cutter enzyme

The default settings of the CHiCAGO pipeline have been developed for PCHi-C experiments using the six-cutter enzyme HindIII, and may require some modification depending on the restriction enzyme of choice and the sequence coverage across the baited regions. HindIII produces average fragment sizes of ~3,700 bp in the human genome. However, more frequently cutting enzymes such as the four-cutters DpnII or MboI produce much smaller fragment sizes (in the case of these enzymes, an average of ~430 bp in the human genome). Consequently, the choice of the restriction enzyme impacts on the number of possible fragment pairs in the CHi-C library, which in turn affects the interaction-level sequencing coverage for a given total number of sequencing reads per sample. In this section, we describe the parameters recommended for standard use when analyzing CHi-C data generated with a four-cutter enzyme. A strategy to choose optimal parameter settings for a custom experiment is presented below, and Box 1 describes how CHi-C data can be binned into larger fragment pools to increase interaction-level coverage.

Box 1. Using CHiCAGO with binned restriction fragment pools.

CHi-C data, particularly those generated with four-cutter restriction enzymes or sequenced at relatively low coverage, may benefit from binning to enable robust and sensitive signal detection by CHiCAGO. For this reason, we suggest analysing four-cutter data in ~5 kb bins (comparable with six-cutter resolution) alongside unbinned data.

To use binned data as input, it is sufficient to modify CHiCAGO design files such that they contain ‘virtual’ restriction fragments corresponding to sets of consecutive fragments of a predefined minimum length. The baited fragments can be left unbinned as at reasonable levels of capture efficiency they are usually sufficiently represented without the need to increase coverage.

The makeBins.R script provided as part of the chicagoTools suite can be used for this purpose, taking the .baitmap and .rmap files from the original restriction digest and capture design as input and generating the equivalents of these files for the binned design.

A typical run is executed in the following way:

Rscript --vanilla makeBins.R --include_baits <rmap>

The full set of parameters for makeBins.R is as follows:

Rscript makeBins.R [--] [--help] [--include_baits] [--verbose]  [--opts OPTS] [--baitmap BAITMAP]
[--output_prefix OUTPUT_PREFIX] [--binsize BINSIZE] [--start START] [--end END]

The <rmap> should be replaced with the full path to .rmap file (bed file containing coordinates of the restriction fragments). The rmap entries should be coordinate sorted.

The include_baits flag specifies whether the baited fragments should be included in the binning process or left solitary.

The full list of parameters is provided below:

Parameter Description
baitmap Full path to .baitmap file (bed file containing coordinates of baited restriction fragments). Baitmap should be coordinate sorted. If not provided, the script will look for .baitmap file matching .rmap file header
binsize Size of bins in bp. Default: 5,000
output_prefix File name prefix for the resulting binned .rmap file reflecting the binning mode (binned or not binned). If none is provided, it will default to the name of the original .rmap file
start The name of column listing the fragment start coordinate in the .rmap file. By default, it assumes that the .rmap file does not contain a header; thus, the column name assigned internally by R will be V2
end The name of column listing the fragment end coordinate in the .rmap file. By default, it assumes that the .rmap file does not contain a header; thus, the column name assigned internally by R will be V3

When running Chicago with four-cutter enzymes (or other frequent cutters), users should adjust several key parameters when creating the design files with makeDesignFiles.py script. These parameters are minFragLen, maxFragLen, maxLBrownEst and binsize (see the Procedure).

The minFragLen and maxFragLen are the minimum and maximum fragment lengths in the experiment to be included in the analysis, respectively. The default minFragLen of 150 bp is optimized for six-cutters, while for four-cutters we recommend to set this parameter to 75 bp to accommodate the smaller ligation products generated by these enzymes. Likewise, the default maxFragLen setting of 40,000 bp is optimized for six-cutters, and we recommend setting it to a lower value of ~12,000 bp for four-cutter enzymes.

The parameters maxLBrownEst and binsize are used in the estimation of the Brownian background component. maxLBrownEst defines the proximal distance range from the bait (in either direction) for estimating the Brownian component, while binsize refers to the length of the restriction fragment bins used in this procedure. In general, we recommend that binsize covers four or five restriction fragments, corresponding to the default setting of ~20,000 bp for six-cutters and the recommended lower setting of ~1,500 bp for four-cutters. For maxLBrownEst, the six-cutter-optimized default is 1,500,000 bp, while a lower value of ~75,000–120,000 bp is recommended as a starting point for four-cutter data.

While the above parameter values are the recommended initial settings for four-cutter experimental designs, their optimal values depend on factors beyond the choice of the restriction enzyme, in particular on sequencing depth. It is therefore advisable to assess the suitability of specific settings for the data analyzed, as described in the ‘Custom parameter tuning’ section.

Finally, users should be aware that the choice of the restriction enzyme affects the distance distribution of the detected interactions, with six-cutters generally leading to longer-range detected interactions compared with four-cutters. The reduced numbers of long-range interactions detectable in four-cutter data are likely due to the typically higher sparsity of data arising from these assays and can be mitigated by binning as shown below. By contrast, the lower frequency of short-range interactions detected with six-cutters is potentially due to the high variance of background signal in this range, leading to reduced sensitivity. We illustrate these effects by comparing two published PCHi-C datasets on pluripotent cell-derived cardiomyocytes, generated using HindIII34 and MboI33, respectively. The majority of detected interactions in HindIII PCHi-C fall between 100 kb and 400 kb, while for MboI most significant interactions are detected in the 3–100 kb range (Extended Data Fig. 1). As a result, the observed overlap in significant interactions between HindIII and MboI PCHi-C data is low (Extended Data Fig. 1a, left panel). Importantly however, the other ends of both shared and MboI- or HindIII-specific interactions are enriched for regulatory histone marks (Extended Data Fig. 1a, right panel), suggesting that interactions in all three groups are biologically relevant. Binning of MboI PCHi-C data to 5 kb fragments allows CHiCAGO to detect more significant interactions in the longer range (~30–200 kb), increasing the overlap with HindIII-based data, while retaining the enrichment for regulatory histone marks in all groups of interactions (Extended Data Fig. 1b). Therefore, combining binned and unbinned four-cutter data leads to the best tradeoff between sensitivity and resolution. This point is further elaborated in a recent preprint by Su et al., who have independently arrived at the same conclusions53.

Custom parameter tuning

The optimal choice of CHiCAGO parameters depends on multiple data properties, including, but not limited to, sequencing coverage and the choice of restriction enzyme. Here we describe strategies for selecting optimal parameters in a data-driven way.

Mitigating the effects of data sparsity on background model estimation

While the procedure is reasonably robust to the choice of maxLBrownEst and binsize parameters, CHiCAGO expects that sequencing coverage is sufficient to observe a distance decay of mean binwise read counts throughout the maxLBrownEst range for most baits. If this is not the case, sparsity may adversely affect the estimation of the scaling factors and, ultimately, interaction P-values. At the same time, maxLBrownEst range should be large enough to enable robust estimation of the distance function and normalization parameters.

Suboptimally estimated background can be first recognized visually from bait interaction profiles plotted using plotBaits() function with plotBprof parameter set to TRUE. This setting will plot the estimated mean and the upper boundary of the 95% confidence interval for the Brownian background component alongside the observed read counts for each interaction. If sparsity is not accounted for correctly, the Brownian background estimates may decay too quickly towards zero with increasing distance, leading to many low-count long-range interactions being called as significant (Fig. 3a, top left).

Fig. 3. Visualizing the suitability of the background model estimation.

Fig. 3

DpnII promoter CHi-C data in pluripotent cell-derived cardiomyocytes33 were analyzed using either default parameters optimized for six-cutter enzymes (maxLBrownEst 1.5 Mb and bin size 20 kb) or using the suggested parameters for four-cutter enzymes (maxLBrownEst 75 kb and bin size 1.5 kb). For comparison, we used the same DpnII data binned into 5 kb bins and analyzed them with the default six-cutter parameters, and we also analyzed six-cutter (HindIII) data34 with the default six-cutter parameters. a, Interaction profile of the ATG101 promoter, plotted using the plotBaits() function, for each of the four processed datasets. Note the likely spurious low-count interaction detected as significant in the DpnII data analyzed with the default six-cutter-optimized settings (red arrow), but not in the other three data/parameter combinations. b, Boxplots indicating data sparsity across the maxLBrownEst range used for estimating the Brownian background component. For each data/parameter combination, the distance range plotted corresponds to maxLBrownEst, and the size of each bin is set to binsize. Sparsity within each bin was calculated as the proportion of all possible other-end fragments in the bin with a count of zero. For well-defined parameter settings (DpnII with four-cutter parameters, and 5-kb-binned DpnII and HindIII with default parameters), sparsity increases gradually across the Brownian estimation. In contrast, for DpnII data analyzed with six-cutter settings, sparsity rapidly reaches a plateau close to one, indicating that interactions with almost all other ends in most bins have zero count for most baits. The boxplots were generated using the plotBackgroundSparsity.R script provided as part of chicagoTools.

We recommend that binsize is set to correspond to the average length of four to five restriction fragments. This is to allow the bins to be large enough to estimate the average count per bin robustly, but small enough so the counts from individual interactions within each bin do not vary too greatly with distance. To test that maxLBrownEst is chosen appropriately at a given binsize setting, we recommend assessing data sparsity in the chosen maxLBrownEst range, for example by plotting the proportion of missing (zero-count) interactions per distance bin across baits. Figure 3b (generated using the plotBackgroundSparsity.R script included in chicagoTools) provides examples of such plots for DpnII, binned DpnII and HindIII data processed with six- and four-cutter parameter settings. Applying the default six-cutter settings (binsize 20kb; maxLBrownEst 1.5M) to DpnII-based data generates a rapid increase in the proportion of missing interactions with distance (Fig. 3b, top left panel; compare with the binned DpnII and HindIII data in the bottom panels, for which these parameters are suitable). As a result, the Brownian background component is underestimated, and many interactions with very low counts are called as significant (Fig. 3a, top left). To mitigate this problem, maxLBrownEst should be sufficiently large to observe a distance decay of averaged read counts throughout the maxLBrownEst range for most baits, but not so large that the baits become overly sparse across the assessed distance. Figure 3a,b (top right panel) shows DpnII-based data processed with the recommended four-cutter settings (binsize 1.5 kb; maxLBrownEst 75 kb).

P-value weighting

CHiCAGO corrects for multiple testing using a P-value weighting approach dependent on interaction distance, such that longer-range and transchromosomal interactions are corrected more vigorously. The weights are derived by fitting a bounded logistic curve to the number of observed reproducible interactions across replicates as a function of distance. The fitting parameters of this logistic regression (weightAlpha, weightBeta, weightGamma and weightDelta) are then used to estimate the P-value weight for each interaction given its distance. Technically, the weights therefore correspond to the probability of having a true interaction at a given distance relative to the whole distance range.

The default P-value weights specified in CHiCAGO were calibrated on human macrophage datasets29, produced using the six-cutter HindIII. To estimate these weights robustly, we pooled across three different macrophage samples with similar interaction profiles (unstimulated, M1 and M2 stimulated), resulting in nine biological replicates overall. These estimates were broadly consistent with those obtained for two replicates of mouse ES cells and three replicates of human lympho-blastoid cells15, suggesting that for HindIII data of standard quality and coverage, using the default weights is generally appropriate. However, for other restriction enzymes, atypical coverage levels and other custom features, P-value weights need to be reestimated. This can be done using fitDistCurve.R script provided as part of the chicagoTools suite (Box 2). We recommend assessing the goodness of fit using the curveFit plot, produced by this script. If the fit is good, the estimated parameters can be used in the CHiCAGO P-value weighting procedure. To provide these parameters to CHiCAGO, it is recommended to save them in a custom settings file (see Supplementary Table 1) that can be provided as a parameter to setExperiment() function when running the package within R or supplied to the runChicago.R wrapper through the --settings-file option (see the Procedure for details).

Box 2. Estimating P-value weights.

The script for estimation of P-value weights requires the CHiCAGO objects (.Rds) for each biological replicate as input that can be obtained by running CHiCAGO with the default weights. The results of the fitting are reported in two diagnostic plots (named curveFit and mediancurveFit) with new fitting parameters saved in the output settings file.

A typical run is executed in the following way:

Rscript fitDistCurve.R <output_prefix> --inputs Rep1.Rds,Rep2.Rds,Rep3.Rds

The full set of parameters for fitDistCurve.R is as follows:

Rscript fitDistCurve.R [--help] [--opts OPTS] [--inputs INPUTS] [--threshold THRESHOLD] [--subsets SUBSETS]
[--largeBinSize  LARGEBINSIZE] [--binNumber BINNUMBER]  [--halfNumber HALFNUMBER]

The <output_prefix> should be replaced with the file name prefix to use for the output files: settings file for use in Chicago, summary object and plot. The inputs are comma-separated (without spaces) paths to CHiCAGO objects (.Rda or .Rds). At least two datasets are required. The more datasets are provided, the more accurate is the weight estimation. Additional optional parameters are described below.

Parameter Description
threshold Default: −10. Threshold applied to log(p) values (not CHiCAGO scores!). If the fitting is not successful (the red fitted line is not following well the data), use a more lenient signal threshold, such as −5
subsets Number of subsets to partition the data into. Parameters estimated on subsets, median taken. Default: 5
largeBinSize Largest bin size to consider. Default: 1e6
binNumber Number of large bins. Default: 16
halfNumber First bin is subdivided into halves—the number of times to do this. Default: 5

We provide precomputed weights for both HindIII and DpnII/MboI in Chicago’s bitbucket repository (chicagoTools/Tuning_CHiCAGO_settings_four_cutters.md). These weights are, however, a starting point, and we recommend that users, particularly those generating large volumes of CHi-C data, optimize their own weight profiles as described above.

Score cutoff to call interactions

CHiCAGO scores correspond to weighted −log P-values that are ‘soft-thresholded’ such that very-short-range interactions with zero reads produce a score of zero. While we loosely refer to interactions passing a predefined CHiCAGO score cutoff as ‘significant’, these scores are primarily a ranking measure, and therefore the choice of a cutoff is to a degree a subjective exercise.

Based on a previously described analysis14, a score of 5 is suggested as a stringent cutoff for calling ‘significant’ interactions. This cutoff can be tuned further for a given experimental setting and research question according to custom criteria. One strategy for this is based on balancing the enrichment of nonbaited interacting regions (‘other ends’) for specific chromatin features (e.g., the enhancer-associated histone mark H3K4me1) with a recall of such features, as illustrated in Fig. 4. A recent publication by Disney-Hogg et al.54 has proposed alternative approaches for cutoff determination based on the consistency of interaction calls between individual replicates.

Fig. 4. Tuning the CHiCAGO score cutoff by balancing recall and enrichment of regulatory chromatin features at interacting fragments.

Fig. 4

a, Percentage of interactions with H3K4me1 marked fragments (y-axis) within each given CHiCAGO score range (as specified on the x-axis), computed for HindIII capture-HiC data generated on MyLa cell line (CD8+ T cell-derived)36. The region highlighted in blue shows the cutoff range, where the enrichment of interactions with H3K4me1-marked fragments starts to reach a plateau. b, Recall of H3K4me1 marked fragments (y-axis, expressed as a percentage) at the increasingly stringent CHiCAGO score cutoffs (x-axis), computed for HindIII Capture-HiC data in MyLa cells. The gray dashed line highlights the CHiCAGO score cutoff of 5.

While defining a CHiCAGO score cutoff is useful for prioritizing interactions of potential interest, we do not recommend treating it as a binary indicator of a ‘presence’ versus ‘absence’ of an interaction. Instead, we suggest that exploratory analyses such as clustering use CHiCAGO scores on the quantitative scale (potentially in an arcsinh-transformed form to compress the high score range while avoiding problems at zero), while formal differential testing uses read-count-based statistics—for example, as implemented in Chicdiff45 (see the Procedure).

Assessment of the results

QC plots

The chicagoPipeline() function produces four diagnostic plots with the option to save them to a disk (which is done automatically in the runChicago.R wrapper) (Fig. 2, Fig. 5). The first figure (saved as <prefix>_oeNorm.pdf, Fig. 5a) is a barplot showing the scaling factors estimated for pools of ‘other end’ fragments. CHiCAGO selects these pools based on transchromosomal read counts detected for each other end. The barplot is expected to show a gradual increase of other-end scaling factors as the transchromosomal counts increase. The scaling factors are also expected to be larger for bait-to-bait interactions that have a generally higher capture efficiency.

Fig. 5. QC plots generated by CHiCAGO.

Fig. 5

a, Barplot showing the scaling factors (si’s) computed for each pool of other ends. b, Boxplots showing distribution of technical noise estimates for each pool of baits/viewpoints (top) and for each pool of other ends (bottom). c, Distance dependency of background counts and computed fit (red curve), plotted on a log–log scale. d, Interaction profiles for three example viewpoints. High-scoring interactions detected by Chicago (score ≥5) are shown in red, and subthreshold interactions (3≤ score <5) are shown in blue. e, Number of overlaps between chromatin features of interacting fragments detected using Chicago (yellow bars) versus number of overlaps from 100 random distance-matched subsets of HindIII fragments (blue bars). Error bars represent 95% confidence intervals. The plots were generated using CHi-C data in HaCaT cells36. B2B, bait-to-bait.

The second figure (<prefix>_techNoise.pdf, Fig. 5b) contains two plots showing how technical noise estimates vary for pools of baited fragments and those of other ends, respectively, each of them defined by the number of detected transchromosomal read pairs. Technical noise estimates should increase for higher trans counts, with bait-to-bait interactions showing higher values than those with nonbaited other ends (note that this may not be the case in analyses where the restriction fragments are binned).

Finally, the third plot (<prefix>_distFun.pdf, Fig. 5c) shows the estimated distance function reflecting how the mean number of Brownian reads for the estimated ‘reference’ bait varies with genomic distance. The curve should fit the points reasonably well, and the function should decrease monotonically with distance for reasons discussed above.

Bait interaction profile examples

To illustrate CHiCAGO performance on specific examples, the pipeline calls the plotBaits() function to plot raw read counts versus linear distance from bait for a subset of random baits, labeling significant interactions in a different color (Fig. 5d shows three examples). By default, 16 random bait interaction profiles are plotted within a 2 Mb window centered at the bait. Interactions highlighted in red pass the threshold of 5, while those highlighted in blue pass the more lenient threshold of 3. To visualize interaction profiles for baits of interest, plotBaits() function can be called post hoc using a previously generated CHiCAGO object and a set of bait IDs (from the .baitmap file) as input (see the Procedure for details).

Feature enrichment plots

When baited fragments correspond to regulatory regions such as gene promoters or enhancers (which is typically the case), we expect their interacting fragments to be enriched for chromatin features such as histone marks and transcription factor binding sites. Thus, enrichment of the other ends of significant interactions for these features can be used to validate CHiCAGO interaction calls in most scenarios. This can be done automatically using the peakEnrichment4Features() function in the R package. This function returns a barplot and a text file containing the values used to construct the plot (Fig. 5e). In the barplot, yellow bars show how many interacting other ends overlap with each feature, and blue bars show the expected overlap values for noninteracting other ends, with error bars representing 95% confidence intervals. The expected values are computed by averaging the number of overlaps found in multiple sets (by default, 100) of distance-matched pairs drawn from the pool of nonsignificant contacts. Results show enrichment for cases where the yellow bar is higher than the upper bound of the confidence interval. It is important to note that this procedure requires sets of chromatin features provided by the user, and expected results may depend on the choice of these features and the identity of the baited fragments, as well as on a given cell type and condition.

Expertise needed to implement the protocol

The following protocol is written for a user with some experience in operating the command line. Some minimal experience in R/Rstudio environment would be required to plot interaction profiles for baits of interest and to run parameter tuning scripts if needed.

Materials

Equipment

Starting input files

  • CHi-C aligned bam files (Procedure Section 1, Steps 1–4; see also Box 3).

  • Restriction map file (.rmap): a bed-like file containing the coordinates of the restriction fragments. Usually this file contains four columns (without a header), as follows: chr, start, end, fragmentID (Table 1)

  • Bait map file (.baitmap): a bed-like file that contains the coordinates of the baited restriction fragments with respective annotation. Usually this file contains five columns (without a header), as follows: chr, start, end, fragmentID, baitAnnotation. It is important to note that the regions specified in this file should be an exact subset of the .rmap file (including fragmentID). The baitAnnotation field is a text field that is only used for output and plot annotation (Table 1)

Box 3. QC and alignment of raw CHi-C reads using HiCUP.

Prior to using CHiCAGO, CHi-C data must first undergo QC in the same manner as conventional Hi-C data. The main purpose is to remove di-tags that are generated as identifiable technical artifacts in the Hi-C protocol. While CHiCAGO can theoretically work with any Hi-C aligner that generates .bam files, we recommend HiCUP19 for this task. Tutorials on using HiCUP are available online (https://www.bioinformatics.babraham.ac.uk/projects/hicup/).

HiCUP initially truncates single-end Hi-C reads at the first detected ligation junction from the 5′ end, thereby creating single-restriction fragment reads that are then aligned individually to the genome using Bowtie56 or (preferably) Bowtie257. The reads are then re-paired, and any read whose pair did not map correctly is discarded. Next, HiCUP filters the reads to remove invalid di-tags that arise from scenarios such as same-fragment ligation, contiguous ligation and re-ligation events. The reads are then de-duplicated, and an output hicup.bam file of unique, filtered reads is generated.

In general, a high-quality Hi-C library is expected to have >75% valid di-tags. However, a high proportion of invalid di-tags does not necessarily mean that the data are unusable. An important additional measure of sample quality is the proportion of cis- to trans-chromosomal di-tags, which is reported by HiCUP in the final html file. A high proportion of transchromosomal di-tags may indicate a problem with Hi-C library preparation, in particular, with the formaldehyde fixation step.

Within HiCUP, a wrapper script is provided to run all of these processes automatically using a configuration file. This has the advantage that a summary HTML QC report is generated. The HiCUP configuration file should be customized according to the restriction enzyme used in the experiment. We recommend using the following insert length settings for a six-cutter enzyme such as HindIII:

Shortest: 150
Longest: 800

For a four-cutter such as DpnII, we recommend reducing the shortest allowed insert size as follows:

Shortest: 50
Longest: 800

Standard HiCUP truncates Hi-C reads at the first detected ligation junction, considering at most a pair of interacting fragments per di-tag. However, it is known that Hi-C commonly results in more than two fragments ligated together, and therefore each Hi-C read—particularly resulting from using a four-cutter enzyme—may also contain more than two ligated restriction fragments. Filtering out this information potentially results in missed valid di-tag pairs resulting from true chromatin contacts. A recent version of HiCUP (available at https://github.com/StevenWingett/HiCUP/tree/combinations) enables exploiting these cases to increase the numbers of valid read pairs. In this version, each read is split at all ligation junctions and identified within it, and all resulting pairwise combinations of the restriction fragments within the read and across the di-tags are then considered in the analysis. An extra filtering step is then performed to retain only valid, unique di-tags within these combinations. In our initial tests, this approach has resulted in a ~10% increase in the number of unique, valid di-tags obtained from a Hi-C library sequenced with 150 bp paired ends.

Following QC, on-target di-tags can be detected using the script get_captured_reads provided with HiCUP. This script compares the bam file against the .baitmap and creates two bam files: uncaptured.bam and captured.bam. In addition, a capture_summary.txt file indicates the percentage of captured reads and the levels of cis and trans interactions. However, the true percentage of on-target reads can be affected by the level of PCR duplication within the Hi-C and PCHi-C libraries, since captured reads are proportionally more likely to be duplicated than uncaptured reads. Therefore, it may be desirable to run get_captured_reads on a non-deduplicated version of the HiCUP output: filt.bam. This file can be obtained by modifying the HiCUP configuration file to retain intermediate output files, as follows:

Keep: 1

The fold enrichment of CHi-C target regions can be calculated by first running get_captured_reads script on a conventional Hi-C sample using the same .baitmap file, and then dividing Percent_total_captured (CHi-C) by Percent_total_captured (Hi-C). To generate input files for CHiCAGO from the aligned CHi-C bam files, the script bam2chicago.sh is available as part of the chicagoTools suite. The script requires the input bam file (for example, hicup.captured.bam), the .rmap file and the .baitmap file. The script intersects the reads with the .baitmap using a minimum overhang of 60% and generates a CHi-C input (.chinput) file that contains read counts for each baited interaction. Note that if using the developmental version of HiCUP mentioned above (‘HiCUP combinations’), an updated version of the script, available with chicagoTools, is required to generate the .chinput file: bam2chicago_V02.sh. The option ‘--combinations’ should be specified.

Table 1. Input files required to run Chicago.
Input Extension Description Generation
Restriction map file .rmap Bed file containing coordinates of the restriction fragments
Default, 4 columns: <chr> <start> <end> <fragmentID>
FragmentIDs must be unique. Any fragment mapping outside of these coordinates will be disregarded by Chicago
Example:
Provided by user
1 1 3002504 1
1 3002505 3002505 2
1 3005877 3006265 3
Baitmap file .baitmap Bed file containing coordinates of the baited restriction fragments and their associated annotations
Default, 5 columns: <chr> <start> <end> <fragmentID> <baitAnnotation>
The regions specified in this file, including their fragmentIDs, must be an exact subset of those in the .rmap file. The baitAnnotation column is a text field that is used only to annotate the output and plots
Example:
Provided by user
1 3090913 3092556 31 U6.149-201
1 3455951 3457756 182 GM1992-001
1 3659706 3665317 261 Xkr4-001
NPerBin File .npb Tab-separated file containing the number of restriction fragments in each distance bin per bait
Default: <Total no. valid restriction fragments in distance bin 1> … <Total no.valid restriction fragments in distance bin
N>, where the bins map within the ‘proximal’ distance range from each bait (0; maxLBrownEst] and bin size is defined
by the bin size parameter
Generated by
makeDesignFiles.py
NBaitsPerBin file .nbpb Tab-separated file containing the number of valid baits in each distance bin per other end
Default: <otherEndID> <Total no. valid baits in distance bin 1> … <Total no. valid baits in distance bin N>, where the
bins map within the ‘proximal’ distance range from each other end (0; maxLBrownEst] and bin size is defined by the bin
size parameter
Generated by
makeDesignFiles.py
Proximal Other End (ProxOE) file .poe Tab-separated file containing distances between baits and other ends that map within the ‘proximal’ distance range
from each other (0; maxLBrownEst]
Default, 3 columns: <baitID> <otherEndID> <absolute distance>
Generated by
makeDesignFiles.py
Interaction file .chinput Tab-separated file containing information about all interactions detected in a bam file
Default: <baitID> <otherEndID> <N> <otherEndLen> <distSign>, where N is the number of reads detected for ligation
products between the ‘bait’ and ‘other end’, otherEndLen is the length of the ‘other-end’ restriction fragment and
distSign is the linear distance between the bait and other-end fragments, respectively
Generated by
bam2chicago.sh
Feature file .bed Bed file containing coordinates of a genomic feature (e.g., histone marks, transcription factors). This will be used to
assess its enrichment at loci that show significant interactions
Example:
Provided by user
chr18 3336569 3338454
chr18 3360132 3360614
chr18 3382678 3383869
Feature list file (optional) .txt Tab-separated file containing path information for all feature files to be used by Chicago
Default, 2 columns: <feature-name> <feature-bed-file-location>
Example:
Provided by user
H3K4me1 H3k04me1StdPk.narrowPeak.bed
H3K4me3 H3k04me3StdPk.narrowPeak.bed
H3K27ac H3k27acStdPk.narrowPeak.bed

Software

  • Operating system: Linux, Macintosh

  • R v3.3.1 or above (https://www.r-project.org)

  • Bedtools v2.25 (Chicago is currently incompatible with bedtools v2.26+ due to strict BED format compliance checking introduced in this version), preinstalled and added to the path. The required version of bedtools can be installed, for example, through Anaconda:
    conda install -c bioconda bedtools=2.25
  • Perl, preinstalled and added to the path (https://www.perl.org/get.html)

  • Python v2.7 or above (a developmental version of makeDesignFiles.py compatible with Python 3 is also available in the chicagoTools suite), preinstalled and added to the path (https://www.python.org/downloads/)

  • R package ‘argparser’

Hardware

  • 50 GB memory RAM (a typical Chicago job with two biological replicates for HindIII data)14

  • Four CPUs (suggested for a typical HindIII Capture-HiC dataset)14 ▲ CRITICAL Sequencing depth and the type of restriction enzyme used with or without binning may significantly affect the hardware requirements.

Equipment setup

Minimal requirements

In our experience, a CHi-C experiment with two biological replicates using HindIII or DpnII takes ~2–3 h wall-clock time and uses up to 50 GB RAM. This was tested with various samples that had ~30 million valid, unique on-target reads (after removal of nonvalid and noncaptured di-tags) mapping to between ~4,500 and 18,000 targeted fragments. The process involves ~2 h to generate each of the CHiCAGO input files (.chinput files) in parallel, and up to 20 min for the CHiCAGO wrapper script runChicago.R. Increasing the complexity of the experimental design (e.g., more replicates, baits or sequenced reads) may substantially increase the processing time. The operating system used must be able to run R, Python and perl applications, and running Chicago might require the ability to run bash script in shell.

Required data

For the purposes of this workflow, we use HindIII CHi-C data generated on two cell lines: HaCaT (keratinocyte) and MyLa (CD8 T cell-derived) that were produced as part of a study into the genetics of psoriasis susceptibility36. The ~4,500 baits in this capture design target HindIII fragments harboring single-nucleotide polymorphisms associated with various immune disorders. For each cell line, there are two biological replicates. We downsampled this data to obtain ~10 million reads mapping to the bait fragments. Following this, CHiCAGO input (.chinput) files were generated using a reduced version of the baitmap containing only chromosomes 7, 9, 13, 17, 20, 21 and 22. These were the chromosomes that were identical between the two baitmaps for HaCaT and MyLa, since the HaCaT design included a few extra loci. Note that the rmap was accordingly filtered such that the trans interactions were also restricted to these chromosomes, further reducing the size of the .chinput files.

These downsampled data can be accessed from our OSF repository at https://osf.io/kt67f/ (DOI 10.17605/OSF.IO/KT67F) in the following formats (Supplementary Table 2):

  • FASTQ files for testing alignment programs, such as Hi-C User Pipeline (HiCUP)

  • BAM files for testing bam2chicago.sh

  • CHiCAGO input (.chinput) files for running CHiCAGO

  • Processed CHiCAGO Rds files for testing functions within CHiCAGO

In addition to the downsampled data accompanying this protocol, further examples of CHi-C datasets can be accessed through the PCHiCdata R package as described in the CHiCAGO vignette (https://www.bioconductor.org/packages/release/bioc/vignettes/Chicago/inst/doc/Chicago.html) or downloaded directly from Bitbucket (https://bitbucket.org/chicagoTeam/chicago/src/master/PCHiCdata/inst/extdata/).

Installing software

To install Chicago, make sure that you have R v3.3.1 or above. The scripts from the chicagoTools suite also require additional dependencies: bedtools (v2.25), Perl and Python 2.7 or above (a developmental version of makeDesignFiles.py compatible with Python 3 is also available in the chicagoTools suite). The dependencies need to be preinstalled and added to $PATH. Additionally, when running Chicago from the command line, the R package argparser is required, which could be installed with the following R code:

install.packages("argparser")

An easy way to install the Chicago R package is through Bioconductor, by running the following code:

if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Chicago")

or via Anaconda by running one of the following commands:

conda install -c bioconda bioconductor-chicago
conda install -c bioconda/label/gcc7 bioconductor-chicago
conda install -c bioconda/label/cf201901 bioconductor-chicago

However, Bioconductor releases happen two times per year; more recent versions of R packages can be found at https://bitbucket.org/chicagoTeam/. Chicago can be downloaded directly from bit-bucket, using the functionality in devtools, by running the following code:

install.packages("devtools")
library(devtools)
install_bitbucket("chicagoTeam/Chicago", subdir="Chicago")

Procedure

Section 1: preprocessing: QC and alignment of raw sequencing data ● Timing Several hours to days, depending on the number of CPUs, available RAM and the sequencing depth

▲ CRITICAL For testing purposes, these steps can be skipped by instead downloading the downstream CHiCAGO input (.chinput) files from OSF (https://osf.io/wsc69/).

▲ CRITICAL Prior to running CHiCAGO, sequence data in the form of paired end FASTQ files must first undergo alignment and filtering for valid Hi-C di-tags. Here, we describe how to do this using HiCUP19 (see Box 3 for further information on the HiCUP pipeline).

  • 1

    Refer to the HiCUP guidelines to install the program, create aligner indices and generate the HindIII restriction digest of the GRCh37 genome https://www.bioinformatics.babraham.ac.uk/projects/hicup/read_the_docs/html/index.html

  • 2

    Download the raw FASTQ files for read 1 and read 2 for each sample from the OSF repository; e.g., for MyLa CHi-C replicate 1, navigate to the OSF directory https://osf.io/xm9an/ and download the files MyLa_rep1_CHiC_DS20M_R1.fastq.gz and MyLa_rep1_CHiC_DS20M_R2.fastq.gz

    ! CAUTION These files are ~900 MB each.

  • 3

    Generate a config file as described in the HiCUP documentation.

    In the config file, we have used the following optional parameters:
    Keep:1
    Longest: 800
    Shortest: 50
  • 4

    Run the HiCUP pipeline.

    -bash$ hicup --config MyLa1.config

    The HiCUP pipeline generates a QC report in html format as shown for MyLa replicate 1 in Extended Data Fig. 2.

  • 5

    To determine the capture efficiency, use the get_captured_reads script, which is provided with the HiCUP pipeline. To run the script, use the coordinates of the bait fragments, which were used for capture. For demonstration purposes, download the CHiCAGO design files from https://osf.io/sx7fu/, and place them into a directory called ‘designDir’.

  • 6
    Modify the .baitmap file such that it is in the tab-delimited format: chromosome, start, end:
    -bash$ cd designDir
    -bash$ cut -f 1-3 HindIII_GWAS_sharedChr.baitmap
     > HindIII_GWAS_baits.txt
  • 7
    Run the perl script using the hicup.bam file as input:
    -bash$ perl get_captured_reads --baits designDir/HindIII_GWAS_baits.
    txt MyLa_rep1_CHiC_DS20M_R1_2.filt.bam
  • 8

    Check the output summary report file capture_summary.txt

    ! CAUTION Here, the proportion of on-target di-tags (~13—16% across the four available samples) appears lower than it really is, because we have significantly filtered the baitmap.

  • 9
    Convert the hicup.bam file to the CHiCAGO input format using bam2chicago.sh within chicagoTools:
    -bash$ bam2chicago.sh MyLa_rep1_CHiC_DS20M_R1_2.hicup.bam \
    designDir/HindIII_GWAS_sharedChr.baitmap \
    designDir/GRCh37_HindIII_sharedChr.rmap \
    MyLa_rep1_CHiC_DS20M_HindIII_GWAS_sharedChr \
    nodelete

    ▲ CRITICAL STEP Make sure that the bedtools version in use is v2.25. When ‘nodelete’ is specified, bam2chicago.sh retains the intermediate .bedpe file, a paired-end bed file that lists all the read pairs corresponding to bait-to-bait interactions. This file is not used for the downstream CHiCAGO analysis, but it can be useful to analyze bait-to-bait interactions with any Hi-C normalization/interaction calling tool in addition to CHiCAGO, if needed.

Section 2: generation of CHiCAGO design files ● Timing 5 min (test dataset) to ~2 h (full data)

▲ CRITICAL The input files and parameters used to generate CHiCAGO design files are described in Box 4.

Box 4. Creating design files.

CHiCAGO requires five ‘design files’ that depend on the specific genome and restriction enzyme used: the restriction map file (.rmap) and the list of baited restriction fragments (.baitmap), as well as three auxiliary files termed ‘NPerBin’ (.npb), NBaitsPerBin (.nbpb) and proxOE (.poe). The restriction map and bait map files must be provided by the user (see Materials), while the three auxiliary files can be generated using the makeDesignFiles.py script provided as part of the chicagoTools suite.

A typical run is executed in the following way:

python makeDesignFiles.py --rmapfile --baitmapfile --minFragLen --maxFragLen

or

python makeDesignFiles.py --designDir --minFragLen --maxFragLen

The full set of parameters for makeDesignFiles.py is as follows:

python makeDesignFiles.py [--designDir=.] [--rmapfile=designDir/*.rmap] [--baitmapfile=designDir/*.
baitmap] [--outfilePrefix=designDir/] --minFragLen= --maxFragLen= [--maxLBrownEst=] [--binsize=]
[--removeb2b=True] [--removeAdjacent=True]
Parameter Description
rmapfile Full path to .rmap design file
baitmap Full path to .baitmap design file
outfilePrefix Output file prefix, including path. Default: same name as the .rmap file and the extension specific to each of the three file types:.npb,.nbpb and.poe, respectively
designDir Full path to directory containing .rmap and .baitmap design files
binsize Bin size for pooling restriction fragments (in bps). Recommended 1,500 for four-cutter enzymes or 20,000 for six-cutter enzymes
minFragLen Minimum fragment length cutoff. Recommended 75 for four-cutter enzymes or 150 for six-cutter enzymes. No default
maxFragLen Maximum fragment length cutoff. Recommended 12,000 for four-cutter enzymes or 40,000 for six-cutter enzymes. No default
maxLBrownEst Proximal distance range for estimating Brownian noise. Recommended 75,000 for four-cutter enzymes or 1,500,000 for six-cutter enzymes
removeb2b Boolean variable indicating whether bait-to-bait interactions should be included when counting the total number of fragments at a given distance. Default: True (strongly recommended)
removeAdjacent Boolean variable indicating whether fragments immediately adjacent to bait should be discarded. Default: True (this is strongly recommended, except in the case where artificial restriction fragments are used)
  • 10

    Download the .rmap and .baitmap files for the sample data from https://osf.io/sx7fu/.

  • 11

    Place the .rmap and .baitmap files into a directory called ‘designDir’.

  • 12

    Run makeDesignFiles.py within chicagoTools:

    -bash$ cd designDir
    -bash$ python makeDesignFiles.py --designDir=. \
    --minFragLen=150 \
    --maxFragLen=40000 \
    --outfilePrefix=HindIII_GWAS_sharedChr
    -bash$ ls
    GRCh37_HindIII_sharedChr.rmap HindIII_GWAS_sharedChr.npb
    HindIII_GWAS_sharedChr.baitmap HindIII_GWAS_sharedChr.poe
    HindIII_GWAS_sharedChr.nbpb

Section 3 (optional): preparing feature files ● Timing 5 min

▲ CRITICAL Files with genome coordinates of chromatin features such as histone modifications can be used to benchmark CHiCAGO peaks using the optional peakEnrichment4Features() function (see Supplementary Table 1 for more details). If data on relevant chromatin features are not available for the exact cell type used, those from another broadly related cell type can be used instead for QC purposes.

  • 13

    Download the files in the Feature Enrichment folder on OSF: https://osf.io/ajtpv/, and move them into a directory called ‘features’.

  • 14

    Check the contents of the directory. We have included bed files from ENCODE and Roadmap Epigenomics Project with peaks for H3K4me1, H3K27ac and H3K27me3. The data come from Normal Human Epidermal Keratinocytes (NHEK), to match against HaCaT cells, and CD8+ T cells (Roadmap sample E047), to match against MyLa cells. The tab-delimited bed files take the format: chromosome, start, end.

    -bash$ cd features
    -bash$ head -n 5 E047_CD8_H3K27ac_hg19.bed
    chr1    712798    714780
    chr1    761390    763337
    chr1    776457    780929
    chr1    892923    895966
    chr1    935677    936822

    The all_feat.txt file has two tab-delimited columns in the format: feature name, feature file.

    ▲ CRITICAL STEP peakEnrichment4Features defaults to the current working directory to find the feature bed files unless the full paths are specified in the second column of the feat file.

  • 15

    Generate a list of the relevant files for MyLa and HaCaT.

    -bash$ head all_feat.txt
      H3K4me1_CD8 E047_CD8_H3K4me1_hg19.bed
      H3K27ac_CD8 E047_CD8_H3K27ac_hg19.bed
      H3K27me3_CD8 E047_CD8_H3K27me3_hg19.bed
      H3K4me1_KC ENCFF898SZF_NHEK_H3K4me1_hg19.bed
      H4K27ac_KC ENCFF943CBQ_NHEK_H3K27ac_hg19.bed
      H3K27me3_KC ENCFF151HKM_NHEK_H3K27me3_hg19.bed
    -bash$ grep CD8 all_feat.txt > myla_feat.txt
    -bash$ head myla_feat.txt
      H3K4me1_CD8 E047_CD8_H3K4me1_hg19.bed
      H3K27ac_CD8 E047_CD8_H3K27ac_hg19.bed
      H3K27me3_CD8 E047_CD8_H3K27me3_hg19.bed
    -bash$ grep NHEK all_feat.txt > hacat_feat.txt
    -bash$ head hacat_feat.txt
      H3K4me1_KC ENCFF898SZF_NHEK_H3K4me1_hg19.bed
      H4K27ac_KC ENCFF943CBQ_NHEK_H3K27ac_hg19.bed
      H3K27me3_KC ENCFF151HKM_NHEK_H3K27me3_hg19.bed

Section 4: running CHiCAGO ● Timing 20 min (test dataset) to ~2 h per sample (full data)

▲ CRITICAL CHiCAGO can be run either step-by-step within R (Step 17A) or in a single step using the wrapper script, runChicago.R (Step 17B). In practice, the computational requirements for full datasets may necessitate running CHiCAGO noninteractively using runChicago.R. However, for training purposes, we recommend following the step-by-step analysis to gain a better understanding of the core CHiCAGO processes.

  • 16
    Prepare the following directories:
    • “designDir”: contains the .rmap, .baitmap, .npb, .nbpb and .poe files
    • “chinput_files”: contains the .chinput files (generated as in Section 1 (Steps 1–9) or obtained from OSF https://osf.io/wsc69/)
    • “features” (optional): contains the feature bed files and associated text file(s) (see Section 3).
  • 17
    Run CHiCAGO either step-by-step in R (option A) or using runChicago.R (option B):
    • (A)
      Step-by-step execution in R
      • (i)
        Load Chicago, and specify data locations showing full paths to designDir and chinput_files by modifying ‘your_full_path_to’ in the following R code:
        > library(Chicago)
        > testDesignDir <- file.path("your_full_path_to/designDir")
        > dir(testDesignDir)
        [1] “GRCh37_HindIII_sharedChr.rmap” “HindIII_GWAS_sharedChr.
        baitmap” “HindIII_GWAS_sharedChr.nbpb[4] “HindIII_GWAS_sharedChr.npb” “HindIII_GWAS_sharedChr.poe”
        > testDataPath <- file.path(“your_full_path_to/chinput_files”)
        > dir(testDataPath)
        [1] “HaCaT_unst_rep1_CHiC_DS20M_HindIII_GWAS_sharedChr.chinput”
        [2] “HaCaT_unst_rep2_CHiC_DS40M_HindIII_GWAS_sharedChr.chinput”
        [3] “MyLa_rep1_CHiC_DS20M_HindIII_GWAS_sharedChr.chinput”
        [4] “MyLa_rep2_CHiC_DS20M_HindIII_GWAS_sharedChr.chinput”
        > hacat_files <- c(
         file.path(testDataPath, “HaCaT_unst_rep1_CHiC_DS20M_HindIII_
        GWAS_sharedChr.chinput”),
         file.path(testDataPath, “HaCaT_unst_rep2_CHiC_DS40M_HindIII_
        GWAS_sharedChr.chinput”))
        > myla_files <- c(
         file.path(testDataPath, “MyLa_rep1_CHiC_DS20M_HindIII_GWAS_
        sharedChr.chinput”),
         file.path(testDataPath, “MyLa_rep2_CHiC_DS20M_HindIII_GWAS_
        sharedChr.chinput”))
        
      • (ii)
        (Optional) Specify the path to the settings file.
        These settings are modified from the defaults for running CHiCAGO on downsampled data. The settings in DS_settings.txt are the same as in the vignette (https://www.bioconductor.org/packages/release/bioc/vignettes/Chicago/inst/doc/Chicago.html) and can also be obtained from https://osf.io/b9p3v/.
        > settingsFile <- file.path(“your_full_path_to/DS_settings.txt”)
      • (iii)
        Create a new chicagoData object for each cell line, specifying where the design and settings files are.
        > cd_hac <- setExperiment(designDir = testDesignDir, settingsFile = settingsFile)
        > cd_myl <- setExperiment(designDir = testDesignDir, settingsFile = settingsFile)
      • (iv)
        Read in the input data files for each replicate per cell line. CHiCAGO will use information from each replicate.
        > cd_hac <- readAndMerge(files = hacat_files, cd = cd_hac)
        > cd_myl <- readAndMerge(files = myla_files, cd = cd_myl)
      • (v)
        Run the pipeline with chicagoPipeline() function:
        > cd_hac <- chicagoPipeline(cd_hac)
        > cd_myl <- chicagoPipeline(cd_myl)
        The pipeline produces a series of diagnostic plots, which are described in detail in ‘Assessment of the results’. The example plots for downsampled MyLa are shown in Extended Data Fig. 3a–c.
      • (vi)
        Use the slot ‘x’ of the chicagoData object, which is a data.table containing fragment pair information, to obtain all interactions with CHiCAGO score over a chosen threshold (here 5):
        > int <- cd_hac@x
        > sig_int <- int[score >= 5]
        > head(sig_int, 2)
        baitID  otherEndID distbin  s_j otherEndLen
        1: 181259  181294 (1.6e+05,1.8e+05]  1.10723 14789
        2: 181259  181302 (1.8e+05,2e+05]  1.10723 1244
         distSign  isBait2bait N.1 N.2  N  refBinMean s_i NNb
        1: 171410  FALSE  12  8 10 2.509301  0.8603894 9
        2: 199327  FALSE  4  13 9 2.290855  0.8603894 8
        NNboe  Tlb  Tblb  Tmean Bmean  log.p  log.w
        1: 10  [0,1][164, 231) 0.001068976 2.422647 -6.460730 6.103982
        2: 9  [0,1][164, 231) 0.001068976 2.175203 -6.106312 6.080273
        log.q  score
        1: -12.56471  6.409173
        2: -12.18658  6.031045
        The chicagoData object also contains the following slots: ‘settings’, a list of settings defined by setExperiment(), and ‘params’, a list of parameters that are populated as the pipeline runs and estimated by CHiCAGO in turn.
      • (vii)
        Save the chicagoData object in the RDS format as follows:
        > outputDir_hacat <- "your_full_path_to/hacat_reps_results"
        > dir.create(outputDir_hacat)
        > saveRDS(cd_hac, file.path(outputDir_hacat,"hacat_reps_Chicago.Rds"))
        > outputDir_myla <- "your_full_path_to/myla_reps_results"
        > dir.create(outputDir_myla)
        >  saveRDS(cd_myl, file.path(outputDir_myla, "myla_reps_Chicago.Rds"))
      • (viii)
        Export interaction calls exceeding a user-specified score cutoff (by default, 5) in human- and genome browser-readable formats. The exportResults() function will output files in ‘seqMonk’, ‘interBed’ and ‘washU_text’ formats:
        > exportResults(cd_hac, file.path(outputDir_hacat, "hacat_reps_Chicago"))
        > exportResults(cd_myl, file.path(outputDir_myla, "myla_reps_Chicago"))
        The output ‘washU_text’ file can be uploaded to the WashU Epigenome Browser to explore interactions by eye (as arcs, for example). See Box 5 for instructions on how to do this.
      • (ix)
        Plot raw read counts versus linear distance from one example bait. If specific baits are not specified, the function will plot profiles for 16 random baits.
        > plottedBaits_myla <- plotBaits(cd_myl, baits = 670997, plotBprof = TRUE)
        > plottedBaits_hacat <- plotBaits(cd_hac, baits = 670997, plotBprof = TRUE)
        See Extended Data Fig. 3d for how this plot should look in both cell lines.
        ▲ CRITICAL STEP Use plotBaits() with plotBprof=TRUE to plot the expected counts (solid grey line) and the upper bound of their 95% confidence interval (dashed gray line).
      • (x)
        (Optional): use the function peakEnrichment4Features() to test the hypothesis that other ends in the CHiCAGO output are enriched for genomic features of interest. Here, we test histone marks in matched cell types to HaCaT and MyLa. To obtain the relevant files, first follow ‘Section 3 (optional): preparing feature files’.
        > featuresFolder <- file.path("your_full_path_to/features")
        > dir(featuresFolder)
        [1] "all_feat.txt""E047_CD8_H3K27ac_hg19.bed"
        [3] "E047_CD8_H3K27me3_hg19.bed""E047_CD8_H3K4me1_hg19.bed"
        [5] "ENCFF151HKM_NHEK_H3K27me3_hg19.bed""ENCFF898SZF_NHEK_H3K4-me1_hg19.bed"
        [7] "ENCFF943CBQ_NHEK_H3K27ac_hg19.bed""hacat_feat.txt"
        [9] "myla_feat.txt"
        > featuresFile_myla <- file.path(featuresFolder, "myla_feat.txt")
        > featuresTable_myla <- read.delim(featuresFile_myla, header=FALSE, as.is=TRUE)
        > featuresList_myla <- as.list(featuresTable_myla$V2)
        > names(featuresList_myla) <- featuresTable_myla$V1
        >featuresList_myla
        $H3K4me1_CD8
        [1] “E047_CD8_H3K4me1_hg19.bed”
        $H3K27ac_CD8
        [1] “E047_CD8_H3K27ac_hg19.bed”
        $H3K27me3_CD8
        [1] “E047_CD8_H3K27me3_hg19.bed”
        Use the following command to specify the number of distance bins for the analysis, ensuring that the bin size is ~10 Kb.
        > no_bins_myla <- ceiling(max(abs(cd_myl@x$distSign), na.rm = T)/1e4)
        > enrichmentResults_myla <- peakEnrichment4Features(cd_myl, folder=featuresFolder, list_frag=featuresList_myla, no_bins=no_bins_myla, sample_number=100)
        The resultant plots for HaCaT and MyLa are shown in Extended Data Fig. 3e. For HaCaT cells, use ‘hacat_feat.txt’ as the featuresFile.
    • (B)
      Using a wrapper script
      • (i)
        Alternatively to running Chicago step-by-step within R, the wrapper script runChicago.R (available from chicagoTools) can be used to perform a typical CHiCAGO analysis with a single command (Fig. 2 and Table 2).
        Provided the design files are generated as described in Section 2 (Steps 10-12), runChicago.R performs the following steps:
        • Creates the chicagoData object given the design folder and, if needed, applies custom settings using setExperiment() function
        • Reads in the input file(s) and merges replicates if necessary using readSample() or readAndMerge() functions.
        • Runs interaction calling using chicagoPipeline() function
        • Saves the full chicagoData object in the file with an Rds or RDa format
        • Exports significant interactions in a genome browser-readable format using exportResults() function
        • Plots interaction profiles of multiple random baits using plotBaits() function
        • Estimates the enrichment of significant interactions for user-specified genomic features versus distance-matched controls using peakEnrichment4Features() function
        • Saves the settings used for running Chicago and the input parameters of the script itself to a text file
        • Sorts output files into a directory tree with the following subfolders: data, diag_plots, enrichment_data and examples
        A typical run is executed in the following way:
        Rscript runChicago.R --design-dir [DESIGN-DIR] --en-feat-list [EN-FEAT-LIST] <input-files> <output-prefix>

Box 5. Visualizing CHiCAGO interaction calls in the WashU browser.

The Chicago output data folder contains the washU_text.txt that can be read by WashU Epigenome Browser58,59 for visualizing called interactions. The file is currently generated in the format supported by the legacy version of the browser (http://epigenomegateway.wustl.edu/legacy/):

chr20,119103,138049 chr20,161620,170741 5.08
chr20,119103,138049 chr20,523682,536237 6.79
chr20,161620,170741 chr20,73978,76092   5.13
chr20,233983,239479 chr20,206075,209203 5.99

with bait coordinates in the first column, and other-end coordinates and corresponding interaction scores in the second and third column, respectively. The file can be uploaded for visualization by choosing ‘File Upload’ from the Apps dropdown menu. The track should then be set up as a ‘Pairwise interaction’ format. While the interactions can be viewed as arcs, it is also useful to set the view to ‘Full’ (by right clicking on the lefthand panel). In this configuration, the full restriction fragments can be observed.

The ‘new’ WashU Epigenome Browser released in 2018 (https://epigenomegateway.wustl.edu), however, supports a different format for the visualization of interaction data:

chr20 119103 138049 chr20:161620-170741,5.08
chr20 119103 138049 chr20:523682-536237,6.79
chr20 161620 170741 chr20:73978-76092,5.13
chr20 233983 239479 chr20:206075-209203,5.99

In this case, the first three columns represent bait coordinates, and the fourth column contains information on other-end coordinates and corresponding interaction scores. Additional information on the available custom track file formats can be found at https://eg.readthedocs.io/en/latest/text.html.

We provide a script to reformat the output WashU text file produced by the Chicago wrapper from the ‘old’ WashU to the ‘new’ WashU format: washuOld2New.R. The script can be run in the following manner from the command line:

Rscript washuOld2New.R --washuFile [--outputDir] [--outputName]

By choosing Text Tracks from the Tracks dropdown menu, a long-range interaction text file can be uploaded by selecting ‘long range text’ as text type file from the ‘Choose text type file’ dropdown menu for interaction data visualization. The browser screenshots in the SVG or PDF format can be obtained by clicking Apps/Screenshot, and the data can be saved in a session using Apps/Session.

Table 2. Parameters of the runChicago.R wrapper.

Parameter Description
cutoff Score cutoff for significant interactions. Default: 5
design-dir Full path to the folder containing baitmap, rmap, nperbin, nbaitsperbin and proxoe files with the corresponding extensions: . baitmap, .rmap, .npb, .nbpb and .poe. Only one file of each type should be included. The option defaults to the current directory
en-feat-files Comma-separated list of files with genomic feature coordinates for computing peaks’ enrichment. Alternative -en-feat-list option
en-feat-folder Folder containing all feature files. If provided, -en-feature-files file does not need to list the full path
en-feat-list Full path to file listing the feature files to be used. Alternative to –en-feat-files. File should be a tab-separated file with two columns: <feature-name> <feature-bed-file-location>
en-full-cis-range Assess the enrichment for features for the full distance range (same chromosome only). Can be very slow
en-max-dist Upper distance limit for computing enrichment for features. Default: 1 Mb
en-min-dist Lower distance limit for computing enrichment for features. Default: 0
en-sample-no Number of control samples, over which to compute the enrichment for features. Default: 100
en-trans Include trans-interactions in the enrichment for features computation
examples-full-range Plot interactions for the full range of distances from bait. These plots will appear in addition to plotting interactions within 1 Mb
examples-prox-dist Distance limit for plotting interactions for a set of bait examples. Default: 1 Mb
export-format File format for writing out peaks. It must be one or more of tde following: seqMonk, interBed, washU_text, washU_track (comma-separated). Default: washU_text
export-order Specify how results should be ordered should the results be ordered by ‘score’ or genomic ‘position’? [default: position]
features-only Rerun feature enrichment analysis with CHiCAGO output files. With this option, <input-files> must be either a single Rds file containing full CHiCAGO objects or ‘-’, in which case the file location will be inferred automatically from <output-prefix> and files added to the corresponding folder
help Print a help message
print-memory Print memory use during the execution of chicagoPipeline() function
rda Save the the chicagoData object in a file with an RDa format (under the name cd). Otherwise, the output will be an Rds file
save-df-only Only save the the data table part of the chicagoData object (cd@x) as a data frame. This will discard @params and @settings slots
settings-file Full path to the settings file to be loaded into @settings slot of the chicagoData object

The input-files should be replaced with one or multiple comma-separated .chinput files. Multiple files must correspond to different biological replicates as technical replicates should instead be deduplicated and merged into a single .chinput file.

The output-prefix should be replaced with the header for both output folder and output file names.

The full list of parameters is provided in Table 2.

Use the wrapper to perform a CHiCAGO analysis of the test data, as demonstrated below for two replicates:

-bash$ Rscript runChicago.R --design-dir your_full_path_to/designDir \
--settings-file your_full_path_to/DS_settings.txt \
--en-feat-list your_full_path_to/myla_feat.txt your_full_path_to/
MyLa_rep1_CHiC_DS20M_HindIII_GWAS_sharedChr.chinput,your_full_
path_to/MyLa_rep2_CHiC_DS20M_HindIII_GWAS_sharedChr.chinput \
MyLa_DS_reps_HindIII_GWAS_sharedChr

▲ CRITICAL STEP When checking for feature enrichment of other ends using the runChicago.R wrapper, the full paths to feature files must be provided in the second column of the en-feat-list txt file, unless they are in the working directory. See Table 2 for a complete list of available options.

Section 5 (optional): summarizing interaction calls ● Timing 5 min (test dataset) to ~1 h (full data), depending on the number of baited fragments and the choice of score cutoff

  • 18

    Run the CHiCAGO pipeline on each sample separately (HaCaT replicate 1, HaCaT replicate 2, MyLa replicate 1 and MyLa replicate 2). This can either be done step-by-step in R, saving the Rds files, or using the runChicago.R wrapper. For each replicate, the wrapper would look something like:

    -bash$ Rscript runChicago.R --design-dir your_full_path_to/designDir \
    --settings-file your_full_path_to/DS_settings.txt \
    your_full_path_to/HaCaT_unst_rep1_CHiC_DS20M_HindIII_GWAS_sharedChr.
    chinput \
    HaCaT_rep1_DS20M_HindIII_GWAS_sharedChr

    Alternatively, the Rds file for each downsampled replicate can be obtained from the OSF repository at https://osf.io/b9p3v/. Please note that the ‘settings’ slot must be edited to show the full paths to the files in the designDir.

  • 19

    Make names_file.txt: a tab-delimited file where the first column contains chosen sample names (e.g. HaCaT1, HaCaT2, MyLa1, MyLa2) and the second column corresponds to the full paths to the CHiCAGO Rds file for each replicate, as generated in the previous step.

  • 20

    Run makePeakMatrix.R from chicagoTools:

    -bash$ Rscript makePeakMatrix.R \
    your_full_path_to/names_file.txt \
    hacat_myla_DS_peakMatrix
    -bash$ head -n 2 hacat_myla_DS_peakMatrix.txt
    baitChr baitStart baitEnd baitID baitName oeChr oeStart oeEnd oeID
    oeName dist HaCaT1 HaCaT2 MyLa1 MyLa2
    7 18756496 18765328 666933 rs1178121 7 18408301 18411877 666818 . -350823
    1.09936903030889 5.3583502527233 0.170773196740831 2.16829960330778

    The resultant txt file contains the peak matrix while the pdf file contains the sample dendrogram showing how the samples cluster by cell line (Extended Data Fig. 4a). See Box 6 for a complete list of available options.

Box 6. Summarizing interaction calls from multiple CHi-C samples.

A convenient way to summarize CHiCAGO results from multiple samples is in the form of a ‘peak matrix’ that lists the coordinates, annotations and sample-wise scores or read counts for all interactions that pass a signal threshold in at least one sample. The peak matrix is useful for downstream analyses (e.g., Chicdiff, clustering by interaction and/or by sample type) and integration with other types of data. Peak matrices can be generated using makePeakMatrix.R script provided as part of the chicagoTools suite.

A typical run is executed in the following way:

Rscript makePeakMatrix.R [--twopass] <names-file> <output-prefix>

The full set of parameters for makePeakMatrix.R is as follows:

Rscript makePeakMatrix.R [--help] [--twopass] [--notrans] [--vanilla] [--rda] [--setzero] [--var VAR]
[--print-memory] [--scorecol SCORECOL] [--cutoff CUTOFF] [--fetchcol FETCHCOL] [--lessthan] [--maxdist
MAXDIST] [--digestmap DIGESTMAP] [--baitmap BAITMAP] [--peaklist PEAKLIST] [--clustmethod CLUSTMETHOD]
[--clustsubset CLUSTSUBSET]

The names-file should be replaced with the full path to a tab-separated file providing the sample names and the full paths to the Rds files: <sample names> <full paths to Rds files>.

The output-prefix should be replaced with the prefix to use for the output files.

Parameter Description
baitmap Full path to bait map ID file
It overrides any settings provided in chicagoData
Required for --vanilla
clustmethod Clustering method to cluster columns (average/ward.D2/complete) (default: average)
clustsubset Number of interactions to randomly subset for clustering
Full dataset used if total number of interactions in the peak matrix is below this number (default: 1e+06)
cutoff Score cutoff
setzero By default, interactions absent in a sample/condition are given score NA for that sample/condition. If flagged --setzero, these interactions will receive a score of 0 for a corresponding sample/replicate
digestmap Full path to digest map file
It overrides any settings provided in chicagoData
Required for --vanilla
fetchcol Specify the column to collect information from as an alternative to scorecol
scorecol will still be used to threshold interactions
If fetchcol is different from scorecol, the --twopass is activated
help Print a help message
lessthan Select interactions with scorecol below the cutoff
maxdist Maximum distance (in bps) from bait to be included
notrans Discard transchromosomal interactions
Faster alternative to save memory compared with --twopass
peaklist Specify predefined peak list (e.g., one generated by --twopass mode)
print-memory Print memory info at each step
Rda Load data from an RDa archive rather than the default Rds
The name of the variable in the RDa file containing the chicagoData object or the peak data frame must be specified using --var option (see below)
scorecol Specify column name with CHiCAGO scores (default: score)
twopass Memory efficient processing method:
1—list all significant interactions as a union of peaks in each dataset;
2—reload and subset for these interactions prior to merging
Warning: the method selected by option 2 might be slower
vanilla Input file in the RDa/RDS format contains only data frames and not Chicago objects
--digestmap and --baitmap options are required
var Variable containing the chicagoData object or the peak data frame in the input file saved in the RDa format (default: x)
Peak matrix Contains the coordinates, annotations and sample-wise scores or read counts for all interactions that pass a signal threshold in at least one sample

Section 6 (optional): downstream analyses

Differential interaction calling with Chicdiff ● Timing 5 min (on test data, once input files are obtained), ~1 h (on full data)

  • 21

    Download and install Chicdiff following the instructions on the GitHub repository (https://github.com/RegulatoryGenomicsGroup/chicdiff).

  • 22

    Run the CHiCAGO pipeline on each individual sample in the two cell types (or obtain the Rds files from OSF at https://osf.io/b9p3v/), and get the peak matrix as outlined in Section 5.

  • 23

    In R, load Chicdiff and define the location of the design files.

    > library(Chicdiff)
    > testDesignDir <- file.path("your_full_path_to/designDir")
  • 24

    Define the location of the peak matrix.

    > peakFiles <- file.path("your_full_path_to/hacat_myla_DS_peakMatrix.txt")
  • 25

    Define the location of the count data in .chinput format (one per replicate).

    > testDataPath <- file.path("your_full_path_to/chinput_files")
    > countData <- list(
      HaCaT = c(HaCaT1 = file.path(testDataPath, "HaCaT_unst_rep1_CHiC_DS20M_HindIII_GWAS_sharedChr.chinput"),
      HaCaT2 = file.path(testDataPath, "HaCaT_unst_rep2_CHiC_DS40M_HindIII_GWAS_sharedChr.chinput")),
      MyLa = c(MyLa1 = file.path(testDataPath, "MyLa_rep1_CHiC_DS20M_HindIII_GWAS_sharedChr.chinput"),
      MyLa2 = file.path(testDataPath, "MyLa_rep2_CHiC_DS20M_HindIII_GWAS_sharedChr.chinput")))

    ▲ CRITICAL STEP Sample names provided in the countData list must match those in the first column of the peak matrix.

  • 26

    Define the location of the CHiCAGO data files in .Rds format (one per replicate).

    > testDataPath_rds <- file.path("your_full_path_to/Rds_files")
    > chicagoData <- list(
     HaCaT = c(HaCaT1 = file.path(testDataPath_rds, "HaCaT_rep1_DS20M_HindIII_GWAS_sharedChr.Rds"),
     HaCaT2 = file.path(testDataPath_rds, "HaCaT_rep2_DS40M_HindIII_GWAS_sharedChr.Rds")),
     MyLa = c(MyLa1 = file.path(testDataPath_rds, "MyLa_rep1_DS20M_HindIII_GWAS_sharedChr.Rds"),
     MyLa2 = file.path(testDataPath_rds, "MyLa_rep2_DS20M_HindIII_GWAS_sharedChr.Rds")))

    ▲ CRITICAL STEP The sample names provided in the chicagoData list must match those in the first column of the peak matrix.

  • 27

    Now run the experiment using chicdiffPipeline().

    > resultsPath <- file.path("your_full_path_to/HaCaT_MyLa_Chicdiff_results")
    > dir.create(resultsPath)
    > setwd(resultsPath)
    > chicdiff.settings <- setChicdiffExperiment(designDir = testDesign-
    Dir, chicagoData = chicagoData, countData = countData, peakfiles =
    peakFiles, outprefix="HaCaT_MyLa")
    > output <- chicdiffPipeline(chicdiff.settings)
  • 28

    Visualize some of the results using plotDiffBaits().

    > outputRds <- file.path(resultsPath, "HaCaT_MyLa_results.Rds")
    > countputRds <- file.path(resultsPath, "HaCaT_MyLa_countput.Rds")
    > bmapRds <- file.path(testDesignDir, "HindIII_GWAS_sharedChr.baitmap")
    > baits = c(287087, 287096, 425899, 781515)
    > plotDiffBaits(outputRds, countputRds, bmapRds, baits = baits)

    This function plots the raw read counts of interactions versus their linear distance from the respective bait fragment as mirror images for the two cell lines. The output is shown in Extended Data Fig. 4b. See Box 7 for more detailed instructions on running Chicdiff and the structure of a typical run.

Box 7. Detecting differential chromosomal interactions with Chicdiff.

Detection of differential interactions between conditions (different cell types, gene knockout, enhancer modifications) is a common task in studies using CHi-C. While in an exploratory setting this task can be accomplished by comparing CHiCAGO interaction scores across conditions (e.g., using cluster analysis), this approach is not adequate for formal differential analysis, which requires accounting for random noise and the burden of multiple testing.

Existing statistical tools for differential analysis of genomic data such as Hi-C and RNA-seq do not account for the specific statistical properties of CHi-C, nor leverage them to improve the precision and power of differential interaction detection in these data. To address this issue, we developed Chicdiff45—an R package for differential CHi-C data analysis. Chicdiff combines moderated differential testing for count data implemented in DESeq247 with CHi-C-specific procedures for signal normalization informed by CHiCAGO and a multiple testing treatment based on P-value weighting22. While Chicdiff needs to be run downstream of CHiCAGO, it can perform differential analysis of user-specified sets of interactions irrespective of whether they pass a CHiCAGO score cutoff in either condition.

A typical run is executed in the following way:

library(Chicdiff)
chicdiff.settings <- setChicdiffExperiment(designDir = testDesignDir, chicagoData = chicagoData, countData
= countData, peakfiles = peakFiles, outprefix="test")
output <- chicdiffPipeline(chicdiff.settings)
Input Description
1. designDir -Path to design directory, which contains .baitmap and .rmap files
Baitmap file (.baitmap) Bait map file (.baitmap): a bed-like file that contains the coordinates of the baited restriction fragments with respective annotation. Usually this file contains five columns, as follows: chr, start, end, fragmentID, baitAnnotation. The same .baitmap file required to run Chicago
rmap file (.rmap) Restriction map file (.rmap): a bed-like file containing the coordinates of the restriction fragments. Usually this file contains four columns, as follows: chr, start, end, fragmentID. This is the same .rmap file required to run Chicago
2. chicagoData
Chicago output .Rds or .Rda files Chicago output file. Should be generated for each replicate separately
3. countData
Count data .chinput files Count data files are .chinput files, the same that were used to run Chicago. They are generated from aligned CHi-C BAM files by running bam2chicago.sh script
4. peakFiles
One or more peakfile(s) The peakfile(s) contain interactions of interest and are generated by makePeakMatrix.R from the results of Chicago run(s) on either individual replicates or merged replicates for each condition

Fine-mapping interaction calls with Peaky ● Timing 20+ min (to process two baits in the full dataset)

  • 29

    The full-sized CHi-C data are more suitable than downsampled data for demonstrating how to use Peaky. Obtain the full MyLa CHiCAGO object myla_reps_HindIII_GWAS_all_myla.Rds at https://osf.io/9xhp7/.

  • 30

    Specify the working directory and CHiCAGO dataset:

    > library(peaky)
    > peaky_output_dir <- file.path("your_path_to/myla_peaky_results")
    > dir.create(peaky_output_dir)
    > setwd(peaky_output_dir)
    > chicago_rds_path <- file.path("your_path_to/myla_reps_hindIII_GWAS_all_myla.Rds")
  • 31

    Convert the CHiCAGO data for analysis with Peaky.

    > peaky_prepare_from_chicago(chicago_rds_path, peaky_output_dir)
  • 32

    Run Peaky’s fine mapping command with a couple of example baits.

    > for(i in c(1558,3610)){ peaky_run(peaky_output_dir, i) } # Here, “i”
    refers to a line in myla_peaky_results/baits/baitlist.txt
  • 33

    Generate a .csv output and plots for all processed baits.

    > X = peaky_wrapup(peaky_output_dir)
  • 34

    Plot a given bait interactively in R.

    > peaky_plot(X$output, bait=642001)

    The output for bait 642001 can be seen in Extended Data Fig. 5. See Box 8 for the structure of a typical run and more detailed instructions.

Box 8. Fine-mapping CHiCAGO-detected chromosomal interactions with Peaky.

Peaky has been developed to call chromatin contacts at high resolution, considering seemingly long runs of contact calls across many adjacent ‘other ends’ and subsequently extracting a smaller subset of fine-mapped interactions. This approach assumes that any true, causal contact between a bait and a given ‘other end’ yields an increase in signal that extends across the neighboring ‘other ends’ fragments, as is often observed even in processed chromatin conformation capture data. Rather than designating all local ‘other ends’ with high counts as being in contact with the bait, Peaky aims to identify the causal contact(s) responsible for the local rise in signal. In practice, Peaky adjusts counts in a negative binomial regression (e.g., against the CHiCAGO-modeled expected count) and models patterns of local signal decay in the resulting residuals through reversible-jump Markov chain Monte Carlo. The probability of each ‘other end’ being a causal contact is quantified as the marginal posterior probability of contact (MPPC). To optimize the model’s performance, users may specify the rate at which signal decays around a truly interacting ‘other end’ (see ‘Estimate of omega’ in the vignette at https://cqgd.github.io/pky/). Peaky can be run as a post hoc analysis to fine-map interactions detected by CHiCAGO, or entirely independently from raw CHi-C or Capture-C counts and a restriction fragment map.

A typical run is executed in the following way:

library(peaky)
peaky_prepare_from_chicago(chicago_rds_path, peaky_output_dir)
for(i in 1:3){ peaky_run(peaky_output_dir, i) } #where i refers to a given bait

peaky_wrapup(peaky_output_dir)
Input Description
1. CHiCAGO data
chicago_rds_path Path to the.rds file produced by Chicago
Optional: chicago_bait_subset and chicago_max_dist can be used as parameters to peaky_prepare_from_chicago() function to limit the analysis to specific baits and interactions up to a given distance
2. Peaky output directory
peaky_output_dir Directory to store Peaky’s intermediate files and results in. Will be created if it does not exist
Output Description
1. Table of fine-mapping results
peaky_output.csv Contains fine-mapping results, specifically the MPPC for each putative interaction
2. Plots showing fine-mapped contacts
plots/peaky_plot_bait_#.pdf Plots showing each bait’s raw read counts, adjusted read counts, CHiCAGO scores (if available) and MPPC across the local genome

Troubleshooting

Troubleshooting advice can be found in Table 3.

Table 3. Troubleshooting table.

Step Problem Possible reason Solution
1, Running
bam2chicago.sh
Error: Type checker found wrong number of fields while tokenizing data line Incorrect version of bedtools Make sure that the bedtools version in use is v2.25
Empty chinputs are produced Incorrect version of bedtools Make sure that the bedtools version in use is v2.25
12, Running
makeDesignFiles.py
SyntaxError, such as ‘Missing parentheses in call to “print”’ Incorrect version of Python Use Python v2.7 or above (developmental version of this script compatible with Python 3 is available in chicagoTools suite)
17B, Running
runChicago.R
Too few significant interactions (score >) called While it could be a biologically valid result, it could arise from purely technical reasons. If the sequencing coverage is low, many baits would be filtered out from the analysis Check how many baits are filtered out and what is the total number of contacts per bait. By default, Chicago filters out baits with <250 different interactions in total. This can be adjusted by changing the maxNPerBait parameter in the settings file
Too many significant interactions (score >5) called Background is not estimated correctly Follow our guidelines on custom parameter tuning
Too many trans-interactions are called significant (score >5) Interactions over long distances may be overweighted P-value weighting scheme should be trained based on user’s replicates by running fitDistCurve.R script (found within the chicagoTools suite)

Timing

The timings for running CHiCAGO functions are dependent on sequencing depth and the number of di-tag combinations in the .chinput file. A typical runChicago.R run might take ~20 min to several hours. Similarly, the time required to make the design directory files for CHiCAGO using makeDesignFiles.py script depends on the bait design and the chosen restriction enzyme. For ~20,000 baits using HindIII, it typically takes 5–10 min. However, processing a similar number of baits with a four-cutter takes 1–2 h. Note that the exact timing will strongly depend on the choice of maxLBrownEst parameter (see Box 4).

Anticipated results

Running the runChicago.R wrapper script produces four output folders: data, diag_plots, enrichment_data and examples. The data folder contains an . Rds file, which contains all R data output generated by CHiCAGO; chiefly, the interaction data between all detected fragment pairs that were included in the analysis. It also contains a params.txt file, which keeps a record of the parameters used to run the test. The final file in the data folder, washU_text.txt, contains all significant interactions at the requested score cutoff in a format suitable for direct upload to the original WashU Epigenome Browser. Note that the ‘New’ (2018) Epigenome Browser requires an altered format (see Box 5). Also note that the bait-to-bait interactions are listed twice in the .Rds file but only once in the WashU file, which contains the stronger interaction of the two possibilities (where the fragments are interchangeably considered to be the ‘bait’ or the ‘other-end’).

The plots produced by runChicago.R are described in detail in ‘Assessment of the results’. Briefly, the diag_plots folder contains three pdf files showing diagnostic plots of the test. The enrichment_data folder contains a pdf file and a txt file that give the results of the other-end feature enrichment analysis, if specified. The examples folder contains a pdf file displaying the interaction profiles for 16 random baits in the analysis. The bait interaction profiles give an indication of sequence coverage and consistency between baits; for examples of sparsely sequenced and well-sequenced interaction profiles, see Fig. 3 and Fig. 5, respectively. Post hoc assessment of the suitability of the CHiCAGO model parameters can be conducted in several ways, as described in ‘Custom parameter tuning’.

Extended Data

Extended Data Fig. 1. Comparative analysis of PCHi-C data generated with a four- and a six-cutter restriction enzyme.

Extended Data Fig. 1

Three MboI PCHi-C replicates obtained from iPSC-derived cardiomyocytes (iPSC CMs33) were processed by CHiCAGO either at the restriction fragment level, using standard 4 bp cutter settings or in 5 kb bins, as described in the Procedure. Three HindIII PCHi-C replicates obtained from hESC-derived cardiomyocytes (hESC CMs34) were processed using standard 6 bp cutter settings. Only genes baited in both iPSC CMs and hESC CMs were included in the comparative analysis. An interaction was considered shared when the middle of the significantly interacting fragments in the MboI data fell within the respective interacting fragments in the HindIII dataset (CHiCAGO score >5). When several interactions in MboI data overlapped with the same HindIII interaction, it was counted as a single shared interaction to avoid double-counting. a,b, Comparison between MboI and HindIII PCHi-C datasets in nonbinned mode (a) and binned mode (b). The violin plots show the distance distribution of significant interactions belonging to shared, MboI- and HindIII-specific groups. The number of significant interactions in each group is indicated in gray. The barplots show enrichment for regulatory histone marks (as a ratio between observed and expected) in each group of interactions.

Extended Data Fig. 2. QC plots generated by HiCUP for downsampled CHi-C data.

Extended Data Fig. 2

MyLa CHi-C36 replicate 1 was downsampled to 20 million raw read pairs and processed using HiCUP19, as described in the Procedure. a, Truncation, alignment to GRCh37 and pairing results for read 1 (dark blue) and read 2 (light blue). The ~15 million paired reads are taken forwards for filtering. b, Detection of valid Hi-C di-tags (dark blue) and removal of Hi-C artifacts such as religation products (turquoise) and di-tags falling outside the specified size range (orange). c, Size distribution of di-tags with limits shown as red lines. d, Interacting fragments are grouped into cis < 10 kb (dark blue), cis > 10 kb (light blue) and trans (green) for di-tags before removal of PCR duplicates (left) and after (right).

Extended Data Fig. 3. QC plots generated by CHiCAGO for downsampled CHi-C data.

Extended Data Fig. 3

Downsampled CHi-C datasets36 were processed by CHiCAGO using both replicates per cell line as described in the Procedure. a, Barplot showing the scaling factors (si’s) computed for each pool of other ends for MyLa. b, Boxplots showing distribution of technical noise estimates for each pool of baits/viewpoints (top) and for each pool of other ends (bottom) for MyLa. c, Distance dependency of background counts and computed fit (red curve), plotted on a log–log scale for MyLa. d, Interaction profiles for the bait 670997, assigned to rs4141001, in MyLa (top) and HaCaT (bottom). High-scoring interactions detected by CHiCAGO (score ≥5) are shown in red, and subthreshold interactions (3 ≤ score < 5) are shown in blue. e, Number of overlaps between chromatin features of interacting fragments detected using CHiCAGO (yellow bars) versus number of overlaps from 100 random distance-matched subsets of HindIII fragments (blue bars) in MyLa (top) and HaCaT (bottom). Error bars represent 95% confidence intervals.

Extended Data Fig. 4. Identifying differential interactions between conditions using Chicdiff.

Extended Data Fig. 4

a, Dendrogram for downsampled HaCaT and MyLa samples36 obtained from running getPeakMatrix as outlined in the Procedure. b, Chicdiff45 bait profiles were generated for four loci as described in the Procedure. The plots show the raw read counts versus linear distance from the bait fragment as mirror images for HaCaT and MyLa. Other-end interacting fragments are pooled and color-coded by their adjusted weighted P-value.

Extended Data Fig. 5. Example of fine-mapping chromatin contacts with Peaky.

Extended Data Fig. 5

The full MyLa CHi-C36 data were processed by CHiCAGO using both replicates and then analyzed using Peaky44. The top panel shows the distribution of raw read counts for other end fragments for the bait 642001, with high-scoring interactions (CHiCAGO score ≥ 5) highlighted in blue. The second panel shows the CHiCAGO adjusted read counts with high-scoring interactions (CHiCAGO score ≥ 5) highlighted in blue and with the Peaky model fitted as a green line. The third panel shows CHiCAGO scores for those interactions with the blue dashed line showing the score cutoff of 5. In the bottom panel, the probability of each other-end fragment being a causal contact is quantified as the marginal posterior probability of contact (MPPC). Based on this metric, a number of fragments with CHiCAGO score ≥ 5 (points highlighted in blue) have MPPC very close to zero. After discounting these, a smaller subset of fine-mapped interactions may be identified.

Supplementary Material

The online version contains supplementary material available at https://doi.org/10.1038/s41596-021-00567-5.

File 1

Acknowledgements

We thank all users of CHiCAGO and associated packages for providing test data and reporting issues. Research in M.S. lab is supported by core funding from the UK’s Medical Research Council (MRC) (MC-A652-5QA20). S.W.W. acknowledges core support from the UK’s Biotechnology and Biological Sciences Research Council (BBSRC). C.W. is supported by the MRC (MC_UU_00002/4) and the Wellcome Trust (WT107881, 215097/Z/18/Z).

Footnotes

Author contributions

J.C., P.F.P. and M.S developed the CHiCAGO pipeline. J.C.,W.R.O., V.M. and M.S. developed Chicdiff. C.E. and C.W. developed Peaky and its integration with CHiCAGO. S.W.W. developed the HiCUP pipeline. V.M., H.R.J. and M.D.R. optimized CHiCAGO parameters and contributed auxiliary scripts, with input from J.C., W.R.O. and M.S. P.F.P. and H.R.J. wrote and tested the code in the Procedure. P.F.P., H.R.J., M.D.R, C.E., C.W., M.S. and V.M. wrote the manuscript. All authors read and approved the final manuscript. M.S. and V.M. supervised the work.

Competing interests

P.F.P. is currently an employee of Inivata Limited. J.C. is currently an employee of AstraZeneca and may or may not own stock options. M.S. is a cofounder of Enhanc3D Genomics Ltd. The rest of the authors declare no competing interests.

Extended data is available for this paper at https://doi.org/10.1038/s41596-021-00567-5.

Peer review information Nature Protocols thanks Andrea M. Chiariello, Fulai Jin and Yun Li for their contribution to the peer review of this work.

Reprints and permissions information is available at www.nature.com/reprints.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Data availability

All of the figures for this paper were produced using publicly available data from Ray-Jones et al.36, Montefiori et al.33, and Choy et al.34. We provide downsampled FASTQ files and all intermediate file types (.bam, .chinput, .Rds) from Ray-Jones et al. on the OSF repository (https://osf.io/kt67f) to allow readers to test either the full pipeline or specific analysis steps.

Code availability

The Chicago and PCHiCdata R packages are available from Bioconductor and from the Bitbucket repository: https://bitbucket.org/chicagoTeam/chicago. The chicagoTools suite of auxiliary scripts is available from the same Bitbucket repository. Full documentation and installation instructions for HiCUP are available from https://www.bioinformatics.babraham.ac.uk/projects/hicup/. The Peaky R package is available from the GitHub repository: http://github.com/cqgd/pky. The Chicdiff R package is available from the GitHub repository: https://github.com/RegulatoryGenomicsGroup/chicdiff. The code presented in the Procedure and the versions of the software used in this protocol are deposited on OSF: https://osf.io/kt67f/ (DOI 10.17605/OSF.IO/KT67F).

References

  • 1.Schoenfelder S, Fraser P. Long-range enhancer-promoter contacts in gene expression control. Nat Rev Genet. 2019;20:437–455. doi: 10.1038/s41576-019-0128-0. [DOI] [PubMed] [Google Scholar]
  • 2.Schmitt AD, Hu M, Ren B. Genome-wide mapping and analysis of chromosome architecture. Nat Rev Mol Cell Biol. 2016;17:743–755. doi: 10.1038/nrm.2016.104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lieberman-Aiden E, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–293. doi: 10.1126/science.1181369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.van Berkum NL, et al. Hi-C: a method to study the three-dimensional architecture of genomes. J Vis Exp. 2020 doi: 10.3791/1869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Dixon JR, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485:376–380. doi: 10.1038/nature11082. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Nora EP, et al. Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature. 2012;485:381–385. doi: 10.1038/nature11049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Sexton T, et al. Three-dimensional folding and functional organization principles of the Drosophila genome. Cell. 2012;148:458–472. doi: 10.1016/j.cell.2012.01.010. [DOI] [PubMed] [Google Scholar]
  • 8.Schoenfelder S, et al. The pluripotent regulatory circuitry connecting promoters to their long-range interacting elements. Genome Res. 2015;25:582–597. doi: 10.1101/gr.185272.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Mifsud B, et al. Mapping long-range promoter contacts in human cells with high-resolution capture Hi-C. Nat Genet. 2015;47:598–606. doi: 10.1038/ng.3286. [DOI] [PubMed] [Google Scholar]
  • 10.Sahlén P, et al. Genome-wide mapping of promoter-anchored interactions with close to single-enhancer resolution. Genome Biol. 2015;16:156. doi: 10.1186/s13059-015-0727-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Hughes JR, et al. Analysis of hundreds of cis-regulatory landscapes at high resolution in a single, high-throughput experiment. Nat Genet. 2014;46:205–212. doi: 10.1038/ng.2871. [DOI] [PubMed] [Google Scholar]
  • 12.Würtele H, Chartrand P. Genome-wide scanning of HoxB1-associated loci in mouse ES cells using an open-ended chromosome conformation capture methodology. Chromosome Res. 2006;14:477–495. doi: 10.1007/s10577-006-1075-0. [DOI] [PubMed] [Google Scholar]
  • 13.Simonis M, et al. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture-on-chip (4C) Nat Genet. 2006;38:1348–1354. doi: 10.1038/ng1896. [DOI] [PubMed] [Google Scholar]
  • 14.Zhao Z, et al. Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra- and interchromosomal interactions. Nat Genet. 2006;38:1341–1347. doi: 10.1038/ng1891. [DOI] [PubMed] [Google Scholar]
  • 15.Cairns J, et al. CHiCAGO: robust detection of DNA looping interactions in Capture Hi-C data. Genome Biol. 2016;17:127. doi: 10.1186/s13059-016-0992-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Huber W, et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015;12:115–121. doi: 10.1038/nmeth.3252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Rosa A, Becker NB, Everaers R. Looping probabilities in model interphase chromosomes. Biophys J. 2010;98:2410–2419. doi: 10.1016/j.bpj.2010.01.054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Bohn M, Heermann DW. Diffusion-driven looping provides a consistent framework for chromatin organization. PLoS One. 2010;5:e12218. doi: 10.1371/journal.pone.0012218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wingett S, et al. HiCUP: pipeline for mapping and processing Hi-C data. F1000Res. 2015;4:1310. doi: 10.12688/f1000research.7334.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B Methodol. 1995;57:289–300. [Google Scholar]
  • 21.Genovese CR, Roeder K, Wasserman L. False discovery control with p-value weighting. Biometrika. 2006;93:509–524. [Google Scholar]
  • 22.Ignatiadis N, Klaus B, Zaugg JB, Huber W. Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat Methods. 2016 doi: 10.1038/nmeth.3885. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Freire-Pritchett P, et al. Global reorganisation of cis-regulatory units upon lineage commitment of human embryonic stem cells. eLife. 2017;6:e21926. doi: 10.7554/eLife.21926. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Novo CL, et al. Long-range enhancer interactions are prevalent in mouse embryonic stem cells and are reorganized upon pluripotent state transition. Cell Rep. 2018;22:2615–2627. doi: 10.1016/j.celrep.2018.02.040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Chovanec P, et al. Widespread reorganisation of pluripotent factor binding and gene regulatory interactions between human pluripotent states. Nat Commun. 2021;12:2098. doi: 10.1038/s41467-021-22201-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Siersbæk R, et al. Dynamic rewiring of promoter-anchored chromatin loops during adipocyte differentiation. Mol Cell. 2017;66:420–435.:e5. doi: 10.1016/j.molcel.2017.04.010. [DOI] [PubMed] [Google Scholar]
  • 27.Rubin AJ, et al. Lineage-specific dynamic and pre-established enhancer-promoter contacts cooperate in terminal differentiation. Nat Genet. 2017;49:1522–1528. doi: 10.1038/ng.3935. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Thiecke MJ, et al. Cohesin-dependent and -independent mechanisms mediate chromosomal contacts between promoters and enhancers. Cell Rep. 2020;32:107929. doi: 10.1016/j.celrep.2020.107929. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Javierre BM, et al. Lineage-specific genome architecture links enhancers and non-coding disease variants to target gene promoters. Cell. 2016;167:1369–1384.:e19. doi: 10.1016/j.cell.2016.09.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Burren OS, et al. Chromosome contacts in activated T cells identify autoimmune disease candidate genes. Genome Biol. 2017;18:165. doi: 10.1186/s13059-017-1285-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Petersen R, et al. Platelet function is modified by common sequence variation in megakaryocyte super enhancers. Nat Commun. 2017;8:16058. doi: 10.1038/ncomms16058. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Litchfield K, et al. Identification of 19 new risk loci and potential regulatory mechanisms influencing susceptibility to testicular germ cell tumor. Nat Genet. 2017;49:1133–1140. doi: 10.1038/ng.3896. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Montefiori LE, et al. A promoter interaction map for cardiovascular disease genetics. eLife. 2018;7:e35788. doi: 10.7554/eLife.35788. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Choy MK, et al. Promoter interactome of human embryonic stem cell-derived cardiomyocytes connects GWAS regions to cardiac gene networks. Nat Commun. 2018;9:2526. doi: 10.1038/s41467-018-04931-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Joshi O, et al. Dynamic reorganization of extremely long-range promoter-promoter interactions between two states of pluripotency. Cell Stem Cell. 2015;17:748–757. doi: 10.1016/j.stem.2015.11.010. [DOI] [PubMed] [Google Scholar]
  • 36.Ray-Jones H, et al. Mapping DNA interaction landscapes in psoriasis susceptibility loci highlights KLF4 as a target gene in 9q31. BMC Biol. 2020;18:47. doi: 10.1186/s12915-020-00779-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Martin P, et al. Chromatin interactions reveal novel gene targets for drug repositioning in rheumatic diseases. Ann Rheum Dis. 2019;78:1127–1134. doi: 10.1136/annrheumdis-2018-214649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Ghavi-Helm Y, et al. Highly rearranged chromosomes reveal uncoupling between genome topology and gene expression. Nat Genet. 2019;51:1272–1282. doi: 10.1038/s41588-019-0462-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Andrey G, et al. Characterization of hundreds of regulatory landscapes in developing limbs reveals two regimes of chromatin folding. Genome Res. 2017;27:223–233. doi: 10.1101/gr.213066.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Su C, et al. Mapping effector genes at lupus GWAS loci using promoter Capture-C in follicular helper T cells. Nat Commun. 2020;11:3294. doi: 10.1038/s41467-020-17089-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Chesi A, et al. Genome-scale Capture C promoter interactions implicate effector genes at GWAS loci for bone mineral density. Nat Commun. 2019;10:1260. doi: 10.1038/s41467-019-09302-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Anil A, Spalinskas R, Åkerborg Ö, Sahlén P. HiCapTools: a software suite for probe design and proximity detection for targeted chromosome conformation capture applications. Bioinformatics. 2018;34:675–677. doi: 10.1093/bioinformatics/btx625. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Ben Zouari Y, Molitor AM, Sikorska N, Pancaldi V, Sexton T. ChiCMaxima: a robust and simple pipeline for detection and visualization of chromatin looping in Capture Hi-C. Genome Biol. 2019;20:102. doi: 10.1186/s13059-019-1706-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Eijsbouts CQ, Burren OS, Newcombe PJ, Wallace C. Fine mapping chromatin contacts in capture Hi-C data. BMC Genomics. 2019;20:77. doi: 10.1186/s12864-018-5314-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Cairns J, Orchard WR, Malysheva V, Spivakov M. Chicdiff: a computational pipeline for detecting differential chromosomal interactions in Capture Hi-C data. Bioinformatics. 2019;35:4764–4766. doi: 10.1093/bioinformatics/btz450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Holgersen EM, et al. Identifying high-confidence capture Hi-C interactions using CHiCANE. Nat Protoc. 2021;16:2257–2285. doi: 10.1038/s41596-021-00498-1. [DOI] [PubMed] [Google Scholar]
  • 47.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Thiecke MJ, et al. Cohesin-dependent and -independent mechanisms mediate chromosomal contacts between promoters and enhancers. Cell Rep. 2020;32:107929. doi: 10.1016/j.celrep.2020.107929. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Ay F, Bailey TL, Noble WS. Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts. Genome Res. 2014;24:999–1011. doi: 10.1101/gr.160374.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Heinz S, et al. Transcription elongation can affect genome 3D structure. Cell. 2018;174:1522–1536.:e22. doi: 10.1016/j.cell.2018.07.047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Imakaev M, et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat Methods. 2012;9:999–1003. doi: 10.1038/nmeth.2148. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Beccari L, et al. Dbx2 regulation in limbs suggests inter-TAD sharing of enhancers. Dev Dyn. 2021 doi: 10.1002/dvdy.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Su C, Pahl MC, Grant SFA, Wells AD. Restriction enzyme selection dictates detection range sensitivity in chromatin conformation capture-based variant-to-gene mapping approaches. bioRxiv. 2020 doi: 10.1101/2020.12.15.422932. Preprint. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Disney-Hogg L, Kinnersley B, Houlston R. Algorithmic considerations when analysing capture Hi-C data. Wellcome Open Res. 2020;5:289. doi: 10.12688/wellcomeopenres.16394.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Feldmann A, Dimitrova E, Kenney A, Lastuvkova A, Klose RJ. CDK-Mediator and FBXL19 prime developmental genes for activation by promoting atypical regulatory interactions. Nucleic Acids Res. 2020;48:2942–2955. doi: 10.1093/nar/gkaa064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10:R25. doi: 10.1186/gb-2009-10-3-r25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Zhou X, et al. The Human Epigenome Browser at Washington University. Nat Methods. 2011;8:989–990. doi: 10.1038/nmeth.1772. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Zhou X, et al. Exploring long-range genome interactions using the WashU Epigenome Browser. Nat Methods. 2013;10:375–376. doi: 10.1038/nmeth.2440. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

File 1

Data Availability Statement

All of the figures for this paper were produced using publicly available data from Ray-Jones et al.36, Montefiori et al.33, and Choy et al.34. We provide downsampled FASTQ files and all intermediate file types (.bam, .chinput, .Rds) from Ray-Jones et al. on the OSF repository (https://osf.io/kt67f) to allow readers to test either the full pipeline or specific analysis steps.

The Chicago and PCHiCdata R packages are available from Bioconductor and from the Bitbucket repository: https://bitbucket.org/chicagoTeam/chicago. The chicagoTools suite of auxiliary scripts is available from the same Bitbucket repository. Full documentation and installation instructions for HiCUP are available from https://www.bioinformatics.babraham.ac.uk/projects/hicup/. The Peaky R package is available from the GitHub repository: http://github.com/cqgd/pky. The Chicdiff R package is available from the GitHub repository: https://github.com/RegulatoryGenomicsGroup/chicdiff. The code presented in the Procedure and the versions of the software used in this protocol are deposited on OSF: https://osf.io/kt67f/ (DOI 10.17605/OSF.IO/KT67F).

RESOURCES