Guiding the design of well-powered Hi-C experiments to detect differential loops

Sarah M Parker; Eric S Davis; Douglas H Phanstiel

doi:10.1093/bioadv/vbad152

. 2023 Oct 16;3(1):vbad152. doi: 10.1093/bioadv/vbad152

Guiding the design of well-powered Hi-C experiments to detect differential loops

Sarah M Parker ¹, Eric S Davis ², Douglas H Phanstiel ^3,^4,^5,^6,^7,^✉

Editor: Thomas Lengauer

PMCID: PMC10645293 PMID: 38023330

Abstract

Motivation

Three-dimensional chromatin structure plays an important role in gene regulation by connecting regulatory regions and gene promoters. The ability to detect the formation and loss of these loops in various cell types and conditions provides valuable information on the mechanisms driving these cell states and is critical for understanding long-range gene regulation. Hi-C is a powerful technique for characterizing 3D chromatin structure; however, Hi-C can quickly become costly and labor-intensive, and proper planning is required to ensure efficient use of time and resources while maintaining experimental rigor and well-powered results.

Results

To facilitate better planning and interpretation of human Hi-C experiments, we conducted a detailed evaluation of statistical power using publicly available Hi-C datasets, paying particular attention to the impact of loop size on Hi-C contacts and fold change compression. In addition, we have developed Hi-C Poweraid, a publicly hosted web application to investigate these findings. For experiments involving well-replicated cell lines, we recommend a total sequencing depth of at least 6 billion contacts per condition, split between at least two replicates to achieve the power to detect differences in the majority of loops. For experiments with higher variation, more replicates and deeper sequencing depths are required. Values for specific cases can be determined by using Hi-C Poweraid. This tool simplifies Hi-C power calculations, allowing for more efficient use of time and resources and more accurate interpretation of experimental results.

Availability and implementation

Hi-C Poweraid is available as an R Shiny application deployed at http://phanstiel-lab.med.unc.edu/poweraid/, with code available at https://github.com/sarmapar/poweraid.

1 Introduction

3D chromatin structure is thought to play a critical role in the regulation of gene expression, particularly during development and in response to external stimuli (Phanstiel et al. 2017, Siersbæk et al. 2017, Reed et al. 2022). Abnormalities in this organization have been implicated in a number of human diseases and developmental disorders (Malod-Dognin et al. 2020). While multiple types of 3D chromatin structures have been identified, chromatin loops—point-to-point interactions between two genomically distal loci—are of particular interest as they can bring gene promoters into close physical proximity with distal enhancers and facilitate transcriptional activation (Kagey et al. 2010). Multiple genomic approaches have been developed to detect loops and other chromatin structures, including Hi-C, a widely used approach to quantify chromatin interactions in a genome-wide fashion (Dekker et al. 2002, Dostie et al. 2006, Simonis et al. 2006, Lieberman-Aiden et al. 2009). Computational analysis of Hi-C data can be used to both identify chromatin loops and quantify the interaction frequencies between pairs of loop anchors. The application of Hi-C to investigate differential looping, or changes in looping across samples or biological conditions, has provided valuable insights into the function of the human genome, the role of enhancers in regulating gene expression, and the genetic basis of human disease (Greenwald et al. 2019, Reed et al. 2022).

Proper design and interpretation of comparative Hi-C genomics experiments requires a rigorous understanding of the statistical power underlying the experiment in question. Statistical power is the probability of a test rejecting the null hypothesis, given that the null hypothesis is false. In this case, the null hypothesis is that a given loop is not changing between conditions. Power relies on several key factors including the sample size (e.g. biological replicates), the effect size (i.e. fold change), the counts representing the feature of interest (i.e. the loop pixel), the alpha level (i.e. the accepted false positive rate), and dispersion (i.e. the variance between replicates). Several software packages have been developed to estimate the power of comparative genomic studies and these tools have been extensively used to design and interpret studies involving RNA-seq, ChIP-seq, ATAC-seq, and numerous other genomic methodologies (Hart et al. 2013, Guo et al. 2014, Zuo and Keleş 2014, Vieth et al. 2017, Li et al. 2018). However, Hi-C has several inherent differences compared to these data types that make power analysis non-trivial. First, the generation of Hi-C datasets for preliminary power analysis is expensive since Hi-C requires sequencing depths that are orders of magnitude greater than those required for traditional sequencing experiments. Second, Hi-C library preparation and data analysis both require special expertise due to the lengthy protocol and the sheer size and complexity of the resulting datasets. Finally, the counts observed in a given Hi-C pixel arise due to multiple biological and technical forces, each of which is dependent on the genomic distance between the loci depicted by said pixel.

Hi-C interaction frequencies are governed by at least two main forces, or types of interactions, each of which affects the power to detect differential looping. It is the cumulative influence of these interaction types that gives rise to the counts observed at any particular pixel in a Hi-C dataset. The first type of interactions are polymer interactions, which are distance-dependent interactions between two regions of a chromosome due to their inclusion in a linear polymer. The second type are looping interactions, interactions driven by the forces of specific chromatin interacting proteins. Most looping interactions are presumably the result of CTCF binding and cohesin-driven loop extrusion; however, other less common mechanisms have also been identified (Tang et al. 2015, Fudenberg et al. 2016, Conte et al. 2022). Other forces, including compartmentalization and contact domain inclusion, also influence interaction frequencies albeit to a lesser degree. An understanding of the relationships between these forces can strongly influence the power to detect differential loops as we describe here.

To facilitate better planning of Hi-C experiments, we have modeled these forces and conducted a detailed evaluation of the effects of sequencing depth and replicate number on differential loop detection using deeply sequenced, publicly available Hi-C datasets (Rao et al. 2014). We also developed an interactive web application to enable better planning of comparative Hi-C experiments. Finally, we provide guidance and recommendations for planning Hi-C experiments depending on project goals and budgets.

2 Methods

2.1 Hi-C subsampling, alignment, and processing

In situ Hi-C datasets for 29 samples (18 primary and 11 replicate) of GM12878 cells were downloaded as merge_no_dups files through GEO accession GSE63525. These files were randomly subsampled to the proper proportion of the total sequencing depth required so that the unique reads add up to the desired sequencing depth. For each line in the merge_no_dup file, a random float was generated, and if the float was less than the percent of total needed, that line was included in the new subsampled file. This method was repeated for the total sequencing depths of 50 million, 100 million, 250 million, 500 million, 750 million, 1 billion, 2 billion, 3 billion, 4 billion, and 5 billion contacts. This method resulted in files with +/− up to 0.01% difference from the whole number sequencing depths.

The 29 subsampled merge_no_dup files per sequencing depth were combined into one file per sequencing depth and processed into Hi-C files using Juicer tools (Durand et al. 2016) (v1.14.08). Looping interactions were called at 5 kb resolution with Significant Interaction Peak (SIP) caller (Rowley et al. 2020) (v1.6.1) and Juicer tools using the merged Hi-C file from the 29, non-subsampled merge_no_dup files (5 536 073 657 total contacts) with the following parameters: “-g 2.0 -t 2000 -fdr 0.05” for a total of 14 849 loops after merging at 10 kb resolution.

The un-normalized expected and observed counts for each loop in each Hi-C file were extracted using a pre-release of mariner (https://github.com/EricSDavis/mariner) (v. 0.1.0) from Hi-C files with duplicate reads removed but no other quality filtering steps applied. Loops were then filtered for a length shorter than 2 Mb, observed counts greater than expected, and for those only located on Chromosomes 1–22, resulting in ∼14 000 loops per sequencing depth.

2.2 Fold change and power calculations

The fold change of observed counts for various fold changes of the counts due to looping interactions for each loop was calculated using

{FC}_{observed} = \frac{(o - e) \times {FC}_{looping} + e}{o},

where o is observed counts, e is expected counts, and FC_looping is the fold change of counts due to looping interactions. For example, in Fig. 2, the FC_looping values used were 2, 3, and 4.

Figure 2. — Fold change compression is anti-correlated with loop size. (A) An example region showing a loop and the effect on observed counts and fold change if counts due to looping interactions are doubled. The counts due to looping interactions can be represented as the observed counts (159) minus the expected counts (36). When these looping counts are doubled, the expected counts remain the same, so the observed fold change of total counts (1.77) is less than that of the fold change of looping counts (2). (B) The distance-dependent nature of the percent of signal due to looping. The percent of signal due to looping is about 50%–80% of the observed counts and increases slightly with loop size. For smaller loop sizes, expected counts comprise more of the observed counts. (C) Observed fold change per loop for looping fold changes of two (red), three (green), and four (blue). Dark and light shaded areas represent interquartile and interdecile ranges. Observed fold change is compressed when looping doubles, triples, and quadruples, with an increasing range of effect as fold change increases. The compression effect is greatest for shorter-range loops, meaning longer-range loops typically have higher observed fold changes for the same change in looping counts. (D) Effect of loop size on power for a median loop at each distance when dispersion, alpha, replicates, and counts are held constant at 0.001, 0.05, 2, and 50, respectively.

Power was calculated per loop using the rnapower() function from the RNASeqPower package (Hart et al. 2013), where alpha was 0.05 divided by the number of loops for a given sequencing depth, depth was the observed counts for the given loop, cv was the square root of a given dispersion, effect was the fold change of observed counts for a given fold change due to looping counts, and n was the number of given replicates. For the purpose of our analysis, dispersion values ranging from 0.001 to 0.1, replicate values ranging from 2 to 10, and fold change values ranging from 1.1 to 10 were used.

For each combination of parameters, the percentage of well-powered loops was calculated. A well-powered loop was one where the power was greater than a set threshold, either 0.8 or 0.9 for a given fold change. A threshold of 0.8 to detect a 2-fold change was used for all analyses described here. For our initial analyses (Figs 1 and 2), we used the following parameters: 2 replicates, 2 billion reads per replicate, 0.001 dispersion. These were chosen to reflect the same values from a previous differential Hi-C study (Phanstiel et al. 2017), allowing for comparison between these two datasets (Supplementary Fig. S1).

Figure 1. — Counts and power decrease with increasing loop size. (A) Median counts per loop as a function of genomic distance are plotted as a solid dark purple line. Dark and light purple shaded areas represent inner interquartile and interdecile ranges, respectively. The dotted black line represents the median value of all (loop and non-looped) pixels at each 10 kb binned genomic distance. (B) Effect of loop size on power when dispersion, alpha, and effect are held constant at 0.001, 0.05, and 2, respectively. This distance-dependent effect is due only to changes in counts and is investigated across different replicate values. (C) Distribution of loop sizes showing a median loop size of 300 kb. (D) An example region on Chromosome 6 with loops labeled in order of increasing distance. The observed contacts and the power to detect a 2-fold change in counts due to looping interactions are listed for each loop pixel.

2.3 Extending Poweraid predictions to diverse datasets

Poweraid is built using the Hi-C data from GM12878 cells in Rao et al. (2014). In order to provide more clear recommendations, all Hi-C datasets from Rao et al. (2014) along with Hi-C datasets from Phanstiel et al. (2017) and Bond et al. (2023) were processed and the percentage of contacts/reads were reported (Supplementary Table S1 and Supplementary Fig. S2). Dispersion for each dataset was calculated using the common dispersion from the estimateDisp function in the edgeR package (Robinson et al. 2010). Power for each loop for each dataset was calculated using these dispersions, the average counts per loop across replicates, and an observed fold change corresponding to a 2-fold change in looping. The calculated percent of well-powered loops for each dataset along with the predicted percent of well-powered loops using the closest available values for dispersion, sequencing depth, and replicates in Poweraid. Since these values are only estimates, they are rounded to the closest percentage. The differences in these calculated and predicted values are in Supplementary Fig. S3.

2.4 Visualizations

To reduce noise and to aid in visualizations, the observed and expected counts used for Figs 1 and 2 were fit to a power law curve using aomisc (https://github.com/OnofriAndreaPG/aomisc). All figures were generated by use of the Bioconductor package plotgardener (Kramer et al. 2022).

3 Results

3.1 Loop size is anti-correlated with sequencing counts

One of the key determinants of power in genomic experiments is the number of sequencing counts attributed to the feature being measured, in this case, loops. One of the unique aspects of Hi-C data is that on average, sequencing counts decrease as a function of genomic distance. We modeled this trend to understand how it affects the power to detect differential loops at various distances. Using deeply sequenced Hi-C data from Rao et al., which include roughly 5 billion unique contacts in GM12878 cells, we extracted the observed counts and expected counts for all loops at a 10 kb resolution and plotted the median counts as well as interquartile and interdecile ranges of observed counts (Fig. 1A). These counts decrease as genomic distance increases, and the relationship follows a power law curve as previously observed. This suggests that the size of a differential loop is closely correlated with the power we have to detect it.

To elucidate this relationship, we calculated the power to detect differential loops [via the RNASeqPower package (Hart et al. 2013)] using the median counts for each genomic distance as values for depth (Fig. 1B). Although initially developed for RNA-seq data, RNASeqPower’s statistical principles are applicable to Hi-C count data due to the shared count-based nature of RNA-seq and Hi-C data. While the specific biological characteristics differ, the overarching statistical principles that the package employs, such as assessing the impact of sample size or replicates on statistical power, are relevant across count-based datasets. By leveraging RNASeqPower’s robust framework, we can assess statistical power for Hi-C counts, ensuring reliable power calculations for differential Hi-C data analysis. We held dispersion, alpha, and fold change constant at 0.001, 0.05, and 2, respectively. A dispersion of 0.001 was chosen based on dispersions of some of the more deeply sequenced human Hi-C datasets (Supplementary Table S1). As expected, power decreases sharply with increasing loop size, an effect that is slightly alleviated with increased sequencing replicates. The distribution of loop sizes is skewed toward the shorter range of the distances shown (Fig. 1C). These trends were similar for other Hi-C datasets (Phanstiel et al. 2017) investigated (Supplementary Fig. S1). How the distributions of loop sizes and power intersect to affect the overall power of a differential Hi-C experiment will be explored in more detail later. Examples of loops at varying distances and their associated statistical power are depicted in Fig. 1D.

3.2 Loop size is anti-correlated with fold change compression

A second key determinant of statistical power is the effect size—or fold change—of the features of interest. However, since only a fraction of the counts at a given loop pixel arise from looping interactions, observed fold changes in a Hi-C experiment are smaller than the actual changes in chromatin looping. To illustrate this compression, we consider a single 350 kb loop from the GM12878 dataset acquired by Rao et al. (2014) (Fig. 2A). There are 159 observed counts at this loop pixel. Polymer interactions for this pixel can be estimated by calculating the median interactions for all pixels connecting loci 350 kb apart. Because the vast majority of pixels do not represent a chromatin loop, the resulting value is called the “expected” counts since this is the number of counts that would be expected in the absence of a chromatin loop. For the loop in question, this provides a value of 36 counts, or 23% of the observed counts. Looping interactions can be estimated by subtracting the expected counts from the observed counts which gives us 123, or 77% of the observed counts. Therefore, a 2-fold increase in looping interactions (123×2 = 246) would actually only be observed as a 1.77-fold increase in observed counts (Fig. 2A) since the expected counts would remain the same.

We next sought to explore this effect across all loops identified in a given Hi-C experiment. Using the same approach described above, we estimated the percentage of counts due to looping interactions for all loops (Fig. 2B). As loop size increases, we observe a higher percentage of counts due to looping. We next explored how this distance-dependent change in the composition of sequencing counts impacts fold change compression. For every loop, we calculated the percentage of counts due to looping and calculated what the observed fold change would be given a 2, 3, or 4-fold change in looping. The median as well as interquartile and interdecile ranges are plotted in Fig. 2C. As expected, fold change compression was observed for all loop sizes; however, the magnitude of fold change compression decreased with increasing loop size, meaning longer-range loops typically have higher observed fold changes for the same change in looping counts.

To determine how these distance-dependent effects on fold change compression impact statistical power, we calculated the power to detect the median values of observed fold change for each loop size (Fig. 2D). Dispersion, alpha, replicates, and counts were held constant at 0.001, 0.05, 2, and 50, respectively. Plotting the resulting power reveals the opposite trend that we observed when considering counts as a function of distance. That is, the correlation between sequencing counts and loop size suggests a positive correlation between loop size and power, whereas the inverse correlation between fold change compression and loop size suggests a negative correlation between loop size and power. However, these models were built by isolating individual variables and using only median values of sequencing counts and fold change per loop size bin. How these features intersect to determine power on a per-loop basis in real datasets and how these relationships impact experimental design is explored below.

3.3 Recommendations for maximizing power

To determine optimal experimental parameters, such as sequencing depth and replicates, we calculated power for every loop in the deeply sequenced dataset created by Rao et al. (2014). We subsampled the dataset to 10 different sequencing depths and calculated power using a range of both dispersions and replicates. Observed fold changes were modeled using the fold change compression relationships for each loop as calculated in Fig. 2. For the case of this analysis, a well-powered loop is defined as one which has 80% or higher power to detect a 2-fold change in looping. As expected, a higher overall sequencing depth results in increased power. Even at relatively high sequencing depths (i.e. 6 billion contacts per condition), power is highly distance dependent with shorter loops being far more well-powered than longer loops. The percent of well-powered loops at each distance is shown in Fig. 3B and C. Fortunately, loop size distributions are skewed to shorter loops so the low power to detect loops >1 Mb does not heavily influence the overall power of the experiment. Nevertheless, for proper interpretation of differential Hi-C experiments, it is important to consider this bias toward the detection of shorter loops.

Figure 3. — Power increases with replicates and sequencing depth. (A) For each combination of replicates (2–10), dispersion (0.001, 0.01, and 0.04), and sequencing depth per replicate (50 million, 100 million, 250 million, 500 million, 750 million, 1 billion, 2 billion, 3 billion, 4 billion, and 5 billion contacts), the power to detect a 2-fold change in looping was calculated. Since each individual line is a replicate, this means that the maximum of total contacts per line ranges from 10 billion (2 reps × 5 B contacts) to 50 billion (10 reps × 5 B contacts). Loops with a power above 0.8 were designated as “well-powered,” and the percentage of these loops is represented on the x axis. This percentage is investigated across the total sequencing depth per condition (multiplying replicates by the sequencing depth per replicate). For the lowest dispersion, we highlight two different scenarios: a recommendation of 6 billion contacts per condition to achieve ∼50% well-powered loops, and a more ideal scenario of 25 billion contacts per condition to achieve ∼90% well-powered loops. (B and C) The percent of well-powered loops (B) and number of loops (C) in Rao & Huntley *et al.*’s GM12878 data for the recommended 6 billion contacts per condition, using dispersion and replicate values of 0.001 and 2, respectively. The power to detect a 2-fold change in looping interactions was calculated per loop, and the percentages and numbers of loops with a power over 0.8 were calculated per 10 kb bin. (D and E) The percent of well-powered loops (D) and number of loops (E) in Rao *et al.*’s GM12878 data for the ideal 25 billion contacts per condition, using dispersion and replicate values of 0.001 and 2, respectively. The power to detect a 2-fold change in looping interactions was calculated per loop, and the percentages and numbers of loops with a power over 0.8 were calculated per 10 kb bin.

Another important parameter affecting power is dispersion. Dispersion quantifies the variability in the data across replicates, capturing both biological and technical sources of variation within one condition. The relationship between dispersion and power is intricate and depends on various factors. For the human cell line Hi-C data (Rao et al. 2014, Phanstiel et al. 2017, Bond et al. 2023), dispersions ranged from ∼0.0008 to 0.04 (Supplementary Table S1). While high dispersion decreases power by introducing additional noise, increasing the number of replicates can mitigate its negative impact. As the number of replicates increases, the experimental design becomes more robust against variations introduced by dispersion, thereby enhancing the overall power of the study.

Based on these findings, we recommend sequencing to a depth of at least 6 billion contacts per condition. This will ensure that roughly 50% of loops will exhibit 80% power to detect a two-fold change in looping. These calculations assume that replicates come from the same cell line and are grown and treated in reproducible ways. For higher dispersion experiments, e.g. comparing datasets from different individuals, even higher sequencing depth is required. In any case, more replicates are always as good or better than fewer replicates. Therefore, in general, we recommend performing as many replicates as possible. In our experience, most Hi-C libraries are not complex enough to produce more than 1 billion unique contacts, so reaching 6 billion contacts likely requires many replicates anyway. Sequencing depths exceeding 25 billion contacts would be required to achieve 90% well-powered loops, even with very low dispersion. Such sequencing depths are not currently tractable for most researchers but may become so in the near future as we discuss later.

3.4 Hi-C Poweraid: a web application for differential Hi-C experiment design

Since there are various parameters to consider when planning a new Hi-C experiment that can quickly become complex and overwhelming, we built Hi-C Poweraid, an R shiny app to facilitate this planning (http://phanstiel-lab.med.unc.edu/poweraid/). Hi-C Poweraid consists of two main tabs: Tab 1 for investigating power across different sequencing depths (Supplementary Fig. S5A) and Tab 2 for investigating power across different loop sizes (Supplementary Fig. S5B). In Tab 1, the user can specify a range of replicate values, a power threshold, and one or more dispersion values. A well-powered loop is defined as one with a power above the given power threshold to detect a 2-fold change in looping interactions. This tab is useful for determining the ideal total sequencing depth and number of replicates for an experiment, and to investigate how differing dispersions can affect that decision. Tab 2 provides more granular information, albeit on just one set of parameters at a time. Here, a user can make fine-tuned selections of fold change, dispersion, replicate number, and sequencing depth per replicate. The plots provided can then be used to determine power as a function of loop size, which can help inform both the planning and interpretation of differential Hi-C experiments. For both tabs, we make use of interactive plotly plots, which allow for features, such as pan and zoom, region selections, trace isolation, and hover effects to glean more specific information about certain regions or points on the plot (Sievert 2020).

4 Discussion

Due to the expense and difficulty of performing Hi-C experiments, it is critical to first perform a careful power analysis. A power analysis can also help inform the interpretation of experiments; e.g. it can help scientists determine if the small number of differential loops is due to a similarity between the samples or just a lack of power. It can also help explain biases in the sets of differential loops detected (e.g. shorter loops). To address this, we performed a rigorous power analysis of differential Hi-C experiments and developed a web application to facilitate interrogation of the resulting data. As a result of this analysis, we recommend designing experiments with at least 6 billion contacts per condition, split between two or more replicates to achieve a power above 0.8 to detect a 2-fold change for over 50% of loops. A more ideal, albeit currently infeasible, design would include 25 billion contacts per condition to achieve a power above 0.8 to detect a 2-fold change for over 90% of loops. The analyses and web app described here can aid in the effective use of time and resources and in justifying plans, costs, and resource distribution when proposing new experiments.

However, several caveats pertain to these estimates. First, it is important to note, “Hi-C contacts” refers to reads with duplicates removed and that actual sequencing depths need to be even higher than the numbers quoted here in order to achieve the appropriate number of contacts. The percentage of reads that result in Hi-C contacts varies widely based on library quality, complexity, and sequencing depth. For the human cell lines in Rao et al. (2014), these percentages range from 73% to 85%, with a mean of 81% (Supplementary Table S1 and Supplementary Fig. S2). In general, it is recommended to use Hi-C Poweraid to estimate power for human cell lines, as there are other features of non-human genomes that are not addressed here.

Second, while Hi-C Poweraid provides good general guidelines for experimental design, optimal design is difficult to pinpoint. Different experimental designs and protocols have different dispersions that can be hard to predict but have an important impact on power. For experiments performed on multiple replicates of the same cell line, low dispersion values are expected and deep sequencing of a small number (e.g. 2) of replicates is sufficient for optimal power. Experiments in which replicates represent different donors or animals are likely to exhibit higher dispersion values and more replicates may be required to reach similar power thresholds. However, exact values for dispersion based on different experimental designs is difficult to determine. When estimating a dispersion to use for Hi-C Poweraid, we advise using the dispersion from other, similar experiments in the lab, a collaborator’s lab, or from publicly available Hi-C data. If a dispersion cannot be estimated, it is recommended to use as many replicates as is feasible, as increasing replicates is likely to increase the overall power of the experiment.

Third, Poweraid is also built on the assumption that an experiment is using Hi-C counts at a 10 kb resolution. With resolutions lower than 10 kb, counts per loop decrease, which drastically reduces the power to detect differences. For example, for an experiment with Hi-C counts at 5 kb resolution and a dispersion of 0.001, it would require at least 15 billion total contacts to reach over 50% of well-powered loops (Supplementary Fig. S4). For resolutions beyond 10 kb, there will be more counts per loop, but we begin to lose the ability to detect very short-range loops. Additionally, it becomes more difficult to distinguish loop pixels from the local background at larger resolutions due to each pixel containing larger regions of data condensed to one summarized count. Beyond this, these results pertain to Hi-C data only and it is currently unclear how these recommendations apply to other methods, such as micro-C, Hi-ChIP, ChIA pet, and capture Hi-C. Estimates are likely to be comparable for micro-C, as the experimental design and resulting data are similar; however, protocols that involve regional enrichment (e.g. Hi-ChIP, ChIA pet, capture Hi-C, etc.) will require their own power analysis. Finally, while the trends are likely to remain the same, the exact values depicted here and in the Poweraid app only apply to human Hi-C datasets.

We have found that optimally powered Hi-C experiments require far deeper sequencing than is typically performed or feasible for most labs (e.g. 25 billion contacts per condition). While current sequencing costs largely inhibit such experiments, sequencing costs have decreased drastically over the past two decades, and are likely to continue decreasing (Hayden 2014, Wetterstrand 2019). Multiple emerging sequencing technologies have the potential to decrease sequencing costs by 60% or more (Almogy et al. 2022). As newer technologies arise and are adopted, and as sequencing costs continue to decrease, the recommendations for sequencing depth proposed here will become more affordable and attainable.

Hi-C Poweraid is a useful tool that enables accurate Hi-C power estimates without the need to generate costly preliminary datasets or conduct complex computational analyses. These estimates will help facilitate grant proposals and provide better planning for experiments, which will ultimately translate into more robust scientific results. Well-planned experiments will improve the efficiency of allocation of time and resources, allow for more accurate interpretation of results, and expedite scientific progress.

Supplementary Material

vbad152_Supplementary_Data

Click here for additional data file.^{(1.1MB, zip)}

Acknowledgements

We thank Erika Deoudes for graphic design and typesetting.

Contributor Information

Sarah M Parker, Curriculum in Bioinformatics and Computational Biology, Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, United States.

Eric S Davis, Curriculum in Bioinformatics and Computational Biology, Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, United States.

Douglas H Phanstiel, Curriculum in Bioinformatics and Computational Biology, Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, United States; Curriculum in Genetics and Molecular Biology, Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, United States; Thurston Arthritis Research Center, University of North Carolina, Chapel Hill, NC, 27599, United States; Department of Cell Biology and Physiology, University of North Carolina, Chapel Hill, NC, 27599, United States; Lineberger Comprehensive Cancer Center, The University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, United States.

Supplementary data

Supplementary data are available at Bioinformatics Advances online.

Conflict of interest

None declared.

Funding

This work was supported by the National Institutes of Health [R35-GM128645 to D.H.P., T32-GM067553 to E.S.D.]; and the National Science Foundation Graduate Research Fellowship Program [DGE-1650116 to S.M.P.]. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Data availability

Hi-C data can be accessed through SRA accession PRJNA385337 (Phanstiel et al. 2017) and GEO accession GSE63525 (Rao et al. 2014). Hi-C Poweraid is available as an R Shiny application deployed at https://phanstiel-lab.med.unc.edu/poweraid/. The R Shiny application code is available at https://github.com/sarmapar/poweraid.

References

Almogy G, Pratt M, Oberstrass F et al. Cost-efficient whole genome-sequencing using novel mostly natural sequencing-by-synthesis chemistry and open fluidics platform. bioRxiv, 2022.05.29.493900, 2022, preprint: not peer reviewed.
Bond ML, Davis ES, Quiroga IY. et al. Chromatin loop dynamics during cellular differentiation are associated with changes to both anchor and internal regulatory features. Genome Res 2023;33:1258–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
Conte M, Irani E, Chiariello AM. et al. Loop-extrusion and polymer phase-separation can co-exist at the single-molecule level to shape chromatin folding. Nat Commun 2022;13:4070. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dekker J, Rippe K, Dekker M. et al. Capturing chromosome conformation. Science 2002;295:1306–11. [DOI] [PubMed] [Google Scholar]
Dostie J, Richmond TA, Arnaout RA. et al. Chromosome conformation capture carbon copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res 2006;16:1299–309. [DOI] [PMC free article] [PubMed] [Google Scholar]
Durand NC, Robinson JT, Shamim MS. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst 2016;3:99–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fudenberg G, Imakaev M, Lu C. et al. Formation of chromosomal domains by loop extrusion. Cell Rep 2016;15:2038–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
Greenwald WW, Li H, Benaglio P. et al. Subtle changes in chromatin loop contact propensity are associated with differential gene regulation and expression. Nat Commun 2019;10:1054. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guo Y, Zhao S, Li C-I. et al. RNAseqPS: a web tool for estimating sample size and power for RNAseq experiment. Cancer Inform 2014;13:1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hart SN, Therneau TM, Zhang Y. et al. Calculating sample size estimates for RNA sequencing data. J Comput Biol 2013;20:970–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hayden C. Technology: the $1,000 genome. Nature 2014;507:294–5. [DOI] [PubMed] [Google Scholar]
Kagey MH, Newman JJ, Bilodeau S. et al. Mediator and cohesin connect gene expression and chromatin architecture. Nature 2010;467:430–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kramer NE, Davis ES, Wenger CD. et al. Plotgardener: cultivating precise multi-panel figures in R. Bioinformatics 2022;38:2042–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li C-I, Samuels DC, Zhao Y-Y. et al. Power and sample size calculations for high-throughput sequencing-based experiments. Brief Bioinform 2018;19:1247–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lieberman-Aiden E, van Berkum NL, Williams L. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009;326:289–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
Malod-Dognin N, Pancaldi V, Valencia A. et al. Chromatin network markers of leukemia. Bioinformatics 2020;36:i455–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
Phanstiel DH, Van Bortle K, Spacek D. et al. Static and dynamic DNA loops form AP-1-Bound activation hubs during macrophage development. Mol Cell 2017;67:1037–48.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rao SSP, Huntley MH, Durand NC. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 2014;159:1665–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reed KSM, Davis ES, Bond ML. et al. Temporal analysis suggests a reciprocal relationship between 3D chromatin structure and transcription. Cell Rep 2022;41:111567. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robinson MD, McCarthy DJ, Smyth GK. et al. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010;26:139–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rowley MJ, Poulet A, Nichols MH. et al. Analysis of Hi-C data using SIP effectively identifies loops in organisms from C. elegans to mammals. Genome Res 2020;30:447–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
Siersbæk R, Madsen JGS, Javierre BM. et al. Dynamic rewiring of Promoter-Anchored chromatin loops during adipocyte differentiation. Mol Cell 2017;66:420–35.e5. [DOI] [PubMed] [Google Scholar]
Sievert C. Interactive Web-Based Data Visualization with R, plotly, and shiny. Boca Raton, FL, USA: CRC Press, 2020. [Google Scholar]
Simonis M, Klous P, Splinter E. et al. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture–on-chip (4C). Nat Genet 2006;38:1348–54. [DOI] [PubMed] [Google Scholar]
Tang Z, Luo OJ, Li X. et al. CTCF-mediated human 3D genome architecture reveals chromatin topology for transcription. Cell 2015;163:1611–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vieth B, Ziegenhain C, Parekh S. et al. powsimR: power analysis for bulk and single cell RNA-seq experiments. Bioinformatics 2017;33:3486–8. [DOI] [PubMed] [Google Scholar]
Wetterstrand KA. The Cost of Sequencing a Human Genome. Genome.gov. 2019.
Zuo C, Keleş S.. A statistical framework for power calculations in ChIP-seq experiments. Bioinformatics 2014;30:753–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

vbad152_Supplementary_Data

Click here for additional data file.^{(1.1MB, zip)}

Data Availability Statement

[vbad152-B1] Almogy G, Pratt M, Oberstrass F et al. Cost-efficient whole genome-sequencing using novel mostly natural sequencing-by-synthesis chemistry and open fluidics platform. bioRxiv, 2022.05.29.493900, 2022, preprint: not peer reviewed.

[vbad152-B2] Bond ML, Davis ES, Quiroga IY. et al. Chromatin loop dynamics during cellular differentiation are associated with changes to both anchor and internal regulatory features. Genome Res 2023;33:1258–68. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B3] Conte M, Irani E, Chiariello AM. et al. Loop-extrusion and polymer phase-separation can co-exist at the single-molecule level to shape chromatin folding. Nat Commun 2022;13:4070. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B5] Dekker J, Rippe K, Dekker M. et al. Capturing chromosome conformation. Science 2002;295:1306–11. [DOI] [PubMed] [Google Scholar]

[vbad152-B6] Dostie J, Richmond TA, Arnaout RA. et al. Chromosome conformation capture carbon copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res 2006;16:1299–309. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B7] Durand NC, Robinson JT, Shamim MS. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst 2016;3:99–101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B8] Fudenberg G, Imakaev M, Lu C. et al. Formation of chromosomal domains by loop extrusion. Cell Rep 2016;15:2038–49. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B9] Greenwald WW, Li H, Benaglio P. et al. Subtle changes in chromatin loop contact propensity are associated with differential gene regulation and expression. Nat Commun 2019;10:1054. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B10] Guo Y, Zhao S, Li C-I. et al. RNAseqPS: a web tool for estimating sample size and power for RNAseq experiment. Cancer Inform 2014;13:1–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B11] Hart SN, Therneau TM, Zhang Y. et al. Calculating sample size estimates for RNA sequencing data. J Comput Biol 2013;20:970–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B12] Hayden C. Technology: the $1,000 genome. Nature 2014;507:294–5. [DOI] [PubMed] [Google Scholar]

[vbad152-B13] Kagey MH, Newman JJ, Bilodeau S. et al. Mediator and cohesin connect gene expression and chromatin architecture. Nature 2010;467:430–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B14] Kramer NE, Davis ES, Wenger CD. et al. Plotgardener: cultivating precise multi-panel figures in R. Bioinformatics 2022;38:2042–5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B15] Li C-I, Samuels DC, Zhao Y-Y. et al. Power and sample size calculations for high-throughput sequencing-based experiments. Brief Bioinform 2018;19:1247–55. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B16] Lieberman-Aiden E, van Berkum NL, Williams L. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009;326:289–93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B17] Malod-Dognin N, Pancaldi V, Valencia A. et al. Chromatin network markers of leukemia. Bioinformatics 2020;36:i455–63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B19] Phanstiel DH, Van Bortle K, Spacek D. et al. Static and dynamic DNA loops form AP-1-Bound activation hubs during macrophage development. Mol Cell 2017;67:1037–48.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B20] Rao SSP, Huntley MH, Durand NC. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 2014;159:1665–80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B21] Reed KSM, Davis ES, Bond ML. et al. Temporal analysis suggests a reciprocal relationship between 3D chromatin structure and transcription. Cell Rep 2022;41:111567. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B22] Robinson MD, McCarthy DJ, Smyth GK. et al. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010;26:139–40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B23] Rowley MJ, Poulet A, Nichols MH. et al. Analysis of Hi-C data using SIP effectively identifies loops in organisms from C. elegans to mammals. Genome Res 2020;30:447–58. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B24] Siersbæk R, Madsen JGS, Javierre BM. et al. Dynamic rewiring of Promoter-Anchored chromatin loops during adipocyte differentiation. Mol Cell 2017;66:420–35.e5. [DOI] [PubMed] [Google Scholar]

[vbad152-B25] Sievert C. Interactive Web-Based Data Visualization with R, plotly, and shiny. Boca Raton, FL, USA: CRC Press, 2020. [Google Scholar]

[vbad152-B26] Simonis M, Klous P, Splinter E. et al. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture–on-chip (4C). Nat Genet 2006;38:1348–54. [DOI] [PubMed] [Google Scholar]

[vbad152-B27] Tang Z, Luo OJ, Li X. et al. CTCF-mediated human 3D genome architecture reveals chromatin topology for transcription. Cell 2015;163:1611–27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad152-B28] Vieth B, Ziegenhain C, Parekh S. et al. powsimR: power analysis for bulk and single cell RNA-seq experiments. Bioinformatics 2017;33:3486–8. [DOI] [PubMed] [Google Scholar]

[vbad152-B29] Wetterstrand KA. The Cost of Sequencing a Human Genome. Genome.gov. 2019.

[vbad152-B30] Zuo C, Keleş S.. A statistical framework for power calculations in ChIP-seq experiments. Bioinformatics 2014;30:753–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Guiding the design of well-powered Hi-C experiments to detect differential loops

Sarah M Parker

Eric S Davis

Douglas H Phanstiel

Roles

Abstract

Motivation

Results

Availability and implementation

1 Introduction

2 Methods

2.1 Hi-C subsampling, alignment, and processing

2.2 Fold change and power calculations

Figure 2.

Figure 1.

2.3 Extending Poweraid predictions to diverse datasets

2.4 Visualizations

3 Results

3.1 Loop size is anti-correlated with sequencing counts

3.2 Loop size is anti-correlated with fold change compression

3.3 Recommendations for maximizing power

Figure 3.

3.4 Hi-C Poweraid: a web application for differential Hi-C experiment design

4 Discussion

Supplementary Material

Acknowledgements

Contributor Information

Supplementary data

Conflict of interest

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases