Skip to main content
Genome Research logoLink to Genome Research
. 2021 May;31(5):877–889. doi: 10.1101/gr.269209.120

Correcting signal biases and detecting regulatory elements in STARR-seq data

Young-Sook Kim 1,2,3,4,5, Graham D Johnson 1,2,3,4, Jungkyun Seo 1,2,3,4,5, Alejandro Barrera 1,2,3,4, Thomas N Cowart 1,4, William H Majoros 1,3,4,5, Alejandro Ochoa 1,4,5, Andrew S Allen 1,2,4,5, Timothy E Reddy 1,2,3,4,5
PMCID: PMC8092017  PMID: 33722938

Abstract

High-throughput reporter assays such as self-transcribing active regulatory region sequencing (STARR-seq) have made it possible to measure regulatory element activity across the entire human genome at once. The resulting data, however, present substantial analytical challenges. Here, we identify technical biases that explain most of the variance in STARR-seq data. We then develop a statistical model to correct those biases and to improve detection of regulatory elements. This approach substantially improves precision and recall over current methods, improves detection of both activating and repressive regulatory elements, and controls for false discoveries despite strong local correlations in signal.


Gene regulation is of foundational importance to nearly all biological processes, and variation in gene regulatory activity plays a major role in human disease risk (Lee and Young 2013; Parker et al. 2013; Finucane et al. 2015). A major step toward measuring regulatory activity across the human genome has been the development of high-throughput reporter assays such as STARR-seq (Arnold et al. 2013) that allow regulatory element activity to be quantified with high-throughput sequencing rather than with optical detection of a fluorescent or luminescent signal.

High-throughput reporter assays create substantial analytical challenges that are distinct from other sequencing-based genomic assays. There is significant local variation in high-throughput reporter assay signal. We show here that, across data from several laboratories, most of that variation can be explained by features of the underlying genomic sequence and experimental procedures rather than by regulatory element activity. For example, nucleotide composition can alter PCR efficiency leading to under- and overrepresentation of some sequences. Meanwhile, highly repetitive sequences often do not align uniquely to the human reference genome, also biasing signal estimates. Additional analytical challenges include that STARR-seq signals can be both positive and negative, reflecting activation and repression, and the boundaries of regulatory elements are typically unknown and must therefore be estimated from the data. Those challenges together impact signal representations, hinder estimation of regulatory element activity, and cause false positives and false negatives when left unaddressed.

Taken together, key requirements of statistical methods to analyze STARR-seq data are the ability to identify and estimate the effect of both activating and repressing regulatory elements while also correcting for underlying sequence biases in high-throughput reporter assays. A statistical model was recently introduced that corrects technical biases and detects regulatory elements in STARR-seq, but the model is limited to detecting only activating regulatory elements (Lee et al. 2020). Considering repression is a crucial gene regulation mechanism (Courey and Jia 2001), overlooking repressive elements may limit understanding of gene regulation with STARR-seq. To overcome that challenge, our correcting reads and analysis of differentially active elements (CRADLE) model takes a two-step approach. First, CRADLE uses a generalized linear regression model to estimate and correct major biases that we have identified in STARR-seq data. Next, CRADLE detects regions with statistically significant regulatory activity from the bias-corrected signals while rigorously controlling FDR. In doing so, CRADLE substantially improves the use of STARR-seq by providing a robust estimation of regulatory activity and improved visualization of raw signals.

Results

DNA sequence biases STARR-seq signals

To identify sources of signal variance in STARR-seq, we analyzed data from two whole-genome STARR-seq studies completed in different laboratories and in different human cell models: A549 (Johnson et al. 2018) and HeLa-S3 cells (Muerdter et al. 2018). Each study followed a similar protocol in which an input STARR-seq library was generated by cloning randomly fragmented genomic DNA into the 3′ untranslated region (UTR) of a reporter gene. The input library was then assayed by transfecting it into cultured human cells where the cloned DNA fragments regulate their own transcription into mRNA. The expression of each random fragment as mRNA was then measured with high-throughput sequencing. Finally, regulatory activity was estimated by comparing the expression of each fragment in the output library relative to its abundance in the input library.

The STARR-seq input libraries showed substantially more signal variance than is observed in the controls for other genomic assays such as for ChIP-seq (Fig. 1A,B; The ENCODE Project Consortium 2012). That variance in STARR-seq input signal was consistent across replicates and between studies (Fig. 1C), more so than for ChIP-seq input signal (Supplemental Fig. S1). Here, we analyzed four potential sources of variance in STARR-seq signal: (1) DNA structure influencing DNA fragmentation, cloning or other enzymatic reactions, thus affecting which DNA fragments are available in the assay library (Poptsova et al. 2014); (2) differences in the Gibbs free energy of DNA fragments influencing multiplex PCR efficiency, leading to preferential amplification of some fragments (Cheung et al. 2011; Benjamini and Speed 2012; Hansen et al. 2012; Jiang et al. 2015; Love et al. 2016; Teng and Irizarry 2017); (3) G-quadraplexes in the genome impairing amplification by DNA polymerase (Chambers et al. 2015; Rhodes and Lipps 2015); and (4) biases resulting from differences in the mappability of short read sequences to the reference human genome, for example, owing to repetitive sequences (Derrien et al. 2012).

Figure 1.

Figure 1.

Technical biases affect STARR-seq signal. (A) STARR-seq input libraries have higher signal variance than ChIP-seq input control libraries. Variance in per base signal in individual RPKM-normalized libraries are plotted for Chromosome 1. The error bars indicate variance between replicates. The number of replicates plotted is as follows: six replicates for STARR-seq A549 data; two replicates for STARR-seq HeLa-S3 data; three replicates for ChIP-seq A549 data; two replicates for ChIP-seq HeLa-S3 data; three replicates for ChIP-seq LNCaP data. (B) Representative browser signal tracks are shown for STARR-seq and ChIP-seq input libraries (Chr 1: 11,197,048–11,236,707). Signals are RPKM-normalized. (C) Pearson's correlations of STARR-seq input library signals in 1-bp windows along Chromosome 1. (D) DNA sequence biases impact STARR-seq signals. STARR-seq signals are plotted for 500-bp windows with varying degrees of bias for the following physical properties of DNA: fragment-end DNA structures, Gibbs free energy, G-quadruplex structure, and mappability. Whiskers extend 1.5 times the interquartile range. Center lines in the boxes show the medians. In plots of fragment-end bias, minor groove width (MGW) and propeller twist (ProT) are plotted and the ideal is log2(Freq in input/Freq in ref)=0. In plots of other biases, the ideal line is the median signal. (E) PCR amplification introduces bias into STARR-seq libraries. The impact of Gibbs free energy bias is shown for PER1 BAC libraries amplified with different numbers of PCR cycles (3, 6, 12, and 18 cycles). Each point represents the sum of signals in a 500-bp window from three technical replicates. The solid line is a lowess fit line. The dashed ideal line is the median signal across all windows.

We found evidence that each source of bias influences the signal observed when sequencing STARR-seq libraries. To model biases attributed to DNA secondary structure, we computationally estimated the minor groove width (MGW) and propeller twist (ProT) at the 5′ ends of DNA fragments (Zhou et al. 2013). We analyzed 5′ ends of DNA fragments because they are not modified in the end-repairing process of generating STARR-seq libraries (Poptsova et al. 2014). We observed distinct biases in 5′ MGW and ProT (Fig. 1D). Consistent with signal biases caused by preferential fragmentation, ApG and GpG dinucleotides are most underrepresented at the 5′ ends of STARR-seq fragments, which were previously reported to be less prone to shearing (Supplemental Fig. S2; Poptsova et al. 2014). To estimate biases caused by differences in the thermodynamic stability of complementary DNA strands, G-quadruplex structure, and mappability, we binned the genome into 500-bp windows. We then used data from previous studies to estimate the Gibbs free energy of the duplexed DNA strands (Protozanova et al. 2004), stability of G-quadruplex structure (Chambers et al. 2015), and the fraction of redundant mappable positions in the reference genome for each window (Fig. 1D; Derrien et al. 2012). Fragments with the highest Gibbs free energy, highly stable G-quadruplex structure, and low mappability all had substantially depleted STARR-seq signals. Those trends were consistent across both whole-genome STARR-seq studies (Johnson et al. 2018; Muerdter et al. 2018).

To evaluate whether biases in estimated Gibbs free energy are caused by differences in PCR amplification efficiency, we generated DNA fragment libraries from a bacterial artificial chromosome (BAC) using between three and 18 cycles of PCR. The BAC contained 211 kb surrounding the PER1 gene on human Chromosome 17. DNA fragments with extreme Gibbs free energy were depleted from the final library, and particularly so after 12 or more PCR cycles (Fig. 1E). That observation also indicates that signal from output STARR-seq libraries will have more severe PCR-related biases than that from input libraries owing to the additional 15–16 PCR cycles used (Johnson et al. 2018; Muerdter et al. 2018), and that minimizing PCR can substantially reduce this source of bias.

Modeling biases in STARR-seq signal guides improved experimental designs

To model the aforementioned biases in STARR-seq signal, we developed a generalized linear regression model (GLM) with covariates to model DNA structure (Zhou et al. 2013) in fragment ends, annealing and denaturing efficiency of DNA fragments related to their Gibbs free energy (Protozanova et al. 2004), stability of G-quadruplex structure (Chambers et al. 2015), and mappability (Derrien et al. 2012) as a reduced set of independent variables (Fig. 2A). We then fit that model to predict biases in STARR-seq signals across the genome (Fig. 2A). To improve model fit, particularly at the extremes of STARR-seq signal, we separately modeled regions with high STARR-seq signal that we observed to have significantly different coefficients for biases related to the fragment Gibbs free energy (Fig. 2B). We used a biased structured sampling approach to better fit the tails of the signal distribution (Fig. 2C; Methods). The model fit was robust to the specific thresholds used in the biased structured sampling approach (Supplemental Fig. S3). Together, these adjustments improve model fit at the extremes of STARR-seq signal where biases are most strong and thus most likely to impact analysis.

Figure 2.

Figure 2.

The CRADLE GLM approach accurately predicts signal bias. (A) Equation of the GLM to predict the impact of technical biases and approach used by CRADLE to calculate bias covariates. To estimate bias effects for each position (blue), we used a window centered on that position that was twice the median fragment length, L. We assume L number of fragments (green) in a window and that each fragment is L-bp in length. To calculate each bias covariate for the position, we combined quantitative measures from L fragments. (pos) Single-bp position; (MGWpos) minor grove width; (ProTpos) propeller twist; (Annealpos) annealing efficiency; (Denaturepos) denaturation efficiency; (Gquadpos) G-quadruplex structure; (Mappos) mappability. (B)–(F) The results from the GLM fitted with Johnson et al. (2018) STARR-seq data (six input libraries and five 0-h dex-treated output libraries) and Muerdter et al. (2018) STARR-seq data (two input libraries and two no-inhibitor-treated output libraries). For CF, the results were visualized for Chromosome 1. (B) Coefficients in input libraries for regions with signals above and below the 90th percentile (“Regions with high input signal” and “The rest of regions,” respectively). (C) Ratio of the sum of squared errors with structured sampling to the sum of squared errors with random sampling are plotted for regions with extremely high signals (above the 99th percentile). (D) Variance explained by CRADLE are plotted. The R2 values are from GLMs fitted with input and output STARR-seq libraries. The error bars indicate variance between replicates. (E) Distribution of GLM residuals and the STARR-seq effect sizes are shown after correction. (F) Squared semipartial correlations are shown for fragment-end, Gibbs free energy, G-quadruplex, and mappability covariates. The error bars indicate variance between replicates. (G) The R2 values of the GLMs are shown for PER1 BAC libraries amplified with different numbers of PCR cycles. (H) Coefficients of anneal and denature covariates are shown for the GLM fitted with PER1 BAC libraries. The error bars show a 95% confidence interval.

Overall, the GLM fit the observed signals with R2 up to 0.75 for input STARR-seq libraries (Fig. 2D). The model fit output libraries less well than the input libraries, owing in part to regulatory activity also contributing to differences in STARR-seq signal. Still, the GLM showed comparable performance to using input library in predicting biases of output library, despite the high degrees of freedom (Supplemental Fig. S4). The GLM had significantly better fit than the model that simply binned genome and used GC content in each bin as a covariate (Supplemental Fig. S5). We think the improved fit of the GLM over the simple GC-content model is partially a result of the nonlinear relationship of GC content and signal (Fig. 1D). Residuals from the model approximately follow a normal distribution (Fig. 2E), supporting model fit. We also estimated the extent to which each covariate independently explained variation in STARR-seq signal (Fig. 2F). Overall, the median of the explained variation across the two studies showed fragment-end sequences, and Gibbs free energy explained the greatest amount of signal variation. In the data from Johnson et al. (2018), G-quadruplex bias in the input STARR-seq library and mappability bias in the output STARR-seq library had a negative marginal contribution to total predictive power but the effects were minor. Meanwhile, in the data set of Muerdter et al. (2018), Gibbs free energy was the major contributor to signal biases, showing relatively large variance between replicates. This shows distinctive bias effects in the two input libraries, which aligns with relatively small correlation between two signals compared to Johnson et al. (Fig. 1C). These findings indicate that most of the variance in STARR-seq signal can be attributed to technical biases; it is important to model distinct relative contributions of those biases in different STARR-seq library preparations.

Most of the parameters we modeled are not readily mitigated by modifying experimental procedures. As examples, reducing PCR cycles may not be feasible when template is limited, and DNA fragmentation is required for STARR-seq. Therefore, we investigated whether the GLM can instead statistically correct biases in STARR-seq signal. First, we fit the aforementioned GLM (Fig. 2A) to fragment sequencing libraries generated with different numbers of PCR cycles and calculated the amount of variance explained by the GLM. Consistent with our earlier observation that additional PCR cycles increased Gibbs free energy bias (Fig. 1E), the model explained more signal variance when more PCR cycles were used (Fig. 2G). There was also a monotonic increase in the coefficients for fragment annealing and denaturing efficiency based on the Gibbs free energy (Fig. 2H). Those results show that the GLM can correct different amounts of bias resulting from different experimental designs.

Removing technical biases in STARR-seq improves visualization

Visualizing signals from functional genomic assays is often a critical step in quality control, experiment interpretation, integrative analysis, and hypothesis generation. Because substantial signal variation in STARR-seq is attributed to the underlying DNA sequence, however, it is challenging to gain useful information from visual inspection of uncorrected STARR-seq signals. That visualization can be substantially improved by instead using the residuals from the GLM. For example, across the two genome-wide studies analyzed here (Johnson et al. 2018; Muerdter et al. 2018), the GLM reduced signal variance by between 40% and 80%, resulting in approximately zero-centered corrected signals (Fig. 3A,B). Further demonstrating generality across specific experimental procedures, the GLM also effectively corrected biases caused by different amounts of PCR (Fig. 3C,D). An example of the resulting correction for a 19-kb region is shown in Figure 3E, where the GLM residuals allow for clearer visual identification of elements with activating or repressive regulatory effects. For example, a region near the PER1 gene that is well known to have activating regulatory activity in response activation of the glucocorticoid receptor (NR3C1) (Reddy et al. 2012; Johnson et al. 2018) showed much clearer indication of activity after correction (Fig. 3F). Similarly, an example of a repressive element that is bound by the REST repressor (The ENCODE Project Consortium 2012) is also better represented in corrected signals compared to uncorrected observed signals (Fig. 3G). To generalize the argument that correcting biases better reflects regulatory activity, we compared observed and corrected signal for NR3C1-binding regions and REST-binding regions that had corresponding motifs (Supplemental Fig. S6; The ENCODE Project Consortium 2012). Overall, corrected signal provides more stable background signal and shows clearer regulatory activity (Supplemental Fig. S6). Together, these results show that our model accounts for a substantial variation of signals in STARR-seq data and improves visualization of signals.

Figure 3.

Figure 3.

The CRADLE GLM approach corrects technical bias. (A) STARR-seq signals are plotted for 500-bp windows along Chromosome 1 after removing technical bias with CRADLE. Signal is balanced despite varying degrees of technical biases. The ideal line is the median corrected signal. Whiskers extend 1.5 times the interquartile range. Center lines in the boxes show the medians. (B) Variance in observed signals and CRADLE-corrected signals in 1-bp windows are shown along Chromosome 1. The error bars indicate variance between replicates; six input libraries and five 0-h dex-treated output libraries in Johnson et al. (2018) are plotted; two input libraries and two no-inhibitor-treated output libraries in Muerdter et al. (2018) are plotted. (C) STARR-seq signals are shown for PER1 BAC libraries amplified with different numbers of PCR after removing technical bias with CRADLE. Each point represents the sum of corrected signals in a 500-bp window from three technical replicates. The solid line is a lowess fit line. The dashed ideal line is the median signal across all windows. (D) Variance in observed signals and CRADLE-corrected signals in 1-bp windows are shown after correcting the PER1 BAC libraries. (E) Representative signal tracks are shown for STARR-seq input libraries before and after CRADLE correction (Chr 2: 29,772,197–29,791,543). (F) STARR-seq and ChIP-seq signal tracks are shown in the dex-responsive PER1 locus. Observed and corrected signal of Johnson et al. (2018) are presented for 0-h dex-treated (untreated) and 12-h dex-treated output libraries. ChIP-seq signal tracks are not corrected. The highlighted region (Chr 17: 8,151,204–8,152,809) is a known dex-responsive activating regulatory element. (G) STARR-seq signal tracks are shown for the TMEM63C locus. Observed and corrected signal of Johnson et al. (2018) are presented for input and 0-h dex-treated output libraries. The highlighted region (Chr 14: 77,207,895–77,210,261) contains a REST motif and is occupied by REST in multiple cell types.

Correcting biases improves detection of regulatory signals embedded in STARR-seq data

To next detect genomic regions with significant STARR-seq activity, we developed a new method that rigorously models two key features of the assay. First, STARR-seq measures both activation and repression of reporter gene expression (Johnson et al. 2018), thus being able to detect both activation and repression is important. Second, local STARR-seq signals are highly correlated (e.g., Fig. 3E). That correlation, if not appropriately considered, can lead to nonconservative control of type I errors if not modeled (Lun and Smyth 2014).

To overcome those challenges, we developed a two-step statistical approach to merge locally correlated signals while maintaining well-calibrated control of the false discovery rate (FDR) (Fig. 4A). Briefly, our approach first detects signals in broad genomic regions and then identifies more specific sources of signal variation within those regions. The approach is based on previous work from Benjamini (Benjamini and Hochberg 1995; Benjamini and Bogomolov 2014). To increase power of detecting regulatory elements, we also used independent filtering to remove regions without enough signal variation to reject the null (Fig. 4A,B; Bourgon et al. 2010).

Figure 4.

Figure 4.

Detection of regulatory elements with CRADLE. (A) CRADLE regulatory element pipeline is shown in diagram. Effect sizes are calculated in windows of uniform length. Contiguous windows with similar effect sizes are merged into regions before filtering regions with small variance. Regions are binned and a statistical test is performed on each bin to compare corrected input and output signals. Bin-level P-values are merged to generate a region-level P-value before performing a region-level Benjamini–Hochberg (BH) procedure. Regions selected by the first BH procedure were used to perform a bin-level BH procedure to identify regulatory elements. (B) The number of detected regulatory elements is dependent on the variance filter. (C,D) Precision recall curves, using corrected and uncorrected signals in the simulation study. To detect regulatory elements with uncorrected signals, two statistical approaches were used: (1) fitting uncorrected signals to Poisson GLM and performing Wald test (“Uncorrected 1”) and (2) using a Poisson distribution with the mean of uncorrected input signals as a null distribution and testing the significance of the mean of uncorrected output signals (“Uncorrected 2”). (C) Precision recall curve when signals are simulated with mixed fold change (2, 3, 4) and a mix of activating and repressive elements. (D) Precision recall curve when signals are simulated with a fixed fold change (FC) and with a fixed regulatory activity (either activating or repressive). (EG) Comparison of inhibitor-responsive regulatory elements detected by CRADLE and Muerdter et al. (2018). (E) The Venn diagram shows the overlap of regulatory elements detected by both studies. (F) Transcription factor motif enrichment is shown for inhibitor-responsive repressive regulatory elements exclusively detected by each study. Rank* is the rank of motif in the other study. (G) The mean of IRE3 ChIP-seq effect size is plotted for inhibitor-responsive repressive regulatory elements exclusively detected by each study. (H) The Venn diagram shows the overlap of dex-responsive activating and repressive regulatory elements detected by CRADLE and Johnson et al. (2018). (I) Transcription factor motif enrichment in A549 steady-state repressive regulatory elements detected by CRADLE.

To show the benefit of correcting technical biases when detecting regulatory elements in STARR-seq data, we simulated whole-genome STARR-seq signals with embedded activating and repressive regulatory elements across a range of effect sizes (Fig. 4C,D). We then used the method described above to detect regulatory elements in corrected or uncorrected signals. When detecting regulatory elements with uncorrected signals, we used statistical tests based on a Poisson distribution to avoid unfairly reducing the performance by violating key assumptions of a t-test. Specifically, we used two approaches: (1) fitting uncorrected signals to a Poisson GLM and Wald tests to reject the null (“Uncorrected 1”), and (2) using a Poisson distribution with the mean of uncorrected input signals as a null distribution and testing for a significant difference in the means of uncorrected output signals (“Uncorrected 2”).

Overall, correcting biases with the GLM substantially improved the precision of detecting regulatory signals, especially at more stringent detection thresholds (Fig. 4C,D). In contrast, the majority of regulatory elements with uncorrected signals were false positives (Supplemental Fig. S7). Performance improvement was particularly pronounced when detecting repression (Fig. 4D), where the area under the precision recall curve (AUPRC) increased by 0.64 when correcting signals.

Overall, repressive signals are more difficult to detect. In the repression simulation with corrected signals, recall and precision were worse than in the activation simulation by as much as 0.43 in AUPRC. The decreased AUPRC was mainly a result of small simulated output signals of repressive regulatory elements that were largely filtered out by the overall variance filter. However, this simulation result still shows correcting technical biases helps to decrease false positives in detecting both activation and repression.

Improved detection of regulatory elements in STARR-seq data

We used the CRADLE method described above to call regulatory elements in data from two published whole-genome STARR-seq studies (Johnson et al. 2018; Muerdter et al. 2018). Muerdter et al. (2018) measured differential regulatory activity in response to inhibitors that blocked interferon response. The study reported 12,010 inhibitor-responsive regulatory elements with 2892 repressive elements, with their analysis pipeline that used binomial distribution and hypergeometric tests (Arnold et al. 2013; Muerdter et al. 2018). CRADLE detected a similar number of regulatory elements at 20% FDR (N = 11,997), 815 of which were repressive (Supplemental Table S1). Although the activating elements detected by each method had overlap up to 46%, repressive elements were largely different between the methods (Fig. 4E).

To investigate the biological properties of regulatory elements detected exclusively by each method, we used motif enrichment analysis to detect potential biologically important sequence signals in the nonoverlapping sets of regulatory elements. Motifs for interferon-responsive transcription factors (TFs) were most strongly enriched in the CRADLE-exclusive repressive elements (Fig. 4F; Supplemental Table S2). In contrast, activator-protein 1 (AP-1) TF motifs were most significantly enriched in the repressive elements unique to the Muerdter et al. (2018) analysis, with interferon response motifs ranked lower by enrichment (Fig. 4F; Supplemental Table S3). The motifs enriched in shared repressive regulatory elements of CRADLE and Muerdter et al. (2018) overall corresponded with the motifs enriched in CRADLE-exclusive repressive regulatory elements (Supplemental Tables S2, S4). In addition, the CRADLE-exclusive repressive elements showed higher enrichment for IRF3 ChIP-seq signal (Fig. 4G). We noted that CRADLE estimated positive effects for 1704 repressive elements uniquely detected by Muerdter et al. (Supplemental Fig. S8; Muerdter et al. 2018), suggesting they are false positives attributed to the biases in STARR-seq signal. Indeed, subsequent motif analysis of those repressive elements with positive effects revealed enriched NF-kB motifs, not corresponding to the experimental design.

The Johnson et al. (2018) study used STARR-seq to measure changes in regulatory activity in response to the dexamethasone (dex) across time. The study used MACS2 (Zhang et al. 2008) and edgeR (Robinson et al. 2010) together to identify 4835 dex-responsive regulatory elements at 0.05 FDR with 3311 activating elements. With the data from Johnson et al. (2018), we used CRADLE to detect regulatory elements both in untreated A549 cells and in response to dex at the same 0.05 FDR (Supplemental Tables S5, S6). That analysis identified 10% more dex-responsive regulatory elements (N = 5368) than the methods used by Johnson et al. (2018), with 4683 activating and 685 repressive dex-responsive regulatory elements (Supplemental Table S6). As with the comparison to the Muerdter et al. (2018) analysis, we observed little overlap in repressive elements whereas activating elements showed up to 70% overlap (Fig. 4H). Overall, those repressive regulatory elements identified by CRADLE in each study had higher control library signals than activating regulatory elements, demonstrating CRADLE requires a region to have enough coverage to be reliably detected as repressive (Supplemental Fig. S9).

To validate the newly identified repressive elements, we again used motif enrichment analysis to identify potential sequence signals consistent with repressive elements. The motif for the RE1-silencing transcription factor (REST), a well-characterized repressive factor (Chong et al. 1995), was most enriched in repressive regulatory elements in untreated A549 cells (Fig. 4I; Supplemental Table S7). Meanwhile, the motif enrichment in dex-responsive regulatory elements exclusively detected from CRADLE corresponded to previous findings about NR3C1 biology. Namely, for the dex-responsive activating regulatory elements, the NR3C1 DNA binding motif was most enriched followed by cofactor AP-1 transcription family (Supplemental Table S8), corresponding to the motifs enriched in the shared dex-responsive activating regulatory elements (Supplemental Table S9). For the dex-responsive repressive regulatory elements, the AP-1 motif was most enriched, consistent with the role of AP-1 in NR3C1-mediated activation and repression (Supplemental Table S10; Gupte et al. 2013; Johnson et al. 2018; McDowell et al. 2018).

We also validated some of the 240 A549 steady-state REST-binding repressive regulatory elements using two independent studies (Supplemental Fig. S10; van Arensbergen et al. 2019; Doni Jayavelu et al. 2020). Although neither of these studies used A549 cells, we assumed the repressive activity of the REST-binding repressive regulatory elements could be validated in other cell models because REST is a common repressor in diverse cell lines. We intersected the REST-binding repressive regulatory elements with the regions tested by Doni Jayavelu et al. (2020) that used massively parallel reporter assay (MPRA) test repressive activity and observed 30 elements in common (Supplemental Fig. S10). Of those, 27 elements (90%) had repressive activity in Doni Jayavelu et al. (2020), whereas two elements did not have coverage and one element did not show repressive activity. The one nonrepressive element is likely a result of the small overlap with their tested region (27 bp) that did not cover the REST motif. We also compared our repressive element calls with the data from a genome-wide survey of regulatory elements (SuRE) signal (van Arensbergen et al. 2019). The SuRE signal in the REST-binding repressive regulatory elements showed repression in that study that was significantly different from both random and activator-binding elements (Supplemental Fig. S10).

Discussion

We showed that a substantial fraction of the variation in STARR-seq signal can be explained by DNA sequence features that are related to experimental artifacts rather than regulatory element activity. Overall, biases in PCR amplification had some of the strongest impacts on sequence biases in STARR-seq, and we show here that minimizing the amount of PCR can reduce variation in signals. DNA structure bias at the ends of fragments is possibly caused by preferential fragmentation, cloning, or efficiency as an enzymatic substrate. The efficiency of adding adaptors in cloning or in reverse transcription could be also affected by DNA sequences or structures at fragment ends (Zheng et al. 2011). Potential opportunities to mitigate those biases could include using multiple enzymes from different species that have different sequence biases or further refinement of reaction conditions. Similarly, increasing read length could mitigate mappability-induced biases by decreasing the mappable space in the genome; G-quadruplex structure bias might be alleviated by optimizing experimental conditions to destabilize those structures. However, mitigating technical biases using a statistical model is much faster and easier. We showed the GLM had significant predictive power that led to substantially stabilized STARR-seq signals. Indeed, corrected signals showed noticeably reduced variance and improved visualization of regulatory activity.

With corrected signals from the GLM, we detected regulatory elements with substantially improved accuracy compared to previous models. CRADLE especially improved the identification of repressive regulatory elements that were challenging to detect previously, as we showed via simulations, comparisons to other studies, and through investigation of DNA binding motifs for repressive factors. That improvement will allow for a more complete understanding of the diversity of regulatory element activity across the human genome.

Lee et al. (2020) also recently addressed the need to model biases in STARR-seq to improve detection of regulatory elements. Conceptually, both approaches model physical characteristics of genomic sequence that substantially influence STARR-seq signal and develop novel peak calling approaches. In terms of implementation, there are differences in model parameters (e.g., how CRADLE models PCR biases), model fitting (e.g., weighted sampling of the tails of the coverage distribution), and peak calling methods. In terms of performance evaluation, we evaluated CRADLE over a broader range of simulations, and in doing so we showed that the reported False Discovery Rates are well-calibrated. We also showed that CRADLE is especially able to detect repressed regulatory elements.

Our work on CRADLE also opens up the possibility of developing analogous statistical models for other high-throughput sequencing technologies. Many high-throughput sequencing technologies share common experimental steps that cause technical biases in STARR-seq. In that regard, CRADLE exemplifies how those biases can be statistically modeled and corrected, thus allowing effect estimation and peak calling from data with a mean of zero. Of course, each sequencing technology may have assumptions that are distinct from STARR-seq and have other major bias effects that were not modeled in CRADLE. For example, antibody specificity might be one of the major bias sources in ChIP-seq. More studies need to be done to determine the scope of the applicability of CRADLE.

Methods

Downloaded data

For STARR-seq data, we downloaded FASTQ files of whole-genome STARR-seq that used A549 and HeLa-S3 cells from Johnson et al. (2018) and Muerdter et al. (2018), respectively. Those files were downloaded from NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) repository (Barrett et al. 2013) with accession codes available in those studies (GSE114063, GSE100432).

For ChIP-seq data (The ENCODE Project Consortium 2012; Davis et al. 2018), we downloaded ChIP-seq FASTQ files with following GEO accession codes: GSE91296 for A549 control ChIP-seq; GSE91275 for A549 0-h dex-treated NR3C1 ChIP-seq; GSE91235 for A549 12-h dex-treated NR3C1 ChIP-seq; GSE92032 for HeLa-S3 control ChIP-seq; GSE101280 for A549 REST control ChIP-seq; GSE101362 for A549 REST ChIP-seq; GSM935570 for HeLa-S3 IRF3 ChIP-seq; GSM935339 for HeLa-S3 IRF3 control ChIP-seq.

Processing of high-throughput sequence data

FASTQs files were aligned to the human genome reference assembly hg38 with Bowtie 2 (version 2.3.4.3) (Langmead and Salzberg 2012), using the ‐‐sensitive option and requiring a MAPQ of at least 30. Fragments were discarded if they are aligned to gap, centromere, and telomere that are available in UCSC Gap and Centromere Table Browser (Hinrichs et al. 2006) and ENCODE blacklist regions (Amemiya et al. 2019). Alignment of paired-end data sets were further restricted to require properly paired alignments. Unnormalized and RPKM-normalized (‐‐binSize 1) bigWig files were generated by bamCoverage subcommand in deepTools (version 3.0.1) (Ramírez et al. 2016) using ‐‐extendReads. The reported average fragment length was used to extend reads when generating single-end bigWigs. Unnormalized and normalized bigWig files were used for CRADLE inputs files and for visualizing signals in genome browser tracks, respectively. A549 ATAC-seq FASTQs were processed as above but were aligned to hg19 and required a less stringent MAPQ score (≥5). Peaks were called for ChIP-seq data sets using MACS2 (Zhang et al. 2008) with an FDR threshold of 0.05. For NR3C1-binding sites, we first called peaks using MACS2 (Zhang et al. 2008), independently for 0-h dex-treated NR3C1 ChIP-seq samples and 12-h dex-treated NR3C1 ChIP-seq samples with respective control ChIP-seq samples. Then we merged those peaks and used edgeR (Robinson et al. 2010) to perform differential testing at FDR 0.05 and selected peaks with positive effect size to detect NR3C1-binding regions. The coordinates of autosomal inhibitor-responsive regulatory elements previously reported in hg19 (Muerdter et al. 2018) were converted to hg38 with liftOver (Hinrichs et al. 2006).

PER1 BAC library preparation and sequencing

Purified PER1 bacterial artificial chromosomes (BACs) (CH17-212C17; Chr 17: 7,981,103–8,192,310) were harvested from Escherichia coli using standard protocols. Following DNA shearing using the Covaris S2 instrument, the BAC DNAs were size-selected using solid phase reversible immobilization (SPRI) beads. STARR-seq insert libraries were prepared using the NEBNext DNA Library Prep Master Mix kit and 50 ng of template DNA. Adapted DNAs were enriched in triplicate reactions via 3, 6, 12, or 18 cycles of PCR using the NEB Q5 PCR kit. The resulting libraries were characterized on the Agilent Tape Station before 50 cycles of paired-end sequencing on the Illumina MiSeq platform. FASTQs were aligned as above. We checked duplicated rate for each cycle using Picard (MarkDuplicates, version 2.14.0). The mean duplicate rate for each cycle is as follows: 0.3% in Cycle 3; 0.5% in Cycle 6; 1.3% in Cycle 12; 1.9% in Cycle 18.

Data processing for bias covariates

To obtain DNA structure parameters for fragment-end bias, we estimated minor groove width (MGW) and propeller twist (ProT) for all 5-mers (total 1024 sequences) using DNAshape (Zhou et al. 2013). For Gibbs free energy parameters, we used the estimated Gibbs free energy for all dimers (Protozanova et al. 2004). For G-quadruplex structure parameters, we used bigWig files that reported stability of G-quadruplex structure in the whole genome with accession code GSE63874 in GEO (Chambers et al. 2015). For mappability scores, we downloaded the human mappability score bigWig files for 36-mer and 50-mer (Derrien et al. 2012), using accession codes (ENCSR821KQV, ENCSR093EEM) in ENCODE (The ENCODE Project Consortium 2012; Davis et al. 2018). For those G-quadruplex structure bigWig and mappability bigWig files, the genomic coordinates were in hg19 assembly, so we used the liftOver tool (Hinrichs et al. 2006) to convert them to hg38.

Measuring technical biases in STARR-seq libraries

To investigate fragment-end bias, we counted the frequency of 5-mers starting 2 bp upstream of the 5′ end of positive strands of fragments in STARR-seq input libraries (Johnson et al. 2018; Muerdter et al. 2018). To identify enriched fragmentation sites, we compared that observed 5-mer frequency distribution to that observed in the reference genome (hg38) excluding gaps, centromeres, telomeres that are available in UCSC Gap and Centromere Table Browser (Hinrichs et al. 2006), and ENCODE blacklist regions (Amemiya et al. 2019). To examine Gibbs free energy, G-quadruplex structure, and mappability bias, we binned human Chromosome 1 into 500-bp windows using a 250-bp stride. We estimated the amount of potential technical bias in a window by calculating the mean of per base measure of those biases using previously reported values: Gibbs free energy value (Protozanova et al. 2004); the percent of mismatch for G-quadruplex structure bias (Chambers et al. 2015); and mappability score (Derrien et al. 2012). This analysis was limited to the PER1 BAC when estimating Gibbs free energy bias in the PER1 BAC library.

Correcting technical biases in STARR-seq

We used the technical bias covariates in a general linearized model (GLM) with a Poisson distribution and log link to correct STARR-seq signals. An estimate of the 90th percentile of observed coverage in input libraries (IsigP90) was calculated using 1-kb bookended regions. To ensure the GLM models effects across the range of observed signals, we trained the model using a structured sampling strategy to select bookended regions without replacement such that the final training set is approximately 106 bases in length. We evenly partitioned the training set to fit regions with input signal above and below IsigP90. The set of regions below IsigP90 were further evenly partitioned into the following percentile bins of observed coverage: [0, 20); [20, 40); [40, 60); [60, 80); [80, 90). To ensure representation across the upper tail of the STARR-seq signal distribution, regions above IsigP90 were asymmetrically partitioned as follows: 62.5% of regions were evenly divided into the following percentile bins of observed coverage, [90, 92); [92, 94); [94, 96); [96, 98); [98, 99), whereas the remaining 37.5% of regions were binned into the 99th percentile of coverage. With Muerdter et al. (2018) data, we empirically found preferentially sampling Chromosome X in the last two bins improved performance.

Single base positions with observed input signal (Isigpos) above and below IsigP90 were independently fit to the GLM. To predict the total bias effects at each single base position, we used windows twice the length of the median fragment length (L) centered on the position of interest (Fig. 2A). We assumed each position was covered by L number of hypothetical fragments of L bp length with each overlapping by a single base. We then multiplied the same bias covariates for all fragments in that window with each covariate to the power of a unique beta as follows:

Observedsignalposi=posL+1LMGWiβ1I(Isigpos<IsigP90)+β1(1I(Isigpos<IsigP90))ProTiβ2I(Isigpos<IsigP90)+β2(1I(Isigpos<IsigP90))Annealiβ3I(Isigpos<IsigP90)+β3(1I(Isigpos<IsigP90))Denatureiβ4I(Isigpos<IsigP90)+β4(1I(Isigpos<IsigP90))Gquadiβ5I(Isigpos<IsigP90)+β5(1I(Isigpos<IsigP90))Mapiβ6I(Isigpos<IsigP90)+β6(1I(Isigpos<IsigP90)) (1)

Each beta coefficient represents the relative effect of each bias predictor. Here, we assumed the set of betas is the same for all overlapping fragments. Then in log space, observed signal can be estimated with using the sums of bias covariates in the GLM as follows:

log(E(Observedsignalpos))=(β1I(Isigpos<IsigP90)+β1(1I(Isigpos<IsigP90)))i=posL+1Llog(MGWi)+(β2I(Isigpos<IsigP90)+β2(1I(Isigpos<IsigP90)))i=posL+1Llog(ProTi)+(β3I(Isigpos<IsigP90)+β3(1I(Isigpos<IsigP90)))i=posL+1Llog(Anneali)+(β4I(Isigpos<IsigP90)+β4(1I(Isigpos<IsigP90)))i=posL+1Llog(Denaturei)+(β5I(Isigpos<IsigP90)+β5(1I(Isigpos<IsigP90)))i=posL+1Llog(Gquadi)+(β6I(Isigpos<IsigP90)+β6(1I(Isigpos<IsigP90)))i=posL+1Llog(Mapi) (2)

MGW and ProT values were calculated using DNAshape (Zhou et al. 2013) for the two 5-mers starting from 2 bp external to both 5′ ends of each hypothetical fragment. MGWi and ProTi was obtained by multiplying those two MGW and ProT values, respectively.

We used the nearest-neighbor model (SantaLucia 1998; Protozanova et al. 2004) to estimate the Gibbs free energy of each hypothetical fragment. To estimate the relative melting temperature (Tm) of each fragment, we divided Gibbs free energy of a hypothetical fragment by the number of dimers in that hypothetical fragment and by the fixed entropy value (Protozanova et al. 2004). Tm values were normalized to range of [0, 1]. To model the nonlinear dependency of annealing and denaturing efficiencies to Tm, normalized Tm values in the ith fragment (Tm,i) were mapped to two exponential functions as follows:

Anneali=(eTm,i106e1061)/(10611e)Denaturei=(eTm,i106e1061)/(10611e) (3)

Those mapped values were used for Anneali and Denaturei in the GLM Model.

To obtain Gquadi for each hypothetical fragment, we used the maximum of G-quadruplex structure stability value in that sequence (Chambers et al. 2015). To obtain Mapi for each hypothetical fragment, we used a k-mer mappability score file (Derrien et al. 2012), where k-mer is a sequencing length, and multiplied the mappability scores of both ends of a fragment.

After fitting the GLM, the bias predicted by the model at each base position was removed by subtracting the estimated bias effect from the observed signal. To avoid false positives, positions with fewer than 10 observed overlapping fragments and with no signal in output libraries are not reported in the corrected signal file. The minimum number of observed overlapping fragments required for a position to be reported is parameterized (-mi) in the correctBias subcommand in CRADLE.

Modeling technical biases in STARR-seq libraries with only GC content

To show our sophisticated approach of modeling bias has better fit than simply modeling GC content, we took the following approach. We binned the genome with nonoverlapping sliding windows with six different window sizes ranging from 10 to 1000 bp. We randomly selected approximately 106 bases in length for a training set. Then with the training set, we calculated GC content in each bin for which the size corresponded to the chosen window size and used the GC content as a covariate in fitting GLM with Poisson distribution and log link. We independently fitted each replicate in the GLM. Then we used the resulting coefficients (intercept and the coefficient of GC content) to estimate the bias impact for the regions that were not in the training set.

Normalizing signals in STARR-seq

To make the corrected signals from the GLM comparable between replicates, for example, by correcting for overall differences in sequencing, we normalized STARR-seq signals between replicates using linear regression. We used the training set sampled as mentioned above and regressed per-nucleotide signal from each library against a common replicate of the input library. We estimated the slope in the linear regression and divided observed signals in each library by that slope estimate.

Evaluating model fit

To determine how well the CRADLE GLM explained variance in observed signal, we calculated R2 with observed and predicted signals across Chromosome 1 for each STARR-seq library. To calculate R2, we fitted the GLM in CRADLE and calculated the sum of squares (SSQ) by adding up squared residuals. Then, we calculated total SSQ, the sum of the squared difference of observed signals and the mean signal. By using the SSQ and total SSQ, we calculated R2 with the following equation:

R2=1(SSQ/totalSSQ).

Evaluating the contribution of each covariate to model fit

To estimate the contribution of each bias type, we calculated semipartial correlations for Chromosome 1 using the GLM. To assess each technical bias type, we excluded bias covariates that model corresponding bias type in fitting the GLM. For example, we excluded “anneal” and “denature” covariates when assessing Gibbs free energy bias impact. The R2 of these models were calculated as above and subtracted from the R2 of the full model.

Calling regulatory elements with CRADLE

Genomic regions possessing regulatory activity were identified using a modified Benjamini method (Benjamini and Bogomolov 2014). We first binned the genome into windows (1.5 × L) and determined the effect size of each window by subtracting the mean corrected signal in input libraries from the mean corrected signal in output libraries. Each window was classified to one of the three types, using the following standard:

  • Type(windowx) = 1 if (effect size > 0 and |effect size| > 99th percentile of absolute effect sizes),

  • Type(windowx) = −1 if (effect size < 0 and |effect size| > 99th percentile of absolute effect sizes),

  • Type(windowx) = 0 otherwise.

The threshold of the 99th percentile of absolute effect sizes was chosen to classify windows because the majority of windows are not expected to encode regulatory activity. Contiguous windows of the same type, including Type 0, were merged to form regions for statistical testing. These regions were then binned with nonoverlapping bins of which the length is 1/6 × L. In each bin, the input and output STARR-seq signals were compared using Welch's t-test (Welch 1947) to account for potential differences in variance. Individual bin-level P-values from the same region were merged to a region-level P-value via the method of Simes (1986). To increase our power to detect potential regulatory regions for final testing, regions with small overall variance were removed from further analysis independently of the statistical test used (Bourgon et al. 2010). Specifically, we ranked regions according to their overall variance and then applied the overall variance filter that removed 0%–90% of regions with low variance using 10% intervals. P-values for regions passing each threshold were subjected to the first BH procedure (Benjamini and Hochberg 1995) using a parameterized FDR value (-fdr). Then, we chose the threshold of the overall variance filter that returned the greatest number of selected regions from the first BH procedure. To identify bins that have regulatory activity, bin-level P-values from the Welch's t-test in the selected regions were then subjected to the second BH procedure with new FDR adjusted by the following:

NewFDR=(predeterminedFDR)×(thenumberofselectedregionstotalnumberofregions).

Contiguous bins that encode regulatory activity with the same sign of effect sizes were merged in the final output and the minimum P-value was reported.

Simulation of STARR-seq signals

To evaluate the performance of CRADLE, we simulated STARR-seq signals that maintained the observed sequence biases and expected variance across replicates. STARR-seq signals were simulated using a negative binomial distribution and mean-variance relationships estimated independently for input and output libraries from previously published STARR-seq data (Johnson et al. 2018). Simulated input and output signal matrices generated using 300-bp bookended windows along Chromosome 1 were used to estimate mean-dispersion relationship in DESeq2 (Love et al. 2014) before interpolation with the Scipy.interpolate.interp1d command in Python. To generate a set of predefined regulatory elements (N = 50,504), we randomly sampled ∼0.5% of total windows requiring that the selected windows be in at least the 70th percentile of coverage in the published input libraries. For each predefined regulatory element, we randomly assigned an absolute fold change [2, 3, or 4] and regulatory activity type [activating or repressing]. Predefined sets of regulatory elements with a specific fold change and regulatory activity type were generated as above using the specified fold change and regulatory activity types as described in text.

Five simulated STARR-seq input and output signals were generated using a negative binomial distribution. The mean parameters used to generate the simulated input and output signals were determined by calculating the mean window counts using the published input libraries (Johnson et al. 2018). The variance parameters were determined using either the input or output interpolation analyses described above. The mean parameters used to generate the simulated output signals were adjusted for predefined regulatory elements windows by multiplying or dividing the mean signal by the predetermined fold change and determining the corresponding variance parameter.

Detecting regulatory elements in simulated data

To evaluate the effect of correcting STARR-seq signals on identifying regulatory elements, we used CRADLE to call regulatory elements before and after correcting biases in the simulated data sets. Owing to the normality assumption in Welch's t-test, we modified the CRADLE approach described above to call regulatory activity in uncorrected simulated signals. In place of the Welch's t-test, we used two alternative statistical approaches to compare uncorrected simulated input and output signals. First, we used a Poisson GLM as follows:

log(E(signali))=β0+β1×(datatypei)datatype={0ifsignaliisfrominputlibray1ifsignaliisfromoutputlibray} (4)

We then performed the Wald test for β1 with f distribution. Second, we followed a similar approach as used by MACS2 (Zhang et al. 2008). We used the mean input bin signal as the mean parameter in a Poisson distribution to calculate a P-value for the mean output bin signal. We called regulatory activity in corrected simulated signals as described above.

Motif enrichment analysis

Motif enrichment analysis was performed using the findMotifsGenome subcommand in the HOMER (Heinz et al. 2010) 4.10.1 software suite using the following parameters: -size given -mis 3 -mset vertebrates.

Plotting heatmaps

Heatmaps were plotted using deepTools command “computeMatrix” using a reference-point and “plotHeatmap” (Ramírez et al. 2016). In both cases, we specified single-nucleotide resolution using the option ‐‐binSize 1.

TF occupancy in CRADLE regulatory elements

To determine whether regulatory elements called by CRADLE are bound by TFs, we used the CRADLE pipeline to detect A549 steady-state activating and repressive regulatory elements from a previously published study (Johnson et al. 2018). We used the findMotifsGenome subcommand in the HOMER suite (version 4.10.1) (Heinz et al. 2010) with the parameters, -size given -mis 0 -mset vertebrates -find, to detect REST motifs in each repressive element and FOSL2, JUNB, and GABPA motifs in activating elements. For each element that encoded a specified motif, we intersected those elements with ENCODE ChIP-seq peaks for the corresponding TF in A549 cells (The ENCODE Project Consortium 2012; Davis et al. 2018).

Validation of REST-occupied repressive regulatory elements

Regions tested by Doni Jayavelu et al. (2020) in K562 cells (N = 7440) were intersected with repressive regulatory elements identified by CRADLE in A549 cells that also contained a REST motif and were bound by REST in the same cell line (N = 240) (The ENCODE Project Consortium 2012). Reported fold change values in K562 cells were compared for the intersection set except the two elements without coverage (N = 28), regions predicted by Doni Jayavelu et al. (2020) to be repressive elements (N = 3001), and control regions (N = 40).

Signals from a genome-wide Survey of Regulatory Elements (SuRE) in HepG2 and K562 cells (van Arensbergen et al. 2019) were compared in specific sets of regulatory elements identified by CRADLE in A549 cells. These regulatory elements included activating regulatory elements that contained either a FOSL2, GABPA, or JUNB motif and were bound by the corresponding TF in A549 (The ENCODE Project Consortium 2012) or repressive elements that likewise contained a REST motif and were bound by REST (N = 240) (The ENCODE Project Consortium 2012). A549 regulatory elements that contained a SNP in the genomes assayed in the SuRE study were excluded on a per genome basis. The minimum and maximum number of SNP-filtered elements compared for each TF are as follows: FOSL2 N = 650-651; GABPA N = 401-402; JUNB N = 723; REST N = 102.

We randomly generated a set of regions (N = 240) of fixed length (430 bp) controlling for accessibility (The ENCODE Project Consortium 2012) and dinucleotide composition. The fixed length was set to the median length of the compared repressive elements. In generating random regions, we excluded regions that overlapped gaps, centromeres, and telomeres that are available in UCSC Gap and Centromere Table Browser (Hinrichs et al. 2006) and ENCODE blacklist regions (Amemiya et al. 2019), or the following features defined by ChromHMM (Ernst and Kellis 2017) in K562 and HepG2 cells: promoters, promoter flanking regions, enhancers, CTCF enriched sites, and repressed regions. After applying the SNP-filter described above, we obtained 94 random regions.

Data access

The PER1 BAC data sets generated in this study have been submitted to NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE149914.

CRADLE is implemented in Python, and the source code is available in the Supplemental Code. CRADLE can be freely downloaded either from GitHub (https://github.com/ReddyLab/CRADLE) or pip (pip install cradle; https://pypi.org/project/CRADLE/). Instructions for installing and running CRADLE are available on the CRADLE GitHub page.

Supplementary Material

Supplemental Material

Acknowledgments

We thank Greg Crawford and David MacAlpine for their helpful comments and advice in developing this work. We also thank funding from National Institutes of Health (NIH) grants R01HD085227 (T.E.R.), UM1HG009428 (Y.-S.K., J.S., A.B., W.H.M., A.S.A., T.E.R.), 5R01HG010741 (Y.-S.K., G.D.J., J.S., A.B., W.H.M., A.S.A., T.E.R.), and R01DK104927 (A.B., T.E.R.); G.D.J. was supported by NIH fellowship F32DK115188.

Author contributions: Conceptualization and project administration was by Y.-S.K. and T.E.R.; data curation, formal analysis, investigation, validation, and visualization was by Y.-S.K.; supervision and funding acquisition was by T.E.R.; methodology was by Y.-S.K., T.E.R., G.D.J., J.S., W.H.M., A.O., and A.S.A.; software development was by Y.-S.K., T.N.C., and A.B.; writing of the original draft was by Y.-S.K. and T.E.R.; and review and editing was by all the authors.

Footnotes

[Supplemental material is available for this article.]

Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.269209.120.

Freely available online through the Genome Research Open Access option.

Competing interest statement

The authors declare no competing interests.

References

  1. Amemiya HM, Kundaje A, Boyle AP. 2019. The ENCODE blacklist: identification of problematic regions of the genome. Sci Rep 9: 9354. 10.1038/s41598-019-45839-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Arnold CD, Gerlach D, Stelzer C, Boryn LM, Rath M, Stark A. 2013. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339: 1074–1077. 10.1126/science.1232542 [DOI] [PubMed] [Google Scholar]
  3. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, et al. 2013. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res 41: D991–D995. 10.1093/nar/gks1193 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Benjamini Y, Bogomolov M. 2014. Selective inference on multiple families of hypotheses. J R Stat Soc 76: 297–318. 10.1111/rssb.12028 [DOI] [Google Scholar]
  5. Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc 57: 289–300. 10.1111/j.2517-6161.1995.tb02031.x [DOI] [Google Scholar]
  6. Benjamini Y, Speed TP. 2012. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res 40: e72. 10.1093/nar/gks001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bourgon R, Gentleman R, Huber W. 2010. Independent filtering increases detection power for high-throughput experiments. Proc Natl Acad Sci 107: 9546–9551. 10.1073/pnas.0914005107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chambers VS, Marsico G, Boutell JM, Di Antonio M, Smith GP, Balasubramanian S. 2015. High-throughput sequencing of DNA G-quadruplex structures in the human genome. Nat Biotechnol 33: 877–881. 10.1038/nbt.3295 [DOI] [PubMed] [Google Scholar]
  9. Cheung MS, Down TA, Latorre I, Ahringer J. 2011. Systematic bias in high-throughput sequencing data and its correction by BEADS. Nucleic Acids Res 39: e103. 10.1093/nar/gkr425 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Chong JA, Tapia-Ramirez J, Kim S, Toledo-Aral JJ, Zheng Y, Boutros MC, Altshuller YM, Frohman MA, Kraner SD, Mandel G. 1995. REST: a mammalian silencer protein that restricts sodium channel gene expression to neurons. Cell 80: 949–957. 10.1016/0092-8674(95)90298-8 [DOI] [PubMed] [Google Scholar]
  11. Courey AJ, Jia S. 2001. Transcriptional repression: the long and the short of it. Genes Dev 15: 2786–2796. 10.1101/gad.939601 [DOI] [PubMed] [Google Scholar]
  12. Davis CA, Hitz BC, Sloan CA, Chan ET, Davidson JM, Gabdank I, Hilton JA, Jain K, Baymuradov UK, Narayanan AK, et al. 2018. The Encyclopedia of DNA Elements (ENCODE): data portal update. Nucleic Acids Res 46: D794–d801. 10.1093/nar/gkx1081 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Derrien T, Estellé J, Marco Sola S, Knowles DG, Raineri E, Guigó R, Ribeca P. 2012. Fast computation and applications of genome mappability. PLoS One 7: e30377. 10.1371/journal.pone.0030377 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Doni Jayavelu N, Jajodia A, Mishra A, Hawkins RD. 2020. Candidate silencer elements for the human and mouse genomes. Nat Commun 11: 1061. 10.1038/s41467-020-14853-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. The ENCODE Project Consortium. 2012. An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74. 10.1038/nature11247 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Ernst J, Kellis M. 2017. Chromatin-state discovery and genome annotation with ChromHMM. Nat Protoc 12: 2478–2492. 10.1038/nprot.2017.124 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh PR, Anttila V, Xu H, Zang C, Farh K, et al. 2015. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat Genet 47: 1228–1235. 10.1038/ng.3404 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Gupte R, Muse GW, Chinenov Y, Adelman K, Rogatsky I. 2013. Glucocorticoid receptor represses proinflammatory genes at distinct steps of the transcription cycle. Proc Natl Acad Sci 110: 14616–14621. 10.1073/pnas.1309898110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Hansen KD, Irizarry RA, Wu Z. 2012. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics 13: 204–216. 10.1093/biostatistics/kxr054 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK. 2010. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell 38: 576–589. 10.1016/j.molcel.2010.05.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F, et al. 2006. The UCSC Genome Browser Database: update 2006. Nucleic Acids Res 34: D590–D598. 10.1093/nar/gkj144 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Jiang Y, Oldridge DA, Diskin SJ, Zhang NR. 2015. CODEX: a normalization and copy number variation detection method for whole exome sequencing. Nucleic Acids Res 43: e39. 10.1093/nar/gku1363 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Johnson GD, Barrera A, McDowell IC, D'Ippolito AM, Majoros WH, Vockley CM, Wang X, Allen AS, Reddy TE. 2018. Human genome-wide measurement of drug-responsive regulatory activity. Nat Commun 9: 5317. 10.1038/s41467-018-07607-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nat Methods 9: 357–359. 10.1038/nmeth.1923 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Lee TI, Young RA. 2013. Transcriptional regulation and its misregulation in disease. Cell 152: 1237–1251. 10.1016/j.cell.2013.02.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Lee D, Shi M, Moran J, Wall M, Zhang J, Liu J, Fitzgerald D, Kyono Y, Ma L, White KP, et al. 2020. STARRPeaker: uniform processing and accurate identification of STARR-seq active regions. Genome Biol 21: 298. 10.1186/s13059-020-02194-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Love MI, Huber W, Anders S. 2014. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15: 550. 10.1186/s13059-014-0550-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Love MI, Hogenesch JB, Irizarry RA. 2016. Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation. Nat Biotechnol 34: 1287–1291. 10.1038/nbt.3682 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Lun AT, Smyth GK. 2014. De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly. Nucleic Acids Res 42: e95. 10.1093/nar/gku351 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. McDowell IC, Barrera A, D'Ippolito AM, Vockley CM, Hong LK, Leichter SM, Bartelt LC, Majoros WH, Song L, Safi A, et al. 2018. Glucocorticoid receptor recruits to enhancers and drives activation by motif-directed binding. Genome Res 28: 1272–1284. 10.1101/gr.233346.117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Muerdter F, Boryń LM, Woodfin AR, Neumayr C, Rath M, Zabidi MA, Pagani M, Haberle V, Kazmar T, Catarino RR, et al. 2018. Resolving systematic errors in widely used enhancer activity assays in human cells. Nat Methods 15: 141–149. 10.1038/nmeth.4534 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Parker SC, Stitzel ML, Taylor DL, Orozco JM, Erdos MR, Akiyama JA, van Bueren KL, Chines PS, Narisu N, Black BL, et al. 2013. Chromatin stretch enhancer states drive cell-specific gene regulation and harbor human disease risk variants. Proc Natl Acad Sci 110: 17921–17926. 10.1073/pnas.1317023110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Poptsova MS, Il'icheva IA, Nechipurenko DY, Panchenko LA, Khodikov MV, Oparina NY, Polozov RV, Nechipurenko YD, Grokhovsky SL. 2014. Non-random DNA fragmentation in next-generation sequencing. Sci Rep 4: 4532. 10.1038/srep04532 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Protozanova E, Yakovchuk P, Frank-Kamenetskii MD. 2004. Stacked-unstacked equilibrium at the nick site of DNA. J Mol Biol 342: 775–785. 10.1016/j.jmb.2004.07.075 [DOI] [PubMed] [Google Scholar]
  35. Ramírez F, Ryan DP, Grüning B, Bhardwaj V, Kilpert F, Richter AS, Heyne S, Dündar F, Manke T. 2016. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res 44: W160–W165. 10.1093/nar/gkw257 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Reddy TE, Gertz J, Crawford GE, Garabedian MJ, Myers RM. 2012. The hypersensitive glucocorticoid response specifically regulates period 1 and expression of circadian genes. Mol Cell Biol 32: 3756–3767. 10.1128/MCB.00062-12 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Rhodes D, Lipps HJ. 2015. G-quadruplexes and their regulatory roles in biology. Nucleic Acids Res 43: 8627–8637. 10.1093/nar/gkv862 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Robinson MD, McCarthy DJ, Smyth GK. 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26: 139–140. 10.1093/bioinformatics/btp616 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. SantaLucia J Jr. 1998. A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc Natl Acad Sci 95: 1460–1465. 10.1073/pnas.95.4.1460 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Simes RJ. 1986. An improved Bonferroni procedure for multiple tests of significance. Biometrika 73: 751–754. 10.1093/biomet/73.3.751 [DOI] [Google Scholar]
  41. Teng M, Irizarry RA. 2017. Accounting for GC-content bias reduces systematic errors and batch effects in ChIP-seq data. Genome Res 27: 1930–1938. 10.1101/gr.220673.117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. van Arensbergen J, Pagie L, FitzPatrick VD, de Haas M, Baltissen MP, Comoglio F, van der Weide RH, Teunissen H, Võsa U, Franke L, et al. 2019. High-throughput identification of human SNPs affecting regulatory element activity. Nat Genet 51: 1160–1169. 10.1038/s41588-019-0455-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Welch BL. 1947. The generalisation of student's problems when several different population variances are involved. Biometrika 34: 28–35. 10.1093/biomet/34.1-2.28 [DOI] [PubMed] [Google Scholar]
  44. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, et al. 2008. Model-based Analysis of ChIP-Seq (MACS). Genome Biol 9: R137. 10.1186/gb-2008-9-9-r137 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Zheng W, Chung LM, Zhao H. 2011. Bias detection and correction in RNA-sequencing data. BMC Bioinformatics 12: 290. 10.1186/1471-2105-12-290 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Zhou T, Yang L, Lu Y, Dror I, Dantas Machado AC, Ghane T, Di Felice R, Rohs R. 2013. DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale. Nucleic Acids Res 41: W56–W62. 10.1093/nar/gkt437 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press

RESOURCES