Direct characterization of cis-regulatory elements and functional dissection of complex genetic associations using HCR-FlowFISH

Steven K Reilly; Sager J Gosai; Alan Gutierrez; Ava Mackay-Smith; Jacob C Ulirsch; Masahiro Kanai; Kousuke Mouri; Daniel Berenzy; Susan Kales; Gina M Butler; Adrianne Gladden-Young; Redwan M Bhuiyan; Michael L Stitzel; Hilary K Finucane; Pardis C Sabeti; Ryan Tewhey

doi:10.1038/s41588-021-00900-4

. Author manuscript; available in PMC: 2022 Mar 16.

Published in final edited form as: Nat Genet. 2021 Jul 29;53(8):1166–1176. doi: 10.1038/s41588-021-00900-4

Direct characterization of cis-regulatory elements and functional dissection of complex genetic associations using HCR-FlowFISH

Steven K Reilly ^1,^2,^14,^*, Sager J Gosai ^1,^2,^3,¹⁴, Alan Gutierrez ¹, Ava Mackay-Smith ¹, Jacob C Ulirsch ^1,^3,⁴, Masahiro Kanai ^1,^4,⁵, Kousuke Mouri ⁷, Daniel Berenzy ⁷, Susan Kales ⁷, Gina M Butler ¹, Adrianne Gladden-Young ¹, Redwan M Bhuiyan ^8,⁹, Michael L Stitzel ^8,^9,¹⁰, Hilary K Finucane ^1,⁴, Pardis C Sabeti ^1,^2,^6,^11,^*, Ryan Tewhey ^7,^12,^13,^*

PMCID: PMC8925018 NIHMSID: NIHMS1723995 PMID: 34326544

Abstract

Effective interpretation of genome function and genetic variation requires a shift from epigenetic mapping of cis-regulatory elements (CREs) to characterization of endogenous function. We developed HCR-FlowFISH, a broadly applicable approach to characterize CRISPR-perturbed CREs via accurate quantification of native transcripts, alongside CASA (CRISPR Activity Screen Analysis), a hierarchical Bayesian model to quantify CRE activity. Across >325,000 perturbations, we provide evidence that CREs can regulate multiple genes, skip over the nearest gene, and can display activating and/or silencing effects. At the cholesterol-level associated FADS locus, we combine endogenous screens with reporter assays to exhaustively characterize multiple genome-wide association signals, functionally nominating causal variants and importantly, identifying their target genes.

Identification and functional characterization of CREs - sequence modules controlling gene expression - is a critical challenge for modern genomics. While over 900,000 putative CREs have been identified via the co-occurrence of transcription factor (TF) binding, DNA accessibility, and histone modifications, few have been directly shown to regulate gene expression in their endogenous chromosomal context^1–4. Only 7% of CREs exhibiting chromatin interactions with distal promoters interact with the nearest gene⁵. Most putative CREs remain unvalidated, with target gene(s) and quantitative effects unknown.

As over 90% of sequence variants associated with human traits and disease risk are in non-coding regions, and many are thought to impact CRE function, there is a crucial need to better characterize CREs⁶. While genetic fine-mapping can help pinpoint likely causal variants^7,8, and episomal reporter assays can identify variants with regulatory potential^9–12, methods to directly quantify the function of CREs containing variants and identify their target gene(s) are lacking. This limitation hinders our ability to identify causal variants and elucidate the mechanistic role of common variation in complex diseases and traits.

One promising approach to functionally characterizing endogenous CREs is to perturb their predicted activity using CRISPR/dCas9 to induce a targeted transcriptionally-repressive chromatin state using a tethered histone-deacetylase KRAB (CRISPRi)¹³. Generally, such approaches link single guide-RNAs (gRNAs) and their target CREs in a cell to an easily observed phenotype, and are highly amenable to parallelized screening^14–17. However, many CRISPR non-coding studies rely on using specific cellular phenotypes downstream of gene transcription (such as cell growth or resistance to a cytotoxic drug) as readouts, restricting CRE characterization to a subset of genes.

A more direct and generalizable strategy to assess CREs is to measure transcript abundance in response to perturbations, but existing methods are limited and only effective for highly expressed transcripts. Tagging endogenous transcripts with transgenic reporters, such as GFP, indirectly estimates transcriptional regulation via translation, is non-universal, and requires unique cellular models for each target locus^18–20. Single-cell RNA-sequencing of CRISPR screens directly measures transcript abundance but is bound in scale to relatively few gRNAs, limiting the interrogation to small portions of the genome at high costs²¹. One encouraging strategy has been to combine amplified single-cell fluorescent in-situ hybridization (FISH) techniques with CRISPR screens²². However, current implementation uses Thermo Fisher’s PrimeFlow assay, which is unmodifiable and restricted to identifying CREs for only highly expressed targets. This severely limits our ability to characterize variants identified by GWAS, where multiple potential target genes must be exhaustively considered.

We developed HCR-FlowFISH to be able to reliably characterize most CREs. This method combines (1) CRISPRi-mediated perturbation of CREs with (2) hybridization chain reaction (HCR), a versatile amplified FISH method, for (3) flow cytometry-based single-cell measurements (Fig. 1a)²³. We also developed CASA, a Bayesian inference tool that directly models CRE activity from flow cytometry-based CRISPR screens and displays improved CRE quantification over previous analysis methods. Our approaches are open-sourced, sensitive, and scalable to thousands of CREs and multiple genes, enabling exhaustive screens for all genes in a single locus.

Figure 1 | — a, Overview of HCR-FlowFISH method showing CRE identification using endogenous CRISPRi perturbation of the genome, quantification of transcript abundance with HCR, and flow-cytometry assisted sorting to bin effector gRNAs. b, Timeline of HCR-FlowFISH protocol shows shortened, 2-day protocol. c, Detection of 23 transcripts and background (EGFP, non-expressed) via HCR. d, Probe-number normalized HCR signal correlates with gene expression levels in K562 cells, showcasing utility of HCR across a broad range of genes (R² = 0.7731 on log₁₀ probe-number normalized HCR signal). e, Tuning of HCR signal:background ratio by increasing probe number, probe concentration, or hours of amplification increases signal to background ratio. f, Detection of *TUB1B* and *ACTB* across six suspension and adherent mammalian cell lines displays wide-applicability of HCR-FlowFISH. g, HCR signal to background ratio does not diminish for 21 days.

Here we use HCR-FlowFISH to functionally characterize 326,130 perturbations, including four genes at the FADS locus, which is associated with lipid levels and has undergone recent positive-selection in human populations. We reveal complex regulatory landscapes in greater detail than provided by biochemical readouts like H3K27ac and DNase I hypersensitive sites (DHSs). Our technique detected CREs shared by multiple genes, including elements with opposing directions of effect. Finally, by combining HCR-FlowFISH with reporter assays, we are able to parse complex genetic associations and nominate functional regulatory variants with their target gene(s).

Results

HCR-FlowFISH is a robust transcript quantification method.

We sought to establish HCR-FlowFISH as a method that enables accurate amplification of endogenous transcript levels for detection by flow cytometry. HCR quantitatively increases FISH signal by combining transcript-specific DNA probes with fluorescently labeled, metastable hairpins that form long chains, amplifying fluorescent signal once initialized by the hybridization probe. We started by modifying HCR, a method initially developed for FISH microscopy, to allow scaling to millions of single cells per experiment while providing increased transcript sensitivity and compatibility with high-throughput screening (Fig. 1b and Methods)²³. We applied HCR-FlowFISH to a wide range of genes and were able to individually detect expression of 23 transcripts in K562 cells out of 23 attempted targets relative to a negative control (Fig. 1c). The assessed gene transcripts span a range of lengths (871–5,368 bp) and expression levels (1.2–2,734 transcripts per million (TPM)), with transcript levels correlating with the average probe-number normalized fluorescent signal (R² = 0.7731) (Fig. 1d).

We selected three genes to interrogate more closely, comparing HCR-FlowFISH to PrimeFlow, a proprietary RNA FISH method based on branched DNA amplification previously used for CRISPR non-coding screens²². For the three genes, representing low, middle, and high expression (LARGE1, 1.5 TPM; FADS3, 77 TPM; GATA1, 193 TPM), we compared the signal intensity. For all three genes, we observed an improved signal-to-noise ratio (SNR) using HCR, enabling a greater dynamic range and improved detection threshold for low-abundance transcripts (Supplementary Fig. 1a,b).

We investigated modifications to our HCR-FlowFISH protocol to further improve the SNR for lowly expressed genes and applicability in diverse cell types. We increased the SNR two-fold by increasing the duration of the hairpin amplification process. Increasing probe concentration and the number of probes used per target transcript increased SNR by five-fold and two-fold, respectively (Fig. 1e). Across a variety of common suspension (Jurkat, GM12878, TF1) and adherent (293T, HepG2, SK-N-SH) cell lines, we successfully observed HCR signals concordant with cellular transcript levels, demonstrating that HCR is robust across all cell types we tested (Fig. 1f)²⁴. Fluorescent signals for all 23 genes tested were stable for at least 21 days (Fig. 1g). Thus, signal detection and quantification can be optimized for individual gene targets and is robust across cell types.

HCR-FlowFISH and CASA identify CREs in non-coding CRISPR screens.

We next demonstrated the application of HCR-FlowFISH to measure transcript abundance as a phenotypic readout in CRISPR non-coding screens. We first assayed the GATA1 locus, which has been well studied by previous growth-based screens, allowing us to benchmark our method while highlighting the advantages of reading out expression rather than growth¹⁷. Our lentiviral library contained 15,000 gRNAs targeting 90 kb of sequence, comprising 278 DHSs in a 920-kb region encompassing GATA1; we infected into cells such that each cell received no more than one gRNA (Fig. 1a).

To identify CRE-perturbing gRNAs, we then sorted doxycycline-induced CRISPRi-expressing cells and performed HCR-FlowFISH (Extended Data Fig. 1a). We used different fluorescently labeled probe-hairpin combinations targeting GATA1 and the housekeeping gene TBP to control for cell size and permeability, sorting for cells in the top and bottom 10% for expression (Extended Data Fig. 1b and Methods). DNA from the cells in each bin was isolated, gRNAs were sequenced and mapped to their genomic targets, and a guide score was calculated as the natural log ratio (low:high) expression bin gRNA abundance. Composite scores reflecting all gRNAs’ overlapping effects on a single nucleotide were also calculated (Methods). Regions repressed by CRISPRi, resulting in decreased gene expression, were assigned higher composite guide scores, indicating their roles as putative CREs that enhance gene transcription.

In order to identify CREs and estimate their effect sizes from HCR-FlowFISH data, we developed CASA as a statistical framework. CASA models latent CRE activity over a disjoint sliding genomic window containing sequenced gRNAs and their propensity to alter gene expression (Extended Data Fig. 1c). Per replicate, we approximated posterior CRE activity scores of putative CREs and identified regions where scores diverged substantially from non-targeting control gRNAs (Methods). Using generative models based on empirical data, CASA activity scores are tightly correlated with absolute gene knockdown level (Pearson r > 0.92; P < 10⁻¹⁰⁰) (Supplementary Fig. 2a–d). Notably, we found that when analyzing a moderately expressed gene (112.0 TPM), CASA maintains 84% sensitivity for detecting CREs that knock down gene expression by 15% when perturbed. Additionally, sensitivity is improved when more guides are used to detect putative CREs of modest effect, especially when targeting less abundantly expressed genes (Supplementary Fig. 2e,f).

CASA identified three regions associated with GATA1 expression in our pilot screen. Both composite and individual guide scores of gRNAs were strongest at the promoter of GATA1 and two distal elements, eGATA1 and eHDAC6, previously identified in growth screens (Fig. 2b,c and Methods)¹⁷. 1,219 gRNAs were in common with a previous study (Fig. 2a and Extended Data Fig. 2a), and we observe strong, significant correlation both in individual gRNAs (Pearson r = 0.84; P = 3.4 × 10⁻¹⁰⁶) and composite guide scores between studies¹⁷ (Extended Data Fig. 2a,b,d). Correlation was much stronger in CASA-identified CREs (Spearman ρ = 0.79; P = 2 × 10⁻⁵⁴) than in non-CRE regions (Spearman ρ = 0.15; P = 4 × 10⁻⁷⁹) (Fig. 2d).

Figure 2 | — An HCR-FlowFISH CRISPRi screen on a 920-kb region centered on *GATA1.* Black boxes show regions targeted by guides, K562 DHS shown in light blue, H3K27ac in dark blue; composite guide scores for *GATA1* (red) and *HDAC6* (yellow) are shown. b, Zoom on a 40-kb region showing that CREs identified by growth screens at this locus, eGATAI and eHDAC6, are also identified by HCR-FlowFISH, as well as the respective promoters for *GATA1* and *HDAC6.* Composite score tracks are averaged, overlapping guide scores. c, HCR-FlowFISH guide scores for gRNAs in eGATAI (n = 85), eHDAC6 (n = 115), and eGLOD5 (n = 25), compared to randomly permuted gRNAs. eGATA1 and eHDAC6 are identified by CASA (two-sided t-test, **P ≤* 1 × 10⁻⁵, ns = not significant) by CASA. The minima, centers, and maxima of the boxes indicate the 25th, 50th, and 75th percentiles of the data distributions. Whiskers capture all remaining data, excluding outliers extending beyond 1.5 times the interquartile range below or above the 25th or 75th percentiles, respectively. d, Comparison of HCR-FlowFISH composite guide scores compared to growth scores, both binned in 10-bp regions, showing high correlation for regions in CREs identified by CASA (Spearman ρ = 0.79, two-sided t-test P = 2 × 10⁻⁵⁴, dark red line is the ordinary least squares regression best fit, red shaded band is 95% confidence interval). e, Comparison of HCR-FlowFISH guide scores for gRNAs targeting the *GATA1* transcript (red) or *HDAC6* transcript (yellow) at promoter regions 1,000 bp upstream of the TSS. Guides at the *GATA1 (n* = 46) and *HDAC6 (n* = 88) promoter were high scoring when HCR was performed with probes for that promoter’s transcript and yielded significant CASA CREs, but not at nearby genes *TIMM17B (n* = 79) and *PIM2 (n* = 89) (two-sided t-test, **P ≤* 1 × 10⁻⁵, ns = not significant). The minima, centers, and maxima of the boxes indicate the 25th, 50th, and 75th percentiles of the data distributions. Whiskers capture all remaining data, excluding outliers extending beyond 1.5 times the interquartile range below or above the 25th or 75th percentiles, respectively.

We compared HCR-FlowFISH’s performance to those of growth screens and PrimeFlow. HCR-FlowFISH identified CREs with improved specificity to growth screens: guide scores for gRNAs within CASA-defined CREs are better separated from non-CRE-overlapping gRNAs (Extended Data Figs. 2e and 3a). Specifically, HCR displays improved signal to noise and reduced guide score variability, especially in non-CRE region, enabling more accurate CRE calls (Extended Data Fig. 3a,b)²².

Because HCR-FlowFISH can be applied directly on any transcript and can interrogate more lowly expressed genes, we were able to interrogate CREs for the nearby gene HDAC6 using the same cell library. As expected, CASA detected a CRE at the promoter of HDAC6, and promoter-targeting gRNAs displayed significantly increased guide scores (Mann-Whitney (MW) P < 9 × 10⁻²²), an effect not observed at gRNAs targeting nearby promoters for GATA1, TIMM17B, and PIM2 (Fig. 2e).

We identified two distal CREs for HDAC6. Surprisingly, both overlapped CREs for GATA1, one overlapping the previously identified eHDAC6 element and the other overlapping the GATA1 promoter (Fig. 2b). Notably, the gRNAs perturbing the promoter of GATA1 reduced GATA1 expression but increased HDAC6 expression, consistent with previous studies¹⁷ (Extended Data Fig. 2d). Our analysis at this locus highlights how HCR-FlowFISH can provide direct evidence of complex regulatory activity that eludes existing biochemical assays for chromatin structure.

Lastly, we deployed HCR-FlowFISH and CASA to another well-studied region using CRISPRi growth screens, the MYC locus¹⁷. We identify all previously validated CRE-promoter interactions for MYC that were tested by our screen, including a promoter boundary element at the non-coding transcript PVT1. This element, when inhibited or deleted, allows nearby CREs to target MYC²⁵. This is concordant with our screens identifying MYC expression increases when the PVT1 promoter is perturbed (Extended Data Fig. 4a). Interestingly, we observed this switching is reciprocal, with perturbation of MYC increasing PVT1 expression.

Comprehensive CRE scans of five loci show the flexibility of HCR-FlowFISH.

Next, we deployed comprehensive HCR-FlowFISH CRE-scans by selecting all high-quality gRNAs across five loci covering 150 kb to 2.3 Mb each. We used standard gRNA design software²⁶ with strict cutoffs for guide specificity and efficiency to design libraries of 52,500 targeting gRNAs and 7,500 controls (Methods).

We studied the moderately expressed gene LMO2 (79.8 TPM), a regulator of hematopoiesis for which multiple CREs have been previously nominated^27,28. CASA identified a cluster of strong CREs +67–76 kb upstream of LMO2’s promoter (Extended Data Fig. 4b). These CREs reside within H3K27ac peaks that have been reported to interact directly with the LMO2 promoter and drive expression in vivo^27,28. This stretch of CREs has been shown to recapitulate LMO2 expression in blood cells, consistent with our use of K562 cells²⁹. HCR-FlowFISH also identified that these CREs act on the expression of both CAPRIN1 and CAT, nearly 500 kb away. In addition, CAGE data for K562 identifies three promoters at LMO2: a proximal, intermediate and distal promoter³⁰. We detected the stronger proximal and intermediate promoters but failed to observe a significant signal by HCR-FlowFISH when perturbing the distal promoter. To investigate further, we performed targeted knockdown of the distal promoter and observed increased expression from the internal promoters, thus maintaining overall LMO2 expression levels (Supplementary Fig. 6c).

We also examined two other similarly expressed genes, ERP29 and CD164. For ERP29, a reticuloplasmin, CASA identified the strongest CREs at the promoter and at distal elements within introns of TMEM116 (Fig. 3a). The CREs exhibit opposing effects on ERP29, with one proximal CRE 8 kb upstream of ERP29 that decreases transcript abundance and distal CRE signals 19 kb upstream of ERP29 that increase transcript abundance. This illustrates HCR-FlowFISH’s ability to detect both the strength and direction of regulatory activity. For CD164, a transmembrane sialomucin, we identified four CREs in addition to the promoter, clustered −77 kb and −145 kb downstream of the transcription start site (Fig. 3b). Notably, two closer, strong DHS/H3K27ac peaks displayed negligible effect on CD164 expression. eQTLs at these two CREs are associated with RP11–425D10.10, and promoter capture Hi-C links them to CCDC162P, suggesting they are functional CREs acting on more distant transcripts. At the − 77 kb CRE, we had previously identified a SNP associated with red blood cell count, rs1546273, showing that it altered reporter gene expression and in its endogenous location looped to the promoter of CD164⁹. HCR-FlowFISH data indicate that the CRE harboring this variant, when perturbed, has the strongest impact on CD164 expression of all CREs in the region.

Figure 3 | — **a-d**, Connectogram diagrams showing K562 DHS (light blue), K562 H3K27ac (dark blue), guide coverage (black), HCR-FlowFISH composite guide score tracks, and CASA CREs for *ERP29* (yellow), *CD164* (purple), *NMU* (green), and *MEF2C* (pink). CASA-derived CRE activity scores are shown as lines connecting the CRE to the target gene, and colored by effect on transcript abundance (black decreases abundance, red increases abundance). In each case, CASA identified CREs supported by K562 H3K27ac and DHS. Stars on *MEF2C* and *CD164* guide score tracks indicate the locations of two variants of interest.

To test the limits of HCR-FlowFISH detection, we assayed transcripts for the lowly expressed MEF2C (14.8 TPM) as well as NMU (612.8 TPM) for which small effect CREs (<10% knockdown) have been identified. MEF2C is a multifunctional gene important for hematopoiesis, while NMU acts in lymphoid cells as a signaling modulator of type 2 inflammation. As expected, CASA identified strong CREs over the NMU and MEF2C promoters (Fig. 3c,d). For NMU, previous work using an orthogonal CRISPR-QTL method has identified a CRE in a cluster of DHS sites 30.5–34 kb upstream of NMU responsible for a 7% reduction in gene expression²¹. CASA called a CRE in the same upstream region, as well as other weak intronic CREs overlapping low levels of K562 H3K27ac (Fig. 3c).

Highlighting the utility of HCR-FlowFISH to identify targets for genetic variants identified by GWAS, rs114694170 resides in one of MEF2C’s promoter-proximal CREs (Fig. 3c). We previously nominated this as a causal variant affecting platelet count by genetic fine-mapping and overlap with a hematopoietic progenitor specific CRE³¹, but we lacked experimental evidence for the effect of the CRE containing this variant. Our HCR-FlowFISH maps in K562 cells provide direct evidence that the CRE harboring rs114694170 is a functional regulator of MEF2C.

Dense perturbation of FADS1 identifies CREs and informs gRNA design.

We next investigated the fatty acid dehydrogenase (FADS) locus (108 kb surrounding FADS1, FADS2, FADS3, and FEN1) in order to comprehensively interrogate the regulatory landscape of an entire gene cluster (Fig. 4a). After perturbing the FADS locus, sorting on FADS1 expression, and applying CASA, we identified four clusters of CREs for FADS1: (i) its own promoter and a nearby but weaker 1st intron CRE; (ii, iii) strong CREs 18 kb and 25 kb upstream in the introns of FADS2; and (iv) a +58kb upstream intergenic CRE between the 3’ ends of FADS2 and FADS3 (Fig. 4a).

Figure 4 | — a, ~100-kb genomic interval surrounding the *FADS* locus displaying ENCODE DHS (light blue) and H3K27ac (dark blue) from K562 cells, along with the guide coverage tiled by HCR FlowFISH (black). HCR-FlowFISH composite guide scores for one replicate of *FADS1, FADS2, FADS3,* and *FEN1* are shown. Tiling MPRA data for the same locus is included below in red. Individual gRNA binding locations for panel d are noted. b, Autocorrelation plot of adjacent guides on *FADS1* HCR-FlowFISH shows significant correlation of nearby guides. c, Guide-wise logit scores for guides in the gene promoters show significantly high scoring at promoters compared to an equal number of permuted guides *two-sided Mann-Whitney-Wilcoxon test *P ≤* 1 × 10⁻¹⁰). d, qPCR analysis of *FADS1 (n* = 3 primers, each three technical replicates) and *FADS3 (n* = 4 primers, each three technical replicates) expression changes (bars represent standard deviation) after single-guide targetings, and a non-targeting guide (NT) corroborates transcript abundance patterns in full screen.

We also used this locus to test the relationship of gRNA design and density in a non-coding screen by densely tiling it using nearly all available gRNAs (9,554) regardless of predicted off-target quality. Even with this lenient gRNA selection criteria, we found HCR-FlowFISH experiments to be highly reproducible. For example, composite guide scores were well correlated across replicates of FADS1 HCR-FlowFISH (Pearson r = 0.98) (Supplementary Fig. 3a). Furthermore, individual guide scores were highly auto-correlated between neighboring gRNAs across the locus, as expected, due to overlapping CRISPRi activity (Fig. 4b).

As expression and targeting efficiencies of gRNAs can widely vary³² and are shown to be associated with false positives in growth-based CRISPR screens³³, we investigated various design metrics for association with performance measures in our screen. When comparing overlapping gRNAs between HCR-FlowFISH and GATA1 growth screens, we observed that high representation in the cell library, rather than high gRNA specificity, appears to contribute more to the positive correlation between datasets (Extended Data Fig. 2c). HCR-FlowFISH results for FADS1 supported this observation, as neither gRNA efficiency nor specificity were predictive of activity (Supplementary Fig. 4a,b). GC content was associated with activity, likely due to the higher GC content at promoters (Supplementary Fig. 4c). Dense gRNA tiling also confirmed that modeling the strand targeted by the gRNA independently shows no differential effect, supporting a model where steric positioning of the KRAB domain is unimportant for local CRE repression (Supplementary Fig. 4d,e).

We downsampled gRNAs at the FADS locus to determine the minimum set required while still providing reliable CRE calls of CRE activity on the FADS1 gene. The full screen has an average of 30.36 gRNAs/kb. After a 74% reduction (7.41 gRNAs/kb), a maximum of two small spurious CREs were identified by CASA when using both FADS1 experimental replicates at any level of subsetting. These CRE calls are primarily driven by few high-scoring gRNAs in gene bodies that are outliers relative to neighboring gRNAs (Supplementary Fig. 4f). Together, these data suggest our transcription-based assay’s less restrictive gRNA design requirements allow for additional guides, increasing accuracy of non-coding CRISPR screens.

HCR-FlowFISH reveals a complex regulatory landscape at the FADS locus.

We next performed HCR-FlowFISH to quantify CRE activity for three other genes in the locus (FADS2, FADS3 and FEN1), generating well-correlated replicates (Supplementary Fig. 3a), and found the strongest scores at the promoter for each gene (MW P ≤ 1 × 10⁻¹⁰) (Fig. 4c). For FADS2, CASA did not identify a CRE at its canonical promoter, but instead nominated an alternative promoter 12 kb upstream. Only this alternative promoter was supported by CAGE data in K562 cells, suggesting it is the sole active promoter in the cell line (MW P ≤ 1 × 10⁻²⁰) (Supplementary Fig. 3b).

Three non-promoter CREs identified at the FADS locus displayed evidence of sharing CRE activity across multiple genes (Extended Data Fig. 5a,b). Perturbation of an intronic CRE 18 kb downstream of the FADS2 promoter caused a decrease in expression of all four genes tested, highlighting the extent of sharing that can occur for a single regulatory element. Other CREs were specific for only a subset of the genes at the locus. Strikingly, some regions were clearly marked by DHS and H3K27ac in K562 cells, yet perturbation of these regions did not alter gene expression at the four proximal genes measured by HCR-FlowFISH, as expected by their looping to genes (BEST1/FTH1) further away (Fig. 4a), highlighting the complexity of regulatory interactions and the need for direct experimental observation. We validated the effects of several of these CREs using single gRNAs, imaging, and qPCR (Fig. 4d and Supplementary Fig. 3c).

We further validated CREs identified by HCR-FlowFISH at FADS by orthogonal high-throughput methods. Comparing 10-bp bins in HCR-defined CREs versus an equal number of randomly permuted bins not in CREs, HCR-FlowFISH identified CREs were specifically enriched for H3K27ac, DHS, PRO-seq, GRO-seq, and CAGE data from K562 cells (MW P ≤ 1 × 10⁻⁵ for each) (Supplementary Fig. 5a). Notably, all CREs identified by HCR-FlowFISH overlapped previously identified H3K27ac peaks in K562 cells (Fig. 4a) and >85% of significant gRNAs directly overlapped an H3K27ac element (Supplementary Fig. 5b). The composite score from HCR-FlowFISH was similar to H3K27ac signal in both position and magnitude, displaying the highest significant cross-correlation (corr = 0.75) with a 0-bp offset (Supplementary Fig. 5c).

We found that HCR-FlowFISH was able to distinguish overlapping CREs and assign each to their target gene (Fig. 4a,c). One example is an intergenic CRE for FADS1 and FADS3 downstream of FADS2, which overlaps a single H3K27ac peak and two distinct DHS peaks 1.5 kb apart. While perturbation of this CRE is associated with decreases in FADS1 expression, it also is associated with increases in FADS3 expression.

To confirm that our screen could detect repressors as well as enhancers and promoters, and to investigate the molecular underpinnings of such elements, we targeted the FADS3 repressive CRE identified above with a CRISPR-KO cutting screen using 900 gRNAs overlapping the CRE. We sorted cells based on FADS3 expression and subjected high and low expression cell populations to long-read sequencing from a single amplicon covering the CRE to identify deletions impacting gene expression (Fig. 5a and Methods). We recapitulated the CRISPRi transcriptional effects with the CRISPR-KO screen, and applied CASA to nucleotide-resolution counts of deletions in high and low expression amplicon pools to call significant functional windows, localizing a 3,000-bp CRISPRi window down to a core 600-bp functional region directly overlapping a DHS peak (Fig. 5b). Notably, we identified three bound TFs via ChIP, along with their canonical motifs, including the transcriptional repressor REST, potentially explaining this CRE association with increased FADS3 expression when perturbed. We also identified binding of TFs important for long-range targeting of CREs including NRF1 and CTCF (Fig. 5c).

Figure 5 | — a, Clustering of high-quality single gRNA-targeted deletion sequences (purple bars) within a 3-kb CRE initially identified by the *FADS3* CRISPRi screen. Deletion-bearing cells were subject to HCR-FlowFISH and sorting based on *FADS3* transcript abundance. b, Individual guide scores (orange) from *FADS3* HCR-FlowFISH screen, overlaid with the log odds ratio of the cumulative deletion frequencies per nucleotide in the low versus high *FADS3* transcript abundance bins (purple). DHS (light blue) and H3K27ac (dark blue) K562 peaks calls are shown, along with CASA CRISPR cutting CRE in purple identifying a core 500-bp CRE. c, Zoomed view of the CRISPR cutting CRE identifies underlying ChIP peaks for multiple transcription factors (black bars) and canonical TF motifs (inlaid green bars). d, Luciferase reporter assays CREs with scrambled REST (R, red), CTCF (C, blue), NRF1 (N, pink), and all combinations. REST scrambles increased reporter expression (n = 6 each, *two-sided t-test P < 0.0001), except when paired with CTCF scrambles, which removed all CRE activity in all contexts. The minima, centers, and maxima of the boxes indicate the 25th, 50th, and 75th percentiles of the data distributions. Whiskers capture all remaining data, excluding outliers extending beyond 1.5 times the interquartile range below or above the 25th or 75th percentiles, respectively.

To confirm the function of this element as a repressor, we generated luciferase reporter assays comparing wild-type sequence to constructs where the REST, CTCF, and NRF1 binding motif were scrambled (Fig. 5d). Abrogation of the NRF1 element had no effect on CRE activity, while CTCF abrogation resulted in weaker CRE activity (t-test P < 0.0001 in all comparisons). CREs with a scrambled REST element showed a marked increase in activity, both alone (1.8 log₂ fold change) or with the nonfunctional NRF1 element (1.2 log₂ fold change). We further confirmed that REST directly binds the CRE using an electrophoretic mobility shift assay (Supplementary Fig. 5d,e). These results support CTCF as a primary transcription-promoting factor, with REST acting as a direct repressor for the CRE.

In summary, across all eight loci that we examined with HCR-FlowFISH, our data demonstrated both concordance with, and exceptions to, several conventions of regulatory elements. We find that 95% of our CREs are co-occurring with DHS and/or H3K27ac signal, and that our CRE activity is significantly correlated with H3K27ac (Spearman ρ = 0.58; P = 1.7 × 10⁻⁷) and more weakly to DHS (Spearman ρ = 0.35; P = 0.007) (Supplementary Fig. 6a,b). We also observed strong correlation with PRO-seq data (Spearman ρ = 0.50; P = 3.5 × 10⁻⁴) (Supplementary Fig. 6c). We catalogued widespread regulatory element sharing between genes, something that has been described recently^19,22. Notably, we frequently observed active distal regulators that skipped over more proximal DHS and H3K27ac regions with no functional effect. We note that distance of a CRE to its target promoter is not significantly correlated with CRE activity score in our dataset (Spearman ρ = −0.11; P = 0.38) (Supplementary Fig. 6d), though the five strongest CREs across all screens were within 100 kb of their targets. All CREs identified by HCR-FlowFISH co-occur within the same topologically associated domain as their target promoter. 50% of CREs have significant ChIA-PET promoter interactions compared to only 11% of nearby DHS peaks not identified by HCR-FlowFISH, and HCR-FlowFISH CRE interactions are supported by more reads (Supplementary Fig. 6e).

Nominating causal genetic variants and their effector transcripts.

The FADS locus harbors a high density of independent GWAS associations for lipid metabolites and related traits; however, causal SNPs and target genes have proven difficult to characterize due to a positive-selection driven elevation in linkage disequilibrium (LD) across the region^34–36. The enzymes encoded by the genes at the FADS locus are necessary in biosynthesis of omega-6 and omega-3 long-chain polyunsaturated fatty acids (PUFAs), with levels of their metabolites well-reflected in blood cells³⁷.

To identify functional variants that are candidate causal alleles at the FADS locus, we first investigated eQTLs, analyzing the 241 variants with a significant association to the expression of one or more genes in GTEx (Fig. 6a). Fifty-eight percent of these variants were significantly linked to multiple genes in the locus, with 57% showing the opposite direction of effect for FADS1 and FADS2. With HCR-FlowFISH, we determined that three eQTLs exist in CREs that regulate both genes, supporting the possibility that individual SNPs might act directly on multiple genes, rather than having multiple associations only due to tight LD to other eQTLs (Supplementary Fig. 7a–c). Further work will be required to understand how sequence changes at these CREs result in opposite gene expression effects, either by altering promoter specificity or another mechanism.

Figure 6 | — a, ~110-kb region surrounding the *FADS* locus showing all variants from 1000 Genomes, with notations for eQTLs and number of *FADS* genes they are associated with. Connectogram of all CREs identified via HCR-FlowFISH highlight complex regulatory landscape with extensive CRE sharing. b, Fine-mapped credible variant sets for eGFR and HDL levels share similar posterior inclusion probabilities (PIP) due to high LD. MPRA (red), DHS (light blue) and H3K27ac (dark blue) signals, as well as composite guide scores for HCR-FlowFISH on *FADS1* (green), *FADS2* (teal), *FADS3* (orange), and *FEN1* (purple) at CRE within an intron of *FADS2.* Variants within a HCR-FlowFISH identified *FADS1* CRE are labeled in green, and variants displaying allelic skew via MPRA are outlined in red, with location of the only variant identified with both denoted by a green bar. c, Total cholesterol GWAS at the *FADS* locus yields 73 genome-wide significant variants. d, Results of allelic skew in the MPRA shows the alternative allele drives increased CRE activity. e, Many traits significantly associated with rs2727271 are direct metabolites of the FADS1 enzyme.

To further characterize genetic variants in the FADS region, we combined the CRE-gene links catalogued by HCR-FlowFISH with episomal measures of CRE activity by performing an MPRA in K562 cells. We tested 200-bp oligos in a 5-bp sliding window along the entire FADS locus. Oligos with regulatory activity measured via the MPRA overlapped DHS and H3K27ac data in the region (Fig. 4a), and CREs identified via HCR-FlowFISH had significantly higher MPRA scores (MW P ≤ 1 × 10⁻⁵) (Supplementary Fig. 5a). Not all MPRA active regions are HCR-FlowFISH CREs, which is likely due to the more cell-type agnostic activity observed in exogenous reporter assays and/or that the CREs are targeting genes not assayed in our screen. We also designed allele-specific MPRA oligos capturing every variant (3,108 polymorphisms) in the 1000 Genomes Project dataset at the FADS locus³⁸. 119 (3.8%) of the variants in the FADS region showed significant allelic skew (FDR ≤ 10%), 51 of which occurred in HCR-FlowFISH CREs (Fig. 6a, Supplementary Data 3 and Methods).

We next overlaid GWAS data for blood cell composition and metabolic traits from the NHGRI-EBI GWAS catalog³⁹ and Phenoscanner⁴⁰ with MPRA and HCR-FlowFISH results in order to nominate causal variants based on allelic skew and identify their transcriptional targets (Supplementary Data 3). One of the strongest GWAS effect sizes for PUFA levels is at rs174466 (β = −0.85, P = 2.5 × 10⁻²²), which is an eQTL for six genes within 120 kb, including FADS1, FADS2, and FADS3^41,42. Genetic analysis alone is unable to resolve the association to PUFA levels and identify which genes are affected given that 16 additional variants are in tight LD (r² > 0.8) with the sentinel variant (Extended Data Fig. 6a,b). By deploying HCR-FlowFISH and MPRA, we showed that rs174466 is the only variant that conveys both allelic skew by MPRA and validates as a CRE by HCR-FlowFISH. The allele lies in the promoter region with regulatory potential solely acting on FADS3, suggesting that eQTL associations to other genes are due to LD with additional functional SNPs or trans effects. The alternate allele establishes a canonical motif for SP2, a transcriptional activator at GC box promoters, which is also bound to this location in vivo, suggesting the alternative allele for rs174466 may act to increase FADS3 expression (Extended Data Fig. 6a,d). The MPRA recapitulates this, with the alternate variant increasing activity by 0.8 log₂ fold, notably only when the element is tested in the same orientation as its endogenous promoter (Extended Data Fig. 6c and Supplementary Data 3).

Interrogating associations with total cholesterol, we again demonstrated that HCR-FlowFISH allows us to nominate individual variants from tightly-linked, genetically indistinguishable alleles, while also providing empirical evidence of the affected transcript. LD imposed similar limitations here, as 73 variants reach genome-wide significance for association (P ≤ 5 × 10⁻⁸) (Fig. 6c)⁴³. Fine-mapping of cholesterol-related phenotypes, HDL levels and estimated glomerular filtration rate in the UK Biobank nominates multiple credible sets in the region⁴⁴. One includes a 95% credible set of four equally probable SNPs in perfect LD (r² = 1): rs2727270, rs2727271, rs2524299, and rs2072113 (Fig. 6b and Supplementary Data 4). Only rs2727271 displays an allelic skew effect by MPRA (1.2 log₂ fold) and overlaps an HCR-FlowFISH validated CRE for FADS1 and FADS2.

Discussion

The ability to interpret cis-regulatory grammar and functional organization of the genome will be essential for characterizing the mechanisms underpinning the majority of disease and trait-associated non-coding variants. Indeed, several consortia, including ENCODE, FANTOM, and BLUEPRINT, are devoted to answering this question. In this study, we present HCR-FlowFISH, a broadly applicable approach to mapping complex regulatory interactions in nearly any expressed gene and cell system, allowing for a generalizable approach to study gene regulation.

Critically, HCR-FlowFISH is less limited by concerns from growth-based screens regarding the global effects of off-target gRNAs³² as low off-target or targeting efficiency scores do not confound expression-based screens. At the NMU locus, for example, a 106-kb region containing four previously characterized CREs is untargetable using the standard gRNA off-target filter of <2 mismatches in the genome. The ability for future HCR-FlowFISH experiments to maintain high-quality results using a wider range of potential gRNAs will enable more comprehensive screens and improve interpretability.

CRE screens using HCR-FlowFISH reveal a rich regulatory landscape imperceptible by current biochemical assays and offer novel insights into the regulome. We detect widespread CRE sharing across genes and surprisingly observe that CRISPRi perturbations of some CREs can impart both positive and negative effects on nearby genes. While nearly all of the CREs identified by CASA have predictive regulatory marks such as DHS and H3K27ac, classical assumptions — such as that CREs impact the nearest gene or that epigenetic signal strength correlates with gene expression — are not universally true in our endogenous screens.

Our HCR-FlowFISH CRISPRi screens quantify both activity of CREs and their genic targets, but in its current iteration is limited by resolution to decipher functional effects of single nucleotide variants. To overcome this, we combined HCR-FlowFISH with MPRA to enable a powerful new paradigm for parsing genetic disease associations and nominate causal variants. Additionally, HCR-FlowFISH is compatible with a host of different CRISPR effectors, allowing the perturbation of the genome at base pair resolution.

While this study presents detailed locus-centric, single gene-target CRE screens, HCR-FlowFISH is fully applicable to targeted genome-wide analyses. In addition, the ability to modify and multiplex HCR can enable complex screens, including targeting multiple genes simultaneously. Thus, HCR-FlowFISH provides a uniquely flexible platform to innovate new technologies for the study of cis-regulatory function to facilitate the translation of genetic associations into improved mechanistic comprehension of human disease.

Methods

CRISPR gRNA library design, synthesis, and infection.

All gRNAs and controls are reported in hg38 coordinates in (Supplementary Data 1). Prior to synthesis, gRNA sequences lacking a 5’ guanine were appended with a G to aid in transcription efficiency. The sequence ‘tatcttgtggaaaggacgaaacacc’ was added before each gRNA and ‘gtttaagagctatgctggaaacagcatagc’ after to facilitate cloning and provide a custom Illumina read 1 primer site.

For the entire FADS gene cluster chr11:61555967–61664630 (hg19) and all DHS sites at the GATA1 locus chrX:48,306,481–49,174,557 (hg19), gRNAs were designed using custom scripts. Briefly, all possible 20-bp gRNAs with the cas9 protospacer adjacent motif “NGG” within a region surrounding were considered. gRNA efficiency scores were determined using the “Rule Set 2” method and range from 0–100, with 100 being optimal⁴⁵. To calculate off-target scores, we used bowtie to map gRNAs to the human reference (hg19) with a maximum 10,000 matches, with up to three mismatches (parameters: -n 3 -l 15 -e 10000 -y --all -S)^46
47. We designed 9,551 and 14,337 gRNAs in the FADS and GATA1 gRNAs libraries respectively, with 449 and 1,000 random non-targeting controls. gRNA coordinates were remapped to hg38 for reporting and plotting.

At the CD164, ERP29, LMO2, MEF2C, and NMU loci, we used the Guidescan software to design gRNAs. Guide sequences with ≤ 1 mismatch to genome or specificity scores below 0.2 were discarded²⁶. For each library, 52,500 guides were designed to target > 1 Mb surrounding each gene. 1,500 non-targeting and 6,000 safe-targeting-tiling control gRNAs were included. At NMU and CD164, 1,000 previously defined safe-targeting control gRNAs were also included⁴⁸. MYC loci designs were a combination of Guidescan and a contribution from ENCODE consortium members. Full details are provided in Supplementary Data 1.

Guide oligos were synthesized by Agilent and diluted in 100 μl dH₂0. Guide sequences were amplified using Q5 Hot Start High-Fidelity 2X Master Mix (NEB, M0494L, 10 μM primers (Supplementary Data 1), in a total volume of 50 μl. PCR conditions were: initial denaturation 98 °C for 2 min; 12 cycles of 98°C for 10 s, 60 °C for 15 s, 72 °C for 45 s), final amplification 72 °C for 5 min). Amplicons were purified using a 3.0× SPRI using Agencourt AMPure XP SPRI Beads (Beckman Coulter, A63881). SgOPTI plasmid (Addgene, 85681) was digested with BsmBI (NEB, R0580L). The amplified gRNA library was cloned into SgOPTI using Gibson Assembly (NEB, E2611S) and then electroporated into Endura Electro-competent cells (Lucigen, 60242–2), with < 16 h of growth at 30 °C. Transformation complexity (> 1,000× library size minimum) was assayed by serial dilutions. gRNA library plasmids were purified using Qiagen’s Plasmid Plus MidiPrep Kit (Qiagen, 12945).

Lentivirus was produced as in previous studies using HEK-293T cells⁴⁹. Transient transfections were performed with PAX2 and pCMV-VSV-G (Addgene, 35002, 8454) packaging plasmids, using the x-tremeGENE 9 DNA Transfection Agent (Sigma-Aldrich, 6365787001). Lentivirus was filtered, harvested, and frozen at −80 °C until use. Cells were infected at a range of viral titres (0–200 μL), to identify a multiplicity of infection of 0.3. Cells were spinfected in 6M cell increments using polybrene (Sigma Aldrich TR1003G) at 1,200 rpm for 45 min at 37 °C. Transduced cells were selected by 20 μg/ml puromycin selection for 3 days. Library-scale infections performed identically to above, using the appropriate viral titre and scaling so that live cells after puromycin selection would be > 1,000× library size.

Single gRNA K562 CRISPRi cell line generation.

Individual gRNAs (Supplementary Data 1) were subjected to identical guide sequence amplification, and cloning was all performed identically to the large-scale gRNA libraries. Individual gRNA plasmids were verified by Sanger sequencing.

Cell culture.

K562 cells with a doxycycline-inducible CRISPRi were a gift of the Lander lab and identical to those used in previous studies¹⁷. Cells were grown in RPMI 1640 GluteMAX (Gibco) with 10% heat inactivated FBS (HI-FBS, Gibco). Cells were induced with CRISPRi for 24 h with a final concentration of 1 μg/ml doxycycline (VWR). In tests of guides for FADS, GATA1, and MYC, doxed and undoxed distributions were highly correlated (Spearman ρ = 0.91, 0.80, and 0.86, respectively). Non lenti-library infected cells were periodically sorted on BFP signal when BFP/CRISPRi expression dropped below 80%. Jurkat and GM12878 cells were grown in RPMI 1640 GluteMAX with 15% HI-FBS, 293T cells were grown in DMEM (Gibco) with 10% HI-FBS, SK-N-SH cells were grown in EMEM (ATCC) with 10% HI-FBS, and TF1 cells were grown in RPMI 1640 GluteMAX (Gibco) with 10% heat-inactivated FBS, 2 mM L-Glutamine and 2 ng/ml recombinant human Granulocyte-Macrophage Colony-Stimulating Factor (GM-CSF) (Peprotech). All cells were grown at 37 °C and 5% CO₂. Cells were not synchronized prior to performing HCR-FlowFISH.

HCR-FlowFISH.

HCR probes and fluorescently labeled hairpins were purchased from Molecular Instruments using sequences listed in Supplementary Table 1, targeted against the mRNA transcript most abundant in K562 cells. Buffers were optimized for HCR-FlowFISH and prepared in lab with RNase-free practices.

All samples were prepared using lo-binding plasticware (Eppendorf) where possible. All values below are listed per 5M cell aliquot. All centrifugations were performed at 500 × g for 5 min unless otherwise noted.

Cell fixing and permeabilization.

Cells were centrifuged and resuspended in 1 ml of 4% formaldehyde in PBST (1× PBS, 0.1% Tween 20) and incubated at room temperature for 1 h with rotation. After pelleting cells and aspirating supernatant, cells were washed with 1 ml of PBST, and this was repeated for a total of 4× PBST washes. Cells were then pelleted and resuspended in cold 70% ethanol and incubated at 4 °C for 10 min, and then pelleted again.

Expression detection.

Cells were washed twice with 0.5 ml PBST and resuspended in 400 ml of pre-warmed probe hybridization buffer (30% formamide, 5× sodium chloride sodium citrate (SSC), 9 mM citric acid pH 6.0, 0.1% Tween 20, 50 μg/ml heparin, 1× Denhardt’s solution, and 10% low MW dextran sulfate). The solution was pre-hybridized for 30 min at 37 °C with rotation in a hybridization oven. During pre-hybridization, 100 ml of probe hybridization buffer and 2 pmol of probe were prepared and warmed to 37 °C. The probe solution was then added to the sample for a final probe concentration of 4 nM, and incubated overnight (15–20 h) at 37 °C with rotation.

Excess probe removal.

500 μl of SSCT wash buffer (5× SSC, 0.1% Tween) was added and the sample was centrifuged at 750 × g for 5 min. The pellet was resuspended with 500 μl of probe wash buffer (30% formamide, 5× SSC, 9 mM citric acid pH 6.0, 0.1% Tween 20, and 50 μg/ml heparin) and cells were re-pelleted for 4× total washes. Cells were then resuspended in 500 μl SSCT wash buffer, incubated at room temperature for 5 min, and pelleted again.

Signal amplification.

Cells were resuspended in 150 μl of pre-warmed amplification buffer (5× SSC 0.1% Tween, 10% low MW dextran sulfate) and pre-amplified for 30 min at room temperature with rotation. Next, 15 pMol of hairpins were prepared by boiling 5 ml of 3 mM hairpin at 95 °C for 90 s and cooling to room temperature in the dark. The hairpin mixture was prepared by adding all hairpins to 100 μl of amplification buffer, and the mixture was added to the sample to reach 60 nM hairpin concentration. The samples were then incubated at room temperature in the dark with rotation for 3 h (FADS1, FADS2, FADS3, GATA1, HDAC6, CD164, NMU), 12 h (ERP29, MEF2C, LMO2) or 18 h (MYC, PVT1). After centrifuging the samples and aspirating the hairpin solution, the cell pellet was resuspended in 500 μl of SSCT wash buffer. Cells were then pelleted again, and the SSCT wash was repeated for a total of six washes. The cells were then resuspended in PBS and kept dark at 4 °C prior to sorting.

Note on HCR-FlowFISH target specific experimental design.

We have observed that target transcript TPM is correlated with HCR signal intensity (Fig. 1b) and that signal intensity can be increased via additional probe number, probe concentration, or hairpin amplification time. We suggest a minimum of 30 probe-pairs per target when feasible with transcript length, and increasing probe-pairs for targets < 50 TPM. Increases in hairpin amplification time are also suggested for targets < 50 TPM. Very highly expressed targets such as GATA1 (TPM = 193) are successful with fewer probes (19) and 3-h amplifications.

Prime-Flow.

PrimeFlow RNA Assay Kit was purchased from Thermo Fisher Scientific (88–18005-204) along with corresponding probes GATA1 (VA1–20436-PF), LARGE1 (VA1–3005734-PF), FADS3 (VA1–6005976-PF) and eGFP (60175). Cells were stained in triplicate according to the specifications provided in the kit. The cells were analyzed on a Beckman CytoFlex identically to HCR-FlowFISH stained cells.

Sorting and cytometric analysis.

Cells were filtered using a Celltrics 0.3 μm filter (04–004-2326, Sysmex) and diluted with PBS to an approximate concentration of 10 million cells/mL. The cells were sorted using a Sony MA900, using a 100 μM chip (LE-C3210, Sony), using the 405 nm, 488 nm, and 638 nm lasers. Cells were first gated for live, single-cell, and BFP+ cells as our CRISPRi is tagged with blue fluorescent protein (BFP).

To accurately quantify changes in target gene expression, we take advantage of the multiplexable nature of HCR by assigning different hairpin-probe combinations to different transcripts. HCR-FlowFISH uses this feature to simultaneously quantify transcripts for a target gene and the housekeeping gene TBP to control for differential fluorescence signal caused by cell size, permeability, and total RNA abundance in each cell. To sort cells, we plotted FITC (TBP probes) vs. APC (target gene probes) and gated on the highest and lowest 10% FITC/APC ratio (Extended Data Fig. 1b). Each sorting bin contained cells equal to 100× gRNA library size to maintain complexity. The collected cells were spun down at 500 × g for 5 min and frozen at −20 °C until DNA isolation.

Single gRNA libraries and other non-sorted HCR-FlowFISH samples were analyzed using a Beckman CytoFlex LX Flow Cytometer. A minimum of two replicates each of 100,000 total cells were used. We analyzed APC (target gene) and FITC (TBP) single on live, single-cell, BFP+ gated cells using FlowJo v9.

DNA isolation.

We isolated DNA from cells using the following protocol for 1M cells and scaled accordingly: pelleted cells were thawed and resuspended in 100 μl of Lysis Buffer (1% SDS, 10 mM EDTA, 50 mM Tris-HCl pH 8.1), and incubated at 65 °C for 3 h. Lysed cells were cooled to room temperature and incubated for 30 min at 37 °C with 14 units of RNase A (Qiagen, 19101), directly followed by a Proteinase K (8 units, NEB, P8107S) incubation (37 °C for 2 h, 95 °C for 20 min). DNA was purified with a 1.0× SPRI and 5× 70% ethanol washes with bead resuspension. DNA was eluted in 80 μl of DNase-free water after a 65 °C incubation for 5 min.

Library preparation and sequencing.

Guide sequences were amplified directly into sequencing libraries from all recovered genomic DNA. A maximum of 600 ng genomic DNA was used per PCR reaction: 25 μl of Q5 Hot Start High-Fidelity 2X Master Mix, 2.5 μl of 10 μM indexed forward/reverse primers (Supplementary Data 1), and 1 μl of 10 mM spermine (Sigma-Aldrich, 85590–5G) in total volume of 50 μl. PCR conditions were: initial denaturation 98 °C for 2 min; 22 cycles of 98 °C for 10 s, 60 °C for 15 s, and 72 °C for 45 s), final amplification 72 °C for 5 min). All reactions were pooled to maintain complexity. 600 μl of amplified gRNA sequence was purified using a double sided SPRI (0.5× followed by a 1.2× SPRI) to remove primers and genomic DNA with 2× 70% ethanol washes. Libraries were quantified on a High Sensitivity D1000 ScreenTape (Agilent, 5067–5584), associated reagents (Agilent, 5067–5585), and Agilent Technologies’ 2200 TapeStation. Libraries were pooled and sequenced with either Illumina MiSeq v2 50 cycle kit (15033623) or NextSeq v3 75 (15057941) cycle kit using custom read one and indexing primers (Supplementary Data 1). We suggest that libraries displaying dropout of > 5% of guides or do not display a Gamma-Poisson guide count distribution are discarded.

HCR-FlowFISH normalization and visualization.

We represent data from HCR-FlowFISH experiments in three forms: individual guide scores, composite guide scores, and CRE activity scores. The first two scores are derived directly from the data, while the third is inferred from a statistical model defined in the next section. For each experiment, gRNA counts measured for each sorting bin are semi-randomly scaled to equalize counts between pairs of bins. First, gRNAs that are not detected in either the low-expression or high-expression bins are discarded. Second, the larger sequencing library is scaled down by the small-to-large library ratio. Based on counts observed in the low-expression bin, I, and the high-expression bin, h, for all N observed gRNAs, the scaled values are:

l_{n}^{‡} = l_{n} \frac{\min (\sum_{n = 1}^{N} l_{n}, \sum_{n = 1}^{N} h_{n})}{\sum_{n = 1}^{N} l_{n}}

and

h_{n}^{‡} = h_{n} \frac{\min (\sum_{n = 1}^{N} l_{n}, \sum_{n = 1}^{N} h_{n})}{\sum_{n = 1}^{N} h_{n}}

Finally, we convert each continuous scaled value to a discrete count by isolating the integer parts and adding one count with probability equal to the fractional part. We use scaled library pairs for all downstream analyses and suppress this scaling in subsequent notation.

Next, we assign each gRNA a 300-nucleotide area-of-effect window surrounding the predicted Cas9 cut-site for each gRNA. To visualize the data, we determine a nucleotide-wise composite guide score as the log-odds ratio (LOD-score) of the total counts observed in the low-expression bin (l) divided by the total in the high-expression bin for all gRNAs in the set M_kwhich affect a given nucleotide, k:

s_{k} = \frac{\sum^{M_{k}} l_{n}}{\sum^{M_{k}} h_{n}}

Therefore, positive scores indicate enhancer activity. We also report individual guide scores by calculating LOD-score for each gRNA in each experimental replicate. To aid visualization, we subtracted the median composite guide score from each individual guide score for each replicate.

CASA: cis-regulatory activity prediction.

We designed a hierarchical Bayesian model for CRISPR activity screen analysis (CASA) to demarcate putative CRE-gene connections using individual HCR-FlowFISH experimental replicates (Supplementary Fig. 2c). In our model, we assume that each gRNA has a detection rate, c, in a sequencing library with the prior distribution:

P (c_{n}) \sim Gamma (α, β) .

Concurrently, the latent CRE activity score of each genomic window is drawn from the prior:

P (a_{w}) \sim N (m, s) .

Finally, for each genomic window, w, and gRNA, n, we assume the likelihoods of observed gRNA counts in a pair of low and high gene target expression libraries are:

P (l_{w n} ∣ c_{w n}, a_{w}) \sim Poisson (\frac{c_{w n} e^{a_{w}}}{1 + e^{a_{w}}})

and

P (h_{w n} ∣ c_{w n}, a_{w}) \sim Poisson (\frac{c_{w n}}{1 + e^{a_{w}}}),

respectively. This formulation induces an overdispersed count distribution where a logit-normally distributed CRE activity score partitions observable gRNA detection rate between paired sequencing libraries generated from the same HCR-FlowFISH experiment.

CASA determines the posterior latent CRE activity, a, for each genomic window, w, of predefined size in a locus conditioned on HCR-FlowFISH count data for the N_w gRNAs with overlapping areas of effect:

P (a_{w} ∣ l_{w}, h_{w})

Additionally, we compute a_w conditioned on a large set of non-targeting gRNAs (e.g., a_NT) to standardize latent activity measurements:

P (a_{w} - a_{NT} ∣ l_{w}, h_{w})

and assume all a_w are conditionally independent to make inference tractable.

Finally, we use the region of practical equivalence (ROPE) decision rule to determine which genomic windows compose putative CRE-gene connections⁵⁰. Briefly, we set a symmetric ROPE centered on zero for each tested gene and calculate the 95% highest density interval (HDI) for each standardized CRE-activity prediction. Putative active bins are deemed significant when the entire 95% HDI falls completely above or below the ROPE. The absolute values of the ROPE boundaries are: log 2.0 for GATA1, HDAC6, and FADS1; log 1.6 for FADS2, FADS3, and FEN1; log 1.8 for ERP29, LMO2, and CD164; log 1.4 for MEF2C; log 1.2 for NMU. We discard solitary bins with ostensibly significant CRE-activity which are not contiguous with at least one more significant bin. This filters out artifacts due to poor coverage or edge effects from binning.

We implemented CASA in PyMC3⁵¹ and use the No-U-Turn Sampling algorithm⁵² to numerically compute posteriors. We have provided CASA in several forms to the end user, including the source code on GitHub, a public Docker environment, and a Python widget that can automatically run the analysis on the Google Cloud Platform using local data. Code to visualize results is also hosted on GitHub.

Cis-regulatory element identification.

We report CREs based on CASA activity prediction with consistent support between experimental replicates. First, within each replicate, we merge contiguous runs of CASA nominated bins with significant CRE activity scores. Next, we merge putative CREs across experimental replicates for a given HCR-FlowFISH target and discard putative CREs that are not supported in all experimental replicates. This minimizes our type I error rate. We also apply a minimum guide coverage filter as described in the next subsection (CASA sensitivity analysis).

Finally, we assign maximum a posteriori (MAP) estimates of activity to each CRE based on summits of absolute activity. First, we assume scaled activity posteriors are Gaussian and assign μ_wr as the midpoint of the 95% HDR of each w window and r replicate. Next, we linearly scale all replicates such that:

\max_{1 \leq w \leq W} (μ_{w 1}) = \max_{1 \leq w \leq W} (μ_{w r}), \forall r

and

\min_{1 \leq w \leq W} (μ_{w 1}) = \min_{1 \leq w \leq W} (μ_{w r}), \forall r .

To avoid expensive Bayesian inference under a model of replication we assume:

\begin{matrix} a_{w r}^{*} ∣ l, h \sim N (μ_{w r}^{*}, σ_{w r}^{2}) \\ μ_{w r}^{*} \sim N (ζ_{w}, σ^{2}) \end{matrix}

where the asterisk indicates normalization by non-targeting controls and scaling to the first replicate. Under these assumptions, the MAP activity estimate of each bin is:

MAP (ζ_{w}) = \frac{1}{R} \sum_{r = 1}^{R} μ_{w r}^{*} .

Finally, CRE activity is determined by the peak MAP estimates for the $ζ_{w}$ ’s within the CRE area.

CASA sensitivity analysis.

We conducted a simulation study to measure the sensitivity of our CRE identification pipeline under different theoretical conditions. Simulations occur in three stages: sampling guide effects, simulating observed cellular gene-expression, and counting guide representation in the top and bottom deciles of normalized target gene expression. We generated two sets of simulations based on flow-cytometry readings collected during two separate experiments quantifying the expression of FADS1 and LMO2. During each experiment, we quantified target and housekeeping gene expression using the APC-A and FSC-A channels, respectively. Additionally, we quantified gene expression in control cells which were sham-treated with APC-A probes targeting eGFP, a synthetic gene not expressed in our cell lines, in addition to cells treated with the appropriate probe. Additionally, we log-transformed all observed and simulated flow-cytometry readings.

First, we sample guide-dependent impacts on target gene expression:

μ_{i} \sim N (m_{g} * (1 - f_{k d}), σ_{g}^{2}),

where m_g is the difference in the average observed expression of the target gene in the control cells versus the probe-treated cells, f_kd is the theoretical fraction of gene expression knockdown, and $σ_{g}^{2}$ is the variance of observed target gene expression due to guide-effects. For each simulation, we generate 1,000 simulated non-targeting guides where f_kd = 0 and 1,000–3,000 simulated CRE-targeting guides with $f_{k d} \in [0.01, 0.50]$ .

Next, we sample 2,000 cells for each simulated guide as follows (Supplementary Fig. 2a,b):

{\vec{c}}_{i j} \sim N ([\begin{matrix} μ_{A} + μ_{i} \\ μ_{F} \end{matrix}], [\begin{matrix} σ_{A}^{2} & K_{A F} \\ K_{A F} & σ_{F}^{2} \end{matrix}]),

where $μ_{A} + μ_{i}$ is the mean signal detected in the target gene channel (e.g., APC-A), μ_F is the mean detected housekeeping gene expression, and K_AF is the covariance of these two measures. These parameters are specified by observed flow-cytometry data:

[\begin{array}{l} μ_{A} \\ μ_{F} \end{array}] = [\begin{array}{l} {\bar{μ}}_{A, C o n t r o l} \\ {\bar{μ}}_{F, C o n t r o l} \end{array}],

[\begin{matrix} σ_{A}^{2} + σ_{g}^{2} & K_{A F} \\ K_{A F} & σ_{F}^{2} \end{matrix}] = Q_{C o n t r o l},

m_{g} = {\bar{μ}}_{A, P r o b e d} - {\bar{μ}}_{A, C o n t r o l},

where ${\bar{μ}}_{A, C o n t r o l}$ , ${\bar{μ}}_{F, C o n t r o l}$ and $Q_{C o n t r o l}$ are mean observed expression in the APC-A channel, mean observed expression in the FSC-A channel, and observed covariance matrix of these two channels, respectively, in the control cells (e.g., eGFP probe-treated). Additionally, ${\bar{μ}}_{A, P r o b e d}$ is the mean target gene expression measured in the appropriately probed cells. We assume a 3:1 ratio for $σ_{A}^{2} : σ_{g}^{2}$ .

Once all cells are simulated, they are sorted into deciles based on normalized target expression:

c_{i j, n o r m} = c_{i j, 1} - c_{i j, 2}

All c_ij’s observed in the top or bottom decile are then used to count the number of times each simulated guide, μ_i, is represented in that decile. Finally, simulated targeting guides are evenly assigned to 100 mock CREs such each is covered by 10, 20, or 30 guides. These simulations were also repeated for varying f_kd. All simulated experiments were then analyzed using CASA to generate sensitivity estimates for our method under various conditions (Supplementary Fig. 2e,f). Based on these analyses, we broadly apply a filter to remove regions supported by less than 30 guides. The two exceptions are the FADS locus, where we aimed test the effectiveness of a dense tiling screen, and an H3K27ac peak at the ERP29 where a clear putative CRE is bisected by a region of zero guide density.

Reanalysis of the Prime-Flow screen for GATA1.

We downloaded gRNA annotations and three replicates of raw CRISPRi-FlowFISH counts for the GATA1 gene from the Open Science Framework at https://osf.io/uhnb4/²². First, we remapped gRNAs to hg38 using BOWTIE1 with 0 mismatches. We simplified raw counts data from a 6-way flow cytometry sort by only considered counts of cells in the top and bottom 10-percentile of GATA1 expression. Furthermore, we discarded all but the first PCR replicate from each sorting replicate. CASA was run on these simplified data using the same settings as for the analysis of HCR-FlowFISH on GATA1. We report putative CREs supported by at least two screen replicates.

Functional genomic annotations and analyses.

We downloaded all functional genomic data for comparison with our screens and visualization from the ENCODE portal (https://www.encodeproject.org/)² with the following identifiers: ENCSR000EKS (K562 DHS), ENCSR000AKP (K562 H3K27ac), ENCSR000CIL (K562 CAGE), ENCSR000BNL (K562 SP2 ChIP-seq). PRO-seq and GRO-seq datasets were obtained from Wang et al.⁵³. For all analysis and visualization, we used the comprehensive GENCODE 32 human gene annotation (https://www.gencodegenes.org/human/release_32.html) unless otherwise specified. Promoters were defined as 1,000 bp upstream annotated transcription start site. Transcription factor binding sites and factorbook motifs were accessed via the UCSC genome browser and the SP2 motif position weight matrix was accessed via HOCOMOCO^54.55. K562 expression data and TPMs were calculated using the ARCH4s database²⁴.

Single guide knockdowns at LMO2.

TrcrRNA and CRISPR guide RNA targeting the LMO2 promoters were purchased from IDT (Supplementary Data 1). All RNA was rehydrated to 400 and kept on ice prior to use. K562-CRISPRi cells were induced with doxycycline for 48 h prior to electroporation identically to screens. Duplexes of guide:trcr RNA for each perturbation were prepared at 50 μM. Electroporation was carried out with the Neon Transfector (Invitrogen) and Neon Transfection System 10 μl Kit (Thermo Fisher, MPK1096) as using Buffer E, 10 μl tips, and the program: 1450 V, 3 pulses, 10 ms. For electroporations, 50,000 cells were harvested and washed with PBS prior to resuspension in 7.5 μl Buffer R and combined with 2.5 μl of guide complex. Electroporated cells were transferred to recovery media (RPMI 1640 GluteMAX with no antibiotics (Gibco), 10% Heat-Inactivated FBS (HI-FBS, Gibco), 1 μg/mL doxycycline (VWR), and grown at 37 °C and 5% CO₂ for 48 h. The experiment was performed in triplicate. Cells were transfected with water instead of guide complex as a control. An eGFP plasmid was used as a positive control.

RT-qPCR.

Gene specific primers were purchased from Eton Bioscience (Supplementary Data 1). We generated standard curves for each amplicon in ten-fold serial dilutions ranging in concentration from 100 pM to 1 × 10⁻⁴ pM using the Power SYBR Green RNA-to-CT 1-Step Kit (Thermo Fisher, 4389986) on an Applied Biosystems’ QuantStudio 6 Flex. A minimum of 3 replicates were used to determine mean Ct values. Dilutions outside the linear range were discarded.

RNA was isolated from single-guide K562 cell lines and total RNA was isolated from cells using the Qiagen’s RNeasy Mini Kit (Qiagen 74106) using DTT and homogenization with a Qiashredder (79654). Subsequently, DNA was removed using the TURBO DNase Kit (Thermo Fisher, AM2238) for 30 min at 37 °C. RNA was purified using 2× by volume Agencourt RNAClean XP SPRI beads (Beckman Coulter, A63881). 50 ng of total RNA from cells was used per qPCR and target abundance determined by the previously generated standard curves. Expression was determined using the standard curve described above to calculate fold change reduction in perturbed vs unperturbed cells. We normalized target abundance against a housekeeping gene, TBP, across ≥3 replicates.

HCR-FlowFISH cellular imaging.

Cell aliquots were taken from HCR-FlowFISH prepared samples prior to cytometric analysis and placed on a standard slide. Cells were imaged with a WideField EpiFluorescence microscope from ASI Imaging at 20× magnification with a 0.5 NA Nikon Objective. Acquisition settings were kept constant between images. A custom image analysis pipeline was created using Cell Profiler (Broad Institute) to identify cells and obtain red:green intensity quantifications per cell.

MPRA design, experiment and analysis.

The MPRA library for the FADS locus was constructed as previously described¹¹. Briefly, 200-bp sequences were designed that tile across hg19 at chr11:61555001–61665622 moving over the region with a 5-bp sliding window, both the forward and reverse orientation were selected for testing for a total of 44,330 sequences. All 3,108 single nucleotide polymorphisms and small indels in the 1000 Genomes phase 3 were also tested by centering the allele in the middle of 200 bp of flanking sequence and taking both orientations. In addition, we included 265 positive and negative control sequences selected based on their activity in previous MPRA assays. Oligos were synthesized by Agilent Technologies with 15 bp of adaptor sequence on either end. Unique 20-bp barcodes were added by PCR along by Gibson assembly (primers MPRA_v3_F & MPRA_v3_20I_R). The oligo library was expanded by electroporation into E. coli and an appropriate number of expanded cultures were selected to achieve an average of 200 CFU (barcodes) per oligo sequence tested. The resulting purified ΔGFP plasmid library was sequenced using Illumina 2 × 150 bp chemistry to acquire oligo-barcode pairings. The library underwent AsiSI restriction digestion, and a GFP amplicon with a minimal TATA promoter was inserted by Gibson assembly resulting in the 200-bp oligo sequence positioned directly upstream of the promoter and the 20-bp barcode falling in the 3’ UTR of GFP. After expansion within E. coli, the final MPRA plasmid library was sequenced by Illumina 1 × 31 bp chemistry to acquire a baseline representation of each oligo-barcode pair within the library.

100 million K562 cells were transfected with a Thermo Fisher Neon Transfection System (3 pulses, 1450 V, 10 ms) using 10 million cells and 5 μg of plasmid per transfection in RPMI. Forty-eight hours after transfection, cells were collected by centrifugation and washed three times with PBS prior to freezing at − 80 °C. RNA was extracted, treated with DNase, and a mixture of 3 GFP-specific biotinylated oligos (GFP_BiotinCapture_1–3) were used to capture GFP transcripts using Streptavidin C1 Dynabeads (Life Technologies). Following another round of DNase treatment, cDNA was synthesized and GFP mRNA abundance was quantified by qPCR to determine the cycle at which linear amplification begins for each replicate. Replicates were diluted to approximately the same concentration based on the qPCR results, and a 14 cycle PCR with NEBNext Ultra II Q5 Master Mix was used to amplify the cDNA (primers MPRA_v3_Illumina_GFP_F & Ilmn_P5_PCR). A second round of PCR (6 cycles) was used to add Illumina sequencing adaptors and indices to the DNA/RNA replicates. The resulting MPRA barcode libraries were spiked with 5% PhiX and sequenced using Illumina single-end chemistry (with 8-bp index read) on a NextSeq 500.

Data from the MPRA were analyzed as previously described¹¹. Briefly, the sum of the barcode counts for each oligo were provided to DESeq2 and replicates were median normalized followed by an additional normalization of the RNA samples to center the RNA/DNA activity distribution over a log₂ fold change of zero⁵⁶. Oligos showing differential expression relative to the plasmid input were identified by modeling a negative binomial distribution with DESeq2 and applying a false discovery rate (FDR) threshold of 1%. For sequences that displayed significant MPRA activity, a paired t-test was applied on the log-transformed RNA/plasmid ratios for each experimental replicate to test whether the reference and alternate allele had similar activity. An FDR threshold of 10% was used to identify SNPs with a significant skew in MPRA activity between alleles (allelic skew).

Cutting library design, construction, sequencing, and analysis.

We used the 909 guides within the ~8-kb region (chr11:61635743–61643818, hg38) surrounding the FADS1-FADS3 intergenic CRE to design a targeted subpool for use in a CRISPR cutting experiment. Oligos were synthesized by Twist Biosciences, and were amplified identically to larger libraries. Guides were cloned into a vector bearing the catalytically active CRISPR protein (lentiCRISPRv2 Addgene: 52961). HCR was performed identically CRIPSRi screens after 10 days to allow deletions to be generated.

HCR-labeled and sorted cells were de-crosslinked and DNA isolated following the same protocol used for Illumina sgRNA libraries. The guide targeted region was amplified as a single 8.9-kb amplicon in a 50 μL reaction containing 24 μL of extracted DNA, 1 μL PrimeStar GXL polymerase, 1 μL of 10 mM dNTP, 1 μL of each primer (10 μM), and 10 μL of 5× GXL buffer. Each sample was amplified with a unique primer pair containing a 5-bp barcode on the 5’ end used for demultiplexing post-sequencing. Samples were equal molar pooled prior to standard SMRTBell ligation library preparation. HiFi circular consensus sequencing was performed for 30 h on a Sequel II by the JAX Genome Technologies Laboratory group. Downstream processing was performed with the Pacbio toolset using the lima package for demultiplexing individual samples (flags: --single-side, --ccs, --window-size 5, --min-length 500) and read mapping using the minimap2 wrapper pbmm2 with standard settings for CCS reads to build 38 of the human genome⁵⁷.

Mapped reads were quantified with respect to the number of unique deletions detected at each nucleotide by selecting reads mapping to the FADS locus and supported by 2 or more CCS passes of sequencing. We next designated the genomic area within the amplicon and outside of our previously defined CREs as the control area, such that deletions in this space help define the null effect distribution similar to non-targeting gRNAs in CRISPRi screens. We then run CASA on the CRISPR cutting data using 50-nt windows without extending the effects of deletions outside of the nucleotides they explicitly remove (in contrast to the effects of gRNAs in CRISPRi screen which we assume generate a wide area of effect). An odds ratio threshold of 2.0 is used to define the region of significant activity.

Luciferase assay.

K562 cells were suspended in R buffer, mixed with 450 ng of empty pGL4.23 or 500 ng of pGL4.23 containing tested element and 100 ng of pGL4.74 and were electroporated in 10 μL volume with the Neon transfection system (Thermo Fisher Scientific, MPK5000) by 3 pulses of 1450 V for 10 msec each. After adding 65 μL of RPMI 1640 supplemented with 10% FBS, cells were cultured in 96-well plates for 24 h. All tested vectors were replicated 6 times. Cells were transferred to 96-well white plates before assay (Greiner, 655075). Dual-Glo Luciferase assay system (Promega, E2940) was used to measure Firefly and Renilla luciferase activity with the product’s protocol and their luminescence was detected by using SpectraMax i3 (Molecular Devices). Firefly/Renilla ratio of luminescence was used for the activity of each replicate.

Electrophoretic mobility shift assays (EMSAs).

EMSA probes corresponding to long and short versions of the FADS3 repressive CRE were generated after designing, synthesizing and annealing biotin end-labeled probes (IDT, Supplementary Table 2). K562 nuclear extracts were prepared using the NE-PER Extraction Kit (Thermo Scientific), and EMSAs were completed using the LightShift Chemiluminescent EMSA kit (Thermo Scientific) following the manufacturer’s instructions. Binding reactions for the short oligos were generated by incubating the following components for 25 min at 25 °C as follows: 1× binding buffer, 1 μg poly dI-dC, 4 μg K562 nuclear extract, and 200 fmol labeled probe. DNA-protein complexes were detected by chemiluminescence.

For antibody supershifts using the long oligos, reactions were pre-incubated with all components except the biotinylated probe (100 fmol) at 25 °C for 25 min, then biotinylated probe was added to each tube. Reactions were incubated another 25 min at 25 °C before stopping the reactions with loading dye and run as described above. REST antibody (EMD Millipore ChIP validated, cat# 17–641), including 30% glycerol or IgG were added at concentrations of 0.5, 1, and 2 μg each.

Population genetic analysis, eQTLs, and UK Biobank fine-mapping.

LD (r²) was reported from analysis in the 1000 Genomes project, using individuals from Europe⁵⁸. eQTLs were obtained from all tissues from the GTEx portal (v8.0) on 30 April 2018. We used statistical fine-mapping results of 96 complex traits and diseases in the UK Biobank that we previously conducted (https://www.finucanelab.org/data). Briefly, we conducted GWAS and statistical fine-mapping in the UK Biobank using up to 361,194 white British individuals. We computed association statistics for the variants with INFO > 0.8, MAF > 0.01% (except for rare coding variants with MAC > 0), and HWE P > 1 × 10⁻¹⁰ using SAIGE (for case-control studies)⁵⁹ or BOLT-LMM (for continuous traits)⁶⁰ with the covariates including top 20 PCs, sex, age, age², sex × age, sex × age², and dilution factor where appropriate. Statistical fine-mapping was performed using FINEMAP (v1.3.1)^61,62 and susieR (v0.8.1.0521)⁶³ with the maximum number of causal variants specified as 10. We defined a region based on a 3-Mb window around a lead variant and then merged any regions that overlapped. We used summary statistics from the GWAS and in-sample LD matrices which were calculated from imputed dosages for individuals included in each GWAS using Ldstore v2.0b.

Connectivity and epigenetic correlation analysis.

We obtained 10-kb resolution, GMAP analyzed, topologically associating domain (TAD) annotations from TADKB⁶⁴ and assessed if putative CRE-promoter interactions characterized in this study are contained completely within contiguous TADs. ChIA-PET overlaps were performed using intrachromosomal peaks from ENCODE (ENCFF001TIC). For CRE CASA activity score correlation analysis, we took the absolute value of CASA activity scores and correlated it against either the DHS or H3K27ac mean signal in K562 cells, using ENCODE datasets described above. We compared only within called peaks of DHS or H3K27ac, calculating Spearman correlations. We correlated CASA activity scores with the sum maximum of plus strand and negated minus strand PRO-seq data from GSM1480327. We correlated CRE activity at non-promoter elements with their distance to GENCODE v32 promoter annotation65.

Guide design optimization analysis.

We subsetted the number of guides used to analyze CRE interactions with FADS1 and determined the impact of guide density on peak calling with CASA using two experimental replicates. We subsetted guides by applying a 100-nt sliding window to the FADS locus and selecting a specified fraction of guides targeting each window for random removal.

Data availability.

All raw CRISPRi screening data, MPRA data, as well as processed files have been uploaded to the ENCODE Portal with the accession# ENCSR455UGU (https://www.encodeproject.org/publication-data/ENCSR455UGU/). Track hubs are available for each locus screened at the following links:

https://genome.ucsc.edu/sZskr2/GATA_HCR

https://genome.ucsc.edu/sZskr2/CD164_HCR

https://genome.ucsc.edu/s/skr2/ERP29_HCR

https://genome.ucsc.edu/s/skr2/LMO2_HCR

https://genome.ucsc.edu/s/skr2/NMU_HCR

https://genome.ucsc.edu/s/skr2/MEF2C_HCR

https://genome.ucsc.edu/s/skr2/FADS_HCR

https://genome.ucsc.edu/s/skr2/MYC_HCR

DNase Hypersensitivity and Histone modification data were collected from ENCODE (https://www.encodeproject.org). Topologically associated domains were collected from TADKB (http://dna.cs.miami.edu/TADKB/). Genome-wide association study data were collected from the UK Biobank and the Global Lipids Genetics Consortium (https://biobank.ndph.ox.ac.uk/showcase/ and http://lipidgenetics.org, respectively). Fine-mapping data are available at https://www.finucanelab.org/data.

Code availability.

CASA software is available at https://github.com/sjgosai/CASA. Python software is managed using conda, which is available at https://repo.continuum.io/miniconda/. Bowtie software is available at https://bioconda.github.io (version 1.2.3 was used). Guidescan software is available at https://bioconda.github.io (version 1.2–1 was used). FlowJo is available at https://www.flowjo.com/solutions/flowjo/ (version 10.7 was used). CellProfiler is available at https://cellprofiler.org/releases (version 4.1.3 was used). SAIGE software is available at https://github.com/weizhouUMICH/SAIGE (version 0.29.4.2 was used). BOLT-LMM software is available at https://alkesgroup.broadinstitute.org/BOLT-LMM/BOLT-LMM_manual.html (version 2.3.2 was used). FINEMAP software is available at http://www.christianbenner.com (version 0.8.1.0521 was used). susieR software is available at https://github.com/stephenslab/susieR (commit e81a5ca was used).

Extended Data

Extended Data Fig. 3 — a, HCR-FlowFISH and PrimeFlow-CRISPRi individual guide score comparison for shared guides. Guides are grouped by overlap with CASA-nominated CREs. We find using HCR-FlowFISH improves separability between guide scores inside and outside of designated CREs compared to PrimeFlow. We also note guide score variability is reduced in HCR-FlowFISH. The minima, centers, and maxima of the boxes indicate the 25th, 50th, and 75th percentiles of the data distributions. Whiskers capture all remaining data, excluding outliers extending beyond 1.5 times the interquartile range below or above the 25th or 75th percentiles, respectively. n = 2,897 (grey boxes) and n = 88 (green boxes) shared guides analyzed outside and inside CRE boundaries, respectively. b, CASA CRE identification on simplified ABC data and comparison to HCR data. CASA only considers the highest and lowest expression bins from the first PCR replicate of each CRISPRi-FlowFISH screen replicate, yet distinguishes CREs from non-specific scores induced by perturbing the *GATA1* gene body, in contrast to the original analysis.

Extended Data Fig. 4 — **a,b**, Connectogram diagrams showing K562 DHS (light blue), K562 H3K27ac (dark blue), guide coverage (black), HCR-FlowFISH composite guide score tracks, and CASA CREs calls for *MYC* (teal), *PVT1* (salmon), *LMO2* (orange), *CAPRIN1* (navy), and *CAT* (lilac). CASA-derived CRE activity scores are shown as lines connecting the CRE to the target gene, and colored by effect on transcript abundance (black decreases abundance, red increases abundance). In a, ‘Pro’ and ‘e1–4’ denote the promoter and enhancers identified at this locus in Fulco et al.¹⁷. In b, ‘P’, ‘I’, ‘D’, denote the proximal, intermediate and distal promoters of *LMO2*, respectively. c, Relative mRNA expression compared to unperturbed cells for CRISPRi perturbations of distal, intermediate + distal, and proximal + distal promoters. Three technical replicates shown, bars represent standard deviation.

Extended Data Fig. 5 — **a,b**, Individual guide scores (points) and CASA CRE calls (bars) of HCR-FlowFISH screens for *FADS1* (green), *FADS2* (teal), and *FADS3* (orange). K562 DHS (light blue) and H3K27ac (dark blue) peaks are also shown. Notably, these elements are shared between all three *FADS* genes. Surprisingly, perturbing the CRE in a results in a modest, but detectable, increase in *FADS3* transcripts, in contrast to the decreases in *FADS1* and *FADS2* transcript abundance.

Extended Data Fig. 6 — a, Genomic region surrounding the *FADS3* promoter, highlighting tiling MPRA signal (red) and HCR-FlowFISH composite score for *FADS3* (orange). rs174466 is denoted, along with all variants in linkage disequilibrium (r² ≥ 0.2). Variants within an HCR-FlowFISH identified *FADS3* CRE are labeled in orange, and variants displaying allelic skew from MPRA are denoted with a red outline. SP2 ChIP-seq signal overlapping rs174466 is included in grey. b, GWAS trait associations with rs174466 shows multiple overlaps with metabolic targets of *FADS3*. c, MPRA activity for reference and alternate version of the rs174466 shows increased CRE activity on the alternate allele. d, Motif for SP2 highlighting change to alternate allele better matches the canonical motif.

Supplementary Material

1723995_Sup_Fig

NIHMS1723995-supplement-1723995_Sup_Fig.pdf^{(2.4MB, pdf)}

1723995_Sup_Tab

NIHMS1723995-supplement-1723995_Sup_Tab.xlsx^{(285.2KB, xlsx)}

1723995_reporting_summary

NIHMS1723995-supplement-1723995_reporting_summary.pdf^{(2.4MB, pdf)}

1723995_Sup_Data1

NIHMS1723995-supplement-1723995_Sup_Data1.pdf^{(4.9MB, pdf)}

1723995_Sup_Data2

NIHMS1723995-supplement-1723995_Sup_Data2.xlsx^{(50.5MB, xlsx)}

1723995_Sup_Data3

NIHMS1723995-supplement-1723995_Sup_Data3.pdf^{(45.6KB, pdf)}

1723995_Sup_Data4

NIHMS1723995-supplement-1723995_Sup_Data4.pdf^{(47.7KB, pdf)}

Acknowledgements

We thank Chales Fulco, Aaron Lin, Cameron Myhrvold, Hayden Metsky, Brittany Petros, John Ray, Steve Schaffner, and James Xue for editing and conversations about the manuscript. We thank Chelsea Otis, Natan Pirete, and Patricia Rodgers in the Broad Flow Cytometry Core for cytometry and sorting assistance. We thank the Broad Imaging Platform for custom scripting and assistance in image analysis. We thank John Ray and Matthew Bakalar in the Hacohen Lab for sorting and microscopy assistance. We thank Charles Fulco, Jesse Engreitz, and Eric Lander for discussion on PrimeFlow and CRISPR screens. This work and S.K.R., S.J.G., A.G., A.M.-S., S.K., D.B., and R.T. were supported as an ENCODE Functional Characterization Center (UM1HG009435), a Broad SPARC grant, and the Howard Hughes Medical Institute. S.K.R. is partially supported by K99HG010669 and F32HG00922. R.T. is supported by R00HG008179. S.J.G. was partially supported by 4T32GM007226-41.

P.C.S. is a co-founder of and consultant to Sherlock Biosciences and Board Member of Danaher Corporation.

Footnotes

Competing Interests

The remaining authors declare no competing interests.

References

1.Davis CA et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 46, D794–D801 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
2.ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Andersson R et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Chen L et al. Genetic drivers of epigenetic and transcriptional variation in human immune cells. Cell 167, 1398–1414.e24 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Sanyal A, Lajoie BR, Jain G & Dekker J The long-range interaction landscape of gene promoters. Nature 489, 109–113 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Maurano MT et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Huang H et al. Fine-mapping inflammatory bowel disease loci to single-variant resolution. Nature 547, 173–178 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Mahajan A et al. Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nat. Genet 50, 1505–1513 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Ulirsch JC et al. Systematic functional dissection of common genetic variation affecting red blood cell traits. Cell 165, 1530–1545 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Vockley CM et al. Massively parallel quantification of the regulatory effects of noncoding genetic variation in a human cohort. Genome Res. 25, 1206–1214 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Tewhey R et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell 165, 1519–1529 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Ray JP et al. Prioritizing disease and trait causal variants at the TNFAIP3 locus using functional and genomic features. Nat. Commun 11, 1237 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Gilbert LA et al. CRISPR-mediated modular RNA-guided regulation of transcription in eukaryotes. Cell 154, 442–451 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Canver MC et al. BCL11A enhancer dissection by Cas9-mediated in situ saturating mutagenesis. Nature 527, 192–197 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Sanjana NE et al. High-resolution interrogation of functional elements in the noncoding genome. Science 353, 1545–1549 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Korkmaz G et al. Functional genetic screens for enhancer elements in the human genome using CRISPR-Cas9. Nat. Biotechnol 34, 192–198 (2016). [DOI] [PubMed] [Google Scholar]
17.Fulco CP et al. Systematic mapping of functional enhancer-promoter connections with CRISPR interference. Science 354, 769–773 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Rajagopal N et al. High-throughput mapping of regulatory DNA. Nat. Biotechnol 34, 167–174 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Diao Y et al. A tiling-deletion-based genetic screen for cis-regulatory element identification in mammalian cells. Nat. Methods 14, 629–635 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Leonetti MD, Sekine S, Kamiyama D, Weissman JS & Huang B A scalable strategy for high-throughput GFP tagging of endogenous human proteins. Proc. Natl. Acad. Sci. U. S. A 113, E3501–8 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Gasperini M et al. A genome-wide framework for mapping gene regulation via cellular genetic screens. Cell 176, 377–390 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Fulco CP et al. Activity-by-contact model of enhancer-promoter regulation from thousands of CRISPR perturbations. Nat. Genet 51, 1664–1669 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Choi HMT et al. Third-generation in situ hybridization chain reaction: multiplexed, quantitative, sensitive, versatile, robust. Development 145, dev165753 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Lachmann A et al. Massive mining of publicly available RNA-seq data from human and mouse. Nat. Commun 9, 1366 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Cho SW et al. Promoter of lncRNA gene PVT1 is a tumor-suppressor DNA boundary element. Cell 173, 1398–1412.e22 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Perez AR et al. GuideScan software for improved single and paired CRISPR guide RNA design. Nat. Biotechnol 35, 347–349 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Bhattacharya A, Chen C-Y, Ho S & Mitchell JA Upstream distal regulatory elements contact the Lmo2 promoter in mouse erythroid cells. PLoS One 7, e52880 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Visel A, Minovitsky S, Dubchak I & Pennacchio LA VISTA Enhancer Browser--a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–D92 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Landry J-R et al. Expression of the leukemia oncogene Lmo2 is controlled by an array of tissue-specific elements dispersed over 100 kb and bound by Tal1/Lmo2, Ets, and Gata factors. Blood 113, 5783–5792 (2009). [DOI] [PubMed] [Google Scholar]
30.Oram SH et al. A previously unrecognized promoter of LMO2 forms part of a transcriptional regulatory circuit mediating LMO2 expression in a subset of T-acute lymphoblastic leukaemia patients. Oncogene 29, 5796–5808 (2010). [DOI] [PubMed] [Google Scholar]
31.Ulirsch JC et al. Interrogation of human hematopoiesis at single-cell and single-variant resolution. Nat. Genet 51, 683–693 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Gilbert LA et al. Genome-scale CRISPR-mediated control of gene repression and activation. Cell 159, 647–661 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Tycko J et al. Mitigation of off-target toxicity in CRISPR-Cas9 screens for essential non-coding elements. Nat. Commun 10, 4063 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Ye K, Gao F, Wang D, Bar-Yosef O & Keinan A Dietary adaptation of FADS genes in Europe varied across time and geography. Nat. Ecol. Evol 1, 167 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Mychaleckyj JC et al. Multiplex genomewide association analysis of breast milk fatty acid composition extends the phenotypic association and potential selection of FADS1 variants to arachidonic acid, a critical infant micronutrient. J. Med. Genet 55, 459–468 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Mathieson I et al. Genome-wide patterns of selection in 230 ancient Eurasians. Nature 528, 499–503 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Fenton JI, Gurzell EA, Davidson EA & Harris WS Red blood cell PUFAs reflect the phospholipid PUFA composition of major organs. Prostaglandins Leukot. Essent. Fatty Acids 112, 12–23 (2016). [DOI] [PubMed] [Google Scholar]
38.1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Buniello A et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005-D1012 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Kamat MA et al. PhenoScanner V2: an expanded tool for searching human genotype-phenotype associations. Bioinformatics 35, 4851–4853 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Tukiainen T et al. Detailed metabolic and genetic characterization reveals new associations for 30 known lipid loci. Hum. Mol. Genet 21, 1444–1455 (2012). [DOI] [PubMed] [Google Scholar]
42.GTEx Consortium et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Willer CJ et al. Discovery and refinement of loci associated with lipid levels. Nat. Genet 45, 1274–1283 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Bycroft C et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

Methods-only References

45.Doench JG et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol 34, 184–191 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Langmead B, Trapnell C, Pop M & Salzberg SL Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Hsu PD et al. DNA targeting specificity of RNA-guided Cas9 nucleases. Nat. Biotechnol 31, 827–832 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Morgens DW et al. Genome-scale measurement of off-target activity using Cas9 toxicity in high-throughput screens. Nat. Commun 8, 15178 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Wang T, Lander ES & Sabatini DM Viral packaging and cell culture for CRISPR-based screens. Cold Spring Harb. Protoc 2016, db.prot090811 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Kruschke JK Rejecting or accepting parameter values in Bayesian estimation. Advances in Methods and Practices in Psychological Science 1, 270–280 (2018). [Google Scholar]
51.Salvatier J, Wiecki TV & Fonnesbeck C Probabilistic programming in Python using PyMC3. PeerJ Comput. Sci 2, e55 (2016). [Google Scholar]
52.Hoffman MD & Gelman A The No-U-Turn Sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res 15, 1593–1623 (2014). [Google Scholar]
53.Wang J et al. Nascent RNA sequencing analysis provides insights into enhancer-mediated gene regulation. BMC Genomics 19, 633 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Wang J et al. Factorbook.org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium. Nucleic Acids Res. 41, D171–D176 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Kulakovskiy IV et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 46, D252–D259 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Love MI, Huber W & Anders S Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Li H Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Sudmant PH et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
59.Zhou W et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet 50, 1335–1341 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Loh P-R, Kichaev G, Gazal S, Schoech AP & Price AL Mixed-model association for biobank-scale datasets. Nat. Genet 50, 906–908 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Benner C et al. FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493–1501 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Benner C, Havulinna AS, Salomaa V, Ripatti S & Pirinen M Refining fine-mapping: effect sizes and regional heritability. bioRxiv 318618 (2018) doi: 10.1101/318618. [DOI] [Google Scholar]
63.Wang G, Sarkar A, Carbonetto P & Stephens M A simple new approach to variable selection in regression, with application to genetic fine-mapping. bioRxiv 501114 (2019) doi: 10.1101/501114. [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Liu T et al. TADKB: Family classification and a knowledge base of topologically associating domains. BMC Genomics 20, 217 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Core LJ et al. Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers. Nat. Genet 46, 1311–1320 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1723995_Sup_Fig

NIHMS1723995-supplement-1723995_Sup_Fig.pdf^{(2.4MB, pdf)}

1723995_Sup_Tab

NIHMS1723995-supplement-1723995_Sup_Tab.xlsx^{(285.2KB, xlsx)}

1723995_reporting_summary

NIHMS1723995-supplement-1723995_reporting_summary.pdf^{(2.4MB, pdf)}

1723995_Sup_Data1

NIHMS1723995-supplement-1723995_Sup_Data1.pdf^{(4.9MB, pdf)}

1723995_Sup_Data2

NIHMS1723995-supplement-1723995_Sup_Data2.xlsx^{(50.5MB, xlsx)}

1723995_Sup_Data3

NIHMS1723995-supplement-1723995_Sup_Data3.pdf^{(45.6KB, pdf)}

1723995_Sup_Data4

NIHMS1723995-supplement-1723995_Sup_Data4.pdf^{(47.7KB, pdf)}

Data Availability Statement

https://genome.ucsc.edu/sZskr2/GATA_HCR

https://genome.ucsc.edu/sZskr2/CD164_HCR

https://genome.ucsc.edu/s/skr2/ERP29_HCR

https://genome.ucsc.edu/s/skr2/LMO2_HCR

https://genome.ucsc.edu/s/skr2/NMU_HCR

https://genome.ucsc.edu/s/skr2/MEF2C_HCR

https://genome.ucsc.edu/s/skr2/FADS_HCR

https://genome.ucsc.edu/s/skr2/MYC_HCR

[R1] 1.Davis CA et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 46, D794–D801 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Andersson R et al. An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Chen L et al. Genetic drivers of epigenetic and transcriptional variation in human immune cells. Cell 167, 1398–1414.e24 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Sanyal A, Lajoie BR, Jain G & Dekker J The long-range interaction landscape of gene promoters. Nature 489, 109–113 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Maurano MT et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Huang H et al. Fine-mapping inflammatory bowel disease loci to single-variant resolution. Nature 547, 173–178 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Mahajan A et al. Fine-mapping type 2 diabetes loci to single-variant resolution using high-density imputation and islet-specific epigenome maps. Nat. Genet 50, 1505–1513 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Ulirsch JC et al. Systematic functional dissection of common genetic variation affecting red blood cell traits. Cell 165, 1530–1545 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Vockley CM et al. Massively parallel quantification of the regulatory effects of noncoding genetic variation in a human cohort. Genome Res. 25, 1206–1214 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Tewhey R et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell 165, 1519–1529 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Ray JP et al. Prioritizing disease and trait causal variants at the TNFAIP3 locus using functional and genomic features. Nat. Commun 11, 1237 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Gilbert LA et al. CRISPR-mediated modular RNA-guided regulation of transcription in eukaryotes. Cell 154, 442–451 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Canver MC et al. BCL11A enhancer dissection by Cas9-mediated in situ saturating mutagenesis. Nature 527, 192–197 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Sanjana NE et al. High-resolution interrogation of functional elements in the noncoding genome. Science 353, 1545–1549 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Korkmaz G et al. Functional genetic screens for enhancer elements in the human genome using CRISPR-Cas9. Nat. Biotechnol 34, 192–198 (2016). [DOI] [PubMed] [Google Scholar]

[R17] 17.Fulco CP et al. Systematic mapping of functional enhancer-promoter connections with CRISPR interference. Science 354, 769–773 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Rajagopal N et al. High-throughput mapping of regulatory DNA. Nat. Biotechnol 34, 167–174 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Diao Y et al. A tiling-deletion-based genetic screen for cis-regulatory element identification in mammalian cells. Nat. Methods 14, 629–635 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Leonetti MD, Sekine S, Kamiyama D, Weissman JS & Huang B A scalable strategy for high-throughput GFP tagging of endogenous human proteins. Proc. Natl. Acad. Sci. U. S. A 113, E3501–8 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Gasperini M et al. A genome-wide framework for mapping gene regulation via cellular genetic screens. Cell 176, 377–390 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Fulco CP et al. Activity-by-contact model of enhancer-promoter regulation from thousands of CRISPR perturbations. Nat. Genet 51, 1664–1669 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Choi HMT et al. Third-generation in situ hybridization chain reaction: multiplexed, quantitative, sensitive, versatile, robust. Development 145, dev165753 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Lachmann A et al. Massive mining of publicly available RNA-seq data from human and mouse. Nat. Commun 9, 1366 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Cho SW et al. Promoter of lncRNA gene PVT1 is a tumor-suppressor DNA boundary element. Cell 173, 1398–1412.e22 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Perez AR et al. GuideScan software for improved single and paired CRISPR guide RNA design. Nat. Biotechnol 35, 347–349 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Bhattacharya A, Chen C-Y, Ho S & Mitchell JA Upstream distal regulatory elements contact the Lmo2 promoter in mouse erythroid cells. PLoS One 7, e52880 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Visel A, Minovitsky S, Dubchak I & Pennacchio LA VISTA Enhancer Browser--a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–D92 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Landry J-R et al. Expression of the leukemia oncogene Lmo2 is controlled by an array of tissue-specific elements dispersed over 100 kb and bound by Tal1/Lmo2, Ets, and Gata factors. Blood 113, 5783–5792 (2009). [DOI] [PubMed] [Google Scholar]

[R30] 30.Oram SH et al. A previously unrecognized promoter of LMO2 forms part of a transcriptional regulatory circuit mediating LMO2 expression in a subset of T-acute lymphoblastic leukaemia patients. Oncogene 29, 5796–5808 (2010). [DOI] [PubMed] [Google Scholar]

[R31] 31.Ulirsch JC et al. Interrogation of human hematopoiesis at single-cell and single-variant resolution. Nat. Genet 51, 683–693 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Gilbert LA et al. Genome-scale CRISPR-mediated control of gene repression and activation. Cell 159, 647–661 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Tycko J et al. Mitigation of off-target toxicity in CRISPR-Cas9 screens for essential non-coding elements. Nat. Commun 10, 4063 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Ye K, Gao F, Wang D, Bar-Yosef O & Keinan A Dietary adaptation of FADS genes in Europe varied across time and geography. Nat. Ecol. Evol 1, 167 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Mychaleckyj JC et al. Multiplex genomewide association analysis of breast milk fatty acid composition extends the phenotypic association and potential selection of FADS1 variants to arachidonic acid, a critical infant micronutrient. J. Med. Genet 55, 459–468 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Mathieson I et al. Genome-wide patterns of selection in 230 ancient Eurasians. Nature 528, 499–503 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Fenton JI, Gurzell EA, Davidson EA & Harris WS Red blood cell PUFAs reflect the phospholipid PUFA composition of major organs. Prostaglandins Leukot. Essent. Fatty Acids 112, 12–23 (2016). [DOI] [PubMed] [Google Scholar]

[R38] 38.1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Buniello A et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47, D1005-D1012 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Kamat MA et al. PhenoScanner V2: an expanded tool for searching human genotype-phenotype associations. Bioinformatics 35, 4851–4853 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Tukiainen T et al. Detailed metabolic and genetic characterization reveals new associations for 30 known lipid loci. Hum. Mol. Genet 21, 1444–1455 (2012). [DOI] [PubMed] [Google Scholar]

[R42] 42.GTEx Consortium et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Willer CJ et al. Discovery and refinement of loci associated with lipid levels. Nat. Genet 45, 1274–1283 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Bycroft C et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Direct characterization of cis-regulatory elements and functional dissection of complex genetic associations using HCR-FlowFISH

Steven K Reilly

Sager J Gosai

Alan Gutierrez

Ava Mackay-Smith

Jacob C Ulirsch

Masahiro Kanai

Kousuke Mouri

Daniel Berenzy

Susan Kales

Gina M Butler

Adrianne Gladden-Young

Redwan M Bhuiyan

Michael L Stitzel

Hilary K Finucane

Pardis C Sabeti

Ryan Tewhey

Abstract

Figure 1 |. HCR-FlowFISH is a new generalizable method for transcription abundance readouts in non-coding CRISPRi screens.

Results

HCR-FlowFISH is a robust transcript quantification method.

HCR-FlowFISH and CASA identify CREs in non-coding CRISPR screens.

Figure 2 |. HCR-FlowFISH CRE screens on transcript abundance recapitulates growth screens at the GATA1 locus and can be extended to the HDAC6 transcript.

Comprehensive CRE scans of five loci show the flexibility of HCR-FlowFISH.

Figure 3 |. Application of HCR-FlowFISH unveils gene-specific CRE interactions at diverse loci.

Dense perturbation of FADS1 identifies CREs and informs gRNA design.

Figure 4 |. HCR-FlowFISH uncovers a complex regulatory landscape of all genes at the FADS locus.

HCR-FlowFISH reveals a complex regulatory landscape at the FADS locus.

Figure 5 |. High-resolution mapping using a CRISPR cutting HCR-FlowFISH screen identifies CREs at transcription factor resolution.

Nominating causal genetic variants and their effector transcripts.

Figure 6 |. Nominating causal genetic variants and identifying their effector transcripts at the FADS locus.

Discussion

Methods

CRISPR gRNA library design, synthesis, and infection.

Single gRNA K562 CRISPRi cell line generation.

Cell culture.

HCR-FlowFISH.

Cell fixing and permeabilization.

Expression detection.

Excess probe removal.

Signal amplification.

Note on HCR-FlowFISH target specific experimental design.

Prime-Flow.

Sorting and cytometric analysis.

DNA isolation.

Library preparation and sequencing.

HCR-FlowFISH normalization and visualization.

CASA: cis-regulatory activity prediction.

Cis-regulatory element identification.

CASA sensitivity analysis.

Reanalysis of the Prime-Flow screen for GATA1.

Functional genomic annotations and analyses.

Single guide knockdowns at LMO2.

RT-qPCR.

HCR-FlowFISH cellular imaging.

MPRA design, experiment and analysis.

Cutting library design, construction, sequencing, and analysis.

Luciferase assay.

Electrophoretic mobility shift assays (EMSAs).

Population genetic analysis, eQTLs, and UK Biobank fine-mapping.

Connectivity and epigenetic correlation analysis.

Guide design optimization analysis.

Data availability.

Code availability.

Extended Data

Extended Data Fig. 1. CRISPRi induction, sorting schema, and construction of CASA (CRISPR Activity Screen Analysis), a generative model of CRE activity.

Extended Data Fig. 2. HCR-FlowFISH screens display high similarity and increased sensitivity compared to growth screens at the GATA1 locus.

Extended Data Fig. 3. HCR-FlowFISH and CASA enhance selectivity of CRISPRi screens at the GATA1 locus.

Extended Data Fig. 4. HCR-FlowFISH and CASA identify CREs for multiple loci.

Extended Data Fig. 5. HCR-FlowFISH and CASA reveal complex CRE sharing at the FADS locus.

Extended Data Fig. 6. Functional characterization nominates rs174466 as a FADS3 CRE-activity altering SNP.

Supplementary Material

Acknowledgements

Footnotes

References

Methods-only References

Associated Data

Supplementary Materials

Data Availability Statement