Abstract
Understanding the effects of genetic variation in gene regulatory elements is crucial to interpreting genome function. This is particularly pertinent for the hundreds of thousands of disease-associated variants identified by GWAS, which frequently sit within gene regulatory elements but whose functional effects are often unknown. Current methods are limited in their scalability and ability to assay regulatory variants in their endogenous context, independently of other tightly linked variants. Here, we present a new medium-throughput screening system: genome engineering based interrogation of enhancers assay for transposase accessible chromatin (GenIE-ATAC), that measures the effect of individual variants on chromatin accessibility in their endogenous genomic and chromatin context. We employ this assay to screen for the effects of regulatory variants in human induced pluripotent stem cells, validating a subset of causal variants, and extend our software package (rgenie) to analyse these new data. We demonstrate that this methodology can be used to understand the impact of defined deletions and point mutations within transcription factor binding sites. We thus establish GenIE-ATAC as a method to screen for the effect of gene regulatory element variation, allowing identification and prioritisation of causal variants from GWAS for functional follow-up and understanding the mechanisms of regulatory element function.
Graphical Abstract
INTRODUCTION
Genome-wide association studies (GWAS) of complex diseases and traits have identified an ever-growing list of associated genetic loci currently numbering 515 302 (1). However, it remains hugely challenging to identify specific causal variants and genes, since the majority of these loci lie in non-coding regions of the genome and there are often many candidate variants in linkage disequilibrium (LD). The identification of causal genes has been aided by expression quantitative trait loci (eQTL) datasets from relevant cell types, since colocalization between eQTL and GWAS results can indicate a shared genetic cause of disease and gene expression change (2,3). Whilst useful, these datasets are currently underpowered, resulting in many disease associated loci lacking an eQTL association to a specific gene, and in other cases, multiple genes being implicated at the same locus (4). Another method to link a disease association to a gene is through annotation of enhancer-promoter loops using chromatin conformation capture (3C, HiC, pchiC, captureC) methodologies (5), but these have limitations in resolution, frequently missing short-range interactions (6), and often implicating multiple genes (7).
Establishing causal variants is perhaps even more challenging. Statistical fine-mapping methods can be used to narrow down the list of putative causative SNPs from human genetic data (4,8), but their resolution is limited by the recombination structure in human populations. Sites of open chromatin, marked by accessibility to DNA modifying enzymes (e.g. DNAseI hypersensitivity (9) or assay for transposase accessible chromatin, ATAC (10)), are bound by transcription factors (TF) in place of canonical nucleosomes. Variants residing at these genomic locations could therefore exert their effects by altered TF binding and gene regulation. Thus, annotation of chromatin features, such as open chromatin domains and histone modifications, can also help to identify likely causative SNPs and provide an insight into the molecular mechanism (11). However, such chromatin features are highly cell type and stimulation specific, making it difficult to rule out particular variants, and frequently implicate multiple candidate variants. Chromatin QTL studies are able to identify regulatory DNA variation that directly impacts chromatin architecture (12) but similarly to eQTL analysis, these studies are often underpowered and do not necessarily implicate individual variants (6). Importantly, none of these methods are able to directly demonstrate the causality of a specific variant. Massively parallel reporter assays can be used to understand the effect of specific variants on enhancer activity (13,14), but enhancers are removed from their native genomic and chromatin context, and thus are unable to demonstrate that a specific variant truly has an effect in vivo. Use of genome engineering technologies to introduce specific SNPs is an attractive way to causally link a variant to a change in enhancer or gene function (15), but these experiments are time-consuming and difficult to scale. It is also possible to use CRISPR inhibition or activation techniques targeted to presumed enhancer regions to screen for those that are active in a particular cell type (16–18), but the resolution of these methods is limited to around 1 kb, and they do not show an effect of a specific genetic variant(s).
To circumvent some of these limitations, we have previously developed a medium throughput arrayed screening method we termed genome engineering based interrogation of enhancers (GenIE) that allows screening for the effect of a variant of interest on transcription of the gene in which it resides (19). However, this is limited to those variants present within transcribed regions of the genome, which does not allow analysis of the large number of intergenic enhancers, in which GWAS variants are selectively enriched (20). Here we have developed a method, GenIE-ATAC, that allows screening for the effect of specific genetic variants on the open chromatin region in which they are located, in their endogenous context.
MATERIALS AND METHODS
Experimental design
The GenIE-ATAC method requires the SNP of interest to be located within an ATAC peak in the assayed cell type. For each SNP, three primers were designed: an upstream biotinylated forward primer for the linear PCR enrichment step, a nested forward primer for the ATAC PCR step, and a reverse primer for gDNA amplification of the region (Supplementary Figure S1). All primers bound within the ATAC peak, the forward primers did not overlap with each other and were located close to the SNP but outside the HDR template for editing experiments. The nested forward primer and reverse primer contained adaptor sequence for the addition of barcodes for Miseq (31) and amplicons were <295 bp to allow sequencing using a 150 PE Miseq run. All primers were unique in BLAT searches. For reverse experiments, equivalent primers were designed but in the opposite directions. For editing experiments, we chose the guide with a cut site closest to the SNP of interest and off-target cutting of guides was checked using WGE (https://www.sanger.ac.uk/htgt/wge/). All guide sequences, HDR oligo sequences, and primer sequences used are detailed in Supplementary Tables S1 and S2.
Tn5 production
GenIE-ATAC requires the Tn5 protein to be loaded with a single oligonucleotide (MEDS-A) which differs from the commercially produced Tn5 which is loaded with MEDS-A and B oligonucleotides. We purified Tn5 using a C-terminal intein tag and a chitin-binding domain as previously published (32), with some adaptations including the use of chitin magnetic beads. We transformed pTBX1-Tn5 plasmid (Addgene 60240) into C3013 cells (NEB) and expressed Tn5 as previously described. Bacterial pellets (from 500 ml culture) were resuspended in 12 ml cold HEGX (20 mM HEPES–KOH at pH 7.2, 0.8 M NaCl, 1 mM EDTA, 10% glycerol, 0.2% Triton X-100) plus proteinase inhibitors (PI) (Complete, Roche), sonicated on ice at 80% power for 8 × 30 s on/30 s off with a probe sonicator and centrifuged (13 000 rpm for 20 min 4°C) to produce 12.5 ml cleared lysate. Neutralised PEI (1 ml 10%, final concentration 0.8%) was added dropwise to the lysate to precipitate the DNA, and the lysate was centrifuged to clear (13 000 rpm for 20 min 4°C). The supernatant (∼12 ml) was added to 1 ml HEGX washed chitin magnetic resin (NEB) and left to bind for 1 h at 4°C. Beads were washed 4 times with HEGX buffer before adding 250 nmol (50 μl) of annealed MEDS-A oligo in 3 ml HEGX buffer plus PI and rocked overnight at RT.
Annealed MEDS:
MEDS-A: TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
MEDS-REV: pCTGTCTCTTATACACATCT/intT
Oligos were resuspended in 10 mM Tris–HCl (pH 8.5) to a concentration of 10 nmol/μl and 25 μl of each were mixed, heated to 95°C and left to cool slowly to RT. Importantly, annealed oligos were used on the same day.
After incubation with oligos, the beads were washed 4 times with HEGX buffer, and Tn5 protein was released from the beads by incubation with 50 mM fresh DTT in 4 ml HEGX for 48 h at 4°C. Eluted protein was collected and dialysed overnight in 2L 2× Tn5 dialysis buffer (100 HEPES–KOH at pH 7.2, 0.2 M NaCl, 0.2 mM EDTA, 2 mM DTT, 0.2% Triton X-100, 20% glycerol) at 4°C. The Tn5 was concentrated using 10K MWCO spin protein concentrator (Pierce) to 24 uM (1.32 mg/ml) as determined by Bradford. Then an equal volume of 100% glycerol was added and carefully mixed before aliquoting and storing at −20°C. Tn5 activity was titrated using gDNA in tagmentation buffer (5× buffer: 50 mM TAPS–NaOH, 25 mM MgCl2, 50% v/v DMF pH 8.5). The amount of Tn5 that was needed to tagment 1 μg gDNA to around 400 bp was found to be equivalent to the amount of Tn5 needed to be added to 0.5 × 106 nuclei to tagment to the optimal size of fragments (around 400–800 bp) for GenIE-ATAC. This amount was between 1–5 μl of Tn5 produced as described above.
hiPSC cell culture and ethics
iPSC lines were generated as part of the HipSci project (KOLF_2, Cambridgeshire 1 NRES REC Reference 09/H0304/77) and work on these is covered under HMDMC 14/013. Human KOLF_2 (HIPSCI, www.hipsci.org) were grown in feeder-free conditions in TeSR-E8 medium (StemCell Technologies) on Synthemax (Corning) (final amount 2 μg/cm2) and routinely passaged 1:10 every 5 days using Gentle Cell Dissociation Reagent (Stemcell Technologies).
Arrayed CRISPR-cas9 editing
hiPSCs were edited by nucleofection of ribonucleoprotein (RNP) complex containing full-length chemically modified synthetic guide RNA and SpCas9, along with a ssODN repair template (19,21). Briefly, SpCas9 was expressed and purified from Escherichia coli using a His-tag. Diluted SpCas9 (5 μl, 20 μg) was mixed with full-length guide RNA (Synthego) (5 μl, 225 pmol) at RT for 20 min for RNP complexes to form, followed by addition of the ssODN repair template (5 μl, 500 pmol) just before the nucleofection. Cells were dissociated using accutase and 1 × 106 cells in P3 buffer were mixed with the RNP/template complex and nucleofected in large cuvettes (V4XP-3024 Lonza) using 4D-Nucleofector on program CA137. After nucleofection cells were plated onto a 10 cm dishes coated with Synthemax (5 μg/cm2) with TeSR-E8 supplemented with Rock inhibitor. After 24 h, the media was exchanged for TeSR-E8 and after ∼10 days the cells, when they had grown to approximately 80% confluence, were harvested by accutase and processed for ATAC (0.5 × 106 cells) or pellets (2 × 106 cells) were frozen for gDNA extraction. Routinely at least three ATAC samples and three gDNA samples were analysed per pool of cells.
ATAC assay, linear PCR and nested PCR
hiPSCs grown on Synthemax were washed with PBS and 6 ml accutase was added for 5 min at 37°C. The accutase was aspirated and the cells were gently resuspended in 10 ml TeSR-E8. Cells were pelleted (300 g 3 min) and resuspended in 1 ml cold PBS. After counting, 0.5 × 106 cells were transferred to an eppendorf (not loBind) and pelleted (300 g 3 min). The cells were then resuspended in 500 μl fresh ice cold sucrose buffer (10 mM Tris–HCl pH 7.5, 3 mM CaCl2, 2 mM MgCl2, 0.32 M sucrose) and incubated on ice for 12 min. Triton-X-100 was added to a final concentration of 0.5% (25 μl of a fresh 10% solution), mixed gently and incubated on ice for 6 min. The whole lysis solution was transferred to a new eppendorf (not loBind) and spun 450g for 5 min at 4°C to collect the nuclei. All traces of the lysis buffer were removed from the nuclei and 50 μl tagmentation mixture (10 μl 5× tagmentation buffer, Tn5 (amount from titration above), water to 50 μl) was added to the nuclei, gently resuspended and transferred to a Lobind eppendorf. Nuclei were tagmented for 30 min at 37°C and the DNA was cleaned up using a MinElute PCR cleanup kit (Qiagen) and eluted in 21 μl EB. The DNA was quantified using Qubit and was typically 30–60 ng/μl. For each SNP, 3–4 independent tagmentations were carried out from the same pool of cells.
Linear PCR was carried out on the purified DNA for each ATAC repeat:
Purified tagmented DNA | 10 μl |
Biotinylated forward primer (2uM) fresh | 1.5 μl |
KAPA hotstart ready mix | 25 μl |
Water | 13.5 μl |
1. | 95°C | 5 min |
2. | 98°C | 20 s |
3. | 60°C | 30 s |
4. | 72°C | 30 s |
5. | Goto step 2 49× |
Kapa (non-hotstart) enzyme (1 μl) was spiked into the PCR mixtures and a further 50 cycles completed.
1. | 98°C | 20 s |
2. | 60°C | 30 s |
3. | 72°C | 30 s |
4. | Goto Step 1 49× | |
5. | 72°C | 10 min |
6. | 4°C storage |
Biotinylated DNA pulldown was performed using Dynabeads MyOne Streptavidin C1 and 20 μl slurry was used per PCR. The beads were washed twice in 40 μl PBS with 0.1% BSA and once with 2xBW buffer (10 mM Tris–HCl, pH 7.5, 1 mM EDTA, 2 mM NaCl) before being resuspended in 50 μl 2× BW buffer. The whole linear PCR (50 μl) was added to the bead suspension and rotated overnight at RT. The beads were collected on the magnet, and the supernatant removed. The beads were washed once with 100 μl water, and resuspended in 30 μl EB.
The subsequent amplification of DNA used the beads as template with the nested forward primer specific for the SNP of interest and the Tn5A primer (TCGTCGGCAGCGTCAG) specific for the oligonucleotide loaded into the Tn5 and integrated into the DNA during tagmentation. The amount of beads was tested (Supplementary Figure S4) and routinely we use 10 μl allowing up to three technical repeats to be performed per ATAC. PCR was carried out using PowerUp SYBR Green Master Mix (Applied Biosystems) with 10 μl bead template and 0.4 μM final concentration forward and reverse primers in a 50 μl reaction.
1. | 95°C | 10 min |
2. | 95°C | 15 s |
3. | 57°C | 15 s |
4. | 72°C | 30 s |
5. | Goto Step 2 35× | |
6. | 72°C | 5 min |
7. | 4°C storage |
The PCR products were analysed on a 2% agarose gel, stained with ethidium bromide, and should be a smear of DNA sizes ranging from ∼400–800 bp (similar to the tagmented DNA).
gDNA extraction and PCR
gDNA was prepared using the MagAttract HWM kit (Qiagen) (19). PCR was carried out using PowerUp SYBR Green Master Mix (Applied Biosystems) with 5 μl gDNA (250–500 ng) template and 0.4 μM final concentration forward and reverse primers in a 50 μl reaction. Typically 3–4 PCRs were carried out from one gDNA preparation. We also tested using tagmented gDNA as the template for PCR (Supplementary Figure S2) and in these experiments 250–500 ng DNA was tagmented as described in the ATAC experiment, cleaned up using the MinElute PCR cleanup kit (Qiagen) and used as PCR template.
Barcoding and sequencing
In order to add the Illumina indices (31) we performed a second PCR using 1 μl PCR1 (from ATAC or gDNA samples), PowerUp SYBR Green Master Mix (Applied Biosystems), and 0.4 μM final concentration forward and reverse primers in a 25 μl reaction.
FGFR OCT4 binding site experiment
To carry out editing, we mixed 2 HDR oligos equally (for 1bp edit or 3 bp edit at the OCT4 binding site in the FGFR intronic super enhancer) and performed a nucleofection with RNP as described above. We performed three biological repeats for ATAC, gDNA and RNA samples with three technical repeats for each at the PCR stage. The RNA sample was processed as previously published (19).
CTCF saturation mutagenesis experiment
To carry out the editing for the saturation mutagenesis of CTCF binding site, we mixed 9 HDR oligos equally and performed 8 nucleofections with RNP as described above. The cells were pooled and grown over 5 × 10 cm dishes and nine ATAC and nine gDNA independent samples were harvested. For the linear PCR we inputted 20 μl tagmented DNA, and after biotin pulldown, we used all 30 μl beads into the PCR (split over 4 PCR reactions). These PCRs were pooled before adding the Miseq barcodes. For the gDNA PCRs, we inputted 4x more gDNA and split over four PCR reactions before pooling for Miseq barcoding. This was to make sure we had good coverage of all of the editing events that had taken place in the pool of cells.
Sequencing and read alignment
PCR amplicons were sequenced using 2 × 150 bp reads on an Illumina MiSeq instrument. Since amplicons were all smaller than 295 bp, we merged the overlapping 150-bp paired-end reads using FLASH v1.2.11 (33) to improve alignment of Cas9-induced deletions. As input to FLASH we specified a minimum overlap of 10 bp, fragment size as the amplicon size, fragment standard deviation of 40, and maximum mismatch density of 10%, and used the –allow-outies parameter. A mean of 98.6% of reads could be successfully merged, with standard deviation of 1.5%. For all regions except rs141252451 in MIR8070, we aligned merged reads to the GRCh38 human reference sequence using bwa mem v0.7.17 (34), with lenient parameters to allow aligning Cas9-induced deletions (-O 24,48 -E 1 -A 4 -B 16 -T 70 -k 19 -w 200 -d 600 -L 20 -U 40). For rs141252451, we aligned to an amplicon sequence that included the 2-bp insertion defined by rs141252451, since rgenie excludes insertion reads by default.
Analysis with rgenie
The core analysis performed in rgenie is the same as previously described (19). Briefly, for each replicate (ATAC or gDNA) at each locus, rgenie extracts reads mapping to the targeted region from the aligned BAM file. To quantify the effect of an allele X on the ATAC-seq readout, we first determine for each replicate the ratio of the read count of X to that of the WT allele, r = (read count X / read count WT). We separately compute the mean of this ratio for ATAC replicates, rA, and for gDNA replicates, rG. If allele X alters chromatin accessibility, then the ratios will differ between ATAC and gDNA, i.e. rA ≠ rG. We use a two-tailed unequal variances t-test to test for a difference in this ratio, and we report p-values from this test. Although the statistical analysis is identical for an RNA or an ATAC readout, we updated rgenie to provide plots and statistical summaries tailored to an ATAC analysis. Specifically, an ATAC analysis by default reports only deletions wholly contained within a ±6 bp window around the targeted SNP. This is because with ATAC-seq the reads span a range of sizes depending upon the ATAC peak shape, with most reads tending to be short. As a result, longer deletions are not reliably captured; this differs from RNA, where all reads span the full amplicon. We also highlight allele effect statistics for the two most prevalent deletions within this window. In Supplementary Tables S3 and S4, we provide a statistical analysis summary from applying rgenie to the eight SNPs and the CTCF mutagenesis experiments, respectively. See the Data Availability section for links to rgenie input file details, amplicon sequences, and HDR/WT alleles for these analyses.
RESULTS
We extended our existing GenIE method (19) to assay the effect of variants lying within gene regulatory elements on the chromatin accessibility of that region. This increases the proportion of GWAS-associated variants (20) we can assay by a further 23.7% to a total of 92.7% across one or other assay (Figure 1A). It also allows us to measure the effect of a variant on both transcription (GenIE) and chromatin accessibility (GenIE-ATAC) for up to 52.9% of GWAS associated variants. We introduced a variant of interest by genome editing, and used a locus-specific ATAC to measure the effect of each allele on chromatin accessibility. Thus, we could directly demonstrate the causality of a specific variant in altering chromatin at its endogenous location within the genome. This will allow for instance to distinguish between several causal variants in high LD, or to establish cause and effect relationships for co-regulated enhancers. In contrast to the alternative approach of making edited isogenic cell lines, we were able to avoid the time-consuming step of clonal isolation (21) and the problem of clone-to-clone variability (22) by analysing the pool of edited cells. This made it possible to perform an arrayed screen of dozens of variants, and will, in the future, allow the use of more complex and disease relevant cellular models such as differentiated induced pluripotent stem cells (iPSCs) and primary cell types. Since there is no necessity for extended culture or clonal growth, this should be possible in most cell types, making it possible to assay the variants that are in chromatin accessible regions specific to a particular cell-type or ‘state’, which are frequently critical in disease.
We used human iPSCs to establish our method, since these cells can be used to model regulatory element variation in a wide variety of disease-relevant cell types and are amenable to CRISPR-mediated genome editing. We first delivered Cas9 and a guide RNA as a ribonucleoprotein along with a 100 nt single-stranded DNA oligonucleotide homology directed repair (HDR) template (21). This generated a population of cells containing the WT allele (unedited) and the HDR allele (SNP introduced) along with a large number of indels (small insertions/deletions) (Figure 1B).
We harvested genomic DNA (gDNA) from the mixed population of cells, performed PCR across the edited site and assessed the frequency of different alleles within the population using high throughput sequencing of the amplicon. Using exactly the same population of cells we performed ATAC, which selectively integrates adaptor sequences into open chromatin using the Tn5 transposase. We then carried out a nested PCR using a target-specific primer near to the edited site and a universal primer that binds the Tn5 adaptor sequence. This amplicon was also sequenced, and the effect on chromatin accessibility for the edited allele was then calculated as the ratio of sequencing reads in the ATAC relative to the gDNA for the edited allele, normalised to the same ratio for the unedited (WT) allele (Figure 1B).
Two developments were necessary to successfully establish a site-specific ATAC protocol. Firstly, we purified our own Tn5 and loaded it with a single oligonucleotide sequence (MEDS-A sequence) to allow a single PCR primer binding site for the subsequent PCR of the ATAC sample. Our method also requires a substantial input of ATAC DNA into the PCR to avoid any PCR bias, and the cost of commercial Tn5 protein would limit the number of variants that could be assayed in a single experiment. Secondly, we needed to carry out a linear PCR using a biotinylated primer, followed by a biotin pulldown of the amplified DNA to enrich for the region of interest prior to a semi-nested PCR (Supplementary Figure S1). This is because the Tn5 MEDS-A sequence is integrated at tens or hundreds of thousands of sites in the genome, whereas the gene specific primers only have one binding site, and thus a semi-nested PCR alone was unable to give sufficient enrichment of the desired region.
We first tested GenIE-ATAC on three heterozygous SNPs in KOLF_2 iPSCs that overlapped with accessible chromatin regions defined by ATAC-seq in these cells. These variants were selected due to their high probability of affecting chromatin accessibility based on fine-mapping of chromatin QTL data from sensory neurons (23). As chromatin QTL data was not available from hiPSCs, the variants were cross referenced with hiPSC ATACseq to make sure that the variants were within chromatin accessible regions, but it is still possible that they may not have an effect in this cell type. As expected for heterozygous alleles, when we performed amplicon sequencing over the variant of interest from gDNA we observed an equal proportion of the two alleles (ratio of 1) (Figure 2, red bars).
For each SNP tested, when we analysed amplicon sequencing using GenIE-ATAC, there was an imbalance of reads between the two alleles (ratio skewed away from 1), with the alt alleles either decreasing chromatin accessibility (for SNP1) or increasing it (for SNPs 2 and 3). These results were recapitulated using primers designed in the reverse direction with the locus specific primer on the opposite side of the variant (Supplementary Figure S2). Importantly, the effects of SNPs on chromatin accessibility measured by GenIE-ATAC were consistent with the expectations from the chromatin QTL data and confirmed independently by analysis of allele-specific reads from genome-wide ATAC-seq in hiPSCs (19). As negative controls for GenIE-ATAC we also performed the experiment using three heterozygous SNPs located in ATAC peaks that had no strong allelic imbalance in ATAC-seq (Figure 2, bottom panel), and once again the data from GenIE-ATAC corroborated that seen with genome-wide allele-specific ATAC-seq. We used the standard deviation between replicates (0.0477) to estimate the sensitivity of the assay. With three replicates, at a P-value of 0.01 we can detect a 21.9% change in accessibility, or a 13.3% change with a P-value of 0.05.
The control in these experiments was a simple PCR from gDNA, but we wanted to validate that we obtained the same results with tagmented gDNA that had been through the linear PCR and enrichment process. We found that tagmentation and enrichment did not cause an allelic bias, and we still observed a ratio of 1 for heterozygous alleles. However it did increase the variation we observed (Supplementary Figure S3) when compared to simple gDNA PCR, likely due to sampling bias. Thus, we used gDNA PCR as the control for subsequent experiments to increase consistency and the throughput of the assay. In order to increase the number of replicates that could be achieved per SNP per experiment, we also tested the effect of titrating the amount of ATAC DNA to be added into the linear PCR. The results and variation were similar with 3 times less input material (Supplementary Figure S4), which we used for subsequent experiments.
We next used GenIE-ATAC to screen 8 naturally occurring SNPs for their causal effect on accessibility of chromatin after editing with CRISPR. These SNPs were selected based on being homozygous, located in ATAC peaks in our iPSC line, and having a high probability of affecting chromatin accessibility from chromatin QTL data (23). The details of these regions and their predicted effects on transcription factor binding using the deep learning method DeepSEA (24) and manual curation are indicated in Table 1.
Table 1.
SNP # | SNP ID | Position of SNP | Nearest gene | Position of SNP with iPS ATAC peak | KOLF2 genotype | Edit introduced by HDR | Binding site information |
---|---|---|---|---|---|---|---|
1 | rs55685444 | chr5:33470746 | TARS | 0.54 | C hom ref | G alt | In motifs of CTCF and JUND, within binding peaks. DeepSEA predicts decrease in CTCF binding by the alt allele. |
2 | rs12269414 | chr10:63872514 | REEP3 (250 kb away) | 0.61 | G hom ref | A Alt | In a CTCF motif within a CTCF peak. DeepSEA predicts massively decreased CTCF binding for the alt allele. |
3 | rs7729529 | chr5:43007716 | ZNF131 and SEP1 | 0.17 | C hom ref | G alt | In a region of dense TF binding. DeepSEA predicts significantly increased binding of CEBPB. |
4 | rs6700034 | chr1:20508117 | MUL1 | 0.33 | C hom ref | A alt | In 5′UTR of MUL1. Alt allele had a transcriptional increase in GenIE assay. |
5 | rs8011143 | chr14:22599022 | ABHD4 | 0.52 | T hom ref | C alt | Within intron 1 of ABHD4. Alt allele had a transcriptional decrease in GenIE assay. |
6 | rs4757506 | chr11:17276730 | NUCB2 | 0.49 | A hom ref | G alt | SNP very near the start of the 5′ UTR. DeepSEA predicts slight decreases in CTCF binding and histone marks. |
7 | rs185220 | chr5:56909530 | SETD9 | 0.26 | G hom alt | A ref | In a region of dense TF binding in the 5′ UTR of SETD9. DeepSEA predicts strongly increased DNaseHS for the alt allele, as well as multiple TFs. |
8 | rs141252451 | chr11:11799003 | MIR8070 | 0.51 | hom ref | Insertion CA alt | SNP is right in the small peak centre, in a CTCF motif and Chip-seq peak. DeepSEA predicts completely lost CTCF binding. |
The editing efficiency of each of these SNPs in our WT iPSC line (KOLF_2) was variable but generally high (Figure 3A, bottom panel). This may be because they are located in regions of accessible chromatin that have previously been shown to have particularly high editing rates (25). This is advantageous for GenIE-ATAC since all variants will be located within ATAC peaks. Even in the cases with the lowest HDR rates (1%), using 1 million cells and assuming a 50% survival, there are around 5000 independent editing events, which essentially eliminates clonal artefacts. Data was processed using an R package we have developed (rgenie) which automates the analysis of experiments, and reports quality control and statistical analysis of the effect of both HDR alleles and defined small deletions over the variant of interest.
The HDR effect size after editing the desired SNPs to the alternative allele (Figure 3A, upper panel, and Supplementary Table S3) showed that three of the edited SNPs did not have any significant effect on chromatin accessibility (regions 1, 4 and 5). We have previously shown that the SNP in region 4 has an effect on transcription of the MUL1 gene (19), so a lack of change in chromatin accessibility suggests its effect is mediated post-transcriptionally, likely at the RNA level. SNPs edited in regions 2, 3 and 8 all showed a decrease in chromatin accessibility consistent with a reduction in transcription factor binding. The Region 2 SNP is within a CTCF binding motif and the editing of a G > A within the motif is consistent with a predicted reduction of CTCF binding and chromatin accessibility upon editing. Region 8 had the largest effect size in the screen, and in this case the HDR event created a 2 bp insertion (CA) within a CTCF binding site, which was predicted to result in the complete loss of CTCF binding. Interestingly, the SNPs in regions 6 and 7 showed an increase in chromatin accessibility, potentially due to the introduction of a new binding site for a transcription factor.
As well as introducing the SNP of interest at each site by HDR, the GenIE-ATAC method also creates a large population of deletion and insertion alleles around the cut site of the Cas9/guide. Since insertions are generally of lower frequency, and are more difficult to interpret, we excluded these from our analysis, but the deletion repertoire provides information about the effect of these mutations on chromatin accessibility. Accordingly, we analysed the effect of all deletions wholly contained within a 6 bp window around the SNP of interest (Figure 3A, middle panel, and Supplementary Table S3). As expected, deletion of bases within an ATAC peak very often results in a reduction of chromatin accessibility due to disruption of transcription factor binding, and we observed this in six of our eight edited locations. However, deletions around the SNP in region 5 did not have any effect and deletions in region 6 had an opposite effect to the introduction of the SNP, demonstrating the importance of being able to introduce and assay specific desired variants. Closer inspection of individual deletions in region 4 (Supplementary Figure S5) showed that those that removed a poly-C region had a strong effect on chromatin accessibility, but those that did not had a weak or no effect. This suggests that these C bases are important for binding of an unknown factor, and highlights how analysis of specific deletions can provide information about mechanism of action. Analysis of region 2 (Figure 3B), demonstrated that whilst introduction of the SNP led to a loss of accessibility, the deletions fell into two classes where accessibility was virtually abrogated, or was retained (red boxes). Closer inspection showed that in the case of the specific deletions which maintained chromatin accessibility, the adjacent sequences were such that the CTCF consensus site was recapitulated upon deletion. Thus, it is highly likely that CTCF binding was responsible for the accessible chromatin peak at this site. Deletions in region 8 showed a similar but smaller effect than HDR-generated 2 bp insertion, but closer analysis showed that the three most frequent deletions had only very small effects whereas most other deletions had strong effects. This is probably due to the repetitive nature of this region which likely recapitulated a binding site in these three specific examples. All plots were generated using the rgenie R package and the full output for all SNPs in this experiment is shown in supplementary note.
Since the majority of SNPs in chromatin accessible regions are also transcribed (Figure 1), there is an opportunity to measure the effect of variants on both transcription using GenIE (19) and chromatin accessibility using GenIE-ATAC. To demonstrate this, we chose a transcribed intronic region that is part of a super enhancer of the FGFR1 gene that had been shown to have activity in a reporter assay (26), overlaps with a chromatin accessible region in hiPSCs and contains a consensus OCT4-SOX2 binding site (Figure 4A). We performed an experiment where we analysed the effect of making a single or triple mutation in the OCT4 binding site using both GenIE and GenIE-ATAC. Editing was effective with HDR rates of 4.9% and 2.7% for single and triple mutations respectively (Figure 4A). This showed that as predicted, both single and triple mutations and deletions reduced chromatin accessibility and transcription (Figure 4B, supplementary note). The triple mutation showed a more substantial decrease in chromatin accessibility than the single mutant, but there was no difference in their effect on transcription. This may be due to a loss of binding of the neighbouring SOX2 protein, which may affect chromatin accessibility but not further decrease the activity of the enhancer. It highlights how chromatin accessibility is not perfectly correlated with transcription, and how the two assays can complement each other and help to understand the mechanisms of action of specific variants in vivo.
The effects of point mutations within transcription factor binding sites are often difficult to predict based on consensus sequences alone (27). We therefore performed GenIE-ATAC in a multiplex format to simultaneously introduce multiple SNPs across a defined region in one experiment and obtain nucleotide-level resolution of the determinants of transcription factor binding. We targeted the CTCF binding site in region 2 and designed HDR oligo templates to mutate 9 bases in the binding site individually into every other possible base (Figure 5A).
We performed GenIE-ATAC using a pool of all HDR oligos. It was not possible to introduce mutations at the N in the NGG PAM due to the fact that Cas9 would continue to introduce indels even after successful HDR, and bases further from the Cas9 cut site could not be evaluated due to low HDR rates. We increased the amount of ATAC DNA material and gDNA into each PCR proportionally for the number of HDR events to maintain representation (see methods). Figure 5B shows the effect size of each substitution (see also Supplementary Table S4). For example, a substitution of an A to G in position 4 of the CTCF site had no effect on binding, whereas mutation to any other base dramatically decreased chromatin accessibility. The inverse of this effect size reflects the importance of each base for CTCF binding, and broadly recapitulates the CTCF binding motif (Figure 5C, D). There are some differences, however, which could be due to several reasons. Firstly, the consensus binding motif measures the frequency of each base across all binding sites, and the nucleotide determinants of binding at a specific locus may be different (e.g. due to competitive or cooperative binding interactions, effects of DNA major/minor groove structure, or chromatin modifications or structure). Secondly, DNA binding factors other than CTCF could be contributing to the changes in chromatin accessibility. Finally, the effect of mutation has been shown to be highly context-specific, such that mutation of a consensus base can be rescued or buffered by a second mutation elsewhere in the binding site (27). This result illustrates the context-dependence of transcription factor binding and the difficulty in predicting the effect of a mutation from binding site consensus alone.
DISCUSSION
We believe that GenIE-ATAC will be a generally useful screening method to identify causal variants from GWAS studies and to understand transcription factor binding in the context of chromatin. Importantly, it is able to functionally link specific genetic variants to changes in chromatin accessibility in their endogenous locus and genomic context. Whilst chromatin accessibility is a direct readout of the effect of variants on transcription factor binding, it is not always directly correlated to downstream effects on transcription. We show that GenIE-ATAC can be combined with GenIE to simultaneously measure effects on transcription in many cases. This can help to identify alternative mechanisms of action, for example transcriptional repressor binding and post-transcriptional RNA regulatory elements. Our new methodology can thus validate causal variants, and provide mechanistic information about how they could influence transcription factor binding and gene expression, which is currently understood in a vanishingly small number of cases.
Homology directed repair can be used to precisely introduce virtually any mutation of interest given that Cas9 can cut every 8 bp in an average genome, and we can introduce SNPs at least 10 nt away from the cut site. However, advances in genome editing methods such as alternative Cas enzymes, improved homology directed repair efficiency (28), base editing (29) and prime editing (30) will further increase the number and range of mutations that can be screened and the cell types amenable to editing. Further development of the system for instance by using the Tn5 transposon to insert unique molecular identifiers will also help to improve sensitivity, reduce PCR artefacts and improve the accuracy of quantification.
Our ability to perform GenIE ATAC experiments in a multiplex manner also allows us to understand more generally the functional consequences of variation within transcription factor recognition sequences in their endogenous context. This will allow us to interpret the effects of sequence or chromatin context on the effect of individual substitutions within transcription factor binding sites.
We thus believe that the GenIE-ATAC method will facilitate the prioritisation of candidate genetic variants from GWAS studies for further functional follow-up and provide mechanistic insights into how these polymorphisms affect gene expression and influence disease.
DATA AVAILABILITY
The R package (rgenie) developed in this study is available at https://github.com/Jeremy37/rgenie. Raw data is available at Zenodo (https://doi.org/10.5281/zenodo.5802511).
Supplementary Material
ACKNOWLEDGEMENTS
pTXB1-Tn5 was a gift from Rickard Sandberg (Addgene plasmid # 60240; http://n2t.net/addgene:60240; RRID:Addgene_60240). We acknowledge Scientific Operations at the Sanger Institute for support in next generation sequencing and quality control. We thank members of the Bassett lab for helpful discussions.
Authors’ contributions: S.C. and J.S. contributed equally to this work. S.C. planned and conducted experiments with help from E.C. and Q.W. J.S. developed the software and conducted the analyses. A.B. supervised the study.
Contributor Information
Sarah Cooper, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK; OpenTargets, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
Jeremy Schwartzentruber, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK; OpenTargets, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
Eve L Coomber, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
Qianxin Wu, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
Andrew Bassett, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK; OpenTargets, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
Open Targets [OTAR037]; Wellcome [206194]. Funding for open access charge: Wellcome.
Conflict of interest statement. None declared.
REFERENCES
- 1. Buniello A., MacArthur J.A.L., Cerezo M., Harris L.W., Hayhurst J., Malangone C., McMahon A., Morales J., Mountjoy E., Sollis E.et al.. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019; 47:D1005–D1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Kerimov N., Hayhurst J.D., Peikova K., Manning J.R., Walter P., Kolberg L., Samoviča M., Sakthivel M.P., Kuzmin I., Trevanion S.J.et al.. A compendium of uniformly processed human gene expression and splicing quantitative trait loci. Nat. Genet. 2021; 53:1290–1299. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Ghoussaini M., Mountjoy E., Carmona M., Peat G., Schmidt E.M., Hercules A., Fumis L., Miranda A., Carvalho-Silva D., Buniello A.et al.. Open Targets Genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics. Nucleic Acids Res. 2021; 49:D1311–D1320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Schwartzentruber J., Cooper S., Liu J.Z., Barrio-Hernandez I., Bello E., Kumasaka N., Johnson T., Estrada K., Gaffney D.J., Beltrao P.et al.. Genome-wide meta-analysis, fine-mapping, and integrative prioritization identify new Alzheimer's disease risk genes. Nat. Genet. 2021; 53:392–402. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. McCord R.P., Kaplan N., Giorgetti L.. Chromosome conformation capture and beyond: toward an integrative view of Chromosome structure and function. Mol. Cell. 2020; 77:688–708. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Kumasaka N., Knights A.J., Gaffney D.J.. Fine-mapping cellular qtls with RASQUAL and ATAC-seq. Nat. Genet. 2016; 48:206–213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Cano-Gamez E., Trynka G.. From GWAS to function: using functional genomics to identify the mechanisms underlying complex diseases. Front. Genet. 2020; 11:424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Schaid D.J., Chen W., Larson N.B.. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat. Rev. Genet. 2018; 19:491–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Funk C.C., Casella A.M., Jung S., Richards M.A., Rodriguez A., Shannon P., Donovan-Maiye R., Heavner B., Chard K., Xiao Y.et al.. Atlas of transcription factor binding sites from ENCODE dnase hypersensitivity data across 27 tissue types. Cell Rep. 2020; 32:108029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Buenrostro J.D., Giresi P.G., Zaba L.C., Chang H.Y., Greenleaf W.J.. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods. 2013; 10:1213–1218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Soskic B., Cano-Gamez E., Smyth D.J., Rowan W.C., Nakic N., Esparza-Gordillo J., Bossini-Castillo L., Tough D.F., Larminie C.G.C., Bronson P.G.et al.. Chromatin activity at GWAS loci identifies T cell states driving complex immune diseases. Nat. Genet. 2019; 51:1486–1493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Human InducedPluripotent Stem Cells Initiative Consortium Alasoo K., Rodrigues J., Mukhopadhyay S., Knights A.J., Mann A.L., Kundu K., Hale C., Dougan G., Gaffney D.J. Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response. Nat. Genet. 2018; 50:424–431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Klein J.C., Agarwal V., Inoue F., Keith A., Martin B., Kircher M., Ahituv N., Shendure J.. A systematic evaluation of the design and context dependencies of massively parallel reporter assays. Nat. Methods. 2020; 17:1083–1091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Inoue F., Ahituv N.. Decoding enhancers using massively parallel reporter assays. Genomics. 2015; 106:159–164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Soldner F., Stelzer Y., Shivalila C.S., Abraham B.J., Latourelle J.C., Barrasa M.I., Goldmann J., Myers R.H., Young R.A., Jaenisch R.. Parkinson-associated risk variant in distal enhancer of α-synuclein modulates target gene expression. Nature. 2016; 533:95–99. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Li K., Liu Y., Cao H., Zhang Y., Gu Z., Liu X., Yu A., Kaphle P., Dickerson K.E., Ni M.et al.. Interrogation of enhancer function by enhancer-targeting CRISPR epigenetic editing. Nat. Commun. 2020; 11:485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Fulco C.P., Munschauer M., Anyoha R., Munson G., Grossman S.R., Perez E.M., Kane M., Cleary B., Lander E.S., Engreitz J.M.. Systematic mapping of functional enhancer-promoter connections with CRISPR interference. Science. 2016; 354:769–773. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Simeonov D.R., Gowen B.G., Boontanrart M., Roth T.L., Gagnon J.D., Mumbach M.R., Satpathy A.T., Lee Y., Bray N.L., Chan A.Y.et al.. Discovery of stimulation-responsive immune enhancers with CRISPR activation. Nature. 2017; 549:111–115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Cooper S.E., Schwartzentruber J., Bello E., Coomber E.L., Bassett A.R.. Screening for functional transcriptional and splicing regulatory variants with GenIE. Nucleic Acids Res. 2020; 48:e131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Maurano M.T., Humbert R., Rynes E., Thurman R.E., Haugen E., Wang H., Reynolds A.P., Sandstrom R., Qu H., Brody J.et al.. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012; 337:1190–1195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Bruntraeger M., Byrne M., Long K., Bassett A.R.. Editing the genome of human induced pluripotent stem cells using CRISPR/Cas9 ribonucleoprotein complexes. Methods Mol. Biol. 2019; 1961:153–183. [DOI] [PubMed] [Google Scholar]
- 22. Volpato V., Smith J., Sandor C., Ried J.S., Baud A., Handel A., Newey S.E., Wessely F., Attar M., Whiteley E.et al.. Reproducibility of molecular phenotypes after long-term differentiation to Human iPSC-derived neurons: a multi-site omics study. Stem Cell Rep. 2018; 11:897–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Schwartzentruber J., Foskolou S., Kilpinen H., Rodrigues J., Alasoo K., Knights A.J., Patel M., Goncalves A., Ferreira R., Benn C.L.et al.. Molecular and functional variation in iPSC-derived sensory neurons. Nat. Genet. 2018; 50:54–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Qin Q., Feng J.. Imputation for transcription factor binding predictions based on deep learning. PLoS Comput. Biol. 2017; 13:e1005403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Chari R., Mali P., Moosburner M., Church G.M.. Unraveling CRISPR-Cas9 genome engineering parameters via a library-on-library approach. Nat. Methods. 2015; 12:823–826. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Barakat T.S., Halbritter F., Zhang M., Rendeiro A.F., Perenthaler E., Bock C., Chambers I.. Functional dissection of the enhancer repertoire in Human embryonic stem cells. Cell Stem Cell. 2018; 23:276–288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Maurano M.T., Wang H., Kutyavin T., Stamatoyannopoulos J.A.. Widespread site-dependent buffering of human regulatory polymorphism. PLoS Genet. 2012; 8:e1002599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Skarnes W.C., Pellegrino E., McDonough J.A.. Improving homology-directed repair efficiency in human stem cells. Methods. 2019; 164-165:18–28. [DOI] [PubMed] [Google Scholar]
- 29. Komor A.C., Kim Y.B., Packer M.S., Zuris J.A., Liu D.R.. Programmable editing of a target base in genomic DNA without double-stranded DNA cleavage. Nature. 2016; 533:420–424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Anzalone A.V., Randolph P.B., Davis J.R., Sousa A.A., Koblan L.W., Levy J.M., Chen P.J., Wilson C., Newby G.A., Raguram A.et al.. Search-and-replace genome editing without double-strand breaks or donor DNA. Nature. 2019; 576:149–157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Kozarewa I., Turner D.J.. 96-plex molecular barcoding for the Illumina Genome Analyzer. Methods Mol. Biol. 2011; 733:279–298. [DOI] [PubMed] [Google Scholar]
- 32. Picelli S., Björklund A.K., Reinius B., Sagasser S., Winberg G., Sandberg R.. Tn5 transposase and tagmentation procedures for massively scaled sequencing projects. Genome Res. 2014; 24:2033–2040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Magoč T., Salzberg S.L.. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011; 27:2957–2963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013; arXiv doi:16 March 2013, preprint: not peer reviewedhttps://arxiv.org/abs/1303.3997.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The R package (rgenie) developed in this study is available at https://github.com/Jeremy37/rgenie. Raw data is available at Zenodo (https://doi.org/10.5281/zenodo.5802511).