Abstract
Cis-regulatory elements (CREs) have a major effect on phenotypes including disease. They are identified in a genome-wide manner by analyzing the binding of transcription factors (TFs), various co-factors and histone modifications in DNA using assays such as ChIP-seq, Cut&Tag and ATAC-seq. However, these assays are descriptive and require high-throughput technologies, such as massively parallel reporter assays (MPRAs), to test the functional activity and variant effect on these sequences. Currently, technologies that can simultaneously analyze both the regulatory function of a specific sequence and the TFs, cofactors and epigenomic modifications that determine it do not exist. Here, we developed enrichment followed by epigenomic profiling MPRA (e2MPRA), a novel technology that utilizes lentivirus-based MPRA to enrich for the integration of specific CREs into the genome followed by Cut&Tag or ATAC-seq targeted specifically for these sequences. This method allows to simultaneously analyze in a high-throughput manner regulatory activity, protein binding and epigenetic modification of thousands of candidate CREs and their variants. We demonstrate that e2MPRA can be used to dissect the epigenetic functions of TF motifs arranged in synthetic enhancers, as well as to analyze the effect of enhancer sequence variants on epigenetic modifications. In summary, this technology will increase our understanding of the regulatory code, its effect on the epigenome and how its alteration can lead to a variety of phenotypes including human disease.
Introduction
Gene expression is regulated in a spatiotemporal manner by active cis-regulatory elements (CREs), such as promoters, enhancers, insulators and silencers. Nucleotide variants in CREs can alter gene expression, which can lead to subsequent changes in phenotypes, including cell differentiation, disease, evolution, and other biological phenomena. Active regulatory elements such as promoters and enhancers are bound by various transcription factors (TFs) that play different roles in transcriptional regulation, such as pioneering activity that induces chromatin accessibility, transcriptional activation and repression, histone modification, and chromatin remodeling activities that mediate higher-order chromatin structure.
Several technologies have been developed to identify CREs in a genome-wide manner. For example, DNase-seq and ATAC-seq can identify open chromatin regions that are accessible to TFs1,2. ChIP-seq identifies the binding of specific TFs, co-factors or histone marks by using chromatin immunoprecipitation followed by sequencing3. Cleavage Under Targets and Release Using Nuclease (CUT&RUN) uses an antibody to target specific TFs, co-factors or histone marks followed by the binding of a Protein A/G fused to micrococcal nuclease (pAG-MNase) to cleave the primary antibody-bound sites allowing to profile candidate CREs (cCREs) using a lower cell number than ChIP-seq4,5 (i.e. at least 100 cells for a histone modification and 1,000 cells for a transcription factor are required). CUT&Tag (Cleavage Under Targets and Tagmentation) has been developed from CUT&RUN by using Tn5-mediated tagmentation, generating fragments ready for PCR enrichment and DNA sequencing6. These technologies have identified millions of cCREs across many different cell types and tissues. However, these technologies are descriptive, as the binding of a protein or specific histone mark to a certain DNA sequence does not ultimately mean it is a functional CRE. In addition, it is hard to assess the effect of nucleotide variants in these cCREs on regulatory activity using these technologies.
Massively parallel reporter assays (MPRA) overcome these hurdles, allowing to test the activity of thousands of sequences and variants within them for their regulatory activity by measuring a transcribed barcode7. We previously developed a lentivirus-based MPRA (lentiMPRA), where cCREs are integrated into the host genome and can be tested in a wide-array of cell types8. By testing a similar episomal MPRA library side-by-side, we showed that this ‘in genome’ readout is more strongly correlated with ENCODE annotations and sequence-based models and later work on numerous MPRA technologies showed that it also provides higher cell-type specificity predictions than episomal MPRA9. However, while MPRA can reveal the regulatory activity of a sequence, it cannot identify the proteins binding to that sequence nor its epigenetic modifications.
Here, we developed a novel technology, e2MPRA, by combining lentiMPRA with CUT&Tag and ATAC-seq (Fig. 1). We used e2MPRA to analyze synthetic enhancers containing liver-specific transcription factors and dissect their epigenetic roles. Additionally, we examined a perturbation library of nine enhancers containing the POU::SOX motif, which is essential for pluripotency, demonstrating the applicability of this technology for identifying key motifs involved in epigenetic function. This technology can systematically characterize thousands of cCREs and their variants for their functional effect on regulatory activity and epigenetic modification, allowing the dissection of the DNA and epigenetic regulatory code side by side.
Fig. 1: e2MPRA overview.
a, Designed cCRE libraries are associated with barcode sequences and cloned into reporter plasmids. b, The plasmid libraries are packaged into lentivirus and infected into cells to allow genomic integration and enrichment for epigenetic analyses. c, RNA barcodes (for lentiMPRA) and Tn5-tagmented DNA fragments (for ATAC and CUT&Tag) are extracted from infected cells. Genomic DNA is also extracted to estimate integration frequency. d, Extracted samples are amplified with unique molecular identifiers (UMIs) and sequenced to quantify genomic integration frequency for each cCRE and regulatory and epigenetic activity. e, LentiMPRA activity is calculated by RNA/DNA barcode count. For ATAC and CUT&Tag, epigenetic activity is calculated by dividing the number of Tn5-tagged fragments (enriched CRE counts) by the genomic integration frequency of each CRE (inserted CRE counts). This normalization accounts for differences in lentiviral integration efficiency across CREs.
Results
e2MPRA development
To develop e2MPRA, we first designed a pilot library consisting of 400 elements, each 100 base pairs (bp) in length (Methods), which have been previously tested for regulatory activity using lentiMPRA in HepG2 cells10,11. The pilot library included six distinct sequence categories: 1) 50 random genomic sequences as negative controls; 2) 100 scrambled sequences as negative controls; 3) 100 active CREs from the human genome; 4) 50 inactive genomic sequences; 5) 50 active synthetic sequences; and 6) 50 inactive synthetic sequences (Fig. 2a). These sequences were synthesized, amplified, and coupled with a random 15-bp barcode before insertion into a lentiMPRA plasmid containing an EGFP reporter vector to generate a sequence-barcode library (Fig. 1a, Extended Data Fig. 1). The resulting pilot library was packaged into lentivirus and transduced into HepG2 cells at a multiplicity of infection (MOI) of 50 (Fig. 1b). To minimize background signal arising from unintegrated lentiviral DNA, infected cells were passaged twice and cultured for 10 days following infection. Using these cultured cells, we simultaneously performed lentiMPRA, ATAC-seq and CUT&Tag specifically on the library to quantify transcriptional activity and epigenetic modifications in parallel (Fig. 1c, Methods). We conducted a single round of viral infection followed by three independent replicate experiments per assay, which served as technical replicates for downstream analyses.
Fig. 2: Properties of the pilot library.
a, Pilot library composition. b, Violin plots showing the distribution of log2-transformed epigenetic activities (log2(Activity)) of the pilot library elements, measured by lentiMPRA, ATAC-seq and H3K27ac CUT&Tag, across six categories: random genomic sequences (yellow), scrambled sequences (green), active genomic sequences (red), inactive genomic sequences (blue), active synthetic sequences (green), and inactive synthetic sequences (purple). To ensure comparability of activity scores across different assays, replicates, and libraries, trimmed mean of M-values (TMM) normalization for each replicate to define the expectation of log2(Activity) of random genomic sequences as 0. Statistical significance was determined by Mann–Whitney–Wilcoxon tests with Benjamini–Hochberg correction, comparing each category to random genomic sequences as a negative control. (****: adj-P ≤ 1.0e-4; ns: not significant; *(ns): 1.0e-2 < P <= 5.0e-2, but adj-P > 5.0e-2). c, Scatter plots comparing log2(Activity) of ATAC-seq or H3K27ac CUT&Tag (x-axis) against lentiMPRA (y-axis). Each dot is colored by its category as shown in panel a. Spearman’s correlation coefficients (ρ) are indicated in each plot. d, Scatter plots comparing log2(Activity) measured by ATAC-seq or H3K27ac CUT&Tag versus normalized mapped read counts from endogenous genomic regions. Each dot represents a genomic CRE from the library and is colored by its category as shown in panel a. Spearman’s correlation (ρ) values are shown.
We first analyzed the regulatory function of each sequence in the library using MPRAflow12. The regulatory activity of each element was quantified as the ratio of RNA barcode counts to the corresponding DNA barcode counts. Since barcode sequences are randomly associated with each element, we included only barcodes that were observed in both DNA and RNA barcode datasets within each replicate. Additionally, elements with fewer than 5 unique barcodes were excluded from the analysis. After filtering, the final dataset contained on average 79.1 unique barcodes per element, with only one element excluded (Extended Data Fig. 2a). This ensured a robust evaluation of activity and helped correct for site of integration biases. We observed a high correlation for the number of UMIs per barcodes (Spearman’s ρ=0.78–0.89; Extended Data Fig. 2b) and RNA/DNA ratio (Spearman’s ρ > 0.98; Extended Data Fig. 2c) across the three replicates. As expected, active genomic CREs and active synthetic sequences exhibited significantly higher regulatory activity compared to random genomic sequences and scrambled sequences (Fig. 2b). Inactive genomic and synthetic sequences showed low activity, recapitulating previous results.
Next, we investigated epigenetic activities of CREs in the pilot library by applying ATAC-seq and H3K27ac CUT&Tag, both well-characterized enhancer marks. In these assays, Tn5-tagmented elements were enriched using primers that specifically amplify sequences in the pilot library and simultaneously labeled with unique molecular identifiers (UMIs) (enriched CRE count; Figs. 1d–e). To estimate the genomic integration frequency for each element, we quantified “non-tagmented” elements from the input genomic DNA (gDNA) using the same primers (inserted CRE count; Figs. 1d–e). We observed an average of 95.9 and 68.8 UMIs per CRE in ATAC-seq and H3K27ac CUT&Tag, respectively, while the inserted CRE count had an average of 306.7 UMIs per CRE (Extended Data Fig. 2d). The enriched CRE counts or inserted CRE counts were highly correlated between replicates (Spearman’s ρ=0.83–0.97; Extended Data Fig. 2e). The epigenetic activity of each CRE was quantified as the ratio of enriched CRE count to inserted CRE count (Fig. 1e). Though we obtained slightly better correlation in UMI counts across replicates for ATAC-seq and H3K27ac CUT&Tag (Extended Data Fig. 2e) compared to the lentiMPRA result (Extended Data Fig. 2b), we observed a lower moderate log2(Activity) correlation between replicates (ATAC-seq, Spearman’s ρ=0.64–0.66; H3K27ac CUT&Tag, Spearman’s ρ=0.48–0.53, Extended Data Fig. 2f) than that observed for lentiMPRA (Extended Data Fig. 2c). These moderate correlations are consistent with the way that the present method directly counts CRE fragments using UMIs, leading to greater error propagation when calculating activity, whereas lentiMPRA uses the average of multiple barcodes. To mitigate noise, we summed the counts across replicates and then computed activity by dividing the total enriched count by the total inserted count for each element.
We found that active genomic and synthetic sequences exhibited significantly higher activity than random genomic sequences and scrambled sequences (Mann–Whitney–Wilcoxon tests with Benjamini–Hochberg correction, adj-P values < 0.0001), consistent with the lentiMPRA result (Fig. 2b) with ATAC-seq and H3K27ac CUT&Tag activities across the library correlating with lentiMPRA activity (Spearman’s ρ = 0.67 and 0.64, respectively; Fig. 2c). We observed a moderate correlation between epigenetic activity as measured by e2MPRA and endogenous ATAC-seq and H3K27ac ChIP-seq signals obtained from the ENCODE dataset13 (ATAC, Spearman’s ρ=0.44; H3K27ac, Spearman’s ρ=0.53; Fig. 2d). This could be due to a technical issue with e2MPRA where specific primers are used to detect each element, while genome-wide Cut&Tag and ATAC-seq measure the enrichment of multiple fragment reads for each region. Taken together, our pilot library suggests that although e2MPRA does not quantitatively measure epigenetic activity in a comparable manner to the endogenous genomic context, it can qualitatively capture epigenetic modifications to assess intrinsic regulatory potential.
Dissecting TFs function on epigenetic modifications via e2MPRA
Next, we set out to use e2MPRA to systematically evaluate the effect of TFs on epigenetic modifications. We designed a library of synthetic cCREs that contain nine binding motifs of TFs that are known to be expressed in the liver: CEBPA, CTCF, FOXA1, HNF1A, NR2F2, ONECUT1, PPARA, REST, and XBP111,14 (Fig. 3a). These TF motifs were systematically arranged into three classes on two distinct neutral templates (chr9:83712634–83712733 and chr2:211153273–211153372; hg19) previously shown not to have enhancer activity in HepG2 cells11,14. Class 1 (N=27 per template, total N =54) consisted of homotypic arrangements of 1, 2, or 4 evenly spaced copies of each TF motif, to assess the impact of the number of motifs on their epigenetic activity. Class 2 consisted of heterotypic arrangements of two different motifs to examine their cooperation (N=288). Class 3 comprises heterotypic arrangements of four distinct motifs in all possible combinations to explore further complex interactions (N=6,048). We then characterized this library using e2MPRA in HepG2 cells. cCRE activities on the two templates were treated as replicates, yielding two replicates per cCRE, in addition to the three biological replicates of independent library infections.
Fig. 3: Synthetic cCRE library dissecting TF motif effect on regulatory activity and epigenetic modifications.
a, Synthetic cCRE library design. Nine TF motifs were systematically arranged on two neutral templates into three distinct classes (Class 1: homotypic, Class 2: two-TF combinations, and Class 3: four-TF combinations) to evaluate epigenetic function. (*NR2F2 containing sequences were not sufficiently detected in the assays and excluded from subsequent analyses). b, Regulatory and epigenetic activities (lentiMPRA, ATAC-seq, and H3K27ac CUT&Tag) measured for Class 1 sequences. Significant Spearman’s correlations (FDR < 0.05) between TFBS copy number (x-axis) and log2(Activity) (y-axis) are indicated with a red background, and non-significant correlations are indicated in blue. c, Representative examples showing synergistic transcriptional effects when combining PPARA or REST motifs with other TF motifs (Class 2). Transcriptional activities of homotypic arrangements (Class 1; four identical motifs) were compared to heterotypic arrangements (Class 2; two motifs combined in a 2:2 ratio). Statistically significant synergistic effects (FDR < 0.01) are indicated by a red background and non-significant combinations in blue. d, Network visualization of significantly synergistic TF pairs (FDR < 0.01). Red lines indicate positive synergy, and blue lines indicate negative synergy. Line thickness corresponds to the −log10(adjusted p-value) for the significance of the interaction term.
For the lentiMPRA assay, we observed an average of 157.4 unique barcodes per cCRE (Extended Data Fig. 3a), and the correlations between replicates were high (Spearman’s ρ ≈ 0.9; Extended Data Fig. 3b). For the e2MPRA, we detected on average 406.0 UMIs per cCRE for inserted counts, and 66.5 and 38.8 UMIs per cCRE for enriched counts in ATAC-seq and H3K27ac CUT&Tag, respectively (Extended Data Fig. 3c). The correlations of the activity between replicates for ATAC-seq and H3K27ac were moderate, ranging from 0.4 to 0.5 (Extended Data Fig. 3d), as the e2MPRA data exhibited greater variability compared to lentiMPRA, consistent with our pilot library results. Additionally, sequences containing the NR2F2 motif were rarely amplified or not detected in genomic DNA, likely due to its high GC sequence (Supplementary Table 11). Therefore, we excluded these data from subsequent analyses, and the remaining eight TF motifs were used for further analyses.
For Class 1, we found that having more CEBPA, FOX1A, HNF1A and XBP1 motifs significantly increased regulatory activity (MPRA), suggesting their primary role as direct transcriptional activators, as shown previously11 (t-test with Benjamini–Hochberg correction, FDR < 0.05; Fig. 3b, Extended Data Fig. 4). We found that HNF1A and ONECUT1 motifs were associated with an increase in ATAC signal, fitting with their characterized role as pioneer TFs15,16. The PPARA motif was shown to affect both chromatin accessibility and H3K27ac modification but not transcriptional activation, i.e. having more copies of it did not lead to regulatory activity, consistent with its interaction with chromatin remodeling factors17. Conversely, REST and CTCF, which are known to function as repressors in a context-dependent manner18–20, showed no significant correlation with epigenetic activity (Extended Data Fig. 4).
Next, we compared Class1 (homotypic arrangements of individual TF motifs) and Class 2 (heterotypic arrangements of two different motifs) to examine synergistic effects of TF motif pairs. For each pair, we performed linear regression to quantify the individual contributions to epigenetic activity, introducing a synergistic interaction term. We observed multiple cooperative effects of two TF motifs on the transcriptional activity measured by lentiMPRA (Fig. 3c–d, Extended Data Fig. 5a), but not for ATAC-seq and H2K27ac activities (Extended Data Fig. 5b–c). For example, PPARA synergistically increased transcriptional activity in combination with CEBPA, CTCF, FOXA1, ONECUT1, and XBP1 (Fig. 3c upper panel), despite its homotypic arrangements increasing only ATAC-seq and H3K27ac activity but not transcriptional activity (Fig. 3b). This suggests that PPARA facilitates transcriptional activation by mediating its epigenetic activity.
To further investigate regulatory grammar, we analyzed Class 3 sequences (heterotypic arrangements of four distinct motifs in all possible combinations) and assessed the effects of motif order on regulatory activity. For each combination of four motifs (N=70), we performed one-way ANOVA to determine the significance of order on regulatory activity. Among the significant combinations affecting transcriptional activity, the sequence containing four TF motifs [HNF1A, PPARA, REST, XBP1] showed the largest differential regulatory activity depending on TF motif order (Fig. 4a–b). Many of these TF motif combinations showed significant effects on transcriptional activity, but not ATAC-seq or H3K27ac activities, which is consistent with epigenetic modifications occurring in broader regions around cCREs (FDR < 0.01; Fig. 4b). In addition, we further investigated positional enrichment of each TF motif in the top 200 and bottom 200 sequences, with position 1-to-4 being distal-to-proximal from the minimal promoter (mP) (Fig. 4c). Specifically, transcriptional activators such as HNF1A and XBP1 showed greater enhancement at positions closer to the mP in the top 200 sequences and depletion in the bottom 200. Conversely, the REST repressor exhibited stronger repressive effect when located closer to the mP. Taken together, these results demonstrate that e2MPRA allows the analysis of epigenetic activity of TFs and their cooperative function independent of their transcriptional activity.
Fig. 4: TF motif order affects enhancer activity.
a, Representative example demonstrating how rearranging motif order (HNF1A, PPARA, REST, and XBP1) affects enhancer epigenetic activity. b, Statistical analysis (one-way ANOVA) evaluating whether changes in motif order within each set of four distinct motifs (24 permutations per set) significantly affects epigenetic activities. All 70 unique combinations (selecting 4 TF motifs from 8, 8C4 = 70) were tested. The x-axis represents the difference in activity between the permutation with the highest activity and lowest activity among the 24 permutations tested for each combination. The y-axis indicates the corresponding statistical significance (−log10 adjusted P-value). c, Positional enrichment analysis of TF motifs within the top and bottom 200 permutations ranked by transcriptional activity. Statistical significance of enrichment was determined by hypergeometric tests with Benjamini–Hochberg correction (*: adj-P < 5.0e-2; **; adj-P < 1.0e-2; ***:adj-P < 1.0e-3; ****:adj-P < 1.0e-4). Motif positions are numbered from distal (position 1) to proximal (position 4) relative to the minimal promoter.
e2MPRA analysis of variant effects
We next set out to test the ability of e2MPRA to simultaneously characterize the effect of variants on regulatory activity and epigenetic marks. We selected five sequences positive for enhancer activity from a previous lentiMPRA10 carried out in induced pluripotent stem cells (iPSCs; WTC11 line). These sequences had ATAC-seq and ChIP-seq peaks for POU5F1/SOX2 in human ESCs and iPSCs, as well as being in close proximity (<1 Mb) to genes associated with pluripotency and/or early development (Fig. 5a, Methods). We generated a single-nucleotide substitution library (Fig. 5b), where each nucleotide within the 100-bp sequence was individually mutated to each of the three alternative nucleotides (N = 300 variants per CRE). We then performed e2MPRA using WTC11 iPSCs to quantify the impact of these variants on transcriptional and epigenetic activities.
Fig. 5: CRE perturbation analyses.
a, CREs selected for perturbation analysis. Variants derived from seq6846_R (*) were not sufficiently detected in the assays and were therefore excluded from subsequent analyses. b, Two strategies were used for enhancer perturbation analyses. In the single-nucleotide substitution approach (left), each nucleotide in a 100-bp CRE was individually mutated to all three alternative nucleotides (300 variants per CRE). In the 6-bp window perturbation approach (right), consecutive 6-bp segments were randomly mutated across the enhancer sequence using a sliding window shifted by 1 bp increments, resulting in 95 variants per CRE replicate. This randomization was performed twice independently, generating a total of 190 variants per CRE. c-d, Analysis of single-nucleotide substitution effects on epigenetic activities (lentiMPRA, ATAC, and H3K27ac CUT&Tag) for seq68781_R (c) and POU5F1_DE_core (d). Annotated TF motifs22 are shown above each plot. Bar colors represent statistical significance (black: significant at P < 0.01; gray: not significant). e-f, Analysis of 6-bp window perturbation effects for seq68781_R (e) and POU5F1_DE_core (f). Positional effects of mutations were quantified as median absolute deviation (MAD) scores at each nucleotide position and smoothed using a Gaussian filter (line plot). Regions identified as significantly affected by mutations are indicated by gray lines along the baseline.
For the lentiMPRA, we observed an average of 211.8 unique barcodes per CRE (Extended Data Fig. 6a), and the activity of variants was highly correlated between replicates (Spearman’s ρ ≈ 0.95; Extended Data Fig. 6b). For the ATAC-seq and CUT&Tag H3K27ac, we detected on average 754.4 UMIs per element for inserted counts, and 43.3 and 56.5 UMIs per element for enriched counts, respectively (Extended Data Fig. 6c). The correlations between replicates were lower for both ATAC-seq and H3K27ac assays (Spearman’s ρ ≈ 0.4–0.5; Extended Data Fig. 6d), consistent with previous observations from the pilot and HepG2 libraries. One of the five active CREs (seq6846_R; Fig. 5a) yielded insufficient read coverage and was therefore excluded from downstream analyses. The remaining four CREs showed high lentiMPRA activities (log2(Activity) = 0.66–4.79) for the wild type sequence fitting with their selection as active enhancers (Fig. 5a). Thus, we focused our subsequent analyses on these four CREs. For the single-nucleotide substitution libraries, we quantified the effects of each variant using a linear regression model and found positional clusters of variant effects (Figs. 5c–d, Extended Data Fig. 7). For example, CRE seq68781_R, located near the FOXK1 gene, contains ETS and POU5F1::SOX2 motifs, and their mutations significantly decreased its activity (Fig. 5c).
Analyzing ATAC-seq and H3K27ac activities, we observed noisy signals and no clusters of variant effects around putative TF motifs in any CREs (Figs. 5c–d, Extended Data Fig. 7). This could be due to e2MPRA directly using read counts of amplified variants without using barcode counts, thus having weaker statistical power. To overcome this, we introduced 6bp randomization with 1bp-sliding windows for each CRE (two randomizations per window, total N=190 per CRE; Fig. 5b). To quantify the impact of these perturbations, we calculated the perturbation effect at each position (k) as the median epigenetic activity across all sequences affected by perturbation at position k, then computed the median absolute deviation (MAD) score using wild-type (WT) enhancer activity as the baseline (set to 0). Finally, we applied a modified Canny edge detection algorithm to the position vs. MAD score data, smoothing and identifying regions with sharp epigenetic activity changes upon perturbation as significant functional peaks (Methods). This approach found that disruptions of POU5F1::SOX2 and ETS motifs in the seq68781_R (Fig. 5e) significantly decreased ATAC-seq and H3K27ac activities, consistent with the single-nucleotide mutation results (Fig. 5c). In the previously characterized POU5F1 enhancer21 (POU5F1_DE_core), we found that disruption of POU5F1::SOX2 motif decreased both regulatory activity and ATAC signals (Fig. 5f). In addition, while the disruption of the YY1 motif increased regulatory activity, it reduced ATAC activity (no significance for H3K27ac), suggesting that the transcriptional repression by YY1 is associated with chromatin opening mediated by its own binding. It was previously shown that both POU5F1::SOX2 and YY1 motifs are conserved among species and required for enhancer activity21. Overall, our results demonstrate that e2MPRA effectively enables high-resolution identification of functional motifs, providing a powerful framework for systematic dissection of regulatory logic.
Discussion
In this study, we used lentiMPRA both for testing the function of cCREs and for the enrichment of specific cCREs within the genome (i.e. allowing for many copies of a specific cCRE to be integrated into the genome). In-genome enrichment of these cCREs enabled us to use CUT&Tag and ATAC-seq to characterize the proteins that bind to these sequences and their epigenetic and nucleosome profile. This approach allowed us to: 1) develop a novel technology that tests regulatory sequences for their function coupled with TF binding and epigenetic modification; 2) generate a novel catalog of regulatory elements that are characterized not only by their transcriptional activity but also via open chromatin and epigenetic modifications; 3) obtain a better understanding of regulatory grammar and how its alteration can lead to phenotypic consequences.
Our pilot library showcased the feasibility of this approach. While we observed a moderate correlation between replicates for ATAC-seq and H3K27ac CUT&Tag (Extended Data Fig. 2f) which is due to the use of UMI counts, summing up these counts across replicates showed the expected results for negative and positive controls. They also showed a good correlation with our lentiMPRA results. Factors that could drive differences are the way each method measures activity, i.e. lentiMPRA using an average of multiple barcodes, versus UMI counts. In addition, there could be inherent differences between regulatory activity and epigenetic mark changes and nucleosome opening. For example, ATAC-seq marks open chromatin regions that could also be due to the binding of repressive factors. Further development of experimental and/or computational methods that can improve the quantification of ATAC-seq and CUT&Tag and other biochemical assays for e2MPRA could overcome these limitations.
We applied e2MPRA to evaluate synthetic enhancers containing defined TF motifs and demonstrated its effectiveness in measuring epigenetic activity. We showed that CEBPA, FOXA1, HNF1A, and XBP1 motifs increased transcriptional activity, consistent with previous results11,14 and their known roles as transcriptional activators. In addition, HNF1A and ONECUT1, previously characterized as pioneer factors16, were shown to increase chromatin accessibility, further validating the ability of e2MPRA to disentangle epigenetic regulation from transcriptional activation, which conventional MPRAs alone cannot achieve. Moreover, e2MPRA enabled the detection of changes in chromatin accessibility and H3K27ac modification mediated by PPARA, likely through its interactions with chromatin remodeling factors. The ability to dissect distinct molecular functions of individual TFs underscores the utility of e2MPRA in elucidating the regulatory grammar of gene expression.
We further demonstrate that e2MPRA is a powerful tool to characterize the effect of variants on epigenetic modifications. Analysis of four regions that interact with POU5F1 and SOX2 showed that mutations in the POU5F1::SOX2 motif not only alter transcriptional activity but also impact chromatin accessibility and H3K27ac modification. This fits with pluripotent factors being known to function as pioneer factors, with SOX2 in particular being reported to interact with the histone acetyltransferase p30023. In addition, e2MPRA allowed the identification of motifs responsible for epigenetic function in enhancers, e.g. the YY1 motif in the POU5F1 distal enhancer, which requires chromatin accessibility but negatively regulates transcription. The”CR4-C” region24 in this enhancer contains the YY1 motif and is required for POU5F1 expression. It is reported that YY1 directly binds to this region (ENCODE, ENCFF509GYP) in human ESCs, and plays a role in regulating pluripotency by directly interacting with OCT4 and BAF chromatin remodeling factors in mouse ESCs25.
It is worth noting that e2MPRA also has several caveats and technical limitations. First, as cCREs are placed upstream of a promoter in the vector, it lacks the endogenous genomic context, such as enhancer–promoter looping and chromatin 3D architecture. Second, e2MPRA uses an artificial minimal promoter instead of the cognate promoter. Therefore, e2MPRA cannot detect epigenetic functions that are associated with 3D chromatin architecture or specific promoter-enhancer compatibilities. Due to the detection sensitivity, e2MPRA currently requires shorter CREs (approximately 100 bp), making it unsuitable for the analysis of longer elements such as super-enhancers. In addition, as this was a proof-of-concept study, we were more conservative and used a small number of sequences in these assays; however, this technology could be scaled up by increasing library size, cell numbers and number of viral integrations per cell. Further development of this technology could overcome many of these limitations. In summary, our study demonstrates the power and feasibility of this approach, which enables high-resolution characterization of regulatory activity along with epigenetic function for a large number of sequences, providing valuable insights for discovering functional elements and identifying disease-associated variants.
Methods
MPRA library design
HepG2 pilot library.
To design the pilot library, we selected sequence features listed in Fig. 2a. 100 active CREs were selected from the top 100 sequences in our previous lentiMPRA in HepG210, as well as the overlap with H3K27ac peaks (ENCODE data: ENCFF001SWK). 50 inactive genomic sequences were selected as the bottom 50 sequences. For active and inactive synthetic enhancers, we selected the top 50 and bottom 50 sequences based on MPRA signals from the library used in Smith et al. (2013), which consists of combinations of 12 active TF motifs. 50 random genomic sequences were selected from genomic sequences that do not overlap without H3K27ac, H3K27me3 marks, and RepeatMasker annotation. Scramble sequences were generated by randomizing nucleotides using 100 active genomic sequences. Each CRE was 200bp in length, and flanked by 15bp adaptor sequences (5’-AGGACCGGATCAACT and CATTGCGTGAACCGA-3’).
To generate 100bp and 150bp libraries, the initial 200bp CRE sequences were cropped from the center of each sequence and added another adaptor sequences (5’-AATGCTAGCGCATGG and CTGCAACCTACGGAA-3’). We tested all the 100bp, 150bp and 200bp libraries with e2MPRA. However, we were not able to effectively obtain PCR amplification followed by ATAC or CUT&Tag for the 150bp and 200bp libraries (data not shown). Therefore, we decided to use the 100bp library in the following experiments.
HepG2 synthetic enhancer library.
Seven TF binding motifs (CEBPA, FOXA1, HNF1A, NR2F2, ONECUT1, PPARA, and XBP1) used in Smith et al. (2013)11 were selected based on their correlation significance with transcriptional activity. In addition, we included the CTCF and REST sites used in Georgakopoulos-Soares et al. (2022)14, resulting in a total of nine TF motifs in this library. These motifs were arranged in various combinations (Class 1, 2, 3; Fig. 3a) on two neutral DNA templates (hg19 chr9:83712634–83712733 and chr2:211153273–211153372), which were the same as those used in the previous study but cropped 100 bp from the center. In addition to these synthetic enhancer sequences, we also included 200 control sequences selected from four sequence features of the pilot library based on lentiMPRA activity: random genomic sequences, scrambled sequences, and active/inactive genomic sequences (50 each). These control sequences were used only for TMM normalization.
WTC11 enhancer perturbation library.
To select enhancer candidates for perturbation in the WTC11 iPSC line, we first collected POU5F1 and SOX2 peaks from TF ChIP-seq data and H3K27ac ChIP-seq peaks from ChIP-Atlas26 (last accessed: 2023-05-11). These peaks were overlapped with functional WTC11 CREs identified previously10, and the intersected CREs were scanned for POU5F1:SOX2 binding motifs using FIMO27 ver. 5.5.1, resulting in 1000 candidates. Among these candidates, we selected five active (seq6846_R, seq68781_R, seq5934_F, POU5F1_DE_core, NANOG_p) located near genes of interest within 1 Mb and four inactive sequences (seq16767_R, seq34039_F, seq34899_R, seq2846_R). The inactive sequences were included to identify potential variants that relieve repressive elements and lead to their activation. However, their MPRA activity scores were too weak and therefore excluded from subsequent analyses (data not shown). We trimmed these sequences to 100 bp, ensuring that the POU5F1:SOX2 motif is included. To generate the single nucleotide substituted subset, we created sequences in which each nucleotide at every position of the target sequence was replaced with each of the other three nucleotides. To generate the 6-bp window perturbation subset, we created sequences in which random mutations were introduced within 6-bp windows, shifting by 1 bp at a time. To prevent the emergence of TF motifs due to random mutations, we extracted a 6-bp region upstream and downstream of the mutated site (totaling 18 bp) and matched with known binding motifs using FIMO. If known motifs were detected, the introduced mutation was rejected and a new mutation process was performed recursively. The mutation process was repeated twice, and both replicates were included in the library. We also included the same 200 control sequences from the HepG2 library which were also used only for TMM normalization.
Generation of MPRA libraries
The MPRA libraries were generated as previously described12, with minor modifications. In brief, all library sequences were synthesized as a Twist oligonucleotide pool with adaptor sequences at both ends (adaptor 02 for the pilot library and the WTC11 library; adaptor 03 for the HepG2 library; Extended Data Fig. 1a). The oligonucleotide pool was amplified by two rounds of PCR using NEBNext High-Fidelity 2X PCR Master Mix (NEB; Extended Data Fig. 1b). The first-round PCR was performed for 10 cycles using the primer set 5BC-AG02/03-f01 and 5BC-AG02/03-r01 (Supplementary Table 11) to attach a minimal promoter. The second-round PCR was performed for 10 cycles using the primer set 5BC-AG-f02 and 5BC-AG-r02 (Supplementary Table 11) to attach 15-bp random barcodes for lentiMPRA. For both PCR reactions, the following cycling program was used: 98°C for 2 min; 10 cycles of 98°C for 15 s, 60°C for 20 s, and 72°C for 30 s; followed by a final extension at 72°C for 5 min. The amplified fragments were then inserted into the SbfI/AgeI site of the pLS-SceI vector (Addgene, #137725) using the NEBuilder HiFi DNA Assembly mix (NEB). The recombinant products were transformed into electrocompetent cells (NEB) following the manufacturer’s protocol and incubated overnight at 37°C on 15-cm LB agar plates with 100 μL of 100 mg/mL carbenicillin (Nacalai). For each library, we collected approximately 80,000 colonies for the pilot library, 1.3 million colonies for the HepG2 synthetic enhancer library, and 0.9 million colonies for the WTC11 enhancer perturbation library, aiming to obtain on average 200 barcodes per CRE. To determine the association between CREs and random barcodes, CRE-mP-barcode fragments were PCR-amplified from each plasmid library pool, and P5 and P7 flowcell adaptors were attached using P5-pLSmP-ass-i# and P7-pLSmP-ass-gfp (98°C for 1 min; 15 cycles of 98°C for 15 s, 60°C for 20 s, and 72°C for 3min; 72°C for 5 min) (Supplementary Table 11). The fragments were then sequenced with a iSeq (the pilot library) or NextSeq Mid output 300-cycle kit (the HepG2 and WTC11 library) using custom primers (Read 1: pLSmP-ass02/03-seq-R1; Index read1: pLSmP-ass-seq-ind1; Index read2: pLSmP-rand-ind2; Read 2: pLSmP-ass-seq02/03-R2, Supplementary Table 11).
Lentivirus packaging and titration
Lentivirus packaging, titration, and infection were performed as previously described12, with minor modifications. Lentivirus for each plasmid library was produced by co-transfecting it with the helper plasmids pMD2.G (Addgene, #12259) and psPAX2 (Addgene, #12260) into HEK293T cells cultured in T175 flasks using the EndoFectin Lenti transfection reagent (GeneCopoeia), according to the manufacturer’s protocol. At 8 hours post-transfection, the culture medium was refreshed, and ViralBoost reagent (Alstem) was added. At 48 hours after adding the ViralBoost reagent, the culture supernatant was collected, filtered through a 0.45-μm PES filter unit, concentrated 50-fold using the Lenti-X concentrator (Takara), and stored at 4°C for up to three weeks. The purified lentivirus was then used for titration. For the pilot library and the HepG2 library, HepG2 cells were plated at 1 million cells per well in 6-well plates and incubated for 24 hours. Then, serial volumes (0, 4, 8, 16, 32, 64 μL) of the lentivirus were added along with 8 μg/mL Polybrene. For the WTC11 library, WTC11 cells (P59) were plated in 6-well plates at 20% confluency and incubated for 24 hours. Then, serial volumes (16, 32 μL) of the lentivirus were added along with an equal volume of ViroMag reagent (OZ Biosciences). All infected cells were cultured for three days and then washed three times with PBS. Genomic DNA was extracted using the Wizard SV Genomic DNA Purification Kit (Promega). The multiplicity of infection (MOI) was measured as the relative amount of viral DNA (WPRE region, primer set: WPRE.F and WPRE.R; backbone, primer set: BB.F and BB.R; Supplementary Table 11) to genomic DNA (intronic region of the LIPC gene, primer set: LP34.F and LP34.R, Supplementary Table 11) by qPCR using the Thunderbird SYBR qPCR Mix (Toyobo), according to the manufacturer’s protocol.
Lentiviral infections and cell cultures
For the pilot library and the HepG2 library, 4 million or 2 million cells per replicate (the pilot library: 1 replicate; the HepG2 library: 3 replicates) were seeded in 2 wells of a 6-well plate and incubated for 24 hours. The cells were then infected with the lentiviral libraries along with 8 μg/mL polybrene, with an estimated MOI of 50. For the WTC11 library, 5 million cells per replicate (3 replicates) were seeded in a 10-cm dish and incubated for 24 hours. The cells were then infected with 160 μL of the lentiviral library along with 80 μL of ViroMag reagent, with an estimated MOI of 5. To eliminate residual viral DNA within the cells, infected cells were passaged twice after infection and maintained a total of 8–10 days. EGFP fluorescence was observed to confirm that the infected library remained unsilenced after passages. The cells were collected and cryopreserved at −80°C using 100 μL of BamBanker (Nippon Genetics) per 1 million cells. These cryopreserved cells were used for downstream assays.
gDNA/RNA extraction and DNA/RNA barcode sequencing for lentiMPRA assay
The lentiMPRA assay was performed as previously described12, with minor modifications. In brief, the frozen cells (pilot library: 1 million cells; HepG2 library: 2.5 million cells; WTC11 library: 10 million cells per replicate) were washed twice with PBS and lysed in RLT Plus buffer (pilot library: 100 μL; HepG2 library: 200 μL; WTC11 library: 1 mL; QIAGEN). Genomic DNA and total RNA were extracted from the lysate using the AllPrep DNA/RNA Mini Kit (QIAGEN). To prevent genomic DNA contamination, the Turbo DNA-free Kit (Thermo Fisher) was used for further purification of the RNA solution. Then, the entire purified RNA was used for reverse transcription to generate cDNA using SuperScript IV (Invitrogen). To prepare the sequencing library, 1.5 μg (pilot and HepG2 libraries) or 6 μg (WTC11 library) of genomic DNA and the entire RT product were PCR-amplified for 3 cycles (primer: P5-pLSmP-5bc-i#/P7-pLSmP-ass16UMI-gfp, Supplementary Table 11; cycling program: 98°C for 1 min; 3 cycles of 98°C for 10 s, 60°C for 30 s, and 72°C for 1 min; 72°C for 5 min) to attach P5 and P7 flow cell adaptors and UMIs to the sequencing library. The amplicons were further amplified 18 cycles using the primer set P5/P7. The final product was pooled the DNA and RNA barcode library at a molar ratio of 1:3 and sequenced on a iSeq (the pilot library) or NextSeq High output 75-cycle kit (the HepG2 and WTC11 library) with custom primers (Read 1: pLSmP-ass-seq-ind1, Read 2: pLSmP-bc-seq, i7 index: pLSmP-UMI-seq, i5 index: pLSmP-5bc-seq-R2) (Supplementary Table 11).
ATAC and CUT&Tag for e2MPRA assay
ATAC-seq28 (Illumina Tagment DNA Enzyme and Buffer Small Kit; Illumina) and CUT&Tag (Hyperactive In-Situ ChIP Library Prep Kit for Illumina; Vazyme) were performed on frozen cells following the manufacturer’s protocol with minor modifications. For both assays, we used 500,000 cells (pilot and HepG2 libraries) or 2 million cells (WTC11 library). The following procedure was performed for 500,000 cells. For 2 million cells, all reagent volumes were increased fourfold accordingly. For the ATAC assay, the cells were washed with PBS and suspended in 50 μL of lysis buffer for nuclei isolation. The isolated nuclei were then resuspended in 50 μL of transposition mix per 500,000 cells, incubated at 37°C for 30 minutes, and purified using the MinElute Reaction Cleanup Kit (QIAGEN).
For the CUT&Tag assay, the cells were washed with 250 μL of wash buffer, then resuspended in 50 μL of wash buffer and 5 μL of ConA bead solution. After treatment with ConA beads, the cells were washed again and resuspended in 50 μL of antibody buffer, followed by the addition of 0.5 μL of primary antibody and 0.5 μL of secondary antibody. We used H3K27ac (Abcam: ab4729) as the primary antibody and goat anti-rabbit IgG (Abcam: ab6702) as the secondary antibody. The primary antibody incubation was performed overnight at 4°C, while the secondary antibody was incubated for 2 hours at room temperature. After incubation, the solution was washed and resuspended in 150 μL of the tagmentation buffer, then incubated at 37°C for 1 hour. The tagmented DNA was extracted by stopping the reaction with 5 μL of 0.5 M EDTA, 1.5 μL of 10% SDS, and 1.25 μL of Proteinase K, followed by phenol/chloroform extraction and ethanol precipitation.
The purified Tn5-cleaved products from both assays underwent size selection using 0.65×/1.2× SPRIselect (Beckman Coulter) to remove genomic DNA contamination. To prepare the sequencing library, half of the total products were amplified for 3 cycles in the first-round PCR (primers: P5-ctMPRA-amp02/03-i# and P7-ctMPRA-amp02/03-UMI (Supplementary Table 11; Extended Data Fig. 1c); cycling program: 98°C for 1 min; 3 cycles of 98°C for 10 s, 68°C for 30 s, and 72°C for 1 min; final extension at 72°C for 5 min) to attach P5 and P7 flow cell adaptors and UMIs. The amplicons were then purified using 1.0× AMPure XP (Beckman Coulter). qPCR was performed using the purified product, and the cycle number at which the reaction reached 80% of the plateau was determined as the cycle number for the second-round PCR. The second-round PCR was performed for 17–24 cycles using the primer set P5/P7. To estimate the count of each CRE inserted into genomic DNA, we used 1 μg (the pilot library and the HepG2 library) or 4 μg (the WTC11 library) of gDNA obtained from the lentiMPRA AllPrep extraction and performed the same PCR amplification to construct an additional sequencing library. The sequencing libraries were pooled and sequenced on a iSeq (the pilot library) or NextSeq Mid output 150-cycle kit (the HepG2 and WTC11 library) flowcell with custom primers (Read 1: pLSmP-ass-seq02/03-R1, Read 2: pLSmP-ass-seq02/03-R2, i7 index: ctMPRA-seq02/03-UMI, i5 index: ctMPRA-seq02/03-idx; Supplementary Table 11; Extended Data Fig. 1d) using the following setting: 72+72+15+8bp.
Sequencing data processing pipeline
lentiMPRA data processing.
The analysis of CRE-barcode associations and DNA/RNA barcode sequencing data was performed using MPRAflow v2.3.512 with minor modifications. For CRE-barcode association data, FASTQ files were generated using bcl2fastq with the following parameters: “--no-lane-splitting --minimum-trimmed-read-length 0 --mask-short-adapter-reads 0 --use-bases-mask Y145n,Y15,I10,Y145n”. These FASTQ files were used as input for “association.nf” in MPRAflow, aligning reads to the designed library sequences with the following parameters: ”--min-cov 3 --mapq −1 --cigar 100M”. Since our libraries, particularly the WTC11 library, were highly sensitive to single-nucleotide mismatches, we modified the alignment strategy by replacing “bwa mem” with “bwa aln -n 0” to exclude all reads containing any mismatches. For DNA/RNA barcode quantification, FASTQ files were generated from barcode sequencing data using bcl2fastq with the following settings: “--use-bases-mask Y15n,Y16,I10,Y15n”. These FASTQ files were then processed with count.nf in MPRAflow using the parameters: “--bc-length 15 --umi-length 16 --thresh 5”. The resulting DNA/RNA barcode count files for each replicate were subsequently used in downstream analyses.
ATAC and CUT&Tag data processing.
For ATAC and CUT&Tag sequencing data, FASTQ files were generated using bcl2fastq with the following parameters: “--no-lane-splitting -- minimum-trimmed-read-length 0 --mask-short-adapter-reads 0 --use-bases-mask Y72,Y15,I8,Y72”. This command generated paired-end reads of CREs as R1 and R3, and UMI reads as R2. Using these FASTQ files, we first associated the CRE with UMIs using fastp v0.23.229 with the following parameters: “-i $R1/$R3 -I $R2 --umi --umi_loc=read2 --umi_len=15 -w 1 -Q -A -L -G -u 100 -n $UMI_LENGTH -Y 100. This command appended the UMI to the end of the first part of the read header for R1/R3. Next, a consensus sequence was generated from the paired-end reads using fastq-join v1.3.130. All consensus sequences were then aligned to the designed sequences using BWA31 with the parameter: “bwa aln -n 0”. The resulting BAM file was processed to remove PCR duplicates using UMI-tools v1.1.432 with the following parameters: “dedup --per-gene --per-contig --umi-separator=“:””. Finally, UMIs per CRE were counted from the deduplicated BAM file using samtools33 idxstats (v1.11). This process was performed separately for each replicate.
Replicates, normalization and activity scores.
Activity scores were calculated by dividing the enriched counts of CREs (or barcodes) from each assay by the counts detected from genomic DNA. For lentiMPRA data, the activity score for each CRE was calculated as the ratio of “rna_count” to “dna_count” of barcodes. For ATAC and CUT&Tag data, activity scores were determined by dividing the number of UMIs per CRE enriched by the number of UMIs per CRE amplified from genomic DNA. To ensure comparability of activity scores across different assays, replicates, and libraries, we first performed counts per million (CPM) normalization. We then calculated a scaling factor f using the trimmed mean of M-values (TMM) normalization method34, based on the enriched counts and genomic counts of random genomic sequences, which served as a negative control. We assumed that these sequences exhibited no variability between the enriched counts from assays and the genomic counts. All enriched counts for the library sequences were then divided by the calculated scaling factor f. This normalization process ensured that the activity scores were adjusted so that the expected log2(Activity) of the negative control sequences was 0 across all samples. To combine the data from all three replicates, we followed the method described by Wang et al. (2024)35, which first summarizes the enriched and genomic counts within each replicate and then calculates the activity score by dividing the enriched count by the genomic count. For regression analysis, we used the activity scores from each replicate.
Endogenous activity of CREs in the pilot library
The endogenous activities of CREs from genomic sequences in the pilot library were estimated using whole-genome epigenetic assay data from ENCODE36. We followed our previous method (Agarwal et al., 2025) to obtain and compute the consensus signal from these data. In brief, we extracted three bigWig files for H3K27ac ChIP-seq data (ENCFF084DIM, ENCFF515WSE, ENCFF759SNY) and five bigWig files for ATAC-seq data (ENCFF622FRD, ENCFF024GLW, ENCFF240VVR, ENCFF782GKX, ENCFF029XKY) of HepG2 cells from the hg38 assembly in ENCODE. For each CRE, we first extended the original 100 bp genomic region to 500 bp to mitigate positional biases and then calculated the mean bigWig signal from the corresponding genomic region using bigWigAverageOverBed37. All data were log-transformed, and multiple replicates corresponding to the same CRE were averaged to compute the consensus signal for each assay. These signals were then compared with the log2(Activity) of e2MPRA.
Analysis of HepG2 synthetic enhancer library
All analyses were performed using in-house Python scripts based on statsmodels38 v0.14.2 and scikit-learn39 v1.2.2. The difference between the two template types was considered as an additional replicate, resulting in a total of 2×3 replicates for each sequence. Of the nine TF motifs used, sequences containing NR2F2 motif were either absent or present at very low counts in the genomic count (Supplementary Table 7); therefore, these sequences were excluded from the analyses.
To analyze the homotypic amplification of each epigenetic activity score, we calculated Spearman’s correlation and its P-value between log2(Activity) and the count of motifs for each motif and assay. All P-values were adjusted using the Benjamini-Hochberg method40 (FDR=0.05).
To evaluate potential synergistic effects between two TFs, we performed pairwise multiple linear regression analysis for each TF pair. The activity score was modeled as a function of the TF motifs present in each sequence. The observed log2(Activity) of Class 1 and Class 2 sequences was used as the response variable, while the individual TF motif count was used as the predictor variable. Additionally, we introduced an interaction term k, defined as: k = {0 | Class 1 , 1 or 2 | Class 2}. The regression model was expressed as follows:
Multiple linear regression was performed for all TF pairs, and the statistical significance of the coefficient of k was assessed by applying multiple comparison corrections to the P-values. Synergistic effects were considered significant if the adjusted P-values met the predefined significance threshold (FDR = 0.01).
For Class 3 sequences, we first tested whether changes in TF motif order significantly affected epigenetic activity using one-way analysis of variance (ANOVA). The analysis was performed across all 70 unique motif sets generated by selecting 4 TF motifs out of 8 (8C4 = 70), with each set comprising 24 permutations (4! = 24). For each motif set, one-way ANOVA was conducted using the activity scores of the 24 permutations as input. Statistical significance was determined from ANOVA P-values, followed by Benjamini–Hochberg correction.
To further investigate how the position of each motif relative to the minimal promoter (mP) influences activity, we conducted a positional enrichment analysis. Specifically, we selected the top 200 and bottom 200 sequences based on transcriptional activity from all permutations and calculated the frequency of each TFBS at each of the four positions (position 1 = farthest from mP, position 4 = closest to mP) as odds ratios. Enrichment was assessed using a hypergeometric test, and P-values were adjusted using the Benjamini–Hochberg method.
Analysis of WTC11 enhancer perturbation library
Single nucleotide substitution subset analysis.
To infer the effects of single nucleotide variants, we followed previous studies41,42 with minor modifications. We fitted a linear regression model of the form:
where enriched count refers to the number of unique molecular identifiers (UMIs) detected in the RNA, ATAC and CUT&Tag fragments, inserted count refers to the number of UMIs detected in the gDNA, representing the initial insertion frequency, and represents binary indicators for mutated nucleotides and their positions. The estimated coefficients of and their P-value were reported as the effects for each variant.
6 bp window perturbation subset analysis.
To identify significant functional sites from the 6 bp window perturbation result, we adapted the Canny edge detection method43 to the one-dimensional data. This simplified Canny method was used to detect regions where perturbations induced abrupt changes in activity scores, identifying these regions as functional sites.
-
Positional effect calculation.
For each position , we calculated the positional effect using the median activity score of six sequences perturbing that position (due to the 6 bp window perturbation design). To assess the statistical significance of the positional effect, we computed the median absolute deviation (MAD) score. The MAD score was calculated using the wild-type (WT) activity score as the mean absolute deviation reference. However, in the ATAC and H3K27ac assays, the number of UMIs for the WT was abnormally higher than that of the variants across all target CREs— around 10,000 UMIs per WT sequence, compared to only hundreds for the variants (Supplementary Table 9). Therefore, instead of using the WT activity score directly, we estimated its reference value by taking the median of single nucleotide substitution results.
-
Edge detection and functional site identification.
The computed MAD scores were smoothed using a Gaussian filter, and their derivatives were obtained for edge detection. Based on these derivatives, we applied the non-maximum suppression (NMS) technique to enhance edge localization. For hysteresis thresholding, we classified the detected edges as follows:
Strong edges: Local maxima identified through NMS and exceeding the median of derivative scores
Weak edges: Values exceeding the median of derivative scores.
Weak edges connected to strong edges were recursively clustered and reclassified as strong edges.
Through this process, regions with substantial relative changes in activity scores were identified as strong edges.
-
Peak detection and functional site estimation.
To further suppress noise, we selected MAD score peaks based on the extrema of smoothed values, where absolute values exceeded 0.75. Finally, functional site regions were defined as adjacent regions between peak candidates and strong edges.
Extended Data
Extended Data Fig 1. Sequence scheme of e2MPRA.
a, Structure of the synthesized CRE oligonucleotides. For each of the three libraries, a distinct pair of 15-bp adapter sequences was added to both ends of the CRE. b, Primer sequences and their corresponding binding sites used in the first and second rounds of PCR for library amplification. c, Primer sequences and their binding sites used to amplify CRE fragments following ATAC or CUT&Tag. d, Primer sequences and their binding sites used for sequencing to quantify the amplified CRE fragments.
Extended Data Fig 2. Comparison of barcode counts, enriched counts, and inserted CRE counts across replicates for the pilot library.
a, Distribution of barcode coverage per CRE in each lentiMPRA replicate. b, Scatter plot showing the correlation of the number of UMIs for DNA and RNA barcodes in lentiMPRA between replicates. This reflects the raw count-level reproducibility of each measurement. Spearman’s ρ is shown in the upper left. c, Scatter plot showing the correlation of log2(RNA barcode count / DNA barcode count) between replicates. Spearman’s correlation coefficient (ρ) is shown in the upper left. d, Distribution of inserted CRE counts (from gDNA) and enriched CRE counts (from ATAC and H3K27ac CUT&Tag) per CRE in each replicate. e, Scatter plot showing the correlation of inserted and enriched CRE counts between replicates. This reflects the raw count-level reproducibility of each measurement. Spearman’s ρ is shown in the upper left. f, Scatter plot showing the correlation of log2(enriched CRE count / inserted CRE count) between replicates. This represents the reproducibility of the normalized epigenetic activity scores. Spearman’s ρ is shown in the upper left.
Extended Data Fig 3. Comparisons of barcode counts, enriched counts, and inserted CRE counts across replicates for the HepG2 library.
a, Distribution of barcode coverage per CRE in each lentiMPRA replicate. b, Scatter plot showing the correlation of log2(RNA barcode count / DNA barcode count) between replicates. Spearman’s correlation coefficient (ρ) is shown in the upper left. c, Distribution of inserted CRE counts (from gDNA) and enriched CRE counts (from ATAC and H3K27ac CUT&Tag) per CRE in each replicate. d, Scatter plot showing the correlation of log2(enriched CRE count / inserted CRE count) between replicates. This represents the reproducibility of the normalized epigenetic activity scores. Spearman’s ρ is shown in the upper left.
Extended Data Fig 4. Epigenetic activities measured for Class 1 sequences across all TF binding motifs.
Significant Spearman’s correlations (FDR < 0.05) between TFBS copy number (x-axis) and log2(Activity) (y-axis) are indicated with a red background and non-significant correlations are indicated with a blue background.
Extended Data Fig 5. Epigenetic activities measured for Class 2 sequences across all TF binding motif pairs.

a, lentiMPRA; b, ATAC-seq; c, H3K27ac CUT&Tag. Transcriptional or epigenetic activities of heterotypic TFBS arrangements (Class 2; two different motifs combined in a 2:2 ratio) were compared to homotypic arrangements (Class 1; four identical motifs) to assess synergistic effects. Statistically significant synergistic interactions (FDR < 0.01) are indicated by a red background and non-significant combinations are shown with a blue background.
Extended Data Fig. 6. Comparisons of barcode counts, enriched counts, and inserted CRE counts across replicates for the WTC11 library.
a, Distribution of barcode coverage per CRE in each lentiMPRA replicate. b, Scatter plot showing the correlation of log2(RNA barcode count / DNA barcode count) between replicates. Spearman’s correlation coefficient (ρ) is shown in the upper left. c, Distribution of inserted CRE counts (from gDNA) and enriched CRE counts (from ATAC and H3K27ac CUT&Tag) per CRE in each replicate. d, Scatter plot showing the correlation of log2(enriched CRE count / inserted CRE count) between replicates. This represents the reproducibility of the normalized epigenetic activity scores. Spearman’s ρ is shown in the upper left.
Extended Data Fig. 7. CRE perturbation analyses for the remaining two CREs not shown in Fig. 5.
a-b, Analysis of single-nucleotide substitution effects on epigenetic activities (lentiMPRA, ATAC, and H3K27ac CUT&Tag) for seq5934_F (a) and NANOG_p (b). Annotated TF motifs22 are shown above each plot. Bar colors represent statistical significance (black: significant at P < 0.01; gray: not significant). c-d, Analysis of 6-bp window perturbation effects for seq5934_F (c) and NANOG_p (d). Positional effects of mutations were quantified as median absolute deviation (MAD) scores at each nucleotide position and smoothed using a Gaussian filter (line plot). Regions identified as significantly affected by mutations are indicated by gray lines along the baseline.
Supplementary Material
Supplementary Table 1. Pilot library design
Supplementary Table 2. HepG2 synthetic enhancer library design
Supplementary Table 3. WTC11 enhancer perturbation library design
Supplementary Table 4. lentiMPRA DNA and RNA counts for the pilot library
Supplementary Table 5. ATAC and CUT&Tag counts for the pilot library
Supplementary Table 6. lentiMPRA DNA and RNA counts for the HepG2 library
Supplementary Table 7. ATAC and CUT&Tag counts for the HepG2 library
Supplementary Table 8. lentiMPRA DNA and RNA counts for the WTC11 library
Supplementary Table 9. ATAC and CUT&Tag counts for the WTC11 library
Supplementary Table 10. Primers used in e2MPRA
Supplementary Table 11. Transcription factor binding motifs used in the HepG2 library
Acknowledgements
This work was supported by the World Premier International Research Center Initiative (WPI), MEXT Japan, MEXT KAKENHI Grant Numbers JP24K02004 (F.I.), 24K18101 (Z.Z.), and AMED under Grant Number JP24gm7010002 (F.I.). This work was funded in part by the National Human Genome Research Institute grant numbers 1R21HG010683 (N.A.), 1UM1HG009408 (N.A.) and 1UM1HG011966 (N.A.). We thank the Single-Cell Genome Information Analysis Core (SignAC) at WPI-ASHBi, Kyoto University, for their support. The WTC11 cell line was kindly provided by Dr. Bruce R. Conklin (The Gladstone Institutes and UCSF).
Footnotes
Code availability
The code used for data processing and analysis in this study is available at: https://github.com/ziczhang/e2MPRA_analysis.
Competing interests
F.I. receives funding from Relation Therapeutics. N.A. is a Cofounder and on the scientific advisory board of Regel Therapeutics Inc. N.A. received funding from BioMarin Pharmaceutical Inc.
Data availability
The e2MPRA sequencing data generated in this study, including association barcode sequencing data and barcode sequencing data for lentiMPRA, as well as ATAC-seq and CUT&Tag-seq data, have been deposited at Zenodo (https://zenodo.org/records/15428846 and https://zenodo.org/records/15469962). Source data are provided with this paper as Supplementary Tables.
References
- 1.Buenrostro J. D., Giresi P. G., Zaba L. C., Chang H. Y. & Greenleaf W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods 10, 1213–8 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Boyle A. P. et al. High-Resolution Mapping and Characterization of Open Chromatin across the Genome. Cell 132, 311–322 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Johnson D. S., Mortazavi A., Myers R. M. & Wold B. Genome-Wide Mapping of in Vivo Protein-DNA Interactions. Science (1979) 316, 1497–1502 (2007). [DOI] [PubMed] [Google Scholar]
- 4.Skene P. J. & Henikoff S. An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites. Elife 6, (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Skene P. J., Henikoff J. G. & Henikoff S. Targeted in situ genome-wide profiling with high efficiency for low cell numbers. Nat Protoc 13, 1006–1019 (2018). [DOI] [PubMed] [Google Scholar]
- 6.Kaya-Okur H. S. et al. CUT&Tag for efficient epigenomic profiling of small samples and single cells. Nat Commun 10, 1930 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Inoue F. & Ahituv N. Decoding enhancers using massively parallel reporter assays. Genomics 106, 159–64 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Inoue F. et al. A systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity. Genome Res 27, 38–52 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kreimer A., Yan Z., Ahituv N. & Yosef N. Meta-analysis of massively parallel reporter assays enables prediction of regulatory function across cell types. Hum Mutat humu.23820 (2019) doi: 10.1002/humu.23820. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Agarwal V. et al. Massively parallel characterization of transcriptional regulatory elements. Nature (2025) doi: 10.1038/s41586-024-08430-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Smith R. P. et al. Massively parallel decoding of mammalian regulatory sequences supports a flexible organizational model. Nat Genet 45, 1021–1028 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Gordon M. G. et al. lentiMPRA and MPRAflow for high-throughput functional characterization of gene regulatory elements. Nat Protoc 15, 2387–2412 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Bernstein B. E., Meissner A. & Lander E. S. The mammalian epigenome. Cell 128, 669–681 (2007). [DOI] [PubMed] [Google Scholar]
- 14.Georgakopoulos-Soares I. et al. Transcription factor binding site orientation and order are major drivers of gene regulatory activity. Nat Commun 14, 2333 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Beucher A. et al. The HASTER lncRNA promoter is a cis-acting transcriptional stabilizer of HNF1A. Nat Cell Biol 24, 1528–1540 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Heller S. et al. Transcriptional changes and the role of ONECUT1 in hPSC pancreatic differentiation. Commun Biol 4, 1298 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Viswakarma N. et al. Coactivators in PPAR-Regulated Gene Expression. PPAR Res 2010, 1–21 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Lutz M. Transcriptional repression by the insulator protein CTCF involves histone deacetylases. Nucleic Acids Res 28, 1707–1713 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Chong J. A. et al. REST: A mammalian silencer protein that restricts sodium channel gene expression to neurons. Cell 80, 949–957 (1995). [DOI] [PubMed] [Google Scholar]
- 20.Schoenherr C. J. & Anderson D. J. The Neuron-Restrictive Silencer Factor (NRSF): A Coordinate Repressor of Multiple Neuron-Specific Genes. Science (1979) 267, 1360–1363 (1995). [DOI] [PubMed] [Google Scholar]
- 21.Chew J.-L. et al. Reciprocal transcriptional regulation of Pou5f1 and Sox2 via the Oct4/Sox2 complex in embryonic stem cells. Mol Cell Biol 25, 6031–46 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Andrews G. et al. Mammalian evolution of human cis-regulatory elements and transcription factor binding sites. Science (1979) 380, (2023). [DOI] [PubMed] [Google Scholar]
- 23.Yoo W. et al. Molecular basis for SOX2-dependent regulation of super-enhancer activity. Nucleic Acids Res 51, 11999–12019 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Rodda D. J. et al. Transcriptional regulation of nanog by OCT4 and SOX2. J Biol Chem 280, 24731–7 (2005). [DOI] [PubMed] [Google Scholar]
- 25.Wang J. et al. YY1 Positively Regulates Transcription by Targeting Promoters and Super-Enhancers through the BAF Complex in Embryonic Stem Cells. Stem Cell Reports 10, 1324–1339 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Zou Z., Ohta T., Miura F. & Oki S. ChIP-Atlas 2021 update: a data-mining suite for exploring epigenomic landscapes by fully integrating ChIP-seq, ATAC-seq and Bisulfite-seq data. Nucleic Acids Res 50, W175–W182 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Grant C. E., Bailey T. L. & Noble W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Corces M. R. et al. An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues. Nat Methods 14, 959–962 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Chen S. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2, (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Aronesty E. Comparison of Sequencing Utility Programs. Open Bioinforma J 7, 1–8 (2013). [Google Scholar]
- 31.Li H. & Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Smith T., Heger A. & Sudbery I. UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res 27, 491–499 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Danecek P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Robinson M. D. & Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11, R25 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Wang Q. S. et al. Statistically and functionally fine-mapped blood eQTLs and pQTLs from 1,405 humans reveal distinct regulation patterns and disease relevance. Nat Genet (2024) doi: 10.1038/s41588-024-01896-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Bernstein B. E. et al. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Meuleman W. et al. Index and biological spectrum of human DNase I hypersensitive sites. Nature 584, 244–251 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Seabold S. & Perktold J. Statsmodels: Econometric and Statistical Modeling with Python. in 92–96 (2010). doi: 10.25080/Majora-92bf1922-011. [DOI] [Google Scholar]
- 39.Pedregosa F. et al. Scikit-learn: Machine Learning in Python. ArXiv (2012). [Google Scholar]
- 40.Benjamini Y. & Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Stat Soc Series B Stat Methodol 57, 289–300 (1995). [Google Scholar]
- 41.Kircher M. et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat Commun 10, 3583 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Ashuach T. et al. MPRAnalyze: statistical framework for massively parallel reporter assays. Genome Biol 20, 183 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Canny J. A Computational Approach to Edge Detection. IEEE Trans Pattern Anal Mach Intell PAMI-8, 679–698 (1986). [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Table 1. Pilot library design
Supplementary Table 2. HepG2 synthetic enhancer library design
Supplementary Table 3. WTC11 enhancer perturbation library design
Supplementary Table 4. lentiMPRA DNA and RNA counts for the pilot library
Supplementary Table 5. ATAC and CUT&Tag counts for the pilot library
Supplementary Table 6. lentiMPRA DNA and RNA counts for the HepG2 library
Supplementary Table 7. ATAC and CUT&Tag counts for the HepG2 library
Supplementary Table 8. lentiMPRA DNA and RNA counts for the WTC11 library
Supplementary Table 9. ATAC and CUT&Tag counts for the WTC11 library
Supplementary Table 10. Primers used in e2MPRA
Supplementary Table 11. Transcription factor binding motifs used in the HepG2 library
Data Availability Statement
The e2MPRA sequencing data generated in this study, including association barcode sequencing data and barcode sequencing data for lentiMPRA, as well as ATAC-seq and CUT&Tag-seq data, have been deposited at Zenodo (https://zenodo.org/records/15428846 and https://zenodo.org/records/15469962). Source data are provided with this paper as Supplementary Tables.











