Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2024 Sep 8:2024.09.06.611737. [Version 1] doi: 10.1101/2024.09.06.611737

Mutagenesis Sensitivity Mapping of Human Enhancers In Vivo

Michael Kosicki 1, Boyang Zhang 2, Anusri Pampari 2,3, Jennifer A Akiyama 1, Ingrid Plajzer-Frick 1, Catherine S Novak 1, Stella Tran 1, Yiwen Zhu 1, Momoe Kato 1, Riana D Hunter 1, Kianna von Maydell 1, Sarah Barton 1, Erik Beckman 1, Anshul Kundaje 2,3, Diane E Dickel 1, Axel Visel 1,4,5,*, Len A Pennacchio 1,5,6,*
PMCID: PMC11398460  PMID: 39282388

Abstract

Distant-acting enhancers are central to human development. However, our limited understanding of their functional sequence features prevents the interpretation of enhancer mutations in disease. Here, we determined the functional sensitivity to mutagenesis of human developmental enhancers in vivo. Focusing on seven enhancers active in the developing brain, heart, limb and face, we created over 1700 transgenic mice for over 260 mutagenized enhancer alleles. Systematic mutation of 12-basepair blocks collectively altered each sequence feature in each enhancer at least once. We show that 69% of all blocks are required for normal in vivo activity, with mutations more commonly resulting in loss (60%) than in gain (9%) of function. Using predictive modeling, we annotated critical nucleotides at base-pair resolution. The vast majority of motifs predicted by these machine learning models (88%) coincided with changes to in vivo function, and the models showed considerable sensitivity, identifying 59% of all functional blocks. Taken together, our results reveal that human enhancers contain a high density of sequence features required for their normal in vivo function and provide a rich resource for further exploration of human enhancer logic.

Introduction

Distant-acting enhancers are critical for regulating gene expression in a tissue-specific manner during mammalian development. Enhancer sequences function by binding transcription factors (TFs), proteins that influence the transcriptional output of the enhancer’s target gene1. Individual TF binding motifs are typically 6–12bp in size1 and most mammalian enhancers are hundreds of basepairs long, containing multiple TF binding sites24. The potential TF binding sites within an enhancer can be predicted from DNA sequence2 and TF binding to DNA in a given tissue or cell type can be directly measured using epigenomic methods such as ChIP-seq5. However, given our lack of information on all possible TF binding events, their individual functional contributions, and interactions between bound TFs, we cannot currently predict enhancer activity directly from DNA sequence. This lack of knowledge about the functional underpinnings of enhancers precludes us from predicting how genetic variants affect gene expression.

Enhancer reporter assays offer a way to study the functional relevance of individual subregions or basepairs within an enhancer by coupling wild-type or mutated versions of an enhancer to a reporter gene and measuring the resulting expression. Crucially, these assays allow dissection of enhancer function outside of the enhancer’s endogenous genomic context, where interactions with promoters and other enhancers may confound the readout or even completely mask changes in their individual activity due to enhancer redundancy6,7. Recently improved mouse transgenic engineering approaches have enabled larger-scale, whole-organism, sensitive, and reproducible assessment of regulatory elements and mutation effects in the context of prenatal in vivo development (enSERT)8,9. Changes to spatiotemporal enhancer activity patterns observed in these assays are highly informative of the phenotypic impact of studied mutations on complex processes such as limb or brain development8,10. While other, complementary methods for enhancer perturbation (including massively parallel reporter assays) exist, they tend to rely on in vitro cell culture systems11. Transgenic mouse assays are unique in their ability to reveal the impact of sequence changes within enhancers on their complex spatiotemporal in vivo activity patterns in embryonic development.

In the present study, we applied these recent advances in mouse reporter assay technology at scale to explore the sequence determinants of human developmental enhancer function in vivo. We conducted a complete, systematic mutagenesis mapping of seven human enhancers active during embryonic development and assessed the consequences of mutations for in vivo enhancer activity in mouse transgenic assays. We observed a high density of sites required for correct tissue-specific activity within the enhancers studied, as well as different modes of functional interactions between sites within enhancers. We also trained machine learning models based on chromatin accessibility to predict the binding site motifs within these enhancers and validated them using in vivo transgenic assays. The models identified sequence motifs which coincided to a high degree with functional sites, offering a method to computationally predict nucleotides within enhancers that are likely to affect their in vivo function. Thus, these models are expected to be useful for the interpretation and prioritization of clinically observed variants in enhancers. Taken together, our data reveal a considerable functional complexity of human in vivo enhancers and provide a comprehensive resource for model development and validation.

Results

Large-Scale Block Mutagenesis of Developmental Enhancers

To study how the sequence features within mammalian enhancers relate to their in vivo activity patterns, we selected seven human enhancers that were between 223bp and 431bp long. Each of these enhancers drives strong and highly reproducible activity in transgenic mouse reporter assays at mid-gestation (embryonic day 11.5) in brain (enhancers NEU1–3), heart (enhancers HT1–3), or face and limb (enhancer FL, Figure 1A, Supplementary Table1)1218. We divided each enhancer into consecutive 12bp blocks for mutagenesis, corresponding to the average size of individual transcription factor binding sites, without biasing the design towards predicted binding sites (Figure 1B). In total, the seven enhancers encompassed 167 mutagenesis blocks. For each enhancer, we generated a series of transgenic reporter constructs in which all basepairs within one or several blocks were mutated using a transition mutagenesis scheme designed to eliminate any transcription factor binding sites that may be present with the block (A<>G, C<>T; Supplementary Figure 1; Supplementary Note1; Supplementary Table2).

Figure 1. General enhancer properties.

Figure 1.

(A) Wild-type pattern of seven enhancers mutagenized in this study (see Supplementary Table 1 for details). (B) Initial screen design. (C) Examples of patterns in mutagenized constructs. (D) Functional annotation of 12bp blocks (N=108; see Supplementary Note 2 for adjustments). (E) Distribution of block mutation outcomes (N=108).

To identify subregions of enhancers not required for in vivo function, we first produced a series of 103 constructs in which between two and nine 12bp blocks had been mutated simultaneously (Figure 1B). Each mutagenized enhancer was coupled to a minimal promoter and LacZ reporter gene and used to generate transgenic mouse embryos using CRISPR-mediated insertion at a safe harbor locus8,9 (enSERT; Methods). We then compared the resulting in vivo activity patterns with those of the wildtype allele of each enhancer (Figure 1C). Overall, 33 of the 112 combinatorial constructs, encompassing 69 of the 167 individual blocks, caused no detectable changes in enhancer activity. The absence of changes could theoretically result from combinatorial compensation between loss- and gain-of-function effects. To exclude this possibility, we also tested 21 of these 69 blocks individually in single-block mutation constructs and observed that none of them altered the enhancer activity. Thus, we tentatively classified all 69 blocks as non-critical for in vivo enhancer activity. To complete the systematic block mutagenesis survey, we assayed the remaining 98 untested blocks individually, finding an additional 25 non-critical blocks for a total of 94 that appeared dispensable for normal enhancer function. Disruption of the remaining 73 blocks resulted in changes in activity. We also performed additional validation of the transition mutagenesis scheme, which resulted in minor adjustments to functional block annotations (Supplementary Note 2; Supplementary Figure 2).

We observed that the peripheral blocks of many enhancers were often not required for function and therefore we defined the functional core of each enhancer by the two outermost blocks whose mutation caused a change in function (Figure 1D). Across seven enhancers, there was a total of 108 functional core blocks. Mutagenesis of 6% of these 108 blocks led to full loss of function, 37% led to major loss, 17% led to minor loss, 9% led to gain of function and no change was observed when mutagenizing 31% (Figure 1E; Supplementary Table3).

While all seven enhancers contained subregions that caused major changes in activity when mutated, across enhancers we observed notable differences in the proportion of blocks with critical functions and in the types of observed activity changes (Figure 1D). Gain-of-function changes in activity were almost exclusively observed in enhancers FL and NEU1, with 9 of 10 instances located in these two enhancers. This observation suggests that these enhancers contain multiple binding sites for repressive factors that, when mutated, cause de-repression of the enhancer and thereby ectopic activity. Four enhancers (FL, NEU2, HT2, HT3) contained at least one “Achilles’ heel” block that, when mutated, caused a full loss of enhancer function. We also observed substantial differences in the proportion of blocks within an enhancer causing major or full loss of function, ranging from 21% (HT3) to 67% (HT2). Nevertheless, all enhancers contained three or more such blocks.

To investigate whether experimentally observed function agrees with other indicators of DNA function, we examined its relationship with measures of selective constraints in mammalian evolution and in human populations. Blocks that altered in vivo enhancer function showed higher evolutionary conservation across mammals than those whose mutation did not cause activity changes (p<0.05, see Supplementary Figure 3A,B,C). Similarly, enhancers with a higher fraction of blocks that caused full/major loss or gain of function showed a lower density of variants across human populations (R2=68%, p-value<0.05, Supplementary Figure 3D). These findings support that blocks that contribute to enhancer activity, as observed by mutagenesis screening, contribute to fitness and are therefore subject to selective constraints in evolution and human populations. Taken together, these results show that all tested enhancers have multiple sites critical for their function, dispersed across extended core regions ranging from approximately 110bp to 250bp in length. However, they show substantial differences in their robustness to mutation and in their propensity to gain or lose tissue-specific activities upon mutation.

Basepair Resolution Prediction of Critical Sites Within Enhancers

The comprehensive in vivo dataset of block-mutated enhancers offers a unique opportunity to develop and assess tools for predicting the importance of individual nucleotides for normal in vivo enhancer function. We trained a machine learning model (ChromBPNet19) to predict genome-wide open chromatin signal from DNA sequence using 29 bulk ATAC-seq, single-cell ATAC-seq (scATAC) and DNase I hypersensitive site sequencing (DHS) human and mouse datasets from embryonic tissues in which the tested enhancers were active (Supplementary Table 4). Next, we used these models to predict the consequences of mutating individual or multiple 12bp blocks in each enhancer and compared the predicted changes in open chromatin signal to the observed differences in enhancer in vivo activity. For example, for enhancer FL and using a model derived from e11.5 limb DHS data, mutagenesis of block 12 resulted in a predicted minor reduction (log2 fold change = −0.24) in chromatin openness, which coincided with a minor loss of in vivo function in the limbs (Figure 2A). In contrast, mutagenesis of block 16 was predicted to reduce chromatin openness substantially (log2 fold change = −1.03), which coincided with an observed major loss of in vivo activity. Comparing all predicted changes in chromatin openness with observed in vivo results for enhancer FL revealed a strong correlation (R2=0.73, Figure 2B, see Methods for scoring of in vivo results). For five of the seven enhancers, we identified models trained on data from relevant tissues with high correlation between predicted mutation effects and in vivo results observed for mutant alleles (respective best-fit models: R2=0.50–0.79; Methods, Supplementary Figure 4A, Supplementary Table 4, Supplementary Note3). For two of the seven enhancers none of the models from relevant tissues showed good correlation with in vivo results and these enhancers were excluded from further analysis (NEU1 and NEU2, see Supplementary Note3 for details).

Figure 2. Machine learning model selection and validation.

Figure 2.

(A) Examples of ChromBPNet model output and in vivo results for reference and mutagenized constructs of enhancer FL. White arrowheads indicate partial or full loss of in vivo activity. (B) Correlation of model-predicted mutation effects (change in predicted signal between wild-type and mutagenized sequence) and the observed in vivo mutagenesis results. Each dot represents a construct with a mutagenized block or a combination of blocks. R2=Spearman correlation. (C) Contributions scores for wild-type sequences with per block in vivo experiment results in boxes below. Best-fit models depicted. Clusters with high contribution scores boxed in (N=14). OFT = outflow tract, LV = left ventricle, RV = right ventricle, atr. = atrium. (D) Single or double basepair mutations were introduced at clusters with high contribution scores. Also see Supplementary Figure 4B.

For each model, we used DeepLIFT20 to predict the contribution of each basepair within the enhancer to the open chromatin signal (Figure 2C). Using only the best-fit model for each enhancer, we observed 15 locally dense clusters of contiguous nucleotides with high contribution scores. In many cases, the observed clusters were reminiscent of binding motifs of TFs expected to be active in the tissues observed in vivo. For example, in face and limb enhancer FL, the approach revealed high contribution scores for motifs relevant to craniofacial and limb development, including an isolated HAND2/TWIST1 E-box motif and a pair of a homeobox and a HAND2/TWIST1 motifs resembling a previously described Coordinator motif (Figure 2C)2124. Likewise, in heart enhancers HT1, HT2 and HT3 we observed clusters of high contribution scores that corresponded to binding motifs for GATA, MEF2, and SRF, all of which are involved in cardiac development (Figure 2C)22,25. Of 15 clusters with high contribution scores, 14 overlapped blocks that showed loss of activity upon mutagenesis, indicating high positive predictive value (93%). Conversely, of the 53 blocks whose mutation caused a change of in vivo function, 19 overlapped clusters with high contribution scores, indicating moderate sensitivity (36%, also see Supplementary Note 4, Methods).

Next, we assessed experimentally if the motifs identified by high contribution scores are indeed the critical functional components of the 12 basepair blocks tested previously by block mutagenesis. We introduced targeted mutations of single or two adjacent nucleotides predicted to disrupt 7 of 15 clusters with high contribution scores. In all cases, we observed a loss-of-function in line with contribution score-based predictions. For example, in enhancer HT1, upon introducing a point mutation (G78A) within a predicted GATA binding motif, we observed a complete loss of in vivo activity in the left cardiac ventricle that was indistinguishable from the effect of mutating the entire surrounding 12 basepair block (Figure 2D). Similar effects were observed for all other cases tested (Figure 2D, Supplementary Figure 4B). Together, these results indicate that contribution scores derived from models trained to predict open chromatin signal can identify functional TF motifs within enhancers and predict the impact of their mutation on enhancer activity with considerable accuracy.

Consideration of Degenerate Motifs and Multi-Tissue Activities Improves Detection Sensitivity

To increase the sensitivity of detecting functionally relevant TF motifs, we hypothesized that motifs with weaker contribution scores may escape detection because they do not stand out as distinct clusters in wildtype sequence. To find such degenerate sites, we performed in silico saturation mutagenesis of all five enhancers, generating 5082 variant sequences with 1bp substitution mutation each. Next, we examined the variant sequences for the emergence of new local clusters of nucleotides with high contribution scores, and for changes in overall predicted open chromatin signal across the enhancer. For example, in enhancer HT1, we observed that a single in silico point mutation (T111C) resulted in the emergence of a strong, predicted MEF2 motif that is not evident from the reference sequence. The mutation increased the predicted open chromatin signal substantially (log2 fold change = 0.74; Figure 3A, left). Targeted disruption of this MEF2 motif through mutation of a different single basepair (T112C) caused region-specific loss of cardiac in vivo activity in a pattern that was identical to the loss of activity observed upon mutating the entire 12bp block in which the mutation resides. A degenerate MEIS-TEAD site with similar in vivo impact was observed in another block of enhancer HT1 (Figure 3A, right). Across all enhancers, we identified 6 sites that both featured a novel cluster of high, positive contribution scores and had a predicted open chromatin signal 25% higher than the reference (Supplementary Figure 5 A).

Figure 3. Refined map of binding motifs and enhancer activity.

Figure 3.

(A) Discovery of additional sites through in silico mutagenesis and validation. Also see Supplementary Figure 5 A and B. (B) Examples of block mutants with gain of brain activity and additional motifs discovered using alternative FL models trained on neuronal datasets. Black arrowheads indicate gain of function. Also see Supplementary Figure 5C. (C) Final TF binding motif and activity map. Includes motifs discovered using alternative models (element FL) and degenerate motifs (marked with asterisks; elements HT1, HT2 and HT3). (D) Fraction of blocks with motif predictions, by experimentally determined function. Major loss includes full loss. (E) Number of activator and inhibitor sites as estimated from experimental data alone (marked with asterisk; NEU1 and NEU2) or from experimental data combined with model motif predictions (FL, NEU3, HT1–3), by enhancer (Methods, Supplementary Figure 5D for visual guide).

We also explored if the sensitivity of detecting functional sites can be further increased by combining models derived from multiple training sets representing different relevant tissues. We tested this paradigm using face and limb enhancer FL, which showed a striking increase in activity in the brain in several block mutation experiments, suggesting latent neuronal activities that could potentially be studied using models derived from brain tissues containing many different types of neuronal cells (Figure 3B, left). Indeed, using an alternative model derived from e11.5 hindbrain ATAC-seq data, we observed two strong binding site motif predictions for activator SOX and repressor SNAI that were not apparent in the best-fit limb model (Figure 3B, right). A targeted 2-basepair mutation of the SOX motif resulted in loss of in vivo function, whereas a targeted single-basepair mutation in the repressive SNAI motif caused a major gain of in vivo function (Figure 3B). Using an additional model derived from glutamatergic neurons, we observed two more sites, including a repressive NR/RAR motif located in a sequence block that causes a gain of activity when mutated (Supplementary Figure 5C). Together, the use of two alternative models identified four additional binding motifs in enhancer FL, thereby providing mechanistic explanations for the observed in vivo activity changes.

The combined use of in silico saturation mutagenesis and alternative models (Figure 3C) predicted TF motifs in 30 of the 53 blocks that showed altered in vivo activity upon mutation, increasing sensitivity to 59% compared to 36% based on best-fit models alone. Despite this substantially improved sensitivity, we observed only a minor reduction in positive predictive value, from 14/15 (93%) to 22/25 (88%) of predicted functional sites showing altered in vivo activity. Blocks classified to cause a major loss of function when mutated had a predicted TF site more often than those causing minor loss of function (66% vs 38%), although the difference was not statistically significant (p=0.06, Fisher’s Exact Test; Figure 3D).

Combining the results of block mutagenesis and open chromatin model predictions also offers an opportunity to examine the overall complexity of individual enhancers by estimating the total number of functional sites (Methods, Supplementary Figure 5D). We observed that the seven enhancers examined had between 4 and 15 functionally relevant sites (average: 9; Figure 3F). Enhancers that contained blocks which, when mutagenized, caused a gain of function, had the highest number of sites (13–17 sites total; FL, NEU1, HT1; p<0.05, Mann-Whitney U-test). Taken together, these results show how combining large-scale in vivo mutagenesis, epigenomic data, and predictive modeling can elucidate the functional landscape of in vivo enhancers at base-pair resolution.

Response to Mutations Reveals Regulatory Modes

The complex spatial activity patterns of enhancers, which frequently include multiple developmental tissues and cell types, represent an additional hurdle for relating enhancer sequence content to in vivo function. We explored whether the results of our mutagenesis screen can be used to disentangle the relationship between sequence features within an enhancer and tissue-specific activities.

First, we examined the impact of different mutations on the in vivo activity of enhancer HT1 in different subregions of the developing heart. The reference allele of HT1 showed strong activity in the outflow tract and both ventricles, along with weaker activity in the atria (Figure 4A). We scored the activity changes observed in each of these four cardiac subregions for each single-block mutagenesis allele in comparison to the reference allele (Figure 4A). We observed that activity in the atria and left ventricle was typically more severely affected than activity in the right ventricle and the outflow tract. Extending this analysis to include constructs with multiple mutated blocks, and sorting them by the overall severity of the observed changes (Figure 4B) revealed a graded response in which atrial expression was most susceptible to mutations, followed by left ventricle, right ventricle, and outflow tract. We did not observe any cases in which outflow tract or right ventricle expression was affected in the absence of changes to left ventricle or atrial activity. This suggests that functional sites within enhancer HT1 predominantly do not drive expression in specific subregions of the heart, but contribute to an overall pattern in a graded fashion. A similarly graded response was observed for enhancer NEU2 (Supplementary Figure 7B).

Figure 4. Patterns of multi-tissue in vivo responses to mutations.

Figure 4.

(A) Activity of single block mutants of enhancer HT1, scored across four cardiac substructures. Flanking wild-type blocks not shown. (B) Activity all mutated HT1 constructs, scored across four cardiac substructures, arranged by overall expression (Methods). (C) Activity of mutated FL constructs, scored across three branchial arches. Arranged by structure-specific full loss of function. Only mutants with partial loss of function in one of the arches were included. OFT = outflow tract, LV = left ventricle, RV = right ventricle, atr. = atrium, (r) = random scrambling mutagenesis, (tv) = GC content preserving transversion mutagenesis, 1;11 = combinatorial mutagenesis of blocks 1 and 11, A190G = 1bp A to T mutation at position 190. Arrowheads: black = gain of function, blue = minor loss, white = full loss. Also see Supplementary Figure 7.

Next, we examined enhancer FL, which shows more complex expression changes, performing the same structure-specific annotations (Supplementary Figure 7A). Focusing on expression in the first, second, and third branchial arch, we observed structure-specific activity changes associated with distinct subsets of mutations (Figure 4C). For example, mutations of blocks 3 or 9 selectively abolished expression in branchial arch 2 while maintaining activity in branchial arches 1 and 3. In contrast, mutations of blocks 10 or 16 selectively abolished expression in branchial arch 3. These results show that distinct aspects of the complex in vivo activity pattern of enhancer FL require different functional subregions of the enhancer. A similar structure-specific response to mutations was observed for enhancer NEU1 (Supplementary Figure 7C).

In contrast to these structure-specific effects of mutations affecting branchial arch activity, some other tissues in which enhancer FL is active exhibited graded responses more similar to HT1 and NEU1. In particular, all mutations that caused a full loss of activity in any facial substructure also caused loss of limb activity (consistent with shared developmental signaling in these tissues26). Conversely, nearly all incomplete loss mutants (27/29) retained some activity in branchial arch 1 (Supplementary Figure 7A). These findings indicate that all functional sites within enhancer FL contribute to limb and branchial arch 1 activity in a graded fashion, while some functional sites are specifically required for activity in either branchial arch 2 or branchial arch 3.

In conclusion, the results of our large-scale in vivo enhancer mutagenesis highlight two distinct modes by which mutations can affect the activity of enhancers with complex, multi-tissue activity patterns. The more commonly observed mode is a graded response of structures to mutations, with some structures being overall more sensitive to mutations than others. In a second strictly structure-specific mode, distinct mutations affect activity in distinct substructures independently. As illustrated by enhancer FL, these modes are not a general property of a given enhancer, but can co-occur within the same enhancer, applying to different aspects of the complex activity pattern.

Paired Block Mutagenesis Demonstrates Pervasive Additive Logic

The severity of activity changes in enhancers generally increased with the number of introduced mutations (see, e.g., Figure 4B). However, this observation does not immediately reveal the functional impacts expected from compound mutations that affect more than one functional sequence block. Under a simple model of enhancer function, individual sites within the enhancer contribute to the enhancer’s overall regulatory activity in an additive fashion. Consequently, it is expected that combinations of mutations cause additive in vivo activity changes that reflect those observed in the constituent single-block mutagenesis experiments. However, more complex modes of functional intra-enhancer interaction resulting from compensatory or synergistic functional interactions between sites are also conceivable2730. To examine the prevalence of such complex interactions in human in vivo enhancers, we systematically compared how mutagenesis of single 12bp blocks or pairs of such blocks affected in vivo activity. We only studied pairs separated by at least one block to avoid potentially confounding gain-of-binding events at the boundary of adjacent blocks and to exclude short-distance, homo- and heterodimer TF interactions. Under an additive model of function, we expected combining two loss-of-function mutations to result in a more pronounced loss. Any other outcome would indicate deviation from the additive model (Figure 5A; Supplementary Figure 8A).

Figure 5. Comparison of individual and paired block mutations.

Figure 5.

(A) Classification of outcomes of paired block mutagenesis. A combination of two loss-of-function mutations resulting in a more pronounced loss is considered additive, while any other outcome is classified as non-additive (also see Supplementary Figure 8A). (B) Distribution of additive and non-additive outcomes of paired block mutagenesis. (C) Examples of additive pairs. (D) An example of non-additive pair. White arrowheads highlight structures of interest (see main text). Also see Supplementary Figure 8.

We examined 32 pairs of blocks and found 29 (90%) to have patterns consistent with the additive model (Figure 5B). For example, combined mutation of blocks 13 and 19 of enhancer FL resulted in full loss of function, while mutagenesis of either of these blocks in isolation led to only incomplete reduction in enhancer activity (Figure 5C). A similar additive effect was observed for HT1 blocks 5 and 7, as well as NEU1 blocks 15 and 22, for which paired block mutation caused more severe loss than either of the individual block mutations (Figure 5C; see Supplementary Figure 8B for additional examples). As a contrasting example of non-additive changes in function of a compound mutagenesis construct, partial loss of midbrain activity caused by mutagenizing NEU1 blocks 19 and 22 together was highly similar to the effect of mutagenizing either of the blocks in isolation (Figure 5D; see Supplementary Figure 8C for remaining non-additive pairs). Taken together, our results demonstrate that most functional sites within human in vivo enhancers contribute to overall regulatory activity of the enhancer in an additive manner. More complex functional interactions between sites within an enhancer can occur, but are rare.

Discussion

Over the past decade, dramatically improved maps of the transcriptional enhancers orchestrating human genome function have emerged from genome-wide mapping efforts in hundreds of mammalian tissues and cell types3134. In sharp contrast, our understanding of the genomic code for how individual enhancers direct gene expression in vivo remains cursory. This knowledge gap currently prevents accurate predictions of how a given mutation within an enhancer will impact its in vivo function. To develop a systematic and robust data foundation for gaining insight into this relationship, we performed comprehensive in vivo mutagenesis mapping of multiple human developmental enhancers with different tissue specificities, leveraging mouse genome editing to generate and analyze more than 1,700 independent transgenic mouse embryos. Our studies revealed a diversity of functional site arrangements within these enhancers, enabled the identification of machine learning models for prediction of functional binding motifs at basepair resolution, identified strategies to improve the sensitivity of machine learning models, described complementary modes of multi-tissue activity, and established an additive model as the predominant mode of functional site interactions.

Systematic block mutagenesis of seven in vivo enhancers showed that all had a complex functional architecture, with sites required for normal activity spread across hundreds of basepairs, and revealing pronounced differences in overall sensitivity to mutations (Figure 1). Three enhancers could be completely inactivated by mutagenesis of a single “Achilles’ heel” block. Conversely, three enhancers contained blocks which, when mutagenized, led to gains of function. In an example of extremely high density of functional sites, no single core block of enhancer FL could be mutagenized without affecting its in vivo activity. In contrast, in an example of low density of functional sites, the majority of blocks in the functional core of enhancer HT3 could be mutated without impact on the observed in vivo activity. Given the spectrum of density of functional sites observed across the enhancers studied here, we speculate that even more robust enhancers, in which no individual block mutation leads to major loss of function, may exist.

Predictive modeling greatly complemented our experimental survey, allowing us to interpret the results of block mutagenesis at basepair resolution, with considerable sensitivity and high positive predictive value (Figure 2 and Figure 3). Systematic comparison of models against experimental data from in vivo block mutagenesis enabled the selection of best-fit models for individual enhancers. We found that the models trained directly on tissue-specific open chromatin signal predicted coherent, tissue-appropriate sets of binding motifs. The resulting high-confidence predictions enabled the targeted experimental verification of functionally relevant nucleotides within each block, highlighting a powerful computationally guided strategy for the interpretation of human pathogenic mutations and evolutionary divergence at enhancers across species.

By applying machine learning models to in silico-mutated enhancer sequences, we uncovered additional, degenerate TF motifs that could not be detected in reference sequences, thereby further increasing model sensitivity (Figure 3A, Supplementary Figure 5A). Notably, despite their low contribution scores in the context of the wildtype enhancer, we showed experimentally that these motifs contribute to the in vivo function of the respective enhancers. This observation aligns with the notion that suboptimal, lower-affinity TF binding sites in enhancers contribute to tissue-specific activities3538. Application of machine learning models to in silico-mutated enhancer sequences offers an effective and scalable approach for the systematic discovery of such sites in other enhancers.

Three out of seven enhancers in our study harbored blocks that, when mutagenized, caused gains of activity, either in tissues in which the wildtype allele is inactive or quantitatively increasing activity in a tissue in which the wildtype allele is active. Such gains of function suggest the presence of repressive binding sites within these blocks, resulting in tissue-specific derepression upon mutagenesis. Generally, enhancers that included gain-of-function blocks also appeared to have overall more functional sites than enhancers that contained only blocks that caused loss of function when mutated (Figure 3E). The two enhancers containing multiple gain-of-function blocks (FL and NEU1) also had the clearest examples of mutations acting in a structure-specific manner (Figure 4A). This suggests that the activity in different tissues is enabled by the interplay of activating and repressive sites, which is consistent with observations of activator-repressor logic in other developmental enhancers29,39,40.

The complexity of functional impacts of mutations across tissues stresses the importance of studying human enhancers using whole-organism, multi-tissue experimental paradigms. For example, several of the gain-of-function activity changes we observed in face/limb enhancer FL appeared in unrelated organ systems, such as the heart and nervous tissues (Supplementary Figure 2B, Figure 3B). This aligns with our observation that some of the functional motifs for enhancer FL were not detected by machine learning models trained only on tissues in which the reference enhancer was predominantly active, namely face and limb (Figure 3C, Supplementary Figure 5C). It would be challenging to capture such mutation-induced ectopic activity in unexpected tissues even in complex in vitro systems. Our findings imply that interpretation of human non-coding variation and regulatory evolution, as well as designing safe, tissue-specific gene therapies will require a multi-tissue, in vivo approach, taking into account a possibility of ectopic activation from as little as a single basepair mutation (Figure 3B).

Systematic mutagenesis also provided insight into the relationship between individual sequence features of enhancers and their respective function in directing complex activity patterns that include multiple tissues or anatomical regions (Figure 4). In particular, we observed that most mutations caused a quantitative reduction in activity relative to the wild-type baseline activity across all tissues. Since baseline activity may vary across tissues, this resulted in a general graded reduction in activity across tissues. However, we also observed several cases in which mutations affected in vivo activity selectively in individual anatomical structures, implying that the corresponding wildtype sequence feature interacts with TFs with spatially restricted expression.

Combining mutations in pairs of blocks allowed us to examine the possible presence of functional interactions between sites. We observed examples of additive effects on enhancer function, where the combined mutations resulted in additive in vivo activity changes, as well as non-additive effects. In 90% of cases, we found a simple additive pattern, suggesting that additive logic is the predominant mode in human developmental enhancers (Figure 5). The non-additive cases we identified may represent opposing or interfering effects of two TFs. Alternatively, they may be a special case of additive logic, in which block mutations simultaneously lead to a loss of activity in one cell type and a gain of activity in another cell type in the same anatomical structure. The effect of combining such block mutations may appear to be non-additive. Identifying the underlying TFs will help design experiments to interpret these observations.

In conclusion, our comprehensive mutagenesis survey of human in vivo enhancers revealed many facets of within-enhancer regulatory logic, in particular pertaining to activator-repressor paradigm, multi-tissue expression and applicability of predictive modeling. These findings provide a foundation for the interpretation of human non-coding variation, changes of enhancer activity across evolution, and will aid in the design of synthetic enhancers for biotechnological and therapeutic purposes.

Methods

Transgenic assay

Transgenic E11.5 mouse embryos were generated as described previously9. Briefly, super-ovulating female FVB mice were mated with FVB males and fertilized embryos were collected from the oviducts. Enhancer sequences were synthesized by Twist Biosciences and cloned into the donor plasmid containing minimal Shh promoter, lacZ reporter gene and H11 locus homology arms (Addgene, 139098) using NEBuilder HiFi DNA Assembly Mix (NEB, E2621). The sequence identity of donor plasmids was verified using long-read sequencing (Primordium). Plasmids are available upon request. A mixture of Cas9 protein (Alt-R SpCas9 Nuclease V3, IDT, Cat#1081058, final concentration 20 ng/μL), hybridized sgRNA against H11 locus (Alt-R CRISPR-Cas9 tracrRNA, IDT, cat#1072532 and Alt-R CRISPR-Cas9 locus targeting crRNA, gctgatggaacaggtaacaa, total final concentration 50 ng/μL) and donor plasmid (12.5 ng/μL) was injected into the pronucleus of donor FVB embryos. The efficiency of targeting and the gRNA selection process is described in detail in Osterwalder 20229.

Embryos were cultured in M16 with amino acids at 37°C, 5% CO2 for 2 hours and implanted into pseudopregnant CD-1 mice. Embryos were collected at E11.5 for lacZ staining as described previously9. Briefly, embryos were dissected from the uterine horns, washed in cold PBS, fixed in 4% PFA for 30 min and washed three times in embryo wash buffer (2 mM MgCl2, 0.02% NP-40 and 0.01% deoxycholate in PBS at pH 7.3). They were subsequently stained overnight at room temperature in X-gal stain (4 mM potassium ferricyanide, 4 mM potassium ferrocyanide, 1 mg/mL X-gal and 20 mM Tris pH 7.5 in embryo wash buffer). PCR using genomic DNA extracted from embryonic sacs digested with DirectPCR Lysis Reagent (Viagen, 301-C) containing Proteinase K (final concentration 6 U/mL) was used to confirm integration at the H11 locus and test for presence of tandem insertions9. Only embryos with donor plasmid insertion at H11 were used. The stained transgenic embryos were washed three times in PBS and imaged from both sides using a Leica MZ16 microscope and Leica DFC420 digital camera.

Correlating predictions of machine learning models and in vivo results

To assess fit between the models and in vivo results, experimental results were scored on a scale from 0 to 7, with 0 indicating full loss of function, 1–4 indicating various degrees of major loss of function, 5 indicating minor loss of function, 6 indicating no change, 7 a gain of function. The Spearman correlation (R) between this in vivo score and model predicted log2 fold change in open chromatin signal across all single and multi-block transition mutagenesis constructs was computed across for each model and enhancer combination. Total predicted signal for the input sequence was used. All model estimates were obtained from the count head, using as input 2114 bp centered on the enhancer, flanked by the reporter construct (H11 locus left homology arm on the left and Shh promoter and reporter LacZ gene on the right).

Sensitivity, specificity and estimation of binding site numbers

Sensitivity and specificity were calculated simply as fractions of, respectively, functional or wild-type blocks overlapping model-predicted motifs. Positive predictive value was calculated as a fraction of predicted binding motifs overlapping at least one functional block. The GATA motif in HT1 block 5 (classified as major loss) also overlapped block 6 (gain) by 1bp, which was ignored for the sake of simplicity.

To obtain the model-corrected number of binding sites per enhancer, we counted each predicted binding site once (even if it spanned multiple blocks) and assumed that each functional block without a prediction contains exactly one site - an activator one, if loss of function was observed upon block mutagenesis or a repressor, if gain of function was observed. We excluded sites predicted to be in non-functional blocks.

Selection of paired block mutations

We selected only block pairs that were separated by at least 1 block, to avoid potential gain-of-binding events at the interface of mutagenized blocks. We also excluded combinations of full loss of function blocks with other loss of function blocks, since the likely outcome - full loss of function - cannot be classified as either additive or non-additive in a meaningful way.

Machine learning models training and interpretation

Training of scATAC ChromBPNet models included in this study was described previously19,43.

Reference genome (mm10), blacklist regions, filtered BAM files for pair-end data and unfiltered BAM files for single-end data (ATAC-seq and DHS) were obtained from the ENCODE portal31,32. For unfiltered BAM files, an additional filtering step was performed using `samtools view -b -@50 -F780 -q30`. Isogenic replicates for each biological sample were merged to yield consolidated reads. For ATAC-seq samples, the peaks were directly retrieved from the ENCODE portal. For DHS samples, we used MACS244 and followed the ENCODE ATAC-seq protocol for peak-calling. We further removed regions that overlap with blacklist regions. The dataset was divided into three groups (training, validation, and testing) by chromosome (1–19, X and Y). We employed a 5-fold chromosome hold-out cross-validation approach with different sets of chromosomes for different groups in each fold. Group compositions for each fold are available here.

ChromBPNet models19 were trained to predict the read counts given 2114 bp sequences from both peak and background regions, or from background regions alone. The ultimate output of ChromBPNet was a prediction of counts corrected using background region model for Tn5 enzyme effects. The background regions were chosen not to overlap peak regions, to have fewer reads than a minimum number of total counts observed in any peak region and to match the GC-content distribution of peak regions. Pearson correlation between predicted and observed log counts was used as a metric of fit during training. We utilized the DeepSHAP implementation of the DeepLIFT algorithm to derive base-resolution contribution scores for each input sequence20,45.

Motifs were identified using web server TOMTOM version 5.5.6 with default settings46.

Supplementary Material

Supplement 1
media-1.xlsx (131.1KB, xlsx)
1

Acknowledgements

This work was supported by a U.S. National Institutes of Health (NIH) grant to L.A.P. (R01HG003988). Research was conducted at the E.O. Lawrence Berkeley National Laboratory and performed under U.S. Department of Energy Contract DE-AC02-05CH11231, University of California (UC). The authors acknowledge funding support from NIH grants 5U24HG007234, U01HG009431, and U01HG012069 to A.K. We would like to thank Laura Cook, Evgeny Kvon, Om Patange and Fabrice Darbellay for critical reading of the manuscript.

Footnotes

Conflicts of Interest

A.K. is on the scientific advisory board of SerImmune, AINovo, TensorBio and OpenTargets. A.K. was a scientific co-founder of RavelBio, a paid consultant with Illumina, was on the SAB of PatchBio and owns shares in DeepGenomics, Immunai, Freenome, and Illumina.

References

  • 1.Lambert S. A. et al. The Human Transcription Factors. Cell 172, 650–665 (2018). [DOI] [PubMed] [Google Scholar]
  • 2.Fickett J. W. Quantitative discrimination of MEF2 sites. Mol. Cell. Biol. 16, 437–441 (1996). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Long H. K., Prescott S. L. & Wysocka J. Ever-Changing Landscapes: Transcriptional Enhancers in Development and Evolution. Cell 167, 1170–1187 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Gotea V. et al. Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. Genome Res. 20, 565–577 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Johnson D. S., Mortazavi A., Myers R. M. & Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science 316, 1497–1502 (2007). [DOI] [PubMed] [Google Scholar]
  • 6.Dickel D. E. et al. Ultraconserved Enhancers Are Required for Normal Development. Cell 172, 491–499.e15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Hong J.-W., Hendrix D. A. & Levine M. S. Shadow enhancers as a source of evolutionary novelty. Science 321, 1314 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kvon E. Z. et al. Comprehensive In Vivo Interrogation Reveals Phenotypic Impact of Human Enhancer Variants. Cell 180, 1262–1271.e15 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Osterwalder M. et al. Characterization of Mammalian In Vivo Enhancers Using Mouse Transgenesis and CRISPR Genome Editing. Methods Mol. Biol. 2403, 147–186 (2022). [DOI] [PubMed] [Google Scholar]
  • 10.Snetkova V. et al. Ultraconserved enhancer function does not require perfect sequence conservation. Nat. Genet. 53, 521–528 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Inoue F. & Ahituv N. Decoding enhancers using massively parallel reporter assays. Genomics 106, 159–164 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Visel A. et al. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 457, 854–858 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Attanasio C. et al. Fine tuning of craniofacial morphology by distant-acting enhancers. Science 342, 1241006 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Visel A. et al. A high-resolution enhancer atlas of the developing telencephalon. Cell 152, 895–908 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Visel A., Minovitsky S., Dubchak I. & Pennacchio L. A. VISTA Enhancer Browser--a database of tissue-specific human enhancers. Nucleic Acids Res. 35, D88–92 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Spurrell C. H. et al. Genome-wide fetalization of enhancer architecture in heart disease. Cell Rep. 40, 111400 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.May D. et al. Large-scale discovery of enhancers from human heart tissue. Nat. Genet. 44, 89–93 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Dickel D. E. et al. Genome-wide compendium and functional assessment of in vivo heart enhancers. Nat. Commun. 7, 12923 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Trevino A. E. et al. Chromatin and gene-regulatory dynamics of the developing human cerebral cortex at single-cell resolution. Cell 184, 5053–5069.e23 (2021). [DOI] [PubMed] [Google Scholar]
  • 20.Shrikumar A., Greenside P. & Kundaje A. Learning Important Features Through Propagating Activation Differences. in Proceedings of the 34th International Conference on Machine Learning (eds. Precup D. & Teh Y. W.) vol. 70 3145–3153 (PMLR, 06--11 Aug 2017). [Google Scholar]
  • 21.Firulli B. A., Redick B. A., Conway S. J. & Firulli A. B. Mutations within helix I of Twist1 result in distinct limb defects and variation of DNA binding affinities. J. Biol. Chem. 282, 27536–27546 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Selleri L. & Rijli F. M. Shaping faces: genetic and epigenetic control of craniofacial morphogenesis. Nat. Rev. Genet. 24, 610–626 (2023). [DOI] [PubMed] [Google Scholar]
  • 23.Prescott S. L. et al. Enhancer divergence and cis-regulatory evolution in the human and chimp neural crest. Cell 163, 68–83 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Kim S. et al. DNA-guided transcription factor cooperativity shapes face and limb mesenchyme. Cell 187, 692–711.e26 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Olson E. N. Gene regulatory networks in the evolution and development of the heart. Science 313, 1922–1927 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Schneider R. A., Hu D. & Helms J. A. From head to toe: conservation of molecular signals regulating limb and craniofacial morphogenesis. Cell Tissue Res. 296, 103–109 (1999). [DOI] [PubMed] [Google Scholar]
  • 27.Smith R. P. et al. Massively parallel decoding of mammalian regulatory sequences supports a flexible organizational model. Nat. Genet. 45, 1021–1028 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Lettice L. A. et al. Opposing functions of the ETS factor family define Shh spatial expression in limb buds and underlie polydactyly. Dev. Cell 22, 459–467 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Lettice L. A., Devenney P., De Angelis C. & Hill R. E. The Conserved Sonic Hedgehog Limb Enhancer Consists of Discrete Functional Elements that Regulate Precise Spatial Expression. Cell Rep. 20, 1396–1408 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Spitz F. & Furlong E. E. M. Transcription factors: from enhancer binding to developmental control. Nat. Rev. Genet. 13, 613–626 (2012). [DOI] [PubMed] [Google Scholar]
  • 31.Gorkin D. U. et al. An atlas of dynamic chromatin landscapes in mouse fetal development. Nature 583, 744–751 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.ENCODE Project Consortium et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Rebboah E. et al. The ENCODE mouse postnatal developmental time course identifies regulatory programs of cell types and cell states. bioRxiv (2024) doi: 10.1101/2024.06.12.598567. [DOI] [Google Scholar]
  • 35.Farley E. K., Olson K. M., Zhang W., Rokhsar D. S. & Levine M. S. Syntax compensates for poor binding sites to encode tissue specificity of developmental enhancers. Proc. Natl. Acad. Sci. U. S. A. 113, 6508–6513 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Farley E. K. et al. Suboptimization of developmental enhancers. Science 350, 325–328 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Jindal G. A. et al. Single-nucleotide variants within heart enhancers increase binding affinity and disrupt heart development. Dev. Cell 58, 2206–2216.e5 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Crocker J. et al. Low affinity binding site clusters confer hox specificity and regulatory robustness. Cell 160, 191–203 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Boisclair Lachance J.-F., Webber J. L., Hong L., Dinner A. R. & Rebay I. Cooperative recruitment of Yan via a high-affinity ETS supersite organizes repression to confer specificity and robustness to cardiac cell fate specification. Genes Dev. 32, 389–401 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Borok M. J., Tran D. A., Ho M. C. W. & Drewell R. A. Dissecting the regulatory switches of development: lessons from enhancer evolution in Drosophila. Development 137, 5–13 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Rauluseviciute I. et al. JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 52, D174–D182 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Kircher M. et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat. Commun. 10, 3583 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Ameen M. et al. Integrative single-cell analysis of cardiogenesis identifies developmental trajectories and non-coding mutations in congenital heart disease. Cell 185, 4937–4953.e23 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Feng J., Liu T., Qin B., Zhang Y. & Liu X. S. Identifying ChIP-seq enrichment using MACS. Nat. Protoc. 7, 1728–1740 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Lundberg S. M. & Lee S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 4765–4774 (2017). [Google Scholar]
  • 46.Gupta S., Stamatoyannopoulos J. A., Bailey T. L. & Noble W. S. Quantifying similarity between motifs. Genome Biol. 8, R24 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement 1
media-1.xlsx (131.1KB, xlsx)
1

Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES