Abstract
Transcriptional enhancers act as docking stations for combinations of transcription factors and thereby regulate spatiotemporal activation of their target genes1. It has been a long-standing goal in the field to decode the regulatory logic of an enhancer and to understand the details of how spatiotemporal gene expression is encoded in an enhancer sequence. Here we show that deep learning models2–6, can be used to efficiently design synthetic, cell-type-specific enhancers, starting from random sequences, and that this optimization process allows detailed tracing of enhancer features at single-nucleotide resolution. We evaluate the function of fully synthetic enhancers to specifically target Kenyon cells or glial cells in the fruit fly brain using transgenic animals. We further exploit enhancer design to create ‘dual-code’ enhancers that target two cell types and minimal enhancers smaller than 50 base pairs that are fully functional. By examining the state space searches towards local optima, we characterize enhancer codes through the strength, combination and arrangement of transcription factor activator and transcription factor repressor motifs. Finally, we apply the same strategies to successfully design human enhancers, which adhere to enhancer rules similar to those of Drosophila enhancers. Enhancer design guided by deep learning leads to better understanding of how enhancers work and shows that their code can be exploited to manipulate cell states.
Subject terms: Gene regulation, Genomics, Machine learning, Synthetic biology
Deep learning models were used to design synthetic cell-type-specific enhancers that work in fruit fly brains and human cell lines, an approach that also provides insights into these gene regulatory elements.
Main
Cell-type-specific expression of a target gene is achieved when a unique combination of transcription factors (TFs) activates a specific enhancer; whereas this enhancer remains either passively (default-off7,8) or actively repressed in other cell types (for example, through repressor binding9 or corepressor/polycomb recruitment). Typically, when an enhancer is translocated to another chromosome or to an episomal plasmid, it maintains cell-type-specific control of its nearby reporter gene1,10. Therefore, its regulatory capacity is contained in the enhancer DNA sequence and has co-evolved to respond uniquely to a specific trans-environment in a cell type. A thorough understanding of how enhancer activation is encoded in its DNA sequence is important, as it is a key component for the modelling and prediction of gene expression2,11; for the interpretation of non-coding genome variation12,13; for the improvement of gene therapy; and for the reconstruction and manipulation of dynamic gene regulatory networks underlying developmental, homeostatic and disease-related cell states.
Many complementary approaches and techniques have been used to decode enhancer logic1. These include studies of individual enhancers by mutational analysis14–16, in vitro TF binding (for example, electrophoresis mobility shift assay), cross-species conservation17 and reporter assays. The upscaling of such studies led to the identification of common features of coregulated enhancers18–20. These experimental findings also triggered the improvement of computational methods for the prediction of cis-regulatory modules, whereby feature selection and parameter optimization led to new insights into how binding sites cluster and how their strength (or binding energy) impacts enhancer function15,16,21–24. Wider adoption of genome-wide profiling of chromatin accessibility25, single-cell chromatin accessibility26–28, histone modifications29,30, TF binding31 and enhancer activity19,32 led to significantly larger training sets of coregulated enhancers that could then be used for a posteriori discoveries of TF motifs and enhancer rules, aided by the growing resources of high-quality TF motifs33,34. Further mechanistic insight has been provided by thermodynamic modelling of enhancers35,36, in vivo imaging of enhancer activity37, the analysis of genetic variation through expression quantitative trait loci and chromatin-accessibility quantitative trait loci analysis12,38 and high-throughput in vitro binding assays39,40. Recently, the enhancer biology field embraced the use of convolutional neural networks (CNN) and network-explainability techniques that again provided a substantiall leap forward in terms of prediction accuracy and syntax formulation2–6,41–44.
An orthogonal strategy to decode enhancer logic is to engineer synthetic enhancers from scratch. This approach has the advantage that the designer knows exactly which features are implanted, so that the minimal requirements for enhancer function can be revealed. Recent work showed the promise of CNN-driven enhancer design by successfully designing yeast promoters45 and by using a CNN to select high-scoring enhancers for S2 cells, from a large pool of random sequences4. Here, we tackle the next challenge in enhancer design: to design enhancers that are cell-type specific. To this end, we used previously trained deep learning models for which we have already validated the accuracy of nucleotide-level interpretation and motif-level predictions12,5 (Supplementary Note 1). Using these enhancer models as a guide (or ‘oracle’), we tested three different sequence design approaches46,47 (Fig. 1).
In silico evolutions
As a first strategy for enhancer design, we created synthetic enhancers to specifically target Kenyon cells (KC) in the mushroom body of the fruit fly brain, using a nucleotide-by-nucleotide sequence evolution approach45 (Methods). This approach starts from a 500 base pair (bp) random sequence that is evolved from scratch (EFS) in silico towards a chosen cell type through several iterations. Prediction scores are calculated using DeepFlyBrain5, a deep learning model trained on differentially accessible regions across several cell types of the Drosophila brain and that can recognize motif-level nucleotide arrangements for many cell types (Supplementary Note 1). At each iteration we performed saturation mutagenesis13,44,48 whereby all nucleotides were mutated one by one and each sequence variation was scored by DeepFlyBrain to select the mutation with the greatest positive delta score for the KC class (among 81 classes representing different cell types that the model learned to predict). We performed this procedure starting from 6,000 GC-adjusted random sequences and observed that after 15 iterations, DeepFlyBrain KC prediction scores increased from around the minimal score (0) to nearly the maximum score (1), while remaining low for other cell types (Fig. 2a and Extended Data Fig. 1a,b). We found this greedy search to provide a good balance between computational cost and ability to efficiently yield high-scoring sequences, compared to alternative state space searches (Extended Data Fig. 2a–d and Methods).
Next, we investigated the initial (random) sequence and the specific paths that are followed through the search space towards local optima. For only a small fraction (3%) of random sequences, the prediction score remained below 0.5 even after 15 mutations (Extended Data Fig. 1c). These sequences were mostly characterized by more instances of repressor binding sites together with an increased number of mutations required to generate sufficient activator binding sites. A second observation is that, even though 500 bp space is given to the model, the selected mutations accumulated in about 200 bp space, preferentially at the centre of the random sequence (Extended Data Fig. 1d,e).
We investigated the consequences of each mutation on shaping the enhancer code using DeepExplainer-based contribution scores (Fig. 2b; Methods). This revealed that initial random sequences harbour several short repressor binding sites by chance and these are preferentially destroyed during the first iterations (Extended Data Fig. 1f,g). These repressor sites contribute negatively to the KC class prediction and represent candidate binding sites for KC-specific repressor TFs such as Mamo and CAATTA5. The nucleotides with the highest impact represent mutations that destroy a repressor binding site and simultaneously generate a binding site for the key activators Eyeless (Ey), Mef2 or Onecut. Eventually, DeepExplainer highlighted several candidate activator binding sites, whereby Ey, Mef2 and Onecut sites dominate (Fig. 2b and Extended Data Fig. 1f,g).
To test whether the in silico evolved enhancers can drive reporter gene expression in vivo, we randomly selected 13 sequences after 10 or 15 iterations (Fig. 2c and Supplementary Figs. 1 and 2) and integrated them into the fly genome with a minimal promoter and a GFP reporter gene (Methods). Investigating the GFP expression pattern by confocal imaging showed that 10 of these 13 tested synthetic enhancers were active specifically in the targeted cell type, the KC (Fig. 2d and Extended Data Fig. 1h). Some enhancers did not show activity after ten mutations but became active after an extra five mutations (Fig. 2d, Extended Data Fig. 1i,j and Supplementary Fig. 3). The three enhancers without GFP signal in KC were found to also be Dachshund negative, indicating the potential loss of KC (Extended Data Fig. 1k). Using assay for transposase-accessible chromatin by sequencing (ATAC-seq) on the brains of the transgenic lines, we verified that the synthetic enhancers become accessible when integrated into the genome (Extended Data Fig. 1l), as predicted by the model.
We also generated transgenic lines to test enhancers at different steps during the evolutionary design process (Supplementary Figs. 4 and 5). We found that random sequences or sequences with only few mutations remain inactive, whereas enhancer activity is initiated when repressor sites are removed and Ey and Mef2 sites are generated; and activity further increases with more and stronger instances of activator motifs (Extended Data Fig. 1m,n).
To demonstrate that enhancers can be generated for other cell types, we started from the same random sequences as above and evolved them into perineurial glia (PNG) enhancers (Extended Data Fig. 2e). After 15 mutations, putative PNG repressor sites have been destroyed and activator sites have been generated (Fig. 2e and Supplementary Fig. 6). We validated six designed sequences by creating transgenic GFP reporter flies and confirmed that four were positive, as they drive GFP specifically in PNG cells (Fig. 2f and Extended Data Fig. 2f). Because the same random sequence was evolved into either KC or PNG enhancers, this experiment underscores that the chosen mutations and the candidate binding sites they destroy or generate, causally underlie the activity of these synthetic enhancers.
Given that KC enhancers can arise from random sequences after 10 or 15 mutations, we proposed that certain genomic regions may require even fewer mutations to acquire KC enhancer activity. We scanned the entire fly genome and identified regions with high prediction scores but without chromatin accessibility in KC (Extended Data Fig. 2g,h; Methods). By applying sequence evolution to these sequences, three of four sequences became positive KC enhancers with only six mutations (Fig. 2g,h, Extended Data Fig. 2i,j and Supplementary Fig. 7). When the negative enhancer was further evolved, with an extra five mutations, it also became positive (Fig. 2g and Extended Data Fig. 2i,j). This suggests that KC enhancers, and probably other cell-type enhancers as well, can arise de novo in the genome with few mutations.
To summarize the changes that happened during the design process, we performed motif discovery across all 6,000 sequences, at each step of the optimization path (Extended Data Fig. 1f,g). This confirmed that repressor sites are often present in random sequences and that they are preferentially destroyed during the first steps of the search algorithm. To experimentally test that these short repressor sites functionally cause repression, we selected three positive synthetic enhancers and three of the near-enhancers rescued from the genome and evolved these to become non-functional by manually choosing the mutations that decrease the prediction score by creating repressor binding sites (Extended Data Fig. 2i and Supplementary Fig. 8 and 9). We avoided mutating any of the predicted activator sites (Fig. 3a); thus, placed repressor motifs in between activator sites. New transgenic lines with these sequences integrated into the genome confirm that all tested enhancers have entirely lost their activity (Fig. 3b). This shows that enough repressor sites can dominate over a functional combination of activator sites.
The sequence evolution strategy thus represents an intuitive and efficient approach to generate cell-type-specific enhancers and to characterize their functional constituents.
Multiple cell-type codes
A single enhancer can be active in several different cell types49, and our earlier work suggested that this can be achieved by enhancers that contain several codes for different cell types, intertwined in a single approximately 500 bp sequence5. On the basis of this finding, we wondered whether a genomic enhancer that is active in a single cell type, could be synthetically augmented to become also active in a second cell type. To test this, we started with two optic lobe enhancers (amon and CG15117) that are accessible and active in T4/T5 and T1 neurons, respectively5, and whose activity per cell type is also predicted correctly by DeepFlyBrain (Fig. 3c–e and Extended Data Fig. 3a–c). We then performed in silico evolution on these enhancers towards KC, while simultaneously maintaining a high prediction score for the original cell type. After 13 and 14 mutations, the enhancers were also predicted as KC enhancers but retained T4 and T1 binding sites. Testing the augmented sequences in vivo with a GFP reporter confirmed the spatial expansion of the enhancer activity to KC (Fig. 3f,g, Extended Data Fig. 3c–f and Supplementary Fig. 10; Methods).
Reciprocally, enhancers active in several cell types may be pruned towards a single cell-type code. We searched for genomic enhancers that score high for several cell types (Fig. 3h–l). We selected a Pkc53e enhancer that is accessible and active in both optic lobe T neurons and KC and predicted correctly by the model. This time, we drove the in silico evolution to maintain the KC prediction score, while decreasing the T neurons prediction score (Methods). After nine mutations, the sequence was predicted to have only KC activity (Fig. 3m). Nucleotide contribution scores show that the most important binding sites for KC were unaffected after nine mutations, whereas the activator binding sites were destroyed and new repressor binding sites were created for T neurons (Extended Data Fig. 3g). Testing the final sequence in vivo confirmed the spatial restriction of the enhancer activity (Fig. 3n). Together, our results indicate that, guided by the DeepFlyBrain model, intertwined enhancer codes can be independently dissected and altered.
Motif implantation
As a second strategy, we used a classical motif implantation approach to design KC enhancers. The rationale behind this strategy is based on our results above: nucleotide-by-nucleotide sequence evolution showed that all the selected mutations were associated with the creation or destruction of a TF binding site, rather than affecting contextual sequence between motif instances (Fig. 2b,e,h and Extended Data Fig. 3d,e,g). This suggested that a combination of appropriately positioned activator motifs, without the presence of repressor motifs, would be sufficient to create a cell-type-specific enhancer. Furthermore, we reasoned that by applying this design strategy to thousands of random sequences we could gain more insight into the KC enhancer logic. To this end, we iteratively implanted strong TF binding site instances in 2,000 random sequences, selecting locations with the highest prediction score towards the KC class. We first implanted a single binding site for one of the four key activators of KC enhancers, namely, Ey, Mef2, Onecut and Sr5, and then specific combinations of sites in a particular implantation order (Extended Data Fig. 4a; Methods). This revealed that Ey and Mef2 had the strongest effect on the prediction score, while Onecut and Sr increased the prediction score only marginally (Fig. 4a). Implanting Ey and Mef2 consecutively increased the score more than the sum of their individual contribution and their implantation order did not affect the final score. Adding Onecut and then Sr on top of Ey and Mef2 sites increased the scores even further until it reached the level that we obtained above after 15 mutations through in silico sequence evolution (Fig. 4a). We could also observe some minor preferences in the motif flanking sequence (for example, Mef2 is flanked by T or G in 5′ and A or C in 3′; Extended Data Fig. 4a)
We also found that high-scoring configurations consisted of activator sites that are positioned close together within a distance usually smaller than 100 bp (Fig. 4b,c and Extended Data Fig. 4b). When the Ey and Mef2 pair were implanted on the same strand, we observed strong preference for a 5 bp distance (or 4 bp when implanted on opposite strands) between the two binding sites whereby Mef2 was located upstream of Ey (Fig. 4b and Extended Data Fig. 4c). For the Ey and Onecut pair, there was a strong preference for a 3 bp space and Onecut preferred the downstream side of Ey (Fig. 4c and Extended Data Fig. 4d).
We investigated the nucleotide contribution scores before and after motif implantations for an example sequence with high prediction score in which motifs were inserted close together (Fig. 4d,e and Supplementary Fig. 11). The initial random sequence contained several repressor binding sites and the Ey binding site implantation destroyed the strongest repressor binding site. Mef2 and Onecut implantations followed the predicted spacing relative to Ey, with a distance of 5 and 3 bp, respectively. This can explain why implantation of motifs at random locations yields lower scoring sequences (Fig. 4a). Even though some repressor binding sites were still present at further distances, their relative negative contribution was decreased after the activator binding site implantations (Fig. 4e). Testing this designed 500 bp sequence in vivo confirmed specific activity in KC (Fig. 4f). Introduction of mutations to generate repressor sites close to the implanted motifs (none of the activator sites was modified) resulted in complete loss of enhancer activity in vivo, suggesting dominance of repressor motifs (Fig. 4d,e,g). Furthermore, a 49 bp subsequence, containing just the three binding sites, resulted in the same activity and specificity in vivo (Fig. 4h,i and Supplementary Fig. 12). We further confirmed the robustness of the motif implanting design by validating in vivo a second 500 bp sequence showing increased spacing between motifs (Extended Data Fig. 4e,f,g). This result suggests that a functional KC enhancer can be created through motif-by-motif implantation with just these three binding sites and its size can be decreased to the minimal length required to contain these binding sites.
As a third strategy for enhancer design, we used generative adversarial networks (GAN) that have been shown to be powerful generators in different fields43,48, including the generation of functional genomic sequences46. This method was less interpretable than in silico evolution or motif implanting but still allowed the generation of functional and specific enhancers (Supplementary Note 2).
Human enhancer design
We used our previously trained and validated melanoma deep learning model, DeepMEL2 (ref. 12) (Supplementary Note 1) with the same three strategies as before, to design human melanocyte or melanocyte-like melanoma (MEL) enhancers. As for the Drosophila experiments, we started from GC-adjusted random sequences (Extended Data Fig. 5a) and, by following the nucleotide-by-nucleotide sequence evolution approach, we evolved them into sequences with high prediction scores for the MEL class. This process drove the generation of activator binding sites (SOX10, MITF and TFAP2) and the destruction of ZEB motifs to resemble MEL genomic enhancers; the prediction scores started to plateau after 15 mutations (Fig. 5a and Extended Data Fig. 5b,c). We randomly selected ten regions that were evolved from scratch (EFS-1–10) with 15 mutations and tested their activity with a luciferase assay in vitro, in a MEL cell line (MM001) (Fig. 5b,c and Methods). Seven of ten tested enhancers showed activity in the range of previously characterized positive control (native) enhancers and none of them showed activity in a cell line that represents another melanoma cell state (mesenchymal-like, MM047) in which the MEL-specific TFs (SOX10, MITF and TFAP2) are not expressed (Fig. 5d and Extended Data Fig. 5d). When we integrated these synthetic enhancers into the genome of the MM001 cell line using lentiviral vectors (Methods), they generated an ATAC-seq peak, whereas neither the random sequences nor the evolved sequence when integrated in a non-MEL cell line are accessible (Fig. 5e and Extended Data Fig. 5e,f).
Next, we tested the activity of a series of synthetic sequences, along the design path, from a random sequence to an active enhancer (Extended Data Fig. 6 and Supplementary Figs. 13 and 14). This shows that the predicted activity by DeepMEL2 correlates with the luciferase reporter activity in vitro (Fig. 5f and Extended Data Fig. 5g), suggesting that the steps of increased activity are not biased to our DeepMEL2 model but reflect biological activity. Functional in silico evolved enhancers lost their activity and accessibility, when ZEB sites were generated in proximity of activator sites (Fig. 5e,f and Extended Data Figs. 5g and 8) and this repressive mechanism depended on the number and the strength of repressor sites (Extended Data Fig. 8a,b–e and Supplementary Fig. 15). We confirmed that the same principles of repression apply to genomic enhancers, using the MEL enhancer in an IRF4 intron as example and through ChIP–seq we identified ZEB2 as the actual repressor TF (Fig. 5g,h and Supplementary Note 3). Mutating the endogenous ZEB2 site in the IRF4 enhancer causes a significant increase in activity, whereas mutations that generate more ZEB2 sites (without touching activator sites) decrease its activity (Fig. 5i and Supplementary Note 3).
These findings could be further corroborated by scoring all sequences during the optimization process with two other deep learning models, namely, a newly trained ChromBPNet model50 on bulk MM001 ATAC-seq data (Methods) and the previously published Enformer model, for which the SK-MEL-5 ATAC-seq class represents the MEL state2. The Enformer model has a receptive field of 200 kilobases (kb) and can be used to predict both enhancer activity and target gene expression in the context of an entire gene locus. To simulate whether our synthetic enhancers do function like genomic enhancers in a complex locus, we replaced the IRF4 enhancer studied above with synthetic enhancers, thus performing an in silico CRISPR experiment. Replacement of the IRF4 enhancer by a random sequence results in no predicted accessibility, whereas replacement by different synthetic enhancers along their design path gradually obtains increased prediction scores for accessibility, H3K27Ac signal and CAGE gene expression (Fig. 5j,k and Extended Data Fig. 7b). Because Enformer contains more than 600 chromatin accessibility (DNase hypersensitivity) output classes, across a wide variety of cell types, we used it to assess the specificity of our designed enhancers and found high prediction scores for only four classes, each representing either melanocytes or melanocyte-like melanoma cell states (Fig. 5l and Extended Data Fig. 7a). The ChromBPNet model shows continuous increases of predicted enhancer activity along the optimization path (Fig. 5m). Again, all three models correctly predict that synthetic enhancers, after they reach their highest activity level, can be switched off entirely by introducing point mutations that generate ZEB binding sites (Fig. 5j,k,m and Extended Data Fig. 7a,b). Furthermore, changing the location of the enhancer relative to the transcriptional start site did not alter its functionality, suggesting that the enhancers are not dependent on the local sequence context around the IRF4 enhancer location to be functional (Extended Data Fig. 7c). As a final example of in silico evolution, we identified a human ‘near-enhancer’ and rescued its activity with only four mutations (Extended Data Fig. 9a–d).
We also applied the motif implantation strategy to design human enhancers. We implanted SOX10, MITF and TFAP2 binding sites to 2,000 random sequences of 500 bp. Whereas implanting only MITF or TFAP2 resulted in a small increase in the prediction score, implanting SOX10 alone had the strongest effect (Fig. 5n). Adding MITF and then TFAP2 on top of SOX10 sites increased the prediction scores to 0.6 on average. The prediction scores continued increasing even further after adding another set of SOX10, MITF and TFAP2 binding sites (Fig. 5n). We did not observe a preferential location for the implantation of MITF or TFAP2 relative to SOX10; however, both binding sites were located within 100 bp of SOX10 (Fig. 5o). The second SOX10 binding site was placed further away at a 200–250 bp distance relative to the first SOX10 (Fig. 5o). We selected four sequences with either single or double SOX10, MITF and TFAP2 implanted sites and tested their activity with luciferase assays. All enhancers showed activity in the range of native enhancers and adding the binding sites twice consistently increased the activity of the enhancers (Fig. 5p and Extended Data Fig. 10a,b,c). Replacing the implanted binding sites with their weaker versions taken from a native enhancer (IRF4) decreased the activity of the enhancers dramatically (Extended Data Fig. 10a,b,c). To confirm that the activity of the enhancers was driven by the implanted binding sites, we cut the sequences from the most upstream binding site to the most downstream binding site. These subsequences (116–164 bp) were also active with a slight change in their activity levels (Extended Data Fig. 10a,b,c). Finally, instead of choosing the best location for MITF and TFAP2 implantation, we implanted them at the closest location to the SOX10 binding site that would result in a positive change in the prediction score. These minimal enhancers (51–64 bp) were as active as their longer (500 bp) version (Extended Data Fig. 10a,b,c).
Finally, we applied the GAN-based sequence generation approach to the generation of human enhancers and obtained similar performances as with the Drosophila GAN-generated enhancers (Supplementary Note 2).
In conclusion, these results show that enhancer design strategies are adaptable to different biological systems and even other species, including human.
Discussion
Understanding the code of transcriptional regulation and using this knowledge to design synthetic enhancers has been a persistent challenge. We successfully designed synthetic enhancer sequences in human and fly guided by deep learning models. By combining a stepwise enhancer design approach alongside model interpretation techniques, we followed the trajectories of in silico enhancer emergence in Drosophila and human, towards local optima. Nucleotide-by-nucleotide evolution revealed that the selected mutations predominantly destroy candidate repressor TF binding sites and create candidate activator sites. Mostly, ten iterative mutations were sufficient to convert a random sequence into a cell-type-specific functional enhancer. Similarly, for native yeast promoter sequences, it was recently shown that only four mutations could dramatically increase or decrease their activities45. This evolutionary design process may represent an optimized version of natural evolution of genomic enhancers. We found that the fly and human genomes contain ‘near-enhancers’ that require few mutations to become functional.
The location, orientation, strength and number of TF motifs in a single enhancer and their distance to other motifs are important features determining an enhancer code that is unique to each cell type. This array of well-arranged TF binding sites constitutes a docking platform for a specific combination of TFs. Their cooperative binding makes the enhancer accessible/active at different levels and in different cell types. We found certain enhancers to be active in several cell types. Besides the trivial possibility whereby two cell types share a common set of TFs that bind to a common set of sites (for example, different KC subtypes), we showed that some enhancers have evolved several intertwined codes (for example, KC and T neurons). We could prove this by either removing a code from a native dual-code enhancer or adding a second code to a native single-code enhancer.
The consequence of this motif-driven enhancer model is that it allows enhancer design by motif implantation. Several studies have used motif implantation in an attempt to reconstitute enhancer activity but successes of accurate in vivo activity have been limited51,52. More recently, motif embedding has also been used in combination with deep learning models4,6,53 with the advantage that many different motif implanting scenarios can be tested in silico, before performing experimental validation4,6,43,53, as compared to high-throughput testing of random implantations32,54,55. By exploiting motif implantation further, particularly by scoring each possible implant position, as well as combinations of motifs, we could reveal motif synergies (for example, Ey + Mef2 or SOX10 + MITF), as well as preferred orientations and distances between motifs, motif strengths and motif copy number. A minimal fly brain enhancer designed with three abutting motif instances illustrates that functional enhancers can be created without further sequence context. Compared to random insertions of motif instances52,56, deep learning guided implantation has the capacity to take the entire enhancer sequence into account. Consequently, what makes an enhancer is not only the optimal combination of motifs used (including each motif’s strength and copy number) but also the optimal balance between repressor and activator motifs and the optimal motif arrangement.
Two of 13 KC enhancers remain negative, whereas one is inconclusive. Nevertheless, this leads to a conservative success rate greater than 75%. We also envision several routes for further improvement in enhancer design. First, whereas our examples focused on adult cell types, we did not consider temporal changes. It thus remains to be investigated whether developmental enhancers with highly dynamic and complex output functions can be decoded and designed along the same principles. Studies of the shavenbaby enhancer in Drosophila showed that its output is affected by mutations in most of its nucleotides57. This may be due to a densely packed motif content, such as our minimal enhancer or to yet-unknown sequence features. It may be interesting to investigate such developmental enhancers with deep learning models58. Also, we observed slight variations in the GFP output pattern of (genomic and synthetic) enhancers. Incorporating such high-resolution variations in the training data may yield models with improved spatial and quantitative resolution. Lastly, the repressor motifs identified by our models recruit TFs that cause a decrease in chromatin accessibility. However, this is probably not true for all transcriptional repressors (for example, binding sites of the REST repressor overlap with accessible chromatin59). A future challenge will be to take repressor motifs into account that do not decrease chromatin accessibility. To train such models, more enhancer activity data or gene expression data will be needed.
The successful application of enhancer design on both fly brain and human cancer cells has shown that simple, yet powerful strategies guided by deep learning models are adaptable to different organisms or systems. Our proof-of-concept study is an encouraging step forward towards the development of organism-wide deep learning models. Such models will facilitate the generation of synthetic enhancers during development, disease and homeostasis; and will further improve our understanding and control of the genomic cis-regulatory code.
Methods
Data reporting
No statistical methods were used to predetermine sample size. The number of synthetic enhancers that were tested using transgenic flies was determined to be minimally six per cell type and it was bounded by the feasibility of the transgenic animal generation experiments. In total, 68 transgenic fly lines were generated. The number of synthetic enhancers that were used with luciferase assays was determined to be minimally ten per different category (in silico evolution, motif embedding, GAN, repressors and mutational steps). In total, 97 sequences were tested using luciferase assay. The initial random sequences (used for sequence evolution and motif implantation) were sampled from the sequence space that matches the GC content of the genomic sequences. Flies fitting the sex (equal amount of male flies and female flies) and age (less than 10 days) criteria were selected randomly for all experiments. In this study, we didn’t perform experiments that needed to be allocated into different groups. The investigators were blinded when performing cloning, transfection, antibody staining and luciferase experiments by using enhancer IDs.
Statistics and reproducibility
Statistics were calculated using Scipy (v.1.6.0; RRID: SCR_008058)60. The results here and throughout the manuscript were visualized using matplotlib (v.3.1.1; RRID: SCR_008624)61. The deep learning models were run in a conda environment in which python (v.3.7; RRID: SCR_008394), tensorflow-gpu (v.1.15; RRID: SCR_016345)62, numpy (v.1.19.5; RRID: SCR_008633)63, ipykernel (v.5.1.2; RRID: SCR_024813) and h5py (v.2.10.0; RRID: SCR_024812) packages were installed. The same results were obtained from different replication experiments. Several brains (at least ten) were stained and imaged for the fly experiments. Three biological replicates were performed for the main luciferase experiments. Two biological replicates were performed for the negative control luciferase experiments. No biological replicates were performed for ATAC-seq or ChIP–seq experiments.
In silico saturation mutagenesis
To measure the effect of each possible single mutation on a given DNA sequence, we performed in silico saturation mutagenesis, as described earlier13,48,64. We first generated the sequences of all single mutations for a given 500 bp sequence (three possible mutations for each nucleotide, making 1,500 sequences in total). We scored these sequences and the initial sequence with the deep learning models. For a chosen class, we calculated the delta prediction score by subtracting the score of the initial sequence from the score of the mutated sequence for each mutation.
Random sequence generation
We generated random 500 bp sequences to use as a previous set for the in silico sequence evolution and motif implantation by using the numpy.random.choice([“A”,“C”,“G”,“T”]) command. For each position, instead of using 25% probability for each nucleotide to be chosen, we used the frequency of the nucleotides from fly or human genomic regions for each position. In these genomic regions, the GC content was higher in the centre of the regions on average relative to the flankings. We used 6,126 KC regions for fly and 3,885 MEL regions for human that we identified in our previous publications3,5.
In silico sequence evolution
By using the saturation mutagenesis scores mentioned above, we performed in silico sequence evolution. For the in silico evolution from random sequences, we calculated saturation mutagenesis scores for a random sequence. Then, we selected the mutation that had the highest positive delta prediction score for the selected class (for γ-KC, class no. 35 in DeepFlyBrain; for PNG, class no. 34 in DeepFlyBrain; for MEL, class no. 16 in DeepMEL2). For the selected sequence with one mutation, we recalculated the saturation mutagenesis scores for each nucleotide and again selected the mutation with the highest delta score and repeated this procedure until the initial random sequence accumulated 20 mutations.
Even though we used a simple objective function to direct the sequence evolution towards a single cell type, without explicitly penalizing off-target cell types, the generated sequences were mostly active only in the targeted cell type. We believe this is because of the type of enhancer models we are using, which were trained on cell-type-specific accessible regions. When more general models are used, for example trained on entire ATAC-seq tracks, adapted objective functions can be used and are available in our code. The cell-type-specific activity of our synthetic enhancers suggests that: (1) activator binding sites were not created for other cell types; and (2) repressor sites, which are present in random sequences by chance, were not destroyed for other cell types. For example, in KC we observed that activator binding sites are usually longer than repressor sites (18 and 10 bp versus 5 and 6 bp for Ey, Mef2, Mamo and CAATTA, respectively). This implies that a random sequence is more likely to have several repressor binding sites by chance compared to activator sites (Extended Data Fig. 1f). Indeed, the average prediction scores of our initial 6,000 random sequences were close to zero for all classes. This may at least in part explain why earlier enhancer design efforts may have failed.
We used 6,000 initial random sequences for KC and PNG and 4,000 for MEL. For the generation of KC enhancers from genomic regions, we performed six iterative mutations. For the many cell-type code enhancers, we started from optic lobe enhancers and in each iteration we manually selected the mutations that increased the γ-KC prediction score while maintaining the optic lobe prediction scores high. For the pruning experiment of a multiple cell-type code enhancer into only KC code, we manually selected the mutations that maintain the γ-KC prediction score high while decreasing the optic lobe prediction scores. The DeepFlyBrain class numbers used for optic lobe neurons are 23 for T1, 20 for T2 and 2 for T4 neurons.
To rescue the designed enhancers that were weak or negative, we performed five more mutations on both from-scratch and from-genomic sequences.
To repress the sequences with the creation of repressor binding sites, we selected single or double mutations manually, by going over in silico saturation mutagenesis plots calculated on the evolved sequences.
To explore the alternative in silico sequence evolution paths besides choosing the best mutation (greedy algorithm), we chose the top 20 mutations on each sequence for every incremental step starting from a random sequence. We followed this procedure for five incremental mutational steps. Starting from the random sequence used to generate enhancer KC EFS-4, we obtained 3.2 million paths/sequences at the end.
Nucleotide contribution scores
We used a network explaining tool, called DeepExplainer (SHAP package65,66; RRID: SCR_021362), to calculate the contribution of each nucleotide to the final prediction of the deep learning model for the chosen class. We used randomly selected 250 genomic regions to initialize the explainer.
DeepFlyBrain model takes a single strand as an input. For a given 500 bp, we multiplied the explainer’s output by the one-hot encoded DNA sequence and visualized it as the height of the nucleotide letters. DeepMEL2 model takes forward and reverse strands separately as an input. In this case, the explainer results in contribution scores for each strand. We first took the average contribution score for each nucleotide and then multiplied it by the one-hot encoded DNA sequence to visualize.
Motif annotation
To identify TF binding sites during the in silico evolution of designed sequences, we used TF-Modisco (v.0.5.5.4; RRID: SCR_024811)67 and Cluster-Buster(RRID: SCR_024810)68. First, we calculated the nucleotide contribution scores on every mutational step, including random sequences. Then, we ran TF-Modisco on each mutational step separately to identify which patterns are appearing/disappearing. The TF-Modisco parameters we used were num_to_samp=5000, sliding_window_size=15, flank_size=5, target_seqlet_fdr=0.15, trim_to_window_size=15, initial_flank_to_add=5, final_flank_to_add=5, final_min_cluster_size=60. After investigating the TF-Modisco patterns that were identified on each mutational step, we used mutational step 1 for KC and mutational step 4 for MEL to collect the identified patterns, as they contained all the activator and repressor patterns. (Earlier steps did not have good representation of activators because they are close to random sequences. Later steps did not have good representation of repressors because they were destroyed during the mutational steps.) We trimmed the patterns on the basis of information content (threshold = 0.1) and saved them as a .cb file to be used by the Cluster-Buster.
By using the TF-Modisco patterns, we ran Cluster-Buster (with -c 0 and -m 3 options) to identify motifs on each mutational step, including random sequences. We selected only the motif instances from Cluster-Buster results and merged (by using BEDTools v.2.30.0; RRID: SCR_006646; ref. 69) the overlapping hits of the motifs into a single hit. We calculated mean + s.d. on the hit scores coming from random sequences for each motif separately and used these thresholds to get the significant hits.
Identification of TF binding sites similar to TF-Modisco patterns was performed using Tomtom (RRID: SCR_024809)70 using the cisTarget motif collection (RRID: SCR_024808)71.
Scoring the fly genome
To identify the regions that have high prediction scores for γ-KC but have less accessibility in γ-KC, we scored the whole fly genome. We used the bedtools makewindows -g dm6.chromsize -w 500 -s 50 command69 to create the coordinates of the binned fly genome with a 500 bp window and 50 bp stride. We removed the regions that are not exactly 500 bp. This resulted in 2,750,893 regions to be scored with the DeepFlyBrain model. We used the stats function of deeptools/pyBigWig package (RRID: SCR_024807)72 to calculate mean γ-KC accessibility values for each bin.
Motif implanting
To implant binding sites into 500 bp sequences, we started from a random sequence. We implanted a binding site into every possible location on the random sequence one-by-one by replacing the nucleotides on the random sequences with the binding site. Then, we scored these sequences with the model. We selected the binding site position that gives the highest prediction score and implanted the motif on that position. Then, starting from this sequence with one binding site implanted, we implanted the next binding sites one-by-one by using the same procedure. The sequence of binding sites that maximize the TF-Modisco pattern score were selected to implant and they are as follows: Ey, TGCTCACTCAAGCGTAA; Mef2, CTATTTATAG; Onecut, ATCGAT; Sr, CCACCC; SOX10, AACAATGGGCCCATTGTT; MITF, GTCACGTGAC; and TFAP2, GCCTGAGGC. We used 2,000 initial random sequences for KC and 2,000 for MEL. The weaker binding sites taken from the IRF4 enhancer are as follows: SOX10_1, GTGAATGACAGCTTTGTT; SOX10_2, TACAAGTATCTCCATTGT; MITF_1, ATCATGTGAA; MITF_2, GCCATATGAC; TFAP2_1, TCTTCAGGC; and TFAP2_2, CCCTGTGGT.
When TF motifs are implanted at random positions in a random sequence, prediction scores are very low, probably because repressor sites remain present. Likewise, to be able to generate a functional enhancer through random sequence generation, many sequences need to be generated (that is, 100 million and 1 billion; refs. 38,73).
To measure if there is a preference for a flanking sequence when performing motif implanting, we aggregated all the sequences aligned by the location of the implanted motif. Then, we calculated the position probability matrix and visualized it by subtracting 0.25 from each position.
To measure the effect of different background sequences on the minimal KC enhancer, we generated 1 million random sequences with the size of 20 bp. Then, we replaced the 20 bp spanning the position where Ey, Mef2 and Onecut binding sites implanted that occupied the 6 bp flankings on both sides and 8 bp intermotif space. Then, we scored the sequences with the model and measured the effect of different backgrounds around the motif implantation area.
Generative adversarial network
To train a GAN model, we used Wasserstein GAN architecture with gradient penalty74 similar to earlier work47. The model consists of two parts: generator and discriminator. Generator takes noise as input (size is 128), followed by a dense layer with 64,000 (500 × 128) units with ELU activation, a reshape layer (500, 128), a convolution tower of five convolution blocks with skip connections, a one-dimensional (1D) convolution layer with four filters with kernel width 1 and finally a SOFTMAX activation layer. The output of the generator is a 500 × 4 matrix, which represents one-hot encoded DNA sequence. Discriminator takes 500 bp one-hot encoded DNA sequence as input (real or fake), followed by a 1D convolution layer with 128 filters with kernel width 1, a convolution tower of five convolution blocks with skip connections, a flatten layer and finally a dense layer with one unit.
Each block in the convolution tower consists of a RELU activation layer followed by 1D convolution with 128 filters with kernel width 5. The noise is generated by the numpy.random.normal(0, 1, (batch_size, 128)) command. We used a batch size of 128. For every train_on_batch iteration of the generator, we performed ten train_on_batch iterations for the discriminator. We used Adam optimizer with learning_rate of 0.0001, beta_1 of 0.5 and beta_2 of 0.9. We trained the models for around 260,000 batch training iteration for KC and around 160,000 batch training iteration for MEL.
We used 6,126 KC regions for the fly model and 3,885 MEL regions for the human model, which we identified in our previous publications, as real genomic sequences to train the models. After the training, we sampled 6,144 (48 × batch size) sequences for KC and 3,968 (31 × batch size) sequences for MEL by using the generator for every 10,000 batch training iteration. The sampled synthetic sequences were generated by calculating predictions on noise and then the numpy.argmax() command was used to convert the predictions into one-hot encoded representations.
Background model
To compare against the GAN-generated sequences, we generated random sequences in different orders by using the CreateBackgroundModel function from the INCLUSive package (RRID: SCR_013488)75 based on the same genomic regions that we used to train GANs.
Training ChromBPNet models
For training ChromBPNet models we used a prereleased version (v.1.3-pre-release; RRID: SCR_024806) from the ChromBPNet GitHub repository (https://github.com/kundajelab/chrombpnet/tree/v1.3-pre-release). We followed all the preprocessing and training steps as described in the tutorial: from the aligned ATAC reads in the MM001 BAM file, we made a BigWig of Tn5 insertion sites, trained a bias model that predict Tn5 binding sites in non-peak regions which is then used in the ChromBPNet model to filter out Tn5 bias. ChromBPNet uses 2,114 bp DNA sequence as input and predicts both the ATAC track and the natural log count of the aligned reads for the central 1,000 bp. To be able to score 500 bp DNA sequences (IRF4 enhancer and synthetic enhancers), we used the flanking sequences of the cloned/integrated enhancer sequences surrounded by the integrated cassette. Both scalar and track prediction were plotted. Flanking sequences are provided in the Supplementary Code.
Using the Enformer model
We used the Enformer model (RRID: SCR_024805) to do in silico CRISPR experiments. We took the IRF4 locus (chr. 6: 339010:453698) centred by the IRF4 enhancer (chr. 6: 396104:396604). We replaced the endogenous IRF4 enhancer with the random/ evolved/ repressed designed sequences and calculated the prediction scores for the related cell types. The prediction scores were plotted as showing the whole locus. For DNase and ChIP-Histone:H3K27ac tracks, the mean values were calculated using the middle three bins or one bin spanning the enhancer location. For CAGE tracks, the mean values are calculated using one bin spanning the transcriptional start site of IRF4. The index of the tracks that we used to get the prediction scores are as follows—4,832: CAGE/melanoma cell line:G-361, 162: DNase/SK-MEL-5, 2,162: ChIP-Histone:H3K27ac/foreskin melanocyte male newborn.
To measure the locational effect of the designed enhancers on gene expression, chromatin accessibility and histone modification, we moved the synthetic enhancer around the IRF4 locus; (1) to 10 kb upstream, (2) 5 kb upstream (which is next to the promoter of the IRF4 gene) and (3) 17.5 kb downstream of the original location.
Cloning of synthetic Drosophila enhancers
Synthetic sequences were ordered from Twist Bioscience, precloned in the pTwist ENTR vector. The motif-implantation and double-coded sequences were synthesized with an extra 5′ CACC sequence as double-stranded DNA (gBlocks Gene Fragments) by IDT. The 49 bp motif-implantation sequence was ordered from IDT as forward and reverse single-stranded DNA oligos, which were then annealed for 5 min at 95 °C and cooling down to RT over 1 h. The double-stranded DNA sequences were then cloned into the pENTR/D-TOPO plasmid (Invitrogen).
All sequences were introduced in a modified pH-Stinger vector76, containing nuclear GFP, Hsp70 promoter, gypsy insulators and attB site for phiC31 integration, through Gateway LR recombination reaction (Invitrogen). A total of 2 µl of the reaction was transformed into 25 µl of Stellar chemically competent bacteria (Takara). Plasmid minipreps were performed using the NucleoSpin Plasmid Transfection-grade Mini kit (Macherey-Nagel) and sequenced with Sanger sequencing to confirm the correct insertion of the regions in the destination plasmid. After confirmation of the sequence, plasmid midipreps were performed using the NucleoBond Xtra endotoxin-free Midi kit (Macherey-Nagel). Next, the plasmids were sent to FlyORF (Switzerland) for injection in Drosophila embryos (21F site on chromosome 2l) and positive transformants were selected on the basis of eye colour.
Drosophila flies were raised on a yeast-based medium at 25 °C under a 12 h/12 h day/night light cycle.
Immunohistochemistry analysis of Drosophila brains
Brains of adult flies (Drosophila melanogaster, less than 10 days old, equally mixed sex) were dissected in PBS and transferred to a tube for fixation in 4% formaldehyde in PBS for 20 min. All incubations were done at room temperature, unless otherwise indicated. Brains were washed in PBS with 0.3% Triton-X (PBST) three times for 10 min each, then they were placed in blocking solution (5% normal goat serum (Abcam) in PBST) for 3 h. We incubated the brains overnight at 4 °C in primary antibodies diluted in blocking solution (rabbit anti-GFP, IgG (Invitrogen), 1:1,000 and mouse anti-Dachshund, mAB dac1-1 (DSHB), 1:250). The brains were then washed in PBST three times for 10 min each and incubated with the fluorochrome-conjugated secondary antibodies diluted in blocking solution for 2 h (Alexa Fluor 488 donkey anti-rabbit IgG (Invitrogen), 1:500 and Alexa Fluor 647 goat anti-mouse IgG (Invitrogen), 1:500). Next, brains were washed in PBS three times for 10 min each. Finally, samples were mounted onto microscope slides with Prolong Glass Antifade Mountant (Invitrogen).
For image acquisition, a Zeiss LSM900 microscope equipped with Airyscan2 in combination with a ×20 objective (Plan Apo 0,80 Air) was used. The setup was controlled by ZEN blue (v.3.4.91, Carl Zeiss Microscopy GmbH). GFP was excited with a blue diode 100 mW at 488 nm and tiled images were collected with emission filter BP450-490/BS495/BP500-550.
Cloning of synthetic human enhancers
The 500 bp synthetic sequences were ordered from Twist Bioscience, precloned in the pTwist ENTR vector. The 500 bp regions were introduced in the pGL4.23-GW luciferase reporter vector (Promega) through Gateway LR recombination reaction (Invitrogen) and 2 µl of the reaction was transformed into 25 µl of Stellar chemically competent bacteria (Takara).
Synthetic sequences shorter than 150 bp were ordered as gBlocks from IDT (Integrated DNA Technologies) with 5′ (cccgtcgacgaattctgcagatatcacaagtttgtacaaaaaagcaggct) and 3′ (acccagctttcttgtacaaagtggtgataaacccgctgatcag) adaptors. The pGL4.23-GW luciferase reporter vector was linearized through inverse PCR with primers Lin_pSA335_short_ME_For (gtggtgataaacccgctgatcag) and Lin_pSA335_short_ME_Rev (tctgcagaattcgtcgacggg). The short sequences and the linearized vector were combined in an NEBuilder reaction (New England Biolabs) and 2 µl of the reaction was transformed into 25 µl of Stellar chemically competent bacteria.
For all cloning procedures, plasmid minipreps were performed using the NucleoSpin Plasmid Transfection-grade Mini kit (Macherey-Nagel) and sequenced with Sanger sequencing to confirm the correct insertion of the regions in the destination plasmid.
To generate stable cell lines with synthetic enhancers, the synthetic sequences were cloned into the pSA351_SCP1_intron_eGFP vector (Addgene no. 206906). The vector was linearized through inverse PCR with primers Lin_pSA351_For (ctgagctccctagggtact) and Lin_pSA351_Rev (cgactcgaggctagtctc). The synthetic sequences were PCR-amplified from their respective pGL.23-GW vector with their respective primer pairs: MM_EFS_1_For (gagactagcctcgagtcgctgattgtttgaaccattgttacgatttgg) and MM_EFS_1_Rev (agtaccctagggagctcagcaattttgttttttgcgcgtgac) for MM-EFS-1 sequences; MM_EFS_4_For (gagactagcctcgagtcgtgatatgtattcacccatgccctca) and MM_EFS_4_Rev (agtaccctagggagctcaagggtttgtatatgtatgctcctttatacga) for MM-EFS-4 sequences; MM_EFS_8_For (gagactagcctcgagtcgatacgcacgacaaagcctcat) and MM_EFS_8_Rev (agtaccctagggagctcacactgtacaaggcatcccgc) for MM-EFS-8 sequences; IRF_4_For (gagactagcctcgagtcggctgccattggtgtggattttaag) and IRF_4_Rev (agtaccctagggagctcaactggcatcgagacggg) for IRF4 sequences. The PCR amplicons and the linearized vector were combined in an NEBuilder reaction and 2 µl of the reaction was transformed into 25 µl of Stellar chemically competent bacteria. Plasmid minipreps were performed using the NucleoSpin Plasmid Transfection-grade Mini kit (Macherey-Nagel) and sequenced with Sanger sequencing to confirm the correct insertion of the regions in the vector. After confirmation of the sequence, plasmid maxipreps were performed using the NucleoBond Xtra endotoxin-free Maxi kit (Macherey-Nagel).
Transfection and luciferase assay
MM001 and MM047 were seeded in 24-well plates and transfected with 400 ng of pGL4.23-enhancer vector + 40 ng of pRL-TK Renilla vector (Promega) with Lipofectamine 2000 (Thermo Fisher Scientific). As positive controls, the previously published enhancers MLANA_5-I, IRF4_4-I and TYR_−9-D or ABCC3_11-I and GPR39_23-I were used for MM001 and MM047, respectively77. One day after transfection, luciferase activity was measured through the Dual-Luciferase Reporter Assay System (Promega) by following the manufacturer’s protocol. Briefly, cells were lysed with 100 µl of passive lysis buffer for 15 min at 500 rpm. A total 20 µl of the lysate was transferred in duplicate in a well of an OptiPlate-96 HB (PerkinElmer) and 100 µl of luciferase assay reagent II was added in each well. Luciferase-generated luminescence was measured on a Victor X luminometer (PerkinElmer). A total of 100 µl of the Stop & Glo Reagent was added to each well and the luminescence was measured again to record Renilla activity. Luciferase activity was estimated by calculating the ratio luciferase/Renilla; this value was normalized by the ratio calculated on blank wells containing only reagents. Three biological replicates were done per condition for MM001 and two biological replicates for MM047.
Production of lentivirus
The lentivirus plasmids were transfected in HEK 293 T cells by use of the Lipofectamine 3000 reagent (Thermo Fisher Scientific). A total of 30 µg of pooled plasmid DNA was combined with 20 µg of a Pax2 plasmid (Addgene no. 12260; RRID: Addgene_12260) and 10 µg of the MD2.G plasmid (Addgene no. 12259; RRID: Addgene_12259). At 48 h posttransfection, medium was collected and refreshed. At 72 h posttransfection, medium was collected a second time. Both medium collections were combined and spun down for 5 min at 1,500 rpm. Supernatants was carefully collected with a blunt needle and a syringe and filtered through a 45 µm syringe disc filter (Millex-HV Millipore) into an Ultra-15 MWCO100 centrifugal filter (Amicon). The concentrator tube containing the supernatants was spun down at 4,000 rpm for approximately 45 min until the desired volume of 250 µl was reached. The virus suspension was aliquoted and stored at −80 °C.
Transduction of melanoma cells
The MM001 cells were seeded into a six-well plate at a density of 250,000 cells per well. Transduction was performed by adding 5–40 µl of lentivirus and Polybrene at 8 µg ml−1. Cells were incubated for 24 h before washing away the Polybrene with PBS and with growth medium. After 3 days the cells were split and expanded further.
OmniATAC-seq
Omni-assay for transposase-accessible chromatin using sequencing (OmniATAC-seq) was performed as described previously78. Briefly, 50,000 MM001 cells transduced with the enhancer pools were resuspended in 50 µl of cold ATAC-seq resuspension buffer (RSB; 10 mM TrisHCl pH 7.4, 10 mM NaCl and 3 mM MgCl2 in water) containing 0.1% NP40, 0.1% Tween-20 and 0.01% digitonin by pipetting up and down three times. This cell lysis reaction was incubated on ice for 3 min. After lysis, 1 ml of ATAC-seq RSB containing 0.1% Tween-20 was added and the tubes were inverted to mix. Nuclei were then centrifuged for 10 min at 500g in a prechilled (4 °C) fixed-angle centrifuge. Supernatant was removed and nuclei were resuspended in 50 µl of transposition mix (25 µl of 2× TD buffer, 2.5 µl of transposase (Nextera Tn5 transposase, Illumina), 16.5 µl of PBS, 0.5 µl of 1% digitonin, 0.5 µl of 10% Tween-20 and 5 µl of water) by pipetting up and down six times. Transposition reactions were incubated at 37 °C for 30 min in a thermoblock. Reactions were cleaned-up by MinElute (Qiagen). Transposed DNA was amplified (ten cycles) with primers i5_Indexing_For (aatgatacggcgaccaccgagatctacacnnnnnnnntcgtcggcagcgtcagatgtg) and i7_Indexing_Rev (caagcagaagacggcatacgagatnnnnnngtctcgtgggctcggagatgt). All libraries were sequenced on a NextSeq2000 instrument (Illumina).
Reads were demultiplexed using bcl2fastq (v.2.20; RRID: SCR_015058; https://emea.support.illumina.com/sequencing/sequencing_software/bcl2fastq-conversion-software.html). Adaptors were trimmed by trimgalore (v.0.6.7; RRID: SCR_011847; https://github.com/FelixKrueger/TrimGalore). Reads were mapped to a custom hg38 genome, which contains integrated sequences as extra chromosomes, using bwa-mem2 (v.2.2.1; RRID: SCR_022192)79. By using SAMtools (v.1.16.1; RRID: SCR_002105)80, reads were sorted and deduplicated and reads from the blacklisted regions (https://www.encodeproject.org/files/ENCFF356LFX/) were cleaned. Bigwig files with RPGC normalization were generated by using deepTools (v.3.5.0; RRID: SCR_016366) bamCoverage72.
ChIP–seq
ChIP–seq was performed by following the Myers Lab ChIP–seq Protocol v.011014 on 2 × 107 MM001 cells. A total of 5 µg of rabbit anti-ZEB2 (1 mg ml−1; Bethyl A302-473A; RRID: AB_3076293) was used for ChIP. A total of 15 ng of immunoprecipitated DNA was used to perform library preparation according to the Illumina TruSeq DNA Sample preparation guide. Briefly, the immunoprecipitated DNA was end-repaired, A-tailed and ligated to diluted sequencing adaptors (1/100). After PCR amplification with i5_Indexing_For and i7_Indexing_rev (18 cycles) and bead purification (Agencourt AmpureXP, Analis), the libraries with fragment size of 300–500 bp were sequenced using the NextSeq2000 instrument (Illumina).
Reads were demultiplexed using bcl2fastq (v.2.20; RRID: SCR_015058). Adaptors were trimmed by trimgalore (v.0.6.7; RRID: SCR_011847). Reads were mapped to hg38 using bwa-mem2 (v.2.2.1; RRID: SCR_022192)79. By using SAMtools (v.1.16.1; RRID: SCR_002105)80, reads were sorted and deduplicated and reads from the blacklisted regions (https://www.encodeproject.org/files/ENCFF356LFX/) were cleaned. Bigwig files with RPGC normalization were generated by using deepTools (v.3.5.0; RRID: SCR_016366) bamCoverage72. Peaks were called using MACS2 (v.2.1.2.1; RRID: SCR_013291) callpeak81.
Cell lines
MM001, MM047 and MM099 were obtained from G. Ghanem and were cultured in Ham’s F-10 Nutrient Mix (Invitrogen) + 10% FBS (Invitrogen). We authenticated the cell lines by checking their genomic, transcriptomic and epigenomic profiles12,82,83. HEK293T used for lentivirus production was obtained from ATCC (catalogue no. CRL-3216; RRID: CVCL_0063) and were cultured in DMEM (Invitrogen) + 10% FBS (Invitrogen). Cell lines were tested for mycoplasma contamination before experiments and were found to be negative.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-023-06936-2.
Supplementary information
Acknowledgements
We acknowledge A. Kundaje and A. Pampari for assistance and early access to their ChromBPNet code; the Genomics Core Leuven for assistance in sequencing; and the VIB Bio Imaging Core for their support and assistance in imaging. Computing was performed at the Vlaams Supercomputer Center. This work is funded by the following grants to S.A.: ERC Consolidator Grant (724226_cis‐CONTROL), ERC Proof of Concept (963884), ERC Advanced Grant (101054387_Genome2Cells), Special Research Fund (BOF) KU Leuven (grant C14/18/092; C14/22/125), Foundation Against Cancer (F/2020/1396) and FWO (grants G094121N; G0B5619N; G0I2722N - EOS ID: 40007513); Michael J. Fox Foundation for Parkinson’s Research (Michael J. Fox Foundation) (ASAP-000430).
Extended data figures and tables
Author contributions
I.I.T. and S.A. conceived the study. I.I.T. performed all computational analyses and designed synthetic enhancers. V.C. performed enhancer cloning with assistance from K.I.S. and D.M. V.C. performed luciferase assays with assistance from D.M. K.I.S. performed antibody staining and visualization with assistance from I.I.T., H.D. and J.I. R.V. performed lentivirus production and cell line transduction. V.C. performed ATAC-seq and ChIP–seq experiments with assistance from H.D., K.T. and A.P. G.H. performed ATAC-seq and ChIP–seq data preprocessing. N.K. trained ChromBPNet models with assistance from E.C.E. I.I.T. and S.A. wrote the manuscript with assistance from D.M.
Peer review
Peer review information
Nature thanks Ivan Ovcharenko and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
Cloned Drosophila and human sequences were provided as Supplementary Tables. DeepMEL, DeepMEL2 and DeepFlyBrain deep learning model files were obtained from Kipoi84 (http://kipoi.org/models/DeepMEL; https://kipoi.org/models/DeepFlyBrain) with Zenodo record ids 3592129, 4590308 and 5153337. The fasta files used to train GAN models and the trained GAN models are available on Zenodo at 10.5281/zenodo.6701504. Custom genomes (hg38 and dm6) generated in this study are available on Zenodo at 10.5281/zenodo.10184648. Chromatin accessibility values in KC in adult Drosophila brains were obtained from GSE163697 (ref. 39). In vitro saturation mutagenesis on IRF4 data were obtained from https://kircherlab.bihealth.org/satMutMPRA/85. Chromatin accessibility of Drosophila and transduced melanoma lines and ZEB2 ChIP–seq data generated for this study have been submitted to the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE240003.
Code availability
Code used to load deep learning models, create random sequences, perform sequence evolution, perform motif implantation and train GAN models together with the IPython Notebooks that reproduces all the figures were provided as Supplementary Code. The data to run the scripts, the models and the intermediate files can be found together with the code at 10.5281/zenodo.10184648.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
is available for this paper at 10.1038/s41586-023-06936-2.
Supplementary information
The online version contains supplementary material available at 10.1038/s41586-023-06936-2.
References
- 1.Davidson, E. H. Genomic Regulatory Systems: Development and Evolution (Academic, 2001).
- 2.Avsec Ž, et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods. 2021;18:1196–1203. doi: 10.1038/s41592-021-01252-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Minnoye L, et al. Cross-species analysis of enhancer logic using deep learning. Genome Res. 2020;30:1815–1834. doi: 10.1101/gr.260844.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.de Almeida BP, Reiter F, Pagani M, Stark A. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat. Genet. 2022;54:613–624. doi: 10.1038/s41588-022-01048-5. [DOI] [PubMed] [Google Scholar]
- 5.Janssens, J. et al. Decoding gene regulation in the fly brain. Nature10.1038/s41586-021-04262-z (2022). [DOI] [PubMed]
- 6.Avsec Ž, et al. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat. Genet. 2021;53:354–366. doi: 10.1038/s41588-021-00782-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zaret KS, Carroll JS. Pioneer transcription factors: establishing competence for gene expression. Genes Dev. 2011;25:2227–2241. doi: 10.1101/gad.176826.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Jacobs J, et al. The transcription factor Grainy head primes epithelial enhancers for spatiotemporal activation by displacing nucleosomes. Nat. Genet. 2018;50:1011–1020. doi: 10.1038/s41588-018-0140-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Payankaulam S, Li LM, Arnosti DN. Transcriptional repression: conserved and evolved features. Curr. Biol. 2010;20:R764–R771. doi: 10.1016/j.cub.2010.06.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Pennacchio LA, et al. In vivo enhancer analysis of human conserved non-coding sequences. Nature. 2006;444:499–502. doi: 10.1038/nature05295. [DOI] [PubMed] [Google Scholar]
- 11.Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Preprint at bioRxiv10.1101/2023.08.30.555582 (2023).
- 12.Atak ZK, et al. Interpretation of allele-specific chromatin accessibility using cell state-aware deep learning. Genome Res. 2021;31:1082–1096. doi: 10.1101/gr.260851.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods. 2015;12:931–934. doi: 10.1038/nmeth.3547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Yuh CH, Bolouri H, Davidson EH. Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science. 1998;279:1896–1902. doi: 10.1126/science.279.5358.1896. [DOI] [PubMed] [Google Scholar]
- 15.Patwardhan RP, et al. Massively parallel functional dissection of mammalian enhancers in vivo. Nat. Biotechnol. 2012;30:265–270. doi: 10.1038/nbt.2136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Kheradpour P, et al. Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. Genome Res. 2013;23:800–811. doi: 10.1101/gr.144899.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hare EE, Peterson BK, Iyer VN, Meier R, Eisen MB. Sepsid even-skipped enhancers are functionally conserved in Drosophila despite lack of sequence conservation. PLoS Genet. 2008;4:e1000106. doi: 10.1371/journal.pgen.1000106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kvon EZ, et al. Genome-scale functional characterization of Drosophila developmental enhancers in vivo. Nature. 2014;512:91–95. doi: 10.1038/nature13395. [DOI] [PubMed] [Google Scholar]
- 19.Arnold CD, et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science. 2013;339:1074–1077. doi: 10.1126/science.1232542. [DOI] [PubMed] [Google Scholar]
- 20.Zinzen RP, Girardot C, Gagneur J, Braun M, Furlong EEM. Combinatorial binding predicts spatio-temporal cis-regulatory activity. Nature. 2009;462:65–70. doi: 10.1038/nature08531. [DOI] [PubMed] [Google Scholar]
- 21.May D, et al. Large-scale discovery of enhancers from human heart tissue. Nat. Genet. 2011;44:89–93. doi: 10.1038/ng.1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Narlikar L, et al. Genome-wide discovery of human heart enhancers. Genome Res. 2010;20:381–392. doi: 10.1101/gr.098657.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ghandi M, Lee D, Mohammad-Noori M, Beer MA. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 2014;10:e1003711. doi: 10.1371/journal.pcbi.1003711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Kantorovitz MR, et al. Motif-blind, genome-wide discovery of cis-regulatory modules in Drosophila and mouse. Dev. Cell. 2009;17:568–579. doi: 10.1016/j.devcel.2009.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Meuleman W, et al. Index and biological spectrum of human DNase I hypersensitive sites. Nature. 2020;584:244–251. doi: 10.1038/s41586-020-2559-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Lareau CA, et al. Droplet-based combinatorial indexing for massive-scale single-cell chromatin accessibility. Nat. Biotechnol. 2019;37:916–924. doi: 10.1038/s41587-019-0147-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Satpathy AT, et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 2019;37:925–936. doi: 10.1038/s41587-019-0206-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Cusanovich DA, et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell. 2018;174:1309–1324. doi: 10.1016/j.cell.2018.06.052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Dunham I, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ernst J, Kellis M. ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods. 2012;9:215–216. doi: 10.1038/nmeth.1906. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Yan J, et al. Transcription factor binding in human cells occurs in dense clusters formed around cohesin anchor sites. Cell. 2013;154:801–813. doi: 10.1016/j.cell.2013.07.034. [DOI] [PubMed] [Google Scholar]
- 32.Smith RP, et al. Massively parallel decoding of mammalian regulatory sequences supports a flexible organizational model. Nat. Genet. 2013;45:1021–1028. doi: 10.1038/ng.2713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Weirauch MT, et al. Determination and Inference of eukaryotic transcription factor sequence specificity. Cell. 2014;158:1431–1443. doi: 10.1016/j.cell.2014.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Rauluseviciute, I. et al. JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res.10.1093/nar/gkad1059 (2023). [DOI] [PMC free article] [PubMed]
- 35.He X, Samee MAH, Blatti C, Sinha S. Thermodynamics-based models of transcriptional regulation by enhancers: the roles of synergistic activation, cooperative binding and short-range repression. PLoS Comput. Biol. 2010;6:e1000935. doi: 10.1371/journal.pcbi.1000935. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Parker David S, White Michael A, Ramos Andrea I, Cohen Barak A, Barolo S. The cis-regulatory logic of Hedgehog gradient responses: key roles for Gli binding affinity, competition and cooperativity. Sci. Signal. 2011;4:ra38–ra38. doi: 10.1126/scisignal.2002077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Fukaya T, Lim B, Levine M. Enhancer control of transcriptional bursting. Cell. 2016;166:358–368. doi: 10.1016/j.cell.2016.05.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Deplancke B, Alpern D, Gardeux V. The genetics of transcription factor DNA binding variation. Cell. 2016;166:538–554. doi: 10.1016/j.cell.2016.07.012. [DOI] [PubMed] [Google Scholar]
- 39.Jolma A, et al. DNA-binding specificities of human transcription factors. Cell. 2013;152:327–339. doi: 10.1016/j.cell.2012.12.009. [DOI] [PubMed] [Google Scholar]
- 40.Zhu F, et al. The interaction landscape between transcription factors and the nucleosome. Nature. 2018;562:76–81. doi: 10.1038/s41586-018-0549-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Koo PK, Majdandzic A, Ploenzke M, Anand P, Paul SB. Global importance analysis: an interpretability method to quantify importance of genomic features in deep neural networks. PLoS Comput. Biol. 2021;17:e1008925. doi: 10.1371/journal.pcbi.1008925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Karollus A, Mauermeier T, Gagneur J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol. 2023;24:56. doi: 10.1186/s13059-023-02899-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Toneyan S, Tang Z, Koo PK. Evaluating deep learning for predicting epigenomic profiles. Nat. Mach. Intell. 2022;4:1088–1100. doi: 10.1038/s42256-022-00570-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Yuan H, Kelley D. R. scBasset: sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods. 2022;19:1088–1096. doi: 10.1038/s41592-022-01562-8. [DOI] [PubMed] [Google Scholar]
- 45.Vaishnav, E. D. et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature10.1038/s41586-022-04506-6 (2022). [DOI] [PMC free article] [PubMed]
- 46.Zrimec J, et al. Controlling gene expression with deep generative design of regulatory DNA. Nat. Commun. 2022;13:5099. doi: 10.1038/s41467-022-32818-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Killoran, N., Lee, L. J., Delong, A., Duvenaud, D. & Frey, B. J. Generating and designing DNA with deep generative models. Preprint at 10.48550/arXiv.1712.06148 (2017).
- 48.Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015;33:831–838. doi: 10.1038/nbt.3300. [DOI] [PubMed] [Google Scholar]
- 49.Preger-Ben Noon E, et al. Comprehensive analysis of a cis-regulatory region revealspleiotropy in enhancer function. Cell Rep. 2018;22:3021–3031. doi: 10.1016/j.celrep.2018.02.073. [DOI] [PubMed] [Google Scholar]
- 50.Brennan KJ, et al. Chromatin accessibility in the Drosophila embryo is determined by transcription factor pioneering and enhancer activation. Dev. Cell. 2023;58:1898–1916. doi: 10.1016/j.devcel.2023.07.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Vincent BJ, Estrada J, DePace AH. The appeasement of Doug: a synthetic approach to enhancer biology. Integr. Biol. 2016;8:475–484. doi: 10.1039/c5ib00321k. [DOI] [PubMed] [Google Scholar]
- 52.Swanson CI, Schwimmer DB, Barolo S. Rapid evolutionary rewiring of a structurally constrained eye enhancer. Curr. Biol. 2011;21:1186–1196. doi: 10.1016/j.cub.2011.05.056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Koo PK, Ploenzke M. Improving representations of genomic sequence motifs in convolutional networks with exponential activations. Nat. Mach. Intell. 2021;3:258–266. doi: 10.1038/s42256-020-00291-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.King DM, et al. Synthetic and genomic regulatory elements reveal aspects of cis-regulatory grammar in mouse embryonic stem cells. eLife. 2020;9:e41279. doi: 10.7554/eLife.41279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Davis JE, et al. Dissection of c-AMP response element architecture by using genomic and episomal massively parallel reporter assays. Cell Syst. 2020;11:75–85. doi: 10.1016/j.cels.2020.05.011. [DOI] [PubMed] [Google Scholar]
- 56.Tsai A, Alves MR, Crocker J. Multi-enhancer transcriptional hubs confer phenotypic robustness. eLife. 2019;8:e45325. doi: 10.7554/eLife.45325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Fuqua T, et al. Dense and pleiotropic regulatory information in a developmental enhancer. Nature. 2020;587:235–239. doi: 10.1038/s41586-020-2816-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.de Almeida, B. P. Targeted design of synthetic enhancers for selected tissues in the Drosophila embryo. Nature10.1038/s41586-023-06905-9 (2024). [DOI] [PMC free article] [PubMed]
- 59.Imrichova, H. & Aerts, S. ChIP–seq meta-analysis yields high quality training sets for enhancer classification. Preprint at bioRxiv10.1101/388934 (2018).
- 60.Virtanen P, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Hunter JD. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 2007;9:90–95. doi: 10.1109/MCSE.2007.55. [DOI] [Google Scholar]
- 62.Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. Preprint at 10.48550/arXiv.1603.04467 (2015).
- 63.Harris CR, et al. Array programming with NumPy. Nature. 2020;585:357–362. doi: 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016;26:990–999. doi: 10.1101/gr.200535.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. Preprint at 10.48550/arXiv.1704.02685 (2019).
- 66.Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Proc. 31st International Conference on Neural Information Processing Systems 4768–4777 (2017).
- 67.Shrikumar, A. et al. Technical note on transcription factor motif discovery from importance scores (TF-MoDISco) version 0.5.6.5. Preprint at 10.48550/arXiv.1811.00416 (2020).
- 68.Frith MC, Li MC, Weng Z. Cluster-Buster: finding dense clusters of motifs in DNA sequences. Nucleic Acids Res. 2003;31:3666–3668. doi: 10.1093/nar/gkg540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8:R24. doi: 10.1186/gb-2007-8-2-r24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Bravo González-Blas C, et al. SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks. Nat. Methods. 2023;20:1355–1367. doi: 10.1038/s41592-023-01938-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Ramírez F, et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 2016;44:W160–W165. doi: 10.1093/nar/gkw257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Sahu, B. et al. Sequence determinants of human gene regulatory elements. Nat. Genet.10.1038/s41588-021-01009-4 (2022). [DOI] [PMC free article] [PubMed]
- 74.Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. Improved training of Wasserstein GANs. Preprint at 10.48550/arXiv.1704.00028 (2017).
- 75.Thijs G, et al. INCLUSive: INtegrated Clustering, Upstream sequence retrieval and motif Sampling. Bioinformatics. 2002;18:331–332. doi: 10.1093/bioinformatics/18.2.331. [DOI] [PubMed] [Google Scholar]
- 76.Aerts S, et al. Robust target gene discovery through transcriptome perturbations and genome-wide enhancer predictions in Drosophila uncovers a regulatory basis for sensory specification. PLoS Biol. 2010;8:e1000435. doi: 10.1371/journal.pbio.1000435. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Mauduit D, et al. Analysis of long and short enhancers in melanoma cell states. eLife. 2021;10:e71735. doi: 10.7554/eLife.71735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Corces MR, et al. An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues. Nat. Methods. 2017;14:959–962. doi: 10.1038/nmeth.4396. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Vasimuddin, M. D., Misra, S., Li, H. & Aluru, S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 314–324 (IEEE, 2019); 10.1109/IPDPS.2019.00041.
- 80.Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Gaspar, J. Improved peak-calling with MACS. Preprint at bioRxiv10.1101/496521 (2018).
- 82.Verfaillie A, et al. Decoding the regulatory landscape of melanoma reveals TEADS as regulators of the invasive cell state. Nat. Commun. 2015;6:6683. doi: 10.1038/ncomms7683. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Wouters J, et al. Robust gene expression programs underlie recurrent cell states and phenotype switching in melanoma. Nat. Cell Biol. 2020;22:986–998. doi: 10.1038/s41556-020-0547-3. [DOI] [PubMed] [Google Scholar]
- 84.Avsec Ž, et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 2019;37:592–600. doi: 10.1038/s41587-019-0140-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Kircher M, et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat. Commun. 2019;10:3583. doi: 10.1038/s41467-019-11526-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Cloned Drosophila and human sequences were provided as Supplementary Tables. DeepMEL, DeepMEL2 and DeepFlyBrain deep learning model files were obtained from Kipoi84 (http://kipoi.org/models/DeepMEL; https://kipoi.org/models/DeepFlyBrain) with Zenodo record ids 3592129, 4590308 and 5153337. The fasta files used to train GAN models and the trained GAN models are available on Zenodo at 10.5281/zenodo.6701504. Custom genomes (hg38 and dm6) generated in this study are available on Zenodo at 10.5281/zenodo.10184648. Chromatin accessibility values in KC in adult Drosophila brains were obtained from GSE163697 (ref. 39). In vitro saturation mutagenesis on IRF4 data were obtained from https://kircherlab.bihealth.org/satMutMPRA/85. Chromatin accessibility of Drosophila and transduced melanoma lines and ZEB2 ChIP–seq data generated for this study have been submitted to the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE240003.
Code used to load deep learning models, create random sequences, perform sequence evolution, perform motif implantation and train GAN models together with the IPython Notebooks that reproduces all the figures were provided as Supplementary Code. The data to run the scripts, the models and the intermediate files can be found together with the code at 10.5281/zenodo.10184648.