Abstract
Enhancers control gene expression and have crucial roles in development and homeostasis1–3. However, the targeted de novo design of enhancers with tissue-specific activities has remained challenging. Here we combine deep learning and transfer learning to design tissue-specific enhancers for five tissues in the Drosophila melanogaster embryo: the central nervous system, epidermis, gut, muscle and brain. We first train convolutional neural networks using genome-wide single-cell assay for transposase-accessible chromatin with sequencing (ATAC-seq) datasets and then fine-tune the convolutional neural networks with smaller-scale data from in vivo enhancer activity assays, yielding models with 13% to 76% positive predictive value according to cross-validation. We designed and experimentally assessed 40 synthetic enhancers (8 per tissue) in vivo, of which 31 (78%) were active and 27 (68%) functioned in the target tissue (100% for central nervous system and muscle). The strategy of combining genome-wide and small-scale functional datasets by transfer learning is generally applicable and should enable the design of tissue-, cell type- and cell state-specific enhancers in any system.
Subject terms: Gene regulation, Machine learning
Deep learning and transfer learning were used to design tissue-specific enhancers in the Drosophila embryo that were active and specific, validating this approach to achieve tissue-, cell type- and cell state-specific expression control.
Main
Enhancers are non-coding DNA elements that activate transcription from target promoters in a highly cell type-specific fashion1. Although the existence of enhancer activities within DNA sequences has been recognized since the early 1980s2, and hundreds of enhancers have been functionally characterized in model organisms such as flies4 and mice5, the precise encoding of regulatory activities within the DNA sequence has remained elusive. Specifically, although it is known that enhancer sequences contain binding sites for transcription factors, the specific arrangement of these sites and the potential role of additional sequence properties have remained unknown, hampering the prediction and the de novo design of enhancers with tissue-specific activities.
By utilizing genome-wide enhancer activity datasets in a model cell line, it is possible to train deep learning convolutional neural networks (CNNs) to predict enhancer activity and strength directly from the DNA sequence and to design synthetic enhancers de novo6. However, extending this achievement to in vivo systems has been challenging, presumably owing to the limited number of functionally characterized enhancers, which has remained relatively low, typically falling below a few hundred per tissue in flies4 and mice5. Such quantities have been considered insufficient for effectively training deep learning models.
A widely applicable approach to enhance prediction performance with limited data is through the utilization of transfer learning, which has been used successfully in various fields7, including cell biology8, network biology9 and genomics10–13. Transfer learning involves pre-training models using large-scale datasets that share similarities with the target task, followed by target task-specific adjustment or fine-tuning on smaller datasets. Provided pre-training is carried out with datasets sufficiently similar to the target task, transfer learning yields improved prediction performance7. To predict enhancer activity from the DNA sequence, leveraging genome-wide datasets of enhancer-associated chromatin features as a steppingstone seems particularly promising (see, for example, refs. 3,11,13,14).
Single-cell assay for transposase-accessible chromatin with sequencing (scATAC-seq) datasets provide measurements of DNA accessibility at the single-cell level and thus allow the determination of cell type-specific accessibility profiles even within complex tissues comprising diverse cell populations15. Given the association of enhancers with accessible chromatin, we decided to use a combination of scATAC-seq datasets and results from in vivo enhancer activity assays to develop a deep learning model predictive of enhancer activity using transfer learning (Fig. 1a).
Specifically, we selected four prominent and distinct tissues within the 10- to 12-hour-old Drosophila melanogaster embryo, namely the central nervous system (CNS), epidermis, muscle and gut. In addition, we selected enhancers that were specifically active in the brain but not in the rest of the CNS, an enhancer–activity pattern that we considered particularly challenging given the shared cell types with the CNS and the relatively small number of functionally characterized brain-specific enhancers available for training.
We first trained single-task CNNs to map 1-kb-long DNA sequences tiled across the genome to the corresponding pseudo-bulk ATAC-seq signals based on our recently published scATAC-seq atlas of the Drosophila embryo16 (sequence-to-accessibility models; Fig. 1a and Extended Data Fig. 1a). We used a tenfold chromosome hold-out cross-validation scheme to train and evaluate the predictive performance of the model. As expected on the basis of previous work6,17–20, these models performed well with Pearson correlation coefficients (PCCs) between the predicted and experimentally measured ATAC-seq signals of approximately 0.73 for all tissues in all held-out test set chromosomes (range of PCCs: 0.72–0.75; Fig. 1a and Extended Data Fig. 1b,d). Moreover, using model-interpretation tools21–24 revealed known transcription factor motifs, such as GGGGT (Kr and Ttk) for CNS25, and motifs for Grh for epidermis26, GATA for gut27–29, Mef2, forkhead (Bin) and Twist for muscle30, and Zelda and Klu for brain31,32 (Extended Data Figs. 1e and 2, Supplementary Fig. 1 and Supplementary Table 1). Finally, the models also captured cell type-specific differences in accessibility, that is, sites that were preferentially accessible in specific tissues were also predicted to be accessible in these tissues (Extended Data Fig. 1c).
We next utilized functionally characterized enhancers from our previous work4,33 for transfer learning to build sequence-to-activity models. We framed the enhancer–activity prediction task as a binary classification (active/inactive) as the in vivo enhancer–activity data are derived from annotated non-quantitative in situ hybridization assays4,33. We initialized CNNs to predict tissue-specific enhancer activities directly from the DNA sequence by the sequence-to-accessibility models trained on ATAC-seq data for the respective tissues (CNS, epidermis, gut, muscle and brain—see previous paragraph), and trained an enhancer prediction task until convergence (Fig. 1a; see Methods). We evaluated the models using cross-validation with left-out datasets containing active and inactive enhancers, with and without ATAC-seq signals. This revealed that the sequence-to-activity models obtained by transfer learning substantially improved the predictions for all five tissues as assessed by several performance measures compared to: (1) models directly trained on the in vivo enhancer activity data starting from random initialization; (2) models pre-trained on ATAC-seq data from a different tissue (salivary gland); and (3) the sequence-to-accessibility models without transfer learning (Fig. 1b and Extended Data Figs. 3 and 4). The transfer-learned models also outperformed the other models in correctly discriminating accessible regions with and without enhancer activity, and the improvement was particularly strong for muscle and brain, which had the fewest known enhancers for training (177 and 119, respectively) (Extended Data Fig. 5). The models also reliably discriminated additional positive and negative control enhancers, including the known enhancers in tissue-specific marker gene loci (Extended Data Fig. 6).
Moreover, and particularly relevant for enhancer design that can only test a very limited number of predictions in vivo, these models reached positive predictive values (PPVs) between 36% (brain) to 88% (CNS) at prediction thresholds that recovered at least 10 known enhancers during cross-validation (or PPVs between 13% to 76% at ≥50 known enhancers; Fig. 1c), suggesting that it would not be unreasonable to attempt the de novo design of synthetic enhancers for these tissues. We therefore proceeded to design synthetic enhancers with defined tissue-specific activities de novo (Fig. 2a). Specifically, we created random sequences with a zero-order Markov model and selected 8 enhancers for each of the 5 tissues (40 enhancers total) that had high predicted accessibility and activity scores specifically in the CNS, epidermis, gut, muscle or brain, focusing on distinct motif signatures when possible to remove potential redundancies (see Methods; Extended Data Figs. 7 and 8, Supplementary Fig. 2 and Supplementary Table 2).
We ordered the designed enhancer sequences, cloned them into a previously used reporter system that features a minimal hsp70 promoter and lacZ reporter gene, and integrated the constructs into a consistent landing site in the Drosophila genome33 (see Methods for details on the reporter system and its properties). We then collected and fixed embryos and scored the enhancer activities of the candidates by two-colour fluorescent in situ hybridization, comparing lacZ reporter expression to the expression of the tissue-specific marker genes elav (CNS), wg (epidermis), GATAe (gut), Mef2 (muscle) and tll (brain). In addition to a qualitative visual assessment, we also quantitatively compared the expression patterns by pixel-wise PCCs across the entire volumes of the acquired microscopy image z-stacks.
This revealed that eight out of eight CNS enhancers were active in the CNS; some of these had additional, mainly weak and sporadic, activity in the peripheral nervous system (Fig. 2, Extended Data Fig. 9a and Supplementary Table 2). Similarly, seven out of eight epidermis enhancers and eight out of eight muscle enhancers functioned specifically in the epidermis and muscle, respectively (Fig. 2, Extended Data Fig. 9b,d and Supplementary Table 2). For both the gut and brain enhancers, two out of eight were active in the respective target tissue and had partial additional activities in other tissues such as the CNS, salivary gland or amnioserosa (Fig. 2, Extended Data Fig. 9c,e and Supplementary Table 2), in line with the expectations from cross-validation. These results from our qualitative visual assessment were confirmed by quantitative assessment of pattern similarities (Extended Data Fig. 10 and Supplementary Table 2). All patterns deemed correct by visual assessment and three out of the four gut enhancer patterns that were deemed incorrect by visual assessment were significantly different from random and negative control patterns (t-test P value < 0.05; n = 4 embryos).
Notably, given the aim of this study to target broad tissue types that comprise distinct subtypes, not all of the enhancers that were active in the correct target tissue exhibited identical activity patterns. For example, the epidermis enhancers were active in segmental and/or pharyngeal parts of the epidermis, and a similar sub-pattern variability within the correct overall tissue type was seen for CNS and muscle (Extended Data Fig. 9). Also notable are the different success rates for muscle (100%) and gut (25%), and the observation that several gut enhancers were active outside the gut in epidermis, sensory complexes and amnioserosa (Extended Data Fig. 9c and Supplementary Table 2). This probably stems from a more complex gut ‘enhancer grammar’ involving low-information GATA motifs (for example, in Fig. 2c and Extended Data Fig. 2d): the five GATA transcription factors in the fly are utilized rather broadly in endoderm and gut (Serpent and dGATAe34,35), but also in amnioserosa, dorsal epidermis, the heart (Pannier36,37) and other tissues38—that is, the very tissues for which we observe ectopic gut enhancer activity. In this context, it is notable that the pattern similarity (PCC) with the gut marker gene dGATAe is significantly above random for all but one of the gut enhancers deemed incorrect by visual assessment (and for all the correct ones), potentially indicating pattern overlap and/or relatedness of the tissues (Extended Data Figs. 9c and 10b). After this proof of concept at the level of broad tissue types, it will be interesting to see the development of more fine-grained models that discriminate between closely related tissue subtypes and individual cell types, especially those that share prominent transcription factors (such as GATA factors in gut and other tissues).
Overall, our work demonstrates the feasibility of targeted design of synthetic enhancers for selected tissues by deep and transfer learning. The framework proposed here should be applicable to any species and tissue provided a genome-wide dataset of enhancer-associated features (for example, DNA accessibility, characteristic histone modifications, transcription factor or cofactor binding and enhancer RNAs) and a reasonable number of functionally validated enhancers (in this study, more than 100 were used per tissue).
More traditional machine learning approaches have been used successfully for the prediction of chromatin features, transcription factor binding and enhancer sequences4,39–42 and for predicting genomic elements with highly constrained cis-regulatory codes and limited architectures (for example, core promoter elements43 or highly defined enhancer motif contexts44). However, the challenge of flexible enhancer design has only become possible with deep learning6,45 (and ref. 46, which was published as a preprint while this manuscript was under review).
For the near future, we foresee great progress in deep and transfer learning approaches to the prediction and design of enhancers and other genomic regulatory elements. These will probably include the application of large multitask models trained simultaneously on many datasets comprising different tissues and cell types47. As predictive sequence features such as transcription factor motifs are often shared between tissues (for example, in Extended Data Fig. 2 and Supplementary Fig. 1), shared learning of large models might further improve model performance compared to the dedicated single-task models used here. Conversely, improved performance might come from the combination of many small, dedicated models such as the ones developed here, each specialized for one specific type of function or genomic element, into a larger overarching framework. Another likely improvement for the specific task of enhancer design will be the move from computational screening of random sequences, which can only sample a very small part of the possible sequence space, to a more direct and efficient way to generate synthetic enhancer sequences, such as the use of generative adversarial networks48, variational autoencoders49,50 and diffusion models51 that can ‘hallucinate’ possible solutions.
Our work complements approaches to design enhancers in or via cell culture models6,46 or via the modelling of cell type-characteristic DNA accessibility patterns and their sequence signatures (topic modelling45) and ongoing efforts to predict gene expression47 and 3D genome architecture52,53 from extended DNA sequences. Models to predict endogenous gene expression must integrate the regulatory cues of multiple enhancers acting from different distances, consider distinct promoter types with enhancer–promoter compatibilities, and insulator, silencer and tethering elements, together with the sequence determinants of RNA processing and stability. It will be interesting to see these models integrate lessons from enhancer-centric approaches to further develop and move towards designing entire synthetic gene loci with complex gene-expression patterns.
We envision that our work will synergize with ongoing efforts to build comprehensive ‘cell atlases’ for gene expression and DNA accessibility in the fly, mouse and human, thus providing the opportunity to design enhancers for many, if not all, tissues in these organisms, potentially even for aberrant tissue or cell states. In conclusion, our work not only demonstrates the remarkable progress in enhancer design made possible by deep and transfer learning and the growing datasets on enhancers and chromatin, but also sets the stage for a future in which the precise design and manipulation of gene-expression patterns become a reality.
Methods
Processing of pseudo-bulk DNA accessibility data
We retrieved sci-ATAC-seq3 mapped reads (dm6) from each of the 18 tissue pseudo-bulk (that is, mapped reads from all cells combined) at the 10–12 h timepoint from ref. 16 (downloaded from https://shendure-web.gs.washington.edu/content/members/DEAP_website/public/ on 1 February 2022, BAM files available upon request; see also Extended Data Fig. 1a). We generated coverage tracks for each tissue pseudo bulks, including the five tissues of interest: CNS, brain, epidermis, midgut and muscle (initially modelled separately for the somatic muscle and visceral muscle pseudo bulks as these were annotated separately in the respective publication, we proceeded with only visceral as explained below). All read fragments from each pseudo-bulk were used for peak calling with MACS254,55 with the following command: macs2 callpeak --nomodel --keep-dup all --extsize 200 --shift −100 --gsize dm -B.
Deep learning sequence-to-accessibility models
Data preparation
We binned the dm6 genome (downloaded from https://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz) into 1,001-bp windows with a stride of 50 bp, and filtered windows in the chromosomes chr2L, chr2R, chr3L, chr3R, chr4, chrX, chrY and chrM. For each window, we computed the log average of the depth-normalized ATAC coverage over the central 201 bp of the window. We combined the accessibility peaks of all scATAC-seq pseudo bulks and selected all bins whose central 151 bp were within any 301 bp-centred peak region. We further added 144,424 random windows throughout the genome with a range of accessibility levels to obtain a dataset with reasonable class imbalances while maintaining high diversity in negative examples. Finally, we only included windows with non-zero ATAC signals across every pseudo-bulk and removed the ones with outlier values (quantile <0.01 or >0.999 in any pseudo-bulk). We augmented our dataset by adding the reverse complement of each original sequence, with the same output, ending up with 464,203 examples (928,406 post-augmentation).
Cross-validation scheme
We used a cross-validation scheme to have a more robust model performance. We divided the sequences into ten folds based on their chromosomal positions (considering chromosome halves; see Supplementary Table 3 for the specific folds used) and used a cross-validation setup where we use eight folds for training, one for validation, and one for testing. Each genomic window can serve as an example in a training, validation/tuning, or test set.
Model architecture and training
We used the previously optimized DeepSTARR CNN architecture for predicting genome-wide enhancer activity from DNA sequence with minor adaptations6. Using the DeepSTARR architecture as a starting point, we performed hyperparameter grid-search to yield best performance on the DNA accessibility validation set of fold01 across the different tissues. The final CNN uses one-hot encoded 1,001 bp long DNA sequence (A = [1,0,0,0], C = [0,1,0,0], G = [0,0,1,0], T = [0,0,0,1]) to predict DNA accessibility signals. The CNN contains four 1D convolutional layers (filters = 256,120,60,60; size = 7,3,3,3; padding = same), each followed by batch normalization, a ReLU non-linearity, and max-pooling (size = 3). After the convolutional layers there are two fully connected layers, with 64 and 256 neurons, respectively, followed by batch normalization, a ReLU non-linearity, and dropout where the fraction is 0.4. The final layer is mapped to the accessibility signal output. Hyperparameters were manually adjusted to yield best performance on the validation set of one cross-validation fold. The models were implemented and trained in Keras (https://keras.io/) from TensorFlow v.1.14.0 (ref. 56) using the Adam optimizer57 (learning rate = 0.005), mean squared error as loss function, a batch size of 128, and early stopping with patience of five epochs.
To account for variance between different training runs and improve the accuracy and robustness of the models, we trained three replicate models on each held-out test fold (that is, 30 models for each pseudo bulks tissue). After analysing the variance in predictions, and removing the model runs that did not converge (PCC on the test set ≤ 0.1), we averaged the predictions of the replicate models per test set.
Model performance
The performance of each model was evaluated on the held-out test chromosomes of each fold. We used the PCC across all bins for a quantitative genome-wide evaluation.
Prediction on full Drosophila genome
We extracted 1,001 bp sequences tiled across the Drosophila dm6 genome (downloaded from https://hgdownload.soe.ucsc.edu/goldenPath/dm6/bigZips/dm6.fa.gz) with a stride of 20 bp using bedtools makewindows (parameters -w 1001 -s 20’) and bedtools getfasta58. For each model, we next predicted the accessibility of each genomic window and averaged these per nucleotide to obtain genome-wide coverage.
Nucleotide contributions
We used DeepExplainer (the DeepSHAP implementation of DeepLIFT, see refs. 22–24 update from https://github.com/AvantiShri/shap/blob/master/shap/explainers/deep/deep_tf.py) to compute contribution scores for all nucleotides in all sequences with respect to the accessibility predictions. We used 100 dinucleotide-shuffled versions of each input sequence as reference sequences. For each sequence, the obtained hypothetical importance scores were multiplied by the one-hot encoded matrix of the sequences to derive the final nucleotide contribution scores. We used one replicate model for each of the 10 folds of cross-validation and averaged the scores for each sequence in each cell type across all the 10 folds. The nucleotide contribution scores were visualized using the ggseqlogo function from the R package ggseqlogo (v.0.124).
Motif discovery using TF-Modisco
To find important predictive motifs, we ran TF-Modisco (v.0.5.12.0 (ref. 21)) on the nucleotide contribution scores of one model fold for each tissue type separately, using the respective accessible regions. We specified the following parameters: sliding_window_size=15, flank_size=5, max_seqlets_per_metacluster=50000 and TfModiscoSeqletsToPatternsFactory(trim_to_window_size=15, initial_flank_to_add=5, final_min_cluster_size=30). We trimmed the PWM motifs by removing flanking positions with an information content lower than 0.4. The TF-Modisco discovered motifs are detailed in Extended Data Fig. 2, the converted PWM logo and the closest match from the transcription factor motif database available at https://github.com/bernardo-de-almeida/motif-clustering6 (similarity assessed using TOMTOM59 with the following command: tomtom -dist kullback -motif-pseudo 0.1 -text -min-overlap 1).
Transcription factor motif analyses across tissues
For the transcription factors that we could assign to the identified motifs, we retrieved their RNA in situ expression data at Drosophila embryogenesis stage 13–16 from the Berkeley Drosophila Genome Project (BDGP; https://insitu.fruitfly.org/cgi-bin/ex/insitu.pl) and matched their tissue annotation with the tissues used for the sequence-to-accessibility model (see Supplementary Fig. 1b for summary results across tissues and Supplementary Table 1 for full annotation). In addition, we retrieved the transcription factors expression in matched single-cell RNA-seq clusters from the same publication where we retrieved the single-cell ATAC-seq data16. The cluster assignment was done through nonnegative least square matrix factorization (see respective publication for details and data; https://shendure-web.gs.washington.edu/content/members/DEAP_website/public/). Transcription factor expression across tissues is displayed in Supplementary Fig. 1c and Supplementary Table 1.
Deep learning sequence-to-activity models
Data preparation
We retrieved the in vivo enhancer activity data from the CAD4 database (available in supplementary table 13 in ref. 33), which also includes all enhancer activity data from the Vienna Tiles library (https://enhancers.starklab.org/). For each of the 5 tissues of interest (CNS, epidermis, gut, muscle, brain-specific), we defined sequences as active if they were active between stages 13 and 16 in any of the related tissue annotation terms. CNS: ventral nerve cord, neuroblast of ventral nerve cord primordium, embryonic brain, embryonic central brain, embryonic central brain glial cell, embryonic central brain neuron; epidermis: embryonic dorsal epidermis, embryonic ventral epidermis, embryonic head epidermis, lateral head epidermis, embryonic lateral epidermis, embryonic ventral trunk epidermis, ventral head epidermis, dorsal head epidermis, embryonic epidermis; gut: embryonic hindgut, embryonic midgut chamber, hindgut, embryonic/larval midgut, foregut, midgut interstitial cell; muscle: embryonic/larval somatic muscle, somatic muscle, embryonic somatic muscle, visceral muscle, embryonic/larval visceral muscle, circular visceral muscle fibre, longitudinal visceral muscle fibre, oesophageal visceral muscle, embryonic/larval muscle system, muscle system, dorsal pharyngeal muscle; brain-specific: embryonic brain, embryonic central brain, embryonic central brain glial cell, embryonic central brain neuron AND inactive in the VNC: ventral nerve cord, neuroblast of ventral nerve cord primordium. All the remaining sequences were considered inactive for the respective tissues. For data augmentation, we tiled every sequence in 1,001 bp windows and added also the reverse complement of each original sequence, with the same output, ending up with 176,424 examples (352,848 post-augmentation). Separately for each tissue, we further filtered for active sequences that overlap (minimum overlap of 151 bp) accessibility peaks of the respective tissue to obtain a cleaner positive set. For negative fragments, we selected only at most five different sequences to keep reasonable class imbalances.
Cross-validation scheme
We used the same cross-validation folds for training, validation and testing from the accessibility models. Hence, for each fold, the test sets are completely held-out across both stages of training.
Model architecture and training
The architecture and weights learned in the first model of the respective tissue were used to initialize this second CNN model to classify DNA sequences based on their activity in vivo, an approach known as transfer learning. For muscle we initialized the model with the visceral muscle accessibility model because it led to a slightly higher performance than initializing with the somatic muscle model (AUPRC of 0.14 vs. 0.12, respectively). We kept all layers trainable and changed the last layer to a sigmoid activation. The models were trained using the Adam optimizer57 (with smaller learning rate = 0.0001), binary cross-entropy as loss function, a batch size of 128, and early stopping with patience of twenty epochs.
To account for variance between different training runs and improve the accuracy and robustness of the models, we trained three replicate models on each held-out test fold (that is, 30 models for each of the five tissues, total of 150 models). After analysing the variance in predictions, and removing the model runs that did not converge (area under the curve ≤ 0.7), we averaged the predictions of the replicate models per test set.
Model performance
We assessed the model performance of the models of each tissue only on the original, non-augmented Vienna Tiles data, to have a more unbiased set of active and inactive sequences. To have a confident set of positive sequences, we considered as active sequences only the accessibility peaks of the respective tissue that fall (minimum overlap of 201 bp) within tiles active in the respective tissue. As negative sequences we considered both the accessibility peaks that fall (minimum overlap of 201 bp) within tiles inactive in the respective tissue, as well as all other sequences in inactive tiles. We computed the predictions for each sequence using the respective cross-validation set where the sequence is held-out for testing. Using this set of active and inactive tiles per tissue, model performance was accessed using the AUPRC, accuracy, F1-scores (all calculated using confusionMatrix from R package caret v.6.0-90 (ref. 60)), and by estimating the positive predictive value (percentage of validated active sequences among all positive predictions) at different prediction thresholds.
We also evaluated the sequence-to-activity models for known tissue-specific enhancers in marker gene loci of each tissue (enhancers in our database present in ±50kb from the transcription start site): elav (CNS), grh (epidermis), GATAe (gut), Mef2 (muscle) and tll (brain) (Extended Data Fig. 6). There were no enhancers in epidermis wg locus, so we replaced it by the epidermis marker gene grh.
Comparison with different model initializations
For each of the five tissues, we compared the performance of the fine-tuned models with transfer learning with (1) models pre-trained on DNA accessibility of a different tissue (salivary gland, since it has very different profiles when compared with the five tissues of interest; see Extended Data Fig. 1a) and (2) models directly trained on the in vivo enhancer activity data starting from random initialization (no fine-tuning). Model architecture, training and cross-validation schemes, as well as performance evaluation were identical to the ones described above for the main model.
Nucleotide contributions
Same as described for the accessibility models above.
Computational design of Drosophila enhancers
Three billion random 501 bp DNA sequences were generated in bash with the following code: cat /dev/urandom | tr -dc ‘ACGT’ | fold -w 501 | head -n 3000000000 and flanked left and right with random 250 bp sequences to obtain 1,001 bp long sequences. We predicted these sequences’ activities and accessibilities with one replicate model per tissue (taking less than 10 min for 100,000 sequences per model on a single CPU) until we had ~15,000 sequences predicted to be specifically active and accessible in the five target tissues (CNS, epidermis, gut, muscle, brain). From the top 3,000 candidates, we randomly sampled 100 and computed the nucleotide contribution scores for visual inspection of motif content and arrangement, alongside the candidates’ prediction scores. We made sure the predicted activity is independent of the ±250 bp flanks by predicting the activity of each of the selected middle 501-bp sequences with 100 different ±251 bp flanks. Based on this combined information, we then manually selected eight candidates per tissue for testing in vivo (Supplementary Table 2). We searched the candidate synthetic enhancers against the Drosophila genome (taxid:7227) using Blastn via NIH NCBI Blast https://blast.ncbi.nlm.nih.gov/Blast.cgi with default parameters, except for word size of 7 (smallest and thus most sensitive setting) and expectation value (E value) threshold of 10. Two candidates (active muscle_synth5 and inactive gut_synth9) had matches with E values of 0.032, which corresponds to 22/501 bp shared sequence; no other candidate had matches with E value ≤ 0.1.
Distribution of prediction scores in random sequences
We scored 100,000 random 1,001 bp sequences with the sequence-to-activity transfer-learned models as well as activity models directly trained on the in vivo enhancer activity data starting from random initialization. We used the same replicate model of the random sequence selection above for each tissue. We calculated the percentiles of the final 40 synthetic enhancers in the distributions of the two models in each tissue.
Final enhancer activity scores of the selected 40 candidates
To obtain the final expected enhancer activities (= final scores) for the selected 40 candidates, we placed the 501 bp sequences of each candidate within the ±250 bp flanks of the actual reporter construct and scored the resultant 1-kb sequences with the transfer learning enhancer-to-activity models of each tissue. We used one replicate model for each of the ten folds of cross-validation and averaged the predictions across folds.
Nucleotide contributions
Same as described for the accessibility models above but using the 501 bp synthetic sequences flanked by the actual sequence of the plasmid where they were inserted for testing in vivo.
Cloning of synthetic Drosophila enhancers
The 501-bp synthetic sequences (designed above; Supplementary Table 2) were ordered from Twist Bioscience flanked by 20-bp linkers for Gibson assembly (5′, GAATTGGGAATTCGTTAACA; 3′, TGGTCTAGAGCCCGGGCGAA). Sequences were cloned upstream of a minimal hsp70 promoter driving a lacZ reporter gene in an attB-containing plasmid33, linearized with BglII using Gibson Assembly. Plasmids were verified by Sanger sequencing. 27 µg per plasmid (45 µl; 600 ng µl−1) were sent to BestGene for injection in Drosophila embryos (integration site: http://flybase.org/reports/FBst0024482.html) and positive transformants were selected. All constructs were injected into embryos according to standard methods and inserted into the attP landing site line M{3×P3-RFP.attP′}ZH-51C via PhiC31 integrase insertion, yielding integration at chromosomal position 51C1.
Such reporter systems provide an opportunity to measure enhancer activities and the enhancers’ spatio-temporal activity patterns in a constant and controlled environment4,33,61. The hsp70 core promoter has been widely used for transgene expression and enhancer testing (for example, ref. 33) and functions highly similarly as other developmental promoters (for example, DSCP)62. While controlled reporter systems differ from endogenous gene regulation, we previously found that 82% of the enhancer–activity patterns reflect the enhancers’ endogenous activities4.
Embryo fixation for imaging
Embryos of the respective genotypes were washed off collection plates into a collection bottle with a mesh at the bottom using paintbrushes and water. Afterwards, the embryos were dechorionated for 2 min in 50% bleach. Following dechorionation, embryos were washed extensively with water and were collected eventually on the mesh of the collection bottle with 1x PBT (PBS, 0.1% Triton X-100). After drying the embryos on the mesh on a piece of tissue paper they were transferred into a 1.5-ml reaction tube with 1 volume fixation solution (4% (v/v) formaldehyde in PBS) and 1 volume heptane. Embryos were fixed for 20 min on a horizontal shaker at 500 rpm. To devitellinize the embryos the fixation solution was aspirated and 1 volume methanol was added to the tube, followed by extensive shaking. The heptane phase and excess methanol were removed, leaving the devitellinized embryos at the bottom of the tube. Embryos were washed three times with methanol and stored in methanol or ethanol at −20 °C.
FISH in Drosophila embryos
Whole-mount Drosophila RNA in situ hybridization experiments were carried out as described previously63. Digoxigenin-labelled RNA anti-sense probes for elav, wg, GATAe, mef2 as well as tll were prepared from corresponding EST clones from the DGRC collections (Drosophila Genomics Resource Center (NIH Grant 2P40OD010949)) using the DIG labelling mix (Roche, 11175033910) and T3, T7 or SP6 RNA polymerase (Roche) according to the manufacturer’s instructions. Fluorescein-labelled RNA anti-sense probe for lacZ was prepared from a PCR fragment that has been amplified from a pGEMT easy plasmid containing the lacZ gene using the Fluorescein labelling mix (Roche, 11685619910) and T7 RNA polymerase (Roche) according to the manufacturer’s instructions. mRNA expression was visualized from these probes using anti-Digoxigenin-Peroxidase (Roche 11633716001) and anti-Fluorescein-Peroxidase (Roche 11426346910) (all antibodies diluted 1:2,000) coupled with the TSA Plus Cyanine 3 (Akoya Biosciences, NEL744001KT) and TSA Plus Fluorescein (Akoya Biosciences, NEL741001KT) kits.
Qualitative visual pattern assessment and imaging of representative FISH-stained embryos
Two-hundred to three-hundred double FISH-stained embryos with the respective genetic background were mounted in ProLong Gold mounting medium with DAPI (ThermoFisher Scientific P36931) and scored individually for lacZ reporter expression in embryonic stage 13-14. If a synthetic enhancer-driven lacZ expression pattern was observed in all homozygous embryos in a reproducible manner, the enhancer was scored as active. For these, one representative homozygous embryo was selected and a z stack (1 µm step size, between 7–12 slices per embryo) was imaged on a Zeiss LSM 880 Airyscan Fast confocal microscope using a Plan Apochromat 20×/0.8 objective. For visualization of the enhancer-driven reporter expression in relation to the tissue-specific marker gene expression, a maximum projection of the z stack was performed in Fiji64.
Quantification of tissue-specific enhancer activity in FISH-stained embryos
For the quantification of enhancer activity in the predicted tissue we analysed its reporter expression pattern in spatial relation to the respective tissue-specific marker expression and calculated a PCC. For this purpose, we imaged z-stacks (1 µm step size, between 7–12 slices per embryo) of 4 double FISH-stained embryos of the respective genotype with low-resolution (256 × 256 Pixel) on a Zeiss LSM 880 Airyscan Fast confocal microscope using a Plan Apochromat 20×/0.8 objective. Subsequently, we calculated the PCC between the two channels with Fiji64 utilizing the JACoP plugin65 with standard parameters. As controls we used either double FISH-stained embryos that showed no reporter expression or embryos double FISH-stained for the unrelated Myosin heavy chain (MHC, muscle) and cacophony (cac, CNS) genes.
Statistics and data visualization
All statistical calculations and graphical displays have been performed in R statistical computing environment (v.3.5.1 (ref. 66)) and using the R package ggplot2 (v.3.2.1 (ref. 67)). Coverage data tracks have been visualized in the UCSC Genome Browser68 and used to create displays of representative genomic loci. In all boxplots, the central line denotes the median, the box encompasses 25th to 75th percentile (interquartile range) and the whiskers extend to 1.5× interquartile range.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-023-06905-9.
Supplementary information
Acknowledgements
Basic research at the IMP is supported by Boehringer Ingelheim GmbH and the Austrian Research Promotion Agency (FFG). For the purpose of Open Access, the author has applied a CC-BY-NC-ND 4.0 International license to any Author Accepted Manuscript version arising from this submission.
Extended data figures and tables
Author contributions
B.P.d.A., E.E.M.F. and A.S. conceived the project. B.P.d.A. performed all computational analyses and designed the synthetic enhancers. M.P. cloned all reporter constructs. C.S. performed fly husbandry, embryo collection, in situ hybridization and imaging. S.S. provided assistance with the pseudo-bulk scATAC-seq data, data analysis and interpretation of the results. B.P.d.A., C.S., E.E.M.F. and A.S. interpreted the data. B.P.d.A. and A.S. wrote the manuscript, with input from all authors. E.E.M.F. and A.S. supervised the project.
Peer review
Peer review information
Nature thanks Shaun Mahony and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Data availability
The transcription factor motif database is available at https://github.com/bernardo-de-almeida/motif-clustering. The final pre-trained accessibility and enhancer activity models, as well as the data used to train and evaluate the models, are available at 10.5281/zenodo.8011697. All reporter DNA constructs and transgenic flies for active synthetic enhancers are available from the Vienna Drosophila Resource Center (VDRC) at https://shop.vbc.ac.at/vdrc_store/vdrc-fly-stocks/other-resources/a-stark-stocks-as-stock.html.
Code availability
Code used to train the models and to make predictions on new sequences is available on GitHub (https://github.com/bernardo-de-almeida/DeepSTARR_embryo).
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
is available for this paper at 10.1038/s41586-023-06905-9.
Supplementary information
The online version contains supplementary material available at 10.1038/s41586-023-06905-9.
References
- 1.Levine M. Transcriptional enhancers in animal development and evolution. Curr. Biol. 2010;20:R754–R763. doi: 10.1016/j.cub.2010.06.070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Banerji J, Rusconi S, Schaffner W. Expression of a β-globin gene is enhanced by remote SV40 DNA sequences. Cell. 1981;27:299–308. doi: 10.1016/0092-8674(81)90413-X. [DOI] [PubMed] [Google Scholar]
- 3.Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: From properties to genome-wide predictions. Nat. Rev. Genet. 2014;15:272–286. doi: 10.1038/nrg3682. [DOI] [PubMed] [Google Scholar]
- 4.Kvon EZ, et al. Genome-scale functional characterization of Drosophila developmental enhancers in vivo. Nature. 2014;512:91–95. doi: 10.1038/nature13395. [DOI] [PubMed] [Google Scholar]
- 5.Visel A, Minovitsky S, Dubchak I, Pennacchio LA. VISTA Enhancer Browser—a database of tissue-specific human enhancers. Nucleic Acids Res. 2007;35:D88–D92. doi: 10.1093/nar/gkl822. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.de Almeida BP, Reiter F, Pagani M, Stark A. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat. Genet. 2022;54:613–624. doi: 10.1038/s41588-022-01048-5. [DOI] [PubMed] [Google Scholar]
- 7.Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. in Advances in Neural Information Processing Systems, Vol. 27 (Curran Associates, 2014).
- 8.Zheng SC, et al. Universal prediction of cell-cycle position using transfer learning. Genome Biol. 2022;23:41. doi: 10.1186/s13059-021-02581-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Theodoris CV, et al. Transfer learning enables predictions in network biology. Nature. 2023;618:616–624. doi: 10.1038/s41586-023-06139-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Schwessinger R, et al. DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat. Methods. 2020;17:1118–1124. doi: 10.1038/s41592-020-0960-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Salvatore M, Horlacher M, Marsico A, Winther O, Andersson R. Transfer learning identifies sequence determinants of cell-type specific regulatory element accessibility. NAR Genomics Bioinformatics. 2023;5:lqad026. doi: 10.1093/nargab/lqad026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Novakovsky G, Saraswat M, Fornes O, Mostafavi S, Wasserman WW. Biologically relevant transfer learning improves transcription factor binding prediction. Genome Biol. 2021;22:280. doi: 10.1186/s13059-021-02499-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Bravo González-Blas, C. et al. Enhancer grammar of liver cell types and hepatocyte zonation states. Preprint at bioRxiv10.1101/2022.12.08.519575 (2022).
- 14.Janssens J, et al. Decoding gene regulation in the fly brain. Nature. 2022;601:630–636. doi: 10.1038/s41586-021-04262-z. [DOI] [PubMed] [Google Scholar]
- 15.Domcke S, et al. A human cell atlas of fetal chromatin accessibility. Science. 2020;370:eaba7612. doi: 10.1126/science.aba7612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Calderon D, et al. The continuum of Drosophila embryonic development at single-cell resolution. Science. 2022;377:eabn5800. doi: 10.1126/science.abn5800. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kelley DR. Cross-species regulatory sequence activity prediction. PLoS Comput. Biol. 2020;16:e1008050. doi: 10.1371/journal.pcbi.1008050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kelley DR, Snoek J, Rinn JL. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016;26:990–999. doi: 10.1101/gr.200535.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kelley DR, et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 2018;28:739–750. doi: 10.1101/gr.227819.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kim D, et al. The dynamic, combinatorial cis-regulatory lexicon of epidermal differentiation. Nat. Genet. 2021;53:1564–1576. doi: 10.1038/s41588-021-00947-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Shrikumar, A. et al. TF-MoDISco v0.4.4.2-alpha: technical note. Preprint at https://arxiv.org/abs/1811.00416v1 (2018).
- 22.Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. Preprint at https://arxiv.org/abs/1704.02685 (2017).
- 23.Lundberg SM, et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020;2:56–67. doi: 10.1038/s42256-019-0138-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Lundberg, S. M. & Lee, S.-I. in 31st Conference on Neural Information Processing Systems (ACM, 2017).
- 25.Doe CQ. Temporal patterning in the Drosophila CNS. Annu. Rev. Cell Dev. Biol. 2017;12:55. doi: 10.1146/annurev-cellbio-111315-125210. [DOI] [PubMed] [Google Scholar]
- 26.Jacobs J, et al. The transcription factor Grainy head primes epithelial enhancers for spatiotemporal activation by displacing nucleosomes. Nat. Genet. 2018;50:1011–1020. doi: 10.1038/s41588-018-0140-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Spahn P, et al. Multiple regulatory safeguards confine the expression of the GATA factor serpent to the hemocyte primordium within the Drosophila mesoderm. Dev. Biol. 2014;386:272–279. doi: 10.1016/j.ydbio.2013.12.012. [DOI] [PubMed] [Google Scholar]
- 28.Reuter R. The gene serpent has homeotic properties and specifies endoderm versus ectoderm within the Drosophila gut. Development. 1994;120:1123–1135. doi: 10.1242/dev.120.5.1123. [DOI] [PubMed] [Google Scholar]
- 29.Aronson BE, Stapleton KA, Krasinski SD, Friedman DR. Role of GATA factors in development, differentiation, and homeostasis of the small intestinal epithelium. Am. J. Physiol. 2014;306:474–490. doi: 10.1152/ajpgi.00119.2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ciglar L, Furlong EE. Conservation and divergence in developmental networks: a view from Drosophila myogenesis. Curr. Opin. Cell Biol. 2009;21:754–760. doi: 10.1016/j.ceb.2009.10.001. [DOI] [PubMed] [Google Scholar]
- 31.Larson ED, et al. Cell-type-specific chromatin occupancy by the pioneer factor Zelda drives key developmental transitions in Drosophila. Nat. Commun. 2021;12:7153. doi: 10.1038/s41467-021-27506-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Berger C, et al. FACS purification and transcriptome analysis of Drosophila neural stem cells reveals a role for Klumpfuss in self-renewal. Cell Rep. 2012;2:407–418. doi: 10.1016/j.celrep.2012.07.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Cusanovich DA, et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature. 2018;555:538–542. doi: 10.1038/nature25981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Okumura T, Matsumoto A, Tanimura T, Murakami R. An endoderm-specific GATA factor gene, dGATAe, is required for the terminal differentiation of the Drosophila endoderm. Dev. Biol. 2005;278:576–586. doi: 10.1016/j.ydbio.2004.11.021. [DOI] [PubMed] [Google Scholar]
- 35.Okumura T, Tajiri R, Kojima T, Saigo K, Murakami R. GATAe-dependent and -independent expressions of genes in the differentiated endodermal midgut of Drosophila. Gene Expr. Patterns. 2007;7:178–186. doi: 10.1016/j.modgep.2006.07.001. [DOI] [PubMed] [Google Scholar]
- 36.Winick J, et al. A GATA family transcription factor is expressed along the embryonic dorsoventral axis in Drosophila melanogaster. Development. 1993;119:1055–1065. doi: 10.1242/dev.119.4.1055. [DOI] [PubMed] [Google Scholar]
- 37.Gajewski K, Fossett N, Molkentin JD, Schulz RA. The zinc finger proteins Pannier and GATA4 function as cardiogenic factors in Drosophila. Development. 1999;126:5679–5688. doi: 10.1242/dev.126.24.5679. [DOI] [PubMed] [Google Scholar]
- 38.Lebestky T, Chang T, Hartenstein V, Banerjee U. Specification of Drosophila hematopoietic lineage by conserved transcription factors. Science. 2000;288:146–149. doi: 10.1126/science.288.5463.146. [DOI] [PubMed] [Google Scholar]
- 39.Weinstein ML, et al. A novel role for trithorax in the gene regulatory network for a rapidly evolving fruit fly pigmentation trait. PLoS Genet. 2023;19:e1010653. doi: 10.1371/journal.pgen.1010653. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Grossman SR, et al. Systematic dissection of genomic features determining transcription factor binding and enhancer function. Proc. Natl Acad. Sci. USA. 2017;114:E1291–E1300. doi: 10.1073/pnas.1621150114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Ghandi M, Lee D, Mohammad-noori M, Beer MA. Enhanced regulatory sequence prediction using gapped k-mer features. PLoS Comput. Biol. 2014;10:e1003711. doi: 10.1371/journal.pcbi.1003711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Yanez-Cuna JO, et al. Dissection of thousands of cell type-specific enhancers identifies dinucleotide repeat motifs as general enhancer features. Genome Res. 2014;24:1147–1156. doi: 10.1101/gr.169243.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Vongoc L, Huang CY, Cassidy CJ, Medrano C, Kadonaga JT. Identification of the human DPR core promoter element using machine learning. Nature. 2020;21:51–60. doi: 10.1038/s41586-020-2689-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Reiter F, de Almeida BP, Stark A. Enhancers display constrained sequence flexibility and context-specific modulation of motif function. Genome Res. 2023;33:346–358. doi: 10.1101/gr.277246.122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Taskiran, I. I. et al. Cell type directed design of synthetic enhancers. Nature10.1038/s41586-023-06936-2 (2023). [DOI] [PMC free article] [PubMed]
- 46.Gosai, S. et al. Machine-guided design of synthetic cell type-specific cis-regulatory elements. Preprint at bioRxiv10.1101/2023.08.08.552077 (2023).
- 47.Avsec Ž, et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods. 2021;18:1196–1203. doi: 10.1038/s41592-021-01252-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Goodfellow, I. J. et al. in Proc. 27th International Conference on Neural Information Processing Systems (MIT Press, 2014).
- 49.Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. Proc. 31st International Conference on Machine Learning (ICML) (2014).
- 50.Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2014).
- 51.Yang, L. et al. Diffusion models: a comprehensive survey of methods and applications. Preprint at https://arxiv.org/abs/2209.00796 (2022).
- 52.Fudenberg G, Kelley DR, Pollard KS. Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods. 2020;17:1111–1117. doi: 10.1038/s41592-020-0958-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Zhou J. Sequence-based modeling of three-dimensional genome architecture from kilobase to chromosome scale. Nat. Genet. 2022;54:725–734. doi: 10.1038/s41588-022-01065-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Feng J, Liu T, Qin B, Zhang Y, Liu XS. Identifying ChIP–seq enrichment using MACS. Nat. Protoc. 2012;7:1728–1740. doi: 10.1038/nprot.2012.101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Zhang Y, et al. Model-based analysis of ChIP–seq (MACS) Genome Biol. 2008;9:R137. doi: 10.1186/gb-2008-9-9-r137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. Preprint at https://arxiv.org/abs/1603.04467 (2016).
- 57.Kingma, D. P. & Ba, J. L. Adam: A method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2015).
- 58.Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8:R24. doi: 10.1186/gb-2007-8-2-r24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Kuhn M. Building predictive models in R using the caret package. J. Stat. Softw. 2008;28:1–26. doi: 10.18637/jss.v028.i05. [DOI] [Google Scholar]
- 61.Erceg J, et al. Subtle changes in motif positioning cause tissue-specific effects on robustness of an enhancer’s activity. PLoS Genet. 2014;10:e1004060. doi: 10.1371/journal.pgen.1004060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Zabidi MA, et al. Enhancer–core-promoter specificity separates developmental and housekeeping gene regulation. Nature. 2015;518:556–559. doi: 10.1038/nature13994. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Schor IE, et al. Non-coding RNA expression, function, and variation during Drosophila embryogenesis. Curr. Biol. 2018;28:3547–3561.e9. doi: 10.1016/j.cub.2018.09.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Schindelin, J. et al. Fiji: an open-source platform for biological-image analysis. Nat. Methods9, 676–682 (2012). [DOI] [PMC free article] [PubMed]
- 65.Bolte S, Cordelières FP. A guided tour into subcellular colocalization analysis in light microscopy. J. Microsc. 2006;224:213–232. doi: 10.1111/j.1365-2818.2006.01706.x. [DOI] [PubMed] [Google Scholar]
- 66.R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/ (R Foundation for Statistical Computing, 2020).
- 67.Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2016).
- 68.Kent WJ, et al. The Human Genome Browser at UCSC. Genome Res. 2002;12:996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The transcription factor motif database is available at https://github.com/bernardo-de-almeida/motif-clustering. The final pre-trained accessibility and enhancer activity models, as well as the data used to train and evaluate the models, are available at 10.5281/zenodo.8011697. All reporter DNA constructs and transgenic flies for active synthetic enhancers are available from the Vienna Drosophila Resource Center (VDRC) at https://shop.vbc.ac.at/vdrc_store/vdrc-fly-stocks/other-resources/a-stark-stocks-as-stock.html.
Code used to train the models and to make predictions on new sequences is available on GitHub (https://github.com/bernardo-de-almeida/DeepSTARR_embryo).