Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jun 7.
Published in final edited form as: Nature. 2020 Nov 25;588(7837):337–343. doi: 10.1038/s41586-020-2962-9

A map of cis-regulatory elements and 3D genome structures in zebrafish

Hongbo Yang 1,*, Yu Luan 1,*, Tingting Liu 1,*, Hyung Joo Lee 2, Li Fang 3, Yanli Wang 4, Xiaotao Wang 1, Bo Zhang 4, Qiushi Jin 1, Khai Chung Ang 5, Xiaoyun Xing 2, Juan Wang 1, Jie Xu 1, Fan Song 4, Iyyanki Sriranga 1, Chachrit Khunsriraksakul 4, Tarik Salameh 4, Daofeng Li 2, Mayank N K Choudhary 2, Jacek Topczewski 6,7, Kai Wang 3, Glenn S Gerhard 8, Ross C Hardison 9, Ting Wang 2, Keith C Cheng 5, Feng Yue 1,10,#
PMCID: PMC8183574  NIHMSID: NIHMS1695157  PMID: 33239788

Abstract

The zebrafish has been widely used in the study of human disease and development, with ~70% of the protein-coding genes conserved between the two species1. However, the annotation of functional control elements in the zebrafish genome has been lagging. Here, we performed RNA-seq, ATAC-seq, ChIP-seq, whole-genome bisulfite sequencing (WGBS), and Hi-C experiments in up to eleven adult and two embryonic tissues to generate a comprehensive map of transcriptomes, cis-regulatory elements, heterochromatin, methylomes, and 3D genome organization in the zebrafish Tübingen reference strain. A comparison of zebrafish, human, and mouse regulatory elements allowed the identification of both evolutionarily conserved and species-specific regulatory sequences and networks. We observed enrichment of evolutionary breakpoints at TAD boundaries, which were correlated with strong H3K4me3 and CCCTC-binding factor (CTCF) signals. We performed single-cell ATAC-seq in zebrafish brain, which delineated 25 different clusters of cell types. By combining long-molecule sequencing and Hi-C, we assembled the sex-determining chromosome 4 de novo. Overall, our work provides an additional epigenomic anchor for the functional annotation of vertebrate genomes and the study of evolutionally conserved elements of the 3D genome organization.


Zebrafish (Danio rerio) has been an important vertebrate model system for several decades because of its high fecundity, external embryogenesis, rapid embryonic development, and nearly transparent embryos. These features have made it an ideal system for the study of vertebrate development and aging2, comparative genomics3, and human disease modeling. However, comprehensive annotation of the cis-regulatory elements in the zebrafish genome has been lagging. Although previous genomic studies in zebrafish have provided critical biological insights48, most used whole embryos and our understanding of tissue-specific regulators remains limited.

To profile the transcribed regions, chromatin accessibility, and DNA methylation patterns in the zebrafish genome, we performed strand-specific RNA sequencing (RNA-seq), assay for transposase-accessible chromatin using sequencing (ATAC-seq)9, and whole-genome bisulfite sequencing (WGBS) in up to 11 zebrafish adult tissues and two embryonic tissues (Fig. 1a, b, Supplemental Table 1). Because histone modifications have been used to predict different classes of potential regulatory elements such as enhancers and repressors10,11, we also performed chromatin immunoprecipitation followed by DNA sequencing (ChIP-seq) for a panel of histone modifications, including: H3 lysine 4 trimethylation (H3K4me3), H3 lysine 27 acetylation (H3K27ac), H3 lysine 9 dimethylation (H3K9me2) and trimethylation (H3K9me3). To study higher-order chromatin structure and link distal enhancers to their target genes, we performed Hi-C experiments in adult brain and muscle (Fig. 1a, b). Although chromosome 4 is regarded as the ‘rudimentary’ sex chromosome in zebrafish12, the quality of its current assembly is poor due to the heavy presence of heterochromatin. Therefore, we performed three long-molecule sequencing experiments (Nanopore, 10X genomics, and Bionano optical mapping) in one Tübingen female zebrafish to generate a de novo assembly of chromosome 4. To investigate the cell types and their regulatory elements in the zebrafish brain, we performed single-cell ATAC-seq (scATAC-Seq). In total, we generated 161 genomic datasets that comprised over 10 billion reads. To our knowledge, this is the most comprehensive analysis of candidate cis-regulatory elements (CREs) in zebrafish to date and represents a major resource for comparative genomics and the study of gene regulation in this vertebrate model organism.

Fig. 1 |. Identification of cis-regulatory elements in the zebrafish genome.

Fig. 1 |

a, Tissues analyzed and techniques performed in this study. b, WashU Epigenome Browser snapshot of an example region, showing Hi-C, ChIP-seq, ATAC-seq, WGBS, and RNA-seq data in adult zebrafish brain, muscle and liver. The values on the y-axis for ChIP-seq were input-normalized. c, Expression of myh6 is heart-specific in both zebrafish and human (n=2). The expression values in human expression were from GTEx. Error bars represent standard error of the mean. d, Boxplot of the expression of brain-specific genes in zebrafish (upper panel) (n=2,481) and the expression of their orthologs in human (lower panel) (n=2,481). e, An example of a predicted novel transcript. Vertical scale 0–20 for H3K27ac and H3K4me3, 0–10 for RNA-seq. f, RNA-seq and H3K4me3 ChIP-seq signals for the predicted 8,311 novel transcripts across all the tissues. For all boxplots in this manuscript: horizontal line, median; box, interquartile range (IQR); whiskers, the values are between 5th and 95th percentiles

Transcriptome analysis

We detected 39,188 transcripts across all tissues using RNA-Seq, 14,764 of which exhibited tissue-specific patterns (Extended Data Fig. 1a-c). We identified 13,285 novel transcripts, 8,311 of which were also supported by H3K4me3 peaks at the promoter regions (Fig. 1e, f, Extended Data Fig. 1d, Supplemental Table 2). These 8,311 novel transcripts include 976 lncRNAs, 3,596 novel isoforms, and 3,739 potential novel protein-coding genes.

Next, we examined whether the expression patterns for the orthologs of tissue-specific genes were conserved between zebrafish and human. Among the 14,764 tissue-specific zebrafish transcripts, 3,737 have one-to-one human ortholog, 1,747 of which (47%) also show tissue-specific patterns in human (Fig. 1c, d and Supplemental Table 3), suggesting that these genes might play a critical and conserved role in the tissues where they are uniquely expressed.

Chromatin accessibility, promoter and enhancer dynamics

Chromatin accessibility is associated with a wide range of regulatory elements13,14, thus we performed ATAC-seq in all 11 adult tissues. We identified 66k – 180k ATAC-seq peaks (Supplemental Table 4) in each tissue and merged them into a list of 436,036 non-redundant peaks across all tissues (Fig. 2a, Supplemental Table 5). 116,353 of the peaks were novel compared with previous work in whole embryos1520 (Extended Data Fig. 2a and Supplemental Table 6). As expected, distal ATAC-seq peaks have higher tissue specificity than proximal peaks (Fig. 2b, c).

Fig. 2 |. Characterization of tissue-specific cis-regulatory elements.

Fig. 2 |

a, Number of ATAC-seq peaks predicted in each tissue and their genomic distribution. b and c, Tissue specificity of proximal and distal ATAC-seq peaks in 11 adult tissues. d, Clustering analysis identified tissue-specific enhancers. Values in the heatmap were input-normalized H3K27ac intensity (Number of the enhancer, n=58,226). e, Normalized ATAC-seq intensity in the corresponding enhancer elements shown in d. f, Examples of validated tissue-specific enhancers by GFP reporter assay in zebrafish embryos. For brain-enhancer 5, 112 out of 143 surviving embryos showed similar patterns. For heart-enhancer 7, 61 out of the 67surviving embryos showed similar patterns. For muscle-enhancer 5, 53 out of the 63 surviving embryos showed similar patterns. For kidney-enhancer 1, 47 out of the 82 surviving embryos showed similar patterns. Scale bar: 200 μm. g, Pearson correlation coefficient (PCC) between aggregated signals of scATAC-seq and bulk ATAC-seq data. Values are the sums of the reads in continuous 10 kb bins, normalized by sequencing depth. h, The overlap between peaks predicted in bulk and scATAC-seq data. i, t-SNE(t-Distributed Stochastic Neighbor Embedding) analysis identified 25 clusters in the scATAC-seq data in zebrafish adult brain (n=19,955). j, Examples of enriched motifs in different clusters from scATAC-seq peaks (n=19,955).

We then defined the CREs with the following combinations of histone modifications and ATAC-seq peaks: active promoter (H3K27ac + H3K4me3 + ATAC-seq), weak promoter (H3K4me3 + ATAC-seq), active enhancer (distal H3K27ac + ATAC-seq), and heterochromatin (H3K9me2 or H3K9me3 sites). Across all the tissues, we predicted 25,593 active promoters, 40,220 weak promoters, 58,065 active enhancers, and 112,445 heterochromatin sites (Extended Data Fig. 2b, c and Supplemental Table 7-10). 40.9% of the predicted promoters and 62.5% of the predicted enhancers reported in this study were novel when compared with previous reports6,2125 (Extended Data Fig. 2a). 71.3% of the enhancers were tissue-specific and also showed tissue-specific ATAC-seq signals (Fig. 2d, e). Gene Ontology (GO) analysis showed that they were located near genes important for relevant tissue-specific functions (Extended Data Fig. 2d).

To validate the predicted enhancers and their tissue-specificity, we used a GFP-based zebrafish embryo reporter assay. Of the 32 tissue-specific enhancers tested, 87.5% (28/32) showed restricted GFP expression (Fig. 2f and Extended Data Fig. 3, Supplemental Table 11).

Single cell ATAC-Seq in zebrafish brain

We performed single-cell ATAC-seq (scATAC-Seq) in adult zebrafish brain, generating 654 million usable reads from 19,955 cells in the adult brain. We identified 268,268 non-redundant peaks, covering 98.7% of the bulk ATAC-Seq peaks in the brain (Fig. 2g, h and Extended Data Fig. 4a-c). 73,264 peaks were only identified in the scATAC-seq, suggesting that there are potentially more regulatory elements in the zebrafish genome.

With the scATAC-seq data, we identified 25 clusters of cells in the zebrafish brain (Fig. 2i). By identifying the key cell type-specific transcription factor (TF) motifs, we inferred the potential cell type of each cluster, such as oligodendrocyte progenitor cells (OPC) and prefrontal cortex cells (PCC) (Extended Data Fig. 4d, e). We quantitatively determined the enrichment of TF motifs in each of the 25 clusters (Fig. 2j and Extended Data Fig. 4f). Interestingly, we observed that many neuronal TFs (such as SOX9 and OLIG2) were enriched in different clusters, suggesting their potential roles in cell-type specific regulation in the zebrafish brain.

Heterochromatin and DNA methylation

We performed ChIP-seq for H3K9me2 and H3K9me3 in 11 adult zebrafish tissues (Extended Data Fig. 5a). Across all tissues, we identified 73,777 non-redundant H3K9me2 sites and 68,798 non-redundant H3K9me3 sites. While both H3K9me2 and H3K9me3 are heterochromatic marks, they were located in different parts of the genome, with only a ~10% overlap in the same tissue (Extended Data Figs. 5b-d). Short interspersed nuclear elements (SINE) were enriched in both H3K9me2 and H3K9m3 sites, while long terminal repeats (LTR) were only enriched in H3K9me2 sites (Fig. 3b). Although both H3K9me2 and H3K9me3 sites were depleted of active marks within the same tissue, 20% of them overlap with ATAC-seq peaks or other active marks in other tissues (Fig. 3a, Extended Data Fig. 5e,g), suggesting that heterochromatin regions in one tissue may be active regulatory elements in other tissues.

Fig. 3 |. Analysis of heterochromatin and repetitive elements and de novo assembly of zebrafish chromosome 4.

Fig. 3 |

a, Comparison of H3K9me2 and H3K9me3 sites with active marks (ATAC-seq, H3K4me3, or H3K27ac peaks) from the same tissue (left panel) and active marks in other tissues (right panel). The number of heterochromatin regions in each tissue: Testis 36,672; Spleen 20,813; Skin 24,687; Muscle 25,692; Liver 29,117; Kidney 21,821; Intestine 24,072; Heart 14,706; Colon 22,426; Blood 19,082; and Brain 21,596. b, Repetitive elements enrichment analysis for predicted enhancers, H3K9me3, and H3K9me2 sites. Color and size indicate fold enrichment. c, DNA methylation levels, H3K27ac ChIP-seq, and ATAC-seq signals for tissue-specific hypoDMRs. Tissue-specific hypoDMRs cluster size, Muscle=1,912, Heart=1,708, Liver=3,386. d, Examples of tissue-specific hypoDMRs. Vertical scale 0–300 for H3K4me3, 0–100 for H3K27ac and ATAC-seq, 0–1 for WGBS, 0–40 for RNA-seq. e, Brain Hi-C data mapped to GRCz10, GRCz11, and the de novo assembled chr4. Aberrant Hi-C signals were observed when Hi-C reads were mapped to the GRCz10 or GRCz11 reference genome but were not visible when mapped to the de novo assembled chr4. f, Alignment of the de novo assembled chr4 to the GRCz11 reference genome (alignment of 2 kb bins by LASTZ).

To study DNA methylation patterns in zebrafish, we performed WGBS in 11 adult tissues with ~30X coverage in each dataset. Genome-wide CpG methylation levels were ~80% across different tissues, with the exception of the testis, which had a higher level (Extended Data Fig. 6a, b). We also detected elevated levels of methylation at the CAC trinucleotide in brain compared to other tissues (Extended Data Fig. 6c), similar to what has been reported in human and mouse26. Unmethylated CpGs were mostly found in CpG islands (CGIs), gene promoters, and 5’ UTRs (Extended Data Fig. 6d, e), while CpGs in gene bodies and different classes of repetitive elements were heavily methylated (Extended Data Fig. 6e). We also identified unmethylated and lowly methylated regions (UMRs and LMRs, Supplemental Table 12). Most UMRs overlapped with candidate promoters and proximal ATAC-seq peaks, while LMRs overlapped more with candidate enhancers and distal ATAC-seq peaks (Extended Data Fig. 6f). We identified differentially methylated regions (DMRs) and tissue-specific hypomethylated DMRs (hypoDMRs) (Extended Data Fig. 6g, Supplemental Table 13). Tissue-specific hypoDMRs were enriched with tissue-specific H3K27ac and ATAC-seq signals (Fig. 3c, d), suggesting that they can potentially identify tissue-specific CREs.

De novo assembly of chromosome 4

Chromosome 4 has been regarded as the ‘rudimentary’ sex chromosome in zebrafish12. However, extensive heterochromatin and transposable elements have made the analysis of this chromosome challenging. Indeed, we observed strong enrichment of H3K9me3, H3K9me2, and DNA methylation on the long arm of chr4 (Extended Data Fig. 7a-b). When we examined the Hi-C data (Fig. 3e) mapped to the current GRCz11 genome, there were many aberrant off-diagonal signals, indicating the presence of structural variations (SVs) between the reference genomes and the genome of the fish used in this study, either due to assembly error or inter-individual variations.

We therefore performed Nanopore, 10X Genomics, and BioNano optical mapping in one single female zebrafish of the Tübingen strain. By combining these long-molecule sequencing results with the Hi-C data from brain, we de novo assembled a new version of chr4 (Methods, Extended Data Fig. 7c, Supplementary Dataset 1). With the newly assembled genome, we re-processed the Hi-C data and observed that most of the aberrant signals were no longer visible on the Hi-C map (Fig. 3e, f). We re-processed the Bionano optical mapping data and also observed fewer SV events (Extended Data Fig. 7d, e). This newly assembled chromosome 4 will serve as a more accurate resource to study sex determination and other processes in zebrafish that involve genes on this chromosome.

Conservation of cis-regulatory elements

Functional elements are often conserved during evolution27. We first examined the sequence conservation of different classes of CREs. Promoters had the highest degree of sequence conservation, while enhancers had a much lower but still significant level of sequence conservation (Fig. 4a, Extended Data Fig. 8b). 88.6% of the predicted enhancers whose sequences were not conserved in human or mouse were conserved in other fish species (Fig. 4c and Extended Data Fig. 8c). Furthermore, 60–90% of zebrafish enhancers whose sequences were conserved in human were also predicted as candidate enhancer elements by the NIH ENCODE and Roadmap Epigenetics Project (Fig. 4b and Extended Data Fig. 8a). Interestingly, even for zebrafish enhancers with no detectable sequence conservation with human, we found that they were conserved in other fish species with elevated Fish phyloP scores (Fig. 4c), suggesting that they might be used as enhancers in other fish.

Fig. 4 |. Conservation of zebrafish cis-regulatory elements and transcriptional networks.

Fig. 4 |

a, Percentage of zebrafish exons and CREs that have orthologous sequences in human. Total number for each bar: Exon 1,000; Promoter 25593; Enhancer 58,065; and Random 1,000. For exons and random, we randomly sampled 1000 elements and computed their conservation percentage. The simulations were performed 20 times and the average percentage shown. b, Percentage of human orthologous sequences of zebrafish enhancers that were predicted as enhancers in human tissues (total number of each bar: Brain 1,241; Blood 748; Colon 775; Heart 839; Intestine 564; Kidney 173; Liver 402; Muscle 591; Skin 356; and Spleen 1,000). c, Fish phyloP score for the zebrafish enhancers whose sequences were not conserved in human (number of enhancers in red line is 51,446, blue line is 50,000). d, An ultra-conserved non-coding element predicted as a brain enhancer in zebrafish, mouse, and human. This enhancer element has been validated by transgenic reporter assay in mouse (#hs1056 in the VISTA Enhancer Browser). e, Heatmap showing TF motif enrichment in tissue-specific enhancers in zebrafish and human. f, Linking distal enhancers to their target genes by correlation of tissue-specific activity. (Left) Distribution of the predicted number of enhancers per gene. (Right) Distribution of predicted number of genes per enhancer. g, Validation of the predicted enhancer-to-gene pairs by Hi-C interaction counts in brain.

Prior efforts have identified several thousand Ultra-Conserved Non-coding Elements (UCNEs) in vertebrates 28. There are 2,405, 4,337, and 4,351 UCNEs in zebrafish, mouse, and human, respectively. We found that 69% of the zebrafish UCNEs overlapped with the predicted CREs: 15% as promoters, 53% as enhancers, and 1% as non-coding RNAs. One such example is shown in Fig. 4d: the zebrafish UCNE SALL3_Anna is conserved in both human and mouse, and this element is predicted as an enhancer sequence in all three species based on the enrichment of H3K27ac signals (another example shown in Extended Data Fig. 8d). Furthermore, this enhancer has also been validated by transgenic reporter assays in mouse embryos29. Overall, enhancers localized in the ultra-conserved regions were more likely to be predicted as enhancers in mouse and human (30% vs. 5.68%).

Linking distal elements to target genes

To link distal ATAC-seq peaks to their target genes, we adopted a correlation-based strategy30. We generated 340,527 ATAC-seq peak-to-gene links with FDR < 0.01, which contain 144,886 distal ATAC-seq peaks and 18,792 genes (Extended Data Fig. 9a). Using a similar strategy, we predicted 96,540 enhancer-to-gene links, 37,241 of which were also supported by ATAC-seq peak-to-gene links (Fig. 4f, Extended Data Fig. 9b, c, Supplemental Table 14). The enhancer-to-gene links contain 33,728 putative enhancers and 16,935 genes. We observed higher Hi-C reads for the predicted enhancer-to-gene pairs than expected values at each genomic distance (Fig. 4g, Extended Data Fig. 9d), supporting the linkages between genes and distal elements based on activity correlation.

Tissue-specific transcriptional regulatory network

To identify putative key transcription factor in each tissue, we performed motif analysis in each group of tissue-specific enhancers and identified a set of motifs that were enriched in the same tissues in zebrafish and human (Fig. 4e). To further probe the similarities in TF connections between zebrafish and human, we performed the three-node network analysis as previously described31 (Methods). We observed that CTCF was predicted as the driver node in most tissues, while tissue-specific TFs such as NEUROD and MYOD were predicted as middle and passenger nodes in the networks of brain and muscle tissue, respectively (Extended Data Fig. 9e, f). The overall patterns of the three-node networks were highly similar between zebrafish and human (Extended Data Fig. 9f), further demonstrating the value of zebrafish as a system to study human TF regulatory circuits.

Higher-order genome structure in zebrafish

To study higher-order chromatin structure in zebrafish, we generated high-resolution Hi-C data (10kb) in the adult brain and muscle (Extended Data Fig. 10a), with ~ 2.1 and 1.4 billion paired-end reads in each tissue. Different replicates of the Hi-C experiments were highly reproducible32. We predicted the A/B compartments and found that their genomic coverages in these two tissues were similar. H3K27ac, H3K4me3 and ATAC-seq signals were enriched in the A compartment, while H3K9me2 and H3K9me3 signals were enriched in the B compartment (Extended Data Fig. 10b). We identified 5,348 regions with switched A/B compartments between the two tissues, and these regions were associated with altered gene expression and H3K27ac signals (Extended Data Fig. 10c). We predicted 1,350 TADs in the brain and 1,238 TADs in the muscle (Extended Data Fig. 10d, Supplemental Table 15). Most of them were shared between the two tissues (Extended Data Fig. 10b, e) and TADs boundaries were enriched for CTCF binding sites, SINE, and satellite elements (Extended Data Fig. 10f-h).

We identified 7,708 and 5,312 chromatin loops in the adult brain and muscle, respectively (Fig. 5a, Supplemental Table 16). The majority of the loop anchors had convergent CTCF binding motifs (72% in muscle, 63% in brain). 98.6% of the predicted loops in the brain were between regions that contain either at least one promoter or one enhancer, and 91.6% of enhancer-promoter loops overlapped with predicted enhancer-promoter or distal ATAC-seq peak-promoter linkage pairs (Fig. 5b). We performed motif analysis to identify the TFs that may play a role in forming the chromatin loops. CTCF and BORIS were enriched in shared loops (Fig. 5a), and tissue-specific TFs were enriched in tissue-specific chromatin interactions (Fig. 5a). For example, RFX and NeuroD2 were enriched in brain-specific loops while two muscle-specific master regulators, Myf5 and Ascl1, were enriched in muscle-specific loops (Fig. 5c).

Fig. 5 |. Higher-order chromatin structure and zebrafish genome evolution.

Fig. 5 |

a, Aggregate Peak Analysis (APA) plot and motif analysis of tissue-specific or shared chromatin loops. In each panel, n is the number of loops in that group. b, (Left) Annotation of cis-elements in the predicted loop anchors in brain with total number of loops in pie chart, 7,710. (Right) Comparison of promoter-enhancer chromatin loops with correlation-based linkage between ATAC-Seq or histone modification-based enhancer-to-gene pairs with total number of loops in pie chart, 4,996. c, Examples of shared, brain-specific and muscle-specific chromatin loops. d, Relative position of evolutionary breakpoints to TADs. Breakpoints were between zebrafish and mouse (left) or between zebrafish and human (right). In all cases, we found that the evolutionary breakpoints were enriched at zebrafish TAD boundaries and depleted from the center of TADs. Kb, kilobases. e, TADs without breakpoints (upper panel) have stronger interactions inside than TADs with breakpoints (lower panel). f, Expression pattern of genes in TADs without evolutionary breakpoint is more conserved than genes in TADs with a breakpoint inside. For each gene, we collected its expression profile across the same 10 tissues in both zebrafish and human, and computed a Spearman Correlation Co-efficient (SCC) between the profiles for each gene. The number of gene pairs without BPs, 4,625, the number with BPs 3,918 (3.56×10−26, Mann-Whitney U Test, two-sided). g, H3K4me3 signals were higher in TAD boundaries with breakpoints than TAD boundaries without breakpoints. h, Higher transcriptional activities at TAD boundaries with breakpoints and containing CTCF binding sites in human GM12878 cells (the number of breakpoints in blue line is 639, red line is 625) (K562 data in Extended Data Fig. 12c). *Note: Results from 17 additional vertebrates are shown in Extended Data Fig. 11a, b and Extended Fig. 12d.

Zebrafish genome evolution and TADs

Topologically associating domains (TADs) have been shown to be conserved among different species 3336. To investigate the relationship between TADs and zebrafish genome evolution, we first identified three sets of zebrafish evolutionary breakpoints by aligning its genome against chicken, mouse, and human, respectively. We then compared the breakpoints with TAD annotations in zebrafish and observed that 80.5% of breakpoints (984 of 1,223 zebrafish-to-human breakpoints) were located near TAD boundaries, but depleted toward the center of TADs (Fig. 5d, Extended Data Fig. 11a). We divided TADs into two groups, TADs containing a breakpoint and TADs not containing a breakpoint. Interestingly, TADs without breakpoints had stronger interaction frequencies in the middle than TADs with breakpoints evolution (Fig. 5e, Extended Data Fig. 12d). Further, the expression patterns of genes across different tissues in the TADs without breakpoints were more correlated with their homologs in human (Fig. 5f) than the other group, suggesting that there is an association between TAD stability and conservation of the expression pattern. This may be due to the following reasons: 1) strong chromatin interactions may contribute to TAD stability during evolution; or 2) breaking of TADs with strong interactions is selected against in evolution, since these interactions and the genes involved in the interactions may be physiologically important for zebrafish.

Next, we divided zebrafish TAD boundaries into two classes, boundaries overlapping with breakpoints and boundaries not overlapping with breakpoints. We observed elevated H3K4me3 signals at TAD boundaries as previously described36. Surprisingly, we found a much higher level of H3K4me3 at TAD boundaries with breakpoints (Fig. 5g, Extended Data Fig. 11b, c). We also confirmed similar higher H3K4me3 enrichment in human or mouse TAD boundaries with evolutionary breakpoints (Extended Data Fig. 11d-f). As a control, we did not observe different levels of H3K27ac or ATAC-seq enrichment between the two groups of TAD boundaries (Extended Data Fig. 12a, b). Interestingly, an earlier report showed H3K4me3 signal enrichment at recombination hotspots in mouse37, and our finding in zebrafish further suggests its potential association with genome stability and evolution.

Previous work has suggested a link between transcription at CTCF-containing TAD boundaries and their potential role in translocations3840. Therefore, we investigated the transcriptional status at the evolutionary breakpoints at TAD boundaries, using the Pol2, CTCF ChIP-seq, and GRO-seq data in GM12878 and K562 human cells. We observed that there were much higher transcription activities at breakpoints overlapping with CTCF-containing TAD boundaries, compared with those without CTCF TAD boundaries (Fig. 5h, Extended Data Fig. 12c).

A key feature of the zebrafish genome is an extra genome-duplication event compared with other vertebrates41. There were 2,456 paralogous gene pairs annotated in the Ensemble and the paralogs show similar expression patterns across all different tissues, with a median Pearson correlation of 0.458 (Extended Data Fig. 12e). We analyzed paralog pairs located on the same chromosome and observed that paralogs located in the same TADs have a higher correlation in gene expression patterns than paralogs located in different TADs (Extended Data Fig. 12f).

In summary, we report the most comprehensive annotation of the zebrafish genome to date, and we described both conserved and divergent gene regulatory networks and 3D genome structures between zebrafish and human. The breadth and depth of the data establish a genomic foundation for conducting further human disease modeling and biological studies in zebrafish.

Methods

Adult tissue ChIP-seq:

One-year-old adult Tübingen fish (sex information: Supplemental Table 17) were dissected to separate tissues that were washed twice in 1X PBS buffer before flash freezing on dry ice. Collection of peripheral blood from adult fish was done according to the protocol from JoVE42. All procedures on live animals have been approved by the Institutional Animal Care and Use Committee (IACUC) at the Pennsylvania State University (PRAMS201445659). At the beginning of ChIP-seq, all tissues (except peripheral blood) were ground in liquid nitrogen and fixed by 1% formaldehyde at room temperature for 15 min. 2.5 M glycine was added at a final concentration of 0.2 M and incubated at room temperature for 5 min to quench the fixation. Fixed tissues were then washed once by cold 1X PBS. Tissue pellets were resuspended and incubated on ice for 10 min in 100 μL of ChIP-seq lysis buffer (20 mM Tris-HCl, pH 8.0, 1% SDS, 50 mM EDTA, 1X proteinase inhibitor cocktail). Next, 900 μL cold 1X TE buffer was added to dilute the SDS, and the nuclei suspension was sonicated using a Covaris E220 with the following parameters: 140 W, duty factor 5, 200 per burst. To check the chromatin fragmentation size, 20 μL of input chromatin was reverse crosslinked in elution buffer (20 mM Tris-HCl, pH 8.0, 1% SDS, 1 mM EDTA) at 65 °C overnight, treated with RNase A and proteinase K and purified by phenol-chloroform extraction. Input DNA was then loaded on a Lonza flash gel to ensure that the majority of DNA was between 100–300 bp. To prepare the antibody-beads complex, 3 μg of histone H3K27ac antibody (Active Motif, 39133), H3K4me3 antibody (EMD Millipore, 07–473), H3K9me2 antibody (Cell Signaling, 4658), or H3K9me3 antibody (Abcam, ab8898) was mixed with 12 μL M-280 sheep anti-rabbit (ThermoFisher, 11203D) or sheep anti-mouse IgG Dynabeads (ThermoFisher, 11201D) in 150 μL of 5 mg/mL BSA/1X PBS buffer, with rotation at 4 °C for 3 hours. After incubation, the antibody-beads complexes were washed once with BSA/1X PBS buffer. About 200 μg chromatin was used per immunoprecipitation. An equal volume of master mix (1X TE, 2% Triton X-100, 0.2% sodium deoxycholate, 2X proteinase inhibitor cocktail) was mixed with 200 μg chromatin and then incubated with antibody-beads complexes overnight with rotation. The next morning, the beads were washed 5 times with cold RIPA wash buffer (20 mM Tris-HCl, pH 8.0, 1% NP-40, 0.7% sodium deoxycholate, 500 mM LiCl, 1 mM EDTA, 1X proteinase inhibitor cocktail). Then the bead-bound chromatin was eluted using 150 μL of elution buffer at 65 °C for 30 min. To prepare the library, eluted chromatin was reverse cross-linked and purified by phenol-chloroform extraction. Then DNA was end-repaired by END-IT DNA end-repair kit (Epicentre, ER81050) according to the kit protocol, adenylated using Klenow fragment (3’->5’ exo-) (NEB, M0212S), ligated with Illumina TruSeq adaptor (Illumina, FC-121–3001) and subsequently amplified by PCR (Roche, kk2601). The quality and quantity of all the libraries were checked using a BioAnalyzer High Sensitivity DNA Kit (Agilent).

Embryonic tissue ChIP-seq:

Zebrafish embryonic tissue ChIP-seq was performed similarly to the adult tissue with a few modifications. For embryonic trunk, 50 trunks of 1-dpf embryos were dissected and digested in 500 μL of 0.25% trypsin at room temperature for 20 min. The reaction was then neutralized with FBS. Trunk cells were washed once with cold 1X PBS and cross-linked with 1% formaldehyde at room temperature for 12 min. For embryonic neuronal cells, Tg (Huc:Kaede) transgenic fish were crossed with wildtype Tübingen fish. At 1 dpf, embryos were checked under the fluorescence microscope, and green positive embryos were selected, dechorionated, and pipetted in calcium-free Ringer’s solution to remove the yolk. The supernatant was removed, and 500 μL of 0.25% trypsin was added to digest embryos at room temperature for 20 min. Neutralized with 500 μL of FBS and washed once with cold 1X PBS, digested cells were then sorted, and only green fluoresced cells, which corresponded to embryonic neuronal cells, were collected and cross-linked with 1% formaldehyde at room temperature for 12 min.

RNA-seq

For each RNA-seq experiment, tissues from at least two Tübingen fishwere combined to use as one replicate. For embryonic trunk, ten 1-dpf fish were dechorionated with pronase, and the trunk was separated for RNA-seq. For embryonic neurons, green cells from Tg(Huc:Kaede) fish were sorted by FACS, and approximately 20,000 cells were used for one replicate. The tissue RNA was extracted from Trizol® according to the manufacturer’s protocol (Invitrogen). The cDNA libraries were constructed using SureSelect Strand Specific RNA Library Preparation Kit (Agilent) according to the manufacturer’s protocol. Briefly, polyA RNA was purified from 1000 ng of total RNA using oligo (dT) beads (Invitrogen). Extracted RNA was first fragmented, then followed by reverse transcription, end-repair, adenylation, adaptor ligation, and subsequent PCR amplification. The final product was checked by size distribution and concentration using a BioAnalyzer High Sensitivity DNA Kit (Agilent) and Kapa Library Quantification Kit (Kapa Biosystems).

ATAC-seq

Tübingen adult tissues were freshly dissected and processed immediately for ATAC-seq. Briefly, the tissues were resuspended in 1 mL of lysis buffer (1X PBS, 0.2% NP-40, 5% BSA, 1 mM DTT, protease inhibitors), followed by dounce homogenization with loose pestle for 20 strokes. The lysate was then filtered through 40 μm cell strainer, and nuclei were collected at 500 g for 5 min. Tagmentation was performed immediately according to the ATAC-seq protocol reported previously9.

Whole-genome bisulfite sequencing (WGBS)

Tübingen adult tissues were dissected, and the genomic DNA was extracted using DNeasy Blood & Tissue Kit (Qiagen, 69504). Then 1 μg of genomic DNA of each tissue was subjected to bisulfite conversion using EZ DNA Methylation-Gold Kit. The final libraries were prepared using the Accel-NGS® Methyl-Seq DNA Library Kit.

BioNano optical mapping

Genomic DNA from one Tübingen female muscle was extracted for BioNano optical mapping. DNA extraction was done according to the BioNano Prep Animal Tissue DNA Isolation Fibrous Tissue Protocol-30071. The homogenized genomic DNA was then directly labeled by DLE enzyme (BioNano, 80005) according to the BioNano Prep Direct Label and Stain (DLS)-30206 protocol. The labeled and stained genomic DNA was then loaded onto a Saphyr chip, and 407 Gb data was collected for genome assembly.

Single-cell ATAC-seq

Single-cell ATAC-seq (scATAC-seq) was performed using one female brain and one male brain, respectively, on the 10X Genomics platform. To isolate nuclei, a freshly dissected single brain was transferred to 1 mL NbActiv1 medium (BrainBits, NbActiv1 500) with a wide-bore pipette tip to break the tissue into small pieces. Then the tissue was further fragmented using regular-bore pipette tips followed by filtering through a 30 μm cell strainer. The isolated cells were spun down at 500 g for 5 min at 4 °C and then lysed in 100 μL chilled 0.1X Lysis Buffer (10 mM Tris HCl, pH 7.5, 10 mM NaCl, 3 mM MgCl2, 1% BSA,0.01% Tween-20, 0.01% NP40 and 0.001% digitonin) on ice for 5 min. Then 1 mL chilled Wash Buffer (10 mM Tris HCl, pH 7.5, 10 mM NaCl, 3 mM MgCl2, 1% BSA, 0.1% Tween-20) was added to the lysed cells and cells were spun down at 500 g for 5 min at 4 °C. Finally, 300 μL chilled Diluted Nuclei Buffer (10x Genomics, PN 2000153/2000207) was added to resuspend the nuclei. The nuclei were filtered again using a 30 μm cell strainer before cell counting. Around 12,000 nuclei were used for one Tn5 tagmentation reaction, and the scATAC-seq library was prepared and sequenced according to the 10X Genomics user guide.

Zebrafish tissue Hi-C

Hi-C experiments on adult zebrafish brain and muscle tissues were performed according to a previously published protocol43 with a few modifications. For brain replicate 1, two adult Tübingen zebrafish brains were gently ground into small granules in liquid nitrogen and resuspended in 1 mL of cold hypotonic buffer (20 mM Tris-HCl, pH 8.0, 10 mM NaCl, 20 mM EDTA). For brain replicate 2, only one female Tübingen brain was used. Granular brains were then Dounce homogenized with a loose pestle for 20 strokes. The upper layer of the homogenate was carefully transferred into a new tube and fixed with 2% formaldehyde at room temperature for 10 min. 0.2 M of glycine was added to stop fixation. For muscle tissue, 60 mg of Tübingen muscle was first chopped into small pieces and digested by 0.25% trypsin at room temperature for 30 min. After neutralization with FBS, muscle cells were resuspended in cold 1X PBS and fixed with 2% of formaldehyde at room temperature for 10 min. 0.2 M glycine was then added to stop fixation. Two muscle Hi-C experiments were performed using tissues from female and male Tübingen fish, respectively.

Enhancer reporter assays

Selected potential tissue-specific enhancers were evaluated using a Tol2 transposon-mediated zebrafish transgenesis approach as previously described44. Selected enhancers were PCR amplified from Tübingen genomic DNA and subcloned to the upstream of the hsp70 promoter/eGFP cassette in the pT2HE vector (Supplemental Table 18 for primers). These enhancer reporter plasmids were subsequently co-injected with transposase mRNA in one-cell-stage Tübingen wildtype zebrafish embryos. eGFP expression patterns were monitored by Zeiss SteREO DiscoveryV8 microscope by AxioVison Rel.4.8 software, at four different developmental time points: 24–36 hpf, 2 dpf, 3 dpf, and 5 dpf. Expression patterns were recorded only if eGFP signal was consistently displayed by at least 30–40% of total embryos (approximately 100 embryos were analyzed per enhancer).

Novel transcript identification

All the RNA-seq reads were mapped to the genome GRCz10 using Tophat245, with “--mate-inner-dis 50 --mate-std-dev 1000 --read-mismatches 2 --read-edit-dist 2” parameters excluding “-G”, then assembled using Cufflinks46. The “cuffmerge” function was applied to combine cufflinks assemblies from each library. The “cuffcompare” function was then used to analyze assembled transcripts of all libraries and to merge assembled transcripts with ENSEMBL and RefSeq reference annotations (ftp://ftp.ncbi.nlm.nih.gov/genomes/Danio_rerio/ARCHIVE/ANNOTATION_RELEASE.105/Gnomon/ and ftp://ftp.ensembl.org/pub/release-91/gtf/danio_rerio/). Novel transcripts were defined as those with a “u/x/j/i” class code and presenting in both biological replicates. Class code “u” represents no overlap with any known transcript; “x” represents exonic overlap with known transcripts on the opposite strand; “j” represents potential novel isoforms with at least one splice junction is shared with a known transcript; “i” represents a transcript falling entirely within a reference intron.

lncRNA identification and novel transcript annotation

To identify lncRNAs from novel transcripts, we first selected novel transcripts with more than one exon and longer than 150 bp. The coding potential of these transcripts was predicted using CPAT with a zebrafish model47. The transcripts with a coding potential score less than 0.38 were identified as lncRNAs. The transcripts with a coding potential score greater than 0.38 were considered as novel protein-coding gene candidates. TransDecoder (https://transdecoder.github.io/) was then used to predict open reading frames of these novel protein-coding gene candidates. The predicted coding sequences were blasted with BLASTp against known protein databases of other species (xenopus, cattle, pig, chicken, mouse, and human) to identify homologs in these species (cut-off: 1e-3). The GTF files for the predicted lncRNAs and the novel transcripts were integrated with reference GTF files for further analysis.

Quantification for RNA-seq signals and identification of tissue-specific genes

RNA-seq reads were trimmed using TrimGalore (https://github.com/FelixKrueger/TrimGalore). We aligned trimmed RNA-seq reads to the zebrafish genome (GRCz10) and generated bigwig files using STAR48. QC and expression values (TPM: Transcripts Per Million base pairs) for the RNA-seq libraries were computed with the RSEM package49 and the previously generated GTF files were used for the genome annotation. In order to correct the batch effect across all libraries prepared at different times, we used the limma package in R on the log2(TPM+1) matrix, and genes with TPM <1 in two biological replicates in all tissues were further filtered out. Pearson correlation coefficients were then computed between biological replicates using the read counts of 10 kb-binned matrices. The average TPM value between two biological replicates was used to represent the gene expression value of each tissue to identify the tissue-specific expressed genes. TPM values for the same gene in different tissues were transformed into Z-scores to identify the tissue-specific genes at a threshold of Z >2 and a minimum of 3-fold change.

Comparison of orthologous genes between zebrafish and human

To compare the ortholog gene expression pattern between zebrafish and human, we downloaded the TPM matrix from the GTEx dataset from the same tissue (except testis and embryonic tissues), and the median TPM value of replicates was used to represent the gene expression value of each tissue. The one-to-one orthologous genes detected by BioMart were selected for the comparison.

ATAC-seq data processing

ATAC-seq library adapters were detected and trimmed using detect_adapter.py (https://github.com/ENCODE-DCC/atac-seqpipeline/blob/master/src/detect_adapter.py) and cutadapt50. The trimmed reads were aligned using bowtie251 with the following parameters: -X2000 –mm. PCR duplication was removed using Picard (http://broadinstitute.github.io/picard/). We calculated the Pearson correlation coefficient between two biological replicates using the reads counts of 10 kb-binned matrices and merged paired files with high correlation.

Then we process the ATAC-seq data using parameters recommended by ENCODE-DCC (https://github.com/ENCODE-DCC/atac-seq-pipeline). Peaks were filtered to keep those with p-value<0.00001 and q-value<0.01 and peak length were fixed to 500 bp. We only selected the peak with the most significant signal if several peaks overlapped in each tissue. Then we separated the proximal ATAC-seq peaks (average N=23,233), which were defined as overlapping with any TSS regions (2.5 kb upstream to 500 bp downstream of TSS), and distal ATAC-seq peaks (average N=122,979).

Chromatin accessibility can reveal a full range of regulatory elements, including both active and poised enhancers and promoters. To generate the zebrafish open chromatin landscape, we merged the ATAC-seq peaks across 11 tissues. The peaks with the most significant signals were selected as representative open chromatin regions, and all the peaks overlapping with representative open chromatin regions of at least one base-pair were removed. In total, we identified 436,035 representative open chromatin regions in the zebrafish genome. We annotated the ATAC-seq peaks using the ChIPseeker package. Footprint was identified by the HINT software v0.13.0 (Hmm-based IdeNtification of Transcription factor footprints)52 based on ATAC-seq data. Briefly, ATAC-seq narrowpeaks were used as input, the footprint region were filtered by footprint score>10, transcription factor motifs overlap with footprints was identified using the MOODS package v1.9.4 53(https://github.com/jhkorhonen/MOODS), with motifs from the HOCOMOCO database54 (http://hocomoco11.autosome.ru/).

H3K27ac and H3K4me3 ChIP-seq data processing

We mapped the ChIP-seq reads to the zebrafish genome GRCz10 using BWA aligner55. The mapped reads with MAPQ less than 30 were removed, and PCR duplicated reads were removed by Picard. We calculated the Pearson correlation coefficient between two biological replicates and merged paired files with a high correlation. For the H3K27ac and H3K4me3 histone marks, we used parameters recommended by ENCODE-DCC (https://github.com/ENCODE-DCC/chip-seq-pipeline2). In brief, candidate narrow peaks were first selected with -logP >5, and -logQ >2. Reads per million (RPM) of IP data and input data in each peak region were calculated, and the qualified peaks should pass the threshold of two-fold enrichment (RPMIP ≥ 2 × RPMInput) and RPMIP-RPMInput >1.

RPM(IP)=ReadscountsofpeakinIPNumberofnonduplicatedreadspermillioninIP
RPM(Input)=ReadscountsofpeakininputNumberofnonduplicatedreadspermillionininput

H3K9me3 and H3K9me2 ChIP-seq data processing

Since the H3K9me3 and H3K9me2 markers are broad domain, we called the peaks using Homer with the parameter “-region -size 1,000”, and peaks within 5 kb were merged together.

Reproducibility of ChIP-seq data

To check the reproducibility of biological replicates, we divided the reference genome into 10-kb bins and computed the number of reads within each bin. The Pearson correlation coefficients between biological replicates for H3K4me3, H3K27ac, H3K9me3 and H3K9me2 were calculated using the normalized 10-kb binned reads. After confirming that all replicates were highly correlated, we pooled the bam files of biological replicates together with the merge function of Samtools56 for further analysis.

Identification of the zebrafish genome blacklist

Genomic experiments based on Illumina sequencing (e.g., ChIP-seq, ATAC-seq) often produce artificial high signals in certain genomic regions, such as centromeres, telomeres, and satellite repeats. It is therefore essential to identify and remove these artificial signals that exist ChIP-seq and ATAC-seq experiments. To flag these artificial regions, ChIP-seq input sequencing data were used as IP with the whole genome sequence as background to call the artificial peaks by MACS257. The peaks with -log10(q-value) less than 5 were filtered out. Then, the ChIP sequencing data were used to call peaks with the genome sequence as background, with the same threshold. The overlapped peaks of these two datasets crossing all tissues were defined as the zebrafish genome blacklist, and these regions were filtered out for all ChIP-seq and ATAC-seq peak-calling analyses (Supplemental Table 19).

Identification of cis-regulatory elements

To systematically compare ChIP-seq data, we used a 1-kb flanking region of summit peaks to define the promoters and enhancers across all tissues. We observed, on average, 96% of H3K27ac and H3K4me3 were overlapping with representative open chromatin regions (ATAC-seq peaks ± 500 bp). Weak promoters were defined by H3K4me3 peaks that overlapped with representative open chromatin region but without H3K27ac peaks. Active promoters were defined by H3K4me3 peaks that overlapped with representative open chromatin region and H3K27ac peaks. Active enhancers were defined as H3K27ac peaks not overlapping with H3K4me3 peaks or TSS regions (2.5kb upstream to 500bp downstream of TSS) but overlapping with representative open chromatin 500 bp flanking regions. Since testis contains broad-spread H3K4me3 signals across the whole genome, instead of using H3K4me3 peaks, we used the annotated TSS regions that overlapped with H3K27ac peaks and representative open chromatin regions to define active promoters. Those that overlapped with representative open chromatin regions without H3K27ac peaks defined weak promoters, while active enhancers were defined by H3K27ac peaks that were away from the annotated TSS regions but overlapped with representative open chromatin regions.

To generate the non-redundant union sets of each type of cis-regulatory elements, we merged the active enhancers/active promoters/weak promoters from different tissues, respectively, if the peaks were within 500 bp, and used the middle points to represent the location of merged active enhancers/active promoters/weak promoters, then fixed the length of each element into 2 kb.

We defined heterochromatin in each tissue by merging the H3K9me3 and H3K9me2 peaks within 500 bp. We converted the cis-regulatory elements from zebrafish genomic locations(danRer10) to human/mouse genomic locations(hg38/mm10), using the liftOver tools with the center 100 bp of each element and required minMatch >0.1, the danRer10ToHg38/Mm10.over.chain files were modified based on the zebrafish conversed CNE database58 (http://zebrafish.stanford.edu).

Identification of tissue-specific cis-regulatory elements

We identified tissue-specific cis-regulatory elements based on the union sets of each type of cis-regulatory element. Then we computed the H3K27ac RPM change to represent the active enhancer/active promoter intensity, and the resulting matrix was quantile normalized. The normalized matrix was transformed into Z-scores to identify the tissue-specific elements at a threshold of Z >2 and a minimum of 2-fold change in magnitude. GO term analysis of top 1,000 tissue-specific enhancers/active promoters (ranked by intensity) was performed by using GREAT 3.0.059 after liftover the genomic coordinate of each type of cis-regulatory elements into Zv9/danRer7. Motif analysis was performed by HOMER2. For super-enhancers, only the narrow peaks within super-enhancers were used for the GO term and motif analysis.

WGBS data processing

WGBS paired-end reads were mapped to the zebrafish genome assembly GRCz10, as previously described with the minor modifications60. To increase the mapping efficiency, the first ten low-quality base pairs of the sequence read 1s, and the first 15 of the sequence read 2s were trimmed along with adapter sequences by using Trim Galore! (The Babraham Institute) version 0.6.1 with the following parameters: --clip_R1 10 --clip_R2 15 --paired --retain unpaired -r1 21 -r2 21. The trimmed reads were mapped to in silico bisulfite-converted zebrafish genome reference by using Bismark61 version 0.18.1 with the following parameters: -X 2000 --un -N 1 -L 28. Unpaired or unmapped read 1s were then mapped as single read mode by using Bismark with the following parameters: -N 1 -L 28. Unpaired or unmapped read 2s were also mapped as single read mode by using Bismark with the following parameters: --pbat -N 1 -L 28. The redundant reads from PCR amplification were then removed by using the following Bismark command: deduplicate_bismark --bam. The methylation information for individual cytosines was extracted from the deduplicated reads by using Bismark with the following commands: bismark_methylation_extractor --comprehensive --merge_non_CpG --gzip. After merging paired-end and single-end extracted files, the Bismark commands bismark2bedGraph --CX and coverage2cytosine --CX was used to calculate total read count and methylation read count per each C. The methylation levels and read coverage of each CpG were visualized on the WashU Epigenome Browser using a methylC track62.

The mean CpG methylation levels, percentages of CpGs with low, medium, and high methylation levels, and distribution of CpG methylation levels were calculated by using CpGs with at least five read coverage.

UMRs and LMRs

UMRs and LMRs were identified by using MethylSeekR v.1.22.063, following the tool’s recommendations with the minor modifications. The random number generator seed of 123 was set at the beginning of the analysis to ensure reproducibility. Partially methylated domains were identified and masked using the smallest chromosome 25 as a training set. UMRs and LMRs were identified using cutoffs of less than 0.5 of methylation levels and at least 5 or 6 CpGs, ensuring FDRs below 5%. UMRs and LMRs were assigned as active promoters, weak promoters, or active enhancers if they overlap with these CREs defined in the same tissue by using BEDTools v.2.27.1. UMRs and LMRs were similarly intersected with proximal or distal ATAC-seq peaks.

DMRs

DMRs between two tissues were identified by using DSS v.2.14.064. First, mean methylation levels of each CpG site was estimated with smoothing65 in the DSS package. Then, dispersion at each CpG site was estimated, and the Wald test on each CpG site was performed to calculate statistical significance of methylation difference across different samples. Without replicates, DMRs were detected by using DSS package’s callDMR function with the following parameters: delta=0.2, p.threshold=0.05, minlen=200, minCG=5, dis.merge=50, pct.sig=0.5. Hypomethylated DMRs in a given tissue were further filtered by intersecting with UMRs or LMRs in the same tissue. Tissue-specific hypoDMRs were defined as DMRs hypomethylated in at least 8 out of 10 pairwise comparisons. For the hypoDMRs with a heatmap of DNA methylation levels (Fig. 3d), union set of hypoDMRs were obtained by using BEDtools’ merge function. DNA methylation levels of union set of hypoDMRs were calculated using CpGs with at least 5 read coverage. Only regions that exist uniquely in hypoDMRs of each tissue and have DNA methylation levels available across all 11 tissues were used. Gene ontology and wildtype expression enrichment analysis was performed by the GREAT v.3.0.0, after converting genomic coordinate of hypoDMRs into Zv9/danRer7. Heatmaps of DNA methylation levels, H3K27ac ChIP-seq and ATAC-seq signals of tissue-specific hypoDMRs along with their neighboring regions were plotted by using deepTools66.

Repetitive elements enrichment analysis

We compared the predicted CREs with different subtypes of repetitive elements, including LTR, DNA, SINE, LINE, satellite, unknown and rolling circle (RC), annotated by RepeatMasker to investigate whether some CREs were enriched or depleted of any specific types of repetitive elements. We analyzed the enrichment by calculating the number of overlapped base pairs between each cis-regulatory element and subtype of repetitive elements. The equation of calculated fold enrichment is shown below: j represents the H3K27ac peaks in the active enhancer regions/ H3K9me3 peak regions/ H3K9me2 peak regions; i represents the subtype of repetitive elements.

Foldenrichment=LengthofjoverlapwithiåLengthofjLengthofiåLengthofGenome

Genome assembly

The Oxford nanopore sequencing reads of zebrafish Tu (50X coverage) were assembled using Canu software67 (version: 1.8) with default settings for nanopore reads. To improve the accuracy of our assembly, we performed four rounds of genome polishing. In the first two rounds, we mapped the nanopore long reads to the contigs using minimap2 (version: 2.16-r922) and polished the contigs with Nanopolish (version: 0.11.1). In round 3 and 4, the 10X Genomics reads (barcodes trimmed) were aligned to the contigs with BWA-MEM (0.7.15-r1140). Duplicated reads were marked by Picard (version: 2.17.2). Pilon software68 (version: 1.23) was used to polish the contigs. 1,325 Nanopore contigs (Max length = 40.077 Mbp) were generated with N50 of 9.418 Mbp. Next, we applied BioNano Solve v3.2.1 software (“non-haplotype” with “no extend split” and “no cut seg dups”) to combine the Nanopore contigs with BioNano optical mapping data to generate 114 chromosome arm-level scaffolds (N50 = 29.725 Mbp and max length = 42.585 Mbp). At the last step, we leveraged the brain Hi-C data with 3d-dna software69 (default parameters) to generate the final chromosome level scaffolds with a total length of 1,519 Mbp (N50 = 56.768 Mbp, max length = 75.632 Mbp). The structure variation is called by BioNano_Access software for the de novo assembled and GRCz10 reference genome.

Linking distal ATAC-seq peaks/Active enhancers to genes by correlation

We strictly followed the strategies proposed by Corces et al30. Briefly, we first removed ATAC-seq peaks whose normalized RPMs were less than 1 across all tissues. We then computed the Pearson correlation coefficient (PCC) between all ATAC-seq peaks (log2(RPM)) and genes (log2(TPM+1)) whose TSS were located within 500 kb of ATAC-seq peaks. To determine PCC cutoff and estimate FDR, we randomly selected 10,000 ATAC-seq peaks and computed their PCC with genes which were located on other chromosomes. To compute the p-value for each correlation value, we calculated the z-score by compare it with the mean and the standard deviation of the correlations of all the random ATAC-seq-to-gene pairs. The z-score was converted into a two-tailed p-value. We then estimated FDR using the Benjamini-Hochberg procedure. The distal H3K27ac peaks were processed by the same way with FDR < 0.01 and PCC cutoff = 0. 56142.

Motif comparative analysis between zebrafish and human

First, we predicted the motif enrichment in each zebrafish tissue-specific enhancer group using Homer findMotifGenome function. Then, we merged the enriched motifs in all tissues to generate a tissue-motif matrix. The p-value of motifs in each tissue was quantile normalized. The human motif matrix was generated with the same method using human Roadmap ChIP-seq datasets70. Then we ranked the motifs by normalized p-value, and the top three motifs of each tissue in zebrafish were used to compare with the human motif matrix

Core transcriptional regulatory network analysis

To identify the regulatory TF motifs for each tissue in zebrafish and human, we used HOMER71 to analyze the nucleosome-free regions (based on H3K27ac intensity) within the 10-kb flanking regions of TSS sites for 661 TF genes. We next used FIMO72 to scan motif occupancy in the nucleosome-free regions with p < 1e-5 as the threshold. Then, we performed the three-node motif network analysis using the method described previously31.

Hi-C data processing

Mapping and matrix generation:

For Hi-C data, adaptor sequences were trimmed, and low-quality reads were removed. Paired-end reads were mapped to the GRCz10 genome using HiC-Pro v2.9.073 (https://github.com/nservant/HiC-Pro). Singleton, multi-mapped, dumped, dangling, self-circle paired-end reads, and PCR duplicates were all removed by HiC-Pro after mapping. We generated raw contact matrices at 25-kb, 40-kb, 100-kb, 500-kb, and 1-Mb resolutions. Visualization of Hi-C contact matrices was done using juicer tools v1.8.974 and juicebox v1.1175 (https://github.com/aidenlab/juicer/wiki/Download). HiCRep was used to compute correlations between replicates32. The cool file was generated by cooler v.0.8.676, cooltools v.0.4.0 (https://github.com/mirnylab/cooltools) was used to compute the expected value and observed value in different resolutions.

A/B Compartment:

A/B compartment analysis was performed at 40-kb, 100-kb, and 250-kb resolutions using HOMER software. The positive eigenvalues were set to A compartments, and negative values were set to B compartments. We identified the regions with changes in sign of the PC1 value between muscle and brain as A/B compartment switched regions.

DI calculation:

The directionality index (DI) of the 40-kb binned raw Hi-C matrix was calculated as previously described36.

Insulation, boundary calculation, and TAD calling:

The TAD structure (insulation/boundaries) was defined by the insulation score as previous studies77,78. The matrices which were used to calculate the insulation score were normalized by ICE method79 for discarding the bias of raw matrices. The insulation score of the ICE matrix was calculated by the following parameters: -is 480,000 –ids 320,000 -im iqrMean -ss 160,000.

Hi-C loop calling:

Loops were computed from Hi-C matrices using HiCCUPS43 with the parameter “(--ignore_sparsity -r 10000- k KR -f .1,.1,.1 -p 4,2,1 -i 7,5,3 -t 0.02,1.5,1.75,2 -d 20000,20000,50000 ”, as previously described and part of the juicer tools package. Consistent loops were identified using pairtopair function in BEDtools with the parameter “–type both –f 0.5”. Tissue-specific loops were identified using pairtopair function in BEDtools with the parameter “-type notboth -slop 40000”

Identification of evolutionary breakpoints distribution in TADs

We used progressiveMauve80 to identify the genomic rearrangement breakpoints between zebrafish and human, gibbon, chimp, gorilla, mouse, cat, dog, pig, sheep, cattle, chicken, zebra finch, Xenopus, platyfish, stickleback, tilapia, medaka, and fugu, respectively. To calculate the breakpoint density in TADs, we divided the zebrafish genome into 100-kb bins and computed the number of overlapped breakpoints. We plotted the profile figure using the computeMatrix and plotProfile packages from deepTools, with a TAD length of 2,800 kb (median size of the TADs) and boundary length of 40 kb. We defined TADs with breakpoints using bedtools intersect function with parameter “-f 0.3”.

H3K4me3 and super-enhancer pattern analysis of pile-up TADs

To integrate histone signals and CREs intensity in TADs, we divided the zebrafish genome into 100-kb bins. For H3K4me3, we computed the H3K4me3 per base-pair signal in muscle within each bin. We used computeMatrix and plotProfile packages from deepTools to plot the density of H3K4me3. To compare the expression similarity of the genes in TADs with breakpoints and TADs without breakpoints, we calculated the Spearman correlation coefficients of gene expression TPM across eleven tissues between zebrafish and human.

Interaction pattern analysis of pile-up TADs

To visualize the overall interaction patterns within TADs, we performed a pile-up analysis similar to APA plotting43. We first normalized the interaction matrix by dividing each entry of the raw 25-kb matrix by the expected interaction frequency at the corresponding genomic distance. To make the TAD boundary sharper, we only considered the whole-genome intra-TAD interaction frequencies and excluded inter-TAD interactions when we calculated the expected values. Then the TAD sets with and without breakpoints were separately piled up and averaged over the normalized matrix. Specifically, for each TAD of interest, we extended fixed 40 bins (1 Mb) on both sides of the midpoint and extracted the 80×80 matrix within the extended region. Hundreds of such matrices were then aggregated, and the average interaction intensity was calculated for each location of the matrix. In the resulting plot, the center point corresponds to the center of each TAD, and boundaries of the center block represent the average locations of piled-up TAD boundaries.

Single-cell analysis

We have generated 781,123,374 reads from 23,871 cells using 10X Genomics Chromium single cell ATAC-seq solution protocol. The BCL files generated from sequencing were used as inputs to the 10X Genomics Cell Ranger ATAC-seq pipeline; then the FASTQ files were aligned to the GRCz10 genome using BWA81 and the fragments with MAPQ>30 were kept for further analysis and each fragment is associated with a single cell barcode. To qualify scATAC-seq, we calculated the Pearson correlation coefficient between the aggregated signal of scATAC-seq and bulk ATAC-seq. The results showed that single-cell data was highly correlated with bulk ATAC-seq (Extended Data Fig. 15a). Then a cell-by-bin matrix was generated by segmenting the genome into 5-kb windows and scoring each cell for reads in each window. The high-quality cells were kept with the log10(UMI) value between 3 and 5 and the fraction of reads in promoters between 10% and 65% (Extended Data Fig. 15b). Finally, 19,955 cells passed quality control for further analysis. We used the SnapATAC(https://github.com/r3fang/SnapATAC) method to reduce the dimensionality of the dataset and the deep neural network-based scAlign82 to remove the batch effect of two replicates. We then identified 25 clusters using the graph-based clustering. To identify cell-type-specific regulatory elements, we called peaks on aggregated single cells from each cluster using MACS2. Finally, we identified a total of 268,268 non-redundant peaks, of which 195,004 peaks were also identified as peaks in bulk ATAC-seq data. A total of 171,159 cluster-specific differential peaks (DA peaks) were identified by Fisher’s exact test, and the threshold was set as FDR<0.05. To determine the cell type of each cluster, we performed motif analysis based on known transcription factor binding motifs in DA peak regions of each cluster using HOMER. We used ChromVAR83 to estimate bias-corrected deviations of transcription factor binding motif enrichment and the result was consistent with the top motif enrichments for DA peaks of each cluster.

Statistics

All boxplots in Main and Extended Data were represented with R and python for the boxplot function: Horizontal line, median; box, interquartile range (IQR); whiskers, the values within the 5th and 95th percentiles

Extended Data

Extended Data Figure 1. Tissue-specific gene expression in zebrafish.

Extended Data Figure 1.

a, Clustering analysis of transcripts from RNA-seq data in embryonic and adult tissues (n=31,842). b and c, Gene Ontology and KEGG pathway analysis for the tissue-specific genes in adult brain, heart and testis (the number of tissue-specific genes in these two figures are, Brain=3,693, Heart=392, Testis=1,605). d, Distribution of H3K4me3 signals surrounding the known and predicted novel transcripts. e, Human orthologs of zebrafish tissue-specific genes were more tissue-specific compared to human orthologs of non-tissue-specific zebrafish genes (n=14,764, 3,739, 6,043, Mann-Whitney U Test, two-sided, ***P value<2.2×10−16).

Extended Data Figure 2. Comparative analysis of zebrafish cis-regulatory elements.

Extended Data Figure 2.

a, Comparison of the predicted regulatory elements identified with previous data. Enhancers were based on H3K27ac signals in the same four tissues (brain, heart, intestine, testis) from Perez-Rico et al. 2017. Note the data we generated is from Tübingen zebrafish strain and the published results were from the AB strain. b, Number of predicted cis-regulatory elements in each tissue. E-brain stands for 1 dpf embryonic neuron cells. E-trunk stands for 1 dpf zebrafish whole trunk region. c, An example showing genes with active promoters have higher expression level. Blue hollow bar indicates the known mrpl39 promoter. Orange hollow bar indicates the potential novel promoter. Note that the mrpl39 promoter has H3K4me3 peaks in both muscle and brain, but only has strong H3K27ac signals in muscle and its expression is higher (4.43-fold). d, Gene Ontology results for the muscle-specific enhancers and skin-specific enhancers. We used the GREAT tool for this analysis (the numbers of tissue-specific enhancers used in this figure are muscle=813, skin=512).

Extended Data Figure 3. Enhancer reporter assay for tissue-specific enhancers.

Extended Data Figure 3.

In total, 28 of 32 predicted tissue-specific enhancers showed consistent GFP signals in the corresponding tissues. For the eight brain enhancers tested, 63/95, 51/86, 85/119, 112/143, 27/45, 34/48, 27/41, 62/77, and 37/45 embryos, respectively, had green signals in the brain region. For the six tested heart enhancers, 64/94, 52/85, 79/121, 20/41, 51/95, 32/55 and 20/31 embryos, respectively had green signals in the heart region. For the six tested muscle enhancers, 52/57, 26/30, 107/124, 53/63, 93/114, 61/67 and 66/78 embryos, respectively had green signals in the trunk muscle. For the four selected kidney enhancers, 47/82, 35/67, 44/62, 15/42 and 56/110 embryos, respectively had green signals in the kidney region.

Extended Data Figure 4. Single cell ATAC-seq in zebrafish brain.

Extended Data Figure 4.

a, Barcode selection of single cell ATAC-seq. The x-axis represents the log value of the number of unique molecular identifiers (UMI); the y axis represents the ratio of fragments in promoter regions; the red lines represent threshold, and the grey shadows represent that the barcode passed the filter. b, Genomic distribution of all differentially accessible (DA) peaks. c, Overlap of all differentially accessible peaks with enhancers predicted in bulk brain. d, Top panel, the cluster distribution in the tSNE projection. Bottom left, pileups of differentially accessible ATAC-seq signals for each cluster. Shown in the figure is the +/− 10kb flanking region surrounding peak centers. Bottom right, most significantly enriched transcription factor motif for each cluster. e, tSNE projection of all scATAC-seq cells colored by Z-score of peak enrichment. f, Motif enrichment of known neuron-specific TFs in scATAC-seq predicted clusters (n=19,955)

Extended Data Figure 5. Heterochromatin annotation in adult tissues.

Extended Data Figure 5.

a, WashU Epigenome Browser screenshot of H3K9me3 and H3K9me2 histone ChIP-seq signals in 11 zebrafish adult tissues. The values on the y-axis were input-normalized. b, Distribution of H3K9me3 and H3K9me2 sites in the zebrafish genome. c, Venn Diagram shows the overlap between H3K9me3 and H3K9me2 sites in zebrafish genome. d, Overlapping percentile of H3K9me3 and H3K9me2 peaks in adult tissues. e, H3K9me3 and H3K9me2 sites were depleted of ATAC-seq, H3K4me3 and H3K27ac ChIP-seq signals (n= 68,789 H3K9me3 sites and n=73,777 H3K9me2 sites). f, Overlap of H3K9me3 sites, H3K9me2 sites, and ATAC-seq peaks with repetitive elements (The total number of each bar, from left to right, 68,789, 73,777 and 436,036). g, Examples of H3K9me3 sites in one tissue found to be active regions in other tissues. Horizontal scale 0–20 for H3K27ac and H3K4me3, 0–10 for RNA-seq, 0–5 for H3K9me3 and H3K9me2.

Extended Data Figure 6. DNA methylation level and distribution in adult tissues.

Extended Data Figure 6.

a, Fraction of total CpGs with low (< 25%), medium (≥ 25% and < 75%), and high (≥ 75%) methylation levels and mean CpG methylation levels (mCG/CG) in zebrafish adult tissues (the mCG/CG ration, from left to right, 0.788, 0.859, 0.790, 0.777, 0.791, 0.797, 0.781, 0.777, 0.804, 0.789, 0.781). b, Distribution of CpG methylation levels across zebrafish adult tissues. c, The distribution of non CpG methylation in 11 adult tissues. d, Mean methylation levels of the tissue-specific gene promoters. n represents the number of tissue-specific gene promoter. e, Mean methylation level of CpGs overlapping different genomic features or repetitive element classes. CDS, coding sequence. f, Number of UMRs and LMRs in zebrafish tissues and their overlap with enhancer and promoters (left panel) (number of UMR and LMR, from top to bottom, 14,990, 10,569, 14,569, 14,587, 14,831, 14,289, 13,842, 13,569, 14,424, 14,374, 13,908, 30,009, 7,916, 19,038, 21,411, 22,591, 16,796, 14,961, 16,268, 17,481, 15,932, 15,665) and ATAC-seq peaks (right panel)( numbers of UMR and LMR are the same with left panel). f, Clustering of tissue-specific hypoDMRs. Values in the heatmap are mean methylation levels of hypoDMRs (n=17,654, number of tissue-specific hypoDMRs).

Extended Data Figure 7. De novo assembly of zebrafish chromosome 4 of the Tübingen strain.

Extended Data Figure 7.

a, WashU Epigenome Browser snapshot showing that heterochromatic marks H3K9me2 and H3K9me3 signals were enriched on chromosome 4 in zebrafish testis. The values on the y-axis were input-normalized. b, H3K9me2, H3K9me3, and DNA methylation level on chr4 long arm are significantly higher than other regions in all tissues (n=11, two-sided, t-test). c, Overall strategy of de novo assembly of the Tübingen chr4 by integrating 10X, Nanopore, Bionano, and Hi-C data. d, Bionano long molecule sequencing data shows that there were many SVs on chr4 when mapped to the GRCz11 reference genome. e, SVs on chr4 detected by Bionano when the data were mapped to the de novo assembled chr4.

Extended Data Figure 8. Conservation of CREs from zebrafish to other vertebrates.

Extended Data Figure 8.

a, Percentage of zebrafish enhancers whose sequences were conserved in human (the number of each bar, from left to right, 13,307, 7,018, 11,940, 7,499, 14,783, 14,272, 8,995, 13,777, 10,757, 15,505, 1,734, 4,011, 5,247). b and c, Similar to Fig. 4a. Percentage of zebrafish exons and CREs that have orthologous sequences in mouse and other fish species. Total number of each bar, from left to right: 1,000, 25593, 58,065, 1,000. For exons and random, we randomly sample 1000 elements and computed their conservation percentage. The simulations were performed 20 times and the average percentage was presented. d, Another example of ultra-conserved non-coding element (UCNE). This element (FOXP1_Finn_1) is predicted to be a muscle enhancer in zebrafish, mouse, and human. Grey vertical bar marks the ultra-conserved region. Red vertical bar is the enhancer sequence in the human genome that was validated as a limb enhancer by transgenic mouse reporter assay in the VISTA Enhancer Browser (#hs956).

Extended Data Figure 9. Distal ATAC-seq peak-to-gene pairs, enhancer-to-gene pairs, and transcriptional regulation network.

Extended Data Figure 9.

a and b, Distance distribution of CREs to their linked gene TSS. c, Correlation of ATAC-seq peak-to-gene pairs and Enhancer-to-gene pairs (n from left to right=3,292, 3,827, 3,544, 3,281, 3,008, 2,795, 2,357, 2,001, 1,106). d, Validation of predicted enhancer-to-gene pairs by Hi-C interaction counts in muscle. e, mef2d is a regulator in both zebrafish muscle and heart, but it regulates different downstream targets by motif prediction analysis. f, The overall structure of the regulatory network is conserved between human and zebrafish. FFL connection analysis was performed, in this analysis, there are three types of nodes: A, driver node that regulates B and C; B, middle node, regulated by A but regulating node C; C, passenger node, regulated by both A and B.

Extended Data Figure 10. Compartment and TADs in zebrafish.

Extended Data Figure 10.

a, Heatmap of genome-wide Hi-C interaction matrices in zebrafish brain (blue) and muscle (red). b, Active marks (H3K4me3, H3K27ac, and ATAC-seq) were enriched in compartment A and depleted in compartment B. Repressive marks (H3K9me2 and H3K9me3) were enriched in compartment B. Error bands represent standard error of the mean. c, Genome browser snapshot of A/B compartment in brain and muscle. The blue vertical shaded area marks a region that is located in compartment B in brain but in compartment A in muscle. As expected, A compartment which is associated with more ATAC-seq peaks, H3K27ac and RNA-seq signals. d, Examples of shared TADs between zebrafish brain and muscle. e, Average DI scores surrounding TAD boundaries identified in brain (upper panel) and muscle (lower panel). f, ChIP-seq data shows that CTCF binding sites were enriched at TAD boundaries. g, Footprint analysis of ATAC-seq peaks in the TAD boundaries shows enrichment of CTCF binding motif (number of each bar, from left to right, 0.213, 0.24, 0.22, 0.237, 0.251, 0.232, 0.24, 0.262, 0.271, 0.281, 0.37, 0.27, 0.253, 0.25, 0.252, 0.253, 0.26, 0.23, 0.238, 0.24, 0.22). h, Repetitive elements enriched at TAD boundaries (left panel) and loop anchors (right panel).

Extended Data Figure 11. Comparing zebrafish evolutionary breakpoints with TAD annotation.

Extended Data Figure 11.

a. Similar to Fig. 5d. Enrichment of evolutionary breakpoints at TAD boundaries. Relative positions of evolutionary breakpoints to TADs in 15 vertebrates. In all cases, we found that the evolutionary breakpoints were enriched at zebrafish TAD boundaries and depleted from the center of TADs. Grey vertical bar labels the TAD body area. b, By comparing zebrafish with 17 vertebrates, H3K4me3 signals were found to be more enriched at TAD boundaries with breakpoints than those without breakpoints. Orange vertical bar labels the TAD boundaries. c, Higher H3K4me3 levels at breakpoint-containing TAD boundaries when using TADs annotation from zebrafish muscle were found as well, similar to Fig. 5g. d, H3K4me3 enrichment in human ESCs (H1) TAD boundaries with or without zebrafish-to-human breakpoints. e, H3K4me3 enrichment in mouse ESCs TAD boundaries with or without zebrafish-to-mouse breakpoints. f, H3K4me3 enrichment in human ESCs (H1) TAD boundaries with or without mouse-to-human breakpoints.

Extended Data Figure. 12. TADs with and without breakpoints.

Extended Data Figure. 12.

a, H3K27ac and ATAC-seq signals do not show differences at TAD boundaries with breakpoints compared to those without breakpoints. Orange vertical bar labels the TAD boundaries. b, Sizes of TADs with and without evolutionary breakpoints were similar (n=573, 777, two-sided, t-test). c, Enrichment of transcription at breakpoints (BP) that overlap with CTCF TAD boundaries in K562 cells (the number of breakpoints in blue line is 639, red line is 625). d, In 17 vertebrates, TADs without evolutionary breakpoints (bottom panel) have stronger interaction frequencies in the middle than TADs with evolutionary breakpoints (upper panel). Breakpoints in these 17 vertebrates were defined by comparing their genomes to the zebrafish genome. e, Distribution of correlations between the expression pattern of each pair of paralogs across 11 adult zebrafish tissues. f, Correlations between pairs of paralogs located on the same chromosome. Among them, 17 pairs were located within the same TAD, and the rest of the 65 pairs were located in different TADs. As a control, we randomly sampled 100 genes. Number of each bar, from left to right, 17, 65, 100.

Supplementary Material

Supplementary Information
Supplementary Data 1
Supplementary Table 1-19

Acknowledgements

This work was supported by NIH grants R35GM124820, R01HG009906, R24DK106766 (R.C.H. and F.Y.), and R01DK107735 (G.S.G). F.Y. is also supported by U01CA200060. T.W. is supported by NIH grants R01HG007175, R01HG007354, R01ES024992, U24ES026699, and U01HG009391. We thank John A. Stamatoyannopoulos for discussion and suggestions. We thank Huijue Lyu for proof-reading and other Yue lab members for discussion. We thank Elizabeth DeForest, Salvatore Stella, Peggy Hubley, and Penn State Zebrafish Functional Genomics Core for fish husbandry and embryo collection.

Footnotes

Declaration of interests

The authors declare no competing financial interests.

Data availability and visualization

All the sequencing data are deposited in NCBI GEO: GSE134055. Visualization of all the genomic data generated in this study are available in the WashU EpiGenome Browser (https://epigenome.wustl.edu/zebrafishENCODE/). The human histone modifications ChIP-seq data were downloaded from the ROADMAP Project. The mouse histone modification ChIP-seq data were downloaded from the mouse ENCODE Consortium. The human tissue transcriptome data were downloaded from the GTEx Consortium. All the list of public zebrafish ChIP-seq and ATAC-seq used in this study were listed in the Supplemental Table 6. The human h1-ESC Hi-C data were downloaded from GSE52457. GM12878 and K562 GRO-seq data were downloaded from GSE60456. GM12878 and K562 CTCF ChIP-seq were downloaded from GSE31477. GM12878 and K562 Pol2 ChIP-seq were downloaded from GSE91426 and GSE31477.

Reference:

  • 1.Howe K. et al. The zebrafish reference genome sequence and its relationship to the human genome. Nature 496, 498–503, doi: 10.1038/nature12111 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Gerhard GS et al. Life spans and senescent phenotypes in two strains of Zebrafish (Danio rerio). Experimental gerontology 37, 1055–1068 (2002). [DOI] [PubMed] [Google Scholar]
  • 3.Lamason RL et al. SLC24A5, a putative cation exchanger, affects pigmentation in zebrafish and humans. Science 310, 1782–1786, doi: 10.1126/science.1116238 (2005). [DOI] [PubMed] [Google Scholar]
  • 4.Vastenhouw NL et al. Chromatin signature of embryonic pluripotency is established during genome activation. Nature 464, 922–926, doi: 10.1038/nature08866 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bogdanovic O. et al. Dynamics of enhancer chromatin signatures mark the transition from pluripotency to cell specification during embryogenesis. Genome Res 22, 2043–2053, doi: 10.1101/gr.134833.111 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kaaij LJ et al. Enhancers reside in a unique epigenetic environment during early zebrafish development. Genome Biol 17, 146, doi: 10.1186/s13059-016-1013-1 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Aday AW, Zhu LJ, Lakshmanan A, Wang J. & Lawson ND Identification of cis regulatory features in the embryonic zebrafish genome through large-scale profiling of H3K4me1 and H3K4me3 binding sites. Dev Biol 357, 450–462, doi: 10.1016/j.ydbio.2011.03.007 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Vesterlund L, Jiao H, Unneberg P, Hovatta O. & Kere J. The zebrafish transcriptome during early development. BMC developmental biology 11, 30, doi: 10.1186/1471-213X-11-30 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Buenrostro JD, Giresi PG, Zaba LC, Chang HY & Greenleaf WJ Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods 10, 1213–1218, doi: 10.1038/nmeth.2688 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Consortium EP An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74, doi: 10.1038/nature11247 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Yue F. et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364, doi: 10.1038/nature13992 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Anderson JL et al. Multiple sex-associated regions and a putative sex chromosome in zebrafish revealed by RAD mapping and population genomics. PLoS One 7, e40701, doi: 10.1371/journal.pone.0040701 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Klemm SL, Shipony Z. & Greenleaf WJ Chromatin accessibility and the regulatory epigenome. Nat Rev Genet 20, 207–220, doi: 10.1038/s41576-018-0089-8 (2019). [DOI] [PubMed] [Google Scholar]
  • 14.Meuleman W. et al. Index and biological spectrum of human DNase I hypersensitive sites. Nature 584, 244–251, doi: 10.1038/s41586-020-2559-3 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Quillien A. et al. Robust Identification of Developmentally Active Endothelial Enhancers in Zebrafish Using FANS-Assisted ATAC-Seq. Cell Rep 20, 709–720, doi: 10.1016/j.celrep.2017.06.070 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Letelier J. et al. Evolutionary emergence of the rac3b/rfng/sgca regulatory cluster refined mechanisms for hindbrain boundaries formation. Proc Natl Acad Sci U S A 115, E3731–E3740, doi: 10.1073/pnas.1719885115 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Liu G, Wang W, Hu S, Wang X. & Zhang Y. Inherited DNA methylation primes the establishment of accessible chromatin during genome activation. Genome Res 28, 998–1007, doi: 10.1101/gr.228833.117 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Marletaz F. et al. Amphioxus functional genomics and the origins of vertebrate gene regulation. Nature 564, 64–70, doi: 10.1038/s41586-018-0734-6 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Meier M. et al. Cohesin facilitates zygotic genome activation in zebrafish. Development 145, doi: 10.1242/dev.156521 (2018). [DOI] [PubMed] [Google Scholar]
  • 20.Torbey P. et al. Cooperation, cis-interactions, versatility and evolutionary plasticity of multiple cis-acting elements underlie krox20 hindbrain regulation. PLoS Genet 14, e1007581, doi: 10.1371/journal.pgen.1007581 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Paik EJ et al. A Cdx4-Sall4 regulatory module controls the transition from mesoderm formation to embryonic hematopoiesis. Stem Cell Reports 1, 425–436, doi: 10.1016/j.stemcr.2013.10.001 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Kang J. et al. Modulation of tissue repair by regeneration enhancer elements. Nature 532, 201–206, doi: 10.1038/nature17644 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Kaufman CK et al. A zebrafish melanoma model reveals emergence of neural crest identity during melanoma initiation. Science 351, aad2197, doi: 10.1126/science.aad2197 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Goldman JA et al. Resolving Heart Regeneration by Replacement Histone Profiling. Dev Cell 40, 392–404 e395, doi: 10.1016/j.devcel.2017.01.013 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Perez-Rico YA et al. Comparative analyses of super-enhancers reveal conserved elements in vertebrate genomes. Genome Res 27, 259–268, doi: 10.1101/gr.203679.115 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Lister R. et al. Global epigenomic reconfiguration during mammalian brain development. Science 341, 1237905, doi: 10.1126/science.1237905 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Visel A. et al. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nature genetics 40, 158–160, doi: 10.1038/ng.2007.55 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Dimitrieva S. & Bucher P. UCNEbase--a database of ultraconserved non-coding elements and genomic regulatory blocks. Nucleic Acids Res 41, D101–109, doi: 10.1093/nar/gks1092 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Visel A, Minovitsky S, Dubchak I. & Pennacchio LA VISTA Enhancer Browser--a database of tissue-specific human enhancers. Nucleic Acids Res 35, D88–92, doi: 10.1093/nar/gkl822 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Corces MR et al. The chromatin accessibility landscape of primary human cancers. Science 362, doi: 10.1126/science.aav1898 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Neph S. et al. Circuitry and dynamics of human transcription factor regulatory networks. Cell 150, 1274–1286, doi: 10.1016/j.cell.2012.04.040 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Yang T. et al. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Res 27, 1939–1949, doi: 10.1101/gr.220640.117 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Krefting J, Andrade-Navarro MA & Ibn-Salem J. Evolutionary stability of topologically associating domains is associated with conserved gene regulation. BMC Biol 16, 87, doi: 10.1186/s12915-018-0556-x (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Lazar NH et al. Epigenetic maintenance of topological domains in the highly rearranged gibbon genome. Genome Res 28, 983–997, doi: 10.1101/gr.233874.117 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Fishman V. et al. 3D organization of chicken genome demonstrates evolutionary conservation of topologically associated domains and highlights unique architecture of erythrocytes’ chromatin. Nucleic Acids Res 47, 648–665, doi: 10.1093/nar/gky1103 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Dixon JR et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380, doi: 10.1038/nature11082 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Smagulova F. et al. Genome-wide analysis reveals novel molecular features of mouse recombination hotspots. Nature 472, 375–378, doi: 10.1038/nature09869 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Canela A. et al. Genome Organization Drives Chromosome Fragility. Cell 170, 507–521 e518, doi: 10.1016/j.cell.2017.06.034 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Gothe HJ et al. Spatial Chromosome Folding and Active Transcription Drive DNA Fragility and Formation of Oncogenic MLL Translocations. Mol Cell 75, 267–283 e212, doi: 10.1016/j.molcel.2019.05.015 (2019). [DOI] [PubMed] [Google Scholar]
  • 40.Canela A. et al. Topoisomerase II-Induced Chromosome Breakage and Translocation Is Determined by Chromosome Architecture and Transcriptional Activity. Mol Cell 75, 252–266 e258, doi: 10.1016/j.molcel.2019.04.030 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Postlethwait JH et al. Vertebrate genome evolution and the zebrafish gene map. Nat Genet 18, 345–349, doi: 10.1038/ng0498-345 (1998). [DOI] [PubMed] [Google Scholar]
  • 42.Pedroso GL et al. Blood collection for biochemical analysis in adult zebrafish. J Vis Exp, e3865, doi: 10.3791/3865 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Rao SS et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680, doi: 10.1016/j.cell.2014.11.021 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Xie W. et al. Epigenomic analysis of multilineage differentiation of human embryonic stem cells. Cell 153, 1134–1148, doi: 10.1016/j.cell.2013.04.022 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Kim D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14, R36, doi: 10.1186/gb-2013-14-4-r36 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Trapnell C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28, 511–515, doi: 10.1038/nbt.1621 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Wang L. et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Research 41, 10.1093/nar/gkt006 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Dobin A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21, doi: 10.1093/bioinformatics/bts635 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Li B. & Dewey CN RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. Bmc Bioinformatics 12, 10.1186/1471-2105-12-323 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal 17, 10–12 (2011). [Google Scholar]
  • 51.Langmead B. & Salzberg SL Fast gapped-read alignment with Bowtie 2. Nature methods 9, 357 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Li Z. et al. Identification of transcription factor binding sites using ATAC-seq. Genome Biology 20, 45 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Korhonen J, Martinmaki P, Pizzi C, Rastas P. & Ukkonen E. MOODS: fast search for position weight matrix matches in DNA sequences. Bioinformatics 25, 3181–3182, doi: 10.1093/bioinformatics/btp554 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Kulakovskiy IV et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res 46, D252–D259, doi: 10.1093/nar/gkx1106 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Gerstein MB et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 330, 1775–1787, doi: 10.1126/science.1196914 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Li H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079, doi: 10.1093/bioinformatics/btp352 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Liu T. Use model-based Analysis of ChIP-Seq (MACS) to analyze short reads generated by sequencing protein-DNA interactions in embryonic stem cells. Methods Mol Biol 1150, 81–95, doi: 10.1007/978-1-4939-0512-6_4 (2014). [DOI] [PubMed] [Google Scholar]
  • 58.Hiller M. et al. Computational methods to detect conserved non-genic elements in phylogenetically isolated genomes: application to zebrafish. Nucleic acids research 41, e151–e151 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Hiller M. et al. Computational methods to detect conserved non-genic elements in phylogenetically isolated genomes: application to zebrafish. Nucleic Acids Res 41, e151, doi: 10.1093/nar/gkt557 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Lee HJ et al. Regenerating zebrafish fin epigenome is characterized by stable lineage-specific DNA methylation and dynamic chromatin accessibility. Genome biology 21, 1–17 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Krueger F. & Andrews SR Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. bioinformatics 27, 1571–1572 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Zhou X, Li D, Lowdon RF, Costello JF & Wang T. methylC Track: visual integration of single-base resolution DNA methylation data on the WashU EpiGenome Browser. Bioinformatics 30, 2206–2207 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Burger L, Gaidatzis D, Schübeler D. & Stadler MB Identification of active regulatory regions from DNA methylation data. Nucleic acids research 41, e155–e155 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Wu H. et al. Detection of differentially methylated regions from whole-genome bisulfite sequencing data without replicates. Nucleic acids research 43, e141–e141 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Hansen KD, Langmead B. & Irizarry RA BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome biology 13, R83 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Ramirez F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res 44, W160–165, doi: 10.1093/nar/gkw257 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Koren S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27, 722–736, doi: 10.1101/gr.215087.116 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Walker BJ et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9, e112963, doi: 10.1371/journal.pone.0112963 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Dudchenko O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95, doi: 10.1126/science.aal3327 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Roadmap Epigenomics C. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330, doi: 10.1038/nature14248 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Heinz S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell 38, 576–589, doi: 10.1016/j.molcel.2010.05.004 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Grant CE, Bailey TL & Noble WS FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018, doi: 10.1093/bioinformatics/btr064 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Servant N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology 16, 259–259 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Durand NC et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell systems 3, 95–98 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Robinson JT et al. Juicebox. js provides a cloud-based visualization system for Hi-C data. Cell systems 6, 256–258. e251 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Abdennur N. & Mirny LA Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics 36, 311–316 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Crane E. et al. Condensin-driven remodelling of X chromosome topology during dosage compensation. Nature 523, 240–244, doi: 10.1038/nature14450 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Giorgetti L. et al. Structural organization of the inactive X chromosome in the mouse. Nature 535, 575–579, doi: 10.1038/nature18589 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Imakaev M. et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nature Methods 9, 999–1003 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Darling AE, Mau B. & Perna NT progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5, e11147, doi: 10.1371/journal.pone.0011147 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Li H. & Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics 25, 1754–1760 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Johansen N. & Quon G. scAlign: a tool for alignment, integration, and rare cell identification from scRNA-seq data. Genome biology 20, 1–21 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Schep AN, Wu B, Buenrostro JD & Greenleaf WJ chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nature methods 14, 975–978 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information
Supplementary Data 1
Supplementary Table 1-19

RESOURCES