Protocol to profile snATAC-seq datasets and motif enrichment analysis during zebrafish early embryogenesis

Jie Zhou; Xueqian Yang; Xiumei Lin; Kaichen Zhao; Xue Wang; Zhiqiang Dong; Chuanyu Liu; Chang Liu

doi:10.1016/j.xpro.2024.103501

. 2024 Dec 12;5(4):103501. doi: 10.1016/j.xpro.2024.103501

Protocol to profile snATAC-seq datasets and motif enrichment analysis during zebrafish early embryogenesis

Jie Zhou ^1,^2,^5,^∗, Xueqian Yang ³, Xiumei Lin ^1,², Kaichen Zhao ^2,³, Xue Wang ², Zhiqiang Dong ^3,⁴, Chuanyu Liu ^1,^∗∗, Chang Liu ^1,^2,^6,^∗∗∗

PMCID: PMC11697690 PMID: 39671284

Summary

The scarcity of zebrafish-specific motif databases presents a challenge to the analysis of transcription factor (TF) motif within zebrafish single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) data, thus hindering the identification of regulatory elements throughout zebrafish embryonic development. Here, we provide a protocol to analyze single-nucleus chromatin accessibility dataset during zebrafish early embryogenesis. We describe steps for fragment file retrieval, sample integration, quality control, Latent Semantic Indexing (LSI) clustering, and peak calling via ArchR. Crucially, we detail procedures for zebrafish-specific motif database construction, motif enrichment, and TF footprinting analysis.

For complete details on the use and execution of this protocol, please refer to Lin et al.¹

Subject areas: bioinformatics, single cell, developmental biology, model organisms

Graphical abstract

Highlights

•
Instructions for ArchR package installation and dataset preparation
•
Procedures to obtain a high-quality snATAC-seq dataset by quality control
•
Steps for dimensionality reduction, unsupervised clustering, and cell annotation
•
Codes for zebrafish-specific motif database construction and motif enrichment

Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.

Before you begin

Timing: 2–8 h

Overview

Genome-wide landscape of open chromatin regions at single-cell resolution provides critical insights into the combined regulatory state of a cell, which arises from the intricate interplay of several regulatory mechanisms involving higher-order structure of chromosome, DNA methylation, histone modifications and the binding of transcription factors.²^,³ Investigating genome-wide DNA accessibility during zebrafish early embryogenesis enables the identification of cell-type-specific cis-regulatory elements (CREs) and their cell-to-cell variation, thus delineating the diversity and dynamic regulation of gene expression that underpins cell lineage specification and tissue morphogenesis in embryonic development.⁴ However, the transcription factor (TF) motif enrichment analysis, integral to pinpointing regulatory elements in diverse cell types during zebrafish embryogenesis, is hindered by the dearth of specialized motif databases for the zebrafish genome, obstructing the precise delineation of regulators across cell types.

ArchR is a robust and scalable software package designed for complex snATAC-seq analysis, which provides an intuitive user interface. Furthermore, ArchR also exhibits advantages in terms of running speed and memory consumption. Here, we describe a modified protocol utilizing the R package of ArchR⁵ to analyze single nucleus transposase-accessible chromatin using sequencing (snATAC-seq) datasets of zebrafish at various developmental stages. This protocol encompasses steps for quality control, Latent semantic indexing (LSI) clustering, peak calling, alongside the construction of zebrafish-specific motif database, TF motif enrichment analysis and TF footprinting analysis, to address the lack of zebrafish-specific motif databases.

Install tools and packages

Timing: 0.5–4 h

1.
Before running this tutorial, please ensure that you have installed R (https://www.r-project.org/) on your machine.

Note: This tutorial primarily uses R version 4.1.3 on Linux for data processing and analysis.

2.
First install devtools and BiocManager if they aren’t already installed and then install ArchR’package and ArchR dependencies.

>if(!require(devtools)){

>install.packages("devtools") # Install devtools if not already installed

>if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") # Install BiocManager if not already installed

>devtools::install_github("GreenleafLab/ArchR",ref="master", repos = BiocManager::repositories()) ># Install ArchR

>library(ArchR)

>ArchR::installExtraPackages() # Install ArchR dependencies

Note: This tutorial uses ArchR latest version 1.0.2 for analysis, which is not optimized to run on Windows currently. It should work but parallelization in ArchR has not been enabled for Windows. Therefore, it is highly recommended to install ArchR on a Linux operating system in order to effectively implement this pipeline in R. For more detailed installation instructions, installation troubleshooting tips and full documentation of ArchR, visit https://www.archrproject.com/.

3.
Install and load all the relevant R packages mentioned in the key resources table.

># Install packages listed in the key resources table

>BiocManager::install(c("TFBSTools","org.Dr.eg.db","TxDb.Drerio.UCSC.danRer11.refGene","BSgenome.>Drerio.UCSC.danRer11"))

>install.packages("readr")

>install.packages("dplyr")

>install.packages(“pheatmap”)

>devtools::install_version("ggplot2",version ="3.4.0")

># Load packages

>library(BSgenome.Drerio.UCSC.danRer11)

>library(org.Dr.eg.db)

>library(TxDb.Drerio.UCSC.danRer11.refGene)

>library(pheatmap)

>library(dplyr)

>library(TFBSTools)

>library(readr)

Download or prepare datasets

Timing: 1–6 h

4.
Download the datasets (all fragments.tsv.gz files and “atac_all_metaData.txt” file) from China National GeneBank DataBase (CNGBdb)⁶^,⁷ (CNGBdb: https://ftp.cngb.org/pub/CNSA/data4/CNP0002827/Single_Cell/CSE0000120/) or NCBI (NCBI: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA987386).

Note: This tutorial primarily uses snATAC-seq data of zebrafish embryos from 3 hpf (hours post fertilization) to 24 hpf. Therefore, the tutorial uses fragment files as input data. The fragments.tsv.gz file is a tabular file that contains one line per unique fragment, with tab-separated fields such as chromosome (chr) start position, end position, cell barcode, and duplicate count.

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

snATAC-seq data across several embryonic stage	Lin et al.¹	https://db.cngb.org/search/project/CNP0002827/

Software and algorithms

R (v.4.1.3)	R CRAN	https://cran.r-project.org/
ArchR 1.0.2	Granja JM, Corces MR et al., 2021⁵	https://github.com/GreenleafLab/ArchR
BSgenome.Drerio.UCSC.danRer11 1.4.2	Team TBD, 2019	https://bioconductor.org/packages/BSgenome.Drerio.UCSC.danRer11/
org.Dr.e.g.db 3.14.0	Marc Carlson, 2021	https://bioconductor.org/packages/org.Dr.eg.db/
TxDb.Drerio.UCSC.danRer11.refGene 3.4.6	Team BC, Maintainer BP, 2019	https://bioconductor.org/packages/TxDb.Drerio.UCSC.danRer11.refGene/
readr 2.1.5	Wickham H et al., 2024	https://readr.tidyverse.org/
TFBSTools 1.32.0	Tan G, Lenhard B, 2016⁸	https://github.com/ge11232002/TFBSTools
dplyr 1.1.4	Wickham H. et al., 2023	https://dplyr.tidyverse.org/
ggplot2 3.4.0	H. Wickham, 2016	https://ggplot2.tidyverse.org/
pheatmap 1.0.12	Raivo Kolde, 2019	https://github.com/raivokolde/pheatmap

Open in a new tab

Step-by-step method details

Part 1: Quality control

Timing: 1–6 h

The objective of quality control (QC) is to identify and eliminate low-quality cells from the raw single-cell dataset. This step is crucial to remove potential sources of noise, technical biases, measurement errors, and other factors that could distort downstream analysis outcomes. By performing QC, we obtain a high-quality cell dataset that is more reliable and suitable for further analysis.

1.
Read the fragment files and create arrow files using ArchR.
- a.
  Set the work directory and list the path of the required fragment files of zebrafish embryos across different developmental time points.
  Note: In the later step 1c the files path will be provide as a character vector to the function createArrowFiles() for reading accessible fragments from input files.
  
  >save_folder='/data/work/zf_snATAC'
  
  >setwd(save_folder) # Set the current working directory
  
  >data_path="/data/work/zf_snATAC/03_data/fragment/" # The directory where fragment files are saved
  
  >files = list.files(data_path,pattern ="tsv.gz$") # List the the fragment files with the suffix "tsv.gz" in “data_path”
  
  >files = paste0(data_path,files) # get the paths of each fragment files
  
  >names(files)=c("10hpf_1","10hpf_2","10hpf_3","12hpf_1","12hpf_2","12hpf_3","18hpf_1","18hpf_2","18hpf_3","24hpf_1","24hpf_2","24hpf_3","24hpf_4","3hpf_1","3hpf_2","5hpf_1","5hpf_2","5hpf_3","6hpf_1","6hpf_2") # naming each fragment files
  Note: The variable “names(files)” should be consistent with sample name of “files”, which represent developmental time points of zebrafish embryos.
- b.
  Adjust the number of threads by utilizing the addArchRThreads() function based on the specifications of your local environment.
- c.
  Create gene and genome annotations specific to zebrafish which is not supported by ArchR natively using the createGeneAnnotation() and createGenomeAnnotation() functions.
  >addArchRThreads(threads=20) # Set the default number of threads for parallelized operations
  
  ># Create gene and genome annotations specific to zebrafish
  
  >genomeAnnotation=createGenomeAnnotation(genome=BSgenome.Drerio.UCSC.danRer11 )
  
  >geneAnnotation=createGeneAnnotation(TxDb=TxDb.Drerio.UCSC.danRer11.refGene, OrgDb = org.Dr.eg.db)
  
  >genomeAnnotation$chromSizes=resize(genomeAnnotation$chromSizes, width=c(width(genomeAnnotation$chromSizes)+5))
  CRITICAL: The parallel processing in ArchR is not supported for Windows OS so the threads will automatically be set to 1.
- d.
  Create arrow files for each fragment file using createArrowFiles() function
  Note: The createArrowFiles() function involves reading accessible fragments, quality control, create genome-wide TileMatrix and GeneScoreMatrix using gene and genome annotations created in step 1b. In the process of preliminary quality control, nucleus with low minTSS and minFrags values will be filtered out. These metrics indicate the minimum transcription start site and minimum number of fragments required for a nucleus to be considered of high quality.
  
  ># Create arrow files for each fragment file
  
  >ArrowFiles <- createArrowFiles(
  
  >inputFiles = files,
  
  >sampleNames = names(files),
  
  >minTSS = 2,
  
  >minFrags = 1000,
  
  >addTileMat = T,
  
  >addGeneScoreMat = T,
  
  >geneAnnotation = geneAnnotation,
  
  >genomeAnnotation = genomeAnnotation
  
  >)
  Note: Don’t set minTSS too high because we will adjust the threshold for each sample separately in later steps.
2.
Create ArchRProject and remove low-quality cells further.
- a.
  Calculate doublet score for each cell using the function addDoubletScores().
  ># Calculate doublet score
  
  >doubScores <- addDoubletScores(
  
  >input = ArrowFiles,
  
  >k = 10,
  
  >knnMethod = "UMAP",
  
  >LSIMethod = 1
  
  >)
- b.
  Create ArchRProject and save it. The ArchRProject is the basis of nearly all ArchR functions and analytical workflows which organizes numerous Arrow files into a project.
  ># Create ArchRProject
  
  >proj <- ArchRProject(
  
  >ArrowFiles = ArrowFiles,
  
  >outputDirectory = "/data/work/zf_snATAC/proj",
  
  >copyArrows = TRUE,
  
  >geneAnnotation = geneAnnotation,
  
  >genomeAnnotation = genomeAnnotation
  
  >)
  
  >proj$stage = gsub("_1|_2|_3|_4","",proj$Sample)
  
  >saveArchRProject(proj)
  Note: Set outputDirectory to your own directory.
- c.
  For samples of each stage, filter cells with low TSS Enrichment score and log10(nFrags) value for quality control.
  Note: These two parameters are sample-specific and determined by the actual characteristics of each stage.
- d.
  Filter cells predicted as doublets using filterDoublets ().
  Note: Doublets are considered as a significant factor that can compromise the accuracy and interpretation of data, and impede the analysis of cellular functions.
- e.
  Get the vector of cell id passed QC.
  ># Filter cells with low TSS Enrichment score and log10(nFrags) value which are sample-specific
  
  >tp = paste0(c(3,5,6,10,12,18,24),"hpf")
  
  >nFrags = c(3.8,3.8,3.5,3.5,3.6,3.6,3.4)
  
  >TSS = c(4,4,4,5,5,4,5)
  
  >id = c()
  
  >all_cellid = rownames(proj@cellColData)
  
  >for (num in 1:7){
  
  >sub_proj = proj[which(proj$stage == tp[num]),]
  
  >sub_proj = sub_proj[which(sub_proj$TSSEnrichment > TSS[num] & log10(sub_proj$nFrags) > nFrags[num]),]
  
  >sub_proj = filterDoublets(sub_proj,filterRatio = 5)
  
  >tmp_cellid = rownames(sub_proj@cellColData)
  
  >tmp = match(tmp_cellid,all_cellid)
  
  >id = c(id,tmp)
  
  >print(paste0(tp[num]," finish!"))
  
  >}
- f.
  Subset the project to only keep nucleus with high quality and save ArchRProject in the “all_filter” directory.
- g.
  For samples of each stage, create violin plots in the “Plots” directory showing TSS enrichment score (Figure 1A) and log10(unique nuclear fragments) (Figure 1B) in each sample respectively, followed by TSS enrichment profiles (Figure 1C).
- h.
  Create scatter plots in the “Plots” directory using plotGroups() function, showing log10(unique nuclear fragments) vs. TSS enrichment score (Figure 1D).
  >proj = proj[id,] # Retaining only nucleus that have passed quality control.
  
  >proj=saveArchRProject(proj,dropCells=T,load=T,outputDirectory = "all_filter") # Save ArchRProject in the “all_filter” directory
  
  ># Create violin plots showing TSS enrichment score
  
  >p1 <- plotGroups(
  
  >ArchRProj = proj,
  
  >groupBy = "stage",
  
  >colorBy = "cellColData",
  
  >name = "TSSEnrichment",
  
  >plotAs = "violin",
  
  >alpha = 0.4,
  
  >addBoxPlot = TRUE)
  
  ># Create violin plots showing log10(nFrags)
  
  >p2 <- plotGroups(
  
  >ArchRProj = proj,
  
  >groupBy = "stage",
  
  >colorBy = "cellColData",
  
  >name = "log10(nFrags)",
  
  >plotAs = "violin",
  
  >alpha = 0.4,
  
  >addBoxPlot = TRUE)
  
  ># Plot TSS enrichment profiles
  
  >p3 <- plotTSSEnrichment(ArchRProj = proj, groupBy = "stage")
  
  ># Create plot shows the log10(unique nuclear fragments) vs TSS enrichment score
  
  >df <- getCellColData(proj, select = c("log10(nFrags)", "TSSEnrichment"))
  
  >p4 <- ggPoint(
  
  >x = df[,1],
  
  >y = df[,2],
  
  >colorDensity = TRUE,
  
  >continuousSet = "blueYellow",
  
  >xlabel = "Log10 Unique Fragments",
  
  >ylabel = "TSS Enrichment",
  
  >) + geom_hline(yintercept = 4, lty = "dashed") + geom_vline(xintercept = 3, lty = "dashed")
  
  >plotPDF(p1,p2,p3,p4, name = "AfterQC.pdf", ArchRProj = proj, addDOC = FALSE) # Save editable vectorized versions of these plots

Quality control of snATAC-seq data and features

(A) Violin plots show the TSS enrichment scores of samples across 7 development stage.

(B) Violin plots show the unique fragment numbers of samples across 7 development stage.

(C) Plots show the enrichment of snATAC-seq fragments around TSS.

(D) Scatter plots show bivariate distributions of TSS enrichment vs. log10 (unique fragments) of all samples.

Part 2: LSI clustering and peak calling of snATAC-seq data

Timing: 2–8 h

Given the sparsity of snATAC-seq data, we use Latent Semantic Indexing (LSI) to reduce the dimensionality and then perform unsupervised clustering and cell type annotation.

3.
Dimensionality reduction and unsupervised clustering.
- a.
  Use addIterativeLSI() function to perform LSI and create a reducedDims object called “IterativeLSI”.
- b.
  Perform unsupervised clustering using addClusters() function with the parameter resolution = 1.0.
- c.
  Create an embedding to visualize the clusters identified in LSI sub-space using addUMAP() function and visualize the results in a umap plotting via plotEmbedding() function.

>#Perform LSI to reduce the dimensionality

>proj <- addIterativeLSI(

>ArchRProj = proj,

>useMatrix = "TileMatrix",

>name = "IterativeLSI",

>iterations = 2,

>clusterParams = list(

>resolution = c(0.2),

>sampleCells = 10000,

>n.start = 10),

>varFeatures = 25000,

>dimsToUse = 1:30,force = T)

># Perform unsupervised clustering

>proj = addClusters(

>input = proj,

>reducedDims = "IterativeLSI",

>method = "Seurat",

>name = "Clusters",

>resolution = 1,force = T)

>#Create an embedding to visualize the clusters

>proj=addUMAP(ArchRProj=proj,

>reducedDims='IterativeLSI',

>name='UMAP',

>nNeighbors=30,

>minDist=0.5,

>metric='cosine',

>force = T)

>p1=plotEmbedding(ArchRProj=proj,colorBy="cellColData",name = "Clusters",embedding = "UMAP")

>plotPDF(p1,name='Plot_UMAP.pdf',ArchRProj=proj,addDOC=FALSE ,width=5,height=5)

>saveArchRProject(proj)

Note: You can’t get the same results even if with the same parameters when running LSI.

4.
Cell types annotation.
- a.
  Identify marker genes for manually cell type annotation the of each cluster obtained in the previous step using getMarkerFeatures() function with gene score matrix.
- b.
  Create heatmap in the “Plots” directory to visualize top10 marker genes based on Log2FC (Figure 2A) using markerHeatmap() function.
  ># Identify marker features
  
  >markersGS <- getMarkerFeatures(
  
  >ArchRProj = proj, useMatrix = "GeneScoreMatrix",
  
  >groupBy = "Clusters",
  
  >bias = c("TSSEnrichment", "log10(nFrags)"),
  
  >testMethod = "wilcoxon"
  
  >)
  
  ># Get a data frame containing the relevant marker features of each cluster
  
  >markerList <- getMarkers(markersGS, cutOff = "FDR <= 0.05& Log2FC >= 1")
  
  >markerList = as.data.frame(markerList)
  
  ># Get top10 marker genes of each cluster based on Log2FC
  
  >top10 = markerList %>%
  
  >group_by(group_name) %>%
  
  >slice_max(n = 10, order_by = Log2FC)
  
  >write.csv(top10,file=paste0('all_filter/',"Plots/","zf_snATAC_clusters_top10markers.csv"))
  
  ># Create a heatmap to visualize marker features
  
  >heatmapGS <- markerHeatmap(
  
  >seMarker = markersGS,
  
  >cutOff = "FDR <= 0.05 & Log2FC >= 1",
  
  ># labelMarkers = markerGenes,
  
  >transpose = TRUE,
  
  >limits = c(-5, 5),
  
  >nLabel = 2,
  
  ># labelRows=TRUE,
  
  >nPrint = 5,
  
  >binaryClusterRows = TRUE,
  
  >clusterCols = FALSE,
  
  ># returnMatrix = T
  
  >)
  
  >ComplexHeatmap::draw(heatmapGS, heatmap_legend_side = "bot", annotation_legend_side = "bot")
  
  >plotPDF(heatmapGS,name=paste0("zf_snATAC","_GeneScores-Marker-Heatmap"), width=25, height=16, ArchRProj=proj, addDOC=FALSE)
- c.
  Annotating cells using information stored in “atac_all_metaData.txt” and visualize the cell types on UMAP embedding by plotEmbedding() function (Figure 2C).
  Note: Prior knowledge of cell type specific marker genes is often used to identify the cell types in each cluster. But this tutorial use the annotation the same as Lin et al.¹ The cell type information is stored in “atac_all_metaData.txt”, which can be downloaded from CNGBdb (CNGBdb: https://ftp.cngb.org/pub/CNSA/data4/CNP0002827/Single_Cell/CSE0000120/).
  
  ># Get the identify of each cluster
  
  >metadata =read.table("atac_all_metaData.txt",sep = ",",header = TRUE,row.names=1)
  
  >proj = proj[which(proj$cellNames %in% rownames(metadata)),]
  
  >proj$celltype=metadata[proj$cellNames,]$celltype
  
  ># Visualize the cell types on UMAP embedding
  
  >p0 = plotEmbedding(proj, colorBy = "cellColData", name = "celltype")
  
  >plotPDF(p0,name = 'Celltype_UMAP.pdf',ArchRProj = proj, addDOC = FALSE,width = 5,height = 5)
5.
Cell-type-specific peak calling.
- a.
  Use addGroupCoverages() function to construct pseudo-bulk replicates by grouping together similar nucleus in the snATAC-seq data.
  Note: Modify the number of threads via the parameter threads depending on the configuration of your local environment when using addGroupCoverages() function.
- b.
  Identify the path to MACS2 with the function findMacs2().
- c.
  Call peaks enriched in accessible regions of genome via MACS2⁹ using function addReproduciblePeakSet().
- d.
  Add the new peak matrix to the ArchRProject by the function addPeakMatrix().
  ># Generate pseudo-bulk replicates
  
  >proj <- addGroupCoverages(
  
  >ArchRProj = proj,
  
  >groupBy = "celltype",
  
  >force = TRUE,
  
  >threads=100
  
  >)
  
  ># Searching For MACS2 in the PATH environment variable
  
  >pathToMacs2 <- findMacs2()
  
  ># Call peaks enriched in accessible regions of genome
  
  >proj <- addReproduciblePeakSet(
  
  >ArchRProj = proj,
  
  >groupBy = "celltype",
  
  >pathToMacs2 = pathToMacs2,force=TRUE,
  
  >cutOff = 0.05,
  
  >genomeSize = 1345118429
  
  >)
  
  ># Add the new peak matrix to the ArchRProject
  
  >proj <- addPeakMatrix(proj)
  Note: To perform peak calling with MACS2 using ArchR, you need to ensure that ArchR can locate the MACS2 executable. Each ArchRProject object is restricted to using only one peak set. If you want to explore different peak sets, you will need to save a duplicate of the ArchRProject and copy the Arrow files to create a new project. This way, you can analyze different peak sets independently.
- e.
  Use the getPeakSet() function to retrieve peak set and then save peaks information in a data frame.
- f.
  Identify marker features with the function of getMarkerFeatures() from PeakMatrix with the parameter useMatrix = "PeakMatrix" .
  Note: The user could obtain the SummarizedExperiment object, from which can acquire particular slices the user interested in use the getMarkers() function.
- g.
  Create a bar plot to show the distribution of cell-type-specific peaks across the genome (Figure 2B).
  ># Retrieve the peak set
  
  >pks = getPeakSet(proj)
  
  >pks_info = data.frame(peaktype = pks$peakType)
  
  >pks_info$stage = pks@ranges@NAMES
  
  >pks_info$peakname=paste(seqnames(pks),start(pks),end(pks),sep = "_")
  
  >pks_info$near_gene = pks$nearestGene
  
  >rownames(pks_info) = pks_info$peakname
  
  ># Identify marker features
  
  >markersPeaks <- getMarkerFeatures(
  
  >ArchRProj = proj,
  
  >useMatrix = "PeakMatrix",
  
  >groupBy="celltype",
  
  >bias = c("TSSEnrichment", "log10(nFrags)"),
  
  >testMethod = "wilcoxon"
  
  >)
  
  ># Retrieve particular slices from markersPeaks
  
  >markerList <- getMarkers(
  
  >markersPeaks,
  
  >cutOff = "FDR <= 0.01 & Log2FC >= 1"
  
  >)
  
  >#Create a bar plot to show the distribution of cell-type-specific peaks across the genome
  
  >markers = as.data.frame(markerList)
  
  >markers$peak_name=paste(markers$seqnames,markers$start,markers$end,sep="_")
  
  >markers$peak_anno=pks_info[markers$peak_name,]$peaktype
  
  >p = ggplot(data = markers,aes(x = group_name,fill = peak_anno)) + geom_bar(position = 'fill') + theme_bw()
  
  >pdf("all_filter/Plots/markerPeak_Perstage.pdf")
  
  >print(p)
  
  >dev.off()

Clustering and cell-type-specific peak calling of the developing zebrafish embryos

(A) Heatmap of marker genes using gene score matrix in the developing zebrafish embryos.

(B) Bar plots show the distribution of cell-type-specific peaks across the genome.

(C) Clustering visualization and annotations of snATAC-seq data.

Part 3: Motif database construction, motif enrichment, and TF footprinting analysis

Timing: 6–24 h

It is necessary to construct a zebrafish-specific motif database for the accurate scanning of DNA sequences and identification of positions that match the specified motifs in the database, enabling further motif enrichment and TF footprinting analysis. To construct the database, we need to convert the PWM (Position Weight Matrix) matrices downloaded from the CIS-BP (Catalog of Inferred Sequence Binding Preferences) database¹⁰ into PWMatrixList objects utilizing TFBSTools.

6.
Downloaded TF_Information_all_motifs_plus.txt and pwms_all_motifs of zebrafish from CIS-BP database and read the TF_Information_all_motifs_plus.txt.

># For constructing PWMatrixList objects

>library(TFBSTools)

># For reading txt file separated by "\t" in the next step

>library(readr)

># Extract transcription factors information

>TF_Information=read_delim("MotifSet/TF_Information_all_motifs_plus.txt",delim = "\t",escape_double = FALSE, trim_ws = TRUE)

>TF_Information=TF_Information[TF_Information$Motif_ID != ".",]

>TF_Information=TF_Information[TF_Information$Motif_ID != "NA",]

>TF_Information=TF_Information[TF_Information$TF_Name != "NA",]

Note: Here is a more detailed instruction for downloading TF_Information_all_motifs_plus.txt and pwms_all_motifs: first open the CIS-BP database website: http://cisbp2.ccbr.utoronto.ca/index.php. On the homepage of the website, navigate to the “bulk downloads” section. In the “Selection” dropdown menu, choose “Danio rerio” as the species the user is interested in. Click on the “Download Species Archive!” button to initiate the download. The TF_Information_all_motifs_plus.txt and pwms_all_motifs is contained within the downloaded zip archive. Make sure to provide the actual path of the TF_Information_all_motifs_plus.txt file based on the folder where it was extracted on your computer when reading the TF_Information_all_motifs_plus.txt.

7.
Convert the PPM (position probability matrix) to PWM (position weight matrix) and then convert them into PWMatrixList objects.

># List all text files giving the frequency of each base at each position in the motif in the "MotifSet/pwms_all_motifs/" directory.Each file is named by its Motif_ID.

>pwms_matrix=list.files("MotifSet/pwms_all_motifs/")

>motif_set=names(table(TF_Information$Motif_ID))

>pwmlist = list()

>mt = c()

>#Get PPM and convert the PPM to PWM

>for (motif in motif_set){

>pm = pwms_matrix[grep(motif,pwms_matrix)]

>if (length(pm) == 0){print(motif);next}

>pwms_m=read.table(paste0("MotifSet/pwms_all_motifs/",pm),header = T)

>if (nrow(pwms_m) == 0){print(motif);next}

>rownames(pwms_m) = pwms_m$Pos

>pwms_m1 = as.matrix(pwms_m[,-1])

>pwms_m1 = pwms_m1+0.008 # Use pseudo-counts to correcting the small number of counts or eliminating the zero values before log transformation

>pwms_m1 <- pwms_m1 / rowSums(pwms_m1) # Each position will sum to 1

>pwms_m1 = pwms_m1/0.25 # Normalization using the background letter frequencies as we assume that the content of the four nucleotides A, T, C, and G is equal in the genome.

>pwms_m1 = log(pwms_m1)

>motif = unlist(strsplit(motif,split = "_"))[1]

>mt = c(mt,motif)

>pwm <- PWMatrix(ID= motif,

>name= motif,

>matrixClass="Unknown", strand="∗",

>bg=c(A=0.25,C=0.25, G=0.25, T=0.25),

>tags=list(ensembl = motif),

>profileMatrix= matrix(pwms_m1,byrow=TRUE,

>nrow = 4,

>dimnames=list(c("A", "C", "G", "T"))))

>pwmlist = c(pwmlist,list(pwm)) # construct PWMatrix objects

># Construct PWMatrixList objects

>PWMListM = do.call(PWMatrixList,pwmlist)

>names(PWMListM) = mt

>save(PWMListM,file = "MotifSet/MotifSet_noTF.rda")

>tf_name = unique(TF_Information$TF_Name)

8.
Add motif annotations to ArchRProject using defined PWMatrixList objects.

># Add motif annotations

>proj=addMotifAnnotations(

>ArchRProj=proj,

>motifPWMs=PWMListM,

>name = "Motif",force = T

Note: It takes time to run this step, typically ranging from 4 to 12 h. In the absence of any error messages, it is recommended to patiently wait for the computation to finish.

9.
Perform motif enrichment using cell-type differential peak and creating corresponding plots (Figure 3A).

>row_seq = c("blastomere",'epiblast','hypoblast','neural crest','neural keel','central nervous system','forebrain','immature eye','neural stem cell','primary neuron','anterior/posterior axis','segmental plate','musculature system','lateral plate mesoderm','mesenchyme cell','erythroid lineage cell','EVL','periderm/epidermis',"integument","YSL/presumptive endoderm","YSL","digestive system","UND").

>markersPeaks1 = markersPeaks[,row_seq] # MarkersPeaks was obtained in the step5b

># Use significantly differential peaks for motif enrichment analyses

>enrichMotifs <- peakAnnoEnrichment(

>seMarker = markersPeaks1,

>ArchRProj = proj,

>peakAnnotation = "Motif",

>cutOff = "FDR <= 0.05 & Log2FC >= 1")

># Plot these motif enrichments across all cell groups

>heatmapEM_dat <- plotEnrichHeatmap(enrichMotifs, n = 10, transpose = TRUE,clusterCols = F, returnMatrix = T)

>p = pheatmap(heatmapEM_dat,cluster_rows = F,cluster_cols = F)

>print(p)

10.
Call the getPositions() function to get the genomic positions of motifs and subset the GRangesList object to obtain the positions of motifs we interested in.

># Extract enriched motifs using in the heatmap

>enrichMotif = colnames(heatmapEM_dat)

>enrichMotif = sapply(enrichMotif,function(x){strsplit(x,split="[ (]")[[1]][1]})

>motif_info = data.frame(motif = enrichMotif)

>motif_info$celltype=rownames(heatmapEM_dat)[unlist(apply(heatmapEM_dat,2,which.max))]

># Get the positions of all the enriched motifs

>motifPositions <- getPositions(proj)

>motif_name2 = enrichMotif

>markerMotifs <- unlist(lapply(motif_name2, function(x) grep(x, names(motifPositions), value = TRUE)))

>flank = 250

>chromLengths = getChromLengths(proj) #Get the length of each chromosome

>positions = motifPositions[markerMotifs] #Subset the positions to enriched motifs we interested.

>positions0 = positions

># Filter a set of enriched motifs position that the start position should be greater than flank + 50 and the end position plus flank + 50 should be less than the chromosome length.

>positions = lapply(seq_along(positions), function(x){

>idx1 = start(positions[[x]]) > flank + 50

>idx2 = end(positions[[x]]) + flank + 50 < chromLengths[paste0(seqnames(positions[[x]]))]

>if(sum(idx1 & idx2)==0){NULL}

>else{positions[[x]][idx1 & idx2]}

>})

>names(positions) = names(positions0)

>positions = as(positions, "GRangesList")# Converts the object positions to a "GRangesList" object.

11.
Compute footprints the motifs we interested and subtract the Tn5 Bias when calling plotFootprints() to generate motif footprint plots (Figure 3B).

>motif_choose=c("M08706","M03782","M08061","M09472","M09126","M03071")

># Choose the specific motifs we are interested to compute footprints

>positions1 = positions[motif_choose]

># It is suggested to conduct footprinting on a subset of motifs rather than all motifs via the positions parameter of getFootprints() function.

>seFoot <- getFootprints(

>ArchRProj = proj,

>positions = positions1,

>groupBy = "celltype",flank=250,nTop = 20)

>motif_name = names(assays(seFoot))

>tf_info = TF_Information

>tf_motif = data.frame(TF_Name = tf_info$TF_Name,Motif_ID = tf_info$Motif_ID)

>tf_motif$Motif_ID=sapply(tf_motif$Motif_ID,function(x){strsplit(x,split="_2")[[1]][1]})

>tf_motif_name = c()

>for (motif in motif_name){

>tmp1 = tf_motif[match(motif,tf_motif$Motif_ID),]

>tmp2 = tmp1$TF_Name

>tmp3 = paste0(motif,'_',tmp2)

>tf_motif_name = c(tf_motif_name,tmp3)}

>seFoot1 = seFoot

>names(assays(seFoot1)) = tf_motif_name #Rename assays(seFoot1) for footprints plotting to display both Motif_ID and TF_Name simultaneously in the figure caption.

># Plot footprints and save the PDF file in the directory “Plots” of outputDirectory of the ArchRProject

>plotFootprints(

seFoot = seFoot1,

>ArchRProj = proj,

>normMethod = "Subtract",

>plotName = "Marker_Footprints-Subtract-Bias",

>addDOC = FALSE,

>smoothWindow = 5,flank=250,height = 12,width = 6

Note: It is advisable to conduct footprinting on a selected group of motifs instead of all motifs.

TF footprinting analysis identifies representative cell-type-specific TFs activities

(A) Heatmap of motif enrichment in marker peaks of each cell type.

(B) TFs footprint profiles of tfap2e and tal1 respectively specific to neural crest and erythroid lineage cell.

Expected outcomes

Our protocol offers a comprehensive analysis pipeline for studying the dynamic chromatin landscapes of single-nucleus samples from early zebrafish embryogenesis, incorporating several crucial steps such as quality control, LSI clustering, peak calling, and the construction of a zebrafish-specific motif database, along with motif enrichment analysis and TF footprinting analysis.

To ensure the reliability of our analysis, we implemented rigorous quality control measures. We excluded cells with low TSS enrichment scores and a small number of unique nuclear fragments. The TSS enrichment profiles exhibited a distinct peak around the TSS sites, validating the high quality of our dataset (Figure 1).Using our pipeline, we successfully identified 23 clusters representing various cell types including periderm/epidermis, neural stem cells, enveloping layer (EVL), neural keel, and immature eye, among others. Moreover, we identified cell-type-specific peaks and determined their distribution across the genome (Figure 2). To gain insights into the underlying transcriptional regulation, we constructed a zebrafish-specific motif database. By executing TF footprinting analysis, we observed active TF binding in the corresponding cell types (Figure 3).

In summary, our study provides a comprehensive analysis framework that enables a detailed characterization of the chromatin landscape dynamics during zebrafish early embryogenesis. These findings will contribute to a deeper understanding of the regulatory mechanisms underlying embryonic development in zebrafish.

Limitations

This protocol has several limitations. Firstly, it is important to note that the construction of the zebrafish-specific motif database described in this study may not be directly applicable to other species. Therefore, researchers working with different organisms may have to develop their own motif databases, which can be time-consuming and resource-intensive. Secondly, it is important to consider the computational requirements of this protocol, particularly when dealing with large datasets. The scanning of DNA sequences and the identification of positions that match the specified motifs can be computationally demanding, requiring substantial time and memory resources. Researchers should be aware of these potential limitations and plan accordingly when applying this protocol to large-scale datasets.

Troubleshooting

Problem 1

An error prompt “Detected windows OS, setting threads to 1. Setting default number of Parallel threads to 1.” occurred while running the code “addArchRThreads(threads = 20)”

Potential solution

This error occurred because ArchR is not optimized to run on Windows currently. Although it should still work but parallelization in ArchR has not been enabled for Windows as stated in the information provided on the official website (https://www.archrproject.com/). Hence, it is highly recommended to install ArchR on a Linux operating system in order to effectively implement this pipeline in R.

Problem 2

An error prompt “Error in hsdisablefileLocking(): could not find function “h5disablefileLocking” occurred while executing function createArrowFiles().

Potential solution

Please check whether the r package “rhdf5” is successfully installed. We recommend using version v.2.38.1 for this protocol.

>BiocManager::install("rhdf5",force = TRUE)

Problem 3

An error prompt “ERROR Found in ggplot for 18 hpf 3 (1 of 20)” occurred while executing function addDoubletScores() in step 2a or user can’t create plots via the function plotPDF (related to Step 2d).

Potential solution

Please check the installed version of the “ggplot2” package on your machine. We recommend using version 3.4.0 for this protocol.

>devtools::install_version("ggplot2",version ="3.4.0")

Problem 4

The user observed significant batch effects when integrating multiple samples (related to Step 3).

Potential solution

It is recommended to use a tool called Harmony for batch effect correction before clustering as follows.

>proj <- addHarmony(

>ArchRProj = proj,

>reducedDims = "IterativeLSI",

>name = "Harmony",

>groupBy = "Sample"

>proj2 <- addClusters(

>input = proj,

>reducedDims = "Harmony",

>method = "Seurat",

>name = "Clusters_Harmony",

>resolution = 1

Problem 5

User can’t find the MACS2 executable (related to Step 5).

Potential solution

•
Make sure that your machine have installed MACS2 using pip or pip3.
•
Provide the MACS2 path to ArchR using the pathToMacs2 parameter when execute addReproduciblePeakSet() function .
•
If an error prompt "Error: File 'your path/PeakCalls/InsertionBeds/posterior.axis._.10hpf_2–1_summits.bed' does not exist or is non-readable.” occurred while executing function addReproduciblePeakSet() in step 5 user can try to reinstall MACS2. We recommend using version 2.2.9.1 for this protocol.

> pip install --upgrade --force-reinstall MACS2

Problem 6

How to generate cell type annotations using snATAC-seq datasets (related to Step 6).

Potential solution

It is recommended to compare marker genes identified using a gene score matrix or marker peaks using a peak matrix with canonical cell type markers, which can be investigated in zfin (http://zfin.org/) as well as published literature.

Problem 7

Make sure that PWMatrixList objects have been constructed successfully (related to Step 9).

Potential solution

The process of scanning DNA sequences and identifying positions that match specified motifs can be time-consuming when working with large PWMatrixList objects. To confirm the successful construction of PWMatrixList objects, we can subset the list and test the addMotifAnnotations function. This step will help verify the functionality and accuracy of the objects before proceeding with further analysis.

Problem 8

An error prompt “Warning message in mclapply(..., mc.cores = threads, mc.preschedule = preschedule):’38 parallel function calls did not deliver results’” occurred.

Potential solution

Decrease the number of threads based on the specifications of your local environment.

Problem 9

An error prompt “Error: [matrixStats (>= 1.2.0)] useNames = NA is defunct.” occurred when performing plotFootprints() function.

Potential solution

Please check the installed version of the “matrixStats” library package on your machine. We recommend using version 1.1.0 for this protocol.

> devtools::install_version("matrixStats", version="1.1.0")

Note: Any other error related to the use of ArchR, please refer to the ArchR GitHub Issue pages.

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Chang Liu, PhD (liuchang4@genomics.cn).

Technical contact

Questions about the technical specifics of performing the protocol should be directed to and will be fulfilled by the technical contact, Jie Zhou (zhoujie5@genomics.cn).

Materials availability

This study did not generate new unique reagents.

Data and code availability

The raw and processed data for snATAC-seq has been deposited in China National GeneBank DataBase (CNGBdb)⁶^,⁷ with accession number: CNP0002827 and the NCBI Sequence Read Archive with the BioProject accession: PRJNA987386. The codes are available online (https://figshare.com/articles/dataset/Code/22121171).

Acknowledgments

This research was supported by the Guangdong Provincial Key Laboratory of Genome Read and Write (no. 2017B030301011), Hangzhou Science and Technology Department (no. TD2023003 and 2024SZD0128), and Shenzhen Key Laboratory of Single-Cell Omics (no. ZDSYS20190902093613831).

Author contributions

J.Z. wrote the manuscript. J.Z. and X.Y. undertook the main task of data analyses with the assistance of X.L., X.W., K.Z., and Z.D. Chuanyu Liu and Chang Liu supervised the study and revised the manuscript. All the authors reviewed and approved the final manuscript.

Declaration of interests

The authors declare no competing interests.

Contributor Information

Jie Zhou, Email: zhoujie5@genomics.cn.

Chuanyu Liu, Email: liuchuanyu@genomics.cn.

Chang Liu, Email: liuchang4@genomics.cn.

References

1.Lin X., Yang X., Chen C., Ma W., Wang Y., Li X., Zhao K., Deng Q., Feng W., Ma Y., et al. Single-nucleus chromatin landscapes during zebrafish early embryogenesis. Sci. Data. 2023;10:464. doi: 10.1038/s41597-023-02373-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Buchwalter A., Kaneshiro J.M., Hetzer M.W. Coaching from the sidelines: the nuclear periphery in genome regulation. Nat. Rev. Genet. 2019;20:39–50. doi: 10.1038/s41576-018-0063-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Isbel L., Grand R.S., Schübeler D. Generating specificity in genome regulation through transcription factor sensitivity to chromatin. Nat. Rev. Genet. 2022;23:728–740. doi: 10.1038/s41576-022-00512-6. [DOI] [PubMed] [Google Scholar]
4.Ma S., Zhang B., LaFave L.M., Earl A.S., Chiang Z., Hu Y., Ding J., Brack A., Kartha V.K., Tay T., et al. Chromatin Potential Identified by Shared Single-Cell Profiling of RNA and Chromatin. Cell. 2020;183:1103–1116.e20. doi: 10.1016/j.cell.2020.09.056. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Granja J.M., Corces M.R., Pierce S.E., Bagdatli S.T., Choudhry H., Chang H.Y., Greenleaf W.J. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 2021;53:403–411. doi: 10.1038/s41588-021-00790-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Guo X., Chen F., Gao F., Li L., Liu K., You L., Hua C., Yang F., Liu W., Peng C., et al. CNSA: a data repository for archiving omics data. Database. 2020;2020 doi: 10.1093/database/baaa055. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Chen F.Z., You L.J., Yang F., Wang L.N., Guo X.Q., Gao F., Hua C., Tan C., Fang L., Shan R.Q., et al. CNGBdb: China National GeneBank DataBase. Yi Chuan. 2020;42:799–809. doi: 10.16288/j.yczz.20-080. [DOI] [PubMed] [Google Scholar]
8.Tan G., Lenhard B. TFBSTools: an R/bioconductor package for transcription factor binding site analysis. Bioinformatics. 2016;32:1555–1556. doi: 10.1093/bioinformatics/btw024. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Zhang Y., Liu T., Meyer C.A., Eeckhoute J., Johnson D.S., Bernstein B.E., Nusbaum C., Myers R.M., Brown M., Li W., Liu X.S. Model-based analysis of ChIP-Seq (MACS) Genome Biol. 2008;9:R137. doi: 10.1186/gb-2008-9-9-r137. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Weirauch M.T., Yang A., Albu M., Cote A.G., Montenegro-Montero A., Drewe P., Najafabadi H.S., Lambert S.A., Mann I., Cook K., et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell. 2014;158:1431–1443. doi: 10.1016/j.cell.2014.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[bib1] 1.Lin X., Yang X., Chen C., Ma W., Wang Y., Li X., Zhao K., Deng Q., Feng W., Ma Y., et al. Single-nucleus chromatin landscapes during zebrafish early embryogenesis. Sci. Data. 2023;10:464. doi: 10.1038/s41597-023-02373-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Buchwalter A., Kaneshiro J.M., Hetzer M.W. Coaching from the sidelines: the nuclear periphery in genome regulation. Nat. Rev. Genet. 2019;20:39–50. doi: 10.1038/s41576-018-0063-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Isbel L., Grand R.S., Schübeler D. Generating specificity in genome regulation through transcription factor sensitivity to chromatin. Nat. Rev. Genet. 2022;23:728–740. doi: 10.1038/s41576-022-00512-6. [DOI] [PubMed] [Google Scholar]

[bib4] 4.Ma S., Zhang B., LaFave L.M., Earl A.S., Chiang Z., Hu Y., Ding J., Brack A., Kartha V.K., Tay T., et al. Chromatin Potential Identified by Shared Single-Cell Profiling of RNA and Chromatin. Cell. 2020;183:1103–1116.e20. doi: 10.1016/j.cell.2020.09.056. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Granja J.M., Corces M.R., Pierce S.E., Bagdatli S.T., Choudhry H., Chang H.Y., Greenleaf W.J. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 2021;53:403–411. doi: 10.1038/s41588-021-00790-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Guo X., Chen F., Gao F., Li L., Liu K., You L., Hua C., Yang F., Liu W., Peng C., et al. CNSA: a data repository for archiving omics data. Database. 2020;2020 doi: 10.1093/database/baaa055. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] 7.Chen F.Z., You L.J., Yang F., Wang L.N., Guo X.Q., Gao F., Hua C., Tan C., Fang L., Shan R.Q., et al. CNGBdb: China National GeneBank DataBase. Yi Chuan. 2020;42:799–809. doi: 10.16288/j.yczz.20-080. [DOI] [PubMed] [Google Scholar]

[bib8] 8.Tan G., Lenhard B. TFBSTools: an R/bioconductor package for transcription factor binding site analysis. Bioinformatics. 2016;32:1555–1556. doi: 10.1093/bioinformatics/btw024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Zhang Y., Liu T., Meyer C.A., Eeckhoute J., Johnson D.S., Bernstein B.E., Nusbaum C., Myers R.M., Brown M., Li W., Liu X.S. Model-based analysis of ChIP-Seq (MACS) Genome Biol. 2008;9:R137. doi: 10.1186/gb-2008-9-9-r137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Weirauch M.T., Yang A., Albu M., Cote A.G., Montenegro-Montero A., Drewe P., Najafabadi H.S., Lambert S.A., Mann I., Cook K., et al. Determination and inference of eukaryotic transcription factor sequence specificity. Cell. 2014;158:1431–1443. doi: 10.1016/j.cell.2014.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Protocol to profile snATAC-seq datasets and motif enrichment analysis during zebrafish early embryogenesis

Jie Zhou

Xueqian Yang

Xiumei Lin

Kaichen Zhao

Xue Wang

Zhiqiang Dong

Chuanyu Liu

Chang Liu

Summary

Graphical abstract

Highlights

Before you begin

Overview

Install tools and packages

Download or prepare datasets

Key resources table

Step-by-step method details

Part 1: Quality control

Figure 1.

Part 2: LSI clustering and peak calling of snATAC-seq data

Figure 2.

Part 3: Motif database construction, motif enrichment, and TF footprinting analysis

Figure 3.

Expected outcomes

Limitations

Troubleshooting

Problem 1

Potential solution

Problem 2

Potential solution

Problem 3

Potential solution

Problem 4

Potential solution

Problem 5

Potential solution

Problem 6

Potential solution

Problem 7

Potential solution

Problem 8

Potential solution

Problem 9

Potential solution

Resource availability

Lead contact

Technical contact

Materials availability

Data and code availability

Acknowledgments

Author contributions

Declaration of interests

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases