Summary
While the single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) is a powerful single-cell resolution tool for studying chromatin accessibility, its analytical workflow presents significant challenges for researchers new to scATAC-seq. Here, we present a protocol for conducting scATAC-seq analysis using a publicly available dataset as an example. We describe steps for data pre-processing and downstream analysis. We then detail procedures for computational multi-omics integration.
Subject areas: Bioinformatics, Sequence analysis, Single Cell
Graphical abstract

Highlights
-
•
Instructions for processing scATAC-seq data
-
•
Steps to infer gene regulatory network by integrating scATAC-seq and scRNA-seq data
-
•
Guidance on identifying and resolving issues in scATAC-seq data processing
Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.
While the single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) is a powerful single-cell resolution tool for studying chromatin accessibility, its analytical workflow presents significant challenges for researchers new to scATAC-seq. Here, we present a protocol for conducting scATAC-seq analysis using a publicly available dataset as an example. We describe steps for data pre-processing and downstream analysis. We then detail procedures for computational multi-omics integration.
Before you begin
Hardware preparation
The scATAC-seq analysis pipeline should run on a Linux operating system with a network connection. The required Random Access Memory (RAM) depends on the number of cells being analyzed. For sequence alignment using Cell Ranger ATAC (v.2.1.0), the minimum hardware requirement is an 8-core Intel or AMD processor, while 24 cores are recommended for optimal performance. The memory should be at least 64 GB of RAM, although 160 GB is recommended to enhance efficiency. Additionally, a minimum of 1 TB of available disk space is required. The operating system must be a 64-bit version of either CentOS/RedHat 7.0 or Ubuntu 14.04. For the analysis of fewer than 100,000 cells, a fragment file for downstream analysis with ArchR requires a minimum of 8 CPU cores, 32 GB of RAM and 100 GB of available disk space, allowing the process to complete in approximately 1 hour. In the case of analyzing one million cells, utilizing 8 CPU cores and 32 GB of RAM will take about 8 hours to finish the analysis in ArchR. However, for integrated analysis using Seurat, 64 GB of RAM is required for approximately 10,000 cells, while for 10,0000 cells, computer clusters with more than 100 GB of RAM are necessary. Analysis of millions of cells using Seurat is not available.
Software preparation
Timing: 1.5 h
The purpose of this section is to guide users through the process of downloading and installing the necessary software and tools for running our pipeline. If users encounter difficulties installing the software or tools on their devices, they should seek assistance from their system administrator. Most Linux distributions support the command line operations listed below. Each code snippet concludes with a command that exports the tool’s command to the filesystem, making it available for use in subsequent phases.
We have created a Docker image named ‘STARProtocol_scATAC’ in DCS Cloud (https://cloud.stomics.tech/#/dashboard), a multi-omics data intelligent analysis platform, to facilitate single-cell ATAC-seq analyses. Users must first register an account and create a new project by clicking the “Create a project” on the homepage. The public image ‘STARProtocol_scATAC’ can be copied to a personal image repository using the platform’s shared image function (refer to the documentation at https://cloud.stomics.tech/helpcenter/usermanual/image.html#how-to-use-shared-images). Raw files and resource files (e.g., FASTQ files, reference datasets, and repetitive element annotation files) should then be uploaded to the “Data” module (for detailed instructions, please visit: https://cloud.stomics.tech/helpcenter/usermanual/data.html#add-files). Upon configuration, users can initiate a new analysis task by clicking “New Online Analysis” in the “Analysis” section, selecting the ‘STARProtocol_scATAC’ image and allocating recommended computational resources (e.g., 8 CPU cores and 64 GB RAM) to execute the full analytical pipeline within the designated workspace. Step-by-step guidance for creating new analyses is available at: https://cloud.stomics.tech/helpcenter/usermanual/analysis.html#quick-start-for-online-analysis. For additional tutorials on navigating DCS Cloud, visit the official help center: https://www.stomics.tech/helpcenter/.
To configure on your local environment, please follow the steps below.
Note: Each line of executable code is marked by a greater-than sign (>).
-
1.
Install conda/Miniconda and modify channels in conda configuration. Then create and activate a new environment named “scATACPipeline” for scATAC-seq analysis pipeline.
>project_path=/user/projects #path to create projects directory
>resource_path=$project_path/ref
>software_path=$project_path/software
>mkdir -p $resource_path
>mkdir -p $software_path
>cd $software_path
>#wget https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh
>#bash Anaconda3-2023.09-0-Linux-x86_64.sh -b -p "$software_path" -u
# Miniconda is a lightweight alternative to Anaconda and is great for managing environments and packages.
>wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
>sh Miniconda3-latest-Linux-x86_64.sh -b -p "$software_path"
>rm Miniconda3-latest-Linux-x86_64.sh
>echo "source ${software_path}/miniconda3/bin/activate" >> ∼/.bashrc
>source ∼/.bashrc
>conda config --add channels conda-forge
>conda config --add channels defaults
>conda config --add channels r
>conda config --add channels bioconda
>conda config --show channels
>conda create -n scATACPipeline python=3.10
>source activate scATACPipeline
# To verify whether Miniconda is installed
>which conda
-
2.
Install cellranger-atac for data preprocessing of scATAC-seq data.
Note: The downloading link for “cellranger-atac-2.1.0.tar.gz” changes periodically. For updated commands and further details regarding cellranger-atac, please refer to the Cell Ranger Download page: https://support.10xgenomics.com/single-cell-atac/software/downloads/latest.
>cd $software_path/miniconda3/envs/scATACPipeline/bin
# download " cellranger-atac-2.1.0.tar.gz ", please refer to the Cell Ranger Download page (https://support.10xgenomics.com/single-cell-atac/software/downloads/latest) to get updated download command
>tar -xzvf cellranger-atac-2.1.0.tar.gz
>export PATH=path_to_cellranger-atac-2.1.0/bin:$PATH
#To verify whether cellranger-atac is installed
>which cellranger-atac
-
3.
Install Amulet to remove doublets from scATAC-seq data.
>pip3 install numpy pandas scipy statsmodels
>wget -O Amulet_v1.1.tar.gz "https://github.com/UcarLab/AMULET/archive/refs/tags/v1.1.tar.gz"
>tar -xzvf Amulet_v1.1.tar.gz
>cd AMULET-1.1
>chmod +x AMULET.sh
-
4.
Install base R and R package needed for this protocol. Troubleshooting 1 and 2.
>source activate scATACPipeline
>conda install -c conda-forge r-base=4.2.0
>conda install -y conda-forge::r-rmpfr
>conda install conda-forge::r-devtools
>conda install conda-forge::r-matrix
>conda install conda-forge::r-mass
>Rscript -e "devtools::install_github('caleblareau/BuenColors')"
>Rscript -e "install.packages('BiocManager')"
>export C_INCLUDE_PATH=$C_INCLUDE_PATH:/usr/include/
>Rscript -e "BiocManager::install('SummarizedExperiment')"
>Rscript -e "BiocManager::install('chromVAR')"
>Rscript -e "BiocManager::install('ComplexHeatmap')"
>Rscript -e "BiocManager::install('motifmatchr')"
>Rscript -e "devtools::install_github('buenrostrolab/FigR')"
>Rscript -e "devtools::install_version(package = 'Signac', version = package_version('1.9.0'))"
>Rscript -e "devtools::install_github('quadbio/Pando')"
>conda install -y -c bioconda bioconductor-rhdf5
>Rscript -e "devtools::install_github('GreenleafLab/ArchR', ref = 'master')"
>Rscript -e "install.packages('pheatmap')"
>Rscript -e "install.packages('shinythemes')"
>conda install conda-forge::r-arrow
> Rscript -e "devtools::install_github('aertslab/RcisTarget') "
> Rscript -e "devtools::install_github('aertslab/GENIE3')"
>Rscript -e "devtools::install_github('aertslab/SCENIC', ref='v1.1.0')"
>Rscript -e "BiocManager::install('BSgenome.Hsapiens.UCSC.hg38')"
# To verify whether base R is installed
>which R
# To verify whether the R packages are installed, attempt to load them in an R session.
-
5.
Download and install samtools.
>wget https://sourceforge.net/projects/samtools/files/samtools/1.9/samtools-1.9.tar.bz2
>tar -xvf samtools-1.9.tar.bz2
>cd samtools-1.9
>./configure --prefix=path_to_samtools-1.9 #path to the directory installation
>make && make install
>export PATH=path_to_samtools-1.9:$PATH
# Alternatively, you can install using Conda.
># conda install bioconda::samtools
># To verify whether samtools is installed
>which samtools
-
6.
Download and install bedtools.
>wget https://github.com/arq5x/bedtools2/releases/download/v2.29.1/bedtools-2.29.1.tar.gz
>tar -zxvf bedtools-2.29.1.tar.gz
>cd bedtools2
>make
>export PATH=path_to_bedtools2:$PATH
# Alternatively, you can install using Conda.
>#conda install bioconda::bedtools
# To verify whether bedtools is installed
>which bedtools
-
7.
Install macs2.
>conda install bioconda::macs2
# To verify whether bedtools is installed
>which macs2
-
8.
Download custom script.
> git clone https://github.com/M-wen/scATAC-seq-Analysis.git
Resource download
Timing: 1 h
-
9.
Download human reference dataset (GRCh38) required for Cell Ranger ATAC.
>cd ${resource_path}
>wget https://cf.10xgenomics.com/supp/cell-atac/refdata-cellranger-arc-GRCh38-2020-A-2.0.0.tar.gz
>tar -zxvf refdata-cellranger-arc-GRCh38-2020-A-2.0.0.tar.gz
-
10.
Create a repetitive elements file required for Amulet, which is merged from three sets of regions: simpleRepeats and genomicSuperDups from UCSC, along with the exclusion list from ENCODE.
>mkdir $resource_path/repetitive_elements
>cd $resource_path/repetitive_elements
# download genomicSuperDups from UCSC
>wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/genomicSuperDups.txt.gz
# download simpleRepeats from UCSC
>wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/simpleRepeat.txt.gz
# download the exclusion list from ENCODE
>wget https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz
>gunzip simpleRepeat.txt.gz
>gunzip genomicSuperDups.txt.gz
>cat simpleRepeat.txt | awk -v OFS="\t" '{print $2,$3,$4}' > hg38_simpleRepeat.bed
>cat genomicSuperDups.txt | awk -v OFS="\t" '{print $1,$3,$4}' > hg38_genomicSuperDups_1.bed
>cat genomicSuperDups.txt | awk -v OFS="\t" '{print $7,$8,$9}' > hg38_genomicSuperDups_2.bed
>gunzip ENCFF356LFX.bed.gz
>mv ENCFF356LFX.bed GRCh38_unified_blacklist.bed
>cat hg38_simpleRepeat.bed hg38_genomicSuperDups_1.bed hg38_genomicSuperDups_2.bed GRCh38_unified_blacklist.bed > blacklist_repeats_segdups_rmsk_hg38.bed
-
11.
Get chromosome sizes from Fasta files.
>cd ${resource_path}/refdata-cellranger-arc-GRCh38-2020-A-2.0.
0/fasta
>samtools faidx genome.fa
>cut -f1,2 genome.fa.fai |grep -v "ˆchrUn" |grep -v "random" > chrom_hg38.sizes
-
12.
Download scATAC-seq datasets from 10x genomics.
In this protocol, we utilized peripheral blood mononuclear cells (PBMCs) from three healthy individuals to demonstrate the analysis of scATAC-seq.
>dataset_path=${project_path}/data/ATAC
>mkdir -p ${dataset_path}
>cd ${dataset_path}
>wget https://s3-us-west-2.amazonaws.com/10x.files/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_Controller/10k_pbmc_ATACv2_nextgem_Chromium_Controller_fastqs.tar
>wget https://s3-us-west-2.amazonaws.com/10x.files/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_fastqs.tar
>wget https://s3-us-west-2.amazonaws.com/10x.files/samples/cell-atac/2.1.0/10k_pbmc_ATACv1p1_nextgem_Chromium_X/10k_pbmc_ATACv1p1_nextgem_Chromium_X_fastqs.tar
>tar -xvf 10k_pbmc_ATACv2_nextgem_Chromium_Controller_fastqs.tar
>tar -xvf 10k_pbmc_ATACv2_nextgem_Chromium_X_fastqs.tar
>tar -xvf 10k_pbmc_ATACv1p1_nextgem_Chromium_X_fastqs.tar
Key resources table
Step-by-step method details
We have prepared a GitHub repository containing all the necessary scripts for the method presented below. The repository can be accessed at: GitHub: https://github.com/M-wen/scATAC-seq-Analysis. To follow the step-by-step instructions, users are encouraged to download the entire repository. In all subsequent steps, the paths to the required software tools should match the “exported” paths in the “before you begin” section.
scATAC-seq data pre-processing analysis
Timing: 28 h
In this section, we describe essential steps for pre-processing analysis of scATAC-seq datasets, focusing on the formatting and mathematical characteristics of the data. These steps include alignment to the genome, quality control, cell-feature matrix generation, dimensionality reduction, embedding, clustering, and cell annotation. The quality of the data library is the primary consideration in this section.
Note: In this protocol, we use FASTQ data from 10x Genomics as an example for data demonstration. Therefore, we employ Cell Ranger ATAC for alignment analysis. For data from other platforms, we provide detailed descriptions in Step 1. From the fragment file to downstream analysis, we utilize ArchR. Other commonly used software includes Signac and SnapATAC2. Signac (https://stuartlab.org/signac/) uses the peak-cell matrix for analysis, while SnapATAC2 (https://kzhang.org/SnapATAC2/tutorials/index.html) is better suited for analyzing millions of cells.
-
1.
Sequence alignment (critical step).
Align the FASTQ files to the human genome (GRCh38) and call valid cell barcodes using the cellranger-atac count command with default parameters. The command-line options include.-
a.–id: a unique run id and output folder name;
-
b.–reference: path to the reference genome;
-
c.–fastqs: path to input FASTQ data;
-
d.–sample: prefix of the filenames of FASTQs to select;
-
e.–localcores: set max cores the pipeline may request at one time;
-
f.–localmem: set max GB the pipeline may request at one time.
-
a.
Note: For additional options, you can use the command line: `${cellranger_atac} count -h` or visit the website: https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/using/count.
The output files will be located in the outs/ subdirectory within this step’s output directory. For more details, please visit the website: https://support.10xgenomics.com/single-cell-atac/software/pipelines/latest/output/overview. The files fragments.tsv.gz and singlecell.csv within the outs/ directory will be used for next steps.
Note: The alignment step must be followed to ensure successful implementation of the workflow. The output files serve as essential inputs for the following steps in the process. Currently, three companies—10x Genomics, Bio-Rad, and MGI—offer droplet-based commercial devices and reagents for performing scATAC-seq library construction. Most single-cell ATAC-seq libraries worldwide are derived from these three main platforms. Due to their unique methods for constructing sequence libraries, each platform has developed its own data preprocessing and analysis software: Cell Ranger ATAC for 10x Genomics, the Bio-Rad ATAC-seq Analysis Toolkit (https://www.bio-rad.com/webroot/web/pdf/lsr/literature/Bulletin_7191.pdf) for Bio-Rad, and dnbc4tools (https://github.com/MGI-tech-bioinformatics/DNBelab_C_Series_HT_scRNA-analysis-software) for MGI. We use the dataset from 10x Genomics as an example; therefore, Cell Ranger ATAC is utilized for analysis. For information on the other two software tools, please visit their respective websites.
Note: Multiple CPUs improve performance in this step. The wall time for cellranger-atac count as a function of available memory across different CPU architectures is detailed on the 10x Genomics system requirements page (https://support.10xgenomics.com/single-cell-atac/software/overview/system-requirements). While performance generally improves with allocation of additional cores and memory beyond the minimum threshold of 64 GB, diminishing returns emerge when exceeding either 160 GB of memory or 48 cores. Furthermore, increasing the number of computational threads proportionally raises memory requirements. Processing approximately 10,000 cells with cellranger-atac count typically requires ∼10 hours of wall time, 8 CPU cores, and 64 GB of memory. To reduce wall time, additional CPU cores may be employed for the alignment analysis, contingent upon available hardware resources. For rapid workflow familiarization, we recommend initially processing a single pair of FASTQ files to understand analytical outputs. Subsequently, precomputed fragment files and metadata can be downloaded from Zenodo (https://doi.org/10.5281/zenodo.14715304) for downstream analyses.
>mkdir -p ${project_path}/results/00_preprocess
>cd ${project_path}/results/00_preprocess
>ref=${resource_path}/refdata-cellranger-arc-GRCh38-2020-A-2.0.0
>cellranger_atac=path_to_cellranger-atac-2.1.0/bin/cellranger-atac #path to the cellranger-atac software
>for x in `ls ${dataset_path}`
>do
>fq=${dataset_path}/${x}
>name=${x}
>${cellranger_atac} count --id ${name} --reference ${ref} --fastqs ${fq} --sample ${name} --localcores 8 --localmem 100
>done
-
2.
Remove doublets using Amulet (optional step).
Perform doublet removal using Amulet, which identifies all genomic loci with >2 uniquely aligned reads per cell and detects doublets exhibiting significantly more such loci than expected.
The command-line options include.-
a.$FragPath: path to fragment file (fragments.tsv.gz from step1);
-
b.$Barcode: path to a barcode to cell_id map in CSV format (singlecell.csv from step1);
-
c.$Human_autosomes: path to the list of chromosomes to use for the analysis (one column with chromosome name);
-
d.$Repeat: path to known repetitive elements (blacklist_repeats_segdups_rmsk_hg38.bed created in the step 2 of Resource Download);
-
e.$Out: path to an existing output directory where output files will be written;
-
f.$Script: path of the Amulet script’s directory.
-
a.
Note: Step 2 produces six files: MultipletBarcodes_01.txt, MultipletCellIds_01.txt, MultipletProbabilities.txt, MultipletSummary.txt, Overlaps.txt, and OverlapSummary.txt. The specific descriptions of the contents of these files can be accessed on the website (https://github.com/UcarLab/AMULET). The file MultipletBarcodes_01.txt includes the list of doublets, which will be used to remove doublets in the next step.
Note: When cells are loaded into droplet-based high-throughput single-cell sequencing, two or more cells may be encapsulated within the same droplet, these cells are referred to as “doublets”. Doublets can also be categorized into two types based on the origin of the cells: “homotypic doublets”, which are formed by cells of the same type, and “heterotypic doublets”, which are formed by cells of different types. In this step, we use Amulet to remove homotypic doublets. It is optional step for the pipeline, we can also use the `ArchR::addDoubletScores` function to identify the doublet. For more details about the `ArchR::addDoubletScores` function, please refer to the following link: https://www.archrproject.com/bookdown/doublet-inference-with-archr.html.
>for x in `ls ${dataset_path}`
>do
>mkdir -p ${project_path}/results/01_singlet_amulet/${x}
>cd ${project_path}/results/01_singlet_amulet/${x}
>FragPath=${project_path}/results/00_preprocess/${x}/outs/fragments.tsv.gz
>Barcode=${project_path}/results/00_preprocess/${x}/outs/singlecell.csv
>Human_autosomes=path_to_AMULET-1.1/human_autosomes.txt
>Repeat=${resource_path}/repetitive_elements/blacklist_repeats_segdups_rmsk_hg38.bed
>Out=${project_path}/results/01_singlet_amulet/${x}
>Script=path_to_AMULET-1.1
>path_to_AMULET-1.1/AMULET.sh $FragPath $Barcode $Human_autosomes $Repeat $Out $Script
>done
-
3.Set up the ArchR genome, create Arrow files using ArchR, and remove low quality cells (critical step).
-
a.Set up the ArchR genome and create Arrow files using ArchR. Troubleshooting 3.
-
i.Load the ArchR library, and change the working directory using the `setwd` function within the R session.
-
ii.We set the default genome to hg38 for gene and genome annotation with the `addArchRGenome` function, and set the number of threads with the `addArchRThreads` function.
-
iii.Use the `getValidBarcodes` function to create a character vector of valid barcodes from the data frame (singlecell.csv from step 1) by filtering out those with an `is__cell_barcode` value of 0 and excluding any barcodes present in the doublets list (MultipletBarcodes_01.txt from step 2).
-
iv.Arrow files are created using `createArrowFiles` function, with the paths to our fragment files (the fragments.tsv.gz files from step 1) provided as a character vector.Note: The arrow files are stored as large HDF5-format files on the disk, containing basic metadata and matrices (“TileMatrix” and “GeneScoreMatrix”). Output files will appear in the “ArrowFiles/”subdirectory within the output directory. Besides arrow files are created, a “QualityControl” folder will be created in the “ArrowFiles/”. This folder will hold two charts, the log10 (unique nuclear fragments) vs. TSS enrichment score is displayed in the first plot, and dotted lines denote the thresholds that were applied. The fragment size distribution is displayed in the second.Note: Multiple CPUs improve performance in this step. Wall time and peak memory usage for `createArrowFiles` across varying CPU cores configurations (benchmarked on ∼10,000 cells) are shown on Figures S1A and S1B. Performance typically scales with additional CPU cores, but diminishing returns emerge when exceeding 4 cores. Furthermore, increasing the number of computational threads proportionally increases memory requirements. To optimize resource utilization, configure thread counts between 50% and 75% of available cores based on local hardware constraints
CRITICAL: The genome should be same as in the step 1.>source activate scATACPipeline>R# load library>library(ArchR)>addArchRGenome("hg38")>addArchRThreads(threads = 4)# 1. make arrow files>raw <- "path_to_project/results/00_preprocess/" #insert path to raw data>amulet <- "path_to_project/results/01_singlet_amulet/">system("mkdir path_to_project/results/02_clustering/ArrowFiles")>setwd("path_to_project/results/02_clustering/ArrowFiles")>samples <- c("10k_pbmc_ATACv1-1_nextgem_Chromium_X","10k_pbmc_ATACv2_nextgem_Chromium_Controller","10k_pbmc_ATACv2_nextgem_Chromium_X")>getValidBarcodes <- function(csvFiles = NULL, multiplet = NULL, sampleNames = NULL){if (length(sampleNames) != length(csvFiles)) {stop("csvFiles and sampleNames must exist!")}if (!all(file.exists(csvFiles))) {stop("Not All csvFiles exists!")}barcodeList <- lapply(seq_along(csvFiles), function(x) {df <- ArchR:::.suppressAll(data.frame(readr::read_csv(csvFiles[x])))multi <- read.table(multiplet[x],header=F)if ("is__cell_barcode" %ni% colnames(df)) {stop("is__cell_barcode not in colnames of 10x singlecell.csv file!")}as.character(df[which(paste0(df$is__cell_barcode) != 0 & !df$barcode %in% multi$V1),]$barcode)}) %>% SimpleListnames(barcodeList) <- sampleNamesbarcodeList}>samples <- c("10k_pbmc_ATACv1-1_nextgem_Chromium_X","10k_pbmc_ATACv2_nextgem_Chromium_Controller","10k_pbmc_ATACv2_nextgem_Chromium_X")>for(i in 1:length(samples)){# use only valid 10x barcodesbarcodes <- getValidBarcodes(csvFiles = paste0(raw, samples[i],"/outs/singlecell.csv"), multiplet = paste0(amulet,samples[i],"/MultipletBarcodes_01.txt"),sampleNames =samples[i])ArrowFiles <- createArrowFiles(inputFiles = paste0(raw,samples[i], "/outs/fragments.tsv.gz"),sampleNames = samples[i], minTSS = 0, minFrags = 0,validBarcodes = barcodes[[1]], addTileMat = TRUE,addGeneScoreMat = TRUE, force=TRUE, offsetPlus = 0, offsetMinus = 0,excludeChr = c("chrM", "chrY", "chrX"))}# If Step 2 is not performed, use the function provided below.># getValidBarcodes <- function(csvFiles = NULL, multiplet = NULL, sampleNames = NULL){if (length(sampleNames) != length(csvFiles)) {stop("csvFiles and sampleNames must exist!")}if (!all(file.exists(csvFiles))) {stop("Not All csvFiles exists!")}barcodeList <- lapply(seq_along(csvFiles), function(x) {df <- ArchR:::.suppressAll(data.frame(readr::read_csv(csvFiles[x])))if ("is__cell_barcode" %ni% colnames(df)) {stop("is__cell_barcode not in colnames of 10x singlecell.csv file!")}as.character(df[which(paste0(df$is__cell_barcode) != 0),]$barcode)}) %>% SimpleListnames(barcodeList) <- sampleNamesbarcodeList}>#samples <- c("10k_pbmc_ATACv1-1_nextgem_Chromium_X","10k_pbmc_ATACv2_nextgem_Chromium_Controller","10k_pbmc_ATACv2_nextgem_Chromium_X")>#for(i in 1:length(samples)){# use only valid 10x barcodesbarcodes <- getValidBarcodes(csvFiles = paste0(raw, samples[i],"/outs/singlecell.csv"), sampleNames = samples[i])ArrowFiles <- createArrowFiles(inputFiles = paste0(raw,samples[i], "/outs/fragments.tsv.gz"),sampleNames = samples[i], minTSS = 0, minFrags = 0,validBarcodes = barcodes[[1]], addTileMat = TRUE,addGeneScoreMat = TRUE, force=TRUE, offsetPlus = 0, offsetMinus = 0,excludeChr = c("chrM", "chrY", "chrX"))}
-
i.
-
b.make an ArchR project.Run the `ArchRProject()` function with a character vector of Arrow file paths and required parameters to initialize the ArchRProject.Note: The outputDirectory saves all subsequent plots and downstream analyses. ArchR will automatically link the step 3a supplied genome and gene annotations to the new ArchRProject. These annotations were stored when we ran `addArchRGenome(“hg38”)` in the step 3a. In the R session, the variable “sc” is an ArchRProject, which is an S4 object that includes several important slots. The slots can be accessed using the @ operator. For example, sc@cellColData contains a matrix that holds cell-level metadata (such as TSSEnrichment and nFrags).>ArrowFiles <- paste0("path_to_project/results/02_clustering/ArrowFiles/", samples, ".arrow")>set.seed(1)>setwd("../")>sc <- ArchRProject(ArrowFiles = ArrowFiles,outputDirectory = "path_to_project/results/02_clustering", copyArrows = FALSE)
-
c.Quality control (QC) of scATAC-seq data.
-
a.
Calculate fragment counts per cell and TSS Enrichment Scores, and then visualize their distribution across the library in Figures 1A and 1B.
Note: Quality control (QC) of scATAC-seq data is critical to remove the impact of low-quality cells on downstream analysis. There are some characteristics of data that can describe the scATAC-seq data, such as the number of unique nuclear fragments for each cell, signal to background ratio, and the fragment size distribution. However, the fragment size distribution is a less important trait because the fragment length variety only shows the library complexity and shorter fragment sizes (∼50-200 bp) are preferred for scATAC-seq since they provide higher sensitivity for detecting regulatory elements and allow for better resolution of chromatin accessibility patterns within individual cells, although we always want to see a nucleosome periodicity corresponding to the nucleosome-free regions (NFR) (< 100 bp) and mono-, di-, and tri-nucleosomes (∼ 200, 400, 600 bp, respectively) in the scATAC-seq data. Therefore, we usually utilize the number of fragments and the signal-to-background ratio for QC in scATAC-seq data. Firstly, the number of fragments for each cell can be analogous to the gene number for each cell in scRNA-seq. Cells that contain either too many or too few fragments are considered as debris or doublets, so the outliers in the fragment number distribution are removed. Then, the second trait for QC is the signal-to-background ratio. Low signal-to-background ratio is often due to dead or dying cells with de-chromatinized DNA, allowing for random transposition across the genome. Transcription start site (TSS) enrichment score and the fraction of reads in called peak regions (FRiP) are used to measure the signal to background. The typical TSS enrichment plot in ATAC-seq shows that the majority of fragments will be enriched around TSS, and the farther away from the TSS, the weaker the signal.19,20 Therefore, if the fragments surrounding the TSS display an aggregate distribution in individual cells, the cells are not damaged when the DNA is transposed by Tn5. The TSS Enrichment Score, which calculated as the fold-enrichment of the peak of fragment depth around the TSS of genes across genome relative to the average read depth at some specified distance (1000 or 2000 bp) from the TSSs,21 was then used to determine if the cells had an aggregate distribution around the TSS.22 And FRiP, similar to TSS enrichment, is calculated as the fraction of all unique mapped Tn5 insertion events that overlap a set of peaks of accessibility, reflecting the signal-to-noise ratio of scATAC-seq. The value of FRiP ranges from 0 to 1, with values closer to 1 indicating that the fragments obtained from the cell are chromatin open regions, while values closer to 0 indicate that the fragments are captured genome wide.
Note: In this protocol, we use the TSS enrichment score and the number of fragments to measure the quality of each library. Typically, TSS enrichment scores greater than 5 or 6 (10 for PBMC samples), FRiP exceeding 0.3 (0.2 is acceptable), and the number of unique nuclear fragments greater than 1,000 are empirically recommended for human and mouse data. For more suggestions, please visit the website: https://www.encodeproject.org/atac-seq/.
>metadata <- as.data.frame(sc@cellColData)
>colours <- ArchR:::paletteDiscrete( values = unique(metadata$Sample))
>system("mkdir ./Plots")
>metadata$Sample <- factor(metadata$Sample, levels = unique(metadata$Sample))
>P1<- ggplot(metadata, aes(x = Sample, y = TSSEnrichment, fill=Sample)) +
geom_violin(trim=TRUE,color="white",show.legend = F) +
geom_boxplot(width=0.1,position=position_dodge(0.9),show.legend = F,fill="white",outlier.size = 0,outlier.stroke = 0)+
scale_fill_manual(values = colours)+
theme_cowplot()+
theme(axis.text.x=element_blank(),axis.title.x = element_blank())+
ylab("TSSEnrichment")
>P2<- ggplot(metadata, aes(x = Sample, y = log10(nFrags), fill=Sample)) +
geom_violin(trim=TRUE,color="white",show.legend = F) +
geom_boxplot(width=0.1,position=position_dodge(0.9),show.legend = F,fill="white",outlier.size = 0,outlier.stroke = 0)+
scale_fill_manual(values = colours)+
theme_cowplot()+
scale_y_continuous(limits = c(3, 5))+
theme(axis.text.x=element_text(colour="black",family="Times",size=10))+
ylab(expression(Log[10]∗paste("(","nFragment",")",sep = "")))+xlab("")
>ggsave("./Plots/sample_qc.pdf",do.call(plot_grid,c(list(P1,P2),ncol=1,align = "v")),width=3,height=3)
>gg <- ArchR:::ggPoint(
x = pmin(log10(metadata$nFrags), 5) + rnorm(length(metadata$nFrags), sd = 0.00001),
y = metadata$TSSEnrichment + rnorm(length(metadata$nFrags), sd = 0.00001),
colorDensity = TRUE,
xlim = c(2.5, 5),
ylim = c(0, max(metadata$TSSEnrichment) ∗ 0.8),
baseSize = 6,
continuousSet = "sambaNight",
xlabel = "Log 10 (Unique Fragments)",
ylabel = "TSS Enrichment",
rastr = TRUE) +
geom_hline(yintercept=4, lty = "dashed", size = 0.25) +
geom_vline(xintercept=log10(1000), lty = "dashed", size = 0.25)
>ggsave("./Plots/totalcell_qc.pdf", gg, width=5, height=5)
-
4.
Remove low quality cells (critical step).
Figure 1.
The mathematical characteristics of scATAC-seq data
(A) Distribution of TSS Enrichment Scores and the number of detected fragments per cell for individual libraries. The boxes span the first to third quartiles (Q1 to Q3), and the horizontal line denotes the median.
(B) Scatter plot illustrating the relationship between the TSS Enrichment score and the number of detected fragments. Each point represents an individual cell. The vertical dashed line indicates the expected minimum number of fragments; the horizontal one represents the expected minimum TSS enrichment score.
(C) Scatter plot showing the correlation between depth and LSI reductional dimension.
(D) UMAP embedding of PBMCs, colored by individual PBMC replicates.
(E) UMAP embedding and clustering analysis of PBMCs, colored by different clusters.
(F) Heatmap displaying the cell number of different cell type annotated across each cluster, with annotation transferred from integrated scRNA-seq data.
(G) Heatmap plot of marker genes specifically expressed in different clusters.
(H) UMAP embedding of and clustering analysis of PBMCs, colored by different cell type.
Estimate the density distribution of these quality control indicators (log10(fragment number), TSS Enrichment Score) and remove outliers.
Note: The distributions of TSS enrichment and the number of fragments for each library after removing low quality cells are shown in the Figures S2A and S2B.
Note: There is no universal standard for filtering cells, as those with fragment numbers or TSS Enrichment Scores below specified thresholds are discarded based on the density distribution characteristics of each sample. However, do not set the value too high, as you can always adjust it later.
>tss_outliers <- list()
>tss_outliers_names <- list()
>for(i in 1:length(samples)){
sample_i <- samples[i]
tss_enrich <- sc$TSSEnrichment[sc$Sample==sample_i]
tss_outliers[[sample_i]] <- isOutlier(tss_enrich, nmads=1, type="lower")
tss_outliers_names[[sample_i]] <-
rownames(sc@cellColData)[sc$Sample==sample_i][tss_outliers[[sample_i]]]
}
>lower_threshold <- sapply(tss_outliers, function(x) attr(x, "thresholds")[1])
>filter_tss <- sapply(tss_outliers, function(x) sum(x))
>tss_outliers_names <- unlist(tss_outliers_names)
>sc <- sc[!rownames(sc@cellColData) %in% tss_outliers_names,]
>sc <- sc[which(log10(sc$nFrags) > 3.5),]
-
5.
Dimension reduction (critical step).
Use the genome-wide tile matrix as the input and perform dimensionality reduction through multiple iterations of the LSI method.
Note: For descriptions of the parameters, you can visit the website: https://www.archrproject.com/bookdown/iterative-latent-semantic-indexing-lsi.html, or see `?addIterativeLSI` for more details. Then, a new object called `IterativeLSI` will be created in the reducedDims slot (see it using sc@reducedDims$IterativeLSI).
>sc <- addIterativeLSI(ArchRProj = sc, useMatrix = "TileMatrix",name = "IterativeLSI", iterations = 4, clusterParams = list(resolution = 4, sampleCells = 2000, n.start = 10),varFeatures = 50000, dimsToUse = 1:30,
force=TRUE, seed=1)
-
6.Assess the correlation between each LSI component and sequencing depth (optional step).
- Compute Pearson correlations between iterative LSI components and per-cell fragment counts, then generate diagnostic scatter plots to assess depth-dependent technical artifacts (Figure 1C).
Note: If there is a strong correlation between the first LSI component and the total number of fragments in a cell, we should exclude it from downstream analysis because it frequently captures sequencing depth (technical variation) rather than biological variability. In this example, the correlation between the first LSI component and the total number of fragments is approximately 0.9; therefore, the follow-up analyses (such as batch correction, clustering, and embedding) use the parameter `dimsToUse = 2:30`.
Note: This step is important but not mandatory; it only determines the dimensions used in subsequent analysis steps and enhances the results. However, we recommend performing this step.
>embed <- sc@reducedDims$IterativeLSI$matSVD
>counts <- subset(sc@cellColData,select= "nFrags")
>embed <- embed[rownames(x = counts), ]
>n=10
>n <- Signac:::SetIfNull(x = n, y = ncol(x = embed))
>embed <- embed[, seq_len(length.out = n)]
>counts$nFrags <- as.numeric(counts$nFrags)
>depth.cor <- as.data.frame(cor(x = embed, y = counts$nFrags))
>depth.cor$Component <- rownames(depth.cor)
>colnames(depth.cor)[1] <- "correaltion"
>depth.cor$Component <- factor(depth.cor$Component, levels = depth.cor$Component)
>P3 <- ggplot(depth.cor, aes(Component, correaltion)) +
geom_point() +
# scale_x_continuous(n.breaks = 10, limits = c(1, 10)) +
ylab("Correlation") +
ylim(c(-1, 1)) +
theme_light() +
ggtitle("Correlation between depth and reduced dimension components")
> ggsave("./Plots/correaltion_with_depth.pdf", P3,width=5,height=5)
-
7.
Batch correction (optional step). Troubleshooting 4.
Performs Harmony integration to remove batch effects (by “Sample”) on the specified reduced dimensions (IterativeLSI) and stores the corrected dimensions under the name “Harmony”.
Note: Batch effects are unexpected technical variations in data caused by variations including sequencing platforms, the platform of library construction for scATAC-seq, experimental labs, distinct operators, sample acquisition and processing. These effects can lead to systematic errors that blur the distinction between technological and biological variability. As a consequence, cells that initially belonged to the same cell type may exhibit different chromatin accessibility profiles and cluster in different groups.23 In this example, there is no biological variance between the three PBMC replicates; therefore, it is appropriate that the visualization of the embedding colored by samples shows no batch effect (Figure 1D). If there is a cluster that contains only one sample, it indicates that a batch effect should be removed using `addHarmony` function. Additionally, caution must be exercised regarding overcorrection; if cells from different cell types are clustered together within the same group, it suggests that overcorrection has occurred and adjustments are necessary. Therefore, we recommend having certain expectations for how samples cluster to avoid both batch effects and issues of overcorrection. Proper clustering is crucial for uncovering biological questions in downstream analyses.
Note: If you use the `addHarmony` function in your workflow, the named “Harmony” object will be created in the `reducedDims` slot and be used in the follow-up steps (clustering and embedding). Otherwise, the “IterativeLSI” will be utilized based on the `reducedDims` parameter.
>sc <- addHarmony(ArchRProj = sc, reducedDims = "IterativeLSI", name = "Harmony", groupBy = "Sample", force=TRUE, dimsToUse = 2:30)
-
8.
Clustering (critical step).
Utilize the `addClusters` function to conduct clustering.
Note: It is important to pay attention to two parameters: `maxClusters` and `dimsToUse`. The `maxClusters` must be set to a value greater than or equal to the number of cell types. If the number of clusters exceeds the set value, HCRlust and Cutree will be used to unbiasedly merge clusters. For dimsToUse, if it is found that the first dimension is strongly correlated with depth during dimensionality reduction, then dimsToUse should start from the second dimension during clustering. You can use `?addClusters` to view specific explanations for the other parameters. The clusters will be saved in the `cellColData` slot, you can use the $ accessor to get the clusters for each single cell (e.g., sc@cellColData$Clusters).
>sc <- addClusters(input = sc, reducedDims = "Harmony", method = "Seurat",name = "Clusters", resolution = 0.8, force=TRUE, seed = 1, maxClusters = 100,dimsToUse = 2:30)
-
9.
Embedding (critical step).
We perform an additional nonlinear dimensionality reduction (UMAP) based on LSI dimensionality reduction, since only one to three dimensions can be displayed in single-cell ATAC-seq clustering visualizations.
Note: UMAP, which is grounded in graph algorithms, is better suited for visualizing exploratory data. It effectively preserves global structures when handling large datasets, speeds up operations, and reduces memory usage. The embedding object called “UMAP” will be saved in the `embedding` slot.
>sc <- addUMAP(ArchRProj = sc, reducedDims = "Harmony", name = "UMAP",nNeighbors = 30, minDist = 0.5, metric = "cosine",force=TRUE, seed = 1,dimsToUse = 2:30)
-
10.
Visualize the cell embedding within clusters and across samples (critical step).
Note: The output file named “Plot-UMAP-Sample-Clusters.pdf” is saved in the “Plots/” subdirectory within the output directory. The charts are illustrated in Figures 1D and 1E.
>P4 <- plotEmbedding(ArchRProj = sc, colorBy = "cellColData", name = "Clusters",embedding = "UMAP")
>P5 <- plotEmbedding(ArchRProj = sc, colorBy = "cellColData", name = "Sample",
embedding = "UMAP")
>plotPDF(P4,P5, name = "Plot-UMAP-Sample-Clusters.pdf", ArchRProj = sc,
addDOC = FALSE, width = 5, height = 5)
-
11.
Imputation with MAGIC (optional step).
Apply MAGIC denoise gene scores using cellular neighborhood information. This constructs a Markov affinity graph to smooth values across similar cells, generating imputed gene expression profiles. Save the imputed matrix in the `imputeWeights` slot.
Note: The significantly increased noise from scATAC-seq data makes the profile of gene activity scores unclear. This step is important but not mandatory. However, we recommend performing it, as it significantly enhances the visual interpretation of gene scores.
>proj <- addImputeWeights(
ArchRProj = proj,
reducedDims = "IterativeLSI",
dimsToUse = NULL,
scaleDims = NULL,
corCutOff = 0.75,
td = 3,
ka = 4,
sampleCells = 1000,
nRep = 2,
k = 15,
epsilon = 1,
useHdf5 = TRUE,
randomSuffix = FALSE,
threads = getArchRThreads(),
seed = 1,
verbose = TRUE,
logFile = createLogFile("addImputeWeights")
)
-
12.Cell type annotation (critical step).
-
a.Cell type annotation using marker genes (Figure S3).Assign cell types to each cell cluster based on reported marker genes. This step will include displaying the expression of known marker genes using the “GeneScoreMatrix” stored in the Arrow files. Save the plots in the “Plots/” subdirectory within the output directory.Note: Most strategies for cell type annotation in scATAC-seq are primarily based on gene expression. Therefore, obtaining an accurate gene activity score for each cell is crucial. Several methods have been developed for calculating gene activity, including Signac, ArchR, and snapATAC. It has been reported that the gene score computed by ArchR is particularly accurate.2>markerGenes <- c("CD3D","CD4","CD8A", #T cells"CD19", #B cells"XBP1", #plasma"CD14", #Monocytes"GNLY" # NK cells)>p <- plotEmbedding(ArchRProj = proj,colorBy = "GeneScoreMatrix",name = markerGenes,embedding = "UMAP",imputeWeights = getImputeWeights(proj))>P5 <- lapply(p, function(x){x + guides(color = FALSE, fill = FALSE) +theme_ArchR(baseSize = 6.5) +theme(plot.margin = unit(c(0, 0, 0, 0), "cm"),axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())})
-
b.Transfer the scRNA-seq annotation label to scATAC-seq dataset (Figure 1F).
-
a.
Perform cell type annotation through integrative analysis with scRNA-seq data, transferring the scRNA-seq annotation labels to the scATAC-seq dataset using the addGeneIntegrationMatrix function.
Note: The predicted cell type will be saved in the cellColData slot with the column name “predictedSubGroup” (sc@cellColData$predictedSubGroup).
>library(Seurat)
>library(SeuratData)
>InstallData("pbmc3k")
>library(pbmc3k.SeuratData)
>data("pbmc3k")
>seRNA <- pbmc3k
>seRNA@active.assay <- "RNA"
>proj <- addGeneIntegrationMatrix(
ArchRProj = proj,
useMatrix = "GeneScoreMatrix",
matrixName = "GeneIntegrationMatrix",
reducedDims = "IterativeLSI",
seRNA = seRNA,
addToArrow = FALSE,
groupRNA = "seurat_annotations",
nameCell = "predictedsubCell",
nameGroup = "predictedsubGroup",
nameScore = "predictedsubScore"
)
>P6 <- plotEmbedding(
proj,
colorBy = "cellColData",
name = "predictedsubGroup",
embedding = "UMAP"
)
>plotPDF(P6, name = "Plot-UMAP-CellTypeAnnotation.pdf", ArchRProj = sc,
addDOC = FALSE, width = 5, height = 5)
>saveRDS(proj, "pbmc_scATAC_annotation.rds")
Downstream analysis of scATAC-seq data
Timing: 2 h
In this section, we outline the steps for the downstream analysis of scATAC-seq datasets, which are guided by the scientific hypothesis and experimental design. In other words, there is no standard method for this type of study. Nonetheless, the following steps are typically performed: profiling chromatin accessibility for each cluster or cell type, identifying differentially accessible regions between different clusters or cell types, detecting key factors that drive changes in chromatin accessibility, and uncovering promoter–enhancer linkages.
Note: In this pipeline, the profiling of chromatin accessibility for each cluster or cell type (Steps 14-17) is a critical step for subsequent analyses. Those steps can also utilize the `addReproduciblePeakSet` function and `addPeakMatrix`. The remaining steps depend on the scientific hypothesis and experimental design.
-
13.
Peak calling.
For each cell cluster or cell type, we combine all fragments to generate a pseudo-bulk ATAC-seq dataset for individual biological replicates. Furthermore, we generate two pseudo-replicates which consist of half of the fragments from pooled cell cluster or cell type. We call peak for each of the pseudo-bulk datasets and the pooled dataset of all replicates independently.24-
a.Generate two equal-sized, cell-type-specific pseudo-bulk ATAC-seq datasets from individual biological replicates.
-
i.Generate a cell type-Arrow file mapping list (≥40 cells per type) by concatenating cell type names with Arrow filenames, using cell annotations and Arrow files as inputs.
-
ii.Split fragment files into equal-sized pseudo-bulks using `make_beds_pseudo` function.Note: The output files of `make_beds_pseudo` function like this in the “project_path /results/03_peakcalling/bedfiles_pseudo”:B_10k_pbmc_ATACv1-1_nextgem_Chromium_X_pseudo1_blacklistrm.bedB_10k_pbmc_ATACv1-1_nextgem_Chromium_X_pseudo2_blacklistrm.bedB_10k_pbmc_ATACv2_nextgem_Chromium_Controller_pseudo1_blacklistrm.bedB_10k_pbmc_ATACv2_nextgem_Chromium_Controller_pseudo2_blacklistrm.bedB_10k_pbmc_ATACv2_nextgem_Chromium_X_pseudo1_blacklistrm.bedB_10k_pbmc_ATACv2_nextgem_Chromium_X_pseudo2_blacklistrm.bed….Platelet_10k_pbmc_ATACv2_nextgem_Chromium_Controller_pseudo1_blacklistrm.bedPlatelet_10k_pbmc_ATACv2_nextgem_Chromium_Controller_pseudo2_blacklistrm.bedPlatelet_10k_pbmc_ATACv2_nextgem_Chromium_X_pseudo1_blacklistrm.bedPlatelet_10k_pbmc_ATACv2_nextgem_Chromium_X_pseudo2_blacklistrm.bed
-
i.
-
b.Execute `call_pseudo_cluster` function to generate cell-type-specific pseudo-replicates and call peaks for each replicate.Note: The `call_pseudo_cluster` function takes two parameters: `cell_type`, which specifies the particular cell type for analysis, and `bed_files`, a vector containing the paths to the relevant bed files.The output files for this function like this:B_pseudo1.bedB_pseudo1_peaks.narrowPeakB_pseudo1_peaks.xlsB_pseudo1_summits.bedB_pseudo2.bedB_pseudo2_peaks.narrowPeakB_pseudo2_peaks.xlsB_pseudo2_summits.bed…Platelet_pseudo1.bedPlatelet_pseudo1_peaks.narrowPeakPlatelet_pseudo1_peaks.xlsPlatelet_pseudo1_summits.bedPlatelet_pseudo2.bedPlatelet_pseudo2_peaks.narrowPeakPlatelet_pseudo2_peaks.xlsPlatelet_pseudo2_summits.bed
-
c.Execute `make_beds` function to generate a cell-type-specific pseudo-bulk ATAC-seq dataset from individual biological replicates and use `call_peaks` function to call peaks for each dataset.Note: The output files for this step like this:B_10k_pbmc_ATACv1-1_nextgem_Chromium_X_blacklistrm.bedB_10k_pbmc_ATACv1-1_nextgem_Chromium_X_peaks.narrowPeakB_10k_pbmc_ATACv1-1_nextgem_Chromium_X_peaks.xlsB_10k_pbmc_ATACv1-1_nextgem_Chromium_X_summits.bedB_10k_pbmc_ATACv2_nextgem_Chromium_Controller_blacklistrm.bedB_10k_pbmc_ATACv2_nextgem_Chromium_Controller_peaks.narrowPeakB_10k_pbmc_ATACv2_nextgem_Chromium_Controller_peaks.xlsB_10k_pbmc_ATACv2_nextgem_Chromium_Controller_summits.bedB_10k_pbmc_ATACv2_nextgem_Chromium_X_blacklistrm.bedB_10k_pbmc_ATACv2_nextgem_Chromium_X_peaks.narrowPeakB_10k_pbmc_ATACv2_nextgem_Chromium_X_peaks.xlsB_10k_pbmc_ATACv2_nextgem_Chromium_X_summits.bed…Platelet_10k_pbmc_ATACv2_nextgem_Chromium_Controller_blacklistrm.bedPlatelet_10k_pbmc_ATACv2_nextgem_Chromium_Controller_peaks.narrowPeaPlatelet_10k_pbmc_ATACv2_nextgem_Chromium_Controller_peaks.xlsPlatelet_10k_pbmc_ATACv2_nextgem_Chromium_Controller_summits.bedPlatelet_10k_pbmc_ATACv2_nextgem_Chromium_X_blacklistrm.bedPlatelet_10k_pbmc_ATACv2_nextgem_Chromium_X_peaks.narrowPeakPlatelet_10k_pbmc_ATACv2_nextgem_Chromium_X_peaks.xlsPlatelet_10k_pbmc_ATACv2_nextgem_Chromium_X_summits.bed
-
d.Execute `call_cluster` for each cell type to:
-
i.Combine fragment files from all biological replicates into a pseudo-bulk ATAC-seq dataset.
-
ii.Call peaks for the combined dataset.
-
i.
-
a.
Note: The output files for this function like this:
B_blacklistrm.bed
B_peaks.narrowPeak
B_peaks.xls
B_summits.bed
…
Platelet_blacklistrm.bed
Platelet_peaks.narrowPeak
Platelet_peaks.xls
Platelet_summits.bed
Note: Utilizing multiple CPU cores enhances computational efficiency in this step. We benchmarked wall time and peak memory utilization during peak calling across varying core allocations (Figures S1C and S1D). While performance scales linearly with additional cores up to 8 threads, diminishing returns emerge beyond this threshold. Furthermore, increasing thread count proportionally elevates memory requirements. To optimize resource utilization, configure thread counts not to exceed the number of distinct cell types processed.
>R
>options(scipen=999)
>library(ArchR)
>library(rtracklayer)
>library(parallel)
>library(data.table)
>library(GenomicRanges)
>addArchRThreads(threads = 1)
>set.seed(10)
>addArchRGenome("hg38")
>source("path_to_software/scATAC-seq-Analysis/bin/function_peakcalling.R")
>setwd("path_to_project/results")
>system("mkdir -p 03_peakcalling/bedfiles_pseudo")
>system("mkdir -p 03_peakcalling/bedfiles")
>sc <- readRDS("path_to_project/results/02_clustering/pbmc_scATAC_annotation.rds")
>blacklist <- import("path_to_resource/repetitive_elements/GRCh38_unified_blacklist.bed")
>ArrowFiles <- getArrowFiles(sc)
>Groups <- getCellColData(ArchRProj = sc, select = "predictedsubGroup", drop = TRUE)
>Cells <- sc$cellNames
>cellGroups <- split(Cells, Groups)
>availableChr <- ArchR:::.availableSeqnames(head(getArrowFiles(sc)))
>chromLengths <- getChromLengths(sc)
>chromSizes <- getChromSizes(sc)
>cell_types <- names(cellGroups)
>input <- lapply(1:length(cellGroups), function(x) lapply(names(ArrowFiles),
function(y) {
if(sum(grepl(paste0(y, "#"), cellGroups[[x]]))>=40)
c(names(cellGroups)[x], y)
}
))
>input <- unlist(input, recursive = FALSE)
>input <- input[!sapply(input, is.null)]
# get fragments from ArrowFiles for every sample and every celltype and split into two pseudo-pseudobulks of equal size
>mcmapply(function(X,Y) make_beds_pseudo(X,Y, cellGroups_new=cellGroups),
X=ArrowFiles, Y=names(ArrowFiles),
mc.cores=10)
>bed_files <- list.files("03_peakcalling/bedfiles_pseudo")
# make bedfiles and call peaks for pseudo-pseudobulks for cell types
>mclapply(cell_types, function(x) call_pseudo_cluster(x, bed_files), mc.cores=5)
# make bedfiles for pseudobulk for cluster and sample
>mcmapply(function(X,Y) make_beds(X, Y, cellGroups),X=ArrowFiles, Y=names(ArrowFiles),mc.cores=10)
>mclapply(input, function(x) call_peaks(x[1], x[2]), mc.cores=10)
>bed_files <- list.files("03_peakcalling/bedfiles")
># make bedfiles and call peaks for pseudobulks for clusters
>mclapply(cell_types, function(x) call_cluster(x, bed_files), mc.cores=10)
-
14.Generate a reproducible peak set for each cell type or cluster.
-
a.Generate a list of reproducible peaks using the `filter_peaks` function, retain two types of peaks:
-
i.Type 1: Peaks identified in the pooled dataset with ≥ 50% overlap in both biological replicates.
-
ii.Type 2: Peaks identified in the pooled dataset with ≥ 50% overlap in both pseudo-replicates.24
-
i.
-
b.Save the filtered peak data with a `.naiveSummitList.bed` suffix for each cell type under the path `$project_path/results/04_parsePeak/`.
-
c.Create summit file as `pbmc.naiveSummitList.list` in the work directory.
-
a.
>R
>library(ArchR)
>library(parallel)
>source("path_to_software/scATAC-seq-Analysis/bin/function_get_reproducible_peak.R")
>setwd("path_to_project/results")
>system("mkdir 04_parsePeak")
>sc <- readRDS("02_clustering/pbmc_scATAC_annotation.rds")
>sc$predictedsubGroup <- gsub(" ","_",sc$predictedsubGroup)
>sc$predictedsubGroup <- gsub("\\+","",sc$predictedsubGroup)
>ArrowFiles <- getArrowFiles(sc)
>Groups <- getCellColData(ArchRProj = sc, select = "predictedsubGroup", drop = TRUE)
>Cells <- sc$cellNames
>cellGroups <- split(Cells, Groups)
>input <- lapply(1:length(cellGroups), function(x) lapply(names(ArrowFiles),
function(y) {
if(sum(grepl(paste0(y, "#"), cellGroups[[x]]))>=40)
c(names(cellGroups)[x], y)
}
))
>input <- unlist(input, recursive = FALSE)
>input <- input[!sapply(input, is.null)]
>celltypes <- unique(sapply(seq_along(input),function(x){y=input[[x]][1]
return(y)}))
>cores <- length(celltypes)
>mclapply(celltypes,function(x) filter_peaks(x),mc.cores = cores)
-
15.Merge all cell type or cluster peak sets to a union peak set.
-
a.Extend peak summits by 500 bp using `extendSummit` function.
-
b.Filter peak regions.
-
i.Remove peak regions with N using `filter4N′ function.
-
ii.Remove peak regions outside chromosome using `filter4chrom′ function.
-
i.
-
c.Normalize MACS2 peak scores (–log10(q-value)) to “score per million” using `norm2spm′ function.
-
d.Filter reproducible peaks by applying a “score per million” cut-off of 2.
-
e.Save output file as `04_parsePeak/pbmc.filteredNfixed.union.peakSet `.
-
a.
>R
>library("data.table")
>library("GenomicRanges")
>library("BSgenome")
>source("path_to_software/scATAC-seq-Analysis/bin/funtion_peakmerging.R")
>inF = "path_to_project/results/pbmc.naiveSummitList.list"
>genome = "hg38"
>chromF = "path_to_resource/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/fasta/chrom_hg38.sizes"
>outDir = "path_to_project/results/04_parsePeak/"
>outF = "pbmc"
>summitF <- read.table(inF, sep="\t", header=F)
>label.lst <- as.character(summitF$V1)
>file.lst <- as.character(summitF$V2)
>peak.list = lapply(seq(file.lst), function(i){
p.gr <- read2gr(file.lst[i], label=label.lst[i])
p.gr <- extendSummit(p.gr, size=500)
p.gr <- filter4chrom(p.gr, chromF)
p.gr <- filter4N(p.gr, genome=genome)
p.gr <- nonOverlappingGR(p.gr, by = "score", decreasing = TRUE)
p.gr <- norm2spm(p.gr, by="score")
p.gr
})
>for(i in 1:length(label.lst)){
outPeak <- as.data.frame(peak.list[[i]])
outFname <- paste(outDir, label.lst[i], ".filterNfixed.peakset", sep="")
fwrite(outPeak, file=outFname, sep="\t", quote = F, col.names = T, row.names = F)
}
# merge
>merged.gr <- do.call(c, peak.list)
# filter reproducible peaks by choosing a spm cut-off of 2
>merged.filtered.gr <- merged.gr[which(mcols(merged.gr)$spm >2),]
>outUnion <- as.data.frame(merged.filtered.gr)
>outfname = paste(outDir, outF, ".filteredNfixed.union.peakSet",sep="")
>fwrite(outUnion, file=outfname, sep="\t", quote = F, col.names = T, row.names = F)
-
16.Add the peak matrix to the ArchR project.
-
a.Annotate the peak sets using `ArchR:::.fastAnnoPeaks` function.
-
b.Add the annotated peak sets to the `peakSet` slot of the ArchR object using `addPeakSet` function.
-
c.Create the `PeakMatrix` and save it in Arrow files using `addPeakMatrix` function.
-
a.
Note: Peak annotation in scATAC-seq involves the identification of genomic regions with open chromatin, which likely correspond to active regulatory elements such as promoters, enhancers, and transcription factor binding sites. Once peaks are identified, they can be annotated using publicly available databases, such as the ENCODE project,22 which provide information on the genomic features and functional annotations of the identified regions. For example, the ENCODE project can provide information on whether a peak is located near a TSS, exon, or intron, and whether it overlaps with a known regulatory element. Peak annotation is valuable for understanding the regulatory landscape of individual cells, identifying cell type-specific enhancers, and determining the potential targets of transcription factors.
>R
>library(ArchR)
>library(data.table)
>set.seed(10)
>addArchRGenome("hg38")
>proj <- readRDS("path_to_project/results/02_clustering/pbmc_scATAC_annotation.rds")
>peakset <- as.data.frame(fread("path_to_project/results/04_parsePeak/pbmc.filteredNfixed.union.peakSet",header=T))
>peakSet_gr <- GRanges(
peakset$seqnames,
IRanges(peakset$start, peakset$end)
)
>mcols(peakSet_gr)$score <- peakset$score
>mcols(peakSet_gr)$name <- peakset$name
>mcols(peakSet_gr)$Group <- peakset$label
>mcols(peakSet_gr)$spm <- peakset$spm
>genomeAnnotation = getGenomeAnnotation(proj)
>BSgenome <- eval(parse(text = genomeAnnotation$genome))
>BSgenome <- validBSgenome(BSgenome)
>geneAnnotation = getGeneAnnotation(proj)
>promoterRegion = c(2000, 100)
>peakSet_gr <- ArchR:::.fastAnnoPeaks(peakSet_gr, BSgenome = BSgenome, geneAnnotation = geneAnnotation, promoterRegion = promoterRegion)
>proj <- addPeakSet(
ArchRProj = proj,
peakSet = peakSet_gr,
force = TRUE
)
>proj <- addPeakMatrix(proj)
-
17.Identifying differentially accessible region.
-
a.Identify DARs between two or more groups of cells and annotate these regions with cell-type-specific marker features using the `addMarkerFeatures` function in ArchR.
-
b.Visualize the DARs for each cell type. The visualization of the DARs for each cell type is presented in Figure 2A.
-
a.
Note: A differentially accessible region (DAR) is a genomic region that exhibits substantial differences in chromatin accessibility between various biological conditions or cell types. The discovery of DARs can elucidate the molecular processes underlying variations in gene expression and cellular function across different biological states.20 DARs are critical for understanding gene expression regulation, developmental processes, and diseases such as cancer.
Note: In addition, Signac's `FindAllMarkers` function performs differential accessibility analysis using statistical tests such as the Wilcoxon Rank Sum test or Student's t-test. This function calculates the log2 fold change and the statistical significance of each DAR while filtering the results based on user-defined thresholds.
>markersPeaks <- getMarkerFeatures(
ArchRProj = proj,
useMatrix = "PeakMatrix",
groupBy = "predictedsubGroup",
bias = c("TSSEnrichment", "log10(nFrags)"),
testMethod = "wilcoxon"
)
>heatmapPeaks <- markerHeatmap(
seMarker = markersPeaks,
cutOff = "FDR <= 0.1 & Log2FC >= 0.5",
transpose = TRUE
)
>plotPDF(heatmapPeaks, name = "Peak-Marker-Heatmap", width = 8, height = 6, ArchRProj = proj, addDOC = FALSE)
-
18.
Add motif annotation (Figure 2B).
Perform motif annotation, enrichment analysis, and calculating deviations for an ArchR project.-
a.Add the motif annotations to the `peakAnnotation` slot named 'Motif' (proj@peakAnnotation$Motif) using the 'cisbp' motif set.
-
b.Perform motif enrichment using peak annotations derived from marker peaks (identified by the `getMarkerFeatures` function in Step 17), and apply false discovery rate (FDR < 0.01) and log2 fold change (log2FC >= 0.5) cutoff thresholds.
-
c.Add background peaks to the project using the addBgdPeaks function. These peaks are essential for computing deviations as they model the expected distribution of chromatin accessibility based on peak characteristics, including GC content and fragment counts.
-
d.Compute per-cell deviations across all motif annotations using the `addDeviationsMatrix` function, generate a deviations matrix named 'MotifMatrix' and save it to Arrow files for the specified peak annotation.
-
a.
Note: This motif matrix represents how much each individual cell's accessibility differs from a general background expectation derived from the motifs. This process enables researchers to identify and visualize the significance of specific transcription factor binding motifs within their dataset, thereby enhancing the understanding of the regulatory mechanisms driving changes in gene expression.
>proj <- addMotifAnnotations(ArchRProj = proj, motifSet = "cisbp", name = "Motif")
>enrichMotifs <- peakAnnoEnrichment(
seMarker = markersPeaks,
ArchRProj = proj,
peakAnnotation = "Motif",
cutOff = "FDR <= 0.01 & Log2FC >= 0.5"
)
>heatmapEM <- plotEnrichHeatmap(enrichMotifs, n = 7, transpose = TRUE)
>plotPDF(heatmapEM, name = "Motifs-Enriched-Marker-Heatmap", width = 8, height = 6, ArchRProj = proj, addDOC = FALSE)
>proj <- addBgdPeaks(proj)
>proj <- addDeviationsMatrix(
ArchRProj = proj,
peakAnnotation = "Motif",
force = TRUE
)
-
19.Get Co-accessibility.
-
a.Calculate the co-accessibility by running addCoAccessibility` function.Note: The function takes as input a peak-by-cell matrix of single-cell ATAC-seq data and a list of putative enhancers and promoters, and outputs a co-accessibility matrix and a list of putative interactions.Note: In scATAC-seq analysis, Co-accessibility refers to that two or more genomic regions share common accessibility. Measurement of Co-accessibility can be achieved by calculating the Pearson correlation coefficient between two or more genomic regions in scATAC-seq data. This can be used to explore the interactions and relationships between different regions of the genome and their role in regulating gene expression. Co-accessibility analysis can also be used to study genomic structural changes across cell types, developmental stages, and biological processes. Calculation of co-accessibility between enhancers and target genes using the co-accessibility matrix provided by Cicero, in which each element represents the degree of co-accessibility between any two enhancers or target genes. Specifically, Cicero uses the Pearson correlation coefficient to calculate the co-accessibility matrix and identifies the cell-type-specific co-occurrence patterns between enhancers and target genes by clustering ATAC-seq peak regions across all single cells.25>proj <- addCoAccessibility(ArchRProj = proj,k = 10,maxDist = 250000,reducedDims = "IterativeLSI")
-
b.Plotting browser tracks of Co-accessibility (Figure 2C).
-
a.
>markerGenes <- c(
"CD3D","CD4","CD8A", #T cells
"CD19", #B cells
"XBP1", #plasma
"CD14", #Monocytes
"GNLY" # NK cells
)
>p <- plotBrowserTrack(
ArchRProj = proj,
groupBy = "predictedsubGroup",
geneSymbol = markerGenes,
upstream = 50000,
downstream = 50000,
loops = getCoAccessibility(proj)
)
Figure 2.
The biological characteristics of scATAC-seq data
(A) Heatmap plot of differentially accessible regions (DARs) in each cell type.
(B) Heatmap plot of enriched motif in each cell type.
(C) Track plot of CD3D gene with peak co-accessibility.
Integrated with scRNA-seq data to infer gene regulatory network
Timing: 2 h
Here, we detail the workflow for constructing gene regulatory networks. Initially, we integrate scATAC-seq data with scRNA-seq data to create in-silico pseudo-multiomics cell datasets. Using these synthetic datasets, we then calculate the correlation between the peaks and gene expression, known as peak-gene linkage. Thereafter, the gene regulatory network is inferred by combining the peak-gene linkage with peak-peak co-accessibility.
-
20.Integrated analysis with scRNA-seq data. Troubleshooting 5.
-
a.Load the 3K PBMC scRNA-seq dataset from SeuratData.
-
b.Extract the gene score matrix from Arrow files using the `ArchR:::.getPartialMatrix` function. Use this gene score matrix as input to create the Seurat object.
-
c.Integrate the scATAC and scRNA-seq data using the Seurat `FindIntegrationAnchors` function.
- d.
-
e.The results of cell type annotation are presented in Figure 3B.
-
a.
Note: ScATAC-seq is a technique used to measure chromatin accessibility in order to identify cell-type-specific regulatory mechanisms. ScRNA-seq depicts cell-type gene expression profiles and identifies cell-type-specific gene expression. An integrated study of scRNA-seq and scATAC-seq provides a more thorough molecular profile of individual cells and their identities. Currently, the majority of integrative analysis methods are based on the gene score matrix from scATAC-seq datasets and the gene expression matrix from scRNA-seq. As a result, the challenge of integrating scATAC-seq and scRNA-seq data can be seen as similar to the challenges faced when integrating RNA-seq data, such as addressing batch correction issues. Nowadays, several tools designed for integrated analysis of scATAC-seq and scRNA-seq exhibit a good performance. Examples include scJoint, which is based on the semisupervised neural network framework,26 Seurat’s integrated method, which utilizes canonical correlation analysis (CCA) to perform batch correction and dimensionality reduction to remove batch effects and identify shared sources of variation between datasets. To facilitate the creation of pseudo multi-omics dataset, in our tutorial, the Seurat’s CCA and Harmony is employed as the tool for integrational analysis.
>library(ArchR)
>library(Seurat)
>library(Signac)
>library(future)
>library(future.apply)
>library(parallel)
>library(dplyr)
>library(pbmc3k.SeuratData)
>system("mkdir path_to_project/results/06_integrated")
>data("pbmc3k")
>pbmc_rna <- pbmc3k
>seRNA@active.assay <- "RNA"
>pbmc_rna <- NormalizeData(pbmc_rna)
>pbmc_rna <- FindVariableFeatures(pbmc_rna,nfeatures = 3000)
>pbmc_rna <- ScaleData(pbmc_rna)
>pbmc_rna <- RunPCA(pbmc_rna)
>pbmc_rna <- FindNeighbors(pbmc_rna, dims = 1:30)
>pbmc_rna <- FindClusters(pbmc_rna)
>genesUse <- VariableFeatures(object = pbmc_rna)
>pbmc_atac <- readRDS("path_to_project/results/04_parsePeak/pbmc_scATAC_addpeakMatrix.rds")
>allCells <- pbmc_atac$cellNames
>useMatrix <- "GeneScoreMatrix"
>geneDF <- ArchR:::.getFeatureDF(getArrowFiles(pbmc_atac), useMatrix)
>GeneScoreMatrix <- ArchR:::.getPartialMatrix(
ArrowFiles = getArrowFiles(pbmc_atac),
featureDF = geneDF[geneDF$name %in% genesUse,],
threads = 10,
cellNames = allCells,
useMatrix = useMatrix,
verbose = FALSE)
>rownames(GeneScoreMatrix) <- geneDF[geneDF$name %in% genesUse, "name"]
>mat <- log(GeneScoreMatrix + 1)
>Seurat_ATAC_pbmc <- CreateSeuratObject(
counts = GeneScoreMatrix,
assay = 'GeneScore',
project = 'ATAC',
min.cells = 1,
meta.data = as.data.frame(pbmc_atac@cellColData))
>rm(list=c("mat","GeneScoreMatrix"))
>gc()
>Seurat_ATAC_pbmc <- ScaleData(Seurat_ATAC_pbmc, verbose = FALSE)
>DefaultAssay(Seurat_ATAC_pbmc) <- 'GeneScore'
>Seurat_ATAC_pbmc$Tech <- "ATAC"
>pbmc_rna$Tech <- "RNA"
>plan("multicore", workers = 10)
>options(future.globals.maxSize = 100 ∗ 1024ˆ3)
>DefaultAssay(pbmc_rna) <- 'RNA'
>reference.list <- c(pbmc_rna, Seurat_ATAC_pbmc)
>names(reference.list) <- c("RNA", "ATAC")
>rna_atac.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
>rna_atac_integrated <- IntegrateData(anchorset = rna_atac.anchors, dims = 1:30)
>rna_atac_integrated <- ScaleData(object = rna_atac_integrated, verbose = F)
>rna_atac_integrated <- RunPCA(object = rna_atac_integrated, verbose = F)
>rna_atac_integrated <- FindNeighbors(object = rna_atac_integrated, dims = 1:30)
>rna_atac_integrated <- FindClusters(object = rna_atac_integrated, resolution = 0.5)
>library(harmony)
>rna_atac_integrated <- RunHarmony(rna_atac_integrated, "Tech")
>rna_atac_integrated <- RunUMAP(rna_atac_integrated, reduction = "harmony",reduction.name = "UMAPHarmony",dims = 1:30)
>rna_atac_integrated$Merged_cluster <- ifelse(rna_atac_integrated@meta.data$Tech == "RNA",
rna_atac_integrated@meta.data$seurat_annotations, rna_atac_integrated@meta.data$predictedsubGroup)
>saveRDS(rna_atac_integrated, "path_to_project/results/06_integrated/pbmc_scRNA_scATAC_integrated_harmony.rds")
-
21.Cells from scRNA-seq paired with scATAC-seq. Troubleshooting 6
-
a.Pair scRNA-seq and scATAC-seq cells using the pairCells function from the FigR package with Harmony-based dimensionality reduction.
-
b.Rename the uniquely paired cells for downstream analysis.
-
a.
>R
>library(Seurat)
>library(FigR)
>library(optmatch)
>library(dplyr)
# 01 using FigR pairing scRNA and snATAC
## cca
>object <- readRDS("path_to_project/results/06_integrated/pbmc_scRNA_scATAC_integrated_harmony.rds")
> CCA_PCs <- object@reductions$harmony@cell.embeddings
>isATAC <- grepl("#",rownames(CCA_PCs))
>table(isATAC) # ATAC vs RNA
>ATACcells <- rownames(CCA_PCs)[isATAC]
>RNAcells <- rownames(CCA_PCs)[!isATAC]
>ATAC_PCs <- CCA_PCs[isATAC,]
>RNA_PCs <- CCA_PCs[!isATAC,]
>pairing <- pairCells(ATAC = ATAC_PCs,
RNA = RNA_PCs,
keepUnique = TRUE
)
## deduplicate
>if (length(unique(pairing$RNA)) > length(unique(pairing$ATAC))) {
dupli_ATAC <- unique(pairing$ATAC[duplicated(pairing$ATAC)])
dupli_pairing <- pairing[which(pairing$ATAC %in% dupli_ATAC),]
dupli_pairing <- data.frame(ATAC = dupli_pairing$ATAC,RNA = dupli_pairing$RNA, dist = dupli_pairing$dist)
## deduplicate
dedupli_pairing <- dupli_pairing %>% group_by(ATAC) %>% top_n(n=-1, wt=dist)
unique_pairing <- pairing[which(! pairing$ATAC %in% dupli_ATAC),]
total_uniq_pairing <- rbind(unique_pairing,dedupli_pairing)
}else {
dupli_RNA <- unique(pairing$RNA[duplicated(pairing$RNA)])
dupli_pairing <- pairing[which(pairing$RNA %in% dupli_RNA),]
dupli_pairing <- data.frame(ATAC = dupli_pairing$ATAC,RNA = dupli_pairing$RNA, dist = dupli_pairing$dist)
## deduplicate
dedupli_pairing <- dupli_pairing %>% group_by(RNA) %>% top_n(n=-1, wt=dist)
unique_pairing <- pairing[which(! pairing$RNA %in% dupli_RNA),]
total_uniq_pairing <- rbind(unique_pairing,dedupli_pairing)
}
>paired <- data.frame(ATAC = total_uniq_pairing$ATAC, RNA = total_uniq_pairing$RNA, muti = paste0("multi_cell","_",c(1:length(total_uniq_pairing$ATAC))))
>saveRDS(paired,"path_to_project/06_integrated/paired.rds")
-
22.Create in-silico pseudo multiomics cells.
- Generate multi-omics objects containing integrated peak-cell and gene-cell matrices.
>library(tidyverse)
>library(Signac)
>library(data.table)
>library(GenomicRanges)
>library(Seurat)
>library(data.table)
>library(Pando)
>library("BSgenome.Hsapiens.UCSC.hg38")
>library(ArchR)
>addArchRGenome("hg38")
>proj <- readRDS("path_to_project/results/04_parsePeak/pbmc_scATAC_addpeakMatrix.rds")
>useMatrix <- "PeakMatrix"
>paired <- readRDS("path_tp_project/06_integrated/paired.rds")
>peakDF <- ArchR:::.getFeatureDF(getArrowFiles(proj), useMatrix)
>peaks <- paste(peakDF$seqnames, peakDF$start,peakDF$end,sep="_")
>cells <- paired$ATAC
>PeakMatrix <- ArchR:::.getPartialMatrix(
ArrowFiles = getArrowFiles(proj),
featureDF = peakDF,
threads = 10,
cellNames = cells,
useMatrix = useMatrix,
verbose = FALSE
)
>rownames(PeakMatrix) <- peaks
>geneAnnotation <- geneAnnoHg38
>genes <- geneAnnotation$genes
>exons <- geneAnnotation$exons
>genesdf <- data.frame(seqnames = seqnames(genes),start = start(genes),end = end(genes),strand = strand(genes),gene_id = genes$gene_id,symbol = genes$symbol,gene_name = genes$symbol,type = "gene")
>exonsdf <- data.frame(seqnames = seqnames(exons),start = start(exons),end = end(exons),strand = strand(exons),gene_id = exons$gene_id,symbol = exons$symbol,gene_name = exons$symbol,type = "exon")
>gene_annot_df <- rbind(genesdf,exonsdf)
>gene_annot <- makeGRangesFromDataFrame(gene_annot_df,keep.extra.columns= T)
>peakassay <- CreateChromatinAssay(PeakMatrix, sep=c("_","_"),annotation = gene_annot)
>ATAC <- CreateSeuratObject(
counts = peakassay,
assay = "peaks"
)
>ATAC <- ATAC[,paired$ATAC]
>ATAC <- RenameCells(ATAC,old.names = colnames(ATAC),new.names = paired$muti)
>ATAC <- CreateSeuratObject(
counts = peakassay,
assay = "peaks"
)
>ATAC <- ATAC[,paired$ATAC]
>ATAC <- RenameCells(ATAC,old.names = colnames(ATAC),new.names = paired$muti)
>coembed <- readRDS("path_to_project/results/06_integrated/pbmc_scRNA_scATAC_integrated_harmony.rds")
>RNA <- coembed[,paired$RNA]
>RNA <- RenameCells(RNA,old.names = colnames(RNA),new.names = paired$muti)
>multi <- RNA
>multi[["peaks"]] <- ATAC[["peaks"]]
>saveRDS(multi,"path_to_project/06_integrated/pbmc_multiomic_object.rds")
-
23.Infer GRN.
-
a.Infer GRN using Pando.Infer the gene regulatory network (GRN) using Pando with the pseudo multi-omics object; visualize the global GRN in Figure 3C and consult Pando documentation (https://quadbio.github.io/Pando/articles/getting_started.html) for implementation details.Note: Gene expression can be regulated at three distinct levels: DNA, transcriptional control, and translational control. At the DNA level, gene silencing occurs when nucleosomes are densely packed or due to DNA methylation. This results in cis-regulatory elements (CREs) related to gene expression regulation—such as promoters, enhancers, insulators, and silencers—being relatively closed, making it difficult for trans-acting factors, including transcription factors (TFs), to bind to them. However, when chromatin accessibility increases, cis-acting elements can effectively bind to transcription factors, promoting the transcription of target genes into mRNA. Therefore, chromatin opening is a prerequisite for gene expression. During transcription, once receptor proteins receive signals for gene expression, activated TFs bind to cis-acting elements to promote the expression of target genes. This process is also regulated by other TFs and feedback mechanisms related to the expression of target genes. A series of TFs and proteins can collectively govern the expression of a target gene, which in turn can influence the expression of other genes. Enhancer-driven gene regulatory networks (eGRNs) involve TFs interacting with groups of CREs to regulate the transcription of their target genes. In an eGRN, a regulon consists of a TF along with a collection of CREs and the regulated target genes.Note: With the increasing availability of scRNA-seq and scATAC-seq datasets, co-expression networks derived from scRNA-seq data have been combined with co-accessible candidate CREs from scATAC-seq to create eGRNs. Nowadays, tools such as SCENIC+ (https://scenicplus.readthedocs.io/en/latest/), Pando (https://quadbio.github.io/Pando/), STREAM (https://github.com/OSU-BMBL/STREAM/), and CellOracle (https://morris-lab.github.io/CellOracle.documentation/) have been designed to predict eGRNs. In this tutorial, Pando has been chosen to create the eGRN because it allows for easy inference of the regulatory network using a Seurat object that integrates transcriptomic and chromatin accessibility data.>library(tidyverse)>library(Seurat)>library(FigR)>library(optmatch)>library(data.table)>library(Pando)>library("BSgenome.Hsapiens.UCSC.hg38")>muo_data <- read_rds('path_to_project/results/06_integrated/pbmc_multiomic_object.rds')>muo_data <- initiate_grn(muo_data,rna_assay = 'RNA',peak_assay = 'peaks',regions = phastConsElements20Mammals.UCSC.hg38,exclude_exons = TRUE)>data('motifs')>data('motif2tf')>muo_data <- find_motifs(muo_data,pfm = motifs,motif_tfs = motif2tf,genome = BSgenome.Hsapiens.UCSC.hg38)>genesUsed <- unique(rownames(muo_data@assays$RNA@data))>regions <- NetworkRegions(muo_data)>muo_data <- infer_grn(muo_data,peak_to_gene_method = 'Signac',genes = genesUsed,parallel = F)>muo_data <- find_modules(muo_data,p_thresh = 0.1,nvar_thresh = 2,min_genes_per_module = 1,rsq_thresh = 0.05)>saveRDS(muo_data,"path_to_project/results/06_integrated/pbmc_GRN.rds")>muo_data <- get_network_graph(muo_data, graph_name='umap_graph')>plot_network_graph(muo_data, graph='umap_graph')
-
b.Plot TF and target gene expression of the eRegulon (Figure 3D).To identify cell-specific regulons.
-
i.Calculate the regulon specificity score (RSS) for each eRegulon in different cell types.
-
ii.Assess the expression levels of transcription factors (TFs) of eRegulon in each cell type.
-
i.
-
a.
>library(Seurat)
>library(Signac)
>library(tidyverse)
>library(AUCell)
>library(SCENIC)
>library(Pando)
>object <- readRDS("path_to_project/06_integrated/pbmc_GRN.rds")
>grn_module <- object@grn@networks$glm_network@modules@meta
>grn_module_2 <- grn_module[which(grn_module$n_genes > 1),]
>tf <- unique(grn_module_2$tf)
>tf_target_meta <- as.data.frame(grn_module_2)
>tf_target <- lapply(unique(tf),function(x){
targe_genes <- tf_target_meta[which(tf_target_meta$tf == x),"target"]
})
>names(tf_target) <- tf
>cellInfo <- data.frame(object@meta.data)
>assaydata <- object@assays$RNA@data
>assaydata <- assaydata[,which(colnames(assaydata) %in% rownames(cellInfo))]
>cells_rankings <- AUCell_buildRankings(assaydata, nCores=1, plotStats=TRUE)
>cells_AUC <- AUCell_calcAUC(tf_target, cells_rankings)
>raw_rss <- calcRSS(AUC = cells_AUC, cellAnnotation = cellInfo[,"seurat_annotations"])
> rssPlot <- plotRSS(rss, col.low="#473172", col.mid="#20988b", col.high="#f9e920", verbose=F)
>ggsave("path_to_project/results/06_integrated/celltype_target_gene_rss_dotplot.pdf",rssPlot$plot,width = 6,height=12)
>library(viridis)
>bk = seq(0.1, 1,by = 0.01)
>col_length = length(bk)
>if_1 = inferno(col_length/2)
>if _2 = rev(mako(col_length/2))
>col= append(if_2,if_1)
>library(pheatmap)
>DataForDotPlot <- function(GeneList){
Pv_M = as.matrix(object@assays$RNA@data)
Pv_IDENT = as.data.frame(object@meta.data)
Cluster_IDENT = names(table(Pv_IDENT$seurat_annotations))
HEAD = data.frame(GeneID = "character", Cluster = "character", average = 10.1)
for( i in GeneList){
for(j in Cluster_IDENT){
temp_Barcodes = rownames(Pv_IDENT[which(Pv_IDENT$seurat_annotations == j),])
>mat_tmp <- DataForDotPlot(levels(rssPlot$df$Topic))
>matrix_df <- reshape2::dcast(mat_tmp, GeneID ∼ Cluster, value.var = "average", fill = 0)
>rownames(matrix_df) <- matrix_df$GeneID
>matrix_df <- matrix_df[,-1]
>matrix_df <- matrix_df[rev(levels(rssPlot$df$Topic)),]
>pheatmap(matrix_df, cluster_cols = F, cluster_rows = F,
show_rownames = T,
color = col,
scale = "row",
filename = "./celltype_TF_heatmap.pdf",width=6,height=8
)
Figure 3.
The GRN constructed using scATAC-seq and scRNA-seq data
(A) UMAP embedding of PBMCs, colored by the different Omics.
(B) UMAP plot of PBMCs from integrated scATAC-seq and scRNA-seq, colored by the different cell types.
(C) Global GRN network.
(D) Heatmap displaying the expression of TFs within the eRegulon, and dot plot describing the expression of TF target genes expression (RSS) associated with the eRegulon.
Expected outcomes
We have developed a practical workflow that covers most aspects of a typical scATAC-seq analysis, from FastQ processing to downstream analysis. By following this protocol, users can expect to complete the analysis of scATAC-seq data, which includes cell clustering, cell type identification for each cluster, identification of differential accessible elements among different cell types, motif enrichment analysis, and inference of the gene regulatory networks.
Limitations
The current protocol has several significant shortcomings. The analysis pipeline for scATAC-seq data is incomplete, as it does not include integration with GWAS data, CNV calling, or lineage tracing. Furthermore, the protocol lacks consistency, which means some alterations may be required. For instance, if batch effects exist across platforms but not within samples, the “platform” factor should be accounted for to address these effects. Additionally, parameters such as doublet rates and clustering resolution may need adjustment for datasets from different species or library construction platforms.
Moreover, inferring gene regulatory networks (GRNs) by integrating scRNA-seq data is more feasible with approximately 10,000 cells, as both time and memory requirements increase dramatically with larger cell numbers. Furthermore, when using FigR to pair single-omics scRNA-seq and scATAC-seq data, the upper limit is around 25,000 cells.
Troubleshooting
Problem 1
Error message: “io_utils.c:16:10: fatal error: zlib.h: No such file or directoryio_utils.c:16:10: fatal error: zlib.h: No such file or directory”, while install FigR package (related to the step 4 of the software preparation section).
Potential solution
You need to check if `zlib.h` is installed on your system. If it is already installed, export the necessary environment variables.
If you’re looking for more detailed instructions on how to check for zlib.h and export environment variables, you could include something like this.
-
•
Check Installation:
‘find / -name zlib.h’
-
•
Export Environment Variables (if needed):
‘export C_INCLUDE_PATH=/path/to/zlib:$C_INCLUDE_PATH’
If the zlib.h file is not available, you should install the zlib development headers using the following command: ‘sudo apt -y install zlib1g-dev’.
Problem 2
Error message:” ERROR: dependencies ‘SummarizedExperiment’, ‘chromVAR’, ‘motifmatchr’, ‘GenomicRanges’, ‘ComplexHeatmap’ are not available for package ‘FigR’” (related to the step 4 of the software preparation section).
Potential solution
Install each dependency package. You can search for the dependency packages by using Google, such as with queries like 'R package SummarizedExperiment install' or 'conda install SummarizedExperiment' to obtain the installation method.
Problem 3
The addArchRGenome() function is not available for the genomes except human and mouse genome (related to step 3a).
Potential solution
-
•
Create a genomeAnnotaion using the createGenomeAnnotation() function with a BSgenome object.
-
•
Create a geneAnnotation using the createGeneAnnotation() function with a TxDb object (transcript database) and an OrgDb object (organism database). Alternatively, if you don’t already have a TxDb and OrgDb object, you can create a geneAnnotation object using three GRange object: a gene object containing coordinates, an exon object containing a gene exon coordinates, and a TSS object containing standard TSS coordinates. Detailed for creating the genomeAnnotation and the geneAnnotation object can be found from https://www.archrproject.com/bookdown/getting-set-up.html.
Problem 4
The batch effect was overcorrected (related to step 7).
Potential solution
This step is not necessary for all datasets. For some datasets from the same platform and with similar samples, the batch effects can be corrected using Iterative Latent Semantic Indexing when reducing dimension with the `addIterativeLSI()` function.
Problem 5
More than 1 million cells for scATAC-seq and scRNA-seq is not available using “CCA” method (related to step 20).
Potential solution
The function scnapy.tl.ingest in Scanpy can be used for integrated analysis of more than 1 million cells, which is similar to the ‘CCA’ method in Seurat.
Problem 6
Error message: “Error in graph.adjacency.dense(adjmatrix, mode = mode, weighted = weighted, :long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537 Calls:pairCells ... graph from adjacency matrix -> graph.adjacency.dense” while implementing the ‘pairCells’ function in step 21.
Potential solution
When using FigR to pair single-omics scRNA-seq and scATAC-seq data, the limit is around 25,000 cells. Cells can be downsampled for pairing if necessary.
Resource availability
Lead contact
For further information and resource requests, please contact the lead contact, Chuanyu Liu (liuchuanyu@genomics.cn).
Technical contact
Technical questions on executing this protocol should be directed to and will be answered by the technical contact, Wen Ma (mawen3@genomics.cn).
Materials availability
This study did not generate new unique materials.
Data and code availability
Analyses were conducted in R, and all the code described in this protocol has been deposited on GitHub: https://github.com/M-wen/scATAC-seq-Analysis. The main output files have been deposited to Zenodo: https://doi.org/10.5281/zenodo.14715304.
Acknowledgments
This work was supported by the Shenzhen Key Laboratory of Single-Cell Omics (ZDSYS20190902093613831), Zhejiang Science and Technology Department (2024C03004), and Hangzhou Science and Technology Department (2024SZD1B09). We thank all members of our team and acknowledge the high-performance computational capabilities provided by BGI Research. The cartoon illustrations used in the experiment process diagram were sourced from https://biogdp.com/.
Author contributions
W.M. analyzed the data and drafted the manuscript. P.M. and D.H. analyzed the data. K.L. edited the manuscript. Y.Y., P.C., and C.L. edited the manuscript and supervised the project.
Declaration of interests
The authors declare no competing interests.
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.xpro.2025.103960.
Contributor Information
Wen Ma, Email: mawen3@genomics.cn.
Chuanyu Liu, Email: liuchuanyu@genomics.cn.
Supplemental information
References
- 1.R Core Team . R Foundation for Statistical Computing; 2020. R: A Language and Environment for Statistical Computing.https://www.R-project.org/ [Google Scholar]
- 2.Granja J.M., Corces M.R., Pierce S.E., Bagdatli S.T., Choudhry H., Chang H.Y., Greenleaf W.J. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 2021;53:403–411. doi: 10.1038/s41588-021-00790-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Lawrence M., Huber W., Pagès H., Aboyoun P., Carlson M., Gentleman R., Morgan M.T., Carey V.J. Software for Computing and Annotating Genomic Ranges. PLoS Comput. Biol. 2013;9 doi: 10.1371/journal.pcbi.1003118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Huber W., Carey V.J., Gentleman R., Anders S., Carlson M., Carvalho B.S., Bravo H.C., Davis S., Gatto L., Girke T., et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods. 2015;12:115–121. doi: 10.1038/nmeth.3252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wickham H. The split-apply-combine strategy for data analysis. J. Stat. Softw. 2011;40:1–29. [Google Scholar]
- 6.Stoeckius M., Hafemeister C., Stephenson W., Houck-Loomis B., Chattopadhyay P.K., Swerdlow H., Satija R., Smibert P. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods. 2017;14:865–868. doi: 10.1038/nmeth.4380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Stuart T., Srivastava A., Madad S., Lareau C.A., Satija R. Single-cell chromatin state analysis with Signac. Nat. Methods. 2021;18:1333–1341. doi: 10.1038/s41592-021-01282-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Kartha V.K., Duarte F.M., Hu Y., Ma S., Chew J.G., Lareau C.A., Earl A., Burkett Z.D., Kohlway A.S., Lebofsky R., Buenrostro J.D. Functional inference of gene regulation using single-cell multi-omics. Cell Genom. 2022;2 doi: 10.1016/j.xgen.2022.100166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hao Y., Stuart T., Kowalski M.H., Choudhary S., Hoffman P., Hartman A., Srivastava A., Molla G., Madad S., Fernandez-Granda C., Satija R. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 2024;42:293–304. doi: 10.1038/s41587-023-01767-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wickham H., Averick M., Bryan J., Chang W., McGowan L., François R., Grolemund G., Hayes A., Henry L., Hester J., et al. Welcome to the Tidyverse. J. Open Source Softw. 2019;4:1686. doi: 10.21105/joss.01686. [DOI] [Google Scholar]
- 11.Fleck J.S., Jansen S.M.J., Wollny D., Zenk F., Seimiya M., Jain A., Okamoto R., Santel M., He Z., Camp J.G., Treutlein B. Inferring and perturbing cell fate regulomes in human brain organoids. Nature. 2023;621:365–372. doi: 10.1038/s41586-022-05279-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Schep A.N., Wu B., Buenrostro J.D., Greenleaf W.J. chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat. Methods. 2017;14:975–978. doi: 10.1038/nmeth.4401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lawrence M., Gentleman R., Carey V. rtracklayer: an R package for interfacing with genome browsers. Bioinformatics. 2009;25:1841–1842. doi: 10.1093/bioinformatics/btp328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Aibar S., González-Blas C.B., Moerman T., Huynh-Thu V.A., Imrichova H., Hulselmans G., Rambow F., Marine J.-C., Geurts P., Aerts J., et al. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods. 2017;14:1083–1086. doi: 10.1038/nmeth.4463. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Zheng G.X.Y., Terry J.M., Belgrader P., Ryvkin P., Bent Z.W., Wilson R., Ziraldo S.B., Wheeler T.D., McDermott G.P., Zhu J., et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 2017;8 doi: 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Thibodeau A., Eroglu A., McGinnis C.S., Lawlor N., Nehar-Belaid D., Kursawe R., Marches R., Conrad D.N., Kuchel G.A., Gartner Z.J., et al. AMULET: a novel read count-based method for effective multiplet detection from single nucleus ATAC-seq data. Genome Biol. 2021;22:252. doi: 10.1186/s13059-021-02469-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Zhang Y., Liu T., Meyer C.A., Eeckhoute J., Johnson D.S., Bernstein B.E., Nusbaum C., Myers R.M., Brown M., Li W., Liu X.S. Model-based analysis of ChIP-Seq (MACS) Genome Biol. 2008;9 doi: 10.1186/gb-2008-9-9-r137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Buenrostro J.D., Giresi P.G., Zaba L.C., Chang H.Y., Greenleaf W.J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods. 2013;10:1213–1218. doi: 10.1038/nmeth.2688. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Yan F., Powell D.R., Curtis D.J., Wong N.C. From reads to insight: a hitchhiker's guide to ATAC-seq data analysis. Genome Biol. 2020;21:22. doi: 10.1186/s13059-020-1929-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Ou J., Liu H., Yu J., Kelliher M.A., Castilla L.H., Lawson N.D., Zhu L.J. ATACseqQC: a Bioconductor package for post-alignment quality assessment of ATAC-seq data. BMC Genom. 2018;19:169. doi: 10.1186/s12864-018-4559-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hitz B.C., Jin-Wook L., Jolanki O., Kagda M.S., Graham K., Sud P., Gabdank I., Strattan J.S., Sloan C.A., Dreszer T., et al. The ENCODE Uniform Analysis Pipelines. bioRxiv. 2023 doi: 10.1101/2023.04.04.535623. Preprint at. [DOI] [Google Scholar]
- 23.Luecken M.D., Büttner M., Chaichoompu K., Danese A., Interlandi M., Mueller M.F., Strobl D.C., Zappia L., Dugas M., Colomé-Tatché M., Theis F.J. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods. 2022;19:41–50. doi: 10.1038/s41592-021-01336-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Li Y.E., Preissl S., Miller M., Johnson N.D., Wang Z., Jiao H., Zhu C., Wang Z., Xie Y., Poirion O., et al. A comparative atlas of single-cell chromatin accessibility in the human brain. Science. 2023;382 doi: 10.1126/science.adf7044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Pliner H.A., Packer J.S., McFaline-Figueroa J.L., Cusanovich D.A., Daza R.M., Aghamirzaie D., Srivatsan S., Qiu X., Jackson D., Minkina A., et al. Cicero predicts cis-regulatory DNA interactions from single-cell chromatin accessibility data. Mol. Cell. 2018;71:858–871.e8. doi: 10.1016/j.molcel.2018.06.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Lin Y., Wu T.Y., Wan S., Yang J.Y.H., Wong W.H., Wang Y.X.R. scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nat. Biotechnol. 2022;40:703–710. doi: 10.1038/s41587-021-01161-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Analyses were conducted in R, and all the code described in this protocol has been deposited on GitHub: https://github.com/M-wen/scATAC-seq-Analysis. The main output files have been deposited to Zenodo: https://doi.org/10.5281/zenodo.14715304.

Timing: 1.5 h

