Summary
Here, we present workflows for integrating independent transcriptomic and chromatin accessibility datasets and analyzing multiomics. First, we describe steps for integrating independent transcriptomic and chromatin accessibility measurements. Next, we detail multimodal analysis of transcriptomes and chromatin accessibility performed in the same sample. We demonstrate their use by analyzing datasets obtained from mouse embryonic stem cells induced to differentiate toward mesoderm-like, myogenic, or neurogenic phenotypes.
For complete details on the use and execution of this protocol, please refer to Khateb et al.1
Subject areas: Bioinformatics, Computer sciences
Graphical abstract
Highlights
-
•
Integration of scRNA-seq and scATAC-seq from independent datasets
-
•
Multiomics of snRNA-seq and snATAC-seq from the same sample
-
•
Inference of cell states from multiomics pseudotime
Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.
Here, we present workflows for integrating independent transcriptomic and chromatin accessibility datasets and analyzing multiomics. First, we describe steps for integrating independent transcriptomic and chromatin accessibility measurements. Next, we detail multimodal analysis of transcriptomes and chromatin accessibility performed in the same sample. We demonstrate their use by analyzing datasets obtained from mouse embryonic stem cells induced to differentiate toward mesoderm-like, myogenic, or neurogenic phenotypes.
Before you begin
Hardware preparation
A computer with a MacOS or Window operation system and network connection is required. The RAM requirement depends on the number of cells to be analyzed. 16 GB RAM should be sufficient for an initial analysis. If more than 10,000 cells are analyzed, computer clusters over 32 GB with a Linux operation system are required.
Software preparation
Timing: 1 h(for step 1)
The applications described in this section are required for the analysis of single cell (sc)RNA- seq, single cell (sc)ATAC-seq analysis, integration of scRNA-seq and scATAC-seq datasets, and multiomics analysis.
-
1.
Prepare docker environment troubleshooting 1.
For scRNA-seq, scATAC-seq, multiomics analysis, and data integration, single-cell analysis tools in R platform are required. To avoid conflicts of R libraries installation, docker developing environment is used.-
a.Access docker webpage (https://www.docker.com/) and install the latest version of Docker Desktop.
-
b.Pull docker image from docker hub.> docker pull holyone70/mesoderm_pipeline:mesoderm_pipeline
-
c.Run docker image to prepare R developing environment.> docker run -e PASSWORD=rstudio -p 8787:8787 --name mesoderm_pipeline holyone70/mesoderm_pipeline:mesoderm_pipeline
-
d.Run web browser and put local address of R server (http://localhost:8787).
-
e.Put username (rstudio) and password (rstudio).
-
f.Check availability of R packages troubleshooting 2.
-
g.Check availability of data files troubleshooting 3.
-
h.Scripts order for running.
-
i.scRNA_analysis.R.
-
ii.scATAC_analysis.R.
-
iii.int_scRNA.R.
-
iv.int_scRNA_scATAC.R.
-
v.multiomics_anal.R.
-
i.
-
i.How to run the R scripts.
-
i.Clean the environment by clicking the broom symbol located at upright corner.
-
ii.Go to file → open the star_protocol project → select the R script of interest to run at the File, Packages, Help panel.Note: To pull and run docker image, terminal is used in Mac OS and LINUX, and power shell is used in Window OS. Command lines to download data were located on the top of each R script, which were commented.
CRITICAL: To successfully run Multiomics_anal.R, minimal the Docker resources requirement would be 32 g memory, 8 CPU and 4 GB swap.
-
i.
-
a.
Data collection
Timing:30min(for step2)
Single cell datasets analyzed in this protocol were deposited into GEO repository (GSE198730). scRNA-seq datasets consist of barcodes, features, and matrix, and scATAC-seq datasets contain barcodes, fragments, matrix, and peaks. The time points for each dataset are described in Figure 1.
Figure 1.
Scheme illustrating ESCs differentiation time points
Key resources table
REAGENT or RESOURCE | SOURCE | IDENTIFIER |
---|---|---|
Deposited data | ||
Single cell RNA-seq datasets | Khateb et al.1 |
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE198730 (GSE198730_aPSM scRNA_rep1_barcodes.tsv.gz , GSE198730_aPSM _scRNA_rep1_features.tsv.gz, GSE198730_aPSM _scRNA_rep1_matrix.mtx.gz) |
Single cell ATAC-seq datasets | Khateb et al.1 | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE198730 (aPSM _scATAC_rep1_filtered_peak_bc_matrix.h5) |
Single cell omics datasets | Khateb et al.1 |
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE198730 (GSE198730_HIFLR_snRNA_barcodes.tsv.gz, GSE198730_HIFLR _snRNA_features.tsv.gz, GSE198730_HIFLR_snRNA_matrix.mtx.gz, GSE198730_HIFLR_snATAC_fragments.tsv.gz GSE198730_HIFLR_snATAC_fragments.tsv.gz.tbi.gz) |
Github repository | Single cell RNA-seq, Single cell ATAC-seq, Multiome single nuclei ATAC and gene expression | https://github.com/LMSCGR/mesoderm_induced_ESCs_pipeline (HIFLR_snATAC_fragments.tsv.gz.tbi,aPSM_fragments.tsv.gz.tbi,cell_cycle.txt,naive_instructed_esc.csv,aPSM_f.txt) |
Software and algorithms | ||
BioRender | https://biorender.com/ | |
R v4.2.2 | The R Project for Statistical Computing | https://www.r-project.org/ |
RStudio server (v 2022.12.0+353) | RStudio Team2 | https://posit.co/ |
Seurat v4.3.0 | Stuart et al.3 | https://cran.r-project.org/web/packages/Seurat/index.html |
Signac v1.9.0 | Stuart et al.4 | https://satijalab.org/signac |
Harmony v0.1.1 | Korsunsky et al.5 | https://github.com/immunogenomics/harmony |
Monocle3 v1.3.1 | Cao et al.6 | https://cole-trapnell-lab.github.io/monocle3/ |
JASPAR 2020 v 0.99.10 | Fornes et al.7 | https://bioconductor.org/packages/release/data/annotation/html/JASPAR2020.html |
TFBSTools v 1.36.0 | Tan and Lenhard8 | https://bioconductor.org/packages/release/bioc/html/TFBSTools.html |
SeuratWrappers v0.3.1 | https://github.com/satijalab/seurat-wrappers | |
Other | ||
Local computer – memory: 16GB required, 32GB recommended; processors: 4 required, 8 recommended | N/A | N/A |
Step-by-step method details
Part 1: Single cell RNA seq analysis
Timing: 1 h(for step 1 to step 9)
In this section, we describe essential steps to analyze scRNA-seq datasets.
-
1.
Load datasets using Seurat package troubleshooting 4.
library(dplyr)
library(Seurat)
library(monocle3)
library(SeuratWrappers)
# for plotting
library(ggplot2)
library(patchwork)
set.seed(1234)
aPSM.matrix <- Read10X(data.dir ="./aPSM_scRNA/filtered_feature_bc_matrix/")
-
2.
Create Seurat object.
aPSM <- CreateSeuratObject(counts = aPSM.matrix, min.cells = 3, min.features = 200, project = “ aPSM”)
Note: Options for min.cell and min.features are selected as default values from Seurat tutorials (https://satijalab.org/seurat/articles/pbmc3k_tutorial.html). File names should be barcodes.tsv.gz, features.tsv.gz, and matrix.mtx.gz. For Window OS, “.\\filtered_feature_bc_matrix\\ can be used.
-
3.
Select cells for the analysis through quality control (QC) (Figure 2A). Troubleshooting 5.
aPSM[["percent.mt"]] <- PercentageFeatureSet(aPSM, pattern = "ˆmt-")
VlnPlot(aPSM, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
plot1 <- FeatureScatter(aPSM, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(aPSM, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1+plot2
aPSM <- subset(aPSM, subset = nFeature_RNA > 0 & nFeature_RNA < 8000 & percent.mt < 20)
-
4.
Preprocess data and select features for the analysis.
aPSM <- NormalizeData(object = aPSM, normalization.method = "LogNormalize", scale.factor = 1e4)
aPSM <- FindVariableFeatures(aPSM, selection.method = "vst", nfeatures = 2000)
aPSM_top10 <- head(VariableFeatures(aPSM), 10)
plot1 <- VariableFeaturePlot(aPSM)
plot2 <- LabelPoints(plot = plot1, points = aPSM_top10, repel = TRUE)
plot1+plot2
aPSM.all.genes <- rownames(aPSM)
aPSM <- ScaleData(aPSM, features = aPSM.all.genes)
-
5.
Filter cell cycle genes.
convertHumanGeneList <- function(x){
require("biomaRt")
human <- useMart("ensembl", dataset = "hsapiens_gene_ensembl" , host = "https://dec2021.archive.ensembl.org/")
mouse <- useMart("ensembl", dataset = "mmusculus_gene_ensembl" ,host = "https://dec2021.archive.ensembl.org/")
tmp <- getLDS(attributes = c("hgnc_symbol"), filters = "hgnc_symbol", values = x , mart = human, attributesL = c("mgi_symbol"), martL = mouse, uniqueRows=TRUE)
mousex <- unique(tmp[,2])
return(mousex)}
s.genes <- convertHumanGeneList(cc.genes.updated.2019$s.genes)
g2m.genes <- convertHumanGeneList(cc.genes.updated.2019$g2m.genes)
cell_cycle <- t(read.csv(file="aPSM_scRNA/cell_cycle.txt",header=F))[,1]
filtered_genes <- c(s.genes,cell_cycle)
-
6.
Filter cell cycle genes, reduce dimensions and establish dataset dimensionality (Figure 2B).
aPSM <- RunPCA(object = aPSM, features = VariableFeatures(object = aPSM), verbose = FALSE)
aPSM <- CellCycleScoring(aPSM, s.features = filtered_genes, g2m.features = g2m.genes, set.ident = TRUE)
aPSM <- ScaleData(aPSM, vars.to.regress = c("S.Score", "G2M.Score"), features = rownames(aPSM))
aPSM <- JackStraw(aPSM, num.replicate = 100)
aPSM <- ScoreJackStraw(aPSM, dims = 1:20)
ElbowPlot(object = aPSM,ndims =50)
-
7.
Cluster and visualize cells (Figure 3A).
aPSM <- FindNeighbors(object = aPSM, dims = 1:30)
aPSM <- FindClusters(object = aPSM, resolution = 0.25)
aPSM <- RunTSNE(object = aPSM, dims = 1:30)
aPSM <- RunUMAP(object = aPSM, dims = 1:30)
DimPlot(object=aPSM,reduction='umap',label=T)+labs(title = " aPSM")
save(aPSM,file="aPSM_scRNA.RData")
-
8.
Analyze unique features of each cluster.
aPSM.markers <- FindAllMarkers(aPSM, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
aPSM.markers_table <- aPSM.markers %>%group_by(cluster) %>% slice_max(n = 20, order_by = avg_log2FC)
Note: Options for min.pct and logfc.threshold are selected as default values from Seurat tutorials (https://satijalab.org/seurat/articles/pbmc3k_tutorial.html).
-
9.
Visualize clusters pseudotime (Figure 3B).
DefaultAssay(aPSM) <- "RNA"
aPSM_cds <- as.cell_data_set(aPSM)
aPSM_cds <- cluster_cells(aPSM_cds,reduction="UMAP",k = 30,resolution = 0.00012)
aPSM_cds <- learn_graph(aPSM_cds, close_loop = F,use_partition = T,learn_graph_control =list(minimal_branch_len=5))
plot_cells(aPSM_cds, label_groups_by_cluster = T, label_leaves = F, label_branch_points = T,graph_label_size = 3)
aPSM.min.umap <- which.min(unlist(FetchData(aPSM, "UMAP_2")))
aPSM.min.umap <- colnames(aPSM)[aPSM.min.umap]
aPSM_cds <- order_cells(aPSM_cds, root_cells = aPSM.min.umap)
plot_cells(aPSM_cds, color_cells_by = "pseudotime", label_cell_groups =T, label_leaves = F, label_branch_points = F,show_trajectory_graph = T,graph_label_size = 3,label_groups_by_cluster = T)
Figure 2.
scRNA-seq data quality control
(A) Violin plots of scRNA-seq data of aPSM scRNA-seq data. mRNA counts (nCount_RNA), number of detected genes(nFeature_RNA), mitochondria gene percentage (percent.mt).
(B) Elbowplot of aPSM scRNA-seq data describing the standard deviations of the principal components (PC).
Figure 3.
Visualization of aPSM scRNA-seq data
(A) UMAP plot of aPSM scRNA-seq data.
(B) Pseudotime plot of aPSM scRNA-seq data. The heatmap represents units of progress, with 1 located at the root of the trajectory.
Part 1: Single cell ATAC seq analysis
Timing: 1 h(for step 10 to step 17)
In this section, we describe steps to evaluate chromatin accessibility using scATAC-seq.
-
10.
Load datasets using Signac package.
library(Signac)
library(Seurat)
library(GenomeInfoDb)
library(EnsDb.Mmusculus.v79)
library(patchwork)
set.seed(1234)
aPSM.counts <- Read10X_h5("./aPSM_scATAC/ filtered_peak_bc_matrix.h5")
aPSM_meta <- read.table("./aPSM_scATAC/singlecell.csv.gz", sep = ",", header = TRUE, row.names = 1)
aPSM_chrom_assay <- CreateChromatinAssay(
counts = aPSM.counts,
sep = c(":","-"),
genome = 'mm10',
fragments = './aPSM_scATAC/filtered_feature_bc_matrix/fragments.tsv.gz', min.cells = 3, min.features = 100)
Note: Options for min.cell and min.features are selected as default values from Signac tutorials (https://stuartlab.org/signac/articles/pbmc_vignette.html).
-
11.
Create Seurat object.
aPSM_atac <- CreateSeuratObject(
counts = aPSM_chrom_assay,
assay = 'aPSM_peaks',
project = 'aPSM_atac',
meta.data = aPSM_meta[colnames(aPSM_chrom_assay),])
-
12.
Add annotation information.
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79)
# change to UCSC style since the data was mapped to mm10
seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- "mm10"
# add the gene information to the object
Annotation(aPSM_atac) <- annotations
-
13.
Select cells for analysis through QC (Figure 4A).
aPSM_atac <- NucleosomeSignal(object = aPSM_atac)
# compute TSS enrichment score per cell
aPSM_atac <- TSSEnrichment(object = aPSM_atac, fast = FALSE)
aPSM_atac$pct_reads_in_peaks <- aPSM_atac$peak_region_fragments / aPSM_atac$passed_filters ∗ 100
aPSM_atac$blacklist_ratio <- aPSM_atac$blacklist_region_fragments / aPSM_atac$peak_region_fragments
VlnPlot(
object = aPSM_atac,
features = c('pct_reads_in_peaks', 'peak_region_fragments',
'TSS.enrichment', 'blacklist_ratio', 'nucleosome_signal'),pt.size = 0.1, ncol = 5)
-
14.
Preprocess data and select features for the analysis.
FeatureScatter(aPSM_atac, feature1 = "peak_region_fragments", feature2 = "nCount_aPSM_peaks")
aPSM_atac <- subset(
x = aPSM_atac,
subset = peak_region_fragments > 2586 &
peak_region_fragments < 20000 & pct_reads_in_peaks > 15 & blacklist_ratio < 0.05)
ncol(aPSM_atac)
VlnPlot(
object = aPSM_atac,
features = c('nucleosome_signal','peak_region_fragments'),pt.size = 0.1) + NoLegend()
FeatureScatter(aPSM_atac, feature1 = "peak_region_fragments", feature2 = "nCount_aPSM_peaks")
-
15.
Reduce dimensions.
aPSM_atac <- RunTFIDF(aPSM_atac)
aPSM_atac <- FindTopFeatures(aPSM_atac, min.cutoff = 'q0')
aPSM_atac <- RunSVD(
object = aPSM_atac, assay = 'aPSM_peaks',
reduction.key = 'LSI_', reduction.name = 'lsi')
-
16.
Cluster and visualize cells (Figure 4B).
library(ggplot2)
aPSM_atac <- RunUMAP(object = aPSM_atac, reduction = 'lsi', dims = 1:40)
aPSM_atac <- RunTSNE(object = aPSM_atac, reduction = 'lsi', dims = 1:40)
aPSM_atac <- FindNeighbors(object = aPSM_atac, reduction = 'lsi', dims = 1:40)
aPSM_atac <- FindClusters(object = aPSM_atac, verbose = FALSE,resolution=0.25)
DimPlot(object = aPSM_atac, label = F,reduction = 'umap') +labs(title = " aPSM scATAC")
-
17.
Calculate gene activities and add them to Seurat object.
aPSM_gene.activities <- GeneActivity(aPSM_atac)
save(aPSM_gene.activities,file="aPSM_atac_gene.activities.RData")
# add the gene activity matrix to the Seurat object as a new assay and normalize it
aPSM_atac[['RNA']] <- CreateAssayObject(counts = aPSM_gene.activities)
aPSM_atac <- NormalizeData(
object = aPSM_atac, assay = 'RNA', normalization.method = 'LogNormalize',scale.factor = median(aPSM_atac$nCount_RNA) )
save(aPSM_atac,file="aPSM_scATAC.RData")
Figure 4.
Quality control and visualization of integration between scATAC and scRNA-seq data
(A) Features distribution of aPSM scATAC-seq data.
(B) UMAP plot of aPSM scATAC-seq datasets.
(C) Features distribution between naïve and instructed scRNA-seq datasets.
(D) UMAP plots before and after Harmony integration.
Part 1: Integrated data analysis
Timing:1h(for step 18 to step 19)
In this section, we describe steps to integrate and analyze data from different platforms. Users can infer the changes of gene expressions during time points or relations between gene expressions and chromatin accessibility during cellular differentiation.
-
18.Integrate single cell RNA seq datasets.
-
a.Prepare R library for the integration.library(dplyr)library(Seurat)library(harmony)library(data.table)library(parallel)set.seed(1234)# Set number of cores to useNCORES = 1meta <- fread("naive_instructed_esc.csv")
-
b.Load datasets.data_dir <- list("./naive_scRNA/","./instructed_scRNA/")mat.list <- list()soupx.used <- list()for(i in 1:length(data_dir)){mat.list[[i]] <- Read10X(data.dir = paste0(data_dir[i], 'filtered_feature_bc_matrix'))soupx.used[[i]] <- F}cat(sum(unlist(lapply(mat.list, ncol))),"cells (total) loaded...\n")sample_num<-min(ncol(mat.list[[1]]),ncol(mat.list[[2]]))sel.id<-sample(colnames(mat.list[[2]]), size=sample_num, replace=FALSE)mat.list[[2]]<-mat.list[[2]][,sel.id]Note: Files should be in filtered_feature_bc_matrix folder, and file names should be barcodes.tsv.gz, features.tsv.gz, and matrix.mtx.gz.
-
c.Create Seurat objects.seu.list <- list()seu.list <- mclapply(mat.list,FUN = function(mat){return(CreateSeuratObject(counts = mat, min.features = 200, min.cells = 3,project = 'naive_instructed_data'))}, mc.cores = NCORES)for(i in 1:length(seu.list)){cat(' ------------------------------------\n','--- Processing dataset number ', i, '-\n','------------------------------------\n')# Add meta datafor(md in colnames(meta)){seu.list[[i]][[md]] <- meta[[md]][i]}# add %MTseu.list[[i]][["percent.mt"]] <- PercentageFeatureSet(seu.list[[i]], pattern = "mt-")# Filter out low quality cells according to the metrics defined aboveseu.list[[i]] <- subset(seu.list[[i]],subset = nFeature_RNA > 1600 & nFeature_RNA < 8000 & percent.mt < 20)# Only mito and floor filtering; trying to find doublets}cat((sum(unlist(lapply(mat.list, ncol)))-sum(unlist(lapply(seu.list, ncol)))),"cells (total) removed...\n")
-
d.Preprocess Seurat objects.seuPreProcess <- function(seu, assay='RNA', n.pcs=30, res=0.25){pca.name = paste0('pca_', assay)pca.key = paste0(pca.name,'_')umap.name = paste0('umap_', assay)seu = NormalizeData(seu) %>% FindVariableFeatures(assay = assay,selection.method = "vst",nfeatures = 2000,verbose = F) %>% ScaleData(assay = assay) %>% RunPCA(assay = assay,reduction.name = pca.name,reduction.key = pca.key,verbose = F,npcs = n.pcs)n.pcs.use =n.pcs# FindNeighbors %>% RunUMAP, FindClustersseu <- FindNeighbors(seu,reduction = pca.name,dims = 1:n.pcs.use,force.recalc = TRUE,verbose = FALSE) %>% RunUMAP(reduction = pca.name,dims = 1:n.pcs.use,reduction.name=umap.name)seu@reductions[[umap.name]]@misc$n.pcs.used <- n.pcs.useseu <- FindClusters(object = seu,resolution = res)seu[[paste0('RNA_res.',res)]] <- as.numeric(seu@active.ident)return(seu)}# preprocess each dataset individuallyseu.list <- lapply(seu.list, seuPreProcess)
-
e.Merge datasets (Figure 4C).tmp.list <- list()for(i in 1:length(seu.list)){DefaultAssay(seu.list[[i]]) <- "RNA"tmp.list[[i]] <- DietSeurat(seu.list[[i]], assays = "RNA")}# merge tmp count matricesscMuscle.pref.seurat <- merge(tmp.list[[1]],y = tmp.list[[2]])VlnPlot(scMuscle.pref.seurat,features = c('nCount_RNA','nFeature_RNA','percent.mt'),group.by = 'source',pt.size = 0)
-
f.Preprocess merged data.# Seurat preprocessing on merged data ----DefaultAssay(scMuscle.pref.seurat) <- 'RNA'scMuscle.pref.seurat <-NormalizeData(scMuscle.pref.seurat, assay = 'RNA') %>% FindVariableFeatures(selection.method = 'vst',nfeatures = 2000,verbose = TRUE) %>% ScaleData(assay = 'RNA',verbose = TRUE) %>% RunPCA(assay = 'RNA',reduction.name = 'pca_RNA',reduction.key = 'pca_RNA_',verbose = TRUE,npcs = 50)ElbowPlot(scMuscle.pref.seurat, reduction = 'pca_RNA', ndims = 50)
-
g.Find clusters for individual datasets.n.pcs = 30scMuscle.pref.seurat <-RunUMAP(scMuscle.pref.seurat, reduction = 'pca_RNA',dims = 1:n.pcs, reduction.name='umap_RNA') %>% FindNeighbors(reduction = 'pca_RNA',dims = 1:n.pcs,force.recalc = TRUE,verbose = F)scMuscle.pref.seurat <- FindClusters(object = scMuscle.pref.seurat, resolution = 0.25)scMuscle.pref.seurat[['RNA_res.0.25']] <- as.numeric(scMuscle.pref.seurat@active.ident)
-
h.Integrate datasets using Harmony package.scMuscle.pref.seurat <-scMuscle.pref.seurat %>% RunHarmony(group.by.vars=c('sample'), reduction='pca_RNA',assay='RNA',plot_convergence = TRUE,verbose=TRUE)scMuscle.pref.seurat <-scMuscle.pref.seurat %>% RunUMAP(reduction = 'harmony', dims = 1:n.pcs,reduction.name='umap_harmony')scMuscle.pref.seurat@reductions$umap_harmony@misc$n.pcs.used <- n.pcsscMuscle.pref.seurat <-scMuscle.pref.seurat %>% FindNeighbors(reduction = 'harmony',dims = 1:n.pcs,graph.name = 'harmony_snn',force.recalc = TRUE,verbose = FALSE)scMuscle.pref.seurat <- FindClusters(object = scMuscle.pref.seurat,resolution = 1.0,graph.name='harmony_snn')scMuscle.pref.seurat[['harmony_res.1.0']] <- as.numeric(scMuscle.pref.seurat@active.ident)scMuscle.pref.seurat <- FindClusters(object = scMuscle.pref.seurat,resolution = 2.0, graph.name='harmony_snn')scMuscle.pref.seurat[['harmony_res.2.0']] <- as.numeric(scMuscle.pref.seurat@active.ident)
-
i.Validate integrated results (Figure 4D).library(cowplot)library(ggplot2)p1<-DimPlot(object = scMuscle.pref.seurat, reduction = "umap_RNA", pt.size = .1, group.by = "sample")+labs(title = "Merged by Seurat")p2<-DimPlot(object = scMuscle.pref.seurat, reduction = "umap_harmony", pt.size = .1, group.by = "sample")+labs(title = "Merged by Seurat with Harmony")p1+p2save(scMuscle.pref.seurat,file="naive_instructed_scRNA_ESCs.RData")
-
a.
-
19.Integrate scATAC-seq dataset with scRNA-seq dataset.
-
a.Prepare R library for integration.library(Signac)library(Seurat)library(GenomeInfoDb)library(EnsDb.Mmusculus.v79)library(patchwork)library(ggplot2)set.seed(1234)
-
b.Load datasets.load("aPSM_scRNA.RData")load("aPSM_scATAC.RData")
-
c.Infer relations between scRNA-seq and scATAC-seq.DefaultAssay(aPSM_atac) <- 'RNA'ncol(aPSM_atac)transfer.anchors <- FindTransferAnchors(reference = aPSM, query = aPSM_atac, k.anchor = 20,k.filter = 200, reduction = 'cca', dims = 1:30)predicted.labels <- TransferData(anchorset = transfer.anchors,refdata = aPSM$seurat_clusters,weight.reduction = aPSM_atac[['lsi']],dims = 2:30)save(transfer.anchors,file="transfer.anchors_aPSM_atac.RData")aPSM_atac <- AddMetaData(object =aPSM_atac, metadata = predicted.labels)save(aPSM_atac,file="aPSM_atac_meta.RData")
-
d.Visualize the clusters of the integrated datasets.DimPlot(object = aPSM_atac, label = F,reduction = 'umap',group.by ='predicted.id' ) +labs(title = " aPSM scATAC")DimPlot(object = aPSM, label = F,reduction = 'umap') +labs(title = " aPSM")
-
a.
Part 2: Multiomics analysis
Timing: 1 h(for step 20 and step 21)
In this section, we describe major steps on how to perform multimodal analysis.
-
20.Data preprocessing.
-
a.Load the libraries and setup working directory.library(Seurat)library(Signac)library(patchwork)library(monocle3)library(SeuratWrappers)library(EnsDb.Mmusculus.v79)library(GenomeInfoDb)library(ggplot2)library(dplyr)set.Seed(1234)setwd(getwd())
-
b.Load snRNA and snATAC data and create Seurat object.Star.data <- Read10x(data.dir = " ./multiomics/filtered_feature_bc_matrix/”)# Extract RNA and ATAC datarna_counts <- Star.data$`Gene Expression`atac_counts <- Star.data$Peaks# Create Seurat object containing snRNA dataStar <- CreateSeuratObject(counts = rna_counts, project = "Star", min.cells=5, min.features = 100, assay = "RNA")
CRITICAL: HIFLR_snRNA_barcodes.tsv.gz, HIFLR_snRNA_features.tsv.gz, and HIFLR_snRNA_matrix.mtx.gz are the files generated by CellRanger-arc v2.0.0. Files should be kept in the same folder, named as filtered_feature_bc_matrix.
-
c.Load snATAC-seq fragments files.grange.counts <- StringToGRanges(rownames(atac_counts), sep = c(":", "-"))grange.use <- seqnames(grange.counts) %in% standardChromosomes(grange.counts)atac_counts <- atac_counts[as.vector(grange.use), ]
-
d.Add annotation.annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79)seqlevelsStyle(annotation) <- "UCSC"genome(annotation) <- "mm10"
-
e.Create snATAC assay.fragpath <- " ./multiomics/filtered_feature_bc_matrix/fragments.tsv.gz"Star[["ATAC"]] <- CreateChromatinAssay(counts = atac_counts, sep = c(":", "-"), genome = 'mm10', fragments = fragpath, min.cells = 10, annotation = annotation)
-
f.Downsize the dataset.set.seed(111)Star <- subset(x = Star, downsample = 6000)save(Star, file="Star_ds6k.RData")load("./Star_ds6k.RData")
CRITICAL: To load the snATAC-seq fragments file properly, fragments.tsv.gz.tbi file is required to be in the same folder.
CRITICAL: Use only peaks in standard chromosomes and set sequence level style as UCSC.
-
g.Quality control:
-
i.Calculate percentage of mitochondrial genes in snRNA-seq.
-
ii.Compute both TSS enrichment score and nucleosome signal metrics in Signac for snATAC-seq (Figure 5A).DefaultAssay(Star) <- "RNA"Star[["percent.mt"]] <- PercentageFeatureSet(Star, pattern = "ˆmt-")VlnPlot(Star, features = c("nCount_RNA", "nFeature_RNA", "percent.mt"), ncol = 3, log = TRUE, pt.size = 0) + NoLegend()DefaultAssay(Star) <- "ATAC"Star <- NucleosomeSignal(Star)Star <- TSSEnrichment(object=Star, fast=FALSE)VlnPlot(Star, features = c("nCount_ATAC", "nFeature_ATAC", "TSS.enrichment", "nucleosome_signal"), ncol = 4, log = TRUE, pt.size = 0) + NoLegend()Note: Low-quality cells refer to potential damaged cells, empty droplets, cell doublets, or multiplets.
-
iii.Remove low quality cells (Figure 5B).Star <- subset(x = Star,subset = nCount_RNA < 100000 &nCount_RNA > 1200 &nCount_ATAC < 1e5 &nCount_ATAC > 1e2 &nucleosome_signal < 2.5 &TSS.enrichment > 3 &Percent.mt < 10)saveRDS(Star, file="Star.RData")VlnPlot(Star, features = c("nCount_RNA", "nFeature_RNA", "percent.mt"), ncol = 3, log = TRUE, pt.size = 0) + NoLegend()VlnPlot(Star, features = c("nCount_ATAC", "nFeature_ATAC", "TSS.enrichment", "nucleosome_signal"), ncol = 4, log = TRUE, pt.size = 0) + NoLegend()
CRITICAL: The filtering criteria are dataset specific. Chose a cutoff to avoid losing unique cell populations or to include noisy cells.
-
i.
-
a.
-
21.WNN analysis.
-
a.Perform normalization and dimensional reduction of snRNA-seq and snATAC-seq assays independently and individually.# snRNA analysisDefaultAssay(Star) <- "RNA"Star <- SCTransform(Star, verbose = FALSE) %>% RunPCA() %>% RunUMAP(dims = 1:30, reduction.name = 'umap', reduction.key = 'UMAP_')# snATAC analysisDefaultAssay(Star) <- "ATAC"Star <- RunTFIDF(Star)Star <- FindTopFeatures(Star, min.cutoff = 'q0')Star <- RunSVD(Star)Star <- RunUMAP(Star, reduction = 'lsi', dims = 2:30,reduction.name = "umap.atac", reduction.key = "atacUMAP_")Note: In snATAC-seq assay, the first dimension is typically correlated with sequencing depth rather than biological variation. It is thus excluded in the UMAP computing.
-
b.Learn cell-specific modality weights and construct a WNN graph.Star <- FindMultiModalNeighbors(Star, reduction.list = list("pca", "lsi"),dims.list = list(1:30, 2:30))Star <- RunUMAP(Star, nn.name = "weighted.nn",reduction.name = "umap.wnn", reduction.key ="wnnUMAP_")Star <- FindClusters(Star, graph.name = "wsnn",resolution = 0.8, algorithm = 3, verbose = FALSE)
-
c.Visualize the clusters. (Figure 6A).p1 <- DimPlot(Star, reduction = "umap", group.by = "seurat_clusters", label = TRUE, label.size = 8, repel = TRUE) + ggtitle("RNA")p2 <- DimPlot(Star, reduction = "umap.atac",group.by = "seurat_clusters", label = TRUE, label.size = 8, repel = TRUE) + ggtitle("ATAC")p3 <- DimPlot(Star, reduction = "umap.wnn", group.by = "seurat_clusters", label = TRUE, label.size = 8, repel = TRUE) + ggtitle("WNN")p1+p2+p3 &NoLegend()
-
d.snRNA-seq analysis: Characterization and annotation of cell states are achieved by identifying marker genes via differential expression analysis in both pseudotemporal ordering identified clusters and WNN clusters. Cell types are defined using known gene markers. For example, Myod1, Myog, and Myf5 are myogenic markers and Ascl1, Neurod4, and Nhlh1 are neurogenic markers. Pax7 drives both myogenesis and neurogenesis. Meis1 and Pbx1 are anterior presomitic mesoderm (aPSM) markers. As an example, here we analyze myogenic genes Myod1 and Myog.
-
i.Find markers.DefaultAssay(Star) <- "RNA"Star.rna.markers <- FindAllMarkers(Star, assay = "RNA", only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)Star.rna.markers %>%group_by(cluster) %>%top_n(n = 2, wt = avg_log2FC)
-
ii.Add cell states annotations.Star <- RenameIdents(Star, '10' = 'cell_5','11' = 'cell_2')Star <- RenameIdents(Star, '5' = 'cell_4','6' = 'cell_1','7' = 'cell_1','8' = 'cell_3','9' = 'cell_5')Star <- RenameIdents(Star, '0' = 'cell_1','1' = 'cell_2','2' = 'cell_1','3' = 'cell_1','4' = 'cell_3')Star$celltype <- Idents(Star)
CRITICAL: Cell states can be assigned with known markers. Writing the Star.rna.markers into a file and studying the markers potentially used to annotate the clusters would be recommended.
-
iii.Visualize the cell states (Figure 6D).p1 <- DimPlot(Star, reduction = "umap", group.by = " celltype", label = FALSE, label.size = 8, repel = TRUE) + ggtitle("RNA")p2 <- DimPlot(Star, reduction = "umap.atac",group.by = " celltype", label = FALSE, label.size = 8, repel = TRUE) + ggtitle("ATAC")p3 <- DimPlot(Star, reduction = "umap.wnn", group.by = " celltype", label = FALSE, label.size = 8, repel = TRUE) + ggtitle("WNN")p1+p2+p3
-
i.
-
e.snATAC-seq analysis.
-
i.Load libraries.library(chromVAR)library(motifmatchr)library(JASPAR2020)library(TFBSTools)library(BSgenome.Mmusculus.UCSC.mm10)
-
ii.Find snATAC markers.DefaultAssay(Star) <- "ATAC"Star.atac.markers <- FindAllMarkers(Star, assay = "ATAC", only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)Star.atac.markers %>%group_by(cluster) %>%top_n(n = 2, wt = avg_log2FC)
-
iii.Add motif information.pwm_set <- getMatrixSet(x = JASPAR2020, opts = list(collection = "CORE", tax_group = 'vertebrates', all_versions = FALSE))Star <- AddMotifs(object = Star,genome = BSgenome.Mmusculus.UCSC.mm10,pfm = pwm_set,assay="ATAC")
-
iv.Computer motif activities.Star <- RunChromVAR(object = Star,genome = BSgenome.Mmusculus.UCSC.mm10)
-
i.
-
a.
Figure 5.
Multiomics data quality control
(A) snRNA and snATAC QC plot before removing low quality cells.
(B) snRNA and snATAC QC plot after removing low quality cells.
Figure 6.
Characterization and annotation of cell states
(A) UMAP visualization of the clustering based on snRNA-seq, snATAC-seq, and WNN analysis before cell state annotation.
(B) Pseudotime single cell trajectory plot. The heatmap represents units of progress, with 1 located at the root of the trajectory.
(C) Cell states derived from pseudotime trajectory inference. State 2 and state 4 are marked NA (not assigned), since they may represent transitioning states and could not be unambiguously assigned to a specific developmental stage.
(D) UMAP visualization of cell states after annotated clustering.
Part 2: Data visualization
Timing: 1 h(for step 22)
In this section, we describe steps to do data visualization.
-
22.Pseudotime analysis:
-
a.Convert Seurat object to Monocle object.DefaultAssay(Star) <- "RNA"set.seed(22)cds <- SeuratWrappers::as.cell_data_set(Star, assay = "RNA", reduction = "umap", group.by = "celltype")cds@rowRanges@elementMetadata@listData[["gene_short_name"]] <- rownames(Star[["RNA"]])
-
b.Create CDS object.cds <- preprocess_cds(cds, method = "PCA")cds <- reduce_dimension(cds, preprocess_method = "PCA",umap.n_neighbors= 14L, reduction_method = "UMAP")cds <- cluster_cells(cds, reduction_method = "UMAP")cds <- learn_graph(cds, use_partition = FALSE, close_loop = FALSE)
-
c.Set the root with Seurat clusters 0 and order cells.cell_ids <- colnames(cds)[Star$seurat_clusters == "0"]closest_vertex <- cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertexclosest_vertex <- as.matrix(closest_vertex[colnames(cds), ])closest_vertex <- closest_vertex[cell_ids, ]closest_vertex <- as.numeric(names(which.max(table(closest_vertex))))mst <- principal_graph(cds)$UMAProot_pr_nodes <- igraph::V(mst)$name[closest_vertex]rowData(cds)$gene_name <- rownames(cds)rowData(cds)$gene_short_name <- rowData(cds)$gene_namecds <- order_cells(cds, root_pr_nodes = root_pr_nodes)
-
d.Visualize trajectory plot (Figure 6B).plot_cells(cds, color_cells_by = "pseudotime",label_cell_groups =T, label_leaves = F,label_branch_points = F,show_trajectory_graph = T,graph_label_size = 3, label_groups_by_cluster = T)
-
e.Visualize cell states derived from trajectory inference (Figure 6C).plot_cells(cds, color_cells_by = "cluster", cell_size = 1,label_cell_groups = TRUE, group_label_size = 4,show_trajectory_graph = FALSE,label_branch_points = FALSE,label_roots = FALSE,label_leaves = FALSE)
-
f.Visualize paired-plots expression of Myod1 and Myog (Figure 7A).Star.seur <- as.Seurat(cds, assay = NULL, clusters = "UMAP")Star.seur <- AddMetaData(Star.seur,metadata= cds@principal_graph_aux$UMAP$pseudotime,col.name = "monocle3_pseudotime")FeaturePlot(Star.seur,features = c("Myod1","Myog"),reduction ="UMAP",combine = T,blend = TRUE, blend.threshold = 0.0,min.cutoff = 0,max.cutoff = 6)
-
g.Visualize Footprinting plots (Figure 7B).Star_135 <- subset(x = Star, idents = c("cell_1", "cell_3", "cell_5"), invert = FALSE)DefaultAssay(Star_135) <- "ATAC"Star_135 <- Footprint(object = Star_135,motif.name = c("MYOG", "MYOD1"),genome = BSgenome.Mmusculus.UCSC.mm10)PlotFootprint(Star_135, features = c("MYOD1")) + patchwork::plot_layout(ncol = 1)PlotFootprint(Star_135, features = c("MYOG")) + patchwork::plot_layout(ncol = 1)Note: Cell_1 is aPSM cells, Cell_3 is a neurogenic cluster, and Cell_5 is a myogenic cluster.
-
a.
Figure 7.
Visualization of myogenic cells
(A) Individual and paired-plots expression of Myod1 and Myog in cell states derived from pseudotime trajectory inference.
(B) Myod1 and Myog footprinting profile in aPSM, neurogenic and myogenenic clusters.
Expected outcomes
This protocol provides a resource to profile transcriptional and chromatin accessibility features of pluripotent, mesoderm-induced ESCs and ESC-derived cell lineages. Expression profiles and chromatin accessibility are determined for each developmental timepoint. Transcriptomics changes across differentiation time points are revealed through integrating pipeline of individual scRNA-seq datasets (protocol 1:integrated data analysis-step1), and correlation between transcriptomic expression and chromatin accessibility through integrating pipeline between scRNA-seq and scATAC-seq datasets (protocol 1:integrated data analysis-step2). In addition, multiomics datasets can be visualized and inferred through multiomics analysis pipeline (protocol 2: omics analysis).
Limitations
The protocols are based on R library called Seurat under R-R studio schema. If users need to run the protocols in high-performance computing environments, they require R batch modules such as Swarm. Furthermore, the parameters of data integration are decided by the heuristic hyperparameter tuning for the datasets under specific time points. Therefore, we need to develop an automatic tuning module to explore optimal hyperparameters for new datasets. In addition, users can compare the outputs from these protocols with results from other single cell packages such as SCANPY, if a module to convert schema between R and Python is developed.
Troubleshooting
Problem 1
Unable to run the docker image with docker desktop.
Potential solution
In the software preparation step, it is important to follow the steps in Docker_manual_mac.docx or Docker_manual_windowOS.docx and set up the docker desktop environment properly.
Problem 2
R packages cannot be loaded by “library” command.
Potential solution
Run the codes in R environment below:
p <- installed.packages()
colnames(p)
If the packages cannot be found after running the codes, visit the Bioconductor website (https://www.bioconductor.org/), search a package, and follow guidelines. If the package cannot be found in Bioconductor, run install.packages(“package_name”) in R environment. More details and examples can be found in Software_preparation.R.
Problem 3
Data files cannot be loaded.
Potential solution
Check whether the files are in the folder. If they are, check their name.
Problem 4
Monocle3 failed to be installed.
Potential solution
-
•
Install the monocle3: Monocle3 runs in the R statistical computing environment. R version 4.2.2 or higher will be needed.
-
•
Install a few Bioconductor dependencies that aren’t automatically installed.
BiocManager::intall(c('BiocGenerics', 'DelayedArray', 'DelayedMatrixStats','limma', 'lme4', 'S4Vector', 'SingleCellExperiment', 'SummarizedExperiment', 'batchelor', 'Matrix.utils', 'HDF5Array', 'terra', 'ggrastr'))
-
•
Install monocle3 through the cole-trapnell-lab GitHub: To ensure the monocle3 was installed correctly, start a new R session, and run.
install.packages('devtools')
devtools::install_github('cole-trapnell-lab/monocle3')
library(monocle3)
CRITICAL: monocle3 installation is tricky. Some troubleshooting will be found at cole-trapnell-lab GitHub ( https://cole-trapnell-lab.github.io/monocle3/docs/installation)
Problem 5
Plots cannot be drawn.
Potential solution
Run the codes in R environment below:
gg2 <- try(find.package("ggplot2"), silent = TRUE)
gg2
If the packages cannot be found after running the codes, run install.packages(“ggplot2”) in R environment.
Resource availability
Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Vittorio Sartorelli (sartorev@mail.nih.gov).
Materials availability
This study did not generate unique reagents.
Data and code availability
Original data and codes have been deposited to Zenodo: https://doi.org/10.5281/zenodo.7224723.
Acknowledgments
We thank the NIAMS Genomic Technology, Biodata Mining and Discovery, Flow Cytometry, and Light Imaging Sections. Dr. Hong-Wei Sun and Dr. Stephen Brooks (Biodata Mining and Discovery Section) provided useful suggestions for data analysis. This study utilized the high-performance computational capabilities of the Helix Systems at the NIH, Bethesda, MD, USA (https://helix.nih.gov/). This work was supported in part by the Intramural Research Program of the NIAMS at the NIH (grants AR041126 and AR041164 to V.S.).
Author contributions
K.D.K. and K.J. analyzed and interpreted data and drafted the manuscript. S.D.O. and V.S. edited the manuscript and supervised the project.
Declaration of interests
The authors declare no competing interests.
Contributor Information
Kyung Dae Ko, Email: kyungdae.ko@nih.gov.
Kan Jiang, Email: kan.jiang@nih.gov.
Vittorio Sartorelli, Email: vittorio.sartorelli@nih.gov.
References
- 1.Khateb M., Perovanovic J., Ko K.D., Jiang K., Feng X., Acevedo-Luna N., Chal J., Ciuffoli V., Genzor P., Simone J., et al. Transcriptomics, regulatory syntax, and enhancer identification in mesoderm-induced ESCs at single-cell resolution. Cell Rep. 2022;40:111219. doi: 10.1016/j.celrep.2022.111219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.RStudio Team . RStudio. Integrated Development for R; 2022. [Google Scholar]
- 3.Stuart T., Butler A., Hoffman P., Hafemeister C., Papalexi E., Mauck W.M., 3rd, Hao Y., Stoeckius M., Smibert P., Satija R. Comprehensive integration of single-cell data. Cell. 2019;177:1888–1902.e21. doi: 10.1016/j.cell.2019.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Stuart T., Srivastava A., Madad S., Lareau C.A., Satija R. Single-cell chromatin state analysis with Signac. Nat. Methods. 2021;18:1333–1341. doi: 10.1038/s41592-021-01282-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Korsunsky I., Millard N., Fan J., Slowikowski K., Zhang F., Wei K., Baglaenko Y., Brenner M., Loh P.R., Raychaudhuri S. Fast, sensitive and accurate integration of single-cell data with harmony. Nat. Methods. 2019;16:1289–1296. doi: 10.1038/s41592-019-0619-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Cao J., Spielmann M., Qiu X., Huang X., Ibrahim D.M., Hill A.J., Zhang F., Mundlos S., Christiansen L., Steemers F.J., et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature. 2019;566:496–502. doi: 10.1038/s41586-019-0969-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Fornes O., Castro-Mondragon J.A., Khan A., van der Lee R., Zhang X., Richmond P.A., Modi B.P., Correard S., Gheorghe M., Baranašić D., et al. Jaspar 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2020;48:D87–D92. doi: 10.1093/nar/gkz1001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Tan G., Lenhard B. TFBSTools: an R/bioconductor package for transcription factor binding site analysis. Bioinformatics. 2016;32:1555–1556. doi: 10.1093/bioinformatics/btw024. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Original data and codes have been deposited to Zenodo: https://doi.org/10.5281/zenodo.7224723.