Protocol for directly selecting cell type marker genes for single-cell clustering analyses by Festem

Zihao Chen; Changhu Wang; Ruibin Xi

doi:10.1016/j.xpro.2024.103514

. 2024 Dec 18;6(1):103514. doi: 10.1016/j.xpro.2024.103514

Protocol for directly selecting cell type marker genes for single-cell clustering analyses by Festem

Zihao Chen ^1,^2,^3,^∗, Changhu Wang ^1,^2,^3,^∗∗, Ruibin Xi ^1,^4,^∗∗∗

PMCID: PMC11728985 PMID: 39700012

Summary

Feature selection by expectation maximization test (Festem) enables the direct selection of cell type marker genes, facilitating downstream clustering of single-cell RNA sequencing (scRNA-seq) data. Here, we present a protocol for using Festem to identify marker genes in scRNA-seq data and perform subsequent analyses. We describe comprehensive steps for setting up the environment, marker gene selection, clustering, and marker gene assignment. This protocol yields both clustering results and identified marker genes, enhancing the interpretation of biological information in scRNA-seq data.

For complete details on the use and execution of this protocol, please refer to Chen et al.¹

Subject areas: bioinformatics, single cell, RNA-seq, systems biology

Graphical abstract

Highlights

•
Instructions for directly selecting cell type marker genes using Festem
•
Steps for clustering using Festem-selected marker genes
•
Guidance on batch effect removal based on Festem-selected genes

Publisher’s note: Undertaking any experimental protocol requires adherence to local institutional guidelines for laboratory safety and ethics.

Before you begin

In single-cell RNA sequencing (scRNA-seq) research, cell types and their marker genes are typically identified through clustering and differentially expressed gene (DEG) analysis. Traditionally, genes are selected based on surrogate criteria such as variance and deviance. These selected genes are then used for clustering, and markers are identified through DEG analysis, assuming known cell types. However, surrogate criteria may overlook crucial genes or include irrelevant ones, and DEG analysis can suffer from the selection bias.² To address these limitations, we developed Festem, a novel method that directly selects marker genes for optimal cell-type identification by exploiting the intrinsic clustering information within each gene’s expression distribution. By doing so, Festem circumvents the pitfalls of surrogate criteria and avoids the selection bias of the available DEG methods.

Installation and environment setup

Timing: ∼1 h

1.
Install Miniconda.
- a.
  Download Miniconda from https://docs.anaconda.com/miniconda/.
- b.
  Run the following command in the terminal.

bash Miniconda3-latest-Linux-x86_64.sh

Note: The installation process may vary depending on your operating system (Windows, macOS, or Linux). For comprehensive guidance, please refer to the official Miniconda installation instructions at the following link: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html.

2.
Create virtual environment.
- a.
  Run the following command in the terminal.

conda create -n Festem python=3.12.2

conda activate Festem

3.
Install R using Miniconda.
- a.
  Run the following command in the terminal.

conda install conda-forge::r-base=4.4.1

Note: If you are using Windows, you will also need to install RTools (4.4). You can download it from the following link: https://cran.r-project.org/bin/windows/Rtools/.

4.
Install R packages. Troubleshooting 1 and Troubleshooting 2.
- a.
  Install Seurat and devtools. Run the following command in the conda terminal.
  conda install conda-forge::r-seurat=5.1.0
  
  conda install conda-forge::r-devtools
- b.
  Install related packages. Run the following command in R Console. (Access R Console by running the command “R” in the conda terminal).
  install.packages("BiocManager")
  
  BiocManager::install("edgeR")
  
  devtools::install_github("satijalab/seurat-data")
  Note: Seurat utilizes several packages to significantly enhance speed and performance. Based on the developers’ recommendations, we suggest installing the following packages:
  
  setRepositories(ind = 1:3, addURLs = c("https://satijalab.r-universe.dev", "https://bnprks.r-universe.dev/"))
  
  install.packages(c("BPCells", "presto", "glmGamPoi"))
  Optional: If you intend to process scRNA-seq data with multiple batches, we recommend installing the Harmony package for effective batch removal.
  
  install.packages("harmony")
- c.
  Install Festem. Troubleshooting 3.
  devtools::install_github("XiDsLab/Festem")

Data collection

Timing: 5 min

5.
Download an scRNA-seq dataset using SeuratData. Run the following command in R Console. Troubleshooting 4.

library(SeuratData)

options(timeout = 1000)

InstallData("ifnb")

Note: Single-cell datasets analyzed in this protocol were preprocessed and deposited into Zenodo: https://doi.org/10.5281/zenodo.11331165.

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

Immune cell gene expressions from eight patients with lupus	Kang et al.³	GEO: GSE96583

Software and algorithms

Miniconda3	Anaconda, Inc.	https://docs.anaconda.com/miniconda/
R v.4.4.1	R Core Team	https://www.r-project.org/
Bioconductor v.1.30.23	Bioconductor Core Team⁴	https://bioconductor.org/
Festem v.1.2.1	Chen et al.¹	https://github.com/XiDsLab/Festem
Seurat v.5.1.0	Hao et al.⁵	https://cloud.r-project.org/web/packages/Seurat/index.html
SeuratData v.0.2.2.9001	Satija et al.⁶	https://github.com/satijalab/seurat-data
devtools v.2.4.5	Wickham et al.⁷	https://cran.r-project.org/web/packages/devtools/index.html
edgeR v.3.32.1	Robinson et al.⁸	https://bioconductor.org/packages/release/bioc/html/edgeR.html
ScottKnott v.1.3–1	Jelihovschi et al.⁹	https://cran.r-project.org/web/packages/ScottKnott/index.html
harmony v.1.2.0	Korsunsky et al.¹⁰	https://cran.r-project.org/web/packages/harmony/index.html
dplyr v.1.1.4	Wickham et al.¹¹	https://cloud.r-project.org/web/packages/dplyr/index.html

Other

Personal computer	HP, Inc.	HP Star Book Pro 14 (AMD Ryzen 7 8845H processor)

Open in a new tab

Materials and equipment

All analyses performed here (and the associated timing estimates) were conducted on a personal computer with an AMD Ryzen 7 8845H processor with 8 cores and 32 GB of RAM (running on Windows 11).

Step-by-step method details

The Festem protocol consists of three main parts: marker gene identification (identify all heterogeneously distributed genes), cell clustering and assignment of marker genes to clusters (determine which cluster each marker gene represents).

For scRNA-seq data, different instructions are required depending on the presence of batch effects. To ensure clarity, we provide two variants of the Festem protocol to handle both types of data: For data without batch effects, follow steps 1–7; for multi-batch data, follow steps 8–14.

Single scRNA-seq dataset workflow

Timing: <1 min (for step 1)

Timing: ∼5 min (for step 2)

Timing: <1 min (for step 3)

Timing: ∼1 s (for step 4)

Timing: ∼1 min (for steps 5–7)

1.
Package and data import. Troubleshooting 5.

library(Seurat)

library(SeuratData)

library(Festem)

library(dplyr)

data("ifnb")

if (as.numeric(substr(packageVersion("SeuratObject"),1,1))==5){

ifnb <- UpdateSeuratObject(ifnb)

}

ifnb <- ifnb[,ifnb@meta.data$stim=="CTRL"]

2.
Run Festem to select cell-type marker genes. Troubleshooting 6 and Troubleshooting 7.

ifnb <- RunFestem(ifnb, num.threads = 4)

Note: To speed up Festem, you can enable parallelization by setting the “num.threads” parameter to the desired number of CPU cores.

3.
Clustering using Festem-selected marker genes.

gene_set <- rownames(ifnb)[ifnb[["RNA"]][[]][,"Festem_rank"] <= 2500]

ifnb <- NormalizeData(ifnb)

ifnb <- ScaleData(ifnb, features = gene_set)

ifnb <- RunPCA(ifnb , verbose = FALSE, features = gene_set)

ifnb <- FindNeighbors(object = ifnb, dims = 1:20)

ifnb <- FindClusters(object = ifnb, resolution = 1.5)

ifnb <- RunTSNE(ifnb, reduction = "pca", dims = 1:20)

4.
Visualize the clusters (Figure 1A).

DimPlot(ifnb, label = T) + NoLegend()

5.
Assign Festem-detected marker genes to identified clusters (Figure 1B).

marker <- AllocateMarker(ifnb,VariableFeatures(ifnb))

6.
Visualize marker gene expression across clusters and annotate clusters (Figures 2A and 2B). Troubleshooting 8.

ifnb <- RenameIdents(ifnb, "0" = "CD4 Naive T", "1" = "CD4 Memory T", "2" = "CD14 Monocyte", "3" = "CD14 Monocyte", "4" = "CD16 Monocyte", "5" = "CD14 Monocyte", "6" = "B", "7" = "T Activated", "8" = "NK", "9" = "CD4 Memory T", "10" = "DC", "11" = "CD8 T", "12" = "B Activated", "13" = "T cell:Monocyte Complex", "14" = "HSP+ CD4 T", "15" = "IFNhi CD14 Monocyte", "16" = "Mk", "17" = "pDC", "18" = "CD34+ Progenitors", "19" = "CD14 Monocyte")

DimPlot(ifnb, label = T) + NoLegend()

MarkerHeatmap(ifnb, VariableFeatures(ifnb))

CRITICAL: When using different operating systems, the cluster index and t-SNE plot may exhibit slight variations compared to Figure 1A. Before renaming them, users should verify the expression of canonical cell type markers (e.g., those in Figure 2C) to prevent errors.

7.
Check canonical markers of annotated cell-types (Figure 2C).

# Check canonical markers

marker_list <- c("CD3D","CREM","IL7R","CCR7","CD27","SELL","GIMAP5","CACYBP","TCF7","GNLY","NKG7","CCL5","CD247","GZMB","CD8A","MS4A1","CD79A","CD37","MIR155HG","NME1","FCGR3A","VMO1","MS4A7","CCL2","S100A9","CD14","LYZ","HLA-DQA1","GPR183","FCER1A","CST3","CD1C","TSPAN13","IL3RA","IGJ","HSPA1A","HSPB1","HSPA1B","HSPH1","HSPE1","HSPD1","CD34","TPSAB1","GATA2","SNHG7","ISG15","ISG20","IFI6","IFIT1","PPBP","PF4")

# Reorder cell types

ifnb@active.ident <- factor(ifnb@active.ident,levels = c("Mk","T cell:Monocyte Complex","IFNhi CD14 Monocyte","CD34+ Progenitors","HSP+ CD4 T","pDC","DC","CD14 Monocyte","CD16 Monocyte","B Activated","B","CD8 T","NK","T Activated","CD4 Naive T","CD4 Memory T"))

ifnb <- ScaleData(ifnb,features = marker_list)

DotPlot(ifnb, features = marker_list, cols = c("blue", "red"), dot.scale = 8, idents = levels(ifnb@active.ident)) + RotatedAxis()

Clustering and marker genes in the control group in IFNB dataset

(A) UMAP plot of the control group in IFNB dataset generated from step 4.

(B) An example of marker genes detected by Festem from step 5. p values are adjusted with the Benjamini-Hochberg method.

Cell type annotations and marker genes in the control group in IFNB dataset

(A) Cell type annotations of cells generated from step 6.

(B) Heatmap for marker gene expression generated from step 6.

(C) The expressions of canonical markers of different cell types generated from step 7.

Multi-batch scRNA-seq dataset workflow

Timing: <1 min (for step 8)

Timing: ∼10 min (for step 9)

Timing: ∼1 min (for steps 10 and 11)

Timing: ∼2 min (for step 12)

Timing: <1 min (for steps 13 and 14)

In our pipeline, we first use Festem to select marker genes, and then apply batch correction. Festem is applied to each batch individually, and the results are combined to obtain the final set of marker genes. The rationale behind this approach is that each batch contains valuable information about whether a gene is a marker. We can evaluate whether a gene is a marker gene for each batch individually and then combine evidence from all batches to achieve a more confident evaluation. The advantages of this method include: (1) minimal influence of batch effects on marker gene selection; (2) independence from the batch correction method used; (3) easy parallelization, making it computationally efficient for large datasets.

8.
Package and data import.

library(Seurat)

library(SeuratData)

library(Festem)

library(dplyr)

data("ifnb")

if (as.numeric(substr(packageVersion("SeuratObject"),1,1))==5){

ifnb <- UpdateSeuratObject(ifnb)

}

9.
Run Festem to select cell-type marker genes. Troubleshooting 6 & Troubleshooting 7.

ifnb <- RunFestem(ifnb, batch = "stim", num.threads = 4).

Note: You can also provide a vector containing batch labels for each cell, for example:

ifnb <- RunFestem(ifnb, batch = ifnb@meta.data$stim, num.threads = 4)

10.
Batch removal based on Festem-selected marker genes via Harmony.¹⁰

gene_set <- rownames(ifnb)[ifnb[["RNA"]][[]][,"Festem_rank"] <= 2500]

ifnb <- NormalizeData(ifnb)

ifnb <- ScaleData(ifnb, features = gene_set)

ifnb <- RunPCA(ifnb , verbose = FALSE, features = gene_set)

ifnb <- harmony::RunHarmony(ifnb,"stim",plot_convergence = T, lambda = 10)

11.
Clustering using Festem-selected marker genes.

ifnb <- FindNeighbors(object = ifnb, dims = 1:30, reduction = "harmony")

ifnb <- FindClusters(object = ifnb, resolution = 1.25)

ifnb <- RunTSNE(ifnb, reduction = "harmony", dims = 1:30)

12.
Assign Festem-detected marker genes to identified clusters.

marker <- AllocateMarker(ifnb,VariableFeatures(ifnb))

13.
Visualize marker gene expression across clusters and annotate clusters (Figures 3A and 3B). Troubleshooting 8.

ifnb <- RenameIdents(ifnb, "0" = "IFNhi CD14 Monocyte", "1" = "CD4 Naive T", "2" = "CD14 Monocyte", "3" = "CD4 Memory T", "4" = "CD16 Monocyte", "5" = "B", "6" = "CD8 T", "7" = "NK", "8" = "T Activated", "9" = "HSP+ CD4 T", "10" = "DC", "11" = "B Activated", "12" = "T cell:Monocyte Complex", "13" = "CD4 Naive T", "14" = "CD8 T", "15" = "Mk", "16" = "pDC", "17" = "B", "18" = "CD34+ Progenitors", "19" = "Eryth")

DimPlot(ifnb, label = T) + NoLegend()

MarkerHeatmap(ifnb, VariableFeatures(ifnb))

14.
Check canonical markers of annotated cell-types (Figure 3C).

# Check canonical markers

# Reorder cell types

ifnb@active.ident <- factor(ifnb@active.ident,levels = c("Eryth","Mk","T cell:Monocyte Complex","IFNhi CD14 Monocyte","CD34+ Progenitors","HSP+ CD4 T","pDC","DC","CD14 Monocyte","CD16 Monocyte","B Activated","B","CD8 T","NK","T Activated","CD4 Naive T","CD4 Memory T"))

ifnb <- ScaleData(ifnb,features = marker_list)

DotPlot(ifnb, features = marker_list, cols = c("blue", "red"), dot.scale = 8, idents = levels(ifnb@active.ident)) + RotatedAxis()

Joint analysis of the control and stimulated group in IFNB dataset

(A) Cell type annotations of cells generated from step 13.

(B) Heatmap for marker gene expression generated from step 13.

(C) The expressions of canonical markers of different cell types generated from step 14.

Expected outcomes

Running the above pipelines will generate a Seurat object containing marker genes identified by Festem, along with clustering and dimensionality reduction results (Figure 1A). Specifically, the detected marker genes will be stored as “VariableFeatures” in the Seurat object, with their adjusted p-values located in the “meta.data” of the active assay. Additionally, a data frame containing fold-changes and adjusted p-values of marker genes will be generated (Figure 1B).

Quantification and statistical analysis

In this protocol, all p-values are adjusted using the Benjamini-Hochberg procedure.¹² When the data consist of multiple batches, p-values are first combined using the Bonferroni method¹³ before FDR adjustment. Marker gene discoveries are made under FDR significance level of 0.05.

Limitations

Festem assumes that gene expressions follow a negative binomial distribution, a model suitable for most scRNA-seq count data, especially for the mostly widely used unique molecular identifier (UMI) data.¹⁴^,¹⁵ Festem may have low precisions or recalls if the scRNA-seq data significantly violate this assumption. Another limitation of Festem is its higher computational time compared to many gene selection methods, owing to the use of EM iterations. Nevertheless, Festem can concurrently perform gene selection and identify cell-type markers with computational times that are comparable to or faster than many widely-used DEG identification methods.¹ Therefore, we consider the computational time required by Festem to be largely acceptable.

Troubleshooting

Problem 1

Failed to install R package igraph. R Console reports an error: “installation of package 'igraph' had non-zero exit status”.

Potential solution

This issue may arise due to a missing or non-default location of the GLPK package, particularly on Linux systems without root privileges. To resolve this, install igraph using Miniconda before installing Seurat with the following code:

conda install conda-forge::r-igraph.

Problem 2

Conda dependency checks cost a lot of time.

Potential solution

You can use mamba as an alternative to conda. First, download Miniforge from https://github.com/conda-forge/miniforge?tab=readme-ov-file and configure mamba according to its tutorial at https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html. Then, open the Miniforge terminal and install R and R packages using the following commands.

mamba create -n Festem python=3.12.2

mamba activate Festem

mamba install conda-forge::r-base=4.4.1

mamba install conda-forge::r-seurat=5.1.0

mamba install conda-forge::r-devtools

Problem 3

Festem failed to install on Windows. R Console reports an error: “Error in system(paste(MAKE, p1(paste("-f", shQuote(makefiles))), "compilers"), : 'make' not found”.

Potential solution

1.
Download RTools (4.4) from https://cran.r-project.org/bin/windows/Rtools/rtools44/rtools.html and install it on your system. Then, open the conda terminal and add the path of Rtools to the environment variable:

set PATH=%PATH%;C:\rtools44\usr\bin

(Replace “C:\rtools44\usr\bin” with the path where you installed RTools).

2.
Alternatively, you can download the binary version of Festem from https://github.com/XiDsLab/Festem/releases/download/v1.2.1/Festem_1.2.1.zip and install it through the R console:

install.packages(“Festem_1.2.1.zip”, repo = NULL)

Problem 4

Failed to download package “ifnb.SeuratData”.

Potential solution

Manually download the package “ifnb.SeuratData” from https://seurat.nygenome.org/src/contrib/ifnb.SeuratData_3.1.0.tar.gz. Then, run the following command in R console to install it.

install.packages("ifnb.SeuratData_3.1.0.tar.gz", repos = NULL, type = "source")

library(ifnb.SeuratData)

LoadData("ifnb")

Problem 5

Failed to subset the control group of the dataset. R Console reports an error: “invalid class "Assay" object: slots in class definition but not in object: ‘assay.orig’” (related to step 1).

Potential solution

This issue arises because the downloaded data is in the form of SeuratV4, while you are using SeuratV5. To resolve this, please run the following command.

ifnb <- UpdateSeuratObject(ifnb)

Problem 6

When running Festem, R runs out of memory (related to step 2 and step 9).

Potential solution

Use a smaller block size during parallelization. For example, use the following code.

# Single scRNA-seq dataset

ifnb <- RunFestem(ifnb, num.threads = 4, block_size = 1e4)

# Multi-batch scRNA-seq dataset

ifnb <- RunFestem(ifnb, batch = "stim", num.threads = 4, block_size = 1e4)

Problem 7

We have prior information about cell types, how can we incorporate the prior with the Festem protocol? (related to step 2 and step 9).

Potential solution

If we have a good estimation of the number of cell types in the dataset, we can set the parameter “G” to enable Festem generating a pre-clustering with G cell types, for example.

ifnb <- RunFestem(ifnb, G = 14, num.threads = 4)

If we have a pre-clustering result or cell labels for the dataset, we can also provide it to Festem by setting the parameter “prior”. For example, if a pre-clustering result is stored in a vector called “label”, then we can use the following code.

ifnb <- RunFestem(ifnb, prior = label, num.threads = 4)

Problem 8

R fails or takes too much time to generate a heatmap for marker gene expressions (related to step 6 and step 13).

Potential solution

To reduce the time and computational resources required, you can sub-sample a smaller fraction of cells for plotting by setting the parameter “plot_cell_prop”. For example.

MarkerHeatmap(ifnb, VariableFeatures(ifnb), plot_cell_prop = 0.01)

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Dr. Ruibin Xi (ruibinxi@math.pku.edu.cn).

Technical contact

Technical questions on executing this protocol should be directed to and will be answered by the technical contact, Zihao Chen (g.e.challenger@pku.edu.cn) and Dr. Changhu Wang (wangch156@pku.edu.cn).

Materials availability

This study did not generate any new reagents.

Data and code availability

•
The datasets used in these workflows are downloaded using SeuratData. Raw expression counts are available at 10× Genomics: https://www.10xgenomics.com/resources/datasets.
•
Additional information and R code for the latest version of Festem is provided on the GitHub repository: https://github.com/XiDsLab/Festem and at Zenodo: https://doi.org/10.5281/zenodo.14159896.
•
Preprocessed datasets analyzed in this protocol were available at Zenodo: https://doi.org/10.5281/zenodo.11331165.

Acknowledgments

We thank Qinghua Ran for testing our scripts. This work was supported by the National Key R&D Program of China (2020YFE0204200 to R.X.), the National Natural Science Foundation of China (12425110 and 12371286 to R.X.), the Foundation of Shuanghu Laboratory (SH-2024JK10 to R.X.), and the Sino-Russian Mathematics Center.

Author contributions

Conceptualization, R.X. and C.W.; methodology, C.W. and Z.C.; software, Z.C. and C.W.; formal analysis, Z.C.; writing, Z.C. and R.X.; funding acquisition and supervision, R.X.

Declaration of interests

R.X. holds stock in GeneX Health Co., Ltd.

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work, the authors used Microsoft Copilot in Bing in order to polish the text. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article.

Contributor Information

Zihao Chen, Email: g.e.challenger@pku.edu.cn.

Changhu Wang, Email: wangch156@pku.edu.cn.

Ruibin Xi, Email: ruibinxi@math.pku.edu.cn.

References

1.Chen Z., Wang C., Huang S., Shi Y., Xi R. Directly selecting cell-type marker genes for single-cell clustering analyses. Cell Rep. Methods. 2024;4 doi: 10.1016/j.crmeth.2024.100810. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Zhang J.M., Kamath G.M., Tse D.N. Valid post-clustering differential analysis for single-cell RNA-seq. Cell Syst. 2019;9:383–392.e6. doi: 10.1016/j.cels.2019.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Kang H.M., Subramaniam M., Targ S., Nguyen M., Maliskova L., McCarthy E., Wan E., Wong S., Byrnes L., Lanata C.M., et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 2018;36:89–94. doi: 10.1038/nbt.4042. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Huber W., Carey V.J., Gentleman R., Anders S., Carlson M., Carvalho B.S., Bravo H.C., Davis S., Gatto L., Girke T., et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods. 2015;12:115–121. doi: 10.1038/nmeth.3252. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Hao Y., Stuart T., Kowalski M.H., Choudhary S., Hoffman P., Hartman A., Srivastava A., Molla G., Madad S., Fernandez-Granda C., Satija R. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 2024;42:293–304. doi: 10.1038/s41587-023-01767-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Satija R., Hoffman P., Butler A. SeuratData: Install and Manage Seurat Datasets R package version 0.2.2.9001, commit 4dc08e022f51c324bc7bf785b1b5771d2742701d. 2023. https://github.com/satijalab/seurat-data
7.Wickham H., Hester J., Chang W., Bryan J. devtools: Tools to Make Developing R Packages Easier R package version 2.4.5. 2022. https://CRAN.R-project.org/package=devtools
8.Robinson M.D., McCarthy D.J., Smyth G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Jelihovschi E., Faria J.C., Allaman I.B. ScottKnott: a package for performing the Scott-Knott clustering algorithm in R. TeMA. 2014;15 003-017. [Google Scholar]
10.Korsunsky I., Fan J., Slowikowski K., Zhang F., Wei K., Baglaenko Y., Brenner M.B., Loh P.-R., Raychaudhuri S. Fast, sensitive, and accurate integration of single cell data with Harmony. Nat. Methods. 2018;16:1289–1296. doi: 10.1038/s41592-019-0619-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Wickham H., François R., Henry L., Müller K., Vaughan D. dplyr: A Grammar of Data Manipulation R package version 1.1.4. 2023. https://CRAN.R-project.org/package=dplyr
12.Benjamini Y., Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B. 1995;57:289–300. [Google Scholar]
13.Vovk V., Wang R. Combining p-values via averaging. Biometrika. 2020;107:791–808. [Google Scholar]
14.Chen W., Li Y., Easton J., Finkelstein D., Wu G., Chen X. UMI-count modeling and differential expression analysis for single-cell RNA sequencing. Genome Biol. 2018;19:70. doi: 10.1186/s13059-018-1438-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Choi K., Chen Y., Skelly D.A., Churchill G.A. Bayesian model selection reveals biological origins of zero inflation in single-cell transcriptomics. Genome Biol. 2020;21:183. doi: 10.1186/s13059-020-02103-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

•
The datasets used in these workflows are downloaded using SeuratData. Raw expression counts are available at 10× Genomics: https://www.10xgenomics.com/resources/datasets.
•
Additional information and R code for the latest version of Festem is provided on the GitHub repository: https://github.com/XiDsLab/Festem and at Zenodo: https://doi.org/10.5281/zenodo.14159896.
•
Preprocessed datasets analyzed in this protocol were available at Zenodo: https://doi.org/10.5281/zenodo.11331165.

[bib1] 1.Chen Z., Wang C., Huang S., Shi Y., Xi R. Directly selecting cell-type marker genes for single-cell clustering analyses. Cell Rep. Methods. 2024;4 doi: 10.1016/j.crmeth.2024.100810. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Zhang J.M., Kamath G.M., Tse D.N. Valid post-clustering differential analysis for single-cell RNA-seq. Cell Syst. 2019;9:383–392.e6. doi: 10.1016/j.cels.2019.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Kang H.M., Subramaniam M., Targ S., Nguyen M., Maliskova L., McCarthy E., Wan E., Wong S., Byrnes L., Lanata C.M., et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 2018;36:89–94. doi: 10.1038/nbt.4042. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Huber W., Carey V.J., Gentleman R., Anders S., Carlson M., Carvalho B.S., Bravo H.C., Davis S., Gatto L., Girke T., et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods. 2015;12:115–121. doi: 10.1038/nmeth.3252. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Hao Y., Stuart T., Kowalski M.H., Choudhary S., Hoffman P., Hartman A., Srivastava A., Molla G., Madad S., Fernandez-Granda C., Satija R. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 2024;42:293–304. doi: 10.1038/s41587-023-01767-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Satija R., Hoffman P., Butler A. SeuratData: Install and Manage Seurat Datasets R package version 0.2.2.9001, commit 4dc08e022f51c324bc7bf785b1b5771d2742701d. 2023. https://github.com/satijalab/seurat-data

[bib7] 7.Wickham H., Hester J., Chang W., Bryan J. devtools: Tools to Make Developing R Packages Easier R package version 2.4.5. 2022. https://CRAN.R-project.org/package=devtools

[bib8] 8.Robinson M.D., McCarthy D.J., Smyth G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] 9.Jelihovschi E., Faria J.C., Allaman I.B. ScottKnott: a package for performing the Scott-Knott clustering algorithm in R. TeMA. 2014;15 003-017. [Google Scholar]

[bib10] 10.Korsunsky I., Fan J., Slowikowski K., Zhang F., Wei K., Baglaenko Y., Brenner M.B., Loh P.-R., Raychaudhuri S. Fast, sensitive, and accurate integration of single cell data with Harmony. Nat. Methods. 2018;16:1289–1296. doi: 10.1038/s41592-019-0619-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Wickham H., François R., Henry L., Müller K., Vaughan D. dplyr: A Grammar of Data Manipulation R package version 1.1.4. 2023. https://CRAN.R-project.org/package=dplyr

[bib12] 12.Benjamini Y., Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B. 1995;57:289–300. [Google Scholar]

[bib13] 13.Vovk V., Wang R. Combining p-values via averaging. Biometrika. 2020;107:791–808. [Google Scholar]

[bib14] 14.Chen W., Li Y., Easton J., Finkelstein D., Wu G., Chen X. UMI-count modeling and differential expression analysis for single-cell RNA sequencing. Genome Biol. 2018;19:70. doi: 10.1186/s13059-018-1438-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Choi K., Chen Y., Skelly D.A., Churchill G.A. Bayesian model selection reveals biological origins of zero inflation in single-cell transcriptomics. Genome Biol. 2020;21:183. doi: 10.1186/s13059-020-02103-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Protocol for directly selecting cell type marker genes for single-cell clustering analyses by Festem

Zihao Chen

Changhu Wang

Ruibin Xi

Summary

Graphical abstract

Highlights

Before you begin

Installation and environment setup

Data collection

Key resources table

Materials and equipment

Step-by-step method details

Single scRNA-seq dataset workflow

Figure 1.

Figure 2.

Multi-batch scRNA-seq dataset workflow

Figure 3.

Expected outcomes

Quantification and statistical analysis

Limitations

Troubleshooting

Problem 1

Potential solution

Problem 2

Potential solution

Problem 3

Potential solution

Problem 4

Potential solution

Problem 5

Potential solution

Problem 6

Potential solution

Problem 7

Potential solution

Problem 8

Potential solution

Resource availability

Lead contact

Technical contact

Materials availability

Data and code availability

Acknowledgments

Author contributions

Declaration of interests

Declaration of generative AI and AI-assisted technologies in the writing process

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases