Scdrake: a reproducible and scalable pipeline for scRNA-seq data analysis

Jan Kubovčiak; Michal Kolář; Jiří Novotný

doi:10.1093/bioadv/vbad089

. 2023 Jul 6;3(1):vbad089. doi: 10.1093/bioadv/vbad089

Scdrake: a reproducible and scalable pipeline for scRNA-seq data analysis

Jan Kubovčiak ¹, Michal Kolář ^2,^3,^✉, Jiří Novotný ^4,^5,^✉

Editor: Guoqiang Yu

PMCID: PMC10351969 PMID: 37465398

Abstract

Motivation

While the workflow for primary analysis of single-cell RNA-seq (scRNA-seq) data is well established, the secondary analysis of the feature-barcode matrix is usually done by custom scripts. There is no fully automated pipeline in the R statistical environment, which would follow the current best programming practices and requirements for reproducibility.

Results

We have developed scdrake, a fully automated workflow for secondary analysis of scRNA-seq data, which is fully implemented in the R language and built within the drake framework. The pipeline includes quality control, cell and gene filtering, normalization, detection of highly variable genes, dimensionality reduction, clustering, cell type annotation, detection of marker genes, differential expression analysis and integration of multiple samples. The pipeline is reproducible and scalable, has an efficient execution, provides easy extendability and access to intermediate results and outputs rich HTML reports. Scdrake is distributed as a Docker image, which provides a straightforward setup and enhances reproducibility.

Availability and implementation

The source code and documentation are available under the MIT license at https://github.com/bioinfocz/scdrake and https://bioinfocz.github.io/scdrake, respectively.

Supplementary information

Supplementary data are available at Bioinformatics Advances online.

1 Introduction

Single-cell RNA-seq (scRNA-seq) is a technology that is able to capture transcriptional profiles of thousands of individual cells (Islam et al., 2011; Mereu et al., 2020). Analysis of scRNA-seq data still remains challenging, mainly because of the technology itself, and partly due to the plethora of analysis tools developed and the lack of gold standards (Luecken and Theis, 2019). While there are currently well-established tools and pipelines for the initial quantification part of the analysis [e.g. 10× Genomics Cell Ranger (Zheng et al., 2017)], the secondary analysis of the acquired feature-barcode matrix usually requires custom scripts and combinations of different software packages.

Several pipelines offering the secondary analysis of scRNA-seq data have been already published (see Supplementary Table S1). However, some of them lack current best analysis practices, are missing important steps of the analysis or offer only a graphical user interface without the full automation. Moreover, the majority is not implemented in a pipeline toolkit, and thus, is neither reproducible nor scalable, and does not provide access to intermediate results.

Here, we introduce scdrake, an automated pipeline for the secondary analysis of scRNA-seq data, which is fully implemented in the R statistical environment. Scdrake provides all important steps of the secondary analysis of a feature-barcode matrix: quality control, cell and gene filtering, normalization, detection of highly variable genes, dimensionality reduction, clustering, cell type annotation, detection of marker genes, differential expression analysis and integration of multiple samples. For those steps, we followed the best practices described in Amezquita et al. (2020) using the current state-of-the-art packages from the Bioconductor project (Huber et al., 2015), namely, scran (Lun et al., 2016), scater (McCarthy et al., 2017), scDblFinder (Germain et al., 2022), SC3 (Kiselev et al., 2017), SingleR (Aran et al., 2019), batchelor (Haghverdi et al., 2018) and Seurat (Hao et al., 2021). The pipeline is extensively configurable and provides rich graphical outputs and reports in HTML format. Internally, scdrake is an R package built upon drake (Landau, 2018), a Make-like pipeline toolkit for R. Thus, the pipeline is highly reproducible and scalable, has an efficient execution, provides easy access to intermediate results and is arbitrarily extendable. We strived to achieve maximum practicability for bioinformaticians while providing ample and meaningful outputs for biologists. At the same time, bioinformaticians can quickly react to biologists’ needs by changing the parameters of the pipeline, which then efficiently skips already finished parts. This dialogue between the biologist and the bioinformatician is indispensable during scRNA-seq data analysis. Scdrake ensures that this communication is performed in an effective and reproducible manner.

2 Results

2.1 Pipeline overview

The scdrake workflow (Fig. 1 and Supplementary Figs S1 and S2) consists of two pipelines, which are further divided into stages. Each stage finishes with a standalone HTML report (e.g. clustering or identification of marker genes). Users can use predefined standard tools and parameters or modify them for their particular needs.

Figure 1. — Overview of the scdrake workflow. The workflow consists of two pipelines: the first one performs analysis of an individual sample (A) while the second one integrates multiple independent samples (B). Stages for detection of cluster marker genes and differential expression analysis are shared by both pipelines (C). More details are provided in Supplementary Figs S1 and S2

The first pipeline (Fig. 1A) performs analysis of an individual sample. It starts with a feature-barcode matrix (in various formats) and contains two stages: quality control and filtering, followed by normalization, dimensionality reduction, clustering and cell type annotation. The second pipeline (Fig. 1B) integrates multiple independent samples, which were preprocessed by the single-sample pipeline previously. Its first stage performs sample integration per se, while the second step is similar to the second stage of the single-sample pipeline, except it works with the integrated data. Stages for detection of cluster marker genes and differential expression analysis are shared by both pipelines (Fig. 1C). More detailed information about the structure of scdrake’s pipelines can be found at https://bioinfocz.github.io/scdrake/articles/pipeline_overview.html.

We have tested the pipeline using the dataset of peripheral blood mononuclear cells (PBMCs 1K; 10× Genomics, CITE) and provided full outputs on the scdrake documentation page (https://onco.img.cas.cz/novotnyj/scdrake/). In Figure 2, we present an example output of the normalization and clustering stage of the single-sample pipeline.

Figure 2. — An example of the normalization and clustering stage HTML output from the single-sample pipeline. Users can view clustering results and any cell covariate [e.g. cell cycle phase or total number of unique molecular identifiers (UMIs) per cell] requested in the configuration in three different dimensionality reduction coordinates [uniform manifold approximation and projection (UMAP), t-distributed stochastic neighbour embedding (t-SNE), principal component analysis (PCA)]. These can be also used to display labels from the automatic cell annotation via SingleR. In the actual report, images link to full-size PDFs

2.2 Implementation details

Scdrake applies a project-based analysis approach; therefore, a new analysis starts with the initiation of a project directory, which includes all necessary configuration files, RMarkdown templates for stage reports and initial scripts for drake. Configuration files are stored in language-agnostic YAML ain’t markup language (YAML) format.

As the scdrake pipelines are in whole implemented in the R language using the drake toolkit, the pipeline definitions are in the form of R objects where each pipeline step (called target) is a piece of R code. Thus, scdrake can be used both from within R or through a simple command-line interface which wraps its most important R functions. When the pipeline is executed, drake constructs a dependency tree of targets and efficiently runs and saves to cache only targets whose code or upstream dependencies have changed since the last execution.

During runtime, drake recognizes which targets are currently independent of other targets and can be run in parallel (implicit parallelism). These abilities greatly enhance the execution time. In addition, users can load each target from a cache and implement custom pipelines that reuse the existing targets in scdrake (modular design). The majority of targets use data structures from Bioconductor, e.g. SingleCellExperiment, making it easy to use them within other Bioconductor packages or export them to other widely used formats.

As scdrake depends on many other packages, it is, for the sake of reproducibility, necessary to track their exact versions. For this purpose, scdrake utilizes the renv package (Ushey, 2022) that can restore exact package versions using a lock file, which is tracked by git. Moreover, to facilitate the installation of scdrake, we provide a Docker image in which all dependencies are installed. The Docker image is the most reproducible and straightforward way of how to use scdrake, and we recommend it to all users. We provide a separate vignette on the Docker image installation and usage (also within Singularity). Additional modes of installation are available for those, who cannot use the Docker image.

3 Conclusions

Scdrake provides a completely R-based, fully automated pipeline for the secondary analysis of scRNA-seq data, covering its most important parts. The pipeline is implemented using the drake toolkit that provides significant benefits in terms of reproducibility, speed, efficiency, access to intermediate results and modularity. We believe scdrake will become a popular choice for bioinformaticians and, in turn, its outputs interesting and useful for biologists.

Scdrake is available at https://github.com/bioinfocz/scdrake and its extensive documentation is at https://bioinfocz.github.io/scdrake. We are committed to further extending scdrake capabilities, mainly by including more analysis modules, such as gene set enrichment analysis, trajectory inference or support for spatial transcriptomic data. The project is open to the community and we gladly welcome all contributions.

Supplementary Material

vbad089_Supplementary_Data

Click here for additional data file.^{(180KB, zip)}

Acknowledgements

We thank our colleagues at IMG for testing the beta version of scdrake and providing valuable feedback.

Contributor Information

Jan Kubovčiak, Laboratory of Genomics and Bioinformatics, Institute of Molecular Genetics of the Czech Academy of Sciences, Vídeňská 1083, 142 20 Prague 4, Czech Republic.

Michal Kolář, Laboratory of Genomics and Bioinformatics, Institute of Molecular Genetics of the Czech Academy of Sciences, Vídeňská 1083, 142 20 Prague 4, Czech Republic; Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology in Prague, Technická 5, 166 28 Prague 6, Czech Republic.

Jiří Novotný, Laboratory of Genomics and Bioinformatics, Institute of Molecular Genetics of the Czech Academy of Sciences, Vídeňská 1083, 142 20 Prague 4, Czech Republic; Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology in Prague, Technická 5, 166 28 Prague 6, Czech Republic.

Author contributions

Jan Kubovčiak (Conceptualization, Investigation, Methodology, Software, Validation [equal]), Michal Kolář (Project administration [lead], Supervision [equal], Validation [equal], Writing—original draft [equal], Writing—review & editing [equal]) and Jiří Novotný (Conceptualization, Investigation, Methodology, Project administration, Software, Validation, Writing—original draft, Writing—review & editing [equal]).

Funding

This work was supported by ELIXIR CZ research infrastructure project (MEYS Grant No: LM2018131). M.K. and J.N. were supported in part by the Operational Programme Research, Development and Education under the project (No. CZ.02.1.01/0.0/0.0/16_019/0000785) and by the project National Institute for Cancer Research (Programme EXCELES, ID Project No. LX22NPO5102)—Funded by the European Union—Next Generation EU.

Conflict of Interest

None declared.

References

Amezquita R.A. et al. (2020) Orchestrating single-cell analysis with bioconductor. Nat. Methods, 17, 137–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
Aran D. et al. (2019) Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol., 20, 163–172. [DOI] [PMC free article] [PubMed] [Google Scholar]
Germain P. et al. (2022) Doublet identification in single-cell sequencing data using scDblFinder. F1000Research, 10, 979. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haghverdi L. et al. (2018) Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol., 36, 421–427. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hao Y. et al. (2021) Integrated analysis of multimodal single-cell data. Cell, 184, 3573–3587.e29. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huber W. et al. (2015) Orchestrating high-throughput genomic analysis with bioconductor. Nat. Methods, 12, 115–121. [DOI] [PMC free article] [PubMed] [Google Scholar]
Islam S. et al. (2011) Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res., 21, 1160–1167. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kiselev V.Y. et al. (2017) SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods, 14, 483–486. [DOI] [PMC free article] [PubMed] [Google Scholar]
Landau W.M. (2018) The drake R package: a pipeline toolkit for reproducibility and high-performance computing. J. Open Source Softw., 3, 550. [Google Scholar]
Luecken M.D., Theis F.J. (2019) Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol., 15, e8746. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lun A. et al. (2016) A step-by-step workflow for low-level analysis of single-cell RNA-seq data with bioconductor. F1000Research, 5, 2122. [DOI] [PMC free article] [PubMed] [Google Scholar]
McCarthy D.J. et al. (2017) Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics, 33, 1179–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mereu E. et al. (2020) Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat. Biotechnol., 38, 747–755. [DOI] [PubMed] [Google Scholar]
Ushey K. (2022) renv: Project Environments. https://rstudio.github.io/renv/ (1 February 2023, date last accessed).
Zheng G.X.Y. et al. (2017) Massively parallel digital transcriptional profiling of single cells. Nat. Commun., 8, 14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

vbad089_Supplementary_Data

Click here for additional data file.^{(180KB, zip)}

[vbad089-B1] Amezquita R.A. et al. (2020) Orchestrating single-cell analysis with bioconductor. Nat. Methods, 17, 137–145. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad089-B2] Aran D. et al. (2019) Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol., 20, 163–172. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad089-B3] Germain P. et al. (2022) Doublet identification in single-cell sequencing data using scDblFinder. F1000Research, 10, 979. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad089-B4] Haghverdi L. et al. (2018) Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol., 36, 421–427. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad089-B5] Hao Y. et al. (2021) Integrated analysis of multimodal single-cell data. Cell, 184, 3573–3587.e29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad089-B6] Huber W. et al. (2015) Orchestrating high-throughput genomic analysis with bioconductor. Nat. Methods, 12, 115–121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad089-B7] Islam S. et al. (2011) Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res., 21, 1160–1167. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad089-B8] Kiselev V.Y. et al. (2017) SC3: consensus clustering of single-cell RNA-seq data. Nat. Methods, 14, 483–486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad089-B9] Landau W.M. (2018) The drake R package: a pipeline toolkit for reproducibility and high-performance computing. J. Open Source Softw., 3, 550. [Google Scholar]

[vbad089-B10] Luecken M.D., Theis F.J. (2019) Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol., 15, e8746. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad089-B11] Lun A. et al. (2016) A step-by-step workflow for low-level analysis of single-cell RNA-seq data with bioconductor. F1000Research, 5, 2122. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad089-B12] McCarthy D.J. et al. (2017) Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics, 33, 1179–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbad089-B13] Mereu E. et al. (2020) Benchmarking single-cell RNA-sequencing protocols for cell atlas projects. Nat. Biotechnol., 38, 747–755. [DOI] [PubMed] [Google Scholar]

[vbad089-B14] Ushey K. (2022) renv: Project Environments. https://rstudio.github.io/renv/ (1 February 2023, date last accessed).

[vbad089-B15] Zheng G.X.Y. et al. (2017) Massively parallel digital transcriptional profiling of single cells. Nat. Commun., 8, 14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Scdrake: a reproducible and scalable pipeline for scRNA-seq data analysis

Jan Kubovčiak

Michal Kolář

Jiří Novotný

Roles

Abstract

Motivation

Results

Availability and implementation

Supplementary information

1 Introduction

2 Results

2.1 Pipeline overview

Figure 1.

Figure 2.

2.2 Implementation details

3 Conclusions

Supplementary Material

Acknowledgements

Contributor Information

Author contributions

Funding

Conflict of Interest

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Scdrake: a reproducible and scalable pipeline for scRNA-seq data analysis

Jan Kubovčiak

Michal Kolář

Jiří Novotný

Roles

Abstract

Motivation

Results

Availability and implementation

Supplementary information

1 Introduction

2 Results

2.1 Pipeline overview

Figure 1.

Figure 2.

2.2 Implementation details

3 Conclusions

Supplementary Material

Acknowledgements

Contributor Information

Author contributions

Funding

Conflict of Interest

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases