Abstract
Here we present ICARUS, a web server to enable users without experience in R to undertake single cell RNA-seq analysis. The focal point of ICARUS is its intuitive tutorial-style user interface, designed to guide logical navigation through the multitude of pre-processing, analysis and visualization steps. ICARUS is easily accessible through a dedicated web server (https://launch.icarus-scrnaseq.cloud.edu.au/) and avoids installation of software on the user's computer. Notable features include the facility to apply quality control thresholds and adjust dimensionality reduction and cell clustering parameters. Data is visualized through 2D/3D UMAP and t-SNE plots and may be curated to remove potential confounders such as cell cycle heterogeneity. ICARUS offers flexible differential expression analysis with user-defined cell groups and gene set enrichment analysis to identify likely affected biological pathways. Eleven organisms including human, dog, mouse, rat, zebrafish, fruit fly, nematode, yeast, cattle, chicken and pig are currently supported. Visualization of multimodal data including those generated by CITE-seq and the 10X Genomics Multiome kit is included. ICARUS incorporates a function to save the current state of analysis avoiding computationally intensive steps during repeat analysis. The complete analysis of a typical single cell RNA-seq dataset by inexperienced users may be achieved in 1–2 h.
Graphical Abstract
INTRODUCTION
Many single cell RNA-seq data analysis software require knowledge of programming in R or python. We present ICARUS (Interactive single cell RNA-seq analysis with R-shiny using Seurat), an application for single cell RNA-seq (scRNA-seq) analysis accessible through standard web browsers, with an intuitive tutorial style interface and user control over a wide range of parameters at each analysis step (Table 1 and Figure 1). ICARUS functionality is based on the Seurat R package which contains the most comprehensive tool set for scRNA-seq data analysis with the latest methodologies and frequent updates (1–4). ICARUS improves on previous analysis and visualization software tools including 10X Genomics Loupe cell browser or third-party tools such as Cerebro (5), NASQAR (6), Alona (7), single cell explorer (8) and others in three main areas: (i) intuitive tutorial-style user interface to guide logical navigation through the multitude of pre-processing, analysis and visualization steps; (ii) extended functionality and increased diversity of analysis tools to accommodate the rapidly evolving singe cell RNA-seq research field; (iii) easy accessibility through a dedicated web server (https://launch.icarus-scrnaseq.cloud.edu.au/) that avoids installation of software on the user's computer.
Table 1.
Step in ICARUS | Main command | R packages | Reference |
---|---|---|---|
Quality control | - | Seurat | (1–4) |
Integration of second dataset | Seurat::FindIntegrationAnchors Harmony::RunHarmony |
Seurat, Harmony | (1–4,22) |
Dimensionality reduction | Seurat::FindVariableFeatures Seurat::NormalizeData Seurat::ScaleData Seurat::RunPCA |
Seurat | (1–4) |
Clustering | Seurat::FindNeighbours Seurat::FindClusters Seurat::RunUMAP Seurat::RunTSNE |
Seurat | (1–4) |
Data correction | Seurat::CellCycleScoring Seurat::ScaleData(vars.to.regress) |
Seurat | (1–4) |
Labelling clusters | SingleR::SingleR Celldex::BlueprintEncodeData Celldex::DatabaseImmuneCellExpressionData Celldex::HumanPrimaryCellAtlasData Celldex::ImmGenData Celldex::MonacoImmuneData Celldex::MouseRNAseqData Celldex::NovershternHematopoieticData |
SingleR, Celldex | (12) |
Multimodal analysis | - | Seurat | (1–4) |
Differential expression analysis | Seurat::FindMarkers | Seurat | (1–4) |
Pathway analysis | ClusterProfiler::gseGO ClusterProfiler::gseKEGG ClusterProfiler::gseWP ReactomePA::gsePathway |
ClusterProfiler, ReactomePA | (20,21) |
MATERIALS AND METHODS
Input data
ICARUS requires a matrix of UMI counts for each gene per cell. Input data may be in the form of a tab delimited table with cells as columns and gene features as rows. Seurat R objects (RDS file) can also be loaded. Alternatively, 10X CellRanger output data files (barcodes.tsv, features.tsv and matrix.tsv) may be directly loaded. For data privacy reasons, the user data is not retained on the server after the user-session is terminated.
Pre-processing
The user may apply various quality control parameters to the dataset with ICARUS including, the number of unique genes detected per cell, the number of molecules detected per cell, mitochondrial and ribosomal percentage. Cells with a low number of unique genes, low number of molecules and a high mitochondrial percentage are indicative of perforated cells or low-quality libraries that have undergone a loss of cytoplasmic RNA. It is recommended to remove these cells before downstream steps to avoid formation of spurious clusters that may mask biological interpretation during analysis. A scaling and normalization step is then applied to account for differences in sequence coverage that arise from cDNA capture errors or PCR amplification efficiency differences. ICARUS integrates Seurat's dimensionality reduction workflow with principal component analysis prioritising a set of highly variable genes (user-selectable, 2000 genes default) (3). For visualization of the sources of cell heterogeneity across the dataset, a dimensionality reduction heatmap and loadings plot displays top genes within each dimension. Further, a plot of variance explained by successive principal components can be helpful in determining the number of dimensions that captures the majority of the variance in the dataset (often referred to as the ‘elbow point’). The elbow point is the recommended number of dimensions to use for downstream cell clustering.
Processing
ICARUS performs clustering with graph-based community detection. Briefly, a graph of k-nearest neighbours is formed between cells in high dimensional space where each cell is a node that is connected to its k-nearest neighbours. The edges of every connection are then weighted based on its similarities to neighbouring cells using the Jaccard-similarity algorithm and a community detection algorithm is used to define clusters (3). Clusters are visualized with 2D or 3D UMAP and t-SNE plots. ICARUS offers user control over clustering parameters including number of dimensions, number of k-nearest neighbours for graph construction and choice of clustering algorithm (either Louvain, SLM or Leiden). UMAP parameters (nearest neighbours and minimum distance) and t-SNE parameters (perplexity and number of iterations) can also be adjusted.
A data correction step allows removal of potential technical and biological confounders that could mask biological signal, such as cell cycle heterogeneity. For removal of cell cycle effects, a cell cycle score is first assigned to each cell based on its expression of G2/M and S phase markers. If the cell displays low or no expression of these markers, then the cell is likely not cycling (G1 phase). Each cell cycle score is modelled against highly variable genes and a corrected expression matrix is generated for dimensionality reduction and clustering (2,9,10). Alternatively, the user may opt to remove the effects of user selected gene(s) or an entire gene set (MsigDB gene sets) (11). In this case, average normalized gene counts of each gene of interest are computed and modelled against all expressed genes. The residuals of the model are scaled and centred and used for dimensionality reduction and clustering (2).
Marker genes for each cluster may be compared against known cell marker databases to assign a cell type to existing clusters. ICARUS utilizes the singleR and Celldex R packages (12) to annotate cell clusters against cell marker databases including Blueprint Encode (13), Database of immune cell expression (DICE) (14), Human primary cell atlas (15), Immunologic genome project (ImmGen) (www.immgen.org), Monaco immune data (16), Mouse RNAseq Dataset (17) and Novershtern Hematopoietic data (18).
Analysis
Visualization of gene expression across clusters is helpful in determining the main drivers of cluster formation and cell heterogeneity. ICARUS provides visualization of normalised and scaled gene counts as UMAP/t-SNE plots and violin plots. Further, multiple genes or gene sets (MsigDB) may be visualized as average normalised gene count between all selected genes.
ICARUS enables users to perform pairwise differential expression tests between cell clusters of interest or between samples within a cell cluster and output a list of differentially expressed genes. Statistical significance testing between pairwise observations is conducted with the Wilcoxon rank sum test (Wilcoxon–Mann–Whitney test) with Bonferroni multiple testing correction. The list of differentially expressed genes are then tested for enriched terms (gene set enrichment analysis) to identify potential affected biological pathways (19). Gene set enrichment and visualization are performed using ClusterProfiler (20) and ReactomePA (21) R packages. ICARUS currently supports eleven model organisms and livestock species including human (Homo sapiens), dog (Canis lupus familiaris), mouse (Mus musculus), rat (Rattus norvegicus), zebrafish (Danio rerio), fruit fly (Drosophila melanogaster), nematode (Caenorhabditis elegans), yeast (Saccharomyces cerevisiae), cattle (Bos taurus), chicken (Gallus gallus) and pig (Sus scrofa).
Graphical visualization of enriched terms includes a dot plot ordered by gene ratio (number of genes in enriched gene set/total number of genes in gene set), a gene concept network showcasing genes involved with the enriched terms and an enrichment map consisting of a network of enriched terms with edges connecting overlapping gene sets.
Occasionally, significant heterogeneity exists within computed cell clusters and differential expression tests may be confounded by the broad cluster classification. ICARUS provides the user the option to interactively select customised cell groups using a lasso select function for differential expression tests. This function may provide additional power to detect unique gene expression profiles in rare cell types that were previously obscured in a wider cluster of cells.
Advanced features
Integration of a second dataset
Comparisons between different scRNA-seq samples (i.e. control versus treatment samples or samples across different batches) are challenging due to technical differences in library preparation and batch effects (e.g. cells on different microfluidic chips, cells harvested at different time points or cells in different sequencing lanes). Simply merging gene expression matrices of samples from different conditions/datasets/batches may result in spurious cell clustering. Instead, methods for data integration rely on identification of a set of shared variable genes from each dataset to group cells with common biological states. In brief, pairs of cells from each dataset that exists within similar high dimensional space are identified through either Canonical Correction Analysis, Reciprocal PCA or Harmony. A neighbour graph is then constructed to compute shared neighbour overlap between cells to output a combined integrated dataset (2,22). The aim of integration is to enable successful grouping of cells from one condition/dataset/batch with the same cell types of the other condition/dataset/batch.
ICARUS offers users the ability to upload a second dataset for integration with the first dataset. Quality control functionality as per described in the preprocessing section is also available for the second dataset. Integration may be performed with user selected parameters including the choice of reduction method (Canonical Correction Analysis, Reciprocal PCA or Harmony) and strength of integration (k anchors).
Support for multimodal analysis
Multimodal analysis refers to the simultaneous measurements of several data types from the same cell. For example, methodologies including CITE-seq (23), cell hashing oligos (24) and single cell RNA-seq plus ATAC-seq (10X multiome kit) allow measurements of both single cell transcriptomes and cell-surface proteins of the same cell. Multimodal data in the form of a tab delimited count matrix with cells as columns (cells must be matching those of the scRNA-seq dataset) and multimodal features as rows can be uploaded into ICARUS. Multimodal data is visualised alongside scRNA-seq data to ensure easy comparison and interpretation.
Local save and continue functionality
ICARUS features local save and continue functionality. At each analysis step the user may opt to save their progress as a Rdata file containing the Seurat R object and the current working environment. The saved file can then be loaded at the CONTINUE interface at the ‘Introduction’ tab to resume analysis. The save and continue feature bypasses potential long computationally intensive processing steps during repeat analysis. Further, the user is able to reproduce all parameters that were used to complete each analysis step and replicate plots and tables. An automatically updated log will record the user's input parameters. Additionally, the Seurat R object may be saved individually and this may be shared across sites/teams to facilitate collaboration. The Seurat R object can be loaded into RStudio for a more in-depth analysis with a variety of supporting scRNA-seq analysis packages available in the R environment.
Downloadable plots and tables
All plots and tables produced in ICARUS are downloadable. Plots may be downloaded with user defined height and width and are available in 6 different file formats including png, jpeg, tiff, png, bmp and svg. Tables are downloaded as comma separated values (csv) files.
RESULTS
To demonstrate the utility of ICARUS, five example datasets listed in Table 2 are available within ICARUS. These include a dataset of 2700 peripheral blood mononuclear cells (PBMC) that was used as demonstration for the Seurat guided clustering tutorial vignette (https://satijalab.org/seurat/articles/pbmc3k_tutorial.html). Using this 2,700 PBMC dataset, a quality control filter was first applied removing low quality cells with unique gene counts more than 2,500 or less than 200 and cells with over 5% mitochondrial content. Dimensionality reduction with PCA was performed prioritising 2,000 highly variable genes. Cell clustering was performed with the first 10 dimensions, a k-nearest neighbour value of 20 and the Louvain community detection algorithm. ICARUS was able to successfully reproduce the UMAP/t-SNE plots shown in the Seurat clustering vignette (Figure 2A) and differentiate PBMC cells into 6 cell types including T cells CD4+, naïve; T cells CD4+, TFH; B cells, naïve; NK cells; Monocytes, CD14 + and Monocytes, CD16+ (Figure 2B). An example of differential expression analyses between B cells and T cells CD4+, naïve cell clusters showcased top differentially expressed genes comprising of HLA genes (HLA-DRB1, HLA-DQA1, HLA-DPB1, HLA-DPA1) and CD79. Gene set enrichment analyses revealed enriched terms including MHC protein complex assembly, antigen processing and presentation and peptide antigen assembly (Figure 3).
Table 2.
DISCUSSION
This manuscript describes ICARUS, a web server application for guided processing and analysis of single cell RNA-seq data. The intended goal for ICARUS is to provide users a platform to enable logical and stepwise interpretation of scRNA-seq data without the requirement of previous programming knowledge. ICARUS combines a tutorial style interface with graphical outputs at every step to reduce the steep learning curve associated with scRNA-seq data processing and analysis. Compared to other similar tools, ICARUS has flexible functionality and includes additional tools to enable biological interpretation and accommodate the rapidly evolving scRNA-seq research field. A typical scRNA-seq dataset containing 8000 cells with 4000 UMIs per cell may be analysed in ICARUS within 1–2 h using commodity hardware. Experienced users may utilise the downloadable R objects for a more in-depth analysis using the multitude of supporting scRNA-seq and other analysis packages available in the R environment.
DATA AVAILABILITY
ICARUS is available at https://launch.icarus-scrnaseq.cloud.edu.au/. The application is free and open to all users with no login requirement. Alternatively, a docker version is accessible through the Docker Hub under the name ‘icarusscrnaseq/icarus’.
ACKNOWLEDGEMENTS
We thank the New Zealand–China Non Communicable Diseases Research Collaboration Centre (NCD CRCC). This web server was supported by the Australian National eResearch Collaboration Tools and Resources (Nectar Research Cloud) initiative.
Notes
Present address: Andrew Jiang, Applied Translational Genetics Group, School of Biological Sciences, The University of Auckland, Auckland, New Zealand.
Contributor Information
Andrew Jiang, Applied Translational Genetics Group, School of Biological Sciences, The University of Auckland, Auckland, New Zealand.
Klaus Lehnert, Applied Translational Genetics Group, School of Biological Sciences, The University of Auckland, Auckland, New Zealand.
Linya You, Department of Human Anatomy & Histoembryology, School of Basic Medical Sciences, Fudan University, Shanghai, China.
Russell G Snell, Applied Translational Genetics Group, School of Biological Sciences, The University of Auckland, Auckland, New Zealand.
FUNDING
New Zealand Ministry of Business Innovation and Employment funding for New Zealand–China Non Communicable Diseases Research [UOOX1601]. Funding for open access charge: New Zealand–China Non Communicable Diseases Research [UOOX1601].
Conflict of interest statement. None declared.
REFERENCES
- 1. Hao Y., Hao S., Andersen-Nissen E., Mauck W.M. 3rd, Zheng S., Butler A., Lee M.J., Wilk A.J., Darby C., Zager M.et al.. Integrated analysis of multimodal single-cell data. Cell. 2021; 184:3573–3587. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Butler A., Hoffman P., Smibert P., Papalexi E., Satija R.. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 2018; 36:411–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Satija R., Farrell J.A., Gennert D., Schier A.F., Regev A.. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol. 2015; 33:495–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Stuart T., Butler A., Hoffman P., Hafemeister C., Papalexi E., Mauck W.M. 3rd, Hao Y., Stoeckius M., Smibert P., Satija R.. Comprehensive integration of single-cell data. Cell. 2019; 177:1888–1902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Hillje R., Pelicci P.G., Luzi L.. Cerebro: interactive visualization of scRNA-seq data. Bioinformatics. 2019; 36:2311–2313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Yousif A., Drou N., Rowe J., Khalfan M., Gunsalus K.C.. NASQAR: a web-based platform for high-throughput sequencing data analysis and visualization. BMC Bioinf. 2020; 21:267. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Franzén O., Björkegren J.L.M.. alona: a web server for single-cell RNA-seq analysis. Bioinformatics. 2020; 36:3910–3912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Feng D., Whitehurst C.E., Shan D., Hill J.D., Yue Y.G.. Single cell explorer, collaboration-driven tools to leverage large-scale single cell RNA-seq data. BMC Genomics. 2019; 20:676. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Tirosh I., Izar B., Prakadan S.M., Wadsworth M.H., Treacy D., Trombetta J.J., Rotem A., Rodman C., Lian C., Murphy G.et al.. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science (New York, N.Y.). 2016; 352:189–196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Scialdone A., Tanaka Y., Jawaid W., Moignard V., Wilson N.K., Macaulay I.C., Marioni J.C., Göttgens B.. Resolving early mesoderm diversification through single-cell expression profiling. Nature. 2016; 535:289–293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Liberzon A., Subramanian A., Pinchback R., Thorvaldsdóttir H., Tamayo P., Mesirov J.P.. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011; 27:1739–1740. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Aran D., Looney A.P., Liu L., Wu E., Fong V., Hsu A., Chak S., Naikawadi R.P., Wolters P.J., Abate A.R.et al.. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 2019; 20:163–172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Martens J.H., Stunnenberg H.G.. BLUEPRINT: mapping human blood cell epigenomes. Haematologica. 2013; 98:1487–1489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Schmiedel B.J., Singh D., Madrigal A., Valdovino-Gonzalez A.G., White B.M., Zapardiel-Gonzalo J., Ha B., Altay G., Greenbaum J.A., McVicker G.et al.. Impact of genetic polymorphisms on human immune cell gene expression. Cell. 2018; 175:1701–1715. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Mabbott N.A., Baillie J.K., Brown H., Freeman T.C., Hume D.A.. An expression atlas of human primary cells: inference of gene function from coexpression networks. BMC Genom. 2013; 14:632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Monaco G., Lee B., Xu W., Mustafah S., Hwang Y.Y., Carré C., Burdin N., Visan L., Ceccarelli M., Poidinger M.et al.. RNA-Seq signatures normalized by mRNA abundance allow absolute deconvolution of human immune cell types. Cell Rep. 2019; 26:1627–1640. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Benayoun B.A., Pollina E.A., Singh P.P., Mahmoudi S., Harel I., Casey K.M., Dulken B.W., Kundaje A., Brunet A.. Remodeling of epigenome and transcriptome landscapes with aging in mice reveals widespread induction of inflammatory responses. Genome Res. 2019; 29:697–709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Novershtern N., Subramanian A., Lawton L.N., Mak R.H., Haining W.N., McConkey M.E., Habib N., Yosef N., Chang C.Y., Shay T.et al.. Densely interconnected transcriptional circuits control cell states in human hematopoiesis. Cell. 2011; 144:296–309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Subramanian A., Tamayo P., Mootha V.K., Mukherjee S., Ebert B.L., Gillette M.A., Paulovich A., Pomeroy S.L., Golub T.R., Lander E.S.et al.. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 2005; 102:15545–15550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Wu T., Hu E., Xu S., Chen M., Guo P., Dai Z., Feng T., Zhou L., Tang W., Zhan L.et al.. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation (N Y). 2021; 2:100141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Yu G., He Q.Y.. ReactomePA: an R/Bioconductor package for reactome pathway analysis and visualization. Mol. Biosyst. 2016; 12:477–479. [DOI] [PubMed] [Google Scholar]
- 22. Korsunsky I., Millard N., Fan J., Slowikowski K., Zhang F., Wei K., Baglaenko Y., Brenner M., Loh P.R., Raychaudhuri S.. Fast, sensitive and accurate integration of single-cell data with harmony. Nat. Methods. 2019; 16:1289–1296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Stoeckius M., Hafemeister C., Stephenson W., Houck-Loomis B., Chattopadhyay P.K., Swerdlow H., Satija R., Smibert P.. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods. 2017; 14:865–868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Stoeckius M., Zheng S., Houck-Loomis B., Hao S., Yeung B.Z., Mauck W.M., Smibert P., Satija R.. Cell hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 2018; 19:224. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
ICARUS is available at https://launch.icarus-scrnaseq.cloud.edu.au/. The application is free and open to all users with no login requirement. Alternatively, a docker version is accessible through the Docker Hub under the name ‘icarusscrnaseq/icarus’.