SingleCellNet: a computational tool to classify single cell RNA-Seq data across platforms and across species

Yuqi Tan; Patrick Cahan

doi:10.1016/j.cels.2019.06.004

. Author manuscript; available in PMC: 2020 Aug 28.

Published in final edited form as: Cell Syst. 2019 Jul 31;9(2):207–213.e2. doi: 10.1016/j.cels.2019.06.004

SingleCellNet: a computational tool to classify single cell RNA-Seq data across platforms and across species

Yuqi Tan ^1,³, Patrick Cahan ^1,^2,^3,⁴

PMCID: PMC6715530 NIHMSID: NIHMS1532557 PMID: 31377170

Summary

Single cell RNA-Seq has emerged as a powerful tool in diverse applications, from determining the cell-type composition of tissues to uncovering regulators of developmental programs. A near-universal step in the analysis of single cell RNA-Seq data is to hypothesize the identity of each cell. Often, this is achieved by searching for combinations of genes that have previously been implicated as being cell-type specific, an approach that is not quantitative and does not explicitly take advantage of other single cell RNA-Seq studies. Here, we describe our tool, SingleCellNet, which addresses these issues and enables the classification of query single cell RNA-Seq data in comparison to reference single cell RNA-Seq data. SingleCellNet compares favorably to other methods in sensitivity and specificity, and it is able to classify across platforms and species. We highlight the SingleCellNet’s utility by classifying previously undetermined cells, and by assessing the outcome of a cell fate engineering experiment.

eTOC Blurb

A major obstacle in analyzing single cell RNA-Seq data is determining the identity of each cell. Often this process is time-consuming, error prone, and lacking in quantitative rigor. We have addressed this challenge by developing SingleCellNet (SCN), which provides a quantitative classification of single cell RNA-Seq data. SCN compares favorably to other methods in sensitivity and specificity. One of the major advantages of SCN is that it is possible to use it to classify cells across platforms and across species.

Introduction

Single cell RNA-Seq (scRNA-Seq) has rapidly emerged as a powerful tool to generate cell atlases of organs, tissues, and complete organisms (Cao et al., 2017; Han et al., 2018; Tabula Muris Consortium et al., 2018), to define stages and regulators of development (Kumar et al., 2017), and to determine how perturbations such as age, pathology, or genetic variation impact cell composition and state (Haber et al., 2017; Kowalczyk et al., 2015; Park et al., 2018; Patel et al., 2014). One of the most time-consuming aspects of scRNA-Seq investigations is ‘cell-typing’, or determining the identity of each cell. This often requires further experimentation such as in situ-based methods to localize cells within a tissue, or prospective isolation followed by functional assessment. It is evident that a faster method with more quantitative rigor method is needed.

One approach is to integrate ‘query’ scRNA-Seq data with existing scRNA-Seq datasets in which the cells have already been identified, such as a cell atlas. Several methods to integrate scRNA-Seq datasets have been proposed. For example, canonical correlation analysis (Butler et al., 2018), and MnnCorrect (Haghverdi et al., 2018) have proven useful in aggregating scRNA-Seq data sets so as to increase statistical power in differential gene expression analysis and in gene-to-gene correlation analysis. However, these approaches require that at least one relatively abundant cell type is present in both data sets. Furthermore, these methods do not explicitly provide a means to quantitatively classify query cell types in comparison to a reference data set, which is the goal of our method SingleCellNet (SCN). The MetaNeighbor tool compares cell types across scRNA-Seq data sets, yet it addresses the question “to what extent is a group of cells reproducible across scRNA-Seq data sets?”, which is distinct from our aim (Crow et al., 2018). SCMAP is the method most akin to SCN in intent (Kiselev et al., 2018) because it classifies query cells according to their similarity to reference cell types based on various measures of correlation. While SCMAP is fast, it ultimately returns a binary cell type assignment for each cell. In many applications, a quantitative measure of similarity can be more informative than a categorical assignment of identity. For example, the extent to which a query cell derived from a cell fate engineering experiment (e.g. directed differentiation) resembles a reference cell type is valuable information that can obscured by categorical assignments of identity. Here, we present SCN, a method to quantitatively classify scRNA-Seq data based on comparison to a reference data set. To make query and reference data compatible across platforms and species, we use a transformation based on comparing the expression of pairs of genes within each cell, a method inspired by the top-scoring pair classifier (Geman et al., 2004). Here we evaluate the performance of SCN, compare it to the intermediate quantitative outputs of SCMAP, and highlight its utility in four realistic use-cases: a cross-platform identification of previously unclassified cells, the identification of cell types resulting from a cell fate engineering experiment, a cross-species classification of hematopoietic cell types and a cross-study classification comparison of neuronal atlases.

Results

Building a multi-class scRNA-Seq classifier with top-pair transform and Random Forest

We previously developed CellNet, a computational method designed to classify populations of engineered cells (Cahan et al., 2014; Radley et al., 2017) using Random Forest classifiers (Breiman, 2001). With SCN, we have revamped this approach to enable cross-platform and cross-species classification of scRNA-Seq data. We do not use gene counts or expression estimates directly in training or in classification. Rather, we transform the data into a binary matrix derived by pairwise comparisons of selected genes on a per cell basis (Top-Pair transformation), limited to genes that are preferentially expressed in each cell type defined in the training data, as well as those genes that are specifically under-expressed in each type (Fig 1A). Then, to limit the set of predictors for input to training the Random Forest classifier, we use template matching (Pavlidis and Noble, 2001) to find the most discriminating sets of gene pairs. After gene-pair selection, the training data is then transformed into a binary matrix and used to train a multi-class Random Forest classifier. Our training process also includes a step to generate, by random sampling of gene-pair values, a set of transformed single cell profiles that are unlike any others in the training data. This ‘unknown’ category can be useful to identify query cells to which no class in the training data corresponds. Query scRNA-seq data undergoes the same Top-Pair (TP) transform. To measure the performance of the classifier throughout this study, we have used two assessment metrics: Cohen’s kappa (k), which measures agreement of categorical variables normalized for chance (Cohen, 1968), and mean area under the precision-recall curve (mean AUPR). As ground truth, we used a variety of gold standard data sets in which the cell identity is given as our base for comparison. To evaluate the performance of the existing method SCMAP using mean AUPR, which requires a quantitative score, we used the intermediate outputs of Pearson correlations, Spearman correlations, and cosine similarity.

Figure 1. — SingleCellNet (SCN) schematic and performance comparisons (A) SCN takes in scRNA-Seq data with annotation as training data and selects the best classifying gene pairs from the training data using an approach similar to the Top-scoring-pair algorithm and pairtransforms it into a binary matrix. A multi-class Random forest classifier is then trained with the transformed training data. The query scRNA-Seq data will also be pair-transformed, and a classification score is generated for each query cell. (B) A representative example of the fifteen pairs of cross-platform scRNA-Seq training-query performance analyses. The Baron human pancreas scRNA-Seq data set is the training and Murano human pancreas scRNA-Seq data set is the query, were used to benchmark the performance of the five methods: SCN-TP, SCN-Base, SCMAP-cosine, SCMAP-pearson and SCMAP-spearman, using two quantitative metrics: mean AUPR and к. The training for each pair of the cross-platform comparisons was cross-validated for ten times. Left: The barplot shows the mean and standard deviation of the classifier performance. Right: The classification scores are displayed with a violin plot, where the x axis shows the true cell annotation, and the y axis label on the right-hand side shows the classifier category. (C) Mean and standard deviation of the classifier performance (Left: Kappa, right: mean AUPR) of cross-platform and cross-species scRNA-Seq classifiers.

Performance of the SingleCellNet TP-RF classifier

We first set out to determine how the number of top pairs, the primary user-tunable parameter of our method, impacts classifier performance. In this and other analyses in this section, we used the Tabula Muris 10x scRNA-Seq (TM-10x) data set. We found that both k and mean AUPR plateaued when the number of top pairs was 10, corresponding to 320 total predictor genes (Fig S1A).

Next, we determined the effect of adjusting profiles based on stage of cell cycle on TP-RF classifiers, as this is sometimes considered an uninformative biological confounder of clustering analysis (Barron and Li, 2016). We evaluated performance in three scenarios: no cycle adjustment, adjustment of both training and validation data, and adjustment of validation data only, where adjustment is performed by regression on stage of cell cycle (Wolf et al., 2018) (Fig S1B). In addition to evaluating SCN-TP, we also trained and evaluated a version of SCN in which no top-pair transform is performed, but rather in which expression estimates of differentially expressed genes are used as predictors variables in training the Random Forest. We refer to this method as SCN-Base. In brief, this analysis shows that SCN-TP is resilient to regressing on stage of cell cycle except when only the query data was adjusted, whereas classifiers trained directly on expression levels are prone to performance degradation.

Classifier performance across platforms and across species

There is wide diversity in scRNA-Seq methodologies, and the extent to which classifiers trained on data from one platform would be applicable to a query data set from another, is unclear. We explored this by training classifiers and assessing their performance when applied to independent, well-annotated scRNA-Seq data from other studies of different scRNA-Seq platforms (Table S1). We have assessed the classifier performance of all five methods with fifteen different pairs of cross-platform training and query data (Fig S2A). As a representative analysis, we discuss here the results of using human pancreas cells profiled by inDrop as training data (Baron et al., 2016), and human pancreas cells profiled by CEL-Seq2 as the query (Muraro et al., 2016). SCN-TP had significantly higher mean AUPR compared to the SCMAP correlation methods and compared to SCN-Base (Fig 1B-left). SCN-TP and SCMAP-cluster had similar k, and both methods significantly outperformed SCN-Base. In typical scRNA-Seq studies, the data are clustered into groups of cells with similar profiles. If the clusters represent cell states or cell types that are robustly detectable across platforms and studies, then cells within the clusters should share high classification scores for the same category, and low scores for all other categories. To test this idea, we visualized the classification results as violin plots, and observed that SCN methods achieved a starker contrast in classification scores than SCMAP-cluster correlation methods (Fig 1B-right), which may contribute to the lower mean AUPR of these methods. The performance results described above held true more generally. SCN methods had significantly higher mean AUPRs than SCMAP correlation methods in 14/15 analyses, whereas SCN had similar or higher к in 12/15 analyses (Fig S2A).

Finally, we determined the performance of the methods when applied across species with five data sets: three of the pancreas and two of the central nervous system (Table S2). Consistent with prior results, SCN-TP achieved significantly higher mean AUPR values, and either similar or higher к values than other methods (Fig 1C). The only exception was in one of the pancreas analyses in which SCN-Base achieved a similar mean AUPR and moderately higher к than SCN-TP.

To be more comprehensive with benchmarking between SCN and SCMAP, we also included SCMAP-cell in our benchmarking comparisons using sixteen gold standard data sets (Fig S2B). We were limited to evaluating SCMAP-cell using only к because the cell-to-cell comparison nature of the SCMAP-cell (i.e. SCMAP-cell assigns identity based on cosine similarities of the top k-nearest neighbors) precluded computing Precision-Recall curves. Unlike the reported assessment in the original article, we also included the unassigned cells into the assessment hence resulting in a lower к value (Table S3). In brief, this analysis shows that the SCN methods have superior к for all sixteen of the benchmark data sets than SCMAP-cell.

We compared the run times for feature selection between SCN and SCMAP (Fig S3A). SCN feature selection is roughly twenty-fold slower than SCMAP. We also provide the classification/projection step runtime comparison among SCN, SCMAP-cluster and SCMAP-cell (Fig S3B). For small query data sets, the three methods showed similar time in classification/projection. But as the query cell number increases, the projection time increases significantly for SCMAP-cell but not for SCN-TP. For example, when the query cell number is 40,600, SCMAP-cell took 24 minutes to project, whereas SCN-TP completed in less than one minute.

Collectively, these analyses show that SCN-TP achieves a high, if not the highest, classifier performance across a range of conditions, including correction for cell cycle, differences in platforms, and differences in species. We note that while the SCN-TP feature selection step is slower than SCMAP’s, it still required only 20 minutes on the largest training data set, and therefore the gains in performance are worth the extra time required to train SCN-TP.

Example applications of SingleCellNet

Below, we briefly describe two example applications of SCN-TP: cell typing of a human pancreas tissue atlas and cell-typing the derivatives of a direct conversion experiment. For the first example, we chose to perform cell-typing on human pancreas cells from Segerstolpe et al, which consists of 2,209 cells of 14 different cell types generated by Smart-Seq2 (Segerstolpe et al., 2016). To illustrate the power of SCN-TP, we first classified the Segerstolpe et al data in comparison to the 69 murine cell types from 20 organs/tissues represented in the Tabula Muris cell atlas (Tabula Muris Consortium et al., 2018). As expected, nearly all of the Segerstolpe et al cells classified as pancreatic in origin, with the exception of a small number of endothelial and immune cells (Fig 2A). Next, to provide a finer level of celltyping resolution, we classified the Segerstolpe et al data in comparison to the Baron et al human pancreas data which consists of 14 different cell types generated by InDrop (Baron et al., 2016). This resulted in a nearly unequivocal classification of most of the query cells according to their annotation as defined in the study (Fig 2B), with the exception of some acinar cells which are cross classified as ductal. Notably, our analysis suggested that cells originally annotated as ‘co-expressors’ are most similar to alpha cells. Another output of SCN is the attribution plot, in which SCN assigns a single identity to each cell based on the category with the maximum classification score. We used this strategy to assign putative identity to the 43 cells that had been previously “unclassified” or “unclassified endocrine” in the original study.

Figure 2. — Application of SCN across scRNA-Seq platforms to determine the identity of unknown cells (A-C) and the quantitative assessment of direct reprogramming protocols (D-G). (A) Tabula Muris FACS-based mouse pancreatic scRNA-Seq data with its annotation provided by the authors and (B) Baron human pancreatic scRNA-Seq data (of a different scRNA-Seq technique) and its annotation are used to train SCN-TP classifiers, and Segerstolpe adult human pancreatic scRNA-Seq data as query data. (C) An attribution plot summarizes the percentage of newly classified cells within previous unknown categories. (D) We used adult Microwell-Seq scRNA-Seq data (45 cell types) and its annotations to train a cell-type specific SCN-TP classifier. The Tsunemoto’s screening experiment, profiling scRNA-Seq of induced neurons (iNs) by application of transcription factor combinations, is used as query data. (E) In this classification heatmap, the columns annotate the five different transcription factor pairs used to generate each iN profile, namely *Neurog3/Pou5f1* (N3O4), *Neurog3/Pou3f4* (N3B4), *Neurog1/Pou4f1* (N1B3a), *Ascl2/Nr4a2*(A1NR1) and *Neurog3/Pou1f1*(N3P1). The four most prominent classification categories (ganglion, MEF, unknown and skeletal) are labeled. (F) The classification score is visualized with a violin plot, where x-axis displays the transcription factor pair combinations and y-axis is the range of classification score in a given category. Three most relevant classifier categories are shown. (G) To obtain a more comprehensive understanding of the composition of cells in each reprogramming experiment, an attribution plot is used with the row showing the transcription factor pair combinations and the column denoting the percentage count. Each cell is colored by its classification category with the highest score.

Our analysis predicted that fifty percent of the ‘unclassified’ cells are Schwann cells, and the remaining are predicted to be gamma cells (Fig 2C). Similarly, the ‘unclassified endocrine’ category was predicted to contain a mixture of alpha, beta, delta, gamma and ductal cells. We note that the ‘winner take all’ approach of the attribution plot can obscure hybrid classifications that could arise from technical sources (e.g. doublets) or from biological sources (e.g. intermediate developmental states). Indeed, a closer examination of the classification heatmap of the previously unclassified cells indicated a clear classification for 20 out of the 43 cells, with the remaining cells exhibiting seemingly dual identities cells of either alpha/beta cells or alpha/gamma cells (Fig S4).

For the second example, we chose to perform cell-typing on fibroblasts that had been reprogrammed to a neural-like identity (Fig 2D) (Tsunemoto et al., 2018). We used a subset of Han et al Microwell-Seq (45 cell types) to train a SCN-TP classifier as this reference data included a diverse set of primary cell types including neurons (Ganglion cells), mouse embryonic fibroblasts, and skeletal muscle cells (Han et al., 2018). Our analysis indicates that three of the transcription-factor pairs (Neurog3/Pou5f1, Neurog3/Pou3f4, Neurog1/Pou4f1) are equally successful in generating cells with a generic neural transcriptome (Fig 2E–F). Cells converted with the transcription factor pair Neurog3/Pou1f1 maintained a residual embryonic fibroblast signature (Fig 2E–F). We found that a small number of cells were classified as skeletal muscle, consistent with prior observations that direct conversion can yield a minority of skeletal muscle-like cells (Treutlein et al., 2016). The transcription factor pair Ascl2/Nr4a2 generated a higher proportion of skeletal myocytes than the other pairs (Fig 2G).

In the online documentation for SCN and in the supplemental figures, we have included several other example applications, including a cross-species classification of peripheral blood mononuclear cells (Fig S5) (Zheng et al., 2017) and a cross-study classification comparison of neuronal atlases (Fig S6 and Fig S7) (Tasic et al., 2016; Zeisel et al., 2018), and a workflow of SCN (Fig S8).

Discussion

The demand for a robust method to quantify cell identity will grow as technologies to generate scRNA-Seq data proliferate and become more accessible. Here, we have shown that SCN quantifies cell identity in a manner that is robust to different scRNA-Seq platforms and that is capable of classifying cells across species. In contrast to other scRNA-Seq integration and comparison methods, we expect that SCN will be especially useful when a quantitative, rather than a binary, metric of identity is informative, or when the presence of shared cell types across data sets is unclear. One such application is the classification of engineered cell types in comparison to a reference data set of in vivo-derived cells. As more data across developmental time points are accrued, we anticipate that SCN will provide a means to quantify not only the identity but also the stage of development and maturation of engineered cells.

STAR Methods

Lead Contact and Materials Availability

Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Patrick Cahan (patrick.cahan@jhmi.edu). This study did not generate new unique reagents.

Method Details

Building and assessing the classifier

Building a classifier begins with a preprocessed gene expression matrix and a pre-annotated metadata where each cell in the gene expression matrix is annotated (Fig S8). We demonstrate how to use the SCN pipeline with the example provided in our online README (http://github.com/pcahan1/singleCellNet/), where the Tabula Muris 10x data set was used as training and the Park et al data set was used as query (Park et al., 2018; Tabula Muris Consortium et al., 2018). If the train and query datasets are of two different species, we would convert query data set gene symbols to symbols of orthologs as determined by HCOP (Seal et al., 2011) (Step 0). Since in this case, the training and query datasets are of the same species, we proceed to find the intersection of genes between the training data and the query data prior to training the SCN-TP classifier. Then, we randomly selected 100 cells (ncell = 100) per cell type from the entire training data to train the SCN-TP classifier and reserved the remaining cells to measure the classifier’s performance (Step 1). The subsetted training data was then downsampled to 1500 counts per cell (total = 1.5e3), scaled up such that the total expression per cell was 10000 (xFact=1e4), and log-transformed. Based on the annotation (dLevel = “newAnn”), we found the top ten (topX = 10) most differentially expressed genes per cell type (Step 2a), then we ranked top twenty-five gene-pair per cell type (topX = 25) from those genes (Step 2b). To optimize memory usage, we have parallelized the ranking process, where we examined sets of gene pairs in chunks of 5000 (sliceSize=5000) (Step 2b). The preprocessed training data was then transformed according to the selected gene pairs (Step 3), and was used to build a multiclass SCN-TP classifier of 1000 trees (ntrees = 1000) (Step 4). Additionally, we created 100 randomized cell expression profiles (nrand = 100) to train up a “rand” or an “unknown” category in the SCN-TP classifier, which can help in cases where some cell types that are present in the query data are not included in the training data (Step 2b). After the SCN-TP classifier was built, we transformed the remaining held-out data according to the top gene pairs selected (Step 5a), along with another 100 randomized cells (numRand = 100). We queried the transformed held-out data (Step 5b) and assessed the performance of the classifier on the held-out data using Precision-Recall curves, k, and mean AUPR (Step 5c-e). This is a crucial quality control step as it will indicate the optimal performance that can be expected from the classifier. If the classifier performs poorly on held out data, then the user should troubleshoot the training procedure beginning with the scRNA-Seq data annotation.

Classifying query data

Once we determined that the classifier performed well, then we applied to the transformed external query data (Park et al., 2018) with top-pairs selected from the optimized training data (Step 6), and classified it with the SCN-TP classifier (Step 7). We can display the classification results by i) classification heatmap, ii) UMAP, iii) attribution plot, iv) skyline plot, and v) classification violin plot (Step 8).

Notes to users

The quality and annotation of the training data are critical to building reliable classifiers. We recommend to start training a SCN-TP classifier with 10–20 distinct cell types, and to iteratively add more cell types and assess classifier performance. Obviously, the user should not attempt to assess a classifier with the query data as the true identity of the query data is unknown.

Quantification and Statistical Analysis

Benchmark assessment

Three variants of SCMAP-cluster, Pearson correlation, Spearman correlation and Cosine similarity were performed by adapting code from SCMAP-cluster to calculate similarities and correlations, and create labels. All the outputs were stored as intermediate matrices and were then transformed to confusion matrices to calculate Area Under the Precision-Recall curve and к. SCMAP-cell similarities and label were obtained from running SCMAP-cell directly, which was also transformed to a confusion matrix to calculate к. Because the nature of how SCMAP-cell similarity is computed, the top k nearest neighbors often come from few similar cell types instead of all the possible reference cell types, hence, it precluded us to compute a fair mean AUPR metrics for SCMAP-cell.

All cross-platform and cross-species comparisons between SCN and SCMAP-cluster were cross-validated (Fig 1C and Fig S2A), where 50 cells per cell type were randomly selected from the training set to train a classifier/reference and corresponding external query data set was then queried. The process was repeated 10 times for each data set pair. Mean and standard error were reported for each quantitative measurement.

Performance of the SCN TP-RF classifier on varying parameters

To assess how the top-pair parameter influences the performance of TP-RF classifier, we used the Tabula Muris 10x scRNA-Seq (TM-10x) data set (Fig S1A). First, we sampled 50 cells from each of the 32 defined cell types from this data set for training top-pair SCN (SCN-TP) across a range of top pairs per cell type. Next, we used the 23,337 remaining cells as held-out validation data, then we measure both к and mean AUPR.

We use the same dataset to investigate how adjusting profiles based on stage of cell cycle affect TP-RF classifiers. We evaluated performance of SCN-base classifier and SCN-TP classifier in three scenarios: no cycle adjustment, adjustment of both training and validation data, and adjustment of validation data only, where adjustment is performed by regression on stage of cell cycle (Fig S1B). In the scenario of no adjustments, SCN-TP and SCN-Base performed similarly (к 0.93 vs 0.94 and mean AUPR 0.95 vs 0.96). When both the training data and validation data were adjusted for cell cycle, the performance of SCN-TP held steady while that of SCN-Base plummeted (к 0.93 vs 0.29 and mean AUPR 0.94 vs 0.31). In the case in which only the training data was adjusted for cell cycle, SCN-Base worsened further while SCN-TP remained stable (к 0.92 vs 0.05, mean AUPR 0.93 vs 0.19). In the case in which only the validation data was adjusted for cell cycle, SCN-TP and SCN-Base both showed reduced performance (к 0.03 vs 0.52, mean AUPR 0.39 vs 0.88).

Data and Code Availability

To aid in the community’s use and improvement of SCN, we have made it available under an Open Source software license, and the code is accessible at GitHub: (http://github.com/pcahan1/singleCellNet/). Our documentation includes sections on training new classifiers, troubleshooting, expected computation time, and step-by-step procedures to reproduce the examples described here. All data sets and cell type annotations were obtained through public accession (STAR Methods and Table S1–2). Similar cell types were merged in Tabula Muris and Microwell-seq data based on hierarchical clustering. Details of the merge can be found in our list of GitHub-hosted reference data (http://github.com/pcahan1/singleCellNet/ and Table S4). We also provide 12 curated, ready-to-use atlases/datasets (Table S4), including expression matrices and metadata, on our GitHub page that are ready to be used as training data.

Supplementary Material

NIHMS1532557-supplement-1.pdf^{(5.5MB, pdf)}

Key Resource Table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Antibodies





Bacterial and Virus Strains





Biological Samples





Chemicals, Peptides, and Recombinant Proteins





Critical Commercial Assays





Deposited Data
Tabula Muris scRNA-Seq data	(Tabula Muris Consortium et al., 2018)	https://figshare.com/projects/Tabula_Muris_Transcriptomic_characterization_of_20_organs_and_tissues_from_Mus_musculus_at_singlecellresolution/27733
Microwell-Seq scRNA-Seq data	(Han et al., 2018)	https://figshare.com/articles/MCA_DGE_Data/5435866
Baron pancreas scRNA-Seq data	(Baron et al., 2016)	https://hemberg-lab.github.io/scRNA.seq.datasets/human/pancreas/
Xin pancreas scRNA-Seq data	(Xin et al., 2016)	https://hemberg-lab.github.io/scRNA.seq.datasets/human/pancreas/
Segerstolpe pancreas scRNA-Seq data	(Segerstolpe et al., 2016)	https://hemberg-lab.github.io/scRNA.seq.datasets/human/pancreas/
Murano pancreas scRNA-Seq data	(Muraro et al., 2016)	https://hemberg-lab.github.io/scRNA.seq.datasets/human/pancreas/
Zheng PBMC scRNA-Seq data	(Zheng et al., 2017)	https://support.10xgenomics.com/single-cell-gene-expression/datasets
Darminis brain scRNA-Seq data	(Darmanis et al., 2015)	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67835
Zeisel brain scRNA-Seq	(Zeisel et al., 2018)	http://mousebrain.org/downloads.html
Tasic cortex scRNA-Seq	(Tasic et al., 2016)	https://github.com/AllenInstitute/tasic2016data
Curated training data sets	This paper	http://github.com/pcahan1/singleCellNet/
Experimental Models: Cell Lines





Experimental Models: Organisms/Strains






Oligonucleotides





Recombinant DNA





Software and Algorithms
R version 3.5.1	R Foundation for Statistical Computing	https://www.r-project.org/
SCMAP	(Kiselev et al., 2018)	https://github.com/liemberg-lab/scmap
SingleCellNet	This paper	http://github.com/pcahan1/singleCellNet/


Other

Open in a new tab

Highlight.

SingleCellNet (SCN) enables quantitative classification of scRNA-Seq data
SCN can be applied across platforms and across species
SCN can assesses the fidelity of cell fate engineering experiments
SCN provides 12 ready-to-use public reference datasets

Acknowledgments

This work was supported by the National Institutes of Health under grant R35GM124725 to PC and the Biochemistry, Cellular, and Molecular Biology Program training grant to YT. We would also like to thank members of the Cahan lab members, especially Emily Su, Emily Lo, Dan Peng, Ray Cheng, and Abby Spangler, for providing feedback and support.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration of Interests

The authors declare no competing interests.

References

Baron M, Veres A, Wolock SL, Faust AL, Gaujoux R, Vetere A, Ryu JH, Wagner BK, Shen-Orr SS, Klein AM, et al. (2016). A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Syst. 3, 346–360.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Barron M, and Li J (2016). Identifying and removing the cell-cycle effect from single-cell RNA-Sequencing data. Sci. Rep 6, 33892. [DOI] [PMC free article] [PubMed] [Google Scholar]
Breiman L (2001). Random Forests. Machine Learning 45, 5–32. [Google Scholar]
Butler A, Hoffman P, Smibert P, Papalexi E, and Satija R (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol 36, 411–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cahan P, Li H, Morris SA, Lummertz da Rocha E, Daley GQ, and Collins JJ (2014). CellNet: network biology applied to stem cell engineering. Cell 158, 903–915. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cao J, Packer JS, Ramani V, Cusanovich DA, Huynh C, Daza R, Qiu X, Lee C, Furlan SN, Steemers FJ, et al. (2017). Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357, 661–667. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cohen J (1968). Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull 70, 213–220. [DOI] [PubMed] [Google Scholar]
Crow M, Paul A, Ballouz S, Huang ZJ, and Gillis J (2018). Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor. Nat. Commun 9, 884. [DOI] [PMC free article] [PubMed] [Google Scholar]
Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, Hayden Gephart MG, Barres BA, and Quake SR (2015). A survey of human brain transcriptome diversity at the single cell level. Proc. Natl. Acad. Sci. USA 112, 7285–7290. [DOI] [PMC free article] [PubMed] [Google Scholar]
Geman D, d’Avignon C, Naiman DQ, and Winslow RL (2004). Classifying gene expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol 3, Article19. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haber AL, Biton M, Rogel N, Herbst RH, Shekhar K, Smillie C, Burgin G, Delorey TM, Howitt MR, Katz Y, et al. (2017). A single-cell survey of the small intestinal epithelium. Nature 551, 333–339. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haghverdi L, Lun ATL, Morgan MD, and Marioni JC (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol 36, 421–427. [DOI] [PMC free article] [PubMed] [Google Scholar]
Han X, Wang R, Zhou Y, Fei L, Sun H, Lai S, Saadatpour A, Zhou Z, Chen H, Ye F, et al. (2018). Mapping the Mouse Cell Atlas by Microwell-Seq. Cell. [DOI] [PubMed] [Google Scholar]
Kiselev VY, Yiu A, and Hemberg M (2018). scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods 15, 359–362. [DOI] [PubMed] [Google Scholar]
Kowalczyk MS, Tirosh I, Heckl D, Rao TN, Dixit A, Haas BJ, Schneider RK, Wagers AJ, Ebert BL, and Regev A (2015). Single-cell RNA-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells. Genome Res. 25, 1860–1872. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kumar P, Tan Y, and Cahan P (2017). Understanding development and stem cells using single cell-based analyses of gene expression. Development 144, 17–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
Muraro MJ, Dharmadhikari G, Grün D, Groen N, Dielen T, Jansen E, van Gurp L, Engelse MA, Carlotti F, de Koning EJP, et al. (2016). A Single-Cell Transcriptome Atlas of the Human Pancreas. Cell Syst. 3, 385–394.e3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Park J, Shrestha R, Qiu C, Kondo A, Huang S, Werth M, Li M, Barasch J, and Suszták K (2018). Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science 360, 758–763. [DOI] [PMC free article] [PubMed] [Google Scholar]
Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, Cahill DP, Nahed BV, Curry WT, Martuza RL, et al. (2014). Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pavlidis P, and Noble WS (2001). Analysis of strain and regional variation in gene expression in mouse brain. Genome Biol. 2, RESEARCH0042. [DOI] [PMC free article] [PubMed] [Google Scholar]
Radley AH, Schwab RM, Tan Y, Kim J, Lo EKW, and Cahan P (2017). Assessment of engineered cells using CellNet and RNA-seq. Nat. Protoc 12, 1089–1102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Seal RL, Gordon SM, Lush MJ, Wright MW, and Bruford EA (2011). genenames.org: the HGNC resources in 2011. Nucleic Acids Res. 39, D514–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Segerstolpe Å, Palasantza A, Eliasson P, Andersson E-M, Andréasson A-C, Sun X, Picelli S, Sabirsh A, Clausen M, Bjursell MK, et al. (2016). Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes. Cell Metab 24, 593–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tabula Muris Consortium, Overall coordination, Logistical coordination, Organ collection and processing, Library preparation and sequencing, Computational data analysis, Cell type annotation, Writing group, Supplemental text writing group, and Principal investigators (2018). Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tasic B, Menon V, Nguyen TN, Kim TK, Jarsky T, Yao Z, Levi B, Gray LT, Sorensen SA, Dolbeare T, et al. (2016). Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci 19, 335–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
Treutlein B, Lee QY, Camp JG, Mall M, Koh W, Shariati SAM, Sim S, Neff NF, Skotheim JM, Wernig M, et al. (2016). Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq. Nature 534, 391–395. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tsunemoto R, Lee S, Szűcs A, Chubukov P, Sokolova I, Blanchard JW, Eade KT, Bruggemann J, Wu C, Torkamani A, et al. (2018). Diverse reprogramming codes for neuronal identity. Nature 557, 375–380. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wolf FA, Angerer P, and Theis FJ (2018). SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xin Y, Kim J, Okamoto H, Ni M, Wei Y, Adler C, Murphy AJ, Yancopoulos GD, Lin C, and Gromada J (2016). RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell Metab. 24, 608–615. [DOI] [PubMed] [Google Scholar]
Zeisel A, Hochgerner H, Lönnerberg P, Johnsson A, Memic F, van der Zwan J, Häring M, Braun E, Borm LE, La Manno G, et al. (2018). Molecular architecture of the mouse nervous system. Cell 174, 999–1014.e22. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. (2017). Massively parallel digital transcriptional profiling of single cells. Nat. Commun 8, 14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1532557-supplement-1.pdf^{(5.5MB, pdf)}

Data Availability Statement

[R1] Baron M, Veres A, Wolock SL, Faust AL, Gaujoux R, Vetere A, Ryu JH, Wagner BK, Shen-Orr SS, Klein AM, et al. (2016). A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Syst. 3, 346–360.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Barron M, and Li J (2016). Identifying and removing the cell-cycle effect from single-cell RNA-Sequencing data. Sci. Rep 6, 33892. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Breiman L (2001). Random Forests. Machine Learning 45, 5–32. [Google Scholar]

[R4] Butler A, Hoffman P, Smibert P, Papalexi E, and Satija R (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol 36, 411–420. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Cahan P, Li H, Morris SA, Lummertz da Rocha E, Daley GQ, and Collins JJ (2014). CellNet: network biology applied to stem cell engineering. Cell 158, 903–915. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Cao J, Packer JS, Ramani V, Cusanovich DA, Huynh C, Daza R, Qiu X, Lee C, Furlan SN, Steemers FJ, et al. (2017). Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357, 661–667. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Cohen J (1968). Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull 70, 213–220. [DOI] [PubMed] [Google Scholar]

[R8] Crow M, Paul A, Ballouz S, Huang ZJ, and Gillis J (2018). Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor. Nat. Commun 9, 884. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, Hayden Gephart MG, Barres BA, and Quake SR (2015). A survey of human brain transcriptome diversity at the single cell level. Proc. Natl. Acad. Sci. USA 112, 7285–7290. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Geman D, d’Avignon C, Naiman DQ, and Winslow RL (2004). Classifying gene expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol 3, Article19. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Haber AL, Biton M, Rogel N, Herbst RH, Shekhar K, Smillie C, Burgin G, Delorey TM, Howitt MR, Katz Y, et al. (2017). A single-cell survey of the small intestinal epithelium. Nature 551, 333–339. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Haghverdi L, Lun ATL, Morgan MD, and Marioni JC (2018). Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol 36, 421–427. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Han X, Wang R, Zhou Y, Fei L, Sun H, Lai S, Saadatpour A, Zhou Z, Chen H, Ye F, et al. (2018). Mapping the Mouse Cell Atlas by Microwell-Seq. Cell. [DOI] [PubMed] [Google Scholar]

[R14] Kiselev VY, Yiu A, and Hemberg M (2018). scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods 15, 359–362. [DOI] [PubMed] [Google Scholar]

[R15] Kowalczyk MS, Tirosh I, Heckl D, Rao TN, Dixit A, Haas BJ, Schneider RK, Wagers AJ, Ebert BL, and Regev A (2015). Single-cell RNA-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells. Genome Res. 25, 1860–1872. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Kumar P, Tan Y, and Cahan P (2017). Understanding development and stem cells using single cell-based analyses of gene expression. Development 144, 17–32. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Muraro MJ, Dharmadhikari G, Grün D, Groen N, Dielen T, Jansen E, van Gurp L, Engelse MA, Carlotti F, de Koning EJP, et al. (2016). A Single-Cell Transcriptome Atlas of the Human Pancreas. Cell Syst. 3, 385–394.e3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Park J, Shrestha R, Qiu C, Kondo A, Huang S, Werth M, Li M, Barasch J, and Suszták K (2018). Single-cell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science 360, 758–763. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, Cahill DP, Nahed BV, Curry WT, Martuza RL, et al. (2014). Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science 344, 1396–1401. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Pavlidis P, and Noble WS (2001). Analysis of strain and regional variation in gene expression in mouse brain. Genome Biol. 2, RESEARCH0042. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Radley AH, Schwab RM, Tan Y, Kim J, Lo EKW, and Cahan P (2017). Assessment of engineered cells using CellNet and RNA-seq. Nat. Protoc 12, 1089–1102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Seal RL, Gordon SM, Lush MJ, Wright MW, and Bruford EA (2011). genenames.org: the HGNC resources in 2011. Nucleic Acids Res. 39, D514–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Segerstolpe Å, Palasantza A, Eliasson P, Andersson E-M, Andréasson A-C, Sun X, Picelli S, Sabirsh A, Clausen M, Bjursell MK, et al. (2016). Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes. Cell Metab 24, 593–607. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Tabula Muris Consortium, Overall coordination, Logistical coordination, Organ collection and processing, Library preparation and sequencing, Computational data analysis, Cell type annotation, Writing group, Supplemental text writing group, and Principal investigators (2018). Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Tasic B, Menon V, Nguyen TN, Kim TK, Jarsky T, Yao Z, Levi B, Gray LT, Sorensen SA, Dolbeare T, et al. (2016). Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci 19, 335–346. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Treutlein B, Lee QY, Camp JG, Mall M, Koh W, Shariati SAM, Sim S, Neff NF, Skotheim JM, Wernig M, et al. (2016). Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq. Nature 534, 391–395. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Tsunemoto R, Lee S, Szűcs A, Chubukov P, Sokolova I, Blanchard JW, Eade KT, Bruggemann J, Wu C, Torkamani A, et al. (2018). Diverse reprogramming codes for neuronal identity. Nature 557, 375–380. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Wolf FA, Angerer P, and Theis FJ (2018). SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Xin Y, Kim J, Okamoto H, Ni M, Wei Y, Adler C, Murphy AJ, Yancopoulos GD, Lin C, and Gromada J (2016). RNA sequencing of single human islet cells reveals type 2 diabetes genes. Cell Metab. 24, 608–615. [DOI] [PubMed] [Google Scholar]

[R30] Zeisel A, Hochgerner H, Lönnerberg P, Johnsson A, Memic F, van der Zwan J, Häring M, Braun E, Borm LE, La Manno G, et al. (2018). Molecular architecture of the mouse nervous system. Cell 174, 999–1014.e22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Ziraldo SB, Wheeler TD, McDermott GP, Zhu J, et al. (2017). Massively parallel digital transcriptional profiling of single cells. Nat. Commun 8, 14049. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

SingleCellNet: a computational tool to classify single cell RNA-Seq data across platforms and across species

Yuqi Tan

Patrick Cahan

Summary

eTOC Blurb

Introduction

Results

Building a multi-class scRNA-Seq classifier with top-pair transform and Random Forest

Figure 1.

Performance of the SingleCellNet TP-RF classifier

Classifier performance across platforms and across species

Example applications of SingleCellNet

Figure 2.

Discussion

STAR Methods

Lead Contact and Materials Availability

Method Details

Building and assessing the classifier

Classifying query data

Notes to users

Quantification and Statistical Analysis

Benchmark assessment

Performance of the SCN TP-RF classifier on varying parameters

Data and Code Availability

Supplementary Material

Highlight.

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

SingleCellNet: a computational tool to classify single cell RNA-Seq data across platforms and across species

Yuqi Tan

Patrick Cahan

Summary

eTOC Blurb

Introduction

Results

Building a multi-class scRNA-Seq classifier with top-pair transform and Random Forest

Figure 1.

Performance of the SingleCellNet TP-RF classifier

Classifier performance across platforms and across species

Example applications of SingleCellNet

Figure 2.

Discussion

STAR Methods

Lead Contact and Materials Availability

Method Details

Building and assessing the classifier

Classifying query data

Notes to users

Quantification and Statistical Analysis

Benchmark assessment

Performance of the SCN TP-RF classifier on varying parameters

Data and Code Availability

Supplementary Material

Highlight.

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases