Abstract
Background
A key challenge in the emerging field of single-cell RNA-Seq is to characterize phenotypic diversity between cells and visualize this information in an informative manner. A common technique when dealing with high-dimensional data is to project the data to 2 or 3 dimensions for visualization. However, there are a variety of methods to achieve this result and once projected, it can be difficult to ascribe biological significance to the observed features. Additionally, when analyzing single-cell data, the relationship between cells can be obscured by technical confounders such as variable gene capture rates.
Results
To aid in the analysis and interpretation of single-cell RNA-Seq data, we have developed FastProject, a software tool which analyzes a gene expression matrix and produces a dynamic output report in which two-dimensional projections of the data can be explored. Annotated gene sets (referred to as gene ‘signatures’) are incorporated so that features in the projections can be understood in relation to the biological processes they might represent. FastProject provides a novel method of scoring each cell against a gene signature so as to minimize the effect of missed transcripts as well as a method to rank signature-projection pairings so that meaningful associations can be quickly identified. Additionally, FastProject is written with a modular architecture and designed to serve as a platform for incorporating and comparing new projection methods and gene selection algorithms.
Conclusions
Here we present FastProject, a software package for two-dimensional visualization of single cell data, which utilizes a plethora of projection methods and provides a way to systematically investigate the biological relevance of these low dimensional representations by incorporating domain knowledge.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-016-1176-5) contains supplementary material, which is available to authorized users.
Keywords: Single-Cell, RNA-Seq, Dimensionality reduction
Background
In an analogous manner to the maturation of RNA-Seq methodologies, single-cell RNA-Seq (scRNA-Seq) is now in its infancy and requires new computational tools to realize its full potential for dissecting and understanding the functional meaning of cell-to-cell heterogeneity [1, 2]. Visualization methods provide an effective strategy for inspecting and characterizing the phenotypic diversity between cells. In a typical scenario, the analysis begins with a matrix of expression levels of thousands of genes in hundreds of cells. An appealing way to make sense out of this immense data is to project it into a two dimensional scatter-plot, where each cell is represented by a single point. While such representations provide an easy way to see obvious stratification of cells into a taxonomy of discrete types, they can also provide more nuanced views of gradual transitions, reflecting for instance, developmental processes [3], physical locations [4], or the cell cycle [5]. Indeed, supplementing these two-dimensional views with additional, phenotypical information (e.g. the expression level of marker genes) can be used to provide the correct context, and make the observed diversity between cells interpretable [6, 7]).
There are three main challenges in making effective use of such visualizations for scRNA-Seq data. The first challenge is selecting an appropriate method for dimensionality reduction (projection) among candidates such as principal component analysis (PCA) [1, 2], independent component analysis (ICA) [3] or various non-linear methods such as t-distributed stochastic neighbor embedding (t-SNE) [8], each of which may highlight different aspects of the data. Once a projection is created, a second challenge is to interpret its biological significance, namely which cellular phenotypes [7] or processes [9] are most responsible for the resulting arrangement of cells. Lastly, scRNA-Seq can be difficult to correctly interpret due to technical confounders such as differences in gene capture rates [10, 11]. Performing functional interpretation on the input data without accounting for this effect may lead to incorrect interpretation of the biological meaning of cell-to-cell heterogeneity.
Introducing FastProject
To address these issues, we have developed FastProject, a software tool for the visualization and interpretation of scRNA-Seq data. FastProject allows the user to explore a gene expression matrix using a plethora of two-dimensional visualization methods. To interpret these two-dimensional plots, we use the concept of biological signatures - sets of genes that represent either a collection of genes with a common associated function (e.g. Glycolysis) or a dichotomy between two cellular states of interest (e.g. naïve vs. memory CD8 T-cells)[7]. We evaluate the extent to which these phenotypic dichotomies are reflected in the projections (i.e. to what extent do neighboring cells in the projection have similar values for the genes included in the signature), and highlight the significant projection-signature pairs. This analysis is made possible in single cells by modeling the probability that a missed transcript was actually expressed in the cell, and using these probabilities when evaluating signature scores on samples to minimize the effect of variable capture rates between cells. As a source for biological signatures, we use publicly available datasets that consist of comparative information from hundreds of studies (e.g. MSigDB [12]), which can be supplemented by the user to include any other gene signatures of interest. Through this automated analysis, FastProject therefore provides a systematic view of the main axes of variation in the data, along with their possible biological meaning.
Compared with currently-available software, FastProject is unique in the combination of methods it employs. Visualization software such as viSNE [6], although designed for cytometry data, allows for the integrated visualization of transcript levels overlaid on a two-dimensional tSNE projection. However, viSNE does not incorporate the use of gene sets or provide a method to systematically search for biological variation within a two-dimensional projection. Another method, PAGODA [13], makes use of gene sets, but does so in the context of a clustered heatmap of expression data, not a 2-dimensional embedding.
It is important to clarify up-front that FastProject is not a normalization tool. Indeed, it has been observed by us and others that without proper normalization scRNA-Seq data can be heavily confounded by technical factors such as library depth and complexity [7]. We strongly advocate the use of scRNA-Seq normalization tools (e.g. based on [5] or [7]) prior to any downstream analysis, and we assume that the data has been normalized prior to input to FastProject. Nevertheless, since scRNA-Seq measurements tend to be characterized by strong zero-inflation, we conduct our biological signature analysis while aiming to minimize the effect of gene dropout events (i.e. false negatives, described in [10, 14]).
When running FastProject, processing is done upfront, producing dozens of projection maps (using different algorithms and parameterizations) and their functional annotation, which can be inspected through an interactive HTML report. On a typical data set, consisting of around 1000 cells, processing time is typically between 10 and 30 min (see Additional file 1 for benchmarking results). Because processing is performed upfront, the user can quickly switch between different projection maps in the output report as well as share the results with colleagues who would not themselves need to install FastProject. Importantly, FastProject has been written to be easily extensible so that it may serve as a general platform for deploying and evaluating new gene filtering schemes, false-negative estimation methods, or projection techniques. Instructions for developers on how to augment FastProject are detailed in the FastProject wiki hosted at https://github.com/YosefLab/FastProject/wiki.
Implementation
Overview
The steps involved in the FastProject processing pipeline are depicted in Fig. 1.
The software starts with an evaluation of false negative rates, later used to down-weight the effect of drop-outs on the biological signature analysis. It then selects sufficiently variable genes for further analysis using increasingly stringent criteria. With these genes in hand, FastProject uses 11 different projection methods (summarized below) to calculate two dimensional coordinate for each cell. Based on a user-provided database of gene signatures it then computes a score for every cell/signature pair and uses a randomization test to identify statistically significant projection-signature associations. Importantly, the confounding effect of missed transcripts is mitigated by estimating the probability that each undetected gene was actually expressed in the cell, and down-weighting the contributions of these measurements in downstream analysis (similar to [7]). Altogether, FastProject outputs 76 possible projections (a combination of choosing different gene filtering criteria, whether or not the data was reduced to significant PCs prior to projection, and the final projection method) along with their functional annotations, which can be interactively inspected through a user-friendly HTML report (Figs. 1b and 3). The results are also provided in the form of text files (including signature scores, projection coordinates etc.), which can be used for downstream analysis.
False-negative estimates
To account for expressed transcripts that are not detected (false negatives) due to the limitations in sensitivity [1, 10, 11, 15], an initial step in the processing pipeline is to evaluate these detection rates so that the subsequent analysis can down-weight the contribution of less reliably measured transcripts. More specifically, it has been observed that the detection probability for a transcript varies with the transcripts relative abundance measurement as well as the total amount of RNA in the sample [10, 14]. To characterize this, we calculate a false negative rate curve which estimates the probability of detection based on a gene’s abundance. A separate curve is estimated for each cell individually so that varying levels of library quality and cell integrity can be accounted for.
To perform this estimate, our procedure utilizes a set of human housekeeping genes [16] under the assumption that these housekeeping genes are constitutively expressed in all cells and only missed due to technical errors. Importantly, as the appropriate set of constitutively expressed genes may differ from study to study and between organisms, FastProject can accept a user-defined housekeeping list.
Our estimation of false negative rates is built on and extends upon our previous work [7]. For each housekeeping gene, we estimate its mean expression level by taking the average of non-zero measurements for the gene. We then use the estimated means to group the genes into 30 quantiles, and denote the mean of genes in quantile 1≤q≤30 as μq. For each cell j and quantile q, we then compute Fjq as the proportion of genes from q that are not detected by j. Based on our assumption of constitutive expression, we treat Fjq as an empirical estimate to the dropout rates (i.e. probability that a gene is not detected, given that it is expressed). We use the Fjq values to fit a sigmoid function that describes the observed dropout rates as a function the genes’ average expression when detected (Fig. 1a):
1 |
where the estimated parameters αj and βj minimize the residual sum of squares (RSS) term:
2 |
Applying the fitted function globally for all genes, we estimate the conditional probability for a dropout event for gene i in cell j as: where μi is the average of gene i’s expression level when detected. Finally, we estimate the prior probabilities for detection and expression of each gene in order to evaluate weights for the missed transcripts, P(Gene i is not expressed by cell j∣Gene i isnot detected in cell j), as described in the Methods section. These probability estimates are used further in the pipeline to reduce the effect of missed transcripts when evaluating signature scores and generating projections. As a first step in validating the efficacy of this approach, a set of simulations were performed (Additional file 2, procedure detailed in the “Methods” section) in which FastProject’s weighting procedure was able to distinguish between true negatives (representing unexpressed genes) and false negatives (representing technical drop-outs).
The fitted sigmoids can also be used to provide an overall evaluation of the abundance of false negatives in each cell j by taking the area under the curve. This in turn provides a way to identify and exclude poor quality samples, assuming that such samples have higher dropout rates. Such quality control filter (on cells) is available in FastProject and can be enabled when running the pipeline (it is turned off by default). With this option enabled, samples that score 1.6 median absolute deviation (MAD) units worse than the median quality score are removed prior to calculating signatures and low-dimensional projections.
Generating 2-dimensional projections
Gene filtering
Selecting an informative set of genes is necessary for obtaining biologically meaningful patterns of variability between cells. To this end, FastProject applies a strict pre-filtering step that discards genes detected in less than a threshold number of cells. The default threshold is 20 % of the input cells, however this is configurable as an input option. Subsequently, FastProject produces projections that derive from all pre-filtered genes as well as subsets thereof, calculated using either of two filtering schemes. The first filter option selects bimodal genes, a characteristic which in the past has been useful in isolating biological variation (e.g. cytokines related to Th17 differentiation [7] or antiviral and inflammatory response genes in stimulated dendritic cells [17]). In FastProject bimodal genes are selected using the Hartigan’s Dip Test (p<0.05). The second filter option selects highly-variable genes, based on their Fano-factor (σ2/μ where μ is the average expression and σ the standard deviation across all cells). An increase in the Fano factor past a value of 1 indicates a departure from a Poisson statistic, which is the steady state distribution expected of constitutively expressed genes [18]. To select candidates with high Fano-factor, similar to the procedure in [19], genes are first stratified into 30 quantiles according to μ, and within each quantile genes are retained if their Fano-factor is more than 2 MAD above the quantile’s median Fano-factor.
Projection methods
The variety of methods available for the task of dimensionality reduction each come with strengths and weaknesses. For instance, projections of scRNA-Seq data based on PCA [7], provide an appealing guarantee about the preservation of variation, and makes the contribution of individual genes readily interpretable [6]. However, the underlying assumptions of PCA may not necessarily be supported by single cell data. In particular, PCA is a linear transformation, which may not be able to accurately capture non-linear relationships in the data (i.e. if the data is embedded within a non-linear low-dimensional manifold). Additionally, PCA posits that the projection axes should be uncorrelated, which again may not necessarily result in the most informative representation. The same criticisms apply for other linear methods such as ICA [3]. A complementary, and commonly used approach (t-SNE [8]) uses a non-linear projection that aims to preserve the structure in the data locally. However, while this method aims to ensure that points that are close in the high dimensional space will be close (with high probability) in the low dimensionality embedding, more global relations are not directly interpretable from the results. To avoid the issue of choosing up-front between these different options, FastProject uses these methods plus additional non-linear projection methods, including: ISOMAP [20] (using four nearest neighbors when defining the adjacency graph), PCA with a radial basis function (RBF) kernel (with kernel coefficient of ), Multi-Dimensional Scaling (MDS), and Spectral Embedding [21] (with Graph laplacian formed using k-nearest neighbors with ). For the linear PCA we consider three combinations of principal components (1st and 2nd, 1st and 3rd, 2nd and 3rd); and for tSNE we use two perplexity values (10 and 30), totaling in 11 projection methods overall. All methods are run as implemented in the Scikit-Learn package for Python [22]. After each projection, the resulting sets of 2-dimensional coordinates are mean-centered and scaled such that the average r2=[xcoordinate]2+[ycoordinate]2 equals 1. This standardization is performed so that signature-projection scores (defined below) are more comparable between projections.
Incorporating false-negative estimates into projections
Non-linear methods such as t-SNE have been shown to effectively combine with PCA. In this approach, PCA is performed first to reduce high-dimensional data to an intermediate number of dimensions, and this intermediate representation is further reduced to 2 or 3 dimensions by the non-linear method. The procedure is effective, assuming that the initial PCA operation preserves the main structure of the data while discarding unnecessary components.
FastProject makes use of this procedure to allow for a general method of incorporating weights (derived from false-negative estimates) into nonlinear projections. Prior to a non-linear projection, a weighted-PCA is performed by using a weighted covariance matrix (similar to [7]). Entries in the weighted covariance matrix are calculated as:
3 |
where Xij is the log-transformed expression of gene i in cell j and w represents the matrix of weights of equivalent size, defined as:
4 |
To select the number of components to retain, following weighted-PCA, the software employs a randomization scheme, as described in [23], to select the top principal components with statistically significant (p<0.05) contribution to the overall variance (retaining at least 5 components when less than 5 are significant). All the subsequent non-linear projections are evaluated on this reduced-dimension matrix. To provide a way for evaluating the effects of this reduction, FastProject also runs each non-linear projection method on the original, non-reduced matrix, allowing the user to switch between these representations in the output report.
Signature-based analysis
To interpret the biological meaning of the organization of cells in the resulting two dimensional maps, FastProject incorporates domain knowledge through the input of gene “signatures” [7]. The signatures can reflect a comparative analysis between two conditions of interest and consist of a set of genes, each of which is labeled as either “up-regulated” or “down-regulated” in that comparison. Signatures, such as these, are based off of annotations of cell states obtained from public databases (e.g. the Immsig collection [12]), or provided by the user. For each signature, a score is computed against each cell by aggregating over the weighted expression level of its genes. Specifically, for signature S and cell j, the respective signature score Rs(j) is calculated as:
5 |
Where is the standardized (Z-normalized across all cells) log expression level of gene i in cell j,wij is the estimated False-Negative weight (defined above), and signS(i)=−1 for genes in the "down-regulated" set and +1 otherwise. Notably, signatures can also be undirected, in which case the sign value is set to +1 for all member genes.
Projections vs. signatures
A signature-projection consistency score is calculated to evaluate how well each projection reflects the phenotypic variation that is captured by each signature. To this end, for each pair of signature, s, and projection, p, we compute a signature consistency score representing the extent to which neighboring points in the projection have similar signature scores. This is done by calculating for each cell j, an estimated signature rank based on its location in the two dimensional plot:
6 |
where Δjk is the Euclidean distance between cells j and k in the projection, α defines an effective neighborhood size (set to 0.33 by default), and rs(j) is the rank of the signature score for cell j (i.e. a rank transformation of the quantitative signature scores RS). The signature-projection consistency is then determined by the respective goodness of fit:
7 |
In this way, each signature/projection pairing is scored based on how similar signature scores are for samples located nearby in the projection space. To identify significantly consistent signature/projection pairs, we use signatures of randomly selected genes to create empirical background distributions of signature scores. We compare the consistency scores computed for the original signatures with those of the random ones (for example, see Fig. 2a), obtaining P-values using a Z-test and correcting for multiple hypothesis testing using the Benjamini Hochberg procedure. We observed that the distribution parameters of scores generated by random signatures varied with the number of genes in the signature. To account for this, separate distributions are generated for different signature sizes (10, 20, 50, 100, and 200 genes) and when assessing the significance of a signature score, the score is compared against the background distribution with the most similar number of genes. In the output, we report all the signatures that had a significant match (FDR<0.05) with at least one projection.
Results
Software
FastProject has been implemented as a command line Python package. As an input the software receives: (1) an expression matrix in a tab-delimited format (genes in rows, cells in columns). (2) a set of gene signatures, using the standard GMT format. Such sets of directed signatures are publicly available from various databases, such as MsigDB [12] (e.g. including signaling effects of genetic and chemical perturbations, cell cycle signatures, and comparison of cell types) and NetPath [24] (transcriptional effects of signaling cascades). Un-directed gene sets are naturally more abundant, and can be drawn from resources such as Gene Ontology [25], KEGG [26, 27], and MSigDB [12] (note that the latter includes much of the information in the former two). Importantly, the user can also upload his/her own signatures that reflect a phenotype of interest. For instance, in the example below, we study glioblastoma cells, and use signatures derived from microarray experiments, which define different tumor sub-types. FastProject also accepts pre-computed signatures, namely, an association of cells with pre-computed values. These can be categorical (e.g. annotations of clusters, computed by a different tool); or numerical (e.g. additional meta data associated with cells; e.g. levels of a handful of proteins, or technical factors such as library complexity; see Methods section for description of how pre-computed signatures are analyzed).
FastProject will calculate projections, signature scores, and their associations, covering all the options above (totaling in 76 different projections; Fig. 1b). To examine the extent to which the projections are affected by zero values, FastProject treats the sample quality score (defined above), and the percentage of genes with zero expression as pre-computed signatures and evaluates their association with each projection method. The output is provided as an HTML file (Fig. 3), where projections, signatures and their associations can be inspected interactively. Additionally, a data export feature embedded in the HTML report allows the generation of tab-delimited text files that depict the output projection coordinates, signature scores, and other relevant information. The source code, running, examples and user manual are bundled with this manuscript submission and also hosted at https://github.com/YosefLab/FastProject.
Extending FastProject
FastProject has been designed using a modular architecture so that new projection methods or criteria to filter genes can easily be added to the pipeline. Since dimensionality reduction is still an active research area, this allows new methods to easily be compared against more traditional approaches. For example, the recently proposed ZIFA algorithm [28] can be added by appending the following lines to Projections.py:
This is documented in the software wiki, hosted with the project repository at https://github.com/YosefLab/FastProject/wiki.
Proof of concept
We applied FastProject to a recently published data sets of tumor cells from five glioblastoma patients [2]. The analyzed data, consisting of 430 single cells with mRNA abundance in units of transcripts per million (TPM) and normalized as described in [2], was downloaded from the Gene Expression Omnibus, accession number GSE57872. We applied FastProject on this data, using a large collection of both “signed” (i.e. up- and down- regulated genes) and “unsigned” signatures from MSig DB (the C2 (curated), H (Hallmark), and C7 (Immune signatures) collections) and NetPath [24].
As a first check, we explored the distributions of signature consistency scores obtained for the original vs. random signatures, and compared the results to an application of FastProject on randomized datasets, where the entries in each column (Cell) were shuffled (i.e. maintaining the percentage of zeroes in every cell; Fig. 2a). As expected, we see pronounced differences between the original input signatures and the randomized ones when FastProject is applied on the non-shuffled data, and these differences disappear when we apply FastProject on the randomized data. As a second test, we evaluated the extent by which the signature consistency scores are biased by the abundance of zero-values in the analyzed cells. As expected, when we do not down-weigh the potential false negative entries, the signature consistency scores highly correlate with the amount of zeroes in each cell; namely the analysis primarily reflect what might be a result of technical dropouts. Conversely, down-weighing the false negatives reduces this bias (Fig. 2b). We repeated this procedure using a second dataset of scRNA-Seq data from mouse dendritic cells responding to antigen stimulation [1], obtaining similar results (Additional files 3 and 4).
Examining the output report of FastProject (Fig. 3a), we first observe that the various projection methods correctly stratify the cells according to their respective donors (Fig. 3b). More importantly, FastProject automatically picks up on several of the most important features in this data, providing a proof of concept for its utility as an unbiased analysis tool. Specifically, the two dimensional position of the cells is highly consistent with their scoring according to an epithelial to mesenchymal transition signature, which is a strong marker of poor prognosis [29, 30]. The two dimensional positions are also associated with signatures of other key pathways altered in cancer, including immune and hypoxic responses as detailed below. While these observations were made using a general database of signatures (MSigDB), we supplemented our analysis with case-specific signatures - in this case gene signatures from TCGA that are predictive of Glioblastoma subtypes [30] to provide further support. As expected we see high level of concordance between the TCGA-derived scores of the Mesenchymal tumor sub-type and the epithelial to mesenchymal transition signatures from MSigDB. We also see the mirror image of the cell ranking when we consider the TCGA-derived signature of the Proneural tumor subtype of glioblastoma.
In addition to the HTML report, FastProject outputs all the cell-signature scores in textual format. Taking advantage of this feature, we were able to more closely inspect the relationship between the different pathways that were highly correlated with the two dimensional positions and reveal new associations in the data. Considering the top ranked signatures (p<10−15 for at least one projection method), and filtering overlapping signatures (Jaccard coefficient >30 %; 63 signatures remaining), we observe a clear pattern of signature clusters (Fig. 4). Interestingly, the mesenchymal signature is positively correlated with the expression of coagulation/complement genes (whose expression in the glioblastoma cells studied here was already verified by [2]). Both signatures are also correlated with the TNF α signaling, which supports previous findings concerning the role of this pathway in mesenchymal emergence [31]. On the other hand, the mesenchymal signature is negatively correlated with a hypoxia signature. While hypoxic regions are characteristic in solid tumors [32], this inverse correlation is surprising and possibly aligned with the up-regulation of angiogenetic markers in mesenchymal glioblastoma tumors [29]. Finally, we see a strong negative correlation with a signature of response to the PPAR γ agonist rosiglitazone, which aligns with previous observations of beneficial effects PPAR γ agonists have in glioblastoma [33–35]. In addition to the inter-donor variation, FastProject’s visualization also highlights potentially important variation within a tumor. Indeed, the cells from patient MGH31 (Fig. 3b, purple) are clearly divided in accordance to the two programs mentioned above - with cells with low mesenchymal and high hypoxic score on one side and the mirror image on the other.
The glioblastoma case study underscores the utility of FastPtroject as a tool for scRNA-Seq data exploration. Starting from a normalized input matrix of gene expression in single cells, and a generic set of signatures, it clearly highlighted some of the main sources for phenotypic variation between cells, and the relation between these sources. Such insights provide an important first step in working with data sets an immense and as complex as the one presented here.
Conclusions
FastProject is a flexible, comprehensive, and automated pipeline that combines multiple techniques for the analysis of scRNA-Seq. It provides a first glance on the main axes of variation in the data (captured by projections of interest) and their biological meaning (the biological signatures that may explain the organization of cells within the projections). The tool was designed with a flexible API in mind, with the aim of establishing a general platform that will be used by the scRNA-Seq research community for deployment and evaluation of future methods, such as normalization, batch correction, removal of undesired effects (cell cycle, drop-outs), gene/cell filtering, and dimensionality reduction.
Methods
False-negative estimates
Let P(Eij∣NotDij) denote the probability that a gene i is expressed in cell j conditioned that it is not detected (i.e. probability for a false negative). To estimate this probability we first estimate the priors for the detection, P(Dij), and expression, P(Eij), events:
For P(Dij) we use the percentage of cells that detect gene i (expression>0), which we denote as . For P(Eij), one approach would be to use:
8 |
where μi is the average of gene i’s log expression when detected and (defined in Eq. 1) represents P(NotDij∣Eij) for each cell. However, in order to get a more robust estimation, we use the population mean of the conditional probability, taking: where N is the number of cells in the dataset. Combining these terms, the full expression is:
9 |
This value is then complemented to obtain P(NotEij|NotDij) (the true negative probability), which is used to weight missed transcripts. Notably, on occasions where the detection rate in a cell is higher than the prior estimate, this formulation results in a negative value. We therefore restrict the estimate to the range [ 0,1] by applying a clipping operation.
Simulating false-negatives
To evaluate our approach, we first applied it on simulated data. Count values for each gene were either set to zero (with probability Pi) or drawn from a negative binomial distribution parameterized by the mean, μi, and the dispersion, . Pi was set to 1.0 for the first 500 genes (considered to be housekeeping genes, and therefore constitutively active) and varied uniformly from 0 to 1 for the remaining genes. μi was generated from an exponential distribution with mean=75 and γ was generated using a uniform distribution from 0 to 4 (reasonable distributions and ranges for μ and γ found by measuring these parameters using non-zero values from the Shalek et al. [1] single-cell data). The generated counts were transformed using log(x+1). In this way, an expression matrix of 3000 genes across 2000 cells was simulated.
After generating ‘true’ expression data, sigmoidal drop-out curves (as defined in Eq. 1) were generated for each cell by drawing α and β from uniform distributions (α∈[ 2,6];β∈[ 1,3]). Using these curves, ‘measured’ expression data was generated by dropping-out non-zero measurements using each cell’s drop-out characteristic and each gene’s non-zero mean. Using FastProject’s weight-estimation procedure, weights were estimated using the ‘measured’ data matrix. For comparison purposes, weights were also generated using the likelihood, P(Detection∣Expression), as described in Gaublomme et al. [7]. The results, detailed in Additional file 2, demonstrate that FastProject’s weighting system produces different weight distributions for true negatives (un-expressed genes) as opposed to false negatives (dropped-out genes), assigning higher weights to the true negatives, as would be expected in an effective weighting procedure.
Clustering
For each of the projections, the FastProject HTML report includes a simple clustering analysis using the cells’ positioning in the respective two dimensional map. Specifically, samples are clustered based on Euclidean distance in the two-dimensional space using k-means with different k values. These clusters are used when rendering a heat-map below the projection showing the (per cluster) average z-score of expression for each gene in the signature.
Evaluating the consistency of projections and categorical pre-computed signatures
A different method is used to evaluate the significance of signature/projection pairings when operating on pre-computed signatures. For numerical pre-computed signatures, the assigned values are shuffled among cells and for each iteration of this procedure, the signature/projection consistency score is evaluated. P-values are then assessed by comparing the unshuffled score against this distribution using a Z-test as above. For factor signatures, it is necessary to calculated the consistency score in a different manner. To this end, for each pair of signature s and projection p we compute a signature consistency score representing the extent to which neighboring points in the projection are assigned to the same category. First, we evaluate a neighborhood-consistency score for every cell:
10 |
where
11 |
In this way, is closer to 1 if most of sample i’s neighbors have the same label. The signature-projection consistency is then determined by a measure of the overall consistency:
12 |
These consistency scores are compared against a distribution of scores calculated from shuffled label assignments to assess significance using a Z-test as above.
Availability and requirements
Project Name: FastProject.Software Repository:https://github.com/yoseflab/fastproject.Software Manual:https://github.com/YosefLab/FastProject/wiki.Archived Versions:https://pypi.python.org/pypi/FastProject.Operating Systems: Platform-Independent.Programming language: Python.License: Free for Educational, Research, and Not-For-Profit purposes. Results from the analysis in this article can by found at https://github.com/YosefLab/FastProject/wiki/Example-Output-Reports.
Acknowledgments
Not applicable.
Funding
DD was funded by a National Institutes of Health NRSA Trainee appointment on grant number T32 and by the California Research Alliance by BASF (CARA). NY was supported by NIH grants U01MH105979 and U01HG007910.
Authors’ contributions
NY and DD conceived and wrote the manuscript. DD implemented the software and analyzed the data. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Abbreviations
- EM
Expectation maximization
- FNC
False-negative curve
- HTML
Hypertext markup language
- ICA
Independent component analysis
- LPS
Lipopolysaccharide
- MAD
Median absolute deviation
- MDS
Multidimensional scaling
- MsigDB
Molecular signatures database
- PCA
Principal component analysis
- RBF
Radial basis function
- scRNA-seq
Single cell RNA-seq
- TPM
Transcripts per million
- t-SNE
t-Distributed stochastic neighbor embedding
Additional files
Contributor Information
David DeTomaso, Email: David.DeTomaso@berkeley.edu.
Nir Yosef, Email: niryosef@berkeley.edu.
References
- 1.Shalek AK, Satija R, Shuga J, Trombetta JJ, Gennert D, Lu D, Chen P, Gertner RS, Gaublomme JT, Yosef N, Schwartz S, Fowler B, Weaver S, Wang J, Wang X, Ding R, Raychowdhury R, Friedman N, Hacohen N, Park H, May AP, Regev A. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature. 2014;509(7505):363–9. doi: 10.1038/nature13437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, Cahill DP, Nahed BV, Curry WT, Martuza RL, Louis DN, Rozenblatt-Rosen O, Suvà ML, Regev A, Bernstein BE. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science (New York) 2014;344(6190):1396–401. doi: 10.1126/science.1254257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS, Rinn JL. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014;32(4):381–6. doi: 10.1038/nbt.2859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol. 2015;33(5):495–502. doi: 10.1038/nbt.3192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, Stegle O. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015;33(2):155–60. doi: 10.1038/nbt.3102. [DOI] [PubMed] [Google Scholar]
- 6.Amir E-aD, Davis KL, Tadmor MD, Simonds EF, Levine JH, Bendall SC, Shenfeld DK, Krishnaswamy S, Nolan GP, Pe’er D. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat Biotechnol. 2013;31(6):545–2. doi: 10.1038/nbt.2594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Gaublomme J, Yosef N, Lee Y, Gertner R, Yang L, Wu C, Pandolfi P, Mak T, Satija R, Shalek A, Kuchroo V, Park H, Regev A. Single-Cell Genomics Unveils Critical Regulators of Th17 Cell Pathogenicity. Cell. 2015;163(6):1400–1412. doi: 10.1016/j.cell.2015.11.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.van der Maaten L, Hinton G. Visualizing Data using t-SNE. J Mach Learn Res. 2008;9:2579–605. [Google Scholar]
- 9.Kafri R, Levy J, Ginzberg MB, Oh S, Lahav G, Kirschner MW. Dynamics extracted from fixed cells reveal feedback linking cell growth to cell cycle. Nature. 2013;494(7438):480–3. doi: 10.1038/nature11897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Ramsköld D, Luo S, Wang YC, Li R, Deng Q, Faridani OR, Daniels GA, Khrebtukova I, Loring JF, Laurent LC, Schroth GP, Sandberg R. Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells. Nat Biotechnol. 2012;30(8):777–82. doi: 10.1038/nbt.2282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Brennecke P, Anders S, Kim JK, Kołodziejczyk AA, Zhang X, Proserpio V, Baying B, Benes V, Teichmann SA, Marioni JC, Heisler MG. Accounting for technical noise in single-cell RNA-seq experiments. Nat Methods. 2013;10(11):1093–5. doi: 10.1038/nmeth.2645. [DOI] [PubMed] [Google Scholar]
- 12.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005;102(43):15545–50. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Fan J, Salathia N, Liu R, Kaeser GE, Yung YC, Herman JL, Kaper F, Fan JB, Zhang K, Chun J, Kharchenko PV. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nat Methods. 2016;13(3):241–4. doi: 10.1038/nmeth.3734. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Marinov GK, Williams BA, McCue K, Schroth GP, Gertz J, Myers RM, Wold BJ. From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing. Genome Res. 2014;24(3):496–510. doi: 10.1101/gr.161034.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Deng Q, Ramsköld D, Reinius B, Sandberg R. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science (New York) 2014;343(6167):193–6. doi: 10.1126/science.1245316. [DOI] [PubMed] [Google Scholar]
- 16.Eisenberg E, Levanon EY. Human housekeeping genes are compact. Trends Genet (TIG) 2003;19(7):362–5. doi: 10.1016/S0168-9525(03)00140-9. [DOI] [PubMed] [Google Scholar]
- 17.Shalek AK, Satija R, Adiconis X, Gertner RS, Gaublomme JT, Raychowdhury R, Schwartz S, Yosef N, Malboeuf C, Lu D, Trombetta JJ, Gennert D, Gnirke A, Goren A, Hacohen N, Levin JZ, Park H, Regev A. Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature. 2013;498(7453):236–40. doi: 10.1038/nature12172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Munsky B, Neuert G, van Oudenaarden A. Using gene expression noise to understand gene regulation. Science (New York) 2012;336(6078):183–7. doi: 10.1126/science.1216379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Macosko E, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas A, Kamitaki N, Martersteck E, Trombetta J, Weitz D, Sanes J, Shalek A, Regev A, McCarroll S. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell. 2015;161(5):1202–14. doi: 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Tenenbaum JB, de Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science (New York) 2000;290(5500):2319–323. doi: 10.1126/science.290.5500.2319. [DOI] [PubMed] [Google Scholar]
- 21.Campbell K, Ponting CP, Webber C. Laplacian eigenmaps and principal curves for high resolution pseudotemporal ordering of single-cell RNA-seq profiles. Technical report. 2015. doi:10.1101/027219. http://dx.doi.org/10.1101/027219.
- 22.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: Machine Learning in {P}ython. J Mach Learn Res. 2011;12:2825–830. [Google Scholar]
- 23.Buja A, Eyuboglu N. Remarks on Parallel Analysis. Multivar Behav Res. 1992;27(4):509–40. doi: 10.1207/s15327906mbr2704_2. [DOI] [PubMed] [Google Scholar]
- 24.Kandasamy K, Mohan SS, Raju R, Keerthikumar S, Kumar GSS, Venugopal AK, Telikicherla D, Navarro JD, Mathivanan S, Pecquet C, Gollapudi SK, Tattikota SG, Mohan S, Padhukasahasram H, Subbannayya Y, Goel R, Jacob HKC, Zhong J, Sekhar R, Nanjappa V, Balakrishnan L, Subbaiah R, Ramachandra YL, Rahiman BA, Prasad TSK, Lin JX, Houtman JCD, Desiderio S, Renauld JC, Constantinescu SN, Ohara O, Hirano T, Kubo M, Singh S, Khatri P, Draghici S, Bader GD, Sander C, Leonard WJ, Pandey A. NetPath: a public resource of curated signal transduction pathways. Genome Biol. 2010;11(1):3. doi: 10.1186/gb-2010-11-1-r3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000;25(1):25–9. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Kanehisa M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000;28(1):27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 2014;42(Database issue):199–205. doi: 10.1093/nar/gkt1076. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Pierson E, Yau C. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 2015;16(1):241. doi: 10.1186/s13059-015-0805-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Phillips HS, Kharbanda S, Chen R, Forrest WF, Soriano RH, Wu TD, Misra A, Nigro JM, Colman H, Soroceanu L, Williams PM, Modrusan Z, Feuerstein BG, Aldape K. Molecular subclasses of high-grade glioma predict prognosis, delineate a pattern of disease progression, and resemble stages in neurogenesis. Cancer Cell. 2006;9(3):157–73. doi: 10.1016/j.ccr.2006.02.019. [DOI] [PubMed] [Google Scholar]
- 30.Verhaak RGW, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O’Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, Getz G, Perou CM, Hayes DN. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010;17(1):98–110. doi: 10.1016/j.ccr.2009.12.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Bhat KPL, Balasubramaniyan V, Vaillant B, Ezhilarasan R, Hummelink K, Hollingsworth F, Wani K, Heathcock L, James JD, Goodman LD, Conroy S, Long L, Lelic N, Wang S, Gumin J, Raj D, Kodama Y, Raghunathan A, Olar A, Joshi K, Pelloski CE, Heimberger A, Kim SH, Cahill DP, Rao G, Den Dunnen WFA, Boddeke HWGM, Phillips HS, Nakano I, Lang FF, Colman H, Sulman EP, Aldape K. Mesenchymal differentiation mediated by NF- κB promotes radiation resistance in glioblastoma. Cancer Cell. 2013;24(3):331–46. doi: 10.1016/j.ccr.2013.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Brown JM, Wilson WR. Exploiting tumour hypoxia in cancer treatment. Nat Rev Cancer. 2004;4(6):437–7. doi: 10.1038/nrc1367. [DOI] [PubMed] [Google Scholar]
- 33.Panigrahy D, Singer S, Shen LQ, Butterfield CE, Freedman DA, Chen EJ, Moses MA, Kilroy S, Duensing S, Fletcher C, Fletcher JA, Hlatky L, Hahnfeldt P, Folkman J, Kaipainen A. PPARgamma ligands inhibit primary tumor growth and metastasis by inhibiting angiogenesis. J Clin Investig. 2002;110(7):923–32. doi: 10.1172/JCI0215634. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Ellis HP, Kurian KM. Biological Rationale for the Use of PPAR γ Agonists in Glioblastoma. Front Oncol. 2014;4:52. doi: 10.3389/fonc.2014.00052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Grommes C, Conway DS, Alshekhlee A, Barnholtz-Sloan JS. Inverse association of PPAR γ agonists use and high grade glioma development. J Neuro-Oncol. 2010;100(2):233–9. doi: 10.1007/s11060-010-0185-x. [DOI] [PubMed] [Google Scholar]
- 36.Amit I, Garber M, Chevrier N, Leite AP, Donner Y, Eisenhaure T, Guttman M, Grenier JK, Li W, Zuk O, Schubert LA, Birditt B, Shay T, Goren A, Zhang X, Smith Z, Deering R, McDonald RC, Cabili M, Bernstein BE, Rinn JL, Meissner A, Root DE, Hacohen N, Regev A. Unbiased reconstruction of a mammalian transcriptional network mediating pathogen responses. Science (New York) 2009;326(5950):257–63. doi: 10.1126/science.1179050. [DOI] [PMC free article] [PubMed] [Google Scholar]