Abstract
Identification of genes that reliably mark distinct cell types is key to leveraging single-cell RNA sequencing to better understand organismal biology. Such genes are usually chosen by measurement of differential expression between groups of cells and selecting those with the greatest magnitude or most statistically significant change. Many methods have been developed for performing such analyses, but no single, best method has emerged. Validating the results of these analyses is costly in terms of time, effort and resources. We demonstrate that applying an ensemble of such methods robustly identifies genes that mark cells that cluster together and that show restricted expression assessed by antisense mRNA in situ and immunofluorescence. This technique is easily extensible to any number of differential expression methods and the inclusion of additional methods is expected to result in further improvement in performance.
Keywords: Transcriptomics, Ensemble, Cellular heterogeneity, Marker gene, Cellular identity
Introduction
The development of single-cell sequencing technology has permitted the identification and characterization of striking cellular heterogeneity within tissues and the organs they compose. Ultimately, biologists must discretize the observed variability into a finite number of states that can be subjected to experimental interrogation. A necessary first step is to characterize such states so that they can be identified using available techniques. For the molecular biologist this requires the determination of molecular species whose measurement, either individually or in combination, allows one to identify the state of interest. In the conventional application of single-cell (and single-nucleus) RNA-sequencing, this entails the identification of genes that mark groups of cells that cluster together in high-dimensional expression space.
The utility of a marker gene is determined by the extent to which it satisfies certain desiderata.
The gene should be expressed at a detectable level,yet not ubiquitously expressed.
Expression of the gene within the compartment ofinterest should vary over a sufficiently large range to permit detection of differential expression using available molecular techniques.
Expression of the gene should be concentrated withinthe state of interest.
Typically, marker genes are identified using single-cell sequencing by measurement of differential expression between sets of clusters. Such sets can contain a single cluster or the union of several clusters. Methods for measuring differential expression include those initially developed for bulk sequencing and later adapted to single-cell (and single-nucleus) sequencing as well as those designed to account for unique technical and statistical properties of single-cell (and single-nucleus) measurements. No consensus has formed to establish a single best method. Indeed, it is likely that different methods capture important and non-overlapping information about the way in which expression varies between states.
The “wisdom of the crowd” is an expression that captures the notion that the synthesis of multiple non-redundant observations often captures the salient features of a phenomenon more completely than any single observation. In the language of Hong et al.,1 individual members of the crowd possess different perspectives and heuristics for problem solving which permits a more efficient search of the solution space when the agents work as a team. This concept has been applied to the inference and modeling of gene regulatory networks2 and gene set enrichment testing.3 Here we present the results of applying this approach to the problem of identifying genes that mark cell states, with a particular interest on identifying genes that can be used to spatially distinguish distinct cell types.
Results
Differential expression methods
There are currently many methodologies available to quantify how gene expression varies between experimental groups. We applied four techniques commonly used to measure differential expression between groups of single cells: Welch’s t-test, Wilcoxon ranked-sum test, binomial test and Model-based Analysis of Single-cell Transcriptomics (MAST).4 We analyzed publicly available expression data from peripheral blood mononuclear cells as this dataset typifies a common use case and the biological heterogeneity of these cells is well-characterized.5 In this study, we consider the different techniques as individual problem-solving agents. Each technique embodies different assumptions about the statistical properties of the expression data and different decision rules about the likelihood that expression of a gene varies significantly between pre-specified conditions i.e., possessing different perspectives and heuristics for solving the problem of identifying genes that are likely to mark the conditions under consideration. We then compared the performance of each of these individual methods to a novel approach we have developed that generates a community consensus ranking of genes by the likelihood that they are sufficiently enriched to uniquely mark the underlying cell state. This is done by assimilating the individual rankings of each gene across the ensemble of methods. From here on we will refer to this approach as EIGEN (Ensemble Identification of Gene Enrichment).
Arriving at a gold standard for performance evaluation was challenging as most curated gene sets associated with peripheral blood mononuclear cells (PBMC’s) were derived from expression data measured with either next-generation sequencing or microarray methods and subjected to differential expression analysis using a variant of one of the techniques under consideration. Hence, gold standard marker sets were constructed by first clustering the cells and then performing weighted gene correlation network analysis (WGCNA)6 to identify modules of genes that were correlated with cluster identity.
To quantitatively compare performance, receiver operating characteristic (ROC) (Figure 1(A)) and precision recall (PR) curves (Figure 2(A)) were constructed and the areas under these curves, AUROC and AUPR, were calculated for each method applied to each cluster (Figure 1(B) and Figure 2(B), respectively). An overall score for each method was calculated reflecting the performance of the method across clusters (Figure 3).
Figure 1.
Performance assessment for each individual method, EIGEN and a random ranking of genes depicted as receiver operating characteristic curves (ROC) (A) and measured as the area under the receiver operating characteristic curves (AUROC) (B).
Figure 2.
Performance assessment for each individual method, EIGEN and a random ranking of genes depicted as precision-recall curves (PR) (A) and measured as the area under the precision-recall curves (AUPR) (B).
Figure 3.
Overall performance of for each individual method, EIGEN and a random ranking of genes.
The community consensus of all methods was the best performer for 11 of 12 clusters when assessed by AUROC (Figure 1(B)) and 7 of 12 clusters when assessed by AUPR (Figure 2(B)). The overall superior performance of EIGEN is reflected in the community consensus of all methods obtaining the highest combined score (Figure 3).
Experimental support of identified markers
The findings above suggest that EIGEN represents a superior method to identify genes that robustly and reliably distinguish the different cell states that are represented in single cell expression data. However, the true test of the reliability of the technique is the ability to identify genes that represent biological variability that is measurable using experimental techniques standardly used to identify cell state differences in the field. In organismal biology, the gold standard techniques are antisense mRNA in situ hybridization and immunohistochemistry/immuno fluorescence. Although these techniques are less quantitative than transcriptomics, quantitative reverse transcription PCR and Western blotting, they provide spatial information lacking using these other tools.
In our previous work, we characterized heterogeneity of cell states within the kidney interstitium using single cell RNA-seq.7 MAST was used to quantify differences in expression of genes in a given cluster compared to the union of all other clusters to identify candidate markers of spatially distinct cell states. We then performed antisense in situ hybridization and immunofluorescence (when possible) on the most enriched candidates for each cluster (Figure 4). Multiple candidates for each cluster had to be tested prior to identifying unique markers of each cell state. Such spatially unique identifiers are frequently referred to in the developmental biology field as “anchor genes”. We next compared how EIGEN would have performed, relative to the other methods (including our previous utilization of MAST) in identifying the validated “anchor genes” (Table 1). This post hoc analysis shows that EIGEN would have ranked the validated marker highest in 9 of 13 cases in which we have identified an “anchor gene”. It was near optimal in the remaining four cases.
Figure 4.
Colorimetric antisense in situ mRNA hybridization images for validated marker genes (a) Ntn1, (b) Gdnf, (d) Clca3a1, (e) Cldn11, (h) Wnt4, (i) Lef1, (j) Fbln2, (k) Reln, (l) Igf1 and (m) Dlk1. Immunofluorescence images for validated marker genes (c) Pdgfrb, (f) Tagln and (g) Snai2.
Table 1.
Validated marker genes for clusters of E18.5 mouse renal interstitial cells together with the ranking assigned to each gene for the given cluster by the individual methods and EIGEN. The ranking assigned using MAST as in7 is included for comparison.
| Cluster | Gene | Rank by method | |||||
|---|---|---|---|---|---|---|---|
|
| |||||||
| Binomial | MAST (original) | MAST | t | Wilcox | EIGEN | ||
| 1 | Ntn1 | 2 | 7 | 5 | 2 | 3 | 1 |
| 3 | Gdnf | 11 | 14 | 9 | 7 | 21 | 4 |
| 6 | Pdgfrb | 101 | 39 | 169 | 5 | 6 | 7 |
| 7 | Clca3a1 | 1 | 8 | 83 | 1 | 2 | 1 |
| 8 | Cldn11 | 33 | 30 | 110 | 1 | 1 | 3 |
| 9 | Tagln | 79 | 2 | 2 | 9 | 3 | 2 |
| 10 | Snai2 | 135 | 7 | 100 | 49 | 80 | 24 |
| 11 | Wnt4 | 3 | 3 | 2 | 8 | 9 | 1 |
| 12 | Lef1 | 2 | 23 | 1 | 3 | 4 | 1 |
| 13 | Fbln2 | 555 | 47 | 41 | 2073 | 54 | 18 |
| 14 | Reln | 120 | 30 | 18 | 26 | 16 | 12 |
| 15 | Igf1 | 25 | 16 | 24 | 25 | 20 | 10 |
| 16 | Dlk1 | 912 | 1 | 42 | 14 | 11 | 19 |
Discussion
Single cell (nucleus) RNA sequencing technology has profoundly altered our ability to identify and study unique cell states among groups of cells, either in a petri-dish or a living tissue. The desired goal of these types of studies differs depending on the line of inquiry. While computational biologists may desire a set of markers that can be combined to quantitatively distinguish a cell state represented as a high dimensional vector in silico, organismal biologists frequently desire individual markers that can spatially distinguish cell states in situ. Identifying unique cell state identifiers for in situ analysis has proven difficult. The ideal gene/gene product will be expressed uniquely or at greatly enriched levels in a distinct cell state and at levels that are detectable using standard techniques such as antisense in situ hybridization or immunostaining. The issue with single cell technologies is that differentially expressed genes are frequently expressed ubiquitously but at different levels in different cell states or they are expressed uniquely in specific cell states but at levels that are not detectable using spatial analysis tools. Ideally, one would rapidly be able to identify genes that fit the “Goldilocks principle” of being expressed cell state specifically and at levels that are not too high but not too low. Such a technique is currently lacking.
In this study, we implemented an algorithm that integrates an ensemble of functionally diverse methods for identification of genes that robustly and reliably mark different cell states with an emphasis on identifying genes that can be used with in situ tissue analysis techniques. We show that this ensemble technique, which we refer to as EIGEN, outperforms each individual technique on both a reference standard and experimentally validated data sets. EIGEN scored validated anchor genes as amongst the top 25 in all experimentally validated cases. This is of note because it suggests that this technique will allow users to identify anchor genes more efficiently than any single technique. It is important to observe that we have not attempted to validate any of the genes that were scored higher by EIGEN than those we had previously validated. It is possible that several, if not all, of the higher scoring candidates will show meaningful restricted expression as is the case for our validated anchor genes. Such a result would allow future investigators to rapidly and efficiently identify robust marker genes (Goldilocks genes) minimizing wasted time, resources and effort. We recognize that many differential expression methods have been developed that were not considered here for simplicity. Importantly, our algorithm was designed to be extensible so that any such method can be readily included. Indeed, studies by Marbach et al. 2 and Hong1 suggest that the performance of the ensemble will improve as additional diverse approaches are integrated. In other words, we would expect this method to perform even better as additional strategies are included.
The synthesis of results from application of an ensemble of related methods to the solution of a given problem has frequently been shown to outperform the individual methods. While a mathematically rigorous explanation for this phenomenon is not immediately apparent, Marbach et al.2 argued that each method’s ability to identify true positives more often than would be expected by chance biases the rankings of true positives toward the top of the list and true negatives away from the top of the list. Consequently, the rankings of true positives and true negatives are distributed separately. If the ranks are assigned independently across methods, then the central limit theorem implies that the average of the ranks of true positives (and true negatives) across methods tends toward a normal distribution whose variance shrinks as the number of integrated methods increases. As the variance shrinks, the distributions of average ranks of true positives and true negatives become more segregated. Consequently, the probabilities that true positives are ranked higher and conversely that true negatives are ranked lower both increase.
Hong et al.1 approached the problem differently articulating their framework in the language of machine learning. They modeled groups of functionally diverse problem-solving agents equipped with different internal representations of problems and algorithms used to locate solutions and were able to demonstrate conditions under which the collective performance of a random group of agents outperforms a collection of the individually best agents. This outcome was found to depend on the diversity of the agents i.e., the ability of individual agents to improve on the collective solution by using a different approach to the problem. If in addition to diversity, one assumes that no single agent can always find the optimum but that a best performing agent exists, then with probability 1 a group of randomly selected agents will outperform the group of best performers.
Both aforementioned analyses informed our analysis of the results presented here with an eye toward the desiderata introduced earlier. For example, the binomial test is parameterized by a threshold that determines which cells express the gene so that the proportions of cells with expression above this threshold are compared (Supplementary figure 1). This embodies a unique internal representation of the problem that confers some degree of independence from the other methods. Further, this formulation effectively ensures satisfaction of desideratum 1. To the extent that the specified parameter corresponds to a threshold of detection, genes ranked more highly by the binomial test should be more likely be detectably expressed. Similarly, ubiquitous expression of a gene would result in a less significant result by the binomial test since almost all cells in both clusters would express the gene at a level above the specified threshold and so such genes would be ranked lower. Alternatively, because the Mann-Whitney U statistic is directly related to the area under the distribution curve, a more extreme value of this statistic implies greater separation of the distribution of expression in the clusters tested and so greater concentration of expression (Supplementary figure 2).
Materials and Methods
Expression data
A subset of the peripheral blood mononuclear cell data from Zheng et al. curated in the DropletTestFiles8 Bioconductor package was analyzed. Cells were called from empty droplets by testing for deviation of the expression profile for each cell from the ambient RNA pool.9 Cells with large mitochondrial proportions i.e., more than 3 mean-absolute deviations away from the median, were removed. Cells were pre-clustered, a deconvolution method was applied to compute size factors for all cells10 and normalized log-expression values were calculated. Variance was partitioned into technical and biological components by assuming technical noise was Poisson-distributed and attributing any estimated variance in excess of that accounted for by a fitted Poisson trend to biological variation. The dimensionality of the data set was reduced by performing principal component analysis and discarding the later principal components for which the variance explained was less than variance attributable to technical noise. Cells were clustered by building a shared nearest neighbor graph11 and executing the Walktrap algorithm.12
Differential expression
Differential expression using the t-test, binomial test and Wilcox rank-sum test was performed using the pairwiseTTest, pairwiseBinom and pairwiseWilcox methods from the scran13 Bioconductor package. Zero-inflated regression using a hurdle model was performed by supplying a model formula with cluster assignment as the only predictor and no intercept term to the zlm method of the MAST Bioconductor package. The logfold change of differential expression between each pair of clusters was calculated by applying supplying appropriately constructed contrasts to the getLogFC method from MAST. The resulting combined -scores were used to calculate -values for the comparisons.
Integration of rankings
For each of the methods and clusters differential expression between each cluster and every other cluster was calculated. Given clusters, ( comparisons were made, one for each combination of comparator cluster and differential expression measurement method. For each comparison genes were ranked by -value; ties were ranked by permutation with increasing values at each index set of ties. A ranking of genes for each technique-cluster pair was created by combining the results of the comparisons using the Borda count election method.14 Similarly, the community consensus of the ensemble of methods was generated by combining the rankings for each technique-cluster pair by the Borda count election method. This procedure was implemented by adapting the source code from the combineMarkers method from scran.
Gold standards
We performed WGCNA to identify modules of highly correlated genes that were related to cluster assignment. For each cluster in the dataset, the most highly correlated gene module was designated as the gold standard marker set for that cluster. While the WGCNA algorithm operates on the same covariance structure as that manipulated by the various differential expression techniques, it was reasoned that the way in which the information was used was sufficiently different as to not bias the construction of a gold standard gene set that favored one differential expression method over another.
Performance metrics
Receiver operating characteristic (ROC) curves are generated by plotting the true positive rate (TPR) versus the false positive rate (FPR) as an implicit function of a threshold t. In this case, t is interpreted as the position along the list of ranked genes. A system that perfectly separates positives from negatives would have a threshold equal to the length of the gold standard list. In this case, as t increases the ROC curve would move vertically to the top left corner and then horizontally to the top right corner coinciding with the lines FPR = 0 and TPR = 1 and resulting in an area under the receiver operating characteristic curve (AUROC) of 1. Hence, the AUROC measured for a system under consideration is interpreted as the fraction of this ideal performance achieved by the system.
Because all genes are ranked, the number of true negatives is a parameter of the algorithm and not part of the problem definition. Hence, a measure that only considers positives may be more informative. This is accomplished by a plotting precision versus recall as an implicit function of position along the ranked list. Here precision is defined as the ratio of all true positives (TP) to the sum of true positives and false positives. At a given position along the ranked gene list . Recall is defined as the ratio of true positives to the sum of true positives and false negatives. If the gold standard gene set is of cardinality G, then . In this case, an optimally performing system would move along the line Precision = 1 from Recall of 0 to Recall of 1 and then along the line Recall = 1 from Precision of 1 to precision of 0 resulting in an area of the under the precision recall curve (AUPR) of 1.
We assessed performance by proceeding as in Marbach, et al. Each differential expression method was considered a binary classier system with the discrimination threshold specified by the index where the ranked list was truncated. A gene was considered a true positive if it appeared in the truncated ranked list and was a member of the gold standard gene set. If a gene appeared in the truncated ranked list and was not a member of the gold standard, then it was designated a false positive. The difference of the gold standard and the truncated ranked list constituted false negatives. Finally, the set of true negatives included all those genes ranked lower than the truncation index that were not members of the gold standard. With the forgoing definitions, receiver operating characteristic and precision-recall curves were generated for each combination of differential expression method and cluster and the area under these curves was calculated.
Empiric -values for the obtained AUROC and AUPR values were calculated by numerically integrating null distribution functions obtained by fitting stretched exponentials to the histograms of AUROC and AUPR values for 25,000 random rankings of the input genes. Separate functions were fit for the half-line to the left and right of the mode as in Prill, et al.15 A combined score was obtained for each method by integrating the above measures of performance.
In situ hybridization
For section in situ hybridization, kidneys isolated at specific stages were fixed overnight in 4 % paraformaldehyde (in PBS) at 4 °C and cryopreserved in 30 % sucrose. Tissues were frozen in OCT (Tissue Tek) and sectioned at 10 μm. Sections were subjected to in situ hybridization as previously described.16 Antisense RNA probes were linearized and transcribed as previously described. For genes with no available plasmids single-stranded DNA for each gene was purchased with the T7 RNA polymerase binding site in the reverse orientation added to the 3′ end of the gene sequence. These probes were made through RNA transcription of these single-stranded DNA gblocks using T7 polymerase.
Immunofluorescence (IF) on paraffin and/or OCT sections
Embryonic tissue was fixed in 4 % paraformaldehyde, and either embedded in OCT media and cryosectioned to 10 μm slices, or embedded in paraffin and sectioned into 5 μm slices, and subjected to IF. Slides for IF were immersed and boiled with either 10 mM sodium citrate or TE antigen retrieval buffer and blocked with a solution of 5 % FBS/PBS for 1 h at room temperature followed by the application of primary antibodies diluted in blocking solution.
Supplementary Material
Acknowledgements
Work in the Carroll lab was supported by NIH grants R01DK095057, R01DK080004, R01DK106743, R24DK090127 and RC2DK125960 to TJC. This work was supported by the UT Southwestern George O’Brien Kidney Research Core DK079328.
Footnotes
CRediT authorship contribution statement
Christopher P. Chaney: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft, Visualization, Writing – review & editing. Keri A. Drake: Investigation, Visualization. Thomas J. Carroll: Conceptualization, Resources, Writing – original draft, Writing – review & editing, Supervision, Project administration, Funding acquisition.
DECLARATION OF COMPETING INTEREST
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Appendix A. Supplementary Data
Supplementary data to this article can be found online at https://doi.org/10.1016/j.jmb.2022.167754.
DATA AVAILABILITY
Data will be made available on request.
References
- 1.Hong L, Page SE, (2004). Groups of diverse problem solvers can outperform groups of high-ability problem solvers. In: Proceedings of the National Academy of Sciences. National Academy of Sciences, pp. 16385–16389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Marbach D. et al. , (2012). Wisdom of crowds for robust gene network inference. Nature Methods, 796–804. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Alhamdoosh M. et al. , (2017). Easy and efficient ensemble gene set testing with EGSEA. F1000Research 6 (2010) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Finak G. et al. , (2015). MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol 16 (1), 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Zheng GXY et al. , (2017). Massively parallel digital transcriptional profiling of single cells. Nature Commun 8 (1), 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Langfelder P, Horvath S, (2008). WGCNA: an R package for weighted correlation network analysis. BMC Bioinf, 559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.England AR et al. , (2020). Identification and characterization of cellular heterogeneity within the developing renal interstitium. Development (Cambridge, England). NLM. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Lun A. (2021). DropletTestFiles: Test Files for Single-Cell Droplet Utilities. [Google Scholar]
- 9.Lun ATL et al. , (2019). EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol 20 (1), 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lun ATL, Bach K, Marioni JC, (2016). Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biology. BioMed Central Ltd.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Xu C, Su Z, (2015). Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31 (12), 1974–1980. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Pons P, Latapy M, (2006). Computing communities in large networks using random walks. J Graph Algorithms Appl. Citeseer. [Google Scholar]
- 13.Lun AT, McCarthy DJ, Marioni JC, (2016). A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Research 5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Borda J.-C.d.,, (1781.). Mémoire sur les élections au scrutin. Histoire Acad R Sci. [Google Scholar]
- 15.Prill RJ et al. , (2010). Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS ONE 5 (2) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Karner CM et al. , (2011). Canonical Wnt9b signaling balances progenitor cell expansion and differentiation during kidney development. Development 138 (7), 1247–1257. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data will be made available on request.




