Summary
Markers are increasingly being used for several high-throughput data analysis and experimental design tasks. Examples include the use of markers for assigning cell types in scRNA-seq studies, for deconvolving bulk gene expression data, and for selecting marker proteins in single-cell spatial proteomics studies. Most marker selection methods focus on differential expression (DE) analysis. Although such methods work well for data with a few non-overlapping marker sets, they are not appropriate for large atlas-size datasets where several cell types and tissues are considered. To address this, we define the phenotype cover (PC) problem for marker selection and present algorithms that can improve the discriminative power of marker sets. Analysis of these sets on several marker-selection tasks suggests that these methods can lead to solutions that accurately distinguish different phenotypes in the data.
Keywords: marker discovery, scRNA-seq, set cover, multiset multicover, gene sets, biomarker, phenotype cover, algorithm, cross-entropy method
Graphical abstract

Highlights
-
•
Marker selection is critical to define cell types in large atlas studies
-
•
We present phenotype cover as a conceptual formulation for the marker-selection problem
-
•
Our algorithms for phenotype cover demonstrate high discriminatory power
Motivation
To date, marker selection is based on methods that focus on each cell type separately and do not consider the relationship between different types. Such methods can select overlapping marker sets for different cell types, making it hard to discriminate between similar cell types. To address this issue and to improve the ability to select a discriminating set of markers, we defined an optimization function for biomarker selection that takes the overlap into account.
Hasanaj et al. propose a marker-selection strategy for improving specificity and selectivity of atlas-scale cell-type assignments. They define an optimization problem for selecting a minimal set of such markers that covers all types. An analysis of the proposed approximation algorithms suggests that these marker sets have high discriminatory power.
Introduction
Several international efforts focus on characterizing gene expression in different tissues, organs, disease states, and more. Examples include HuBMAP, a large NIH effort to reconstruct a three-dimensional (3D) map of the human body at the single-cell resolution,1 the Human Cell Atlas,2,3 the Cancer Cell Atlas,4 and the Brain Atlas.5 One of the first steps of studies at the single-cell level is to characterize cell states or cell types. Typically, this relies on marker genes whose expression or co-expression with other such markers indicates a cell type.6,7,8 To find such markers, researchers often perform differential expression (DE) testing, where a statistical hypothesis test is used to compare the expression of genes in one group of cells versus all other groups (one versus all). These groups are usually defined by cluster, cell type, or condition labels.
To date, marker selection has mainly focused on the most significant DE genes or proteins for each group. While this works well with a small number of distinct groups (e.g., major cell types), it may not work well when there is a much larger number of groups with overlapping DE genes. In such cases, markers are not just useful for defining a specific group or type but are also critical for discriminating between similar types. Consider these large multiorgan single-cell RNA sequencing (scRNA-seq) datasets. In such datasets, we may be interested in markers that are specific for both a cell type and a tissue (i.e., markers that are uniquely found only in cell types from this tissue). Such markers can be less significant than overall DE genes since they may only distinguish between two similar types, but they are still of major importance. An example is given by the Tabula Muris dataset,9 a collection of scRNA-seq profiles of over 100,000 cells from over 20 different organs and tissues in Mus musculus. When analyzing these data, the authors used traditional clustering and DE analysis without considering the issue of cell-type/tissue combination. Another example are T cells, which mature in the thymus.10,11 While T cells later migrate and reside in tissues throughout the body, the identification of T cells that have recently left the thymus (recent thymic emigrants [RTEs]) plays a role in treatment decisions.12 Similarly, the role of resident and infiltrating immune cell types is still an active area of research for neurodegenerative diseases. A key challenge is the current inability to distinguish the resident central nervous system (CNS) immune cells and the bone-marrow-derived immune cells.13 Better signatures of CNS-specific immune cells and signatures of infiltrating immune cells are needed to understand the immune responses to therapies.
In addition to cell-type/tissue markers, multivariable label partitioning is central to many other questions in functional genomics. Cell type or disease states are often simultaneously considered when identifying markers,14 and so state-specific markers for cell types are of interest. Deconvolution of cell types from bulk data is also highly dependent on the ability to select not just good markers for each individual cell type but also a set of discriminatory markers between all types.15,16 Finally, a number of recent single-cell proteomics technologies, including CODEX17 and Cell DIVE, require the pre-selection of markers to profile. The ability to identify a subset of markers that would suffice for distinguishing all cell types in the sample is a key criterion for such a selection.18
Broadly, marker selection represents a feature-selection problem. Feature-selection methods can be largely divided into three categories: filters, wrappers, and embedded.19,20,21 Wrapper and embedded methods interact with a specific classifier. Wrapper methods select (often in a greedy manner) a subset of the features that lead to a classifier with the highest accuracy. Examples include sequential forward and backward selection methods.22,23 Embedded methods use the output of the classifier itself, which comes in the form of an explicit ranking of the features or implicitly via a scoring system (e.g., information gain in decision trees24). Since these methods are geared toward classification, they may not be applicable to other problems, including deconvolution.
Filter methods, on the other hand, are not tied to a specific classifier. For example, scGeneFit25 selects those genes that maintain a separation of the different cell types similar to that of the original space. This method supports both a flat partition or a hierarchy of labels (e.g., major cell types and subtypes). RankCorr26 works in a one-versus-all fashion and selects markers for a fixed cell type by performing a rank transformation. Another algorithm, Relief,27 and its extension ReliefF28 penalize features that cannot distinguish a given instance from its negative (having a different label) neighbors, while assigning high scores to features that take similar values among instances from the same class. Minimum-redundancy-maximum-relevance (mRMR) selects features that are relevant to the target class but are not similar to each other.29 CIBERSORT15 and a number of prior methods16,30,31 analyzed a signature matrix of DE genes to identify submatrices with a low condition number for use in deconvolution of bulk mixtures. Thus, while these methods can successfully select discriminative features when the overlap between sets is small, the ability of such methods to select markers that discriminate all pairs of phenotypes has not been extensively studied.
In this article, we explore the problem of determining a global set of biomarkers. These represent features that collectively distinguish between higher context phenotypes. We assume we are given a phenotype × feature, binary or real score matrix M, whose (i,s) entry represents the relevance of feature s (e.g., average gene expression) for phenotype i. We formulate the task as a combinatorial optimization problem where the goal is to identify the smallest set of features such that for every phenotypic pair (i,j) there exists a set of features that can be used to “distinguish” between i and j. We term this problem phenotype cover (PC). We show that PC is equivalent to multiset multicover, which is nondeterministic polynomial-time (NP)-complete32 and propose two algorithms that can approximate it in polynomial time (STAR Methods). The first is based on the extended greedy algorithm to set cover (G-PC),33 and the second is based on the cross-entropy method (CEM-PC).34,35 By analyzing several marker-selection problems, we show that the greedy algorithm outperforms competitors across a variety of tasks. We also analyze some of the specific markers selected by the method and discuss their ability to distinguish between similar cell types.
Results
We developed methods to select discriminative features from a large set of (potentially overlapping) signatures. The goal of the features we select is to enable the separation of the different components in the set. This can either be for a supervised learning (for example, classification) or for other learning approaches such as deconvolution or dimensionality reduction. Our method takes as input a signature or score matrix M, which is used to estimate the importance of a feature for a phenotype of interest. Features are then selected by reformulating the problem as a multiset multicover instance where the goal is to select features such that every phenotypic pair is covered at least K times, for some positive K (Figure 1). We developed two solutions to the multiset multicover problem: the first is based on a greedy approach (G-PC), and the second based on the cross-entropy method (CEM-PC). See STAR Methods for details.
Figure 1.
Graphical illustration of (binary) phenotype cover and its reformulation as a set cover problem
Given a binary score matrix (left), each feature induces a bipartite graph between classes (center left). Edges in this graph form a set . Multiset multicover is then performed on the collection of to select a small number of features that “distinguish” all phenotypic pairs (at least K times). The idea can be naturally extended to non-binary score matrices by assigning a multiplicity to each element (STAR Methods).
We tested G-PC and CEM-PC and compared them with eight prior methods: scGeneFit,25 decision trees,36 top differentially expressed genes (TopDE), RankCorr,26 ReliefF,28 mRMR,29 ANOVA F values, and mutual information.37,38 We used three scRNA-seq datasets from lung, mouse cortex, and a human cell atlas (Table 1). We vary the coverage factor K from 1 to 20 for the idiopathic pulmonary fibrosis (IPF) dataset, from 1 to 40 for mouse cortex (MC), and from 1 to 9 for human cell atlas (HCA). For all baselines but TopDE and RankCorr, we select a number of features that matches the solution size returned by G-PC. For TopDE, we take the union of the top k differentially expressed genes for each phenotype (k varying from 1 to <10). For RankCorr, we tuned the hyperparameters until a similar number of features was returned. Finally, for CEM-PC, all features with a probability score greater than 0.98 after convergence were chosen (Methods S1, alg. 3). We compare all methods in terms of phenotype classification performance, deconvolution of bulk mixtures, and feature stability. We also validate the features selected by G-PC by performing gene set enrichment analysis and comparing them with known markers in the literature.
Table 1.
scRNA-seq datasets used in this study
| Dataset | Genes | High var. | Cells | Tissues | Cell types | Reference |
|---|---|---|---|---|---|---|
| Idiopathic pulmonary fibrosis (IPF) | 4,443 | yes | 96,301 | 1 | 33 | Adams et al.39 |
| Mouse cortex (MC) | 20,006 | no | 3,005 | 1 | 7 | Zeisel et al.40 |
| Human cell atlas (HCA) | 2,968 | yes | 84,363 | 15 | 7 | He et al.41 |
For HCA, we consider a combination of tissues and cell types (85). For IPF, only healthy samples were kept. Endothelial-mural and astrocyte-ependymal pairs of cells were grouped for MC.
Classification
We first test the ability of a classifier to predict the correct phenotype given only a subset of the features. For each method, we select a feature set S using a subset of the data, train a logistic regression model on the same subset, and evaluate performance on left-out data. G-PC exhibits strong performance on the IPF and MC datasets across a wide range of coverage factors. For example, when 42 genes are selected on the IPF data, G-PC obtains an F1 score of 0.70, followed by scGeneFit (0.65) and CEM-PC (0.61) (Figure 2A). On the MC data (Figure S1A), G-PC again performs best when 30–140 genes are selected (F1 ≈ 0.94–0.95). mRMR also performs well on these data except when the number of genes selected is small (<30). Decision trees, on the other hand, do not improve in performance when more than 30 genes are selected (F1 ≈ 0.92).
Figure 2.
Comparison of feature selection methods for the IPF dataset
(A and B) Performance scores for (A) and (B) were averaged across five different random train and test splits. SD is shown as a shaded region.
(A) Performance of a logistic regression model trained on the selected features. G-PC achieves the highest F1 score across all coverage factors, followed by scGeneFit and CEM-PC.
(B) Jensen-Shannon divergence (lower is better) between CIBERSORT-predicted mixture proportions and the ground truth.
(C) Stability scores for all eight methods over 5 runs. Sequential methods like G-PC, decision trees, and CEM-PC suffer slightly in stability compared with other, more global methods. Nonetheless, G-PC shares about 70% of the features across runs.
(D) Selected biomarkers assigned to each cell type (rows [columns] are differentiated by color [shape]). Gene s (column) is assigned to cell type i (row) if there exists another cell type j such that Mi,s − Mj,s ≥ 1 (S. Biomarker validation). Rows and columns were ordered via hierarchical clustering.
(E) For every phenotypic pair, we compute the coverage (i.e., the score difference between the two phenotypes) provided by the selected gene set. A histogram of these coverage factors corresponding to a coverage of 10 is shown for each method. As can be seen, for G-PC and CEM-PC, which optimize for coverage, each element is covered at least 10 times. Other methods provide high coverage for some elements but miss out on others.
These two datasets are obtained from a single tissue. We thus next tested the ability of PC to differentiate between the same cell types across multiple tissues. For this, we used all tissue and cell-type combinations present in the HCA dataset. Decision trees outperform other methods on this classification task (Figure S2A). G-PC is the second-best method when more than 100 genes are selected, while scGeneFit is the second best when less than 100 genes are selected. scGeneFit, however, does not improve in performance when more than 100 genes are selected. At 235 genes, decision trees converge at 0.70, while G-PC and mutual information reach an F1 of 0.68.
We note that scGeneFit can take the hierarchy of labels into account and that the authors describe improved performance when cell subtypes are considered in the MC dataset. For a fair comparison, we ran three different variants of scGeneFit that take advantage of this hierarchical structure and evaluated performance by using a nearest centroid classifier fit on the entire data. All the hyperparameters we used were identical to those provided by the authors. While G-PC does not use cell subtype information, it still outperforms all three variants across a different number of markers (Figure S4B).
We also tested an additional classifier (k nearest neighbors) and observed very similar results to those obtained with logistic regression (Figures S3A–S3C). Finally, we tested the impact of batch effects by using two pancreas datasets42,43 and observed that our method, G-PC, along with TopDE are the most robust to batch effects (Figure S4A).
Deconvolution
Inferring cell-type proportions from bulk transcriptomics data is an important task in understanding composition of tumors and other tissues. Many methods have been developed to perform deconvolution of bulk mixtures.16,30,31,44 Deconvolution typically requires solving a linear equation of the form m = Sp, where m is a given mixture vector, S is a signature matrix containing cell-type-specific expression signatures (known), and p is the unknown class proportion vector. One widely used method for deconvolution is CIBERSORT,15 which uses -support vector regression ( -SVR). CIBERSORT constructs the signature matrix S by considering the top k DE genes for every cell-type subset (which leads to the exact same selection as the TopDE baseline we consider in this study). Next, CIBERSORT selects the k that leads to a signature matrix S with the lowest condition number. Finally, -SVR is fit on the data, and the regression coefficients in the solution are used to estimate p.
To test the usefulness of the features selected by our method for deconvolution, we constructed pseudo-bulk mixtures using the IPF, MC, and HCA datasets by averaging expression levels across all single cells in the test sets. The signature matrix S was constructed with features selected from the training set and deconvolution via -SVR was then applied to the pseudo-bulk mixtures. As recommended by the authors, we initialize three linear -SVR instances with and save the model that achieves the lowest root-mean-square error between the deconvolution result Sp and m. We compute the Jensen-Shannon (JS) divergence45 between the predicted mixture p and the ground truth. G-PC performs well on the IPF data, with RankCorr doing better only when 50–80 genes are selected (Figure 2B). For example, when 163 genes are selected, G-PC achieves an average JS = 0.045, followed by RankCorr (0.056) and scGeneFit (0.062). For the MC dataset, G-PC is also the top-ranking method, though TopDE and RankCorr also accurately resolve mixture proportions (Figure S1B). All three methods obtain a JS score of less than ≈0.025 across all K. CEM-PC performs well on some instances for both datasets; however, the results are unstable and vary between runs. None of the methods clearly outperforms all others on the HCA dataset (Figure S2B). These results demonstrate the challenges of trying to distinguish cell types across tissues.
Finally, we also tested another version of deconvolution that uses linear least squares (LLS) as the target. We observed that, for LLS, G-PC performs no worse than other methods on IPF and MC (Figures S3D–S3F).
Stability
The focus of the comparison so far has been on accuracy. However, other considerations are also important, especially when selecting features that will be used across different platforms and potentially modalities. One such important issue is feature stability.46 The stability index measures the average size of the overlap divided by the size of the union for all pairs of feature sets (STAR Methods). To test stability, we randomly sample half the data and compute the stability index for the features selected by each method over 5 runs. Stability scores are shown in Figures 2C, S1C, and S2C. G-PC is more stable than decision trees for IPF and MC. However, due to their greedy sequential nature, both G-PC and decision trees are less stable than more global methods such as ReliefF and F values. Nonetheless, G-PC uses from 60% to 70% of the same genes across all runs. Perhaps not surprisingly, due to its random sampling nature, CEM-PC is the least stable method.
Biomarker validation
To validate the set of biomarkers selected by G-PC and decision trees, we performed enrichment analysis for the HCA dataset. We fix a coverage of 8, and for every phenotype i, we select from the solution all those genes s for which there exists some phenotype j satisfying Mi,s ≥ Mj,s + 1. We consider each of these sets as a biomarker set for the given phenotype for both G-PC (Figures 2D and S6) and decision trees.
We next performed gene set enrichment analysis (GSEA)47 using the HuBMAP ASCT + B marker set48 to determine if the selected marker sets for a specific cell type are enriched for pathways associated with these cell types. We test the ability of G-PC and decision trees to identify the correct (1) tissue, (2) cell type, and (3) tissue/cell type combination. G-PC obtains lower q values for 42% (3/7) of the cell types and 54% (6/11) of the tissues (Figures 3A and 3B). No markers were found for four tissues. When tested against the correct tissue and cell-type pair, G-PC obtained lower q values for 71% (30/42) of the pairs (Figure 3C). The remaining combinations (33) either belonged to a tissue that was not present in the marker set or was not identified by either method. Some known markers assigned correctly by G-PC are shown in Figure 3D. Esophagus and trachea tissues were mapped to respiratory system in the ASCT + B set. The top two principal components of the markers that provide coverage for a given tissue or cell type show visible separation between different classes (Figure 4).
Figure 3.
GSEA q values for the HCA dataset
(A and B) We select markers that provide coverage for each cell type for both G-PC and decision trees and perform gene set enrichment analysis (GSEA) using the HuBMAP ASCT + B gene set.48 We first record q values for the top entry, which contains the correct cell type or the correct tissue independently. When comparing the ability of each method to assign the correct class, G-PC obtains a lower q value, i.e., higher-log(q value), for 42% (3/7) of the cell types (A) and 54% (6/11) of the tissues (B). We did not find markers for four tissues in the gene set (common bile duct, muscle, rectum, stomach).
(C) When tested for the ability to identify both the correct tissue and the correct cell type, G-PC obtained lower q values in 71% (30/42) of the cases. The remaining tissue/cell-type pairs (33) either belonged to a tissue that was not present in the marker set or was not identified by either method.
(D) Connected by an edge are known markers for CD4 and myeloid cells that were assigned to the correct tissue/cell-type pair by G-PC. Some markers are assigned to multiple cell types (multiple outgoing edges), while others are pair specific.
Figure 4.
Principal-component analysis (PCA) plots of the selected markers for the different phenotypes present in the HCA dataset
(A and B) A total of 121 markers were selected via G-PC (coverage = 5). For every tissue (A) and cell type (B), the top two principal components of the markers providing coverage (≥1) for that phenotype are plotted. The exact number of markers used is shown in parentheses. There is visible separation between classes.
Due to limitations of the marker set we are using, only 20 cell types could be identified for IPF. Among these, G-PC obtains lower q values for 12 (60%) (Figure S5A). We observe good agreement between genes selected using our greedy procedures and genes known to be involved in specific cell types. For example, G-PC correctly assigns KRT19 and ADGRF5 to type I and II epithelial cells (ATI and ATII49,50,51). CD69 is assigned to both B and T cells,52,53 COBLL1 is assigned to B cells,54 JCHAIN to B and B plasma cells,55 CXCL2 to macrophages,56 and CCL5, PRF1, and CD247 to natural killer cells.57,58,59 See Figure 2D for a larger list of identified markers.
In addition to selecting known cell-type markers, G-PC is also able to select markers that distinguish between similar cell types. For example, it assigns CXCL2 to ATII and not to ATI60 and CD69 and AFF3 to B cells and not plasma cells.52,53,61 Another example is A2M and CST7, which are assigned to cytotoxic T cells,62 whereas NAMPT and TNFRSF18 are assigned to regulatory T cells.63
Discussion
Selection and use of markers is a common step in many analysis pipelines. Most recently, this topic received increased attention due to the large number of new cell types that have been identified and characterized using scRNA-seq data.64,65,66,67 To date, such selection was mainly based on methods that focused on each cell type separately and did not consider the relationship between markers selected for different types. Such methods can select overlapping marker sets for different cell types, making it hard to discriminate between similar cell types. This is especially important for large datasets where multiple cell types in multiple tissues are being profiled.9,41
To address this issue and improve the ability to select a discriminating set of markers, we defined a new optimization function for biomarker selection that takes the overlap into account. Specifically, we defined the PC problem that aims to optimize the accuracy of identifying different sets when using the selected markers. We presented two heuristic filter methods since these lead to solutions that can be used in several different analysis pipelines including classification, deconvolution, experimental design, and more. The first is based on a greedy approximation algorithm (G-PC) and the second is based on the cross-entropy method (CEM-PC).
We evaluated these methods and compared them with prior methods developed for marker selection using several high-throughput scRNA-seq datasets. Our analysis indicates that G-PC assigns equal importance to all different phenotypes in the data and is affected less by class imbalance as shown by the F1 score. Other methods tend to select features that discriminate only dominating classes. Furthermore, G-PC can be used with signature matrices rather than direct expression measurements. In such cases, there is only a single score for all phenotype/gene pairs, which makes using other methods difficult. This allows G-PC to construct signature matrices for deconvolution, which leads to an accurate estimation of cell-type proportions from bulk mixtures. While G-PC is slightly less stable than some other methods, it nonetheless retains most of the features (∼70%) across runs.
Decision trees outperform G-PC with regard to the F1 score in one of the datasets we analyzed (HCA). However, even for HCA, G-PC seems to obtain a more accurate list of cell-type markers based on enrichment analysis. We note that our method is best suited for datasets that require detailed annotations, which usually means that several cell types partially overlap in their markers. In contrast, for large datasets where the focus is on coarser cell types, we see less advantage compared to standard marker selection methods. Finally, we provide a C++ implementation of G-PC with Python bindings, which makes it the fastest method we tested (Table 2). Speed is an important consideration when working with large scRNA-seq datasets.
Table 2.
Runtimes of all feature selection methods for all three datasets used in this study (s)
| Data | G-PC | CEM-PC | DT | scGF | DE | RC | FVal | ReliefF | MI | mRMR |
|---|---|---|---|---|---|---|---|---|---|---|
| IPF | 1.3 | 55 | 304 | 102 | 41 | 90 | 1.5 | 388 | 1039 | 980 |
| MC | 0.17 | 227 | 5 | 39 | 2.7 | 3 | 0.15 | 11 | 116 | 139 |
| HCA | 1.25 | 88 | 50 | 52 | 30 | 95 | 0.8 | 340 | 790 | 310 |
Method names were abbreviated. 178 features were selected for IPF, 66 for MC, and 121 for HCA. Our C++ implementation of G-PC takes less than 2 s for all three datasets, making it the fastest along with F value computation. Performance tests are conducted on a machine with a 2.3 GHz 8-Core Intel Core i9 CPU and 32 GB memory.
We observed that CEM-PC sometimes selects a smaller set of genes that achieves the same coverage as G-PC. However, due to its random sampling nature, CEM-PC is very unstable and can lead to a completely different set of features across runs.
Limitations of study
The biological analysis relied on computing the overlap with existing gene marker lists that may be incomplete. A more thorough biological analysis of the selected genes and their relation to each cell type might provide more insight into the performance of our algorithms.
While G-PC worked well for the data analyzed in this paper, it does not provide an optimal solution. It is interesting to see if other approximation algorithms that optimize for coverage will lead to better results when tested on biological data.
STAR★Methods
Key resources table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Deposited data | ||
| Idiopathic Pulmonary Fibrosis (IPF) | Adams et al.39 | GEO: GSE136831 |
| Mouse Cortex (MC) | Zeisel et al.40 | GEO: GSE60361 |
| Human Cell Atlas (HCA) | He et al.41 | GEO: GSE159929 |
| Software and algorithms | ||
| Multiset multicover algorithm | This paper | Zenodo: https://doi.org/10.5281/zenodo.7158750 |
| Phenotype cover algorithms (G-PC, CEM-PC) | This paper | Zenodo: https://doi.org/10.5281/zenodo.7158780 |
| Experiments in this paper | This paper | Zenodo: https://doi.org/10.5281/zenodo.7158788 |
| scGeneFit algorithm | Dumitrascu et al.25 | GitHub: https://github.com/solevillar/scGeneFit-python |
| Decision Trees, ANOVA F-values, Mutual Information, Logistic Regression | Pedregosa et al.,68 scikit-learn | Zenodo: https://doi.org/10.5281/zenodo.6968622 |
| T-test for differentially expressed genes | Theis Lab; PI: Fabian Theis | GitHub: https://github.com/theislab/diffxpy |
| RankCorr algorithm | Vargo et al.26 | GitHub: https://github.com/ahsv/RankCorr |
| mRMR algorithm | Peng et al.29 | GitHub: https://github.com/smazzanti/mrmr |
| Approximate nearest neighbors algorithm | Johnson et al.69 | GitHub: https://github.com/facebookresearch/faiss |
| GSEA algorithm | Subramanian et al.,47 GSEApy | Zenodo: https://doi.org/10.5281/zenodo.3748084 |
| Python version 3.8 | Python Software Foundation | https://www.python.org/ |
Resource availability
Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Ziv Bar-Joseph (zivbj@andrew.cmu.edu).
Materials availability
This study did not generate new unique reagents.
Method details
Notation
Let represent a score matrix. We denote by P the number of phenotypes (e.g., cell types) and by F the number of features (e.g., genes). In this paper, we use scRNA-seq read count data denoted by . Here, N denotes the number of cells and G denotes the number of genes. Given a known vector y of length N representing class labels, we derive a matrix M from X by averaging expression values of cells with the same class label. In this case, P = {number of distinct classes} and F = G. We denote by [n] the set . Finally, let (x) + = max{x,0}.
Problem formulation and complexity
Phenotype cover (PC)
Given a score (signature) matrix , find a subset of minimal cardinality, such that for every with , and some fixed positive K, the following holds
PC is asking for a small subset of features such that for any given ordered pair of phenotypes (i,j), one can find enough features which collectively distinguish i from j by a factor of at least K. This problem allows the selection of a gene which could cover several phenotypic pairs, e.g., multiple cell subtypes vs another major cell type, but also demands sufficient coverage between subtypes themselves. The straightforward solution of iterating over all possible feature subsets satisfying the requirements above and selecting the one with the smallest cardinality, suffers from an exponential complexity in the number of subsets considered. In fact, PC is equivalent to multiset multicover which is NP-complete.33
To establish this equivalence, it may help to first consider a simplified version of the problem where we restrict M to be binary and K = 1; call this problem PC-B. In this case, we require a small subset of features , such that for any two phenotypes there exists some index where Mi,s−Mj,s = 1. Note that in this simplified form, every feature s induces a bipartite graph , where
Every edge corresponds to an ordered pair of phenotypes (Figure 1).
Now, given the collection of sets , set cover asks to find the smallest subset such that for every element , there exists a set in which contains e. It is easy to see that the features corresponding to are the solution to PC-B.
So far, we only considered a binary score matrix. However, a solution to the binary problem can be naturally extended to solve non-binary scoring matrices by assigning multiplicities to the elements of . To every e = (i,j) we assign the multiplicity and view as a multiset. Note that since we are working with real numbers, we need to round the multiplicities to integers. Higher precision can be easily obtained by first scaling both M and K by some scalar c and performing the rounding after. Finally, the requirement K = 1 can also be relaxed by solving for a multicover, where we require each element to be contained at least K times in (counting multiplicities).
Approximating a solution to phenotype cover
Given the NP-Completeness of PC, we present two greedy solutions that run in polynomial time.
Greedy phenotype cover (G-PC)
First, we consider the well-known greedy approach to solving set cover that iteratively picks the set which covers the greatest number of elements not covered yet.70,71 The algorithm can be trivially extended to solve multiset multicover.33 The full algorithm is presented in Methods S1, algorithms 1 and 2. Every time we select a set, we need to correct the multiplicities of all the remaining O(F) sets, each of which may contain up to O(P2) elements (all phenotypic pairs). Therefore, if we denote the solution size by k, the run-time complexity of G-PC is . In practice, P is small and , therefore, the method is almost linear in the number of features considered. The approximation accuracy for this solution was previously analyzed and it was shown that the greedy algorithm for multiset multicover is upper bounded by a factor of Hm increase in the solution size, where and m is the cardinality of the largest multiset.33
Cross-entropy method phenotype cover (CEM-PC)
In addition to the greedy multiset multicover approach, we developed a new method based on cross-entropy (CEM).34 CEM was originally used to estimate probabilities of rare events and it was later extended to solve combinatorial problems.72 Roughly, CEM consists of two steps: 1) generate a random sample based on a specific distribution, and 2) update distribution parameters such that “high-scoring” samples are more likely to be produced in the next iteration. This two-step procedure is repeated until convergence, or until a maximal number of iterations is reached. The final parameters determine the solution to the combinatorial problem (in our case, selecting features whose probability is greater than some threshold). For a more detailed analysis of CEM, the reader may refer to the excellent tutorial of De Boer et al.35
We present a variant of CEM for solving set cover by introducing a scoring function that encourages high coverage but penalizes a large number of features (Methods S1, alg. 3). The run-time complexity of CEM-PC depends on the maximum number of iterations I, the number of random samples per iteration Rs, and the complexity of the scoring function (in this case, the smallest coverage attained per random sample). This leads to a total run-time complexity of . In this paper, we use I = 500 and Rs = 1000. In practice, convergence is attained in fewer iterations.
Baselines
As mentioned above, several prior methods have been developed for marker and feature selection. We thus compared our method against several baselines on traditional supervised learning tasks, ability to construct signature matrices for deconvolution of bulk mixtures, and feature stability. Specifically, we compare our method to scGeneFit25 and RankCorr26 which were used for discriminative marker selection. We use the implementations provided by the authors of each method. For scGeneFit, we used a redundancy of 0.1 and kept the remaining parameters at defaults. We compare against an embedded method that uses decision trees with the Gini Index criterion to rank features. Note that here we use decision trees as a feature selection method and not as a classifier. The performance of decision trees as a classifier was worse than that of Logistic Regression using the same features, hence, we excluded these results from the manuscript. We also compare against several other filter methods. We consider the union of the top differentially expressed genes per phenotype as determined by Welch’s t-test73 (TopDE). We compare against ReliefF28 which uses nearest neighbors' information to update feature weights. Since computing exact neighbors is slow for the single cell data we are using, we developed a variant of ReliefF that uses approximate neighbors based on the faiss package.69 We compute 30 neighbors per sample. ANOVA F-values and mutual information between gene expression and phenotype are also computed using the popular package scikit-learn.68 Finally, we compare against minimum-redundancy-maximum-relevance (mRMR).29 For mRMR, we use the open-source Python package mrmr (https://github.com/smazzanti/mrmr) which measures relevance via the F-value and measures redundancy via Pearson’s correlation. For all the baselines but TopDE and RankCorr, we take the top k scoring features, where k equals the size of the solution returned by G-PC.
Datasets and preprocessing
We use three public scRNA-seq datasets to validate our method (Table 1). For all three datasets we remove classes with less than 50 cells. This leads to 75 tissue/cell type pairs for HCA. We also filter for genes expressed in at least 10 cells, and for runtime efficiency purposes, we only consider highly variable genes for IPF and HCA for all methods. Also, scGeneFit was slow for MC, so we considered only highly variable genes for MC when running this method. Each dataset is normalized using Scanpy74 so that the total counts for all cells are equal. The data is then log(x+1) transformed and each feature scaled to unit variance and zero mean. scGeneFit performed very poorly when the data was scaled, hence, for a fair comparison we skipped the scaling step when running scGeneFit. Log-transforming and scaling the data had a positive effect on the F1 score for all the other methods. We show these results for the MC dataset in Figure S4D. On the other hand, deconvolution via CIBERSORT works best if the data is in linear space as recommended by the authors, hence, we did not log the data during deconvolution. Feature selection, however, is applied on logged data.
We split all datasets into a train and test set of equal size in a stratified fashion. To obtain a signature matrix M for G-PC and CEM-PC, we average expression values for every phenotype. While it is true that this operation summarizes the data and leads to information loss, we note that our goal is not reconstruction or dimensionality reduction but rather marker selection. We argue that for such a task the individual cell-based expression is less important since we are looking for markers that are generally observed across most or all cells. Furthermore, commonly used DE tests such as t-test also rely on a small set of sufficient statistics.
Regarding the choice of K, in this paper we test the performance of our methods across multiple values of K. In practice, a single value for K could be obtained in a cross-validation fashion.
Quantification and statistical analysis
To compare the performance of Logistic Regression classifiers, we use the macro-average F1 score. This score equally weighs the F1 score of each class, which is desirable as we are interested in finding markers for all phenotypes, regardless of any class imbalance in the data. For a single class p, the F1 score is the harmonic mean between precision and recall
The macro-average F1 score is simply the unweighted mean of per-class F1 scores
To evaluate deconvolution performance, we use the Jensen-Shannon divergence45 which is a symmetric measure between two probability distributions. Given two discrete probability distributions P and Q, the Kullback-Leibler divergence75 is given by
where is a probability space. Letting , the Jensen-Shannon divergence is
Feature stability computes the average size of the overlap divided by the size of the union for all pairs of feature sets. More precisely, given a collection of feature sets , stability is given by
Finally, we performed gene set enrichment analysis (GSEA) using the Python package GSEApy (https://gseapy.readthedocs.io/) and the Enrichr API.76 We used the HuBMAP_ASCTplusB_augmented_2022 gene set.48 All p values reported in this paper were corrected for multiple testing.
Acknowledgments
This work was partially supported by NIH grants OT2OD026682, 1U54AG075931, and 1U24CA268108 and by NSF grant CBET2134998 to Z.B.-J.
Author contributions
Z.B.-J. and E.H. designed the study. E.H. and A.G. derived the theoretical results. Z.B.-J., E.H., A.A., and B.P. designed the empirical analysis and analyzed the results. E.H. wrote the software and performed the analysis. All authors contributed to manuscript writing. All authors read and approved the final manuscript.
Declaration of interests
The authors declare no competing interests.
Published: November 11, 2022
Footnotes
Supplemental information can be found online at https://doi.org/10.1016/j.crmeth.2022.100332.
Supplemental information
Data and code availability
-
•
This paper analyzes existing, publicly available data. These accession numbers for the datasets are listed in the key resources table.
-
•
We implemented a general-purpose package for running the greedy multiset multicover algorithm in C++ and expose it to Python. The code has been deposited at https://github.com/euxhenh/multiset-multicover. The G-PC and CEM-PC algorithms for feature selection can be found at https://github.com/euxhenh/phenotype-cover. Installation instructions are available in each repository. The code for running experiments in this paper is available from https://github.com/euxhenh/phenotype-cover-experiments. DOIs are listed in the key resources table.
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
References
- 1.HuBMAP Consortium The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature. 2019;574:187–192. doi: 10.1038/s41586-019-1629-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Regev A., Teichmann S.A., Lander E.S., Amit I., Benoist C., Birney E., Bodenmiller B., Campbell P., Carninci P., Clatworthy M., et al. The human cell atlas. Elife. 2017;6 doi: 10.7554/eLife.27041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Rozenblatt-Rosen O., Stubbington M.J.T., Regev A., Teichmann S.A. The human cell atlas: from vision to reality. Nature. 2017;550:451–453. doi: 10.1038/550451a. [DOI] [PubMed] [Google Scholar]
- 4.Cancer Genome Atlas Research Network. Weinstein J.N., Collisson E.A., Mills G.B., Shaw K.R.M., Ozenberger B.A., Ellrott K., Shmulevich I., Sander C., Stuart J.M. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 2013;45:1113–1120. doi: 10.1038/ng.2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hawrylycz M.J., Lein E.S., Guillozet-Bongaarts A.L., Shen E.H., Ng L., Miller J.A., van de Lagemaat L.N., Smith K.A., Ebbert A., Riley Z.L., et al. An anatomically comprehensive atlas of the adult human brain transcriptome. Nature. 2012;489:391–399. doi: 10.1038/nature11405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Usoskin D., Furlan A., Islam S., Abdo H., Lönnerberg P., Lou D., Hjerling-Leffler J., Haeggström J., Kharchenko O., Kharchenko P.V., et al. Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat. Neurosci. 2015;18:145–153. doi: 10.1038/nn.3881. [DOI] [PubMed] [Google Scholar]
- 7.Lo Giudice Q., Leleu M., La Manno G., Fabre P.J. Single-cell transcriptional logic of cell-fate specification and axon guidance in early-born retinal neurons. Development. 2019;146:dev178103. doi: 10.1242/dev.178103. [DOI] [PubMed] [Google Scholar]
- 8.Bassett E.A., Wallace V.A. Cell fate determination in the vertebrate retina. Trends Neurosci. 2012;35:565–573. doi: 10.1016/j.tins.2012.05.004. [DOI] [PubMed] [Google Scholar]
- 9.Tabula Muris Consortium. Overall coordination. Logistical coordination. Organ collection and processing. Library preparation and sequencing. Computational data analysis. Cell type annotation. Writing group. Supplemental text writing group. Principal investigators Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature. 2018;562:367–372. doi: 10.1038/s41586-018-0590-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Charles A Janeway J., Travers P., Walport M., Shlomchik M.J. 5th Edition. Immunobiol. Immune Syst. Health Dis.; 2001. Generation of Lymphocytes in Bone Marrow and Thymus. [Google Scholar]
- 11.Heath W.R. In: Encyclopedia of Immunology. Second Edition. Delves P.J., editor. Elsevier; 1998. T lymphocytes; pp. 2341–2343. [Google Scholar]
- 12.Ravkov E., Slev P., Heikal N. Thymic output: assessment of CD4+ recent thymic emigrants and T-Cell receptor excision circles in infants. Cytometry B Clin. Cytom. 2017;92:249–257. doi: 10.1002/cyto.b.21341. [DOI] [PubMed] [Google Scholar]
- 13.Ronning K.E., Karlen S.J., Miller E.B., Burns M.E. Molecular profiling of resident and infiltrating mononuclear phagocytes during rapid adult retinal degeneration using single-cell RNA sequencing. Sci. Rep. 2019;9:4858. doi: 10.1038/s41598-019-41141-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gawel D.R., Serra-Musach J., Lilja S., Aagesen J., Arenas A., Asking B., Bengnér M., Björkander J., Biggs S., Ernerudh J., et al. A validated single-cell-based strategy to identify diagnostic and therapeutic targets in complex diseases. Genome Med. 2019;11:47. doi: 10.1186/s13073-019-0657-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Newman A.M., Liu C.L., Green M.R., Gentles A.J., Feng W., Xu Y., Hoang C.D., Diehn M., Alizadeh A.A. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods. 2015;12:453–457. doi: 10.1038/nmeth.3337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Gong T., Hartmann N., Kohane I.S., Brinkmann V., Staedtler F., Letzkus M., Bongiovanni S., Szustakowski J.D. Optimal deconvolution of transcriptional profiling data using quadratic programming with application to complex clinical blood samples. PLoS One. 2011;6 doi: 10.1371/JOURNAL.PONE.0027156. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Goltsev Y., Samusik N., Kennedy-Darling J., Bhate S., Hale M., Vazquez G., Black S., Nolan G.P. Deep profiling of mouse splenic architecture with CODEX multiplexed imaging. Cell. 2018;174:968–981.e15. doi: 10.1016/J.CELL.2018.07.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Chattopadhyay P.K., Roederer M. Cytometry: today’s technology and tomorrow’s horizons. Methods. 2012;57:251–258. doi: 10.1016/J.YMETH.2012.02.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Bolón-Canedo V., Sánchez-Maroño N., Alonso-Betanzos A., Benítez J., Herrera F. A review of microarray datasets and applied feature selection methods. Inf. Sci. 2014;282:111–135. doi: 10.1016/j.ins.2014.05.042. [DOI] [Google Scholar]
- 20.Tadist K., Najah S., Nikolov N.S., Mrabti F., Zahi A. Feature selection methods and genomic big data: a systematic review. J. Big Data. 2019;6 doi: 10.1186/S40537-019-0241-0/TABLES/6. 79–24. [DOI] [Google Scholar]
- 21.Saeys Y., Inza I., Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23:2507–2517. doi: 10.1093/BIOINFORMATICS/BTM344. [DOI] [PubMed] [Google Scholar]
- 22.Whitney A.W. A direct method of nonparametric measurement selection. IEEE Trans. Comput. 1971;C-20:1100–1103. doi: 10.1109/T-C.1971.223410. [DOI] [Google Scholar]
- 23.Marill T., Green D. On the effectiveness of receptors in recognition systems. IEEE Trans. Inf. Theory. 1963;9:11–17. doi: 10.1109/TIT.1963.1057810. [DOI] [Google Scholar]
- 24.Breiman L., Friedman J.H., Olshen R.A., Stone C.J. Routledge; 1984. Classification and Regression Trees. [DOI] [Google Scholar]
- 25.Dumitrascu B., Villar S., Mixon D.G., Engelhardt B.E. Optimal marker gene selection for cell type discrimination in single cell analyses. Nat. Commun. 2021;12:1186. doi: 10.1038/s41467-021-21453-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Vargo A.H.S., Gilbert A.C. A rank-based marker selection method for high throughput scRNA-seq data. BMC Bioinf. 2020;21:477. doi: 10.1186/s12859-020-03641-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kira K., Rendell L.A. Proceedings of the ninth international workshop on Machine learning ML92. Morgan Kaufmann Publishers Inc.; 1992. A practical approach to feature selection; pp. 249–256. [Google Scholar]
- 28.Kononenko I. In: Machine Learning: ECML-94 Lecture Notes in Computer Science. Bergadano F., Raedt L., editors. Springer Berlin Heidelberg; 1994. Estimating attributes: analysis and extensions of RELIEF; pp. 171–182. [DOI] [Google Scholar]
- 29.Peng H., Long F., Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005;27:1226–1238. doi: 10.1109/TPAMI.2005.159. [DOI] [PubMed] [Google Scholar]
- 30.Abbas A.R., Wolslegel K., Seshasayee D., Modrusan Z., Clark H.F. Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PLoS One. 2009;4 doi: 10.1371/JOURNAL.PONE.0006098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Gong T., Szustakowski J.D. DeconRNASeq: a statistical framework for deconvolution of heterogeneous tissue samples based on mRNA-Seq data. Bioinformatics. 2013;29:1083–1085. doi: 10.1093/BIOINFORMATICS/BTT090. [DOI] [PubMed] [Google Scholar]
- 32.Vazirani V.V. Springer; 2003. Approximation Algorithms. [DOI] [Google Scholar]
- 33.Rajagopalan S., Vazirani V.V. Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science. 1993. Primal-dual RNC approximation algorithms for (multi)-set (multi)-cover and covering integer programs; pp. 322–331. [DOI] [Google Scholar]
- 34.Rubinstein R.Y. Optimization of computer simulation models with rare events. Eur. J. Oper. Res. 1997;99:89–112. doi: 10.1016/S0377-2217(96)00385-2. [DOI] [Google Scholar]
- 35.De Boer P.T., Kroese D.P., Mannor S., Rubinstein R.Y. A tutorial on the cross-entropy method. Ann. Oper. Res. 2005;134:19–67. doi: 10.1007/S10479-005-5724-Z. [DOI] [Google Scholar]
- 36.Quinlan J.R. Induction of decision trees. Mach. Learn. 1986;1:81–106. doi: 10.1007/BF00116251. [DOI] [Google Scholar]
- 37.Kozachenko L.F., Leonenko N.N. Sample estimate of the entropy of a random vector. Probl. Peredachi Infor. 1987;23:9–16. [Google Scholar]
- 38.Kraskov A., Stögbauer H., Grassberger P. Estimating mutual information. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 2004;69 doi: 10.1103/PhysRevE.69.066138. [DOI] [PubMed] [Google Scholar]
- 39.Adams T.S., Schupp J.C., Poli S., Ayaub E.A., Neumark N., Ahangari F., Chu S.G., Raby B.A., DeIuliis G., Januszyk M., et al. Single-cell RNA-seq reveals ectopic and aberrant lung-resident cell populations in idiopathic pulmonary fibrosis. Sci. Adv. 2020;6 doi: 10.1126/sciadv.aba1983. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Zeisel A., Muñoz-Manchado A.B., Codeluppi S., Lönnerberg P., La Manno G., Juréus A., Marques S., Munguba H., He L., Betsholtz C., et al. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015;347:1138–1142. doi: 10.1126/SCIENCE.AAA1934/SUPPL_FILE/ZEISEL-SM.PDF. [DOI] [PubMed] [Google Scholar]
- 41.He S., Wang L.-H., Liu Y., Li Y.-Q., Chen H.-T., Xu J.-H., Peng W., Lin G.-W., Wei P.-P., Li B., et al. Single-cell transcriptome profiling of an adult human cell atlas of 15 major organs. Genome Biol. 2020;21:294. doi: 10.1186/s13059-020-02210-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Segerstolpe Å., Palasantza A., Eliasson P., Andersson E.-M., Andréasson A.C., Sun X., Picelli S., Sabirsh A., Clausen M., Bjursell M.K., et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 2016;24:593–607. doi: 10.1016/j.cmet.2016.08.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Muraro M.J., Dharmadhikari G., Grün D., Groen N., Dielen T., Jansen E., van Gurp L., Engelse M.A., Carlotti F., de Koning E.J.P., van Oudenaarden A. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 2016;3:385–394.e3. doi: 10.1016/j.cels.2016.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Tsoucas D., Dong R., Chen H., Zhu Q., Guo G., Yuan G.-C. Accurate estimation of cell-type composition from gene expression data. Nat. Commun. 2019;10:2975. doi: 10.1038/s41467-019-10802-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Lin J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory. 1991;37:145–151. doi: 10.1109/18.61115. [DOI] [Google Scholar]
- 46.Nogueira S., Sechidis K., Brown G. On the stability of feature selection algorithms. J. Mach. Learn. Res. 2018;18:1–54. [Google Scholar]
- 47.Subramanian A., Tamayo P., Mootha V.K., Mukherjee S., Ebert B.L., Gillette M.A., Paulovich A., Pomeroy S.L., Golub T.R., Lander E.S., Mesirov J.P. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Börner K., Teichmann S.A., Quardokus E.M., Gee J.C., Browne K., Osumi-Sutherland D., Herr B.W., Bueckle A., Paul H., Haniffa M., et al. Anatomical structures, cell types and biomarkers of the Human Reference Atlas. Nat. Cell Biol. 2021;23:1117–1128. doi: 10.1038/s41556-021-00788-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Coulombe P.A., Wong P. Cytoplasmic intermediate filaments revealed as dynamic and multipurpose scaffolds. Nat. Cell Biol. 2004;6:699–706. doi: 10.1038/ncb0804-699. [DOI] [PubMed] [Google Scholar]
- 50.Saha S.K., Kim K., Yang G.M., Choi H.Y., Cho S.G. Cytokeratin 19 (KRT19) has a role in the reprogramming of cancer stem cell-like cells to less aggressive and more drug-sensitive cells. Int. J. Mol. Sci. 2018;19:E1423. doi: 10.3390/IJMS19051423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Kubo F., Ariestanti D.M., Oki S., Fukuzawa T., Demizu R., Sato T., Sabirin R.M., Hirose S., Nakamura N. Loss of the adhesion G-protein coupled receptor ADGRF5 in mice induces airway inflammation and the expression of CCL2 in lung endothelial cells 11 Medical and Health Sciences 1102 Cardiorespiratory Medicine and Haematology. Respir. Res. 2019;20:11–21. doi: 10.1186/S12931-019-0973-6/FIGURES/11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Vazquez B.N., Laguna T., Carabana J., Krangel M.S., Lauzurica P. CD69 gene is differentially regulated in T and B cells by evolutionarily conserved promoter-distal elements. J. Immunol. 2009;183:6513–6521. doi: 10.4049/JIMMUNOL.0900839. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Ziegler S.F., Ramsdell F., Alderson M.R. The activation antigen CD69. Stem Cell. 1994;12:456–465. doi: 10.1002/STEM.5530120502. [DOI] [PubMed] [Google Scholar]
- 54.Plešingerová H., Janovská P., Mishra A., Smyčková L., Poppová L., Libra A., Plevová K., Ovesná P., Radová L., Doubek M., et al. Expression of COBLL1 encoding novel ROR1 binding partner is robust predictor of survival in chronic lymphocytic leukemia. Haematologica. 2018;103:313–324. doi: 10.3324/HAEMATOL.2017.178699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Castro C.D., Flajnik M.F. Putting J-chain back on the map: how might its expression define plasma cell development? J. Immunol. 2014;193:3248–3255. doi: 10.4049/JIMMUNOL.1400531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.De Plaen I.G., Han X.B., Liu X., Hsueh W., Ghosh S., May M.J. Lipopolysaccharide induces CXCL2/macrophage inflammatory protein-2 gene expression in enterocytes via NF-kappaB activation: independence from endogenous TNF-alpha and platelet-activating factor. Immunology. 2006;118:153–163. doi: 10.1111/J.1365-2567.2006.02344.X. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Robertson M.J. Role of chemokines in the biology of natural killer cells. J. Leukoc. Biol. 2002;71:173–183. [PubMed] [Google Scholar]
- 58.Molleran Lee S., Villanueva J., Sumegi J., Zhang K., Kogawa K., Davis J., Filipovich A.H. Characterisation of diverse PRF1 mutations leading to decreased natural killer cell activity in North American families with haemophagocytic lymphohistiocytosis. J. Med. Genet. 2004;41:137–144. doi: 10.1136/JMG.2003.011528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Valés-Gómez M., Esteso G., Aydogmus C., Blázquez-Moreno A., Marín A.V., Briones A.C., Garcillán B., García-Cuesta E.M., López Cobo S., Haskologlu S., et al. Natural killer cell hyporesponsiveness and impaired development in a CD247-deficient patient. J. Allergy Clin. Immunol. 2016;137:942–945.e4. doi: 10.1016/J.JACI.2015.07.051. [DOI] [PubMed] [Google Scholar]
- 60.Vanderbilt J.N., Mager E.M., Allen L., Sawa T., Wiener-Kronish J., Gonzalez R., Dobbs L.G. CXC chemokines and their receptors are expressed in type II cells and upregulated following lung injury. Am. J. Respir. Cell Mol. Biol. 2003;29:661–668. doi: 10.1165/RCMB.2002-0227OC. [DOI] [PubMed] [Google Scholar]
- 61.Shi Y., Zhao Y., Zhang Y., Aierken N., Shao N., Ye R., Lin Y., Wang S. AFF3 upregulation mediates tamoxifen resistance in breast cancers. J. Exp. Clin. Cancer Res. 2018;37:254. doi: 10.1186/S13046-018-0928-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Maher K., Konjar S., Watts C., Turk B., Kopitar-Jerala N. Cystatin F regulates proteinase activity in IL-2-activated natural killer cells. Protein Pept. Lett. 2014;21:957–965. doi: 10.2174/0929866521666140403124146. [DOI] [PubMed] [Google Scholar]
- 63.Ronchetti S., Ricci E., Petrillo M.G., Cari L., Migliorati G., Nocentini G., Riccardi C. Glucocorticoid-induced tumour necrosis factor receptor-related protein: a key marker of functional regulatory T cells. J. Immunol. Res. 2015;2015:171520. doi: 10.1155/2015/171520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Fu Y., Huang X., Zhang P., van de Leemput J., Han Z. Single-cell RNA sequencing identifies novel cell types in Drosophila blood. J. Genet. Genomics Yi Chuan Xue Bao. 2020;47:175–186. doi: 10.1016/j.jgg.2020.02.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Shekhar K., Menon V. In: Computational Methods for Single-Cell Data Analysis Methods in Molecular Biology. Yuan G.-C., editor. Springer; 2019. Identification of cell types from single-cell transcriptomic data; pp. 45–77. [DOI] [PubMed] [Google Scholar]
- 66.Wilkerson B.A., Zebroski H.L., Finkbeiner C.R., Chitsazan A.D., Beach K.E., Sen N., Zhang R.C., Bermingham-McDonogh O. Novel cell types and developmental lineages revealed by single-cell RNA-seq analysis of the mouse crista ampullaris. Elife. 2021;10 doi: 10.7554/eLife.60108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Wu H., Kirita Y., Donnelly E.L., Humphreys B.D. Advantages of single-nucleus over single-cell RNA sequencing of adult kidney: rare cell types and novel cell states revealed in fibrosis. J. Am. Soc. Nephrol. 2019;30:23–32. doi: 10.1681/ASN.2018090912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
- 69.Johnson J., Douze M., Jégou H. Billion-scale similarity search with GPUs. arXiv. 2017 doi: 10.48550/arXiv.1702.08734. Preprint at. [DOI] [Google Scholar]
- 70.Johnson D.S. Approximation algorithms for combinatorial problems. J. Comput. Syst. Sci. 1974;9:256–278. doi: 10.1016/S0022-0000(74)80044-9. [DOI] [Google Scholar]
- 71.Chvatal V. A greedy heuristic for the set-covering problem. Math. Oper. Res. 1979;4:233–235. [Google Scholar]
- 72.Rubinstein R. The cross-entropy method for combinatorial and continuous optimization. Methodol. Comput. Appl. Probab. 1999;1:127–190. doi: 10.1023/A:1010091220143. [DOI] [Google Scholar]
- 73.Welch B.L. The generalisation of student’s problems when several different population variances are involved. Biometrika. 1947;34:28–35. doi: 10.1093/biomet/34.1-2.28. [DOI] [PubMed] [Google Scholar]
- 74.Wolf F.A., Angerer P., Theis F.J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:15. doi: 10.1186/s13059-017-1382-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Kullback S., Leibler R.A. On information and sufficiency. Ann. Math. Statist. 1951;22:79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]
- 76.Chen E.Y., Tan C.M., Kou Y., Duan Q., Wang Z., Meirelles G.V., Clark N.R., Ma’ayan A. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinf. 2013;14:128. doi: 10.1186/1471-2105-14-128. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
-
•
This paper analyzes existing, publicly available data. These accession numbers for the datasets are listed in the key resources table.
-
•
We implemented a general-purpose package for running the greedy multiset multicover algorithm in C++ and expose it to Python. The code has been deposited at https://github.com/euxhenh/multiset-multicover. The G-PC and CEM-PC algorithms for feature selection can be found at https://github.com/euxhenh/phenotype-cover. Installation instructions are available in each repository. The code for running experiments in this paper is available from https://github.com/euxhenh/phenotype-cover-experiments. DOIs are listed in the key resources table.
-
•
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.




