Abstract
Although single cell RNA-sequencing (scRNA-seq) provides unprecedented insights into the biology of complex tissues, analyzing such data on a gene-by-gene basis is challenging due to the large number of tested hypotheses and consequent low statistical power and difficult interpretation. These issues are magnified by the increased noise, significant sparsity and multi-modal distributions characteristic of single cell data. One promising approach for addressing these challenges is gene set testing, or pathway analysis. Unfortunately, statistical and biological differences between single cell and bulk transcriptomic data make it challenging to use existing gene set collections, which were developed for bulk tissue analysis, on scRNA-seq data. In this paper, we describe a procedure for customizing gene set collections originally created for bulk tissue analysis to reflect the structure of gene activity within specific cell types. Our approach leverages information about mean gene expression in the 81 human cell types profiled via scRNA-seq by the Human Protein Atlas (HPA) Single Cell Type Atlas. This HPA information is used to compute cell type-specific gene and gene set weights that can be used to filter or weight gene set collections. As demonstrated through the analysis of immune cell scRNA-seq data using gene sets from the Molecular Signatures Database (MSigDB), accounting for cell type-specificity can significantly improve gene set testing power and interpretability.
Keywords: gene set testing, pathway analysis, single cell transcriptomics, gene set optimization, cell type specificity
1. Introduction
1.1. Single cell analysis challenges
Although single cell assays such as single cell RNA-sequencing (scRNA-seq) [15] are a powerful tool for studying complex tissues, technical and biological limitations make statistical analysis challenging [29,10]. Single cell methods profile very small amounts of genomic material, leading to significant amplification bias and sparsity relative to bulk tissue assays [2]. Single cell approaches for quality control, normalization and statistical analysis (e.g., zero-inflated models) only partially address these challenges [16,18]. In addition to the challenges of noise and sparsity, important biological differences exist between bulk tissue and single cell data. As the average over a large number of cells, bulk tissue measurements are non-sparse, typically unimodal and, in many cases, approximately normally distributed. In contrast, single cell datasets reflect a heterogenous mixture of cell types and states resulting in multi-modal and non-normal distributions [2]. The diverse mixture of cell types and states found in complex tissues also leads to significant differences in gene expression patterns between bulk tissue and single cell data. As evidenced by projects such as the Human Protein Atlas (HPA) [26], gene activity in bulk tissue, as quantified via gene expression or protein abundance, can differ substantially from the activity occurring within the cell subpopulations comprising the tissue.
The HPA repository was recently updated with the Single Cell Type Atlas (SCTA) [25], which captures gene expression values for 81 common human cell types as measured by scRNA-seq on healthy tissue for 31 different tissue types. For the HPA SCTA, the source scRNA-seq data was obtained from the Single Cell Expression Atlas (SCEA) [17], the Human Cell Atlas (HCA) [21,11], the Gene Expression Omnibus (GEO) [3], and the European Genome-phenome Archive (EGA) [13]. The datasets from these source repositories were carefully curated to identify high-quality scRNA-seq data measured on 31 different tissue types where the samples were obtained from healthy individuals and processed without cell type enrichment. Using this data, cell type clusters where identified representing 81 distinct cell types. The mean expression profile of each cell type enables the quantification of the cell type-specificity of human protein coding genes. Importantly, mean gene expression differs not only between different cell types but also between bulk tissue samples and the cell types that comprise that tissue. Figure 1 illustrates these marginal differences for a subset of genes in the Molecular Signatures Database (MSigDB) [14] Hallmark TGF-β signaling pathway based on the cell types captured via scRNA-seq in human skin [23] as represented in the HPA SCTA. As illustrated by this figure, gene expression values measured via scRNA-seq on the cell types that comprise skin can look very different from the values computed via bulk RNA-seq on skin samples, which will be a weighted average of the cell type-specific measurements.
Fig. 1.

Cell type-specific expression of genes in the MSigDB Hallmark TGF-β signaling pathway. Cells represent the fold-change in mean expression of the gene in a given cell type relative to the average across all 81 cell types profiled by the HPA Single Cell Type Atlas.
The pattern of co-expression can also vary significantly between single cells and bulk tissue. A comparison of gene co-expression in single cell and bulk glioblastoma samples performed by Wang et al. [27] found that over 90% of the gene co-expression pairs were unique to either the bulk or single cell data. This dramatic difference in the pattern of co-expression is due to the fact that co-expression at the bulk tissue level is often driven by variation in the proportion of cell types in a given tissue which can bear little resemblance to gene co-expression across cells of a specific type [27,4]. Genes that are uncorrelated at the single cell level can appear to be correlated at the bulk tissue level if the mean expression varies across cell types and cell type proportions vary across bulk samples. The inverse is also possible, i.e., genes whose expression is correlated at the single cell level can appear uncorrelated in bulk tissue.
A similar issue exists for the association of gene expression values with a given experimental condition, i.e., the experimental associations found at the bulk tissue level can be very dissimilar to those found for a specific cell type. Figure 2 provides a simplified illustration of the marginal and joint distribution characteristics of single cell and bulk tissue expression data. In this figure, the marginal distribution is represented by density plots for a single gene while the joint distribution is represented by covariance matrices. Collectively, the distributional differences between single cell and bulk tissue genomic data make it challenging to successfully analyze single cell expression data using biological models originally developed for bulk tissue, which represent the pattern of gene product abundance within an average cell.
Fig. 2.

Bulk tissue vs. single cell distributions. The middle and bottom rows illustrate approximate marginal and joint expression distributions.
1.2. Gene set testing for single cell data
Although high-dimensional genomic data provides a molecular-level lens on biological systems, the gain in fidelity obtained by testing thousands of genomic variables comes at the price of impaired interpretation, loss of power due to multiple hypothesis correction and poor reproducibility [1,9]. To help address these challenges for bulk tissue data, researchers developed gene set testing, or pathway analysis, methods [12]. Gene set testing is an effective hypothesis aggregation technique that lets researchers step back from the level of individual genomic variables and explore associations for biologically meaningful groups of genes. Focusing on a small number of pathways can substantially improve power, interpretation and replication relative to an analysis focused on individual genomic variables [24]. The benefits that gene set-based hypothesis aggregation offers for the analysis of bulk tissue data are even more pronounced for single cell data given increased technical variance and sparsity. Although significant progress has been made developing gene set testing methods, including methods developed by us that are specifically optimized for scRNA-seq data [7,6], and building and maintaining gene set collections, existing collections were largely developed for the analysis of bulk tissue data. This is problematic since many gene sets in collections like the Molecular Signatures Database (MSigDB) [14] are defined to contain groups of genes whose expression in bulk tissue is correlated (e.g., MSigDB cancer modules [22]) or is associated with a specific experimental variable (e.g., MSigDB chemical and genetic perturbations). Such gene sets will often represent biological associations that do not hold at the single cell level [27]. It is important to note that new collections, e.g., the Human Cell Atlas-based MSigDB C8 collection [19], are being developed that contain gene sets derived from scRNA-seq data.
2. Data and methods
To address the bulk tissue bias that exists in most public gene set collections, we have developed a procedure for customizing gene set collections to reflect the structure of gene activity within specific cell types as measured by single cell transcriptomic assays. Our approach leverages information about mean gene expression in the 81 human cell types profiled via scRNA-seq by the HPA SCTA. As detailed below, this SCTA information was used to compute cell type-specific gene and gene set weights that can be used to filter or weight gene set collections. An example vignette and gene and gene set weights for the 81 HPA SCTA cell types and MSigDB collections are available at https://hrfrost.host.dartmouth.edu/SCGeneSetOpt/.
2.1. Data sources
The following data sources were leveraged to compute the cell type-specific gene and gene set weights and generate the paper results:
Human Protein Atlas Single Cell Type Atlas (HPA SCTA): Information about the cell type-specific expression of human protein coding genes was obtained from the HPA SCTA via the downloadable file https://www.proteinatlas.org/download/rna_single_cell_type.tsv.zip.
Molecular Signatures Database (MSigDB): Gene set definitions were obtained from the MSigDB v2024.1 via the downloadable files at https://www.gseamsigdb.org/gsea/msigdb/index.jsp.
10k PBMC3k scRNA-seq data: The PBMC scRNA-seq dataset used to generated the results in Section 3 is also used in the Seurat Guided Clustering Tutorial (https://satijalab.org/seurat/articles/pbmc3k_tutorial.html), is included in the SeuratData R package and is freely accessible from 10x Genomics via a Creative Commons Attribution license at https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz. Processing of the PBMC3k dataset was performed using the same logic employed in the Seurat Guided Clustering Tutorial, which is also contained in the vignette available at https://hrfrost.host.dartmouth.edu/SCGeneSetOpt/.
2.2. Computation of cell type-specific gene and gene set weights
Building on our prior work creating customized versions of MSigDB for different normal human tissue types [5], our method first computes cell type-specific weights for each protein coding human gene for all 81 normal human cell types profiled by the HPA SCTA ( for gene in cell type ). Specifically, is set to the fold-change between the mean normalized transcript abundance of gene in cell type as computed via scRNA-seq relative to the average across all 81 cell types These gene-level weights are then leveraged to compute cell type-specific gene set weights for the sets in the MSigDB collections. Specifically, a weight for each gene set in a target collection is computed as the −log of the p-value from a one-sided, two-sample test t test comparing the values for the genes in set to the for genes not in (this is similar to the competitive gene set test implemented by the cameraPR method in the R limma package [20]).
2.3. Using gene weights for annotation filtering and weighting
The cell type-specific gene weights can be used to customize gene sets via either annotation filtering or weighting. Gene set annotations can be customized for cell type by simply removing all annotations for each gene if , where is a threshold that can be user specified or selected to optimize a gene set testing performance metric, e.g., replication of gene set testing results across related datasets. Filtering has the benefit of parsimony (it results in smaller and more easily intepreted sets), improved power (the remaining set members should be more likely to capture the relevant biological signal), and can work with any gene set testing method, however, it does require the specification of a threshold and ignores most of the information contained in the continuous weights. An alternate approach that uses the continuous gene weights and does not require a threshold is annotation weighting. In this scenario, the log of the gene weights is used to provide a directional weight that can be used with gene set testing methods like fry (see Section 2.6 for more details) that accept gene weights.
2.4. Using gene set weights for collection filtering and weighting
Similar to the application of gene-level weights, the cell type-specific gene set weights can be used for either filtering or weighting. Filtering can be performed by removing (or not using) all sets in a given collection where is less than some threshold. Collection filtering has the benefits of interpretation and statistical power since a smaller number of more biologically relevant gene sets are tested, which makes interpretation easier and reduces the multiple hypothesis correction penalty. Like annotation filtering, the downsides of collection filtering include the need for a specific threshold and fact that most of the information in the weights is not used. The weights can alternatively be used for p-value weighting (e.g., weighted FDR [8]) following gene set testing, which avoids the need for a specific threshold.
2.5. Choice of cell type weights
The effectiveness of the filtering and weighting techniques detailed in Sections 2.3 and 2.4 is strongly dependent on the what cell type weight is employed. Considerations for several common scenarios are discussed below:
Analysis of a single cell type in different experimental conditions: For this scenario, gene set analysis is typically performed to identify sets that are differentially active between the experimental conditions. Using gene and/or gene set weights for type of cell in the dataset is usually appropriate since this will prioritize the pathways/functions most critical to the biology of that cell type.
Analysis of multiple cell types in different experimental conditions: For this scenario, the goal of gene set analysis is usually to identify sets that are differentially active between the experimental conditions irrespective of the cell type. In this case, using an average (or weighted average) of the weights for all cell types present in the data can be effective. Similar to the single cell type case, this will priorize the pathways/functions most relevant to the function of those cell types.
Comparative analysis of different cell types in the same experimental condition: For this scenario, gene set analysis aims to identify sets that are differentially active in cells of one type relative to cells of a different type. If the analysis is primarily focused on one of the cell types, then the weights for that type could be used. Alternatively, the average of the weights (or maximum weight) for the analyzed cell types could be used.
The results in Section 3 correspond to the last scenario and weights for just one of the cell types in the comparison are used (either B cell or T cell weights).
2.6. Gene set testing
The gene set testing results shown in Section 3 were generated using two techniques, both availble via the limma R package [20], for population-level gene set analysis:
Camera [20]: This technique implements a competitive and population-level gene set test that accounts for inter-gene correlation.
Fry [28]: This technique implementes a self-contained and population-level gene set test that can accept gene-level weights.
Both methods were executed using default parameters unless otherwise specified.
3. Results
To evaluate the cell type-specific weight model detailed in Section 2.2, we computed gene and gene set weights for all 81 normal human cell types included in the HPA SCTA. For each of these cell types, gene weights were generated for all human protein coding genes profiled by the HPA SCTA and gene set weights were calculated for all collections in v2024.1 of the MSigDB. To illustrate the utility of these weights and their application for the filtering and weighting approaches detailed in Sections 2.3 and 2.4, we performed various gene set analyses of the 10x PBMC3k example scRNA-seq dataset (visualized in Figure 3) using both B and T cell weights and the sets in the Gene Ontology Biological Process collection (MSigDB collection C5.GO.BP). Results from these analyses are detailed in Sections 3.1–3.4 below.
Fig. 3.

Projection of PBMC scRNA-seq data onto the first two UMAP dimensions. Each point in the plot represents one cell.
Both the cell type-specific weights and R logic for the B cell-based results are available on the paper website (https://hrfrost.host.dartmouth.edu/SCGeneSetOpt/).
3.1. Collection filtering using B cell gene set weights
As illustrated by Table 1, which lists the top ten MSigDB C5.GO.BP (Gene Ontology Biological Process) terms for B cells, the gene sets with the largest values clearly reflect the known biological features of B cells.
Table 1.
Top 10 MSigDB C5.GO.BP gene sets based on B cell gene set weights.
| Gene set | Weight (ws) |
|---|---|
|
| |
| GOBP_B_CELL_RECEPTOR_SIGNALING_PATHWAY | 412 |
| GOBP_B_CELL_ACTIVATION | 270 |
| GOBP_ANTIGEN_RECEPTOR_MEDIATED_SIGNALING_PATHWAY | 206 |
| GOBP_ADAPTIVE_IMMUNE_RESPONSE | 205 |
| GOBP_LYMPHOCYTE_ACTIVATION | 200 |
| GOBP_B_CELL_PROLIFERATION | 186 |
| GOBP_IMMUNE_RESPONSE_REGULATING_CELL_SURFACE_RECEPTOR_SIGNALING | 185 |
| GOBP_CELL_ACTIVATION | 161 |
| GOBP_POSITIVE_REGULATION_OF_IMMUNE_RESPONSE | 155 |
| GOBP_IMMUNE_RESPONSE_REGULATING_SIGNALING_PATHWAY | 155 |
Following the approach outlined in Section 2.4, we used the B cell-based gene set weights to filter the MSigDB C5.GO.BP collection with the goal of both improving statistical power by reducing the multiple hypothesis correction burden and improving the biological relevance and interpretability of the analysis by only testing sets specific to B cells. To perform this analysis, we created a filtered version of the C5.GO.BP collection that retained the sets with B cell weights above 10, which corresponds to 4.5% of the 5,777 C5.GO.BP sets retained after alignment with the PBMC scRNA-seq genes (annotations were removed for genes not captured in the PBMC data and then sets were eliminated if they had less then 5 or greater than 200 members). We then performed a population-level and competitive gene set analysis using the camera method comparing set expression among B cells against expression in non-B cells. Weight-based collection filtering had the desired effect of improving the multiple correction-adjusted statistical significance of the results. Specifically, without filtering only 36 terms had FDR values < 0.1; with filtering this number increased to 67. This trend is visualized in Figure 4.
Fig. 4.

Distribution of FDR values from a gene set analysis comparing B cells against other cell types in the PBMC data using the MSigDB C5.GO.BP collection. Each point captures the FDR values for one of the terms remaining after collection filtering with the x-axis representing the non-filtered FDR value and the y-axis representing the filtered FDR value. Blue lines reflect the FDR threshold of 0.1 and the red line reflects expected distribution for equal FDR values.
3.2. Collection filtering using T cell gene set weights
Table 2 lists the top ten MSigDB C5.GO.BP (Gene Ontology Biological Process) terms according to the T cell-specific gene set weights. Similar to the top terms for B cells shown in Table 1, the Gene Ontology terms with the largest values effectively capture the key aspects of T cell biology.
Table 2.
Top 10 MSigDB C5.GO.BP gene sets based on T cell gene set weights.
| Gene set | Weight (ws) |
|---|---|
|
| |
| GOBP_ADAPTIVE_IMMUNE_RESPONSE | 500.0 |
| GOBP_T_CELL_ACTIVATION | 109 |
| GOBP_T_CELL_RECEPTOR_SIGNALING_PATHWAY | 101 |
| GOBP_LYMPHOCYTE_ACTIVATION | 91 |
| GOBP_ANTIGEN_RECEPTOR_MEDIATED_SIGNALING_PATHWAY | 81 |
| GOBP_BIOLOGICAL_PROCESS_INVOLVED_IN_INTERSPECIES_INTERACTION | 75 |
| GOBP_T_CELL_DIFFERENTIATION | 69 |
| GOBP_ALPHA_BETA_T_CELL_ACTIVATION | 68 |
| GOBP_POSITIVE_REGULATION_OF_IMMUNE_SYSTEM_PROCESS | 68 |
| GOBP_CELL_ACTIVATION | 67 |
Similar to the B cell analysis above, we created a filtered version of the C5.GO.BP collection that retained the sets with T cell weights above 10, which corresponds to 4.4% of the 5,777 C5.GO.BP sets retained after alignment with the PBMC scRNA-seq genes. We then performed a gene set analysis using the camera method comparing set expression among T cells against expression in non-T cells. Weight-based collection filtering again had the desired effect of improving the multiple correction-adjusted statistical significance of the results. Specifically, without filtering 56 terms had FDR values < 0.1, with filtering this number increased to 68. This trend is visualized in Figure 5.
Fig. 5.

Distribution of FDR values from a gene set analysis comparing T cells against other cell types in the PBMC data using the MSigDB C5.GO.BP collection. Each point captures the FDR values for one of the terms remaining after collection filtering with the x-axis representing the non-filtered FDR value and the y-axis representing the filtered FDR value. Blue lines reflect the FDR threshold of 0.1 and the red line reflects expected distribution for equal FDR values.
3.3. Annotation weighting using B cell gene weights
Following the approach in Section 2.3, we used the B cell-based gene weights to perform a weighted gene set analysis using the fry method. Specifically, set expression in B cells was compared to set expression in non-B cells and this analysis was performed both without weights and with gene weights set to the log2 of the B cell-based gene weight plus a pseudo-count of 1e-4 (the log transformation generates sign-based directional weights as need by fry). This weighting scheme prioritizes genes that are strongly up-regulated or down-regulated in B cells according to the HPA SCTA. The fry method (which is a fast approximation of the roast technique) generated unexpectedly small FDR values (much smaller than camera), however, the rank ordering of the sets based on significance was similar to camera and matched the expected biology of B cells. Use of B cell gene weights improved the statistical power of the gene set analysis. Specifically, without weights 1,965 terms had FDR values < 1e 4, with weights this number increased to 2,348.
3.4. Annotation weighting using T cell gene weights
Similar to the B cell analysis in Section 3.3, we used T cell-based gene weights to perform a weighted gene set analysis comparing set expression in T cells to expression in non-T cells. Use of T cell gene weights also improved the statistical power of the gene set analysis. Specifically, without weights 2,920 terms had FDR values < 1e 4, with weights this number increased to 3,590.
4. Conclusions
Gene set testing is a powerful tool for the analysis of scRNA-seq data that addresses the challenges of sparsity and noise. Unfortunately, the utility of gene set testing for single cell data is limited by the fact that most existing gene set collections were developed to capture gene activity within bulk tissue data, which can differ substantially from gene activity in specific cell types. In particular, the pattern of gene co-expression found among cells of a specific type is often significantly different from the pattern seen in bulk tissue samples, which is driven by cell type proportions. A similar issue exists for the association of gene expression values with a given experimental condition. To address this challenge, we explored methods for computing gene and gene set weights using information about the cell type-specificity of human protein coding genes from the HPA SCTA. These cell type-specific weights can be leveraged to improve the power and interpretability of gene set analyses through the filtering or weighting of gene set collections or gene set annotations. To support this type of analysis by other researchers, an example vignette along with gene and gene set weights for the 81 HPA SCTA cell types and MSigDB collections are available at https://hrfrost.host.dartmouth.edu/SCGeneSetOpt/.
Acknowledgments.
This work was funded by National Institutes of Health grants R35GM146586, R21CA253408, and P30CA023108.
Footnotes
Disclosure of Interests. The authors have no conflicts of interest to declare.
References
- 1.Allison DB, Cui X, Page GP, Sabripour M: Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics 7(1), 55–65 (Jan 2006). 10.1038/nrg1749 [DOI] [PubMed] [Google Scholar]
- 2.Bacher R, Kendziorski C: Design and computational analysis of single-cell rna-sequencing experiments. Genome Biol 17, 63 (Apr 2016). 10.1186/s13059-016-0927-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A: Ncbi geo: archive for functional genomics data sets–update. Nucleic Acids Res 41(Database issue), D991–5 (Jan 2013). 10.1093/nar/gks1193 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Crow M, Paul A, Ballouz S, Huang ZJ, Gillis J: Exploiting single-cell expression to characterize co-expression replicability. Genome Biol 17, 101 (May 2016). 10.1186/s13059-016-0964-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Frost HR: Computation and application of tissue-specific gene set weights. Bioinformatics (Apr 2018). 10.1093/bioinformatics/bty217 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Frost HR: Reconstruction set test (reset): A computationally efficient method for single sample gene set testing based on randomized reduced rank reconstruction error. PLoS Comput Biol 20(4), e1012084 (Apr 2024). 10.1371/journal.pcbi.1012084 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Frost HR: Variance-adjusted mahalanobis (vam): a fast and accurate method for cell-specific gene set scoring. Nucleic Acids Res (Jul 2020). 10.1093/nar/gkaa582 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Genovese CR, Roeder K, Wasserman L: False discovery control with p-value weighting. Biometrika 93(3), 509–524 (2006). 10.1093/biomet/93.3.509, http://biomet.oxfordjournals.org/content/93/3/509.abstract [DOI] [Google Scholar]
- 9.Goeman JJ, Buehlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics 23(8), 980–987 (APR 15 2007). 10.1093/bioinformatics/btm051 [DOI] [PubMed] [Google Scholar]
- 10.Heumos L, Schaar AC, Lance C, Litinetskaya A, Drost F, Zappia L,Lücken MD, Strobl DC, Henao J, Curion F, Single-cell Best Practices Consortium, Schiller HB, Theis FJ : Best practices for single-cell analysis across modalities. Nat Rev Genet 24(8), 550–572 (Aug 2023). 10.1038/s41576-023-00586-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Hon CC, Shin JW, Carninci P, Stubbington MJT: The human cell atlas: Technical approaches and challenges. Brief Funct Genomics (Oct 2017). 10.1093/bfgp/elx029 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Khatri P, Sirota M, Butte AJ: Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Computational Biology 8(2), e1002375 (Feb 2012). 10.1371/journal.pcbi.1002375 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lappalainen I, Almeida-King J, Kumanduri V, Senf A, Spalding JD, Ur-Rehman S, Saunders G, Kandasamy J, Caccamo M, Leinonen R, Vaughan B, Laurent T, Rowland F, Marin-Garcia P, Barker J, Jokinen P, Torres AC, de Argila JR, Llobet OM, Medina I, Puy MS, Alberich M, de la Torre S, Navarro A, Paschall J, Flicek P: The european genome-phenome archive of human data consented for biomedical research. Nat Genet 47(7), 692–5 (Jul 2015). 10.1038/ng.3312 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP: Molecular signatures database (msigdb) 3.0. Bioinformatics 27(12), 1739–40 (Jun 2011). 10.1093/bioinformatics/btr260 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, Trombetta JJ, Weitz DA, Sanes JR, Shalek AK, Regev A, McCarroll SA: Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161(5), 1202–1214 (May 2015). 10.1016/j.cell.2015.05.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.McCarthy DJ, Campbell KR, Lun ATL, Wills QF: Scater: pre-processing, quality control, normalization and visualization of single-cell rna-seq data in r. Bioinformatics 33(8), 1179–1186 (Apr 2017). 10.1093/bioinformatics/btw777 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Papatheodorou I, Moreno P, Manning J, Fuentes AMP, George N, Fexova S, Fonseca NA, Füllgrabe A, Green M, Huang N, Huerta L, Iqbal H, Jianu M, Mohammed S, Zhao L, Jarnuczak AF, Jupp S, Marioni J, Meyer K, Petryszak R, Prada Medina CA, Talavera-López C, Teichmann S, Vizcaino JA, Brazma A: Expression atlas update: from tissues to single cells. Nucleic Acids Res 48(D1), D77–D83 (January 2020). 10.1093/nar/gkz947 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Qiu X, Hill A, Packer J, Lin D, Ma YA, Trapnell C: Single-cell mrna quantification and differential analysis with census. Nat Methods 14(3), 309–315 (Mar 2017). 10.1038/nmeth.4150 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, Bodenmiller B, Campbell P, Carninci P, Clatworthy M, Clevers H, Deplancke B, Dunham I, Eberwine J, Eils R, Enard W, Farmer A, Fugger L, Göttgens B, Hacohen N, Haniffa M, Hemberg M, Kim S, Klenerman P, Kriegstein A, Lein E, Linnarsson S, Lundberg E, Lundeberg J, Majumder P, Marioni JC, Merad M, Mhlanga M, Nawijn M, Netea M, Nolan G, Pe’er D, Phillipakis A, Ponting CP, Quake S, Reik W, Rozenblatt-Rosen O, Sanes J, Satija R, Schumacher TN, Shalek A, Shapiro E, Sharma P, Shin JW, Stegle O, Stratton M, Stubbington MJT, Theis FJ, Uhlen M, van Oudenaarden A, Wagner A, Watt F, Weissman J, Wold B, Xavier R, Yosef N, Human Cell Atlas Meeting Participants: The human cell atlas. Elife 6 (Dec 2017). 10.7554/eLife.27041 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK: limma powers differential expression analyses for rna-sequencing and microarray studies. Nucleic Acids Res 43(7), e47 (Apr 2015). 10.1093/nar/gkv007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Rozenblatt-Rosen O, Stubbington MJT, Regev A, Teichmann SA: The human cell atlas: from vision to reality. Nature 550(7677), 451–453 (Oct 2017). 10.1038/550451a [DOI] [PubMed] [Google Scholar]
- 22.Segal E, Friedman N, Koller D, Regev A: A module map showing conditional activity of expression modules in cancer. Nat Genet 36(10), 1090–8 (Oct 2004). 10.1038/ng1434 [DOI] [PubMed] [Google Scholar]
- 23.Solé-Boldo L, Raddatz G, Schütz S, Mallm JP, Rippe K, Lonsdorf AS, Rodríguez-Paredes M, Lyko F: Single-cell transcriptomes of the human skin reveal age-related loss of fibroblast priming. Commun Biol 3(1), 188 (April 2020). 10.1038/s42003-020-0922-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102(43), 15545–15550 (Oct 2005). 10.1073/pnas.0506580102 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Thul PJ, Åkesson L, Wiking M, Mahdessian D, Geladaki A, Ait Blal H, Alm T, Asplund A, Björk L, Breckels LM, Bäckström A, Danielsson F, Fagerberg L, Fall J, Gatto L, Gnann C, Hober S, Hjelmare M, Johansson F, Lee S, Lindskog C, Mulder J, Mulvey CM, Nilsson P, Oksvold P, Rockberg J, Schutten R, Schwenk JM, Sivertsson Å, Sjöstedt E, Skogs M, Stadler C, Sullivan DP, Tegel H, Winsnes C, Zhang C, Zwahlen M, Mardinoglu A, Pontén F, von Feilitzen K, Lilley KS, Uhlén M, Lundberg E: A subcellular map of the human proteome. Science 356(6340) (May 2017). 10.1126/science.aal3321 [DOI] [PubMed] [Google Scholar]
- 26.Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A, Sivertsson Å, Kampf C, Sjöstedt E, Asplund A, Olsson I, Edlund K, Lundberg E, Navani S, Szigyarto CAK, Odeberg J, Djureinovic D, Takanen JO, Hober S, Alm T, Edqvist PH, Berling H, Tegel H, Mulder J, Rockberg J, Nilsson P, Schwenk JM, Hamsten M, von Feilitzen K, Forsberg M, Persson L, Johansson F, Zwahlen M, von Heijne G, Nielsen J, Pontén F: Proteomics. tissue-based map of the human proteome. Science 347(6220), 1260419 (Jan 2015). 10.1126/science.1260419 [DOI] [PubMed] [Google Scholar]
- 27.Wang J, Xia S, Arand B, Zhu H, Machiraju R, Huang K, Ji H, Qian J: Single-cell co-expression analysis reveals distinct functional modules, co-regulation mechanisms and clinical outcomes. PLoS Comput Biol 12(4), e1004892 (Apr 2016). 10.1371/journal.pcbi.1004892 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wu D, Lim E, Vaillant F, Asselin-Labat M, Visvader JE, Smyth GK: ROAST: rotation gene set tests for complex microarray experiments. Bioinformatics (Oxford, England) 26(17), 2176–2182 (Sep 2010). 10.1093/bioinformatics/btq401, http://www.ncbi.nlm.nih.gov/pubmed/20610611 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Yuan GC, Cai L, Elowitz M, Enver T, Fan G, Guo G, Irizarry R, Kharchenko P, Kim J, Orkin S, Quackenbush J, Saadatpour A, Schroeder T, Shivdasani R, Tirosh I: Challenges and emerging directions in single-cell analysis. Genome Biol 18(1), 84 (May 2017). 10.1186/s13059-017-1218-y [DOI] [PMC free article] [PubMed] [Google Scholar]
