Ranking of cell clusters in a single-cell RNA-sequencing analysis framework using prior knowledge

Anastasis Oulas; Kyriaki Savva; Nestoras Karathanasis; George M Spyrou

doi:10.1371/journal.pcbi.1011550

. 2024 Apr 18;20(4):e1011550. doi: 10.1371/journal.pcbi.1011550

Ranking of cell clusters in a single-cell RNA-sequencing analysis framework using prior knowledge

Anastasis Oulas ^1,^*, Kyriaki Savva ¹, Nestoras Karathanasis ¹, George M Spyrou ¹

Editor: Christos A Ouzounis²

PMCID: PMC11060557 PMID: 38635836

Abstract

Prioritization or ranking of different cell types in a single-cell RNA sequencing (scRNA-seq) framework can be performed in a variety of ways, some of these include: i) obtaining an indication of the proportion of cell types between the different conditions under study, ii) counting the number of differentially expressed genes (DEGs) between cell types and conditions in the experiment or, iii) prioritizing cell types based on prior knowledge about the conditions under study (i.e., a specific disease). These methods have drawbacks and limitations thus novel methods for improving cell ranking are required. Here we present a novel methodology that exploits prior knowledge in combination with expert-user information to accentuate cell types from a scRNA-seq analysis that yield the most biologically meaningful results with respect to a disease under study. Our methodology allows for ranking and prioritization of cell types based on how well their expression profiles relate to the molecular mechanisms and drugs associated with a disease. Molecular mechanisms, as well as drugs, are incorporated as prior knowledge in a standardized, structured manner. Cell types are then ranked/prioritized based on how well results from data-driven analysis of scRNA-seq data match the predefined prior knowledge. In additional cell-cell communication perturbations between disease and control networks are used to further prioritize/rank cell types. Our methodology has substantial advantages to more traditional cell ranking techniques and provides an informative complementary methodology that utilizes prior knowledge in a rapid and automated manner, that has previously not been attempted by other studies. The current methodology is also implemented as an R package entitled Single Cell Ranking Analysis Toolkit (scRANK) and is available for download and installation via GitHub (https://github.com/aoulas/scRANK).

Author summary

Single-cell RNA sequencing (scRNA-seq) provides an additive resolution down to the cellular level that was previously not available from traditional “Bulk” RNA sequencing experiments. However, it is often difficult to prioritize the specific cell types which are primarily responsible for the cause of a disease. This work presents a novel methodology that utilizes predefined prior knowledge information to highlight informative cell types from a scRNA-seq analysis. Our methodology allows for ranking and prioritization of cell types based on how well predefined prior knowledge, in the form of pathways and drugs, are mapped to the results obtained from a basic analysis of a scRNA-seq dataset. Prior knowledge is incorporated in a standardized, structured manner, whereby a checklist is attained by querying the human disease database called MalaCards. This checklist comes in the form of pathways and drugs associated with the disease. The expert-user is prompted to “edit” the prior knowledge checklist by removing or adding terms from the list of predefined terms. Our methodology also utilizes cell-cell communication perturbations between disease and control networks to further prioritize/rank cell types.

We believe, this methodology may offer substantial advantages to existing cell prioritizing or ranking techniques and provide a rapid and automated approach that further compliments other methods of cell prioritization in a scRNA-seq framework.

Introduction

Single-cell RNA-sequencing (scRNA-seq) analysis generates clusters of cells that are grouped together due to their similar expression profiles. These clusters define specific cell types and provide scRNA-seq data with increased resolution in comparison to bulk-RNA-seq. This has apparent advantages when studying disease phenotypes because researchers can now narrow down the specific cell types that are affected by a given disease and perform more targeted, specialized analyses. The analysis of genes within specific cell types provides a very precise expression that is unique for the genes within the given cell type. However, it is often unclear which cell types may be more informative to investigate for a specific disease under study. It is common practice to perform a type of prioritization or ranking for the different cell types in a scRNA-seq framework by: 1) obtaining an indication of the proportion of cell types between the different conditions under study (e.g., cases vs. controls) [1], 2) counting the number of differentially expressed genes (DEGs) between cell types and conditions in the experiment [2] or, 3) prioritizing cell types based on prior knowledge about the disease [3]. These methods have certain drawbacks and limitations such as: a) cell type proportion estimates from scRNA-seq data are variable. Statistical methods that can correctly account for different sources of variability are needed to confidently identify statistically significant shifts in cell type composition between experimental conditions [4]. b) It has been shown that when using DEGs to prioritize cells, in both simulated and experimental datasets, the number of DEGs was strongly correlated with the number of cells per type, causing abundant cell types with modest transcriptional perturbations to be prioritized over rare but more strongly perturbed cell types [4]. c) Using prior knowledge is limited to the selection of specific cell types known to be affected in a disease under study. This requires that the cell types are well documented and characterized from a clinical perspective. This is often not the case and most disease phenotypes are not so straightforward. Moreover, the specific type of cells that are mostly affected by the disease are not always so well documented.

Efforts have been made to improve prioritizing methods like the ones described above [4,5]. However, although examples of utilizing prior knowledge and literature to annotate cell populations exist [3], there is currently no structured and automated method of incorporating prior knowledge in scRNA-seq analysis pipelines. Prior knowledge is often introduced in the analysis pipeline after manual interpretation in order to select for the most informative cell types and consequently these cell types are used to perform downstream analyses (e.g., differential expression, pathway enrichment analysis, drug repurposing (DR)). The majority of studies do not use prior knowledge but instead adopt a pure discovery mode in their analysis pipeline (e.g., [1,2]). Results from these studies are difficult to validate as they do not coincide to previously documented information for the disease and it often takes the researchers thorough inspection to arrive to a concrete interpretation of the results.

In this study we present a method that exploits prior knowledge (in combination with expert-user information) to guide the choice of cell types that yield the most biologically meaningful results from a scRNA-seq dataset with at least two conditions (e.g., disease vs. control). Prior knowledge is incorporated in a standardized, structured manner, whereby a checklist is attained by querying the MalaCards human disease database [6] with a disease of interest. The checklist is comprised of pathways and drugs and, optionally, drug mode of actions (MOAs), associated with the disease. The user is prompted to “edit” this checklist by removing or adding terms (in the form of keywords) from the list of predefined terms. The user may also define de novo, a set of keywords that best suit a hypothesis the user is interested in investigating (hypothesis-driven approach). Once the checklist is finalized, a “mapping” step is performed, whereby the incorporated prior knowledge is compared to results obtained from analyzing scRNA-seq data. The methodology is fully automated and a ranking is generated for all cell types in the analysis allowing the user to pinpoint the specific cells that are most prominently affected by the disease under study in accordance to the provided prior knowledge. The output of our methodology further provides a validation of scRNA-seq results and allows for better interpretation of findings. It also provides a means of assessing the efficacy of enrichment analysis and DR performed using scRNA-seq in comparison to bulk-RNA-seq data.

In addition, the results provide greater credence to de novo information as it also backed-up by prior knowledge for the disease under study. Novel information is obtained in the form of predicted pathways that are enriched for each cell type, as well as previously unreported repurposed drugs that are obtained from the cell type-specific DEGs between disease-control conditions.

Materials and methods

Datasets

Three datasets were used in order to assess the performance of our methodology:

COVID dataset 30 samples: Gene Expression Omnibus (GEO) accession GSE159812 [1].

LAM dataset 8 samples: GEO accession GSE135851 [3]

Autism Dataset 41 samples: Sequence Read Archive, accession number PRJNA434002 [2].

Basic analysis flow

The basic analysis flow of our methodology (see Fig 1 –Step 1 (Basic Analysis)) begins by performing scRNA-seq analysis using SEURAT (ver. 5.0.1) [7]. We utilized standardized analysis pipelines following best practices as described in the SEURAT web page (https://satijalab.org/seurat/index.html). The basic analysis includes, pre-processing, normalization, sample integration using anchors, dimensionality reduction via Principal Component Analysis (PCA), nearest-neighbor graph construction, clustering and detection of differentially expressed genes (DEGs) between conditions under study (e.g., disease vs. control). DEGs were identified following data normalization which corrects for cell count imbalances between conditions, as well as other biases. Finally, a 0.05 adjusted p-value threshold was applied for DEG selection. Custom-made scripts were used to perform in-script connections to enrichR via the corresponding R package (enrichR ver. 3.1) [8]. The latter was done to attain annotation of cell types based on marker gene analysis (if not supplied by the user) and also perform pathway enrichment analysis and DR.

Prior knowledge incorporation

The Next step of our methodology entails the incorporation of “prior knowledge” information (see Fig 1 –Step 2 (Prior Knowledge acquisition)). There are two options to consider: 1) Discovery mode: where the user proceeds with all the terms obtained from a checklist of predefined terms (default option). In this mode the checklist is obtained by querying the MalaCards human disease database [6] with a disease of interest (e.g., Covid-19). The checklist is comprised of pathways and drugs, as well as drug MOAs (optional), associated with the disease. 2) Hypothesis-driven mode: where the user can provide specific keywords/terms associated with the hypothesis under investigation (i.e., to check for signs of inflammation in the different cell types, the user may provide targeted terms associated with inflammation, e.g., Cytokines). For the hypothesis-driven approach, the new keyword terms that are defined are used to perform de novo querying across the different supported databases (see below). The search extracts the relevant database entries (if any) associated with the keywords. As it may be difficult for the user to provide specific drug names as keywords for this approach, our methodology supports the input of drug MOAs, which can be supplied instead. Our method then proceeds to search and extract the relevant drugs associated with these MOAs using information from the Drug Repurposing Hub Database (https://clue.io/repurposing). The Drug repurposing hub is a curated and annotated collection of FDA-approved drugs, clinical trial drugs, and pre-clinical tool compounds [9]. This information is made available for our methodology as a text file downloadable from our GitHub page (https://github.com/aoulas/scRANK/).

The order of the prior knowledge information is important for downstream analytics. For the discovery mode approach the order is defined by an enrichment score provided by MalaCards. For the hypothesis-driven approach the user has to define the order of the keywords with most relevant/important keywords provided first.

Mapping prior knowledge to data-driven results

Once the prior knowledge information is finalized, the “mapping” step is performed (see Fig 1 –Step 3 (Mapping Basic Analysis to Prior Knowledge)). This step merges the outputs of Steps 1 and 2 and utilizes this information as input for Step 3. This step assesses how well prior knowledge “maps” to the results obtained from scRNA-seq analysis. This is done firstly by mapping prior knowledge pathways against pathway enrichment results attained from analyzing the scRNA-seq data (data-driven analysis). Pathway enrichment is performed using selected DEGs and five pathway-related databases: (i) the Kyoto Encyclopedia of Genes and Genomes (KEGG) [10], (ii) Gene Ontology (GO) [11], (iii) the Molecular Signatures Database (MSIG) [12], (iv) WikiPathways [13] and Reactome [14]. Secondly, prior knowledge on drug names and/or MOAs (depending on whether the discovery or hypothesis-driven approach is undertaken) are mapped against DR results from the data-driven analysis using the CMAP database. Our methodology supports two options for Step 3. The default option generates different sets of results for every cell type in the dataset. There is also a “Bulk-RNA” option designed to simulate results generated from a bulk-RNA-seq experiment. The generation of this pseudo-bulk signature has been previously described [3] and is performed by merging all different cell types together (per condition) and obtaining DEGs using these pooled cells across conditions. This signature mimics the signature obtained by a bulk-RNA-seq analysis.

Ranking of cell types

Our methodology then uses the prior knowledge information to score/rank the cell types with respect to how well the data-driven output from pathway enrichment analysis and drug repurposing for individual cell types, “maps” to the predefined information provided by the expert user (see Fig 1 –Step 4 (Cell Ranking)). This is achieved by matching the position of the prior knowledge to the output of the scRNA-seq analysis and then taking the Euclidian distance (E) between the matched positions (see Fig 1 –Step 4.1 (Cell Ranking using Pathways and Drugs)). Cell types that exhibit good mapping between prior knowledge and analysed results are characterized by small Euclidean distances (see Fig 2). To correct for biases generated by multiple NA terms in the matching/mapping process we have incorporated a penalty for mappings that display sparse or absent terms. This was done by initially converting the mapping information into binary vectors whereby, 1 at both positions indicates presence or matching of terms at the specific position, while 1 and 0 represents non-matching of terms (see Fig 2). We next used the Jaccard index (J) to compute the similarity between these binary vectors. A J value close to 1 indicates high presence or matching of terms, while a J similarity close to 0 indicates limited or no matching of terms. The final ranking score (RS) was a combination of the Euclidean distance together with the Jaccard similarity. It was computed by dividing the Euclidean distance with the Jaccard similarity, thus penalizing vectors with missing information. To avoid cases where J = 0 we simply divide by a very small number (i.e., 1e-06).

Fig 2 — Mapping is performed by matching the position of the prior knowledge (left) to the output (enriched pathways) of the scRNA-seq analysis (right) and then taking the *Euclidian* distance between the final vectors generated from the matched positions. Positions that are not matched/mapped at all receive a NA value. To avoid biases from cases with multiple NA terms, we used the *Jaccard similarity (J)* to penalize sparsely mapped vectors. This was done by dividing the *Euclidian distance (E)* with *Jaccard similarity (J)* calculated from binary asymmetric variable vectors.

In addition to this ranking step, cells are also ranked using cell-cell communication networks generated using CellChat [15] (see Fig 1 –Step 4.2 (Cell Ranking using CellChat)). Global communications between cells are affected by disease. The assessment of these communications requires precise depiction of cell-cell signalling interactions and effective systems-level visualization. CellChat provides a database of interactions among ligands, receptors and their co-factors representing heteromeric molecular complexes. Furthermore, CellChat is capable of quantitatively analyzing intercellular communication networks utilizing scRNA-seq data. It predicts major signaling inputs and outputs for cells and how those cells and signals coordinate towards the implementation of specific functions, using network analysis and pattern recognition approaches.

Capitalizing on the functionalities of CellChat, we first split scRNA-seq data into control and disease samples and then generate two different types of cell-cell communication networks using CellChat. We then compare the number of interactions (edges) between cell types (nodes) in the two different networks (disease and control) and rank the cells by log fold difference in interactions (LogFDI). Ranking can be performed in either descending or ascending order as it is actually the absolute LogFDI value that determines the magnitude of the differences in interactions. The sign (+ve/-ve) merely dictates whether there is loss or gain of interactions with respect the control network. The LogFDI is obtained by calculating the log2 of the fold change of the interactions between cell types across disease and control conditions (see Eqs 1, 2 and 3). The incentive here is to detect cell types whose signalling undergoes major alterations between disease and control conditions. These cells may be considered to be key effectors in the disease under study (see Fig 3). This provides a ranking which is based on topology information obtained from the CellChat networks.

Fig 3 — A. Data is split into two networks control and disease. B. Cell types are ranked based on taking the log of the fold difference in their edge number interactions (*LogFDI*) between disease and control networks. Taking in consideration both positive and negative log fold changes. Node sizes are proportional to the number of cells in each cell group and edge widths with the number of interactions between nodes.

Finally, the different rankings obtained from the individual pathways and DR are merged by obtaining the mean rank across all the rankings and a final ranking is assigned to each cell type of the analysis. The CellChat data provides a unique type of information about cell-cell communication perturbations during disease and was hence, not incorporated into this mean ranking. CellChat rankings were considered independently in union with the mean rankings.

The log fold difference in interactions (LogFDI_i) between disease and control networks for every cell type i is obtained as such:

{L o g F D I}_{i} = \log_{2} \frac{{T N D I}_{i}}{{T N C I}_{i}} \begin{matrix} \geq 0 i f {T N C I}_{i} {\leq T N D I}_{i} \\ < 0 i f {T N C I}_{i} > {T N D I}_{i} \end{matrix}

(1)

Where:

The normalized total number of control interactions for every cell type in the control network (TNCI_i) is calculated as such:

{T N C I}_{i} = \frac{{N S C}_{i} + {N T C}_{i}}{T I C}

(2)

Where: NSC_i is the number of source-to-target interactions and NTC_i is the number of target-to-source interactions for each cell type i for the control network. TIC is the total number of interactions in the control network.

The normalized total number of disease interactions for every cell type in the disease network (TNDI_i) is calculated as such:

{T N D I}_{i} = \frac{{N S D}_{i} + {N T D}_{i}}{T I D}

(3)

Where: NSD_i is the number of source-to-target interactions and NTD_i is the number of target-to-source interactions for each cell type i for the disease network. TID is the total number of interactions in the disease network.

Bulk-RNA-seq simulations

Each scRNA-seq case-study dataset was also evaluated using bulk-RNA-seq simulations performed by merging all different cell types together and obtaining differentially expressed genes across conditions. The generation of this pseudo-bulk signature has been previously described [3] and is performed by differential expression between all disease cells and all control cells. This signature mimics the signature obtained by bulk-RNA-seq analysis. The Euclidian distances between the bulk-RNA-seq data-driven results and the prior knowledge were treated as for the individual cell types and a ranking obtained for the bulk-RNA-seq.

Additional methods for cell prioritization

In addition to prior knowledge, each scRNA-seq case-study dataset was also assessed using methods which are commonly used to prioritize cell types in a scRNA-seq analysis framework. These method included: 1) comparing the total number of DEGs between disease/control conditions for individual cell types and utilizing this as a means of ranking cell types and identifying the cells which are primarily affected by the disease under study [2], 2) ranking cell types by calculating the proportion of cell types across the different conditions in the analysis. For example, counting the number of different cell types between disease/control conditions can be an indication of the cellular differentiation/proliferation in disease vs. control conditions [1]. The results from both of these approaches were compared with our ranking methodology as well as the results reported by the authors who first published these datasets.

R package implementation

The current methodology was implemented as an R package entitled Single Cell Ranking Analysis Toolkit (scRANK) and is available for download and installation via GitHub (https://github.com/aoulas/scRANK). The basic analysis function supports data in multiple output formats obtained from Cell Ranger [16], namely: txt, csv, as well as H5 format. The methodology also supports pre-processed SEURAT R objects, so the user has multiple options by which to upload data into our pipeline. Moreover, if annotated datasets exist, scRANK supports the upload of meta.txt files containing SEURAT metadata for the corresponding dataset.

Results

Showcasing the importance of our methodology using case studies

Public datasets were used in order to perform three example case studies for showcasing the importance of our methodology. Every case study was evaluated by: a) Using the default option (discovery mode) whereby information derived from MalaCards was used as prior knowledge input and b) Using the hypothesis-driven mode whereby an expert user can manually input keywords as prior knowledge in order to extract information from the relevant databases, thus, allowing for the investigation of specific hypotheses of interest. The three case studies were selected to demonstrate the method’s versatility in different disease scenarios. Moreover, we made sure to select datasets with control samples that had been extensively analyzed down to the cellular/molecular level in the initial dataset publication. This allowed us to compare results from our methodology and the results reported by the authors of the initial publication.

In order to evaluate the ranking of cell types by our methodology in comparison to bulk-RNA-seq data, we further performed bulk-RNA-seq simulations. This was done by merging all different cell types and obtaining differential expressed genes across conditions for each of the datasets used as a case study. A comparison was then performed between the mapping agreement of prior knowledge to data-driven results from a) bulk-RNA-seq and b) the individual cell types.

Case study 1—LAM disease

LAM discovery mode approach

Lymphangioleiomyomatosis (LAM) is a lung disease caused by the abnormal growth of smooth muscle cells, especially in the lungs and lymphatic system. This abnormal growth leads to the formation of holes or cysts in the lung. Causative mutations for LAM disease are known to be present in TSC1 or TSC2 genes. These cause the hyperactivation of the mammalian target of rapamycin (mTOR) complex 1 (mTORC1). Under normal circumstances, mTOR is a major regulator of cell growth and division. The target of rapamycin (TOR) signal-transduction pathway is an important mechanism by which eucaryotic cells adjust their protein biosynthetic capacity to nutrient availability. Both in yeast and in mammals, the TOR pathway regulates the synthesis of ribosomal components, including transcription and processing of pre-rRNA, expression of ribosomal proteins and the synthesis of 5S rRNA. However, in LAM cells, abnormally activated mTOR sends signals that encourage cells to grow uncontrollably. Activators of mTOR involve phosphoinositide 3-kinase (PI3K), phosphatidylinositol-dependent kinase-1 (PDK1), and serine-threonine protein kinase (AKT). Dysregulated mTOR leads to the hyperactivation of multiple pathways, including the MAPK signalling cascade and hence to downstream growth-related perturbations. The mTORC1 inhibitor Sirolimus is the only FDA-approved drug to treat LAM. Novel therapies for LAM are urgently needed as the disease recurs with discontinuation of the treatment and some patients are insensitive to the drug.

When studying LAM it may be more informative to look at the expression profiles of genes as they are expressed specifically in smooth muscle cells, as these are the primary affected cells for this type of disease [17–19]. This was the approach adopted by the authors that first published this dataset.

We further assessed the LAM dataset using our methodology in order to 1) Incorporate prior knowledge on LAM disease and perform ranking of the cell types identified in the SEURAT analysis. 2) Obtain additional repurposed drugs that can potentially be used in LAM disease.

The prior knowledge which was included as a list of terms for the LAM disease dataset was obtained by querying MalaCards with the disease name “Lymphangioleiomyomatosis”. The full list of pathways, drugs and MOAs (see Tables A-E in S1 Text) obtained from the search was then used as input for our methodology to perform cell-ranking. These terms best characterize the pathways, biological processes, as well as the drugs and drug MOAs, that are known to be implicated in LAM disease (according to MalaCards).

Results from the rankings of cell types using pathways and drug information show that our results are in agreement with the results reported by the authors who first published this dataset [3] as well as the clinical phenotype of the disease [17–19]. Smooth muscle cells attained the highest ranking with respect to the rest of the cell types in the analysis (see Fig 4A). Results of the enrichment analysis and DR for some of the LAM top ranked cells are shown as A-C Figs in S1 Text.

Fig 4 — A. LAM cell-rankings based on drugs (DRUG) and pathway databases (KEGG, GOBP, MSIG, WIKI, REACT). The boxplot shows the individual cell rankings obtained by our methodology. The order of the cells is defined by the average rankings across all six parameters used to generate the final ranking for the annotated cell types in the analysis. B. CellChat rankings of LAM dataset cell types based of *LogFDI*. C. ASD cell-rankings based on drugs and pathway databases (WIKI and GOBP—Note that KEGG, MSIG and REACT were not successfully mapped with prior knowledge for this case study). The boxplot shows the individual ranking obtained by our methodology. The order of the cells types is defined by the average rankings across the three parameters that attained mapping information in the analysis. D. CellChat ranking of autism dataset cell types based on *LogFDI*. E. COVID cell-rankings based on drugs (DRUG) and pathway databases (KEGG, GOBP, MSIG, WIKI, REACT). The boxplot shows the individual ranking obtained by our methodology. The order of the cells is defined by the average rankings across all six parameters used to generate the final ranking for the annotated cell types in the analysis. F. CellChat rankings of COVID-19 dataset cell types based on *LogFDI*. Results from the bulk RNA-seq simulation are also shown (boxplots highlighted in red).

LAM Cell-cell communication ranking approach

Interesting results were obtained from comparing number of signalling interactions using the cell-cell communication tool CellChat. Results showed that smooth muscle cells sustained the most differences between donor and LAM conditions, with a markable >3-fold decrease in the normalized number of interactions in the LAM condition compared to the donor (see Fig 4B). It appears that a lot of cell-cell communications pathways may be hindered in these cells under LAM conditions, resulting in loss of cell-cell interactions and denoted as decrease in the number of edge connections in the CellChat LAM network. Smooth muscle cells are the major effectors of LAM disease; as reported in clinical conditions [17–19], hence, perhaps it is apparent for these cell types to sustain the greatest loss in connectivity when compared to other cell types in the analysis. Results from the LAM bulk RNA-seq simulation are also shown in the box-plots (see Fig 4A). In this case, we observe a relatively high ranking, albeit lower than the most informative cell -types for the disease under study.

LAM hypothesis-driven approach

For the hypothesis-driven approach prior knowledge was included as keywords in order to query the five pathway databases (KEGG, GOBP and MSIG Hallmark, WIKI and REACT) and the drug repurposing database (CMAP).

The prior knowledge keywords for LAM disease used for searching the five pathway databases are outlined below:

Keywords MSIG: MTOR, PI3K, MAPK, apoptosis, NF-k and TNF.

Keywords REACT: MTOR signalling, PI3K, MAPK, apoptosis, NF-k and TNF.

Keywords WIKI: MTOR, PI3K, MAPK, apoptosis, NF-k,TNF.

As mentioned in the Material and Methods (see above), the hypothesis-driven approach utilizes the keywords in the form of drug MOAs for searching drugs. This is achieved by mapping them to the relevant MOAs extracted from CMAP. The keywords used for this case study are shown below:

Keywords (MoAs): CDK inhibitor, MTOR inhibitor, MEK inhibitor.

The full list of pathways and drug MOAs obtained from the search were then used as input for our methodology to perform mapping and consequently cell-ranking. These keywords best characterize the pathways, biological processes, as well as the drug MOAs, that are known to be implicated in LAM disease (in accordance to LAM bibliography [17–19]).

Results from the rankings of cell types via the hypothesis-driven approach are partly in agreement with the results obtained by the discovery mode approach and the results reported by the authors of the original dataset publication [3]. The results show that lung capillary aerocytes together with NK Cells and smooth muscle cells attain the highest rankings with respect to the rest of the cell types in the analysis (see Fig 5A). In addition, blood erythrocytes are shown to rank highly in this approach. A possible explanation for this could be the immunity/inflammation related keywords used in the hypothesis-driven approach (NF-k and TNF). Erythrocytes are deeply sensitive cells and important health indicators and they have also been reported as markers of disease especially for inflammatory and oxidative stress related pathologies [20]. Moreover, erythrocytes have previous been observed within lymphatic vessels in LAM lung nodules [21].

Fig 5 — A. LAM cell-rankings based on drugs (DRUG) and pathway databases (KEGG, GOBP, MSIG, WIKI, REACT). The boxplot shows the individual cell rankings obtained by our methodology. The order of the cells is defined by the average rankings across all six parameters used to generate the final ranking for the annotated cell types in the analysis B. ASD cell-rankings based on drugs and pathway databases (KEGG, GOBP, MSIG). The boxplot shows the individual ranking obtained by our methodology. The order of the cells is defined by the average rankings across the three parameters that attained informative information used to generate the final ranking for the annotated cell types in the analysis. C. COVID-19 cell-rankings based on drugs (DRUG) and pathway databases (KEGG, GOBP, MSIG, WIKI, REACT). The boxplot shows the individual ranking obtained by our methodology. The order of the cells is defined by the average rankings across all six parameters used to generate the final ranking for the annotated cell types in the analysis. Results from the bulk RNA-seq simulation are also shown (boxplots highlighted in red).

Case study 2—Autistic Spectrum Disorder (ASD)

ASD discovery mode approach

Clinical classification of patients with autistic spectrum disorder (ASD) is based on current WHO criteria and provides a valuable but simplified depiction of the true nature of the disorder. A recent study [2] has shown that single-nucleus RNA sequencing of cortical tissue from patients with autism can be used to significantly enhance the resolution provided by bulk gene expression studies. Overall, the scRNA-seq study shows that changes in the neocortex of autism patients converge on common genes and pathways. The authors report that synaptic signalling of upper-layer excitatory neurons (L2/3), vasoactive intestinal polypeptide (IN-VIP)–expressing interneurons and the molecular state of microglia are preferentially affected in autism. Moreover, results show that dysregulation of specific groups of genes in L5/6 cortico-cortical projection neurons correlates with clinical severity of autism.

Prior knowledge was included as a list of terms for ASD after querying MalaCards with the disease names “Autism Spectrum Disorder” as well as “Autism”. As these terms are synonymous they were both used to query MalaCards. We obtained a list of terms that best characterize the pathways, biological processes, as well as drugs that are known to be implicated in ASD disease (with respect to MalaCards). The full list of pathways and drugs (see Tables F-J in S1 Text) obtained from the search was then used as input for our methodology to perform cell-ranking.

The results from the rankings of cell types using pathways and drug information show that our results are only partly in alignment with the results reported by the authors of the original ASD dataset publication [2]. The authors report that top differentially expressed neuronal genes were down-regulated for ASD in L2/3 excitatory neurons and vasoactive intestinal polypeptide (IN-VIP)–expressing interneurons. The top genes differentially expressed in non-neuronal cell types were up-regulated for ASD in protoplasmic astrocytes and microglia. They also report cell types that are recurrently affected across multiple patients as upper (L2/3) and deep layer (L5/6-CC) cortico-cortical projection neurons. This is only partly seen in our results whereby, IN-VIP cells attained high ranking with respect to the rest of the cell types in the analysis (see Fig 4C). We did not achieve high ranking for the L2/3 cell, which was one of the main cell types reported in the ASD study and perhaps this can be attributed to the lack of mapped information for the pathways extracted from the MalaCards database. Perhaps an interesting result obtained by our methodology was that Neu-NRGN-I and II cells attained notable high rankings (see Fig 4C). These cells were not previously deemed significant by the authors of this dataset in their original study. They are Neurogranin (Ng) protein expressing neurons, which is encoded by the schizophrenia risk gene NRGN. Ng is critical for encoding contextual memory and regulating developmental plasticity in the primary visual cortex during the critical period. The overall impact of Ng on the neuronal signalling and inflammation that regulates synaptic plasticity is unknown [22]. However, previous reports have implicated decreasing Ng expression levels with altered phosphorylation patterns of postsynaptic density proteins. Postsynaptic proteins, such as glutamate receptors, GTPases, kinases, RNA binding proteins, selective ion channels and ionic transporters are known to affect schizophrenia- as well as autism-related genes [22].

Enrichment analysis and DR results for some of the ASD top ranked cells are shown as D-G Figs in S1 Text. Results from the ASD bulk RNA-seq simulation are also shown in the box-plots (see Fig 4C) attaining an average ranking with respect to the individual cell types in the analysis.

ASD Cell-cell communication ranking approach

CellChat results from comparing the normalized number of signalling interactions between ASD and control networks showed specific differences between disease and control networks. Once again, Neu-NRGN-I and II cells are extenuated, showing a notable increase in fold change of normalized interactions between ASD and control networks with respect to the rest of the cell types in the analysis (see Fig 4D). As mentioned above, these are Ng expressing neurons, a protein which has previously been implicated in brain plasticity and ASD [22]. Pseudo bulk RNA analysis also showed differences in signalling interactions, while from the cell types; Microglia also displayed some differences between control and disease conditions showing a 1.14-fold decrease in the normalized number of interactions in the ASD compared to the control network (see Fig 4D). The authors report that the molecular state of microglia is preferentially affected in ASD and they are also tightly correlated with clinical severity of the disorder. As ASD is also linked to inflammation [23–25], it is not surprising to observe cell-to-cell communication changes in the primary innate immune cells of the brain.

ASD hypothesis-driven approach

Prior knowledge was included using specific “user-defined” keywords (pathways and drug MOAs) and once again our methodology was employed. The prior knowledge keywords for ASD in order to search pathway databases (KEGG, GOBP, MSIG Hallmark, WIKI and REACT) are outlined below:

Keywords KEGG: Pathways of neurodegeneration, Prion, Alzheimer, Parkinson, ALS, axon.

Keywords GO: axon, synapse, neurotransmitter, neuron, PI3K, MAPK, apoptosis, NF-k, TNF, JAK, STAT, Cytokine, Inflammation, Th17, Th1, IL-.

Keywords MSIG: Oxidative, PI3K, MAPK, apoptosis, NF-k, TNF, JAK, STAT, Cytokine, Inflammation, Th17, Th1, IL-.

Keywords REACT: Pathways of neurodegeneration, Prion, Alzheimer, Parkinson, ALS, axon

Keywords WIKI: autism, thyroid hormone

These keywords were selected to specifically address a targeted hypothesis, namely the effect of inflammation on ASD [24–26]. They best characterize the pathways and biological processes that are known to be implicated in inflammation and immunity (to the best of our knowledge).

The keywords used for searching drug MOAs from CMAP are shown below:

Keywords (MOAs): dopamine receptor antagonist, choline, histamine, serotonin.

Once, again these keywords aim to capture the main drug MOAs that have been associated with ASD therapeutics. The full list of pathways and drug MOAs obtained from the search was then used as input for our methodology to perform cell-ranking.

The results from the rankings of cell types using the hypothesis-driven approach are very similar to the discovery-driven approach. Once again there are discrepancies with the results reported by the authors who first published the ASD dataset [2]. As mentioned above, the authors report L2/3 excitatory neurons and vasoactive intestinal polypeptide (IN-VIP)–expressing interneurons as the main neuronal cell types implicated in ASD. While protoplasmic astrocytes and microglia are reported as the most significant non-neuronal cell types. Our results for this approach show high-ranking for L5/6-CC cell types (ranked 3^rd) which is in accordance to the previously reported results for the dataset. However, the hypothesis-driven approach also extenuated the Neu-NRGN-I and II cells which attained the highest ranking with respect to the rest of the cell types in the analysis (see Fig 5B). Although these cells were not highlighted by the authors of the autism dataset, these results are in alignment with both the discover-mode as well as the CellChat rankings attained by our methodology (see Fig 4D).

Case study 3 –COVID-19

COVID-19 discovery mode approach

Although SARS-CoV-2 primarily targets the respiratory system, patients and survivors of COVID-19 can suffer neurological symptoms. A recent study performed scRNA-seq on post-mortem brains of COVID-19 patients with the aim of investigating long-COVID [1]. They revealed interesting themes linking neuroinflammation signals relayed to the brains of COVID-19 patients, particularly affecting microglia, as well as astrocytic cells. Moreover, the authors also report the neuronal subtypes that were mostly affected by these inflammatory signals. They narrow down these neuronal subtypes to gene-expression changes of the excitatory neurons, specifically L2/3 and L2/3-residing VIP interneurons. They also report microglia and astrocyte as the main cell types for relaying signals between brain barriers.

Furthermore, they show links of COVID-19 affected gene expression changes with neurological disorders such as schizophrenia, depression, Alzheimer’s disease, multiple sclerosis, Huntington’s disease and ASD.

Prior knowledge for COVID-19 was used as input for our methodology to perform cell-ranking. The prior knowledge terms for the brain COVID-19 dataset were obtained by querying MalaCards with the disease name “Covid-19”. The resulting list of terms best characterizes the drugs, pathways and biological processes that are known to be implicated in COVID-19 (with respect to MalaCards). The full list of pathways and drugs (see Tables K-O in S1 Text) obtained from the search was then used as input for our methodology to perform cell-ranking.

Results show that macrophages and astrocytes obtained the highest ranking according to our methodology. This is in alignment with the results obtained by the main authors of this dataset, which show that microglia and astrocytes are associated with the immune landscape of the brain in individuals with COVID-19. Furthermore, excitatory neurons received the third highest-ranking in accordance to our methodology (see Fig 4E). This is also in agreement to the neuronal subtypes reported by the authors of the COVID-19 dataset publication. These were narrowed down to gene-expression changes of the excitatory neurons, specifically L2/3 and L2/3-residing VIP interneurons [1]. Enrichment analysis and DR results for some of the COVID-19 top ranked cells are shown as H-K Figs in S1 Text. Results from the COVID-19 bulk RNA-seq simulation is also shown in the box-plots (see Fig 4E), attaining an average ranking with respect to the individual cell types in the analysis.

COVID-19 Cell-cell communication ranking approach

Results from COVID-19 and control CellChat networks, also highlighted astrocytes as the cell type that have substantial cell-cell communication differences between control and disease conditions, with a >1.3-fold decrease in the number of normalized interactions between networks (see Fig 4F). The CellChat results also show that endothelial cells have considerable cell-cell communication differences between control and disease conditions (see Fig 4F). These cell types showed a >3-fold increase in the number of normalized interactions in the COVID-19 network in comparison to the control network (see Fig 4F). CellChat results once again highlights astrocytes, which is in agreement to outcomes reported by the original authors of this dataset, whereby astrocytes were proposed as one of the main cell types for relaying signals between brain barriers in COVID-19 patient samples [1].

COVID-19 hypothesis-driven approach

As with the above scenarios, prior knowledge was provided as keywords for our methodology. The prior knowledge keywords used for the brain COVID-19 dataset in order to search the pathway databases (KEGG, GOBP, MSIG Hallmark, REACTOME, WIKI) are outlined below:

Keywords KEGG: Corona disease, viral, Prion, Alzheimer, Parkinson, ALS, axon

Keywords GO: axon, synapse, neurotransmitter, neuron, PI3K, MAPK, apoptosis, NF-k, TNF, JAK, STAT, Cytokine, Inflammation, Th17, Th1, IL-

Keywords MSIG: Oxidative, PI3K, MAPK, apoptosis, NF-k, TNF, JAK, STAT, Cytokine, Inflammation, Th17, Th1, IL-.

Keywords REACT: SARS-CoV-2, Infectious disease, Innate Immune, Alzheimer, Huntington, autism.

Keywords WIKI: SARS-CoV-2, Infectious disease, Innate Immune, Alzheimer, Huntington, autism

These keywords were selected in order to address the long COVID effects with respect to inflammation in the brains of infected patients. They characterize the pathways and biological processes that are known to be implicated in inflammation as well as long COVID (to the best of our knowledge). The keywords used for searching drug MOAs from CMAP were:

Keywords (MoAs): dopamine receptor antagonist, choline, histamine, serotonin.

These keywords aim to capture the main drug MOAs that have been implicated in neurological disorders. According to current research [1], these disorders show multiple similarities to long COVID. The full list of pathways and drug MOAs obtained from the search was then used as input for our methodology to perform cell-ranking.

Results show that top ranked cells were epithelial cells and excitatory neurons receiving the second highest-ranking (similarly to the Discovery mode). Keywords used were broadly associated with COVID-19 (e.g., Corona disease) as well as some targeted keywords of inflammation (e.g., Cytokine—see full list above) (see Fig 5C). These results are also in agreement with the cell types reported by the authors of the COVID-19 dataset publication [1], which, were narrowed down to gene-expression changes of the excitatory neurons. Astrocytes did not receive a high ranking in the Hypothesis-driven mode, as previously reported for the discovery mode and the original publication for this dataset. Perhaps an explanation for this discrepancy between our result and the ones reported in the original publication, could be that the authors analyse the samples attained from the different brain areas (cortex and choroid plexus) separately, while we integrated all samples in our scRNA-seq analysis.

Comparison to standard cell type ranking methods

Standard cell ranking methods, such as comparison of the total number of DEGs between disease/control conditions for individual cell types, are often used to prioritize cell types and highlight cells which can be deemed important for a specific disease [2]. The calculation of the proportion of cell types across the different conditions in a scRNA-seq experiment, is yet another standard way of cell type prioritization for a targeted disease. In order to assess the differences between these standard approaches and our methodology we proceeded to generate results using these methods for all three datasets used as case studies and compared them to our results.

Comparing results of our ranking methodology to the ranking generated from standard methods used for scRNA-seq ranking, we observe certain similarities but also some discrepancies. For example, for the LAM dataset it appears that smooth muscle cells are not ranked high with respect to the proportional differences between disease donor conditions (see Fig 6A). Perhaps the number of cells is not affected in an analogous manner to the molecular perturbations which are commonly seen in LAM disease [17–19]. Similarly, DEG analysis for these cells does not show major differences in the number of DEGs between disease and donor conditions (see Fig 7A). For the ASD dataset the top ranked cell types with respect to proportional differences between disease and control conditions are oligodendrocytes while the cells which are reported in the main primary work that published this dataset (i.e., L2/3, IN-VIP) are found lower down in the rankings (see Fig 6B). In the analysis of total number of DEGs between disease and control samples for this dataset (see Fig 7B), results are in accordance to the original publication with L2/3 excitatory neurons showing the highest number of DEGs. For the COVID-19 brain dataset, both the proportion of cells and DEGs analyses showed the same output. With top ranked cell types being the epithelial and excitatory neuron cells (see Fig 6C and 7C). As mentioned above, excitatory neurons are also deemed as highly informative by the authors of this dataset. The astrocyte cells which, were deemed highly significant in the original publication for this dataset, appear lower down in the rankings (see Fig 6C and 7C). This discrepancy could be attributed to the way the COVID dataset was analysed. Whereby, the authors analyzed the samples attained from the different brain areas (cortex and choroid plexus) separately, while we used an integration of all samples in our analysis

Fig 6 — A. LAM disease dataset and the proportion of cell types across conditions. B. AUTISM disorder dataset showing the proportion of cell types across conditions. C. Brain COVID-19 dataset showing the proportion of cell types across conditions. Cells are ranked in descending order with respect to the difference between the proportion of cells in disease vs control samples.

Fig 7 — A. LAM disease dataset and the number of DEGs for individual cell types across conditions. B. AUTISM disorder dataset showing the number of DEGs for individual cell types across conditions. C. Brain COVID-19 dataset showing the number of DEGs for individual cell types across conditions. Cells are ranked in descending order with respect to the total number DEGs in control vs disease samples.

Discussion

Currently the use of methods such as, total number of differentially expressed genes (DEGs) between disease/control conditions for individual cell types, are utilized in order to obtain an indication as to which cell types are more profoundly implicated in the disease under study [2]. These methods have certain drawbacks and limitations as shown by assessments in both simulated and experimental datasets. Specifically, the number of DEGs is often strongly correlated with the number of cells per type, causing abundant cell types with modest transcriptional perturbations to be prioritized over rare but more strongly perturbed cell types [4]. Similarly, counting the number of different cell types between disease/control conditions and calculating the proportion of cells between conditions can also be an indication of the cellular differentiation in disease vs. control. For example, in a recent study by Yang et al [1] the authors use differences in proportion of cells between macrophages in disease vs. control conditions to pin-point specific disease associated macrophages. This again, can be a very important indication as to which cell types are mostly affected by the disease. However, this method relies heavily on accurate normalization procedures to ensure that all libraries are contributing equally towards this assessment criterion. Also cell type proportion estimates from scRNA-seq data are variable, and statistical methods that can correctly account for different sources of variability are needed to confidently identify statistically significant shifts in cell type composition between experimental conditions [4]. Moreover, it may be difficult to decide on that exact number of cells that can reliably be considered as a significant difference between disease and control samples and further play a role in disease specific, causative, cellular differentiation.

We propose an approach which uses prior knowledge in an automated, structured manner. Our approach combines both prior knowledge available via disease related databases, but also allows for intervention by expert users which are knowledgeable in the molecular basis of the disease under study. This permits a more targeted hypothesis-driven approach by supplying keywords by which to search available database resources. The supplied prior knowledge is further mapped to enrichment and drug repurposing results obtained from a scRNA-seq data-driven analysis and a ranking of the individual cell clusters is performed based on the accuracy of this mapping. We perform three example case studies using published scRNA-seq datasets and compare the results of the rankings obtained by our methodology with the results reported in the original published research papers for the individual datasets. We find that our methodology is able to highlight specific cell types that are known to be implicated in the diseases under study as reported by clinical information as well as previously published results [1–3]. We show that using our method can provide an automated, structured analysis pipeline that can allude to very similar results attained by relevant research which utilized more traditional, manual, time-consuming methods that require extensive and thorough investigation into the molecular basis of the disease under study. Discrepancies between previously published results and the results of our methodology were predominantly seen for the ASD dataset. Previously, the authors report L2/3 excitatory neurons as the main cell type relevant to the disease under study [2], while our results allude to Neurogranin (Ng) protein expressing neurons. Ng is encoded by the schizophrenia risk gene NRGN which is implicated in numerous brain activities including encoding contextual memory and regulating developmental plasticity [22]. In addition previous studies have implicated decreasing Ng expression levels with altered phosphorylation patterns of postsynaptic density proteins known to affect schizophrenia- as well as autism-related genes [22]. Therefore, there is ample evidence associating of Ng with autism spectrum disorders (ASDs), schizophrenia, as well as other neurodegenerative disorders such as Alzheimer’s and Creutzfeldt-Jakob disease (CJD) [27–29]. Recent work has also reviewed the potential role of Ng as a biomarker of neurological and mental diseases [30]. Given the evidence reported by these studies, it is actually surprising that these cell types were not deemed significant in the original publication for this dataset.

Furthermore, our methodology allows for more targeted hypothesis-driven approaches to highlight specific cell types that are implicated in molecular mechanisms relevant to the hypothesis under consideration. We show that applying specific themes (e.g., inflammation) in our hypothesis driven approach, alters cell ranking results and allows for the identification of cell types which potentially contribute towards these processes.

Overall, our methodology has substantial advantages to more traditional cell ranking techniques. However, we cannot replace other prioritization methods or exclude their potential usefulness. We provide a complementary methodology that utilizes prior knowledge in a rapid and automated manner, that has previously not been attempted by other studies. Potential limitations of our methodology entail the non-completeness of the databases from which the prior knowledge was obtained (MalaCards and thereafter enrichR) as well as the lack of uniform nomenclature. Moreover, a significant disparity or minimal overlap between prior knowledge databases (i.e., MalaCards and enrichR) and results from scRNA-seq analysis (mapping), can also act as a limitation. The use of the hypothesis driven approach can potentially mitigate this problem, as the user can manual add or replace keywords in the prior knowledge checklist. This can allow for better mapping between terms derived from prior knowledge with those from data-driven results. The inclusion of additional pathway databases in our methodology can also provide a means to combat this limitation, by enriching the repertoire from which to extract prior knowledge. An informative extra functionality which, may also mitigate this issue, will be to provide the user with the option to select a specific subset of cell types to merge. This will permit the analysis of combinatorial expression profiles of certain groups of cell types in a disease under study. These functionalities will be included in the next version of our R package (scRANK). ScRANK provides all aspects of our methodology in a fully documented R package and is freely available for download and installation (https://github.com/aoulas/scRANK).

Supporting information

S1 Text. Table A. Drugs extracted from MalaCards for “lymphangioleiomyomatosis”.

Table B. Reactome Pathways according to MalaCards for “lymphangioleiomyomatosis”. Table C. Wiki pathways according to MalaCards for “lymphangioleiomyomatosis”. Table D. KEGG and other pathways from MalaCards for “lymphangioleiomyomatosis”. Table E. Gene Ontology (Biological Processes) from MalaCards for “lymphangioleiomyomatosis”. Table F. Drugs extracted from MalaCards for alias terms “Autism Disorder and Autism”. Table G. Reactome pathways from MalaCards for alias terms “Autism Disorder and Autism”. Table H. Wiki pathways from MalaCards for alias terms “Autism Disorder and Autism”. Table I. KEGG and other pathways from MalaCards for alias terms “Autism Disorder and Autism”. Table J. Gene Ontology (Biological Processes) from MalaCards for alias terms “Autism Disorder and Autism”. Table K. Drugs extracted from MalaCards for “COVID-19”. Table L. Reactome pathways from MalaCards for alias terms “COVID-19”. Table M. Wiki pathways from MalaCards for alias terms “COVID-19”. Table N. KEGG and other pathways from MalaCards for alias terms “COVID-19”. Table O. Gene Ontology (Biological Processes) from MalaCards for alias terms “COVID-19”. Fig A. Pathway enrichment analysis for LAM smooth muscle cell types. Fig B. Pathway enrichment analysis for LAM lung capillary aerocyte cell types. Fig C. Drug repurposing via enrichR for LAM smooth muscle cell types. Fig D. Pathway enrichment analysis for ASD IN-VIP cell types. Fig E. Pathway enrichment analysis for ASD endothelial cell types. Fig F. Drug repurposing via enrichR for ASD IN-VIP cell types. Fig G. Drug repurposing via enrichR for ASD endothelial cell types. Fig H. Pathway enrichment analysis for COVID-19 brain macrophage cell types. Fig I. Pathway enrichment analysis for COVID-19 brain astrocyte cell types. Fig J. Pathway enrichment analysis for COVID-19 brain excitatory neuron cell types. Fig K. Drug repurposing via enrichR for COVID-19 brain excitatory neuron cell types.

(DOCX)

pcbi.1011550.s001.docx^{(3MB, docx)}

Data Availability

All data and code used for running experiments, model fitting, and plotting is available on a GitHub repository at https://github.com/aoulas/scRANK/.

Funding Statement

NK and KS were supported by HORIZON EUROPE via the project ELMUMY (Grant ID: 101097094, URL: https://cordis.europa.eu/project/id/101097094). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Yang AC, Kern F, Losada PM, Agam MR, Maat CA, Schmartz GP, et al. Dysregulation of brain and choroid plexus cell types in severe COVID-19. Nature. 2021;595: 565–571. doi: 10.1038/s41586-021-03710-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Velmeshev D, Schirmer L, Jung D, Haeussler M, Perez Y, Mayer S, et al. Single-cell genomics identifies cell type–specific molecular changes in autism. Science (80-). 2019;364: 685–689. doi: 10.1126/science.aav8130 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Al Mahi N, Zhang EY, Sherman S, Yu JJ, Medvedovic M. Connectivity Map Analysis of a Single-Cell RNA-Sequencing -Derived Transcriptional Signature of mTOR Signaling. Int J Mol Sci. 2021;22: 4371. doi: 10.3390/ijms22094371 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Phipson B, Sim CB, Porrello ER, Hewitt AW, Powell J, Oshlack A. propeller: testing for differences in cell type proportions in single cell data. Bioinformatics. 2022;38: 4720–4726. doi: 10.1093/bioinformatics/btac582 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Skinnider MA, Squair JW, Kathe C, Anderson MA, Gautier M, Matson KJE, et al. Cell type prioritization in single-cell data. Nat Biotechnol. 2021;39: 30–34. doi: 10.1038/s41587-020-0605-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Rappaport N, Nativ N, Stelzer G, Twik M, Guan-Golan Y, Iny Stein T, et al. MalaCards: an integrated compendium for diseases and their annotation. Database. 2013;2013. doi: 10.1093/database/bat018 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184: 3573–3587.e29. doi: 10.1016/j.cell.2021.04.048 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Xie Z, Bailey A, Kuleshov M V., Clarke DJB, Evangelista JE, Jenkins SL, et al. Gene Set Knowledge Discovery with Enrichr. Curr Protoc. 2021;1. doi: 10.1002/cpz1.90 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Corsello SM, Bittker JA, Liu Z, Gould J, McCarren P, Hirschman JE, et al. The Drug Repurposing Hub: a next-generation drug library and information resource. Nat Med. 2017;23: 405–408. doi: 10.1038/nm.4306 [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Kanehisa M, Furumichi M, Sato Y, Kawashima M, Ishiguro-Watanabe M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2022. doi: 10.1093/nar/gkac963 [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Carbon S, Douglass E, Good BM, Unni DR, Harris NL, Mungall CJ, et al. The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res. 2021;49: D325–D334. doi: 10.1093/nar/gkaa1113 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst. 2015;1: 417–425. doi: 10.1016/j.cels.2015.12.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Martens M, Ammar A, Riutta A, Waagmeester A, Slenter DN, Hanspers K, et al. WikiPathways: connecting communities. Nucleic Acids Res. 2021;49: D613–D621. doi: 10.1093/nar/gkaa1024 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Croft D, Mundo AF, Haw R, Milacic M, Weiser J, Wu G, et al. The Reactome pathway knowledgebase. Nucleic Acids Res. 2014;42: D472–7. doi: 10.1093/nar/gkt1102 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Jin S, Guerrero-Juarez CF, Zhang L, Chang I, Ramos R, Kuan C-H, et al. Inference and analysis of cell-cell communication using CellChat. Nat Commun. 2021;12: 1088. doi: 10.1038/s41467-021-21246-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8: 14049. doi: 10.1038/ncomms14049 [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Johnson SR. Lymphangioleiomyomatosis. Eur Respir J. 2006;27: 1056–65. doi: 10.1183/09031936.06.00113303 [DOI] [PubMed] [Google Scholar]
18.Taveira-DaSilva AM, Steagall WK, Moss J. Lymphangioleiomyomatosis. Cancer Control. 2006;13: 276–85. doi: 10.1177/107327480601300405 [DOI] [PubMed] [Google Scholar]
19.Juvet SC, McCormack FX, Kwiatkowski DJ, Downey GP. Molecular pathogenesis of lymphangioleiomyomatosis: lessons learned from orphans. Am J Respir Cell Mol Biol. 2007;36: 398–408. doi: 10.1165/rcmb.2006-0372TR [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Massaccesi L, Galliera E, Corsi Romanelli MM. Erythrocytes as markers of oxidative stress related pathologies. Mech Ageing Dev. 2020;191: 111333. doi: 10.1016/j.mad.2020.111333 [DOI] [PubMed] [Google Scholar]
21.Pacheco-Rodriguez G, Glasgow CG, Ikeda Y, Steagall WK, Yu Z-X, Tsukada K, et al. A Mixed Blood-Lymphatic Endothelial Cell Phenotype in Lymphangioleiomyomatosis and Idiopathic Pulmonary Fibrosis but Not in Kaposi’s Sarcoma or Tuberous Sclerosis Complex. Am J Respir Cell Mol Biol. 2022;66: 337–340. doi: 10.1165/rcmb.2021-0293LE [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Hwang H, Szucs MJ, Ding LJ, Allen A, Ren X, Haensgen H, et al. Neurogranin, Encoded by the Schizophrenia Risk Gene NRGN, Bidirectionally Modulates Synaptic Plasticity via Calmodulin-Dependent Regulation of the Neuronal Phosphoproteome. Biol Psychiatry. 2021;89: 256–269. doi: 10.1016/j.biopsych.2020.07.014 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Eissa N, Sadeq A, Sasse A, Sadek B. Role of Neuroinflammation in Autism Spectrum Disorder and the Emergence of Brain Histaminergic System. Lessons Also for BPSD? Front Pharmacol. 2020;11. doi: 10.3389/fphar.2020.00886 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Siniscalco D, Schultz S, Brigida A, Antonucci N. Inflammation and Neuro-Immune Dysregulations in Autism Spectrum Disorders. Pharmaceuticals. 2018;11: 56. doi: 10.3390/ph11020056 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Meltzer A, Van de Water J. The Role of the Immune System in Autism Spectrum Disorder. Neuropsychopharmacology. 2017;42: 284–298. doi: 10.1038/npp.2016.158 [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Tsilioni I, Patel AB, Pantazopoulos H, Berretta S, Conti P, Leeman SE, et al. IL-37 is increased in brains of children with autism spectrum disorder and inhibits human microglia stimulated by neurotensin. Proc Natl Acad Sci. 2019;116: 21659–21665. doi: 10.1073/pnas.1906817116 [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Blennow K, Diaz-Lucena D, Zetterberg H, Villar-Pique A, Karch A, Vidal E, et al. CSF neurogranin as a neuronal damage marker in CJD: a comparative study with AD. J Neurol Neurosurg Psychiatry. 2019;90: 846–853. doi: 10.1136/jnnp-2018-320155 [DOI] [PubMed] [Google Scholar]
28.Sun X, Dong C, Levin B, Crocco E, Loewenstein D, Zetterberg H, et al. APOE ε4 carriers may undergo synaptic damage conferring risk of Alzheimer’s disease. Alzheimer’s Dement. 2016;12: 1159–1166. doi: 10.1016/j.jalz.2016.05.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Kivisäkk P, Carlyle BC, Sweeney T, Quinn JP, Ramirez CE, Trombetta BA, et al. Increased levels of the synaptic proteins PSD-95, SNAP-25, and neurogranin in the cerebrospinal fluid of patients with Alzheimer’s disease. Alzheimers Res Ther. 2022;14: 58. doi: 10.1186/s13195-022-01002-x [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Xiang Y, Xin J, Le W, Yang Y. Neurogranin: A Potential Biomarker of Neurological and Mental Diseases. Front Aging Neurosci. 2020;12. doi: 10.3389/fnagi.2020.584743 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Text. Table A. Drugs extracted from MalaCards for “lymphangioleiomyomatosis”.

(DOCX)

pcbi.1011550.s001.docx^{(3MB, docx)}

Data Availability Statement

All data and code used for running experiments, model fitting, and plotting is available on a GitHub repository at https://github.com/aoulas/scRANK/.

[pcbi.1011550.ref001] 1.Yang AC, Kern F, Losada PM, Agam MR, Maat CA, Schmartz GP, et al. Dysregulation of brain and choroid plexus cell types in severe COVID-19. Nature. 2021;595: 565–571. doi: 10.1038/s41586-021-03710-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref002] 2.Velmeshev D, Schirmer L, Jung D, Haeussler M, Perez Y, Mayer S, et al. Single-cell genomics identifies cell type–specific molecular changes in autism. Science (80-). 2019;364: 685–689. doi: 10.1126/science.aav8130 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref003] 3.Al Mahi N, Zhang EY, Sherman S, Yu JJ, Medvedovic M. Connectivity Map Analysis of a Single-Cell RNA-Sequencing -Derived Transcriptional Signature of mTOR Signaling. Int J Mol Sci. 2021;22: 4371. doi: 10.3390/ijms22094371 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref004] 4.Phipson B, Sim CB, Porrello ER, Hewitt AW, Powell J, Oshlack A. propeller: testing for differences in cell type proportions in single cell data. Bioinformatics. 2022;38: 4720–4726. doi: 10.1093/bioinformatics/btac582 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref005] 5.Skinnider MA, Squair JW, Kathe C, Anderson MA, Gautier M, Matson KJE, et al. Cell type prioritization in single-cell data. Nat Biotechnol. 2021;39: 30–34. doi: 10.1038/s41587-020-0605-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref006] 6.Rappaport N, Nativ N, Stelzer G, Twik M, Guan-Golan Y, Iny Stein T, et al. MalaCards: an integrated compendium for diseases and their annotation. Database. 2013;2013. doi: 10.1093/database/bat018 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref007] 7.Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184: 3573–3587.e29. doi: 10.1016/j.cell.2021.04.048 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref008] 8.Xie Z, Bailey A, Kuleshov M V., Clarke DJB, Evangelista JE, Jenkins SL, et al. Gene Set Knowledge Discovery with Enrichr. Curr Protoc. 2021;1. doi: 10.1002/cpz1.90 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref009] 9.Corsello SM, Bittker JA, Liu Z, Gould J, McCarren P, Hirschman JE, et al. The Drug Repurposing Hub: a next-generation drug library and information resource. Nat Med. 2017;23: 405–408. doi: 10.1038/nm.4306 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref010] 10.Kanehisa M, Furumichi M, Sato Y, Kawashima M, Ishiguro-Watanabe M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2022. doi: 10.1093/nar/gkac963 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref011] 11.Carbon S, Douglass E, Good BM, Unni DR, Harris NL, Mungall CJ, et al. The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res. 2021;49: D325–D334. doi: 10.1093/nar/gkaa1113 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref012] 12.Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database Hallmark Gene Set Collection. Cell Syst. 2015;1: 417–425. doi: 10.1016/j.cels.2015.12.004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref013] 13.Martens M, Ammar A, Riutta A, Waagmeester A, Slenter DN, Hanspers K, et al. WikiPathways: connecting communities. Nucleic Acids Res. 2021;49: D613–D621. doi: 10.1093/nar/gkaa1024 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref014] 14.Croft D, Mundo AF, Haw R, Milacic M, Weiser J, Wu G, et al. The Reactome pathway knowledgebase. Nucleic Acids Res. 2014;42: D472–7. doi: 10.1093/nar/gkt1102 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref015] 15.Jin S, Guerrero-Juarez CF, Zhang L, Chang I, Ramos R, Kuan C-H, et al. Inference and analysis of cell-cell communication using CellChat. Nat Commun. 2021;12: 1088. doi: 10.1038/s41467-021-21246-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref016] 16.Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8: 14049. doi: 10.1038/ncomms14049 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref017] 17.Johnson SR. Lymphangioleiomyomatosis. Eur Respir J. 2006;27: 1056–65. doi: 10.1183/09031936.06.00113303 [DOI] [PubMed] [Google Scholar]

[pcbi.1011550.ref018] 18.Taveira-DaSilva AM, Steagall WK, Moss J. Lymphangioleiomyomatosis. Cancer Control. 2006;13: 276–85. doi: 10.1177/107327480601300405 [DOI] [PubMed] [Google Scholar]

[pcbi.1011550.ref019] 19.Juvet SC, McCormack FX, Kwiatkowski DJ, Downey GP. Molecular pathogenesis of lymphangioleiomyomatosis: lessons learned from orphans. Am J Respir Cell Mol Biol. 2007;36: 398–408. doi: 10.1165/rcmb.2006-0372TR [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref020] 20.Massaccesi L, Galliera E, Corsi Romanelli MM. Erythrocytes as markers of oxidative stress related pathologies. Mech Ageing Dev. 2020;191: 111333. doi: 10.1016/j.mad.2020.111333 [DOI] [PubMed] [Google Scholar]

[pcbi.1011550.ref021] 21.Pacheco-Rodriguez G, Glasgow CG, Ikeda Y, Steagall WK, Yu Z-X, Tsukada K, et al. A Mixed Blood-Lymphatic Endothelial Cell Phenotype in Lymphangioleiomyomatosis and Idiopathic Pulmonary Fibrosis but Not in Kaposi’s Sarcoma or Tuberous Sclerosis Complex. Am J Respir Cell Mol Biol. 2022;66: 337–340. doi: 10.1165/rcmb.2021-0293LE [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref022] 22.Hwang H, Szucs MJ, Ding LJ, Allen A, Ren X, Haensgen H, et al. Neurogranin, Encoded by the Schizophrenia Risk Gene NRGN, Bidirectionally Modulates Synaptic Plasticity via Calmodulin-Dependent Regulation of the Neuronal Phosphoproteome. Biol Psychiatry. 2021;89: 256–269. doi: 10.1016/j.biopsych.2020.07.014 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref023] 23.Eissa N, Sadeq A, Sasse A, Sadek B. Role of Neuroinflammation in Autism Spectrum Disorder and the Emergence of Brain Histaminergic System. Lessons Also for BPSD? Front Pharmacol. 2020;11. doi: 10.3389/fphar.2020.00886 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref024] 24.Siniscalco D, Schultz S, Brigida A, Antonucci N. Inflammation and Neuro-Immune Dysregulations in Autism Spectrum Disorders. Pharmaceuticals. 2018;11: 56. doi: 10.3390/ph11020056 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref025] 25.Meltzer A, Van de Water J. The Role of the Immune System in Autism Spectrum Disorder. Neuropsychopharmacology. 2017;42: 284–298. doi: 10.1038/npp.2016.158 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref026] 26.Tsilioni I, Patel AB, Pantazopoulos H, Berretta S, Conti P, Leeman SE, et al. IL-37 is increased in brains of children with autism spectrum disorder and inhibits human microglia stimulated by neurotensin. Proc Natl Acad Sci. 2019;116: 21659–21665. doi: 10.1073/pnas.1906817116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref027] 27.Blennow K, Diaz-Lucena D, Zetterberg H, Villar-Pique A, Karch A, Vidal E, et al. CSF neurogranin as a neuronal damage marker in CJD: a comparative study with AD. J Neurol Neurosurg Psychiatry. 2019;90: 846–853. doi: 10.1136/jnnp-2018-320155 [DOI] [PubMed] [Google Scholar]

[pcbi.1011550.ref028] 28.Sun X, Dong C, Levin B, Crocco E, Loewenstein D, Zetterberg H, et al. APOE ε4 carriers may undergo synaptic damage conferring risk of Alzheimer’s disease. Alzheimer’s Dement. 2016;12: 1159–1166. doi: 10.1016/j.jalz.2016.05.003 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref029] 29.Kivisäkk P, Carlyle BC, Sweeney T, Quinn JP, Ramirez CE, Trombetta BA, et al. Increased levels of the synaptic proteins PSD-95, SNAP-25, and neurogranin in the cerebrospinal fluid of patients with Alzheimer’s disease. Alzheimers Res Ther. 2022;14: 58. doi: 10.1186/s13195-022-01002-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1011550.ref030] 30.Xiang Y, Xin J, Le W, Yang Y. Neurogranin: A Potential Biomarker of Neurological and Mental Diseases. Front Aging Neurosci. 2020;12. doi: 10.3389/fnagi.2020.584743 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Ranking of cell clusters in a single-cell RNA-sequencing analysis framework using prior knowledge

Anastasis Oulas

Kyriaki Savva

Nestoras Karathanasis

George M Spyrou

Roles

Abstract

Author summary

Introduction

Materials and methods

Datasets

Basic analysis flow

Fig 1. Flowchart of overall adopted methodology.

Prior knowledge incorporation

Mapping prior knowledge to data-driven results

Ranking of cell types

Fig 2. Riverplot showing an example of mapping of prior knowledge from MalaCards source wiki pathways to wiki pathways obtained from enrichment analysis on the scRNA-seq data for a specific cell type in a scRNA-seq dataset.

Fig 3. Example output from CellChat for a disease-control dataset.

Bulk-RNA-seq simulations

Additional methods for cell prioritization

R package implementation

Results

Showcasing the importance of our methodology using case studies

Case study 1—LAM disease

LAM discovery mode approach

Fig 4. Discovery mode approach rankings across all 3 datasets.

LAM Cell-cell communication ranking approach

LAM hypothesis-driven approach

Fig 5. Hypothesis-driven rankings across all 3 datasets.

Case study 2—Autistic Spectrum Disorder (ASD)

ASD discovery mode approach

ASD Cell-cell communication ranking approach

ASD hypothesis-driven approach

Case study 3 –COVID-19

COVID-19 discovery mode approach

COVID-19 Cell-cell communication ranking approach

COVID-19 hypothesis-driven approach

Comparison to standard cell type ranking methods

Fig 6. Proportion of cell types across different conditions for the 3 datasets used to validate our methodology.

Fig 7. Number of differentially expressed genes (DEGs) for each cell type across different conditions for the 3 datasets used to validate our methodology.

Discussion

Supporting information

Data Availability

Funding Statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases