Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2023 Nov 25;25(1):bbad424. doi: 10.1093/bib/bbad424

Identifying phenotype-associated subpopulations through LP_SGL

Juntao Li 1, Hongmei Zhang 2,, Bingyu Mu 3,, Hongliang Zuo 4, Kanglei Zhou 5
PMCID: PMC10753413  PMID: 38008419

Abstract

Single-cell RNA sequencing (scRNA-seq) enables the resolution of cellular heterogeneity in diseases and facilitates the identification of novel cell types and subtypes. However, the grouping effects caused by cell–cell interactions are often overlooked in the development of tools for identifying subpopulations. We proposed LP_SGL which incorporates cell group structure to identify phenotype-associated subpopulations by integrating scRNA-seq, bulk expression and bulk phenotype data. Cell groups from scRNA-seq data were obtained by the Leiden algorithm, which facilitates the identification of subpopulations and improves model robustness. LP_SGL identified a higher percentage of cancer cells, T cells and tumor-associated cells than Scissor and scAB on lung adenocarcinoma diagnosis, melanoma drug response and liver cancer survival datasets, respectively. Biological analysis on three original datasets and four independent external validation sets demonstrated that the signaling genes of this cell subset can predict cancer, immunotherapy and survival.

Keywords: data integration, cell–cell interaction, cell subpopulation, biological analysis

INTRODUCTION

Human tumors are complex ecosystems composed of multiple cell types [1]. Fortunately, the increasing availability of omics data has provided important support for unraveling the complex features of tumors [2, 3]. Bulk data represent the average measurement of the entire tissue, while single-cell RNA sequencing (scRNA-seq) offers advantages in identifying cell types and therapeutic targets by revealing intratumoral heterogeneity [1, 4, 5]. Cell types are typically annotated by marker genes [6], but determining the role of specific cells in driving sample phenotypes remains a challenge. Although scRNA-seq data can provide high-resolution cell type information, it frequently lacks adequate sample phenotypes and clinical information due to its high cost [1]. Conversely, publicly available databases such as TCGA [7] contain a large amount of bulk data with sample phenotypes and clinical information.

Integrating bulk and scRNA-seq data effectively leverages the benefits of both phenotype and single-cell information simultaneously. Using scRNA-seq data, significant genes were selected as features to build a predictive breast cancer prognosis model with bulk data [8]. To identify subpopulations associated with sample phenotype, Scissor was developed with a sparse regression model [9]. In addition, scAB was developed to detect clinically significant multiresolution cell states using a knowledge- and graph-guided matrix factorization method [10]. As biological processes depend on complex interactions among different cells, we contend that incorporating cell group structure into the model will facilitate the identification of subpopulations associated with the phenotype. The implementation of Scissor and scAB relies on a correlation matrix, which comprises Pearson correlation coefficients of shared genes from bulk and scRNA-seq data. The screening of differentially expressed genes (DEGs) may potentially influence the performance of these methods. Thus, integrating the cell group structure into the model is likely to bolster its robustness.

Feature grouping has been considered in previous studies. Group lasso (GL) method was introduced to select features at the group level while performing regression [11]. To achieve intragroup sparsity, the sparse group lasso (SGL) was formulated for applications in linear regression, logistic regression and Cox regression [12]. A fundamental requirement for successfully applying SGL to bioinformatics is to group features beforehand. Although weighted gene co-expression network analysis (WGCNA) has been successfully applied to gene grouping of cancer bulk data [13, 14], it is not readily applicable to scRNA-seq data due to a large number of genes and cells. Therefore, identifying biologically meaningful group structures for scRNA-seq data is a challenging problem. Fortunately, community clustering algorithms such as Louvain [15] and Leiden [16] present promising avenues to solve this problem.

Inspired by the similarity between community connectivity and cell–cell interactions, we considered the cell communities obtained by the Leiden algorithm on scRNA-seq data as cell groups. We then proposed LP_SGL which incorporates cell group structure to identify phenotype-associated subpopulations by integrating scRNA-seq, bulk expression and bulk phenotype data. The experimental results showed that LP_SGL outperformed Scissor and scAB on datasets related to lung adenocarcinoma (LUAD) diagnosis, melanoma drug response and liver cancer survival. The robustness of the three methods was tested on seven datasets, including six incomplete datasets obtained under different threshold conditions. The subpopulation identification performance of LP_SGL remained almost unchanged, while the latter two methods showed significant fluctuations. Furthermore, the biological analysis confirmed the effectiveness of the proposed method.

MATERIALS AND METHODS

The structure of LP_SGL

LP_SGL is a specialized SGL [12] model that integrates scRNA-seq, bulk expression and bulk phenotype data. The model calculates the Pearson correlation coefficients between samples and cells by sharing genes and integrates scRNA-seq and bulk expression data into a correlation matrix. The letter ‘L’ indicates the use of the Leiden algorithm to obtain the cellular community structure from the scRNA-seq data. The letter ‘P’ represents the use of phenotype information to construct sample labels. The LP_SGL workflow was presented in Figure 1.

Figure 1.

Figure 1

The workflow of LP_SGL.

The Leiden algorithm [16] partitions nodes in a graph based on their similarity, which is analogous to each cell group representing a collection of cells with similar characteristics or functions. Therefore, it is reasonable to consider the cell communities obtained by the Leiden algorithm on scRNA-seq data as cell groups. Before executing the Leiden algorithm, the shared nearest neighbor graph was first constructed. Then, cells were divided into communities by maximizing the following modularity score:

graphic file with name DmEquation1.gif (1)

where Inline graphic stands for the total number of edges in the graph, Inline graphic represents the weight of the edge between cell Inline graphic and Inline graphic, Inline graphic is a resolution parameter, Inline graphic and Inline graphic are the degrees of cell Inline graphic and cell Inline graphic, respectively. Inline graphic denotes the community to which cell Inline graphic is assigned, the Inline graphic function is 1 if Inline graphic and 0 otherwise. The Leiden algorithm utilizes an iterative approach to enhance the initial partition by exchanging cells between communities to maximize the modularity score. This process continues until no further improvement is achievable. The algorithm was implemented through the R package ‘leidenAlg’.

Let Inline graphic be the number of the obtained cell groups, and Inline graphic be the number of cells in the Inline graphicth group. Let Inline graphic be the Inline graphicth row vector from the correlation matrix, and Inline graphic be its subvector corresponding to the Inline graphicth group. LP_SGL can be described as

graphic file with name DmEquation2.gif (2)

where Inline graphic is a loss function that depends on the phenotype information, Inline graphic represents the number of samples, Inline graphic and Inline graphic are regularization parameters, Inline graphic is the regression coefficient vector and Inline graphic is its subvector corresponding to the Inline graphicth group. If the phenotype information on cancer diagnosis (or treatment response) is utilized, then sample label Inline graphic is encoded as 1 or 0, and the negative log-likelihood function is adopted

graphic file with name DmEquation3.gif (3)

If the phenotype information on survival is utilized, then the following loss function is adopted:

graphic file with name DmEquation4.gif (4)

where Inline graphic is the failure index set of samples determined by the occurrence of events, and Inline graphic is the index set of samples with survival time longer than that of the Inline graphicth sample.

The Inline graphic in (2) can be solved through the R package ‘SGL’. The regression coefficient reflects the cell’s impact on the phenotype, with positive and negative coefficients indicating associations with higher and lower value-encoding phenotypes, respectively. In cases where the phenotype represents survival information, positive coefficients correspond to cells that are consistently associated with worse survival outcomes. To simplify, we denoted cells as LP_SGL+ cells (positive coefficients), LP_SGL- cells (negative coefficients) and Background cells (coefficients equal to 0).

During the implementation of the LP_SGL model, three parameters need to be determined: the resolution parameter Inline graphic, regularization parameters Inline graphic and Inline graphic. The Inline graphic acts as a threshold, requiring a minimum density of Inline graphic within each group. Higher values of Inline graphic result in more groups being obtained. We used a sequence of Inline graphic to test the impact of different Inline graphic values on the results, with detailed results presented in Supplementary Table 1 (see Supplementary Data available online at https://academic.oup.com/bib). Due to minimal fluctuations in the results as Inline graphic changed for each dataset, we simplified the experimental process by setting Inline graphic to 0.6. The Inline graphic determines the overall strength of the penalty term, while Inline graphic balances the lasso and GL penalties. We created a search list of Inline graphic in advance for Inline graphic. For each fixed Inline graphic, Inline graphic was determined through 5-fold cross-validation, and the optimal parameter pair Inline graphic was determined through experimental results.

Datasets

The LUAD scRNA-seq data were downloaded from the ArrayExpress (accession numbers: E-MTAB-6149 and E-MTAB-6653), including 29 888 cells and 8 cell types [17]: cancer cell, endothelial cell, T cell, B cell, myeloid cell, alveolar cell, epithelial cell and fibroblasts cell. The bulk data of LUAD were downloaded from TCGA-LUAD. There are in total of 539 tumors and 59 normal samples, and 508 samples with overall survival time and status. An external bulk validation set of LUAD diagnosis was downloaded from GEO (accession code: GSE40419), including 87 tumors and 77 normal samples.

The melanoma scRNA-seq data (accession code: GSE115978) contained 6879 cells and 9 cell types [18]: T cell, CD4+ T cell, CD8+ T cell, B cell, macrophage, malignant cell, cancer-associated fibroblast (CAF), endothelial cell and Natural Killer (NK) cell. In reference [18], cells were defined as T cells based on the overall expression of established cell type markers (CD2, CD3D, CD3E, CD3G). T cells were further classified as CD8+ or CD4+ T cells if they expressed CD8 (CD8A or CD8B) or CD4, respectively, while the rest were still labeled as T cells. The melanoma bulk dataset PRJEB23709 was downloaded from [19]. There are in total of 46 treatment responders and 27 nonresponders. External bulk validation sets for melanoma and thymic carcinoma were downloaded from GEO (accession codes: GSE91061 and GSE181815, respectively).

The liver cancer scRNA-seq data (accession code: GSE125449) contained 8853 cells and 7 cell types [20]: CAF, tumor-associated macrophage (TAM), malignant cell, tumor-associated endothelial cell (TEC), cells with an unknown entity but express hepatic progenitor cell markers (HPC-like), T cell and B cell. TCGA-LIHC provides bulk data of 370 liver cancer samples with survival information, while GEO (GSE14520) provides another liver cancer bulk validation set with survival and recurrence information.

The gene expression values were averaged for genes with multiple occurrences of the same name during data preprocessing. For bulk data, a logarithmic transformation with a base of 2 was performed on the original count data. For scRNA-seq data, the R package ‘Seurat’ was used for preprocessing. Genes expressed in at least 400 cells were retained, and the filtered expression matrix was normalized using the ‘NormalizeData’ function. Highly variable genes between cells were identified using the ‘FindVariableFeatures’ function with the default ‘vst’ method. Subsequently, standardization and principal component analysis were performed using the ‘ScaleData’ and ‘RunPCA’ functions, respectively. The shared nearest neighbor graph was constructed based on the first 10 principal components using the ‘FindNeighbors’ function. Two-dimensional cell visualization was achieved using the ‘RunUMAP’ function.

Testing and biological analysis

To assess the robustness of the model to incomplete or missing data, we deliberately removed some genes. We split the binary phenotype bulk data into two groups and used the R package ‘limma’ to identify DEGs between the two groups, based on the filtering criteria of Logarithm of fold change Inline graphic greater than the threshold and P-value obtained by the default t-test less than 0.05. We set the threshold sequence as {0.5, 0.6, 0.7, 0.8, 0.9, 1} to obtain six different gene sets. The difference in gene sets resulted in different correlation matrices when integrating bulk data with scRNA-seq data. We evaluated the model’s robustness using six different incomplete datasets.

We conducted functional enrichment analysis on DEGs between LP_SGL+ cells and LP_SGL- cells. To assess the activity level of the over-expressed gene set across different samples, we employed the R package ‘GSVA’ to conduct gene set variation analysis (GSVA). We calculated a statistical test between the two types of samples using the t-test. Furthermore, we performed gene set enrichment analysis (GSEA) to investigate the enrichment of DEGs under different biological conditions. GSEA was implemented by utilizing the ‘gseGO’ and ‘gseKEGG’ functions in the R package ‘clusterProfiler’. P-values were calculated based on the hypergeometric distribution, and the false discovery rate (FDR) was calculated using the Benjamini–Hochberg method.

The lasso-cox model was implemented using the R package ‘glmnet’ based on DEGs between LP_SGL+ cells and LP_SGL- cells. Subsequently, multivariable Cox regression was performed using the R package ‘survival’ for genes with nonzero coefficients. Samples were then divided into high- and low-risk groups based on the median of predicted prognostic scores. To assess the difference in survival time between the two groups, Kaplan–Meier (K-M) survival analysis was conducted using the R package ‘survminer’, with the log-rank test. In addition, the Concordance index (C-index) was calculated to measure the predictive ability of the model. To avoid the contingency of the results, 10-times experiments were performed by setting seeds 1 to 10. For clinical characteristics-based methods including age, stage and sex, univariate cox regression was performed using the R package ‘survival’.

RESULTS

Identify cell subpopulations associated with LUAD and normal

We initially applied the LP_SGL method to LUAD dataset in order to identify cells that were associated with either the LUAD or normal phenotype. After preprocessing the data, 29 888 cells were assigned to 24 groups using the Leiden algorithm. The UMAP visualization of 24 cell groups and 8 cell types was, respectively, presented in Figure 2A and Supplementary Figure S1a (see Supplementary Data available online at https://academic.oup.com/bib). Subsequently, 1317 LP_SGL+ cells and 775 LP_SGL- cells were selected by implementing the LP_SGL. A bar chart of the distribution of LP_SGL+ cells and LP_SGL- cells with respect to cell groups was presented in Figure 2B and the corresponding UMAP visualization was displayed in Supplementary Figure S1b (see Supplementary Data available online at https://academic.oup.com/bib), and 63.25% (833/1317) and 36.45% (480/1317) of LP_SGL+ cells appeared in groups 12 and 21, respectively, while 100% of LP_SGL- cells were presented in group 10. A bar chart of the distribution of LP_SGL+ cells and LP_SGL- cells with respect to cell types was presented in Figure 2C and 99.92% (1316/1317) of LP_SGL+ cells were cancer cells and 99.74% (773/775) of LP_SGL- cells were endothelial cells. The concentrated characteristics observed in the distribution of LP_SGL+ cells and LP_SGL- cells within both cell groups and cell types demonstrated the ability of the LP_SGL to accurately identify phenotype-associated subpopulations by introducing cell group structure.

Figure 2.

Figure 2

Experimental results on the LUAD dataset. (A) UMAP visualization of 24 cell groups obtained using the Leiden algorithm. (B and C) Bar chart of the distribution of LP_SGL+ cells and LP_SGL- cells with respect to cell groups and cell types, respectively. (D) Line chart of the proportions of cancer cells contained in the LUAD phenotype cells identified by LP_SGL, Scissor and scAB. (E) Volcano map of DEGs between LP_SGL+ cells and LP_SGL- cells. (F and G) Box plot of GSVA scores for cancer and normal samples on TCGA-LUAD and GSE40419 datasets, respectively. (H) K-M survival curves of high- and low-risk group samples divided by the median prognostic score in the TCGA-LUAD dataset.

We then evaluated the robustness of LP_SGL, Scissor and scAB by using seven datasets (including six different incomplete datasets obtained under different thresholds). The line chart of the proportions of cancer cells contained in the LUAD phenotype cells identified by these methods was shown in Figure 2D. In the original data, the proportion of cancer cells contained in the LUAD-associated cells identified by LP_SGL was 99.92%, which was 11.73 and 53.36% higher than that identified by Scissor and scAB, respectively. On six incomplete datasets, the results obtained by LP_SGL remained almost unchanged, while the other two methods exhibited some degree of fluctuation.

To further reveal the biological significance of the identified cells, we performed differential expression analysis (DEA) between LP_SGL+ cells and LP_SGL- cells. A total of 210 upregulated and 89 downregulated genes were identified by setting Inline graphic greater than 1 and the FDR less than 0.05. The volcanic plot of the DEGs was shown in Figure 2E. Notably, some of these genes have been identified as important regulatory factors in LUAD, such as ENO1, which has been previously reported to promote tumor progression in LUAD [21]. Similarly, YBX1 has been shown to induce the migration of LUAD cells and contribute to tumor metastasis [22]. On the other hand, GPX3 has been found to play an inhibitory role in LUAD, with lower expression levels in tumors compared with normal tissues [23]. These findings demonstrated the potential of LP_SGL for identifying significant DEGs that may be used as diagnostic or therapeutic targets for LUAD.

To assess the clinical relevance of the 210 over-expressed genes identified by LP_SGL, GSVA scores were calculated for each sample in bulk data. As shown in Figure 2F, the cancer samples exhibited significantly higher scores compared with the normal samples in the TCGA-LUAD dataset (Inline graphic). The same trend was observed in another independent LUAD dataset, as depicted in Figure 2G (Inline graphic). These results suggested that the identified upregulated genes were strongly correlated with LUAD. Furthermore, using survival information from the TCGA-LUAD dataset, 508 samples were divided into high- and low-risk groups based on the median predicted prognostic score. As presented in Figure 2H, the K-M survival curve indicated that samples with higher prognostic scores had significantly worse survival outcomes compared with those with lower scores. This analysis further supports the association of the identified LUAD-associated subpopulations with poor prognosis. As a result, we have successfully demonstrated the utility of LP_SGL in accurately identifying cell subpopulations associated with a particular phenotype.

Identifying T cell subpopulations related to immunotherapy

Understanding the mechanism behind the immune checkpoint blockade (ICB) response is crucial as it significantly improves the 10-year survival rate of melanoma patients, despite the therapy not benefiting most treated patients [24]. To address this issue, we employed LP_SGL to analyze melanoma data and identify T cell subpopulations associated with ICB response, and 6879 cells were assigned to 17 groups via the Leiden algorithm. The UMAP visualization of 17 cell groups and 9 cell types were, respectively, presented in Figure 3A and Supplementary Figure S2a (see Supplementary Data available online at https://academic.oup.com/bib). Then, 404 LP_SGL+ cells and 0 LP_SGL- cells were identified by implementing the LP_SGL. A bar chart of the distribution of LP_SGL+ cells with respect to cell types was presented in Figure 3B and the corresponding UMAP visualization was displayed in Supplementary Figure S2b (see Supplementary Data available online at https://academic.oup.com/bib). According to statistics, 99.26% (401/404) of LP_SGL+ cells were presented in group 1, showing the concentrated characteristic consistent with the experimental results on LUAD dataset. It is noteworthy that 99.26% (401/404) of LP_SGL+ cells were T cells (CD8+ T cells: 92.82%, 375/404; CD4+ T cells: 2.72%, 11/404; T cells: 3.72%, 15/404), with the remaining 0.75% being NK cells. Recent research has highlighted the great potential of NK cells in cancer immunotherapy [25]. This result demonstrated that LP_SGL can accurately identify subpopulations related to ICB response, which has the potential to improve the effectiveness of immunotherapy for melanoma patients.

Figure 3.

Figure 3

Experimental results on the melanoma dataset. (A) UMAP visualization of 17 cell groups obtained using the Leiden algorithm. (B) Bar chart of the distribution of LP_SGL+ cells with respect to cell types. (C) Line chart of the proportions of cancer cells contained in the response phenotype cells identified by LP_SGL,Scissor and scAB. (D) Volcano map of DEGs between LP_SGL+ cells and other cells. (EG) Box plot of GSVA scores for response and non-response in PRJEB23709, GSE91061 and GSE181815 datasets, respectively. (H) GSEA plots of upregulated and downregulated biological processes (BP) of the overall DEGs. (I) GSEA plots of upregulated BP of the upregulated DEGs.

In addition, we tested the robustness of LP_SGL, Scissor and scAB by using seven datasets of melanoma. The line chart in Figure 3C showed the proportions of T cells contained in the response phenotype cells identified by these methods. In the original data, the proportion of T cells contained in the response phenotype cells identified by LP_SGL was 99.26%, which was 16.92 and 38.49% higher than that identified by Scissor and scAB, respectively. On six incomplete datasets, the results obtained by LP_SGL remained stable, while the other two methods exhibited significant fluctuations.

To gain a deeper understanding of the immunotherapy response mechanism, we performed DEA between LP_SGL+ cells and other cells, as LP_SGL- cells were not identified. A total of 253 upregulated and 131 downregulated DEGs were identified by meeting the criteria of Inline graphic greater than 3 and FDR less than 0.05. The volcanic plot of the DEGs was shown in Figure 3D. Among them, many of these genes have been confirmed to be closely related to melanoma, such as the reduction of MITF level promoting melanoma invasion [26], tumor regression being abrogated by silencing CCL5 [27] and CST7 being significantly up-regulated in melanoma patients who respond to ICB treatment [28]. These results demonstrated that LP_SGL has the ability to identify gene signals related to immunotherapy responses.

We subsequently calculated the GSVA score of each sample to evaluate the clinical relevance of the identified DEGs. As shown in Figure 3E, the responder in the melanoma dataset had significantly higher scores than the nonresponder (Inline graphic). Moreover, the external melanoma validation set showed similar results, as depicted in Figure 3F (Inline graphic). Interestingly, we also tested whether the immunotherapy-associated cell subpopulations identified from melanoma dataset were applicable to thymic carcinoma samples. Surprisingly, as shown in Figure 3G, thymic carcinoma samples that responded to treatment had significantly higher scores than those that did not respond (Inline graphic). Furthermore, the GSEA of the overall DEGs revealed overactivation of immune response processes and suppression of lipid transport processes, as shown in Figure 3H. The GSEA of the upregulated DEGs was shown in Figure 3I, while no significant enrichment of biological processes was observed for downregulated DEGs. These findings were consistent with previous research demonstrating that inhibiting lipid transport to melanoma cells effectively reduces their growth and invasion [29]. In summary, the LP_SGL identified cell subpopulations that were associated with ICB response, and the signal genes from these cells could reliably predict ICB response in melanoma and other types of cancer.

Identifying cell subpopulations associated with worse survival in liver cancer

To further evaluate the model’s performance in survival phenotype data, we applied the LP_SGL method to the liver cancer dataset to identify cell subpopulations associated with poorer survival outcomes, and 8853 cells were assigned to 16 groups via the Leiden algorithm. The UMAP visualization of 16 cell groups and 7 cell types were, respectively, presented in Figure 4A and Supplementary Figure S3a (see Supplementary Data available online at https://academic.oup.com/bib), and 746 LP_SGL+ cells and 1243 LP_SGL- cells were identified. A bar chart of the distribution of LP_SGL+ cells with respect to cell types was presented in Figure 4B and the corresponding UMAP visualization was displayed in Supplementary Figure S3b (see Supplementary Data available online at https://academic.oup.com/bib), and 91.68% (684/746) of LP_SGL+ cells were composed of tumor-associated cells (TAM, CAF, TEC and malignant cell). Additionally, the cells identified by Scissor and scAB were labeled as Scissor+ cells, Scissor- cells and scAB+ cells, respectively, according to the habits of their respective papers. We applied scAB and Scissor to the liver cancer dataset and obtained the proportions of 85.14% (779/915) and 90.48% (19/21) tumor-associated cells in scAB+ cells and Scissor+ cells, respectively. LP_SGL identified a higher proportion of tumor-associated cells contained in cells associated with poorer survival phenotype compared with Scissor and scAB. Specifically, the proportion identified by LP_SGL was 1.2 and 6.54% higher than that identified by Scissor and scAB, respectively.

Figure 4.

Figure 4

Experimental results on the liver cancer datasets. (A) UMAP visualization of 16 cell groups obtained using the Leiden algorithm. (B) Bar chart of the distribution of LP_SGL+ cells with respect to cell types. (C) Volcano map of DEGs between LP_SGL+ cells and LP_SGL- cells. (D) Bar chart of the average C-index of the results from 10-times experiments results. (E) The K-M survival curves of high- and low-risk group samples in the TCGA-LIHC dataset. (F and G) The survival and recurrence K-M curves of the high- and low-risk groups in the GSE14520 dataset, respectively. (H) Gene set enrichment analysis plots of upregulated Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway of DEGs.

We conducted DEA between LP_SGL+ cells and LP_SGL- cells to explore potential biological mechanisms related to poorer survival. Figure 4C showed 77 upregulated and 12 downregulated DEGs that met the conditions of Inline graphic greater than 1 and FDR less than 0.05. Among these DEGs, most of them have been reported to be associated with liver cancer, such as high expression of YBX1 and NUPR1, which were associated with poor overall survival in liver cancer [30, 31]. In addition, overexpression of IL32 has been found to inhibit cancer cell growth and may serve as a therapeutic target for various cancers, including liver cancer [32]. We also identified DEGs between scAB+ cells and other cells, as well as Scissor+ cells and Scissor- cells, using the same criteria. Subsequently, we used the DEGs obtained from each method to construct lasso-cox models. The average C-index of the 10-times experimental results corresponding to each method was presented in Figure 4D. We found that LP_SGL, scAB and Scissor all achieved comparable results and outperformed traditional clinical characteristic-based methods.

We conducted a survival analysis on the TCGA-LIHC dataset. As depicted in Figure 4E, there was a significant survival difference between the two groups, with the high-risk group having almost four times lower median survival time than the low-risk group. To verify the generalization of the identified DEGs, we conducted a survival analysis on an independent external validation set by following the same steps. The K-M survival curves of the high- and low-risk groups were shown in Figure 4F. We found that the high-risk group in the independent validation set still achieved worse survival outcomes. We also predicted the recurrence risk of the samples based on DEGs using the recurrence time and status of the samples. Figure 4G showed a significant difference in recurrence between the high- and low-risk groups. Furthermore, we performed GSEA based on DEGs and found that the cholesterol metabolism pathway was significantly enriched (Figure 4H). As the liver is the main organ responsible for cholesterol metabolism, abnormal cholesterol metabolism has been associated with the occurrence of liver diseases [33].

DISCUSSION

In this paper, we proposed LP_SGL to identify phenotype-associated subpopulations by integrating scRNA-seq, bulk expression and bulk phenotype data. Importantly, our method was applicable to binary, survival and linear phenotype data, although we were unable to demonstrate the linear experiment due to a lack of suitable data. Moreover, our method can be extended to other omics data, such as chromatin accessibility and DNA methylation data. We also evaluated the performance of cell grouping using the Louvain algorithm (Supplementary Table 2, see Supplementary Data available online at https://academic.oup.com/bib), and the comparable results indicated that incorporating cell group structure into the model was effective. This provides a new perspective for incorporating other cell clustering methods into integrated multi-omics data models.

We compared the proposed LP_SGL with the currently mainstream phenotype-associated subpopulation identification methods, Scissor [9] and scAB [10], where the data preprocessing and parameter settings of both methods were consistent with their respective original literature. The LP_SGL selected the highest proportions of cancer cells and T cells when the three methods were applied to the LUAD diagnosis, melanoma drug response and liver cancer survival datasets, respectively. It is worth noting that compared with LP_SGL and Scissor, scAB consistently selects the highest number of cells, which may be the reason why the cells it identifies contain a lower proportion of cancer cells or T cells. The LP_SGL selected a larger number of cells than Scissor on both LUAD and liver cancer datasets. On the melanoma dataset, LP_SGL identified 404 LP_SGL+ cells in the optimal results. Moreover, when LP_SGL identified 1406 LP_SGL+ cells, which was more than the 1212 Scissor+ cells identified by Scissor, the proportion of T cells in LP_SGL+ cells was 95.87%, still higher than its proportion in Scissor+ cells. These results indicated that LP_SGL had a more accurate and comprehensive ability to identify phenotype-associated subpopulations.

Flow cytometry is a prevalent technique in experiments for identifying cell subpopulations [34]. It enables the segregation of target cells from a mixed cell population based on the fluorescence signal of cell surface markers [35]. However, since our research primarily focused on exploring phenotype-associated subpopulations using available transcriptomic data, there is currently no available flow cytometry data for identifying cell subpopulations. In ensuing studies, integrating flow cytometry data with our algorithm will be on our agenda. Moreover, the patients who underwent bulk RNA-seq in this study are different from those who underwent scRNA-seq. This rendered us incapable of scrutinizing the distribution of identified cells in response and nonresponse samples. Nevertheless, the comparison of performance among LP_SGL, Scissor and scAB, along with extensive biological analyses, proved the credibility of the proposed LP_SGL. Utilizing data from patients who have undergone both bulk RNA-seq and scRNA-seq may be advantageous in identifying phenotype-associated subpopulations. This will be a focus of our future research.

Key Points

  • Our proposed method LP_SGL for integrating scRNA-seq, bulk expression and bulk phenotype data.

  • The group effects caused by cell–cell interactions were introduced into the model to guide the identification of phenotype-associated subpopulations.

  • LP_SGL identified a higher percentage of cancer cells, T cells and tumor-associated cells than Scissor and scAB on lung adenocarcinoma diagnosis, melanoma drug response and liver cancer survival datasets, respectively.

  • The biological analysis on three original datasets and four independent external validation sets demonstrated that the signaling genes of this cell subset have the ability to predict cancer, immunotherapy and survival.

Supplementary Material

supplementary_material_for_lp_sgl_bbad424

ACKNOWLEDGEMENTS

The author expresses gratitude for the support provided by the high performance computing center of Henan Normal University.

Author Biographies

Juntao Li is a full professor at the College of Mathematics and Information Science, Henan Normal University. His research interests include machine learning and data mining.

Hongmei Zhang is a master student at the College of Mathematics and Information Science, Henan Normal University. Her research include machine learning and RNA-seq data analysis.

Bingyu Mu is a lecturer at the School of the Art and Design, Zhengzhou University of Light Industry. Her research include interface and interaction, virtual reality and digital twin.

Hongliang Zuo is a full professor at the College of Mathematics and Information Science, Henan Normal University. His research interests include machine learning and pattern recognition.

Kanglei Zhou is currently pursuing a Ph.D. at the School of Computer Science and Engineering, Beihang University. His research include human motion analysis and augmented reality.

Contributor Information

Juntao Li, College of Mathematics and Information Science, Henan Normal University, 46 Jianshe East Road, 453007, Xinxiang, China.

Hongmei Zhang, College of Mathematics and Information Science, Henan Normal University, 46 Jianshe East Road, 453007, Xinxiang, China.

Bingyu Mu, College of Arts and Design, Zhengzhou University of Light Industry, No. 5 Dongfeng Road, 450000, Zhengzhou, China.

Hongliang Zuo, College of Mathematics and Information Science, Henan Normal University, 46 Jianshe East Road, 453007, Xinxiang, China.

Kanglei Zhou, School of Computer Science and Engneering, Beihang University, 37 Xueyuan Road, Haidian District, 100191, Beijing, China.

AUTHORS’ CONTRIBUTIONS

J.L. designed this study. B.M. and H.Z. collected and preprocessed the data. H.Z. and K.Z. implemented the experiments and analysis. J.L. and H.Z. wrote the manuscript. All authors have read and approved the final manuscript.

FUNDING

National Natural Science Foundation of China (grant number 61203293), Scientific and Technological Project of Henan Province (grant number 212102210140).

CODE AND DATA AVAILABILITY

Codes for LP_SGL are freely available in the GitHub repository (https://github.com/hongmeizhanghm/LP_SGL). The data underlying this article are available in the article.

References

  • 1. Suvà Mario  L, Tirosh  I. Single-cell RNA sequencing in cancer: lessons learned and emerging challenges. Mol Cell  2019;75(1):7–12. [DOI] [PubMed] [Google Scholar]
  • 2. Zhao  J, Zhao  B, Song  X, et al.  Subtype-DCC: decoupled contrastive clustering method for cancer subtype identification based on multi-omics data. Brief Bioinform  2023;24(2): bbad025. [DOI] [PubMed] [Google Scholar]
  • 3. Kaushik  AC, Wang  YJ, Wang  X, Wei  DQ. Irinotecan and vandetanib create synergies for treatment of pancreatic cancer patients with concomitant TP53 and KRAS mutations. Brief Bioinform  2021;22(3): bbaa149. [DOI] [PubMed] [Google Scholar]
  • 4. Patel  AP, Tirosh  I, Trombetta  JJ, et al.  Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science  2014;344(6190):1396–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Yofe  I, Dahan  R, Amit  I. Single-cell genomic approaches for developing the next generation of immunotherapies. Nat Med  2020;26(2):171–7. [DOI] [PubMed] [Google Scholar]
  • 6. Dumitrascu  B, Villar  S, Mixon Dustin  G, et al.  Optimal marker gene selection for cell type discrimination in single cell analyses. Nat Commun  2021;12(1): 1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Tomczak  K, Czerwińska  P, Wiznerowicz  M. The cancer genome atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol (Pozn)  2015;19(1A): A68–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Li  X, Liu  L, Goodall Gregory  J, et al.  A novel single-cell based method for breast cancer prognosis. PLoS Comput Biol  2020;16(8): e1008133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Sun  D, Guan  X, Moran  AE, et al.  Identifying phenotype-associated subpopulations by integrating bulk and single-cell sequencing data. Nat Biotechnol  2022;40(4):527–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Zhang  Q, Jin  S, Zou  X. scAB detects multiresolution cell states with clinical significance by integrating single-cell genomics and bulk sequencing data. Nucleic Acids Res  2022;50(21):12112–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Yuan  M, Lin  Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B  2006;68(1):49–67. [Google Scholar]
  • 12. Simon  N, Friedman  J, Hastie  T, Tibshirani  R. A sparse-group lasso. J Comput Graph Stat  2013;22(2):231–45. [Google Scholar]
  • 13. Song  X, Liang  K, Li  J. WGRLR: a weighted group regularized logistic regression for cancer diagnosis and gene selection. IEEE/ACM Trans Comput Biol Bioinform  2023;20(2):1563–73. [DOI] [PubMed] [Google Scholar]
  • 14. Li  J, Dong  W, Meng  D. Grouped gene selection of cancer via adaptive sparse group lasso based on conditional mutual information. IEEE/ACM Trans Comput Biol Bioinform  2018;15(6): 2028–38. [DOI] [PubMed] [Google Scholar]
  • 15. Blondel  VD, Guillaume  JL, Lambiotte  R, et al.  Fast unfolding of communities in large networks. J Stat Mech  2008;2008(10): P10008. [Google Scholar]
  • 16. Traag  VA, Waltman  L, Van Eck  NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep  2019;9(1): 5233. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Lambrechts  D, Wauters  E, Boeckx  B, et al.  Phenotype molding of stromal cells in the lung tumor microenvironment. Nat Med  2018;24(8):1277–89. [DOI] [PubMed] [Google Scholar]
  • 18. Jerby-Arnon  L, Shah  P, Cuoco Michael  S, et al.  A cancer cell program promotes T cell exclusion and resistance to checkpoint blockade. Cell  2018;175(4): 984–97.e24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Xiong  D, Wang  Y, You  M. A gene expression signature of TREM2hi macrophages and Inline graphicInline graphic T cells predicts immunotherapy response. Nat Commun  2020;11(1): 5084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Ma  L, Hernandez  MO, Zhao  Y, et al.  Tumor cell biodiversity drives microenvironmental reprogramming in liver cancer. Cancer Cell  2019;36(4): 418–30.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Zhou  J, Zhang  S, Chen  Z, et al.  CircRNA-ENO1 promoted glycolysis and tumor progression in lung adenocarcinoma through upregulating its host gene ENO1. Cell Death Dis  2019;10(12): 885. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Peng  Z, Wang  J, Shan  B, et al.  The long noncoding RNA LINC00312 induces lung adenocarcinoma migration and vasculogenic mimicry through directly binding YBX1. Mol Cancer  2018;17:167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Nirgude  S, Choudhary  B. Insights into the role of GPX3, a highly efficient plasma antioxidant, in cancer. Biochem Pharmacol  2021;184:114365. [DOI] [PubMed] [Google Scholar]
  • 24. Havel  JJ, Chowell  D, Chan  TA. The evolving landscape of biomarkers for checkpoint inhibitor immunotherapy. Nat Rev Cancer  2019;19(3):133–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Wolf  NK, Kissiov Djem  U, Raulet  DH. Roles of natural killer cells in immunity to cancer, and applications to immunotherapy. Nat Rev Immunol  2023;23(2):90–105. [DOI] [PubMed] [Google Scholar]
  • 26. Goding  CR. A picture of Mitf in melanoma immortality. Oncogene  2011;30(20):2304–6. [DOI] [PubMed] [Google Scholar]
  • 27. Mgrditchian  T, Arakelian  T, Paggetti  J, et al.  Targeting autophagy inhibits melanoma growth by enhancing NK cells infiltration in a CCL5-dependent manner. Proc Natl Acad Sci U S A  2017;114(44): E9271–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Schetters  STT, Rodriguez  E, Kruijssen  LJW, et al.  Monocyte-derived APCs are central to the response of PD1 checkpoint blockade and provide a therapeutic target for combination therapy. J Immunother Cancer  2020;8(2): e000588. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Zhang  M, Di Martino  JS, Bowman Robert  L, et al.  Adipocyte-derived lipids mediate melanoma progression via FATP proteins. Cancer Discov  2018;8(8):1006–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Gandhi  M, Groß  M, Holler Jessica  M, et al.  The lncRNA lincNMR regulates nucleotide metabolism via a YBX1-RRM2 axis in cancer. Nat Commun  2020;11(1): 3214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Lee  YK, Jee Byul  A, Kwon So  M, et al.  Identification of a mitochondrial defect gene signature reveals NUPR1 as a key regulator of liver cancer progression. Hepatology  2015;62(4):1174–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Hong  JT, Son Dong  J, Lee Chong  K, et al.  Interleukin 32, inflammation and cancer. Pharmacol Ther  2017;174:127–37. [DOI] [PubMed] [Google Scholar]
  • 33. Li  H, Yu  X, Ou  X, et al.  Hepatic cholesterol transport and its role in non-alcoholic fatty liver disease and atherosclerosis. Prog Lipid Res  2021;83:101109. [DOI] [PubMed] [Google Scholar]
  • 34. Zhu  J, Chen  T, Mao  X, et al.  Machine learning of flow cytometry data reveals the delayed innate immune responses correlate with the severity of COVID-19. Front Immunol  2023;14:974343. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. de Pablo  JG, Lindley  M, Hiramatsu  K, Goda  K. High-throughput Raman flow cytometry and beyond. Acc Chem Res  2021;54(9):2132–43. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary_material_for_lp_sgl_bbad424

Data Availability Statement

Codes for LP_SGL are freely available in the GitHub repository (https://github.com/hongmeizhanghm/LP_SGL). The data underlying this article are available in the article.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES